June 3, 2014. Last week, drafts of the Human Proteome were published in Nature (1). These publications and associated data are an exciting addition to the rapidly expanding dataset on the human genome. Curious to learn how these data fit with our work, we focused on one of the two publications (2). The authors generated 25 million mass spectra yielding 293,000 peptides from 30 tissues and cells, and found peptides corresponding to 17,294 of the estimated 20,500 or so genes. An accompanying website enables gene-based queries of the data (humanproteomemap.org).
At Molquant, we are generating a comprehensive set of networks that organize the genome into biologically relevant groups using a diverse set of genomic data. Using networks based on gene expression data, we used the web portal to explore how our networks look in the Proteome data, assessing coverage as well as consistency with our findings.
The accessible data from the proteome website only allows heat map analysis of input genes, so we aren’t able to generate any numbers or correlation statistics. Nevertheless, a visual representation of the protein levels of the queried genes provides a good sense of the degree of correlated expression, tissue specificity and extent of coverage of the queried networks in the proteome data.
Using correlation statistics, thousands of diverse transcriptome samples (tissues, cell lines, tumors) and biologically directed seed genes, we have been able to generate more than one hundred gene networks representing diverse biological pathways, subcellular structures, tissues and cell types.
To assess the strength of the networks, and to explore inter-network relationships, we developed a heatmap plot (bathymetry plot) of the expression correlation value of each queried gene to every other gene in a set. Note that the 10 selected genes representing each network exhibit a high degree of correlation; inter-network correlations can be “read” by examining the degree of correlation across networks.
One representative bathymetry plot is shown in figure 1, which plots 20 networks each represented by 10 genes. Well-characterized genes were chosen to link the mathematically derived networks to known biology. Pathways, subcellular structures and tissue specific networks are shown, labels above the 10 gene squares that represent each network.
We also visualize the expression of networks using expression heat maps and aggregate signature scores of the networks. Shown in figure 2A is a heat map and score of 50 genes from a Molquant proliferation network across more than 1600 human tissue samples from the GTEx project. In this case, GTEx represents a “validation set" since GTEx samples were excluded from network generation that created the gene list. The highly correlated expression of the proliferation network in GTEx highlights the general applicability of the algorithms and samples used for network generation.
Molquant networks are also found in Proteome data.
To compare the RNA based networks with the newly published proteome data, we first compared the proliferation network to a protein abundance heatmap generated from the proteome data (Figure 2B). More than half of the genes from the Molquant network were prominently represented in the adult testis Proteome sample, consistent with both figure 2A and the known biology of the testis. Fetal ovary also exhibited increased signal from many of the proliferation genes. However, the proteome data available on the portal only reports a single point per tissue (presumably summed from the raw data) and does not reveal the variation and nuance observed in the GTEx RNAseq data, where each individual sample is represented. For example, note the high proliferation signal in about 20% of the “blood” samples. Nevertheless, the consistency of high testis expression of many of the proliferation network genes/proteins supported the quality and quantitative nature of the proteome data. We recognize that it is an assumption that the Molquant proliferation network represents the gold standard, an assertion that is difficult to prove. However, the preponderance of well-characterized proliferation genes in the mathematically generated network supports the validity of the gene list.
“Housekeeping” gene networks exhibit varied expression across tissues
Many basic cellular functions such as translation, energy production and metabolism are common to virtually all cells. Although we typically refer to the genes of these processes as “housekeeping” genes, one of the surprises from our network generation work was that these housekeeping networks exhibited significant variation across tissues. Figure 3A shows an example for a mitochondria network where higher expression of the network is seen in heart and muscle, an observation consistent with the high energy requirement of contractile tissues. The increased coordinate expression of the network is consistent between RNA expression and the Proteome data. In Figure 3B, a Ribosome network exhibits coordinate but varied network expression across both RNAseq and proteome data, yet some discrepancies are seen between the two data sets. For example, the proteome data shows high network expression in fetal liver and adult pancreas, whereas the RNAseq GTEx data exhibits highest network expression in prostate and subsets of blood.
Tissue specific networks compared across RNAseq and Proteome
Continuing the comparison of Molquant derived networks between RNAseq and proteome data, we next examined four networks generated to represent tissue specific/cell type specific functions. Figure 4 shows comparisons of networks for Lung, Pancreas, Skeletal Muscle and Adipose cells. For Lung and Pancreas (Figure 4A, B) the networks generally perform very similarly between RNAseq and Proteome data with nearly all of the queried genes exhibiting highest expression/abundance in the appropriate tissue. Extending network comparisons for Skeletal Muscle and Adipose Cells (Figure 4C, D), tissues that were not directly collected in the Proteome set, showed a much less consistent pattern in the Proteome data. While the RNAseq data exhibited the expected tissue restriction, proteome data for muscle was enriched in fetal and adult heart, as well as in esophagus. The latter tissue is perhaps enriched with smooth muscle constituents. In contrast to the well conserved network patterns for the other gene sets, the adipose cell network did not exhibit coordinate expression of network genes in any of the proteome samples. Adipose cells are often present within and “contaminate” a wide variety of tissues (note network expression enrichment in blood vessel, breast and nerve GTEx RNAseq samples), a discernable signal is lacking in the the proteome data. Among several possibilities, the precise process of sample collection in the proteome data may have excluded “contaminating” adipose tissue.
Molquant hematopoietic cell networks well defined in Proteome data
Molquant tools can generate networks for most biology, we have also derived networks for many hematopoietic cell lineages (using well defined cell type-specific genes as seeds). One feature of the Proteome data is the isolation and independent characterization of several hematopoietic lineages (2), whereas the GTEx dataset only collected “blood”, this dataset lacks the ability to discriminate distinct hematopoietic cell types. Figure 5 shows a comparison of six hematopoietic networks (platelets, erythrocytes, neutrophils, natural killer cells, CD8 T cells, macrophages). The GTEx dataset (figure 5A) exhibits strong network enrichment in “blood” as expected, however, it is difficult to recognize any sub-structure to the network expression. The same networks examined in the Proteome data (figure 5A) exhibit enriched protein abundance signal accurately corresponding to the derived networks. Erythrocytes exhibit enriched expression in fetal liver (the site of fetal erythropoiesis), and both neutrophil and macrophage networks show enriched network expression in “monocytes” as presumably no cell type selection was conducted on those samples.
Robust performance of Molquant Networks across diverse human data types
The gene expression based networks we've derived are drawn from thousands of diverse transcriptomes, yet are designed to accurately represent many aspects of human biology. If successful, these networks become powerful tools to explore disease biology, gene function annotation and gene/phenotype linkages, all of which enable translational efforts in diagnosis, drug discovery and development. However, RNA is an imprecise representation of protein (or post-translational modification...), so mapping RNA based networks onto the new Proteome data helps to assess the biological relevance of the tools we've built. The observation that most of the networks generated using transcriptomic data were also seen to exhibit correlated abundance in the proteomic data suggests that the algorithms and data used to generate the Molquant networks are robust and can be used for examining diverse genomic data types.
1. Marx, V. Proteomics: An atlas of expression. Nature 509, 645-649 (2014).
2. Kim, M.S., et al. A draft map of the human proteome. Nature 509, 575-581 (2014).