The Human Proteome Map Completed
Reporter and Curator: Larry H. Bernstein, MD, FCAP
UPDATED 6/02/2024
The genetic, pharmacogenomic, and immune landscapes associated with protein expression across human cancers.
Source: Chen C, Liu Y, Li Q, Zhang Z, Luo M, Liu Y, Han L. The Genetic, Pharmacogenomic, and Immune Landscapes Associated with Protein Expression across Human Cancers. Cancer Res. 2023 Nov 15;83(22):3673-3680. doi: 10.1158/0008-5472.CAN-23-0758. PMID: 37548539; PMCID: PMC10843800.
Abstract
Proteomics is a powerful approach that can rapidly enhance our understanding of cancer development. Detailed characterization of the genetic, pharmacogenomic, and immune landscape in relation to protein expression in cancer patients could provide new insights into the functional roles of proteins in cancer. By taking advantage of the genotype data from The Cancer Genome Atlas (TCGA) and protein expression data from The Cancer Proteome Atlas (TCPA), we characterized the effects of genetic variants on protein expression across 31 cancer types and identified approximately 100,000 protein quantitative trait loci (pQTL). Among these, over 8000 pQTL were associated with patient overall survival. Furthermore, characterization of the impact of protein expression on more than 350 imputed anticancer drug responses in patients revealed nearly 230,000 significant associations. In addition, approximately 21,000 significant associations were identified between protein expression and immune cell abundance. Finally, a user-friendly data portal, GPIP (https://hanlaboratory.com/GPIP), was developed featuring multiple modules that enable researchers to explore, visualize, and browse multidimensional data. This detailed analysis reveals the associations between the proteomic landscape and genetic variation, patient outcome, the immune microenvironment, and drug response across cancer types, providing a resource that may offer valuable clinical insights and encourage further functional investigations of proteins in cancer.
Introduction
Functional proteomics is a powerful approach that helps us understand cancer pathophysiology and identify potential therapeutic strategies (1). Functional protein analysis using reverse-phase protein arrays (RPPA) has already proven highly effective in studying large numbers of TCGA samples, especially when integrated with genomic, transcriptomic, and clinical information (2). Previous works demonstrated that a QTL mapping approach is effective to understand the genetic basis of multiple molecular features in human diseases (3). Identifying the sequence determinants of protein levels (pQTLs) may guide the search for causal genes and facilitate understanding the underlying mechanisms of human diseases. However, it remains challenging to further understand the functional roles of protein expression in cancers. For example, it is unclear whether proteins are associated with drug response and/or immune features in patients. In this study, we systematically investigated the effects of genetic variants on protein expression and characterized the impact of protein expression on imputed drug responses and immune cell abundances from different sources (Fig. 1). To facilitate broad access of these data for the biomedical research community, we developed a user-friendly database, GPIP (https://hanlaboratory.com/GPIP). We expect this study to have a significant clinical impact on the future development of protein-based targeted therapies.

A Workflow of GPIP to identify pQTLs and survival-associated pQTLs. B The number of pQTLs identified for each cancer type. C Association between CYCLINB1 protein expression level and rs12576855 in LUAD patients. D Association between CYCLINB1 protein expression level and rs2722796 in LGG patients. E The number of survival-associated pQTLs identified for each cancer type. F Kaplan–Meier plot showing the association between rs10918659 (pQTL of HER2_pY1248) genotypes and overall survival times of STAD patients. G Kaplan–Meier plot showing the association between rs13158796 (pQTL of HER2_pY1248) genotypes and overall survival times of STAD patients.
Identification of protein–drug associations
To investigate potential associations between protein expression and drug response, we calculated the Spearman rank correlation between protein expression data and drug response from DrVAEN and cancerRxTissue. These two datasets employed distinct predictive models that integrated omics data from CCLE and drug response data from GDSC to predict drug response in TCGA samples (Fig. 2A) (4,5). Association with |Rs| > 0.3 and FDR < 0.05 were considered as significant associations in each cancer type.

A Workflow of GPIP to identify Drug-associated proteins. B The number of protein-drug response pairs identified from DrVAEN (left) and cancerRxTissue (right) for each cancer type. C Visualization of the associations between proteins and drugs (DrVAEN) within and across different cancer signaling pathways. Blue links represent associations within a single pathway, while orange links represent associations cross pathways. D Enrichment analysis of drug target pathways among significant protein-drug response pairs. The color represents the log2 (odds ratio) of Fisher’s exact test. The size represents the FDR value.
Identification of protein–immune cell associations
To examine the relationship between protein expression and immune cell abundance, we utilized Spearman rank correlation coefficient to calculate the associations between protein expression data and immune cell abundance data from TIMER, CIBERSORT, ImmuneCellAI, and ImmuneCellGSVA (Fig. 3). These datasets utilized different methods to evaluate immune cell abundance by leveraging immune gene signatures as a proxy (6–9). We considered correlations with |Rs| > 0.3 and FDR < 0.05 as significant associations.

A Workflow of GPIP to identify Immune cell-associated proteins. B The number of protein-drug response pairs identified from ImmuneCellsGSVA (purple), ImmuCellAI (yellow), TIMER (red) and CIBERSORT (green) for each cancer type. C The top 10 proteins with the highest number of significantly associated immune cell types in HNSC. The color represents the Rs between protein expression and immune cell abundance (ImmuneCellGSVA). The size represents the FDR value. D Association between PREX1expression and impute MDSC abundance in HNSC patients.
Database construction
GPIP was developed using Python Flask-RESTful API frameworks (https://flask-restful.readthedocs.io/), AngularJS (https://angularjs.org), and Bootstrap (https://getbootstrap.com/). The database for GPIP was implemented using the NoSQL database program MongoDB (https://www.mongodb.com/). The user-friendly interface of the GPIP web application was served through the Apache HTTP Server, allowing users to access the database and perform queries and analysis through a web browser.
Data availability
All results generated in this study can be found in GPIP database, (https://hanlaboratory.com/GPIP). Publicly available data generated by others were used by the authors in this study: The genotype data and clinical data were obtained from The Cancer Genome Atlas (TCGA) data portal at https://tcga-data.nci.nih.gov/tcga/. The reverse-phase protein array (RPPA) protein expression data was obtained from The Cancer Proteome Atlas (TCPA) data portal at https://www.tcpaportal.org/. The imputed pharmacogenomic data were obtained from DrVAEN at https://bioinfo.uth.edu/drvaen/ and cancerRxTissue at https://manticore.niehs.nih.gov/cancerRxTissue/. The immune-cell infiltration data were obtained from Tumor Immune Estimation Resource (TIMER) at http://timer.cistrome.org/, Immune Cell Abundance Identifier (ImmuCellAI) at http://bioinfo.life.hust.edu.cn/ImmuCellAI/, and CIBERSORT at https://cibersort.stanford.edu/.
A comprehensive data portal
We developed a user-friendly data portal, GPIP (https://hanlaboratory.com/GPIP), to facilitate visualizing, searching, and browsing of our results by the biomedical research community (Fig. 4A). GPIP contains four main modules: Protein-QTLs, Surivial-QTLs, Drug Response, and Immune Infiltration (Fig. 4B). Querying can be easily performed by selecting cancer type, protein, drug, immune cell abundance, or entering the SNP ID of interest (Fig. 4C). For example, in the Protein-QTLs and Survival-QTLs modules, users can search for pQTLs by selecting a cancer type (e.g., LUAD) and entering a protein name (e.g., CYCLINB1) or an SNP ID (e.g., rs12576855). In the Drug Response module, users can search for protein-drug response associations by selecting a data source for imputed drug response (e.g., DrVAEN) and selecting an anticancer drug (e.g., Talazoparib) or a protein (e.g., PARP1). In the Immune Infiltration module, users can search for protein-immune infiltration pairs by selecting a data source for imputed immune cell abundance (e.g., ImmuneCellsGSVA), and selecting an immune cell type (e.g., Activated B cell) or a protein (e.g., PDL1). In addition, on the bottom of the main page, we developed a cancer type module where users can click on a specific cancer type (e.g., BLCA) to search for related information across all 4 modules (Fig. 4D). The search results for each module included a table to list related information accordingly (Fig. 4E). A “Details” button for each result item was clicked for generating a box plot in protein-QTLs module (Fig. 4F), a Kaplan–Meier plot in Survival-QTLs module (Fig. 4G) and a scatter plot in Drug Response and Immune Infiltration modules, respectively (Fig. 4H, ,I).I). Our database provides a valuable resource for cancer research and will be of great interest to the research community.

A GPIP homepage and browser bar. B The four main modules of GPIP. C Search boxes in the pQTLs module. D Search boxes in the cancer type-specific search module. E An example of resulting list in the pQTL module. F An example of boxplot for the pQTLs module result. G An example of Kaplan–Meier plot for the Survival protein-QTLs module result. H An example of scatter plot for the Drug Response module result. I An example of scatter plot for the Immune Infiltration module result.
Discussion
Proteomics plays a crucial role in identifying potential therapeutic strategies and understanding cancer pathophysiology (2). In this study, we investigated the effects of genetic variants on protein expression and characterized the impact of protein expression on imputed drug responses and immune cell abundances across human cancers. We also developed the user-friendly data portal, GPIP, to provide access to these results. Our study provides a comprehensive analysis of protein expression in different cancer types and their association with drug response and immune cell abundance.
Identifying genetic variants associated with cancer has revolutionized our understanding of the disease and holds promise for improved diagnosis and treatment. In GPIP, we identified ~100,000 pQTLs across 31 cancer types and 8.8% of them were found to be associated with patient survival (Fig. 1). These genetic variants hold significant promise for unraveling the underlying biological mechanisms of disease progression and response to treatments. For example, a survival-associated pQTL may help to identify a genetic variant that controls the expression of a protein crucial for tumor growth or immune response, thus impacting patient survival. Our results suggest that pQTLs have the potential to serve as prognostic biomarkers and aid in the development of precision medicine.
Despite the promising implications, it is crucial to consider potential limitations of pQTL identification. One limitation is the small number of tumor samples in rare cancers, which limits statistical power and the detection of significant pQTLs. For example, only 8 proteins with pQTLs were found in CHOL, likely due to the small sample size (Table S1). Additionally, we observed that some cancer types with large sample sizes identified only a small number of pQTLs (e.g., BRAC), possibly due to the data quality of protein abundance. Tumors originating from different tissues may have variations in protein extraction quality or protein measurement accuracy (3). Furthermore, cancer type heterogeneity can impact pQTL identification, as tumors from different tissues exhibit distinct protein expression profiles and genetic landscapes. Addressing these limitations is necessary to ensure valid and reliable results.
Protein expression levels in tumors can impact response of cancer cells to therapeutic drugs due to their role as targets of drug action, with alterations in expression potentially modifying drug sensitivity or resistance. In GPIP, we utilized the imputed drug response and protein expression data in TCGA patients to identify the potential associations between protein expression and drug response (Fig. 2). Our results revealed that certain proteins were significantly associated with drug sensitivity or resistance, suggesting that protein expression levels could potentially be used as biomarkers to predict drug response in cancer patients. Recent studies have shown that the impact of genetic variants on drug response can be mediated through protein-protein interaction (PPI) networks (19,20). Integrating genetic variants and PPI to further understand the associations between protein expression and drug response may provide further insights.
The protein expression level in tumors is crucial in the context of tumor immune microenvironment and immunotherapy, as it might impact immune cell abundance and response, and potentially improve the efficacy of immunotherapy. In GPIP, we examined the association between protein expression levels and imputed immune cell abundance across multiple cancer types. Our study identified ~21,000 significant correlations between proteins and immune cell types, highlighting the potential role of protein expression levels in shaping the tumor immune microenvironment (Fig. 3). Our results offer a promising avenue for future research to understand the interplay between protein expression and the tumor immune microenvironment, leading to personalized immunotherapy strategies and better treatment outcomes for cancer patients.
In summary, GPIP is a comprehensive and multifaceted data platform designed to aid functional and clinical research on protein in cancer patients. As more relevant datasets become available, we will continually update GPIP to ensure its relevance and usefulness to the research community.
Researchers Produce First Map of Human Proteome, and Reveal New
Significance in The Human Proteome
HAHNE, TECHNISCHE UNIVERSITÄT MÜNCHENTwo international teams have
independently produced the first drafts of the human proteome. These curated
catalogs of the proteins expressed in most non-diseased human tissues and
organs can be used as a baseline to better understand changes that occur in
disease states. Their findings were published today (May 29) in Nature.
Both teams uncovered new complexities of the human genome, identifying novel
proteins from regions of the genome previously thought to be non-coding.
“the real breakthrough with these two projects is the comprehensive coverage of
more than 80 percent of the expected human proteome” said Hanno Steen, director
of proteomics at Boston Children’s Hospital, who was not involved in the work.
The human proteome map provides a catalog of proteins expressed in nondiseased tissues and organs to use as baseline in understanding changes that occur in disease
Given the growing importance of proteins in medical laboratory testing,
- pathologists will want to know that drafts of the complete human proteome
- have been released to the public.
Experts are comparing this to the first complete map of the human genome
- and this information provides for rapid advances
- in understanding transcriptomics and metabolomics
Map of Human Proteome Expected to Advance Medical Science
“Housekeeping genes” that are expressed in all tissues and cell types
- have been thought to be involved in basic cellular functions.
Two teams developing a Human Proteome Map
- detected proteins encoded by 2,350 genes
- across all human cells and tissues.
The corresponding housekeeping proteins comprised
about 75% of total protein mass.
- histones,
- ribosomal proteins,
- metabolic enzymes, and
- cytoskeletal proteins
The two international teams produced
- the first drafts of the human protoeome,
- a catalog of proteins expressed in most
- nondiseased human issues and organs.
The evidence suggests there is translation from DNA regions
- that were not thought to be translated—including
- more than 400 translated long, intergenic non-coding RNAs (lincRNAs)—
found by the Küster team—and - 193 new proteins—uncovered by the Pandey team.
This proteome map can be used as a baseline to understand
- changes that occur in the disease state
These studies are part of the Human Proteome Project,
- an international effort by the Human Proteome Organization
- to revolutionize our understanding of the human proteome
- by coordinating research at laboratories around the world directed
- at mapping the entire human proteome.
This new information about the human proteome
- is expected to trigger rapid advances in medical science
- and a better understanding of the underlying causes of human diseases.
One Study Team Was at Johns Hopkins University
- In one study, which was headed by Ahilesh Pandey, M.D.,
at Johns Hopkins University in Baltimore, - and colleague Harsha Gowda, Ph.D.,
of the Institute of Bioinformatics in Bangalore, India, - the research team used an advanced form of mass spectrometry to analyze proteins
- to create the human proteome map,
according to a report published in NIH Research Matters.
The research team examined
- 30 normal human tissue and cell types:
- 17 adult tissues,
- 7 fetal tissue and
- 6 blood cell types.
Samples from three people per tissue type
- were processed through several steps.
The protein fragments, or peptides, were analyzed on
- high-resolution Fourier-transform mass spectrometers.
The amino acid sequences were
- then compared to known sequences.
Their results were published in the May 28, 2014, issue of Nature.
The resulting draft map of the human proteome map includes
- proteins encoded by more than 17,000 genes,
- noted the Research Matters article.
Among these are hundreds of proteins from regions
- previously thought to be non-coding.
This study also provided a new understanding of
- how genes are expressed.
For example, almost 200 genes begin in locations
- other than those predicted based on genetic sequence.
“The fact that 193 of the proteins came from DNA sequences
- predicted to be non-coding means that
- we don’t fully understand how cells read DNA,
- since the sequences code for proteins
This study also produced the Human Proteome Map,
- an interactive online portal.
This can be accessed at this link.
The study data will soon be accessible through
German’s ProteomicsDB Analyzed a Mix of Available and New Tissue Data
The other study was conducted by a team lead by Bernhard Küster
of the Technische Universität München in Germany.
Küster and his colleagues created a
- searchable,
- public database called
- ProteomicsDB.
This database contains 92% of the
- estimated 19,629 human proteins,
noted The Scientist article.
Küster’s team also used mass spectrometry
- to analyze human tissue samples.
This team’s approach differed from Johns Hopkins’ in that
- it compiled about 60% of the information
- in the ProteomicsDB database
- by using existing raw mass spec (MS) data
- from databases and colleagues’ contributions.
To fill data gaps, the Küster lab generated its own
MS data after analyzing
- 60 human tissues,
- 13 body fluids, and
- 147 cancer cell lines.
High-resolution public data
- was selected and computationally processed
- for strict quality
The database for ProteomicsDB is
- public and searchable.
It can be accessed at this link.
German Study Added New Insights to Transcription Process
Comparing the ratio of protein to mRNA levels for every protein globally,
- the Küster lab found that the translation rate
- is a constant feature of each mRNA transcript.
The proteomics community has viewed
- transcriptome and proteome data as two sides of a coin.
But this analysis shows that at least, at steady state,
- once the ratio for an mRNA/protein pair has been calculated,
- protein levels can be determined
- just from specific mRNA levels.
Proteomics researchers in Toronto maintaining ionic balance and in Boston commented on the
importance of the findings, even a “new paradigm” because of
- the fixed ratio of protein to mRNA
This is quite in keeping with what we have been learning
- with respect to homeostasis.
In 2003, the Human Genome Project created a
- draft map of the human genome—
- all the genes in the human body.
Genomics has since driven many advances in medical science.
This was a progress from the classic discovery of Watson and Crick –
- the classical dogma holds that
- DNA makes RNA makes protein.
- no constraints are place on this
But the cell is functioning in contact with other cells,
- immersed in interstitial fluid
- maintaining cationic and anionic balance
- and mitochondrial energy balance and ubiquitin systems interact
- and protein interacts with the chromatin and transcriptional RNA
So the restriction that has been discovered has credence,
- the classical diagram has to be redrawn
Deeper Knowledge of Proteome to Improve Diagnostics and Therapeutics
In the two projects is:
- the comprehensive coverage of more than 80% of
- the expected human proteome,
These studies indicate that to get to
- a deep level of proteome coverage,
- many different tissue types must be probed.
the studies are complimentary.
- The Hopkins group provided a survey of human proteins from a single source, which allows for easy comparisons within their data.
- The ProteomeDB effort connected new information with existing data
A deeper knowledge of the human proteome could help
- fill the gap between genomes and phenotypes.
As this occurs, it has the potential to transform
- the way diagnostics and therapeutics are developed,
- enhancing overall biomedical research and healthcare,
it was noted in a report presented to scientific leaders at a NIH workshop
- on advances in proteomics and its applications.
Having completed a draft map of the human proteome—
the set of all proteins in the human body
- It opens another window to cell function.
It has been ASSUMED –
- genes control the most basic functions of the cell,
- including what proteins to make and when.
- but we have assumed for too much in assigning
full control to the genome
Researchers have identified more than 20,000 protein- coding genes.
However, scientific understanding of the proteome has
- lagged behind that of the genome,
- partly because of the proteome’s complexities.
The relationship between genes and proteins isn’t a simple matter of
- one gene coding for one protein.
Stretches of DNA can be read and translated
- into proteins in different ways.
Proteins are also more difficult to sequence than genes.
The importance of these latest studies to pathologists and Ph.D.s working
- in molecular diagnostics laboratories is that
- this information will expedite further research into the human proteome.
Such research is expected to lead to
- novel methods of diagnosis and complex
- “multi-analyte” clinical laboratory tests that
- look for multiple proteins in a single assay.
“The prevalent view was that information transfer was from genome to transcriptome to proteome.
What these efforts show is that it’s a two-way road— proteomics can be used to annotate the genome.
The importance is that, using these datasets, we can improve the annotation of the genome and the
algorithms that predict transcription and translation,” said Steen. “The genomics field can now hugely
benefit from proteomics data.”
Wilhelm et al., “Mass-spectrometry- based draft of the human proteome,”
Nature, http://dx.doi.doi:/10.1038/nature13319, 2014
M.S. Kim et al. “A draft map of the human proteome,”
Nature, http://dx.doi.org:/10.1038/nature13302, 2014.
Tags
proteomics, noncoding RNA, human research, human proteome project, human genetics and genomics
http://www.the-scientist.com/?articles.view/articleNo/40083/title/Human-Proteome-Mapped/
__Patricia Kirk
__by Harrison Wein, Ph.D.
__by Anna Azvolinsky
Related Information:
- Finding Treasure in “Junk” DNA:
http://www.nih.gov/researchmatters/september2012/09242012junk.htm
- All About The Human Genome Project:
http://www.genome.gov/10001772 - What is Proteomics?:
http://proteomics.cancer.gov/whatisproteomics - Human Proteome Map:
http://www.humanproteomemap.org - ProteomicsDB:
https://www.proteomicsdb.org
Reference: A draft map of the human proteome.
Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Donahue CA, Gowda H, Pandey A.
Nature. 2014 May 29;509(7502):575-81. http://dx.doi.org:/10.1038/nature13302. PMID: 24870542
Funding: NIH’s National Institute of General Medical Sciences (NIGMS), National Cancer Institute (NCI),
and National Heart, Lung, and Blood Institute (NHLBI); the Sol Goldman Pancreatic Cancer Research Center;
India’s Council of Scientific and Industrial Research; and Wellcome Trust/DBT India Alliance.
http://nihprod.cit.nih.gov/researchmatters/june2014/06092014proteome.htm
Leave a Reply