Funding, Deals & Partnerships: BIOLOGICS & MEDICAL DEVICES; BioMed e-Series; Medicine and Life Sciences Scientific Journal – http://PharmaceuticalIntelligence.com
This session will provide information regarding methodologic and computational aspects of proteogenomic analysis of tumor samples, particularly in the context of clinical trials. Availability of comprehensive proteomic and matching genomic data for tumor samples characterized by the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) and The Cancer Genome Atlas (TCGA) program will be described, including data access procedures and informatic tools under development. Recent advances on mass spectrometry-based targeted assays for inclusion in clinical trials will also be discussed.
Amanda G Paulovich, Shankha Satpathy, Meenakshi Anurag, Bing Zhang, Steven A Carr
Methods and tools for comprehensive proteogenomic characterization of bulk tumor to needle core biopsies
Shankha Satpathy
TCGA has 11,000 cancers with >20,000 somatic alterations but only 128 proteins as proteomics was still young field
CPTAC is NCI proteomic effort
Chemical labeling approach now method of choice for quantitative proteomics
Looked at ovarian and breast cancers: to measure PTM like phosphorylated the sample preparation is critical
Data access and informatics tools for proteogenomics analysis
Bing Zhang
Raw and processed data (raw MS data) with linked clinical data can be extracted in CPTAC
Python scripts are available for bioinformatic programming
Pathways to clinical translation of mass spectrometry-based assays
Meenakshi Anurag
· Using kinase inhibitor pulldown (KIP) assay to identify unique kinome profiles
· Found single strand break repair defects in endometrial luminal cases, especially with immune checkpoint prognostic tumors
· Paper: JNCI 2019 analyzed 20,000 genes correlated with ET resistant in luminal B cases (selected for a list of 30 genes)
· Validated in METABRIC dataset
· KIP assay uses magnetic beads to pull out kinases to determine druggable kinases
· Looked in xenografts and was able to pull out differential kinomes
· Matched with PDX data so good clinical correlation
· Were able to detect ESR1 fusion correlated with ER+ tumors
The adoption of omic technologies in the cancer clinic is giving rise to an increasing number of large-scale high-dimensional datasets recording multiple aspects of the disease. This creates the need for frameworks for translatable discovery and learning from such data. Like artificial intelligence (AI) and machine learning (ML) for the cancer lab, methods for the clinic need to (i) compare and integrate different data types; (ii) scale with data sizes; (iii) prove interpretable in terms of the known biology and batch effects underlying the data; and (iv) predict previously unknown experimentally verifiable mechanisms. Methods for the clinic, beyond the lab, also need to (v) produce accurate actionable recommendations; (vi) prove relevant to patient populations based upon small cohorts; and (vii) be validated in clinical trials. In this educational session we will present recent studies that demonstrate AI and ML translated to the cancer clinic, from prognosis and diagnosis to therapy.
NOTE: Dr. Fish’s talk is not eligible for CME credit to permit the free flow of information of the commercial interest employee participating.
Ron C. Anafi, Rick L. Stevens, Orly Alter, Guy Fish
Overview of AI approaches in cancer research and patient care
Rick L. Stevens
Deep learning is less likely to saturate as data increases
Deep learning attempts to learn multiple layers of information
The ultimate goal is prediction but this will be the greatest challenge for ML
ML models can integrate data validation and cross database validation
What limits the performance of cross validation is the internal noise of data (reproducibility)
Learning curves: not the more data but more reproducible data is important
Neural networks can outperform classical methods
Important to measure validation accuracy in training set. Class weighting can assist in development of data set for training set especially for unbalanced data sets
Discovering genome-scale predictors of survival and response to treatment with multi-tensor decompositions
Orly Alter
Finding patterns using SVD component analysis. Gene and SVD patterns match 1:1
Comparative spectral decompositions can be used for global datasets
Validation of CNV data using this strategy
Found Ras, Shh and Notch pathways with altered CNV in glioblastoma which correlated with prognosis
These predictors was significantly better than independent prognostic indicator like age of diagnosis
Identifying targets for cancer chronotherapy with unsupervised machine learning
Ron C. Anafi
Many clinicians have noticed that some patients do better when chemo is given at certain times of the day and felt there may be a circadian rhythm or chronotherapeutic effect with respect to side effects or with outcomes
ML used to determine if there is indeed this chronotherapy effect or can we use unstructured data to determine molecular rhythms?
Found a circadian transcription in human lung
Most dataset in cancer from one clinical trial so there might need to be more trials conducted to take into consideration circadian rhythms
Stratifying patients by live-cell biomarkers with random-forest decision trees
Stratifying patients by live-cell biomarkers with random-forest decision trees
Guy Fish CEO Cellanyx Diagnostics
Some clinicians feel we may be overdiagnosing and overtreating certain cancers, especially the indolent disease
This educational session focuses on the chronic wound healing, fibrosis, and cancer “triad.” It emphasizes the similarities and differences seen in these conditions and attempts to clarify why sustained fibrosis commonly supports tumorigenesis. Importance will be placed on cancer-associated fibroblasts (CAFs), vascularity, extracellular matrix (ECM), and chronic conditions like aging. Dr. Dvorak will provide an historical insight into the triad field focusing on the importance of vascular permeability. Dr. Stewart will explain how chronic inflammatory conditions, such as the aging tumor microenvironment (TME), drive cancer progression. The session will close with a review by Dr. Cukierman of the roles that CAFs and self-produced ECMs play in enabling the signaling reciprocity observed between fibrosis and cancer in solid epithelial cancers, such as pancreatic ductal adenocarcinoma.
Harold F Dvorak, Sheila A Stewart, Edna Cukierman
The importance of vascular permeability in tumor stroma generation and wound healing
Harold F Dvorak
Aging in the driver’s seat: Tumor progression and beyond
Sheila A Stewart
Why won’t CAFs stay normal?
Edna Cukierman
Tuesday, June 23
3:00 PM – 5:00 PM EDT
Other Articles on this Open Access Online Journal on Cancer Conferences and Conference Coverage in Real Time Include
Improving diagnostic yield in pediatric cancer precision medicine
Elaine R Mardis
Advent of genomics have revolutionized how we diagnose and treat lung cancer
We are currently needing to understand the driver mutations and variants where we can personalize therapy
PD-L1 and other checkpoint therapy have not really been used in pediatric cancers even though CAR-T have been successful
The incidence rates and mortality rates of pediatric cancers are rising
Large scale study of over 700 pediatric cancers show cancers driven by epigenetic drivers or fusion proteins. Need for transcriptomics. Also study demonstrated that we have underestimated germ line mutations and hereditary factors.
They put together a database to nominate patients on their IGM Cancer protocol. Involves genetic counseling and obtaining germ line samples to determine hereditary factors. RNA and protein are evaluated as well as exome sequencing. RNASeq and Archer Dx test to identify driver fusions
PECAN curated database from St. Jude used to determine driver mutations. They use multiple databases and overlap within these databases and knowledge base to determine or weed out false positives
They have used these studies to understand the immune infiltrate into recurrent cancers (CytoCure)
They found 40 germline cancer predisposition genes, 47 driver somatic fusion proteins, 81 potential actionable targets, 106 CNV, 196 meaningful somatic driver mutations
They are functioning well at NCI with respect to grant reviews, research, and general functions in spite of the COVID pandemic and the massive demonstrations on also focusing on the disparities which occur in cancer research field and cancer care
There are ongoing efforts at NCI to make a positive difference in racial injustice, diversity in the cancer workforce, and for patients as well
Need a diverse workforce across the cancer research and care spectrum
Data show that areas where the clinicians are successful in putting African Americans on clinical trials are areas (geographic and site specific) where health disparities are narrowing
Grants through NCI new SeroNet for COVID-19 serologic testing funded by two RFAs through NIAD (RFA-CA-30-038 and RFA-CA-20-039) and will close on July 22, 2020
Tuesday, June 23
12:45 PM – 1:46 PM EDT
Virtual Educational Session
Immunology, Tumor Biology, Experimental and Molecular Therapeutics, Molecular and Cellular Biology/Genetics
This educational session will update cancer researchers and clinicians about the latest developments in the detailed understanding of the types and roles of immune cells in tumors. It will summarize current knowledge about the types of T cells, natural killer cells, B cells, and myeloid cells in tumors and discuss current knowledge about the roles these cells play in the antitumor immune response. The session will feature some of the most promising up-and-coming cancer immunologists who will inform about their latest strategies to harness the immune system to promote more effective therapies.
Judith A Varner, Yuliya Pylayeva-Gupta
Introduction
Judith A Varner
New techniques reveal critical roles of myeloid cells in tumor development and progression
Different type of cells are becoming targets for immune checkpoint like myeloid cells
In T cell excluded or desert tumors T cells are held at periphery so myeloid cells can infiltrate though so macrophages might be effective in these immune t cell naïve tumors, macrophages are most abundant types of immune cells in tumors
CXCLs are potential targets
PI3K delta inhibitors,
Reduce the infiltrate of myeloid tumor suppressor cells like macrophages
When should we give myeloid or T cell therapy is the issue
Judith A Varner
Novel strategies to harness T-cell biology for cancer therapy
Positive and negative roles of B cells in cancer
Yuliya Pylayeva-Gupta
New approaches in cancer immunotherapy: Programming bacteria to induce systemic antitumor immunity
There are numerous examples of highly successful covalent drugs such as aspirin and penicillin that have been in use for a long period of time. Despite historical success, there was a period of reluctance among many to purse covalent drugs based on concerns about toxicity. With advances in understanding features of a well-designed covalent drug, new techniques to discover and characterize covalent inhibitors, and clinical success of new covalent cancer drugs in recent years, there is renewed interest in covalent compounds. This session will provide a broad look at covalent probe compounds and drug development, including a historical perspective, examination of warheads and electrophilic amino acids, the role of chemoproteomics, and case studies.
Benjamin F Cravatt, Richard A. Ward, Sara J Buhrlage
Discovering and optimizing covalent small-molecule ligands by chemical proteomics
Benjamin F Cravatt
Multiple approaches are being investigated to find new covalent inhibitors such as: 1) cysteine reactivity mapping, 2) mapping cysteine ligandability, 3) and functional screening in phenotypic assays for electrophilic compounds
Using fluorescent activity probes in proteomic screens; have broad useability in the proteome but can be specific
They screened quiescent versus stimulated T cells to determine reactive cysteines in a phenotypic screen and analyzed by MS proteomics (cysteine reactivity profiling); can quantitate 15000 to 20,000 reactive cysteines
Isocitrate dehydrogenase 1 and adapter protein LCP-1 are two examples of changes in reactive cysteines they have seen using this method
They use scout molecules to target ligands or proteins with reactive cysteines
For phenotypic screens they first use a cytotoxic assay to screen out toxic compounds which just kill cells without causing T cell activation (like IL10 secretion)
INTERESTINGLY coupling these MS reactive cysteine screens with phenotypic screens you can find NONCANONICAL mechanisms of many of these target proteins (many of the compounds found targets which were not predicted or known)
Electrophilic warheads and nucleophilic amino acids: A chemical and computational perspective on covalent modifier
The covalent targeting of cysteine residues in drug discovery and its application to the discovery of Osimertinib
Richard A. Ward
Cysteine activation: thiolate form of cysteine is a strong nucleophile
Thiolate form preferred in polar environment
Activation can be assisted by neighboring residues; pKA will have an effect on deprotonation
pKas of cysteine vary in EGFR
cysteine that are too reactive give toxicity while not reactive enough are ineffective
Accelerating drug discovery with lysine-targeted covalent probes
This Educational Session aims to guide discussion on the heterogeneous cells and metabolism in the tumor microenvironment. It is now clear that the diversity of cells in tumors each require distinct metabolic programs to survive and proliferate. Tumors, however, are genetically programmed for high rates of metabolism and can present a metabolically hostile environment in which nutrient competition and hypoxia can limit antitumor immunity.
Jeffrey C Rathmell, Lydia Lynch, Mara H Sherman, Greg M Delgoffe
T-cell metabolism and metabolic reprogramming antitumor immunity
Jeffrey C Rathmell
Introduction
Jeffrey C Rathmell
Metabolic functions of cancer-associated fibroblasts
Mara H Sherman
Tumor microenvironment metabolism and its effects on antitumor immunity and immunotherapeutic response
Greg M Delgoffe
Multiple metabolites, reactive oxygen species within the tumor microenvironment; is there heterogeneity within the TME metabolome which can predict their ability to be immunosensitive
Took melanoma cells and looked at metabolism using Seahorse (glycolysis): and there was vast heterogeneity in melanoma tumor cells; some just do oxphos and no glycolytic metabolism (inverse Warburg)
As they profiled whole tumors they could separate out the metabolism of each cell type within the tumor and could look at T cells versus stromal CAFs or tumor cells and characterized cells as indolent or metabolic
T cells from hyerglycolytic tumors were fine but from high glycolysis the T cells were more indolent
When knock down glucose transporter the cells become more glycolytic
If patient had high oxidative metabolism had low PDL1 sensitivity
Showed this result in head and neck cancer as well
Metformin a complex 1 inhibitor which is not as toxic as most mito oxphos inhibitors the T cells have less hypoxia and can remodel the TME and stimulate the immune response
Metformin now in clinical trials
T cells though seem metabolically restricted; T cells that infiltrate tumors are low mitochondrial phosph cells
T cells from tumors have defective mitochondria or little respiratory capacity
They have some preliminary findings that metabolic inhibitors may help with CAR-T therapy
Obesity, lipids and suppression of anti-tumor immunity
Lydia Lynch
Hypothesis: obesity causes issues with anti tumor immunity
Less NK cells in obese people; also produce less IFN gamma
RNASeq on NOD mice; granzymes and perforins at top of list of obese downregulated
Upregulated genes that were upregulated involved in lipid metabolism
All were PPAR target genes
NK cells from obese patients takes up palmitate and this reduces their glycolysis but OXPHOS also reduced; they think increased FFA basically overloads mitochondria
Long recognized for their role in cancer diagnosis and prognostication, pathologists are beginning to leverage a variety of digital imaging technologies and computational tools to improve both clinical practice and cancer research. Remarkably, the emergence of artificial intelligence (AI) and machine learning algorithms for analyzing pathology specimens is poised to not only augment the resolution and accuracy of clinical diagnosis, but also fundamentally transform the role of the pathologist in cancer science and precision oncology. This session will discuss what pathologists are currently able to achieve with these new technologies, present their challenges and barriers, and overview their future possibilities in cancer diagnosis and research. The session will also include discussions of what is practical and doable in the clinic for diagnostic and clinical oncology in comparison to technologies and approaches primarily utilized to accelerate cancer research.
Jorge S Reis-Filho, Thomas J Fuchs, David L Rimm, Jayanta Debnath
Using old methods and new methods; so cell counting you use to find the cells then phenotype; with quantification like with Aqua use densitometry of positive signal to determine a threshold to determine presence of a cell for counting
Hiplex versus multiplex imaging where you have ten channels to measure by cycling of flour on antibody (can get up to 20plex)
Hiplex can be coupled with Mass spectrometry (Imaging Mass spectrometry, based on heavy metal tags on mAbs)
However it will still take a trained pathologist to define regions of interest or field of desired view
Introduction
Jayanta Debnath
Challenges and barriers of implementing AI tools for cancer diagnostics
Jorge S Reis-Filho
Implementing robust digital pathology workflows into clinical practice and cancer research
Jayanta Debnath
Invited Speaker
Thomas J Fuchs
Founder of spinout of Memorial Sloan Kettering
Separates AI from computational algothimic
Dealing with not just machines but integrating human intelligence
Making decision for the patients must involve human decision making as well
How do we get experts to do these decisions faster
AI in pathology: what is difficult? =è sandbox scenarios where machines are great,; curated datasets; human decision support systems or maps; or try to predict nature
1) learn rules made by humans; human to human scenario 2)constrained nature 3)unconstrained nature like images and or behavior 4) predict nature response to nature response to itself
In sandbox scenario the rules are set in stone and machines are great like chess playing
In second scenario can train computer to predict what a human would predict
So third scenario is like driving cars
System on constrained nature or constrained dataset will take a long time for commuter to get to decision
Fourth category is long term data collection project
He is finding it is still finding it is still is difficult to predict nature so going from clinical finding to prognosis still does not have good predictability with AI alone; need for human involvement
End to end partnering (EPL) is a new way where humans can get more involved with the algorithm and assist with the problem of constrained data
An example of a workflow for pathology would be as follows from Campanella et al 2019 Nature Medicine: obtain digital images (they digitized a million slides), train a massive data set with highthroughput computing (needed a lot of time and big software developing effort), and then train it using input be the best expert pathologists (nature to human and unconstrained because no data curation done)
Led to first clinically grade machine learning system (Camelyon16 was the challenge for detecting metastatic cells in lymph tissue; tested on 12,000 patients from 45 countries)
The first big hurdle was moving from manually annotated slides (which was a big bottleneck) to automatically extracted data from path reports).
Now problem is in prediction: How can we bridge the gap from predicting humans to predicting nature?
With an AI system pathologist drastically improved the ability to detect very small lesions
Incidence rates of several cancers (e.g., colorectal, pancreatic, and breast cancers) are rising in younger populations, which contrasts with either declining or more slowly rising incidence in older populations. Early-onset cancers are also more aggressive and have different tumor characteristics than those in older populations. Evidence on risk factors and contributors to early-onset cancers is emerging. In this Educational Session, the trends and burden, potential causes, risk factors, and tumor characteristics of early-onset cancers will be covered. Presenters will focus on colorectal and breast cancer, which are among the most common causes of cancer deaths in younger people. Potential mechanisms of early-onset cancers and racial/ethnic differences will also be discussed.
Stacey A. Fedewa, Xavier Llor, Pepper Jo Schedin, Yin Cao
Cancers that are and are not increasing in younger populations
Stacey A. Fedewa
Early onset cancers, pediatric cancers and colon cancers are increasing in younger adults
Younger people are more likely to be uninsured and these are there most productive years so it is a horrible life event for a young adult to be diagnosed with cancer. They will have more financial hardship and most (70%) of the young adults with cancer have had financial difficulties. It is very hard for women as they are on their childbearing years so additional stress
Types of early onset cancer varies by age as well as geographic locations. For example in 20s thyroid cancer is more common but in 30s it is breast cancer. Colorectal and testicular most common in US.
SCC is decreasing by adenocarcinoma of the cervix is increasing in women’s 40s, potentially due to changing sexual behaviors
Breast cancer is increasing in younger women: maybe etiologic distinct like triple negative and larger racial disparities in younger African American women
Increased obesity among younger people is becoming a factor in this increasing incidence of early onset cancers
Other Articles on this Open Access Online Journal on Cancer Conferences and Conference Coverage in Real Time Include
Live Notes, Real Time Conference Coverage 2020 AACR Virtual Meeting April 28, 2020 Session on Evaluating Cancer Genomics from Normal Tissues Through Metastatic Disease 3:50 PM
Presenter/Authors
Kelly L. Bolton, Ryan N. Ptashkin, Teng Gao, Lior Braunstein, Sean M. Devlin, Minal Patel, Antonin Berthon, Aijazuddin Syed, Mariko Yabe, Catherine Coombs, Nicole M. Caltabellotta, Mike Walsh, Ken Offit, Zsofia Stadler, Choonsik Lee, Paul Pharoah, Konrad H. Stopsack, Barbara Spitzer, Simon Mantha, James Fagin, Laura Boucai, Christopher J. Gibson, Benjamin Ebert, Andrew L. Young, Todd Druley, Koichi Takahashi, Nancy Gillis, Markus Ball, Eric Padron, David Hyman, Jose Baselga, Larry Norton, Stuart Gardos, Virginia Klimek, Howard Scher, Dean Bajorin, Eder Paraiso, Ryma Benayed, Maria Arcilla, Marc Ladanyi, David Solit, Michael Berger, Martin Tallman, Montserrat Garcia-Closas, Nilanjan Chatterjee, Luis Diaz, Ross Levine, Lindsay Morton, Ahmet Zehir, Elli Papaemmanuil. Memorial Sloan Kettering Cancer Center, New York, NY, University of North Carolina at Chapel Hill, Chapel Hill, NC, University of Cambridge, Cambridge, United Kingdom, Dana-Farber Cancer Institute, Boston, MA, Washington University, St Louis, MO, The University of Texas MD Anderson Cancer Center, Houston, TX, Moffitt Cancer Center, Tampa, FL, National Cancer Institute, Bethesda, MD
Abstract
Recent studies among healthy individuals show evidence of somatic mutations in leukemia-associated genes, referred to as clonal hematopoiesis (CH). To determine the relationship between CH and oncologic therapy we collected sequential blood samples from 525 cancer patients (median sampling interval time = 23 months, range: 6-53 months) of whom 61% received cytotoxic therapy or external beam radiation therapy and 39% received either targeted/immunotherapy or were untreated. Samples were sequenced using deep targeted capture-based platforms. To determine whether CH mutational features were associated with tMN risk, we performed Cox proportional hazards regression on 9,549 cancer patients exposed to oncologic therapy of whom 75 cases developed tMN (median time to transformation=26 months). To further compare the genetic and clonal relationships between tMN and the proceeding CH, we analyzed 35 cases for which paired samples were available. We compared the growth rate of the variant allele fraction (VAF) of CH clones across treatment modalities and in untreated patients. A significant increase in the growth rate of CH mutations was seen in DDR genes among those receiving cytotoxic (p=0.03) or radiation therapy (p=0.02) during the follow-up period compared to patients who did not receive therapy. Similar growth rates among treated and untreated patients were seen for non-DDR CH genes such as DNMT3A. Increasing cumulative exposure to cytotoxic therapy (p=0.01) and external beam radiation therapy (2×10-8) resulted in higher growth rates for DDR CH mutations. Among 34 subjects with at least two CH mutations in which one mutation was in a DDR gene and one in a non-DDR gene, we studied competing clonal dynamics for multiple gene mutations within the same patient. The risk of tMN was positively associated with CH in a known myeloid neoplasm driver mutation (HR=6.9, p<10-6), and increased with the total number of mutations and clone size. The strongest associations were observed for mutations in TP53 and for CH with mutations in spliceosome genes (SRSF2, U2AF1 and SF3B1). Lower hemoglobin, lower platelet counts, lower neutrophil counts, higher red cell distribution width and higher mean corpuscular volume were all positively associated with increased tMN risk. Among 35 cases for which paired samples were available, in 19 patients (59%), we found evidence of at least one of these mutations at the time of pre-tMN sequencing and in 13 (41%), we identified two or more in the pre-tMN sample. In all cases the dominant clone at tMN transformation was defined by a mutation seen at CH Our serial sampling data provide clear evidence that oncologic therapy strongly selects for clones with mutations in the DDR genes and that these clones have limited competitive fitness, in the absence of cytotoxic or radiation therapy. We further validate the relevance of CH as a predictor and precursor of tMN in cancer patients. We show that CH mutations detected prior to tMN diagnosis were consistently part of the dominant clone at tMN diagnosis and demonstrate that oncologic therapy directly promotes clones with mutations in genes associated with chemo-resistant disease such as TP53.
therapy resulted also in clonal evolution and saw changes in splice variants and spliceosome
therapy promotes current DDR mutations
clonal hematopoeisis due to selective pressures
mutations, variants number all predictive of myeloid disease
deferring adjuvant therapy for breast cancer patients with patients in highest MDS risk group based on biomarkers, greatly reduced their risk for MDS
Presenter/AuthorsOlivia W. Lee, Akash Mitra, Won-Chul Lee, Kazutaka Fukumura, Hannah Beird, Miles Andrews, Grant Fischer, John N. Weinstein, Michael A. Davies, Jason Huse, P. Andrew Futreal. The University of Texas MD Anderson Cancer Center, TX, The University of Texas MD Anderson Cancer Center, TX, Olivia Newton-John Cancer Research Institute and School of Cancer Medicine, La Trobe University, AustraliaDisclosures O.W. Lee: None. A. Mitra: None. W. Lee: None. K. Fukumura: None. H. Beird: None. M. Andrews: ; Merck Sharp and Dohme. G. Fischer: None. J.N. Weinstein: None. M.A. Davies: ; Bristol-Myers Squibb. ; Novartis. ; Array BioPharma. ; Roche and Genentech. ; GlaxoSmithKline. ; Sanofi-Aventis. ; AstraZeneca. ; Myriad Genetics. ; Oncothyreon. J. Huse: None. P. Futreal: None.
Abstract: Brain metastases (BM) occur in 10-30% of patients with cancer. Approximately 200,000 new cases of brain metastases are diagnosed in the United States annually, with median survival after diagnosis ranging from 3 to 27 months. Recently, studies have identified significant genetic differences between BM and their corresponding primary tumors. It has been shown that BM harbor clinically actionable mutations that are distinct from those in the primary tumor samples. Additional genomic profiling of BM will provide deeper understanding of the pathogenesis of BM and suggest new therapeutic approaches.
We performed whole-exome sequencing of BM and matched tumors from 41 patients collected from renal cell carcinoma (RCC), breast cancer, lung cancer, and melanoma, which are known to be more likely to develop BM. We profiled total 126 fresh-frozen tumor samples and performed subsequent analyses of BM in comparison to paired primary tumor and extracranial metastases (ECM). We found that lung cancer shared the largest number of mutations between BM and matched tumors (83%), followed by melanoma (74%), RCC (51%), and Breast (26%), indicating that cancer type with high tumor mutational burden share more mutations with BM. Mutational signatures displayed limited differences, suggesting a lack of mutagenic processes specific to BM. However, point-mutation heterogeneity revealed that BM evolve separately into different subclones from their paired tumors regardless of cancer type, and some cancer driver genes were found in BM-specific subclones. These models and findings suggest that these driver genes may drive prometastatic subclones that lead to BM. 32 curated cancer gene mutations were detected and 71% of them were shared between BM and primary tumors or ECM. 29% of mutations were specific to BM, implying that BM often accumulate additional cancer gene mutations that are not present in primary tumors or ECM. Co-mutation analysis revealed a high frequency of TP53 nonsense mutation in BM, mostly in the DNA binding domain, suggesting TP53 nonsense mutation as a possible prerequisite for the development of BM. Copy number alteration analysis showed statistically significant differences between BM and their paired tumor samples in each cancer type (Wilcoxon test, p < 0.0385 for all). Both copy number gains and losses were consistently higher in BM for breast cancer (Wilcoxon test, p =1.307e-5) and lung cancer (Wilcoxon test, p =1.942e-5), implying greater genomic instability during the evolution of BM.
Our findings highlight that there are more unique mutations in BM, with significantly higher copy number alterations and tumor mutational burden. These genomic analyses could provide an opportunity for more reliable diagnostic decision-making, and these findings will be further tested with additional transcriptomic and epigenetic profiling for better characterization of BM-specific tumor microenvironments.
are there genomic signatures different in brain mets versus non metastatic or normal?
32 genes from curated databases were different between brain mets and primary tumor
frequent nonsense mutations in TP53
divergent clonal evolution of drivers in BMets from primary
they were able to match BM with other mutational signatures like smokers and lung cancer signatures
Presenter/AuthorsPeter Horak, Malachi Griffith, Arpad Danos, Beth A. Pitel, Subha Madhavan, Xuelu Liu, Jennifer Lee, Gordana Raca, Shirley Li, Alex H. Wagner, Shashikant Kulkarni, Obi L. Griffith, Debyani Chakravarty, Dmitriy Sonkin. National Center for Tumor Diseases, Heidelberg, Germany, Washington University School of Medicine, St. Louis, MO, Mayo Clinic, Rochester, MN, Georgetown University Medical Center, Washington, DC, Dana-Farber Cancer Institute, Boston, MA, Frederick National Laboratory for Cancer Research, Rockville, MD, University of Southern California, Los Angeles, CA, Sunquest, Boston, MA, Baylor College of Medicine, Houston, TX, Memorial Sloan Kettering Cancer Center, New York, NY, National Cancer Institute, Rockville, MDDisclosures P. Horak: None. M. Griffith: None. A. Danos: None. B.A. Pitel: None. S. Madhavan: ; Perthera Inc. X. Liu: None. J. Lee: None. G. Raca: None. S. Li: ; Sunquest Information Systems, Inc. A.H. Wagner: None. S. Kulkarni: ; Baylor Genetics. O.L. Griffith: None. D. Chakravarty: None. D. Sonkin: None.AbstractSomatic variants in cancer-relevant genes are interpreted from multiple partially overlapping perspectives. When considered in discovery and translational research endeavors, it is important to determine if a particular variant observed in a gene of interest is oncogenic/pathogenic or not, as such knowledge provides the foundation on which targeted cancer treatment research is based. In contrast, clinical applications are dominated by diagnostic, prognostic, or therapeutic interpretations which in part also depends on underlying variant oncogenicity/pathogenicity. The Association for Molecular Pathology, the American Society of Clinical Oncology, and the College of American Pathologists (AMP/ASCO/CAP) have published structured somatic variant clinical interpretation guidelines which specifically address diagnostic, prognostic, and therapeutic implications. These guidelines have been well-received by the oncology community. Many variant knowledgebases, clinical laboratories/centers have adopted or are in the process of adopting these guidelines. The AMP/ASCO/CAP guidelines also describe different data types which are used to determine oncogenicity/pathogenicity of a variant, such as: population frequency, functional data, computational predictions, segregation, and somatic frequency. A second collaborative effort created the European Society for Medical Oncology (ESMO) Scale for Clinical Actionability of molecular Targets to provide a harmonized vocabulary that provides an evidence-based ranking system of molecular targets that supports their value as clinical targets. However, neither of these clinical guideline systems provide systematic and comprehensive procedures for aggregating population frequency, functional data, computational predictions, segregation, and somatic frequency to consistently interpret variant oncogenicity/pathogenicity, as has been published in the ACMG/AMP guidelines for interpretation of pathogenicity of germline variants. In order to address this unmet need for somatic variant oncogenicity/pathogenicity interpretation procedures, the Variant Interpretation for Cancer Consortium (VICC, a GA4GH driver project) Knowledge Curation and Interpretation Standards (KCIS) working group (WG) has developed a Standard Operating Procedure (SOP) with contributions from members of ClinGen Somatic Clinical Domain WG, and ClinGen Somatic/Germline variant curation WG using an approach similar to the ACMG/AMP germline pathogenicity guidelines to categorize evidence of oncogenicity/pathogenicity as very strong, strong, moderate or supporting. This SOP enables consistent and comprehensive assessment of oncogenicity/pathogenicity of somatic variants and latest version of an SOP can be found at https://cancervariants.org/wg/kcis/.
best to use this SOP for somatic mutations and not rearangements
variants based on oncogenicity as strong to weak
useful variant knowledge on pathogenicity curated from known databases
the recommendations would provide some guideline on curating unknown somatic variants versus known variants of hereditary diseases
they have not curated RB1 mutations or variants (or for other RBs like RB2? p130?)
Live Notes, Real Time Conference Coverage 2020 AACR Virtual Meeting April 27, 2020 Minisymposium on AACR Project Genie & Bioinformatics 4:00 PM – 6:00 PM
April 27, 2020, 4:00 PM – 6:00 PM
Virtual Meeting: All Session Times Are U.S. EDT
Session Type
Virtual Minisymposium
Track(s)
Bioinformatics and Systems Biology
17 Presentations
4:00 PM – 6:00 PM
– Chairperson Gregory J. Riely. Memorial Sloan Kettering Cancer Center, New York, NY
4:00 PM – 4:01 PM
– Introduction Gregory J. Riely. Memorial Sloan Kettering Cancer Center, New York, NY
Precision medicine requires an end-to-end learning healthcare system, wherein the treatment decisions for patients are informed by the prior experiences of similar patients. Oncology is currently leading the way in precision medicine because the genomic and other molecular characteristics of patients and their tumors are routinely collected at scale. A major challenge to realizing the promise of precision medicine is that no single institution is able to sequence and treat sufficient numbers of patients to improve clinical-decision making independently. To overcome this challenge, the AACR launched Project GENIE (Genomics Evidence Neoplasia Information Exchange).
AACR Project GENIE is a publicly accessible international cancer registry of real-world data assembled through data sharing between 19 of the leading cancer centers in the world. Through the efforts of strategic partners Sage Bionetworks (https://sagebionetworks.org) and cBioPortal (www.cbioportal.org), the registry aggregates, harmonizes, and links clinical-grade, next-generation cancer genomic sequencing data with clinical outcomes obtained during routine medical practice from cancer patients treated at these institutions. The consortium and its activities are driven by openness, transparency, and inclusion, ensuring that the project output remains accessible to the global cancer research community for the benefit of all patients.AACR Project GENIE fulfills an unmet need in oncology by providing the statistical power necessary to improve clinical decision-making, particularly in the case of rare cancers and rare variants in common cancers. Additionally, the registry can power novel clinical and translational research.
Because we collect data from nearly every patient sequenced at participating institutions and have committed to sharing only clinical-grade data, the GENIE registry contains enough high-quality data to power decision making on rare cancers or rare variants in common cancers. We see the GENIE data providing another knowledge turn in the virtuous cycle of research, accelerating the pace of drug discovery, improving the clinical trial design, and ultimately benefiting cancer patients globally.
The first set of cancer genomic data aggregated through AACR Project Genomics Evidence Neoplasia Information Exchange (GENIE) was available to the global community in January 2017. The seventh data set, GENIE 7.0-public, was released in January 2020 adding more than 9,000 records to the database. The combined data set now includes nearly 80,000 de-identified genomic records collected from patients who were treated at each of the consortium’s participating institutions, making it among the largest fully public cancer genomic data sets released to date. These data will be released to the public every six months. The public release of the eighth data set, GENIE 8.0-public, will take place in July 2020.
The combined data set now includes data for over 80 major cancer types, including data from greater than 12,500 patients with lung cancer, nearly 11,000 patients with breast cancer, and nearly 8,000 patients with colorectal cancer.
For more details about the data, analyses, and summaries of the data attributes from this release, GENIE 7.0-public, consult the data guide.
Users can access the data directly via cbioportal, or download the data directly from Sage Bionetworks. Users will need to create an account for either site and agree to the terms of access.
For frequently asked questions, visit our FAQ page.
In fall of 2019 AACR announced the Bio Collaborative which collected pan cancer data in conjuction and collaboration and support by a host of big pharma and biotech companies
they have a goal to expand to more than 6 cancer types and more than 50,000 records including smoking habits, lifestyle data etc
They have started with NSCLC have have done mutational analysis on these
included is tumor mutational burden and using cbioportal able to explore genomic data even further
treatment data is included as well
need to collect highly CURATED data with PRISM backbone to get more than outcome data, like progression data
they might look to incorporate digital pathology but they are not there yet; will need good artificial intelligence systems
4:01 PM – 4:15 PM
– Invited Speaker Gregory J. Riely. Memorial Sloan Kettering Cancer Center, New York, NY
4:15 PM – 4:20 PM
– Discussion
4:20 PM – 4:30 PM
1092 – A systematic analysis of BRAF mutations and their sensitivity to different BRAF inhibitors: Zohar Barbash, Dikla Haham, Liat Hafzadi, Ron Zipor, Shaul Barth, Arie Aizenman, Lior Zimmerman, Gabi Tarcic. Novellusdx, Jerusalem, Israel
Abstract: The MAPK-ERK signaling cascade is among the most frequently mutated pathways in human cancer, with the BRAF V600 mutation being the most common alteration. FDA-approved BRAF inhibitors as well as combination therapies of BRAF and MEK inhibitors are available and provide survival benefits to patients with a BRAF V600 mutation in several indications. Yet non-V600 BRAF mutations are found in many cancers and are even more prevalent than V600 mutations in certain tumor types. As the use of NGS profiling in precision oncology is becoming more common, novel alterations in BRAF are being uncovered. This has led to the classification of BRAF mutations, which is dependent on its biochemical properties and affects it sensitivity to inhibitors. Therefore, annotation of these novel variants is crucial for assigning correct treatment. Using a high throughput method for functional annotation of MAPK activity, we profiled 151 different BRAF mutations identified in the AACR Project GENIE dataset, and their response to 4 different BRAF inhibitors- vemurafenib and 3 different exploratory 2nd generation inhibitors. The system is based on rapid synthesis of the mutations and expression of the mutated protein together with fluorescently labeled reporters in a cell-based assay. Our results show that from the 151 different BRAF mutations, ~25% were found to activate the MAPK pathway. All of the class 1 and 2 mutations tested were found to be active, providing positive validation for the method. Additionally, many novel activating mutations were identified, some outside of the known domains. When testing the response of the active mutations to different classes of BRAF inhibitors, we show that while vemurafenib efficiently inhibited V600 mutations, other types of mutations and specifically BRAF fusions were not inhibited by this drug. Alternatively, the second-generation experimental inhibitors were effective against both V600 as well as non-V600 mutations.Using this large-scale approach to characterize BRAF mutations, we were able to functionally annotate the largest number of BRAF mutations to date. Our results show that the number of activating variants is large and that they possess differential sensitivity to different types of direct inhibitors. This data can serve as a basis for rational drug design as well as more accurate treatment options for patients.
Molecular profiling is becoming imperative for successful targeted therapies
500 unique mutations in BRAF so need to use bioinformatic pipeline; start with NGS panels then cluster according to different subtypes or class specific patterns
certain mutation like V600E mutations have distinct clustering in tumor types
25% of mutations occur with other mutations; mutations may not be functional; they used highthruput system to analyze other V600 braf mutations to determine if functional
active yet uncharacterized BRAF mutations seen in a major proportion of human tumors
using genomic drug data found that many inhibitors like verafanib are specific to a specific mutation but other inhibitors that are not specific to a cleft can inhibit other BRAF mutants
40% of 135 mutants were functionally active
USE of Functional Profiling instead of just genomic profiling
Q?: They have already used this platform and analysis for RTKs and other genes as well successfully
Q? how do you deal with co reccuring mutations: platform is able to do RTK plus signaling protiens
4:30 PM – 4:35 PM
– Discussion
4:35 PM – 4:45 PM
1093 – Calibration Tool for Genomic Aggregates (CTGA): A deep learning framework for calibrating somatic mutation profiling data from conventional gene panel data. Jordan Anaya, Craig Cummings, Jocelyn Lee, Alexander Baras. Johns Hopkins Sidney Kimmel Comprehensive Cancer Center, MD, Genentech, Inc., CA, AACR, Philadelphia, PA
Abstract: It has been suggested that aggregate genomic measures such as mutational burden can be associated with response to immunotherapy. Arguably, the gold standard for deriving such aggregate genomic measures (AGMs) would be from exome level sequencing. While many clinical trials run exome level sequencing, the vast majority of routine genomic testing performed today, as seen in AACR Project GENIE, is targeted / gene-panel based sequencing.
Despite the smaller size of these gene panels focused on clinically targetable alterations, it has been shown they can estimate, to some degree, exomic mutational burden; usually by normalizing mutation count by the relevant size of the panels. These smaller gene panels exhibit significant variability both in terms of accuracy relative to exomic measures and in comparison to other gene panels. While many genes are common to the panels in AACR Project GENIE, hundreds are not. These differences in extent of coverage and genomic loci examined can result in biases that may negatively impact panel to panel comparability.
To address these issues we developed a deep learning framework to model exomic AGMs, such as mutational burden, from gene panel data as seen in AACR Project GENIE. This framework can leverage any available sample and variant level information, in which variants are featurized to effectively re-weight their importance when estimating a given AGM, such as mutational burden, through the use of multiple instance learning techniques in this form of weakly supervised data.
Using TCGA data in conjunction with AACR Project GENIE gene panel definitions, as a proof of concept, we first applied this framework to learn expected variant features such as codons and genomic position from mutational data (greater than 99.9% accuracy observed). Having established the validity of the approach, we then applied this framework to somatic mutation profiling data in which we show that data from gene panels can be calibrated to exomic TMB and thereby improve panel to panel compatibility. We observed approximately 25% improvements in mean squared error and R-squared metrics when using our framework over conventional approaches to estimate TMB from gene panel data across the 9 tumors types examined (spanning melanoma, lung cancer, colon cancer, and others). This work highlights the application of sophisticated machine learning approaches towards the development of needed calibration techniques across seemingly disparate gene panel assays used clinically today.
4:45 PM – 4:50 PM
– Discussion
4:50 PM – 5:00 PM
1094 – Genetic determinants of EGFR-driven lung cancer growth and therapeutic response in vivoGiorgia Foggetti, Chuan Li, Hongchen Cai, Wen-Yang Lin, Deborah Ayeni, Katherine Hastings, Laura Andrejka, Dylan Maghini, Robert Homer, Dmitri A. Petrov, Monte M. Winslow, Katerina Politi. Yale School of Medicine, New Haven, CT, Stanford University School of Medicine, Stanford, CA, Stanford University School of Medicine, Stanford, CA, Yale School of Medicine, New Haven, CT, Stanford University School of Medicine, Stanford, CA, Yale School of Medicine, New Haven, CT
5:00 PM – 5:05 PM
– Discussion
5:05 PM – 5:15 PM
1095 – Comprehensive pan-cancer analyses of RAS genomic diversityRobert Scharpf, Gregory Riely, Mark Awad, Michele Lenoue-Newton, Biagio Ricciuti, Julia Rudolph, Leon Raskin, Andrew Park, Jocelyn Lee, Christine Lovly, Valsamo Anagnostou. Johns Hopkins Sidney Kimmel Comprehensive Cancer Center, Baltimore, MD, Memorial Sloan Kettering Cancer Center, New York, NY, Dana-Farber Cancer Institute, Boston, MA, Vanderbilt-Ingram Cancer Center, Nashville, TN, Amgen, Inc., Thousand Oaks, CA, AACR, Philadelphia, PA
5:15 PM – 5:20 PM
– Discussion
5:20 PM – 5:30 PM
1096 – Harmonization standards from the Variant Interpretation for Cancer Consortium. Alex H. Wagner, Reece K. Hart, Larry Babb, Robert R. Freimuth, Adam Coffman, Yonghao Liang, Beth Pitel, Angshumoy Roy, Matthew Brush, Jennifer Lee, Anna Lu, Thomas Coard, Shruti Rao, Deborah Ritter, Brian Walsh, Susan Mockus, Peter Horak, Ian King, Dmitriy Sonkin, Subha Madhavan, Gordana Raca, Debyani Chakravarty, Malachi Griffith, Obi L. Griffith. Washington University School of Medicine, Saint Louis, MO, Reece Hart Consulting, CA, Broad Institute, Boston, MA, Mayo Clinic, Rochester, MN, Washington University School of Medicine, Saint Louis, MO, Washington University School of Medicine, Saint Louis, MO, Baylor College of Medicine, Houston, TX, Oregon Health and Science University, Portland, OR, National Cancer Institute, Bethesda, MD, Georgetown University, Washington, DC, The Jackson Laboratory for Genomic Medicine, Farmington, CT, National Center for Tumor Diseases, Heidelberg, Germany, University of Toronto, Toronto, ON, Canada, University of Southern California, Los Angeles, CA, Memorial Sloan Kettering Cancer Center, New York, NY
Abstract: The use of clinical gene sequencing is now commonplace, and genome analysts and molecular pathologists are often tasked with the labor-intensive process of interpreting the clinical significance of large numbers of tumor variants. Numerous independent knowledge bases have been constructed to alleviate this manual burden, however these knowledgebases are non-interoperable. As a result, the analyst is left with a difficult tradeoff: for each knowledgebase used the analyst must understand the nuances particular to that resource and integrate its evidence accordingly when generating the clinical report, but for each knowledgebase omitted there is increased potential for missed findings of clinical significance.The Variant Interpretation for Cancer Consortium (VICC; cancervariants.org) was formed as a driver project of the Global Alliance for Genomics and Health (GA4GH; ga4gh.org) to address this concern. VICC members include representatives from several major somatic interpretation knowledgebases including CIViC, OncoKB, Jax-CKB, the Weill Cornell PMKB, the IRB-Barcelona Cancer Biomarkers Database, and others. Previously, the VICC built and reported on a harmonized meta-knowledgebase of 19,551 biomarker associations of harmonized variants, diseases, drugs, and evidence across the constituent resources.In that study, we analyzed the frequency with which the tumor samples from the AACR Project GENIE cohort would match to harmonized associations. Variant matches increased dramatically from 57% to 86% when broader matching to regions describing categorical variants were allowed. Unlike precise sequence variants with specified alternate alleles, categorical variants describe a collection of potential variants with a common feature, such as “V600” (non-valine alleles at the 600 residue), “Exon 20 mutations” (all non-silent mutations in exon 20), or “Gain-of-function” (hypermorphic alterations that activate or amplify gene activity). However, matching observed sequence variants to categorical variants is challenging, as the latter are typically only described as unstructured text. Here we describe the expressive and computational GA4GH Variation Representation specification (vr-spec.readthedocs.io), which we co-developed as members of the GA4GH Genomic Knowledge Standards work stream. This specification provides a schema for common, precise forms of variation (e.g. SNVs and Indels) and the method for computing identifiers from these objects. We highlight key aspects of the specification and our work to apply it to the characterization of categorical variation, showcasing the variant terminology and classification tools developed by the VICC to support this effort. These standards and tools are free, open-source, and extensible, overcoming barriers to standardized variant knowledge sharing and search.
store information from different databases by curating them and classifying them then harmonizing them into values
harmonize each variant across their knowledgebase; at any level of evidence
had 29% of patients variants that matched when compare across many knowledgebase databases versus only 13% when using individual databases
they are also trying to curate the database so a variant will have one code instead of various refseq codes or protein codes
VIC is an open consortium
5:30 PM – 5:35 PM
– Discussion
5:35 PM – 5:45 PM
1097 – FGFR2 in-frame indels: A novel targetable alteration in intrahepatic cholangiocarcinoma. Yvonne Y. Li, James M. Cleary, Srivatsan Raghavan, Liam F. Spurr, Qibiao Wu, Lei Shi, Lauren K. Brais, Maureen Loftus, Lipika Goyal, Anuj K. Patel, Atul B. Shinagare, Thomas E. Clancy, Geoffrey Shapiro, Ethan Cerami, William R. Sellers, William C. Hahn, Matthew Meyerson, Nabeel Bardeesy, Andrew D. Cherniack, Brian M. Wolpin. Dana-Farber Cancer Institute, Boston, MA, Dana-Farber Cancer Institute, Boston, MA, Massachusetts General Hospital, Boston, MA, Brigham and Women’s Hospital, Boston, MA, Dana-Farber Cancer Institute, Boston, MA, Dana-Farber Cancer Institute, Boston, MA, Broad Institute of MIT and Harvard, Cambridge, MA, Massachusetts General Hospital, Boston, MA
5:45 PM – 5:50 PM
– Discussion
5:50 PM – 6:00 PM
– Closing RemarksGregory J. Riely. Memorial Sloan Kettering Cancer Center, New York, NY
Science 07 Jun 2019:
Vol. 364, Issue 6444, pp. 941-942
DOI: 10.1126/science.aaw8299
Precision medicine is at a crossroads. Progress toward its central goal, to address persistent health inequities, will depend on enrolling populations in research that have been historically underrepresented, thus eliminating longstanding exclusions from such research (1). Yet the history of ethical violations related to protocols for inclusion in biomedical research, as well as the continued misuse of research results (such as white nationalists looking to genetic ancestry to support claims of racial superiority), continue to engender mistrust among these populations (2). For precision medicine research (PMR) to achieve its goal, all people must believe that there is value in providing information about themselves and their families, and that their participation will translate into equitable distribution of benefits. This requires an ethics of inclusion that considers what constitutes inclusive practices in PMR, what goals and values are being furthered through efforts to enhance diversity, and who participates in adjudicating these questions. The early stages of PMR offer a critical window in which to intervene before research practices and their consequences become locked in (3).
Initiatives such as the All of Us program have set out to collect and analyze health information and biological samples from millions of people (1). At the same time, questions of trust in biomedical research persist. For example, although the recent assertions of white nationalists were eventually denounced by the American Society of Human Genetics (4), the misuse of ancestry testing may have already undermined public trust in genetic research.
There are also infamous failures in research that included historically underrepresented groups, including practices of deceit, as in the Tuskegee Syphilis Study, or the misuse of samples, as with the Havasupai tribe (5). Many people who are being asked to give their data and samples for PMR must not only reconcile such past research abuses, but also weigh future risks of potential misuse of their data.
To help assuage these concerns, ongoing PMR studies should open themselves up to research, conducted by social scientists and ethicists, that examines how their approaches enhance diversity and inclusion. Empirical studies are needed to account for how diversity is conceptualized and how goals of inclusion are operationalized throughout the life course of PMR studies. This is not limited to selection and recruitment of populations but extends to efforts to engage participants and communities, through data collection and measurement, and interpretations and applications of study findings. A commitment to transparency is an important step toward cultivating public trust in PMR’s mission and practices.
From Inclusion to Inclusive
The lack of diverse representation in precision medicine and other biomedical research is a well-known problem. For example, rare genetic variants may be overlooked—or their association with common, complex diseases can be misinterpreted—as a result of sampling bias in genetics research (6). Concentrating research efforts on samples with largely European ancestry has limited the ability of scientists to make generalizable inferences about the relationships among genes, lifestyle, environmental exposures, and disease risks, and thereby threatens the equitable translation of PMR for broad public health benefit (7).
However, recruiting for diverse research participation alone is not enough. As with any push for “diversity,” related questions arise about how to describe, define, measure, compare, and explain inferred similarities and differences among individuals and groups (8). In the face of ambivalence about how to represent population variation, there is ample evidence that researchers resort to using definitions of diversity that are heterogeneous, inconsistent, and sometimes competing (9). Varying approaches are not inherently problematic; depending on the scientific question, some measures may be more theoretically justified than others and, in many cases, a combination of measures can be leveraged to offer greater insight (10). For example, studies have shown that American adults who do not self-identify as white report better mental and physical health if they think others perceive them as white (11, 12).
The benefit of using multiple measures of race and ancestry also extends to genetic studies. In a study of hypertension in Puerto Rico, not only did classifications based on skin color and socioeconomic status better predict blood pressure than genetic ancestry, the inclusion of these sociocultural measures also revealed an association between a genetic polymorphism and hypertension that was otherwise hidden (13). Thus, practices that allow for a diversity of measurement approaches, when accompanied by a commitment to transparency about the rationales for chosen approaches, are likely to benefit PMR research more than striving for a single gold standard that would apply across all studies. These definitional and measurement issues are not merely semantic. They also are socially consequential to broader perceptions of PMR research and the potential to achieve its goals of inclusion.
Study Practices, Improve Outcomes
Given the uncertainty and complexities of the current, early phase of PMR, the time is ripe for empirical studies that enable assessment and modulation of research practices and scientific priorities in light of their social and ethical implications. Studying ongoing scientific practices in real time can help to anticipate unintended consequences that would limit researchers’ ability to meet diversity recruitment goals, address both social and biological causes of health disparities, and distribute the benefits of PMR equitably. We suggest at least two areas for empirical attention and potential intervention.
First, we need to understand how “upstream” decisions about how to characterize study populations and exposures influence “downstream” research findings of what are deemed causal factors. For example, when precision medicine researchers rely on self-identification with U.S. Census categories to characterize race and ethnicity, this tends to circumscribe their investigation of potential gene-environment interactions that may affect health. The convenience and routine nature of Census categories seemed to lead scientists to infer that the reasons for differences among groups were self-evident and required no additional exploration (9). The ripple effects of initial study design decisions go beyond issues of recruitment to shape other facets of research across the life course of a project, from community engagement and the return of results to the interpretation of study findings for human health.
Second, PMR studies are situated within an ecosystem of funding agencies, regulatory bodies, disciplines, and other scholars. This partly explains the use of varied terminology, different conceptual understandings and interpretations of research questions, and heterogeneous goals for inclusion. It also makes it important to explore how expectations related to funding and regulation influence research definitions of diversity and benchmarks for inclusion.
For example, who defines a diverse study population, and how might those definitions vary across different institutional actors? Who determines the metrics that constitute successful inclusion, and why? Within a research consortium, how are expectations for data sharing and harmonization reconciled with individual studies’ goals for recruitment and analysis? In complex research fields that include multiple investigators, organizations, and agendas, how are heterogeneous, perhaps even competing, priorities negotiated? To date, no studies have addressed these questions or investigated how decisions facilitate, or compromise, goals of diversity and inclusion.
The life course of individual studies and the ecosystems in which they reside cannot be easily separated and therefore must be studied in parallel to understand how meanings of diversity are shaped and how goals of inclusion are pursued. Empirically “studying the studies” will also be instrumental in creating mechanisms for transparency about how PMR is conducted and how trade-offs among competing goals are resolved. Establishing open lines of inquiry that study upstream practices may allow researchers to anticipate and address downstream decisions about how results can be interpreted and should be communicated, with a particular eye toward the consequences for communities recruited to augment diversity. Understanding how scientists negotiate the challenges and barriers to achieving diversity that go beyond fulfilling recruitment numbers is a critical step toward promoting meaningful inclusion in PMR.
Transparent Reflection, Cultivation of Trust
Emerging research on public perceptions of PMR suggests that although there is general support, questions of trust loom large. What we learn from studies that examine on-the-ground approaches aimed at enhancing diversity and inclusion, and how the research community reflects and responds with improvements in practices as needed, will play a key role in building a culture of openness that is critical for cultivating public trust.
Cultivating long-term, trusting relationships with participants underrepresented in biomedical research has been linked to a broad range of research practices. Some of these include the willingness of researchers to (i) address the effect of history and experience on marginalized groups’ trust in researchers and clinicians; (ii) engage concerns about potential group harms and risks of stigmatization and discrimination; (iii) develop relationships with participants and communities that are characterized by transparency, clear communication, and mutual commitment; and (iv) integrate participants’ values and expectations of responsible oversight beyond initial informed consent (14). These findings underscore the importance of multidisciplinary teams that include social scientists, ethicists, and policy-makers, who can identify and help to implement practices that respect the histories and concerns of diverse publics.
A commitment to an ethics of inclusion begins with a recognition that risks from the misuse of genetic and biomedical research are unevenly distributed. History makes plain that a multitude of research practices ranging from unnecessarily limited study populations and taken-for-granted data collection procedures to analytic and interpretive missteps can unintentionally bolster claims of racial superiority or inferiority and provoke group harm (15). Sustained commitment to transparency about the goals, limits, and potential uses of research is key to further cultivating trust and building long-term research relationships with populations underrepresented in biomedical studies.
As calls for increasing diversity and inclusion in PMR grow, funding and organizational pathways must be developed that integrate empirical studies of scientific practices and their rationales to determine how goals of inclusion and equity are being addressed and to identify where reform is required. In-depth, multidisciplinary empirical investigations of how diversity is defined, operationalized, and implemented can provide important insights and lessons learned for guiding emerging science, and in so doing, meet our ethical obligations to ensure transparency and meaningful inclusion.
A Nonlinear Methodology to Explain Complexity of the Genome and Bioinformatic Information, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 1: Next Generation Sequencing (NGS)
A Nonlinear Methodology to Explain Complexity of the Genome and Bioinformatic Information
Reporter: Stephen J. Williams, Ph.D.
Multifractal bioinformatics: A proposal to the nonlinear interpretation of genome
The following is an open access article by Pedro Moreno on a methodology to analyze genetic information across species and in particular, the evolutionary trends of complex genomes, by a nonlinear analytic approach utilizing fractal geometry, coined “Nonlinear Bioinformatics”. This fractal approach stems from the complex nature of higher eukaryotic genomes including mosaicism, multiple interdispersed genomic elements such as intronic regions, noncoding regions, and also mobile elements such as transposable elements. Although seemingly random, there exists a repetitive nature of these elements. Such complexity of DNA regulation, structure and genomic variation is felt best understood by developing algorithms based on fractal analysis, which can best model the regionalized and repetitive variability and structure within complex genomes by elucidating the individual components which contributes to an overall complex structure rather than using a “linear” or “reductionist” approach looking at individual coding regions, which does not take into consideration the aforementioned factors leading to genetic complexity and diversity.
Indeed, many other attempts to describe the complexities of DNA as a fractal geometric pattern have been described. In a paper by Carlo Cattani “Fractals and Hidden Symmetries in DNA“, Carlo uses fractal analysis to construct a simple geometric pattern of the influenza A virus by modeling the primary sequence of this viral DNA, namely the bases A,G,C, and T. The main conclusions that
fractal shapes and symmetries in DNA sequences and DNA walks have been shown and compared with random and deterministic complex series. DNA sequences are structured in such a way that there exists some fractal behavior which can be observed both on the correlation matrix and on the DNA walks. Wavelet analysis confirms by a symmetrical clustering of wavelet coefficients the existence of scale symmetries.
suggested that, at least, the viral influenza genome structure could be analyzed into its basic components by fractal geometry.
This approach has been used to model the complex nature of cancer as discussed in a 2011 Seminars in Oncology paper
Abstract: Cancer is a highly complex disease due to the disruption of tissue architecture. Thus, tissues, and not individual cells, are the proper level of observation for the study of carcinogenesis. This paradigm shift from a reductionist approach to a systems biology approach is long overdue. Indeed, cell phenotypes are emergent modes arising through collective non-linear interactions among different cellular and microenvironmental components, generally described by “phase space diagrams”, where stable states (attractors) are embedded into a landscape model. Within this framework, cell states and cell transitions are generally conceived as mainly specified by gene-regulatory networks. However, the system s dynamics is not reducible to the integrated functioning of the genome-proteome network alone; the epithelia-stroma interacting system must be taken into consideration in order to give a more comprehensive picture. Given that cell shape represents the spatial geometric configuration acquired as a result of the integrated set of cellular and environmental cues, we posit that fractal-shape parameters represent “omics descriptors of the epithelium-stroma system. Within this framework, function appears to follow form, and not the other way around.
As authors conclude
” Transitions from one phenotype to another are reminiscent of phase transitions observed in physical systems. The description of such transitions could be obtained by a set of morphological, quantitative parameters, like fractal measures. These parameters provide reliable information about system complexity. “
the authors describe that gene expression networks display time series display fractal and long-range dependence characteristics.
Abstract: Gene expression is a vital process through which cells react to the environment and express functional behavior. Understanding the dynamics of gene expression could prove crucial in unraveling the physical complexities involved in this process. Specifically, understanding the coherent complex structure of transcriptional dynamics is the goal of numerous computational studies aiming to study and finally control cellular processes. Here, we report the scaling properties of gene expression time series in Escherichia coliand Saccharomyces cerevisiae. Unlike previous studies, which report the fractal and long-range dependency of DNA structure, we investigate the individual gene expression dynamics as well as the cross-dependency between them in the context of gene regulatory network. Our results demonstrate that the gene expression time series display fractal and long-range dependence characteristics. In addition, the dynamics between genes and linked transcription factors in gene regulatory networks are also fractal and long-range cross-correlated. The cross-correlation exponents in gene regulatory networks are not unique. The distribution of the cross-correlation exponents of gene regulatory networks for several types of cells can be interpreted as a measure of the complexity of their functional behavior.
Given that multitude of complex biomolecular networks and biomolecules can be described by fractal patterns, the development of bioinformatic algorithms would enhance our understanding of the interdependence and cross funcitonality of these mutiple biological networks, particularly in disease and drug resistance. The article below by Pedro Moreno describes the development of such bioinformatic algorithms.
Pedro A. Moreno
Escuela de Ingeniería de Sistemas y Computación, Facultad de Ingeniería, Universidad del Valle, Cali, Colombia
E-mail: pedro.moreno@correounivalle.edu.co
Eje temático: Ingeniería de sistemas / System engineering
Recibido: 19 de septiembre de 2012
Aceptado: 16 de diciembre de 2013
Abstract
The first draft of the human genome (HG) sequence was published in 2001 by two competing consortia. Since then, several structural and functional characteristics for the HG organization have been revealed. Today, more than 2.000 HG have been sequenced and these findings are impacting strongly on the academy and public health. Despite all this, a major bottleneck, called the genome interpretation persists. That is, the lack of a theory that explains the complex puzzles of coding and non-coding features that compose the HG as a whole. Ten years after the HG sequenced, two recent studies, discussed in the multifractal formalism allow proposing a nonlinear theory that helps interpret the structural and functional variation of the genetic information of the genomes. The present review article discusses this new approach, called: “Multifractal bioinformatics”.
Keywords: Omics sciences, bioinformatics, human genome, multifractal analysis.
1. Introduction
Omic Sciences and Bioinformatics
In order to study the genomes, their life properties and the pathological consequences of impairment, the Human Genome Project (HGP) was created in 1990. Since then, about 500 Gpb (EMBL) represented in thousands of prokaryotic genomes and tens of different eukaryotic genomes have been sequenced (NCBI, 1000 Genomes, ENCODE). Today, Genomics is defined as the set of sciences and technologies dedicated to the comprehensive study of the structure, function and origin of genomes. Several types of genomic have arisen as a result of the expansion and implementation of genomics to the study of the Central Dogma of Molecular Biology (CDMB), Figure 1 (above). The catalog of different types of genomics uses the Latin suffix “-omic” meaning “set of” to mean the new massive approaches of the new omics sciences (Moreno et al, 2009). Given the large amount of genomic information available in the databases and the urgency of its actual interpretation, the balance has begun to lean heavily toward the requirements of bioinformatics infrastructure research laboratories Figure 1 (below).
The bioinformatics or Computational Biology is defined as the application of computer and information technology to the analysis of biological data (Mount, 2004). An interdisciplinary science that requires the use of computing, applied mathematics, statistics, computer science, artificial intelligence, biophysical information, biochemistry, genetics, and molecular biology. Bioinformatics was born from the need to understand the sequences of nucleotide or amino acid symbols that make up DNA and proteins, respectively. These analyzes are made possible by the development of powerful algorithms that predict and reveal an infinity of structural and functional features in genomic sequences, as gene location, discovery of homologies between macromolecules databases (Blast), algorithms for phylogenetic analysis, for the regulatory analysis or the prediction of protein folding, among others. This great development has created a multiplicity of approaches giving rise to new types of Bioinformatics, such as Multifractal Bioinformatics (MFB) that is proposed here.
1.1 Multifractal Bioinformatics and Theoretical Background
MFB is a proposal to analyze information content in genomes and their life properties in a non-linear way. This is part of a specialized sub-discipline called “nonlinear Bioinformatics”, which uses a number of related techniques for the study of nonlinearity (fractal geometry, Hurts exponents, power laws, wavelets, among others.) and applied to the study of biological problems (http://pharmaceuticalintelligence.com/tag/fractal-geometry/). For its application, we must take into account a detailed knowledge of the structure of the genome to be analyzed and an appropriate knowledge of the multifractal analysis.
1.2 From the Worm Genome toward Human Genome
To explore a complex genome such as the HG it is relevant to implement multifractal analysis (MFA) in a simpler genome in order to show its practical utility. For example, the genome of the small nematode Caenorhabditis elegans is an excellent model to learn many extrapolated lessons of complex organisms. Thus, if the MFA explains some of the structural properties in that genome it is expected that this same analysis reveals some similar properties in the HG.
The C. elegans nuclear genome is composed of about 100 Mbp, with six chromosomes distributed into five autosomes and one sex chromosome. The molecular structure of the genome is particularly homogeneous along with the chromosome sequences, due to the presence of several regular features, including large contents of genes and introns of similar sizes. The C. elegans genome has also a regional organization of the chromosomes, mainly because the majority of the repeated sequences are located in the chromosome arms, Figure 2 (left) (C. elegans Sequencing Consortium, 1998). Given these regular and irregular features, the MFA could be an appropriate approach to analyze such distributions.
Meanwhile, the HG sequencing revealed a surprising mosaicism in coding (genes) and noncoding (repetitive DNA) sequences, Figure 2 (right) (Venter et al., 2001). This structure of 6 Gbp is divided into 23 pairs of chromosomes (diploid cells) and these highly regionalized sequences introduce complex patterns of regularity and irregularity to understand the gene structure, the composition of sequences of repetitive DNA and its role in the study and application of life sciences. The coding regions of the genome are estimated at ~25,000 genes which constitute 1.4% of GH. These genes are involved in a giant sea of various types of non-coding sequences which compose 98.6% of HG (misnamed popularly as “junk DNA”). The non-coding regions are characterized by many types of repeated DNA sequences, where 10.6% consists of Alu sequences, a type of SINE (short and dispersed repeated elements) sequence and preferentially located towards the genes. LINES, MIR, MER, LTR, DNA transposons and introns are another type of non-coding sequences which form about 86% of the genome. Some of these sequences overlap with each other; as with CpG islands, which complicates the analysis of genomic landscape. This standard genomic landscape was recently clarified, the last studies show that 80.4% of HG is functional due to the discovery of more than five million “switches” that operate and regulate gene activity, re-evaluating the concept of “junk DNA”. (The ENCODE Project Consortium, 2012).
Given that all these genomic variations both in worm and human produce regionalized genomic landscapes it is proposed that Fractal Geometry (FG) would allow measuring how the genetic information content is fragmented. In this paper the methodology and the nonlinear descriptive models for each of these genomes will be reviewed.
1.3 The MFA and its Application to Genome Studies
Most problems in physics are implicitly non-linear in nature, generating phenomena such as chaos theory, a science that deals with certain types of (non-linear) but very sensitive dynamic systems to initial conditions, nonetheless of deterministic rigor, that is that their behavior can be completely determined by knowing initial conditions (Peitgen et al, 1992). In turn, the FG is an appropriate tool to study the chaotic dynamic systems (CDS). In other words, the FG and chaos are closely related because the space region toward which a chaotic orbit tends asymptotically has a fractal structure (strange attractors). Therefore, the FG allows studying the framework on which CDS are defined (Moon, 1992). And this is how it is expected for the genome structure and function to be organized.
The MFA is an extension of the FG and it is related to (Shannon) information theory, disciplines that have been very useful to study the information content over a sequence of symbols. Initially, Mandelbrot established the FG in the 80’s, as a geometry capable of measuring the irregularity of nature by calculating the fractal dimension (D), an exponent derived from a power law (Mandelbrot, 1982). The value of the D gives us a measure of the level of fragmentation or the information content for a complex phenomenon. That is because the D measures the scaling degree that the fragmented self-similarity of the system has. Thus, the FG looks for self-similar properties in structures and processes at different scales of resolution and these self-similarities are organized following scaling or power laws.
Sometimes, an exponent is not sufficient to characterize a complex phenomenon; so more exponents are required. The multifractal formalism allows this, and applies when many subgroups of fractals with different scalar properties with a large number of exponents or fractal dimensions coexist simultaneously. As a result, when a spectrum of multifractal singularity measurement is generated, the scaling behavior of the frequency of symbols of a sequence can be quantified (Vélez et al, 2010).
The MFA has been implemented to study the spatial heterogeneity of theoretical and experimental fractal patterns in different disciplines. In post-genomics times, the MFA was used to study multiple biological problems (Vélez et al, 2010). Nonetheless, very little attention has been given to the use of MFA to characterize the content of the structural genetic information of the genomes obtained from the images of the Chaos Representation Game (CRG). First studies at this level were made recently to the analysis of the C. elegans genome (Vélez et al, 2010) and human genomes (Moreno et al, 2011). The MFA methodology applied for the study of these genomes will be developed below.
2. Methodology
The Multifractal Formalism from the CGR
2.1 Data Acquisition and Molecular Parameters
Databases for the C. elegans and the 36.2 Hs_ refseq HG version were downloaded from the NCBI FTP server. Then, several strategies were designed to fragment the genomic DNA sequences of different length ranges. For example, the C. elegans genome was divided into 18 fragments, Figure 2 (left) and the human genome in 9,379 fragments. According to their annotation systems, the contents of molecular parameters of coding sequences (genes, exons and introns), noncoding sequences (repetitive DNA, Alu, LINES, MIR, MER, LTR, promoters, etc.) and coding/ non-coding DNA (TTAGGC, AAAAT, AAATT, TTTTC, TTTTT, CpG islands, etc.) are counted for each sequence.
2.2 Construction of the CGR 2.3 Fractal Measurement by the Box Counting Method
Subsequently, the CGR, a recursive algorithm (Jeffrey, 1990; Restrepo et al, 2009) is applied to each selected DNA sequence, Figure 3 (above, left) and from which an image is obtained, which is quantified by the box-counting algorithm. For example, in Figure 3 (above, left) a CGR image for a human DNA sequence of 80,000 bp in length is shown. Here, dark regions represent sub-quadrants with a high number of points (or nucleotides). Clear regions, sections with a low number of points. The calculation for the D for the Koch curve by the box-counting method is illustrated by a progression of changes in the grid size, and its Cartesian graph, Table 1
The CGR image for a given DNA sequence is quantified by a standard fractal analysis. A fractal is a fragmented geometric figure whose parts are an approximated copy at full scale, that is, the figure has self-similarity. The D is basically a scaling rule that the figure obeys. Generally, a power law is given by the following expression:
Where N(E) is the number of parts required for covering the figure when a scaling factor E is applied. The power law permits to calculate the fractal dimension as:
The D obtained by the box-counting algorithm covers the figure with disjoint boxes ɛ = 1/E and counts the number of boxes required. Figure 4 (above, left) shows the multifractal measure at momentum q=1.
2.4 Multifractal Measurement
When generalizing the box-counting algorithm for the multifractal case and according to the method of moments q, we obtain the equation (3) (Gutiérrez et al, 1998; Yu et al, 2001):
Where the Mi number of points falling in the i-th grid is determined and related to the total number M0 and ɛ to box size. Thus, the MFA is used when multiple scaling rules are applied. Figure 4 (above, right) shows the calculation of the multifractal measures at different momentum q (partition function). Here, linear regressions must have a coefficient of determination equal or close to 1. From each linear regression D are obtained, which generate an spectrum of generalized fractal dimensions Dq for all q integers, Figure 4 (below, left). So, the multifractal spectrum is obtained as the limit:
The variation of the q integer allows emphasizing different regions and discriminating their fractal a high Dq is synonymous of the structure’s richness and the properties of these regions. Negative q values emphasize the scarce regions; a high Dq indicates a lot of structure and properties in these regions. In real world applications, the limit Dqreadily approximated from the data using a linear fitting: the transformation of the equation (3) yields:
Which shows that ln In(Mi )= for set q is a linear function in the ln(ɛ), Dq can therefore be evaluated as q the slope of a fixed relationship between In(Mi )= and (q-1) ln(ɛ). The methodologies and approaches for the method of box-counting and MFA are detailed in Moreno et al, 2000, Yu et al, 2001; Moreno, 2005. For a rigorous mathematical development of MFA from images consult Multifractal system, wikipedia.
2.5 Measurement of Information Content
Subsequently, from the spectrum of generalized dimensions Dq, the degree of multifractality ΔDq(MD) is calculated as the difference between the maximum and minimum values of D : ΔD qq = Dqmax– Dqmin(Ivanov et al, 1999). When qmaxqmin ΔDq is high, the multifractal spectrum is rich in information and highly aperiodic, when ΔDq is small, the resulting dimension spectrum is poor in information and highly periodic. It is expected then, that the aperiodicity in the genome would be related to highly polymorphic genomic aperiodic structures and those periodic regions with highly repetitive and not very polymorphic genomic structures. The correlation exponent t(q) = (q – 1)Dq, Figure 4 (below, right ) can also be obtained from the multifractal dimension Dq. The generalized dimension also provides significant specific information. D(q = 0) is equal to the Capacity dimension, which in this analysis is the size of the “box count”. D(q = 1) is equal to the Information dimension and D(q = 2) to the Correlation dimension. Based on these multifractal parameters, many of the structural genomic properties can be quantified, related, and interpreted.
2.6 Multifractal Parameters and Statistical and Discrimination Analyses
Once the multifractal parameters are calculated (Dq = (-20, 20), ΔDq, πq, etc.), correlations with the molecular parameters are sought. These relations are established by plotting the number of genome molecular parameters versus MD by discriminant analysis with Cartesian graphs in 2-D, Figure 5 (below, left) and 3-D and combining multifractal and molecular parameters. Finally, simple linear regression analysis, multivariate analysis, and analyses by ranges and clusterings are made to establish statistical significance.
3 Results and Discussion
3.1 Non-linear Descriptive Model for the C. elegans Genome
When analyzing the C. elegans genome with the multifractal formalism it revealed what symmetry and asymmetry on the genome nucleotide composition suggested. Thus, the multifractal scaling of the C. elegans genome is of interest because it indicates that the molecular structure of the chromosome may be organized as a system operating far from equilibrium following nonlinear laws (Ivanov et al, 1999; Burgos and Moreno-Tovar, 1996). This can be discussed from two points of view:
1) When comparing C. elegans chromosomes with each other, the X chromosome showed the lowest multifractality, Figure 5 (above). This means that the X chromosome is operating close to equilibrium, which results in an increased genetic instability. Thus, the instability of the X could selectively contribute to the molecular mechanism that determines sex (XX or X0) during meiosis. Thus, the X chromosome would be operating closer to equilibrium in order to maintain their particular sexual dimorphism.
2) When comparing different chromosome regions of the C. elegans genome, changes in multifractality were found in relation to the regional organization (at the center and arms) exhibited by the chromosomes, Figure 5 (below, left). These behaviors are associated with changes in the content of repetitive DNA, Figure 5 (below, right). The results indicated that the chromosome arms are even more complex than previously anticipated. Thus, TTAGGC telomere sequences would be operating far from equilibrium to protect the genetic information encoded by the entire chromosome.
All these biological arguments may explain why C. elegans genome is organized in a nonlinear way. These findings provide insight to quantify and understand the organization of the non-linear structure of the C. elegans genome, which may be extended to other genomes, including the HG (Vélez et al, 2010).
3.2 Nonlinear Descriptive Model for the Human Genome
Once the multifractal approach was validated in C. elegans genome, HG was analyzed exhaustively. This allowed us to propose a nonlinear model for the HG structure which will be discussed under three points of view.
1) It was found that the HG high multifractality depends strongly on the contents of Alu sequences and to a lesser extent on the content of CpG islands. These contents would be located primarily in highly aperiodic regions, thus taking the chromosome far from equilibrium and giving to it greater genetic stability, protection and attraction of mutations, Figure 6 (A-C). Thus, hundreds of regions in the HG may have high genetic stability and the most important genetic information of the HG, the genes, would be safeguarded from environmental fluctuations. Other repeated elements (LINES, MIR, MER, LTRs) showed no significant relationship,
Figure 6 (D). Consequently, the human multifractal map developed in Moreno et al, 2011 constitutes a good tool to identify those regions rich in genetic information and genomic stability. 2) The multifractal context seems to be a significant requirement for the structural and functional organization of thousands of genes and gene families. Thus, a high multifractal context (aperiodic) appears to be a “genomic attractor” for many genes (KOGs, KEEGs), Figure 6 (E) and some gene families, Figure 6 (F) are involved in genetic and deterministic processes, in order to maintain a deterministic regulation control in the genome, although most of HG sequences may be subject to a complex epigenetic control.
3) The classification of human chromosomes and chromosome regions analysis may have some medical implications (Moreno et al, 2002; Moreno et al, 2009). This means that the structure of low nonlinearity exhibited by some chromosomes (or chromosome regions) involve an environmental predisposition, as potential targets to undergo structural or numerical chromosomal alterations in Figure 6 (G). Additionally, sex chromosomes should have low multifractality to maintain sexual dimorphism and probably the X chromosome inactivation.
All these fractals and biological arguments could explain why Alu elements are shaping the HG in a nonlinearly manner (Moreno et al, 2011). Finally, the multifractal modeling of the HG serves as theoretical framework to examine new discoveries made by the ENCODE project and new approaches about human epigenomes. That is, the non-linear organization of HG might help to explain why it is expected that most of the GH is functional.
4. Conclusions
All these results show that the multifractal formalism is appropriate to quantify and evaluate genetic information contents in genomes and to relate it with the known molecular anatomy of the genome and some of the expected properties. Thus, the MFB allows interpreting in a logic manner the structural nature and variation of the genome.
The MFB allows understanding why a number of chromosomal diseases are likely to occur in the genome, thus opening a new perspective toward personalized medicine to study and interpret the GH and its diseases.
The entire genome contains nonlinear information organizing it and supposedly making it function, concluding that virtually 100% of HG is functional. Bioinformatics in general, is enriched with a novel approach (MFB) making it possible to quantify the genetic information content of any DNA sequence and their practical applications to different disciplines in biology, medicine and agriculture. This novel breakthrough in computational genomic analysis and diseases contributes to define Biology as a “hard” science.
MFB opens a door to develop a research program towards the establishment of an integrative discipline that contributes to “break” the code of human life. (http://pharmaceuticalintelligence. com/page/3/).
5. Acknowledgements
Thanks to the directives of the EISC, the Universidad del Valle and the School of Engineering for offering an academic, scientific and administrative space for conducting this research. Likewise, thanks to co authors (professors and students) who participated in the implementation of excerpts from some of the works cited here. Finally, thanks to Colciencias by the biotechnology project grant # 1103-12-16765.
6. References
Blanco, S., & Moreno, P.A. (2007). Representación del juego del caos para el análisis de secuencias de ADN y proteínas mediante el análisis multifractal (método “box-counting”). In The Second International Seminar on Genomics and Proteomics, Bioinformatics and Systems Biology (pp. 17-25). Popayán, Colombia. [ Links ]
Burgos, J.D., & Moreno-Tovar, P. (1996). Zipf scaling behavior in the immune system. BioSystem , 39, 227-232. [ Links ]
C. elegans Sequencing Consortium. (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science , 282, 2012-2018. [ Links ]
Gutiérrez, J.M., Iglesias A., Rodríguez, M.A., Burgos, J.D., & Moreno, P.A. (1998). Analyzing the multifractals structure of DNA nucleotide sequences. In, M. Barbie & S. Chillemi (Eds.) Chaos and Noise in Biology and Medicine (cap. 4). Hackensack (NJ): World Scientific Publishing Co. [ Links ]
Jeffrey, H.J. (1990). Chaos game representation of gene structure. Nucleic Acids Research , 18, 2163-2175. [ Links ]
Mandelbrot, B. (1982). La geometría fractal de la naturaleza. Barcelona. España: Tusquets editores. [ Links ]
Moon, F.C. (1992). Chaotic and fractal dynamics. New York: John Wiley. [ Links ]
Moreno, P.A. (2005). Large scale and small scale bioinformatics studies on the Caenorhabditis elegans enome. Doctoral thesis. Department of Biology and Biochemistry, University of Houston, Houston, USA. [ Links ]
Moreno, P.A., Burgos, J.D., Vélez, P.E., Gutiérrez, J.M., & et al., (2000). Multifractal analysis of complete genomes. In P roceedings of the 12th International Genome Sequencing and Analysis Conference (pp. 80-81). Miami Beach (FL). [ Links ]
Moreno, P.A., Rodríguez, J.G., Vélez, P.E., Cubillos, J.R., & Del Portillo, P. (2002). La genómica aplicada en salud humana. Colombia Ciencia y Tecnología. Colciencias , 20, 14-21. [ Links ]
Moreno, P.A., Vélez, P.E., & Burgos, J.D. (2009). Biología molecular, genómica y post-genómica. Pioneros, principios y tecnologías. Popayán, Colombia: Editorial Universidad del Cauca. [ Links ]
Moreno, P.A., Vélez, P.E., Martínez, E., Garreta, L., Díaz, D., Amador, S., Gutiérrez, J.M., et. al. (2011). The human genome: a multifractal analysis. BMC Genomics , 12, 506. [ Links ]
Mount, D.W. (2004). Bioinformatics. Sequence and ge nome analysis. New York: Cold Spring Harbor Laboratory Press. [ Links ]
Peitgen, H.O., Jürgen, H., & Saupe D. (1992). Chaos and Fractals. New Frontiers of Science. New York: Springer-Verlag. [ Links ]
Restrepo, S., Pinzón, A., Rodríguez, L.M., Sierra, R., Grajales, A., Bernal, A., Barreto, E. et. al. (2009). Computational biology in Colombia. PLoS Computational Biology, 5 (10), e1000535. [ Links ]
The ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature , 489, 57-74. [ Links ]
Vélez, P.E., Garreta, L.E., Martínez, E., Díaz, N., Amador, S., Gutiérrez, J.M., Tischer, I., & Moreno, P.A. (2010). The Caenorhabditis elegans genome: a multifractal analysis. Genet and Mol Res , 9, 949-965. [ Links ]
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., & et al. (2001). The sequence of the human genome. Science , 291, 1304-1351. [ Links ]
The Value of Prediction for Response to Immunotherapies: Genomic Approaches for the Advancement of Neo-Antigen Understanding in Immunotherapy
Reporter: Aviva Lev-Ari, PhD, RN
Genomic Approaches for the Advancement of Neo-Antigen Understanding in Immunotherapy
Recorded February 17, 2016
WATCH WEBINAR
Sponsored by
Webinar Description:
Preview:
Cancer immune therapies have recently demonstrated exciting clinical benefits for a number of cancer types. Somatic mutations in an individual’s cancer cells encode neoantigens. Clinical responses to cancer immune therapies including T cell transfer and checkpoint blockade are primarily mediated by neoantigen specific reactivity. Advances in next-generation sequencing and bioinformatics prediction allow for the rapid and affordable identification of neoantigens in individuals, which have profoundly impacted immuno-oncology drug development.
In this webinar, Dr. Victor Velculescu will highlight efforts that his group and colleagues at PGDx have pioneered for whole exome sequencing and neoantigen prediction. Dr. Drew Pardoll and lab have used this approach to identify intratumoral mutations in lung and colorectal cancer patients who have received anti-PD-1 immunotherapy. Dr. Theresa Zhang will describe how this approach, which utilizes a streamlined neoantigen prediction pipeline, ImmunoSelectTM R, allows for prioritization of thousands of epitopes that result from somatic mutations into a selection that are most likely to produce adaptive responses. These results and experiences will illustrate how correlates of a response to immunotherapy may better identify patients who will benefit from anti-PD-1 and other forms of immune therapy.
Learning Objectives:
Overview of Cancer Genomics
Understand how neoantigen prediction can adapt response to immunotherapy in certain populations
Learn more about PGDx technologies for advancing the value of prediction for response to immunotherapies
Speaker Information:
Dr. Victor Velculescu, M.D., Ph.D.
Professor of Oncology and Co-Director of Cancer Biology Johns Hopkins Kimmel Cancer Center
Founder, Personal Genome Diagnostics
Dr. Velculescu is Professor of Oncology and Co-Director of Cancer Biology at Johns Hopkins Kimmel Cancer Center and a co-founder of Personal Genome Diagnostics. He has a B.S. from Stanford University, and M.D., Ph.D. degrees from Johns Hopkins University.
Dr. Drew Pardoll, M.D., Ph.D.
Abeloff Professor of Oncology and Director of Cancer Immunology
Johns Hopkins Kimmel Cancer Center
Drew M. Pardoll, M.D., Ph.D., is an Abeloff Professor of Oncology, Medicine, Pathology and Molecular Biology and Genetics at the Johns Hopkins University, School of Medicine. He is director of the Cancer Immunology at the Sidney Kimmel Comprehensive Cancer Center. Pardoll completed his medical and doctorate degrees, and medical residency and oncology fellowship at Johns Hopkins University.
Theresa Zhang, Ph.D.
VP of Research Services
Personal Genome Diagnostics
Dr. Zhang received BS degrees from Peking University and Bridgewater College and a PhD from the University of Virginia. She completed a Postdoctoral Fellowship in bioinformatics at Cold Spring Harbor Laboratories. Dr. Zhang is a co-author of numerous scientific publications and a frequent presenter at scientific meetings.
The first priority cited by the vice president was data sharing. Biden defended the concept as essential to advancing the process of cancer research and countered a January 21 New England Journal of Medicine editorial in which editor-in-chief Jeffrey Drazen, M.D., contended that data sharing could breed data “parasites.”
Four days later, Dr. Drazen clarified NEJM’s position by adding that with “appropriate systems” in place, “we will require a commitment from authors to make available the data that underlie the reported results of their work within 6 months after we publish them.”
Other priorities Biden said should serve as the basis of new incentives:
Involve patients in clinical trial design—Raising awareness of trials, and allowing patients to participate in how they are designed and conducted, could help address the difficulty of recruiting patients for studies. Only 4% of cancer patients are involved in a trial, he said.
“Let scientists do science”—Biden contrasted unfavorably NIH’s roughly 1-year process for decisions on grants to that of the Prostate Cancer Foundation, which limits grant applications to 10 pages and decides on those funding requests within 30 days: “Why is it that it takes multiple submissions and more than a year to get an answer from us?” Biden said.
Encourage grants from younger researchers—Biden decried the current professional system under which younger researchers are sidetracked for years doing administrative work in labs before they can pursue their own research grants: “It’s like asking Derek Jeter to take several years off to sell bonds to build Yankee Stadium,” the VP quipped.
Measure progress by outcomes—Rather than the quantity of research papers generated by grants, Biden said, “what you propose and how it affects patients, it seems to me, should be the basis of whether you continue to get the grant.”
Promote open-access publication of results—Biden criticized academic publishing’s reliance on paid-subscription journals that block content behind paywalls and which own data for up to a year. He contrasted that system with the Bill and Melinda Gates Foundation’s stipulation that the research it funds be published in an open-access journal and be freely available once published.
Reward verification—Research that verifies results through replication should be encouraged, Biden said, which acknowledging that few people now get such funding.
Biden recalled how following Beau’s diagnosis with cancer, he and his wife Jill Biden, Ed.D., who introduced the VP at the AACR event, “had access to the best doctors in the world.”
“The more we talked to them, the more we understood that we are on the cusp of a real inflection point in the fight against cancer.”
Updated 4/12/2019
Pediatric Cancer Initiatives
Data Sharing for Pediatric Cancers: President Trump Announces Pledge to Fight Childhood Cancer Will Involve Genomic Data Sharing Effort
In the journal Science, Drs. Olena Morozova Vaske ( and David Haussler University of California, Santa Cruz) recently wrote an editorial entitled “Data Sharing for Pediatric Cancers“, in which they discuss the implications of President Trump’s intentions to increase funding for pediatric cancers with a corresponding effort for genomic data sharing. Also discussed is the current efforts on pediatric genomic data sharing as well as some opinions on coordinating these efforts on a world-wide scale to benefit the patients, researchers, and clinicians.
The article is found below as it is a very good read on the state of data sharing in the pediatric cancer field and offers some very good insights in designing such a worldwide system to handle this data sharing, including allowing patients governance over their own data.
Last month, in a conference call held by the U.S. Department of Health and Human Services and National Institutes of Health (NIH), it was revealed that a large focus of President Trump’s pledge to fund childhood cancer research will be genomic data sharing. Although the United States has only 5% of the world’s pediatric cancer cases, it has disproportionately more resources and access to genomic information compared to low-income countries. We hope that the spotlight on genomic data sharing in the United States will galvanize the world’s pediatric cancer community to elevate genomic data sharing to a level where its full potential can finally be realized.
Pediatric cancers are rare, affecting 50 to 200 children per million a year worldwide. Thus, with 16 different major types and many subtypes, no cancer center encounters large cohorts of patients with the same diagnosis. To advance their understanding of particular cancer subtypes, pediatric oncologists must have access to data from similar cases at other centers. Because subtypes of pediatric cancer are rare, assembling large cohorts is a limiting factor in clinical trials as well. Here, too, data sharing is the first critical step.
Typically, pediatric cancers don’t have the number of mutations that make immunotherapies effective, and only a few subtypes have recurrent mutations that can be used to develop gene-targeted therapies. However, the abnormal expression level of genes gives a vivid picture of genetic misregulation, and just sharing this information would be a huge step forward. Using gene expression and mutation data, analysis of genetic misregulation in different pediatric cancer subtypes could point the way to new treatments.
A major challenge in genomic data sharing is the patient’s young age, which frequently precludes an opportunity for informed consent. Compounding this, the rarity of subtypes requires the aggregation of patients from multiple jurisdictions, raising barriers to assembling large representative data sets. A greater percentage of children than adults with cancer participate in research studies, and children often participate in multiple studies. However, this means that data collected on individual children may be found at multiple institutions, creating difficulties if there are no standards for data sharing.
To enable effective sharing of genomic and clinical data, the Global Alliance for Genomics and Health has developed the Key Implications for Data Sharing (KIDS) framework for pediatric genomics. The recommendations include involving children in the data-sharing decision-making process and imposing an ethical obligation on data generators to provide children and parents with the opportunity to share genomic and clinical information with researchers. Although KIDS guidelines are not legally binding, they could inform policy development worldwide.
To advance the sharing culture, along with the NIH, pediatric cancer foundations such as the St. Baldrick’s Foundation and Alex’s Lemonade Stand Foundation have incorporated genomic data-sharing requirements into their grants processes. Researchers and clinicians around the world have created dozens of pediatric cancer genomic databases and portals, but pulling these together into a larger network is problematic, especially for patients with data at more than one institution, as patient identifiers are stripped from shared data. However, initiatives like the Children’s Oncology Group’s Project Every Child and the European Network for Cancer Research in Children and Adolescents’ Unified Patient Identity may resolve this issue.
We urge the creators of pediatric cancer genomic resources to collaborate and build a real-time federated data-sharing system, and hope that the new U.S. initiative will inspire other countries to link databases rather than just create new siloed regional resources. The great advances in information technology and life sciences in the last decades have given us a new opportunity to save our children from the scourge of cancer. We must resolve to use them.
Bioinformatic Tools for Cancer Mutational Analysis: COSMIC and Beyond, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 1: Next Generation Sequencing (NGS)
Bioinformatic Tools for Cancer Mutational Analysis: COSMIC and Beyond
Curator: Stephen J. Williams, Ph.D.
Updated 7/26/2019
Updated 04/27/2019
Signatures of Mutational Processes in Human Cancer (from COSMIC)
The genomic landscape of cancer. The COSMIC database has a fully curated and annotated database of recurrent genetic mutations founds in various cancers (data taken form cancer sequencing projects). For interactive map please go to the COSMIC database here: http://cancer.sanger.ac.uk/cosmic
Somatic mutations are present in all cells of the human body and occur throughout life. They are the consequence of multiple mutational processes, including the intrinsic slight infidelity of the DNA replication machinery, exogenous or endogenous mutagen exposures, enzymatic modification of DNA and defective DNA repair. Different mutational processes generate unique combinations of mutation types, termed “Mutational Signatures”.
The current set of mutational signatures is based on an analysis of 10,952 exomes and 1,048 whole-genomes across 40 distinct types of human cancer. These analyses are based on curated data that were generated by The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC), and a large set of freely available somatic mutations published in peer-reviewed journals. Complete details about the data sources will be provided in future releases of COSMIC.
The profile of each signature is displayed using the six substitution subtypes: C>A, C>G, C>T, T>A, T>C, and T>G (all substitutions are referred to by the pyrimidine of the mutated Watson–Crick base pair). Further, each of the substitutions is examined by incorporating information on the bases immediately 5’ and 3’ to each mutated base generating 96 possible mutation types (6 types of substitution ∗ 4 types of 5’ base ∗ 4 types of 3’ base). Mutational signatures are displayed and reported based on the observed trinucleotide frequency of the human genome, i.e., representing the relative proportions of mutations generated by each signature based on the actual trinucleotide frequencies of the reference human genome version GRCh37. Note that only validated mutational signatures have been included in the curated census of mutational signatures.
Additional information is provided for each signature, including the cancer types in which the signature has been found, proposed aetiology for the mutational processes underlying the signature, other mutational features that are associated with each signature and information that may be relevant for better understanding of a particular mutational signature.
The set of signatures will be updated in the future. This will include incorporating additional mutation types (e.g., indels, structural rearrangements, and localized hypermutation such as kataegis) and cancer samples. With more cancer genome sequences and the additional statistical power this will bring, new signatures may be found, the profiles of current signatures may be further refined, signatures may split into component signatures and signatures
COSMIC v75 includes curations across GRIN2A, fusion pair TCF3-PBX1, and genomic data from 17 systematic screen publications. We are also beginning a reannotation of TCGA exome datasets using Sanger’s Cancer Genome Project analyis pipeline to ensure consistency; four studies are included in this release, to be expanded across the next few releases. The Cancer Gene Census now has a dedicated curator, Dr. Zbyslaw Sondka, who will be focused on expanding the Census, enhancing the evidence underpinning it, and developing improved expert-curated detail describing each gene’s impact in cancer. Finally, as we begin to streamline our ever-growing website, we have combined all information for each gene onto one page and simplified the layout and design to improve navigation
may be found in cancer types in which they are currently not detected.
Signature 1 has been found in all cancer types and in most cancer samples.
Proposed aetiology:
Signature 1 is the result of an endogenous mutational process initiated by spontaneous deamination of 5-methylcytosine.
Additional mutational features:
Signature 1 is associated with small numbers of small insertions and deletions in most tissue types.
Comments:
The number of Signature 1 mutations correlates with age of cancer diagnosis.
Signature 2
Cancer types:
Signature 2 has been found in 22 cancer types, but most commonly in cervical and bladder cancers. In most of these 22 cancer types, Signature 2 is present in at least 10% of samples.
Proposed aetiology:
Signature 2 has been attributed to activity of the AID/APOBEC family of cytidine deaminases. On the basis of similarities in the sequence context of cytosine mutations caused by APOBEC enzymes in experimental systems, a role for APOBEC1, APOBEC3A and/or APOBEC3B in human cancer appears more likely than for other members of the family.
Additional mutational features:
Transcriptional strand bias of mutations has been observed in exons, but is not present or is weaker in introns.
Comments:
Signature 2 is usually found in the same samples as Signature 13. It has been proposed that activation of AID/APOBEC cytidine deaminases is due to viral infection, retrotransposon jumping or to tissue inflammation. Currently, there is limited evidence to support these hypotheses. A germline deletion polymorphism involving APOBEC3A and APOBEC3B is associated with the presence of large numbers of Signature 2 and 13 mutations and with predisposition to breast cancer. Mutations of similar patterns to Signatures 2 and 13 are commonly found in the phenomenon of local hypermutation present in some cancers, known as kataegis, potentially implicating AID/APOBEC enzymes in this process as well.
Signature 3
Cancer types:
Signature 3 has been found in breast, ovarian, and pancreatic cancers.
Proposed aetiology:
Signature 3 is associated with failure of DNA double-strand break-repair by homologous recombination.
Additional mutational features:
Signature 3 associates strongly with elevated numbers of large (longer than 3bp) insertions and deletions with overlapping microhomology at breakpoint junctions.
Comments:
Signature 3 is strongly associated with germline and somatic BRCA1 and BRCA2 mutations in breast, pancreatic, and ovarian cancers. In pancreatic cancer, responders to platinum therapy usually exhibit Signature 3 mutations.
Signature 4
Cancer types:
Signature 4 has been found in head and neck cancer, liver cancer, lung adenocarcinoma, lung squamous carcinoma, small cell lung carcinoma, and oesophageal cancer.
Proposed aetiology:
Signature 4 is associated with smoking and its profile is similar to the mutational pattern observed in experimental systems exposed to tobacco carcinogens (e.g., benzo[a]pyrene). Signature 4 is likely due to tobacco mutagens.
Additional mutational features:
Signature 4 exhibits transcriptional strand bias for C>A mutations, compatible with the notion that damage to guanine is repaired by transcription-coupled nucleotide excision repair. Signature 4 is also associated with CC>AA dinucleotide substitutions.
Comments:
Signature 29 is found in cancers associated with tobacco chewing and appears different from Signature 4.
Signature 5
Cancer types:
Signature 5 has been found in all cancer types and most cancer samples.
Proposed aetiology:
The aetiology of Signature 5 is unknown.
Additional mutational features:
Signature 5 exhibits transcriptional strand bias for T>C substitutions at ApTpN context.
Comments:
Signature 6
Cancer types:
Signature 6 has been found in 17 cancer types and is most common in colorectal and uterine cancers. In most other cancer types, Signature 6 is found in less than 3% of examined samples.
Proposed aetiology:
Signature 6 is associated with defective DNA mismatch repair and is found in microsatellite unstable tumours.
Additional mutational features:
Signature 6 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.
Comments:
Signature 6 is one of four mutational signatures associated with defective DNA mismatch repair and is often found in the same samples as Signatures 15, 20, and 26.
Signature 7
Cancer types:
Signature 7 has been found predominantly in skin cancers and in cancers of the lip categorized as head and neck or oral squamous cancers.
Proposed aetiology:
Based on its prevalence in ultraviolet exposed areas and the similarity of the mutational pattern to that observed in experimental systems exposed to ultraviolet light Signature 7 is likely due to ultraviolet light exposure.
Additional mutational features:
Signature 7 is associated with large numbers of CC>TT dinucleotide mutations at dipyrimidines. Additionally, Signature 7 exhibits a strong transcriptional strand-bias indicating that mutations occur at pyrimidines (viz., by formation of pyrimidine-pyrimidine photodimers) and these mutations are being repaired by transcription-coupled nucleotide excision repair.
Comments:
Signature 8
Cancer types:
Signature 8 has been found in breast cancer and medulloblastoma.
Proposed aetiology:
The aetiology of Signature 8 remains unknown.
Additional mutational features:
Signature 8 exhibits weak strand bias for C>A substitutions and is associated with double nucleotide substitutions, notably CC>AA.
Comments:
Signature 9
Cancer types:
Signature 9 has been found in chronic lymphocytic leukaemias and malignant B-cell lymphomas.
Proposed aetiology:
Signature 9 is characterized by a pattern of mutations that has been attributed to polymerase η, which is implicated with the activity of AID during somatic hypermutation.
Additional mutational features:
Comments:
Chronic lymphocytic leukaemias that possess immunoglobulin gene hypermutation (IGHV-mutated) have elevated numbers of mutations attributed to Signature 9 compared to those that do not have immunoglobulin gene hypermutation.
Signature 10
Cancer types:
Signature 10 has been found in six cancer types, notably colorectal and uterine cancer, usually generating huge numbers of mutations in small subsets of samples.
Proposed aetiology:
It has been proposed that the mutational process underlying this signature is altered activity of the error-prone polymerase POLE. The presence of large numbers of Signature 10 mutations is associated with recurrent POLE somatic mutations, viz., Pro286Arg and Val411Leu.
Additional mutational features:
Signature 10 exhibits strand bias for C>A mutations at TpCpT context and T>G mutations at TpTpT context.
Comments:
Signature 10 is associated with some of most mutated cancer samples. Samples exhibiting this mutational signature have been termed ultra-hypermutators.
Signature 11
Cancer types:
Signature 11 has been found in melanoma and glioblastoma.
Proposed aetiology:
Signature 11 exhibits a mutational pattern resembling that of alkylating agents. Patient histories have revealed an association between treatments with the alkylating agent temozolomide and Signature 11 mutations.
Additional mutational features:
Signature 11 exhibits a strong transcriptional strand-bias for C>T substitutions indicating that mutations occur on guanine and that these mutations are effectively repaired by transcription-coupled nucleotide excision repair.
Comments:
Signature 12
Cancer types:
Signature 12 has been found in liver cancer.
Proposed aetiology:
The aetiology of Signature 12 remains unknown.
Additional mutational features:
Signature 12 exhibits a strong transcriptional strand-bias for T>C substitutions.
Comments:
Signature 12 usually contributes a small percentage (<20%) of the mutations observed in a liver cancer sample.
Signature 13
Cancer types:
Signature 13 has been found in 22 cancer types and seems to be commonest in cervical and bladder cancers. In most of these 22 cancer types, Signature 13 is present in at least 10% of samples.
Proposed aetiology:
Signature 13 has been attributed to activity of the AID/APOBEC family of cytidine deaminases converting cytosine to uracil. On the basis of similarities in the sequence context of cytosine mutations caused by APOBEC enzymes in experimental systems, a role for APOBEC1, APOBEC3A and/or APOBEC3B in human cancer appears more likely than for other members of the family. Signature 13 causes predominantly C>G mutations. This may be due to generation of abasic sites after removal of uracil by base excision repair and replication over these abasic sites by REV1.
Additional mutational features:
Transcriptional strand bias of mutations has been observed in exons, but is not present or is weaker in introns.
Comments:
Signature 2 is usually found in the same samples as Signature 13. It has been proposed that activation of AID/APOBEC cytidine deaminases is due to viral infection, retrotransposon jumping or to tissue inflammation. Currently, there is limited evidence to support these hypotheses. A germline deletion polymorphism involving APOBEC3A and APOBEC3B is associated with the presence of large numbers of Signature 2 and 13 mutations and with predisposition to breast cancer. Mutations of similar patterns to Signatures 2 and 13 are commonly found in the phenomenon of local hypermutation present in some cancers, known as kataegis, potentially implicating AID/APOBEC enzymes in this process as well.
Signature 14
Cancer types:
Signature 14 has been observed in four uterine cancers and a single adult low-grade glioma sample.
Proposed aetiology:
The aetiology of Signature 14 remains unknown.
Additional mutational features:
Comments:
Signature 14 generates very high numbers of somatic mutations (>200 mutations per MB) in all samples in which it has been observed.
Signature 15
Cancer types:
Signature 15 has been found in several stomach cancers and a single small cell lung carcinoma.
Proposed aetiology:
Signature 15 is associated with defective DNA mismatch repair.
Additional mutational features:
Signature 15 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.
Comments:
Signature 15 is one of four mutational signatures associated with defective DNA mismatch repair and is often found in the same samples as Signatures 6, 20, and 26.
Signature 16
Cancer types:
Signature 16 has been found in liver cancer.
Proposed aetiology:
The aetiology of Signature 16 remains unknown.
Additional mutational features:
Signature 16 exhibits an extremely strong transcriptional strand bias for T>C mutations at ApTpN context, with T>C mutations occurring almost exclusively on the transcribed strand.
Comments:
Signature 17
Cancer types:
Signature 17 has been found in oesophagus cancer, breast cancer, liver cancer, lung adenocarcinoma, B-cell lymphoma, stomach cancer and melanoma.
Proposed aetiology:
The aetiology of Signature 17 remains unknown.
Additional mutational features:
Comments:
Signature 1Signature 18
Cancer types:
Signature 18 has been found commonly in neuroblastoma. Additionally, Signature 18 has been also observed in breast and stomach carcinomas.
Proposed aetiology:
The aetiology of Signature 18 remains unknown.
Additional mutational features:
Comments:
Signature 19
Cancer types:
Signature 19 has been found only in pilocytic astrocytoma.
Proposed aetiology:
The aetiology of Signature 19 remains unknown.
Additional mutational features:
Comments:
Signature 20
Cancer types:
Signature 20 has been found in stomach and breast cancers.
Proposed aetiology:
Signature 20 is believed to be associated with defective DNA mismatch repair.
Additional mutational features:
Signature 20 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.
Comments:
Signature 20 is one of four mutational signatures associated with defective DNA mismatch repair and is often found in the same samples as Signatures 6, 15, and 26.
Signature 21
Cancer types:
Signature 21 has been found only in stomach cancer.
Proposed aetiology:
The aetiology of Signature 21 remains unknown.
Additional mutational features:
Comments:
Signature 21 is found only in four samples all generated by the same sequencing centre. The mutational pattern of Signature 21 is somewhat similar to the one of Signature 26. Additionally, Signature 21 is found only in samples that also have Signatures 15 and 20. As such, Signature 21 is probably also related to microsatellite unstable tumours.
Signature 22
Cancer types:
Signature 22 has been found in urothelial (renal pelvis) carcinoma and liver cancers.
Proposed aetiology:
Signature 22 has been found in cancer samples with known exposures to aristolochic acid. Additionally, the pattern of mutations exhibited by the signature is consistent with the one previous observed in experimental systems exposed to aristolochic acid.
Additional mutational features:
Signature 22 exhibits a very strong transcriptional strand bias for T>A mutations indicating adenine damage that is being repaired by transcription-coupled nucleotide excision repair.
Comments:
Signature 22 has a very high mutational burden in urothelial carcinoma; however, its mutational burden is much lower in liver cancers.
Signature 23
Cancer types:
Signature 23 has been found only in a single liver cancer sample.
Proposed aetiology:
The aetiology of Signature 23 remains unknown.
Additional mutational features:
Signature 23 exhibits very strong transcriptional strand bias for C>T mutations.
Comments:
Signature 24
Cancer types:
Signature 24 has been observed in a subset of liver cancers.
Proposed aetiology:
Signature 24 has been found in cancer samples with known exposures to aflatoxin. Additionally, the pattern of mutations exhibited by the signature is consistent with that previous observed in experimental systems exposed to aflatoxin.
Additional mutational features:
Signature 24 exhibits a very strong transcriptional strand bias for C>A mutations indicating guanine damage that is being repaired by transcription-coupled nucleotide excision repair.
Comments:
Signature 25
Cancer types:
Signature 25 has been observed in Hodgkin lymphomas.
Proposed aetiology:
The aetiology of Signature 25 remains unknown.
Additional mutational features:
Signature 25 exhibits transcriptional strand bias for T>A mutations.
Comments:
This signature has only been identified in Hodgkin’s cell lines. Data is not available from primary Hodgkin lymphomas.
Signature 26
Cancer types:
Signature 26 has been found in breast cancer, cervical cancer, stomach cancer and uterine carcinoma.
Proposed aetiology:
Signature 26 is believed to be associated with defective DNA mismatch repair.
Additional mutational features:
Signature 26 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.
Comments:
Signature 26 is one of four mutational signatures associated with defective DNA mismatch repair and is often found in the same samples as Signatures 6, 15 and 20.
Signature 27
Cancer types:
Signature 27 has been observed in a subset of kidney clear cell carcinomas.
Proposed aetiology:
The aetiology of Signature 27 remains unknown.
Additional mutational features:
Signature 27 exhibits very strong transcriptional strand bias for T>A mutations. Signature 27 is associated with high numbers of small (shorter than 3bp) insertions and deletions at mono/polynucleotide repeats.
Comments:
Signature 28
Cancer types:
Signature 28 has been observed in a subset of stomach cancers.
Proposed aetiology:
The aetiology of Signature 28 remains unknown.
Additional mutational features:
Comments:
Signature 29
Cancer types:
Signature 29 has been observed only in gingivo-buccal oral squamous cell carcinoma.
Proposed aetiology:
Signature 29 has been found in cancer samples from individuals with a tobacco chewing habit.
Additional mutational features:
Signature 29 exhibits transcriptional strand bias for C>A mutations indicating guanine damage that is most likely repaired by transcription-coupled nucleotide excision repair. Signature 29 is also associated with CC>AA dinucleotide substitutions.
Comments:
The Signature 29 pattern of C>A mutations due to tobacco chewing appears different from the pattern of mutations due to tobacco smoking reflected by Signature 4.
Signature 30
Cancer types:
Signature 30 has been observed in a small subset of breast cancers.
Proposed aetiology:
The aetiology of Signature 30 remains unknown.
Examples in the literature of deposits into or analysis from the COSMIC database
“analysis of exons representing 20,857 transcripts from 18,191 genes, we conclude that the genomic landscapes of breast and colorectal cancers are composed of a handful of commonly mutated gene “mountains” and a much larger number of gene “hills” that are mutated at low frequency. “
found cellular pathways with multiple pathways
analyzed a highly curated database (Metacore, GeneGo, Inc.) that includes human protein-protein interactions, signal transduction and metabolic pathways
There were 108 pathways that were found to be preferentially mutated in breast tumors. Many of the pathways involved phosphatidylinositol 3-kinase (PI3K) signaling
the cancer genome landscape consists of relief features (mutated genes) with heterogeneous heights (determined by CaMP scores). There are a few “mountains” representing individual CAN-genes mutated at high frequency. However, the landscapes contain a much larger number of “hills” representing the CAN-genes that are mutated at relatively low frequency. It is notable that this general genomic landscape (few gene mountains and many gene hills) is a common feature of both breast and colorectal tumors.
developed software to analyze multiple mutations and mutation frequencies available from Harvard Bioinformatics at
R Software for Cancer Mutation Analysis (download here)
CancerMutationAnalysis Version 1.0:
R package to reproduce the statistical analyses of the Sjoblom et al article and the associated Technical Comment. This package is build for reproducibility of the original results and not for flexibility. Future version will be more general and define classes for the data types used. Further details are available in Working Paper 126.
CancerMutationAnalysis Version 2.0:
R package to reproduce the statistical analyses of the Wood et al article. Like its predecessor, this package is still build for reproducibility of the original results and not for flexibility. Further details are available in Working Paper 126
Update 04/27/2019
Review 2018. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Z. Sondka et al. Nature Reviews. 2018.
The Catalogue of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census (CGC) reevaluates the cancer genome landscape periodically and curates the findings into a database of genetic changes occurring in various tumor types. The 2018 CGC describes in detail the effect of 719 cancer driving genes. The recent expansion includes functional and mechanistic descriptions of how each gene contributes to disease etiology and in terms of the cancer hallmarks as described by Hanahan and Weinberg. These functional characteristics show the complexity of the cancer mutational landscape and genome and suggest ” multiple cancer-related functions for many genes, which are often highly tissue-dependent or tumour stage-dependent.” The 2018 CGC expands a second tier of genes, expanding the list of cancer related genes.
Criteria for curation of genes into CGC (curation process)
choosing candidate genes are selected from published literature, conference abstracts, large cancer genome screens deposited in databases, and analysis of current COSMIC database
COSMIC data are analyzed to determine presence of patterns of somatic mutations and frequency of such mutations in cancer
literature review to determine the role of the gene in cancer
Minimum evidence
– at least two publications from different groups shows increased mutation frequency in at least one type of cancer (PubMed)
– at least two publications from different groups showing experimental evidence of functional involvement in at least one hallmark of cancer in order to classify the mutant gene as oncogene, tumor suppressor, or fusion partner (like BCR-Abl)
independent assessment by at least two postdoctoral fellows
gene must be classified as either Tier 1 of Tier 2 CGC gene
inclusion in database
continued curation efforts
definitions:
Tier 1 gene: genes which have strong evidence from both mutational and functional analysis as being involved in cancer
Tier 2 gene: genes with mutational patterns typical of cancer drivers but not functionally characterized as well as genes with published mechanistic description of involvement in cancer but without proof of somatic mutations in cancer
Tier 2 genes (719 genes): include 103 oncogenes, 181 tumor suppressors, 134 fusion partners and 31 with unknown function
Updated 7/26/2019
The COSMIC database is undergoing an extensive update and reannotation, in order to ensure standardisation and modernisation across COSMIC data. This will substantially improve the identification of unique variants that may have been described at the genome, transcript and/or protein level. The introduction of a Genomic Identifier, along with complete annotation across multiple, high quality Ensembl transcripts and improved compliance with current HGVS syntax, will enable variant matching both within COSMIC and across other bioinformatic datasets.
As a result of these updates there will be significant changes in the upcoming releases as we work through this process. The first stage of this work was the introduction of improvedHGVS syntax compliance in our May release. The majority of the changes will be reflected in COSMIC v90, which will be released in late August or early September, and the remaining changes will be introduced over the next few releases.
The significant changes in v90 include:
Updated genes, transcripts and proteins from Ensembl release 93 on both the GRCh37 and GRCh38 assemblies.
Full reannotation of COSMIC variants with known genomic coordinates using Ensembl’s Variant Effect Predictor (VEP). This provides accurate and standardised annotation uniformly across all relevant transcripts and genes that include the genomic location of the variant.
New stable genomic identifiers (COSV) that indicate the definitive position of the variant on the genome. These unique identifiers allow variants to be mapped between GRCh37 and GRCh38 assemblies and displayed on a selection of transcripts.
Updated cross-reference links between COSMIC genes and other widely-used databases such as HGNC, RefSeq, Uniprot and CCDS.
Complete standardised representation of COSMIC variants, following the most recent HGVS recommendations, where possible.
Remapping of gene fusions on the updated transcripts on both the GRCh37 and GRCh38 assemblies, along with the genomic coordinates for the breakpoint positions.
Reduced redundancy of mutations. Duplicate variants have been merged into one representative variant.
Key points for you
COSMIC variants have been annotated on all relevant Ensembl transcripts across both the GRCh37 and GRCh38 assemblies from Ensembl release 93. New genomic identifiers (e.g. COSV56056643) are used, which refers to the variant change at the genomic level rather than gene, transcript or protein level and can thus be used universally. Existing COSM IDs will continue to be supported and will now be referred to as legacy identifiers e.g. COSM476. The legacy identifiers (COSM) are still searchable. In the case of mutations without genomic coordinates, hence without a COSV identifier, COSM identifiers will continue to be used.
All relevant Ensembl transcripts in COSMIC (which have been selected based on Ensembl canonical classification and on the quality of the dataset to include only GENCODE basic transcripts) will now have both accession and version numbers, so that the exact transcript is known, ensuring reproducibility. This also provides transparency and clarity as the data are updated.
How these changes will be reflected in the download files
As we are now mapping all variants on all relevant Ensembl transcripts, the number of rows in the majority of variant download files has increased significantly. In the download files, additional columns are provided including the legacy identifier (COSM) and the new genomic identifier (COSV). An internal mutation identifier is also provided to uniquely represent each mutation, on a specific transcript, on a given assembly build. The accession and version number for each transcript are included. File descriptions for each of the download files will be available from the downloads page for clarity. We have included an example of the new columns below.
For example: COSMIC Complete Mutation Data (Targeted screens)
[17:Q] Mutation Id – An internal mutation identifier to uniquely represent each mutation on a specific transcript on a given assembly build.
[18:R] Genomic Mutation Id – Genomic mutation identifier (COSV) to indicate the definitive position of the variant on the genome. This identifier is trackable and stable between different versions of the release.
[19:S] Legacy Mutation Id – Legacy mutation identifier (COSM) that will represent existing COSM mutation identifiers.
We will shortly have some sample data that can be downloaded in the new table structure, to give you real data to manipulate and integrate, this will be available on the variant updates page.
How this affects you
We are aware that many of the changes we are making will affect integration into your pipelines and analytical platforms. By giving you advance notice of the changes, we hope much of this can be mitigated, and the end result of having clean, standardised data will be well worth any disruption. The variant updates page on the COSMIC website will provide a central point for this information and further technical details of the changes that we are making to COSMIC.
Human Genetics and Childhood Diseases, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 1: Next Generation Sequencing (NGS)
Human Genetics and Childhood Diseases
Curator: Larry H. Bernstein, MD, FCAP
Publication Roundup: HGMD
HGMD®, the Human Gene Mutation Database is used by scientists around the world to find information on reported genetic mutations. The papers below use the database to advance our understanding of disease, DNA dynamics, and more.
Local DNA dynamics shape mutational patterns of mononucleotide repeats in human genomes First author: Albino Bacolla
Scientists in the US and UK published results in Nucleic Acids Research of a detailed analysis of single-base substitutions and indels in the human genome. Their findings show that certain base positions are more susceptible to mutagenesis than others. They used HGMD Professional to find mutations in specific genomic regions for analysis; the paper includes charts showing mutation patterns, germline SNPs, and more from HGMD data.
High prevalence of CDH23 mutations in patients with congenital high-frequency sporadic or recessively inherited hearing loss First author: Kunio Mizutari
This Orphanet Journal of Rare Diseases paper from scientists in Japan sequenced 72 patients with unexplained hearing loss, finding several CDH23 mutations, some of which were novel. Mutations in the gene have been linked to Usher syndrome and other forms of hereditary hearing loss. The scientists used HGMD to find all known CDH23 mutations within nearly 70 coding regions.
Mutation analyses and prenatal diagnosis in families of X-linked severe combined immunodeficiency caused by IL2Rγ gene novel mutation First author: Q.L. Bai
In Genetics and Molecular Research, scientists report the utility of mutation analysis of the interleukin-2 receptor gamma gene to assess carrier status and perform prenatal diagnosis for X-linked severe combined immunodeficiency. They studied two high-risk families, along with 100 controls, to evaluate the approach. Sequence variation was determined using HGMD Professional and an X-SCID database, and a new mutation was discovered in the project.
Impact of glucocerebrosidase mutations on motor and nonmotor complications in Parkinson’s disease First author: Tomoko Oeda
Researchers from three hospitals in Japan published this Neurobiology of Aging report that may help stratify Parkinson’s disease patients by prognosis. They sequenced mutations in the GBA gene in 215 patients, finding that those who had mutations associated with Gaucher disease suffered dementia and psychosis much earlier than those who didn’t. The team found previously reported GBA mutations using HGMD Professional.
Comprehensive Genetic Characterization of a Spanish Brugada Syndrome Cohort First author: Elisabet Selga
In this PLoS One publication, scientists from a number of institutions in Spain examined genetic variation among patients with Brugada syndrome, a rare genetic cardiac arrhythmia. They sequenced 14 genes in 55 patients, identifying 61 variants and finding the subset that appear pathogenic. Variants were filtered against a number of databases, including HGMD.
Local DNA dynamics shape mutational patterns of mononucleotide repeats in human genomes
Single base substitutions (SBSs) and insertions/deletions are critical for generating population diversity and can lead both to inherited disease and cancer. Whereas on a genome-wide scale SBSs are influenced by cellular factors, on a fine scale SBSs are influenced by the local DNA sequence-context, although the role of flanking sequence is often unclear. Herein, we used bioinformatics, molecular dynamics and hybrid quantum mechanics/molecular mechanics to analyze sequence context-dependent mutagenesis at mononucleotide repeats (A-tracts and G-tracts) in human population variation and in cancer genomes. SBSs and insertions/deletions occur predominantly at the first and last base-pairs of A-tracts, whereas they are concentrated at the second and third base-pairs in G-tracts. These positions correspond to the most flexible sites along A-tracts, and to sites where a ‘hole’, generated by the loss of an electron through oxidation, is most likely to be localized in G-tracts. For A-tracts, most SBSs occur in the direction of the base-pair flanking the tracts. We conclude that intrinsic features of local DNA structure, i.e. base-pair flexibility and charge transfer, render specific nucleotides along mononucleotide runs susceptible to base modification, which then yields mutations. Thus, local DNA dynamics contributes to phenotypic variation and disease in the human population.
INTRODUCTION
Changes in human genomic DNA in the form of base substitutions and insertions/deletions (indels) are essential to ensure population diversity, adaptation to the environment, defense from pathogens and self-recognition; they are also a critical source of human inherited disease and cancer. On a genome-wide scale, base substitutions result from the combined action of several factors, including replication fidelity, lagging versus leading strand DNA synthesis, repair, recombination, replication timing, transcription, nucleosome occupancy, etc., both in the germline and in cancer (1–4). On a much finer scale [(over a few base pairs (bp)], rates of base substitutions may be strongly influenced by interrelationships between base–protein and base–base interactions. For example, the mutator role of activation-induced deaminase (AID) in B-cells during class-switch recombination and somatic hypermutation (5) targets preferentially cytosines within WRC (W: A|T; R: A|G) sequences (6), whereas apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like (APOBEC) overexpression displays a preference for base substitutions at cytosines in TCW contexts (7). Other examples, such as the induction of C→T transitions at CG:CG dinucleotides by cytosine-5-methylation and the role of UV light in promoting base substitutions at pyrimidine dimers have been well documented (reviewed in (4,8)). More recently, complex patterns of base substitution at guanosines in cancer genomes have been found to correlate with changes in guanosine ionization potentials as a result of electronic interactions with flanking bases (9), suggesting a role for electron transfer and oxidation reactions in sequence-dependent mutagenesis. However, despite these advances, the increasing number of sequence-dependent patterns of mutation noted in genome-wide sequencing studies has met with a lack of understanding of most of the underlying mechanisms (10). Thus, a picture is emerging in which mutations are often heavily dependent on sequence-context, but for which our comprehension is limited.
Mononucleotide repeats comprise blocks of identical base pairs (A|T or C|G; hereafter referred to as A-tracts and G-tracts) and display distinct features: they are abundant in vertebrate genomes; mutations within the tracts occur more frequently than the genome-wide average; mutations generally increase with increasing tract length; length instability is a hallmark of mismatch repair-deficiency in cancers; and sequence polymorphism within the general population has been linked to phenotypic diversity (11–15). Thus, mononucleotide repeats appear ideal for addressing the question of sequence-dependent mutagenesis since base pairs within the tracts are flanked by identical neighbors. Both historic and recent investigations concur with the conclusion that a major source of mononucleotide repeat polymorphism is the occurrence of slippage (i.e. repeat misalignment) during semiconservative DNA replication, which gives rise to the addition or deletion of repeat units (11,12). An additional and equally important source of mutation has recently been suggested to arise from errors in DNA replication by translesion synthesis DNA polymerases, such as pol η and pol κ (13), also on slipped intermediates, leading to single base substitutions.
A key question that remains unanswered in these studies and which is relevant to the issue of sequence context-dependent mutagenesis is whether all base pairs within mononucleotide repeats display identical susceptibility to single base changes and whether indels (which are consequent to DNA breakage) occur randomly within the tracts.
Herein, we combine bioinformatics analyses on mononucleotide repeat variants from the 1000 Genomes Project and cancer genomes with molecular dynamics simulations and hybrid quantum mechanics/molecular mechanics calculations to address the question of sequence-dependent mutagenesis within these tracts. We show that mutations along both A-tracts and G-tracts are highly non-uniform. Specifically, both base substitutions and indels occur preferentially at the first and last bp of A-tracts, whereas they are concentrated between the second and third G:C base pairs in G-tracts. These positions coincide with the most flexible base pairs for A-tracts and with the preferential localization of a ‘hole’ that results when one electron is lost due to an oxidation reaction anywhere along G-tracts. Thus, despite the uniformity of sequence composition, mutations occur in a sequence-dependent context at homopolymeric runs according to a hierarchy that is imposed by both local DNA structural features and long-range base–base interactions. We also show that the repair processes leading to base substitution must differ between A- and G-tracts, since in the former, but not in the latter, base substitutions occur predominantly in the direction of the base immediately flanking the tracts. Additional sequence-dependent patterns of mutation are likely to arise from studies of more heterogeneous sequence combinations, possibly involving other aspects intrinsic to the structure of DNA.
RESULTS
Mononucleotide repeat variation is defined by tract length and flanking base composition
We define mononucleotide repeats in the GRCh37/hg19 (hg19) human genome assembly as uninterrupted runs of A:T and G:C base pairs (hereafter referred to as A-tracts and G-tracts, respectively) from 4 to 13 base pairs in length (Figure 1A). We retrieved a total of 48,767,945 A-tracts and 13,633,781 G-tracts, both of which displayed a biphasic distribution with an inflection point between tract lengths of 8 and 9 (bp) and with the number of runs declining with length more dramatically for G-tracts than for A-tracts (Figure 1B), as noted previously (29). Both the number of short tracts and the extent of decline varied with flanking base composition, TA[n]T runs being two- to three-fold more abundant than CA[n]Cs (Supplementary Figure S1A) and AG[n]As declining the most rapidly (Supplementary Figure S1B). Thus, mononucleotide runs exist as a collection of separate pools of sequences in extant human genomes, each maintained at distinctive rates of sequence stability, as determined by factors such as bp composition (A:T versus G:C), tract length and flanking sequence composition.
Mononucleotide repeat variation, evolutionary conservation and association with transcription. (A) The search algorithm was designed to retrieve runs of As or Ts (A-tracts) and Gs or Cs (G-tracts) length n (n = 4 to 13), along with their 5′ (n = 0) and 3′ (n = n + 1) nearest neighbors from hg19. Tract bases were numbered 5′ to 3′ with respect to the purine-rich sequence. The panel exemplifies the nomenclature for A- and G-tracts of length 4. (B) Logarithmic plot of the number of A-tracts (closed circles) and G-tracts (open circles) in hg19 as a function of length. (C) Normalized fractions of polymorphic tracts (F SNV) (number of SNVs divided by both hg19 number of tracts and n) from the 1KGP for A-tracts (closed circles) and G-tracts (open circles). (D) Radial plot of SNVs in the 1KGP at the 5′ and 3′ nearest neighbors of A-tracts. Periphery, tract length; horizontal axis, scale for the fraction of SNVs (F SNV). (E) Radial plot of SNVs in the 1KGP at the 5′ and 3′ nearest neighbors of G-tracts. (F) Percent difference in the numbers of A-tracts (closed circles) and G-tracts (open circles) between syntenic regions of hg19 and HN genomes. (G) The exponents of Benjamini-corrected P-values for A-tract-containing genes enriched in transcription-factor binding sites plotted as a function of A-tract length (triangles); each value represents the median of the top 11 USCS_TFBS terms. The percent A-tracts (closed circles) and G-tracts (open circles) intersecting genomic regions pulled-down by chromatin immunoprecipitation using antibodies against transcription factors are plotted as a function of tract length. (H) List of gene enrichment terms with a Benjamini-corrected P-value of <0.05 in common between genes containing A- and G-tracts of lengths 4–13, excluding the UCSC_TFBS terms.
We examined the extent of sequence variation in the human population by mapping 38,878,546 single nucleotide variants (SNVs) from 1092 haplotype-resolved genomes (the 1000 Genomes Project, 1KGP) (30) to the hg19 A- and G-tracts. The normalized fractions of polymorphic tracts (F SNV) were greater for G-tracts than A-tracts and both displayed Gaussian-type distributions, with maxima of 0.067 for G-tracts of length 8 and 0.017 for A-tracts of length 9 (Figure 1C). CA[n]C and AG[n]A runs displayed the highest F SNV values for A- and G-tracts, respectively (Supplementary Figure S1C and D), with F SNV values for AG[n]As attaining ∼0.10 at length 8. We conclude that flanking base composition influences the rates of SNV within mononucleotide runs and, as a consequence, their representation in the reference human genome.
F SNV values at the flanking 5′ and 3′ bp were similar between A- and G-tracts, except for minor differences for the least represented (i.e. longest) tracts and did not exceed 0.02 (Supplementary Figure S1E). These fractions are expected to be greater than at more distant positions from the tracts, based on previous data (29). SNVs at G-tracts, but not at A-tracts, were more frequent than at flanking base pairs. F SNVs for base pairs flanking short (≤8 bp) tracts were at least twice as high as those flanking long tracts; F SNVs also displayed distinct sequence preference with most (∼0.1) variants occurring at Ts 3′ of G-tracts (Figure 1D and E). In summary, SNVs at mononucleotide runs do not increase monotonically with length but peak at 8–9 bp. This behavior mirrors the genomic distributions, both with respect to the total number of tracts (Figure 1B) and the subsets flanked by specific-sequence combinations (Supplementary Figure S1A–D). Variation at flanking base pairs also displayed a biphasic pattern centered at a length of 8–9 bp, with a greater chance of variation adjacent to G- than A-tracts and with characteristic sequence preferences.
Long tracts are evolutionarily conserved and associated with high transcription
To assess whether more variable monosatellite runs (Figure 1C) might have undergone a greater reduction in number in extant humans relative to extinct hominids, we compared the number of A- and G-tracts between syntenic regions of five individuals comprising hg19 and three Neanderthal (HN) specimens (31). The difference between hg19 and HN was very small (<±2%) for the short tracts, but it displayed more negative values in hg19 with increasing tract length, which reached a maximum of −11.8 and −32.7% for A- and G-tracts, respectively, of length 9. Beyond this threshold, the numbers of tracts converged for A-tracts, whereas they were more abundant in hg19 for G-tracts >11 bp (Figure 1F). In summary, the largest difference in the number of mononucleotide runs between hg19 and HN sequences was centered at 9 bp for both A- and G-tracts, suggesting that the length distributions (Figure 1A and Supplementary Figure S1A and B) reflect distinct rates of evolutionary gains and losses due to differential sequence mutability (Figure 1C) as a function of length and flanking sequence composition (12).
The fact that long (>9 bp) mononucleotide runs display low variability in the human population (Figure 1C) and sequence conservation during evolutionary divergence (Figure 1F) raises the possibility that they might serve functional roles. Through gene enrichment analyses, we found that genes containing A- and G-tracts were enriched for genes associated with the term ‘UCSC_TFBS’, which pertains to transcripts harboring frequent transcription factor binding sites (32,33). For A-tract-containing genes, the median P-values for the top 11 UCSC_TFBS terms decreased from 2.95E-26 for tracts of length 4 to 5.22E-241 for tracts of length 13 (Figure 1G). The percent of A-tracts intersecting genomic fragments amplified from chromatin immunoprecipitation using transcription-factor binding antibodies (32,33) also increased from 8.7 to 9.9 from length 6 to 13, whereas it was constant (mean ± SD, 22.4 ± 1.1) for G-tracts (Figure1G). For gene classes excluding ‘UCSC_TFBS’, a search for categories enriched at P < 0.05 and common to all A- and G-tract-containing genes returned a set of 25 terms, 22 of which were associated with high levels of tissue-specific gene expression (Figure 1H). In summary, these analyses extend prior work (14) supporting a role for mononucleotide tracts in enhancing gene expression, a function that for A-tracts appears to increase with increasing tract length.
Repeat variability is highly skewed
Next we addressed whether bp along A- and G-tracts display equal probability and type of variation. In the 1KGP dataset, the number of SNVs at each position along both A- and G-tracts of length 4 was within a two-fold difference (144,000–240,000); for both types of sequence, transitions (i.e. A→G and G→A) were the predominant (51–78%) type of base substitution (Supplementary Figure S2A and B). However, with increasing length, the number of SNVs decreased up to 30-fold more drastically for G-tracts than for A-tracts, with increasing numbers of transversions (A→T and G→C|T) being predominant. Normalizing the data for the number of tracts genome-wide revealed that the extent of SNV varied by up to 10-fold, depending upon tract length and bp position. Specifically, the highest degree of variation was observed at the first and last A within the A-tracts (i.e. A1 and An), which underwent up to 61% A→T and 43% A→C transversions, respectively, at length 9 (Figure 2A). Likewise, for G-tracts, the most polymorphic sites were G3, followed by G2, for mid-size tracts of 8–10 bp, with 44% G→C transversions at G3 for tracts of length 8 (Figure2B). Thus, the extent of SNV at mononucleotide runs is grossly skewed in human genomes, both along the sequence itself and across tract length, which must account for the bell-shape behavior in F SNV for the tracts as a whole (Figure 1C).
Population variation spectra. (A) Variation spectra of A-tracts. Percent (number of SNVs at each position divided by the number of tracts in hg19 × 100) of A→T (black), A→C (red) and A→G (green) SNVs in the 1KGP dataset (left). Percent SNVs at A1 as a function of tract length (right). (B) Variation spectra of G-tracts. As in panel A with G→T (black), G→C (red) and G→A (cyan) (left). Percent SNVs at G3 as a function of tract length (right). (C) Percent A→T, A→C and A→G transitions at each position along A-tracts (stars) preceded and followed by a T (TA[n]T, left), C (CA[n]C), center) and G (GA[n]G, right) as a function of tract length. (D) Percent G→T, G→C and G→A transitions at each position along G-tracts (stars) preceded and followed by a T (TG[n]T, left), C (CG[n]C), center) and A (AG[n]A, right) as a function of tract length. (E) Percent transitions at base pairs (stars) preceding or following A-tracts (left) and G-tracts (right) as a function of tract length (n). *, mutated position.
We assessed whether SNV hypervariability was associated with specific combinations of nearest neighbors. For A-tracts flanked 5′ by a T, C or G, the highest percentage of SNVs was observed at A1 when preceded by a T, which reached 7.9% for TA[n] tracts of length 9 (Supplementary Figure S2C). By contrast, for 3′ T, C or G, the greatest effect was elicited by a C, with the highest percentage (7.1%) of SNVs at An for A[n]C tracts of length 9 (Supplementary Figure S2D). Therefore, flanking base pairs play a critical role both in the spectra and frequencies of SNVs at A-tracts. More detailed plots along A-tracts either preceded (Supplementary Figure S2E), followed (Supplementary Figure S2F) or preceded and followed (Figure 2C) by a T, C or G revealed the dramatic and long-range (up to 9–10 bp for the longest tracts, higher than the value of 4 bp predicted by mathematical models of slippage (11)) influence of flanking base pairs on variation spectra, in which up to 95% of the changes were in the direction of the base flanking the tract. Because the number of A-tracts preceded or followed by a specific base varies by up to three-fold (Supplementary Figure S2G), we conclude that for A-tracts, the overall mutation fractions and spectra are the result of at least three variables; length, position along the tract, and base composition of the 5′ and 3′ nearest-neighbors.
For G-tracts flanked 5′ by a T, C or A, high percentages (10–12%) of SNVs were observed at G1 for tracts preceded by a C, an effect that decreased with increasing tract length (Supplementary Figure S3A). This result, together with an exceedingly low number of G→A transitions at G1 for tracts not preceded by a C (Supplementary Figure S3C) relative to all tracts (Supplementary Figure S2B), is consistent with the known high mutability of CG:CG dinucleotides as a result of cytosine-5 methylation (9). The hypermutability at G2 was observed preferentially for tracts preceded by an A, and to a lesser extent T, whereas that at G3 was insensitive to flanking sequence composition. Likewise, G-tracts flanked 3′ by a T, C or A did not display marked sequence-dependent effects (Supplementary Figure S3B). Detailed plots of the SNV spectra along G-tracts either preceded (Supplementary Figure S3D), followed (Supplementary Figure S3E), or preceded and followed (Figure 2D) by a T, C or A revealed a noticeable effect only for 5′ T in association with G→T substitutions at G1for tracts of length ≥8. Thus, despite a consistent over-representation of G-tracts flanked 5′ by a T (Supplementary Figures S3F and S1B), which must account for the high absolute number of SNVs at G1 for TG[n] relative to AG[n] and CG[n] (Supplementary Figure S3G), nearest-neighbor base composition seems to play a lesser role in SNV spectra at G-tracts than at A-tracts.
With respect to SNVs at the flanking 5′ and 3′ nearest positions, no B→A or H→G substitutions (Figure 1A) were found above a length threshold of 9 for A-tracts and 8 for G-tracts (Figure 2E, gray shading) out of 5969 SNVs, implying that tract expansion by recruiting flanking base pairs is disfavored at these lengths. In summary, base substitution along mononucleotide repeats is strongly skewed towards the edges of A-tracts and within the 5′ half of G-tracts, with frequencies that peak at midsize lengths (8–9 bp). For A-tracts ≥7 bp, base substitution occurred almost exclusively in the direction of the flanking nearest-neighbors. Finally, base substitution at flanking bases did not contribute to tract expansion for mononucleotide runs longer than 8–9 bp.
Insertions and deletions display length and positional preference
In addition to SNVs, mononucleotide runs are polymorphic in length as a result of indels. Herein, we consider separately two types of indels: one in which tract length changes by ±1 and flanking bp composition is not altered (slippage); the other comprising all other cases involving the addition or removal of 1–200 bp (indels). Slippage is a widely accepted mutational mechanism (11–12,34), whereby DNA replication errors at reiterated DNA motifs cause changes in the number of motifs (most often +/−1). The normalized fractions of slippage in the 1KGP dataset peaked at lengths of 8 bp for A-tracts and 9 bp for G-tracts (Figure 3A), generating bell-shaped curves similar to those observed for SNVs (Figure1C) and with no differences in the highest fraction of ‘slipped’ tracts, which peaked at ∼0.02. By contrast, +1 slippage occurred more frequently than −1 slippage at A-tracts (Figure 3B). These results support recent studies on microsatellite repeats (12) and contrast with previous conclusions that slippage increases monotonically with tract length, and that the extent of slippage differs between A- and G-tracts (35,36).
Population insertions and deletions. (A) Normalized fractions of A-tracts (closed circles) and G-tracts (open circles) displaying +/−1 bp slippage in the 1KGP dataset as a function of tract length. Data were obtained by dividing the number of events by both the number of hg19 tracts and tract length (n). (B) Ratio of the number of +1 to −1 slippage for A-tracts (closed circles) and G-tracts (open circles). (C) Indels at A-tracts. For positions along the tracts (‘Tract’), ‘F Indel’ is the ratio between the number of indels and the number of tracts in hg19 multiplied by tract length. For the positions immediately flanking the tracts genomic coordinates (‘Before tract’ and ‘After tract’), ‘F Indel’ is the ratio between the number of indels and the number of tracts in hg19. (D) Indels at G-tracts, calculated as described in panel C. (E) Heatmap representation of insertions along A-tracts. The percent insertions (i.e. the number of insertions at each position divided by the number of tracts in hg19) (y-axis) plotted as a function of location (x-axis) from position 0 (insertion between the bp 5′ to the tract and the first bp of the tract) to position n + 1 (insertion between the bp 3′ to the last bp of the tract and the following bp) (see Figure 1A) and as a function of tract length (z-axis). (F) Heatmap representation of insertions along G-tracts.
With respect to indels, the normalized fractions were low (<1 × 10−3) along short (4–6 bp) A- and G-tracts, but rose to a plateau for longer tracts as reported earlier (11); this plateau was 10-fold higher for G-tracts (∼0.03) than for A-tracts (∼0.003) (Figure 3C and D). Indels also occurred more frequently (up to six-fold for A-tracts of length 11) at nearest-neighboring base pairs (‘Before tract’ and ‘After tract’ in Figure 3C and D) than along the tracts. Thus, contrary to SNVs and slippage, indels increased to a plateau with mononucleotide tract length.
We analyzed in detail the locations of insertions along the tracts and the flanking positions with respect to the 5′ to 3′ orientation of the tracts (Figure 1A). The normalized fractions demonstrated that insertions peaked at the 3′, and to a lesser extent 5′, ends of the longest A-tracts (Figure 3E), but remained low. For G-tracts, insertions occurred most efficiently at two locations (G2–3 and G5) (Figure 3F), they increased with tract length (up to ∼0.04), and attained ∼10-fold higher values than for A-tracts. In conclusion, insertion sites at A- and G-tracts followed the patterns observed for SNVs (Figure 2A and B), suggesting that factors associated with local DNA dynamics sensitize specific bases along the tracts to genetic alteration, inducing both SBS and indels.
Base pair flexibility and charge localization map to sites of sequence changes
To elucidate elements of intrinsic DNA dynamics that may be responsible for the biases in SNV and insertion sites, we performed molecular dynamics (MD) and hybrid quantum mechanics/molecular mechanics (QM/MM) simulations on model A[6], A[9], G[6] and G[9] duplex DNA fragments. We focused on water bridge coordination (Figure 4A), bp step flexibility, and for the G[6] and G[9], charge localization, as these properties are known to impact the susceptibility of DNA to base damage, repair and mutation. The fractions of one water coordination increased along the A[9] and A[6] structures in a 5′ to 3′ direction, irrespective of flanking sequence composition, in concert with a decrease in minor groove width (Figure 4B and Supplementary Figure S4A) as predicted (37). Vstep, a measure of bp structural fluctuation, displayed a prominent peak of ∼40 Å3deg3 at the 5′-TA-3′ step for both structures (Figure 4C and Supplementary Figure S4B), which together with low water occupancy points to 5′-TA-3′ being a preferred location for base modification and mutation. In the G[9] and G[6] structures water coordination involved mostly two-water bridges due to wide (∼14 Å) minor grooves (Figure 4Dand Supplementary Figure S4C), whereas flexibility was modest (∼20–22 Å3deg3, Figure 4E and Supplementary Figure S4D). Thus, bp dynamics are likely to impact mutations at A-tracts to a greater extent than at G-tracts. Guanine has the lowest ionization potential (IP) of all four bases and IP further decreases at guanine runs, rendering them targets for electron loss, charge localization, oxidation and eventually mutation (4,38). Because after electron loss the ensuing charge (hole) can migrate along the DNA double-helix and relocalize at specific guanines, we addressed whether the preferred sites of mutation along G-tracts, i.e. G2–3 and G5, would also be preferred sites for charge localization. The QM/MM determinations indicated that whereas for the short G[6] fragment the difference in the density-derived atomic partial charges (DDAPC) (i.e. the hole) localized most often (∼50%) to the first position (Figure 4F), for the long G[9] fragment charge localization shifted downstream (mostly to the second, but also to positions 6–7, Figure 4G). Importantly, the charge was found exclusively around the guanine rings (Figure 4H). Thus, the two main sites of sequence change along G-tracts, i.e. G2–3 and G5, coincide with positions where charge localization and hence one-electron oxidation reactions is predicted to occur most frequently. In summary, bp flexibility at A-tracts and charge transfer at G-tracts likely represent intrinsic DNA features underlying the bias in SNV and insertions at mononucleotide runs in human genomes.
MD and QM/MM simulations. (A) Molecular modeling of one (left) and two (right) minor groove water bridge coordination. (B) Fraction of one-water bridge occupancy (left axis) at A[9] DNA sequences flanked 5′ and 3′ by a T (black circles), C (red circles) or G (green circles). Minor groove widths (right axis), as determined from intrastrand phosphate-to-phosphate distances. (C) Vstep for A[9] DNA sequences, determined as the product of the square root of the eigenvalues (λi) described by the six bp step parameters shift, slide, rise, tilt, roll and twist; i.e. Vstep=∏6i=1λi−−√. (D) Fraction of one- (black circles) and two-water (red circles) bridge occupancy (left axis) at G[9] DNA sequences. Minor groove widths (right axis), as assessed from intrastrand phosphate-to-phosphate distances. (E) Vstep for G9 DNA sequences. (F) Average charge redistribution (open circles and right axis) for G[6] DNA structures upon vertical ionization, examined by calculating the difference on the density-derived atomic partial charges (DDAPC) for the neutral and negatively charged states. Histogram of the number of instances (left axis) in which the largest charge redistribution occurred at a specific position along the G[6] structures. (G) DDAPC for G[9] DNA structures (open circles and right axis) and histogram of the number of instances (left axis) in which the largest charge redistribution occurred at a specific position. (H) VMD rendering of a G[9] DNA structure displaying hole localization at G2. Capped base pairs were removed for clarity.
Position and orientation along nucleosome core particles modulate sequence variation
DNA wrapped around histones in nucleosomes is subject to local deformation (39), which may impact mutation. Thus, we analyzed the 1KGP SNVs at A- and G-tracts predicted to overlap with well-positioned nucleosome core particles (NCPs) (16). In hg19, the percentage of tracts that overlap with NCPs decreased moderately from ∼90% at length of 4 to 81% and 71% for A- and G-tracts of length 13, respectively (Figure 5A), suggesting that mononucleotide runs are not depleted in NCPs in human genomes as previously proposed (40). A-tracts of lengths 4–8 base pairs displayed distinctive peaks along the NCP surface in phase with the helical repeat of DNA (10.5 bp) and with minor grooves facing toward the inner protein core (lengths 4–5) (16) (Figure 5B and Supplementary Figure S5A). A-tracts of length of 9–13 bp exhibited only half (six) the peaks evident for the shorter tracts. For the G-tracts, only small peaks with no clear minor groove-inward-facing regions were detected (Supplementary Figure S5B).
Positioning along nucleosome core particles. (A) Percent of A-tract (open circles) and G-tract (closed circles) base pairs in hg19 overlapping with well-positioned NCP genomic coordinates as a function of tract length. (B) Counts of base pairs in hg19 A-tracts of length 5 overlapping with NCPs genomic regions as a function of distance from the histone octamer dyad axis. Minor groove-inward-facing regions (gray) were derived from the X-ray crystal structure of NCP147 (41). (C) Percent SNVs in the 1KGP dataset (left axis) at every bp along A-tracts of length 5 for tracts centered at maxima (black) and minima (gray) along NCPs (Figure 5B). Percent increase (right axis) of SNVs at minima relative to maxima (green). P-values for paired t-tests: 0.013 (*), 0.002 (**) and 4.7 × 10−6 (***). (D) Whisker plots of%SNVs (left axis) at A1 for A-tracts of length 5 centered at maxima and minima (black) along NCPs (Figure 5B). Percent difference (right axis) in the number of A-tracts of length 5 in hg19 preceded by C, T or G (red) between those centered at minima and those centered at maxima (Figure5B). (E) C-containing/G-containing ratios (see text) for G-tracts of length 5 in hg19 as a function of distance from the NCP dyad axis (black) and location of core histones (maroon and green). Peaks correspond to negative iSAT (i.e. tilt parameters multiplied by the corresponding sin θ) values (gray) (39). Ratios of%SNV at G1 (upshifted by 0.5 for clarity) between C-containing (5′-CCCCCG-3′ sequences on the hg19 forward strand) and G-containing (5′-CGGGGG-3′ sequences on the hg19 forward strand) (Figure 1A) CG[5] tracts mapping NCP Chip-seq genomic intervals (red) fitted by a non-parametric local regression (loess; sampling proportion, 0.100; polynomial degree, 3). (F) VMD rendering (top) of TATTT residues 34–38 (yellow) and the complementary AAATA residues 672–753 (pink) from the 1EQZ pdb nucleosomal crystal structure, corresponding to peak area from −40 to −36 in Figure 5E. The switch in G-tract (lengths of 5 and 7) orientation along NCPs (bottom) serves to position the C-containing strand on the outside (yellow) and, correspondingly, the G-containing strand on the inside (pink).
To assess if tract-positioning along NCPs influences SNVs, we selected A-tracts of lengths 5, 7 and 9 bp and G-tracts of lengths 5 and 7 bp whose central positions coincided with either the maxima or minima (41) (Figure 5B and Supplementary Figure S5A and B) and conducted pair-wiset-tests (330 total) between permutations of ‘categories’, including ‘tracts centered at maxima versus minima’, ‘position along the tracts’, ‘flanking sequence composition’, ‘specific NCP locations’ and ‘tract orientation’. For A-tracts, 79/207 (38%) significant pairs were found, 68 (86%) of which were related to differences between tracts centered at maxima versus minima, with a preponderance (63%) of tests displaying increased %SNVs at minima (Supplementary Figure S5C and E). For example, %SNVs at length 5 bp were greater at minima than at maxima at each position along the A-tracts (Figure 5C). A→C substitutions at A1 were more abundant at maxima than at minima (mean ± SD, 18.7 ± 0.7% at max and 17.6 ± 0.8% at min; P-value 0.001), whereas A→T substitutions at the same position displayed the opposite trend (mean ± SD, 18.4 ± 0.5% at max and 19.8 ± 1.1% at min; P-value 0.0005) (Figure 5D). A-tracts of length 7 also exhibited a similar pattern at A7 (Supplementary Figure S5H). The percentages of CA[5] and A[7]C tracts in hg19 centered at maxima were greater than at minima and the reverse was observed for the TA[5] and A[7T] tracts (Figure 5D and Supplementary Figure S5H). Thus, we conclude that positioning along the NCP surface of both the double-helical grooves and junctions with flanking base pairs influence SNVs along A-tracts. However, this influence is complex and for the most part, difficult to predict.
For G-tracts, most pairwise comparisons (18/34, 53%) indicated SNV variation according to sequence orientation (Supplementary Figure S5F and G). In hg19, the ratio of the numbers of G-tracts of lengths 5 and 7 for which the C-containing strand coincided with the forward sequence (downstream example sequence in Figure 1A) to the numbers of G-tracts for which the G-containing strand coincided with the forward sequence (upstream example sequence in Figure 1A) (C-containing/G-containing ratios) displayed a prominent 10.5-bp oscillation in phase with iSAT (Figure 5E), a measure of ‘inside’ and ‘outside’ bases, according to the bp step tilt parameter (39). Analysis of the helical path of a 146-bp DNA fragment wrapped around histones showed that the oscillation in the C-containing/G-containing ratios corresponds to a preference for guanine bases to face the protein core (Figure 5F). We analyzed the subset of G-tracts preceded by a 5′ C (i.e. CG[5]) to assess whether SNVs at G1, the position known to be mutable due to CpG methylation also oscillated with the C-containing/G-containing ratios. Oscillation in SNV-C-containing/SNV-G-containing values was evident, with peaks aligning to the hg19 troughs (Figure 5E) implying that the cytosines facing the protein surface harbor more variants than those facing away. We conclude that A- and G-tracts display preferential positioning (the former) and orientation (the latter) along NCPs, which in turn modulate the rate of sequence variation.
Mutations associated with human disease
Knowing that the first and last As of long A-tracts and G2–3 in G-tracts are the major sites of SNV in the human population, we addressed whether these features are also discernible in mutated mononucleotide tracts associated with human genetic disease. We collected 9,450,456 unique SBSs (both SBSs and SNVs refer to single base changes) from sequenced cancer genomes and normalized the percent mutations along A- and G-tracts to enable a direct comparison with the 1KGP dataset. For A-tracts (Figure 6A and Supplementary Figure S6A), SBSs displayed the same trend as the 1KGP data (Figure 2A) with respect to the bell-shape increase in mutations at A1 and An and the mutation spectra, although the susceptibility to mutation as a function of tract length attained greater values (6.36% for length 11 in cancer versus 4.15% for length 9 in the 1KGP datasets at A1). The first and last 3 bp also harbored more SBSs than in the 1KGP dataset for tracts >7 bp, a feature that we found to be due exclusively to a large cancer dataset (42) containing high-level microsatellite instability (MSI) samples (Supplementary Figure S6B and C), which are known to result from mismatch-repair deficiency (15). Thus, A-tracts display similar patterns of base substitution between the germline and somatic cancer tissues. For G-tracts, mutation spectra were characterized by G→T transversions at tract lengths >7, particularly at G1, the most frequently mutated position for tracts lengths up to 11 bp (Figure 6B and Supplementary Figure S6D). This trend persisted even when the high rates of methylation-mediated deamination mutations at the CG dinucleotide were removed (Supplementary Figure S6E). Thus, mutation patterns in cancer genomes contrast with those observed in the germline, both with respect to the most mutable position (G1 versus G2–3) and the types of base substitution (G→T in cancer genomes versus G→T and G→C in the germline).
Mutation patterns in cancer genomes. (A) Mutation spectra for SBSs at A-tracts. Percent values were obtained by dividing the total number of SBSs at each position by the number of tracts in hg19 and then multiplying by 3.2516 to equalize the percentage of A-tracts of length 4 between the cancer genomes and the 1KGP datasets. (B) Mutation spectra for SBSs at G-tracts in cancer genomes. Percent values were obtained as in (A) using a multiplication factor of 3.7419. (C) Normalized fractions of A-tracts (closed circles) and G-tracts (open circles) displaying +/−1 bp slippage, obtained by dividing the number of events by both the number of tracts in hg19 and tract length. (D) Indels at A-tracts, calculated as described in Figure 3C. (E) Indels at G-tracts, calculated as described in Figure3C. (F) Heatmap representation of insertions along G-tracts, as described in Figure 3E.
With respect to slippage, the fractions for A-tracts elicited an excess at lengths 9 and 10 bp relative to the 1KGP dataset, which was also due to the MSI-containing dataset. For G-tracts, the fractions peaked at length 8, as for the 1KGP dataset (Figures 3A and 6C), implying that the propensity to undergo slippage is indistinguishable between the germline and soma. Indels were also more abundant at flanking base pairs than along the tracts (Figure 6D and E), particularly for G-tracts of length >7, similar to the 1KGP dataset (Figure 3C and D). Detailed analyses of insertions revealed that both G1 and the preceding position were the most significant sites of mutation (F-values up to 0.08 at G1 for tracts of length 8) (Figure 6F). Thus, the 5′ end of long G-tracts is the most susceptible site for both SBSs and insertions in cancer genomes, in contrast to the germline where these occur within the runs, typically at G2–3.
We also extracted the mutated A- and G-tracts from the Human Gene Mutation Database (HGMD), a collection of >150,000 germline gene mutations associated with human inherited disease. A total of 1519 genes were mutated at A- or G-tracts out of a total of 3972 (38%); 3480 SBSs and 2866 slippage events were noted within these tracts, 85 and 46% of which were predicted to be disease-causing, respectively (Figure 7A and Supplementary Table S1). Ranking genes by the number of literature reports indicated that among the top 10 entries three were associated with cancer (BRCA1, BRCA2 and APC), two with hemophilia (F8 and F9), four with debilitating lesions of the skin (COL71A), muscle (DMD), lung (CFTR) and kidney (PKD1), with one causing hypercholesterolemia (LDLR) (Figure 7B). Thus, mutations within A- and G-tracts carry a high social burden by contributing to some of the most common human pathological conditions.
Mutation patterns in HGMD and model for sequence context-dependent changes. (A) Number of germline SBSs and slippage events (Slip.) at A- and G-tracts in HGMD. Gene alterations were classified as disease-causing mutation (DM), likely disease-causing mutation (DM?), disease-associated and putatively functional polymorphism (DFP), disease-associated polymorphism with additional supporting functional evidence (DP) and invitro/laboratory orinvivo functional polymorphism (FP). Codon changes (SIFT predictor) were classified as damaging (d), null (n), tolerated (t) and low-confidence prediction (l). (B) The 10 most commonly reported genes in HGMD with mutations at A- and G-tracts. Various mutated tracts were generally reported for the same gene in different reports. (C) Mutation spectra for SBSs at A- (left) and G-tracts (right) in HGMD. Percent values were obtained by dividing the total number of SBSs at each position by the number of tracts in hg19 exons. A|G→T (black), A|G→C (red), A→G (green), G→A (cyan). (D) Normalized fractions of A-tracts (closed circles) and G-tracts (open circles) displaying +/−1 bp slippage, obtained by dividing the total number of events by the number of tracts in hg19 exons and by tract length. (E) Model for sequence context-dependent changes at A-tracts (left) and G-tracts (right). *, site of base modification.
For both A- and G-tracts, SBSs occurred mostly at tract lengths of 4–7, with patterns more similar to those in the 1KGP than in the cancer datasets, both with respect to the location of the most mutable positions (first and last As and first/second Gs) and the types of base substitution (A→T and G→H) (Figure 7C and Supplementary Figure S6F). Likewise, slippage events peaked at tract lengths of 7–9 as observed in the 1KGP dataset (Figure 7D). In summary, the patterns of both SBSs and slippage in the HGMD dataset followed the trend observed in the 1KGP dataset, suggesting that germline variants at mononucleotide repeats leading to either population variation or human inherited disease may have arisen through similar mechanisms.
DISCUSSION
Why are specific A:T and G:C base pairs within A- and G-tracts more susceptible to sequence changes than their identical neighbors? For A-tracts, bp flexibility may play a role. Chemical damage to DNA, such as by hydroxyl radicals has been shown to be proportional to the geometrical solvent-accessible surface of the atomic groups, which increases with DNA flexibility (43). Along A-tracts flexibility is restricted, but it is high at both the 5′ and 3′ junctions. Thus, the fact that the highest rates of mutation coincide with the highest degree of flexibility at the 5′-TA-3′ bp step is consistent with the view that this position may be susceptible to DNA damage as a result of flexibility. Other sources of DNA dynamics are also likely to be relevant, such as sugar flexibility at the junctions, which increases with tract length (44). Chemical modification at these junctions may then lead to base substitution and indels, the latter as a result of strand breaks.
With respect to SNV mutation spectra, these were found mostly in the direction of flanking base composition above a length of 7–8 bp. We interpret this behavior in terms of DNA slippage along A-tracts when attempts are made during translesion synthesis (TLS) to bypass a damaged site (Figure 7Ei). Two scenarios may be considered to account for A→T transitions at A1. In the first, the last tract-template base would loop out into the polymerase active site permitting base-pairing and strand elongation (Figure 7Eii) using the tract-flanking base as a template (34,45–46). In the second (Figure 7Eiii), slippage would occur behind the polymerase, prompting extension past the newly created A*:T mispair generated by primer/template misalignment. Either pathway would yield a common intermediate (Figure 7Eiv) that contains the base complementary to the junction across from the damaged site upon slippage resolution (34). Following DNA synthesis (S) and/or repair (R) (Figure 7Ev and vi), this mispair will generate a base change that is always identical to the tract-flanking base.
For G-tracts, the high rates of G→T transversions at G1 in cancer genomes are also consistent with preferred chemical attack at this site due to high flexibility (Figure 7F top). Direct chemical attack at a guanine is known to result in stable products, such as 8-oxo-G and Fapy-G, both of which are known to yield G→T transversions (47–50). Thus, G1 may be the most susceptible site for such reactions for G-tracts of lengths ≥7 (Figure 7Fright), which in cancer genomes would become a mutation hotspot. In the germline, SNVs peaked inside G-tract base pairs, while mutational spectra were insensitive to flanking base composition; these events are inconsistent with a role for template misalignment and slippage as noted for A-tracts. Rather, the correspondence between hotspot mutations at G2–3 and G5 and the QM/MM simulations suggest a role for charge transfer. A large body of work during the past 20 years using computational, theoretical chemistry and biophysical techniques on short oligonucleotides, has shown that guanine is the most easily oxidizable base in DNA and that indeed a guanine radical cation can be generated through long-range hole transfer from an oxidant via one-electron oxidation mechanisms (51–55). GGG triplets were found to act as the most effective traps in hole transfer by both experimental and theoretical work (56–59), demonstrating that the resulting guanine radical cation (or its neutral deprotonated form) became rather delocalized, but it preferentially centered at the first and second G. These well-established patterns of chemical reactivity are consistent with our experimental observation of high mutation frequencies at G1 for short G-tracts and the results from QM/MM simulations on G6. For longer tracts, the downstream shift in mutation hotspots, i.e., G2–3 and G5, also correlate well with the charge localization predicted from QM/MM simulations, which explicitly included solvent effects and structural fluctuations. Thus, in conjunction with the constrained density functional theory (60), both the neutral and oxidized forms of a guanine nucleobase can be reliably constructed to infer the accurate determination of mutational patterns of mononucleotide repeats in human genomic DNA.
The compact organization of the sperm genome (61), and presumably low levels of oxidative stress in the germline, may enable guanine oxidization through one-electron oxidation reactions rather than by direct chemical attack, thereby favoring the formation of radical cations. A charge injected at G1 by electron loss would then migrate to neighboring guanines and localize at sites of low IP, such as G2 (Figure 7F left). Guanine radical cations are known to readily undergo further chemical modification leading to products such as 8-oxo-G, oxazolone, imidazolone, guanidinohydantoin, and spiroiminodyhydantoin (62) (M in Figure 7F), to yield G→T, G→C and G→A substitutions (4,63). Our model is in line with recent observations in which mutations at guanines within short G-runs (1–4 bp) correlate with sequence-dependent IPs at the target guanine in cancer genomes (9). Interestingly, these correlations were not observed in the germline (9). We interpret these composite observations as follows. The IP values for G-runs have been shown to decrease asymptotically with tract length, although the absolute values vary according to the methods and assumptions used (we obtained a value of 5.43 eV for both G[6] and G[9]) (64,65). We suggest that short G-runs with high IPs undergo one-electron oxidation reactions in the oxidative environment of cancer cells but would be refractory to such a mechanism in the germline (Figure 7Fright yellow and left white sectors). As length increases and IP values fall, G-runs would be attacked directly by oxidants abundant in tumor cells (Figure 7F orange sector), whereas oxidation will be limited to electron loss in the germline environment (Figure 7F left yellow sector).
These models (template misalignment for A-tracts and charge transfer for G-tracts) suggest a more complex scenario for mechanisms underlying mononucleotide repeat polymorphism in the human population than recently proposed (13), in which nucleotide misincorporation by error-prone polymerases is proposed as a primary source of mutations at both A- and G-tracts. As already stated, the directionality of SNVs toward tract-flanking bases in A-tracts and the hotspot mutations at G2–3, supports multiple and distinct mechanisms of base substitution at mononucleotide repeats.
Our analyses highlight additional information, including the lack of mutations in the direction of tract-base composition for base pairs flanking long tracts, the association with gene expression and the preference of guanines for the inner NCP surface, and extend prior observations (12) such as the bell-shape character of base substitution and slippage, whose mechanisms remain to be fully clarified. Finally, we document the contribution of mononucleotide mutagenesis to key aspects of human pathology beyond the well-established MSI instability in cancer (15), including hemophilia and tissue degeneration. Our collective work supports the conclusion that as the human genome undergoes evolutionary diversification and along the way suffers disease-associated mutations, oxidation reactions including charge transfer may play a prominent role.
Severe combined immunodeficiency diseases (SCIDs) are a group of primary immunodeficiency diseases characterized by a severe lack of T cells (or T cell dysfunction) caused by various gene abnormalities and accompanied by B cell dysfunction (WHO, 1992; Buckley et al., 1997). The incidence rates in infants were 1/75,000-1/10,0000 (WHO, 1992), but no morbidity statistics are available in China. The 2 genetic modes of SCID include X-linked recessive and autosomal recessive genetic inheritance. X-linked severe combined immunodeficiency (X-SCID) is the most common form, accounting for 50-60% of SCID cases (Noguchi et al., 1993). Immune system abnormalities in patients with X-SCID include T-B+NK-, in which T cells (CD3+) and natural killer (NK) cells (CD16+/CD56+) are absent or significantly reduced, and the number of B cells (CD19+) is normal or increased, causing reduced immunoglobulin production and class switching disorder (Buckley, 2004; Fischer et al., 2005). The IL- 2Rg gene mutation has been confirmed to be a major cause of X-SCID (Noguchi et al., 1993). In recent years, great progress has been made in understanding the pathogenesis of primary immunodeficiency disease and its application in clinical treatment, particularly regarding the development of critical care medicine and immune reconstruction technology. With timely control of infection and early bone marrow or stem cell transplantation, X-SCID patients can be treated, prolonging survival time. Therefore, early diagnosis of X-SCID is very important for patient treatment. Gene diagnosis has become a better early diagnosis or differential diagnosis method. In addition, familial X-SCID brings a great psychological burden to the relatives of patients. Ordinary chromosome analysis and immunological evaluation cannot be used for female carrier identification and fetal diagnosis, and gene diagnosis is the most effective method of carrier detection and prenatal diagnosis. In this study, we detected mutations in 2 families with X-SCID and identified 2 novel mutations, confirming the X-SCID pedigrees. Prenatal diagnosis was performed for the pregnant fetus in the mother of one of the probands based on gene diagnosis. Female individuals in this family were subjected to carrier detection.
IL2Rg gene mutation test Direct sequencing of 1-8 exons and the flanking region of the IL2Rg gene by PCR in family 1 showed that the 3rd exon of the proband contained the c.361-363delGAG heterozygous deletion mutation, which led to deletion of the 121st amino acid glutamate (p.E121del) in its coding product. There were no sequence variations in other coding regions or in the shear zone. The proband’s mother carried the same heterozygous mutation, while his father did not carry the mutation site (Figure 2a, b, c). This mutation was not observed in any cases of the control group, and this family was identified as an X-SCID family. The c.510-511insGAACT insertion heterozygous mutation was present in the 4th exon of the proband’s mother in family 2. This mutation was a 5-base repeat of GAACT, resulting in a change in amino acid 173 from tryptophan into a stop codon (p.W173X). While there were no sequence variations in other coding regions or in the shear zone, the patient’s father did not carry the mutation (see Figure 2d, e). We did not find this mutation in the healthy control group. We presumed that the 4th exon of the deceased child in family 2 contained the c.510-511insGAACT insertion mutation, leading to X-SCID symptoms, and thus we speculated that this family was an X-SCID pedigree. Prenatal diagnosis We verified the chorionic villus status of the fetus in family 1 using the PowerPlex 16 HS System kit. The results of prenatal diagnosis showed that the fetal tissue contained no maternal contamination and that this fetus was female. The results of prenatal diagnosis showed that there was no c.361-363delGAG (p.E121del) heterozygous mutation in the female fetus of family 1.
Figure 2. Sequencing graph of IL2Rg gene in 2 pedigrees with X-chain severe combined immunodeficiency. a.-c. Family 1. a. Normal control (rectangle indicates 3 edentulous bases of this patient). b. Proband carrying the c.361- 363delGAG (p.E121del) mutation (arrow indicates deletion of fragment connection sites). c. The proband’s mother contained a c.361-363delGAG (p.E121del) heterozygous mutation (arrow). d.-e. Family 2. d. The proband’s mother carried the c.510-511insGAACT (p.W173X) heterozygous mutation (arrow indicates that the reverse sequencing graph was positive). e. Normal control (rectangular box indicates 2 normal copies of GAACT (the mutation fragment was 3 copies). Carrier detection results For the c.361-363delGAG (p.E121del) site, the gene analysis results of the female individual in family 1 showed that I2 (proband’s grandmother) was a heterozygous carrier and that II3 (proband’s aunt) was a non-carrier and had no mutations.
IL-2 can combine with the IL-2 receptor (IL-2R) of the immune cell membrane. IL-2R is composed of 3 subunits, including the IL-2Ra chain (CD25), IL-2Rb chain (CD122), and IL- 2Rg chain (CD132). IL-2Rg functional units in common with IL-4, IL-7, IL-9, IL-15, IL-21, and other cytokine receptors, and these regions are referred to as the total chain (Li et al., 2000). The IL-2Rg chain can maintain the integrity of the IL-2R complex and is required for the internalization of the IL-2/IL-2R complex; it is also the link that contacts the cell membrane surface factor region and downstream cell signal transduction molecules. Therefore, the integrity of the IL-2Rg chain is vital for the immune function of an organism (Malka et al., 2008; Shi et al., 2009).
Mutations in the IL2Rg gene, which encodes IL-2Rg, were identified to be a major cause of X-SCID in 1993 (Noguchi et al., 1993). The IL2Rg gene is located on chromosome X q21.3-22, is 37.5 kb length, and contains 8 exons, which encode 369 IL-2Rg amino acids. The IL2Rg chain exhibits varying structural regions, such as the signal peptide [amino acids (AA) 1-22], extracellular domain (AA 23-262), transmembrane region (AA 263-283), and intracellular region (AA 284-369). The WSXWS motif is located in the extracellular region (AA 237-241), while Box 1 is located in the intracellular region (AA 286-294).
By the end of 2013, the Human Gene Mutation Database contained a total of 200 mutations in the IL2Rg gene (HGMD Professional 2013.4). The most common mutation types in the IL2Rg gene were the missense or nonsense mutations, which result from single base changes. A total of 100 missense or nonsense mutations have been identified, followed by insertion or deletion mutations in a total of 50 species. The 3rd most common type of mutations includes shear mutations in approximately 30 species. Eight exons contained mutations, and mutations in 3rd or 4th exons were the highest, accounting for a total mutation rate of 43% (86/200). According to the X-SCID gene database (IL2RGbase) (http://research.nhgri. nih.gov/scid/), the gene mutations in IL2Rg mainly occurred in the extracellular region of the IL2Rg chain (Fugmann et al., 1998). Zhang et al. (2013) reported that the IL2Rg gene mutations in 10 patients with X-SCID in China were located in the extracellular region. Two mutations reported in our study were also located in the extracellular region. The mutation of IL2Rg gene in family 1 was a codon mutation in the 3rd exon, resulting in a 3-base deletion. The c.361-363delGAG (p.E121del) mutation was located in the extracellular area of the IL- 2Rg subunit, and we inferred that the 121 glutamate deletion caused by the mutation would lead to changes in the structure of the peptide chain, affecting signal transmission and resulting in serious symptoms. The mutation of family 2 was a GAACT repeat of ILR2g gene; this repeat of 5 bases resulted in 173 codon changes from tryptophan into a stop codon. Generation of the peptide chain with the mutation lacked 196 amino acids compared to the normal chain, including the intracellular, transmembrane, and some extracellular regions, directly affecting the structure and function of receptors and causing disease. No studies have been reported regarding these 2 mutations. We combined with the mutation characteristics and clinical manifestations and diagnosed family 1 as X-SCID pedigrees. Although the patient in family 2 was deceased, it can be speculated that the 2 deceased patients in family 2 were X-SCID pedigrees caused by c.510-511insGAACT (W173X).
Prenatal diagnosis can accurately identify fetal situations and be used to avoid birth defects, which can also ease the anxiety of the pregnant mother. Gene diagnosis for pedigrees of patients based on DNA samples has advanced recently, particularly with the application of high-throughput sequencing technology (Alsina et al., 2013). We can now perform gene analysis for varied clinical infectious diseases for differential diagnosis. However, the effectiveness of prenatal diagnosis for pedigrees in which the proband is dead remains unclear. Because the gene mutations in the proband is unknown in these cases, the patient’s situation was only inferred by his mother’s genotypes. However, we considered that for the deceased, if we can define the mother was a pathogenic gene carrier, even if the proband is not X-SCID, the woman also has a risk of having X-SCID children and this pedigree may be X-linked recessive inheritance. Prenatal diagnosis may provide a choice for preventing the birth of patients in these families in the premise of informed consent.
Gene diagnosis of IL2Rg can also be used for carrier detection of suspected females in the family.
In the present study, we performed carrier detection of the patient’s grandmother and aunt in family 1 and determined that the patient’s pathogenic mutations were from his grandmother. His aunt did not inherit the pathogenic gene, and thus she was a non-carrier and her fertility will not be affected. In this study, we used direct sequencing of PCR products and identified IL2Rg gene mutations in 2 pedigrees with X-SCID. We found 2 unreported mutations in the IL2Rg gene, and prenatal diagnosis and carrier detection were conducted in 1 X-SCID family. Because the incidence rate of X-SCID is extremely low, it is difficult to promote the widespread use and application of genetic diagnosis. However, this study may provide some implications for the diagnosis of infants with immunodeficiency, and gene diagnosis techniques such as conventional or high-throughput sequencing should be used as soon as possible during pregnancy, which can be used to guide treatment. This method can also provide reliable prenatal diagnosis and carrier detection service for these families.
MEF2A gene mutations and susceptibility to coronary artery disease in the Chinese population
Coronary artery disease (CAD) has high morbidity and mortality rates worldwide. Thus, the pathogenesis of CAD has long been the focus of medical studies. Myocyte enhancer factor 2A (MEF2A) was first discovered as a CAD-related gene by Wang (2005) and Wang et al. (2003, 2005). Three mutation points in exon 7 of MEF2A were subsequently identified by Bhagavatula et al. (2004); however, Altshuler and Hirschhorn (2005) and Weng et al. (2005) predicted that the MEF2A gene lacked mutations. Zhou et al. (2006a,b) analyzed the mutations and polymorphisms in exons 7 and 11 of the MEF2A gene in the Han population in Beijing, and various rare mutations were found in exon 11 rather than in exon 7. The clinical significance of specific 21-bp deletions in MEF2A was also explored, and previous studies have shown mixed results. In this study, polymerase chain reaction-singlestrand conformation polymorphism (PCR-SSCP) and DNA sequencing were used to detect exon 11 of the MEF2A gene in samples collected from 210 CAD patients and 190 healthy controls and to investigate the function of the MEF2A gene in CAD pathogenesis and their correlation.
CAD, a common disease in China, is induced by multiple factors, such as genetics, the environment, and lifestyle. Thus, a multi-faceted approach is necessary in the study of CAD pathogenesis, particularly in molecular biology research, which is important for developing comprehensive treatment of CAD based on gene therapy. The MEF2A gene was first identified as a CAD-related gene through linkage analysis of a large family with CAD (9 of 13 patients developed MI) in 2003.
In this study, we found the following mutations: 1) codon 451G/T (147191) heterozygous or homozygous mutation; 2) loss of 1 (Q), 2 (QQ), 3 (QQP), 6 (425QQQQQQ430), and 7 (424QQQQQQQ430) amino acids (147108-147131); and 3) codon 435G/A (147143) heterozygous mutation. Among these mutations, the synonymous mutation at locus 147191 was confirmed by reference to the National Center for Biotechnology Information (NCBI) database to be a single nucleotide polymorphism, which was also demonstrated in our study by the extensive presence of this polymorphism in healthy controls. However, the heterozygous mutation at locus 147143 was only found in the genomes of CAD patients, and was therefore identified as a mutation.
Given that MEF2A is a CAD-related gene, the results of various studies are controversial among several countries. Weng et al. (2005) screened gene mutations in exon 11 of the MEF2A gene from 300 CAD patients and 1500 healthy controls. They hypothesized that the changes in 5-12 CAG repeats are genetic polymorphisms and that the 21-base deletion in exon 11 of the MEF2A gene did not induce autosomal dominant genetic CAD. Gonzalez et al. (2006) suggested that the CAG repeat polymorphism was independent of MI susceptibility in Spanish patients. Kajimoto et al. (2005) reported that the CAG repeat sequence was not correlated with MI susceptibility in Japanese patients. Horan et al. (2006) also found that the CAG repeat sequence was not associated with the susceptibility to early-onset familial CAD in an Irish population. Hsu et al. (2010) identified no correlation between the CAG repeat sequence and CAD susceptibility in the Taiwanese population. Dai et al. (2010) found that the structural change in exon 11 was not related to CAD in the Chinese Han population. Lieb et al. (2008) and Guella et al. (2009) hypothesized that MEF2A was independent of CAD. However, Yuan et al. (2006) and Han et al. (2007) suggested that the CAG repeat sequence was correlated with CAD because 9 CAG repeats was an independent predictor of CAD. Elhawari et al. (2010) and Maiolino et al. (2011) suggested that MEF2A is a susceptibility gene for CAD. Dai et al. (2013) showed that mutations in exon 12 are associated with the early onset of CAD in the Chinese population. Liu et al. (2012) failed to demonstrate a correlation between the CAG repeat sequence and CAD through case-control analysis, systematic review, and meta-analysis, but found that the 21- base deletion in exon 11 was strongly associated with CAD, and that genetic variations in MEF2A may be a relatively rare, but specific, pathogenic gene for CAD/MI. Kajimoto et al. (2005) reported 4-15 CAG repeats. However, only 4-11 CAG repeats were observed in our study, possibly because of genetic differences in patients in this study. Eleven CAG repeats were observed in most samples from the control group, and the proportion of 10, 9, and 8 repeats exceeded 1%. The heterozygous mutation at 147143, as well as the 4 and 5 CAG repeats, was only observed in CAD patients. Thus, we speculated that the CAG repeat sequence is correlated with CAD susceptibility, and the presence of 4 or 5 repeats may be a risk factor for CAD, which was inconsistent with the results obtained by Han et al. (2007). The inconsistency in these results may be explained by the differences in subjects and sample sizes among studies.
Impact of glucocerebrosidase mutations on motor and nonmotor complications in Parkinson’s disease
Here, we conducted a multicenter retrospective cohort analysis, and the data were investigated by survival time analysis to show the impact of GBA mutations on PD clinical course. We also investigated regional cerebral blood flow (rCBF) and cardiac sympathetic nerve degeneration of subjects with GBA mutations, compared with matched PD controls.
3.1. Subjects
Among the 224 eligible PD patients (the subjects were not related to each other), 9 subjects were excluded from the analysis (4 due to multiple system atrophy findings on subsequent brain MRI and 5 because of insufficient clinical information). Therefore, 215 PD patients [female, 52.1%; age, 66.7 ± 10.8 (mean ± standard deviation)] were analyzed. For non-PD healthy controls, 126 patients’ spouses (female, 58.7%; age, 67.3 ± 10.3) without a family history of PD or GD were enrolled.
3.2. GBA mutations and risk ratios for PD
In the PD subjects, we identified 10 nonsynonymous and 2 synonymous GBA variants. Within the nonsynonymous variants, 7 mutations were previously reported in GD [R120W, L444P-A456P-V460 (RecNciI), L444P, D409H, A384D, D380N, and444L(1447-1466 del 20, insTG)] as GD-associated mutations. Three nonsynonymous mutations have never been reported in GD patients [I(-20)V, I489V, and there was one novel mutation (Y11H)].
GD-associated GBA mutations were found in 19 of the 215 (8.8%) PD patients but none in the healthy controls. The risk of PD development relative to these GD-associated mutations was estimated as an OR of 25.1 [95% confidence interval (CI), 1.50–420,p = 0.0001] with 0-cell correction. The nonsynonymous mutations that were not reported in GD patients had no association with PD development (p = 0.506; OR, 1.3; 95% CI, 0.7–2.6) ( Table 1). Four subjects had double mutations. For subsequent analyses, 2 subjects with double mutations of I (-20)V and K466K were adopted to the group of mutations unreported in GD, and 2 subjects with double mutations of R120W and I(-20)V, and of R120W and L336L were adopted to the group of GD-associated mutations.
Table 1.Frequency of glucocerebrosidase gene allele in Parkinson’s disease patients and controls
3.3. Clinical features of PD patients by GBA mutation groups
The clinical features of PD patients with GD-associated mutations, those with mutations unreported in GD, and those without mutations are shown in Table 2. In the GD-associated mutation group, females, those with a family history and those with dementia (DSM IV) were significantly more frequent than those in the no-mutation group (p = 0.047, 0.012, and 0.020, respectively). The age of PD onset was lower in patients with GD-associated mutations (55.2 ± 9.9 years ± standard deviation), compared with those without mutations (59.3 ± 11.5), although the statistical difference was not significant. There were no differences in clinical manifestations between subjects with mutations unreported in GD and those without mutations, except for dopamine agonist dosage (p = 0.026) ( Table 2).
Table 2.Epidemiological and clinical features of PD patients with Gaucher disease–associated GBA mutations, those with mutations previously unreported in GD and those without mutations
3.4. Survival time analyses to develop dementia, psychosis, dyskinesia, and wearing-off
Time to develop clinical outcomes (dementia, psychosis, dyskinesia, and wearing-off) was compared in 19 subjects with GD-associated mutations, 29 with mutations unreported in GD, and 167 without mutation. The median observation time was 6.0 years. The subjects with GD-associated mutations showed a significantly earlier development of dementia and psychosis, compared with subjects without mutation (p < 0.001 and p = 0.017) ( Supplementary Table e-1, Fig. 1A and B). We rereviewed the clinical record of the subject who showed early dementia (defined by DSM IV) ( Fig. 1A) and made sure it did not satisfy the criteria of DLB ( McKeith et al., 2005).
Fig. 1.
Kaplan–Meier curves of dementia and psychosis in Parkinson’s disease (PD) patients with Gaucher disease (GD)-associated glucocerebrosidase gene (GBA) mutations and those without mutations. PD patients with GD-associated GBA mutations and those without GBA mutations were compared to investigate the time taken to develop dementia (A) and psychosis (B). Because of insufficient information in several patients, the numbers in each analysis were different. The patients with and without mutations were 17 and 165 (A), 18 and 165 (B) against a total of 19 and 167. DSM IV, Diagnostic and Statistical Manual of Mental Disorders, revised fourth edition. p-Values were calculated by log-rank tests.
The associations of GBA mutations and these symptoms were estimated as HRs, adjusting for sex and age at PD onset. HRs were 8.3 for dementia (95% CI, 3.3–20.9; p < 0.001) and 3.1 for psychosis (95% CI, 1.5–6.4; p = 0.002). The time until development of wearing-off and dyskinesia complications was not statistically significant, with HRs of 1.5 (95% CI, 0.8–3.1; p = 0.219) and 1.9 (95% CI, 0.9–4.1; p = 0.086) ( Table 3).
Table 3.Hazard ratios of GBA pathogenic mutations for clinical symptoms
Model
Clinical feature
Hazard ratio
95% CI
p
1
Dementia (DSM-IV)
8.3
3.3–20.9
<0.001
2
Psychosis
3.1
1.5–6.4
0.002
3
Wearing-off
1.5
0.8–3.1
0.219
4
Dyskinesia
1.9
0.9–4.1
0.086
Each model was adjusted for sex and age at onset.
Key: CI, confidence interval; DSM-IV; The Diagnostic and Statistical Manual of Mental Disorders part 1IV; GBA, glucocerebrosidase.
Subjects with mutations unreported in GD did not show significant differences in time to develop all 4 outcomes, compared with no mutation subjects. Therefore, subjects with GD-unreported mutations were regarded as subjects without GBA mutations in further analyses.
3.5. rCBF on SPECT in patients with GD-associated GBA mutations
We conducted pixel-by-pixel comparisons of rCBF on SPECT between PD subjects with mutations (cases) and sex-, age-, and disease duration-matched PD subjects without any mutations in GBA (controls). Four controls were adopted for each case (except for a 34-year-old female case who was matched to a control), and in total 12 cases (female 50%, age at SPECT mean ± standard error (SE); 58.9 ± 3.3 years, disease duration at SPECT 7.3 ± 1.5 years) and 45 controls (female 64.4%, age at SPECT mean ± SE; 61.0 ± 1.3 years, disease duration at SPECT 7.1 ± 0.7 years) were analyzed. As a result, a significantly lower rCBF was seen in the cases compared to the controls in the bilateral parietal cortex, including the precuneus ( Fig. 2).
Fig. 2.
Regional cerebral blood flow in the group with GD-associated mutations compared with the matched Parkinson’s disease group without mutations. Regions with lower regional cerebral blood flow in the group with GD-associated mutations displayed on an anatomic reference map. Abbreviation: GD, Gaucher disease.
3.6. H/M ratios on MIBG scintigraphy in patients with GD-associated GBA mutations
Cardiac MIBG scintigraphy visualizes catecholaminergic terminals in vivo that are reduced as well as brain dopaminergic neurons in PD patients. We also investigated MIBG scintigraphy between 16 cases (female 68.8%, age at examination mean ± SE; 60.2 ± 2.6 years, disease duration at examination 6.2 ± 1.2 years) and sex-, age- and disease duration-matched 61 controls [(63.8 %, age 62.0 ± 1.1 years, disease duration 5.5 ± 0.6 years) (1:4 except for 1 young 34-year-old female case who was matched to a control)]. In the results, both early and late H/M ratios declined in both groups and did not show any significant differences (p = 0.309 and 0.244) ( Supplementary Table e-2).
4. Discussion
4.1. Contributions of GD-associated GBA mutations to the development of PD
In the analysis of 215 PD patients and 126 non-PD controls, we identified 10 nonsynonymous heterozygous GBA mutations, including 1 novel mutation. Among these mutations, 7 were GD-associated, and the patients carrying these mutations represented 8.8% of the PD cohort. No significant association was found between the GD-unreported mutations and PD development, which suggests that only the GD-associated mutations are a genetic risk for PD. According to a worldwide multicenter analysis of 1883 fully sequenced PD patients, 7% of the GD-associated mutations are found in non-Ashkenazi Jewish PD patients ( Sidransky et al., 2009). Although the mutation frequency in the present study was similar to previous results, the OR of GD-associated heterozygous mutations (25.1) was significantly greater than the OR (5.43) of other ethnic cohorts (Sidransky et al., 2009) and was consistent with an OR of 28.0 from a previous Japanese report ( Mitsui et al., 2009). These results, taken together, suggest the possibility thatGBA mutations are at a distinct risk for PD in the Japanese population. However, a larger Japanese cohort study is required to confirm this.
4.2. Cross-sectional clinical figures of PD with GBA mutations
4.3. Impact of GBA mutations on the clinical course of PD
To investigate the impact of GBA mutations on the clinical course of PD, a prospective-designed study over a long period is preferred. Although there has been a few longitudinally designed study to date, follow-up clinical data for a median of 6 years of 121 PD cases from a community-based incident cohort was recently reanalyzed; results demonstrate that progression to dementia defined by DSM IV (HR 5.7) and Hoehn and Yahr stage 3 (HR 3.2) are significantly earlier in 4 GBA mutation-carrier patients compared with 117 patients with wild-type GBA ( Winder-Rhodes et al., 2013). A 2-year follow-up clinical report of 28 heterozygous GBA carriers who were recruited from relatives of GD-patients shows slight but significant deterioration of cognition and smelling, compared to healthy controls ( Beavan et al., 2015). Brockmann et al. (2015)assessed motor and nonmotor symptoms including cognitive and mood disturbances for 3 years in 20 PD patients with GBA mutations and showed a more rapid disease progression of motor impairment and cognitive decline in GBA mutation cases comparing to sporadic PD controls. The current long-term retrospective cohort study up to 12 years reinforced these results. It revealed that dementia and psychosis developed significantly earlier in subjects with GD-associated mutations compared with those without mutation, and the HRs of GBA mutations were estimated at 8.3 for dementia and 23.1 for psychosis, with adjustments for sex and PD onset age. In contrast, the results showed no significant difference in developing wearing-off and dyskinesia.
In this study, we also investigated whether GD-unreported mutations affected the clinical course of PD. In both cross-sectional and survival time analyses, the mutations unreported in GD carried no increased burden on clinical symptoms such as dementia, psychosis, wearing-off, and dyskinesia.
4.4. Reduced rCBF in PD with GBA mutations compared with matched PD controls
We found a significantly decreased rCBF, reflecting decreased synaptic activity, in the bilateral parietal cortex including the precuneus, in subjects with GD-associated mutations compared with matched subjects without mutations. The pattern of reduced rCBF was very similar to the pattern of H215O positron-emission tomography that Goker-Alpan et.al. (2012) reported, showing decreased resting rCBF in the lateral parietal association cortex and the precuneus bilaterally in GD subjects with parkinsonism (7 subjects with homozygous or compound heterozygous GBA mutations), compared with 11 PD without GBA mutations. Results suggest that PD with heterozygous GBAmutations and GD patients presenting parkinsonism had a common reduced pattern of rCBF. Interestingly, in their study, rCBF in the precuneus—but not in the lateral parietal cortex—correlated with IQ, suggesting that the involvement of the precuneus is critical for defining GBA-associated patterns.
4.5. Reduced cardiac MIBG H/M ratios as well as matched PD controls
We also showed that cardiac MIBG H/M ratios in subjects with GD-associated mutations were lower than the cutoff point for PD discrimination (Sawada et al., 2009), suggesting that postganglionic sympathetic nerve terminals to the epicardium were denervated, as well as in PD without mutations.
4.6. Mechanisms of impact on PD clinical course by GD-associated GBA mutations
Experimental studies suggesting a bidirectional pathogenic loop between α-synuclein and glucocerebrosidase have been accumulated (Fishbein et al., 2014, Gegg et al., 2012, Mazzulli et al., 2011, Noelker et al., 2015, Schondorf et al., 2014 and Uemura et al., 2015). Loss of glucocerebrosidase function compromises α-synuclein degradation in lysosome, whereas aggregated α-synuclein inhibits normal lysosomal function of glucocerebrosidase. The pathogenic loop may facilitate neurodegeneration in GD-associated PD brain, resulting in early development of dementia or psychosis as shown in the present study. Several recent researches propose the possibility that the similar mechanism as in PD with GBA mutations exists even in idiopathic PD brain ( Alcalay et al., 2015, Chiasserini et al., 2015, Gegg et al., 2012 and Murphy et al., 2014). On the other hand, the impacts of GD-associated GBA mutations for the development of motor complications such as wearing-off and dyskinesia were not statistically significant, suggesting other pathophysiological mechanisms in the striatal circuit brought out after long-term therapy especially by l-dopa.
4.7. Limitations
Our study has several limitations. In the design of the study, we assumed that the sample size was 215 (PD patients) for survival time analyses and investigated 224 PD patients. We assumed that the mutation prevalence would be 9.4%, and in fact, we found 19 patients with mutations (8.5%) of the 224 patients. Based on these figures, we estimated the risk ratios of heterozygous GBA mutations for the risk of PD development and PD clinical symptoms as ORs in the cross-sectional multivariate analyses, although the 95% CIs were broad. More of subject numbers will be needed to determine robust risk ratios.
Comprehensive Genetic Characterization of a Spanish Brugada Syndrome Cohort
Brugada syndrome (BrS) was identified as a new clinical entity in 1992 [1]. Six years later, the first genetic basis for the disease was identified, with the discovery of genetic variations inSCN5A [2]. Nowadays, more than 300 pathogenic variations in this first gene are known to be associated with BrS [3]. SCN5A encodes for the α subunit of the cardiac voltage-dependent sodium channel (Nav1.5), which is responsible for inward sodium current (INa), and thus plays an essential role in phase 0 of the cardiac action potential (AP). Genetic variations in this gene can explain around 20–25% of BrS cases [3].
Since BrS was classified as a genetic disease, several other genes have been described to confer BrS-susceptibility [4–7]. Pathogenic variations have been mainly described in: 1) genes encoding proteins that modulate Nav1.5 function, and 2) other calcium and potassium channels and their regulatory subunits. All these proteins participate, either directly or indirectly, in the development of the cardiac AP. Although the incidence of pathogenic variations in these BrS-associated genes is low [6], it is considered that, among all of them, they could provide a genetic diagnosis for up to an extra 5–10% of BrS cases. Hence, altogether, a genetic diagnosis can be achieved approximately in 35% of clinically diagnosed BrS patients.
Other types of genetic abnormalities have been suggested to explain the remaining percentage of undiagnosed patients. Indeed, multiplex ligation-dependent probe amplification (MLPA) has allowed the detection of large-scale gene rearrangements involving one or several exons ofSCN5A in BrS cases. However, the low proportion of BrS patients carrying large genetic imbalances identified to date suggests that this type of rearrangements will provide a genetic diagnosis for a modest percentage of BrS cases [8–10].
BrS has been associated with an increased risk of sudden cardiac death (SCD), despite the reported variability in disease penetrance and expressivity [11]. The prevalence of BrS is estimated at about 1.34 cases per 100 000 individuals per year, with a higher incidence in Asia than in the United States and Europe [12]. However, the dynamic nature of the typical electrocardiogram (ECG) and the fact that it is often concealed, hinder the diagnosis of BrS. Therefore, an exhaustive genetic testing and subsequent family screening may prove to be crucial in identifying silent carriers. A large percentage of these pathogenic variation carriers are clinically asymptomatic, and may be at risk of SCD, which is, sometimes, the first manifestation of the disease [13].
In the present work, we aimed to determine the spectrum and prevalence of genetic variations in BrS-susceptibility genes in a Spanish cohort diagnosed with BrS, and to identify variation carriers among relatives, which would enable the adoption of preventive measures to avoid SCD in their families.
Table 1. Demographics of the 55 Spanish BrS patients included in the study.
The table shows the demographic characteristics of all the patients included in the study. Numbers in parentheses represent the relative percentages for each condition. T1 ECG refers to Type 1 BrS diagnostic electrocardiogram (ECG), obtained either spontaneously, or after drug challenge. The information regarding both the electrophysiological studies (EPS) and the treatment was not available for all the patients. Two of the patients that didn’t receive any treatment died, and were not taken into account for the calculations of percentages (+2 dead). ICD, intracardiac cardioverter defibrillator.
Table 2. Characteristics of the Spanish BrS patients carrying rare genetic variations.
The table shows the clinical characteristics of the probands who carried rare genetic variations in SCN5A, SCN2B, or RANGRF. All of them are potentially pathogenic except that found in RANGRF, which is of unknown significance (see discussion). All the potentially pathogenic variations (PPVs) that had been previously reported, except p.P1725L and p.R1898C, had been identified in BrS patients. p.P1725L had been associated with Long QT Syndrome and p.R1898C was found in Exome Variant Server with a MAF of 0.0079%. No rare variations were identified in the control population. Patient’s age is expressed in years. Bold identifies the patients carrying variations that had not been described previously. M, male; F, female; S, syncope; ICD, intracardiac cardioverter defibrillator; UK, unknown; EPS, electrophysiological studies (+, positive response;-, negative response; N/P, not performed). The two patients who carried two PPVs each are identified by a and b, respectively.
We performed a genetic screening of 14 genes (SCN5A, CACNA1C, CACNB2, GPD1L,SCN1B, SCN2B, SCN3B, SCN4B, KCNE3, RANGRF, HCN4, KCNJ8, KCND3, and KCNE1L), which allowed the identification of 61 genetic variations in our cohort. Of these, 20 were classified as potentially pathogenic variations (PPVs), one variation of unknown significance, and 40 common or synonymous variants considered benign.
The 20 PPVs were found in 18 of the 55 patients (32.7% of the patients, 83.3% males; Table 2). Sixteen patients (88.9%) carried one PPV, and two patients (11.1%) carried two different PPVs each. Nineteen out of the 20 PPVs identified were localized in SCN5A and one in SCN2B.
The vast majority of the PPVs identified were missense (70%). We also detected 2 nonsense variations (10%), 3 insertions or deletions causing frameshifts (15%), and one splicing variation (5%). The three frameshifts (p.R569Pfs*151, p.E625Rfs*95 and p.R1623Efs*7) were identified in SCN5A. These were not found in any of the databases consulted (see Methods), and were thus considered potentially pathogenic (see below). The other 16 rare variations identified inSCN5A had been previously described, and hence were also considered potentially pathogenic. Fourteen of them had been identified in BrS patients. Of these, 6 had also been identified in individuals diagnosed with other cardiac electric diseases (i.e. Sick Sinus Syndrome, Long QT Syndrome, Sudden Unexplained Nocturnal Death Syndrome or Idiopathic Ventricular Fibrillation [2,15,16,20,21,25]). The other 2, p.P1725L and p.R1898C, had only been associated with Long QT Syndrome or found in Exome Variant Server with a MAF of 0.0079%, respectively. Furthermore, we identified a variation in SCN2B (c.632A>G in exon 4 of the gene, resulting in p.D211G) which was considered pathogenic. This patient was included within our cohort, but the functional characterization of channels expressing SCN2B p.D211G was object of a previous study from our group [7]. We also identified a nonsense variation in RANGRFwhich has been formerly reported as rare genetic variation of unknown significance [29].
Additionally, we screened the relatives of those probands carrying a PPV. We analysed a total of 129 relatives, 69 of which (53.5%) were variation carriers. Genotype-phenotype correlations evidenced that 8 of the families displayed complete penetrance (S3 Table). Additionally, no relatives were available for one of the probands carrying a PPV, thus hampering genotype-phenotype correlation assessment. The other 12 families showed incomplete penetrance.
MLPA analysis
The 37 patients with negative results after the genetic screening of the 14 BrS-associated genes underwent MLPA analyses of SCN5A. This technique did not reveal any large exon deletion or duplication in this gene for any of the patients.
SCN5A p.R569Pfs*151 (c.1705dupC), a novel PPV
A 41-year-old asymptomatic male presented a type 3 BrS ECG which was suggestive of BrS. Flecainide challenge unmasked a type 1 BrS ECG (Fig 1A, left), which was also spontaneously observed sometimes during medical follow up. Sequencing of SCN5A revealed a duplication of a cytosine at position 1705 (c.1705dupC; Fig 1A, right), which originated a frameshift that lead to a truncated Nav1.5 channel (p.R569Pfs*151). The proband’s sister also carried this duplication, but had never presented signs of arrhythmogenesis. The proband’s twin daughters were also variation carriers, displayed normal ECGs and, to date, are asymptomatic (Fig 1A, middle). Thus, p.R569Pfs*151 represents a novel genetic alteration in the Nav1.5 channel that could potentially lead to BrS, but with incomplete penetrance.
Fig 1. Characteristics of the probands carrying non-reported potentially pathogenic variations (PPVs) in SCN5A and their families.
Left: Electrocardiograms of the probands: (A) patient carrying the p.R569Pfs*151 variation, showing the ST elevation characteristic of BrS in V1 at the time of the flecainide test; (B) patient carrying the p.E625Rfs*95 variation, showing the spontaneous ST elevation characteristic of BrS in V1 and V2; and (C) patient carrying the p.R1623Efs*7 variation, showing the spontaneous ST elevation characteristic of BrS in V1 and V2. Middle: Family pedigrees. Open symbols designate clinically normal subjects, filled symbols mark clinically affected individuals and question marks identify subjects without an available clinical diagnosis. Plus signs indicate the carriers of the PPVs and minus signs, non-carriers. The crosses mark deceased individuals and arrows identify the proband. Right: Detail of the electropherograms obtained after SCN5Asequence analysis of a control subject (left panels) and of the probands (right panels).
A 51-year-old asymptomatic male was diagnosed with BrS since he presented a spontaneous ST segment elevation in leads V1 and V2 characteristic of type 1 BrS ECG (Fig 1B, left). The sequencing of SCN5A evidenced an adenine duplication at position 1872 (c.1872dupA, Fig 1B, right). This genetic variation results in a truncated Nav1.5 channel (p.E625Rfs*95). The genetic analysis of the proband’s relatives proved that only her mother carried the variation (Fig 1B, middle). She was asymptomatic, but a BrS ECG was unmasked upon ajmaline challenge. The proband’s sister was found dead in her crib at 6 months of age, which suggests that her death might be compatible with BrS. Therefore, the p.E625Rfs*95 variation in the Nav1.5 channel represents a novel genetic alteration potentially causing BrS.
SCN5A p.R1623Efs*7 (c.4867delC), a novel PPV
The proband, a 31-year-old male, was admitted to hospital after suffering a syncope. His baseline 12-lead ECG showed a ST segment elevation in leads V1 and V2 that strongly suggested BrS type 1 (Fig 1C, left). A deletion of the cytosine at position 4867 (c.4867delC) was observed upon SCN5A sequencing (Fig 1C, right). This base deletion leads to a frameshift that originates a truncated Nav1.5 channel (p.R1623Efs*7). Genetic screening of his parents and sisters evidenced that none of them carried this novel variation (Fig 1C, middle). None of them had presented any signs of arrhythmogenicity, nor had a BrS ECG. Nevertheless, in uterogenetic analysis of one of his daughters proved that she had inherited the variation. She died when she was 1 year of age of non-arrhythmogenic causes. Hence, the p.R1623Efs*7 variation in the Nav1.5 channel is a novel genetic alteration originated de novo in the proband that could potentially lead to BrS.
Synonymous and common genetic variations portrayal
In our cohort, we identified 40 single nucleotide variations which were common genetic variants and/or synonymous variants (S2 Table). Twenty-nine had a minor allele frequency (MAF) over 1%, and were thus considered common genetic variants.
We also identified 11 variants with MAF less than 1%. Of them, 9 were synonymous variants, what made us assume that they were not disease-causing. Four of these synonymous variants were not found in any of the databases consulted, and thus their MAF was considered to be less than 1%. Each of these synonymous variations was identified in 1 patient of the cohort. A similar proportion of individuals carrying these novel variations was detected upon sequencing of 300 healthy Spanish individuals (600 alleles). The remaining 2 variants were missense, and although they had either a MAF of less than 1% or an unknown MAF according to the Exome Variant Server and dbSNP websites, they were common in our cohort (29.2 and 50%, respectively; S2 Table), and a similar MAF was detected in a Spanish cohort of healthy individuals (26.7% and 48.8%, respectively).
Influence of phenotype and age on PPV discovery
To assess if a connection existed between the probands’ phenotype and the PPV detection yield, we classified the patients in our cohort according to their ECG (spontaneous or induced type 1), the presence of BrS cases within their families, and the presence/absence of symptoms. Even though the overall PPV detection yield was 32.7%, it was even higher for symptomatic patients (Fig 2). Indeed, in this group of patients, having a family history of BrS was identified as a factor for increased PPV discovery yield. In the case of absence of BrS in the family, the variation discovery yield was almost double for those patients having a spontaneous type 1 BrS ECG than for patients with drug-induced type 1 ECG (45.5% vs 25%, respectively). In addition, we identified a PPV in 44.4% of the asymptomatic patients who presented family history of BrS and a spontaneous type 1 BrS ECG. When the patient presented drug-induced type 1 ECG or in the absence of family history of BrS, the PPV discovery yield was of around 15%.
Fig 2. Influence of the phenotype on PPV discovery yield.
Bar graph comparing the PPV detection yield in 8 different clinical categories (stated below the graph). Each bar shows the total number of patients for each clinical category divided in those with a PPV (black) and those without an identified PPV (white). The number of patients (in brackets) and percentages are given. Pos, positive; Neg, negative; Spont, spontaneous type 1 BrS ECG; Drug, drug-induced type 1 BrS ECG; n, number of patients.
We also investigated the role of age on the PPV occurrence. No significant age differences were observed between variation carriers and non-carriers (38.6±10.3 and 43.5±14.4, respectively, p = 0.16). However, the PPV discovery yield was higher for patients with ages between 30 and 50 years: out of the total of patients carrying a PPV, 83.3% of the patients were in this age range, while 11.1% were younger and 5.6% were older patients (Fig 3A, upper panel). The PPV discovery yield was significantly higher for symptomatic than for asymptomatic patients (42.3% vs 24.1%, respectively; Fig 3A, lower panels).
Fig 3. Influence of the age on PPVs discovery yield.
(A) Pie charts showing the distribution of patients in the overall population as well as in the categories of symptomatic and asymptomatic patients regarding PPV discovery. The percentage and the number of patients (in brackets) are given for each group. The small pie charts correspond to the age distribution of patients with an identified PPV. (B) Bar graphs of the PPV detection yields obtained for each of the age groups (< 30 years, 30–50 years and > 50 years). Numbers inside each bar correspond to the number of patients carrying a PPV for each category and the percentages represent the variation detection yield.
Noteworthy, in the 30–50 age range, 52.9% (9/17) of the symptomatic patients and 35.3% (6/17) of asymptomatic patients carried one PPV (Fig 3B, middle). Additionally, 40% (2/5) of the symptomatic young patients (< 30 years) were variation carriers, while no PPVs were identified in asymptomatic patients within this age range.
Overall, 55 unrelated Spanish patients clinically diagnosed with BrS were included in our study.Table 1 shows the demographics of this cohort, and Table 2 and S1 Table show the clinical and genetic characteristics of all the patients included in the study. The mean age at clinical diagnosis was of 41.9±13.3 years. Although the majority of patients were males (74.5%), their age at diagnosis was not different than that of females (41.8±12.1 years and 42.3±16.3 years, respectively; p = 0.92). A type 1 BrS ECG was present spontaneously in 37 patients (67.3%), and drug challenge revealed a type 1 BrS ECG for the remaining 18 patients (32.7%). Almost half of the patients had experienced symptoms, including 2 SCD and 4 aborted SCD. Patients who had not previously experienced any signs of arrhythmogenicity despite having a BrS ECG were considered asymptomatic. Comparison of symptomatic vs asymptomatic patients evidenced a similar percentage of males (73.1% and 75.9%, respectively). However, the mean age at diagnosis was different between the two groups of patients (37.7±14.3 and 45.7±11.4, respectively; p<0.05).
Discussion
To the best of our knowledge, this is the first comprehensive genetic evaluation of 14 BrS-susceptibility genes and MLPA of SCN5A in a Spanish cohort. Well delimited BrS cohorts from Japan, China, Greece and even Spain have been genetically studied [24,30–32]. Additionally, an international compendium of BrS genetic variations identified in more than 2100 unrelated patients from different countries was published in 2010 [3]. However, all these studies screenedSCN5A exclusively. In 2012, Crotti et al. reported the spectrum and prevalence of genetic variations in 12 BrS-susceptibility genes in a BrS cohort [5]. However, this study included patients of different ethnicity. Here, we report the analysis of 14 genes which has been conducted on a well-defined BrS cohort of the same ethnicity.
Our results confirm that SCN5A is still the most prevalent gene associated with BrS. Indeed,SCN5A-mediated BrS in our cohort (30.9%) is higher than the proportion described in other European reports [3,23], where a potentially causative variation is identified in only 20–25% of BrS patients. The reason for this discrepancy is unclear but could point towards a higher prevalence of SCN5A PPVs in the Spanish population or to selection bias. Additionally, we identified a genetic variation in SCN2B (c.632A>G, which results in p.D211G). We have formerly published the comprehensive electrophysiological characterization of this variation, and showed that indeed this variation could be responsible of the phenotype of the patient, thus linking SCN2B with BrS for the first time [7]. Also, we identified a variation in RANGRF. This variation (c.181G>T leading to p.E61X) had been previously reported in a Danish atrial fibrillation cohort [33]. Surprisingly, the authors reported an incidence of 0.4% for this variation in the healthy Danish population, which brought into question its pathogenicity. Our finding of this variation in an asymptomatic patient displaying a type 2 BrS ECG also points toward considering it as a rare genetic variation with a potential modifier effect on the phenotype but not clearly responsible for the disease [29].
No PPVs were identified in the other genes tested. Certainly, it is well accepted that the contribution of these genes to the disease is minor, and thus should only be considered under special circumstances [13,34]. In addition, recent studies have questioned the causality of variations identified in some of these minority genes [35].
We also used the MLPA technique for the detection of large exon duplications and/or deletions in SCN5A in patients without PPVs, and no large rearrangements were identified. This is in accordance with previous reports, which revealed that such imbalances are uncommon [8–10].
Kapplinger et al. [3] reported a predominance of PPVs in transmembrane regions of Nav1.5. Indeed, it has been proposed that most rare genetic variations in interdomain linkers may be considered as non-pathogenic [36]. In contrast, PPVs identified in this study are mainly located in extracellular loops and cytosolic linker regions of Nav1.5 (Fig 4). Additionally, 2 of our non-previously reported frameshifts are located in the DI-DII linker. These 2 genetic variations lead to truncated proteins, which would lack around 75% of the protein sequence, and thus are presupposed to be pathogenic.
Fig 4. Nav1.5 channel scheme showing the relative position of the SCN5A PPVs identified in our cohort.
Open symbols indicate already described variations and closed symbols locate novel variations reported in this study. DI to DIV designate the 4 domains of the protein, and numbers 1–6 identify the different segments within each domain. Crosses mark the voltage sensor.
In our cohort, we have identified 40 synonymous or common genetic variations, 4 of which have not been previously reported. These variations are gradually becoming more and more important in the explanation of certain phenotypes of genetic diseases. Only a few common variations identified here are already published as phenotypic modifiers [37,38]. The effect of these and other common variants identified in our cohort on BrS phenotype should be further studied.
Unexpectedly, almost 40% (7/18) of the PPV carriers did not present signs of arrhythmogenicity. We also performed genotype-phenotype correlations of the PPVs identified in the families (S3 Table). These studies uncovered relatives, most of whom were young individuals, who carried a familial variation but had never exhibited any clinical manifestations of the disease. This is in agreement with Crotti et al. and Priori et al. [5,23], who postulated that a positive genetic testing result is not always associated with the presence of symptoms. Indeed, the existence of asymptomatic patients carrying genetic variations described to cause a severe Nav1.5 channel dysfunction has been reported [39]. The identification of silent carriers is of paramount importance since it allows the adoption of preventive measures before any lethal episode takes place. Unknown environmental factors, medication and modifier genes have been suggested to influence and/or predispose to arrhythmogenesis [11]. Hence, this group of patients has to be cautiously followed in order to avoid fatal events.
Our studies on the connection between patients’ phenotype and the PPV detection yield highlighted the presence of symptoms as a factor for an increased variation discovery yield. Within the group of symptomatic individuals, a PPV was identified in a higher proportion of patients displaying a spontaneous type 1 BrS ECG than for patients showing a drug-induced ECG. Likewise, within the asymptomatic patients with family history of BrS, those who presented spontaneous type 1 BrS ECG carried a PPV more often than those with a drug-induced ECG (Fig 2). Referring to age, the vast majority (17/20, 85%) of the PPVs were identified in patients around their fourth decade of age (30–50 years). This is in accordance with the accepted mean age of disease manifestation. Moreover, in this age range, more than 50% of the patients who presented symptoms carried a variation that could be pathogenic (Fig 3). Importantly, 35.3% of asymptomatic patients of around 40 years of age also carried one of such variations. These data highlight the importance of performing a genetic test even in the absence of clinical manifestations of the disease, and particularly when in the 30–50 years range, which is in accordance with consensus recommendations [13,34].
In conclusion, we have analysed for the first time 14 BrS-susceptibility genes and performed MLPA of SCN5A in a Spanish BrS cohort. Our cohort showed male prevalence with a mean age of disease manifestation around 40 years. BrS in this cohort was almost exclusivelySCN5A-mediated. The mean PPV discovery yield in our Spanish BrS patients is higher than that described for other BrS cohorts (32.7% vs 20–25%, respectively), and is even higher for patients in the 30–50 years age range (up to 53% for symptomatic patients). All these evidences support the genetic testing, at least of SCN5A, in all clinically well diagnosed BrS patients.
Study Limitations
First of all, drug challenge tests were not performed for all the relatives who were asymptomatic variation carriers. This fact hampered their clinical diagnosis and represents an impediment to definitely assess the link between PPVs and BrS. These patients are nowadays under follow-up.
New PPVs have been identified in our cohort. The clinical information available for the families suggests that these new variations could be pathogenic. Still, in vitro studies of these variations are required in order to evaluate their functional effects and verify their pathogenic role. Additionally, genotyping in an independent cohort would help reduce the likelihood of type I (false positive) error in genetic variant discovery.
We have to acknowledge that the study set is relatively small. Consequently, the classification of patients according to the different clinical categories rendered rather small sub-groups, which may lead to over-interpretation of the results. Future studies will be directed to the genetic screening of additional Spanish BrS patients, which will probably reinforce the significance of the tendencies observed here.