Feeds:
Posts
Comments

Posts Tagged ‘genetic variants’

The Human Genome Gets Fully Sequenced: A Simplistic Take on Century Long Effort

 

Curator: Stephen J. Williams, PhD

Article ID #295: The Human Genome Gets Fully Sequenced: A Simplistic Take on Century Long Effort. Published on 6/14/2022

WordCloud Image Produced by Adam Tubman

Ever since the hard work by Rosalind Franklin to deduce structures of DNA and the coincidental work by Francis Crick and James Watson who modeled the basic building blocks of DNA, DNA has been considered as the basic unit of heredity and life, with the “Central Dogma” (DNA to RNA to Protein) at its core.  These were the discoveries in the early twentieth century, and helped drive the transformational shift of biological experimentation, from protein isolation and characterization to cloning protein-encoding genes to characterizing how the genes are expressed temporally, spatially, and contextually.

Rosalind Franklin, who’s crystolagraphic data led to determination of DNA structure. Shown as 1953 Time cover as Time person of the Year

Dr Francis Crick and James Watson in front of their model structure of DNA

 

 

 

 

 

 

 

 

 

Up to this point (1970s-mid 80s) , it was felt that genetic information was rather static, and the goal was still to understand and characterize protein structure and function while an understanding of the underlying genetic information was more important for efforts like linkage analysis of genetic defects and tools for the rapidly developing field of molecular biology.  But the development of the aforementioned molecular biology tools including DNA cloning, sequencing and synthesis, gave scientists the idea that a whole recording of the human genome might be possible and worth the effort.

How the Human Genome Project  Expanded our View of Genes Genetic Material and Biological Processes

 

 

From the Human Genome Project Information Archive

Source:  https://web.ornl.gov/sci/techresources/Human_Genome/project/hgp.shtml

History of the Human Genome Project

The Human Genome Project (HGP) refers to the international 13-year effort, formally begun in October 1990 and completed in 2003, to discover all the estimated 20,000-25,000 human genes and make them accessible for further biological study. Another project goal was to determine the complete sequence of the 3 billion DNA subunits (bases in the human genome). As part of the HGP, parallel studies were carried out on selected model organisms such as the bacterium E. coli and the mouse to help develop the technology and interpret human gene function. The DOE Human Genome Program and the NIH National Human Genome Research Institute (NHGRI) together sponsored the U.S. Human Genome Project.

 

Please see the following for goals, timelines, and funding for this project

 

History of the Project

It is interesting to note that multiple government legislation is credited for the funding of such a massive project including

Project Enabling Legislation

  • The Atomic Energy Act of 1946 (P.L. 79-585) provided the initial charter for a comprehensive program of research and development related to the utilization of fissionable and radioactive materials for medical, biological, and health purposes.
  • The Atomic Energy Act of 1954 (P.L. 83-706) further authorized the AEC “to conduct research on the biologic effects of ionizing radiation.”
  • The Energy Reorganization Act of 1974 (P.L. 93-438) provided that responsibilities of the Energy Research and Development Administration (ERDA) shall include “engaging in and supporting environmental, biomedical, physical, and safety research related to the development of energy resources and utilization technologies.”
  • The Federal Non-nuclear Energy Research and Development Act of 1974 (P.L. 93-577) authorized ERDA to conduct a comprehensive non-nuclear energy research, development, and demonstration program to include the environmental and social consequences of the various technologies.
  • The DOE Organization Act of 1977 (P.L. 95-91) mandated the Department “to assure incorporation of national environmental protection goals in the formulation and implementation of energy programs; and to advance the goal of restoring, protecting, and enhancing environmental quality, and assuring public health and safety,” and to conduct “a comprehensive program of research and development on the environmental effects of energy technology and program.”

It should also be emphasized that the project was not JUST funded through NIH but also Department of Energy

Project Sponsors

For a great read on Dr. Craig Ventnor with interviews with the scientist see Dr. Larry Bernstein’s excellent post The Human Genome Project

 

By 2003 we had gained much information about the structure of DNA, genes, exons, introns and allowed us to gain more insights into the diversity of genetic material and the underlying protein coding genes as well as many of the gene-expression regulatory elements.  However there was much uninvestigated material dispersed between genes, the then called “junk DNA” and, up to 2003 not much was known about the function of this ‘junk DNA’.  In addition there were two other problems:

  • The reference DNA used was actually from one person (Craig Ventor who was the lead initiator of the project)
  • Multiple gaps in the DNA sequence existed, and needed to be filled in

It is important to note that a tremendous amount of diversity of protein has been realized from both transcriptomic and proteomic studies.  Although about 20 to 25,000 coding genes exist the human proteome contains about 600,000 proteoforms (due to alternative splicing, posttranslational modifications etc.)

This expansion of the proteoform via alternate splicing into isoforms, gene duplication to paralogs has been shown to have major effects on, for example, cellular signaling pathways (1)

However just recently it has been reported that the FULL human genome has been sequenced and is complete and verified.  This was the focus of a recent issue in the journal Science.

Source: https://www.science.org/doi/10.1126/science.abj6987

Abstract

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.

 

The current human reference genome was released by the Genome Reference Consortium (GRC) in 2013 and most recently patched in 2019 (GRCh38.p13) (1). This reference traces its origin to the publicly funded Human Genome Project (2) and has been continually improved over the past two decades. Unlike the competing Celera effort (3) and most modern sequencing projects based on “shotgun” sequence assembly (4), the GRC assembly was constructed from sequenced bacterial artificial chromosomes (BACs) that were ordered and oriented along the human genome by means of radiation hybrid, genetic linkage, and fingerprint maps. However, limitations of BAC cloning led to an underrepresentation of repetitive sequences, and the opportunistic assembly of BACs derived from multiple individuals resulted in a mosaic of haplotypes. As a result, several GRC assembly gaps are unsolvable because of incompatible structural polymorphisms on their flanks, and many other repetitive and polymorphic regions were left unfinished or incorrectly assembled (5).

 

Fig. 1. Summary of the complete T2T-CHM13 human genome assembly.
(A) Ideogram of T2T-CHM13v1.1 assembly features. For each chromosome (chr), the following information is provided from bottom to top: gaps and issues in GRCh38 fixed by CHM13 overlaid with the density of genes exclusive to CHM13 in red; segmental duplications (SDs) (42) and centromeric satellites (CenSat) (30); and CHM13 ancestry predictions (EUR, European; SAS, South Asian; EAS, East Asian; AMR, ad-mixed American). Bottom scale is measured in Mbp. (B and C) Additional (nonsyntenic) bases in the CHM13 assembly relative to GRCh38 per chromosome, with the acrocentrics highlighted in black (B) and by sequence type (C). (Note that the CenSat and SD annotations overlap.) RepMask, RepeatMasker. (D) Total nongap bases in UCSC reference genome releases dating back to September 2000 (hg4) and ending with T2T-CHM13 in 2021. Mt/Y/Ns, mitochondria, chrY, and gaps.

Note in Figure 1D the exponential growth in genetic information.

Also very important is the ability to determine all the paralogs, isoforms, areas of potential epigenetic regulation, gene duplications, and transposable elements that exist within the human genome.

Analyses and resources

A number of companion studies were carried out to characterize the complete sequence of a human genome, including comprehensive analyses of centromeric satellites (30), segmental duplications (42), transcriptional (49) and epigenetic profiles (29), mobile elements (49), and variant calls (25). Up to 99% of the complete CHM13 genome can be confidently mapped with long-read sequencing, opening these regions of the genome to functional and variational analysis (23) (fig. S38 and table S14). We have produced a rich collection of annotations and omics datasets for CHM13—including RNA sequencing (RNA-seq) (30), Iso-seq (21), precision run-on sequencing (PRO-seq) (49), cleavage under targets and release using nuclease (CUT&RUN) (30), and ONT methylation (29) experiments—and have made these datasets available via a centralized University of California, Santa Cruz (UCSC), Assembly Hub genome browser (54).

 

To highlight the utility of these genetic and epigenetic resources mapped to a complete human genome, we provide the example of a segmentally duplicated region of the chromosome 4q subtelomere that is associated with facioscapulohumeral muscular dystrophy (FSHD) (55). This region includes FSHD region gene 1 (FRG1), FSHD region gene 2 (FRG2), and an intervening D4Z4 macrosatellite repeat containing the double homeobox 4 (DUX4) gene that has been implicated in the etiology of FSHD (56). Numerous duplications of this region throughout the genome have complicated past genetic analyses of FSHD.

The T2T-CHM13 assembly reveals 23 paralogs of FRG1 spread across all acrocentric chromosomes as well as chromosomes 9 and 20 (Fig. 5A). This gene appears to have undergone recent amplification in the great apes (57), and approximate locations of FRG1 paralogs were previously identified by FISH (58). However, only nine FRG1 paralogs are found in GRCh38, hampering sequence-based analysis.

Future of the human reference genome

The T2T-CHM13 assembly adds five full chromosome arms and more additional sequence than any genome reference release in the past 20 years (Fig. 1D). This 8% of the genome has not been overlooked because of a lack of importance but rather because of technological limitations. High-accuracy long-read sequencing has finally removed this technological barrier, enabling comprehensive studies of genomic variation across the entire human genome, which we expect to drive future discovery in human genomic health and disease. Such studies will necessarily require a complete and accurate human reference genome.

CHM13 lacks a Y chromosome, and homozygous Y-bearing CHMs are nonviable, so a different sample type will be required to complete this last remaining chromosome. However, given its haploid nature, it should be possible to assemble the Y chromosome from a male sample using the same methods described here and supplement the T2T-CHM13 reference assembly with a Y chromosome as needed.

Extending beyond the human reference genome, large-scale resequencing projects have revealed genomic variation across human populations. Our reanalyses of the 1KGP (25) and SGDP (42) datasets have already shown the advantages of T2T-CHM13, even for short-read analyses. However, these studies give only a glimpse of the extensive structural variation that lies within the most repetitive regions of the genome assembled here. Long-read resequencing studies are now needed to comprehensively survey polymorphic variation and reveal any phenotypic associations within these regions.

Although CHM13 represents a complete human haplotype, it does not capture the full diversity of human genetic variation. To address this bias, the Human Pangenome Reference Consortium (59) has joined with the T2T Consortium to build a collection of high-quality reference haplotypes from a diverse set of samples. Ideally, all genomes could be assembled at the quality achieved here, but automated T2T assembly of diploid genomes presents a difficult challenge that will require continued development. Until this goal is realized, and any human genome can be completely sequenced without error, the T2T-CHM13 assembly represents a more complete, representative, and accurate reference than GRCh38.

 

This paper was the focus of a Time article and their basis for making the lead authors part of their Time 100 people of the year.

From TIME

The Human Genome Is Finally Fully Sequenced

Source: https://time.com/6163452/human-genome-fully-sequenced/

 

The first human genome was mapped in 2001 as part of the Human Genome Project, but researchers knew it was neither complete nor completely accurate. Now, scientists have produced the most completely sequenced human genome to date, filling in gaps and correcting mistakes in the previous version.

The sequence is the most complete reference genome for any mammal so far. The findings from six new papers describing the genome, which were published in Science, should lead to a deeper understanding of human evolution and potentially reveal new targets for addressing a host of diseases.

A more precise human genome

“The Human Genome Project relied on DNA obtained through blood draws; that was the technology at the time,” says Adam Phillippy, head of genome informatics at the National Institutes of Health’s National Human Genome Research Institute (NHGRI) and senior author of one of the new papers. “The techniques at the time introduced errors and gaps that have persisted all of these years. It’s nice now to fill in those gaps and correct those mistakes.”

“We always knew there were parts missing, but I don’t think any of us appreciated how extensive they were, or how interesting,” says Michael Schatz, professor of computer science and biology at Johns Hopkins University and another senior author of the same paper.

The work is the result of the Telomere to Telomere consortium, which is supported by NHGRI and involves genetic and computational biology experts from dozens of institutes around the world. The group focused on filling in the 8% of the human genome that remained a genetic black hole from the first draft sequence. Since then, geneticists have been trying to add those missing portions bit by bit. The latest group of studies identifies about an entire chromosome’s worth of new sequences, representing 200 million more base pairs (the letters making up the genome) and 1,956 new genes.

 

NOTE: In 2001 many scientists postulated there were as much as 100,000 coding human genes however now we understand there are about 20,000 to 25,000 human coding genes.  This does not however take into account the multiple diversity obtained from alternate splicing, gene duplications, SNPs, and chromosomal rearrangements.

Scientists were also able to sequence the long stretches of DNA that contained repeated sequences, which genetic experts originally thought were similar to copying errors and dismissed as so-called “junk DNA”. These repeated sequences, however, may play roles in certain human diseases. “Just because a sequence is repetitive doesn’t mean it’s junk,” says Eichler. He points out that critical genes are embedded in these repeated regions—genes that contribute to machinery that creates proteins, genes that dictate how cells divide and split their DNA evenly into their two daughter cells, and human-specific genes that might distinguish the human species from our closest evolutionary relatives, the primates. In one of the papers, for example, researchers found that primates have different numbers of copies of these repeated regions than humans, and that they appear in different parts of the genome.

“These are some of the most important functions that are essential to live, and for making us human,” says Eichler. “Clearly, if you get rid of these genes, you don’t live. That’s not junk to me.”

Deciphering what these repeated sections mean, if anything, and how the sequences of previously unsequenced regions like the centromeres will translate to new therapies or better understanding of human disease, is just starting, says Deanna Church, a vice president at Inscripta, a genome engineering company who wrote a commentary accompanying the scientific articles. Having the full sequence of a human genome is different from decoding it; she notes that currently, of people with suspected genetic disorders whose genomes are sequenced, about half can be traced to specific changes in their DNA. That means much of what the human genome does still remains a mystery.

The investigators in the Telomere to Telomere Consortium made the Time 100 People of the Year.

Michael Schatz, Karen Miga, Evan Eichler, and Adam Phillippy

Illustration by Brian Lutz for Time (Source Photos: Will Kirk—Johns Hopkins University; Nick Gonzales—UC Santa Cruz; Patrick Kehoe; National Human Genome Research Institute)

BY JENNIFER DOUDNA

MAY 23, 2022 6:08 AM EDT

Ever since the draft of the human genome became available in 2001, there has been a nagging question about the genome’s “dark matter”—the parts of the map that were missed the first time through, and what they contained. Now, thanks to Adam Phillippy, Karen Miga, Evan Eichler, Michael Schatz, and the entire Telomere-to-Telomere Consortium (T2T) of scientists that they led, we can see the full map of the human genomic landscape—and there’s much to explore.

In the scientific community, there wasn’t a consensus that mapping these missing parts was necessary. Some in the field felt there was already plenty to do using the data in hand. In addition, overcoming the technical challenges to getting the missing information wasn’t possible until recently. But the more we learn about the genome, the more we understand that every piece of the puzzle is meaningful.

I admire the

T2T group’s willingness to grapple with the technical demands of this project and their persistence in expanding the genome map into uncharted territory. The complete human genome sequence is an invaluable resource that may provide new insights into the origin of diseases and how we can treat them. It also offers the most complete look yet at the genetic script underlying the very nature of who we are as human beings.

Doudna is a biochemist and winner of the 2020 Nobel Prize in Chemistry

Source: https://time.com/collection/100-most-influential-people-2022/6177818/evan-eichler-karen-miga-adam-phillippy-michael-schatz/

Other articles on the Human Genome Project and Junk DNA in this Open Access Scientific Journal Include:

 

International Award for Human Genome Project

 

Cracking the Genome – Inside the Race to Unlock Human DNA – quotes in newspapers

 

The Human Genome Project

 

Junk DNA and Breast Cancer

 

A Perspective on Personalized Medicine

 

 

 

 

 

 

 

Additional References

 

  1. P. Scalia, A. Giordano, C. Martini, S. J. Williams, Isoform- and Paralog-Switching in IR-Signaling: When Diabetes Opens the Gates to Cancer. Biomolecules 10, (Nov 30, 2020).

 

 

Read Full Post »

Emergence of a new SARS-CoV-2 variant from GR clade with a novel S glycoprotein mutation V1230L in West Bengal, India

Authors: Rakesh Sarkar, Ritubrita Saha, Pratik Mallick, Ranjana Sharma, Amandeep Kaur, Shanta Dutta, Mamta Chawla-Sarkar

Reporter and Original Article Co-Author: Amandeep Kaur, B.Sc. , M.Sc.

Abstract
Since its inception in late 2019, SARS-CoV-2 has evolved resulting in emergence of various variants in different countries. These variants have spread worldwide resulting in devastating second wave of COVID-19 pandemic in many countries including India since the beginning of 2021. To control this pandemic continuous mutational surveillance and genomic epidemiology of circulating strains is very important. In this study, we performed mutational analysis of the protein coding genes of SARS-CoV-2 strains (n=2000) collected during January 2021 to March 2021. Our data revealed the emergence of a new variant in West Bengal, India, which is characterized by the presence of 11 co-existing mutations including D614G, P681H and V1230L in S-glycoprotein. This new variant was identified in 70 out of 412 sequences submitted from West Bengal. Interestingly, among these 70 sequences, 16 sequences also harbored E484K in the S glycoprotein. Phylogenetic analysis revealed strains of this new variant emerged from GR clade (B.1.1) and formed a new cluster. We propose to name this variant as GRL or lineage B.1.1/S:V1230L due to the presence of V1230L in S glycoprotein along with GR clade specific mutations. Co-occurrence of P681H, previously observed in UK variant, and E484K, previously observed in South African variant and California variant, demonstrates the convergent evolution of SARS-CoV-2 mutation. V1230L, present within the transmembrane domain of S2 subunit of S glycoprotein, has not yet been reported from any country. Substitution of valine with more hydrophobic amino acid leucine at position 1230 of the transmembrane domain, having role in S protein binding to the viral envelope, could strengthen the interaction of S protein with the viral envelope and also increase the deposition of S protein to the viral envelope, and thus positively regulate virus infection. P618H and E484K mutation have already been demonstrated in favor of increased infectivity and immune invasion respectively. Therefore, the new variant having G614G, P618H, P1230L and E484K is expected to have better infectivity, transmissibility and immune invasion characteristics, which may pose additional threat along with B.1.617 in the ongoing COVID-19 pandemic in India.

Reference: Sarkar, R. et al. (2021) Emergence of a new SARS-CoV-2 variant from GR clade with a novel S glycoprotein mutation V1230L in West Bengal, India. medRxiv. https://doi.org/10.1101/2021.05.24.21257705https://www.medrxiv.org/content/10.1101/2021.05.24.21257705v1

Other related articles were published in this Open Access Online Scientific Journal, including the following:

Fighting Chaos with Care, community trust, engagement must be cornerstones of pandemic response

Reporter: Amandeep Kaur

https://pharmaceuticalintelligence.com/2021/04/13/fighting-chaos-with-care/

T cells recognize recent SARS-CoV-2 variants

Reporter: Aviva Lev-Ari, PhD, RN

https://pharmaceuticalintelligence.com/2021/03/30/t-cells-recognize-recent-sars-cov-2-variants/

Need for Global Response to SARS-CoV-2 Viral Variants

Reporter: Aviva Lev-Ari, PhD, RN

https://pharmaceuticalintelligence.com/2021/02/12/need-for-global-response-to-sars-cov-2-viral-variants/

Identification of Novel genes in human that fight COVID-19 infection

Reporter: Amandeep Kaur, B.Sc., M.Sc.

https://pharmaceuticalintelligence.com/2021/04/19/identification-of-novel-genes-in-human-that-fight-covid-19-infection/

Mechanism of Thrombosis with AstraZeneca and J & J Vaccines: Expert Opinion by Kate Chander Chiang & Ajay Gupta, MD

Reporter & Curator: Dr. Ajay Gupta, MD

https://pharmaceuticalintelligence.com/2021/04/14/mechanism-of-thrombosis-with-astrazeneca-and-j-j-vaccines-expert-opinion-by-kate-chander-chiang-ajay-gupta-md/

Read Full Post »

Unexpected Genetic Vulnerability to Menthol Cigarette Use

Reporter: Irina Robu, PhD

According to a study published in PLOS genetics, a group of international researchers supported by U.S. Food and Drug Administration and the National Institute of Health have found a genetic variant of MRGPRX4 gene in people of African descent that increases a smoker’s preference for cigarettes containing menthol. The FDA determined that

  • nearly 20 million people of African American origin in the United States smoke menthol cigarette.
  • Research has shown that 86 percent of African-American smokers use menthol cigarettes in comparison to the smokers of European descent which are less than 30 percent.

In this study, the researcher Andrew Griffith uncovered clues as to how menthol may reduce the irritation and harshness of smoking cigarettes. The results can help public health agencies to develop strategies to lower the rates of harmful cigarette smoking among groups particularly vulnerable.

At the same time, researchers at University of Texas Southwestern Medical Center led by Dennis Drayna, conducted a detail genetic analyses on 13000 adults using data from a multiethnic, population-based group of smokers from the Dallas Heart Study and from an African-American group of smokers from the Dallas Biobank.

The researchers report that

  • 5 to 8 percent of the African-American study participants had the gene variant.
  • None of the participants of European, Asian, or Native American descent had the variant.
  • Recognizing the genetic variant, pointed the researchers in an unanticipated direction, leading them to offer
  • the first characterization of this naturally-occurring MRGPRX4 variant in humans.
  • The gene codes for a sensor/receptor is believed to be involved in detecting and responding to irritants from the environment in the lungs and airways.

Drayna further stated that while the gene variant can’t explain all of the increased use of menthol cigarettes by African-Americans, the results show that this variant is a theoretically vital factor that motivates the predilection for menthol cigarettes in the population.

Source

https://www.nih.gov/news-events/news-releases/researchers-find-genetic-vulnerability-menthol-cigarette-use

 

Read Full Post »

Bioinformatics Tool Review: Genome Variant Analysis Tools, Volume 2 (Volume Two: Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS and BioInformatics, Simulations and the Genome Ontology), Part 1: Next Generation Sequencing (NGS)

Bioinformatics Tool Review: Genome Variant Analysis Tools

Curator: Stephen J. Williams, Ph.D.

Updated 02/07/2021

Updated 11/15/2018

The following post will be an ongoing curation of reviews of gene variant bioinformatic software.

The Ensembl Variant Effect Predictor.

McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F.

Genome Biol. 2016 Jun 6;17(1):122. doi: 10.1186/s13059-016-0974-4.

Author information

1

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. wm2@ebi.ac.uk.

2

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

3

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. fiona@ebi.ac.uk.

Abstract

The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.

Rare diseases can be difficult to diagnose due to low incidence and incomplete penetrance of implicated alleles however variant analysis of whole genome sequencing can identify underlying genetic events responsible for the disease (Nature, 2015).  However, a large cohort is required for many WGS association studies in order to produce enough statistical power for interpretation (see post and here).  To this effect major sequencing projects have been initiated worldwide including:

A more thorough curation of sequencing projects can be seen in the following post:

Icelandic Population Genomic Study Results by deCODE Genetics come to Fruition: Curation of Current genomic studies

And although sequencing costs have dramatically been reduced over the years, the costs to determine the functional consequences of such variants remains high, as thorough basic research studies must be conducted to validate the interpretation of variant data with respect to the underlying disease, as only a small fraction of variants from a genome sequencing project will encode for a functional protein.  Correct annotation of sequences and variants, identification of correct corresponding reference genes or transcripts in GENCODE or RefSeq respectively offer compelling challenges to the proper identification of sequenced variants as potential functional variants.

To this effect, the authors developed the Ensembl Variant Effect Predictor (VEP), which is a software suite that performs annotations and analysis of most types of genomic variation in coding and non-coding regions of the genome.

Summary of Features

  • Annotation: VEP can annotate two broad categories of genomic variants
    • Sequence variants with specific and defined changes: indels, base substitutions, SNVs, tandem repeats
    • Larger structural variants > 50 nucleotides
  • Species and assembly/genomic database support: VEP can analyze data from any species with assembled genome sequence and annotated gene set. VEP supports chromosome assemblies such as the latest GRCh38, FASTA, as well as transcripts from RefSeq as well as user-derived sequences
  • Transcript Annotation: VEP includes a wide variety of gene and transcript related information including NCBI Gene ID, Gene Symbol, Transcript ID, NCBI RefSeq ID, exon/intron information, and cross reference to other databases such as UniProt
  • Protein Annotation: Protein-related fields include Protein ID, RefSeq ID, SwissProt, UniParc ID, reference codons and amino acids, SIFT pathogenicity score, protein domains
  • Noncoding Annotation: VEP reports variants in noncoding regions including genomic regulatory regions, intronic regions, transcription binding motifs. Data from ENCODE, BLUEPRINT, and NIH Epigenetics RoadMap are used for primary annotation.  Plugins to the Perl coding are also available to link other databases which annotate noncoding sequence features.
  • Frequency, phenotype, and citation annotation: VEP searches Ensembl databases containing a large amount of germline variant information and checks variants against the dbSNP single nucleotide polymorphism database. VEP integrates with mutational databases such as COSMIC, the Human Gene Mutation Database, and structural and copy number variants from Database of Genomic Variants.  Allele Frequencies are reported from 1000 Genomes and NHLBI and integrates with PubMed for literature annotation.  Phenotype information is from OMIM, Orphanet, GWAS and clinical information of variants from ClinVar.
  • Flexible Input and Output Formats: VEP supports input data format called “variant call format” or VCP, a standard in next-gen sequencing. VEP has the ability to process variant identifiers from other database formats.  Output formats are tab deliminated and give the user choices in presentation of results (HTML or text based)
  • Choice of user interface
    • Online tool (VEP Web): simple point and click; incorporates Instant VEP Functionality and copy and paste features. Results can be stored online in cloud storage on Ensembl.
    • VEP script: VEP is available as a downloadable PERL script (see below for link) and can process large amounts of data rapidly. This interface is powerfully flexible with the ability to integrate multiple plugins available from Ensembl and GitHub.  The ability to alter the PERL code and add plugins and code functions allows the flexibility to modify any feature of VEP.
    • VEP REST API: provides robust computational access to any programming language and returns basic variant annotation. Can make use of external plugins.

 

Watch Video on VES Instructional Webinar: https://youtu.be/7Fs7MHfXjWk

Watch Video on VES Web Version training on How to Analyze Your Sequence in VEP

Availability of data and materials

The dataset supporting the conclusions of this article is available from Illumina’s Platinum Genomes [93] and using the Ensembl release 75 gene set. Pre-built data sets are available for all Ensembl and Ensembl Genomes species [94]. They can also be downloaded automatically during set up whilst installing the VEP.

References

Large-scale discovery of novel genetic causes of developmental disorders.

Deciphering Developmental Disorders Study.

Nature2015 Mar 12;519(7542):223-8. doi: 10.1038/nature14135. PMID:25533962

Updated 11/15/2018

Research Points to Caution in Use of Variant Effect Prediction Bioinformatic Tools

Although we have the ability to use high throughput sequencing to identify allelic variants occurring in rare disease, correlation of these variants with the underlying disease is often difficult due to a few concerns:

  • For rare sporadic diseases, classical gene/variant association studies have proven difficult to perform (Meyts et al. 2016)
  • As Whole Exome Sequencing (WES) returns a considerable number of variants, how to differentiate the normal allelic variation found in the human population from disease-causing pathogenic alleles
  • For rare diseases, pathogenic allele frequencies are generally low

Therefore, for these rare pathogenic alleles, the use of bioinformatics tools in order to predict the resulting changes in gene function may provide insight into disease etiology when validation of these allelic changes might be experimentally difficult.

In a 2017 Genes & Immunity paper, Line Lykke Andersen and Rune Hartmann tested the reliability of various bioinformatic software to predict the functional consequence of variants of six different genes involved in interferon induction and sixteen allelic variants of the IFNLR1 gene.  These variants were found in cohorts of patients presenting with herpes simplex encephalitis (HSE). Most of the adult population is seropositive for Herpes Simplex Virus (HSV) however a minor fraction (1 in 250,000 individuals per year) of HSV infected individuals will develop HSE (Hjalmarsson et al., 2007).  It has been suggested that HSE occurs in individuals with rare primary immunodeficiencies caused by gene defects affecting innate immunity through reduced production of interferons (IFN) (Zhang et al., Lim et al.).

References

Meyts I, Bosch B, Bolze A, Boisson B, Itan Y, Belkadi A, et al. Exome and genome sequencing for inborn errors of immunity. J Allergy Clin Immunol. 2016;138:957–69.

Hjalmarsson A, Blomqvist P, Skoldenberg B. Herpes simplex encephalitis in Sweden, 1990-2001: incidence, morbidity, and mortality. Clin Infect Dis. 2007;45:875–80.

Zhang SY, Jouanguy E, Ugolini S, Smahi A, Elain G, Romero P, et al. TLR3 deficiency in patients with herpes simplex encephalitis. Science. 2007;317:1522–7.

Lim HK, Seppanen M, Hautala T, Ciancanelli MJ, Itan Y, Lafaille FG, et al. TLR3 deficiency in herpes simplex encephalitis: high allelic heterogeneity and recurrence risk. Neurology. 2014;83:1888–97.

Genes Immun. 2017 Dec 4. doi: 10.1038/s41435-017-0002-z.

Frequently used bioinformatics tools overestimate the damaging effect of allelic variants.

Andersen LL1Terczyńska-Dyla E1Mørk N2Scavenius C1Enghild JJ1Höning K3Hornung V3,4Christiansen M5,6Mogensen TH2,6Hartmann R7.

Abstract

We selected two sets of naturally occurring human missense allelic variants within innate immune genes. The first set represented eleven non-synonymous variants in six different genes involved in interferon (IFN) induction, present in a cohort of patients suffering from herpes simplex encephalitis (HSE) and the second set represented sixteen allelic variants of the IFNLR1 gene. We recreated the variants in vitro and tested their effect on protein function in a HEK293T cell based assay. We then used an array of 14 available bioinformatics tools to predict the effect of these variants upon protein function. To our surprise two of the most commonly used tools, CADD and SIFT, produced a high rate of false positives, whereas SNPs&GO exhibited the lowest rate of false positives in our test. As the problem in our test in general was false positive variants, inclusion of mutation significance cutoff (MSC) did not improve accuracy.

Methodology

  1. Identification of rare variants
  2. Genomes of nineteen Dutch patients with a history of HSE sequenced by WES and identification of novel HSE causing variants determined by filtering the single nucleotide polymorphisms (SNPs) that had a frequency below 1% in the NHBLI Exome Sequencing Project Exome Variant Server and the 1000 Genomes Project and were present within 204 genes involved in the immune response to HSV.
  3. Identified variants (204) manually evaluated for involvement of IFN induction based on IDBase and KEGG pathway database analysis.
  4. In-silico predictions: Variants classified by the in silico variant pathogenicity prediction programs: SIFT, Mutation Assessor, FATHMM, PROVEAN, SNAP2, PolyPhen2, PhD-SNP, SNP&GO, FATHMM-MKL, MutationTaster2, PredictSNP, Condel, MetaSNP, and CADD. Each program returned prediction scores measuring likelihood of a variant either being ‘deleterious’ or ‘neutral’. Prediction accuracy measured as

ACC = (true positive+true negative)/(true positive+true negative+false positive+false negative)

  1. Validation of prediction software/tools

In order to validate the predictive value of the software, HEK293T cells, deficient in IRF3, MAVS, and IKKe/TBK1, were cotransfected with the nine variants of the aforementioned genes and a luciferase reporter under control of the IFN-b promoter and luciferase activity measured as an indicator of IFN signaling function.  Western blot was performed to confirm the expression of the constructs.

Results

Table 2 Summary of the
bioinformatic predictions
HSE variants IFNLR1 variants Overall ACC
TN TP FN FP Total ACC TN TP FN FP Total ACC
Uniform cutoff
SIFT 4 1 0 4 9 0.56 8 1 0 7 16 0.56 0.56
Mutation assessor 6 1 0 2 9 0.78 9 1 0 6 16 0.63 0.68
FATHMM 7 1 0 1 9 0.89 0.89
PROVEAN 8 1 0 0 9 1.00 11 1 0 4 16 0.75 0.84
SNAP2 5 1 0 3 9 0.67 8 0 1 7 16 0.50 0.56
PolyPhen2 6 1 0 2 9 0.78 12 1 0 3 16 0.81 0.80
PhD-SNP 7 1 0 1 9 0.89 11 1 0 4 16 0.75 0.80
SNPs&GO 8 1 0 0 9 1.00 14 1 0 1 16 0.94 0.96
FATHMM MKL 4 1 0 4 9 0.56 13 0 1 2 16 0.81 0.72
MutationTaster2 4 0 1 4 9 0.44 14 0 1 1 16 0.88 0.72
PredictSNP 6 1 0 2 9 0.78 11 1 0 4 16 0.75 0.76
Condel 6 1 0 2 9 0.78 0.78
Meta-SNP 8 1 0 0 9 1.00 11 1 0 4 16 0.75 0.84
CADD 2 1 0 6 9 0.33 8 0 1 7 16 0.50 0.44
MSC 95% cutoff
SIFT 5 1 0 3 9 0.67 8 1 0 8 16 0.50 0.56
PolyPhen2 6 1 0 2 9 0.78 13 1 0 3 16 0.81 0.80
CADD 4 1 0 4 9 0.56 7 0 1 9 16 0.44 0.48

Note: TN: true negative, TP: true positive, FN: false negative, FP: false positive, ACC: accuracy

Functional testing (data obtained from reporter construct experiments) were considered as the correct outcome.

Three prediction tools (PROVEAN, SNP&GO, and MetaSNP correctly predicted the effect of all nine variants tested.

Updated 02/07/2021

InMeRF: prediction of pathogenicity of missense variants by individual modeling for each amino acid substitution
Jun-Ichi Takeda Kentaro Nanatsue Ryosuke Yamagishi Mikako Ito Nobuhiko Haga 2Hiromi Hirata Tomoo Ogi Kinji Ohno in NAR Genomics and  Bioinformatics. 2020 May 26;2(2):lqaa038.doi: 10.1093/nargab/lqaa038. eCollection 2020 Jun.

Abstract

In predicting the pathogenicity of a nonsynonymous single-nucleotide variant (nsSNV), a radical change in amino acid properties is prone to be classified as being pathogenic. However, not all such nsSNVs are associated with human diseases. We generated random forest (RF) models individually for each amino acid substitution to differentiate pathogenic nsSNVs in the Human Gene Mutation Database and common nsSNVs in dbSNP. We named a set of our models ‘Individual Meta RF’ (InMeRF). Ten-fold cross-validation of InMeRF showed that the areas under the curves (AUCs) of receiver operating characteristic (ROC) and precision-recall curves were on average 0.941 and 0.957, respectively. To compare InMeRF with seven other tools, the eight tools were generated using the same training dataset, and were compared using the same three testing datasets. ROC-AUCs of InMeRF were ranked first in the eight tools. We applied InMeRF to 155 pathogenic and 125 common nsSNVs in seven major genes causing congenital myasthenic syndromes, as well as in VANGL1 causing spina bifida, and found that the sensitivity and specificity of InMeRF were 0.942 and 0.848, respectively. We made the InMeRF web service, and also made genome-wide InMeRF scores available online (https://www.med.nagoya-u.ac.jp/neurogenetics/InMeRF/).

Source: https://pubmed.ncbi.nlm.nih.gov/33543123/

ADDRESS: A database of disease-associated human variants incorporating protein structure and folding stabilities
Jaie Woodard Chengxin Zhang Yang Zhang in J Mol Biol. 2021 Feb 1;166840. doi: 10.1016/j.jmb.2021.166840.

Abstract

Numerous human diseases are caused by mutations in genomic sequences. Since amino acid changes affect protein function through mechanisms often predictable from protein structure, the integration of structural and sequence data enables us to estimate with greater accuracy whether and how a given mutation will lead to disease. Publicly available annotated databases enable hypothesis assessment and benchmarking of prediction tools. However, the results are often presented as summary statistics or black box predictors, without providing full descriptive information. We developed a new semi-manually curated human variant database presenting information on the protein contact-map, sequence-to-structure mapping, amino acid identity change, and stability prediction for the popular UniProt database. We found that the profiles of pathogenic and benign missense polymorphisms can be effectively deduced using decision trees and comparative analyses based on the presented dataset. The database is made publicly available through https://zhanglab.ccmb.med.umich.edu/ADDRESS.

Source: https://pubmed.ncbi.nlm.nih.gov/33539887/

PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes

Abstract

Thousands of genomic structural variants (SVs) segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. Most current approaches identify SVs in single genomes and afterwards merge the identified variants into a joint call set across many genomes. We describe the approach PopDel, which directly identifies deletions of about 500 to at least 10,000 bp in length in data of many genomes jointly, eliminating the need for subsequent variant merging. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel’s running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies.

Source: https://pubmed.ncbi.nlm.nih.gov/33526789/

Other articles related to Genomics and Bioinformatics on this online Open Access Journal Include:

Finding the Genetic Links in Common Disease: Caveats of Whole Genome Sequencing Studies

Large-scale sequencing does not support the idea that lower-frequency variants have a major role in predisposition to type 2 diabetes

US Personalized Cancer Genome Sequencing Market Outlook 2018 –

Icelandic Population Genomic Study Results by deCODE Genetics come to Fruition: Curation of Current genomic studies

Read Full Post »

Free Bio-IT World Webinar: Machine Learning to Detect Cancer Variants

Reporter: Stephen J. Williams, PhD

 

     


SomaticSeq: An Ensemble Approach with Machine Learning to Detect Cancer Variants

June 16 at 1pm EDT Register for this Webinar |  View All Webinars

Accurate detection of somatic mutations has proven to be challenging in cancer NGS analysis, due to tumor heterogeneity and cross-contamination between tumor and matched normal samples. Oftentimes, a somatic caller that performs well for one tumor may not for another.

In this webinar we will introduce SomaticSeq, a tool within the Bina Genomic Management Solution (Bina GMS) designed to boost the accuracy of somatic mutation detection with a machine learning approach. You will learn:

  • Benchmarking of leading somatic callers, namely MuTect, SomaticSniper, VarScan2, JointSNVMix2, and VarDict
  • Integration of such tools and how accuracy is achieved using a machine learning classifier that incorporates over 70 features with SomaticSeq
  • Accuracy validation including results from the ICGC-TCGA DREAM Somatic Mutation Calling Challenge, in which Bina placed 1st in indel calling and 2nd in SNV calling in stage 5
  • Creation of a new SomaticSeq classifier utilizing your own dataset
  • Review of the somatic workflow within the Bina Genomic Management Solution

Speakers:

Li Tai Fang

Li Tai Fang
Sr. Bioinformatics Scientist
Bina Technologies, Part of
Roche Sequencing

Anoop Grewal

Anoop Grewal
Product Marketing Manager
Bina Technologies, Part of
Roche Sequencing

<Read full speaker bios here>

Cost: No cost!

Schedule conflict? Register now and you’ll receive a copy of the recording.

This webinar is compliments of: 

Bio-ITWorld.com/Bio-IT-Webinars

Read Full Post »

Icelandic Population Genomic Study Results by deCODE Genetics come to Fruition: Curation of Current genomic studies

Reporter/Curator: Stephen J. Williams, Ph.D.

 

UPDATED on 9/6/2017

On 9/6/2017, Aviva Lev-Ari, PhD, RN had attend a talk by Paul Nioi, PhD, Amgen, at HMS, Harvard BioTechnology Club (GSAS).

Nioi discussed his 2016 paper in NEJM, 2016, 374:2131-2141

Variant ASGR1 Associated with a Reduced Risk of Coronary Artery Disease

Paul Nioi, Ph.D., Asgeir Sigurdsson, B.Sc., Gudmar Thorleifsson, Ph.D., Hannes Helgason, Ph.D., Arna B. Agustsdottir, B.Sc., Gudmundur L. Norddahl, Ph.D., Anna Helgadottir, M.D., Audur Magnusdottir, Ph.D., Aslaug Jonasdottir, M.Sc., Solveig Gretarsdottir, Ph.D., Ingileif Jonsdottir, Ph.D., Valgerdur Steinthorsdottir, Ph.D., Thorunn Rafnar, Ph.D., Dorine W. Swinkels, M.D., Ph.D., Tessel E. Galesloot, Ph.D., Niels Grarup, Ph.D., Torben Jørgensen, D.M.Sc., Henrik Vestergaard, D.M.Sc., Torben Hansen, Ph.D., Torsten Lauritzen, D.M.Sc., Allan Linneberg, Ph.D., Nele Friedrich, Ph.D., Nikolaj T. Krarup, Ph.D., Mogens Fenger, Ph.D., Ulrik Abildgaard, D.M.Sc., Peter R. Hansen, D.M.Sc., Anders M. Galløe, Ph.D., Peter S. Braund, Ph.D., Christopher P. Nelson, Ph.D., Alistair S. Hall, F.R.C.P., Michael J.A. Williams, M.D., Andre M. van Rij, M.D., Gregory T. Jones, Ph.D., Riyaz S. Patel, M.D., Allan I. Levey, M.D., Ph.D., Salim Hayek, M.D., Svati H. Shah, M.D., Muredach Reilly, M.B., B.Ch., Gudmundur I. Eyjolfsson, M.D., Olof Sigurdardottir, M.D., Ph.D., Isleifur Olafsson, M.D., Ph.D., Lambertus A. Kiemeney, Ph.D., Arshed A. Quyyumi, F.R.C.P., Daniel J. Rader, M.D., William E. Kraus, M.D., Nilesh J. Samani, F.R.C.P., Oluf Pedersen, D.M.Sc., Gudmundur Thorgeirsson, M.D., Ph.D., Gisli Masson, Ph.D., Hilma Holm, M.D., Daniel Gudbjartsson, Ph.D., Patrick Sulem, M.D., Unnur Thorsteinsdottir, Ph.D., and Kari Stefansson, M.D., Ph.D.

N Engl J Med 2016; 374:2131-2141June 2, 2016DOI: 10.1056/NEJMoa1508419

Abstract
Article
References
Citing Articles (22)
Metrics

BACKGROUND

Several sequence variants are known to have effects on serum levels of non–high-density lipoprotein (HDL) cholesterol that alter the risk of coronary artery disease.

METHODS

We sequenced the genomes of 2636 Icelanders and found variants that we then imputed into the genomes of approximately 398,000 Icelanders. We tested for association between these imputed variants and non-HDL cholesterol levels in 119,146 samples. We then performed replication testing in two populations of European descent. We assessed the effects of an implicated loss-of-function variant on the risk of coronary artery disease in 42,524 case patients and 249,414 controls from five European ancestry populations. An augmented set of genomes was screened for additional loss-of-function variants in a target gene. We evaluated the effect of an implicated variant on protein stability.

RESULTS

We found a rare noncoding 12-base-pair (bp) deletion (del12) in intron 4 of ASGR1, which encodes a subunit of the asialoglycoprotein receptor, a lectin that plays a role in the homeostasis of circulating glycoproteins. The del12 mutation activates a cryptic splice site, leading to a frameshift mutation and a premature stop codon that renders a truncated protein prone to degradation. Heterozygous carriers of the mutation (1 in 120 persons in our study population) had a lower level of non-HDL cholesterol than noncarriers, a difference of 15.3 mg per deciliter (0.40 mmol per liter) (P=1.0×10−16), and a lower risk of coronary artery disease (by 34%; 95% confidence interval, 21 to 45; P=4.0×10−6). In a larger set of sequenced samples from Icelanders, we found another loss-of-function ASGR1 variant (p.W158X, carried by 1 in 1850 persons) that was also associated with lower levels of non-HDL cholesterol (P=1.8×10−3).

CONCLUSIONS

ASGR1 haploinsufficiency was associated with reduced levels of non-HDL cholesterol and a reduced risk of coronary artery disease. (Funded by the National Institutes of Health and others.)

 

Amgen’s deCODE Genetics Publishes Largest Human Genome Population Study to Date

Mark Terry, BioSpace.com Breaking News Staff reported on results of one of the largest genome sequencing efforts to date, sequencing of the genomes of 2,636 people from Iceland by deCODE genetics, Inc., a division of Thousand Oaks, Calif.-based Amgen (AMGN).

Amgen had bought deCODE genetics Inc. in 2012, saving the company from bankruptcy.

There were a total of four studies, published on March 25, 2015 on the online version of Nature Genetics; titled “Large-scale whole-genome sequencing of the Icelandic population[1],” “Identification of a large set of rare complete human knockouts[2],” “The Y-chromosome point mutation rate in humans[3]” and “Loss-of-function variants in ABCA7 confer risk of Alzheimer’s disease[4].”

The project identified some new genetic variants which increase risk of Alzheimer’s disease and confirmed some variants known to increase risk of diabetes and atrial fibrillation. A more in-depth post will curate these findings but there was an interesting discrete geographic distribution of certain rare variants located around Iceland. The dataset offers a treasure trove of meaningful genetic information not only about the Icelandic population but offers numerous new targets for breast, ovarian cancer as well as Alzheimer’s disease.

View Mark Terry’s article here on Biospace.com.

“This work is a demonstration of the unique power sequencing gives us for learning more about the history of our species,” said Kari Stefansson, founder and chief executive officer of deCode and one of the lead authors in a statement, “and for contributing to new means of diagnosing, treating and preventing disease.”

The scale and ambition of the study is impressive, but perhaps more important, the research identified a new genetic variant that increases the risk of Alzheimer’s disease and already had identified an APP variant that is associated with decreased risk of Alzheimer’s Disease. It also confirmed variants that increase the risk of diabetes and a variant that results in atrial fibrillation.
The database of human genetic variation (dbSNP) contained over 50 million unique sequence variants yet this database only represents a small proportion of single nucleotide variants which is thought to exist. These “private” or rare variants undoubtedly contribute to important phenotypes, such as disease susceptibility. Non-SNV variants, like indels and structural variants, are also under-represented in public databases. The only way to fully elucidate the genetic basis of a trait is to consider all of these types of variants, and the only way to find them is by large-scale sequencing.

Curation of Population Genomic Sequencing Programs/Corporate Partnerships

Click on “Curation of genomic studies” below for full Table

Curation of genomic studies
Study Partners Population Enrolled Disease areas Analysis
Icelandic Genome

Project

deCODE/Amgen Icelandic 2,636 Variants related to: Alzheimer’s, cardiovascular, diabetes WES + EMR; blood samples
Genome Sequencing Study Geisinger Health System/Regeneron Northeast PA, USA 100,000 Variants related to hypercholestemia, autism, obesity, other diseases WES +EMR +MyCode;

– Blood samples

The 100,000 Genomes Project National Health Service/NHS Genome Centers/ 10 companies forming Gene Consortium including Abbvie, Alexion, AstraZeneca, Biogen, Dimension, GSK, Helomics, Roche,   Takeda, UCB Rare disorders population UK Starting to recruit 100,000 Initially rare diseases, cancer, infectious diseases WES of blood, saliva and tissue samples

Ref paper

Saudi Human Genome Program 7 centers across Saudi Arabia in conjunction with King Abdulaziz City Science & Tech., King Faisal Hospital & Research Centre/Life Technologies General population Saudi Arabia 20,000 genomes over three years First focus on rare severe early onset diseases: diabetes, deafness, cardiovascular, skeletal deformation Whole genome sequence blood samples + EMR
Genome of the Netherlands (GoNL) Consortium consortium of the UMCG,LUMCErasmus MCVU university and UMCU. Samples where contributed by LifeLinesThe Leiden Longevity StudyThe Netherlands Twin Registry (NTR), The Rotterdam studies, and The Genetic Research in Isolated Populations program. All the sequencing work is done by BGI Hong Kong. Families in Netherlands 769 Variants, SNV, indels, deletions from apparently healthy individuals, family trios Whole genome NGS of whole blood no EMR

Ref paper in Nat. Genetics

Ref paper describing project

Faroese FarGen project Privately funded Faroe Islands Faroese population 50,000 Small population allows for family analysis Combine NGS with EMR and genealogy reports
Personal Genome Project Canada $4000.00 fee from participants; collaboration with University of Toronto and SickKids Organization; technical assistance with Harvard Canadian Health System Goal: 100,000 ? just started no defined analysis goals yet Whole exome and medical records
Singapore Sequencing Malay Project (SSMP) Singapore Genome Variation Project

Singapore Pharmacogenomics Project

Malaysian 100 healthy Malays from Singapore Pop. Health Study Variant analysis Deep whole genome sequencing
GenomeDenmark four Danish universities (KU, AU, DTU and AAU), two hospitals (Herlev and Vendsyssel) and two private firms (Bavarian Nordic and BGI-Europe). 150 complete genomes; first 30 published in Nature Comm. ? See link
Neuromics Consortium University of Tübingen and 18 academic and industrial partners (see link for description) European and Australian 1,100 patients with neuro-

degenerative and neuro-

muscular disease

Moved from SNP to whole exome analysis Whole Exome, RNASeq

References

  1. Gudbjartsson DF, Helgason H, Gudjonsson SA, Zink F, Oddson A, Gylfason A, Besenbacher S, Magnusson G, Halldorsson BV, Hjartarson E et al: Large-scale whole-genome sequencing of the Icelandic population. Nature genetics 2015, advance online publication.
  2. Sulem P, Helgason H, Oddson A, Stefansson H, Gudjonsson SA, Zink F, Hjartarson E, Sigurdsson GT, Jonasdottir A, Jonasdottir A et al: Identification of a large set of rare complete human knockouts. Nature genetics 2015, advance online publication.
  3. Helgason A, Einarsson AW, Gumundsdottir VB, Sigursson A, Gunnarsdottir ED, Jagadeesan A, Ebenesersdottir SS, Kong A, Stefansson K: The Y-chromosome point mutation rate in humans. Nature genetics 2015, advance online publication.
  4. Steinberg S, Stefansson H, Jonsson T, Johannsdottir H, Ingason A, Helgason H, Sulem P, Magnusson OT, Gudjonsson SA, Unnsteinsdottir U et al: Loss-of-function variants in ABCA7 confer risk of Alzheimer’s disease. Nature genetics 2015, advance online publication.

Other post related to DECODE, population genomics, and NGS on this site include:

Illumina Says 228,000 Human Genomes Will Be Sequenced in 2014

CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics & Computational Genomics

CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics and Computational Genomics – Part IIB

Human genome: UK to become world number 1 in DNA testing

Synthetic Biology: On Advanced Genome Interpretation for Gene Variants and Pathways: What is the Genetic Base of Atherosclerosis and Loss of Arterial Elasticity with Aging

Genomic Promise for Neurodegenerative Diseases, Dementias, Autism Spectrum, Schizophrenia, and Serious Depression

Sequencing the exomes of 1,100 patients with neurodegenerative and neuromuscular diseases: A consortium of 18 European and Australian institutions

University of California Santa Cruz’s Genomics Institute will create a Map of Human Genetic Variations

Three Ancestral Populations Contributed to Modern-day Europeans: Ancient Genome Analysis

Impact of evolutionary selection on functional regions: The imprint of evolutionary selection on ENCODE regulatory elements is manifested between species and within human populations

Read Full Post »

Finding the Genetic Links in Common Disease:  Caveats of Whole Genome Sequencing Studies

Writer and Reporter: Stephen J. Williams, Ph.D.

In the November 23, 2012 issue of Science, Jocelyn Kaiser reports (Genetic Influences On Disease Remain Hidden in News and Analysis)[1] on the difficulties that many genomic studies are encountering correlating genetic variants to high risk of type 2 diabetes and heart disease.  At the recent American Society of Human Genetics annual 2012 meeting, results of several DNA sequencing studies reported difficulties in finding genetic variants and links to high risk type 2 diabetes and heart disease.  These studies were a part of an international effort to determine the multiple genetic events contributing to complex, common diseases like diabetes.  Unlike Mendelian inherited diseases (like ataxia telangiectasia) which are characterized by defects mainly in one gene, finding genetic links to more complex diseases may pose a problem as outlined in the article:

  • Variants may be so rare that massive number of patient’s genome would need to be analyzed
  • For most diseases, individual SNPs (single nucleotide polymorphisms) raise risk modestly
  • Hard to find isolated families (hemophilia) or isolated populations (Ashkenazi Jew)
  • Disease-influencing genes have not been weeded out by natural selection after human population explosion (~5000 years ago) resulted in numerous gene variants
  • What percentage variants account for disease heritability (studies have shown this is as low as 26% for diabetes with the remaining risk determined by environment)

Although many genome-wide-associations studies have found SNPs that have causality to increasing risk diseases such as cancer, diabetes, and heart disease, most individual SNPs for common diseases raise risk by about only 20-40% and would be useless for predicting an individual’s chance they will develop disease and be a candidate for a personalized therapy approach.  Therefore, for common diseases, investigators are relying on direct exome sequencing and whole-genome sequencing to detect these medium-rare risk variants, rather than relying on genome-wide association studies (which are usually fine for detecting the higher frequency variants associated with common diseases).

Three of the many projects (one for heart risk and two for diabetes risk) are highlighted in the article:

1.  National Heart, Lung and Blood Institute Exome Sequencing Project (ESP)[2]: heart, lung, blood

  • Sequenced 6,700 exomes of European or African descent
  • Majority of variants linked to disease too rare (as low as one variant)
  • Groups of variants in the same gene confirmed link between APOC3 and higher risk for early-onset heart attack
  • No other significant gene variants linked with heart disease

2.  T2D-GENES Consortium: diabetes

Sequenced 5,300 exomes of type 2 diabetes patients and controls from five ancestry groups
SNP in PAX4 gene associated with disease in East Asians
No low-frequency variant with large effect though

3.  GoT2D: diabetes

  • After sequencing 2700 patient’s exomes and whole genome no new rare variants above 1.5% frequency with a strong effect on diabetes risk

A nice article by Dr. Sowmiya Moorthie entitled Involvement of rare variants in common disease can be found at the PGH Foundation site http://www.phgfoundation.org/news/5164/ further discusses this conundrum,  and is summarized below:

“Although GWAs have identified many SNPs associated with common disease, they have as yet had little success in identifying the causative genetic variants. Those that have been identified have only a weak effect on disease risk, and therefore only explain a small proportion of the heritable, genetic component of susceptibility to that disease. This has led to the common disease-common variant hypothesis, which predicts that common disease-causing genetic variants exist in all human populations, but each individual variant will necessarily only have a small effect on disease susceptibility (i.e. a low associated relative risk).

An alternative hypothesis is the common disease, many rare variants hypothesis, which postulates that disease is caused by multiple strong-effect variants, each of which is only found in a few individuals. Dickson et al. in a paper in PLoS Biology postulate that these rare variants can be indirectly associated with common variants; they call these synthetic associations and demonstrate how further investigation could help explain findings from GWA studies [Dickson et al. (2010) PLoS Biol. 8(1):e1000294][3].  In simulation experiments, 30% of synthetic associations were caused by the presence of rare causative variants and furthermore, the strength of the association with common variants also increased if the number of rare causative variants increased. “

one_of_many rare variants

Figure from Dr. Moorthie’s article showing the problem of “finding one in many”.

(please   click to enlarge)

Indeed, other examples of such issues concerning gene variant association studies occur with other common diseases such as neurologic diseases and obesity, where it has been difficult to clearly and definitively associate any variant with prediction of risk.

For example, Nuytemans et. al.[4] used exome sequencing to find variants in the vascular protein sorting 3J (VPS35) and eukaryotic transcription initiation factor 4  gamma1 (EIF4G1) genes, tow genes causally linked to Parkinson’s Disease (PD).  Although they identified novel VPS35 variants none of these variants could be correlated to higher risk of PD.   One EIF4G1 variant seemed to be a strong Parkinson’s Disease risk factor however there was “no evidence for an overall contribution of genetic variability in VPS35 or EIF4G1 to PD development”.

These negative results may have relevance as companies such as 23andme (www.23andme.com) claim to be able to test for Parkinson’s predisposition.  To see a description of the LLRK2 mutational analysis which they use to determine risk for the disease please see the following link: https://www.23andme.com/health/Parkinsons-Disease/. This company and other like it have been subjects of posts on this site (Personalized Medicine: Clinical Aspiration of Microarrays)

However there seems to be more luck with strategies focused on analyzing intronic sequence rather than exome sequence. Jocelyn Kaiser’s Science article notes this in a brief interview with Harry Dietz of Johns Hopkins University where he suspects that “much of the missing heritability lies in gene-gene interactions”.  Oliver Harismendy and Kelly Frazer and colleagues’ recent publication in Genome Biology  http://genomebiology.com/content/11/11/R118 support this notion[5].  The authors used targeted resequencing of two endocannabinoid metabolic enzyme genes (fatty-acid-amide hydrolase (FAAH) and monoglyceride lipase (MGLL) in 147 normal weight and 142 extremely obese patients.

These patients were enrolled in the CRESCENDO trial and patients analyzed were of European descent. However, instead of just exome sequencing, the group resequenced exome AND intronic sequence, especially focusing on promoter regions.   They identified 1,448 single nucleotide variants but using a statistical filter (called RareCover which is referred to as a collapsing method) they found 4 variants in the promoters and intronic areas of the FAAH and MGLL genes which correlated to body mass index.  It should be noted that anandamide, a substrate for FAAH, is elevated in obese patients. The authors did note some issues though mentioning that “some other loci, more weakly or inconsistently associated in the original GWASs, were not replicated in our samples, which is not too surprising given the sample size of our cohort is inadequate to replicate modest associations”.

PLEASE WATCH VIDEO on the National Heart, Lung and Blood Institute Exome Sequencing Project

https://www.youtube.com/watch?v=-Qr5ahk1HEI

REFERENCES

http://www.phgfoundation.org/news/5164/  PHG Foundation

1.            Kaiser J: Human genetics. Genetic influences on disease remain hidden. Science 2012, 338(6110):1016-1017.

2.            Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G et al: Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 2012, 337(6090):64-69.

3.            Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB: Rare variants create synthetic genome-wide associations. PLoS biology 2010, 8(1):e1000294.

4.            Nuytemans K, Bademci G, Inchausti V, Dressen A, Kinnamon DD, Mehta A, Wang L, Zuchner S, Beecham GW, Martin ER et al: Whole exome sequencing of rare variants in EIF4G1 and VPS35 in Parkinson disease. Neurology 2013, 80(11):982-989.

5.            Harismendy O, Bansal V, Bhatia G, Nakano M, Scott M, Wang X, Dib C, Turlotte E, Sipe JC, Murray SS et al: Population sequencing of two endocannabinoid metabolic genes identifies rare and common regulatory variants associated with extreme obesity and metabolite level. Genome biology 2010, 11(11):R118.

Other posts on this site related to Genomics include:

Cancer Biology and Genomics for Disease Diagnosis

Diagnosis of Cardiovascular Disease, Treatment and Prevention: Current & Predicted Cost of Care and the Promise of Individualized Medicine Using Clinical Decision Support Systems

Ethical Concerns in Personalized Medicine: BRCA1/2 Testing in Minors and Communication of Breast Cancer Risk

Genomics & Genetics of Cardiovascular Disease Diagnoses: A Literature Survey of AHA’s Circulation Cardiovascular Genetics, 3/2010 – 3/2013

Genomics-based cure for diabetes on-the-way

Personalized Medicine: Clinical Aspiration of Microarrays

Late Onset of Alzheimer’s Disease and One-carbon Metabolism

Genetics of Disease: More Complex is How to Creating New Drugs

Genetics of Conduction Disease: Atrioventricular (AV) Conduction Disease (block): Gene Mutations – Transcription, Excitability, and Energy Homeostasis

Centers of Excellence in Genomic Sciences (CEGS): NHGRI to Fund New CEGS on the Brain: Mental Disorders and the Nervous System

Cancer Genomic Precision Therapy: Digitized Tumor’s Genome (WGSA) Compared with Genome-native Germ Line: Flash-frozen specimen and Formalin-fixed paraffin-embedded Specimen Needed

Mitochondrial Metabolism and Cardiac Function

Pancreatic Cancer: Genetics, Genomics and Immunotherapy

Issues in Personalized Medicine in Cancer: Intratumor Heterogeneity and Branched Evolution Revealed by Multiregion Sequencing

Quantum Biology And Computational Medicine

Personalized Cardiovascular Genetic Medicine at Partners HealthCare and Harvard Medical School

Centers of Excellence in Genomic Sciences (CEGS): NHGRI to Fund New CEGS on the Brain: Mental Disorders and the Nervous System

LEADERS in Genome Sequencing of Genetic Mutations for Therapeutic Drug Selection in Cancer Personalized Treatment: Part 2

Consumer Market for Personal DNA Sequencing: Part 4

Personalized Medicine: An Institute Profile – Coriell Institute for Medical Research: Part 3

Whole-Genome Sequencing Data will be Stored in Coriell’s Spin off For-Profit Entity

 

Read Full Post »

%d bloggers like this: