Exosomes
Larry H. Bernstein, MD, FCAP, Curator
LPBI
Human Exomes Galore
A new database includes complete sequences of protein-coding DNA from 60,706 individuals.
November 16, 2015
|http://www.the-scientist.com//?articles.view/articleNo/44483/title/Human-Exomes-Galore/
The ability to sequence a person’s entire genome has led many researchers to hunt for the genetic causes of certain diseases. But without a larger set of genomes to compare mutations against, putting these variations into context is difficult. An international group of researchers has banked the full exomes of 60,706 individuals in a database called the Exome Aggregation Consortium (ExAC). The team’s analaysis, posted last month (October 30) on the preprint server bioRxiv, was presented at the Genome Science 2015 conference in Birmingham, U.K. (September 7).
Led by Daniel MacArthur from the Broad Institute of MIT and Harvard, the research team collected exomes from labs around the world for its dataset. “The resulting catalogue of human genetic diversity has unprecedented resolution,” the authors wrote in their preprint. Many of the variants observed in the dataset occurred only once.
“This is one of the most useful resources ever created for medical testing for genetic disorders,” Heidi Rehm, a clinical lab director at Harvard Medical School, told Science News.
Among other things, the team found 3,230 genes that are highly conserved across exomes, indicating likely involvement in critical cellular functions. Of these, 2,557 are not associated with diseases. The authors hypothesized that these genes, if mutated, either lead to embryonic death—before a problem can be diagnosed—or cause rare diseases that have not yet been genetically characterized.
“We should soon be able to say, with high precision: If you have a mutation at this site, it will kill you. And we’ll be able to say that without ever seeing a person with that mutation,” MacArthur said during his Genome Science talk, according to The Atlantic.
This is not the complete set of essential genes in the human body, David Goldstein, a geneticist at Columbia University in New York City, pointed out to Nature. Only by studying more exomes will researchers be able to refine that number, he noted.
Analysis of protein-coding genetic variation in 60,706 humans
http://biorxiv.org/content/early/2015/10/30/030338 doi: http://dx.doi.org/10.1101/030338
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities. The resulting catalogue of human genetic diversity has unprecedented resolution, with an average of one variant every eight bases of coding sequence and the presence of widespread mutational recurrence. The deep catalogue of variation provided by the Exome Aggregation Consortium (ExAC) can be used to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; we identify 3,230 genes with near-complete depletion of truncating variants, 79% of which have no currently established human disease phenotype. Finally, we show that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human knockout variants in protein-coding genes.
Analysis of protein-coding genetic variation in 60,706 humans https://t.co/z0PtB4c8aY
Over the last five years, the widespread availability of high-throughput DNA sequencing technologies has permitted the sequencing of the whole genomes or exomes (the 18 protein-coding regions of genomes) of over half a million humans. In theory, these data represent a powerful source of information about the global patterns of human genetic variation, but in practice, are difficult to access for practical, logistical, and ethical reasons; in addition, the inconsistent processing complicates variant-calling pipelines used by different groups. Current publicly available datasets of human DNA sequence variation contain only a small fraction of all sequenced samples: the Exome Variant Server, created as part of the NHLBI Exome Sequencing Project (ESP)1, contains frequency information spanning 6,503 exomes; and the 1000 Genomes (1000G) Project, which includes individual-level genotype data from whole-genome and exome sequence data for 2,504 individuals2.
Databases of genetic variation are important for our understanding of human population history and biology1–5, but also provide critical resources for the clinical interpretation of variants observed in patients suffering from rare Mendelian diseases6,7. The filtering of candidate variants by frequency in unselected individuals is a key step in any pipeline for the discovery of causal variants in Mendelian disease patients, and the efficacy of such filtering depends on both the size and the ancestral diversity of the available reference data.
Here, we describe the joint variant calling and analysis of high-quality variant calls across 60,706 human exomes, assembled by the Exome Aggregation Consortium (ExAC; exac.broadinstitute.org). This call set exceeds previously available exome-wide variant databases by nearly an order of magnitude, providing unprecedented resolution for the analysis of very low-frequency genetic variants. We demonstrate the application of this data set to the analysis of patterns of genetic variation including the discovery of widespread mutational recurrence, the inference of gene-level constraint against 10 truncating variation, the clinical interpretation of variation in Mendelian disease genes, and the discovery of human “knockout” variants in protein-coding genes.
…..
Deleterious variants are expected to have lower allele frequencies than neutral ones, due to negative selection. This theoretical property has been demonstrated previously in human population sequencing data18,19 and here (Figure 1d, Figure 1e). This allows inference of the degree of natural selection against specific functional classes of variation: however, mutational recurrence as described above indicates that allele frequencies observed in ExAC-scale samples are also skewed by mutation rate, with 10 more mutable sites less likely to be singletons (Figure 2c and Extended Data Figure 4d). Mutation rate is in turn non-uniformly distributed across functional classes – for instance, stop lost mutations can never occur at CpG dinucleotides (Extended Data Figure 4e). We corrected for mutation rates (Supplementary Information) by creating a mutability-adjusted proportion singleton (MAPS) metric. This metric reflects (as expected) strong selection against predicted PTVs, as well as missense variants predicted by conservation-based methods to be deleterious (Figure 2e).
The deep ascertainment of rare variation in ExAC also allows us to infer the extent of 19 selection against variant categories on a per-gene basis by examining the proportion of 20 variation that is missing compared to expectations under random mutation. Conceptually similar approaches have been applied to smaller exome datasets13,20 but have been underpowered, particularly for the analysis of depletion of PTVs. We compared the observed number of rare (MAF <0.1%) variants per gene to an expected number derived from a selection neutral, sequence-context based mutational model13. The model performs extremely well in predicting the number of synonymous variants, which should be under minimal purifying selection, per gene (r = 0.98; Extended Data Figure 5).
……
Critically, we note that LoF-intolerant genes include virtually all known severe haploinsufficient human disease genes (Figure 3b), but that 79% of LoF-intolerant genes have not yet been assigned a human disease phenotype despite the clear evidence for extreme selective constraint (Supplementary Information 4.11). These likely represent either undiscovered severe dominant disease genes, or genes in which loss of a single copy results in embryonic lethality.
The most highly constrained missense (top 25% missense Z scores) and PTV (pLI ≥0.9) genes show higher expression levels and broader tissue expression than the least constrained genes24 (Figure 3c). These most highly constrained genes are also depleted for eQTLs (p < 10-9 for missense and PTV; Figure 3d), yet are enriched within genome-wide significant trait-associated loci (χ2 p < 10-14, Figure 3e). Intuitively, genes intolerant of PTV variation are dosage sensitive: natural selection does not tolerate a 50% deficit in expression due to the loss of single allele. It is therefore unsurprising that these genes are also depleted of common genetic variants that have a large enough effect on expression to be detected as eQTLs with current limited sample sizes. However, smaller changes in the expression of these genes, through weaker eQTLs or functional variants, are more likely to contribute to medically relevant phenotypes. Therefore, highly constrained genes are dosage-sensitive, expressed more broadly across tissues (as expected for core cellular processes), and are enriched for medically relevant variation.
Finally, we investigated how these constraint metrics would stratify mutational classes according to their frequency spectrum, corrected for mutability as in the previous section (Figure 3f). The effect was most dramatic when considering stop-gained variants in the LoF-intolerant set of genes. For missense variants, the missense Z score offers information additional to Polyphen2 and CADD classifications, indicating that gene-level measures of constraint offer additional information to variant-level metrics in assessing potential pathogenicity.
We assessed the value of ExAC as a reference dataset for clinical sequencing approaches, which typically prioritize or filter potentially deleterious variants based on functional consequence and allele frequency6. To simulate a Mendelian variant analysis, we filtered variants in 100 ExAC exomes per continental population against ESP (the previous default reference data set for clinical analysis) or the remainder of ExAC, removing variants present at ≥0.1% allele frequency, a filter recommended for dominant 16 disease variant discovery6. Filtering on ExAC reduced the number of candidate protein-altering variants by 7-fold compared to ESP, and was most powerful when the highest 18 allele frequency in any one population (“popmax”) was used rather than average (“global”) allele frequency (Figure 4a). ESP is not well-powered to filter at 0.1% AF without removing many genuinely rare variants, as AF estimates based on low allele counts are both upward-biased and imprecise (Figure 4b). We thus expect that ExAC will provide a very substantial boost in the power and accuracy of variant filtering in Mendelian disease projects.
…….
The above curation efforts confirm the importance of allele frequency filtering in analysis of candidate disease variants. However, literature and database errors are prevalent even at lower allele frequencies: the average ExAC exome contains 0.89 reportedly Mendelian variants in well-characterized dominant disease genes at <1% popmax AF and 0.20 at <0.1% popmax AF. This inflation likely results from a combination of false reports of pathogenicity and incomplete penetrance, as we show for PRNP in the accompanying work [Minikel et al, submitted]. The abundance of rare functional variation in many disease genes in ExAC is a reminder that such variants should not be assumed to be causal or highly penetrant without careful segregation or case-control analysis28,7.
We investigated the distribution of PTVs, variants predicted to disrupt protein-coding genes through the introduction of a stop codon or frameshift or the disruption of an essential splice site; such variants are expected to be enriched for complete loss-of-function of the impacted genes. Naturally-occurring PTVs in humans provide a model for the functional impact of gene inactivation, and have been used to identify many genes in 6 which LoF causes severe disease31, as well as rare cases where LoF is protective against disease32.
Among the 7,404,909 HQ variants in ExAC, we found 179,774 high-confidence PTVs (as 10 defined in Supplementary Information Section 6), 121,309 of which are singletons. This 11 corresponds to an average of 85 heterozygous and 35 homozygous PTVs per individual (Figure 5a). The diverse nature of the cohort enables the discovery of substantial numbers of novel PTVs: out of 58,435 PTVs with an allele count greater than one, 33,625 occur in only one population. However, while PTVs as a category are extremely rare, the majority of the PTVs found in any one person are common, and each individual 16 has only ~2 singleton PTVs, of which 0.14 are found in PTV-constrained genes (pLI 17 >0.9). The site frequency spectrum of these variants across the populations represented in ExAC recapitulates known aspects of demographic models, including an increase in intermediate-frequency (1%-5%) PTVs in Finland33 and relatively common (>0.1%) PTVs in Africans (Figure 5b).
……
Discussion Here we describe the generation and analysis of the most comprehensive catalogue of 29 human protein-coding genetic variation to date, incorporating high-quality exome sequencing data from 60,706 individuals of diverse geographic ancestry. The resulting call set provides unprecedented resolution for the analysis of very low-frequency protein-coding variants in human populations, as well as a powerful resource for the clinical interpretation of genetic variants observed in disease patients. The complete frequency CC-BY-ND 4.0 International license for this preprint is the author/funder. It is made available under a bioRxiv preprint first posted online October 30, 2015;
http://dx.doi.org/10.1101/030338 ; The copyright holder and annotation data from this call-set has been made freely available through a public website [exac.broadinstitute.org]
The ExAC resource provides the largest database to date for the estimation of allele frequency for protein-coding genetic variants, providing a powerful filter for analysis of candidate pathogenic variants in severe Mendelian diseases. Frequency data from ESP1 have been widely used for this purpose, but those data are limited by population diversity and by resolution at allele frequencies ≤0.1%. ExAC therefore provides 21 substantially improved power for Mendelian analyses, although it is still limited in power at lower allele frequencies, emphasizing the need for more sophisticated pathogenic variant filtering strategies alongside on-going data aggregation efforts. ExAC also highlights an unexpected tolerance of many disease genes to functional variation, and reveals that the literature and public databases contain an inflated number of reportedly pathogenic variants across the frequency spectrum, indicating a need for stringent criteria for assertions of pathogenicity.
Finally, we show that different populations confer different advantages in the discovery of gene-disrupting PTVs, providing guidance for projects seeking to identify human “knockouts” to understand gene function. Individuals of African ancestry have more PTVs (140 on average), with this enrichment most pronounced at allele frequencies above 1% (Figure 5b). Finnish individuals, as a result of a population bottleneck, are depleted at the lowest (<0.1%) allele frequencies but have a peak in frequency at 1-5% (Figure 5b). However, these differences are diminished when considering only LoF-constrained (pLI > 0.9) genes (Extended Data Figure 10). Sampling multiple populations would likely be a fruitful strategy for a researcher investigating common PTV variation. However, discovery of homozygous PTVs is markedly enhanced in the South Asia samples, which come primarily from a Pakistani cohort with 38.3% of individuals self- reporting as having closely related parents, emphasizing the extreme value of consanguineous cohorts for “human knockout” discovery (Figure 5d) [Saleheen et al., to 8 be co-submitted].
…..
While the ExAC dataset dramatically exceeds the scale of previously available frequency reference datasets, much remains to be gained by further increases in sample size. Indeed, the fact that even the rarest transversions have mutational rates13 on the order of 1 x 10-9 implies that almost all possible non-lethal SNVs likely exist in some person on Earth. ExAC already includes >70% of all possible protein-coding CpG transitions at well-covered sites; order of magnitude increases in sample size will eventually lead to saturation of other classes of variation.