Reporter and Curator: Dr. Sudipta Saha, Ph.D.
Negative selection was examined using two measures that highlight different periods of selection in the human genome. The first measure, inter-species, pan-mammalian constraint (GERP-based scores; 24 mammals) addresses selection during mammalian evolution. The second measure is intra-species constraint estimated from the numbers of variants discovered in human populations using data from the 1000 Genomes project and covers selection over human evolution.
For DNaseI elements and bound motifs most sets of elements show enrichment in pan mammalian constraint and decreased human population diversity, though for some cell types the DNaseI sites do not appear overall to be subject to pan-mammalian constraint. Bound TF motifs have a natural control from the set of TF motif with equal sequence potential for binding but without binding evidence from ChIP-seq experiments; in all cases, the bound motifs showed both more mammalian constraint and higher suppression of human diversity.
Consistent with previous findings, genome-wide evidence was not observed for pan-mammalian selection of novel RNA sequences. There are also a large number of elements without mammalian constraint, between 17-90% for TF-binding regions as well as DHSs and FAIRE regions. Previous studies could not determine whether these sequences are either biochemically active, but with little overall impact on the organism, or are under lineage specific selection. By isolating sequences preferentially inserted into the primate lineage, which is only feasible given the genome-wide scale of this data, this issue was specifically examined. The majority of primate-specific sequence is due to retrotransposon activity, but an appreciable proportion is non-repetitive primate-specific sequence. Of 104,343,413 primate-specific bases (excluding repetitive elements), 67,769,372 (65%) are found within ENCODE-identified elements. Examination of 227,688 variants segregating in these primate specific regions revealed that all classes of elements (RNA and regulatory) show depressed derived allele frequencies, consistent with recent negative selection occurring in at least some of these regions. This suggests that an appreciable proportion of the unconstrained elements are lineage specific elements required for organismal function, consistent with long standing views of recent evolution, and the remainder are likely to be “neutral” elements which are not currently under selection, but may still affect cellular or larger scale phenotypes without an effect on fitness.
The binding patterns of TFs are not uniform, and can be correlated both inter-and intra-species measures of negative selection with the overall information content of motif positions. The selection on some motif positions is as high as protein coding exons. These aggregate measures across motifs show that the binding preferences found in the population of sites are also relevant to the per-site behavior. By developing a per-site metric of population effect on bound motifs, it was found that highly constrained bound instances across mammals are able to buffer the impact of individual variation.
It was proposed to express the deleterious effect of TFBS mutations in terms of mutational load, a known population genetics metric that combines the frequency of mutation with predicted phenotypic consequences that it causes. This metric was adapted to use the reduction in PWM score associated with a mutation as a crude but computable measure of such phenotypic consequences. It was not assumed that TFBS load at a given site reduces an individual’s biological fitness. Rather, it was argued that binding sites that tolerate a higher load are less functionally constrained. This approach, although undoubtedly a crude one, makes it possible to consistently estimate TFBS constraints for different TFs and even different organisms and ask why TFBS mutations are tolerated differently in different contexts.
It was first asked whether motif load would be able to detect the expected link between evolutionary and individual variation. A published metric was used, Branch Length Score (BLS), to characterise the evolutionary conservation of a motif instance. This metric utilises both a PWM based model of the conservation of bases and allows for motif movement. Reassuringly, mutational load correlated with BLS in both species, with evolutionary non-conserved motifs (BLS=0) showing by far the highest degree of variation in the population. At the same time, ∼40% of human and fly TFBSs with an appreciable load (L>5e-3) still mapped to reasonably conserved sites (BLS>0.2, ∼50% percentile in both organisms), demonstrating that score-reducing mutations at evolutionary preserved sequences can be tolerated in these populations.
Using this metric, the original findings were confirmed, suggesting that TFBSs with higher PWM scores are generally more functionally constrained compared to ‘weaker’ sites. The fraction of detected sites mapping to bound regions remained similar across the whole analysed score range, suggesting that this relationship is unlikely to be an artefact of higher false-positive rates at ‘weaker’ sites. This global observation, however, does not rule out the possibility that a weaker match at some sites is specifically preserved to ensure dose-specific TF binding. This may be the case, for example, for Drosophila Bric-à-brac motifs, which exhibited no correlation between motif load and PWM score, consistent with the known dosage-dependent function of Bric-à-brac in embryo patterning.
Motif load was used to address whether TFBSs proximal to transcription start sites (TSS) are more constrained compared to more distant regulatory regions. This was found to be the case in the human, but not in Drosophila. CTCF binding sites in both species were a notable exception, tolerating the lowest mutational load at locations 500bp-1kb from TSS, but not closer to the TSS, suggesting that the putative role of CTCF in establishing chromatin domains is particularly important in proximity of gene promoters.
To gain further insight into the functional effects of TFBS mutations, a dataset was used that mapped human CTCF binding sites across four individuals. TFBS mutations detected in this dataset often did not result in a significant loss of binding, with ∼75% mutated sites retaining at least two thirds of the binding signal. This was particularly prominent at conserved sites (BLS>0.5), 90% of which showed this ‘buffering’ effect. To address whether buffering could be explained solely by the flexibility of CTCF sequence preferences, it was analysed between-allele differences in the PWM score at polymorphic binding sites. As expected, globally CTCF binding signal correlated with the PWM score of the underlying motifs. Consistent with this, alleles with minor differences in PWM match generally had little effect on the binding signal compared to sites with larger PWM score changes, suggesting that the PWM model adequately describes the functional constraints of CTCF binding sites. At the same time, it was found that CTCF binding signals could be maintained even in those cases, where mutations resulted in significant changes of PWM score, particularly at evolutionary conserved sites. A linear interaction model confirmed that the effect of motif mutations on CTCF binding was significantly reduced with increasing conservation. These effects were not due to the presence of additional CTCF motifs (as 96% of bound regions only contained a single motif), while differences between more and less conserved sites could not be explained away by differences in the PWM scores of their major alleles. A CTCF dataset from three additional individuals generated by a different laboratory yielded consistent conclusions, suggesting that our observations were not due to over-fitting.
Taken together, CTCF binding data for multiple individuals show that mutations can be buffered to maintain the levels of binding signal, particularly at highly conserved sites, and this effect cannot be explained solely by the flexibility of CTCF’s sequence consensus. It was asked whether mechanisms potentially accountable for such buffering would also affect the relationship between sequence and binding in the absence of mutations. Training an interaction linear model across the whole set of mapped CTCF binding sites revealed that conservation consistently weakens the relationship between PWM score and the binding intensity. Thus, CTCF binding to evolutionary conserved sites may generally have a reduced dependence on sequence.
Source References:
http://www.nature.com/encode/threads/impact-of-evolutionary-selection-on-functional-regions
Nice job.
Just finished reading the post.
Dr. Saha, Do you see a connection between this field of genomics and development of applications to pharmaceutics and then implementation in Medicine? To answer this I believe one need to review Resources in use for genomics research and apply the visionary future envisioning potential. Let’s try to do that for Reproductive Medicine.
I thank you for embarking of the new path of Genomics and Reproduction.
[…] Impact of evolutionary selection on functional regions: The imprint of evolutionary selection on ENC… […]
[…] Impact of evolutionary selection on functional regions: The imprint of evolutionary selection on ENC… […]
[…] Impact of evolutionary selection on functional regions: The imprint of evolutionary selection on ENC… […]
[…] https://pharmaceuticalintelligence.com/2012/09/20/impact-of-evolutionary-selection-on-functional-regi… […]