DNA code vs human language
Larry H. Bernstein, MD, FCAP, Curator
LPBI
Grammar of Genetic Code Much More Complex than Human Languages
Karolinska Institutet
A new study shows that the ‘grammar’ of the human genetic code is more complex than that of even the most intricately constructed spoken languages in the world. The findings, published in the journal Nature, explain why the human genome is so difficult to decipher — and contribute to the further understanding of how genetic differences affect the risk of developing diseases on an individual level.
Researchers Arttu Jolma and Jussi Taipale are part of a team that examined the binding preferences of pairs of transcription factors, and systematically mapped the compound DNA words to which they bind. Courtesy of Ulf Sirborn
“The genome contains all the information needed to build and maintain an organism, but it also holds the details of an individual’s risk of developing common diseases such as diabetes, heart disease and cancer,” says study lead-author Arttu Jolma, doctoral student at Sweden’s Karolinska Institutet Department of Biosciences and Nutrition. “If we can improve our ability to read and understand the human genome, we will also be able to make better use of the rapidly accumulating genomic information on a large number of diseases for medical benefits.”
The sequencing of the human genome in the year 2000 revealed how the three billion letters of A, C, G and T, which the human genome consists of, are ordered. However, knowing just the order of the letters is not sufficient for translating the genomic discoveries into medical benefits; one also needs to understand what the sequences of letters mean. In other words, it is necessary to identify the ‘words’ and the ‘grammar’ of the language of the genome.
The cells in our body have almost identical genomes, but differ from each other because different genes are active (expressed) in different types of cells. Each gene has a regulatory region that contains the instructions controlling when and where the gene is expressed. This gene regulatory code is read by proteins called transcription factors that bind to specific ‘DNA words’ and either increase or decrease the expression of the associated gene.
Under the supervision of Professor Jussi Taipale, researchers at Karolinska Institutet have previously identified most of the DNA words recognized by individual transcription factors. However, much like in a natural human language, the DNA words can be joined to form compound words that are read by multiple transcription factors. However, the mechanism by which such compound words are read has not previously been examined. Therefore, in their recent study inNature, the Taipale team examines the binding preferences of pairs of transcription factors, and systematically maps the compound DNA words to which they bind.
Their analysis reveals that the grammar of the genetic code is much more complex than that of even the most complex human languages. Instead of simply joining two words together by deleting a space, the individual words that are joined together in compound DNA words are altered, leading to a large number of completely new words.
“Our study identified many such words, increasing the understanding of how genes are regulated both in normal development and cancer,” says Arttu Jolma. “The results pave the way for cracking the genetic code that controls the expression of genes.”
This project was supported by the Finnish Academy CoE in Cancer Genetics, Center for Innovative Medicine, Knut and Alice Wallenberg Foundation, Göran Gustafsson Foundations, and the Swedish Research Council. Professor Taipale is also affiliated to the University of Helsinki, Finland.
Citation: ‘DNA-dependent formation of transcription factor pairs alters their binding specificity,’
Jolma A, Yin Y, Nitta KR, Dave K, Popov A, Taipale M, Enge M, Kivioja T, Morgunova E and Taipale J.,
Nature , online November 9, 2015, http://dx.doi.org:/10.1038/nature15518.
DNA-dependent formation of transcription factor pairs alters their binding specificity
Arttu Jolma, Yimeng Yin, Kazuhiro R. Nitta, Kashyap Dave, Alexander Popov, Minna Taipale, Martin Enge, Teemu Kivioja, Ekaterina Morgunova & Jussi Taipale
Nature(2015) http://dx.doi.org:/10.1038/nature15518
Gene expression is regulated by transcription factors (TFs), proteins that recognize short DNA sequence motifs1, 2, 3. Such sequences are very common in the human genome, and an important determinant of the specificity of gene expression is the cooperative binding of multiple TFs to closely located motifs4, 5, 6. However, interactions between DNA-bound TFs have not been systematically characterized. To identify TF pairs that bind cooperatively to DNA, and to characterize their spacing and orientation preferences, we have performed consecutive affinity-purification systematic evolution of ligands by exponential enrichment (CAP-SELEX) analysis of 9,400 TF–TF–DNA interactions. This analysis revealed 315 TF–TF interactions recognizing 618 heterodimeric motifs, most of which have not been previously described. The observed cooperativity occurred promiscuously between TFs from diverse structural families. Structural analysis of the TF pairs, including a novel crystal structure of MEIS1 and DLX3 bound to their identified recognition site, revealed that the interactions between the TFs were predominantly mediated by DNA. Most TF pair sites identified involved a large overlap between individual TF recognition motifs, and resulted in recognition of composite sites that were markedly different from the individual TF’s motifs. Together, our results indicate that the DNA molecule commonly plays an active role in cooperative interactions that define the gene regulatory lexicon.
Figure 1: CAP-SELEX reveals DNA-mediated TF–TF interactions.

a, Schematic description of CAP-SELEX. A TF1–TF2–DNA complex is formed (top left) and subjected to two consecutive affinity purifications, followed by amplification of DNA and sequencing.
http://www.nature.com/nature/journal/vaop/ncurrent/carousel/nature15518-f1.jpg
Figure 2: Overlapping composite TF motifs with novel specificity.

a, An example of a TF pair binding to an overlapping composite site. Top, a composite GCM1–ELK1 logo aligned to the individual logos. Middle, DNA–protein contacts for GCM1 (purple) and ELK1 (light blue) in the composite site, predicted… http://www.nature.com/nature/journal/vaop/ncurrent/carousel/nature15518-f2.jpg
Figure 3: All identified TF–TF interactions.
http://www.nature.com/nature/journal/vaop/ncurrent/carousel/nature15518-f3.jpg
a, PWM motif similarities between the heterodimer motifs (green bars) and monomeric and homodimeric representative motifs from ref. 8. Barcode logos for each factor are shown, and background colour of name indicates TF structural family…
Leave a Reply