Funding, Deals & Partnerships: BIOLOGICS & MEDICAL DEVICES; BioMed e-Series; Medicine and Life Sciences Scientific Journal – http://PharmaceuticalIntelligence.com
Stanford-developed algorithm reveals complex protein dynamics behind gene expression
BY KRISTA CONGER
Michael Snyder
In yet another coup for a research concept known as “big data,” researchers at the Stanford University School of Medicine have developed a computerized algorithm to understand the complex and rapid choreography of hundreds of proteins that interact in mindboggling combinations to govern how genes are flipped on and off within a cell.
To do so, they coupled findings from 238 DNA-protein-binding experiments performed by the ENCODE project — a massive, multiyear international effort to identify the functional elements of the human genome — with a laboratory-based technique to identify binding patterns among the proteins themselves.
The analysis is sensitive enough to have identified many previously unsuspected, multipartner trysts. It can also be performed quickly and repeatedly to track how a cell responds to environmental changes or crucial developmental signals.
“At a very basic level, we are learning who likes to work with whom to regulate around 20,000 human genes,” said Michael Snyder, PhD, professor and chair of genetics at Stanford. “If you had to look through all possible interactions pair-wise, it would be ridiculously impossible. Here we can look at thousands of combinations in an unbiased manner and pull out important and powerful information. It gives us an unprecedented level of understanding.”
Snyder is the senior author of a paper describing the research published Oct. 24 in Cell. The lead authors are postdoctoral scholars Dan Xie, PhD, Alan Boyle, PhD, and Linfeng Wu, PhD.
Proteins control gene expression by either binding to specific regions of DNA, or by interacting with other DNA-bound proteins to modulate their function. Previously, researchers could only analyze two to three proteins and DNA sequences at a time, and were unable to see the true complexities of the interactions among proteins and DNA that occur in living cells.
The challenge resembled trying to figure out interactions in a crowded mosh pit by studying a few waltzing couples in an otherwise empty ballroom, and it has severely limited what could be learned about the dynamics of gene expression.
The ENCODE, for the Encyclopedia of DNA Elements, project was a five-year collaboration of more than 440 scientists in 32 labs around the world to reveal the complex interplay among regulatory regions, proteins and RNA molecules that governs when and how genes are expressed. The project has been generating a treasure trove of data for researchers to analyze for the last eight years.
In this study, the researchers combined data from genomics (a field devoted to the study of genes) and proteomics (which focuses on proteins and their interactions). They studied 128 proteins, called trans-acting factors, which are known to regulate gene expression by binding to regulatory regions within the genome. Some of the regions control the expression of nearby genes; others affect the expression of genes great distances away.
The researchers used 238 data sets generated by the ENCODE project to study the specific DNA sequences bound by each of the 128 trans-acting factors. But these factors aren’t monogamous; they bind many different sequences in a variety of protein-DNA combinations. Xie, Boyle and Snyder designed a machine-learning algorithm to analyze all the data and identify which trans-acting factors tend to be seen together and which DNA sequences they prefer.
Wu then performed immunoprecipitation experiments, which use antibodies to identify protein interactions in the cell nucleus. In this way, they were able to tell which proteins interacted directly with one another, and which were seen together because their preferred DNA binding sites were adjoining.
“Before our work, only the combination of two or three regulatory proteins were studied, which oversimplified how gene regulators collaborate to find their targets,” Xie said. “With our method we are able to study the combination of more than 100 regulators and see a much more complex structure of collaboration. For example, it had been believed that a key regulator of cell proliferation called FOS typically only works with JUN protein family members. We show, in addition to JUN, FOS has different partners under different circumstances. In fact, we found almost all the canonical combinations of two or three trans-acting factors have many more partners than we previously thought.”
To broaden their analysis, the researchers included data from other sources that explored protein-binding patterns in five cell types. They found that patterns of co-localization among proteins, in which several proteins are found clustered closely on the DNA to govern gene expression, vary according to cell type and the conditions under which the cells are grown. They also found that many of these clusters can be explained through interactions among proteins, and that not every protein bound to DNA directly.
“We’d like to understand how these interactions work together to make different cell types and how they gain their unique identities in development,” Snyder said. “Furthermore, diseased cells will have a very different type of wiring diagram. We hope to understand how these cells go astray.”
Other Stanford co-authors include life science research assistant Jie Zhai and life science research associate Trupti Kawli, PhD.
An application of SOMs to study high-dimensional TF colocalization patterns
Colocalization patterns are dynamic through stimulation and across cell types
Many TF colocalizations can be explained by protein-protein interaction
Summary
Different trans-acting factors (TFs) collaborate and act in concert at distinct loci to perform accurate regulation of their target genes. To date, the cobinding of TF pairs has been investigated in a limited context both in terms of the number of factors within a cell type and across cell types and the extent of combinatorial colocalizations. Here, we use an approach to analyze TF colocalization within a cell type and across multiple cell lines at an unprecedented level. We extend this approach with large-scale mass spectrometry analysis of immunoprecipitations of 50 TFs. Our combined approach reveals large numbers of interesting TF-TF associations. We observe extensive change in TF colocalizations both within a cell type exposed to different conditions and across multiple cell types. We show distinct functional annotations and properties of different TF cobinding patterns and provide insights into the complex regulatory landscape of the cell.
Personalized medicine aims to assess medical risks, monitor, diagnose and treat patients according to their specific genetic composition and molecular phenotype. The advent of genome sequencing and the analysis of physiological states has proven to be powerful (Cancer Genome Atlas Research Network, 2011). However, its implementation for the analysis of otherwise healthy individuals for estimation of disease risk and medical interpretation is less clear. Much of the genome is difficult to interpret and many complex diseases, such as diabetes, neurological disorders and cancer, likely involve a large number of different genes and biological pathways (Ashley et al., 2010,Grayson et al., 2011,Li et al., 2011), as well as environmental contributors that can be difficult to assess. As such, the combination of genomic information along with a detailed molecular analysis of samples will be important for predicting, diagnosing and treating diseases as well as for understanding the onset, progression, and prevalence of disease states (Snyder et al., 2009).
Presently, healthy and diseased states are typically followed using a limited number of assays that analyze a small number of markers of distinct types. With the advancement of many new technologies, it is now possible to analyze upward of 105 molecular constituents. For example, DNA microarrays have allowed the subcategorization of lymphomas and gliomas (Mischel et al., 2003), and RNA sequencing (RNA-Seq) has identified breast cancer transcript isoforms (Li et al., 2011,van der Werf et al., 2007,Wu et al., 2010,Lapuk et al., 2010). Although transcriptome and RNA splicing profiling are powerful and convenient, they provide a partial portrait of an organism’s physiological state. Transcriptomic data, when combined with genomic, proteomic, and metabolomic data are expected to provide a much deeper understanding of normal and diseased states (Snyder et al., 2010). To date, comprehensive integrative omics profiles have been limited and have not been applied to the analysis of generally healthy individuals.
To obtain a better understanding of: (1) how to generate an integrative personal omics profile (iPOP) and examine as many biological components as possible, (2) how these components change during healthy and diseased states, and (3) how this information can be combined with genomic information to estimate disease risk and gain new insights into diseased states, we performed extensive omics profiling of blood components from a generally healthy individual over a 14 month period (24 months total when including time points with other molecular analyses). We determined the whole-genome sequence (WGS) of the subject, and together with transcriptomic, proteomic, metabolomic, and autoantibody profiles, used this information to generate an iPOP. We analyzed the iPOP of the individual over the course of healthy states and two viral infections (Figure 1A). Our results indicate that disease risk can be estimated by a whole-genome sequence and by regularly monitoring health states with iPOP disease onset may also be observed. The wealth of information provided by detailed longitudinal iPOP revealed unexpected molecular complexity, which exhibited dynamic changes during healthy and diseased states, and provided insight into multiple biological processes. Detailed omics profiling coupled with genome sequencing can provide molecular and physiological information of medical significance. This approach can be generalized for personalized health monitoring and medicine.
NEW YORK (GenomeWeb News) – The funding squeeze from the sequestration of the US federal budget, now more than half-a-year old, has already had a sizable impact at the National Human Genome Research Institute, leading to cuts to ongoing programs, scaling back of new ones, and the deferring of efforts that have not yet launched.
The five percent cut in funding this year at NHGRI has led not only to trimmed-down renewal grants and fewer, smaller awards broadly, but also has chopped the budget for some of the institute’s important programs, according to NHGRI Director Eric Green.
The programs that have either had their funding reduced, and in one case delayed, include the ENCODE (Encyclopedia of DNA Elements) program, projects focused on using genome sequencing in newborns and in clinical medicine, and other initiatives, Green said in his Director’s Report to the National Advisory Council on Human Genomics Research this week.
In addition, many renewal grants have been trimmed, and there are “numerous examples of detrimental cuts” to the institute’s intramural research program, said Green. These cuts to large and small NHGRI programs come at a pivotal time for genomics, he noted, as the products of such research are beginning to translate into clinical possibilities.
“It is tragic. [That] is the word I would use,” Green told GenomeWeb Daily News this week.
“[The field of genomics] is just so exciting. There are so many opportunities,” he said. “This is precisely the time that we should be pushing the accelerator hard, and we just cannot do it because we don’t have enough fuel in our fuel tank.
“It’s frustrating. I think the opportunities now are just spectacular,” said Green. “It’s tragic because it is just so obvious that we could do some remarkable things in genomics and we are not being able to do it.”
ENCODE, a decade-old flagship project at NIH that aims to identify all of the functional elements in the human genome, had its budget reduced by 16 percent.
The Genomic Sequencing and Newborn Screening Disorders program was cut by half, which left the program to fund fewer research projects than planned and its research consortium to go forward without the benefit of a data coordinating center. This new initiative, an effort to support pioneering studies on how sequencing might be used in the care of newborns and in neonatal care that was created jointly with the Eunice Kennedy Shriver National Institute of Child Health and Human Development, had its budget cut from $10 million to $5 million.
The Genomic Medicine Pilot Demonstration Projects program had its budget cut by 20 percent, and NHGRI’s Bioinformatics Resources and Analysis Research Portfolio had $5 million sliced out of its budget. The new Genomics of Gene Regulation (GGR)request for applications was bumped out of this funding year entirely, and has been delayed until 2014, according to Green.
Because the sequestration plan was concocted and agreed to well in advance of its arrival earlier this year, Green told GWDN that the institute did have some time to try to react to the sequestration and mitigate the pain from the cuts, spreading them around fairly and evenly while maintaining priorities. He said leadership at the institute tried to prepare for the possibility of sequestration by being conservative in its planning.
Programs that were already ongoing, like ENCODE, were likely to take priority over those that were not yet launched, like GGR, in part because the infrastructure is already in place for ongoing projects and because it is easier to plan for how they operate and generate outputs, like data.
“With ENCODE you know for every million dollars you invest you get so much back,” said Green. “With a program like newborn sequencing … we don’t totally know what it’s going to look like or play out like. We won’t know what we are missing because we won’t be able to launch it to the scale that we wanted to launch it originally.”
Green said some of the projects being cut or delayed were created under NHGRI’sstrategic plan, a program it laid out in 2011 that involves restructuring of the institute’s divisions and some shifting in its research portfolio to include more efforts in applying genomics to medicine and healthcare.
“Some of these RFAs that we delayed really represent key elements that we started to anticipate two years ago,” said Green. “We knew we wanted to do more in sequencing, we knew we wanted to do some pilot projects in genomic medicine. We knew we wanted to continue to accelerate efforts in understanding how the genome works … ENCODE, GGR, and so forth. It just had to be slowed down,” he said.
Anastasia Wise, program director for the Genomic Sequencing and Newborn Screening Disorders program, told GWDN that the program was supposed to be much larger than the $5 million in awards unveiled last week, which funded a consortium of four research projects.
Wise said NHGRI and NICHD were each initially planning to provide double the amount of funding they were actually awarded, which is now expected to be a total of $25 million over five years, although that total could be subject to the availability of funding.
“There were definitely more scientifically meritorious applications than we were able to fund,” she said. “Even the four awards that we made ended up being cut an additional five percent because of the sequestration.”
She said the program “wanted to be able to make more awards, and we wanted to be able to fund a coordinating center to be able to bring the network together and help provide some harmonization of data and coordination of logistics between the different members of the consortium,” but it was unable to fund that part of the effort.
Although the fractured fiscal culture in Washington engenders caution at NHGRI as the agency looks forward, Green sees many scientific opportunities right now, as genomics begins to hit the clinic.
“Some people are saying we are not even going fast enough,” he said. “Lots of people have been discussing what the world is going to look like when somebody gets their genome sequenced in the newborn period, and [they] think about what the implications of that are for the patient for the rest of their lives. We want to start studying this,” he said.
“And we are starting to … but we’re not starting as aggressively as we wanted to,” Green said. “I mean, we took a big hit this year.”
Matt Jones is a staff reporter for GenomeWeb Daily News. He covers public policy, legislation, and funding issues that affect researchers in the genomics field, as well as the operations of research institutes. E-mail Matt Jones or follow GWDN’s headlines at @DailyNewsGW.
Sequencing became the household name. In 2000s, it was thought to be the key of the Pandora’s box for cure. Then, after completion of Human Genome Projects showed that there are less number of genes than expected. This outcome induce to originate yet another set of sequencing programs and collaborations around the world, such as Human Protein Project, Human Microorganisms Projects, ENCODE, Transcriptome Sequencing and Consortiums etc.
It is in humankind to believe in magic and illusion. The strength of biological diversity and complex mechanism of expression may chalanges the set up of a simple but informative specific essay. Thus, there is a new developing field to mash rules of biology with mathematical formulas to develop the best bioinformatics or also called computational biology. Predicting transcription start or termination sites, exon boundaries, possible binding sites of transcription regulators for chromatin modification activities, like histone acetylates and enhancer- and insulator-associated factors based on the human genome sequence. Deep in mind, this assumption supports that the sequence contains signatures for chromatin modifications essential for gene regulation and development.
There are three primary colors, red, yellow and blue, however, an artist can create many shades. Recently, scientists combining and organizing more data to make sense of our blueprint of life to transfer info generation to generation with the hope to cure diseases of human kind.
Analyzing genome and transcriptome open the door. These studies suggested that all eukaryotic cells has a rich portfolio of RNAs. Among these long non-coding RNAs has impact on protein coding gene expression, regulating multiple processes even including epigenetic gene expression.
Epigenetics, stemness and non-coding RNAs play a great role to manipulate and correct the gene expression not only at a proper cell type but also location and time within genome without disturbing the host.
Main concern is differentiation of embryonic stem cells under these epigenetics and influencers. The best known post-transcriptional modifications, which include methylation, acetylation, ubiquination, and SUMOylation of lysine residues, methylation of arginine residues, and phosphorylation of serines, occur on histone tails. “Epi” means “top” or
“above” so this mechanism give a new direction to the genetic pathways as long as the organism live sometime and may lead into evolutions. It is critical to show the complexity of
mechanism and relativity of a gene role with a single example for each.
For example, DNA methylation occurs mostly on cytosine residues on the CpG islands usually located on promoter regions that are associated with tissue-specific gene expression. However, there are many other forms of DNA methylations, such as monoallelic methylation in gene imprinting and inactivation of the X chromosome, in repetitive elements, like transposons. There are two main mechanisms but this is not our main topic. Yet, Myc and hypoxia-inducible factor-1α versus certain methyl-CpG-binding proteins, such as MBD1,MBD2, MBD4, MeCP2, and Kaiso works differently.
Stemness is an important factor for an intervention to correct a pathological condition. In terms of epigenetics, regulation and non-coding RNA Vascular endothelial growth factor A (VEGF-A) is an interesting example for differentiation of endothelial cells and morphogenesis of the vascular system during development with several reasons, epigenetics, gene interactions, time and space. Everything has to be just right, because neither less nor too much can fulfill the destiny to become a complete adult cell or an organism. For example, both having only one VEGF-A allele and having two-fold excess of VEGF-A results in death during early embryogenesis, since mice can’t develop proper vascular network. However, explaining diverse mechanisms and functions of VEGF-A is require more information with specific details. VEGF-A plays many roles in many pathological cases, such as cancer, inflammation, retinopathies, and arthritis because VEGF-A has also function in epigenetic reprogramming of the promoter regions of Rex1 and Oct4 genes, that are critical for a stem cell. Preferred mechanism is anti-angiogeneic state but tumor cells prefer hypermethylation to induce pro-angiogeneic state, thus VEGF-A stimulates PIGF in tumour cells among many other factors.
Now, let’s turn around to observe development of a cell with Polycomb repressive complexes (PRCs) because they are important chromatin regulators of embryonic stem (ES) cell function. Originally, RYBP shown to function as transcriptional repressor in reporter assays from both in tissue culture cells and in fruit fly (Drosophila melanogaster ) and as a direct interactor with Ring1A during embryogenesis through methylation. In addition, RYBP in epigenetic resetting during preimplantation development through repression of germ line genes and PcG targets before formation of pluripotent epiblast cells. However, I do believe that the most important element is efficient repression of endogenous retroviruses (murine endogenous retrovirus called MuERV class), preimplantation containing zygotic genome activation stage and germ line specific genes. The selective repressor activity of RYBP is in the ES cell state. When RYBP−/−ES cells were analyzed by measuring gene expression during differentiation as embryo bodies formed from mutant and wild-type cells, the result presented that expression of pluripotency genes Oct4 and Nanog was usually downregulated. However, RYBP is able to bind genomic regions independently of H3K27me3 and there is no relation between altered RYBP binding in Dnmt1-mutant cells to DNA methylation status. In sum, RYBP has a large value in undifferentiated ES cells and may affect or even reset epigenetic landscape during early developmental stages. These are the gaps filled by long non coding RNAs.
We learn more compelling information by comparing and contrasting what is normal and what is abnormal. As a result, pathology is a key learning canvas for basic mechanisms in molecular genetics. Then peppered with functional genomics completes the story for an edible outcome. We generally refer this as a Translational Research. For example, recent foundlings suggest that H19 contributes to cancer, including hepatocellular carcinoma (HCC) after reviewing Oncomine resource. According to these observations, in most HCC cases there is a lower expression of H19 level is compared to the liver. Thus, in vitro and in vivo studies were undertaken with classical genetic analyzes based on loss- and gain-of-function on H19 to characterize two outcomes depend on H19, that are the effects on gene expression and on HCC metastasis. First, the expression of H19 showed gene expression variation since H19 expression was low in tumor cells than peripheral tumor cells. Second, the metastasis of cancer based on alteration of miR-200 pathway contributing mesenchymal-to-epithelial transition by H19. Therefore, H19 and miR-200 are targets to be utilized during molecular diagnostics development and establishing targeted therapies in cancer.
Long story short, there is a circle of life where everything is connected even though they look different. As a result, when we see a sunflower or a baby we remember to smile, because life is still an act to puzzle human.
“OCT4 establishes and maintains nucleosome-depleted regions that provide additional layers of epigenetic regulation of its target genes” Proc. Natl. Acad. Sci. USA 2011 108:14497-14502
“The CHD3 Chromatin Remodeler PICKLE and Polycomb Group Proteins Antagonistically Regulate Meristem Activity in the Arabidopsis” RootPlant Cell 2011 23:1047-1060
… Aviva Lev-Ari, PhD, RN New Life – The Healing Promise of Stem Cells View … p://www.technioniit.com/2012/09/new-life-healing-promise-of-stem-cells.html Diseases and conditions where stemcell treatment is promising or emerging. Source: Wikipedia Since the …
… Aviva Lev-Ari, PhD, RN Stem cells create new heart cells in baby mice, but not in adults, study … picture on the left shows green c-kit+ precursor stem cells within an infarct (lower right) in a
… and Curator: Dr. Sudipta Saha, Ph.D. Germline stem cells that produce oocytes in vitro and fertilization-competent eggs in … from adult mouse ovaries. A fluorescence-activated cell sorting-based protocol has been standardized that can be used with adult … compared to the ESC-derived or induced pluripotent stemcell-derived germline cells that are currently used as models for human …
… PhD, RN The two leading therapy classes are: Cell-based Therapies for angiogenesis and myocardial … Research Projects StemCell biology Embryonic stem cells in cardiovascular repairEarly differentiation of human endothelial …
… Stem Cells with Unread Genome: microRNAs Author, Demet Sag, PhD Life is … a coherent outcome. Thus, providing an engineered whole cell as a system of correction for “StemCell Therapy” may resolve unmet health problems. Only 1% of the genome …
… T-cells, said Dr. Margaret Goodell, director of the Stem Cells and Regenerative Medicine Center of Baylor College of Medicine. … of pediatrics at BCM and a member of the Center for Cell and Gene Therapy at BCM, Texas Children¹s Hospital and The Methodist … found that mice lacking the gene for this factor had a T-cell deficiency and in particular, too few of these early progenitor …
… and Curator: Ritu Saxena, Ph.D Although cancer stem cells constitute only a small percentage of the tumor burden, their … after therapeutic target in cancer. The post on cancer stem cells published on the 22nd of March, 2013, describes the identity of CSCs, their functional characteristics, possible cell of origin and biomarkers. This post focuses on the therapeutic potential …
… programs in the fields of personalized medicine, cell biology, cytogenetics, genotyping, and biobanking drive our … by playing an important role in induced pluripotent stem (iPS) cell research. Induced pluripotent stem cells are powerful cells which can be made from skin or blood cells, and …
… seen in hematologic malignancies such as cutaneous T-cell lymphoma and peripheral T-cell lymphoma and little or no positive outcome … resistance to chemotherapeutics, and similarity to cancer stem cells(6-10). Figure 1. HDACis led to the induction of EMT phemotype. (A …
In a three part series: Part IIA. CRACKING THE CODE OF HUMAN LIFE: Milestones along the Way Part IIB. CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics & Computational Genomics Part IIC. CRACKING THE CODE OF HUMAN LIFE: Recent Advances in Genomic Analysis and Disease
Part III will conclude with Ubiquitin, it’s Role in Signaling and Regulatory Control. Part I reviewed the huge expansion of the biological research enterprise after the Second World War. It concentrated on the
discovery of cellular structures,
metabolic function, and
creation of a new science of Molecular Biology.
Part II follows the race to delineation of the Human Genome, discovery methods and fundamental genomic patterns that are ancient in both animal and plant speciation. But it explores both the complexity and the systems view of the architecture that underlies and understanding of the genome.
These articles review a web-like connectivity between inter-connected scientific discoveries, as significant findings have led to novel hypotheses and many expectations over the last 75 years. This largely post WWII revolution has driven our understanding of biological and medical processes at an exponential pace owing to successive discoveries of
chemical structure,
the basic building blocks of DNA and proteins,
nucleotide and protein-protein interactions,
protein folding, allostericity,
genomic structure,
DNA replication,
nuclear polyribosome interaction, and
metabolic control.
In addition, the emergence of methods for
copying,
removal,
insertion,
improvements in structural analysis
developments in applied mathematics that have transformed the research framework.
Part IIA:
CRACKING THE CODE OF HUMAN LIFE:
Milestones along the Way
A NOVA interview with Francis Collins (NHGRI) (FC), J. Craig Venter (CELERA)(JCV), and Eric Lander (EL). RK: For the past ten years, scientists all over the world have been painstakingly trying to read the tiny instructions buried inside our DNA. And now, finally, the “Human Genome” has been decoded. EL: The genome is a storybook that’s been edited for a couple billion years. The following will address the odd similarity of genes between man and yeast
EL: In the nucleus of your cell the DNA molecule resides that is about 10 angstroms wide curled up, but the amount of curling is limited by the negative charges that repel one another, but there are folds upon folds. If the DNA is stretched the length of the DNA would be thousands of feet. EL: We have known for 2000 years that your kids look a lot like you. Well it’s because you must pass them instructions that give them the eyes, the hair color, and the nose shape they have. RK: Cracking the code of those minuscule differences in DNA that influence health and illness is what the Human Genome Project is all about. Since 1990, scientists all over the world have been involved in the effort to read all three billion As, Ts, Gs, and Cs of human DNA. It took 10 years to find the one genetic mistake that causes cystic fibrosis. Another 10 years to find the gene for Huntington’s disease. Fifteen years to find one of the genes that increase the risk for breast cancer. One letter at a time, painfully slowly… And then came the revolution. In the last ten years the entire process has been computerized. The computations can do a thousand every second and that has made all the difference. EL: This is basically a parts list with a lot of parts. If you take an airplane, a Boeing 777, I think it has like 100,000 parts. If I gave you a parts list for the Boeing 777 in one sense you’d know 100,000 components, screws and wires and rudders and things like that. But you wouldn’t know how to put it together, or why it flies. We now have a parts list, and that’s not enough to understand why it flies.
The Human Genome (Photo credit: dullhunk)
A Quest For Clarity
Tracy Vence is a senior editor of Genome Technology Tracy Vence @GenomeTechMag Projects supported by the US National Institutes of Health will have produced 68,000 total human genomes — around 18,000 of those whole human genomes — through the end of this year, National Human Genome Research Institute estimates indicate. And in his book, The Creative Destruction of Medicine, the Scripps Research Institute’s Eric Topol projects that 1 million human genomes will have been sequenced by 2013 and 5 million by 2014. Daniel MacArthur, a group leader in Massachusetts General Hospital’s Analytic and Translational Genetics Unit estimates that “From a capacity perspective … millions of genomes are not that far off. If you look at the rate that we’re scaling, we can certainly achieve that.” The prospect of so many genomes has brought clinical interpretation into focus. But there is an important distinction to be made between the interpretation of an apparently healthy person’s genome and that of an individual who is already affected by a disease. In an April Science Translational Medicine paper, Johns Hopkins University School of Medicine‘s Nicholas Roberts and his colleagues reported that personal genome sequences for healthy monozygotic twin pairs are not predictive of significant risk for 24 different diseases in those individuals. The researchers concluded that whole-genome sequencing was not likely to be clinically useful. Ambiguities have clouded even the most targeted interpretation efforts.
Technological challenges,
meager sample sizes,
a need for increased,
fail-safe automation and most important
a lack of community-wide standards for the task.
have hampered researchers’ attempts to reliably interpret the clinical significance of genomic variation.
How signals from the cell surface affect transcription of genes in the nucleus.
James Darnell, Jr., MD, Astor Professor, Rockefeller After graduation from Washington University School of Medicine he worked with Francois Jacob at the Pasteur Institute in Paris and served as Vice President for Academic Affairs at Rockefeller in 1990-91. He is the coauthor with S.E. Luria of General Virology and the founding author with Harvey Lodish and David Baltimore of Molecular Cell Biology, now in its sixth edition. His book RNA, Life’s Indispensable Molecule was published in July 2011 by Cold Spring Harbor Laboratory Press. A member of the National Academy of Sciences since 1973, recipient of numerous awards, including the 2003 National Medal of Science, the 2002 Albert Lasker Award. Using interferon as a model cytokine, the Darnell group discovered that cell transcription was quickly changed by binding of cytokines to the cell surface. The bound interferon led to the tyrosine phosphorylation of latent cytoplasmic proteins now called STATs (signal transducers and activators of transcription) that dimerize by
reciprocal phosphotyrosine-SH2 interchange.
accumulate in the nucleus,
bind DNA and drive transcription.
This pathway has proved to be of wide importance with seven STATs now known in mammals that take part in a wide variety of developmental and homeostatic events in all multicellular animals. Crystallographic analysis defined functional domains in the STATs, and current attention is focused on two areas:
how the STATs complete their cycle of activation and inactivation, which requires regulated tyrosine dephosphorylation; and how
persistent activation of STAT3 that occurs in a high proportion of many human cancers contributes to blocking apoptosis in cancer cells.
Current efforts are devoted to inhibiting STAT3 with modified peptides that can enter cells.
Cell cycle regulation and the cellular response to genotoxic stress
Stephen J Elledge, PhD, Gregor Mendel Professor of Genetics and Medicine, Investigator, Howard Hughes Medical Institute, Harvard Medical School As a postdoctoral fellow at Stanford working on eukaryotic homologous recombination, he serendipitously found a family of genes known as ribonucleotide reductases. He subsequently showed that
these genes are activated by DNA damage and
could serve as tools to help scientists dissect the signaling pathways
through which cells sense and respond to DNA damage and replication stress.
At Baylor College of Medicine he made a second major breakthrough with the discovery of the cyclin-dependent kinase 2 gene (Cdk2), which
controls the G1-to-S cell cycle transition,
an entry checkpoint for the cell proliferation cycle and
a critical regulatory step in tumorigenesis.
From there, using a novel “two-hybrid” cloning method he developed, Elledge and Wade Harper, PhD, proceeded to
isolate several members of the Cdk2-inhibitory family.
Their discoveries included the p21 and p57 genes, mutations in the latter (responsible for Beckwith-Wiedemann syndrome), characterized by somatic overgrowth and increased cancer risk. Elledge is also recognized for his work in understanding
proteome remodeling through ubiquitin-mediated proteolysis.
they identified F-box proteins that regulate protein degradation in the cell by
binding to specific target protein sequences and then
marking them with ubiquitin for destruction by the cell’s proteasome machinery.
This breakthrough resulted in
the elucidation of the cullin ubiquitin ligase family,
which controls regulated protein stability in eukaryotes.
Elledge’s recent research has focused on the cellular mechanisms underlying DNA damage detection and cancer using genetic technologies. In collaboration with Cold Spring Harbor Laboratory researcher Gregory Hannon, PhD, Elledge has generated complete human and mouse short hairpin RNA (shRNA) libraries for genome-wide loss-of-function studies. Their efforts have led to
the identification of a number of tumor suppressor proteins
genes upon which cancer cells uniquely depend for survival.
This work led to the development of the “non-oncogene addiction” concept. This is noted as follows:
proteome remodeling through ubiquitin-mediated proteolysis
F-box proteins regulate protein degradation in the cell by binding to specific target protein sequences
and then marking them with ubiquitin for destruction by the cell’s proteasome machinery
elucidation of the cullin ubiquitin ligase family, which controls regulated protein stability in eukaryotes
Playing the dual roles of inventor and investigator, Elledge developed original techniques to define
what drives the cell cycle and
how cells respond to DNA damage.
By using these tools, he and his colleagues have identified multiple genes involved in cell-cycle regulation.
Elledge’s work has earned him many awards, including a 2001 Paul Marks Prize for Cancer Research and a 2003 election to the National Academy of Sciences. In his Inaugural Article (1), published in this issue of PNAS, Elledge and his colleagues describe the function of Fbw7, a protein involved in controlling cell proliferation (see below). Elledge studied the error-prone DNA repair mechanism in E-Coli (Escherichia coli) called SOSmutagenesis for his PhD thesis at MIT. His work identified and described
the regulation of a group of enzymes now known as error-prone polymerases,
the first members of which were the umuCD genes in E. coli.
It was then that he developed a new cloning tool. Elledge invented a technique that allowed him to approach future cloning problems of this type with great rapidity. With the new technique, “you could make large libraries in lambda that behave like plasmids. We called them `phasmid’ vectors, like plasmid and phage together”. The phasmid cloning method was an early cornerstone for molecular biology research.
Elledge began working on homologous recombination in postdoctoral fellowship at Stanford University, an important niche in the field of eukaryotic genetics. Working with the yeast genome, Elledge searched for rec A, a gene that allows DNA to recombine homologously. Although he never located rec A, he discovered a family of genes known as ribonucleotide reductases (RNRs), which are involved in DNA production. Rec A and RNRs share the same last 4 amino acids, which caused an antibody crossreaction in one of Elledge’s experiments. Initially disappointed with the false positives in his hunt for rec A, Elledge was later delighted with his luck. He found that
RNRs are turned on by DNA damage, and
these genes are regulated by the cell cycle.
Prior to leaving Stanford, Elledge attended a talk at the University of California, San Francisco, by Paul Nurse, a leader in cell-cycle research who would later win the 2001 Nobel Prize in medicine. Nurse described his success in isolating the homolog of a key human cell-cycle kinase gene, Cdc2, by using a mutant strain of yeast (8). Although Nurse’s methods were primitive, Elledge was struck by the message he carried: that
cell-cycle regulation was functionally conserved, and
many human genes could be isolated by looking for complimentary genes in yeast.
Elledge then took advantage of his past successes in building phasmid vectors to build a versatile human cDNA library that could be expressed in yeast. After setting up a laboratory at Baylor, he introduced this library into yeast, screening for complimentary cell-cycle genes. He quickly identified the same Cdc2 gene isolated by Nurse. However, Elledge also discovered a related gene known as Cdk2. Elledge subsequently found that
Cdk2 controlled the G1 to S cell-cycle transition, a step that often goes awry in cancer. These results were published in the EMBO Journal in 1991.
He then continued to use
RNRs to perform genetic screens to
identify genes involved in sensing and responding to DNA damage.
He subsequently worked out the
signal transduction pathways in both yeast and humans that recognize damaged DNA and replication problems.
These “checkpoint” pathways are central to the
prevention of genomic instability and a key to understanding tumorigenesis.
This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected on April 29, 2003.
Defective cardiovascular development and elevated cyclin E and Notch proteins in mice lacking the Fbw7 F-box protein.
The mammalian F-box protein Fbw7 and its Caenorhabditis elegans counterpart Sel-10 have been implicated in
the ubiquitin-mediated turnover of cyclin E
as well as the Notch Lin-12 family of transcriptional activators. Both unregulated
Notch and cyclin E
promote tumorigenesis, and
inactivate mutations in human
Fbw7 studies suggest that it may be a tumor suppressor. To generate an in vivo system to assess the consequences of such unregulated signaling, we generated mice deficient for Fbw7. Fbw7-null mice die around 10.5 days post coitus because of a combination of deficiencies in hematopoietic and vascular development and heart chamber mutations. The absence of Fbw7 results in elevated levels of cyclin E, concurrent with inappropriate DNA replication in placental giant trophoblast cells. Moreover, the levels of both Notch 1 and Notch 4 intracellular domains were elevated, leading to stimulation of downstream transcriptional pathways involving Hes1, Herp1, and Herp2. These data suggest essential functions for Fbw7 in controlling cyclin E and Notch signaling pathways in the mouse.
Science as an Adventure
Ubiquitins
Prof. Avram Hershko – Science as an Adventure Prof. Avram Hershko shared the 2004 Nobel Prize in Chemistry with Aaron Ciechanover and Irwin Rose for “for the discovery of ubiquitin-mediated protein degradation.”
Nipam Patel is a professor in the Departments of Molecular and Cell Biology and Integrative Biology at UC Berkeley and runs a research laboratory that studies the role, during embryonic development, of homeotic genes (the genetic switches described in this feature). “Ghost in Your Genes” focuses on epigenetic “switches” that turn genes “on” or “off.” But not all switches are epigenetic; some are genetic. That is, other genes within the chromosome turn genes on or off. In an animal’s embryonic stage, these gene switches play a predominant role in laying out the animal’s basic body plan and perform other early functions;
the epigenome begins to take over during the later stages of embryogenesis.
Beginning as a fertilized single egg that egg becomes many different kinds of cells. Altogether, multicellular organisms like humans have thousands of differentiated cells. Each is optimized for use in the brain, the liver, the skin, and so on. Remarkably, the DNA inside all these cells is exactly the same. What makes the cells differ from one another is that different genes in that DNA are either turned on or off in each type of cell.
Take a typical cell, such as a red blood cell. Each gene within that cell has a coding region that encodes the information used to make a particular protein. (Hemoglobin shuttles oxygen to the tissues and carbon dioxide back out to the lungs—or gills, if you’re a fish.) But another region of the gene, called “regulatory DNA,” determines whether and when the gene will be expressed, or turned on, in a particular kind of cell. This precise transcribing of genes is handled by proteins known as transcription factors, which bind to the regulatory DNA, thereby generating instructions for the coding region.
One important class of transcription factors is encoded by the so called homeotic, or Hox, genes. Found in all animals, Hox genes act to “regionalize” the body along the embryo’s anterior-to-posterior (head-to-tail) axis. In a fruit fly, for example, Hox genes lay out the various main body segments—the head, thorax, and abdomen. Amazingly, all animals, from fruit flies to mice to people, rely on the same basic Hox-gene complex. Using different-colored antibody stains, we can see exactly where and to what degree Hox genes are expressed. Each Hox gene is expressed in a specific region along the anterior-to-posterior axis of the embryo.
A fly’s body has three main divisions: head, thorax, and abdomen. We’ll focus on the thorax, which itself has three main segments. In a normal adult fly, the second thoracic segment features a pair of wings, while the third thoracic segment has a pair of small, balloon-shaped structures called halteres. A modified second wing, the haltere serves as a flight stabilizer. In order for the pair of wings and the pair of halteres (as well as all other parts of the fly) to develop properly, the fly’s suite of
Hox genes must be expressed in a precise way and at precise times.
During development, the fly’s two wings grow from a structure in the larva known as the wing imaginal disk. (An imago is an insect in its final, adult state.) The haltere grows from the larval haltere imaginal disk. Remember the Ubx Hox gene? Using staining again, we can detect the gene product of Ubx. This reveals that
the Ubx gene is naturally “off” in the wing disk—
and is “on” in the haltere disk.
Now you’ll see what happens when the Ubx gene—just one of a large number of Hox genes—is turned off in the haltere disk. What if a genetic mutation caused the Ubx gene to be turned off, during the larval stage, in the third thoracic segment, the segment that normally produces the haltere? Instead of a pair of halteres, the fly has a second set of wings. With the switch of that single Hox gene, Ubx, from on to off, the third thoracic segment becomes an additional second thoracic segment and the pair of halteres became a second pair of wings. This illustrates the remarkable ability of transcription factors like Ubx to control patterning as well as cell type during development.
ENCODE
A. Data Suggests “Gene” Redefinition
As part of a huge collaborative effort called ENCODE (Encyclopedia of DNA Elements), a research team led by Cold Spring Harbor Laboratory (CSHL) Professor Thomas Gingeras, PhD, publishes a genome-wide analysis of RNA messages, called transcripts, produced within human cells. Their analysis—one component of a massive release of research results by ENCODE teams from 32 institutes in 5 countries, with 30 papers appearing in 3 different high-level scientific journals—shows that three-quarters of the genome is capable of being transcribed. This indicates that nearly all of our genome is dynamic and active. It stands in marked contrast to consensus views prior to ENCODE’s comprehensive research efforts, which suggested that
only the small protein-encoding fraction of the genome was transcribed.
The vast amount of data generated with advanced technologies by Gingeras’ group and others in the ENCODE project changes the prevailing understanding of what defines a gene. The current outstanding question concerns
the nature and range of those functions. It is thought that these
“non-coding” RNA transcripts act something like components of a giant, complex switchboard, controlling a network of many events in the cell by
regulating the processes of
replication,
transcription
and translation
– that is, the copying of DNA and the making of proteins is based on information carried by messenger RNAs. With the understanding that so much of our DNA can be transcribed into RNA comes the realization that there is much less space between what we previously thought of as genes, Gingeras points out.
The full ENCODE Consortium data sets can be freely accessed through
the ENCODE project portal as well as at the University of California at Santa Cruz genome browser,
the National Center for Biotechnology Information, and
the European Bioinformatics Institute.
Topic threads that run through several different papers can be explored via the ENCODE microsite page at http://Nature.com/encode. Date: September 5, 2012 Source: Cold Spring Harbor Laboratory
1000 Genomes Project Team Reports on Variation Patterns
(from Phase I Data) October 31, 2012 GenomeWeb
In a study appearing online today in Nature, members of the 1000 Genomes Project Consortium presented an integrated haplotype map representing the genomic variation present in more than 1,000 individuals from 14 human populations. Using data on 1,092 individuals tested by
low-coverage whole-genome sequencing,
deep exome sequencing, and/or
dense genotyping,
the team looked at the nature and extent of the rare and common variation present in the genomes of individuals within these populations. In addition to population-specific differences in common variant profiles, for example, the researchers found distinct rare variant patterns within populations from different parts of the world — information that is expected to be important in interpreting future disease studies. They also encountered a surprising number of the variants that are expected to impact gene function, such as
non-synonymous changes,
loss-of-function variants, and, in some cases,
potentially damaging mutations.
ENCODE was designed to pick up where the Human Genome Project left off. Although that massive effort revealed the blueprint of human biology, it quickly became clear that the instruction manual for reading the blueprint was sketchy at best. Researchers could identify in its 3 billion letters many of the regions that code for proteins, but they make up little more than 1% of the genome, contained in around 20,000 genes. ENCODE, which started in 2003, is a massive data-collection effort designed to catalogue the
‘functional’ DNA sequences,
learn when and in which cells they are active and
trace their effects on how the genome is
packaged,
regulated and
read.
After an initial pilot phase, ENCODE scientists started applying their methods to the entire genome in 2007. That phase came to a close with the publication of 30 papers, in Nature, Genome Research and Genome Biology. The consortium has assigned some sort of function to roughly 80% of the genome, including
more than 70,000 ‘promoter’ regions — the sites, just upstream of genes, where proteins bind to control gene expression —
and nearly 400,000 ‘enhancer’ regions that regulate expression of distant genes (see page 57)1. But the job is far from done.
proteins interact with the DNA to control gene expression.
Overall, the Encode data define regulatory switches that are scattered all over the three billion nucleotides of the genome. In fact, the data suggests,
the regions that lie between gene-coding sequences contain a wealth of previously unrecognized functional elements,Including
nonprotein-coding RNA transcribed sequences,
transcription factor binding sites,
chromatin structural elements, and
DNA methylation sites.
The combined results suggest that 95% of the genome lies within 8 kb of a DNA-protein interaction, and 99% lies within 1.7 kb of at least one of the biochemical events, the researchers say. Importantly, given the complex three-dimensional nature of DNA, it’s also apparent that
a regulatory element for one gene may be located quite some ‘linear’ distance from the gene itself.
“The information processing and the intelligence of the genome reside in the regulatory elements,” explains Jim Kent, director of the University of California, Santa Cruz Genome Browser project and head of the Encode Data Coordination Center. “With this project, we probably went from understanding less than 5% to now around 75% of them.” The ENCODE results also identified SNPs within regulatory regions that are associated with a range of diseases, providing new insights into the roles that
noncoding DNA plays in disease development.
“As much as nine out of 10 times, disease-linked genetic variants are not in protein-coding regions,” comments Mike Pazin, Encode program director at the National Human Genome Research Institute. “Far from being junk DNA, this regulatory DNA clearly makes important contributions to human disease.”
Other Related Articles on this Open Access Online Scientific Journal, include the following:
Impact of evolutionary selection on functional regions: The imprint of evolutionary selection on ENCODE regulatory elements is manifested between species and within human populations s Saha
Sequencing of the human genome via massive programs such as the Cancer Genome Atlas Program (CGAP) and the Encyclopedia of DNA Elements (ENCODE) consortium in conjunction with considerable bioinformatics efforts led by the National Center for Biotechnology Information (NCBI) have unlocked a myriad of yet unclassified genes (for good review see (2). The project encompasses 32 institutions worldwide which, so far, have generated 1640 data sets, initially depending on microarray platforms but now moving to the more cost effective new sequencing technology. Initially the ENCODE project focused on three types of cells: an immature white blood cell line GM12878, leukemic line K562, and an approved human embryonic cell line H1-hESC. The analysis was rapidly expanded to another 140 cell types. DNA sequencing had revealed 20,687 known coding regions with hints of 50 more coding regions. Another 11,224 DNA stretches were classified as pseudogenes. The ENCODE project reveals that many genes encode for an RNA, not protein product, so called regulatory RNAs.
However some of the most recent and interesting results focus on the noncoding regions of the human genome, previously discarded as uninteresting or “junk” DNA . Only 2% of the human genome contains coding regions while 98% of this noncoding part of the genome is actually found to be highly active “with about 4 million constantly communicating switches” (3). Some of these “switches” in the noncoding portion contain small, repetitive elements which are mobile throughout the genome, and can control gene expression and/or predispose to disease such as cancer. These mobile elements, found in almost all organisms, are classified as transposable elements (TE), inserting themselves into far-reaching regions of the genome. Retro-transposons are capable of generating new insertions through RNA intermediates. These transposable elements are normally kept immobile by epigenetic mechanisms(4-6) however some TEs can escape epigenetic repression and insert in areas of the genome, a process described as insertional mutagenesis as the process can lead to gene alterations seen in disease(7). In addition, this insertional mutagenesis can lead to the transformation of cells and, as described in Post 2, act as a model system to determine drivers of oncogenesis. This insertional mutagenesis is a different mechanism of genetic alteration and rearrangement seen in cancer like recombination and fusion of gene fragments as seen with the Philadelphia chromosome and BCR/ABL fusion protein (8). The mechanism of transposition and putative effects leading to mutagenesis are described in the following figure:
Figure. Insertional mutagenesis based on transposon-mediated mechanism. A) Basic structure of transposon contains gene/sequence flanked by two inverted repeats (IR) and/or direct repeats (DR). An enzyme, the transposase (red hexagon) binds and cuts at the IR/DR and transposon is pasted at another site in DNA, containing an insertion site. B) Multiple transpositions may results in oncogenic events by inserting in promoters leading to altered expression of genes driving oncogenesis or inserting within coding regions and inactivating tumor suppressors or activating oncogenes. Deep sequencing of the resultant tumor genomes ( based on nested PCR from IR/DRs) may reveal common insertion sites (CIS) and oncogenic mutations could be identified.
In a bioinformatics study Eunjung Lee et al.(1), in collaboration with the Cancer Genome Atlas Research Network, the authors had analyzed 43 high-coverage whole-genome sequencing datasets from five cancer types to determine transposable element insertion sites. Using a novel computational method, the authors had identified 194 high-confidence somatic TE insertion sites present in cancers of epithelial origin such as colorectal, prostate and ovarian, but not in brain or blood cancers. Sixty four of the 194 detected somatic TE insertions were located within 62 annotated genes. Genes with TE insertion in colon cancers have commonly high mutation rates and enriched genes were associated with cell adhesion functions (CDH12, ROBO2,NRXN3, FPR2, COL1A1, NEGR1, NTM and CTNNA2) or tumor suppressor functions (NELL1m ROBO2, DBC1, and PARK2). None of the somatic events were located within coding regions, with the TE sequences being detected in untranslated regions (UTR) or intronic regions. Previous studies had shown insertion in these regions (UTR or intronic) can disrupts gene expression (9). Interestingly, most of the genes with insertion sites were down-regulated, suggested by a recent paper showing that local changes in methylation status of transposable elements can drive retro-transposition (10,11). Indeed, the authors found that somatic insertions are biased toward the hypomethylated regions in cancer cell DNA. The authors also confirmed that the insertion sites were unique to cancer and were somatic insertions, not germline (germline: arising during embryonic development) in origin by analyzing 44 normal genomes (41 normal blood samples from cancer patients and three healthy individuals).
The authors conclude:
“that some TE insertions provide a selective advantage during tumorigenesis,
rather than being merely passenger events that precede clonal expansion(1).”
The authors also suggest that more bioinformatics studies, which utilize the expansive genomic and epigenetic databases, could determine functional consequences of such transposable elements in cancer. The following Post will describe how use of transposon-mediated insertional mutagenesis is leading to discoveries of the drivers (main genetic events) leading to oncogenesis.
1. Lee, E., Iskow, R., Yang, L., Gokcumen, O., Haseley, P., Luquette, L. J., 3rd, Lohr, J. G., Harris, C. C., Ding, L., Wilson, R. K., Wheeler, D. A., Gibbs, R. A., Kucherlapati, R., Lee, C., Kharchenko, P. V., and Park, P. J. (2012) Science337, 967-971
A recent post by Dr. Margaret Baker entitled “Junk DNA codes for valuable miRNAs: non-coding DNA controls Diabetes” talks about how the ENCODE project is revealing new insights into the functions of non-coding region of the human genome previously labeled as “junk DNA”. MicroRNA or miRNA, which as stated by Dr. Baker, “are among the non-gene encoding sequences in the genome and have been shown to play a major post-transcriptional role in expression of multiple genes.”
The post has touched upon several aspects of miRNA including origin, function, and mechanism of action. This commentary is an extension of Dr. Baker’s post, expanding upon the mechanism of action of miRNAs along with their role in potential disease therapy.
microRNA: Revisiting the past
MicroRNA were not discovered long back, infact, it was in 1998 when the presence of the non-coding RNAs that could be involved in switching ‘on’ and ‘off’ of certain genes. In the last decade, 2006 Nobel Prize for medicine or physiology was awarded to scientists Andrew Fire and Craig Mello for their discovery of this new role of RNA molecules.
A breakthrough research was published in the September 2010 issue of Nature journal, stating that mammalian microRNAs predominantly act by decreasing the levels of target mRNA. Mammalian microRNAs predominantly act to decrease target mRNA levels. miRNAs were initially thought to repress protein output without changes in the corresponding mRNA levels. Guo et al challenged the previous notion of ‘translational repression’ and concluded on the basis of their experimental results that ‘mRNA-destabilization’ scenario for the major part is responsible for the repression in protein expression via miRNAs. Authors utilized the method of ‘ribosome profiling’ to measure the overall effects of miRNA on protein production and then compared these to simultaneously measured effects on mRNA levels. Ribosome profiling prepares maps that exact positions of ribosomes on transcripts after nucleases chew upon the exposed part of transcripts that are not covered by ribosomes. MiR-1 and miR-155 were introduced into the HeLa-cell line. Both of these miRNAs are not normally expressed in HeLa cells. Another miRNA used was mir-223 which is expressed in significant amounts in neutrophils. The reason for choosing the set of these miRNAs was that they had already been shown to repress protein levels via proteomics research. It was deciphered that miRNA-mediated repression was similar regardless of target expression level and further stated that “for both ectopic and endogenous miRNA regulatory interactions, lowered mRNA levels account for lowered mRNA levels accounted for most for most (>/=84%) of the decreased protein production.” These results show that changes in mRNA levels closely reflect the impact of miRNAs on gene expression and indicate that destabilization of target mRNAs is the predominant reason for reduced protein output.
Authors concluded that the discovery “will apply broadly to the vast majority of miRNA targeting interactions. If indeed general, this conclusion will be welcome news to biologists wanting to measure the ultimate impact of miRNAs on their direct regulatory targets.”
Since then and even before the paper was published, several other miRNAs and their roles have been discovered. Information on miRNAs has been consolidated in a database that can be accessed online at http://www.mirbase.org/
The p53 gene is known as a tumor suppressor gene and its inactivation has been associated in some cancers such as neuroblastoma. The study reported that microRNA-380 (miR-380) was able to repress the expression of p53 gene in cancer patients causing uninhibited cell survival and proliferation. The research group was able to decrease the tumor size in vivo in a mouse model of the neuroblastoma by delivering miR-380 antagonist. The researchers also observed that the inhibition of endogenous miR-380 in embryonic stem or neuroblastoma cells resulted in induction of p53, and extensive apoptotic cell death.
Thus, the success of miR antagonist for decreasing tumor size speaks of the effectiveness of miR as a potential therapeutic target for cancer treatment.
In conclusion, as stated by Dr. Baker in her post, “the miRNA data for tissues and specific cell types involved in disease pathology form a new approach to either detecting or possibly correcting gene (coding or non-coding) dysregulation. miRNA mimics and anti-miRNA agents are being developed as new therapeutic modalities.”
Early in the month of September, Nature, published 30 research papers on the results found from the ambitious and one time felt risky project, named, ENCODE (Encyclopedia of DNA Elements). The results of ENCODE revealed that 80% of human genome is not “junk”, as thought before, rather act as regulatory domains for further signaling events.
When human genome was first sequenced, more than a decade ago, scientists were surprised with the low ratio of coding regions transcribing genes to the number of bases in human DNA. Out of 3 billion bases in human DNA scientists found only 21,000 genes. This unexpected finding led to few basic questions:
Why do humans have so many base pairs?
How highly regulated complex behaviors of biochemical, cellular and physiological processes can be translated to regulation at genetic levels?
ENCODE project results unveil our limited knowledge about human genome until now. Their results open up new ways of thinking human DNA and its functional domains. It also brings in huge challenges for both experimental developments and data driven computational approaches for better understanding and applications of these new findings.
To gain insight from large scale data and identifying key players from a large pool of data, Bioinformatics approaches will probably be the only way to move forward. This also means importance of developing new algorithms which will include the capability of including regulatory functions linking with gene regulation. Presently, most algorithms are targeted toward identifying genes and their connections in a linear fashion. However, regulatory domains and their functional activities might be non linear, something which will be revealed with many more experimental results in coming years.
The functional characteristics of human genome will also lead to better understanding of genetic differences between normal states and disease states. Moreover, with proper identification of functional characteristics of a particular gene regulation, drugs can be targeted with much more precision in future. However, to make success of such a complicated problem, it will require visionary design and execution of experiment and computational biology teams working together.
It is well recognized already that Bioinformatics approaches can hugely help in identifying key players in regulation of genes. However many times it is not easy to translate information at the genetic levels directly to cellular or physiological levels. Some of the main reasons are – a) the complex cross talks between proteins which lead to intracellular signaling events and b) highly non linear information sharing among receptors and ligands for extra cellular signaling processes. To achieve efficient understanding of the functional characteristics of non-coding regions of DNA in context with regulation of genes, an effort should be given to map the functional network of gene regulation to signaling pathways of protein networks. This will require development of experimental as well as computational approaches to capture genetic as well as proteomics analysis together. Furthermore, for better understanding of cellular and physiological decisions, mapping between regulations of genes and intracellular signaling pathways should be extended for dynamic analysis with time.
The extraordinary findings from ENCODE project pose many challenges in front for getting answers to many unknowns for next decade or so but also give solutions to some basic questions which have haunted scientific world for almost a decade.
ENCODE data reveals important information from Genome Wide Association Studies relevant to understanding complex genetic diseases
Author: Ritu Saxena, Ph.D.
Introduction
“The depth, quality, and diversity of the ENCODE data are unprecedented” is what was stated by John Stamatoyannopoulos, professor of genomic sciences at the University of Washington and one of the many principle investigators of ENCODE project. ENCODE (Encyclopedia of DNA elements), indeed, was an ambitious project launched as a pilot in 2003 and then expanded in 2007 for the whole genome analysis and identification of all the functional elements of the human genome. The findings were striking as they challenged the definition of “gene” and ‘the central dogma of genetics (Gene-mRNA-protein). Infact, the non-coding part that constitutes about 80% of the genome or the so-called “junk DNA” was found to contain elements crucial for gene regulation. The elements, in large part, include RNA transcripts that are not transcribed into proteins but might have a regulatory role. For detailed reading, refer to the findings published in the issue of Nature, The ENCODE Project Consortium Nature 489, 57–74 (2012) An integrated encyclopedia of DNA elements in the human genome
Protein-coding genes — Proteins are molecules made of amino acids linked together in a specific sequence; the amino acid sequence is encoded by the sequence of DNA subunits called nucleotides that make up genes.
Non-coding genes — Stretches of DNA that are read by the cell as if they were genes but do not encode proteins. These appear to help regulate the activity of the genome.
Chromatin structure features — Complex physical structures made from a combination of DNA and binding proteins that make up the contents of the nucleus and affects genome function.
Histone modifications — Histones are the proteins that make up the chromatin structures that help shape and control the genome. In addition, histone proteins can be physically modified by adding chemical groups, such as a methyl molecule, that further regulates genomic activity.
DNA methylation — Just like histones, methyl groups can be added to DNA itself in a process called DNA methylation. Chemically attaching methyl groups to DNA physically changes the ability of enzymes to reach the DNA and thus alters the gene expression pattern in cells. Methylation helps cells “remember what they are doing” or alter levels of gene expression, and it is a crucial part of normal development and cellular differentiation in higher organisms.
Transcription factor binding sites — Transcription factors are proteins that bind to specific DNA sequences, controlling the flow (or transcription) of genetic information from DNA to mRNA. Mapping the binding sites can help researchers understand how genomic activity is controlled.
How could ENCODE be helpful in the study of complex human diseases?
Complex diseases and Genome wide association studies (GWAS)
Coronary artery disease, type 2 diabetes and many forms of cancer are complex human diseases that have a significant genetic component. Unlike mendelian disorders that have defined loci, the genetic component of complex disorders lies in the form of genetic variations in the genome making an individual susceptible to these complex diseases.
Researchers have performed Genome-wide association studies (GWAS) of the human genome, leading to the identification of thousands of DNA variants that could be linked with complex traits and diseases. However, identifying the variants, referred to as SNPs (Single Nucleotide Polymorphisms), that actually contribute to the disease, and understanding how they exert influence on a disease has been more of a mystery.
How would ENCODE solve the puzzle?
The puzzle lies in interpreting how the SNPs found in the genome affect a person’s susceptibility to a particular trait or disease and what is the mechanism behind it. As identified in the GWAS, most variants that are associated with the phenotype of the trait or disease lie in the non-coding region of the genome. Infact, in more than 400 studies compiled in the GWAS catalog only a small minority of the trait/disease-associated SNPs occur in protein-coding regions; the large majority (89%) are in noncoding regions. These variants fall in the gene deserts that lie far from protein-coding region, similar to those where cis-regulatory modules (CRMs) are found. CRMs such as promoters and enhancers are a group of binding sites for transcription factors, and the presence of transcription factors bound to these sites is a good indicator of the potential regulatory regions.
The integrative analysis of ENCODE data has give important insights to the results of GWAS studies. Investigators have employed ENCODE data as an initial guide to discover regulatory regions in which genetic variation is affecting a complex trait. Additionally, ENCODE study when examined the SNPs from GWAS that were associated with the phenotype of the trait, found that these regions are enriched in DNase-sensitive regions i.e, lie in the function-associated DNA region of the genome as it could be bound by transcription factors affecting the regulation of gene expression. Thus, the project demonstrates that non-coding regions must be considered when interpreting GWAS results, and it provides a strong motivation for reinterpreting previous GWAS findings.
Using ENCODE Data to Interpret GWAS Results
ENCODE and predisposition to CANCER:
C-Myc, a proto-oncogene, codes for a transcripton factor, when expressed constitutively leads to uninhibited cell proliferation resulting in cancer. It has been observed that common variants within a ~1 Mb region upstream of c-Myc gene have been associated with cancers of the colon, prostate, and breast. Several SNPs have been reported in this region, that although affect the phenotype, lie in the distal cis-region of the MYC gene. Alignment of the ENCODE data in this region with the significant variants from the GWAS also reveals that key variants are found in the transcription factor occupied DNA segments mapped by this consortium. One variant rs698327, lies within a DNase hypersensitive site that is bound by several transcription factors, enhancer-associated protein p300, and contains histone modifications relative to enhancers (high H3K4me1, low H3K4me3). ENCODE data indicates that non-coding regions in the human chromosome 8q24 loci are associated with cancer and as observed in the case of c-myc gene, similar studies on cancer-related genes could help explain predisposition to cancer.
ENCODE and fetal hemoglobin expression:
Another example of the use of ENCODE data is that of gene regulation of fetal hemoglobin. Several regions were predicted via ENCODE that were involved in the regulation of fetal hemoglobin. It was found that these predicted regions are close to the SNPs in the BLC11A gene that is associated with persistent expression of fetal hemoglobin.
Future perspective
As evident from the above examples, the ENCODE data shows that genetic variants do affect regulated expression of a target gene. Recently, several research groups in the UK performed a large-scale GWAS study to determine the genetic predisposition to fracture risk. The collaborative effort, published in a recent issue of the PLoS journal, was made to identify genetic variants associated with cortical bone thickness (CBT) and bone mineral density (BMD) with data from more than 10,000 subjects. http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002745 The study generated a wealth of data including the result – identification of SNPs in the WNT16 and its adjacent gene, FAM3C were found to be relevant to CBT and BMD. ENCODE data, in this case, could be helpful in interpreting more detailed information including determining additional SNPs, the regulatory information of the genes involved and much more. Thus, it could be concluded that ENCODE data could be immensely useful in interpreting associations between disease and DNA sequences that can vary from person to person.
Author: Margaret Baker, PhD, Registered Patent Agent
The Encyclopedia of DNA Elements (ENCODE) Project was launched in September of 2003. In 2007 the ENCODE project was expanded to study the entire human genome, Genome-wide association studies or GWAS, and published a Nature paper entitled “An integrated encyclopedia of DNA elements in the human genome,” this month also all data are available at http://genome.ucsc.edu/ENCODE/. Novel functional roles have been discovered for both transcribed and non-transcribed portions of DNA. See several articles and commentary in Science 7 September 2012: Vol. 337 no. 6099 including Maurano et al. pp. 1190-1195 DOI: 10.1126/science.1222794b
For the first time, the 3-dimensional connections that cross the genome have been mapped as long-range looping interactions between functional elements and the genes controlled. These regions of the genome, formerly referred to as “junk DNA”, have the potential to be involved in disease initiation, pathophysiology, and complications. Further, epigenetic factors may be seen to play a more direct role in the expression or silencing of protein coding genes as DNase I hot spots, nucleosomal anchor points, and DNA methylation sites are added to the map.
Non-coding transcribed DNA includes a large percentage of sequences coding for RNA. In fact, RNA encoding genes number nearly equal to the protein encoding genes- 18,400 v 20,687 – and previously unknown non-coding RNA (ncRNA) have also been characterized.
Some of the known elements that were cataloged include:
cis elements – promoters, transcription factor binding sites;
gene contiguous non-coding stretches such as introns, polyA, and UTR, splice variants;
pseudogenes (11,224);
long range gene associated elements – enhancers, insulators, suppressors, and predicted promoter flanking regions;
ribosomal RNA genes; and
sequences for 7,052 small RNAs of which 85% are small nuclear(sn)RNA, small nucleolar(sno)RNA), transfer(t)RNA, and micro(mi)RNA.
What has been found is that distinct non-coding regions, including ncRNA, can be associated with distinct disease traits. miRNA are among the non-gene encoding sequences in the genome which have already been shown to play a major post-transcriptional role in expression of multiple genes..
Most miRNA genes are intergenic or oriented antisense to neighboring genes and therefore assumed to be controlled by independent promoter units. However, in some cases a microRNA gene is transcribed together with its target gene implying coupled regulation of miRNA and protein-coding gene. About one third of miRNA genes reside in polycistronic clusters. miRNA genes can occupy the introns of protein, non-protein coding genes, or nonprotein-coding transcripts. The promoters have been shown to have some similarities in their motifs to promoters of other genes transcribed by RNA polymerase II such as protein coding genes. The ENCODE project also noted that miRNA promoters were in chromatin regions of high promiscuity. There may be up to 1000 miRNA genes in the human genome. In addition, human miRNAs show RNA editing of sequences to yield products different from those encoded by their DNA. miRNA are implicated in cellular roles as diverse as developmental timing in worms, cell death and fat metabolism in flies, haematopoiesis in mammals, and leaf development and floral patterning in plants
The final miRNA gene product is a ∼22 nt functional RNA molecule. The mature miRNA (designated miR-#) is processed from a characteristic stem–loop sequence (called a pre-mir), which in turn may be excised from a longer primary transcript (or pri-mir). It is processed by the same enzyme (DICER) that processes short hairpin RNA, forming interfering RNA, which provides and additional level of control.
MiRNA controls gene expression by binding to complementary regions of messenger transcripts in the 3’ untranslated region to repress their translation or regulate degradation. What makes the mechanism more powerful (or complicated) is the imperfect but specific binding motif associates with a large number of mRNAs in the 3’ untranslated region having the complimentary motif. Conversely then, each mRNA can potentially associate with a number miRNA. Mature processed cytosolic miRNA can act in a manner akin to small interfering(si)RNA, and form the RNA-induced silencing complex (RISC) to block translation. Computational methods have been used to identify potential gene targets based on complimentarity between the miRNA and mRNA sequences.
Gerstein et al. explored the “Architecture of the human regulatory network derived from ENCODE data” Nature 489:91-100 (06 Sep 2012) focusing on the regulation of transcription factors (TF) and association between TF and miRNAs, miRNA and miRNA, protein-protein interactions, and protein phosphorylation. Not surprisingly, not all TF are the upstream factor in each network.
These new and remarkably detailed examinations of the different elements within and transcribed from the human genome perhaps do more to aid our knowledge of why we have stumbled in attempts to eradicate diseases, initially by focusing on a single gene or constellation of coding regions. The miRNA wikipedia is also being re-written on a daily basis and new disease associations made*. As an example of a pathological state that may be linked to miRNA controlled elements, in vitro as well as in small population studies have examined miRNA species in diabetogenic conditions and patients with diabetes (Type I and Type II).
Diabetes and miRNA
In adult β-cell islets, miR-375 is low when glucose is freely available and low miR-375 induces insulin secretion. Interestingly, miR-375 is found only in brain and β-cells which share a secretion pathway.
Diabetic Complications
Organ specific miRNA have been identified in liver, skeletal muscle, kidney, vascular, and adipose tissue which are responsive to transient or sustained hyperglycemia.
miR-17-5p and miR-132 were reported to show significant differences between obese and non obese omental fat and were also abnormal in the blood of obese subjects. Altered expression of miR-17-5p and miR-132 were found to correlate significantly with BMI, fasting blood glucose and glycosylated hemoglobin. (Kloting et al. PLoS ONE 4(3), e4699 (2009).
Clinical practice related to miRNA in diabetes may be possible as one group has identified eight miRNAs (miR-144, miR-146a, miR-150, miR-182, miR-192, miR-29a, miR-30d and miR-320) as potential ‘signature miRNAs’ that could distinguish prediabetic patients from those with overt T2D (Karolina DS, Armugam A, Tavintharan S et al. MicroRNA 144 impairs insulin signaling by inhibiting the expression of insulin receptor substrate 1 in Type 2 diabetes mellitus. PLoS ONE 6(8), e22839 (2011).
Due to the autoimmune component of T1D, the constellation of miRNA would be expected to be different: upregulation of miR-510 and underexpression of miR-191 and miR-342 were observed in the Tregs (regulatory T-cells) of T1D patients (Hezova R, Slaby O, Faltejskova P et al. microRNA-342, microRNA-191 and microRNA-510 are differentially expressed in T regulatory cells of Type 1 diabetic patients. Cell. Immunol. 260(2),70–74 (2010).
Taken together with the “physical” mapping of miRNA genes in the context of the 3-dimensional genome provided by the ENCODE studies and new understanding of potential concerted regulatory mechanisms, the miRNA data for tissues and specific cell types involved in disease pathology form a new approach to either detecting or possibly correcting gene (coding or non-coding) dysregulation. miRNA mimics and anti-miRNA agents are being developed as new therapeutic modalities.
References
Bartel, DP et al. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function” Cell 2004, 116:281-297.
Fernandez-Valverde, SL et al. MicroRNAs in beta-cell Biology, insulin resistance, diabetes and its complications. Diabetes July 2011 60 (7):1825-31.
*Based on initial studies in the worm C. elegans showing the temporal appearance of 21- and 22-nt RNAs during development, a family of highly conserved micro RNA sequences (miRNA) existing in invertebrates and vertebrates, were cataloged by Tuschl et al. at the Max-Planck-Institute and others (see Eddy, SR Non-coding RNA genes and the modern RNA world Nature Reviews Genetics, 2:920-929, 2001). The sequence-specific post-transcriptional regulatory mechanisms mediated by these miRNAs have been associated with certain disease states such as cancer miR-21) and more specifically, lung cancer (miR-124) or breast cancer (miR-7, miR-21) and new species and function continue to be found (see http://www.mirbase.org/ ).
Negative selection was examined using two measures that highlight different periods of selection in the human genome. The first measure, inter-species, pan-mammalian constraint (GERP-based scores; 24 mammals) addresses selection during mammalian evolution. The second measure is intra-species constraint estimated from the numbers of variants discovered in human populations using data from the 1000 Genomes project and covers selection over human evolution.
For DNaseI elements and bound motifs most sets of elements show enrichment in pan mammalian constraint and decreased human population diversity, though for some cell types the DNaseI sites do not appear overall to be subject to pan-mammalian constraint. Bound TF motifs have a natural control from the set of TF motif with equal sequence potential for binding but without binding evidence from ChIP-seq experiments; in all cases, the bound motifs showed both more mammalian constraint and higher suppression of human diversity.
Consistent with previous findings, genome-wide evidence was not observed for pan-mammalian selection of novel RNA sequences. There are also a large number of elements without mammalian constraint, between 17-90% for TF-binding regions as well as DHSs and FAIRE regions. Previous studies could not determine whether these sequences are either biochemically active, but with little overall impact on the organism, or are under lineage specific selection. By isolating sequences preferentially inserted into the primate lineage, which is only feasible given the genome-wide scale of this data, this issue was specifically examined. The majority of primate-specific sequence is due to retrotransposon activity, but an appreciable proportion is non-repetitive primate-specific sequence. Of 104,343,413 primate-specific bases (excluding repetitive elements), 67,769,372 (65%) are found within ENCODE-identified elements. Examination of 227,688 variants segregating in these primate specific regions revealed that all classes of elements (RNA and regulatory) show depressed derived allele frequencies, consistent with recent negative selection occurring in at least some of these regions. This suggests that an appreciable proportion of the unconstrained elements are lineage specific elements required for organismal function, consistent with long standing views of recent evolution, and the remainder are likely to be “neutral” elements which are not currently under selection, but may still affect cellular or larger scale phenotypes without an effect on fitness.
The binding patterns of TFs are not uniform, and can be correlated both inter-and intra-species measures of negative selection with the overall information content of motif positions. The selection on some motif positions is as high as protein coding exons. These aggregate measures across motifs show that the binding preferences found in the population of sites are also relevant to the per-site behavior. By developing a per-site metric of population effect on bound motifs, it was found that highly constrained bound instances across mammals are able to buffer the impact of individual variation.
It was proposed to express the deleterious effect of TFBS mutations in terms of mutational load, a known population genetics metric that combines the frequency of mutation with predicted phenotypic consequences that it causes. This metric was adapted to use the reduction in PWM score associated with a mutation as a crude but computable measure of such phenotypic consequences. It was not assumed that TFBS load at a given site reduces an individual’s biological fitness. Rather, it was argued that binding sites that tolerate a higher load are less functionally constrained. This approach, although undoubtedly a crude one, makes it possible to consistently estimate TFBS constraints for different TFs and even different organisms and ask why TFBS mutations are tolerated differently in different contexts.
It was first asked whether motif load would be able to detect the expected link between evolutionary and individual variation. A published metric was used, Branch Length Score (BLS), to characterise the evolutionary conservation of a motif instance. This metric utilises both a PWM based model of the conservation of bases and allows for motif movement. Reassuringly, mutational load correlated with BLS in both species, with evolutionary non-conserved motifs (BLS=0) showing by far the highest degree of variation in the population. At the same time, ∼40% of human and fly TFBSs with an appreciable load (L>5e-3) still mapped to reasonably conserved sites (BLS>0.2, ∼50% percentile in both organisms), demonstrating that score-reducing mutations at evolutionary preserved sequences can be tolerated in these populations.
Using this metric, the original findings were confirmed, suggesting that TFBSs with higher PWM scores are generally more functionally constrained compared to ‘weaker’ sites. The fraction of detected sites mapping to bound regions remained similar across the whole analysed score range, suggesting that this relationship is unlikely to be an artefact of higher false-positive rates at ‘weaker’ sites. This global observation, however, does not rule out the possibility that a weaker match at some sites is specifically preserved to ensure dose-specific TF binding. This may be the case, for example, for Drosophila Bric-à-brac motifs, which exhibited no correlation between motif load and PWM score, consistent with the known dosage-dependent function of Bric-à-brac in embryo patterning.
Motif load was used to address whether TFBSs proximal to transcription start sites (TSS) are more constrained compared to more distant regulatory regions. This was found to be the case in the human, but not in Drosophila. CTCF binding sites in both species were a notable exception, tolerating the lowest mutational load at locations 500bp-1kb from TSS, but not closer to the TSS, suggesting that the putative role of CTCF in establishing chromatin domains is particularly important in proximity of gene promoters.
To gain further insight into the functional effects of TFBS mutations, a dataset was used that mapped human CTCF binding sites across four individuals. TFBS mutations detected in this dataset often did not result in a significant loss of binding, with ∼75% mutated sites retaining at least two thirds of the binding signal. This was particularly prominent at conserved sites (BLS>0.5), 90% of which showed this ‘buffering’ effect. To address whether buffering could be explained solely by the flexibility of CTCF sequence preferences, it was analysed between-allele differences in the PWM score at polymorphic binding sites. As expected, globally CTCF binding signal correlated with the PWM score of the underlying motifs. Consistent with this, alleles with minor differences in PWM match generally had little effect on the binding signal compared to sites with larger PWM score changes, suggesting that the PWM model adequately describes the functional constraints of CTCF binding sites. At the same time, it was found that CTCF binding signals could be maintained even in those cases, where mutations resulted in significant changes of PWM score, particularly at evolutionary conserved sites. A linear interaction model confirmed that the effect of motif mutations on CTCF binding was significantly reduced with increasing conservation. These effects were not due to the presence of additional CTCF motifs (as 96% of bound regions only contained a single motif), while differences between more and less conserved sites could not be explained away by differences in the PWM scores of their major alleles. A CTCF dataset from three additional individuals generated by a different laboratory yielded consistent conclusions, suggesting that our observations were not due to over-fitting.
Taken together, CTCF binding data for multiple individuals show that mutations can be buffered to maintain the levels of binding signal, particularly at highly conserved sites, and this effect cannot be explained solely by the flexibility of CTCF’s sequence consensus. It was asked whether mechanisms potentially accountable for such buffering would also affect the relationship between sequence and binding in the absence of mutations. Training an interaction linear model across the whole set of mapped CTCF binding sites revealed that conservation consistently weakens the relationship between PWM score and the binding intensity. Thus, CTCF binding to evolutionary conserved sites may generally have a reduced dependence on sequence.