Washington University | Leaders in Pharmaceutical Business Intelligence Group, LLC, Doing Business As LPBI Group, Newton, MA

Posts Tagged ‘Washington University’

ENCODE Findings as Consortium

Posted in Bio Instrumentation in Experimental Life Sciences Research, Biological Networks, Gene Regulation and Evolution, Biomarkers & Medical Diagnostics, Cardiovascular Pharmacogenomics, Chemical Genetics, Computational Biology/Systems and Bioinformatics, Disease Biology, Small Molecules in Development of Therapeutic Drugs, FDA Regulatory Affairs, Genome Biology, Genomic Testing: Methodology for Diagnosis, Health Economics and Outcomes Research, Health Law & Patient Safety, International Global Work in Pharmaceutical, Metabolomics, Molecular Genetics & Pharmaceutical, Personalized and Precision Medicine & Genomic Research, Pharmaceutical R&D Investment, Population Health Management, Genetics & Pharmaceutical, Proteomics, Scientist: Career considerations, Statistical Methods for Research Evaluation, Technology Transfer: Biotech and Pharmaceutical, tagged DNA, ENCODE, Ewan Birney, Genome Biology, Genome Research, Genome-wide association study, National Human Genome Research Institute, Washington University on September 10, 2012| 4 Comments »

Reporter: Aviva Lev-Ari, PhD, RN

Set of Papers Outline ENCODE Findings as Consortium Looks Ahead to Future Studies

By a GenomeWeb staff reporter

NEW YORK (GenomeWeb News) – An international collaboration involving more than 400 researchers working to characterize gene regulatory networks in the human genome is publishing dozens of new studies this week.

In papers appearing in Nature, Science, Genome Research, Genome Biology, Journal of Biological Chemistry, and elsewhere, members of the Encyclopedia of DNA Elements, or ENCODE, consortium describe approaches used to define some four million regulatory regions in the genome, among other things. All told, the team explained, ENCODE efforts have made it possible assign biological functions to around 80 percent of genome sequences — filling in large gaps left by studies that focused on protein-coding sequences alone.

“We found that a much bigger part of the genome — a surprising amount, in fact — is involved in controlling when and where proteins are produced, than in simply manufacturing the building blocks,” ENCODE’s lead analysis coordinator Ewan Birney, associate director of the European Molecular Biology Laboratory European Bioinformatics Institute, said in a statement.

“This concept of ‘junk DNA,’ which has been sort of perpetuated for the past 20 years or so is really not accurate,” ENCODE researcher Rick Myers, director of the HudsonAlpha Institute for Biotechnology, said during a telephone briefing with reporters today. “Most of the genome — more than 80 percent of the base pairs in the genome — has some biological activity, some biological function.”

Researchers participating in a complementary effort within the larger ENCODE project, known as GENCODE, more completely characterize the coding portions of the genome. “As part of the ENCODE project, we both tidied up the protein-coding genes and we also found many non-coding RNA genes as well,” Birney said during today’s telebriefing.

Based on the success of ENCODE so far, the project is expected to be extended by another four years or so. The amount of new funding from the National Human Genome Research Institute for that follow-up work is expected to be as high as $123 million.

“Later this month, NHGRI will be announcing a new round of funding that will take the ENCODE project into its next phase,” NHGRI Director Eric Green said during the call.

Studies done in the decade or so since the human genome was deciphered have highlighted how little of the genome is actually comprised of gene sequences. With the realization that only around 2 percent of the genome is dedicated to protein-coding functions came a spate of speculation about the role of the other 98 percent of genome.

While this portion of the genome was suspected of harboring regulatory sequences, the extent of that regulation and its impact on coding sequences in human tissues over time was not known.

“When the Human Genome Project ended in 2003, we quickly realized that we understood the meaning of only a very small percent of the human genome’s letters,” Green explained. “We did know the genetic code for determining the order of amino acids and proteins, but we understood precious little about the signals that turned genes on or off — or that controlled the amount of proteins produced in different tissues.”

To begin studying such control networks systematically, the international ENCODE consortium kicked off the main phase of its analyses in 2007, following an earlier pilot study.

NHGRI has provided $123 million for the project over the past five years. Another $30 million went to support the development of ENCODE-related technologies since the ENCODE pilot started in 2003, while $40.6 million from NHGRI went towards the pilot itself.

During the study’s main phase, investigators from nearly three-dozen labs around the world took multi-pronged approaches to assess transcription factor binding patterns, histone modification patterns, chromatin structure signatures and other features of the genome that interact with one another to control gene expression over time and across different tissues in the body.

To accomplish the roughly 1,600 experiments done to test some 180 cell types for ENCODE, teams turned to methods such as chromatin immunoprecipitation coupled with sequencing to define the genome-wide binding patterns for more than 100 different transcription factors, for example, while other strategies were used to profile DNA methylation patterns, chromatin features, and so forth.

“It’s really a detailed hierarchy, where proteins bind and epigenetic marks — like DNA methylation and other marks — precisely cooperate and regulate how the genes are going to get turned on [or off] and the amount of this,” Myers said. “These complex networks are one of the big components of the contributions of the 30 papers that are being published today.”

For example, a University of Washington-led team reporting in Science online todaydefined millions of regulatory regions, including some that are operational during normal development, by taking advantage of an enzyme known as DNase I, which chops off DNA specifically at open chromatin sites in the genome. That group found that more than three-quarters of disease-associated variants identified in genome-wide association studies fall in parts of the genome that overlap with regulatory sites.

“We now know that the majority of these changes that are associated with common diseases and traits that don’t fall within genes actually occur within the gene-controlling switches,” University of Washington genome sciences researcher John Stamatoyannopoulos, senior author on that study, said during today’s telebriefing. “This phenomenon is not confined to a particular type of disease. It seems to be present across the board for a very wide variety of different diseases and traits.”

Results from such analyses also hint that some outwardly unrelated conditions might be traced back to similar regulatory processes. And, researchers say, by bringing together information on active regulatory regions with disease-risk variants, it may be possible to define new functionally important tissues for certain conditions.

“By creating these extensive blueprints of the control circuitry, we’re now exposing previously hidden connections between different kinds of diseases that may explain common clinical features,” Stamatoyannopoulos said.

“This has also allowed us to see that the GWAS studies that have been performed contain far more information than was previously believed,” he added, “because hundreds of additional DNA changes that were not thought to be important also appear to affect these gene-controlling switches.”

The new data are also expected to help in understanding genetic disease and interpreting information from personal genomes, according to Michael Snyder, an ENCODE investigator and director of Stanford University’s Center of Genomics and Personalized Medicine.

“We believe the ENCODE project will have a profound impact on personal genomes and, ultimately on personalized medicine,” Snyder told reporters. “We can now better see what personal variants do, in terms of causing phenotypic differences, drug responses, and disease risk.”

Many of the studies stemming from ENCODE can be viewed through a Nature,Genome Research, and Genome Biology-conceived website that links ENCODE papers that share themes or “threads” that are related to one another.

Along with the newly published papers, the ENCODE team is making data available to other members of the research community through the project’s website. Data from studies can also be accessed through an ENCODE browser housed at the University of California at Santa Cruz or via NCBI or EBI sites.

“For basic researchers, the ENCODE data represents a powerful resource for understanding fundamental questions about how life is encoded in our genome,” NHGRI’s Green said. “For more clinically-oriented researchers, the ENCODE data provide key information about which genome sequences are functionally important.”

Team IDs Characteristic Epigenetic Enhancer Patterns in Colon Cancer
April 12, 2012 / GenomeWeb Daily News
NIH to Award $25M for Newborn Sequencing Studies
August 10, 2012 / GenomeWeb Daily News
Illumina Q2 Revenues Down 2 Percent
July 25, 2012 / GenomeWeb Daily News
Study: Exon Arrays Have Benefits over RNA-seq, but Fall Short in Finding Novel Transcription Events
July 10, 2012 / In Sequence
Consortium Members Publish Collection of Studies Stemming from Human Microbiome Project
June 13, 2012 / GenomeWeb Daily News

Source:

http://www.genomeweb.com/sequencing/set-papers-outline-encode-findings-consortium-looks-ahead-future-studies

http://www.nature.com/encode/#/threads

NEWS & VIEWS

52 | NATURE | VOL 489 | 6 SEPTEMBER 2012

FORUM: Genomics

ENCODE explained

The Encyclopedia of DNA Elements (ENCODE) project dishes up a hearty banquet of data that illuminate the roles of the functional elements of the human genome. Here, five scientists describe the project and discuss how the data are influencing research directions across many fields. See Articles p.57, p.75, p.83, p.91, p.101 & Letter p.109

Serving up a genome feast

JOSEPH R. ECKER

Starting with a list of simple ingredients and blending them in the precise amounts needed to prepare a gourmet meal is a challenging task. In many respects, this task is analogous to the goal of the ENCODE project1, the recent progress of which is described in this issue2–7. The project aims to fully describe the list of common ingredients (functional elements) that make up the human genome (Fig. 1). When mixed in the right proportions, these ingredients constitute the information needed to build all the types of cells, body organs and, ultimately, an entire person from a single genome.

The ENCODE pilot project8 focused on just 1% of the genome — a mere appetizer — and its results hinted that the list of human genes was incomplete. Although there was scepticism about the feasibility of scaling up the project to the entire genome and to many hundreds of cell types, recent advances in low-cost, rapid DNA-sequencing technology radically changed that view9. Now the ENCODE consortium presents a menu of 1,640 genome-wide data sets prepared from 147 cell types, providing a six-course serving of papers in Nature, along with many companion publications in other journals.

One of the more remarkable findings described in the consortium’s ‘entrée’ paper (page 57)2 is that 80% of the genome contains elements linked to biochemical functions, dispatching the widely held view that the human genome is mostly ‘junk DNA’. The authors report that the space between genes is filled with enhancers (regulatory DNA elements), promoters (the sites at which DNA’s transcription into RNA is initiated) and numerous previously overlooked regions that encode RNA transcripts that are not translated into proteins but might have regulatory roles. Of note, these results show that many DNA variants previously correlated with certain diseases lie within or very near non-coding functional DNA elements, providing new leads for linking genetic variation and disease.

The five companion articles3–7 dish up diverse sets of genome-wide data regarding the mapping of transcribed regions, DNA binding of regulatory proteins (transcription factors) and the structure and modifications of chromatin (the association of DNA and proteins that makes up chromosomes), among other delicacies.

Djebali and colleagues3 (page 101) describe ultra-deep sequencing of RNAs prepared from many different cell lines and from specific compartments within the cells. They conclude that about 75% of the genome is transcribed at some point in some cells, and that genes are highly interlaced with overlapping transcripts that are synthesized from both DNA strands. These findings force a rethink of the definition of a gene and of the minimum unit of heredity.

Moving on to the second and third courses, Thurman et al.4 and Neph et al.5 (pages 75 and 83) have prepared two tasty chromatin-related treats. Both studies are based on the DNase I hypersensitivity assay, which detects genomic regions at which enzyme access to, and subsequent cleavage of, DNA is unobstructed by chromatin proteins. The authors identified cell-specific patterns of DNase I hypersensitive sites that show remarkable concordance with experimentally determined and computationally predicted binding sites of transcription factors. Moreover, they have doubled the number of known recognition sequences for DNA-binding proteins in the human genome, and have revealed a 50-base-pair ‘footprint’ that is present in thousands of promoters5.

The next course, provided by Gerstein and colleagues6 (page 91) examines the principles behind the wiring of transcription-factor networks. In addition to assigning relatively simple functions to genome elements (such as ‘protein X binds to DNA element Y’), this study attempts to clarify the hierarchies of transcription factors and how the intertwined networks arise.

Beyond the linear organization of genes and transcripts on chromosomes lies a more complex (and still poorly understood) network of chromosome loops and twists through which promoters and more distal elements, such as enhancers, can communicate their regulatory information to each other. In the final course of the ENCODE genome feast, Sanyal and colleagues7 (page 109) map more than 1,000 of these long-range signals in each cell type. Their findings begin to overturn the long-held (and probably oversimplified) prediction that the regulation of a gene is dominated by its proximity to the closest regulatory elements.

One of the major future challenges for ENCODE (and similarly ambitious projects) will be to capture the dynamic aspects of gene regulation. Most assays provide a single snapshot of cellular regulatory events, whereas a time series capturing how such processes change is preferable. Additionally, the examination of large batches of cells — as required for the current assays — may present too simplified a view of the underlying regulatory complexity, because individual cells in a batch (despite being genetically identical) can sometimes behave in different ways. The development of new technologies aimed at the simultaneous capture of multiple data types, along with their regulatory dynamics in single cells, would help to tackle these issues.

A further challenge is identifying how the genomic ingredients are combined to assemble the gene networks and biochemical pathways that carry out complex functions, such as cell-to-cell communication, which enable organs and tissues to develop. An even greater challenge will be to use the rapidly growing body

“These findings force a rethink of the definition of a gene and of the minimum unit of heredity.”ENCODEEncyclopedia of DNA Elementsnature.com/encode

© 2012 Macmillan Publishers Limited. All rights reserved

RESEARCH

NEWS & VIEWS

6 SEPTEMBER 2012 | VOL 489 | NATURE | 53

of data from genome-sequencing projects to understand the range of human phenotypes (traits), from normal developmental processes, such as ageing, to disorders such as Alzheimer’s disease10.

Achieving these ambitious goals may require a parallel investment of functional studies using simpler organisms — for example, of the type that might be found scampering around the floor, snatching up crumbs in the chefs’ kitchen. All in all, however, the ENCODE project has served up an all-you-can-eat feast of genomic data that we will be digesting for some time. Bon appétit!

Joseph R. Ecker is at the Howard Hughes Medical Institute and the Salk Institute for Biological Studies, La Jolla, California 92037, USA.

e-mail: ecker@salk.eduNucleosomeHistoneChromatinmodicationsLong-rangechromatin interactionsFunctionalgenomicelementsDNase IhypersensitivesitesDNA methylationChromosomeDNALong-rangeregulatoryelementsProtein-codingand non-codingtranscriptsPromoterarchitectureTranscriptionfactorTranscriptionmachineryTranscription-factorbinding sitesTranscribed region

Figure 1 | Beyond the sequence. The ENCODE project^2–7provides information on the human genome far beyond that contained within the DNA sequence — it describes the functional genomic elements that orchestrate the development and function of a human. The project contains data about the degree of DNA methylation and chemical modifications to histones that can influence the rate of transcription of DNA into RNA molecules (histones are the proteins around which DNA is wound to form chromatin). ENCODE also examines long-range chromatin interactions, such as looping, that alter the relative proximities of different chromosomal regions in three dimensions and also affect transcription. Furthermore, the project describes the binding activity of transcription-factor proteins and the architecture (location and sequence) of gene-regulatory DNA elements, which include the promoter region upstream of the point at which transcription of an RNA molecule begins, and more distant (long-range) regulatory elements. Another section of the project was devoted to testing the accessibility of the genome to the DNA-cleavage protein DNase I. These accessible regions, called DNase I hypersensitive sites, are thought to indicate specific sequences at which the binding of transcription factors and transcription-machinery proteins has caused nucleosome displacement. In addition, ENCODE catalogues the sequences and quantities of RNA transcripts, from both non-coding and protein-coding regions.

Expression control

WENDY A. BICKMORE

Once the human genome had been sequenced, it became apparent that an encyclopaedic knowledge of chromatin organization would be needed if we were to understand how gene expression is regulated. The ENCODE project goes a long way to achieving this goal and highlights the pivotal role of transcription factors in sculpting the chromatin landscape.

Although some of the analyses largely confirm conclusions from previous smaller-scale studies, this treasure trove of genome-wide data provides fresh insight into regulatory pathways and identifies prodigious numbers of regulatory elements. This is particularly so for Thurman and colleagues’ data4 regarding DNase I hypersensitive sites (DHSs) and for Gerstein and colleagues’ results6 concerning DNA binding of transcription factors. DHSs are genomic regions that are accessible to enzymatic cleavage as a result of the displacement of nucleosomes (the basic units of chromatin) by DNA-binding proteins (Fig. 1). They are the hallmark of cell-type-specific enhancers, which are often located far away from promoters.

The ENCODE papers expose the profusion of DHSs — more than 200,000 per cell type, far outstripping the number of promoters — and their variability between cell types. Through the simultaneous presence in the same cell type of a DHS and a nearby active promoter, the researchers paired half a million enhancers with their probable target genes. But this leaves

© 2012 Macmillan Publishers Limited. All rights reserved

RESEARCH

NEWS & VIEWS

more than 2 million putative enhancers without known targets, revealing the enormous expanse of the regulatory genome landscape that is yet to be explored. Chromosome-conformation-capture methods that detect long-range physical associations between distant DNA regions are attempting to bridge this gap. Indeed, Sanyal and colleagues7 applied these techniques to survey such associations across 1% of the genome.

The ENCODE data start to paint a picture of the logic and architecture of transcriptional networks, in which DNA binding of a few high-affinity transcription factors displaces nucleosomes and creates a DHS, which in turn facilitates the binding of further, lower-affinity factors. The results also support the idea that transcription-factor binding can block DNA methylation (a chemical modification of DNA that affects gene expression), rather than the other way around — which is highly relevant to the interpretation of disease-associated sites of altered DNA methylation11.

The exquisite cell-type specificity of regulatory elements revealed by the ENCODE studies emphasizes the importance of having appropriate biological material on which to test hypotheses. The researchers have focused their efforts on a set of well-established cell lines, with selected assays extended to some freshly isolated cells. Challenges for the future include following the dynamic changes in the regulatory landscape during specific developmental pathways, and understanding chromatin structure in tissues containing heterogeneous cell populations.

Wendy A. Bickmore is in the Medical Research Council Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK.

e-mail: wendy.bickmore@igmm.ed.ac.uk

“The results imply that sequencing studies focusing on protein-coding sequences risk missing crucial parts of the genome.”

11 Years Ago

The draft human genome

OUR GENOME UNVEILED

Unless the human genome contains a lot of genes that are opaque to our computers, it is clear that we do not gain our undoubted complexity over worms and plants by using many more genes. Understanding what does give us our complexity — our enormous behavioural repertoire, ability to produce conscious action, remarkable physical coordination (shared with other vertebrates), precisely tuned alterations in response to external variations of the environment, learning, memory … need I go on? — remains a challenge for the future.

David Baltimore

From Nature 15 February 2001

GENOME SPEAK

With the draft in hand, researchers have a new tool for studying the regulatory regions and networks of genes. Comparisons with other genomes should reveal common regulatory elements, and the environments of genes shared with other species may offer insight into function and regulation beyond the level of individual genes. The draft is also a starting point for studies of the three-dimensional packing of the genome into a cell’s nucleus. Such packing is likely to influence gene regulation … The human genome lies before us, ready for interpretation.

Peer Bork and Richard Copley

From Nature 15 February 2001

Non-codingbut functional

INÊS BARROSO

The vast majority of the human genome does not code for proteins and, until now, did not seem to contain defined gene-regulatory elements. Why evolution would maintain large amounts of ‘useless’ DNA had remained a mystery, and seemed wasteful. It turns out, however, that there are good reasons to keep this DNA. Results from the ENCODE project2–8 show that most of these stretches of DNA harbour regions that bind proteins and RNA molecules, bringing these into positions from which they cooperate with each other to regulate the function and level of expression of protein-coding genes. In addition, it seems that widespread transcription from non-coding DNA potentially acts as a reservoir for the creation of new functional molecules, such as regulatory RNAs.

What are the implications of these results for genetic studies of complex human traits and disease? Genome-wide association studies (GWAS), which link variations in DNA sequence with specific traits and diseases, have in recent years become the workhorse of the field, and have identified thousands of DNA variants associated with hundreds of complex traits (such as height) and diseases (such as diabetes). But association is not causality, and identifying those variants that are causally linked to a given disease or trait, and understanding how they exert such influence, has been difficult. Furthermore, most of these associated variants lie in non-coding regions, so their functional effects have remained undefined.

The ENCODE project provides a detailed map of additional functional non-coding units in the human genome, including some that have cell-type-specific activity. In fact, the catalogue contains many more functional non-coding regions than genes. These data show that results of GWAS are typically enriched for variants that lie within such non-coding functional units, sometimes in a cell-type-specific manner that is consistent with certain traits, suggesting that many of these regions could be causally linked to disease. Thus, the project demonstrates that non-coding regions must be considered when interpreting GWAS results, and it provides a strong motivation for reinterpreting previous GWAS findings. Furthermore, these results imply that sequencing studies focusing on protein-coding sequences (the ‘exome’) risk missing crucial parts of the genome and the ability to identify true causal variants.

However, although the ENCODE catalogues represent a remarkable tour de force, they contain only an initial exploration of the depths of our genome, because many more cell types must yet be investigated. Some of the remaining challenges for scientists searching for causal disease variants lie in: accessing data derived from cell types and tissues relevant to the disease under study; understanding how these functional units affect genes that may be distantly located7; and the ability to generalize such results to the entire organism.

Inês Barroso is at the Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK, and at the University of Cambridge Metabolic Research Laboratories and NIHR Cambridge Biomedical Research Centre, Cambridge, UK.e-mail: ib1@sanger.ac.uk5 4 | N AT U R E | VO L 4 8 9 | 6 S E P T E M B E R 2 0 1 2

© 2012 Macmillan Publishers Limited. All rights reserved

Evolution and the code

JONATHAN K. PRITCHARD & YOAV GILAD

One of the great challenges in evolutionary biology is to understand how differences in DNA sequence between species determine differences in their phenotypes. Evolutionary change may occur both through changes in protein-coding sequences and through sequence changes that alter gene regulation.

There is growing recognition of the importance of this regulatory evolution, on the basis of numerous specific examples as well as on theoretical grounds. It has been argued that potentially adaptive changes to protein-coding sequences may often be prevented by natural selection because, even if they are beneficial in one cell type or tissue, they may be detrimental elsewhere in the organism. By contrast, because gene-regulatory sequences are frequently associated with temporally and spatially specific gene-expression patterns, changes in these regions may modify the function of only certain cell types at specific times, making it more likely that they will confer an evolutionary advantage12.

However, until now there has been little information about which genomic regions have regulatory activity. The ENCODE project has provided a first draft of a ‘parts list’ of these regulatory elements, in a wide range of cell types, and moves us considerably closer to one of the key goals of genomics: understanding the functional roles (if any) of every position in the human genome.

Nonetheless, it will take a great deal of work to identify the critical sequence changes in the newly identified regulatory elements that drive functional differences between humans and other species. There are some precedents for identifying key regulatory differences (see, for example, ref. 13), but ENCODE’s improved identification of regulatory elements should greatly accelerate progress in this area. The data may also allow researchers to begin to identify sequence alterations occurring simultaneously in multiple genomic regions, which, when added together, drive phenotypic change — a process called polygenic adaptation14.

However, despite the progress brought by the ENCODE consortium and other research groups, it remains difficult to discern with confidence which variants in putative regulatory regions will drive functional changes, and what these changes will be. We also still have an incomplete understanding of how regulatory sequences are linked to target genes. Furthermore, the ENCODE project focused mainly on the control of transcription, but many aspects of post-transcriptional regulation, which may also drive evolutionary changes, are yet to be fully explored.

Nonetheless, these are exciting times for studies of the evolution of gene regulation. With such new resources in hand, we can expect to see many more descriptions of adaptive regulatory evolution, and how this has contributed to human evolution.

Jonathan K. Pritchard and Yoav Gilad are in the Department of Human Genetics, University of Chicago, Chicago 60637 Illinois, USA. J.K.P. is also at the Howard Hughes Medical Institute, University of Chicago.

e-mails: pritch@uchicago.edu; gilad@uchicago.edu

From catalogue to function

ERAN SEGAL

Projects that produce unprecedented amounts of data, such as the human genome project15 or the ENCODE project, present new computational and data-analysis challenges and have been a major force driving the development of computational methods in genomics. The human genome project produced one bit of information per DNA base pair, and led to advances in algorithms for sequence matching and alignment. By contrast, in its 1,640 genome-wide data sets, ENCODE provides a profile of the accessibility, methylation, transcriptional status, chromatin structure and bound molecules for every base pair. Processing the project’s raw data to obtain this functional information has been an immense effort.

For each of the molecular-profiling methods used, the ENCODE researchers devised novel processing algorithms designed to remove outliers and protocol-specific biases, and to ensure the reliability of the derived functional information. These processing pipelines and quality-control measures have been adapted by the research community as the standard for the analysis of such data. The high quality of the functional information they produce is evident from the exquisite detail and accuracy achieved, such as the ability to observe the crystallographic topography of protein–DNA interfaces in DNase I footprints5, and the observation of more than one-million-fold variation in dynamic range in the concentrations of different RNA transcripts3.

But beyond these individual methods for data processing, the profound biological insights of ENCODE undoubtedly come from computational approaches that integrated multiple data types. For example, by combining data on DNA methylation, DNA accessibility and transcription-factor expression. Thurman et al.4 provide fascinating insight into the causal role of DNA methylation in gene silencing. They find that transcription-factor binding sites are, on average, less frequently methylated in cell types that express those transcription factors, suggesting that binding-site methylation often results from a passive mechanism that methylates sites not bound by transcription factors.

Despite the extensive functional information provided by ENCODE, we are still far from the ultimate goal of understanding the function of the genome in every cell of every person, and across time within the same person. Even if the throughput rate of the ENCODE profiling methods increases dramatically, it is clear that brute-force measurement of this vast space is not feasible. Rather, we must move on from descriptive and correlative computational analyses, and work towards deriving quantitative models that integrate the relevant protein, RNA and chromatin components. We must then describe how these components interact with each other, how they bind the genome and how these binding events regulate transcription.

If successful, such models will be able to predict the genome’s function at times and in settings that have not been directly measured. By allowing us to determine which assumptions regarding the physical interactions of the system lead to models that better explain measured patterns, the ENCODE data provide an invaluable opportunity to address this next immense computational challenge. ■

Eran Segal is in the Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel.

e-mail: eran.segal@weizmann.ac.il

1. The ENCODE Project Consortium Science 306, 636–640 (2004).

2. The ENCODE Project Consortium Nature 489, 57–74 (2012).

3. Djebali, S. et al. Nature 489, 101–108 (2012).

4. Thurman, R. E. et al. Nature 489, 75–82 (2012).

5. Neph, S. et al. Nature 489, 83–90 (2012).

6. Gerstein, M. B. et al. Nature 489, 91–100 (2012).

7. Sanyal, A., Lajoie, B., Jain, G. & Dekker, J. Nature 489, 109–113 (2012).

8. Birney, E. et al. Nature 447, 799–816 (2007).

9. Mardis, E. R. Nature 470, 198–203 (2011).

10. Gonzaga-Jauregui, C., Lupski, J. R. & Gibbs, R. A. Annu. Rev. Med. 63, 35–61 (2012).

11. Sproul, D. et al. Proc. Natl Acad. Sci. USA 108, 4364–4369 (2011).

12. Carroll, S. B. Cell 134, 25–36 (2008).

13. Prabhakar, S. et al. Science 321, 1346–1350 (2008).

14. Pritchard, J. K., Pickrell, J. K. & Coop, G. Curr. Biol. 20, R208–R215 (2010).

15. Lander, E. S. et al. Nature 409, 860–921 (2001).

“The high quality of the functional information produced is evident from the exquisite detail and accuracy achieved.”

6 S E P T E M B E R 2 0 1 2 | VO L 4 8 9 | N AT U R E | 5 5 NEWS & VIEWS RESEARCH © 2012 Macmillan Publishers Limited. All rights reserved

http://www.sciencemag.org SCIENCE VOL 337 7 SEPTEMBER 2012 1159

NEWS&ANALYSIS

When researchers fi rst sequenced the human

genome, they were astonished by how few

traditional genes encoding proteins were

scattered along those 3 billion DNA bases.

Instead of the expected 100,000 or more

genes, the initial analyses found about 35,000

and that number has since been whittled down

to about 21,000. In between were megabases

of “junk,” or so it seemed.

This week, 30 research papers, including

six in Nature and additional papers published

by Science, sound the death knell for

the idea that our DNA is mostly littered with

useless bases. A decadelong project, the

Encyclopedia of DNA Elements (ENCODE),

has found that 80% of the human genome

serves some purpose, biochemically speaking.

“I don’t think anyone would have anticipated

even close to the amount of sequence

that ENCODE has uncovered that looks like

it has functional importance,” says John A.

Stamatoyannopoulos, an ENCODE re searcher

at the University of Washington, Seattle.

Beyond defi ning proteins, the DNA bases

highlighted by ENCODE specify landing

spots for proteins that infl uence gene activity,

strands of RNA with myriad roles, or

simply places where chemical modifi cations

serve to silence stretches of our chromosomes.

These results are going “to change

the way a lot of [genomics] concepts are

written about and presented in textbooks,”

Stamatoyannopoulos predicts.

The insights provided by ENCODE into

how our DNA works are already clarifying

genetic risk factors for a variety of diseases

and offering a better understanding of gene

regulation and function. “It’s a treasure trove

of information,” says Manolis Kellis, a computational

biologist at Massachusetts Institute

of Technology (MIT) in Cambridge who analyzed

data from the project.

The ENCODE effort has revealed that

a gene’s regulation is far more complex

than previously thought, being infl uenced

by multiple stretches of regulatory DNA

located both near and far from the gene

itself and by strands of RNA not translated

into proteins, so-called noncoding RNA.

“What we found is how beautifully complex

the biology really is,” says Jason Lieb,

an ENCODE researcher at the University of

North Carolina, Chapel Hill.

Throughout the 1990s, various researchers

called the idea of junk DNA into question.

With the human genome in hand, the

National Human Genome Research Institute

(NHGRI) in Bethesda, Maryland, decided it

wanted to fi nd out once and for all how much

of the genome was a wasteland with no functional

purpose. In 2003, it funded a pilot

ENCODE, in which 35 research teams analyzed

44 regions of the genome—30 million

bases in all, about 1% of the total genome. In

2007, the pilot project’s results revealed that

much of this DNA sequence was active in

some way. The work called into serious question

our gene-centric view of the genome,

fi nding extensive RNA-generating activity

beyond traditional gene boundaries (Science,

15 June 2007, p. 1556). But the question

remained whether the rest of the genome was

like this 1%. “We want to know what all the

bases are doing,” says Yale University bioinformatician

Mark Gerstein.

Teams at 32 institutions worldwide have

now carried out scores of tests, generating

1640 data sets. While the pilot phase tests

depended on computer chip–like devices

called microarrays to analyze DNA samples,

the expanded phase benefi ted from the arrival

of new sequencing technology, which made it

cost-effective to directly read the DNA bases.

Taken together, the tests present “a greater

idea of what the landscape of the genome

looks like,” says NHGRI’s Elise Feingold.

Because the parts of the genome used

could differ among various kinds of cells,

ENCODE needed to look at DNA function

in multiple types of cells and tissues. At

fi rst the goal was to study intensively three

types of cells. They included GM12878, the

immature white blood cell line used in the

1000 Genomes Project, a large-scale effort to

catalog genetic variation across humans; a leukemia

cell line called K562; and an approved

human embryonic stem cell line, H1-hESC.

As ENCODE was ramping up, new

sequencing technology brought the cost of

sequencing down enough to make it feasible

to test extensively even more cell types.

ENCODE added a liver cancer cell line,

HepG2; the laboratory workhorse cancer cell

line, HeLa S3; and human umbilical cord tissue

to the mix. Another 140 cell types were

studied to a much lesser degree.

In these cells, ENCODE researchers

closely examined which DNA bases are transcribed

into RNA and then whether those

strands of RNA are subsequently translated

into proteins, verifying predicted proteincoding

genes and more precisely locating

each gene’s beginning, end, and coding

regions. The latest protein-coding gene count

is 20,687, with hints of about 50 more, the

consortium reports in Nature. Those genes

account for about 3% of the human genome,

less if one counts only their coding regions.

Another 11,224 DNA stretches are classifi ed

as pseudogenes, “dead” genes now known to

be active in some cell types or individuals.

ENCODE Project Writes Eulogy

For Junk DNA

GENOMICS

Hypersensitive

sites

CH3CO

CH3

Long-range regulatory elements

(enhancers, repressors/

silencers, insulators)

cis-regulatory elements

(promoters, transcription

factor binding sites)

Gene Transcript

RNA

polymerase

CH3CO (Epigenetic modifications)

ChIP-seq

Computational

predictions and

RT-PCR

RNA-seq

DNase-seq

FAIRE-seq

5C

Zooming in. A diagram of DNA in ever-greater detail shows how ENCODE’s various tests (gray boxes) translate

DNA’s features into functional elements along a chromosome.

CREDIT: ADAPTED FROM THE ENCODE PROJECT CONSORTIUM, PLOS BIOLOGY 9, 4 (APRIL 2011)

Published by AAAS

Downloaded from http://www.sciencemag.org on September 10, 2012

http://www.sciencemag.org SCIENCE VOL 337 7 SEPTEMBER 2012 1161

NEWS&ANALYSIS

ENCODE drives home, however, that

there are many “genes” out there in which

DNA codes for RNA, not a protein, as the end

product. The big surprise of the pilot project

was that 93% of the bases studied were transcribed

into RNA; in the full genome, 76%

is transcribed. ENCODE defi ned 8800 small

RNA molecules and 9600 long noncoding

RNA molecules, each of which is at least 200

bases long. Thomas Gingeras of Cold Spring

Harbor Laboratory in New York has found

that various ones home in on different cell

compartments, as if they have fi xed addresses

where they operate. Some go to the nucleus,

some to the nucleolus, and some to the cytoplasm,

for example. “So there’s quite a lot

of sophistication in how RNA works,” says

Ewan Birney of the European Bioinformatics

Institute in Hinxton, U.K., one of the key leaders

of ENCODE (see p. 1162).

As a result of ENCODE, Gingeras and

others argue that the fundamental unit of

the genome and the basic unit of heredity

should be the transcript—the piece of

RNA decoded from DNA—and not the

gene. “The project has played an important

role in changing our concept of the gene,”

Stamatoyannopoulos says.

Another way to test for functionality of

DNA is to evaluate whether specific base

sequences are conserved between species, or

among individuals in a species. Previous studies

have shown that 5% of the human genome

is conserved across mammals, even though

ENCODE studies implied that much more

of the genome is functional. So MIT’s Lucas

Ward and Kellis compared functional regions

newly identifi ed by ENCODE among multiple

humans, sampling from the

1000 Genomes Project. Some

DNA sequences not conserved

between humans and other

mammals were nonetheless

very much preserved across

multiple people, indicating

that an additional 4% of the

genome is newly under selection

in the human lineage, they

report in a paper published

online by Science (http://scim.

ag/WardKellis). Two such regions were near

genes for nerve growth and the development

of cone cells in the eye, which underlie distinguishing

traits in humans. On the fl ip side,

they also found that some supposedly conserved

regions of the human genome, as highlighted

by the comparison with 29 mammals,

actually varied among humans, suggesting

these regions were no longer functional.

Beyond transcription, DNA’s bases function

in gene regulation through their interactions

with transcription factors and other

proteins. ENCODE carried out several tests

to map where those proteins bind along the

genome (Science, 25 May 2007, p. 1120). Two,

DNase-seq and FAIRE-seq, gave an overview

of the genome, identifying where the protein-

DNA complex chromatin unwinds and a protein

can hook up with the DNA, and were

applied to multiple cell types. ENCODE’s

DNase-seq found 2.89 million such sites

in 125 cell types. Stamatoyannopoulos and

his colleagues describe their more extensive

DNase-seq studies in Science (p. 1190): His

team examined 349 types of cells, including

233 60- to 160-day-old fetal tissue samples.

Each type of cell had about 200,000 accessible

locations, and there seemed to be at least

3.9 million regions where transcription factors

can bind in the genome. Across all cell

types, about 42% of the genome can be accessible,

he and his colleagues report. In many

cases, the assays were able to pinpoint the specifi

c bases involved in binding.

Last year, Stamatoyannopoulos showed

that these newly discovered functional regions

sometimes overlap with specifi c DNA bases

linked to higher or lower risks of various diseases,

suggesting that the regulation of genes

might be at the heart of these risk variations

(Science, 27 May 2011, p. 1031). The work

demonstrated how researchers could use

ENCODE data to come up with new hypotheses

about the link between genetics and a

particular disorder. (The ENCODE analysis

found that 12% of these bases, or SNPs,

colocate with transcription factor binding

sites and 34% are in open chromatin defi ned

by the DNase-seq tests.) Now, in their new

work published in Science,

Stamatoyannopoulos’s lab has

linked those regulatory regions

to their specifi c target genes,

homing in on the risk-enhancing

ones. In addition, the group

fi nds it can predict the cell type

involved in a given disease.

For example, the analysis fi ngered

two types of T cells as

pathogenic in Crohn’s disease,

both of which are involved in

this inflammatory bowel disorder. “We are

informing disease studies in a way that would

be very hard to do otherwise,” Birney says.

Another test, called ChIP-seq, uses an

antibody to home in on a particular DNAbinding

protein and helps pinpoint the locations

along the genome where that protein

works. To date, ENCODE has examined

about 100 of the 1500 or so transcription

factors and about 20 other DNA binding

proteins, including those involved in modifying

the chromatin-associated proteins

called histones. The binding sites found

through ChIP-seq coincided with the sites

mapped through FAIRE-seq and DNAseseq.

Overall, 8% of the genome falls within

a transcription factor binding site, a percentage

that is expected to double once more

transcription factors have been tested.

Yale’s Gerstein used these results to fi gure

out all the interactions among the transcription

factors studied and came up with a network

view of how these regulatory proteins

work. These transcription factors formed a

three-layer hierarchy, with the ones at the top

having the broadest effects and the ones in

the middle working together to coregulate a

common target gene, he and his colleagues

report in Nature.

Using a technique called 5C, other

researchers looked for places where DNA

from distant regions of a chromosome, or

even different chromosomes, interacted. It

found that an average of 3.9 distal stretches

of DNA linked up with the beginning of each

gene. “Regulation is a 3D puzzle that has to

be put together,” Gingeras says. “That’s what

ENCODE is putting out on the table.”

To date, NHGRI has put $288 million

toward ENCODE, including the pilot project,

technology development, and ENCODE

efforts for the mouse, nematode, and fruit fl y.

All together, more than 400 papers have been

published by ENCODE researchers. Another

110 or more studies have used ENCODE data,

says NHGRI molecular biologist Michael

Pazin. Molecular biologist Mathieu Lupien of

the University of Toronto in Canada authored

one of those papers, a study looking at epigenetics

and cancer. “ENCODE data were

fundamental” to the work, he says. “The cost

is defi nitely worth every single dollar.”

–ELIZABETH PENNISI

ENCODE By the Numbers

147 cell types studied

80% functional portion of human genome

20,687 protein-coding genes

18,400 RNA genes

1640 data sets

30 papers published this week

442 researchers

$288 million funding for pilot,

technology, model organism, and current project

“ We are informing

disease studies in a

way that would be

very hard to do

otherwise.”

—EWAN BIRNEY,

EUROPEAN BIOINFORMATICS

INSTITUTE

Published by AAAS

Downloaded from http://www.sciencemag.org on September 10, 2012

http://www.nature.com/encode/

Read Full Post »

Leaders in Pharmaceutical Business Intelligence Group, LLC, Doing Business As LPBI Group, Newton, MA

Posts Tagged ‘Washington University’

ENCODE Findings as Consortium

Set of Papers Outline ENCODE Findings as Consortium Looks Ahead to Future Studies

Related Stories

Like this:

Follow Blog via Email

Recent Posts

Archives

Categories

Meta

Leaders in Pharmaceutical Business Intelligence Group, LLC, Doing Business As LPBI Group, Newton, MA

Posts Tagged ‘Washington University’

ENCODE Findings as Consortium

Set of Papers Outline ENCODE Findings as Consortium Looks Ahead to Future Studies

Related Stories

Share this:

Like this:

Follow Blog via Email

Recent Posts

Archives

Categories

Meta