Advertisements
Feeds:
Posts
Comments

Posts Tagged ‘clusters’


Cancer Mutations Across the Landscape

Curator: Larry H. Bernstein, MD, FCAP

This is an up-to-date article about the significance of mutations found in 12 major types of cancer.

Mutational landscape and significance across 12 major cancer types

Cyriac Kandoth1*, Michael D. McLellan1*, Fabio Vandin2, Kai Ye1,3, Beifang Niu1, Charles Lu1, et al.

1The Genome Institute, Washington University in St Louis, Missouri 63108, USA. 2Department of Computer Science, Brown University, Providence, Rhode Island 02912, USA. 3Department of Genetics, Washington University in St Louis, Missouri 63108, USA. 4Department of Medicine, Washington University in St Louis, Missouri 63108, USA. 5Siteman Cancer Center, Washington University in St Louis, Missouri 63108, USA. 6Department of Mathematics, Washington University in St Louis, Missouri 63108, USA.

NATURE 17 Oct 2013;  5 0 2      http://dx.doi.org/10.1038/nature12634

The Cancer Genome Atlas (TCGA) has used the latest sequencing and analysis methods to identify somatic variants across thousands of tumours. Here we present data and analytical results for point mutations and small insertions/deletions from 3,281 tumours across 12 tumour types as part of the TCGA Pan-Cancer effort. We illustrate

  1. the distributions of mutation frequencies,
  2. types and contexts across tumour types, and
  3. establish their links to tissues of origin,
  4. environmental/ carcinogen influences, and
  5. DNA repair defects.

Using the integrated data sets, we identified 127 significantly mutated genes from well-knownand emerging cellular processes in cancer.

  1. (for example, mitogen-activated protein kinase, phosphatidylinositol-3-OH kinase,Wnt/b-catenin and receptor tyrosine kinase signalling pathways, and cell cycle control)
  2. (for example, histone, histone modification, splicing, metabolism and proteolysis)

The average number of mutations in these significantly mutated genes varies across tumour types;

  1. most tumours have two to six, indicating that the number of driver mutations required during oncogenesis is relatively small.
  2. Mutations in transcriptional factors/regulators show tissue specificity, whereas
  3. histone modifiers are often mutated across several cancer types.

Clinical association analysis identifies genes having a significant effect on survival, and

  • investigations of mutations with respect to clonal/subclonal architecture delineate their temporal orders during tumorigenesis.

Taken together, these results lay the groundwork for developing new diagnostics and individualizing cancer treatment

Introduction

The advancement of DNA sequencing technologies now enables the processing of thousands of tumours of many types for systematic mutation discovery. This expansion of scope, coupled with appreciable progress in algorithms1–5, has led directly to characterization of signifi­cant functional mutations, genes and pathways6–18. Cancer encompasses more than 100 related diseases19, making it crucial to understand the commonalities and differences among various types and subtypes. TCGA was founded to address these needs, and its large data sets are providing unprecedented opportunities for systematic, integrated analysis.

We performed a systematic analysis of 3,281 tumours from 12 cancer types to investigate underlying mechanisms of cancer initiation and progression. We describe variable mutation frequencies and contexts and their associations with environmental factors and defects in DNA repair. We identify 127 significantlymutated genes (SMGs) from diverse signalling and enzymatic processes. The finding of a TP53-driven breast, head and neck, and ovarian cancer cluster with a dearth of other mutations in SMGs suggests common therapeutic strategies might be applied for these tumours. We determined interactions among muta­tions and correlated mutations in BAP1, FBXW7 and TP53 with det­rimental phenotypes across several cancer types. The subclonal structure and transcription status of underlying somatic mutations reveal the trajectory of tumour progression in patients with cancer.

Standardization of mutation data

Stringent filters (Methods) were applied to ensure high quality muta­tion calls for 12 cancer types: breast adenocarcinoma (BRCA), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), uterine corpus endometrial carcinoma (UCEC), glioblastoma multiforme (GBM), head and neck squamous cell carcinoma (HNSC), colon and rectal carcinoma (COAD, READ),bladder urothelial carcinoma (BLCA), kidney renal clear cell carcinoma (KIRC), ovarian serous carcinoma (OV) and acute myeloid leukaemia (LAML; conventionally called AML) (Supplementary Table 1). A total of 617,354 somatic mutations, consisting of

  • 398,750 missense,
  • 145,488 silent,
  • 36,443 nonsense,
  • 9,778 splice site,
  • 7,693 non-coding RNA,
  • 523 non-stop/readthrough,
  • 15,141 frameshift insertions/deletions (indels) and
  • 3,538 inframe indels,

were included for downstream analyses (Supplementary Table 2).

Distinct mutation frequencies and sequence context

Figure 1a shows that AML has the lowest median mutation frequency and LUSC the highest (0.28 and 8.15 mutations per megabase (Mb), respectively). Besides AML, all types average over 1 mutation per Mb, substantially higher than in pediatric tumours20. Clustering21 illus­trates that

  • mutation frequencies for KIRC, BRCA, OV and AML are normally distributed within a single cluster, whereas
  • other types have several clusters (for example, 5 and 6 clusters in UCEC and COAD/ READ, respectively) (Fig. 1a and Supplementary Table 3a, b).

In UCEC, the largest patient cluster has a frequency of approximately 1.5 muta­tions per Mb, and

  • the cluster with the highest frequency is more than 150 times greater.

Multiple clusters suggest that factors other than age contribute to development in these tumours14,16. Indeed,

  • there is a significant correlation between high mutation frequency and DNA repair pathway genes (for example, PRKDC, TP53 and MSH6) (Sup­plementary Table 3c). Notably,
  • PRKDC mutations are associated with high frequency in BLCA, COAD/READ, LUAD and UCEC, whereas
  • TP53 mutations are related with higher frequencies in AML, BLCA, BRCA, HNSC, LUAD, LUSC and UCEC (all P < 0.05).

Mutations in POLQ and POLE associate with high frequencies in multiple cancer types; POLE association in UCEC is consistent with previous observations14.

Comparison of spectra across the 12 types (Fig. 1b and Supplemen­tary Table 3d) reveals that LUSC and LUAD contain increased C>A transversions, a signature of cigarette smoke exposure10. Sequence context analysis across 12 types revealed

  • the largest difference being in C>T transitions and C>G transversions (Fig. 1c).

The frequency of thymine 1-bp (base pair) upstream of C>G transversions is mark­edly higher in BLCA, BRCA and HNSC than in other cancer types (Extended Data Fig. 1). GBM, AML, COAD/READ and UCEC have similar contexts in that

  • the proportions of guanine 1 base downstream of C>T transitions are between
    • 59% and 67%, substantially higher than the approximately 40% in other cancer types.

Higher frequencies of transition mutations at CpG in gastrointestinal tumours, including colorectal, were previously reported22. We found three additional cancer types (GBM, AML and UCEC) clustered in the C>T mutation at CpG, consistent with previous findings of

  • aberrant DNA methylation in endometrial cancer23 and glioblastoma24.

BLCA has a unique signature for C>T transitions compared to the other types (enriched for TC) (Extended Data Fig. 1).

Significantly mutated genes

Genes under positive selection, either in individual or multiple tumour types, tend to display higher mutation frequencies above background. Our statistical analysis3, guided by expression data and curation (Methods), identified 127 such genes (SMGs; Supplementary Table 4). These SMGs are involved in a wide range of cellular processes, broadly classified into 20 categories (Fig. 2), including

  • transcription factors/regulators, histone modifiers, genome integrity, receptor tyrosine kinase signal­ling, cell cycle, mitogen-activated protein kinases (MAPK) signalling, phosphatidylinositol-3-OH kinase (PI(3)K) signalling, Wnt/ -catenin signalling, histones, ubiquitin-mediatedproteolysis, and splicing (Fig. 2).

The identification of MAPK, PI(3)K and Wnt/ -catenin signaling path­ways is consistent with classical cancer studies. Notably, newer categories (for example, splicing, transcription regulators, metabolism, proteolysis and histones) emerge as exciting guides for the development of new therapeutic targets. Genes categorized as histone modifiers (Z = 0.57), PI(3)K signalling (Z = 1.03), and genome integrity (Z = 0.66) all relate to more than one cancer type, whereas

  • transcription factor/regulator (Z = 0.40), TGF- signalling (Z = 0.66), and Wnt/ -catenin signalling (Z = 0.55) genes tend to associate with single types (Methods).

Notably, 3,053 out of 3,281 total samples (93%) across the Pan-Cancer collection had at least one non-synonymous mutation in at least one SMG. The average number of point mutations and small indels in these genes varies across tumour types, with the highest (,6 mutations per tumour) in UCEC, LUAD and LUSC, and the lowest (,2 mutations per tumour) in AML, BRCA, KIRC and OV. This suggests that the numbers of both cancer-related genes (only 127 identified in this study) and cooperating driver mutations required during oncogenesis are small (most cases only had 2–6) (Fig. 3), although large-scale structural rearrangements were not included in this analysis.

Common mutations

The most frequently mutated gene in the Pan-Cancer cohort is TP53 (42% of samples). Its mutations predominate in serous ovarian (95%) and serous endometrial carcinomas (89%) (Fig. 2). TP53 mutations are also associated with basal subtype breast tumours. PIK3CA is the second most commonly mutated gene, occurring frequently (>10%) in most cancer types except OV, KIRC, LUAD and AML. PIK3CA mutations frequented UCEC (52%) and BRCA (33.6%), being speci­fically enriched in luminal subtype tumours. Tumours lacking PIK3CA mutations often had mutations in PIK3R1, with the highest occur­rences in UCEC (31%) and GBM (11%) (Fig. 2).

Many cancer types carried mutations in chromatin re-modelling genes. In particular, histone-lysine N-methyltransferase genes (MLL2 (also known as KMT2D), MLL3 (KMT2C) and MLL4 (KMT2B)) clus­ter in bladder, lung and endometrial cancers, whereas the lysine (K)-specific demethylase KDM5C is prevalently mutated in KIRC (7%). Mutations in ARID1A are frequent in BLCA, UCEC, LUAD and LUSC, whereas mutations in ARID5B predominate in UCEC (10%) (Fig. 2).

Fig. 1.  Distribution of mutation frequencies across 12 cancer types.

Fig. 1.  | Distribution of mutation frequencies across 12 cancer types.

Dashed grey and solid white lines denote average across cancer types and median for each type, respectively. b, Mutation spectrum of six transition (Ti) and transversion (Tv) categories for each cancer type. c, Hierarchically clustered mutation context (defined by the proportion of A, T, C and G nucleotides within ±2bp of variant site) for six mutation categories. Cancer types correspond to colours in a. Colour denotes degree of correlation: yellow (r = 0.75) and red (r = 1).

Fig. 2.  The 127 SMGs from 20 cellular processes in cancer identified in and Pan-Cancer are shown, with the highest percentage in each gene among 12 (not shown)

Fig. 3.  Distribution of mutations in 127 SMGs across Pan-Cancer cohort.

Fig. 3. | Distribution of mutations in 127 SMGs across Pan-Cancer cohort.

Box plot displays median numbers of non-synonymous mutations, with outliers shown as dots. In total, 3,210 tumours were used for this analysis (hypermutators excluded).

Figure 4 | Unsupervised clustering based on mutation status of SMGs. Tumours having no mutation or more than 500 mutations were excluded. A mutation status matrix was constructed for 2,611 tumours. Major clusters of mutations detected in UCEC, COAD, GBM, AML, KIRC, OV and BRCA were highlighted.
Complete gene list shown in Extended Data Fig. 3.  (not shown)

Fig. 5. Driver initiation and progression mutations and tumour clonal mutation is in the subclone

Figure 5 | Driver initiation and progression mutations and tumour clonal mutation is in the subclone

Survival Analysis

We examined which genes correlate with survival using the Cox proportional hazards model, first analysing individual cancer types using age and gender as covariates; an average of 2 genes (range: 0–4) with mutation frequency 2% were significant (P<_0.05) in each type (Supplementary Table 10a and Extended Data Fig. 6). KDM6A and ARID1A mutations correlate with better survival in BLCA (P = 0.03, hazard ratio (HR) = 0.36, 95% confidence interval (CI): 0.14–0.92) and UCEC (P = 0.03, HR = 0.11, 95% CI: 0.01–0.84), respectively, but mutations in SETBP1, recently identified with worse prognosis in atypical chronic myeloid leukaemia (aCML)31, have a significant detrimental effect in HNSC (P = 0.006, HR = 3.21, 95% CI: 1.39–7.44). BAP1 strongly correlates with poor survival (P = 0.00079, HR = 2.17, 95% CI: 1.38–3.41) in KIRC. Conversely, BRCA2 muta­tions (P = 0.02, HR = 0.31, 95% CI: 0.12–0.85) associate with better survival in ovarian cancer, consistent with previous reports32,33; BRCA1 mutations showed positive correlation with better survival, but did not reach significance here.

We extended our survival analysis across cancer types, restricting our attention to the subset of 97 SMGs whose mutations appeared in 2% of patients having survival data in 2 tumour types. Taking type, age and gender as covariates, we found 7 significant genes: BAP1DNMT3AHGFKDM5CFBXW7BRCA2 and TP53 (Extended Data Table 1).  In particular, BAP1 was highly significant (0.00013, HR = 2.20, 95% CI: 1.47–3.29, more than 53 mutated tumours out of 888 total), with mutations associating with detrimental outcome in four tumour types and notable associations in KIRC (P = 0.00079), consistent with a recent report28, and in UCEC(P = 0.066). Mutations in several other genes are detrimental, including DNMT3A (HR = 1.59), previously identified with poor prognosis in AML34, and KDM5C (HR = 1.63), FBXW7 (HR = 1.57) and TP53 (HR = 1.19). TP53 has significant associations with poor outcome in KIRC (P = 0.012), AML (P = 0.0007) and HNSC (P = 0.00007). Conversely, BRCA2 (P = 0.05, HR = 0.62, 95% CI: 0.38 to 0.99) correlates with survival benefit in six types, including OV and UCEC (Supplementary Table 10a, b). IDH1 mutations are associated with improved prognosis across the Pan-Cancer set (HR = 0.67, P = 0.16) and also in GBM (HR = 0.42, P = 0.09) (Supplementary Table 10a, b), consistent with previous work.35

 Driver mutations and tumour clonal architecture

To understand the temporal order of somatic events, we analysed the variant allele fraction (VAF) distribution of mutations in SMGs across AML, BRCA and UCEC (Fig. 5a and Supplementary Table 11a) and other tumour types (Extended Data Fig. 7). To minimize the effect of copy number alterations, we focused on mutations in copy neutral segments. Mutations in TP53 have higher VAFs on average in all three cancer types, suggesting early appearance during tumorigenesis.

It is worth noting that copy neutral loss of heterozygosity is commonly found in classical tumour suppressors such as TP53, BRCA1, BRCA2 and PTEN, leading to increased VAFs in these genes. In AML, DNMT3A (permutation test P = 0), RUNX1 (P = 0.0003) and SMC3 (P = 0.05) have significantly higher VAFs than average among SMGs (Fig. 5a and Supplementary Table 11b). In breast cancer, AKT1, CBFB, MAP2K4, ARID1A, FOXA1 and PIK3CA have relatively high average VAFs. For endometrial cancer, multiple SMGs (for example, PIK3CA, PIK3R1, PTEN, FOXA2 and ARID1A) have similar median VAFs. Conversely, KRAS and/or NRAS mutations tend to have lower VAFs in all three tumour types (Fig. 5a), suggesting NRAS (for example, P = 0 in AML) and KRAS (for example, P = 0.02 in BRCA) have a progression role in a subset of AML, BRCA and UCEC tumours. For all three cancer types, we clearly observed a shift towards higher expression VAFs in SMGs versus non-SMGs, most apparent in BRCA and UCEC (Extended Data Fig. 8a and Methods).

Previous analysis using whole-genome sequencing (WGS) detected subclones in approximately 50% of AML cases15,36,37; however, ana­lysis is difficult using AML exome owing to its relatively few coding mutations. Using 50 AML WGS cases, sciClone (http://github.com/ genome/sciclone) detected DNMT3A mutations in the founding clone for 100% (8 out of 8) of cases and NRAS mutations in the subclone for 75% (3 out of 4) of cases (Extended Data Fig. 8b). Among 304 and 160 of BRCA and UCEC tumours, respectively, with enough coding muta­tions for clustering, 35% BRCA and 44% UCEC tumours contained subclones. Our analysis provides the lower bound for tumour hetero­geneity, because only coding mutations were used for clustering. In BRCA, 95% (62 out of 65) of cases contained PIK3CA mutations in the founding clone, whereas 33% (3 out of 9) of cases had MLL3 muta­tions in the subclone. Similar patterns were found in UCEC tumours, with 96% (65 out of 68) and 95% (62 out of 65) of tumours containing PIK3CA and PTEN mutations, respectively, in the founding clone, and 9% (2 out of22) ofKRAS and 14% (1 out of 7) ofNRAS mutations in the subclone (Extended Data Fig. 8b and Supplementary Table 12).

Mutation con­text (-2 to +2 bp) was calculated for each somatic variant in each mutation category, and hierarchical clustering was then performed using the pairwise mutation context correlation across all cancer types. The mutational significance in cancer (MuSiC)3 package was used to identify significant genes for both indi­vidual tumour types and the Pan-Cancer collective. An R function ‘hclust’ was used for complete-linkage hierarchical clustering across mutations and samples, and Dendrix30 was used to identify sets of approximately mutual exclusive muta­tions. Cross-cancer survival analysis was based on the Cox proportional hazards model, as implemented in the R package ‘survival’ (http://cran.r-project.org/web/ packages/survival/), and the sciClone algorithm (http://github.com/genome/sci-clone) generated mutation clusters using point mutations from copy number neutral segments. A complete description of the materials and methods used to generate this data set and its results is provided in the Methods.

References (20 of 38)

  1. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).
  2. Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
  3. Dees, N. D. et al. MuSiC: Identifying mutational significance in cancer genomes. Genome Res. 22, 1589–1598 (2012).
  4. Roth, A. et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28, 907–913 (2012).
  5. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnol. 31, 213–219 (2013).
  6. Jones, S. et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 321, 1801–1806 (2008).
  7. Parsons, D. W. et al. An integrated genomic analysis of human glioblastoma multiforme. Science 321, 1807–1812 (2008).
  8. Sjo¨blom, T. etal. The consensuscodingsequences of human breast and colorectal cancers. Science 314, 268–274 (2006).
  9. The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).
  10. Ding, L. et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 455, 1069–1075 (2008).
  11. Wood, L. D. etal. The genomic landscapesof human breast and colorectal cancers. Science 318, 1108–1113 (2007).
  12. The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011).
  13. The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
  14. Cancer Genome Atlas Research Network. Integrated genomic characterization of endometrial carcinoma. Nature 497, 67–73 (2013).
  15. The Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 368, 2059–2074 (2013).
  16. The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
  17. Ellis, M. J. et al. Whole-genome analysis informs breast cancer response to aromatase inhibition. Nature 486, 353–360 (2012).
  18. The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013).
  19. Hanahan, D. & Weinberg, R. A. The hallmarks of cancer. Cell 100, 57–70 (2000).
  20. Downing, J. R. et al. The Pediatric Cancer Genome Project. Nature Genet. 44, 619–622 (2012).
Advertisements

Read Full Post »


CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics & Computational Genomics – Part IIB

Curator: Larry H Bernstein, MD, FCAP

Part I: The Initiation and Growth of Molecular Biology and Genomics – Part I From Molecular Biology to Translational Medicine: How Far Have We Come, and Where Does It Lead Us?

https://pharmaceuticalintelligence.wordpress.com/wp-admin/post.php?post=8634&action=edit&message=1

Part II: CRACKING THE CODE OF HUMAN LIFE is divided into a three part series.

Part IIA. “CRACKING THE CODE OF HUMAN LIFE: Milestones along the Way” reviews the Human Genome Project and the decade beyond.

https://pharmaceuticalintelligence.com/2013/02/12/cracking-the-code-of-human-life-milestones-along-the-way/

Part IIB. “CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics & Computational Genomics” lays the manifold multivariate systems analytical tools that has moved the science forward to a groung that ensures clinical application.

https://pharmaceuticalintelligence.com/2013/02/13/cracking-the-code-of-human-life-the-birth-of-bioinformatics-and-computational-genomics/

Part IIC. “CRACKING THE CODE OF HUMAN LIFE: Recent Advances in Genomic Analysis and Disease “ will extend the discussion to advances in the management of patients as well as providing a roadmap for pharmaceutical drug targeting.

https://pharmaceuticalintelligence.com/2013/02/14/cracking-the-code-of-human-life-recent-advances-in-genomic-analysis-and-disease/

To be followed by:
Part III will conclude with Ubiquitin, it’s role in Signaling and Regulatory Control.

Part IIB. “CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics & Computational Genomics” is a continuation of a previous discussion on the role of genomics in discovery of therapeutic targets titled, Directions for Genomics in Personalized Medicinewhich focused on:

  • key drivers of cellular proliferation,
  • stepwise mutational changes coinciding with cancer progression, and
  • potential therapeutic targets for reversal of the process.

It is a direct extension of The Initiation and Growth of Molecular Biology and Genomics – Part I 

These articles review a web-like connectivity between inter-connected scientific discoveries, as significant findings have led to novel hypotheses and many expectations over the last 75 years. This largely post WWII revolution has driven our understanding of biological and medical processes at an exponential pace owing to successive discoveries of
  • chemical structure,
  • the basic building blocks of DNA  and proteins, of
  • nucleotide and protein-protein interactions,
  • protein folding,
  • allostericity,
  • genomic structure,
  • DNA replication,
  • nuclear polyribosome interaction, and
  • metabolic control.

Nucleotides_1.svg

In addition, the emergence of methods for

  • copying,
  • removal
  • insertion, and
  • improvements in structural analysis
  • developments in applied mathematics have transformed the research framework.

This last point,

  • developments in applied mathematics have transformed the research framework, is been developed in this very article

CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics & Computational Genomics – Part IIB

Computational Genomics

1. Three-Dimensional Folding and Functional Organization Principles of The Drosophila Genome

Sexton T, Yaffe E, Kenigeberg E, Bantignies F,…Cavalli G. Institute de Genetique Humaine, Montpelliere GenomiX, and Weissman Institute, France and Israel. Cell 2012; 148(3): 458-472.
http://dx.doi.org/10.1016/j.cell.2012.01.010/
http://www.cell.com/retrieve/pii/S0092867412000165
http://www.ncbi.nlm.nih.gov/pubmed/22265598

Chromosomes are the physical realization of genetic information and thus form the basis for its readout and propagation.

250px-DNA_labeled  DNA diagram showing base pairing      circular genome map

Here we present a high-resolution chromosomal contact map derived from

  • a modified genome-wide chromosome conformation capture approach applied to Drosophila embryonic nuclei.
  • the entire genome is linearly partitioned into well-demarcated physical domains that overlap extensively with active and repressive epigenetic marks.
  • Chromosomal contacts are hierarchically organized between domains.
  • Global modeling of contact density and clustering of domains show that inactive
  • domains are condensed and confined to their chromosomal territories, whereas
  • active domains reach out of the territory to form remote intra- and interchromosomal contacts.

Moreover, we systematically identify

  • specific long-range intrachromosomal contacts between Polycomb-repressed domains.

Together, these observations

  • allow for quantitative prediction of the Drosophila chromosomal contact map,
  • laying the foundation for detailed studies of chromosome structure and function in a genetically tractable system.

fractal-globule

2A. Architecture Reveals Genome’s Secrets

Three-dimensional genome maps – Human chromosome

Genome sequencing projects have provided rich troves of information about

  • stretches of DNA that regulate gene expression, as well as
  • how different genetic sequences contribute to health and disease.

But these studies miss a key element of the genome—its spatial organization—which has long been recognized as an important regulator of gene expression.

  • Regulatory elements often lie thousands of base pairs away from their target genes, and recent technological advances are allowing scientists to begin examining
  • how distant chromosome locations interact inside a nucleus.
  • The creation and function of 3-D genome organization, some say, is the next frontier of genetics.

Mapping and sequencing may be completely separate processes. For example, it’s possible to determine the location of a gene—to “map” the gene—without sequencing it. Thus, a map may tell you nothing about the sequence of the genome, and a sequence may tell you nothing about the map.  But the landmarks on a map are DNA sequences, and mapping is the cousin of sequencing. A map of a sequence might look like this:
On this map, GCC is one landmark; CCCC is another. Here we find, the sequence is a landmark on a map. In general, particularly for humans and other species with large genomes,

  • creating a reasonably comprehensive genome map is quicker and cheaper than sequencing the entire genome.
  • mapping involves less information to collect and organize than sequencing does.

Completed in 2003, the Human Genome Project (HGP) was a 13-year project. The goals were:

  • identify all the approximately 20,000-25,000 genes in human DNA,
  • determine the sequences of the 3 billion chemical base pairs that make up human DNA,
  • store this information in databases,
  • improve tools for data analysis,
  • transfer related technologies to the private sector, and
  • address the ethical, legal, and social issues (ELSI) that may arise from the project.

Though the HGP is finished, analyses of the data will continue for many years. By licensing technologies to private companies and awarding grants for innovative research, the project catalyzed the multibillion-dollar U.S. biotechnology industry and fostered the development of new medical applications. When genes are expressed, their sequences are first converted into messenger RNA transcripts, which can be isolated in the form of complementary DNAs (cDNAs). A small portion of each cDNA sequence is all that is needed to develop unique gene markers, known as sequence tagged sites or STSs, which can be detected using the polymerase chain reaction (PCR). To construct a transcript map, cDNA sequences from a master catalog of human genes were distributed to mapping laboratories in North America, Europe, and Japan. These cDNAs were converted to STSs and their physical locations on chromosomes determined on one of two radiation hybrid (RH) panels or a yeast artificial chromosome (YAC) library containing human genomic DNA. This mapping data was integrated relative to the human genetic map and then cross-referenced to cytogenetic band maps of the chromosomes. (Further details are available in the accompanying article in the 25 October issue of SCIENCE).

Tremendous progress has been made in the mapping of human genes, a major milestone in the Human Genome Project. Apart from its utility in advancing our understanding of the genetic basis of disease, it  provides a framework and focus for accelerated sequencing efforts by highlighting key landmarks (gene-rich regions) of the chromosomes. The construction of this map has been possible through the cooperative efforts of an international consortium of scientists who provide equal, full and unrestricted access to the data for the advancement of biology and human health.

There are two types of maps: genetic linkage map and physical map. The genetic linkage map shows the arrangement of genes and genetic markers along the chromosomes as calculated by the frequency with which they are inherited together. The physical map is representation of the chromosomes, providing the physical distance between landmarks on the chromosome, ideally measured in nucleotide bases. Physical maps can be divided into three general types: chromosomal or cytogenetic maps, radiation hybrid (RH) maps, and sequence maps.
 ch10f3  radiation hybrid maps   ch10f2  subchromosomal mapping

2B. Genome-nuclear lamina interactions and gene regulation.

Kind J, van Steensel B. Division of Gene Regulation, Netherlands Cancer Institute, Amsterdam, The Netherlands.
The nuclear lamina, a filamentous protein network that coats the inner nuclear membrane, has long been thought to interact with specific genomic loci and regulate their expression. Molecular mapping studies have now identified
  • large genomic domains that are in contact with the lamina.
Genes in these domains are typically repressed, and artificial tethering experiments indicate that
  • the lamina can actively contribute to this repression.
Furthermore, the lamina indirectly controls gene expression in the nuclear interior by sequestration of certain transcription factors.
Mol Cell. 2010; 38(4):603-13.          http://dx.doi.org/10.1016/j.molcel.2010.03.016
Peric-Hupkes D, Meuleman W, Pagie L, Bruggeman SW, Solovei I,  …., van Steensel B.  Division of Gene Regulation, Netherlands Cancer Institute, Amsterdam, The Netherlands.
To visualize three-dimensional organization of chromosomes within the nucleus, we generated high-resolution maps of genome-nuclear lamina interactions during subsequent differentiation of mouse embryonic stem cells via lineage-committed neural precursor cells into terminally differentiated astrocytes.  A basal chromosome architecture present in embryonic stem cells is cumulatively altered at hundreds of sites during lineage commitment and subsequent terminal differentiation. This remodeling involves both
  • individual transcription units and multigene regions and
  • affects many genes that determine cellular identity.
  •  genes that move away from the lamina are concomitantly activated;
  • others, remain inactive yet become unlocked for activation in a next differentiation step.

lamina-genome interactions are widely involved in the control of gene expression programs during lineage commitment and terminal differentiation.

 view the full text on ScienceDirect.
Graphical Summary
PDF 1.54 MB
Referred to by: The Silence of the LADs: Dynamic Genome-…
Authors:  Daan Peric-Hupkes, Wouter Meuleman, Ludo Pagie, Sophia W.M. Bruggeman, et al.
Highlights
  • Various cell types share a core architecture of genome-nuclear lamina interactions
  • During differentiation, hundreds of genes change their lamina interactions
  • Changes in lamina interactions reflect cell identity
  • Release from the lamina may unlock some genes for activation

Fractal “globule”

About 10 years ago—just as the human genome project was completing its first draft sequence—Dekker pioneered a new technique, called chromosome conformation capture (C3) that allowed researchers to get a glimpse of how chromosomes are arranged relative to each other in the nucleus. The technique relies on the physical cross-linking of chromosomal regions that lie in close proximity to one another. The regions are then sequenced to identify which regions have been cross-linked. In 2009, using a high throughput version of this basic method, called Hi-C, Dekker and his collaborators discovered that the human genome appears to adopt a “fractal globule” conformation—

  • a manner of crumpling without knotting.

gabst_EK.pptx

In the last 3 years, Jobe Dekker and others have advanced technology even further, allowing them to paint a more refined picture of how the genome folds—and how this influences gene expression and disease states.  Dekker’s 2009 findings were a breakthrough in modeling genome folding, but the resolution—about 1 million base pairs— was too crude to allow scientists to really understand how genes interacted with specific regulatory elements. The researchers report two striking findings.

First, the human genome is organized into two separate compartments, keeping

  • active genes separate and accessible
  • while sequestering unused DNA in a denser storage compartment.
  • Chromosomes snake in and out of the two compartments repeatedly
  • as their DNA alternates between active, gene-rich and inactive, gene-poor stretches.

Second, at a finer scale, the genome adopts an unusual organization known in mathematics as a “fractal.” The specific architecture the scientists found, called

  • a “fractal globule,” enables the cell to pack DNA incredibly tightly —

the information density in the nucleus is trillions of times higher than on a computer chip — while avoiding the knots and tangles that might interfere with the cell’s ability to read its own genome. Moreover, the DNA can easily Unfold and Refold during

  • gene activation,
  • gene repression, and
  • cell replication.

Dekker and his colleagues discovered, for example, that chromosomes can be divided into folding domains—megabase-long segments within which

  • genes and regulatory elements associate more often with one another than with other chromosome sections.

The DNA forms loops within the domains that bring a gene into close proximity with a specific regulatory element at a distant location along the chromosome. Another group, that of molecular biologist Bing Ren at the University of California, San Diego, published a similar finding in the same issue of Nature.  Dekker thinks the discovery of [folding] domains will be one of the most fundamental [genetics] discoveries of the last 10 years. The big questions now are

  • how these domains are formed, and
  • what determines which elements are looped into proximity.

“By breaking the genome into millions of pieces, we created a spatial map showing how close different parts are to one another,” says co-first author Nynke van Berkum, a postdoctoral researcher at UMass Medical School in Dekker‘s laboratory. “We made a fantastic three-dimensional jigsaw puzzle and then, with a computer, solved the puzzle.”

Lieberman-Aiden, van Berkum, Lander, and Dekker’s co-authors are Bryan R. Lajoie of UMMS; Louise Williams, Ido Amit, and Andreas Gnirke of the Broad Institute; Maxim Imakaev and Leonid A. Mirny of MIT; Tobias Ragoczy, Agnes Telling, and Mark Groudine of the Fred Hutchison, Cancer Research Center and the University of Washington; Peter J. Sabo, Michael O. Dorschner, Richard Sandstrom, M.A. Bender, and John Stamatoyannopoulos of the University of Washington; and Bradley Bernstein of the Broad Institute and Harvard Medical School.

2C. three-dimensional structure of the human genome

Lieberman-Aiden et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 2009; DOI: 10.1126/science.1181369.
Harvard University (2009, October 11). 3-D Structure Of Human Genome: Fractal Globule Architecture Packs Two Meters Of DNA Into Each Cell. ScienceDaily.   Retrieved February 2, 2013, from        http://www.sciencedaily.com/releases/2009/10/091008142957

Using a new technology called Hi-C and applying it to answer the thorny question of how each of our cells stows some three billion base pairs of DNA while maintaining access to functionally crucial segments. The paper comes from a team led by scientists at Harvard University, the Broad Institute of Harvard and MIT, University of Massachusetts Medical School, and the Massachusetts Institute of Technology. “We’ve long known that on a small scale, DNA is a double helix,” says co-first author Erez Lieberman-Aiden, a graduate student in the Harvard-MIT Division of Health Science and Technology and a researcher at Harvard’s School of Engineering and Applied Sciences and in the laboratory of Eric Lander at the Broad Institute. “But if the double helix didn’t fold further, the genome in each cell would be two meters long. Scientists have not really understood how the double helix folds to fit into the nucleus of a human cell, which is only about a hundredth of a millimeter in diameter. This new approach enabled us to probe exactly that question.”

The mapping technique that Aiden and his colleagues have come up with bridges a crucial gap in knowledge—between what goes on at the smallest levels of genetics (the double helix of DNA and the base pairs) and the largest levels (the way DNA is gathered up into the 23 chromosomes that contain much of the human genome). The intermediate level, on the order of thousands or millions of base pairs, has remained murky.  As the genome is so closely wound, base pairs in one end can be close to others at another end in ways that are not obvious merely by knowing the sequence of base pairs. Borrowing from work that was started in the 1990s, Aiden and others have been able to figure out which base pairs have wound up next  to one another. From there, they can begin to reconstruct the genome—in three dimensions.

4C profiles validate the Hi-C Genome wide map

Even as the multi-dimensional mapping techniques remain in their early stages, their importance in basic biological research is becoming ever more apparent. “The three-dimensional genome is a powerful thing to know,” Aiden says. “A central mystery of biology is the question of how different cells perform different functions—despite the fact that they share the same genome.” How does a liver cell, for example, “know” to perform its liver duties when it contains the same genome as a cell in the eye? As Aiden and others reconstruct the trail of letters into a three-dimensional entity, they have begun to see that “the way the genome is folded determines which genes were

2D. “Mr. President; The Genome is Fractal !”

Eric Lander (Science Adviser to the President and Director of Broad Institute) et al. delivered the message on Science Magazine cover (Oct. 9, 2009) and generated interest in this by the International HoloGenomics Society at a Sept meeting.

First, it may seem to be trivial to rectify the statement in “About cover” of Science Magazine by AAAS.

  • The statement “the Hilbert curve is a one-dimensional fractal trajectory” needs mathematical clarification.

The mathematical concept of a Hilbert space, named after David Hilbert, generalizes the notion of Euclidean space. It extends the methods of vector algebra and calculus from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions. A Hilbert space is an abstract vector space possessing the structure of an inner product that allows length and angle to be measured. Furthermore, Hilbert spaces must be complete, a property that stipulates the existence of enough limits in the space to allow the techniques of calculus to be used. A Hilbert curve (also known as a Hilbert space-filling curve) is a continuous fractal space-filling curve first described by the German mathematician David Hilbert in 1891,[1] as a variant of the space-filling curves discovered by Giuseppe Peano in 1890.[2] For multidimensional databases, Hilbert order has been proposed to be used instead of Z order because it has better locality-preserving behavior.

Representation as Lindenmayer system
The Hilbert Curve can be expressed by a rewrite system (L-system).

Alphabet : A, B

Constants : F + –                                                                                                                                      119px-Hilbert3d-step3                             120px-Hilbert512

Axiom : A

Production rules:

A → – B F + A F A + F B –

B → + A F – B F B – F A +

Here, F means “draw forward”, – means “turn left 90°”, and + means “turn right 90°” (see turtle graphics).

620px-Harmonic_partials_on_strings.svg

While the paper itself does not make this statement, the new Editorship of the AAAS Magazine might be even more advanced if the previous Editorship did not reject (without review) a Manuscript by 20+ Founders of (formerly) International PostGenetics Society in December, 2006.

Second, it may not be sufficiently clear for the reader that the reasonable requirement for the DNA polymerase to crawl along a “knot-free” (or “low knot”) structure does not need fractals. A “knot-free” structure could be spooled by an ordinary “knitting globule” (such that the DNA polymerase does not bump into a “knot” when duplicating the strand; just like someone knitting can go through the entire thread without encountering an annoying knot): Just to be “knot-free” you don’t need fractals. Note, however, that

  • the “strand” can be accessed only at its beginning – it is impossible to e.g. to pluck a segment from deep inside the “globulus”.

This is where certain fractals provide a major advantage – that could be the “Eureka” moment for many readers. For instance,

  • the mentioned Hilbert-curve is not only “knot free” –
  • but provides an easy access to “linearly remote” segments of the strand.

If the Hilbert curve starts from the lower right corner and ends at the lower left corner, for instance

  • the path shows the very easy access of what would be the mid-point
  • if the Hilbert-curve is measured by the Euclidean distance along the zig-zagged path.

Likewise, even the path from the beginning of the Hilbert-curve is about equally easy to access – easier than to reach from the origin a point that is about 2/3 down the path. The Hilbert-curve provides an easy access between two points within the “spooled thread”; from a point that is about 1/5 of the overall length to about 3/5 is also in a “close neighborhood”.

This may be the “Eureka-moment” for some readers, to realize that

  • the strand of “the Double Helix” requires quite a finess to fold into the densest possible globuli (the chromosomes) in a clever way
  • that various segments can be easily accessed. Moreover, in a way that distances between various segments are minimized.

This marvellous fractal structure is illustrated by the 3D rendering of the Hilbert-curve. Once you observe such fractal structure, you’ll never again think of a chromosome as a “brillo mess”, would you? It will dawn on you that the genome is orders of magnitudes more finessed than we ever thought so.

Those embarking at a somewhat complex review of some historical aspects of the power of fractals may wish to consult the ouvre of Mandelbrot (also, to celebrate his 85th birthday). For the more sophisticated readers, even the fairly simple Hilbert-curve (a representative of the Peano-class) becomes even more stunningly brilliant than just some “see through density”. Those who are familiar with the classic “Traveling Salesman Problem” know that “the shortest path along which every given n locations can be visited once, and only once” requires fairly sophisticated algorithms (and tremendous amount of computation if n>10 (or much more). Some readers will be amazed, therefore, that for n=9 the underlying Hilbert-curve helps to provide an empirical solution.

refer to pellionisz@junkdna.com

Briefly, the significance of the above realization, that the (recursive) Fractal Hilbert Curve is intimately connected to the (recursive) solution of TravelingSalesman Problem, a core-concept of Artificial Neural Networks can be summarized as below.

Accomplished physicist John Hopfield (already a member of the National Academy of Science) aroused great excitement in 1982 with his (recursive) design of artificial neural networks and learning algorithms which were able to find reasonable solutions to combinatorial problems such as the Traveling SalesmanProblem. (Book review Clark Jeffries, 1991, see also 2. J. Anderson, R. Rosenfeld, and A. Pellionisz (eds.), Neurocomputing 2: Directions for research, MIT Press, Cambridge, MA, 1990):

“Perceptions were modeled chiefly with neural connections in a “forward” direction: A -> B -* C — D. The analysis of networks with strong backward coupling proved intractable. All our interesting results arise as consequences of the strong back-coupling” (Hopfield, 1982).

The Principle of Recursive Genome Function surpassed obsolete axioms that blocked, for half a Century, entry of recursive algorithms to interpretation of the structure-and function of (Holo)Genome.  This breakthrough, by uniting the two largely separate fields of Neural Networks and Genome Informatics, is particularly important for

  • those who focused on Biological (actually occurring) Neural Networks (rather than abstract algorithms that may not, or because of their core-axioms, simply could not
  • represent neural networks under the governance of DNA information).

DNA base triplets

3A. The FractoGene Decade

from Inception in 2002 to Proofs of Concept and Impending Clinical Applications by 2012

  1. Junk DNA Revisited (SF Gate, 2002)
  2. The Future of Life, 50th Anniversary of DNA (Monterey, 2003)
  3. Mandelbrot and Pellionisz (Stanford, 2004)
  4. Morphogenesis, Physiology and Biophysics (Simons, Pellionisz 2005)
  5. PostGenetics; Genetics beyond Genes (Budapest, 2006)
  6. ENCODE-conclusion (Collins, 2007)

The Principle of Recursive Genome Function (paper, YouTube, 2008)

  1. Cold Spring Harbor presentation of FractoGene (Cold Spring Harbor, 2009)
  2. Mr. President, the Genome is Fractal! (2009)
  3. HolGenTech, Inc. Founded (2010)
  4. Pellionisz on the Board of Advisers in the USA and India (2011)
  5. ENCODE – final admission (2012)
  6. Recursive Genome Function is Clogged by Fractal Defects in Hilbert-Curve (2012)
  7. Geometric Unification of Neuroscience and Genomics (2012)
  8. US Patent Office issues FractoGene 8,280,641 to Pellionisz (2012)

http://www.junkdna.com/the_fractogene_decade.pdf
http://www.scribd.com/doc/116159052/The-Decade-of-FractoGene-From-Discovery-to-Utility-Proofs-of-Concept-Open-Genome-Based-Clinical-Applications
http://fractogene.com/full_genome/morphogenesis.html

When the human genome was first sequenced in June 2000, there were two pretty big surprises. The first was thathumans have only about 30,000-40,000 identifiable genes, not the 100,000 or more many researchers were expecting. The lower –and more humbling — number

  • means humans have just one-third more genes than a common species of worm.

The second stunner was

  • how much human genetic material — more than 90 percent — is made up of what scientists were calling “junk DNA.”

The term was coined to describe similar but not completely identical repetitive sequences of amino acids (the same substances that make genes), which appeared to have no function or purpose. The main theory at the time was that these apparently non-working sections of DNA were just evolutionary leftovers, much like our earlobes.

If biophysicist Andras Pellionisz is correct, genetic science may be on the verge of yielding its third — and by far biggest — surprise.

With a doctorate in physics, Pellionisz is the holder of Ph.D.’s in computer sciences and experimental biology from the prestigious Budapest Technical University and the Hungarian National Academy of Sciences. A biophysicist by training, the 59-year-old is a former research associate professor of physiology and biophysics at New York University, author of numerous papers in respected scientific journals and textbooks, a past winner of the prestigious Humboldt Prize for scientific research, a former consultant to NASA and holder of a patent on the world’s first artificial cerebellum, a technology that has already been integrated into research on advanced avionics systems. Because of his background, the Hungarian-born brain researcher might also become one of the first people to successfully launch a new company by using the Internet to gather momentum for a novel scientific idea.

The genes we know about today, Pellionisz says, can be thought of as something similar to machines that make bricks (proteins, in the case of genes), with certain junk-DNA sections providing a blueprint for the different ways those proteins are assembled. The notion that at least certain parts of junk DNA might have a purpose for example, many researchers now refer to with a far less derogatory term: introns.

In a provisional patent application filed July 31, Pellionisz claims to have unlocked a key to the hidden role junk DNA plays in growth — and in life itself. His patent application covers all attempts to count, measure and compare the fractal properties of introns for diagnostic and therapeutic purposes.

3B. The Hidden Fractal Language of Intron DNA

To fully understand Pellionisz’ idea, one must first know what a fractal is.

Fractals are a way that nature organizes matter. Fractal patterns can be found in anything that has a nonsmooth surface (unlike a billiard ball), such as coastal seashores, the branches of a tree or the contours of a neuron (a nerve cell in the brain). Some, but not all, fractals are self-similar and stop repeating their patterns at some stage; the branches of a tree, for example, can get only so small. Because they are geometric, meaning they have a shape, fractals can be described in mathematical terms. It’s similar to the way a circle can be described by using a number to represent its radius (the distance from its center to its outer edge). When that number is known, it’s possible to draw the circle it represents without ever having seen it before.

Although the math is much more complicated, the same is true of fractals. If one has the formula for a given fractal, it’s possible to use that formula

  • to construct, or reconstruct,
  • an image of whatever structure it represents,
  • no matter how complicated.

The mysteriously repetitive but not identical strands of genetic material are in reality building instructions organized in a special type

  • of pattern known as a fractal.  It’s this pattern of fractal instructions, he says, that
  • tells genes what they must do in order to form living tissue,
  • everything from the wings of a fly to the entire body of a full-grown human.

In a move sure to alienate some scientists, Pellionisz has chosen the unorthodox route of making his initial disclosures online on his own Web site. He picked that strategy, he says, because it is the fastest way he can document his claims and find scientific collaborators and investors. Most mainstream scientists usually blanch at such approaches, preferring more traditionally credible methods, such as publishing articles in peer-reviewed journals.

Basically, Pellionisz’ idea is that a fractal set of building instructions in the DNA plays a similar role in organizing life itself. Decode the way that language works, he says, and in theory it could be reverse engineered. Just as knowing the radius of a circle lets one create that circle, the more complicated fractal-based formula would allow us to understand how nature creates a heart or simpler structures, such as disease-fighting antibodies. At a minimum, we’d get a far better understanding of how nature gets that job done.

The complicated quality of the idea is helping encourage new collaborations across the boundaries that sometimes separate the increasingly intertwined disciplines of biology, mathematics and computer sciences.

Hal Plotkin, Special to SF Gate. Thursday, November 21, 2002.                          http://www.junkdna.com/Special to SF Gate/plotkin.htm (1 of 10)2012.12.13. 12:11:58/

fractogene_2002

3C. multifractal analysis

The human genome: a multifractal analysis. Moreno PA, Vélez PE, Martínez E, et al.

BMC Genomics 2011, 12:506. http://www.biomedcentral.com/1471-2164/12/506

Background: Several studies have shown that genomes can be studied via a multifractal formalism. Recently, we used a multifractal approach to study the genetic information content of the Caenorhabditis elegans genome. Here we investigate the possibility that the human genome shows a similar behavior to that observed in the nematode.
Results: We report here multifractality in the human genome sequence. This behavior correlates strongly on the

  • presence of Alu elements and
  • to a lesser extent on CpG islands and (G+C) content.

In contrast, no or low relationship was found for LINE, MIR, MER, LTRs elements and DNA regions poor in genetic information.

  • Gene function,
  • cluster of orthologous genes,
  • metabolic pathways, and
  • exons tended to increase their frequencies with ranges of multifractality and
  • large gene families were located in genomic regions with varied multifractality.

Additionally, a multifractal map and classification for human chromosomes are proposed.

Conclusions

we propose a descriptive non-linear model for the structure of the human genome,

This model reveals

  • a multifractal regionalization where many regions coexist that are far from equilibrium and
  • this non-linear organization has significant molecular and medical genetic implications for understanding the role of
  • Alu elements in genome stability and structure of the human genome.

Given the role of Alu sequences in

  • gene regulation,
  • genetic diseases,
  • human genetic diversity,
  • adaptation
  • and phylogenetic analyses,

these quantifications are especially useful.

MiIP: The Monomer Identification and Isolation Program

Bun C, Ziccardi W, Doering J and Putonti C.Evolutionary Bioinformatics 2012:8 293-300.    http://dx.goi.org/10.4137/EBO.S9248

Repetitive elements within genomic DNA are both functionally and evolutionarilly informative. Discovering these sequences ab initio is

  • computationally challenging, compounded by the fact that
  • sequence identity between repetitive elements can vary significantly.

Here we present a new application, the Monomer Identification and Isolation Program (MiIP), which provides functionality to both

  • search for a particular repeat as well as
  • discover repetitive elements within a larger genomic sequence.

To compare MiIP’s performance with other repeat detection tools, analysis was conducted for

  • synthetic sequences as well as
  • several a21-II clones and
  • HC21 BAC sequences.

The primary benefit of MiIP is the fact that it is a single tool capable of searching for both

  • known monomeric sequences as well as
  • discovering the occurrence of repeats ab initio, per the user’s required sensitivity of the search.

Methods for Examining Genomic and Proteomic Interactions

1. An Integrated Statistical Approach to Compare Transcriptomics Data Across Experiments: A Case Study on the Identification of Candidate Target Genes of the Transcription Factor PPARα

Ullah MO, Müller M and Hooiveld GJEJ. Bioinformatics and Biology Insights 2012:6 145–154.       http://dx.doi.org/10.4137/BBI.S9529

http://www.la- press.com/
http://bionformaticsandBiologyInsights.com/An_Integrated_Statistical_Approach_to_Compare_ transcriptomic_Data_Across_Experiments-A-Case_Study_on_the_Identification_ of_Candidate_Target_Genes_of_the Transcription_Factor_PPARα/
Corresponding author email: guido.hooiveld@wur.nl

An effective strategy to elucidate the signal transduction cascades activated by a transcription factor is to compare the transcriptional profiles of wild type and transcription factor knockout models. Many statistical tests have been proposed for analyzing gene expression data, but most

  • tests are based on pair-wise comparisons. Since the analysis of microarrays involves the testing of multiple hypotheses within one study, it is
  • generally accepted that one should control for false positives by the false discovery rate (FDR). However, it has been reported that
  • this may be an inappropriate metric for comparing data across different experiments.

Here we propose an approach that addresses the above mentioned problem by the simultaneous testing and integration of the three hypotheses (contrasts) using the cell means ANOVA model.

These three contrasts test for the effect of

  • a treatment in wild type,
  • gene knockout, and
  • globally over all experimental groups.

We illustrate our approach on microarray experiments that focused on the identification of candidate target genes and biological processes governed by the fatty acid sensing transcription factor PPARα in liver. Compared to the often applied FDR based across experiment comparison, our approach identified a conservative but less noisy set of candidate genes with same sensitivity and specificity. However, our method had the advantage of

  • properly adjusting for multiple testing while
  • integrating data from two experiments, and
  • was driven by biological inference.

We present a simple, yet efficient strategy to compare

  • differential expression of genes across experiments
  • while controlling for multiple hypothesis testing.

2. Managing biological complexity across orthologs with a visual knowledgebase of documented biomolecular interactions

Vincent VanBuren & Hailin Chen.   Scientific Reports 2, Article number: 1011  Received 02 October 2012 Accepted 04 December 2012 Published 20 December 2012
http://dx.doi.org/10.1038/srep01011

The complexity of biomolecular interactions and influences is a major obstacle to their comprehension and elucidation. Visualizing knowledge of biomolecular interactions increases comprehension and facilitates the development of new hypotheses. The rapidly changing landscape of high-content experimental results also presents a challenge for the maintenance of comprehensive knowledgebases. Distributing the responsibility for maintenance of a knowledgebase to a community of subject matter experts is an effective strategy for large, complex and rapidly changing knowledgebases.
Cognoscente serves these needs by

  • building visualizations for queries of biomolecular interactions on demand,
  • by managing the complexity of those visualizations, and
  • by crowdsourcing to promote the incorporation of current knowledge from the literature.

Imputing functional associations between biomolecules and imputing directionality of regulation for those predictions each

  • require a corpus of existing knowledge as a framework to build upon. Comprehension of the complexity of this corpus of knowledge
  • will be facilitated by effective visualizations of the corresponding biomolecular interaction networks.

Cognoscente

http://vanburenlab.medicine.tamhsc.edu/cognoscente.html
was designed and implemented to serve these roles as

  • a knowledgebase and
  • as an effective visualization tool for systems biology research and education.

Cognoscente currently contains over 413,000 documented interactions, with coverage across multiple species.  Perl, HTML, GraphViz1, and a MySQL database were used in the development of Cognoscente. Cognoscente was motivated by the need to

  • update the knowledgebase of biomolecular interactions at the user level, and
  • flexibly visualize multi-molecule query results for heterogeneous interaction types across different orthologs.

Satisfying these needs provides a strong foundation for developing new hypotheses about regulatory and metabolic pathway topologies.  Several existing tools provide functions that are similar to Cognoscente, so we selected several popular alternatives to

  • assess how their feature sets compare with Cognoscente ( Table 1 ). All databases assessed had
  • easily traceable documentation for each interaction, and
  • included protein-protein interactions in the database.

Most databases, with the exception of BIND,

  • provide an open-access database that can be downloaded as a whole.

Most databases, with the exceptions of EcoCyc and HPRD, provide

  • support for multiple organisms.

Most databases support web services for interacting with the database contents programatically, whereas this is a planned feature for Cognoscente.

  • INT, STRING, IntAct, EcoCyc, DIP and Cognoscente provide built-in visualizations of query results,
  • which we consider among the most important features for facilitating comprehension of query results.
  • BIND supports visualizations via Cytoscape. Cognoscente is among a few other tools that support multiple organisms in the same query,
  • protein->DNA interactions, and
  • multi-molecule queries.

Cognoscente has planned support for small molecule interactants (i.e. pharmacological agents).  MINT, STRING, and IntAct provide a prediction (i.e. score) of functional associations, whereas
Cognoscente does not currently support this. Cognoscente provides support for multiple edge encodings to visualize different types of interactions in the same display,

  • a crowdsourcing web portal that allows users to submit interactions
  • that are then automatically incorporated in the knowledgebase, and displays orthologs as compound nodes to provide clues about potential
  • orthologous interactions.

The main strengths of Cognoscente are that

  1. it provides a combined feature set that is superior to any existing database,
  2. it provides a unique visualization feature for orthologous molecules, and relatively unique support for
  3. multiple edge encodings,
  4. crowdsourcing, and
  5. connectivity parameterization.

The current weaknesses of Cognoscente relative to these other tools are

  • that it does not fully support web service interactions with the database,
  • it does not fully support small molecule interactants, and
  • it does not score interactions to predict functional associations.

Web services and support for small molecule interactants are currently under development.

Other related articles on thie Open Access Online Sceintific Journal, include the following:

Big Data in Genomic Medicine                    lhb                          https://pharmaceuticalintelligence.com/2012/12/17/big-data-in-genomic-medicine/

BRCA1 a tumour suppressor in breast and ovarian cancer – functions in transcription, ubiquitination and DNA repair S Saha                                                                                   https://pharmaceuticalintelligence.com/2012/12/04/brca1-a-tumour-suppressor-in-breast-and-ovarian-cancer-functions-in-transcription-ubiquitination-and-dna-repair/

Computational Genomics Center: New Unification of Computational Technologies at Stanford A Lev-Ari    https://pharmaceuticalintelligence.com/2012/12/03/computational-genomics-center-new-unification-of-computational-technologies-at-stanford/

Paradigm Shift in Human Genomics – Predictive Biomarkers and Personalized Medicine – Part 1 (pharmaceuticalintelligence.com) A Lev-Ari https://pharmaceuticalintelligence.com/2013/01/13/paradigm-shift-in-human-genomics-predictive-biomarkers-and-personalized-medicine-part-1/

LEADERS in Genome Sequencing of Genetic Mutations for Therapeutic Drug Selection in Cancer Personalized Treatment: Part 2 A Lev-Ari
https://pharmaceuticalintelligence.com/2013/01/13/leaders-in-genome-sequencing-of-genetic-mutations-for-therapeutic-drug-selection-in-cancer-personalized-treatment-part-2/

Personalized Medicine: An Institute Profile – Coriell Institute for Medical Research: Part 3 A Lev-Ari https://pharmaceuticalintelligence.com/2013/01/13/personalized-medicine-an-institute-profile-coriell-institute-for-medical-research-part-3/

GSK for Personalized Medicine using Cancer Drugs needs Alacris systems biology model to determine the in silico effect of the inhibitor in its “virtual clinical trial” A Lev-Ari    https://pharmaceuticalintelligence.com/2012/11/14/gsk-for-personalized-medicine-using-cancer-drugs-needs-alacris-systems-biology-model-to-determine-the-in-silico-effect-of-the-inhibitor-in-its-virtual-clinical-trial/

Recurrent somatic mutations in chromatin-remodeling and ubiquitin ligase complex genes in serous endometrial tumors S Saha
https://pharmaceuticalintelligence.com/2012/11/19/recurrent-somatic-mutations-in-chromatin-remodeling-and-ubiquitin-ligase-complex-genes-in-serous-endometrial-tumors/

Human Variome Project: encyclopedic catalog of sequence variants indexed to the human genome sequence A Lev-Ari

https://pharmaceuticalintelligence.com/2012/11/24/human-variome-project-encyclopedic-catalog-of-sequence-variants-indexed-to-the-human-genome-sequence/

Prostate Cancer Cells: Histone Deacetylase Inhibitors Induce Epithelial-to-Mesenchymal Transition sjwilliams
https://pharmaceuticalintelligence.com/2012/11/30/histone-deacetylase-inhibitors-induce-epithelial-to-mesenchymal-transition-in-prostate-cancer-cells/

https://pharmaceuticalintelligence.com/2013/01/09/the-cancer-establishments-examined-by-james-watson-co-discover-of-dna-wcrick-41953/

Directions for genomics in personalized medicine lhb https://pharmaceuticalintelligence.com/2013/01/27/directions-for-genomics-in-personalized-medicine/

How mobile elements in “Junk” DNA promote cancer. Part 1: Transposon-mediated tumorigenesis. Sjwilliams
https://pharmaceuticalintelligence.com/2012/10/31/how-mobile-elements-in-junk-dna-prote-cancer-part1-transposon-mediated-tumorigenesis/

Mitochondrial fission and fusion: potential therapeutic targets? Ritu saxena    https://pharmaceuticalintelligence.com/2012/10/31/mitochondrial-fission-and-fusion-potential-therapeutic-target/

Mitochondrial mutation analysis might be “1-step” away ritu saxena  https://pharmaceuticalintelligence.com/2012/08/14/mitochondrial-mutation-analysis-might-be-1-step-away/

mRNA interference with cancer expression lhb https://pharmaceuticalintelligence.com/2012/10/26/mrna-interference-with-cancer-expression/

Expanding the Genetic Alphabet and linking the genome to the metabolome https://pharmaceuticalintelligence.com/2012/09/24/expanding-the-genetic-alphabet-and-linking-the-genome-to-the-metabolome/

Breast Cancer: Genomic profiling to predict Survival: Combination of Histopathology and Gene Expression Analysis A Lev-Ari

https://pharmaceuticalintelligence.com/2012/12/24/breast-cancer-genomic-profiling-to-predict-survival-combination-of-histopathology-and-gene-expression-analysis/

Ubiquinin-Proteosome pathway, autophagy, the mitochondrion, proteolysis and cell apoptosis lhb https://pharmaceuticalintelligence.com/2012/10/30/ubiquinin-proteosome-pathway-autophagy-the-mitochondrion-proteolysis-and-cell-apoptosis/

Genomic Analysis: FLUIDIGM Technology in the Life Science and Agricultural Biotechnology A Lev-Ari https://pharmaceuticalintelligence.com/2012/08/22/genomic-analysis-fluidigm-technology-in-the-life-science-and-agricultural-biotechnology/

2013 Genomics: The Era Beyond the Sequencing Human Genome: Francis Collins, Craig Venter, Eric Lander, et al.  https://pharmaceuticalintelligence.com/2013_Genomics

Paradigm Shift in Human Genomics – Predictive Biomarkers and Personalized Medicine – Part 1 https://pharmaceuticalintelligence.com/Paradigm Shift in Human Genomics_/

English: DNA replication or DNA synthesis is t...

English: DNA replication or DNA synthesis is the process of copying a double-stranded DNA molecule. This process is paramount to all life as we know it. (Photo credit: Wikipedia)

Français : Deletion chromosomique

Français : Deletion chromosomique (Photo credit: Wikipedia)

A slight mutation in the matched nucleotides c...

A slight mutation in the matched nucleotides can lead to chromosomal aberrations and unintentional genetic rearrangement. (Photo credit: Wikipedia)

Read Full Post »


electronic health record - choice of cause of ...

electronic health record – choice of cause of consultation (German) (Photo credit: Wikipedia)

Reporter: Larry H Bernstein, MD, FACP

The Electronic Health Record: How far we have travelled, and where is journey’s end?

A focus of the Accountable Care Act is improved delivery of quality, efficiency and effectiveness to the patients who receive healthcare in US from the providers in a coordinated system.  The largest confounder in all of this is the existence of silos that are not readily crossed, handovers, communication lapses, and a heavy paperwork burden.  We can add to that a large for profit insurance overhead that is disinterested in the patient-physician encounter.  Finally, the knowledge base of medicine has grown sufficiently that physicians are challenged by the amount of data and the presentation in the Medical Record.

I present a review of the problems that have become more urgent to fix in the last decade.  The administration and paperwork necessitated by health insurers, HMOs and other parties today may account for 40% of a physician’s practice, and the formation of large physician practice groups and alliances of the hospital and hospital staffed physicians (as well as hospital system alliances) has increased in response to the need to decrease the cost of non-patient care overhead.   I discuss some of the points made by two innovators from the healthcare and  the communications sectors.

 

Image representing New York Times as depicted ...

Image via CrunchBase

I also call attention to the New York Times front page article calling attention to a sharp rise in inflation-adjusted Medicare payments for emergency-room services since 2006 due to upcoding at the highest level, partly related to the ability to physician ability to overstate the claim for service provided by correctible improvements I discuss below.  (NY Times, 9/22/2012).  The solution still has another built in step that requires quality control of both the input and the output, achievable today.  This also comes at a time that there is a nationwide implementation of ICD-10 to replace ICD-9 coding.

US medical groups' adoption of EHR (2005)

US medical groups’ adoption of EHR (2005) (Photo credit: Wikipedia)

 

The first finding by Robert S Didner, on “Decision Making in the Clinical Setting”, concludes that the gathering of information has large costs while reimbursements for the activities provided have decreased, detrimental to the outcomes that are measured.  He suggests that this data can be gathered and reformatted to improve its value in the clinical setting by leading to decisions with optimal outcomes.  He outlines how this can be done.

The second is a discussion by Emergency Medicine  physicians, Thomas A Naegele and harry P Wetzler,  who have developed a Foresighted Practice Guideline (FPG) (“The Foresighted Practice Guideline Model: A Win-Win Solution”).   They focus on collecting data from similar patients, their interventions, and treatments to better understand the value of alternative courses of treatment.  Using the FPG model will enable physicians to elevate their practice to a higher level and they will have hard information on what works.  These two views are more than 10 years old, and they are complementary.

Didner points out that there is no one sequence of tests and questions that can be optimal for all presenting clusters.  Even as data and test results are acquired, the optimal sequence of information gathering is changed, depending on the gathered information.  Thus, the dilemma is created of how to collect clinical data.  Currently, the way information is requested and presented does not support the way decisions are made.   Decisions are made in a “path-dependent” way, which is influenced by the sequence in which the components are considered.    Ideally, it would require a separate form for each combination of presenting history and symptoms, prior to ordering tests, which is unmanageable.   The blank paper format is no better, as the data is not collected in the way it would be used, and it constitutes separate clusters (vital signs, lab work{also divided into CBC, chemistry panel, microbiology, immunology, blood bank, special tests}].   Improvements have been made in the graphical presentation of a series of tests. Didner presents another means of gathering data in machine manipulable form that improves the expected outcomes.  The basis for this model is that at any stage of testing and information gathering there is an expected outcome from the process, coupled with a metric, or hierarchy of values to determine the relative desirability of the possible outcomes.

He creates a value hierarchy:

  1. Minimize the likelihood that a treatable, life-threatening disorder is not treated.
  2. Minimize the likelihood that a treatable, permanently-disabling or disfiguring disorder is not treated.
  3. Minimize the likelihood that a treatable, discomfort causing disorder is not treated.
  4. Minimize the likelihood that a risky procedure, (treatment or diagnostic procedure) is inappropriately administered.
  5. Minimize the likelihood that a discomfort causing procedure is inappropriately administered.
  6. Minimize the likelihood that a costly procedure is inappropriately administered.
  7. Minimize the time of diagnosing and treating the patient.
  8. Minimize the cost of diagnosing and treating the patient.

In reference to a way of minimizing the number, time and cost of tests, he determines that the optimum sequence could be found using Claude  Shannon’s Information theory.  As to a hierarchy of outcome values, he refers to the QALY scale as a starting point. At any point where a determination is made there is disparate information that has to be brought together, such as, weight, blood pressure, cholesterol, etc.  He points out, in addition, that the way the clinical information is organized is not opyimal for the way to display information to enhance human cognitive performance in decision support.  Furthermore, he looks at the limit of short term memory as 10 chunks of information at any time, and he compares the positions of chess pieces on the board with performance of a grand master, if the pieces are in an order commensurate with a “line of attack”.  The information has to be ordered in the way it is to be used! By presenting information used for a particular decision component in a compact space the load on short term memory is reduced, and there is less strain in searching for the relevant information.

He creates a Table to illustrate the point.

Correlation of weight with other cardiac risk factors

Chol                       0.759384
HDL                        -0.53908
LDL                         0.177297
bp-syst                 0.424728
bp-dia                   0.516167

Triglyc                   0.637817

The task of the information system designer is to provide or request the right information, in the best form, at each stage of the procedure.

The FPG concept as deployed by Naegele and Wetzler is a model for design of a more effective health record that has already shown substantial proof of concept in the emergency room setting.  In principle, every clinical encounter is viewed as a learning experience that requires the collection of data , learning from similar patients, and comparing the value of alternative courses of treatment.  The framework for standard data collection is the FPG model. The FPG is distinguished from hindsighted guidelines which are utilized by utilization and peer review organizations.  Over time, the data forms patient clusters and enables the physician to function at a higher level.

Hypothesis construction is experiential, and hypothesis generation and testing is required to go from art to science in the complex practice of medicine.  In every encounter there are 3 components: patient, process, and outcome.  The key to the process is to collect data on patients, processes and outcomes in a standard way.  The main problem with a large portion of the chart is that the description is not uniform.  This is not fully resolved with good natural language encryption.  The standard words and phrases that may be used for a particular complaint or condition constitute a guideline.  This type of “guided documentation” is a step in moving toward a guided practice.  It enables physicians to gather data on patients, processes and outcomes of care in routine settings, and they can be reviewed and updated.  This is a higher level of methodology than basing guidelines on “consensus and opinion”.
When Lee Goldman, et al., created the guideline for classifying chest pain in the emergency room, the characteristics of the chest pain was problematic. In dealing with this he determined that if the chest pain was “stabbing”, or if it radiated to the right foot, heart attack is excluded.

The IOM is intensely committed to practice guidelines for care.  The guidelines are the data bases of the science of medical decision making and disposition processing, and are related to process-flow.  However, the hindsighted  or retrospective approach is diagnosis or procedure oriented.  HPGs are the tool used in utilization review.  The FPG model focuses on the physician-patient encounter and is problem oriented.   We can go back further and remember the contribution by Lawrence Weed to the “structured medical record”.
The physicians today use an FPG framework in looking at a problem or pathology (especially in pathology, which extends the classification by used of biomarker staining).  The Standard Patient File Format (SPPF) was developed by Weed and includes: 1. Patient demographics; 2. Front of the chart; 3. Subjective: Objective; Assessment/diagnosis;6. Plan; Back of the chart.  The FPG retains the structure of the SPPF  All of the words and phrases in the FPG are the data base for the problem or condition. The current construct of the chart is uninviting: nurses notes, medications, lab results, radiology, imaging.

Realtime Clinical Expert Support and Validation System
Gil David and Larry Bernstein have developed, in consultation with Prof. Ronald Coifman, in the Yale University Applied Mathematics Program, a software system that is the equivalent of an intelligent Electronic Health Records Dashboard that provides empirical
medical reference and suggests quantitative diagnostics options.

The introduction of a DASHBOARD has allowed a presentation of drug reactions, allergies, primary and secondary diagnoses, and critical information about any patient the care giver needing access to the record. The advantage of this innovation is obvious. The startup problem is what information is presented and how it is displayed, which is a source of variability and a key to its success. It is also imperative that the extraction of data from disparate sources will, in the long run, further improve the diagnostic process.  For instance, the finding of both ST depression on EKG coincident with an increase of a cardiac biomarker (troponin). Through the application of geometric clustering analysis the data may interpreted in a more sophisticated fashion in order to create a more reliable and valid knowledge-based opinion.  In the hemogram one can view data reflecting the characteristics of a broad spectrum of medical conditions.  Characteristics expressed as measurements of size, density, and concentration, resulting in more than a dozen composite variables, including the mean corpuscular volume (MCV), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular hemoglobin (MCH), total white cell count (WBC), total lymphocyte count, neutrophil count (mature granulocyte count and bands), monocytes, eosinophils, basophils, platelet count, and mean platelet volume (MPV), blasts, reticulocytes and platelet clumps, as well as other features of classification.   This has been described in a previous post.

It is beyond comprehension that a better construct has not be created for common use.

W Ruts, S De Deyne, E Ameel, W Vanpaemel,T Verbeemen, And G Storms. Dutch norm data for 13 semantic categoriesand 338 exemplars. Behavior Research Methods, Instruments, & Computers 2004; 36 (3): 506–515.
De Deyne, S Verheyen, E Ameel, W Vanpaemel, MJ Dry, WVoorspoels, and G Storms. Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts.
Behavior Research Methods 2008; 40 (4): 1030-1048

Landauer, T. K., Ross, B. H., & Didner, R. S. (1979). Processing visually presented single words: A reaction time analysis [Technical memorandum].  Murray Hill, NJ: Bell Laboratories. Lewandowsky , S. (1991).

Weed L. Automation of the problem oriented medical record. NCHSR Research Digest Series DHEW. 1977;(HRA)77-3177.

Naegele TA. Letter to the Editor. Amer J Crit Care 1993;2(5):433.

 

Read Full Post »