Posts Tagged ‘DNA analysis’

A Nonlinear Methodology to Explain Complexity of the Genome and Bioinformatic Information

Reporter: Stephen J. Williams, Ph.D.

Multifractal bioinformatics: A proposal to the nonlinear interpretation of genome

The following is an open access article by Pedro Moreno on a methodology to analyze genetic information across species and in particular, the evolutionary trends of complex genomes, by a nonlinear analytic approach utilizing fractal geometry, coined “Nonlinear Bioinformatics”.  This fractal approach stems from the complex nature of higher eukaryotic genomes including mosaicism, multiple interdispersed  genomic elements such as intronic regions, noncoding regions, and also mobile elements such as transposable elements.  Although seemingly random, there exists a repetitive nature of these elements. Such complexity of DNA regulation, structure and genomic variation is felt best understood by developing algorithms based on fractal analysis, which can best model the regionalized and repetitive variability and structure within complex genomes by elucidating the individual components which contributes to an overall complex structure rather than using a “linear” or “reductionist” approach looking at individual coding regions, which does not take into consideration the aforementioned factors leading to genetic complexity and diversity.

Indeed, many other attempts to describe the complexities of DNA as a fractal geometric pattern have been described.  In a paper by Carlo Cattani “Fractals and Hidden Symmetries in DNA“, Carlo uses fractal analysis to construct a simple geometric pattern of the influenza A virus by modeling the primary sequence of this viral DNA, namely the bases A,G,C, and T. The main conclusions that

fractal shapes and symmetries in DNA sequences and DNA walks have been shown and compared with random and deterministic complex series. DNA sequences are structured in such a way that there exists some fractal behavior which can be observed both on the correlation matrix and on the DNA walks. Wavelet analysis confirms by a symmetrical clustering of wavelet coefficients the existence of scale symmetries.

suggested that, at least, the viral influenza genome structure could be analyzed into its basic components by fractal geometry.
This approach has been used to model the complex nature of cancer as discussed in a 2011 Seminars in Oncology paper
Abstract: Cancer is a highly complex disease due to the disruption of tissue architecture. Thus, tissues, and not individual cells, are the proper level of observation for the study of carcinogenesis. This paradigm shift from a reductionist approach to a systems biology approach is long overdue. Indeed, cell phenotypes are emergent modes arising through collective non-linear interactions among different cellular and microenvironmental components, generally described by “phase space diagrams”, where stable states (attractors) are embedded into a landscape model. Within this framework, cell states and cell transitions are generally conceived as mainly specified by gene-regulatory networks. However, the system s dynamics is not reducible to the integrated functioning of the genome-proteome network alone; the epithelia-stroma interacting system must be taken into consideration in order to give a more comprehensive picture. Given that cell shape represents the spatial geometric configuration acquired as a result of the integrated set of cellular and environmental cues, we posit that fractal-shape parameters represent “omics descriptors of the epithelium-stroma system. Within this framework, function appears to follow form, and not the other way around.

As authors conclude

” Transitions from one phenotype to another are reminiscent of phase transitions observed in physical systems. The description of such transitions could be obtained by a set of morphological, quantitative parameters, like fractal measures. These parameters provide reliable information about system complexity. “

Gene expression also displays a fractal nature. In a Frontiers in Physiology paper by Mahboobeh Ghorbani, Edmond A. Jonckheere and Paul Bogdan* “Gene Expression Is Not Random: Scaling, Long-Range Cross-Dependence, and Fractal Characteristics of Gene Regulatory Networks“,

the authors describe that gene expression networks display time series display fractal and long-range dependence characteristics.

Abstract: Gene expression is a vital process through which cells react to the environment and express functional behavior. Understanding the dynamics of gene expression could prove crucial in unraveling the physical complexities involved in this process. Specifically, understanding the coherent complex structure of transcriptional dynamics is the goal of numerous computational studies aiming to study and finally control cellular processes. Here, we report the scaling properties of gene expression time series in Escherichia coliand Saccharomyces cerevisiae. Unlike previous studies, which report the fractal and long-range dependency of DNA structure, we investigate the individual gene expression dynamics as well as the cross-dependency between them in the context of gene regulatory network. Our results demonstrate that the gene expression time series display fractal and long-range dependence characteristics. In addition, the dynamics between genes and linked transcription factors in gene regulatory networks are also fractal and long-range cross-correlated. The cross-correlation exponents in gene regulatory networks are not unique. The distribution of the cross-correlation exponents of gene regulatory networks for several types of cells can be interpreted as a measure of the complexity of their functional behavior.


Given that multitude of complex biomolecular networks and biomolecules can be described by fractal patterns, the development of bioinformatic algorithms  would enhance our understanding of the interdependence and cross funcitonality of these mutiple biological networks, particularly in disease and drug resistance.  The article below by Pedro Moreno describes the development of such bioinformatic algorithms.

Pedro A. Moreno
Escuela de Ingeniería de Sistemas y Computación, Facultad de Ingeniería, Universidad del Valle, Cali, Colombia

Eje temático: Ingeniería de sistemas / System engineering
Recibido: 19 de septiembre de 2012
Aceptado: 16 de diciembre de 2013




The first draft of the human genome (HG) sequence was published in 2001 by two competing consortia. Since then, several structural and functional characteristics for the HG organization have been revealed. Today, more than 2.000 HG have been sequenced and these findings are impacting strongly on the academy and public health. Despite all this, a major bottleneck, called the genome interpretation persists. That is, the lack of a theory that explains the complex puzzles of coding and non-coding features that compose the HG as a whole. Ten years after the HG sequenced, two recent studies, discussed in the multifractal formalism allow proposing a nonlinear theory that helps interpret the structural and functional variation of the genetic information of the genomes. The present review article discusses this new approach, called: “Multifractal bioinformatics”.

Keywords: Omics sciences, bioinformatics, human genome, multifractal analysis.

1. Introduction

Omic Sciences and Bioinformatics

In order to study the genomes, their life properties and the pathological consequences of impairment, the Human Genome Project (HGP) was created in 1990. Since then, about 500 Gpb (EMBL) represented in thousands of prokaryotic genomes and tens of different eukaryotic genomes have been sequenced (NCBI, 1000 Genomes, ENCODE). Today, Genomics is defined as the set of sciences and technologies dedicated to the comprehensive study of the structure, function and origin of genomes. Several types of genomic have arisen as a result of the expansion and implementation of genomics to the study of the Central Dogma of Molecular Biology (CDMB), Figure 1 (above). The catalog of different types of genomics uses the Latin suffix “-omic” meaning “set of” to mean the new massive approaches of the new omics sciences (Moreno et al, 2009). Given the large amount of genomic information available in the databases and the urgency of its actual interpretation, the balance has begun to lean heavily toward the requirements of bioinformatics infrastructure research laboratories Figure 1 (below).

The bioinformatics or Computational Biology is defined as the application of computer and information technology to the analysis of biological data (Mount, 2004). An interdisciplinary science that requires the use of computing, applied mathematics, statistics, computer science, artificial intelligence, biophysical information, biochemistry, genetics, and molecular biology. Bioinformatics was born from the need to understand the sequences of nucleotide or amino acid symbols that make up DNA and proteins, respectively. These analyzes are made possible by the development of powerful algorithms that predict and reveal an infinity of structural and functional features in genomic sequences, as gene location, discovery of homologies between macromolecules databases (Blast), algorithms for phylogenetic analysis, for the regulatory analysis or the prediction of protein folding, among others. This great development has created a multiplicity of approaches giving rise to new types of Bioinformatics, such as Multifractal Bioinformatics (MFB) that is proposed here.

1.1 Multifractal Bioinformatics and Theoretical Background

MFB is a proposal to analyze information content in genomes and their life properties in a non-linear way. This is part of a specialized sub-discipline called “nonlinear Bioinformatics”, which uses a number of related techniques for the study of nonlinearity (fractal geometry, Hurts exponents, power laws, wavelets, among others.) and applied to the study of biological problems ( For its application, we must take into account a detailed knowledge of the structure of the genome to be analyzed and an appropriate knowledge of the multifractal analysis.

1.2 From the Worm Genome toward Human Genome

To explore a complex genome such as the HG it is relevant to implement multifractal analysis (MFA) in a simpler genome in order to show its practical utility. For example, the genome of the small nematode Caenorhabditis elegans is an excellent model to learn many extrapolated lessons of complex organisms. Thus, if the MFA explains some of the structural properties in that genome it is expected that this same analysis reveals some similar properties in the HG.

The C. elegans nuclear genome is composed of about 100 Mbp, with six chromosomes distributed into five autosomes and one sex chromosome. The molecular structure of the genome is particularly homogeneous along with the chromosome sequences, due to the presence of several regular features, including large contents of genes and introns of similar sizes. The C. elegans genome has also a regional organization of the chromosomes, mainly because the majority of the repeated sequences are located in the chromosome arms, Figure 2 (left) (C. elegans Sequencing Consortium, 1998). Given these regular and irregular features, the MFA could be an appropriate approach to analyze such distributions.

Meanwhile, the HG sequencing revealed a surprising mosaicism in coding (genes) and noncoding (repetitive DNA) sequences, Figure 2 (right) (Venter et al., 2001). This structure of 6 Gbp is divided into 23 pairs of chromosomes (diploid cells) and these highly regionalized sequences introduce complex patterns of regularity and irregularity to understand the gene structure, the composition of sequences of repetitive DNA and its role in the study and application of life sciences. The coding regions of the genome are estimated at ~25,000 genes which constitute 1.4% of GH. These genes are involved in a giant sea of various types of non-coding sequences which compose 98.6% of HG (misnamed popularly as “junk DNA”). The non-coding regions are characterized by many types of repeated DNA sequences, where 10.6% consists of Alu sequences, a type of SINE (short and dispersed repeated elements) sequence and preferentially located towards the genes. LINES, MIR, MER, LTR, DNA transposons and introns are another type of non-coding sequences which form about 86% of the genome. Some of these sequences overlap with each other; as with CpG islands, which complicates the analysis of genomic landscape. This standard genomic landscape was recently clarified, the last studies show that 80.4% of HG is functional due to the discovery of more than five million “switches” that operate and regulate gene activity, re-evaluating the concept of “junk DNA”. (The ENCODE Project Consortium, 2012).

Given that all these genomic variations both in worm and human produce regionalized genomic landscapes it is proposed that Fractal Geometry (FG) would allow measuring how the genetic information content is fragmented. In this paper the methodology and the nonlinear descriptive models for each of these genomes will be reviewed.

1.3 The MFA and its Application to Genome Studies

Most problems in physics are implicitly non-linear in nature, generating phenomena such as chaos theory, a science that deals with certain types of (non-linear) but very sensitive dynamic systems to initial conditions, nonetheless of deterministic rigor, that is that their behavior can be completely determined by knowing initial conditions (Peitgen et al, 1992). In turn, the FG is an appropriate tool to study the chaotic dynamic systems (CDS). In other words, the FG and chaos are closely related because the space region toward which a chaotic orbit tends asymptotically has a fractal structure (strange attractors). Therefore, the FG allows studying the framework on which CDS are defined (Moon, 1992). And this is how it is expected for the genome structure and function to be organized.

The MFA is an extension of the FG and it is related to (Shannon) information theory, disciplines that have been very useful to study the information content over a sequence of symbols. Initially, Mandelbrot established the FG in the 80’s, as a geometry capable of measuring the irregularity of nature by calculating the fractal dimension (D), an exponent derived from a power law (Mandelbrot, 1982). The value of the D gives us a measure of the level of fragmentation or the information content for a complex phenomenon. That is because the D measures the scaling degree that the fragmented self-similarity of the system has. Thus, the FG looks for self-similar properties in structures and processes at different scales of resolution and these self-similarities are organized following scaling or power laws.

Sometimes, an exponent is not sufficient to characterize a complex phenomenon; so more exponents are required. The multifractal formalism allows this, and applies when many subgroups of fractals with different scalar properties with a large number of exponents or fractal dimensions coexist simultaneously. As a result, when a spectrum of multifractal singularity measurement is generated, the scaling behavior of the frequency of symbols of a sequence can be quantified (Vélez et al, 2010).

The MFA has been implemented to study the spatial heterogeneity of theoretical and experimental fractal patterns in different disciplines. In post-genomics times, the MFA was used to study multiple biological problems (Vélez et al, 2010). Nonetheless, very little attention has been given to the use of MFA to characterize the content of the structural genetic information of the genomes obtained from the images of the Chaos Representation Game (CRG). First studies at this level were made recently to the analysis of the C. elegans genome (Vélez et al, 2010) and human genomes (Moreno et al, 2011). The MFA methodology applied for the study of these genomes will be developed below.

2. Methodology

The Multifractal Formalism from the CGR

2.1 Data Acquisition and Molecular Parameters

Databases for the C. elegans and the 36.2 Hs_ refseq HG version were downloaded from the NCBI FTP server. Then, several strategies were designed to fragment the genomic DNA sequences of different length ranges. For example, the C. elegans genome was divided into 18 fragments, Figure 2 (left) and the human genome in 9,379 fragments. According to their annotation systems, the contents of molecular parameters of coding sequences (genes, exons and introns), noncoding sequences (repetitive DNA, Alu, LINES, MIR, MER, LTR, promoters, etc.) and coding/ non-coding DNA (TTAGGC, AAAAT, AAATT, TTTTC, TTTTT, CpG islands, etc.) are counted for each sequence.

2.2 Construction of the CGR 2.3 Fractal Measurement by the Box Counting Method

Subsequently, the CGR, a recursive algorithm (Jeffrey, 1990; Restrepo et al, 2009) is applied to each selected DNA sequence, Figure 3 (above, left) and from which an image is obtained, which is quantified by the box-counting algorithm. For example, in Figure 3 (above, left) a CGR image for a human DNA sequence of 80,000 bp in length is shown. Here, dark regions represent sub-quadrants with a high number of points (or nucleotides). Clear regions, sections with a low number of points. The calculation for the D for the Koch curve by the box-counting method is illustrated by a progression of changes in the grid size, and its Cartesian graph, Table 1

The CGR image for a given DNA sequence is quantified by a standard fractal analysis. A fractal is a fragmented geometric figure whose parts are an approximated copy at full scale, that is, the figure has self-similarity. The D is basically a scaling rule that the figure obeys. Generally, a power law is given by the following expression:

Where N(E) is the number of parts required for covering the figure when a scaling factor E is applied. The power law permits to calculate the fractal dimension as:

The D obtained by the box-counting algorithm covers the figure with disjoint boxes ɛ = 1/E and counts the number of boxes required. Figure 4 (above, left) shows the multifractal measure at momentum q=1.

2.4 Multifractal Measurement

When generalizing the box-counting algorithm for the multifractal case and according to the method of moments q, we obtain the equation (3) (Gutiérrez et al, 1998; Yu et al, 2001):

Where the Mi number of points falling in the i-th grid is determined and related to the total number Mand ɛ to box size. Thus, the MFA is used when multiple scaling rules are applied. Figure 4 (above, right) shows the calculation of the multifractal measures at different momentum q (partition function). Here, linear regressions must have a coefficient of determination equal or close to 1. From each linear regression D are obtained, which generate an spectrum of generalized fractal dimensions Dfor all q integers, Figure 4 (below, left). So, the multifractal spectrum is obtained as the limit:

The variation of the q integer allows emphasizing different regions and discriminating their fractal a high Dq is synonymous of the structure’s richness and the properties of these regions. Negative values emphasize the scarce regions; a high Dindicates a lot of structure and properties in these regions. In real world applications, the limit Dqreadily approximated from the data using a linear fitting: the transformation of the equation (3) yields:

Which shows that ln In(Mi )= for set q is a linear function in the ln(ɛ), Dq can therefore be evaluated as q the slope of a fixed relationship between In(Mi )= and (q-1) ln(ɛ). The methodologies and approaches for the method of box-counting and MFA are detailed in Moreno et al, 2000, Yu et al, 2001; Moreno, 2005. For a rigorous mathematical development of MFA from images consult Multifractal system, wikipedia.

2.5 Measurement of Information Content

Subsequently, from the spectrum of generalized dimensions Dq, the degree of multifractality ΔDq(MD) is calculated as the difference between the maximum and minimum values of : ΔD qq Dqmax – Dqmin (Ivanov et al, 1999). When qmaxqmin ΔDis high, the multifractal spectrum is rich in information and highly aperiodic, when ΔDq is small, the resulting dimension spectrum is poor in information and highly periodic. It is expected then, that the aperiodicity in the genome would be related to highly polymorphic genomic aperiodic structures and those periodic regions with highly repetitive and not very polymorphic genomic structures. The correlation exponent t(q) = (– 1)DqFigure 4 (below, right ) can also be obtained from the multifractal dimension Dq. The generalized dimension also provides significant specific information. D(q = 0) is equal to the Capacity dimension, which in this analysis is the size of the “box count”. D(q = 1) is equal to the Information dimension and D(q = 2) to the Correlation dimension. Based on these multifractal parameters, many of the structural genomic properties can be quantified, related, and interpreted.

2.6 Multifractal Parameters and Statistical and Discrimination Analyses

Once the multifractal parameters are calculated (D= (-20, 20), ΔDq, πq, etc.), correlations with the molecular parameters are sought. These relations are established by plotting the number of genome molecular parameters versus MD by discriminant analysis with Cartesian graphs in 2-D, Figure 5 (below, left) and 3-D and combining multifractal and molecular parameters. Finally, simple linear regression analysis, multivariate analysis, and analyses by ranges and clusterings are made to establish statistical significance.

3 Results and Discussion

3.1 Non-linear Descriptive Model for the C. elegans Genome

When analyzing the C. elegans genome with the multifractal formalism it revealed what symmetry and asymmetry on the genome nucleotide composition suggested. Thus, the multifractal scaling of the C. elegans genome is of interest because it indicates that the molecular structure of the chromosome may be organized as a system operating far from equilibrium following nonlinear laws (Ivanov et al, 1999; Burgos and Moreno-Tovar, 1996). This can be discussed from two points of view:

1) When comparing C. elegans chromosomes with each other, the X chromosome showed the lowest multifractality, Figure 5 (above). This means that the X chromosome is operating close to equilibrium, which results in an increased genetic instability. Thus, the instability of the X could selectively contribute to the molecular mechanism that determines sex (XX or X0) during meiosis. Thus, the X chromosome would be operating closer to equilibrium in order to maintain their particular sexual dimorphism.

2) When comparing different chromosome regions of the C. elegans genome, changes in multifractality were found in relation to the regional organization (at the center and arms) exhibited by the chromosomes, Figure 5 (below, left). These behaviors are associated with changes in the content of repetitive DNA, Figure 5 (below, right). The results indicated that the chromosome arms are even more complex than previously anticipated. Thus, TTAGGC telomere sequences would be operating far from equilibrium to protect the genetic information encoded by the entire chromosome.

All these biological arguments may explain why C. elegans genome is organized in a nonlinear way. These findings provide insight to quantify and understand the organization of the non-linear structure of the C. elegans genome, which may be extended to other genomes, including the HG (Vélez et al, 2010).

3.2 Nonlinear Descriptive Model for the Human Genome

Once the multifractal approach was validated in C. elegans genome, HG was analyzed exhaustively. This allowed us to propose a nonlinear model for the HG structure which will be discussed under three points of view.

1) It was found that the HG high multifractality depends strongly on the contents of Alu sequences and to a lesser extent on the content of CpG islands. These contents would be located primarily in highly aperiodic regions, thus taking the chromosome far from equilibrium and giving to it greater genetic stability, protection and attraction of mutations, Figure 6 (A-C). Thus, hundreds of regions in the HG may have high genetic stability and the most important genetic information of the HG, the genes, would be safeguarded from environmental fluctuations. Other repeated elements (LINES, MIR, MER, LTRs) showed no significant relationship,

Figure 6 (D). Consequently, the human multifractal map developed in Moreno et al, 2011 constitutes a good tool to identify those regions rich in genetic information and genomic stability. 2) The multifractal context seems to be a significant requirement for the structural and functional organization of thousands of genes and gene families. Thus, a high multifractal context (aperiodic) appears to be a “genomic attractor” for many genes (KOGs, KEEGs), Figure 6 (E) and some gene families, Figure 6 (F) are involved in genetic and deterministic processes, in order to maintain a deterministic regulation control in the genome, although most of HG sequences may be subject to a complex epigenetic control.

3) The classification of human chromosomes and chromosome regions analysis may have some medical implications (Moreno et al, 2002; Moreno et al, 2009). This means that the structure of low nonlinearity exhibited by some chromosomes (or chromosome regions) involve an environmental predisposition, as potential targets to undergo structural or numerical chromosomal alterations in Figure 6 (G). Additionally, sex chromosomes should have low multifractality to maintain sexual dimorphism and probably the X chromosome inactivation.

All these fractals and biological arguments could explain why Alu elements are shaping the HG in a nonlinearly manner (Moreno et al, 2011). Finally, the multifractal modeling of the HG serves as theoretical framework to examine new discoveries made by the ENCODE project and new approaches about human epigenomes. That is, the non-linear organization of HG might help to explain why it is expected that most of the GH is functional.

4. Conclusions

All these results show that the multifractal formalism is appropriate to quantify and evaluate genetic information contents in genomes and to relate it with the known molecular anatomy of the genome and some of the expected properties. Thus, the MFB allows interpreting in a logic manner the structural nature and variation of the genome.

The MFB allows understanding why a number of chromosomal diseases are likely to occur in the genome, thus opening a new perspective toward personalized medicine to study and interpret the GH and its diseases.

The entire genome contains nonlinear information organizing it and supposedly making it function, concluding that virtually 100% of HG is functional. Bioinformatics in general, is enriched with a novel approach (MFB) making it possible to quantify the genetic information content of any DNA sequence and their practical applications to different disciplines in biology, medicine and agriculture. This novel breakthrough in computational genomic analysis and diseases contributes to define Biology as a “hard” science.

MFB opens a door to develop a research program towards the establishment of an integrative discipline that contributes to “break” the code of human life. (http://pharmaceuticalintelligence. com/page/3/).

5. Acknowledgements

Thanks to the directives of the EISC, the Universidad del Valle and the School of Engineering for offering an academic, scientific and administrative space for conducting this research. Likewise, thanks to co authors (professors and students) who participated in the implementation of excerpts from some of the works cited here. Finally, thanks to Colciencias by the biotechnology project grant # 1103-12-16765.

6. References

Blanco, S., & Moreno, P.A. (2007). Representación del juego del caos para el análisis de secuencias de ADN y proteínas mediante el análisis multifractal (método “box-counting”). In The Second International Seminar on Genomics and Proteomics, Bioinformatics and Systems Biology (pp. 17-25). Popayán, Colombia.         [ Links ]

Burgos, J.D., & Moreno-Tovar, P. (1996). Zipf scaling behavior in the immune system. BioSystem , 39, 227-232.         [ Links ]

C. elegans Sequencing Consortium. (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science , 282, 2012-2018.         [ Links ]

Gutiérrez, J.M., Iglesias A., Rodríguez, M.A., Burgos, J.D., & Moreno, P.A. (1998). Analyzing the multifractals structure of DNA nucleotide sequences. In, M. Barbie & S. Chillemi (Eds.) Chaos and Noise in Biology and Medicine (cap. 4). Hackensack (NJ): World Scientific Publishing Co.         [ Links ]

Ivanov, P.Ch., Nunes, L.A., Golberger, A.L., Havlin, S., Rosenblum, M.G., Struzikk, Z.R., & Stanley, H.E. (1999). Multifractality in human heartbeat dynamics. Nature , 399, 461-465.         [ Links ]

Jeffrey, H.J. (1990). Chaos game representation of gene structure. Nucleic Acids Research , 18, 2163-2175.         [ Links ]

Mandelbrot, B. (1982). La geometría fractal de la naturaleza. Barcelona. España: Tusquets editores.         [ Links ]

Moon, F.C. (1992). Chaotic and fractal dynamics. New York: John Wiley.         [ Links ]

Moreno, P.A. (2005). Large scale and small scale bioinformatics studies on the Caenorhabditis elegans enome. Doctoral thesis. Department of Biology and Biochemistry, University of Houston, Houston, USA.         [ Links ]

Moreno, P.A., Burgos, J.D., Vélez, P.E., Gutiérrez, J.M., & et al., (2000). Multifractal analysis of complete genomes. In P roceedings of the 12th International Genome Sequencing and Analysis Conference (pp. 80-81). Miami Beach (FL).         [ Links ]

Moreno, P.A., Rodríguez, J.G., Vélez, P.E., Cubillos, J.R., & Del Portillo, P. (2002). La genómica aplicada en salud humana. Colombia Ciencia y Tecnología. Colciencias , 20, 14-21.         [ Links ]

Moreno, P.A., Vélez, P.E., & Burgos, J.D. (2009). Biología molecular, genómica y post-genómica. Pioneros, principios y tecnologías. Popayán, Colombia: Editorial Universidad del Cauca.         [ Links ]

Moreno, P.A., Vélez, P.E., Martínez, E., Garreta, L., Díaz, D., Amador, S., Gutiérrez, J.M., et. al. (2011). The human genome: a multifractal analysis. BMC Genomics , 12, 506.         [ Links ]

Mount, D.W. (2004). Bioinformatics. Sequence and ge nome analysis. New York: Cold Spring Harbor Laboratory Press.         [ Links ]

Peitgen, H.O., Jürgen, H., & Saupe D. (1992). Chaos and Fractals. New Frontiers of Science. New York: Springer-Verlag.         [ Links ]

Restrepo, S., Pinzón, A., Rodríguez, L.M., Sierra, R., Grajales, A., Bernal, A., Barreto, E. et. al. (2009). Computational biology in Colombia. PLoS Computational Biology, 5 (10), e1000535.         [ Links ]

The ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature , 489, 57-74.         [ Links ]

Vélez, P.E., Garreta, L.E., Martínez, E., Díaz, N., Amador, S., Gutiérrez, J.M., Tischer, I., & Moreno, P.A. (2010). The Caenorhabditis elegans genome: a multifractal analysis. Genet and Mol Res , 9, 949-965.         [ Links ]

Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., & et al. (2001). The sequence of the human genome. Science , 291, 1304-1351.         [ Links ]

Yu, Z.G., Anh, V., & Lau, K.S. (2001). Measure representation and multifractal analysis of complete genomes. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics , 64, 031903.         [ Links ]


Other articles on Bioinformatics on this Open Access Journal include:

Bioinformatics Tool Review: Genome Variant Analysis Tools

2017 Agenda – BioInformatics: Track 6: BioIT World Conference & Expo ’17, May 23-35, 2017, Seaport World Trade Center, Boston, MA

Better bioinformatics

Broad Institute, Google Genomics combine bioinformatics and computing expertise

Autophagy-Modulating Proteins and Small Molecules Candidate Targets for Cancer Therapy: Commentary of Bioinformatics Approaches

CRACKING THE CODE OF HUMAN LIFE: The Birth of BioInformatics & Computational Genomics


Read Full Post »

Next Generation Sequencing in Clinical Laboratory

Curator: Larry H. Bernstein, MD, FCAP

INSIGHTS on Next-Generation Sequencing

Next-generation (NGS) sequencing brings scalability and sensitivity to diagnostics in ways that traditional DNA analysis could not

Enabling Technology for Diagnosis, Prognosis, and Personalized Medicine

Significantly higher speed, lower cost, smaller sample size, and higher accuracy compared with conventional Sanger sequencing make next-generation sequencing (NGS) an attractive platform for medical diagnostics. By practically eliminating cost and time barriers, NGS allows testing of virtually any gene or genetic mutation associated with diseases.

Scalability and Sensitivity

NGS brings scalability and sensitivity to diagnostics in ways that traditional DNA analysis could not. “NGS analyzes hundreds of gene variants or biomarkers simultaneously. Traditional sequencing is better suited for analysis of single genes or fewer than 100 variants,” notes Joseph Bernardo, president of next-generation sequencing and oncology at Thermo Fisher Scientific (Waltham, MA).

Related Article: Computational Changes in Next-Generation Sequencing

Thermo Fisher’s Oncomine Focus Assay for NGS, for example, analyzes close to 1,000 biomarkers associated with the 52-gene panel. These biomarkers constitute about 1,000 different locations on the 52 genes that correlate with the efficacy of certain drugs. The assay allows single-workflow concurrent analysis of DNA and RNA, enabling sequencing of 35 hot-spot genes, 19 genes associated with copy number gain, and 23 fusion genes.

NGS is also better suited to detect lower levels of variants present in heterogeneous material, such as tumor samples. And while both NGS and Sanger sequencing are versatile, NGS can analyze both DNA and RNA, including RNA fusions, at a much more cost-efficient price point.

“When interrogating a limited number of analytes, Sanger sequencing is the standard for many laboratory- developed tests, offering fast turnaround times and lower cost than NGS,” Bernardo says. “We view the two methods as complementary.”

Diagnostic NGS is moving inexorably toward targeted sequencing, particularly for tumor analysis. The targets are specific regions within a tumor’s DNA or individual genes, or specific locations on single genes.

“Targeted sequencing lends itself to diagnostic testing, particularly in oncology, as the goal is to analyze multiple genes associated with cancer using a platform that offers high sensitivity, reliability, and rapid turnaround time,” Bernardo tells Lab Manager. “It is simply more cost-effective.”

That is why the National Cancer Institute (NCI) chose Thermo Fisher’s Ion Torrent sequencing system and the Oncomine reagents for NCI-MATCH, the most ambitious trial to date of NGS oncology diagnostics.

NCI-MATCH will use a 143-gene panel to test submitted tumor samples at four centers (NCI, MD Anderson Cancer Center, Massachusetts General Hospital, and Yale University). The centers then provide sequencing data that helps direct appropriate treatments.

The NCI test protocol ensures consistency across multiple instruments and sites.

Personalized Treatments

Another great opportunity for NGS-based diagnostics is in personalized or precision medicine for both new and existing drugs. Companion diagnostics—co-approved with the relevant drug—drive this entire business. “The only way personalized medicine can succeed commercially is if pharmaceutical companies incorporate a universal assay philosophy in their development programs instead of developing a unique assay for each new drug,” Bernardo explains. For example, in late 2015, Thermo Fisher partnered with Pfizer and Novartis to develop a universal companion diagnostic with the goal of identifying personalized therapy selection from a menu of drugs targeting non-small-cell lung cancer, which annually causes more deaths than breast, colon, and prostate cancer combined.

While advances in sequencing have been remarkable in recent years, the eventual success of NGS-based diagnostics will not depend on instrumentation alone. “What [ensures] ease of use and commonality of results is the cohesiveness of the entire workflow, from sample prep to rapid sequencing systems and bioinformatics,” Bernardo says. “Those components working together will drive NGS into a realizable solution for the clinical market.”

In addition to confirming a disease condition (diagnosis), NGS also provides valuable information on disease susceptibility, prognosis, and the potential effect of drugs on individual patients. The latter idea, known as precision medicine or personalized medicine, uses an individual’s molecular profile to guide treatment. The idea is to differentiate diseases into subtypes based on molecular (usually genetic) characteristics and tailor therapies accordingly.

Precision medicine is still in its infancy, but dozens of pharmaceutical, diagnostics, and genetics firms have bought into the idea.

“We are just at the beginning of connecting genomic and genetic information to target specific therapies for patients,” says T.J. Johnson, president and CEO of HTG Molecular Diagnostics (Tuscon, AZ). “Precision medicine will have a bright future as we gain better understanding of the root causes of disease.”

In 2013, HTG commercialized its HTG Edge instrument platform and a portfolio of RNA assays, which fully automate the company’s proprietary nuclease protection chemistry. This chemistry measures mRNA and miRNA gene expression levels from very small quantities of difficult-to-handle samples.

HTG entered the NGS market in 2014 with the launch of the first HTG EdgeSeq product, an assay that targets and digitally measures the expression of more than 2,000 microRNAs. The assay utilizes the HTG Edge for sample and library preparation, and it uses a suitable NGS instrument (from either Illumina or Thermo Fisher) for quantitation. The data is imported back into the HTG EdgeSeq instrument for analytics and reporting.

In 2015, the company launched four additional HTG EdgeSeq panels: immuno- oncology and pan-oncology biomarker panels, a lymphoma profiling panel, and a classifier for subtyping diffuse large B-cell lymphomas (DLBCL).

Eliminating Biopsies?

Traditional biopsies for tumor DNA analysis are invasive, risky, and often impossible to obtain, and they may not uncover the heterogeneity often present in tumors. It was recently discovered that dying tumor cells release small pieces of DNA into the bloodstream. This cell-free circulating tumor DNA (ctDNA) is detectable in samples through NGS.

In September 2015, Memorial Sloan Kettering Cancer Center (MSK) and NGS leader Illumina (San Diego, CA) entered a collaboration to study ctDNA for cancer diagnosis and monitoring. The aim is to establish ctDNA as a relevant cancer biomarker.

Heterogeneity as it pertains to cancer traditionally refers to multiple tissues located within a tumor, as determined histologically. A number of recent studies suggest that tumor heterogeneity occurs at the genetic level as well. “In particular, there appears to be a tremendous variety of sequence variants within the same tumor, even resulting in situations where one tumor can have multiple mutated genes that have been demonstrated to drive cancer,” says John Leite, PhD, vice president, oncology—market development and product marketing at Illumina.

Heterogeneity challenges the search for treatments that target a specific gene product or pathway. Once the patient is treated, biopsies tell very little about how that patient is responding. “Our hope is that ctDNA provides clinicians with a real-time measure of the abundance of those mutated genes and that a decrease in the relative abundance is synonymous with a lower tumor burden,” Leite adds.

Clinical trials are needed to demonstrate that patients whose therapy was selected using ctDNA versus traditional tissue biopsy testing had a significantly improved outcome or that the analysis might be informative for prognosis.

What about cancer cells that do not release DNA? “Studies show that tumors from different organs or tissues release more or less ctDNA into the peripheral blood,” Leite explains, “but in general the possibility that some cells might not release ctDNA is an open area of research.”

For the MSK-Illumina collaboration, the cancer center will collect samples, and Illumina will apply its sequencing technology to detect ctDNA in those samples. The big draw here is the potential to reduce the number of invasive, expensive diagnostic and monitoring procedures with a simple blood test. This would not be possible without deep next-generation sequencing—the genomics vernacular for sequencing at great depths of coverage.

“Whereas sequencing to identify germline variants can be performed at a nominal depth of coverage—for example, reading a DNA strand 30 times—sequencing rare variants such as in ctDNA requires a much higher level of sensitivity, which is driven by increasing depth of coverage [as much] as 25,000 times,” Leite tells Lab Manager.

In addition to the Illumina MSK collaboration and the work of Thermo Fisher Scientific described above, many more studies involving research consortia and pharmaceutical companies are under way.

“This is a really exciting time for oncology,” Leite says.

Reducing Sample Size

Similarly, in November 2015, Circulogene Theranostics (Birmingham, AL) launched its cfDNA (cell-free DNA) liquid biopsy products for testing ten tumor types, including breast, lung, and colon cancers. The company claims the ability to enrich circulating cfDNA from a single drop of blood.

“While all liquid biopsy companies are focusing on the downstream novel technologies to selectively enrich or amplify tumor-specific cfDNA from a dominantly normal population, the upstream 40 to 90 percent material loss during cfDNA extraction leads to potential false negative results of cancer mutation detection,” explains Chen Yeh, Circulogene’s chief scientific officer. “This is why 10 to 20 mL of blood [are] generally required for conventional cfDNA liquid biopsies.”

Related Article: Researcher Using Next-Generation Sequencing, Other New Methods to Rapidly Identify Pathogens

Released cfDNA fragments often complex with proteins and lipids, which shift their densities to values much lower than those of pure DNA or protein while protecting the corresponding cfDNA from attack by circulating nucleases. Circulogene’s cfDNA breakthrough concentrates and enriches these genetic fragments through density fractionation followed by enzyme-based DNA modification and manipulation, eliminating extraction-associated loss. The technology ensures near-full recovery of both small-molecular-weight (apoptotic cell death) and high-molecular-weight (necrotic cell death) cell-free DNA species from droplet volumes of plasma, serum, or other body fluids.

“The 50-gene panel is our first offering,” says Yeh. “We will continue to develop and cover more comprehensive, informative, and actionable genes and tests.”

The current bottleneck in personalized and precision medicine is the severe shortage of anticancer drugs. Yeh provides perspective, saying, “We have about 60 FDA-approved drugs for cancer-targeted therapies on market, while there are approximately 150 cancer driver genes to aim for. If counting all mutations within these 150 genes, the numbers will be overwhelming.”

Circulogene’s cell-free DNA enrichment technology may be followed up with NGS, conventional Sanger sequencing, or any DNA assay based on PCR or mass spectrometry. However, the sensitivity of Sanger sequencing is insufficient for detecting variants with volumes below 15 percent. Moreover, the company’s multiplex NGS liquid biopsy test captures and monitors real-time, longitudinal tumor heterogeneity or tumor clonal dynamic evolution, which goes well beyond testing of a single mutation on a single sample in traditional sequencing.


Gene Editing Casts a Wide Net 

With CRISPR, Gene Editing Can Trawl the Murk, Catching Elusive Phenotypes amidst the Epigenetic Ebb and Flow

  • Genome editing, a much-desired means of accomplishing gene knockout, gene activation, and other tasks, once seemed just beyond the reach of most research scientists and drug developers. But that was before the advent of CRISPR technology, an easy, versatile, and dependable means of implementing genetic modifications. It is in the process of democratizing genome editing.

    CRISPR stands for “clustered, regularly interspaced, short palindromic repeats,” segments of DNA that occur naturally in many types of bacteria. These segments function as part of an ancient immune system. Each segment precedes “spacer DNA,” a short base sequence that is derived from a fragment of foreign DNA. Spacers serve as reminders of past encounters with phages or plasmids.

    The CRISPR-based immune system encompasses several mechanisms, including one in which CRISPR loci are transcribed into small RNAs that may complex with a nuclease called CRISPR-associated protein (Cas). Then the RNA guides Cas, which cleaves invading DNA on the basis of sequence complementarity.

    In the laboratory, CRISPR sequences are combined with a short RNA complementary to a target gene site. The result is a complex in which the RNA guides Cas to a preselected target.

    Cas produces precise site-specific DNA breaks, which, with imperfect repair, cause gene mutagenesis. In more recent applications, Cas can serve as an anchor for other proteins, such as transcriptional factors and epigenetic enzymes. This system, it seems, has almost limitless versatility.

  • Edited Stem Cells

    The Sanger Institute Mouse Genetic Program, along with other academic institutions around the world, provides access to thousands of genetically modified mouse strains. “Genetic engineering of mouse embryonic stem (ES) cells by homologous recombination is a powerful technique that has been around since the 1980s,” says William Skarnes, Ph.D., senior group leader at the Wellcome Trust Sanger Institute.

    “A significant drawback of the ES technology is the time required to achieve a germline transmission of the modified genetic locus,” he continues. “While we have an exhaustive collection of modified ES cells, only about 5,000 knockout mice, or a quarter of mouse genome, were derived on the basis of this methodology.”

    The dominant position of the mouse ES cell engineering is now effectively challenged by the CRISPR technology. Compared with very low rates of homologous recombination in fertilized eggs, CRISPR generates high levels of mutations, and off-target effects may be so few as to be undetectable.

    “We used the whole-genome sequencing to thoroughly assess off-target mutations in the offspring of CRISPR-engineered founder animals,” informs Dr. Skarnes. “A mutated Cas9 nuclease was deployed to increase specificity, resulting in nearly perfect targeting.”

    Dr. Skarnes explains that the major mouse genome centers are now switching to CRISPR to complete the creation of the world-wide repository of mouse knockouts. His own research is now focused on genetically engineered induced pluripotent stem cells (iPSCs). These cells are adult cells that have been reprogrammed to an embryonic stem cell–like state, and are thus devoid of ethical issues associated with research on human embryonic stem cells. The ultimate goal is to establish a world-wide panel of reference iPSCs created by high-throughput genetic editing of every single human gene.

    “We are poised to begin a large-scale phenotypic analysis of human genes,” declares Dr. Skarnes. His lab is releasing the first set of functional data on 100 DNA repair genes. “By knocking out individual proteins involved in DNA repair and sequencing the genomes of mutant cells,” declares Dr. Skarnes, “we hope to better understand the mutational signatures that occur in cancer.”

  • Pooled CRISPR Libraries

    Researchers hope to gain a better understanding of the mutational signatures found in cancers by using CRISPR techniques to knock out individual proteins involved in DNA repair and then sequencing the genomes of mutant cells. [iStock/zmeel]

    Connecting a phenotype to the underlying genomics requires an unbiased screening of multiple genes at once. “Pooled CRISPR libraries provide an opportunity to cast a wide net at a reasonably low cost,” says Donato Tedesco, Ph.D., lead research scientist at Cellecta. “Screening one gene at a time on genome scale is a significant investment of time and money that not everyone can afford, especially when looking for common genetic drivers across many cell models.”

    Building on years of experience with shRNA libraries, Cellecta is uniquely positioned to prepare pooled CRISPR libraries for genome-wide or targeted screens of gene families. While shRNA interferes with gene translation, CRISPR disrupts a gene and the genomic level due to imperfections in the DNA repair mechanism.

    To determine if these different mechanisms for inactivating genes affect the results of genetic screens, the team conducted a side-by-side comparison of Cellecta’s Human Genome-Wide Module 1 shRNA Library, which expresses 50,000 shRNA targeting 6,300 human genes, with the library of 50,000 gRNA targeting the same gene set. The concordance between approaches was very high, suggesting that these technologies may be complementary and used for cross-confirmation of results.

    Also, a recently completed Phase I NIH SBIR Grant was used to create and test guiding strand RNA (sgRNA) structures to drastically improve efficiency of gene targeting. For this work, Cellecta used a pool library strategy to simultaneously test multiple sgRNA structures for their efficiency and specificity. An early customized Cellecta pooled gRNA library was successfully utilized for screening for epigenetic genes. This particular screen is highly dependent on a complete loss of function, and could not have been accomplished by shRNA inhibition.

    Scientists from Epizyme interrogated 600 genes in a panel of 100 cell lines and, in addition to finding many epigenetic genes required for proliferation in nearly all cell lines, were able to identify validate several essential epigenetic genes required only in subsets of cells with specific genetic lesions. In other words, pooled cell line screening was able to distinguish targets that are likely to produce toxic side effects in certain types of cancer cells from gene targets that are essential in most cells.

    “A more complicated application of CRISPR technology is to use it for gene activation,” adds Dr. Tedesco. “Cellecta plans to optimize this application to bring forth highly efficient, inexpensive, high-throughput genetic screens based on their pooled libraries.

  • Chemically Modified sgRNA

    Scientists based at Agilent Research Laboratories and Stanford University worked together to demonstrate that chemically modified single guide RNA can be used to enhance the genome editing of primary hepatopoietic stem cells and T cells. This image, which is from the Stanford laboratory of Matthew Porteus, M.D., Ph.D., shows CD34+ human hematopoietic stem cells that were edited to turn green. Editing involved inserting a construct for green fluorescent protein. About 1,000 cells are pictured here.

    Researchers at Agilent Technologies applied their considerable experience in DNA and RNA synthesis to develop a novel chemical synthesis method that can generate long RNAs of 100 nucleotides or more, such as single guide RNAs (sgRNAs) for CRISPR genome editing. “We have used this capability to design and test numerous chemical modifications at different positions of the RNA molecule,” said Laurakay Bruhn, Ph.D., section manager, biological chemistry, Agilent.

    Agilent Research Laboratories worked closely with the laboratory of Matthew Porteus, M.D., Ph.D., an associate professor of pediatrics and stem cell transplantation at Stanford University. The Agilent and Stanford researchers collaborated to further explore the benefits of chemically modified sgRNAs in genome editing of primary hematopoetic stem cells and T cells.

    Dr. Porteus’ lab chose three key target genes implicated in the development of severe combined immunodeficiency (SCID), sickle cell anemia, and HIV transmission. Editing these genes in the patient-derived cells offers an opportunity for novel precision therapies, as the edited cells can renew, expand, and colonize the donor’s bone marrow.

    Dr. Bruhn emphasized the importance of editing specificity, so that no other cellular function is affected by the change. The collaborators focused on three chemical modifications strategically placed at each end of sgRNAs that Agilent had previously tested to show they maintained sgRNA function. A number of other optimization strategies in cell culturing and transfection were explored to ensure high editing yields.

    “Primary cells are difficult to manipulate and edit in comparison with cell lines already adapted to cell culture,” maintains Dr. Bruhn. Widely varied cellular properties of primary cells may require experimental adaptation of editing techniques for each primary cell type.

    The resulting data showed that chemical modifications can greatly enhance efficiency of gene editing. The next step would translate these findings into animal models. Another advantage of chemical synthesis of RNA is that it can potentially be used to make large enough quantities for therapeutics.

    “We are working with Agilent’s Nucleic Acid Solution Division—a business focused on GMP manufacturing of oligonucleotides for therapeutics—to engage with customers interested in this capability and better understand how we might be able to help them accomplish their goals,” says Dr. Bruhn.

  • Customized Animal Models

    “Given their gene-knockout capabilities, zinc-finger-based technologies and CRISPR-based technologies opened the doors for creation of animal models that more closely resemble human disease than mouse models,” says Myung Shin, Ph.D., senior principal scientist, Merck & Co. Dr. Shin’s team supports Merck’s drug discovery and development program by creating animal models mimicking human genetics.

    For example, Dr. Shin’s team has worked with the Dahl salt-sensitive strain of rats, a widely studied model of hypertension. “We used zinc-finger nucleases to generate a homozygous knockout of a renal outer medullary potassium channel (ROMK) gene,” elaborates Dr. Shin. “The resulting model represents a major advance in elucidating the role of ROMK gene.”

    According to Dr. Shin, the model may also provide a bridge between genetics and physiology, particularly in studies of renal regulation and blood pressure. In one study, the model generated animal data that suggest ROMK plays a key role in kidney development and sodium absorption. Work along these lines may lead to a pharmacological strategy to manage hypertension.

    In another study, the team applied zinc-finger nuclease strategy to knockout the coagulation Factor XII, and thoroughly characterize them in thrombosis and hemostasis studies. Results confirmed and extended previous literature findings suggesting Factor XII as a potential target for antithrombotic therapies that carry minimal bleeding risk. The model can be further utilized to study safety profiles and off-target effects of such novel Factor XII inhibitors.

    “We use one-cell embryos to conduct genome editing with zinc-fingers and CRISPR,” continues Dr. Shin. “The ease of this genetic manipulation speeds up generation of animal models for testing of various hypotheses.”

    A zinc finger–generated knockout of the multidrug resistance protein MDR 1a P-glycoprotein became an invaluable tool for evaluating drug candidates for targets located in the central nervous system. For example, it demonstrated utility in pharmacological analyses.

    Dr. Shin’s future research is directed toward preclinical animal models that would contain specific nucleotide changes corresponding to those of humans. “CRISPR technology,” insists Dr. Shin, “brings an unprecedented power to manipulate genome at the level of a single nucleotide, to create gain- or loss-of-function genetic alterations, and to deeply understand the biology of a disease.”

  • Transcriptionally Active dCas9

    “Epigenome editing is important for several reasons,” says Charles Gersbach, Ph.D., an associate professor of biomedical engineering at Duke University. “It is a tool that helps us answer fundamental questions about biology. It advances disease modeling and drug screening. And it may, in the future, serve as mode of genetic therapy.”

    “One part of our research focuses on studying the function of epigenetic marks,” Dr. Gersback continues. “While many of these marks are catalogued, and some have been associated with the certain gene-expression states, the exact causal link between these marks and their effect on gene expression is not known. CRISPR technology can potentially allow for targeted direct manipulation of each epigenetic mark, one at a time.”

    Dr. Gersback’s team mutated the Cas9 nuclease to create deactivated Cas9 (dCas9), which is devoid of endonuclease activity. Although the dCas9 protein lacks catalytic activity, it may still serve as an anchor for a plethora of other important proteins, such as transcription factors and methyltransferases.

    In an elegant study, Dr. Gersbach and colleagues demonstrated that recruitment of a histone acetyltransferase by dCas9 to a genomic site activates nearby gene expression. Moreover, the activation occurred even when the acetyltransferase domain was targeted to a distal enhancer. Similarly, recruitment of KRAB repressor to a distant site silenced the target gene in a very specific manner. These findings support the important role of three-dimensional chromatin structures in gene activation.

    “Genome regulation by epigenetic markers is not static,” maintains Dr. Gersbach. “It responds to changes in the environment and other stimuli. It also changes during cell differentiation. We designed an inducible system providing us with an ability to execute dynamic control over the target genes.”

    In a light-activated CRISPR-Cas9 effector (LACE) system, blue light may be used to control the recruitment of the transcriptional factor VP64 to target DNA sequences. The system has been used to provide robust activation of four target genes with only minimal background activity. Selective illumination of culture plates created a pattern of gene expression in a population of cells, which could be used to mimic what is observed in natural tissues.

    Together with collaborators at Duke University, Dr. Gersbach intends to carry out the high-throughput analysis of all currently identified regulatory elements in the genome. “Our ultimate goal,” he declares, “is to assign function to all of these elements.”

Read Full Post »