Posts Tagged ‘artificial neural network’

The Binding of Oligonucleotides in DNA and 3-D Lattice Structures

Curator: Larry H Bernstein, MD, FCAP


This article is a renewal of a previous discussion on the role of genomics in discovery of therapeutic targets which focused on:

  •  key drivers of cellular proliferation,
  •  stepwise mutational changes coinciding with cancer progression, and
  •  potential therapeutic targets for reversal of the process.

“The Birth of BioInformatics & Computational Genomics” lays the manifold multivariate systems analytical tools that has moved the science forward to a ground that ensures clinical application. Their is a web-like connectivity between inter-connected scientific discoveries, as significant findings have led to novel hypotheses and has driven our understanding of biological and medical processes at an exponential pace owing to insights into the chemical structure of DNA,

  • the basic building blocks of DNA  and proteins,
  • of nucleotide and protein-protein interactions,
  • protein folding, allostericity, genomic structure,
  • DNA replication,
  • nuclear polyribosome interaction, and
  • metabolic control.

In addition, the emergence of methods for

  • copying,
  • removal and insertion, and
  • improvements in structural analysis as well as
  • developments in applied mathematics have transformed the research framework.

Three-Dimensional Folding and Functional Organization Principles of The Drosophila Genome Sexton T, Yaffe E, Kenigeberg E, Bantignies F,…Cavalli G. Institute de Genetique Humaine, Montpelliere GenomiX, and Weissman Institute, France and Israel. Cell 2012; 148(3): 458-472.       http://dx.doi.org/10.1016/j.cell.2012.01.010   http://www.ncbi.nlm.nih.gov/pubmed/22265598 Chromosomes are the physical realization of genetic information and thus form the basis for its

  •   readout and propagation.

Here we present a high-resolution chromosomal contact map derived from a modified genome-wide chromosome conformation capture approach applied to Drosophila embryonic nuclei. The entire genome is linearly partitioned into well-demarcated physical domains that overlap extensively with

  •   active and repressive epigenetic marks.

Chromosomal contacts are hierarchically organized between domains. Global modeling of contact density and clustering of domains show that

  •   inactive domains are condensed and confined to their chromosomal territories, whereas
  •  active domains reach out of the territory to form remote intra- and interchromosomal contacts.
  •  we systematically identify specific long-range intrachromosomal contacts between Polycomb-repressed domains

Together, these observations allow for quantitative prediction of the Drosophila chromosomal contact map, laying the foundation for detailed studies of

  • chromosome structure and function in a genetically tractable system.

“Mr. President; The Genome is Fractal !” Eric Lander (Science Adviser to the President and Director of Broad Institute) et al. delivered the message on Science Magazine cover (Oct. 9, 2009) and generated interest in this by the International HoloGenomics Society at a Sept meeting. First, it may seem to be trivial to rectify the statement in “About cover” of Science Magazine by AAAS. The statement

  • “the Hilbert curve is a one-dimensional fractal trajectory” needs mathematical clarification.

While the paper itself does not make this statement, the new Editorship of the AAAS Magazine might be even more advanced if the previous Editorship did not reject (without review)

  • a Manuscript by 20+ Founders of (formerly) International PostGenetics Society in December, 2006.

Second, it may not be sufficiently clear for the reader that the reasonable requirement for the

  • DNA polymerase to crawl along a “knot-free” (or “low knot”) structure does not need fractals.

A “knot-free” structure could be spooled by an ordinary “knitting globule” (such that the DNA polymerase does not bump into a “knot” when duplicating the strand; just like someone knitting can go through the entire thread without encountering an annoying knot):

  • Just to be “knot-free” you don’t need fractals.

Note, however, that the “strand” can be accessed only at its beginning – it is impossible to e.g.

  • to pluck a segment from deep inside the “globulus”.

This is where certain fractals provide a major advantage – that could be the “Eureka” moment. For instance, the mentioned Hilbert-curve is not only “knot free” – but provides an easy access to

  • “linearly remote” segments of the strand.

If the Hilbert curve starts from the lower right corner and ends at the lower left corner, for instance

  • the path shows the very easy access of what would be the mid-point if the Hilbert-curve
  • is measured by the Euclidean distance along the zig-zagged path.

Likewise, even the path from the beginning of the Hilbert-curve is about equally easy to access – easier than to reach from the origin a point that is about 2/3 down the path. The Hilbert-curve provides an easy access between two points within the “spooled thread”; from a point that is about 1/5 of the overall length to about 3/5 is also in a “close neighborhood”. This marvellous fractal structure is illustrated by the 3D rendering of the Hilbert-curve. Once you observe such fractal structure,

  • you’ll never again think of a chromosome as a “brillo mess”, would you?

It will dawn on you that the genome is orders of magnitudes more finessed than we ever thought so. Those embarking at a somewhat complex review of some historical aspects of the power of fractals may wish to consult the ouvre of Mandelbrot (also, to celebrate his 85th birthday). For the more sophisticated readers, even the fairly simple Hilbert-curve (a representative of the Peano-class) becomes even more stunningly brilliant than just some “see through density”. Those who are familiar with the classic “Traveling Salesman Problem” know that “the shortest path along which every given n locations can be visited once, and only once” requires fairly sophisticated algorithms (and tremendous amount of computation if n>10 (or much more). Some readers will be amazed, therefore, that for n=9 the underlying Hilbert-curve helps to provide an empirical solution. refer to pellionisz@junkdna.com Briefly, the significance of the above realization, that the (recursive) Fractal Hilbert Curve is intimately connected to the (recursive) solution of TravelingSalesman Problem, a core-concept of Artificial Neural Networks can be summarized as below. Accomplished physicist John Hopfield (already a member of the National Academy of Science) aroused great excitement in 1982 with his (recursive) design of artificial neural networks and learning algorithms which were able to find solutions to combinatorial problems such as the Traveling SalesmanProblem. (Book review Clark Jeffries, 1991; see  J Anderson, Rosenfeld, and A Pellionisz (eds.), Neurocomputing 2: Directions for research, MIT Press, Cambridge, MA, 1990): “Perceptions were modeled chiefly with neural connections in a “forward” direction: A -> B -* C — D. The analysis of networks with strong backward coupling proved intractable. All our interesting results arise as consequences of the strong back-coupling” (Hopfield, 1982). The Principle of Recursive Genome Function surpassed obsolete axioms that blocked, for half a Century, entry of recursive algorithms to interpretation of the structure-and function of (Holo)Genome.  This breakthrough,

  • by uniting the two largely separate fields of Neural Networks and Genome Informatics,

is particularly important for those who focused on Biological (actually occurring) Neural Networks (rather than  abstract algorithms that may not, or because of their core-axioms, simply could not represent neural networks under the governance of DNA information). If biophysicist Andras Pellionisz is correct, genetic science may be on the verge of yielding its third — and by far biggest — surprise. With a doctorate in physics, Pellionisz is the holder of Ph.D.’s in computer sciences and experimental biology from the prestigious Budapest Technical University and the Hungarian National Academy of Sciences. A biophysicist by training, the 59-year-old is a former research associate professor of physiology and biophysics at New York University, author of numerous papers in respected scientific journals and textbooks, a past winner of the prestigious Humboldt Prize for scientific research, a former consultant to NASA and holder of a patent on the world’s first artificial cerebellum, a technology that has already been integrated into research on advanced avionics systems. Because of his background, the Hungarian-born brain researcher might also become one of the first people to successfully launch a new company by

  • using the Internet to gather momentum for a novel scientific idea.

The genes we know about today, Pellionisz says, can be thought of as something similar to machines that make bricks (proteins, in the case of genes), with certain junk-DNA sections providing a blueprint for the different ways those proteins are assembled. The notion that at least certain parts of junk DNA might have a purpose for example, many researchers

  • now refer to with a far less derogatory term: introns.

In a provisional patent application filed July 31, Pellionisz claims to have

  • unlocked a key to the hidden role junk DNA

plays in growth — and in life itself. His patent application covers all attempts to

  • count,
  • measure and
  • compare

the fractal properties of introns for diagnostic and therapeutic purposes.

The FractoGene Decade from Inception in 2002 Proofs of Concept and Impending Clinical Applications by 2012Junk DNA Revisited (SF Gate, 2002)The Future of Life, 50th Anniversary of DNA (Monterey, 2003)Mandelbrot and Pellionisz (Stanford, 2004)Morphogenesis, Physiology and Biophysics (Simons, Pellionisz 2005)PostGenetics; Genetics beyond Genes (Budapest, 2006)ENCODE-conclusion (Collins, 2007)The Principle of Recursive Genome Function (paper, YouTube, 2008)You Tube Cold Spring Harbor presentation of FractoGene (Cold Spring Harbor, 2009)Mr. President, the Genome is Fractal! (2009)HolGenTech, Inc. Founded (2010)Pellionisz on the Board of Advisers in the USA and India (2011)ENCODE – final admission (2012) Recursive Genome Function is Clogged by Fractal Defects in Hilbert-Curve (2012) Geometric Unification of Neuroscience and Genomics (2012) US Patent Office issues FractoGene 8,280,641 to Pellionisz (2012) http://www.junkdna.com/the_fractogene_decade.pdf

The Hidden Fractal Language of Intron DNA

To fully understand Pellionisz’ idea, one must first know what a fractal is. Fractals are a way that nature organizes matter. Fractal patterns can be found in anything that has a nonsmooth surface (unlike a billiard ball), such as

  • coastal seashores,
  • the branches of a tree or
  • the contours of a neuron (a nerve cell in the brain).

Some, but not all, fractals are self-similar and stop repeating their patterns at some stage;

  • the branches of a tree, for example, can get only so small.

Because they are geometric, meaning they have a shape, fractals can be described in mathematical terms. It’s similar to the way a circle can be described by using a number to represent its radius (the distance from its center to its outer edge). When that number is known, it’s possible to draw the circle it represents without ever having seen it before. Although the math is much more complicated, the same is true of fractals. If one has the formula for a given fractal, it’s possible to use that formula to construct, or reconstruct, an image of whatever structure it represents, no matter how complicated. The mysteriously repetitive but not identical strands of genetic material are in reality

  • building instructions organized in a special type of pattern known as a fractal.

It’s this pattern of fractal instructions, he says, that tells genes what they must do in order to form living tissue, everything from the wings of a fly to the entire body of a full-grown human. In a move sure to alienate some scientists, Pellionisz chose the unorthodox route of making his initial disclosures online on his own Web site. He picked that strategy, he says, because it is the fastest way he can document his claims and find scientific collaborators and investors. Most mainstream scientists usually blanch at such approaches, preferring more traditionally credible methods, such as publishing articles in peer-reviewed journals. Pellionisz’ idea is that a fractal set of building instructions in the DNA plays a role in organizing life itself. Decode the language, and in theory it could be reverse engineered. Just as knowing the radius of a circle lets one create that circle. The fractal-based formula

  • would allow us to understand how a heart or disease-fighting antibodies is created.

The idea is  encourage new collaborations across the boundaries that separate the intertwined

  • disciplines of biology, mathematics and computer sciences.

Hal Plotkin, Special to SF Gate. Thursday, November 21, 2002. http://www.junkdna.com/ http://www.junkdna.com/the_fractogene_decade.pdf http://www.sciencentral.com/articles/view.php3?article_id=218392305 http://www.news-medical.net/health/Junk-DNA-What-is-Junk-DNA.aspx http://www.kurzweilai.net/junk-dna-plays-active-role-in-cancer-progression-researchers-find http://marginalrevolution.com/marginalrevolution/2013/05/the-battle-over-junk-dna http://profiles.nlm.nih.gov/SC/B/B/F/T/_/scbbft.pdf

Human Genome is Multifractal

The human genome: a multifractal analysis. Moreno PA, Vélez PE, Martínez E, et al.    BMC Genomics 2011, 12:506. http://www.biomedcentral.com/1471-2164/12/506 Several studies have shown that genomes can be studied via a multifractal formalism. These researchers used a multifractal approach to study the genetic information content of the Caenorhabditis elegans genome. They investigated the possibility that the human genome shows a similar behavior to that observed in the nematode. They report

  • multifractality in the human genome sequence.

This behavior correlates strongly on the presence of Alu elements and to a lesser extent on CpG islands and (G+C) content.

  1. Gene function,
  2. cluster of orthologous genes,
  3. metabolic pathways, and
  4. exons
  • tended to increase their frequencies with ranges of multifractality and
  • large gene families were located in genomic regions with varied multifractality.
  • a multifractal map and classification for human chromosomes are proposed.

They propose a descriptive non-linear model for the structure of the human genome. This model reveals a multifractal regionalization where many regions coexist that are

  • far from equilibrium and this non-linear organization has significant
  • molecular and medical genetic implications for understanding the role of Alu elements in genome stability and structure of the human genome.

Given the role of Alu sequences in

  • gene regulation
  • genetic diseases
  • human genetic diversity
  • adaptation and phylogenetic analyses

these quantifications are especially useful.

MiIP: The Monomer Identification and Isolation Program

Bun C, Ziccardi W, Doering J and Putonti C. Evolutionary Bioinformatics 2012:8 293-300. http://dx.doi.org/10.4137/EBO.S9248 Repetitive elements within genomic DNA are both functionally and evolution-wise informative. Discovering these sequences ab initio is computationally challenging, compounded by the fact that sequence identity between repetitive elements can vary significantly. These investigators present a new application, the Monomer Identification and Isolation Program (MiIP),

  • which provides functionality to both search for a particular repeat as well as
  • discover repetitive elements within a larger genomic sequence.

To compare MiIP’s performance with other repeat detection tools, analysis was conducted for synthetic sequences as well as several a21-II clones and HC21 BAC sequences. The main benefit of MiIP is

  • it is a single tool capable of searching for both known monomeric sequences
  • discovering the occurrence of repeats ab initio

Triplex DNA: A third strand for DNA

The DNA double helix can under certain conditions accommodate

  • a third strand in its major groove.

Researchers in the UK  presented a complete set of four variant nucleotides that makes it

  • possible to use this phenomenon in gene regulation and mutagenesis.

Natural DNA only forms a triplex if the targeted strand is rich in purines – guanine (G) and adenine (A) – which in addition to the bonds of the Watson-Crick base pairing

  • can form two further hydrogen bonds,
  •  the ‘third strand’ oligonucleotide has the matching sequence of pyrimidines – cytosine (C) and thymine (T).

Any Cs or Ts in the target strand of the duplex will only bind very weakly, as

  • they contribute just one hydrogen bond.

Moreover, the recognition of G requires the C in the probe strand to be protonated,

  • triplex formation will only work at low pH.

To overcome all these problems, the groups of Tom Brown and Keith Fox at the University of Southampton have developed modified building blocks, and have now completed

  • a set of four new nucleotides, each of which will bind to one DNA nucleotide from the major groove of the double helix.

They tested the binding of a 19-mer of these designer nucleotides to a double helix target sequence in comparison with the corresponding triplex-forming oligonucleotide made from natural DNA bases. Using fluorescence-monitored thermal melting and DNase I footprinting, the researchers showed that

  • their construct forms stable triplex even at neutral pH. 

Tests with mutated versions of the target sequence showed that

  • three of the novel nucleotides are highly selective for their target base pair,
  • while the ‘S’ nucleotide, designed to bind to T, also tolerates C.


DA Rusling et al, Nucleic Acids Res. 2005, 33, 3025 http://nucleicacidsres.com/Rusling_DA KM Vasquez et al, Science 2000, 290, 530 http://Science.org/2000/290.530/Vazquez_KM/ Frank-Kamenetskii MD, Mirkin SM. Annual Rev Biochem 1995; 64:69-95. http://www.annualreviews.org/aronline/1995/Frank-Kamenetski_MD/64.69/ Since the pioneering work of Felsenfeld, Davies, & Rich, double-stranded polynucleotides containing purines in one strand and pydmidines in the other strand [such as poly(A)/poly(U), poly(dA)/poly(dT), or poly(dAG)/ poly(dCT)] have been known to be able to undergo a stoichiometric transition forming a triple-stranded structure containing one polypurine and two poly-pyrimidine strands. Early on, it was assumed that the third strand was located in the major groove and associated with the duplex via non-Watson-Crick interactions now

  • known as Hoogsteen pairing.

Triple helices consisting of one pyrimidine and two purine strands were also proposed. However, notwithstanding the fact that single-base triads in tRNA structures were well- documented, triple-helical DNA escaped wide attention before the mid-1980s. The interest in DNA triplexes arose due to two partially independent developments.

  1.  homopurine-homopyrimidine stretches in super-coiled plasmids were found to adopt an unusual DNA structure, called H-DNA which includes a triplex.
  2. several groups demonstrated that homopyrimidine and some purine-rich oligonucleotides
  • can form stable and sequence-specific complexes with
  • corresponding homopurine-homopyrimidine sites on duplex DNA.

These complexes were shown to be triplex structures rather than D-loops, where

  • the oligonucleotide invades the double helix and displaces one strand.

A characteristic feature of all these triplexes is that the two

  • chemically homologous strands (both pyrimidine or both purine) are antiparallel.

These findings led explosive growth in triplex studies. One can easily imagine numerous “geometrical” ways to form a triplex, and those that have been studied experimentally. The canonical intermolecular triplex consists of either

  • three independent
  • oligonucleotide chains or of a
  • long DNA duplex carrying homopurine-homopyrimidine insert
    • and the corresponding oligonucleotide.

Triplex formation strongly depends on the oligonucleotide(s) concentration. A single DNA

  • chain may also fold into a triplex connected by two loops.

To comply with the sequence and polarity requirements for triplex formation, such a DNA strand must have a peculiar sequence: It contains a mirror repeat

  1. (homopyrimidine for YR*Y triplexes and homopurine for YR*R triplexes)
  2. flanked by a sequence complementary to one half of this repeat.

Such DNA sequences fold into triplex configuration much more readily than do the corresponding intermolecular triplexes, because all triplex forming segments are brought together within the same molecule. It has become clear that both

  • sequence requirements and chain polarity rules for triplex formation

can be met by DNA target sequences built of clusters of purines and pyrimidines. The third strand consists of adjacent homopurine and homopyrimidine blocks forming Hoogsteen hydrogen bonds with purines on alternate strands of the target duplex, and

  • this strand switch preserves the proper chain polarity.

These structures, called alternate-strand triplexes, have been experimentally observed as both intra- and inter-molecular triplexes. These results increase the number of potential targets for triplex formation in natural DNAs somewhat by adding sequences composed of purine and pyrimidine clusters, although arbitrary sequences are still not targetable because

  • strand switching is energetically unfavorable.

References: Lyamichev VI, Mirkin SM, Frank-Kamenetskii MD. J. Biomol. Stract. Dyn. 1986; 3:667-69. http://JbiomolStractDyn.com/1986/Lyamichev_VI/3.667/ Filippov SA, Frank-Kamenetskii MD. Nature 1987; 330:495-97. http://Nature.com/1987/Fillipov_SA/330.495/ Demidov V, Frank-Kamenetskii MD, Egholm M, Buchardt O, Nielsen PE. Nucleic Acids Res. 1993; 21:2103-7. http://NucleicAcidsResearch.com/1993/Demidov_V/21.2103/ Mirkin SM, Frank-Kamenetskii MD. Anna. Rev. Biophys. Biomol. Struct. 1994; 23:541-76. http://AnnRevBiophysBiomolecStructure.com/1994/Mirkin_SM/23.541/ Hoogsteen K. Acta Crystallogr. 1963; 16:907-16 http://ActaCrystallogr.com/1963/Hoogsteen_K/16.907/ Malkov VA, Voloshin ON, Veselkov AG, Rostapshov VM, Jansen I, et al. Nucleic Acids Res. 1993; 21:105-11. http://NucleicAcidsResearch.com/1993/Malkov_VA/21.105 Malkov VA, Voloshin ON, Soyfer VN, Frank-Kamenetskii MD. Nucleic Acids Res. 1993; 21:585-91 http://NucleicAcidsRes.com/1993/Malkov_VA/21.585/ Chemy DY, Belotserkovskii BP, Frank-Kamenetskii MD, Egholm M, Buchardt O, et al. Proc. Natl. Acad. Sci. USA 1993; 90:1667-70 http://PNAS.org/1993/Chemy_DY/90.1667/ Triplex forming oligonucleotides Triplex forming oligonucleotides: sequence-specific tools for genetic targeting. Knauert MP, Glazer PM. Human Molec Genetics 2001; 10(20):2243-2251. http://HumanMolecGenetics.com/2001/Knauert_ MP/10.2243/ Triplex forming oligonucleotides (TFOs) bind in the major groove of duplex DNA with a

  • high specificity and affinity.

Because of these characteristics, TFOs have been proposed as

  • homing devices for genetic manipulation in vivo.

These investigators review work demonstrating the ability of TFOs and related molecules

  • to alter gene expression and mediate gene modification in mammalian cells.

TFOs can mediate targeted gene knock out in mice, providing a foundation for potential

  • application of these molecules in human gene therapy.

The Triplex Genetic Code

Novagon DNA John Allen Berger, founder of Novagon DNA and The Triplex Genetic Code Over the past 12+ years, Novagon DNA has amassed a vast array of empirical findings which

  • challenge the “validity” of the “central dogma theory”, especially the current five nucleotide
  • Watson-Crick DNA and RNA genetic codes. DNA = A1T1G1C1, RNA =A2U1G2C2.

We propose that our new Novagon DNA 6 nucleotide Triplex Genetic Code has more validity than

  • the existing 5 nucleotide (A1T1U1G1C1) Watson-Crick genetic codes.

Our goal is to conduct a “world class” validation study to replicate and extend our findings.

Methods for Examining Genomic and Proteomic Interactions.

An Integrated Statistical Approach to Compare Transcriptomics Data Across Experiments: A Case Study on the Identification of Candidate Target Genes of the Transcription Factor PPARα Ullah MO, Müller M and Hooiveld GJEJ. Bioinformatics and Biology Insights 2012;6: 145–154. http://dx.doi.org/10.4137/BBI.S9529 http://www.ncbi.nlm.nih.gov/pubmed/22783064 Corresponding author email: guido.hooiveld@wur.nl       http://edepot.wur.nl/213859 An effective strategy to elucidate the signal transduction cascades activated by a transcription factor

  • is to compare the transcriptional profiles of wild type and transcription factor knockout models.

Many statistical tests have been proposed for analyzing gene expression data, but

  • most tests are based on pair-wise comparisons.

Since the analysis of microarrays involves the testing of multiple hypotheses within one study,

  • it is generally accepted to control for false positives by the false discovery rate (FDR).

However, this may be an inappropriate metric for

    • comparing data across different experiments.

Here we propose  the simultaneous testing and integration of

  • the three hypotheses (contrasts) using the cell means ANOVA model.

These three contrasts test for the effect of a treatment in

  1. wild type,
  2. gene knockout, and
  3. globally over all experimental groups

We compare differential expression of genes across experiments while

  • controlling for multiple hypothesis testing,
  • managing biological complexity across orthologs
  • with a visual knowledgebase of documented biomolecular interactions.

Vincent Van Buren & Hailin Chen. Scientific Reports 2012; 2, Article number: 1011 http://dx.doi.org/10.1038/srep01011 The complexity of biomolecular interactions and influences is a major obstacle

  • to their comprehension and elucidation.

Visualizing knowledge of biomolecular interactions increases

  • comprehension and facilitates the development of new hypotheses.

The rapidly changing landscape of high-content experimental results also presents a challenge

  • for the maintenance of comprehensive knowledgebases.

Distributing the responsibility for maintenance of a knowledgebase to a community of

  • experts is an effective strategy for large, complex and rapidly changing knowledgebases.

Cognoscente serves these needs

  • by building visualizations for queries of biomolecular interactions on demand,
  • by managing the complexity of those visualizations, and
  • by crowdsourcing to promote the incorporation of current knowledge from the literature.

Imputing functional associations

  • between biomolecules and imputing directionality of regulation
  • for those predictions each require a corpus of existing knowledge as a framework.

Comprehension of the complexity of this corpus of knowledge will be facilitated by effective

  • visualizations of the corresponding biomolecular interaction networks.

Cognoscente (http://vanburenlab.medicine.tamhsc.edu/cognoscente.html) was designed and implemented to serve these roles as a knowledgebase and as

  • an effective visualization tool for systems biology research and education.

Cognoscente currently contains over 413,000 documented interactions, with coverage across multiple species. Perl, HTML, GraphViz1, and a MySQL database were used in the development of Cognoscente. Cognoscente was motivated by the need to update the knowledgebase of

  • biomolecular interactions at the user level, and
  • flexibly visualize multi-molecule query results for
    • heterogeneous interaction types across different orthologs.

Satisfying these needs provides a strong foundation for developing new hypotheses about

  • regulatory and metabolic pathway topologies.

Several existing tools provide functions that are similar to Cognoscente.

Hilbert 3D curve, iteration 3

Hilbert 3D curve, iteration 3 (Photo credit: Wikipedia)

3-dimensionnal Hilbert cube.

3-dimensionnal Hilbert cube. (Photo credit: Wikipedia)

0tj, 1st and 2nd iteration of Hilbert curve in...

0tj, 1st and 2nd iteration of Hilbert curve in 3D. If you’re looking for the source file, contact me. (Photo credit: Wikipedia)

8 first steps of the building of the Hilbert c...

8 first steps of the building of the Hilbert curve in animated gif (Photo credit: Wikipedia)

Read Full Post »


Demonstration of a diagnostic clinical laboratory neural network agent applied to three laboratory data conditioning problems

Izaak Mayzlin                                                                        Larry Bernstein, MD

Principal Scientist, MayNet                                            Technical Director

Boston, MA                                                                          Methodist Hospital Laboratory, Brooklyn, NY

Our clinical chemistry section services a hospital emergency room seeing 15,000 patients with chest pain annually.  We have used a neural network agent, MayNet, for data conditioning.  Three applications are – troponin, CKMB, EKG for chest pain; B-type natriuretic peptide (BNP), EKG for congestive heart failure (CHF); and red cell count (RBC), mean corpuscular volume (MCV), hemoglobin A2 (Hgb A2) for beta thalassemia.  Three data sets have been extensively validated prior to neural network analysis using receiver-operator curve (ROC analysis), Latent Class Analysis, and a multinomial regression approach.  Optimum decision points for classifying using these data were determined using ROC (SYSTAT, 11.0), LCM (Latent Gold), and ordinal regression (GOLDminer).   The ACS and CHF studies both had over 700 patients, and had a different validation sample than the initial exploratory population.  The MayNet incorporates prior clustering, and sample extraction features in its application.   Maynet results are in agreement with the other methods.

Introduction: A clinical laboratory servicing a hospital with an  emergency room seeing 15,000 patients with chest pain to produce over 2 million quality controlled chemistry accessions annually.  We have used a neural network agent, MayNet, to tackle the quality control of the information product.  The agent combines a statistical tool that first performs clustering of input variables by Euclidean distances in multi-dimensional space. The clusters are trained on output variables by the artificial neural network performing non-linear discrimination on clusters’ averages.  In applying this new agent system to diagnosis of acute myocardial infarction (AMI) we demonstrated that at an optimum clustering distance the number of classes is minimized with efficient training on the neural network. The software agent also performs a random partitioning of the patients’ data into training and testing sets, one time neural network training, and an accuracy estimate on the testing data set. Three examples to illustrate this are – troponin, CKMB, EKG for acute coronary syndrome (ACS); B-type natriuretic peptide (BNP), EKG for the estimation of ejection fraction in congestive heart failure (CHF); and red cell count (RBC), mean corpuscular volume (MCV), hemoglobin A2 (Hgb A2) for identifying beta thalassemia.  We use three data sets that have been extensively validated prior to neural network analysis using receiver-operator curve (ROC analysis), Latent Class Analysis, and a multinomial regression approach.

In previous studies1,2 CK-MB and LD1 sampled at 12 and 18 hours postadmission were near-optimum times used to form a classification by the analysis of information in the data set. The population consisted of 101 patients with and 41 patients without AMI based on review of the medical records, clinical presentation, electrocardiography, serial enzyme and isoenzyme  assays, and other tests. The clinical or EKG data, and other enzymes or sampling times were not used to form a classification but could be handled by the program developed. All diagnoses were established by cardiologist review. An important methodological problem is the assignment of a correct diagnosis by a “gold standard” that is independent of the method being tested so that the method tested can be suitably validated. This solution is not satisfactory in the case of myocardial infarction because of the dependence of diagnosis on a constellation of observations with different sensitivities and specificities. We have argued that the accuracy of diagnosis is  associated with the classes formed by combined features and has greatest uncertainty associated with a single measure.

Methods:  Neural network analysis is by MayNet, developed by one of the authors.  Optimum decision points for classifying using these data were determined using ROC (SYSTAT, 11.0), LCM (Latent Gold)3, and ordinal regression (GOLDminer)4.   Validation of the ACS and CHF study sets both had over 700 patients, and all studies had a different validation sample than the initial exploratory population.  The MayNet incorporates prior clustering, and sample extraction features in its application.   We now report on a new classification method and its application to diagnosis of acute myocardial infarction (AMI).  This method is based on the combination of clustering by Euclidean distances in multi-dimensional space and non-linear discrimination fulfilled by the Artificial Neural Network (ANN) trained on clusters’ averages.   These studies indicate that at an optimum clustering distance the number of classes is minimized with efficient training on the ANN. This novel approach to ANN reduces the number of patterns used for ANN learning and works also as an effective tool for smoothing data, removing singularities,  and increasing the accuracy of classification by the ANN. The studies  conducted involve training and testing on separate clinical data sets, which subsequently achieves a high accuracy of diagnosis (97%).

Unlike classification, which assumes the prior definition of borders between classes5,6, clustering procedure includes establishing these borders as a result of processing statistical information and using a given criteria for difference (distance) between classes.  We perform clustering using the geometrical (Euclidean) distance between two points in n-dimensional space, formed by n variables, including both input and output variables. Since this distance assumes compatibility of different variables, the values of all input variables are linearly transformed (scaled) to the range from 0 to 1.

The ANN technique for readers accustomed to classical statistics can be viewed as an extension of multivariate regression analyses with such new features as non-linearity and ability to process categorical data. Categorical (not continuous) variables represent two or more levels, groups, or classes of correspondent feature, and in our case this concept is used to signify patient condition, for example existence or not of AMI.

The ANN is an acyclic directed graph with input and output nodes corresponding respectively to input and output variables. There are also “intermediate” nodes, comprising so called “hidden” layers.  Each node nj is assigned the value xj that has been evaluated by the node’s “processing” element, as a non-linear function of the weighted sum of values xi of nodes ni, connected with nj by directed edges (ni, nj).

xj = f(wi(1),jxi(1) + wi(2),jxi(2) + … + wi(l),jxi(l)),

where xk is the value in node nk and wk,j is the “weight” of the edge (nk, nj).  In our research we used the standard function f(x), “sigmoid”, defined as f(x)=1/(1+exp(-x)).  This function is suitable for categorical output and allows for using an efficient back-propagation algorithm7 for calculating the optimal values of weights, providing the best fit for learning set of data, and eventually the most accurate classification.

Process description:  We implemented the proposed algorithm for diagnosis of AMI. All the calculations were performed on PC with Pentium 3 Processor applying the authors’ unique Software Agent Maynet. First, using the automatic random extraction procedure, the initial data set (139 patients) was partitioned into two sets — training and testing.  This randomization also determined the size of these sets (96 and 43, respectively) since the program was instructed to assign approximately 70 % of data to the training set.

The main process consists of three successive steps: (1) clustering performed on training data set, (2) neural network’s training on clusters from previous step, and (3) classifier’s accuracy evaluation on testing data.

The classifier in this research will be the ANN, created on step 2, with output in the range [0,1], that provides binary result (1 – AMI, 0 – not AMI), using decision point 0.5.

In this demonstartion we used the data of two previous studies1,2 with three patients, potential outliers, removed (n = 139). The data contains three input variables, CK-MB, LD-1, LD-1/total LD, and one output variable, diagnoses, coded as 1 (for AMI) or 0 (non-AMI).

Results: The application of this software intelligent agent is first demonstrated here using the initial model. Figures 1-2 illustrate the history of training process. One function is the maximum (among training patterns) and lower function shows the average error. The latter defines duration of training process. Training terminates when the average error achieves 5%.

There was slow convergence of back-propagation algorithm applied to the training set of 96 patients. We needed 6800 iterations to achieve the sufficiently small (5%) average error.

Figure 1 shows the process of training on stage 2. It illustrates rapid convergence because we deal only with 9 patterns representing the 9 classes, formed on step 1.

Table 1 illustrates the effect of selection of maximum distance on the number of classes formed and on the production of errors. The number of classes increased with decreasing distance, but accuracy of classification does not decreased.

The rate of learning is inversely related to the number of classes. The use of the back-propagation to train on the entire data set without prior processing is slower than for the training on patterns.

     Figures 2 is a two-dimensional projection of three-dimensional space of input variables CKMB and LD1 with small dots corresponding to the patterns and rectangular as cluster centroids (black – AMI, white – not AMI).

     We carried out a larger study using troponin I (instead of LD1) and CKMB for the diagnosis of myocardial infarction (MI).  The probabilities and odds-ratios for the TnI scaled into intervals near the entropy decision point are shown in Table 2 (N = 782).  The cross-table shows the frequencies for scaled TnI results versus the observed MI, the percent of values within MI, and the predicted probabilities and odds-ratios for MI within TnI intervals.  The optimum decision point is at or near 0.61 mg/L (the probability of MI at 0.46-0.6 mg/L is 3% and the odds ratio is at 13, while the probability of MI at 0.61-0.75 mg/L is 26% at an odds ratio of 174) by regressing the scaled values.

     The RBC, MCV criteria used were applied to a series of 40 patients different than that used in deriving the cutoffs.  A latent class cluster analysis is shown in Table 3.  MayNet is carried out on all 3 data sets for MI, CHF, and for beta thalassemia for comparison and will be shown.

Discussion:  CKMB has been heavily used for a long time to determine heart attacks. It is used in conjunction with a troponin test and the EKG to identify MI but, it isn’t as sensitive as is needed. A joint committee of the AmericanCollege of Cardiology and European Society of Cardiology (ACC/ESC) has established the criteria for acute, recent or evolving AMI predicated on a typical increase in troponin in the clinical setting of myocardial ischemia (1), which includes the 99th percentile of a healthy normal population. The improper selection of a troponin decision value is, however, likely to increase over use of hospital resources.  A study by Zarich8 showed that using an MI cutoff concentration for TnT from a non-acute coronary syndrome (ACS) reference improves risk stratification, but fails to detect a positive TnT in 11.7% of subjects with an ACS syndrome8. The specificity of the test increased from 88.4% to 96.7% with corresponding negative predictive values of 99.7% and 96.2%. Lin et al.9 recently reported that the use of low reference cutoffs suggested by the new guidelines results in markedly increased TnI-positive cases overall. Associated with a positive TnI and a negative CKMB, these cases are most likely false positive for MI. Maynet relieves this and the following problem effectively.

Monitoring BNP levels is a new and highly efficient way of diagnosing CHF as well as excluding non-cardiac causes of shortness of breath. Listening to breath sounds is only accurate when the disease is advanced to the stage in which the pumping function of the heart is impaired. The pumping of the heart is impaired when the circulation pressure increases above the osmotic pressure of the blood proteins that keep fluid in the circulation, causing fluid to pass into the lung’s airspaces.  Our studies combine the BNP with the EKG measurement of QRS duration to predict whether a patient has a high or low ejection fraction, a measure to stage the severity of CHF.

We also had to integrate the information from the hemogram (RBC, MCV) with the hemoglobin A2 quantitation (BioRad Variant II) for the diagnosis of beta thalassemia.  We chose an approach to the data that requires no assumption about the distribution of test values or the variances.   Our detailed analyses validates an approach to thalassemia screening that has been widely used, the Mentzer index10, and in addition uses critical decision values for the tests that are used in the Mentzer index. We also showed that Hgb S has an effect on both Hgb A2 and Hgb F.  This study is adequately powered to assess the usefulness of the Hgb A2 criteria but not adequately powered to assess thalassemias with elevated Hgb F.


1.  Adan J, Bernstein LH, Babb J. Lactate dehydrogenase isoenzyme-1/total ratio: accurate for determining the existence of myocardial infarction. Clin Chem 1986;32:624-8.

2. Rudolph RA, Bernstein LH, Babb J. Information induction for predicting acute myocardial infarction.  Clin Chem 1988;34:2031- 2038.

3. Magidson J. “Maximum Likelihood Assessment of Clinical Trials Based on an Ordered Categorical Response.” Drug Information Journal, Maple Glen, PA: Drug Information Association 1996;309[1]: 143-170.

4. Magidson J and Vermoent J.  Latent Class Cluster Analysis. in J. A. Hagenaars and A. L. McCutcheon (eds.), Applied Latent Class Analysis. Cambridge: CambridgeUniversity Press, 2002, pp. 89-106.

5. Mkhitarian VS, Mayzlin IE, Troshin LI, Borisenko LV. Classification of the base objects upon integral parameters of the attached network. Applied Mathematics and Computers.  Moscow, USSR: Statistika, 1976: 118-24.

6.Mayzlin IE, Mkhitarian VS. Determining the optimal bounds for objects of different classes. In: Dubrow AM, ed. Computational Mathematics and Applications. MoscowUSSR: Economics and Statistics Institute. 1976: 102-105.

7. RumelhartDE, Hinton GE, Williams RJ. Learning internal representations by error propagation. In:

RumelhartDE, Mc Clelland JL, eds. Parallel distributed processing.   Cambridge, Mass: MIT Press, 1986; 1: 318-62.

8. Zarich SW, Bradley K, Mayall ID, Bernstein, LH. Minor Elevations in Troponin T Values Enhance Risk Assessment in Emergency Department Patients with Suspected Myocardial Ischemia: Analysis of Novel Troponin T Cut-off Values.  Clin Chim Acta 2004 (in press).

9. Lin JC, Apple FS, Murakami MM, Luepker RV.  Rates of positive cardiac troponin I and creatine kinase MB mass among patients hospitalized for suspected acute coronary syndromes.  Clin Chem 2004;50:333-338.

10.Makris PE. Utilization of a new index to distinguish heterozygous thalassemic syndromes: comparison of its specificity to five other discriminants.Blood Cells. 1989;15(3):497-506.

Acknowledgements:   Jerard Kneifati-Hayek and Madeleine Schlefer, Midwood High School, Brooklyn, and Salman Haq, Cardiology Fellow, Methodist Hospital.

Table 1. Effect of selection of maximum distance on the number of classes formed and on the accuracy of recognition by ANN

ClusteringDistanceFactor F(D = F * R)  Number ofClasses  Number of Nodes inThe HiddenLayers  Number ofMisrecognizedPatterns inThe TestingSet of 43 Percent ofMisrecognized  2414135  1, 02, 03, 01, 02, 03, 0

3, 2

3, 2






Figure 1.

Figure 2.

Table 2.  Frequency cross-table, probabilities and odds-ratios for scaled TnI versus expected diagnosis

Range Not MI MI N Pct in MI Prob by TnI Odds Ratio
< 0.45 655 2 657 2 0 1
0.46-0.6 7 0 7 0 0.03 13
0.61-0.75 4 0 4 0. 0.26 175
0.76-0.9 13 59 72 57.3 0.82 2307
> 0.9 0 42 42 40.8 0.98 30482
679 103 782 100


Read Full Post »

A Software Agent for Diagnosis of ACUTE MI

Authors: Isaac E. Mayzlin, Ph.D.1, David Mayzlin1,Larry H. Bernstein, M.D.2

1MayNet, Carlsbad, CA, 2Department of Pathology and Laboratory Medicine, BridgeportHospital, Bridgeport, CT.

Agent-based  decision  support  systems  are  designed  to  provide  medical  staff  with  information  needed  for making critical decisions. We describe a Software Agent for evaluating multiple tests based on a large data base  especially  efficient  when  time  for  making  the  decision  is  critical  for  successful  treatment  of  serious conditions, such as stroke or acute myocardial infarction (AMI).

Goldman and others (1) developed a screening algorithm based on characteristics of the chest pain, EKG changes, and key clinical findings to separate high-risk from low-risk patients at the time they present using clinical features without using a serum marker. The Goldman algorithm was not widely used because of a 7 percent misclassification error, mostly false positives.       Nonetheless, A third of emergency room visits by patients presenting with symptoms of rule out AMI are not associated with chest pain. A related issue is the finding that a significant number of patients who are at high risk have to be identified using a cardiac marker. The use of cardiac isoenzymes has been to classify patients meeting the high risk criteria, many of whom are not subsequently found to have AMI.

Software Agent for Diagnosis based on the Knowledge incorporated in the Trained Artificial Neural Network and Data Clustering

This Software Agent is based on the combination of clustering by Euclidean distances in multi-dimensional space and non-linear  discrimination  fulfilled  by  the  Artificial  Neural  Network  (ANN)  trained  on  clusters’  averages.         Our  studies indicate that at an optimum clustering  distance the number of classes is minimized with efficient training on the ANN, retaining accuracy of classification by the ANN at 97%. The studies   conducted involve training and testing on separate clinical data sets.  We perform clustering using the geometrical (Euclidean) distance between two points in n-dimensional space,  formed  by  n  variables,  including  both  input  and  output  variables.  Since  this  distance  assumes  compatibility  of different variables, the values of all input variables are linearly transformed (scaled) to the range from 0 to 1.

The ANN technique for readers accustomed to classical statistics can be viewed as an extension of multivariate regression analyses with such new features as non-linearity and ability to process categorical data. Categorical (not continuous) variables represent two or more levels, groups, or classes of correspondent features, and in our case this concept is used to signify patient condition, for example existence or not of AMI.

Process  description. We  implemented  the  proposed  algorithm  for  diagnosis  of  AMI.  All  the  calculations  were performed on the authors’ unique Software Agent Maynet. First, using the automatic random extraction procedure, the initial data set (139 patients) was partitioned into two sets — training and testing.  This randomization also determined the size of these sets (96 and 43, respectively) since the program was instructed to assign approximately 70 % of data to the training set.

The main process consists of three successive steps:

(1)        clustering performed on training data set,

(2)        neural network’s training on clusters from previous step, and

(3)        classifier’s accuracy evaluation on testing data.

The classifier in this research will be the ANN, created on step 2, with output in the range [0,1], that provides binary result (1 – AMI, 0 – not AMI), using decision point 0.5.

In this paper we used the data of two previous studies (2,3) with three patients, potential outliers, removed (n = 139). The data contains three input variables, CK-MB, LD-1, LD-1/total LD, and one output variable, diagnoses, coded as 1 (for AMI) or 0 (non-AMI).

Table  1.  Effect  of  selection  of  maximum  distance  on  the  number  of  classes  formed  and  on  the accuracy of recognition by ANN

Clustering Distance Factor F(D = F * R) Number ofClasses Number of Nodes in The Hidden Layers Number of Misrecognized Patterns inThe TestingSet of 43 Percent ofMisrecognized




1,  02,  03,  0

1,  0

2,  0

3,  0

3,  2

3,  2












Abbreviations: creatine kinase MB isoenzyme: CK-MB; lactate dehydrogenase isoenzyme-1: LD1; LD1/total LD ratio: %LD1; acute myocardial infarction: AMI; artificial neural network: ANN

Read Full Post »