Posts Tagged ‘learning algorithms’

Artificial Intelligence Versus the Scientist: Who Will Win?

Will DARPA Replace the Human Scientist: Not So Fast, My Friend!

Writer, Curator: Stephen J. Williams, Ph.D.


Last month’s issue of Science article by Jia You “DARPA Sets Out to Automate Research”[1] gave a glimpse of how science could be conducted in the future: without scientists. The article focused on the U.S. Defense Advanced Research Projects Agency (DARPA) program called ‘Big Mechanism”, a $45 million effort to develop computer algorithms which read scientific journal papers with ultimate goal of extracting enough information to design hypotheses and the next set of experiments,

all without human input.

The head of the project, artificial intelligence expert Paul Cohen, says the overall goal is to help scientists cope with the complexity with massive amounts of information. As Paul Cohen stated for the article:


Just when we need to understand highly connected systems as systems,

our research methods force us to focus on little parts.


The Big Mechanisms project aims to design computer algorithms to critically read journal articles, much as scientists will, to determine what and how the information contributes to the knowledge base.

As a proof of concept DARPA is attempting to model Ras-mutation driven cancers using previously published literature in three main steps:

  1. Natural Language Processing: Machines read literature on cancer pathways and convert information to computational semantics and meaning

One team is focused on extracting details on experimental procedures, using the mining of certain phraseology to determine the paper’s worth (for example using phrases like ‘we suggest’ or ‘suggests a role in’ might be considered weak versus ‘we prove’ or ‘provide evidence’ might be identified by the program as worthwhile articles to curate). Another team led by a computational linguistics expert will design systems to map the meanings of sentences.

  1. Integrate each piece of knowledge into a computational model to represent the Ras pathway on oncogenesis.
  2. Produce hypotheses and propose experiments based on knowledge base which can be experimentally verified in the laboratory.

The Human no Longer Needed?: Not So Fast, my Friend!

The problems the DARPA research teams are encountering namely:

  • Need for data verification
  • Text mining and curation strategies
  • Incomplete knowledge base (past, current and future)
  • Molecular biology not necessarily “requires casual inference” as other fields do


Notice this verification step (step 3) requires physical lab work as does all other ‘omics strategies and other computational biology projects. As with high-throughput microarray screens, a verification is needed usually in the form of conducting qPCR or interesting genes are validated in a phenotypical (expression) system. In addition, there has been an ongoing issue surrounding the validity and reproducibility of some research studies and data.

See Importance of Funding Replication Studies: NIH on Credibility of Basic Biomedical Studies

Therefore as DARPA attempts to recreate the Ras pathway from published literature and suggest new pathways/interactions, it will be necessary to experimentally validate certain points (protein interactions or modification events, signaling events) in order to validate their computer model.

Text-Mining and Curation Strategies

The Big Mechanism Project is starting very small; this reflects some of the challenges in scale of this project. Researchers were only given six paragraph long passages and a rudimentary model of the Ras pathway in cancer and then asked to automate a text mining strategy to extract as much useful information. Unfortunately this strategy could be fraught with issues frequently occurred in the biocuration community namely:

Manual or automated curation of scientific literature?

Biocurators, the scientists who painstakingly sort through the voluminous scientific journal to extract and then organize relevant data into accessible databases, have debated whether manual, automated, or a combination of both curation methods [2] achieves the highest accuracy for extracting the information needed to enter in a database. Abigail Cabunoc, a lead developer for Ontario Institute for Cancer Research’s WormBase (a database of nematode genetics and biology) and Lead Developer at Mozilla Science Lab, noted, on her blog, on the lively debate on biocuration methodology at the Seventh International Biocuration Conference (#ISB2014) that the massive amounts of information will require a Herculaneum effort regardless of the methodology.

Although I will have a future post on the advantages/disadvantages and tools/methodologies of manual vs. automated curation, there is a great article on researchinformation.infoExtracting More Information from Scientific Literature” and also see “The Methodology of Curation for Scientific Research Findings” and “Power of Analogy: Curation in Music, Music Critique as a Curation and Curation of Medical Research Findings – A Comparison” for manual curation methodologies and A MOD(ern) perspective on literature curation for a nice workflow paper on the International Society for Biocuration site.

The Big Mechanism team decided on a full automated approach to text-mine their limited literature set for relevant information however was able to extract only 40% of relevant information from these six paragraphs to the given model. Although the investigators were happy with this percentage most biocurators, whether using a manual or automated method to extract information, would consider 40% a low success rate. Biocurators, regardless of method, have reported ability to extract 70-90% of relevant information from the whole literature (for example for Comparative Toxicogenomics Database)[3-5].

Incomplete Knowledge Base

In an earlier posting (actually was a press release for our first e-book) I had discussed the problem with the “data deluge” we are experiencing in scientific literature as well as the plethora of ‘omics experimental data which needs to be curated.

Tackling the problem of scientific and medical information overload


Figure. The number of papers listed in PubMed (disregarding reviews) during ten year periods have steadily increased from 1970.

Analyzing and sharing the vast amounts of scientific knowledge has never been so crucial to innovation in the medical field. The publication rate has steadily increased from the 70’s, with a 50% increase in the number of original research articles published from the 1990’s to the previous decade. This massive amount of biomedical and scientific information has presented the unique problem of an information overload, and the critical need for methodology and expertise to organize, curate, and disseminate this diverse information for scientists and clinicians. Dr. Larry Bernstein, President of Triplex Consulting and previously chief of pathology at New York’s Methodist Hospital, concurs that “the academic pressures to publish, and the breakdown of knowledge into “silos”, has contributed to this knowledge explosion and although the literature is now online and edited, much of this information is out of reach to the very brightest clinicians.”

Traditionally, organization of biomedical information has been the realm of the literature review, but most reviews are performed years after discoveries are made and, given the rapid pace of new discoveries, this is appearing to be an outdated model. In addition, most medical searches are dependent on keywords, hence adding more complexity to the investigator in finding the material they require. Third, medical researchers and professionals are recognizing the need to converse with each other, in real-time, on the impact new discoveries may have on their research and clinical practice.

These issues require a people-based strategy, having expertise in a diverse and cross-integrative number of medical topics to provide the in-depth understanding of the current research and challenges in each field as well as providing a more conceptual-based search platform. To address this need, human intermediaries, known as scientific curators, are needed to narrow down the information and provide critical context and analysis of medical and scientific information in an interactive manner powered by web 2.0 with curators referred to as the “researcher 2.0”. This curation offers better organization and visibility to the critical information useful for the next innovations in academic, clinical, and industrial research by providing these hybrid networks.

Yaneer Bar-Yam of the New England Complex Systems Institute was not confident that using details from past knowledge could produce adequate roadmaps for future experimentation and noted for the article, “ “The expectation that the accumulation of details will tell us what we want to know is not well justified.”

In a recent post I had curated findings from four lung cancer omics studies and presented some graphic on bioinformatic analysis of the novel genetic mutations resulting from these studies (see link below)

Multiple Lung Cancer Genomic Projects Suggest New Targets, Research Directions for

Non-Small Cell Lung Cancer

which showed, that while multiple genetic mutations and related pathway ontologies were well documented in the lung cancer literature there existed many significant genetic mutations and pathways identified in the genomic studies but little literature attributed to these lung cancer-relevant mutations.


  This ‘literomics’ analysis reveals a large gap between our knowledge base and the data resulting from large translational ‘omic’ studies.

Different Literature Analyses Approach Yeilding

A ‘literomics’ approach focuses on what we don NOT know about genes, proteins, and their associated pathways while a text-mining machine learning algorithm focuses on building a knowledge base to determine the next line of research or what needs to be measured. Using each approach can give us different perspectives on ‘omics data.

Deriving Casual Inference

Ras is one of the best studied and characterized oncogenes and the mechanisms behind Ras-driven oncogenenis is highly understood.   This, according to computational biologist Larry Hunt of Smart Information Flow Technologies makes Ras a great starting point for the Big Mechanism project. As he states,” Molecular biology is a good place to try (developing a machine learning algorithm) because it’s an area in which common sense plays a minor role”.

Even though some may think the project wouldn’t be able to tackle on other mechanisms which involve epigenetic factors UCLA’s expert in causality Judea Pearl, Ph.D. (head of UCLA Cognitive Systems Lab) feels it is possible for machine learning to bridge this gap. As summarized from his lecture at Microsoft:

“The development of graphical models and the logic of counterfactuals have had a marked effect on the way scientists treat problems involving cause-effect relationships. Practical problems requiring causal information, which long were regarded as either metaphysical or unmanageable can now be solved using elementary mathematics. Moreover, problems that were thought to be purely statistical, are beginning to benefit from analyzing their causal roots.”

According to him first

1) articulate assumptions

2) define research question in counter-inference terms

Then it is possible to design an inference system using calculus that tells the investigator what they need to measure.

To watch a video of Dr. Judea Pearl’s April 2013 lecture at Microsoft Research Machine Learning Summit 2013 (“The Mathematics of Causal Inference: with Reflections on Machine Learning”), click here.

The key for the Big Mechansism Project may me be in correcting for the variables among studies, in essence building a models system which may not rely on fully controlled conditions. Dr. Peter Spirtes from Carnegie Mellon University in Pittsburgh, PA is developing a project called the TETRAD project with two goals: 1) to specify and prove under what conditions it is possible to reliably infer causal relationships from background knowledge and statistical data not obtained under fully controlled conditions 2) develop, analyze, implement, test and apply practical, provably correct computer programs for inferring causal structure under conditions where this is possible.

In summary such projects and algorithms will provide investigators the what, and possibly the how should be measured.

So for now it seems we are still needed.


  1. You J: Artificial intelligence. DARPA sets out to automate research. Science 2015, 347(6221):465.
  2. Biocuration 2014: Battle of the New Curation Methods []
  3. Davis AP, Johnson RJ, Lennon-Hopkins K, Sciaky D, Rosenstein MC, Wiegers TC, Mattingly CJ: Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database. Database : the journal of biological databases and curation 2012, 2012:bas051.
  4. Wu CH, Arighi CN, Cohen KB, Hirschman L, Krallinger M, Lu Z, Mattingly C, Valencia A, Wiegers TC, John Wilbur W: BioCreative-2012 virtual issue. Database : the journal of biological databases and curation 2012, 2012:bas049.
  5. Wiegers TC, Davis AP, Mattingly CJ: Collaborative biocuration–text-mining development task for document prioritization for curation. Database : the journal of biological databases and curation 2012, 2012:bas037.

Other posts on this site on include: Artificial Intelligence, Curation Methodology, Philosophy of Science

Inevitability of Curation: Scientific Publishing moves to embrace Open Data, Libraries and Researchers are trying to keep up

A Brief Curation of Proteomics, Metabolomics, and Metabolism

The Methodology of Curation for Scientific Research Findings

Scientific Curation Fostering Expert Networks and Open Innovation: Lessons from Clive Thompson and others

The growing importance of content curation

Data Curation is for Big Data what Data Integration is for Small Data

Cardiovascular Original Research: Cases in Methodology Design for Content Co-Curation The Art of Scientific & Medical Curation

Exploring the Impact of Content Curation on Business Goals in 2013

Power of Analogy: Curation in Music, Music Critique as a Curation and Curation of Medical Research Findings – A Comparison

conceived: NEW Definition for Co-Curation in Medical Research

Reconstructed Science Communication for Open Access Online Scientific Curation

Search Results for ‘artificial intelligence’

 The Simple Pictures Artificial Intelligence Still Can’t Recognize

Data Scientist on a Quest to Turn Computers Into Doctors

Vinod Khosla: “20% doctor included”: speculations & musings of a technology optimist or “Technology will replace 80% of what doctors do”

Where has reason gone?


Read Full Post »

typical changes in CK-MB and cardiac troponin ...

typical changes in CK-MB and cardiac troponin in Acute Myocardial Infarction (Photo credit: Wikipedia)

Reporter and curator:

Larry H Bernstein, MD, FCAP

This posting is a followup on two previous posts covering the design and handling of HIT to improve healthcare outcomes as well as lower costs from better workflow and diagnostics, which is self-correcting over time.

The first example is a non technology method designed by Lee Goldman (Goldman Algorithm) that was later implemented at Cook County Hospital in Chicago with great success.     It has been known that there is over triage of patients to intensive care beds, adding to the costs of medical care.  If the differentiation between acute myocardial infarction and other causes of chest pain could be made more accurate, the quantity of scare resources used on unnecessary admissions could be reduced.  The Goldman algorithm was introduced in 1982 during a training phase at Yale-New Haven Hospital based on 482 patients, and later validated at the BWH (in Boston) on 468 patients.They demonstrated improvement in sensitivity as well as specificity (67% to 77%), and positive predictive value (34% to 42%).  They modified the computer derived algorithm in 1988 to achieve better results in triage of patients to the ICU of patients with chest pain based on a study group of 1379 patients.  The process was tested prospectively on 4770 patients at two university and 4 community hospitals.  The specificity improved by 74% vs 71% in recognizing absence of AMI by the algorithm vs physician judgement. The sensitivity was not different for admission (88%).  Decisions based solely on the protocol would have decreased admissions of patients without AMI by 11.5% without adverse effects.  The study was repeated by Qamar et al. with equal success.

Pain in acute myocardial infarction (front)

Pain in acute myocardial infarction (front) (Photo credit: Wikipedia)

An ECG showing pardee waves indicating acute m...

An ECG showing pardee waves indicating acute myocardial infarction in the inferior leads II, III and aVF with reciprocal changes in the anterolateral leads. (Photo credit: Wikipedia)

Acute myocardial infarction with coagulative n...

Acute myocardial infarction with coagulative necrosis (4) (Photo credit: Wikipedia)

Goldman L, Cook EF, Brand DA, Lee TH, Rouan GW, Weisberg MC, et al. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N Engl J Med. 1988;318:797-803.

A Qamar, C McPherson, J Babb, L Bernstein, M Werdmann, D Yasick, S Zarich. The Goldman algorithm revisited: prospective evaluation of a computer-derived algorithm versus unaided physician judgment in suspected acute myocardial infarction. Am Heart J 1999; 138(4 Pt 1):705-709. ICID: 825629

The usual accepted method for determining the decision value of a predictive variable is the Receiver Operator Characteristic Curve, which requires a mapping of each value of the variable against the percent with disease on the Y-axis.   This requires a review of every case entered into the study.  The ROC curve is done to validate a study to classify data on leukemia markers for research purposes as shown by Jay Magidson in his demonstation of  Correlated Component Regression (2012)(Statistical Innovations, Inc.)  The test for the contribution of each predictor is measured by Akaike Information Criteria and Bayes Information Criteria, which have proved to be critically essential tests over the last 20 years.

I go back 20 years and revisit the application of these principles in clinical diagnostics, but the ROC was introduced to medicine in radiology earlier.   A full rendering of this matter can be found in the following:
R A Rudolph, L H Bernstein, J Babb. Information induction for predicting acute myocardial infarction.Clin Chem 1988; 34(10):2031-2038. ICID: 825568.

Rypka EW. Methods to evaluate and develop the decision process in the selection of tests. Clin Lab Med 1992; 12:355

Rypka EW. Syndromic Classification: A process for amplifying information using S-Clustering. Nutrition 1996;12(11/12):827-9.

Christianson R. Foundations of inductive reasoning. 1964.  Entropy Publications. Lincoln, MA.

Inability to classify information is a major problem in deriving and validating hypotheses from PRIMARY data sets necessary to establish a measure of outcome effectiveness.  When using quantitative data, decision limits have to be determined that best distinguish the populations investigated.   We are concerned with accurate assignment into uniquely verifiable groups by information in test relationships.  Uncertainty in assigning to a supervisory classification can only be relieved by providing suffiuciuent data.

A method for examining the endogenous information in the data is used to determine decision points.  The reference or null set is defined as a class having no information.  When information is present in the data, the entropy (uncertainty in the data set) is reduced by the amount of information provided.  This is measureable and may be referred to as the Kullback-Liebler distance, which was extended by Akaike to include statistical theory.   An approach is devised using EW Rypka’s S-Clustering has been created to find optimal decision values that separate the groups being classified.  Further, it is possible to obtain PRIMARY data on-line and continually creating primary classifications (learning matrices).  From the primary classifications test-minimized sets of features are determined with optimal useful and sufficient information for accurately distinguishing elements (patients).  Primary classifications can be continually created from PRIMARY data.  More recent and complex work in classifying hematology data using a 30,000 patient data set and 16 variables to identify the anemias, moderate SIRS, sepsis, lymphocytic and platelet disorders has been  published and recently presented.  Another classification for malnutrition and stress hypermetabolism is now validated and in press in the journal Nutrition (2012), Elsevier.
G David, LH Bernstein, RR Coifman. Generating Evidence Based Interpretation of Hematology Screens via Anomaly Characterization. Open Clinical Chemistry Journal 2011; 4 (1):10-16. 1874-2416/11 2011 Bentham Open.  ICID: 939928

G David; LH Bernstein; RR Coifman. The Automated Malnutrition Assessment. Accepted 29 April 2012. Nutrition (2012), doi:10.1016/j.nut.2012.04.017.

Keywords: Network Algorithm; unsupervised classification; malnutrition screening; protein energy malnutrition (PEM); malnutrition risk; characteristic metric; characteristic profile; data characterization; non-linear differential diagnosis

Summary: We propose an automated nutritional assessment (ANA) algorithm that provides a method for malnutrition risk prediction with high accuracy and reliability. The problem of rapidly identifying risk and severity of malnutrition is crucial for minimizing medical and surgical complications. We characterized for each patient a unique profile and mapped similar patients into a classification. We also found that the laboratory parameters were sufficient for the automated risk prediction.
We here propose a simple, workable algorithm that provides assistance for interpreting any set of data from the screen of a blood analysis with high accuracy, reliability, and inter-operability with an electronic medical record. This has been made possible at least recently as a result of advances in mathematics, low computational costs, and rapid transmission of the necessary data for computation.  In this example, acute myocardial infarction (AMI) is classified using isoenzyme CKMB activity, total LD, and isoenzyme LD-1, and repeated studies have shown the high power of laboratory features for diagnosis of AMI, especially with NSTEMI.  A later study includes the scale values for chest pain and for ECG changes to create the model.

LH Bernstein, A Qamar, C McPherson, S Zarich.  Evaluating a new graphical ordinal logit method (GOLDminer) in the diagnosis of myocardial infarction utilizing clinical features and laboratory data. Yale J Biol Med 1999; 72(4):259-268. ICID: 825617

The quantitative measure of information, Shannon entropy treats data as a message transmission.  We are interested in classifying data with near errorless discrimination.  The method assigns upper limits of normal to tests computed from Rudolph’s maximum entropy definitions of group-based normal reference.  Using the Bernoulli trial to determine maximum entropy reference, we determine from the entropy in the data a probability of a positive result that is the same for each test and conditionally independent of other results by setting the binary decision level for each test.  The entropy of the discrete distribution is calculated from the probabilities of the distribution. The probability distribution of the binary patterns is not flat and the entropy decreases when there is information in the data.  The decrease in entropy is the Kullback-Liebler distance.

The basic principle of separatory clustering is extracting features from endogenous data that amplify or maximize structural information into disjointed or separable classes.  This differs from other methods because it finds in a database a theoretic – or more – number of variables with required VARIETY that map closest to an ideal, theoretic, or structural information standard. Scaling allows using variables with different numbers of message choices (number bases) in the same matrix, binary, ternary, etc (representing yes-no; small-modest, large, largest).   The ideal number of class is defined by x^n.   In viewing a variable value we think of it as low, normal, high, high high, etc.  A system works with related parts in harmony.  This frame of reference improves the applicability of S-clustering.  By definition, a unit of information is log.r r = 1.

The method of creating a syndromic classification to control variety in the system also performs a semantic function by attributing a term to a Port Royal Class.  If any of the attributes are removed, the meaning of the class is made meaningless.  Any significant overlap between the groups would be improved by adding requisite variety.  S-clustering is an objective and most desirable way to find the shortest route to diagnosis, and is an objective way of determining practice parameters.

Multiple Test Binary Decision Patterns where CK-MB = 18 u/l, LD-1 = 36 u/l, %LD1 = 32 u/l.

No.               Pattern       Freq                   P1                       Self information                Weighted information

0                    000             26                   0.1831                    2.4493                                     0.4485
1                    001                3                    0.0211                   5.5648                                     0.1176
2                    010               4                    0.0282                   5.1497                                     0.1451
3                    011                2                    0.0282                   6.1497                                     0.0866
4                    100               6                    0.0423                   4.5648                                     0.1929
6                    110                8                    0.0563                  4.1497                                     0.2338
7                    111               93                   0.6549                  0.6106                                     0.3999

Entropy: sum of weighted information (average)           1.6243 bits

The effective information values are the least-error points. Non AMI patients exhibit patterns 0, 1, 2, 3, and 4: AMI patients are 6 and 7.  There is 1 fp 4, and 1 fn 6.  The error rate is 1.4%.


A major problem in using quantitative data is lack of a justifiable definition of reference (normal).  Our information model consists of a population group, a set of attributes derived from observations, and basic definitions using Shannon’s information measure entropy. In this model, the population set and its values for its variables are considered to be the only information available.  The finding of a flat distribution with the Bernoulli test defines the reference population that has no information.  The complementary syndromic group, treated in the same way, produces a distribution that is not flat and has a less than maximum information uncertainty.

The vector of probabilities – (1/2), (1/2), …(1/2), can be related to the path calculated from the Rypka-Fletcher equation, which

Ct = 1 – 2^-k/1 -2^-n

determines the theoretical maximum comprehension from the test of n attributes.  We constructed a ROC curve from theoriginal IRIS  data of R Fisher from four measurements of leaf and petal with a result obtained using information-based induction principles to determine discriminant points without the classification that had to be used for the discriminant analysis.   The principle of maximum entropy, as formu;ated by Jaynes and Tribus proposes that for problems of statistical inference – which as defined, are problems of induction – the probabilities should be assigned so that the entropy function is maximized.  Good proposed that maximum entropy be used to define the null hypothesis and Rudolph proposed that medical reference be defined as at maximum entropy.

Rudolph RA. A general purpose information processing automation: generating Port Royal Classes with probabilistic information. Intl Proc Soc Gen systems Res 1985;2:624-30.

Jaynes ET. Information theory and statistical mechanics. Phys Rev 1956;106:620-30.

Tribus M. Where do we stand after 30 years of maximum entropy? In: Levine RD, Tribus M, eds. The maximum entropy formalism. Cambridge, Ma: MIT Press, 1978.

Good IJ. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Ann Math Stat 1963;34:911-34.

The most important reason for using as many tests as is practicable is derived from the prominent role of redundancy in transmitting information (Noisy Channel Theorem).  The proof of this theorem does not tell how to accomplish nearly errorless discrimination, but redundancy is essential.

In conclusion, we have been using the effective information (derived from Kullback-Liebler distance) provided by more than one test to determine normal reference and locate decision values.  Syndromes and patterns that are extracted are empirically verifiable.

Related articles

Read Full Post »