Posts Tagged ‘feature extraction’

Unlocking the Microbiome

Larry H. Bernstein, MD, FCAP, Curator



Machine-learning technique uncovers unknown features of multi-drug-resistant pathogen

Relatively simple “unsupervised” learning system reveals important new information to microbiologists
January 29, 201

According to the CDC, Pseudomonas aeruginosa is a common cause of healthcare-associated infections, including pneumonia, bloodstream infections, urinary tract infections, and surgical site infections. Some strains of P. aeruginosa have been found to be resistant to nearly all or all antibiotics. (illustration credit: CDC)

A new machine-learning technique can uncover previously unknown features of organisms and their genes in large datasets, according to researchers from the Perelman School of Medicine at the University of Pennsylvania and the Geisel School of Medicine at Dartmouth University.

For example, the technique learned to identify the characteristic gene-expression patterns that appear when a bacterium is exposed in different conditions, such as low oxygen and the presence of antibiotics.

The technique, called “ADAGE” (Analysis using Denoising Autoencoders of Gene Expression), uses a “denoising autoencoder” algorithm, which learns to identify recurring features or patterns in large datasets — without being told what specific features to look for (that is, “unsupervised.”)*

Last year,  Casey Greene, PhD, an assistant professor of Systems Pharmacology and Translational Therapeutics at Penn, and his team published, in an open-access paper in the American Society for Microbiology’s mSystems, the first demonstration of ADAGE in a biological context: an analysis of two gene-expression datasets of breast cancers.

Tracking down gene patterns of a multi-drug-resistant bacterium

The new study, published Jan. 19 in an open-access paper in mSystems, was more ambitious. It applied ADAGE to a dataset of 950 gene-expression arrays publicly available at the time for the multi-drug-resistant bacteriumPseudomonas aeruginosa. This bacterium is a notorious pathogen in the hospital and in individuals with cystic fibrosis and other chronic lung conditions; it’s often difficult to treat due to its high resistance to standard antibiotic therapies.

The data included only the identities of the roughly 5,000 P. aeruginosa genes and their measured expression levels in each published experiment. The goal was to see if this “unsupervised” learning system could uncover important patterns in P. aeruginosa gene expression and clarify how those patterns change when the bacterium’s environment changes — for example, when in the presence of an antibiotic.

Even though the model built with ADAGE was relatively simple — roughly equivalent to a brain with only a few dozen neurons — it had no trouble learning which sets of P. aeruginosa genes tend to work together or in opposition. To the researchers’ surprise, the ADAGE system also detected differences between the main laboratory strain of P. aeruginosa and strains isolated from infected patients. “That turned out to be one of the strongest features of the data,” Greene said.

“We expect that this approach will be particularly useful to microbiologists researching bacterial species that lack a decades-long history of study in the lab,” said Greene. “Microbiologists can use these models to identify where the data agree with their own knowledge and where the data seem to be pointing in a different direction … and to find completely new things in biology that we didn’t even know to look for.”

Support for the research came from the Gordon and Betty Moore Foundation, the William H. Neukom Institute for Computational Science, the National Institutes of Health, and the Cystic Fibrosis Foundation.

* In 2012, Google-sponsored researchers applied a similar method to randomly selected YouTube images; their system learned to recognize major recurring features of those images — including cats of course.

Abstract of ADAGE-Based Integration of Publicly Available Pseudomonas aeruginosa Gene Expression Data with Denoising Autoencoders Illuminates Microbe-Host Interactions

The increasing number of genome-wide assays of gene expression available from public databases presents opportunities for computational methods that facilitate hypothesis generation and biological interpretation of these data. We present an unsupervised machine learning approach, ADAGE (analysis using denoising autoencoders of gene expression), and apply it to the publicly available gene expression data compendium for Pseudomonas aeruginosa. In this approach, the machine-learned ADAGE model contained 50 nodes which we predicted would correspond to gene expression patterns across the gene expression compendium. While no biological knowledge was used during model construction, cooperonic genes had similar weights across nodes, and genes with similar weights across nodes were significantly more likely to share KEGG pathways. By analyzing newly generated and previously published microarray and transcriptome sequencing data, the ADAGE model identified differences between strains, modeled the cellular response to low oxygen, and predicted the involvement of biological processes based on low-level gene expression differences. ADAGE compared favorably with traditional principal component analysis and independent component analysis approaches in its ability to extract validated patterns, and based on our analyses, we propose that these approaches differ in the types of patterns they preferentially identify. We provide the ADAGE model with analysis of all publicly available P. aeruginosa GeneChip experiments and open source code for use with other species and settings. Extraction of consistent patterns across large-scale collections of genomic data using methods like ADAGE provides the opportunity to identify general principles and biologically important patterns in microbial biology. This approach will be particularly useful in less-well-studied microbial species.

Abstract of Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders

Big data bring new opportunities for methods that efficiently summarize and automatically extract knowledge from such compendia. While both supervised learning algorithms and unsupervised clustering algorithms have been successfully applied to biological data, they are either dependent on known biology or limited to discerning the most significant signals in the data. Here we present denoising autoencoders (DAs), which employ a data-defined learning objective independent of known biology, as a method to identify and extract complex patterns from genomic data. We evaluate the performance of DAs by applying them to a large collection of breast cancer gene expression data. Results show that DAs successfully construct features that contain both clinical and molecular information. There are features that represent tumor or normal samples, estrogen receptor (ER) status, and molecular subtypes. Features constructed by the autoencoder generalize to an independent dataset collected using a distinct experimental platform. By integrating data from ENCODE for feature interpretation, we discover a feature representing ER status through association with key transcription factors in breast cancer. We also identify a feature highly predictive of patient survival and it is enriched by FOXM1 signaling pathway. The features constructed by DAs are often bimodally distributed with one peak near zero and another near one, which facilitates discretization. In summary, we demonstrate that DAs effectively extract key biological principles from gene expression data and summarize them into constructed features with convenient properties.


Read Full Post »

electronic health record - choice of cause of ...

electronic health record – choice of cause of consultation (German) (Photo credit: Wikipedia)

Reporter: Larry H Bernstein, MD, FACP

The Electronic Health Record: How far we have travelled, and where is journey’s end?

A focus of the Accountable Care Act is improved delivery of quality, efficiency and effectiveness to the patients who receive healthcare in US from the providers in a coordinated system.  The largest confounder in all of this is the existence of silos that are not readily crossed, handovers, communication lapses, and a heavy paperwork burden.  We can add to that a large for profit insurance overhead that is disinterested in the patient-physician encounter.  Finally, the knowledge base of medicine has grown sufficiently that physicians are challenged by the amount of data and the presentation in the Medical Record.

I present a review of the problems that have become more urgent to fix in the last decade.  The administration and paperwork necessitated by health insurers, HMOs and other parties today may account for 40% of a physician’s practice, and the formation of large physician practice groups and alliances of the hospital and hospital staffed physicians (as well as hospital system alliances) has increased in response to the need to decrease the cost of non-patient care overhead.   I discuss some of the points made by two innovators from the healthcare and  the communications sectors.


Image representing New York Times as depicted ...

Image via CrunchBase

I also call attention to the New York Times front page article calling attention to a sharp rise in inflation-adjusted Medicare payments for emergency-room services since 2006 due to upcoding at the highest level, partly related to the ability to physician ability to overstate the claim for service provided by correctible improvements I discuss below.  (NY Times, 9/22/2012).  The solution still has another built in step that requires quality control of both the input and the output, achievable today.  This also comes at a time that there is a nationwide implementation of ICD-10 to replace ICD-9 coding.

US medical groups' adoption of EHR (2005)

US medical groups’ adoption of EHR (2005) (Photo credit: Wikipedia)


The first finding by Robert S Didner, on “Decision Making in the Clinical Setting”, concludes that the gathering of information has large costs while reimbursements for the activities provided have decreased, detrimental to the outcomes that are measured.  He suggests that this data can be gathered and reformatted to improve its value in the clinical setting by leading to decisions with optimal outcomes.  He outlines how this can be done.

The second is a discussion by Emergency Medicine  physicians, Thomas A Naegele and harry P Wetzler,  who have developed a Foresighted Practice Guideline (FPG) (“The Foresighted Practice Guideline Model: A Win-Win Solution”).   They focus on collecting data from similar patients, their interventions, and treatments to better understand the value of alternative courses of treatment.  Using the FPG model will enable physicians to elevate their practice to a higher level and they will have hard information on what works.  These two views are more than 10 years old, and they are complementary.

Didner points out that there is no one sequence of tests and questions that can be optimal for all presenting clusters.  Even as data and test results are acquired, the optimal sequence of information gathering is changed, depending on the gathered information.  Thus, the dilemma is created of how to collect clinical data.  Currently, the way information is requested and presented does not support the way decisions are made.   Decisions are made in a “path-dependent” way, which is influenced by the sequence in which the components are considered.    Ideally, it would require a separate form for each combination of presenting history and symptoms, prior to ordering tests, which is unmanageable.   The blank paper format is no better, as the data is not collected in the way it would be used, and it constitutes separate clusters (vital signs, lab work{also divided into CBC, chemistry panel, microbiology, immunology, blood bank, special tests}].   Improvements have been made in the graphical presentation of a series of tests. Didner presents another means of gathering data in machine manipulable form that improves the expected outcomes.  The basis for this model is that at any stage of testing and information gathering there is an expected outcome from the process, coupled with a metric, or hierarchy of values to determine the relative desirability of the possible outcomes.

He creates a value hierarchy:

  1. Minimize the likelihood that a treatable, life-threatening disorder is not treated.
  2. Minimize the likelihood that a treatable, permanently-disabling or disfiguring disorder is not treated.
  3. Minimize the likelihood that a treatable, discomfort causing disorder is not treated.
  4. Minimize the likelihood that a risky procedure, (treatment or diagnostic procedure) is inappropriately administered.
  5. Minimize the likelihood that a discomfort causing procedure is inappropriately administered.
  6. Minimize the likelihood that a costly procedure is inappropriately administered.
  7. Minimize the time of diagnosing and treating the patient.
  8. Minimize the cost of diagnosing and treating the patient.

In reference to a way of minimizing the number, time and cost of tests, he determines that the optimum sequence could be found using Claude  Shannon’s Information theory.  As to a hierarchy of outcome values, he refers to the QALY scale as a starting point. At any point where a determination is made there is disparate information that has to be brought together, such as, weight, blood pressure, cholesterol, etc.  He points out, in addition, that the way the clinical information is organized is not opyimal for the way to display information to enhance human cognitive performance in decision support.  Furthermore, he looks at the limit of short term memory as 10 chunks of information at any time, and he compares the positions of chess pieces on the board with performance of a grand master, if the pieces are in an order commensurate with a “line of attack”.  The information has to be ordered in the way it is to be used! By presenting information used for a particular decision component in a compact space the load on short term memory is reduced, and there is less strain in searching for the relevant information.

He creates a Table to illustrate the point.

Correlation of weight with other cardiac risk factors

Chol                       0.759384
HDL                        -0.53908
LDL                         0.177297
bp-syst                 0.424728
bp-dia                   0.516167

Triglyc                   0.637817

The task of the information system designer is to provide or request the right information, in the best form, at each stage of the procedure.

The FPG concept as deployed by Naegele and Wetzler is a model for design of a more effective health record that has already shown substantial proof of concept in the emergency room setting.  In principle, every clinical encounter is viewed as a learning experience that requires the collection of data , learning from similar patients, and comparing the value of alternative courses of treatment.  The framework for standard data collection is the FPG model. The FPG is distinguished from hindsighted guidelines which are utilized by utilization and peer review organizations.  Over time, the data forms patient clusters and enables the physician to function at a higher level.

Hypothesis construction is experiential, and hypothesis generation and testing is required to go from art to science in the complex practice of medicine.  In every encounter there are 3 components: patient, process, and outcome.  The key to the process is to collect data on patients, processes and outcomes in a standard way.  The main problem with a large portion of the chart is that the description is not uniform.  This is not fully resolved with good natural language encryption.  The standard words and phrases that may be used for a particular complaint or condition constitute a guideline.  This type of “guided documentation” is a step in moving toward a guided practice.  It enables physicians to gather data on patients, processes and outcomes of care in routine settings, and they can be reviewed and updated.  This is a higher level of methodology than basing guidelines on “consensus and opinion”.
When Lee Goldman, et al., created the guideline for classifying chest pain in the emergency room, the characteristics of the chest pain was problematic. In dealing with this he determined that if the chest pain was “stabbing”, or if it radiated to the right foot, heart attack is excluded.

The IOM is intensely committed to practice guidelines for care.  The guidelines are the data bases of the science of medical decision making and disposition processing, and are related to process-flow.  However, the hindsighted  or retrospective approach is diagnosis or procedure oriented.  HPGs are the tool used in utilization review.  The FPG model focuses on the physician-patient encounter and is problem oriented.   We can go back further and remember the contribution by Lawrence Weed to the “structured medical record”.
The physicians today use an FPG framework in looking at a problem or pathology (especially in pathology, which extends the classification by used of biomarker staining).  The Standard Patient File Format (SPPF) was developed by Weed and includes: 1. Patient demographics; 2. Front of the chart; 3. Subjective: Objective; Assessment/diagnosis;6. Plan; Back of the chart.  The FPG retains the structure of the SPPF  All of the words and phrases in the FPG are the data base for the problem or condition. The current construct of the chart is uninviting: nurses notes, medications, lab results, radiology, imaging.

Realtime Clinical Expert Support and Validation System
Gil David and Larry Bernstein have developed, in consultation with Prof. Ronald Coifman, in the Yale University Applied Mathematics Program, a software system that is the equivalent of an intelligent Electronic Health Records Dashboard that provides empirical
medical reference and suggests quantitative diagnostics options.

The introduction of a DASHBOARD has allowed a presentation of drug reactions, allergies, primary and secondary diagnoses, and critical information about any patient the care giver needing access to the record. The advantage of this innovation is obvious. The startup problem is what information is presented and how it is displayed, which is a source of variability and a key to its success. It is also imperative that the extraction of data from disparate sources will, in the long run, further improve the diagnostic process.  For instance, the finding of both ST depression on EKG coincident with an increase of a cardiac biomarker (troponin). Through the application of geometric clustering analysis the data may interpreted in a more sophisticated fashion in order to create a more reliable and valid knowledge-based opinion.  In the hemogram one can view data reflecting the characteristics of a broad spectrum of medical conditions.  Characteristics expressed as measurements of size, density, and concentration, resulting in more than a dozen composite variables, including the mean corpuscular volume (MCV), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular hemoglobin (MCH), total white cell count (WBC), total lymphocyte count, neutrophil count (mature granulocyte count and bands), monocytes, eosinophils, basophils, platelet count, and mean platelet volume (MPV), blasts, reticulocytes and platelet clumps, as well as other features of classification.   This has been described in a previous post.

It is beyond comprehension that a better construct has not be created for common use.

W Ruts, S De Deyne, E Ameel, W Vanpaemel,T Verbeemen, And G Storms. Dutch norm data for 13 semantic categoriesand 338 exemplars. Behavior Research Methods, Instruments, & Computers 2004; 36 (3): 506–515.
De Deyne, S Verheyen, E Ameel, W Vanpaemel, MJ Dry, WVoorspoels, and G Storms. Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts.
Behavior Research Methods 2008; 40 (4): 1030-1048

Landauer, T. K., Ross, B. H., & Didner, R. S. (1979). Processing visually presented single words: A reaction time analysis [Technical memorandum].  Murray Hill, NJ: Bell Laboratories. Lewandowsky , S. (1991).

Weed L. Automation of the problem oriented medical record. NCHSR Research Digest Series DHEW. 1977;(HRA)77-3177.

Naegele TA. Letter to the Editor. Amer J Crit Care 1993;2(5):433.


Read Full Post »