
Recap of Bio-IT World 2016 by Sanjay Joshi CTO, Healthcare & Life Sciences, EMC Emerging Technologies Division
Guest Auhtor: Sanjay Joshi, CTO, Healthcare & Life Sciences, EMC Emerging Technologies Division
04/21/2016
This month, I attended Bio-IT World Conference & Expo where I was able to gather with colleagues to discuss the latest trends in data and storage management, data computing and more. I’ve included my key highlights and takeaways below.
Keynotes
Two keynotes stood out for me. The first was from Heidi Rehm, chief laboratory director for molecular medicine at Partners Healthcare Personalized Medicine, who spoke about ClinGen and ClinVar (with approximately 120,000 variants in the database), and hierarchy of human gene ontology and curation. The take-home message was a need for consistency in variant classification and knowledge improvement, along with a massive effort in data sharing.
Howard Jacob, executive vice president for medical genomics and chief medical genomics officer for Hudson Alpha, also delivered a compelling keynote on Clinical Grade Sequencing and the importance of negative results.
CIO of Human Longevity, Inc. (HLI) Yaron Turpaz’s keynote is also worth mentioning, as it was both honest and candid, addressing why they charge $25,000+ per test which includes full-body MRI, and why they will not share their results. HLI also presented the famous “imputing facial structure from genome” slide, forcing us to consider the privacy and security implications of a person’s face and nude body image being created from his or her genome.
Genome Compression
During the conference, I also had the opportunity to visit several vendor booths. PetaGene, categorized under compression and recoding, was one of the most interesting companies.
I heard of them first late summer last year when the Sanger Institute presented their paper on variation graphs (VG), which I would summarize as: Make some biological assumptions, walk/shuffle a matrix based on nucleotides and create a sparse matrix.
A more technical explanation is variation graphs provide a succinct encoding of the sequences of many genomes.
A variation graph (in particular as implemented in VG) is composed of:
- Nodes, which are labeled by sequences and ids.
- Edges, which connect two nodes via either of their respective ends.
- Paths, describe genomes, sequence alignments, and annotations (such as gene models and transcripts) as walks through nodes connected by edges.
This is a reimplementation of the Generalized Compressed Suffix Array (GCSA), a BWT-based index for directed graphs. The implementation is based on the Succinct Data Structures Library 2.0 (SDSL).
https://github.com/jltsiren/gcsa2
First, Second Derivative, Convolution and Quadratic Fitting and all that via MCMC (Monte Carlo Markov Chain)
PetaGene demonstrated a 5x compression ratio and won “Best of Show”:
http://www.bio-itworld.com/2016/4/6/bio-itworld-honors-new-products.aspx
My personal take on the subject is that the compute assumptions are based on several assumptions: biological and bayesian, data quality, etc. The data that HLI has spoken about is worrisome: about 50 rare variants per cohort! Similar results (of about 5x compression rates) from larger samples and in a clinical setting would be a boon to the community.
GATK4: C’mon, YAWL?! (Yet Another Workflow Language?!)
This response came from me and many other folks when Broad Institute, Google and Intel announced GATK4 (Genome Analysis Toolkit). They are testing GATK4 on private cloud instances (SwiftStack) as well.
GATK4 is Apache Spark ready!
The YAWL comment came from another process language called Workflow Definition Language (WDL) and the execution engine called Common Workflow Language (CromWelL):
https://github.com/broadinstitute/gatk
https://github.com/broadinstitute/cromwell
GATK4 handles read alignment, QC + pre-processing, SNV, Indel, CNV, SV and Pathogen for both Germline and Cancer workflows.
Another (earlier) announcement from AmpLab U Berkeley on ADAM (using parquet) changes the compute landscape for Genomics. This uses In-Memory Analytics for Genome Alignment and Assembly:
http://ampcamp.berkeley.edu/5/exercises/genome-analysis-with-adam.html
EMC DSSD and MetaLnx summaries (each in one slide)
I was on an Intel breakfast panel with National Cancer Institute( NCI), AWS and BioTeam.
Attached is the DSSD and MetaLnx summary, whose use-cases are in one slide each.
Qumulo and the 75GB index, Igneous and?
I chaired the session that included Qumulo and Igneous.
Igneous teasingly mentioned ARM processors, mobile technology and software stack on top that is related to analytics and storage and left us all waiting for more information.
BioPharma business models in the Public Cloud?
Several very senior folks in BioPharma are becoming increasingly aware of the “business model watchers” on the public cloud. Here is the argument: workflows that come into the public cloud via the best scientific minds are now being closely monitored; create similar workflows (without treading on patents). This story has been repeated umpteen times in the retail space. I would shelve this (no pun intended) under the “privacy and cybersecurity” category which is the topic du jour of 2016.
Memes that still linger on:
The “10x meme” of 10 times more bacteria than human cells within the human body” still survives in the science world even though it was recently quashed:
http://biorxiv.org/content/early/2016/01/06/036103.full.pdf+html
Tidbits from various presentations:
- Common tools among presentations: Arvados, Docker, Elevada, eTRIKS, IO Informatics, Luigi, Nextflow, Spotfire, Tamr, tranSMART
- “Method Patents” are not allowed after Mayo vs. Prometheus and AMP vs. Myriad. Algorithms and Statistical Analyses are considered “laws of nature”
- Roche mentioned http://www.pioneeringhealthcare.com/
- Pfizer noted that 60 to 70 percent of its IT investments goes toward Data Processing, Loading and Curation
- Amgen classified Real World Evidence (RWE) into four categories:
- Pharmacovigilence
- Disease Profile/Forecast
- Clinical Program Design and Implementation
- Risk/Benefit Analysis
NIST noted that 707.5 Million data records were either lost or stolen in 2015. It presented NIST SP 500-299 “Cloud Computing Security Reference Architecture.
Leonard Lipovich quote from his presentation “…more than 95% of diseases associated with GWAS (Genome Wide Association Studies) SNPs (Single Nucleotide Polymorphisms) are in non-coding regions. lncRNA (long non-coding RNA) may be the new “theranostics” (therapeutics and diagnostics) for personalized medicine.”
As always, the Bio-IT World showrunners organized an information-packed event, and I enjoyed meeting with life sciences, pharmaceutical, clinical, healthcare and IT professionals from around the world.
This is very insightful. There is no doubt that there is the bias you refer to. 42 years ago, when I was postdocing in biochemistry/enzymology before completing my residency in pathology, I knew that there were very influential mambers of the faculty, who also had large programs, and attracted exceptional students. My mentor, it was said (although he was a great writer), could draft a project on toilet paper and call the NIH. It can’t be true, but it was a time in our history preceding a great explosion. It is bizarre for me to read now about eNOS and iNOS, and about CaMKII-á, â, ã, ä – isoenzymes. They were overlooked during the search for the genome, so intermediary metabolism took a back seat. But the work on protein conformation, and on the mechanism of action of enzymes and ligand and coenzyme was just out there, and became more important with the research on signaling pathways. The work on the mechanism of pyridine nucleotide isoenzymes preceded the work by Burton Sobel on the MB isoenzyme in heart. The Vietnam War cut into the funding, and it has actually declined linearly since.
A few years later, I was an Associate Professor at a new Medical School and I submitted a proposal that was reviewed by the Chairman of Pharmacology, who was a former Director of NSF. He thought it was good enough. I was a pathologist and it went to a Biochemistry Review Committee. It was approved, but not funded. The verdict was that I would not be able to carry out the studies needed, and they would have approached it differently. A thousand young investigators are out there now with similar letters. I was told that the Department Chairmen have to build up their faculty. It’s harder now than then. So I filed for and received 3 patents based on my work at the suggestion of my brother-in-law. When I took it to Boehringer-Mannheim, they were actually clueless.