Recap of Bio-IT World 2016 by Sanjay Joshi CTO, Healthcare & Life Sciences, EMC Emerging Technologies Division
Reporter: Aviva Lev-Ari, PhD, RN

Series B, Volume 2:
Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS & BioInformatics, Simulations and the Genome Ontology
https://www.amazon.com/dp/B08385KF87
Recap of Bio-IT World 2016 by Sanjay Joshi CTO, Healthcare & Life Sciences, EMC Emerging Technologies Division
Guest Auhtor: Sanjay Joshi, CTO, Healthcare & Life Sciences, EMC Emerging Technologies Division
04/21/2016
This month, I attended Bio-IT World Conference & Expo where I was able to gather with colleagues to discuss the latest trends in data and storage management, data computing and more. I’ve included my key highlights and takeaways below.
Keynotes
Two keynotes stood out for me. The first was from Heidi Rehm, chief laboratory director for molecular medicine at Partners Healthcare Personalized Medicine, who spoke about ClinGen and ClinVar (with approximately 120,000 variants in the database), and hierarchy of human gene ontology and curation. The take-home message was a need for consistency in variant classification and knowledge improvement, along with a massive effort in data sharing.
Howard Jacob, executive vice president for medical genomics and chief medical genomics officer for Hudson Alpha, also delivered a compelling keynote on Clinical Grade Sequencing and the importance of negative results.
CIO of Human Longevity, Inc. (HLI) Yaron Turpaz’s keynote is also worth mentioning, as it was both honest and candid, addressing why they charge $25,000+ per test which includes full-body MRI, and why they will not share their results. HLI also presented the famous “imputing facial structure from genome” slide, forcing us to consider the privacy and security implications of a person’s face and nude body image being created from his or her genome.
Genome Compression
During the conference, I also had the opportunity to visit several vendor booths. PetaGene, categorized under compression and recoding, was one of the most interesting companies.
I heard of them first late summer last year when the Sanger Institute presented their paper on variation graphs (VG), which I would summarize as: Make some biological assumptions, walk/shuffle a matrix based on nucleotides and create a sparse matrix.
A more technical explanation is variation graphs provide a succinct encoding of the sequences of many genomes.
A variation graph (in particular as implemented in VG) is composed of:
- Nodes, which are labeled by sequences and ids.
- Edges, which connect two nodes via either of their respective ends.
- Paths, describe genomes, sequence alignments, and annotations (such as gene models and transcripts) as walks through nodes connected by edges.
This is a reimplementation of the Generalized Compressed Suffix Array (GCSA), a BWT-based index for directed graphs. The implementation is based on the Succinct Data Structures Library 2.0 (SDSL).
https://github.com/jltsiren/gcsa2
First, Second Derivative, Convolution and Quadratic Fitting and all that via MCMC (Monte Carlo Markov Chain)
PetaGene demonstrated a 5x compression ratio and won “Best of Show”:
http://www.bio-itworld.com/2016/4/6/bio-itworld-honors-new-products.aspx
My personal take on the subject is that the compute assumptions are based on several assumptions: biological and bayesian, data quality, etc. The data that HLI has spoken about is worrisome: about 50 rare variants per cohort! Similar results (of about 5x compression rates) from larger samples and in a clinical setting would be a boon to the community.
GATK4: C’mon, YAWL?! (Yet Another Workflow Language?!)
This response came from me and many other folks when Broad Institute, Google and Intel announced GATK4 (Genome Analysis Toolkit). They are testing GATK4 on private cloud instances (SwiftStack) as well.
GATK4 is Apache Spark ready!
The YAWL comment came from another process language called Workflow Definition Language (WDL) and the execution engine called Common Workflow Language (CromWelL):
https://github.com/broadinstitute/gatk
https://github.com/broadinstitute/cromwell
GATK4 handles read alignment, QC + pre-processing, SNV, Indel, CNV, SV and Pathogen for both Germline and Cancer workflows.
Another (earlier) announcement from AmpLab U Berkeley on ADAM (using parquet) changes the compute landscape for Genomics. This uses In-Memory Analytics for Genome Alignment and Assembly:
http://ampcamp.berkeley.edu/5/exercises/genome-analysis-with-adam.html
EMC DSSD and MetaLnx summaries (each in one slide)
I was on an Intel breakfast panel with National Cancer Institute( NCI), AWS and BioTeam.
Attached is the DSSD and MetaLnx summary, whose use-cases are in one slide each.
Qumulo and the 75GB index, Igneous and?
I chaired the session that included Qumulo and Igneous.
Igneous teasingly mentioned ARM processors, mobile technology and software stack on top that is related to analytics and storage and left us all waiting for more information.
BioPharma business models in the Public Cloud?
Several very senior folks in BioPharma are becoming increasingly aware of the “business model watchers” on the public cloud. Here is the argument: workflows that come into the public cloud via the best scientific minds are now being closely monitored; create similar workflows (without treading on patents). This story has been repeated umpteen times in the retail space. I would shelve this (no pun intended) under the “privacy and cybersecurity” category which is the topic du jour of 2016.
Memes that still linger on:
The “10x meme” of 10 times more bacteria than human cells within the human body” still survives in the science world even though it was recently quashed:
http://biorxiv.org/content/early/2016/01/06/036103.full.pdf+html
Tidbits from various presentations:
- Common tools among presentations: Arvados, Docker, Elevada, eTRIKS, IO Informatics, Luigi, Nextflow, Spotfire, Tamr, tranSMART
- “Method Patents” are not allowed after Mayo vs. Prometheus and AMP vs. Myriad. Algorithms and Statistical Analyses are considered “laws of nature”
- Roche mentioned http://www.pioneeringhealthcare.com/
- Pfizer noted that 60 to 70 percent of its IT investments goes toward Data Processing, Loading and Curation
- Amgen classified Real World Evidence (RWE) into four categories:
- Pharmacovigilence
- Disease Profile/Forecast
- Clinical Program Design and Implementation
- Risk/Benefit Analysis
NIST noted that 707.5 Million data records were either lost or stolen in 2015. It presented NIST SP 500-299 “Cloud Computing Security Reference Architecture.
Leonard Lipovich quote from his presentation “…more than 95% of diseases associated with GWAS (Genome Wide Association Studies) SNPs (Single Nucleotide Polymorphisms) are in non-coding regions. lncRNA (long non-coding RNA) may be the new “theranostics” (therapeutics and diagnostics) for personalized medicine.”
As always, the Bio-IT World showrunners organized an information-packed event, and I enjoyed meeting with life sciences, pharmaceutical, clinical, healthcare and IT professionals from around the world.
Leave a Reply