Huge Data Network Bites into Cancer Genomics
Larry H. Bernstein, MD, FCAP, Curator
LPBI
Closer to a Cure for Gastrointestinal Cancer
Suzanne Tracy, Editor-in-Chief, Scientific Computing and HPC Source
http://www.scientificcomputing.com/news/2015/11/closer-cure-gastrointestinal-cancer
In order to streamline workflows and keep pace with data-intensive discovery demands, CCS integrated its HPC environment with data capture and analytics capabilities, allowing data to move transparently between research steps, and driving discoveries such as a link between certain viruses and gastrointestinal cancers.
SANTA CLARA, CA — At the University of Miami’s Center for Computational Science (CCS), more than 2,000 internal researchers and a dozen expert collaborators across academic and industry sectors worldwide are working together in workflow management, data management, data mining, decision support, visualization and cloud computing. CCS maintains one of the largest centralized academic cyberinfrastructures in the country, which fuels vital and critical discoveries in Alzheimer’s, Parkinson’s, gastrointestinal cancer, paralysis and climate modeling, as well as marine and atmospheric science research.
In order to streamline workflows and keep pace with data-intensive discovery demands, CCS integrated its high performance computing (HPC) environment with data capture and analytics capabilities, allowing data to move transparently between research steps. To speed scientific discoveries and boost collaboration with researchers around the world, the center deployed high-performance DataDirect Networks (DDN) GS12K scale-out file storage. CCS now relies on GS12K storage to handle bandwidth-driven workloads while serving very high IOPS demand resulting from intense user interaction, which simplifies data capture and analysis. As a result, the center is able to capture, store and distribute massive amounts of data generated from multiple scientific models running different simulations on 15 Illumina HiSeq sequencers simultaneously on DDN storage. Moreover, number-crunching time for genome mapping and SNP calling has been reduced from 72 to 17 hours.
“DDN enabled us to analyze thousands of samples for the Cancer Genome Atlas, which amounts to nearly a petabyte of data,” explained Dr. Nicholas Tsinoremas, director of the Center for Computational Sciences at the University of Miami. “Having a robust storage platform like DDN is essential to driving discoveries, such as our recent study that revealed a link between certain viruses and gastrointestinal cancers. Previously, we couldn’t have done that level of computation.”
In addition to providing significant storage processing power to meet both high I/O and interactive processing requirements, CCS needed a flexible file system that could support large parallel and short serial jobs. The center also needed to address “data in flight” challenges that result from major data surges during analysis, and which often cause a 10x spike in storage. The system’s performance for genomics assembly, alignment and mapping is enabling CCS to support all its application needs, including the use of BWA and Bowtie for initial mapping, as well as SamTools and GATK for variant analysis and SNP calling.
“Our arrangement is to share data or make it available to anyone asking, anywhere in the world,” added Tsinoremas. “Now, we have the storage versatility to attract researchers from both within and outside the HPC community … we’re well-positioned to generate, analyze and integrate all types of research data to drive major scientific discoveries and breakthroughs.”
About DDN
DataDirect Networks is a big data storage supplier to data-intensive, global organizations. For more than 15 years, the company has designed, developed, deployed and optimized systems, software and solutions that enable enterprises, service providers, universities and government agencies to generate more value and to accelerate time to insight from their data and information, on premise and in the cloud. Organizations leverage DDN technology and the technical expertise of its team to capture, store, process, analyze, collaborate and distribute data, information and content at largest scale in the most efficient, reliable and cost effective manner. DDN customers include financial services firms and banks, healthcare and life science organizations, manufacturing and energy companies, government and research facilities, and web and cloud service providers.
“Where DDN really stood out is in the ability to adapt to whatever we would need. We have both IOPS-centric storage and the deep, slower I/O pool at full bandwidth. No one else could do that.”
Joel P. Zysman
Director of High Performance Computing
Center for Computational Science at the University of Miami
The University of Miami maintains one of the largest centralized, academic, cyber infrastructures in the US, which is integral to addressing and solving major scientific challenges. At its Center for Computational Science (CCS), more than 2,000 researchers, faculty, staff and students across multiple disciplines collaborate on diverse and interdisciplinary projects requiring HPC resources.
With 50% of the center’s users come from University of Miami’s Miller School of Medicine with ongoing projects at the Hussman Institute for Human Genomics, the explosion of next-generation sequencing has had a major impact on compute and storage demands. At CCS, the heavy I/O required to create four billion reads from one genome in a couple of days only intensifies when the data from the reads needs to be managed and analyzed
Aside from providing sufficient storage power to meet both high I/O and interactive processing demands, CCS needed a powerful file system that was flexible enough to handle very large parallel jobs as well as smaller, shorter serial jobs. CCS also needed to address as much as 10X spikes in storage, so it was critical to scale and support petabytes of machine-generated data without adding a layer of complexity or creating inefficiencies.
Read their success story to learn how high-performance DDN® Storage I/O has helped the University of Miami:
- Establish links between certain viruses and gastrointestinal cancers discovered with computation that were not possible before
- Reduce genomics compute and analysis time from 72 to 17 hours
CHALLENGES
SOLUTION An end-to-end, high performance DDN GRIDScaler® solution featuring a GS12K™ scale-out appliance with an embedded IBM® GPFS™ parallel file system |
TECHNICAL BENEFITS
- Centralized storage with an embedded file system makes it easy to add storage where needed—in the high-performance, high-transaction or slower storage pools—and then manage it all through a single pane of glass
- DDN’s transparent data movement enables using one platform for data capture, download, analysis and retention
- The ability to maintain an active archive of storage lets the center accommodate different types of analytics with varied I/O needs
Leave a Reply