Speeding Up Genome Analysis: MIT Algorithms for Direct Computation on Compressed Genomic Datasets

September 18, 2012 by 2012pharmaceutical

Reporter: Aviva Lev-Ari, PhD, RN

A New Approach Uses Compression to Speed Up Genome Analysis

Public-Domain Computing Resources

Structural Bioinformatics

The BetaWrap program detects the right-handed parallel beta-helix super-secondary structural motif in primary amino acid sequences by using beta-strand interactions learned from non-beta-helix structures.

Wrap-and-pack detects beta-trefoils in protein sequences by using both pairwise beta-strand interactions and 3-D energetic packing information

The BetaWrapPro program predicts right-handed beta-helices and beta-trefoils by using both sequence profiles and pairwise beta-strand interactions, and returns coordinates for the structure.

The MSARi program indentifies conserved RNA secondary structure in non-coding RNA genes and mRNAs by searching multiple sequence alignments of a large set of candidate catalogs for correlated arrangements of reverse-complementary regions

The Paircoil2 program predicts coiled-coil domains in protein sequences by using pairwise residue correlations obtained from a coiled-coil database. The original Paircoil program is still available for use.

The MultiCoil program predicts the location of coiled-coil regions in amino acid sequences and classifies the predictions as dimeric or trimeric. An updated version, Multicoil2, will soon be available.

The LearnCoil Histidase Kinase program uses an iterative learning algorithm to detect possible coiled-coil domains in histidase kinase receptors.

The LearnCoil-VMF program uses an iterative learning algorithm to detect coiled-coil-like regions in viral membrane-fusion proteins.

The Trilogy program discovers novel sequence-structure patterns in proteins by exhaustively searching through three-residue motifs using both sequence and structure information.

The ChainTweak program efficiently samples from the neighborhood of a given base configuration by iteratively modifying a conformation using a dihedral angle representation.

The TreePack program uses a tree-decomposition based algorithm to solve the side-chain packing problem more efficiently. This algorithm is more efficient than SCWRL 3.0 while maintaining the same level of accuracy.

PartiFold: Ensemble prediction of transmembrane protein structures. Using statistical mechanics principles, partiFold computes residue contact probabilities and sample super-secondary structures from sequence only.

tFolder: Prediction of beta sheet folding pathways. Predict a coarse grained representation of the folding pathway of beta sheet proteins in a couple of minutes.

RNAmutants: Algorithms for exploring the RNA mutational landscape.Predict the effect of mutations on structures and reciprocally the influence of structures on mutations. A tool for molecular evolution studies and RNA design.

AmyloidMutants is a statistical mechanics approach for de novo prediction and analysis of wild-type and mutant amyloid structures. Based on the premise of protein mutational landscapes, AmyloidMutants energetically quantifies the effects of sequence mutation on fibril conformation and stability.

Genomics

GLASS aligns large orthologous genomic regions using an iterative global alignment system. Rosetta identifies genes based on conservation of exonic features in sequences aligned by GLASS.

RNAiCut – Automated Detection of Significant Genes from Functional Genomic Screens.

MinoTar – Predict microRNA Targets in Coding Sequence.

Systems Biology

The Struct2Net program predicts protein-protein interactions (PPI) by integrating structure-based information with other functional annotations, e.g. GO, co-expression and co-localization etc. The structure-based protein interaction prediction is conducted using a protein threading server RAPTOR plus logistic regression.

IsoRank is an algorithm for global alignment of multiple protein-protein interaction (PPI) networks. The intuition is that a protein in one PPI network is a good match for a protein in another network if the former’s neighbors are good matches for the latter’s neighbors.

Other

t-sample is an online algorithm for time-series experiments that allows an experimenter to determine which biological samples should be hybridized to arrays to recover expression profiles within a given error bound.

http://people.csail.mit.edu/bab/computing_new.html#systems

Compressive genomics

http://www.nature.com/nbt/journal/v30/n7/abs/nbt.2241.html

Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241

Published online 10 July 2012

Algorithms that compute directly on compressed genomic data allow analyses to keep pace with data generation.

Figures at a glance

Introduction

In the past two decades, genomic sequencing capabilities have increased exponentially 1, 2, 3, outstripping advances in computing power^{4, 5, 6, 7, 8}. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected⁹; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)¹⁰ and the BLAST-Like Alignment Tool (BLAT)¹¹, and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Conclusions

Compressive algorithms for genomics have the great advantage of becoming proportionately faster with the size of the available data. Although the compression schemes for BLAST and BLAT that we presented yield an increase in computational speed and, more importantly, in scaling, they are only a first step. Many enhancements of our proof-of-concept implementations are possible; for example, hierarchical compression structures, which respect the phylogeny underlying a set of sequences, may yield additional long-term performance gains. Moreover, analyses of such compressive structures will lead to insights as well. As sequencing technologies continue to improve, the compressive genomic paradigm will become critical to fully realizing the potential of large-scale genomics.Software is available at http://cast.csail.mit.edu/.

References

Lander, E.S. et al. Nature 409, 860–921 (2001).
Venter, J.C. et al. Science 291, 1304–1351 (2001).
Kircher, M. & Kelso, J. Bioessays 32, 524–536 (2010).
Kahn, S.D. Science 331, 728–729 (2011).
Gross, M. Curr. Biol. 21, R204–R206 (2011).
Huttenhower, C. & Hofmann, O. PLoS Comput. Biol. 6, e1000779 (2010).
Schatz, M., Langmead, B. & Salzberg, S. Nat. Biotechnol. 28, 691–693 (2010).
1000 Genomes Project data available on Amazon Cloud. NIH press release, 29 March 2012.
Stratton, M. Nat. Biotechnol. 26, 65–66 (2008).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403–410 (1990).
Kent, W.J. Genome Res. 12, 656–664 (2002).
Grumbach, S. & Tahi, F. J. Inf. Process. Manag. 30, 875–886 (1994).
Chen, X., Li, M., Ma, B. & Tromp, J. Bioinformatics 18, 1696–1698 (2002).
Christley, S., Lu, Y., Li, C. & Xie, X. Bioinformatics 25, 274–275 (2009).
Brandon, M.C., Wallace, D.C. & Baldi, P. Bioinformatics 25, 1731–1738 (2009).
Mäkinen, V., Navarro, G., Sirén, J. & Välimäki, N. in Research in Computational Molecular Biology, vol. 5541 of Lecture Notes in Computer Science (Batzoglou, S., ed.) 121–137 (Springer Berlin/Heidelberg, 2009).
Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V. & Varghese, G. in Research in Computational Molecular Biology, vol. 6044 of Lecture Notes in Computer Science (Berger, B., ed.) 310–324 (Springer Berlin/Heidelberg, 2010).
Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Genome Res. 21, 734–740 (2011).
Mäkinen, V., Navarro, G., Sirén, J. & Välimäki, N. J. Comput. Biol. 17, 281–308 (2010).
Deorowicz, S. & Grabowski, S. Bioinformatics 27, 2979–2986 (2011).
Li, H., Ruan, J. & Durbin, R. Genome Res. 18, 1851–1858 (2008).
Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. Genome Biol. 10, R25 (2009).
Carter, D.M. Saccharomyces genome resequencing project. Wellcome Trust Sanger Institute http://www.sanger.ac.uk/Teams/Team118/sgrp/ (2005).
Tweedie, S. et al. Nucleic Acids Res. 37, D555–D559 (2009).

Primary authors

P.-R.L. and M.B. contributed equally to this work.
- Po-Ru Loh &
- Michael Baym

Affiliations

Po-Ru Loh, Michael Baym and Bonnie Berger are in the Department of Mathematics and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
Michael Baym is also in the Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, USA.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Bonnie Berger or
Michael Baym

September 2012

By Matthew Dublin

Compressing a dataset with specialized algorithms is typically done in the context of data storage, where compression tools can shrink data to save space on a hard drive. But a group of researchers at MIT has developed tools that compute directly on compressed genomic datasets by exploiting the fact that most sequenced genomes are very similar to previously sequenced genomes.

Speed Up Genome Analysis

by exploiting the fact that most sequenced genomes are very similar to previously sequenced genomes.

Led by MIT professor Bonnie Berger, the group has recently released tools called CaBlast and CaBlat, compressive versions of the widely used Blast and Blat alignment tools, respectively.

In a Nature Biotechnology paper published in July, Berger and her colleagues describe how the algorithms deliver alignment and analysis results up to four times faster than Blast and Blat when searching for a particular sequence in 36 yeast genomes.

“What we demonstrate is that the more highly similar genomes there are in a database, the greater the relative speed of CaBlast and CaBlat compared to the original non-compressive versions,” Berger says. “As we increase the number of genomes, the amount of work required for compressive algorithms scales only linearly in the amount of non-redundant data. The idea is that we’ve already done most of the work on the first genome.”

These two algorithms are still in the beta phase, and the MIT team has several refinements planned for future release to optimize performance. To that end, Berger has made the code for both algorithms available with the hope that developers will help them build “industrial-strength” software that can be used by the research community.

“To achieve optimal performance in real-use cases, we expect the code will need to be tuned for the engineering trade-offs specific to the application at hand,” she says. “The algorithm used to find and compress similar sequences in the database may need to be tweaked to take this issue into account, and the coarse- and fine-search steps should be aware of these constraints as well.”

While computing resources are becoming increasingly powerful, Berger contends that better algorithms and the use of compression technology will play a crucial role in helping researchers to keep up with the production of next-generation sequencing data.

Matthew Dublin is a senior writer at Genome Technology.

2 Responses

on April 13, 2013 at 12:34 PM | Reply sjwilliamspa

I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
Many thanks,Annette

Loading...
- on April 13, 2013 at 8:37 PM | Reply 2012pharmaceutical
  
  I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
  Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
  Many thanks,Annette
  
  Loading...

Comments RSS