Reporter: Aviva Lev-Ari, PhD, RN
A New Approach Uses Compression to Speed Up Genome Analysis
Public-Domain Computing Resources
Structural Bioinformatics












Genomics


Systems Biology


Other

http://people.csail.mit.edu/bab/computing_new.html#systems
Compressive genomics
Figures at a glance
Introduction
Conclusions
- Lander, E.S. et al. Nature 409, 860–921 (2001).
- Venter, J.C. et al. Science 291, 1304–1351 (2001).
- Kircher, M. & Kelso, J. Bioessays 32, 524–536 (2010).
- Kahn, S.D. Science 331, 728–729 (2011).
- Gross, M. Curr. Biol. 21, R204–R206 (2011).
- Huttenhower, C. & Hofmann, O. PLoS Comput. Biol. 6, e1000779 (2010).
- Schatz, M., Langmead, B. & Salzberg, S. Nat. Biotechnol. 28, 691–693 (2010).
- 1000 Genomes Project data available on Amazon Cloud. NIH press release, 29 March 2012.
- Stratton, M. Nat. Biotechnol. 26, 65–66 (2008).
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403–410 (1990).
- Kent, W.J. Genome Res. 12, 656–664 (2002).
- Grumbach, S. & Tahi, F. J. Inf. Process. Manag. 30, 875–886 (1994).
- Chen, X., Li, M., Ma, B. & Tromp, J. Bioinformatics 18, 1696–1698 (2002).
- Christley, S., Lu, Y., Li, C. & Xie, X. Bioinformatics 25, 274–275 (2009).
- Brandon, M.C., Wallace, D.C. & Baldi, P. Bioinformatics 25, 1731–1738 (2009).
- Mäkinen, V., Navarro, G., Sirén, J. & Välimäki, N. in Research in Computational Molecular Biology, vol. 5541 of Lecture Notes in Computer Science (Batzoglou, S., ed.) 121–137 (Springer Berlin/Heidelberg, 2009).
- Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V. & Varghese, G. in Research in Computational Molecular Biology, vol. 6044 of Lecture Notes in Computer Science (Berger, B., ed.) 310–324 (Springer Berlin/Heidelberg, 2010).
- Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Genome Res. 21, 734–740 (2011).
- Mäkinen, V., Navarro, G., Sirén, J. & Välimäki, N. J. Comput. Biol. 17, 281–308 (2010).
- Deorowicz, S. & Grabowski, S. Bioinformatics 27, 2979–2986 (2011).
- Li, H., Ruan, J. & Durbin, R. Genome Res. 18, 1851–1858 (2008).
- Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).
- Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. Genome Biol. 10, R25 (2009).
- Carter, D.M. Saccharomyces genome resequencing project. Wellcome Trust Sanger Institute http://www.sanger.ac.uk/Teams/Team118/sgrp/ (2005).
- Tweedie, S. et al. Nucleic Acids Res. 37, D555–D559 (2009).
Compressing a dataset with specialized algorithms is typically done in the context of data storage, where compression tools can shrink data to save space on a hard drive. But a group of researchers at MIT has developed tools that compute directly on compressed genomic datasets by exploiting the fact that most sequenced genomes are very similar to previously sequenced genomes.
by exploiting the fact that most sequenced genomes are very similar to previously sequenced genomes.
Led by MIT professor Bonnie Berger, the group has recently released tools called CaBlast and CaBlat, compressive versions of the widely used Blast and Blat alignment tools, respectively.
In a Nature Biotechnology paper published in July, Berger and her colleagues describe how the algorithms deliver alignment and analysis results up to four times faster than Blast and Blat when searching for a particular sequence in 36 yeast genomes.
“What we demonstrate is that the more highly similar genomes there are in a database, the greater the relative speed of CaBlast and CaBlat compared to the original non-compressive versions,” Berger says. “As we increase the number of genomes, the amount of work required for compressive algorithms scales only linearly in the amount of non-redundant data. The idea is that we’ve already done most of the work on the first genome.”
These two algorithms are still in the beta phase, and the MIT team has several refinements planned for future release to optimize performance. To that end, Berger has made the code for both algorithms available with the hope that developers will help them build “industrial-strength” software that can be used by the research community.
“To achieve optimal performance in real-use cases, we expect the code will need to be tuned for the engineering trade-offs specific to the application at hand,” she says. “The algorithm used to find and compress similar sequences in the database may need to be tweaked to take this issue into account, and the coarse- and fine-search steps should be aware of these constraints as well.”
While computing resources are becoming increasingly powerful, Berger contends that better algorithms and the use of compression technology will play a crucial role in helping researchers to keep up with the production of next-generation sequencing data.
![]() |
Matthew Dublin is a senior writer at Genome Technology. |
Related Stories
http://www.genomeweb.com/node/1122021/?hq_e=el&hq_m=1349154&hq_l=7&hq_v=09187c3305
- Researchers Find the Unexpected from Their Map of the Mouse Functional Genome
September 3, 2012 / Genome Technology
- Researchers Launch a New Journal to Focus on Microbiome Research
September 3, 2012 / Genome Technology
- Researchers Develop Method to Tell Methylation and Hydroxymethylation Apart
July 1, 2012 / Genome Technology
- NCI-Led Team Builds Petabyte-Scale Cancer Genome Data Repository
July 1, 2012 / Genome Technology
- New Algorithm Identifies Short Tandem Repeats From Next-Gen Sequencing Data
June 1, 2012 / Genome Technology
I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
Many thanks,Annette
I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
Many thanks,Annette