Reporter: Aviva Lev-Ari, PhD, RN
A New Approach Uses Compression to Speed Up Genome Analysis
Public-Domain Computing Resources
Structural Bioinformatics
The BetaWrap program detects the right-handed parallel beta-helix super-secondary structural motif in primary amino acid sequences by using beta-strand interactions learned from non-beta-helix structures.
Wrap-and-pack detects beta-trefoils in protein sequences by using both pairwise beta-strand interactions and 3-D energetic packing information
The BetaWrapPro program predicts right-handed beta-helices and beta-trefoils by using both sequence profiles and pairwise beta-strand interactions, and returns coordinates for the structure.
The MSARi program indentifies conserved RNA secondary structure in non-coding RNA genes and mRNAs by searching multiple sequence alignments of a large set of candidate catalogs for correlated arrangements of reverse-complementary regions
The Paircoil2 program predicts coiled-coil domains in protein sequences by using pairwise residue correlations obtained from a coiled-coil database. The original Paircoil program is still available for use.
The Trilogy program discovers novel sequence-structure patterns in proteins by exhaustively searching through three-residue motifs using both sequence and structure information.
The ChainTweak program efficiently samples from the neighborhood of a given base configuration by iteratively modifying a conformation using a dihedral angle representation.
The TreePack program uses a tree-decomposition based algorithm to solve the side-chain packing problem more efficiently. This algorithm is more efficient than SCWRL 3.0 while maintaining the same level of accuracy.
PartiFold: Ensemble prediction of transmembrane protein structures. Using statistical mechanics principles, partiFold computes residue contact probabilities and sample super-secondary structures from sequence only.
tFolder: Prediction of beta sheet folding pathways. Predict a coarse grained representation of the folding pathway of beta sheet proteins in a couple of minutes.
RNAmutants: Algorithms for exploring the RNA mutational landscape.Predict the effect of mutations on structures and reciprocally the influence of structures on mutations. A tool for molecular evolution studies and RNA design.
AmyloidMutants is a statistical mechanics approach for de novo prediction and analysis of wild-type and mutant amyloid structures. Based on the premise of protein mutational landscapes, AmyloidMutants energetically quantifies the effects of sequence mutation on fibril conformation and stability.Genomics
GLASS aligns large orthologous genomic regions using an iterative global alignment system. Rosetta identifies genes based on conservation of exonic features in sequences aligned by GLASS.
MinoTar – Predict microRNA Targets in Coding Sequence.Systems Biology
The Struct2Net program predicts protein-protein interactions (PPI) by integrating structure-based information with other functional annotations, e.g. GO, co-expression and co-localization etc. The structure-based protein interaction prediction is conducted using a protein threading server RAPTOR plus logistic regression.
IsoRank is an algorithm for global alignment of multiple protein-protein interaction (PPI) networks. The intuition is that a protein in one PPI network is a good match for a protein in another network if the former’s neighbors are good matches for the latter’s neighbors.Other
t-sample is an online algorithm for time-series experiments that allows an experimenter to determine which biological samples should be hybridized to arrays to recover expression profiles within a given error bound.http://people.csail.mit.edu/bab/computing_new.html#systems
Compressive genomics
Figures at a glance
Introduction
Conclusions
- Lander, E.S. et al. Nature 409, 860–921 (2001).
- Venter, J.C. et al. Science 291, 1304–1351 (2001).
- Kircher, M. & Kelso, J. Bioessays 32, 524–536 (2010).
- Kahn, S.D. Science 331, 728–729 (2011).
- Gross, M. Curr. Biol. 21, R204–R206 (2011).
- Huttenhower, C. & Hofmann, O. PLoS Comput. Biol. 6, e1000779 (2010).
- Schatz, M., Langmead, B. & Salzberg, S. Nat. Biotechnol. 28, 691–693 (2010).
- 1000 Genomes Project data available on Amazon Cloud. NIH press release, 29 March 2012.
- Stratton, M. Nat. Biotechnol. 26, 65–66 (2008).
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403–410 (1990).
- Kent, W.J. Genome Res. 12, 656–664 (2002).
- Grumbach, S. & Tahi, F. J. Inf. Process. Manag. 30, 875–886 (1994).
- Chen, X., Li, M., Ma, B. & Tromp, J. Bioinformatics 18, 1696–1698 (2002).
- Christley, S., Lu, Y., Li, C. & Xie, X. Bioinformatics 25, 274–275 (2009).
- Brandon, M.C., Wallace, D.C. & Baldi, P. Bioinformatics 25, 1731–1738 (2009).
- Mäkinen, V., Navarro, G., Sirén, J. & Välimäki, N. in Research in Computational Molecular Biology, vol. 5541 of Lecture Notes in Computer Science (Batzoglou, S., ed.) 121–137 (Springer Berlin/Heidelberg, 2009).
- Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V. & Varghese, G. in Research in Computational Molecular Biology, vol. 6044 of Lecture Notes in Computer Science (Berger, B., ed.) 310–324 (Springer Berlin/Heidelberg, 2010).
- Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Genome Res. 21, 734–740 (2011).
- Mäkinen, V., Navarro, G., Sirén, J. & Välimäki, N. J. Comput. Biol. 17, 281–308 (2010).
- Deorowicz, S. & Grabowski, S. Bioinformatics 27, 2979–2986 (2011).
- Li, H., Ruan, J. & Durbin, R. Genome Res. 18, 1851–1858 (2008).
- Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).
- Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. Genome Biol. 10, R25 (2009).
- Carter, D.M. Saccharomyces genome resequencing project. Wellcome Trust Sanger Institute http://www.sanger.ac.uk/Teams/Team118/sgrp/ (2005).
- Tweedie, S. et al. Nucleic Acids Res. 37, D555–D559 (2009).
Compressing a dataset with specialized algorithms is typically done in the context of data storage, where compression tools can shrink data to save space on a hard drive. But a group of researchers at MIT has developed tools that compute directly on compressed genomic datasets by exploiting the fact that most sequenced genomes are very similar to previously sequenced genomes.
by exploiting the fact that most sequenced genomes are very similar to previously sequenced genomes.
Led by MIT professor Bonnie Berger, the group has recently released tools called CaBlast and CaBlat, compressive versions of the widely used Blast and Blat alignment tools, respectively.
In a Nature Biotechnology paper published in July, Berger and her colleagues describe how the algorithms deliver alignment and analysis results up to four times faster than Blast and Blat when searching for a particular sequence in 36 yeast genomes.
“What we demonstrate is that the more highly similar genomes there are in a database, the greater the relative speed of CaBlast and CaBlat compared to the original non-compressive versions,” Berger says. “As we increase the number of genomes, the amount of work required for compressive algorithms scales only linearly in the amount of non-redundant data. The idea is that we’ve already done most of the work on the first genome.”
These two algorithms are still in the beta phase, and the MIT team has several refinements planned for future release to optimize performance. To that end, Berger has made the code for both algorithms available with the hope that developers will help them build “industrial-strength” software that can be used by the research community.
“To achieve optimal performance in real-use cases, we expect the code will need to be tuned for the engineering trade-offs specific to the application at hand,” she says. “The algorithm used to find and compress similar sequences in the database may need to be tweaked to take this issue into account, and the coarse- and fine-search steps should be aware of these constraints as well.”
While computing resources are becoming increasingly powerful, Berger contends that better algorithms and the use of compression technology will play a crucial role in helping researchers to keep up with the production of next-generation sequencing data.
![]() |
Matthew Dublin is a senior writer at Genome Technology. |
Related Stories
http://www.genomeweb.com/node/1122021/?hq_e=el&hq_m=1349154&hq_l=7&hq_v=09187c3305
- Researchers Find the Unexpected from Their Map of the Mouse Functional Genome
September 3, 2012 / Genome Technology
- Researchers Launch a New Journal to Focus on Microbiome Research
September 3, 2012 / Genome Technology
- Researchers Develop Method to Tell Methylation and Hydroxymethylation Apart
July 1, 2012 / Genome Technology
- NCI-Led Team Builds Petabyte-Scale Cancer Genome Data Repository
July 1, 2012 / Genome Technology
- New Algorithm Identifies Short Tandem Repeats From Next-Gen Sequencing Data
June 1, 2012 / Genome Technology




I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
Many thanks,Annette
I actually consider this amazing blog , âSAME SCIENTIFIC IMPACT: Scientific Publishing –
Open Journals vs. Subscription-based « Pharmaceutical Intelligenceâ, very compelling plus the blog post ended up being a good read.
Many thanks,Annette