Simulation Tools of Genomic Next Generation Sequencing Data: Comparative Analysis & Genetic Simulation Resources

Lopez-Doriga A, Feliubadaló L, Menéndez M, Lopez-Doriga S, Morón-Duran FD, del Valle J, Tornero E, Montes E, Cuesta R, Campos O, Gómez C, Pineda M, González S, Moreno V, Capellá G, Lázaro C.

Hum Mutat. 2014 Mar;35(3):271-7.

PMID:: 24227591

Similar articles

Development and analytical validation of a 25-gene next generation sequencing panel that includes the BRCA1 and BRCA2 genes to assess hereditary cancer risk.

Judkins T, Leclair B, Bowles K, Gutin N, Trost J, McCulloch J, Bhatnagar S, Murray A, Craft J, Wardell B, Bastian M, Mitchell J, Chen J, Tran T, Williams D, Potter J, Jammulapati S, Perry M, Morris B, Roa B, Timms K.

BMC Cancer. 2015 Apr 2;15:215. doi: 10.1186/s12885-015-1224-y.

PMID:: 25886519

Free PMC Article

Similar articles

Clinical Applications of Next-Generation Sequencing in Cancer Diagnosis.

Sabour L, Sabour M, Ghorbian S.

Pathol Oncol Res. 2017 Apr;23(2):225-234. doi: 10.1007/s12253-016-0124-z. Epub 2016 Oct 8. Review.

PMID:: 27722982

Similar articles

Studying cancer genomics through next-generation DNA sequencing and bioinformatics.

Doyle MA, Li J, Doig K, Fellowes A, Wong SQ.

Methods Mol Biol. 2014;1168:83-98. doi: 10.1007/978-1-4939-0847-9_6. Review.

PMID:: 24870132

Similar articles

IMMUNOINFORMATICS

Immunoinformatics and epitope prediction in the age of genomic medicine.

Backert L, Kohlbacher O.

Genome Med. 2015 Nov 20;7:119. doi: 10.1186/s13073-015-0245-0. Review.

PMID:: 26589500

Free PMC Article

Similar articles

IgSimulator: a versatile immunosequencing simulator.

Safonova Y, Lapidus A, Lill J.

Bioinformatics. 2015 Oct 1;31(19):3213-5. doi: 10.1093/bioinformatics/btv326. Epub 2015 May 25.

PMID:: 26007226

Similar articles

Computational genomics tools for dissecting tumour-immune cell interactions.

Hackl H, Charoentong P, Finotello F, Trajanoski Z.

Nat Rev Genet. 2016 Jul 4;17(8):441-58. doi: 10.1038/nrg.2016.67. Review.

PMID:: 27376489

Similar articles

RNA SEQUENCING

SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines.

Audoux J, Salson M, Grosset CF, Beaumeunier S, Holder JM, Commes T, Philippe N.

BMC Bioinformatics. 2017 Sep 29;18(1):428. doi: 10.1186/s12859-017-1831-5.

PMID:: 28969586

Free PMC Article

Similar articles

COMPLEX INSERTIONS AND DELETIONS

INDELseek: detection of complex insertions and deletions from next-generation sequencing data.

Au CH, Leung AY, Kwong A, Chan TL, Ma ES.

BMC Genomics. 2017 Jan 5;18(1):16. doi: 10.1186/s12864-016-3449-9.

PMID:: 28056804

Free PMC Article

Similar articles

EVOLUTIONARY BIOLOGY

The State of Software for Evolutionary Biology.

Darriba D, Flouri T, Stamatakis A.

Mol Biol Evol. 2018 May 1;35(5):1037-1046. doi: 10.1093/molbev/msy014. Review.

PMID:: 29385525

Free PMC Article

Similar articles

SIMULATION PROGRAMS

Nat Rev Genet. 2016 Aug; 17(8): 459–469.

Published online 2016 Jun 20. doi: 10.1038/nrg.2016.57

PMCID: PMC5224698

EMSID: EMS70941

PMID: 27320129

Systematic review of next-generation sequencing simulators: computational tools, features and perspectives.

Zhao M, Liu D, Qu H.

Brief Funct Genomics. 2017 May 1;16(3):121-128. doi: 10.1093/bfgp/elw012. Review.

PMID:: 27069250

Similar articles

A comparison of tools for the simulation of genomic next-generation sequencing data

Merly Escalona,¹ Sara Rocha,¹ and David Posada^1,²

Author information Copyright and License information Disclaimer

The publisher’s final edited version of this article is available at Nat Rev Genet

This article has been corrected. See Nat Rev Genet. 2018 October 03; : .

Online Summary

There is a large number of tools for the simulation of genomic data for all currently available NGS platforms, with partially overlapped functionality. Here we review 23 of these tools, highlighting their distinct functionalities, requirements and potential applications.

The parameterization of these simulators is often complex. The user may decide between using existing sets of parameters values called profiles or re-estimating them from its own data.

Parameters than can be modulated in these simulations include the effects of the PCR amplification of the libraries, read features and quality scores, base call errors, variation of sequencing depth across the genomes and the introduction of genomic variants.

Several types of genomic variants can be introduced in the simulated reads, such as SNPs, indels, inversions, translocations, copy-number variants and short-tandem repeats.

Reads can be generated from single or multiple genomes, and with distinct ploidy levels. NGS data from metagenomic communities can be simulated given an “abundance profile” that reflects the proportion of taxa in a given sample.

Many of the simulators have not been formally described and/or tested in dedicated publications. We encourage the formal publication of these tools and the realization of comprehensive, comparative benchmarkings.

Choosing among the different genomic NGS simulators is not easy. Here we provide a guidance tree to help users choosing a suitable tool for their specific interests.

Abstract

Computer simulation of genomic data has become increasingly popular for assessing and validating biological models or to gain understanding about specific datasets. Multiple computational tools for the simulation of next-generation sequencing (NGS) data have been developed in recent years, which could be used to compare existing and new NGS analytical pipelines. Here we review 23 of these tools, highlighting their distinct functionality, requirements and potential applications. We also provide a decision tree for the informed selection of an appropriate NGS simulation tool for the specific question at hand.

Image source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/

An overview of current NGS technologies

The most popular NGS technologies on the market are Illumina’s sequencing by synthesis, which is probably the most widely used platform at present¹⁷, Roche’s 454 pyrosequencing (454), SOLiD sequencing-by-ligation (SOLiD), IonTorrent semiconductor sequencing¹⁸ (IonTorrent), Pacific Biosciences’s (PacBio) single molecule real-time sequencing¹⁹, and Oxford Nanopore Technologies (Nanopore) single-cell DNA template strand sequencing. These strategies can differ, for example, regarding the type of reads they produce or the kind of sequencing errors they introduce (Table 1). Only two of the current technologies (Illumina and SOLiD) are capable of producing all three sequencing read types —single end, paired end and mate pair. Read length is also dependent on the machine and the kit used; in platforms like Illumina, SOLiD, or IonTorrent it is possible to specify the number of desired base pairs per read. According to the sequencing run type selected it is possible to obtain reads with maximum lengths of 75 bp (SOLiD), 300 bp (Illumina) or 400bp (IonTorrent). On the other hand, in platforms like 454, Nanopore or PacBio, information is only given about the mean and maximum read length that can be obtained, with average lengths of 700 bp, 10 kb and 15 kb and maximum lengths of 1 kb, 10 kb and 15 kb, respectively. Error rates vary depending on the platform from <=1% in Illumina to ~30% in Nanopore. Further overviews and comparisons of NGS strategies can be found in ⁵^,²⁰^–²².

Table 1

Main characteristics of current NGS technologies.

Technology Run Type Maximum Read Length Quality Scores Error Rates References

Single-read Paired-end Mate-pair

Illumina X X X 300 bp > Q30 0.0034 – 1% ⁶⁵

SOLiD X X X 75 bp > Q30 0.01 – 1% ⁶⁶

IonTorrent X X 400 bp ~ Q20 1.78% ²²

454 X X ~700 bp (up to 1 Kb) > Q20 1.07 – 1.7% 59,67

Nanopore X 5.4 – 10 Kb NAY 10 – 40% ^68–72

PacBio X ~15 Kb (up to 40 Kb) < Q10 5 – 10% ^22,73–75

Go to:

Simulation parameters

The existing sequencing platforms use distinct protocols that result in datasets with different characteristics¹. Many of these attributes can be taken into account by the simulators (Fig. 2), although there is not a single tool that incorporates all possible variations. The main characteristics of the 23 simulators considered here are summarized in Tables 2 and and3.3. These tools differ in multiple aspects, such as sequencing technology, input requirements or output format, but maintain several common aspects. With some exceptions, all programs need a reference sequence, multiple parameter values indicating the characteristics of the sequencing experiment to be simulated (read length, error distribution, type of variation to be generated, if any, etc.) and/or a profile (a set of parameter values, conditions and/or data used for controlling the simulation), which can be provided by the simulator or estimated de novo from empirical data. The outcome will be aligned or unaligned reads in different standard file formats, such as FASTQ, FASTA or BAM. An overview of the NGS data simulation process is represented in Fig. 3. In the following sections we delve into the different steps involved.

Open in a separate window

Figure 2

General overview of the sequencing process and steps that can be parameterized in the simulations.

NGS simulators try to imitate the real sequencing process as closely as possible by considering all the steps that could influence the characteristics of the reads. a | NGS simulators do not take into account the effect of the different DNA extraction protocols in the resulting data. However, they can consider whether the sample we want to sequence includes one or more individuals, from the same or different organisms (e.g., pool-sequencing, metagenomics). Pools of related genomes can be simulated by replicating the reference sequence and introducing variants on the resulting genomes. Some tools can also simulate metagenomes with distinct taxa abundance. b | Simulators can try to mimic the length range of DNA fragmentation (empirically obtained by sonication or digestion protocols) or assume a fixed amplicon length. c | Library preparation involves ligating sequencing–platform dependent adaptors and/or barcodes to the selected DNA fragments (inserts). Some simulators can control the insert size, and produce reads with adaptors/barcodes. d | | Most NGS techniques include an amplification step for the preparation of libraries. Several simulators can take this step into account (for example, by introducing errors and/or chimaeras), with the possibility of specifying the number of reads per amplicons. e | Sequencing runs imply a decision about coverage, read length, read type (single-end, paired-end, mate-pair) and a given platform (with their specific errors and biases). Simulators exist for the different platforms, and they can use particular parameter profiles, often estimated from real data.

Open in a separate window

Figure 3

General overview of NGS simulation.

The simulation process begins with the input of a reference sequence (most cases) and simulation parameters. Some of the parameters can be given via a profile, that is estimated (by the simulator or other tools) from other reads or alignments. The outcome of this process may be reads (with or without quality information) or genome alignments in different formats.

CONCLUSIONS

NGS is having a big impact in a broad range of areas that benefit from genetic information, from medical genomics, phylogenetic and population genomics, to the reconstruction of ancient genomes, epigenomics and environmental barcoding. These applications include approaches such as de novo sequencing, resequencing, target sequencing or genome reduction methods. In all cases, caution is necessary in choosing a proper sequencing design and/or a reliable analytical approach for the specific biological question of interest. The simulation of NGS data can be extremely useful for planning experiments, testing hypotheses, benchmarking tools and evaluating particular results. Given a reference genome or dataset, for instance, one can play with an array of sequencing technologies to choose the best-suited technology and parameters for the particular goal, possibly optimizing time and costs. Yet, this is still not the standard practice and researchers often base their choices on practical considerations like technology and money availability. As shown throughout this Review, simulation of NGS data from known genomes or transcriptomes can be extremely useful when evaluating assembly, mapping, phasing or genotyping algorithms e.g. ²^,⁷^,¹⁰^,¹³^,⁶⁴ exposing their advantages and drawbacks under different circumstances.

Altogether, current NGS simulators consider most, if not all, of the important features regarding the generation of NGS data. However, they are not problem-free. The different simulators are largely redundant, implementing the same or very similar procedures. In our opinion, many are poorly documented and can be difficult to use for non-experts, and some of them are no longer maintained. Most importantly, for the most part they have not been benchmarked or validated. Remarkably, among the 23 tools considered here, only 13 have been described in dedicated application notes, 3 have been mentioned as add-ons in the methods section of bigger articles, and 5 have never been referenced in a journal. Indeed, peer-reviewed publication of these tools in dedicated articles would be highly desirable. While this would not definitively guarantee quality, at least it would encourage authors to reach minimum standards in terms of validation, benchmarking, and documentation. Collaborative efforts like the Assemblathon e.g. ²⁷ or iEvo (http://www.ievobio.org/) might be also a source of inspiration. Meanwhile, we hope that the decision tree presented in Fig. 1 helps users making appropriate choices.

SOURCE

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/

Technology	Run Type	Maximum Read Length	Quality Scores	Error Rates	References
Illumina	X	X	X	300 bp	> Q30	0.0034 – 1%	⁶⁵
SOLiD	X	X	X	75 bp	> Q30	0.01 – 1%	⁶⁶
IonTorrent	X	X		400 bp	~ Q20	1.78%	²²
454	X	X		~700 bp (up to 1 Kb)	> Q20	1.07 – 1.7%	59,67
Nanopore	X			5.4 – 10 Kb	NAY	10 – 40%	^68–72
PacBio	X			~15 Kb (up to 40 Kb)	< Q10	5 – 10%	^22,73–75

REFERENCES

Systematic benchmarking of omics computational tools

Serghei Mangul, Lana S. Martin, Brian L. Hill, Angela Ka-Mei Lam, Margaret G. Distler, Alex Zelikovsky, Eleazar Eskin, Jonathan Flint

Nat Commun. 2019; 10: 1393. Published online 2019 Mar 27. doi: 10.1038/s41467-019-09406-4

PMCID:: PMC6437167

Article PubReader PDF–927K Citation

Long fragments achieve lower base quality in Illumina paired-end sequencing

Ge Tan, Lennart Opitz, Ralph Schlapbach, Hubert Rehrauer

Sci Rep. 2019; 9: 2856. Published online 2019 Feb 27. doi: 10.1038/s41598-019-39076-7

PMCID:: PMC6393434

Article PubReader PDF–1.1M Citation

sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs

Apostolos Dimitromanolakis, Jingxiong Xu, Agnieszka Krol, Laurent Briollais

BMC Bioinformatics. 2019; 20: 26. Published online 2019 Jan 15. doi: 10.1186/s12859-019-2611-1

PMCID:: PMC6332552

Article PubReader PDF–1.0M Citation

Analysis validation has been neglected in the Age of Reproducibility

Kathleen E. Lotterhos, Jason H. Moore, Ann E. Stapleton

PLoS Biol. 2018 Dec; 16(12): e3000070. Published online 2018 Dec 10. doi: 10.1371/journal.pbio.3000070

PMCID:: PMC6301703

Article PubReader PDF–968K Citation

Enterovirus D68 – The New Polio?

Hayley Cassidy, Randy Poelman, Marjolein Knoester, Coretta C. Van Leer-Buter, Hubert G. M. Niesters

Front Microbiol. 2018; 9: 2677. Published online 2018 Nov 13. doi: 10.3389/fmicb.2018.02677

PMCID:: PMC6243117

Article PubReader PDF–2.4M Citation

Genetic Simulation Resources and the GSR Certification Program

Bo Peng, Man Chong Leong, Huann-Sheng Chen, Melissa Rotunno, Katy R Brignole, John Clarke, Leah E Mechanic

Bioinformatics. 2019 Feb 15; 35(4): 709–710. Published online 2018 Aug 7. doi: 10.1093/bioinformatics/bty666

PMCID:: PMC6378936

Currently embargoed: Free in PMC on Feb 15, 2020; PubMed

Simulating Illumina metagenomic data with InSilicoSeq

Hadrien Gourlé, Oskar Karlsson-Lindsjö, Juliette Hayer, Erik Bongcam-Rudloff

Bioinformatics. 2019 Feb 1; 35(3): 521–522. Published online 2018 Jul 19. doi: 10.1093/bioinformatics/bty630

PMCID:: PMC6361232

Article PubReader PDF–395K Citation

NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model

Ze-Gang Wei, Shao-Wu Zhang

BMC Bioinformatics. 2018; 19: 177. Published online 2018 May 22. doi: 10.1186/s12859-018-2208-0

PMCID:: PMC5964698

Article PubReader PDF–2.1M Citation

DeepSimulator: a deep simulator for Nanopore sequencing

Yu Li, Renmin Han, Chongwei Bi, Mo Li, Sheng Wang, Xin Gao

Bioinformatics. 2018 Sep 1; 34(17): 2899–2908. Published online 2018 Apr 6. doi: 10.1093/bioinformatics/bty223

PMCID:: PMC6129308

Article PubReader PDF–615K Citation

Xome-Blender: A novel cancer genome simulator

Roberto Semeraro, Valerio Orlandini, Alberto Magi

PLoS One. 2018; 13(4): e0194472. Published online 2018 Apr 5. doi: 10.1371/journal.pone.0194472

PMCID:: PMC5886411

Article PubReader PDF–5.9M Citation

Bamgineer: Introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets

Soroush Samadian, Jeff P. Bruce, Trevor J. Pugh

PLoS Comput Biol. 2018 Mar; 14(3): e1006080. Published online 2018 Mar 28. doi: 10.1371/journal.pcbi.1006080

PMCID:: PMC5891060

Article PubReader PDF–3.5M Citation

Environmental and Host Effects on Skin Bacterial Community Composition in Panamanian Frogs

Brandon J. Varela, David Lesbarrères, Roberto Ibáñez, David M. Green

Front Microbiol. 2018; 9: 298. Published online 2018 Feb 22. doi: 10.3389/fmicb.2018.00298

PMCID:: PMC5826957

Article PubReader PDF–2.1M Citation

Novel read density distribution score shows possible aligner artefacts, when mapping a single chromosome

Fedor M. Naumenko, Irina I. Abnizova, Nathan Beka, Mikhail A. Genaev, Yuriy L. Orlov

BMC Genomics. 2018; 19(Suppl 3): 92. Published online 2018 Feb 9. doi: 10.1186/s12864-018-4475-6

PMCID:: PMC5836841

Article PubReader PDF–1.9M Citation

HgtSIM: a simulator for horizontal gene transfer (HGT) in microbial communities

Weizhi Song, Kerrin Steensen, Torsten Thomas

PeerJ. 2017; 5: e4015. Published online 2017 Nov 8. doi: 10.7717/peerj.4015

PMCID:: PMC5681852

Article PubReader PDF–1.3M Citation

Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes

Haibao Tang, Ewen F. Kirkness, Christoph Lippert, William H. Biggs, Martin Fabani, Ernesto Guzman, Smriti Ramakrishnan, Victor Lavrenko, Boyko Kakaradov, Claire Hou, Barry Hicks, David Heckerman, Franz J. Och, C. Thomas Caskey, J. Craig Venter, Amalio Telenti

Am J Hum Genet. 2017 Nov 2; 101(5): 700–715. Published online 2017 Nov 2. doi: 10.1016/j.ajhg.2017.09.013

PMCID:: PMC5673627

Article PubReader PDF–1.1M Citation

Simulating the dynamics of targeted capture sequencing with CapSim

Minh Duc Cao, Devika Ganesamoorthy, Chenxi Zhou, Lachlan J M Coin

Bioinformatics. 2018 Mar 1; 34(5): 873–874. Published online 2017 Oct 28. doi: 10.1093/bioinformatics/btx691

PMCID:: PMC6192212

Article PubReader PDF–123K Citation

Next-generation sequencing applications in clinical bacteriology

Yair Motro, Jacob Moran-Gilad

Biomol Detect Quantif. 2017 Dec; 14: 1–6. Published online 2017 Oct 23. doi: 10.1016/j.bdq.2017.10.002

PMCID:: PMC5727008

Article PubReader PDF–204K Citation

A multi-scenario genome-wide medical population genetics simulation framework

Jacquiline W Mugo, Ephifania Geza, Joel Defo, Samar S M Elsheikh, Gaston K Mazandu, Nicola J Mulder, Emile R Chimusa

Bioinformatics. 2017 Oct 1; 33(19): 2995–3002. Published online 2017 Jun 24. doi: 10.1093/bioinformatics/btx369

PMCID:: PMC5870573

Article PubReader PDF–488K Citation

Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads

Ryan R. Wick, Louise M. Judd, Claire L. Gorrie, Kathryn E. Holt

PLoS Comput Biol. 2017 Jun; 13(6): e1005595. Published online 2017 Jun 8. doi: 10.1371/journal.pcbi.1005595

PMCID:: PMC5481147

Article PubReader PDF–7.2M Citation

NanoSim: nanopore sequence read simulator based on statistical characterization

Chen Yang, Justin Chu, René L Warren, Inanç Birol

Gigascience. 2017 Apr; 6(4): 1–6. Published online 2017 Feb 24. doi: 10.1093/gigascience/gix010

PMCID:: PMC5530317

Article PubReader PDF–829K Citation

Posted in BioIT: BioInformatics, NGS, Clinical & Translational, Pharmaceutical R&D Informatics, Clinical Genomics, Cancer Informatics, Next Generation Sequencing (NGS), Simulation Modeling in NGS | Leave a Comment

Comments RSS

Leaders in Pharmaceutical Business Intelligence Group, LLC, Doing Business As LPBI Group, Newton, MA