Simulation Tools of Genomic Next Generation Sequencing Data: Comparative Analysis & Genetic Simulation Resources
Reporting: Aviva Lev-Ari, PhD, RN
INTRODUCTION
What is next generation sequencing?
Behjati S, Tarpey PS.
Arch Dis Child Educ Pract Ed. 2013 Dec;98(6):236-8. doi: 10.1136/archdischild-2013-304340. Epub 2013 Aug 28. Review.
Computational pan-genomics: status, promises and challenges.
Computational Pan-Genomics Consortium.
Brief Bioinform. 2018 Jan 1;19(1):118-135. doi: 10.1093/bib/bbw089. Review.
Dahlö M, Scofield DG, Schaal W, Spjuth O.
Gigascience. 2018 May 1;7(5). doi: 10.1093/gigascience/giy028.
- PMID:
- 29659792
NGS IN THE CLINIC
[Clinical Applications of Next-Generation Sequencing].
Rebollar-Vega RG, Arriaga-Canon C, de la Rosa-Velázquez IA.
Rev Invest Clin. 2018;70(4):153-157. doi: 10.24875/RIC.18002544.
Clinical Genomics: Challenges and Opportunities.
Vijay P, McIntyre AB, Mason CE, Greenfield JP, Li S.
Crit Rev Eukaryot Gene Expr. 2016;26(2):97-113. doi: 10.1615/CritRevEukaryotGeneExpr.2016015724. Review.
Next-generation sequencing in the clinic: promises and challenges.
Xuan J, Yu Y, Qing T, Guo L, Shi L.
Cancer Lett. 2013 Nov 1;340(2):284-95. doi: 10.1016/j.canlet.2012.11.025. Epub 2012 Nov 19. Review.
The Future of Whole-Genome Sequencing for Public Health and the Clinic.
Allard MW.
J Clin Microbiol. 2016 Aug;54(8):1946-8. doi: 10.1128/JCM.01082-16. Epub 2016 Jun 15.
- PMID:
- 27307454
Roy S, Coldren C, Karunamurthy A, Kip NS, Klee EW, Lincoln SE, Leon A, Pullambhatla M, Temple-Smolkin RL, Voelkerding KV, Wang C, Carter AB.
J Mol Diagn. 2018 Jan;20(1):4-27. doi: 10.1016/j.jmoldx.2017.11.003. Epub 2017 Nov 21. Review.
- PMID:
- 29154853
MUTATION ANALYSIS – GENE ENCODING
Nagy PL, Worman HJ.
Methods Mol Biol. 2018;1840:321-336. doi: 10.1007/978-1-4939-8691-0_22.
- PMID:
- 30141054
Genome-wide genetic marker discovery and genotyping using next-generation sequencing.
Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML.
Nat Rev Genet. 2011 Jun 17;12(7):499-510. doi: 10.1038/nrg3012. Review.
- PMID:
- 21681211
Best practices for evaluating mutation prediction methods.
Rogan PK, Zou GY.
Hum Mutat. 2013 Nov;34(11):1581-2. doi: 10.1002/humu.22401. Epub 2013 Sep 10. No abstract available.
- PMID:
- 23955774
MITOCHONDRIAL VATIATIONS
Vellarikkal SK, Dhiman H, Joshi K, Hasija Y, Sivasubbu S, Scaria V.
Hum Mutat. 2015 Apr;36(4):419-24. doi: 10.1002/humu.22767.
- PMID:
- 25677119
VARIANT ANALYSIS
A survey of tools for variant analysis of next-generation genome sequencing data.
Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z.
Brief Bioinform. 2014 Mar;15(2):256-78. doi: 10.1093/bib/bbs086. Epub 2013 Jan 21.
- PMID:
- 23341494
Variant callers for next-generation sequencing data: a comparison study.
Liu X, Han S, Wang Z, Gelernter J, Yang BZ.
PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. eCollection 2013.
- PMID:
- 24086590
VARIANT DETECTION IN HEREDITARY CANCER GENES
Lopez-Doriga A, Feliubadaló L, Menéndez M, Lopez-Doriga S, Morón-Duran FD, del Valle J, Tornero E, Montes E, Cuesta R, Campos O, Gómez C, Pineda M, González S, Moreno V, Capellá G, Lázaro C.
Hum Mutat. 2014 Mar;35(3):271-7.
Judkins T, Leclair B, Bowles K, Gutin N, Trost J, McCulloch J, Bhatnagar S, Murray A, Craft J, Wardell B, Bastian M, Mitchell J, Chen J, Tran T, Williams D, Potter J, Jammulapati S, Perry M, Morris B, Roa B, Timms K.
BMC Cancer. 2015 Apr 2;15:215. doi: 10.1186/s12885-015-1224-y.
Clinical Applications of Next-Generation Sequencing in Cancer Diagnosis.
Sabour L, Sabour M, Ghorbian S.
Pathol Oncol Res. 2017 Apr;23(2):225-234. doi: 10.1007/s12253-016-0124-z. Epub 2016 Oct 8. Review.
- PMID:
- 27722982
Studying cancer genomics through next-generation DNA sequencing and bioinformatics.
Doyle MA, Li J, Doig K, Fellowes A, Wong SQ.
Methods Mol Biol. 2014;1168:83-98. doi: 10.1007/978-1-4939-0847-9_6. Review.
- PMID:
- 24870132
IMMUNOINFORMATICS
Immunoinformatics and epitope prediction in the age of genomic medicine.
Backert L, Kohlbacher O.
Genome Med. 2015 Nov 20;7:119. doi: 10.1186/s13073-015-0245-0. Review.
- PMID:
- 26589500
IgSimulator: a versatile immunosequencing simulator.
Safonova Y, Lapidus A, Lill J.
Bioinformatics. 2015 Oct 1;31(19):3213-5. doi: 10.1093/bioinformatics/btv326. Epub 2015 May 25.
Computational genomics tools for dissecting tumour-immune cell interactions.
Hackl H, Charoentong P, Finotello F, Trajanoski Z.
Nat Rev Genet. 2016 Jul 4;17(8):441-58. doi: 10.1038/nrg.2016.67. Review.
- PMID:
- 27376489
RNA SEQUENCING
SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines.
Audoux J, Salson M, Grosset CF, Beaumeunier S, Holder JM, Commes T, Philippe N.
BMC Bioinformatics. 2017 Sep 29;18(1):428. doi: 10.1186/s12859-017-1831-5.
INDELseek: detection of complex insertions and deletions from next-generation sequencing data.
Au CH, Leung AY, Kwong A, Chan TL, Ma ES.
BMC Genomics. 2017 Jan 5;18(1):16. doi: 10.1186/s12864-016-3449-9.
The State of Software for Evolutionary Biology.
Darriba D, Flouri T, Stamatakis A.
Mol Biol Evol. 2018 May 1;35(5):1037-1046. doi: 10.1093/molbev/msy014. Review.
- PMID:
- 29385525
SIMULATION PROGRAMS
Published online 2016 Jun 20. doi: 10.1038/nrg.2016.57
Zhao M, Liu D, Qu H.
Brief Funct Genomics. 2017 May 1;16(3):121-128. doi: 10.1093/bfgp/elw012. Review.
A comparison of tools for the simulation of genomic next-generation sequencing data
Online Summary
There is a large number of tools for the simulation of genomic data for all currently available NGS platforms, with partially overlapped functionality. Here we review 23 of these tools, highlighting their distinct functionalities, requirements and potential applications.
The parameterization of these simulators is often complex. The user may decide between using existing sets of parameters values called profiles or re-estimating them from its own data.
Parameters than can be modulated in these simulations include the effects of the PCR amplification of the libraries, read features and quality scores, base call errors, variation of sequencing depth across the genomes and the introduction of genomic variants.
Several types of genomic variants can be introduced in the simulated reads, such as SNPs, indels, inversions, translocations, copy-number variants and short-tandem repeats.
Reads can be generated from single or multiple genomes, and with distinct ploidy levels. NGS data from metagenomic communities can be simulated given an “abundance profile” that reflects the proportion of taxa in a given sample.
Many of the simulators have not been formally described and/or tested in dedicated publications. We encourage the formal publication of these tools and the realization of comprehensive, comparative benchmarkings.
Choosing among the different genomic NGS simulators is not easy. Here we provide a guidance tree to help users choosing a suitable tool for their specific interests.
Abstract
Computer simulation of genomic data has become increasingly popular for assessing and validating biological models or to gain understanding about specific datasets. Multiple computational tools for the simulation of next-generation sequencing (NGS) data have been developed in recent years, which could be used to compare existing and new NGS analytical pipelines. Here we review 23 of these tools, highlighting their distinct functionality, requirements and potential applications. We also provide a decision tree for the informed selection of an appropriate NGS simulation tool for the specific question at hand.
Image source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/
An overview of current NGS technologies
The most popular NGS technologies on the market are Illumina’s sequencing by synthesis, which is probably the most widely used platform at present17, Roche’s 454 pyrosequencing (454), SOLiD sequencing-by-ligation (SOLiD), IonTorrent semiconductor sequencing18 (IonTorrent), Pacific Biosciences’s (PacBio) single molecule real-time sequencing19, and Oxford Nanopore Technologies (Nanopore) single-cell DNA template strand sequencing. These strategies can differ, for example, regarding the type of reads they produce or the kind of sequencing errors they introduce (Table 1). Only two of the current technologies (Illumina and SOLiD) are capable of producing all three sequencing read types —single end, paired end and mate pair. Read length is also dependent on the machine and the kit used; in platforms like Illumina, SOLiD, or IonTorrent it is possible to specify the number of desired base pairs per read. According to the sequencing run type selected it is possible to obtain reads with maximum lengths of 75 bp (SOLiD), 300 bp (Illumina) or 400bp (IonTorrent). On the other hand, in platforms like 454, Nanopore or PacBio, information is only given about the mean and maximum read length that can be obtained, with average lengths of 700 bp, 10 kb and 15 kb and maximum lengths of 1 kb, 10 kb and 15 kb, respectively. Error rates vary depending on the platform from <=1% in Illumina to ~30% in Nanopore. Further overviews and comparisons of NGS strategies can be found in 5,20–22.
Table 1
Main characteristics of current NGS technologies.
Technology Run Type Maximum Read Length Quality Scores Error Rates References Single-read Paired-end Mate-pair Illumina X X X 300 bp > Q30 0.0034 – 1% 65 SOLiD X X X 75 bp > Q30 0.01 – 1% 66 IonTorrent X X 400 bp ~ Q20 1.78% 22 454 X X ~700 bp (up to 1 Kb) > Q20 1.07 – 1.7% 59,67 Nanopore X 5.4 – 10 Kb NAY 10 – 40% 68–72 PacBio X ~15 Kb (up to 40 Kb) < Q10 5 – 10% 22,73–75 Simulation parameters
The existing sequencing platforms use distinct protocols that result in datasets with different characteristics1. Many of these attributes can be taken into account by the simulators (Fig. 2), although there is not a single tool that incorporates all possible variations. The main characteristics of the 23 simulators considered here are summarized in Tables 2 and and3.3. These tools differ in multiple aspects, such as sequencing technology, input requirements or output format, but maintain several common aspects. With some exceptions, all programs need a reference sequence, multiple parameter values indicating the characteristics of the sequencing experiment to be simulated (read length, error distribution, type of variation to be generated, if any, etc.) and/or a profile (a set of parameter values, conditions and/or data used for controlling the simulation), which can be provided by the simulator or estimated de novo from empirical data. The outcome will be aligned or unaligned reads in different standard file formats, such as FASTQ, FASTA or BAM. An overview of the NGS data simulation process is represented in Fig. 3. In the following sections we delve into the different steps involved.
General overview of the sequencing process and steps that can be parameterized in the simulations.
NGS simulators try to imitate the real sequencing process as closely as possible by considering all the steps that could influence the characteristics of the reads. a | NGS simulators do not take into account the effect of the different DNA extraction protocols in the resulting data. However, they can consider whether the sample we want to sequence includes one or more individuals, from the same or different organisms (e.g., pool-sequencing, metagenomics). Pools of related genomes can be simulated by replicating the reference sequence and introducing variants on the resulting genomes. Some tools can also simulate metagenomes with distinct taxa abundance. b | Simulators can try to mimic the length range of DNA fragmentation (empirically obtained by sonication or digestion protocols) or assume a fixed amplicon length. c | Library preparation involves ligating sequencing–platform dependent adaptors and/or barcodes to the selected DNA fragments (inserts). Some simulators can control the insert size, and produce reads with adaptors/barcodes. d | | Most NGS techniques include an amplification step for the preparation of libraries. Several simulators can take this step into account (for example, by introducing errors and/or chimaeras), with the possibility of specifying the number of reads per amplicons. e | Sequencing runs imply a decision about coverage, read length, read type (single-end, paired-end, mate-pair) and a given platform (with their specific errors and biases). Simulators exist for the different platforms, and they can use particular parameter profiles, often estimated from real data.
General overview of NGS simulation.
The simulation process begins with the input of a reference sequence (most cases) and simulation parameters. Some of the parameters can be given via a profile, that is estimated (by the simulator or other tools) from other reads or alignments. The outcome of this process may be reads (with or without quality information) or genome alignments in different formats.
CONCLUSIONS
NGS is having a big impact in a broad range of areas that benefit from genetic information, from medical genomics, phylogenetic and population genomics, to the reconstruction of ancient genomes, epigenomics and environmental barcoding. These applications include approaches such as de novo sequencing, resequencing, target sequencing or genome reduction methods. In all cases, caution is necessary in choosing a proper sequencing design and/or a reliable analytical approach for the specific biological question of interest. The simulation of NGS data can be extremely useful for planning experiments, testing hypotheses, benchmarking tools and evaluating particular results. Given a reference genome or dataset, for instance, one can play with an array of sequencing technologies to choose the best-suited technology and parameters for the particular goal, possibly optimizing time and costs. Yet, this is still not the standard practice and researchers often base their choices on practical considerations like technology and money availability. As shown throughout this Review, simulation of NGS data from known genomes or transcriptomes can be extremely useful when evaluating assembly, mapping, phasing or genotyping algorithms e.g. 2,7,10,13,64 exposing their advantages and drawbacks under different circumstances.
Altogether, current NGS simulators consider most, if not all, of the important features regarding the generation of NGS data. However, they are not problem-free. The different simulators are largely redundant, implementing the same or very similar procedures. In our opinion, many are poorly documented and can be difficult to use for non-experts, and some of them are no longer maintained. Most importantly, for the most part they have not been benchmarked or validated. Remarkably, among the 23 tools considered here, only 13 have been described in dedicated application notes, 3 have been mentioned as add-ons in the methods section of bigger articles, and 5 have never been referenced in a journal. Indeed, peer-reviewed publication of these tools in dedicated articles would be highly desirable. While this would not definitively guarantee quality, at least it would encourage authors to reach minimum standards in terms of validation, benchmarking, and documentation. Collaborative efforts like the Assemblathon e.g. 27 or iEvo (http://www.ievobio.org/) might be also a source of inspiration. Meanwhile, we hope that the decision tree presented in Fig. 1 helps users making appropriate choices.
SOURCE
- PMCID:
- PMC6437167
- PMCID:
- PMC6393434
- PMCID:
- PMC6332552
- PMCID:
- PMC6301703
- PMCID:
- PMC6243117
- PMCID:
- PMC6378936
- PMCID:
- PMC6361232
- PMCID:
- PMC5964698
- PMCID:
- PMC6129308
- PMCID:
- PMC5886411
- PMCID:
- PMC5891060
- PMCID:
- PMC5826957
- PMCID:
- PMC5836841
- PMCID:
- PMC5681852
- PMCID:
- PMC5673627
- PMCID:
- PMC6192212
- PMCID:
- PMC5727008
- PMCID:
- PMC5870573
- PMCID:
- PMC5481147
- PMCID:
- PMC5530317
Leave a Reply