Simulation Modeling in NGS | Leaders in Pharmaceutical Business Intelligence (LPBI) Group

Archive for the ‘Simulation Modeling in NGS’ Category

Data Science: Step by Step – A Resource for LPBI Group’s One-Year Internship in IT, IS, DS

Posted in Advanced Computing Platform, Artificial Intelligence - Breakthroughs in Theories and Technologies, Artificial Intelligence - General, Big Data, BioIT: BioInformatics, Computational Biology/Systems and Bioinformatics, Intelligent Information Systems, Machine Learning, Natural Language Processing (NLP), Simulation Modeling in NGS on May 7, 2022| Leave a Comment »

Data Science: Step by Step – A Resource for LPBI Group One-Year Internship in IT, IS, DS

Reporter: Aviva Lev-Ari, PhD, RN

9 free Harvard courses: learning Data Science

In this article, I will list 9 free Harvard courses that you can take to learn data science from scratch. Feel free to skip any of these courses if you already possess knowledge of that subject.

Step 1: Programming

The first step you should take when learning data science is to learn to code. You can choose to do this with your choice of programming language?—?ideally Python or R.

If you’d like to learn R, Harvard offers an introductory R course created specifically for data science learners, called Data Science: R Basics.

This program will take you through R concepts like variables, data types, vector arithmetic, and indexing. You will also learn to wrangle data with libraries like dplyr and create plots to visualize data.

If you prefer Python, you can choose to take CS50’s Introduction to Programming with Python offered for free by Harvard. In this course, you will learn concepts like functions, arguments, variables, data types, conditional statements, loops, objects, methods, and more.

Both programs above are self-paced. However, the Python course is more detailed than the R program, and requires a longer time commitment to complete. Also, the rest of the courses in this roadmap are taught in R, so it might be worth learning R to be able to follow along easily.

Step 2: Data Visualization

Visualization is one of the most powerful techniques with which you can translate your findings in data to another person.

With Harvard’s Data Visualization program, you will learn to build visualizations using the ggplot2 library in R, along with the principles of communicating data-driven insights.

Step 3: Probability

In this course, you will learn essential probability concepts that are fundamental to conducting statistical tests on data. The topics taught include random variables, independence, Monte Carlo simulations, expected values, standard errors, and the Central Limit Theorem.

The concepts above will be introduced with the help of a case study, which means that you will be able to apply everything you learned to an actual real-world dataset.

Step 4: Statistics

After learning probability, you can take this course to learn the fundamentals of statistical inference and modelling.
This program will teach you to define population estimates and margin of errors, introduce you to Bayesian statistics, and provide you with the fundamentals of predictive modeling.

Step 5: Productivity Tools (Optional)

I’ve included this project management course as optional since it isn’t directly related to learning data science. Rather, you will be taught to use Unix/Linux for file management, Github, version control, and creating reports in R.

The ability to do the above will save you a lot of time and help you better manage end-to-end data science projects.

Step 6: Data Pre-Processing

The next course in this list is called Data Wrangling, and will teach you to prepare data and convert it into a format that is easily digestible by machine learning models.

You will learn to import data into R, tidy data, process string data, parse HTML, work with date-time objects, and mine text.

As a data scientist, you often need to extract data that is publicly available on the Internet in the form of a PDF document, HTML webpage, or a Tweet. You will not always be presented with clean, formatted data in a CSV file or Excel sheet.

By the end of this course, you will learn to wrangle and clean data to come up with critical insights from it.

Step 7: Linear Regression

Linear regression is a machine learning technique that is used to model a linear relationship between two or more variables. It can also be used to identify and adjust the effect of confounding variables.

This course will teach you the theory behind linear regression models, how to examine the relationship between two variables, and how confounding variables can be detected and removed before building a machine learning algorithm.

Step 8: Machine Learning

Finally, the course you’ve probably been waiting for! Harvard’s machine learning program will teach you the basics of machine learning, techniques to mitigate overfitting, supervised and unsupervised modelling approaches, and recommendation systems.

Step 9: Capstone Project

After completing all the above courses, you can take Harvard’s data science capstone project, where your skills in data visualization, probability, statistics, data wrangling, data organization, regression, and machine learning will be assessed.

With this final project, you will get the opportunity to put together all the knowledge learnt from the above courses and gain the ability to complete a hands-on data science project from scratch.

Note: All the courses above are available on an online learning platform from edX and can be audited for free. If you want a course certificate, however, you will have to pay for one.

Building a data science learning roadmap with free courses offered by MIT.

8 Free MIT Courses to Learn Data Science Online

enrolled into an undergraduate computer science program and decided to major in data science. I spent over $25K in tuition fees over the span of three years, only to graduate and realize that I wasn’t equipped with the skills necessary to land a job in the field.

I barely knew how to code, and was unclear about the most basic machine learning concepts.

I took some time out to try and learn data science myself — with the help of YouTube videos, online courses, and tutorials. I realized that all of this knowledge was publicly available on the Internet and could be accessed for free.

It came as a surprise that even Ivy League universities started making many of their courses accessible to students worldwide, for little to no charge. This meant that people like me could learn these skills from some of the best institutions in the world, instead of spending thousands of dollars on a subpar degree program.

In this article, I will provide you with a data science roadmap I created using only freely available MIT online courses.

Step 1: Learn to code

I highly recommend learning a programming language before going deep into the math and theory behind data science models. Once you learn to code, you will be able to work with real-world datasets and get a feel of how predictive algorithms function.

MIT Open Courseware offers a beginner-friendly Python program for beginners, called Introduction to Computer Science and Programming.

This course is designed to help people with no prior coding experience to write programs to tackle useful problems.

Step 2: Statistics

Statistics is at the core of every data science workflow — it is required when building a predictive model, analyzing trends in large amounts of data, or selecting useful features to feed into your model.

MIT Open Courseware offers a beginner-friendly course called Introduction to Probability and Statistics. After taking this course, you will learn the basic principles of statistical inference and probability. Some concepts covered include conditional probability, Bayes theorem, covariance, central limit theorem, resampling, and linear regression.

This course will also walk you through statistical analysis using the R programming language, which is useful as it adds on to your tool stack as a data scientist.

Another useful program offered by MIT for free is called Statistical Thinking and Data Analysis. This is another elementary course in the subject that will take you through different data analysis techniques in Excel, R, and Matlab.

You will learn about data collection, analysis, different types of sampling distributions, statistical inference, linear regression, multiple linear regression, and nonparametric statistical methods.

Step 3: Foundational Math Skills

Calculus and linear algebra are two other branches of math that are used in the field of machine learning. Taking a course or two in these subjects will give you a different perspective of how predictive models function, and the working behind the underlying algorithm.

To learn calculus, you can take Single Variable Calculus offered by MIT for free, followed by Multivariable Calculus.

Then, you can take this Linear Algebra class by Prof. Gilbert Strang to get a strong grasp of the subject.

All of the above courses are offered by MIT Open Courseware, and are paired with lecture notes, problem sets, exam questions, and solutions.

Step 4: Machine Learning

Finally, you can use the knowledge gained in the courses above to take MIT’s Introduction to Machine Learning course. This program will walk you through the implementation of predictive models in Python.

The core focus of this course is in supervised and reinforcement learning problems, and you will be taught concepts such as generalization and how overfitting can be mitigated. Apart from just working with structured datasets, you will also learn to process image and sequential data.

MIT’s machine learning program cites three pre-requisites — Python, linear algebra, and calculus, which is why it is advisable to take the courses above before starting this one.

Are These Courses Beginner-Friendly?

Even if you have no prior knowledge of programming, statistics, or mathematics, you can take all the courses listed above.

MIT has designed these programs to take you through the subject from scratch. However, unlike many MOOCs out there, the pace does build up pretty quickly and the courses cover a large depth of information.

Due to this, it is advisable to do all the exercises that come with the lectures and work through all the reading material provided.

SOURCE

Natassha Selvaraj is a self-taught data scientist with a passion for writing. You can connect with her on LinkedIn.

https://www.kdnuggets.com/2022/03/8-free-mit-courses-learn-data-science-online.html

Read Full Post »

Simulation Tools of Genomic Next Generation Sequencing Data: Comparative Analysis & Genetic Simulation Resources

Posted in BioIT: BioInformatics, NGS, Clinical & Translational, Pharmaceutical R&D Informatics, Clinical Genomics, Cancer Informatics, Next Generation Sequencing (NGS), Simulation Modeling in NGS on May 31, 2019| Leave a Comment »

Simulation Tools of Genomic Next Generation Sequencing Data: Comparative Analysis & Genetic Simulation Resources

Reporting: Aviva Lev-Ari, PhD, RN

INTRODUCTION

What is next generation sequencing?

Behjati S, Tarpey PS.

Arch Dis Child Educ Pract Ed. 2013 Dec;98(6):236-8. doi: 10.1136/archdischild-2013-304340. Epub 2013 Aug 28. Review.

PMID:: 23986538

Free PMC Article

Similar articles

Computational pan-genomics: status, promises and challenges.

Computational Pan-Genomics Consortium.

Brief Bioinform. 2018 Jan 1;19(1):118-135. doi: 10.1093/bib/bbw089. Review.

PMID:: 27769991

Free PMC Article

Similar articles

Tracking the NGS revolution: managing life science research on shared high-performance computing clusters.

Dahlö M, Scofield DG, Schaal W, Spjuth O.

Gigascience. 2018 May 1;7(5). doi: 10.1093/gigascience/giy028.

PMID:: 29659792

Free PMC Article

Similar articles

NGS IN THE CLINIC

[Clinical Applications of Next-Generation Sequencing].

Rebollar-Vega RG, Arriaga-Canon C, de la Rosa-Velázquez IA.

Rev Invest Clin. 2018;70(4):153-157. doi: 10.24875/RIC.18002544.

PMID:: 30067721

Free Article

Similar articles

Clinical Genomics: Challenges and Opportunities.

Vijay P, McIntyre AB, Mason CE, Greenfield JP, Li S.

Crit Rev Eukaryot Gene Expr. 2016;26(2):97-113. doi: 10.1615/CritRevEukaryotGeneExpr.2016015724. Review.

PMID:: 27480773

Free PMC Article

Similar articles

Next-generation sequencing in the clinic: promises and challenges.

Xuan J, Yu Y, Qing T, Guo L, Shi L.

Cancer Lett. 2013 Nov 1;340(2):284-95. doi: 10.1016/j.canlet.2012.11.025. Epub 2012 Nov 19. Review.

PMID:: 23174106

Free PMC Article

Similar articles

The Future of Whole-Genome Sequencing for Public Health and the Clinic.

Allard MW.

J Clin Microbiol. 2016 Aug;54(8):1946-8. doi: 10.1128/JCM.01082-16. Epub 2016 Jun 15.

PMID:: 27307454

Free PMC Article

Similar articles

Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines: A Joint Recommendation of the Association for Molecular Pathology and the College of American Pathologists.

Roy S, Coldren C, Karunamurthy A, Kip NS, Klee EW, Lincoln SE, Leon A, Pullambhatla M, Temple-Smolkin RL, Voelkerding KV, Wang C, Carter AB.

J Mol Diagn. 2018 Jan;20(1):4-27. doi: 10.1016/j.jmoldx.2017.11.003. Epub 2017 Nov 21. Review.

PMID:: 29154853

Similar articles

MUTATION ANALYSIS – GENE ENCODING

Next-Generation Sequencing and Mutational Analysis: Implications for Genes Encoding LINC Complex Proteins.

Nagy PL, Worman HJ.

Methods Mol Biol. 2018;1840:321-336. doi: 10.1007/978-1-4939-8691-0_22.

PMID:: 30141054

Similar articles

Genome-wide genetic marker discovery and genotyping using next-generation sequencing.

Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML.

Nat Rev Genet. 2011 Jun 17;12(7):499-510. doi: 10.1038/nrg3012. Review.

PMID:: 21681211

Similar articles

Best practices for evaluating mutation prediction methods.

Rogan PK, Zou GY.

Hum Mutat. 2013 Nov;34(11):1581-2. doi: 10.1002/humu.22401. Epub 2013 Sep 10. No abstract available.

PMID:: 23955774

Similar articles

MITOCHONDRIAL VATIATIONS

mit-o-matic: a comprehensive computational pipeline for clinical evaluation of mitochondrial variations from next-generation sequencing datasets.

Vellarikkal SK, Dhiman H, Joshi K, Hasija Y, Sivasubbu S, Scaria V.

Hum Mutat. 2015 Apr;36(4):419-24. doi: 10.1002/humu.22767.

PMID:: 25677119

Similar articles

VARIANT ANALYSIS

A survey of tools for variant analysis of next-generation genome sequencing data.

Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z.

Brief Bioinform. 2014 Mar;15(2):256-78. doi: 10.1093/bib/bbs086. Epub 2013 Jan 21.

PMID:: 23341494

Free PMC Article

Similar articles

Variant callers for next-generation sequencing data: a comparison study.

Liu X, Han S, Wang Z, Gelernter J, Yang BZ.

PLoS One. 2013 Sep 27;8(9):e75619. doi: 10.1371/journal.pone.0075619. eCollection 2013.

PMID:: 24086590

Free PMC Article

Similar articles

VARIANT DETECTION IN HEREDITARY CANCER GENES

ICO amplicon NGS data analysis: a Web tool for variant detection in common high-risk hereditary cancer genes analyzed by amplicon GS Junior next-generation sequencing.

Lopez-Doriga A, Feliubadaló L, Menéndez M, Lopez-Doriga S, Morón-Duran FD, del Valle J, Tornero E, Montes E, Cuesta R, Campos O, Gómez C, Pineda M, González S, Moreno V, Capellá G, Lázaro C.

Hum Mutat. 2014 Mar;35(3):271-7.

PMID:: 24227591

Similar articles

Development and analytical validation of a 25-gene next generation sequencing panel that includes the BRCA1 and BRCA2 genes to assess hereditary cancer risk.

Judkins T, Leclair B, Bowles K, Gutin N, Trost J, McCulloch J, Bhatnagar S, Murray A, Craft J, Wardell B, Bastian M, Mitchell J, Chen J, Tran T, Williams D, Potter J, Jammulapati S, Perry M, Morris B, Roa B, Timms K.

BMC Cancer. 2015 Apr 2;15:215. doi: 10.1186/s12885-015-1224-y.

PMID:: 25886519

Free PMC Article

Similar articles

Clinical Applications of Next-Generation Sequencing in Cancer Diagnosis.

Sabour L, Sabour M, Ghorbian S.

Pathol Oncol Res. 2017 Apr;23(2):225-234. doi: 10.1007/s12253-016-0124-z. Epub 2016 Oct 8. Review.

PMID:: 27722982

Similar articles

Studying cancer genomics through next-generation DNA sequencing and bioinformatics.

Doyle MA, Li J, Doig K, Fellowes A, Wong SQ.

Methods Mol Biol. 2014;1168:83-98. doi: 10.1007/978-1-4939-0847-9_6. Review.

PMID:: 24870132

Similar articles

IMMUNOINFORMATICS

Immunoinformatics and epitope prediction in the age of genomic medicine.

Backert L, Kohlbacher O.

Genome Med. 2015 Nov 20;7:119. doi: 10.1186/s13073-015-0245-0. Review.

PMID:: 26589500

Free PMC Article

Similar articles

IgSimulator: a versatile immunosequencing simulator.

Safonova Y, Lapidus A, Lill J.

Bioinformatics. 2015 Oct 1;31(19):3213-5. doi: 10.1093/bioinformatics/btv326. Epub 2015 May 25.

PMID:: 26007226

Similar articles

Computational genomics tools for dissecting tumour-immune cell interactions.

Hackl H, Charoentong P, Finotello F, Trajanoski Z.

Nat Rev Genet. 2016 Jul 4;17(8):441-58. doi: 10.1038/nrg.2016.67. Review.

PMID:: 27376489

Similar articles

RNA SEQUENCING

SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines.

Audoux J, Salson M, Grosset CF, Beaumeunier S, Holder JM, Commes T, Philippe N.

BMC Bioinformatics. 2017 Sep 29;18(1):428. doi: 10.1186/s12859-017-1831-5.

PMID:: 28969586

Free PMC Article

Similar articles

COMPLEX INSERTIONS AND DELETIONS

INDELseek: detection of complex insertions and deletions from next-generation sequencing data.

Au CH, Leung AY, Kwong A, Chan TL, Ma ES.

BMC Genomics. 2017 Jan 5;18(1):16. doi: 10.1186/s12864-016-3449-9.

PMID:: 28056804

Free PMC Article

Similar articles

EVOLUTIONARY BIOLOGY

The State of Software for Evolutionary Biology.

Darriba D, Flouri T, Stamatakis A.

Mol Biol Evol. 2018 May 1;35(5):1037-1046. doi: 10.1093/molbev/msy014. Review.

PMID:: 29385525

Free PMC Article

Similar articles

SIMULATION PROGRAMS

Nat Rev Genet. 2016 Aug; 17(8): 459–469.

Published online 2016 Jun 20. doi: 10.1038/nrg.2016.57

PMCID: PMC5224698

EMSID: EMS70941

PMID: 27320129

Systematic review of next-generation sequencing simulators: computational tools, features and perspectives.

Zhao M, Liu D, Qu H.

Brief Funct Genomics. 2017 May 1;16(3):121-128. doi: 10.1093/bfgp/elw012. Review.

PMID:: 27069250

Similar articles

A comparison of tools for the simulation of genomic next-generation sequencing data

Merly Escalona,¹ Sara Rocha,¹ and David Posada^1,²

Author information Copyright and License information Disclaimer

The publisher’s final edited version of this article is available at Nat Rev Genet

This article has been corrected. See Nat Rev Genet. 2018 October 03; : .

Online Summary

There is a large number of tools for the simulation of genomic data for all currently available NGS platforms, with partially overlapped functionality. Here we review 23 of these tools, highlighting their distinct functionalities, requirements and potential applications.

The parameterization of these simulators is often complex. The user may decide between using existing sets of parameters values called profiles or re-estimating them from its own data.

Parameters than can be modulated in these simulations include the effects of the PCR amplification of the libraries, read features and quality scores, base call errors, variation of sequencing depth across the genomes and the introduction of genomic variants.

Several types of genomic variants can be introduced in the simulated reads, such as SNPs, indels, inversions, translocations, copy-number variants and short-tandem repeats.

Reads can be generated from single or multiple genomes, and with distinct ploidy levels. NGS data from metagenomic communities can be simulated given an “abundance profile” that reflects the proportion of taxa in a given sample.

Many of the simulators have not been formally described and/or tested in dedicated publications. We encourage the formal publication of these tools and the realization of comprehensive, comparative benchmarkings.

Choosing among the different genomic NGS simulators is not easy. Here we provide a guidance tree to help users choosing a suitable tool for their specific interests.

Abstract

Computer simulation of genomic data has become increasingly popular for assessing and validating biological models or to gain understanding about specific datasets. Multiple computational tools for the simulation of next-generation sequencing (NGS) data have been developed in recent years, which could be used to compare existing and new NGS analytical pipelines. Here we review 23 of these tools, highlighting their distinct functionality, requirements and potential applications. We also provide a decision tree for the informed selection of an appropriate NGS simulation tool for the specific question at hand.

Image source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/

An overview of current NGS technologies

The most popular NGS technologies on the market are Illumina’s sequencing by synthesis, which is probably the most widely used platform at present¹⁷, Roche’s 454 pyrosequencing (454), SOLiD sequencing-by-ligation (SOLiD), IonTorrent semiconductor sequencing¹⁸ (IonTorrent), Pacific Biosciences’s (PacBio) single molecule real-time sequencing¹⁹, and Oxford Nanopore Technologies (Nanopore) single-cell DNA template strand sequencing. These strategies can differ, for example, regarding the type of reads they produce or the kind of sequencing errors they introduce (Table 1). Only two of the current technologies (Illumina and SOLiD) are capable of producing all three sequencing read types —single end, paired end and mate pair. Read length is also dependent on the machine and the kit used; in platforms like Illumina, SOLiD, or IonTorrent it is possible to specify the number of desired base pairs per read. According to the sequencing run type selected it is possible to obtain reads with maximum lengths of 75 bp (SOLiD), 300 bp (Illumina) or 400bp (IonTorrent). On the other hand, in platforms like 454, Nanopore or PacBio, information is only given about the mean and maximum read length that can be obtained, with average lengths of 700 bp, 10 kb and 15 kb and maximum lengths of 1 kb, 10 kb and 15 kb, respectively. Error rates vary depending on the platform from <=1% in Illumina to ~30% in Nanopore. Further overviews and comparisons of NGS strategies can be found in ⁵^,²⁰^–²².

Table 1

Main characteristics of current NGS technologies.

Technology Run Type Maximum Read Length Quality Scores Error Rates References

Single-read Paired-end Mate-pair

Illumina X X X 300 bp > Q30 0.0034 – 1% ⁶⁵

SOLiD X X X 75 bp > Q30 0.01 – 1% ⁶⁶

IonTorrent X X 400 bp ~ Q20 1.78% ²²

454 X X ~700 bp (up to 1 Kb) > Q20 1.07 – 1.7% 59,67

Nanopore X 5.4 – 10 Kb NAY 10 – 40% ^68–72

PacBio X ~15 Kb (up to 40 Kb) < Q10 5 – 10% ^22,73–75

Go to:

Simulation parameters

The existing sequencing platforms use distinct protocols that result in datasets with different characteristics¹. Many of these attributes can be taken into account by the simulators (Fig. 2), although there is not a single tool that incorporates all possible variations. The main characteristics of the 23 simulators considered here are summarized in Tables 2 and and3.3. These tools differ in multiple aspects, such as sequencing technology, input requirements or output format, but maintain several common aspects. With some exceptions, all programs need a reference sequence, multiple parameter values indicating the characteristics of the sequencing experiment to be simulated (read length, error distribution, type of variation to be generated, if any, etc.) and/or a profile (a set of parameter values, conditions and/or data used for controlling the simulation), which can be provided by the simulator or estimated de novo from empirical data. The outcome will be aligned or unaligned reads in different standard file formats, such as FASTQ, FASTA or BAM. An overview of the NGS data simulation process is represented in Fig. 3. In the following sections we delve into the different steps involved.

Open in a separate window

Figure 2

General overview of the sequencing process and steps that can be parameterized in the simulations.

NGS simulators try to imitate the real sequencing process as closely as possible by considering all the steps that could influence the characteristics of the reads. a | NGS simulators do not take into account the effect of the different DNA extraction protocols in the resulting data. However, they can consider whether the sample we want to sequence includes one or more individuals, from the same or different organisms (e.g., pool-sequencing, metagenomics). Pools of related genomes can be simulated by replicating the reference sequence and introducing variants on the resulting genomes. Some tools can also simulate metagenomes with distinct taxa abundance. b | Simulators can try to mimic the length range of DNA fragmentation (empirically obtained by sonication or digestion protocols) or assume a fixed amplicon length. c | Library preparation involves ligating sequencing–platform dependent adaptors and/or barcodes to the selected DNA fragments (inserts). Some simulators can control the insert size, and produce reads with adaptors/barcodes. d | | Most NGS techniques include an amplification step for the preparation of libraries. Several simulators can take this step into account (for example, by introducing errors and/or chimaeras), with the possibility of specifying the number of reads per amplicons. e | Sequencing runs imply a decision about coverage, read length, read type (single-end, paired-end, mate-pair) and a given platform (with their specific errors and biases). Simulators exist for the different platforms, and they can use particular parameter profiles, often estimated from real data.

Open in a separate window

Figure 3

General overview of NGS simulation.

The simulation process begins with the input of a reference sequence (most cases) and simulation parameters. Some of the parameters can be given via a profile, that is estimated (by the simulator or other tools) from other reads or alignments. The outcome of this process may be reads (with or without quality information) or genome alignments in different formats.

CONCLUSIONS

NGS is having a big impact in a broad range of areas that benefit from genetic information, from medical genomics, phylogenetic and population genomics, to the reconstruction of ancient genomes, epigenomics and environmental barcoding. These applications include approaches such as de novo sequencing, resequencing, target sequencing or genome reduction methods. In all cases, caution is necessary in choosing a proper sequencing design and/or a reliable analytical approach for the specific biological question of interest. The simulation of NGS data can be extremely useful for planning experiments, testing hypotheses, benchmarking tools and evaluating particular results. Given a reference genome or dataset, for instance, one can play with an array of sequencing technologies to choose the best-suited technology and parameters for the particular goal, possibly optimizing time and costs. Yet, this is still not the standard practice and researchers often base their choices on practical considerations like technology and money availability. As shown throughout this Review, simulation of NGS data from known genomes or transcriptomes can be extremely useful when evaluating assembly, mapping, phasing or genotyping algorithms e.g. ²^,⁷^,¹⁰^,¹³^,⁶⁴ exposing their advantages and drawbacks under different circumstances.

Altogether, current NGS simulators consider most, if not all, of the important features regarding the generation of NGS data. However, they are not problem-free. The different simulators are largely redundant, implementing the same or very similar procedures. In our opinion, many are poorly documented and can be difficult to use for non-experts, and some of them are no longer maintained. Most importantly, for the most part they have not been benchmarked or validated. Remarkably, among the 23 tools considered here, only 13 have been described in dedicated application notes, 3 have been mentioned as add-ons in the methods section of bigger articles, and 5 have never been referenced in a journal. Indeed, peer-reviewed publication of these tools in dedicated articles would be highly desirable. While this would not definitively guarantee quality, at least it would encourage authors to reach minimum standards in terms of validation, benchmarking, and documentation. Collaborative efforts like the Assemblathon e.g. ²⁷ or iEvo (http://www.ievobio.org/) might be also a source of inspiration. Meanwhile, we hope that the decision tree presented in Fig. 1 helps users making appropriate choices.

SOURCE

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/

Technology	Run Type	Maximum Read Length	Quality Scores	Error Rates	References
Illumina	X	X	X	300 bp	> Q30	0.0034 – 1%	⁶⁵
SOLiD	X	X	X	75 bp	> Q30	0.01 – 1%	⁶⁶
IonTorrent	X	X		400 bp	~ Q20	1.78%	²²
454	X	X		~700 bp (up to 1 Kb)	> Q20	1.07 – 1.7%	59,67
Nanopore	X			5.4 – 10 Kb	NAY	10 – 40%	^68–72
PacBio	X			~15 Kb (up to 40 Kb)	< Q10	5 – 10%	^22,73–75

REFERENCES

Systematic benchmarking of omics computational tools

Serghei Mangul, Lana S. Martin, Brian L. Hill, Angela Ka-Mei Lam, Margaret G. Distler, Alex Zelikovsky, Eleazar Eskin, Jonathan Flint

Nat Commun. 2019; 10: 1393. Published online 2019 Mar 27. doi: 10.1038/s41467-019-09406-4

PMCID:: PMC6437167

Article PubReader PDF–927K Citation

Long fragments achieve lower base quality in Illumina paired-end sequencing

Ge Tan, Lennart Opitz, Ralph Schlapbach, Hubert Rehrauer

Sci Rep. 2019; 9: 2856. Published online 2019 Feb 27. doi: 10.1038/s41598-019-39076-7

PMCID:: PMC6393434

Article PubReader PDF–1.1M Citation

sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs

Apostolos Dimitromanolakis, Jingxiong Xu, Agnieszka Krol, Laurent Briollais

BMC Bioinformatics. 2019; 20: 26. Published online 2019 Jan 15. doi: 10.1186/s12859-019-2611-1

PMCID:: PMC6332552

Article PubReader PDF–1.0M Citation

Analysis validation has been neglected in the Age of Reproducibility

Kathleen E. Lotterhos, Jason H. Moore, Ann E. Stapleton

PLoS Biol. 2018 Dec; 16(12): e3000070. Published online 2018 Dec 10. doi: 10.1371/journal.pbio.3000070

PMCID:: PMC6301703

Article PubReader PDF–968K Citation

Enterovirus D68 – The New Polio?

Hayley Cassidy, Randy Poelman, Marjolein Knoester, Coretta C. Van Leer-Buter, Hubert G. M. Niesters

Front Microbiol. 2018; 9: 2677. Published online 2018 Nov 13. doi: 10.3389/fmicb.2018.02677

PMCID:: PMC6243117

Article PubReader PDF–2.4M Citation

Genetic Simulation Resources and the GSR Certification Program

Bo Peng, Man Chong Leong, Huann-Sheng Chen, Melissa Rotunno, Katy R Brignole, John Clarke, Leah E Mechanic

Bioinformatics. 2019 Feb 15; 35(4): 709–710. Published online 2018 Aug 7. doi: 10.1093/bioinformatics/bty666

PMCID:: PMC6378936

Currently embargoed: Free in PMC on Feb 15, 2020; PubMed

Simulating Illumina metagenomic data with InSilicoSeq

Hadrien Gourlé, Oskar Karlsson-Lindsjö, Juliette Hayer, Erik Bongcam-Rudloff

Bioinformatics. 2019 Feb 1; 35(3): 521–522. Published online 2018 Jul 19. doi: 10.1093/bioinformatics/bty630

PMCID:: PMC6361232

Article PubReader PDF–395K Citation

NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model

Ze-Gang Wei, Shao-Wu Zhang

BMC Bioinformatics. 2018; 19: 177. Published online 2018 May 22. doi: 10.1186/s12859-018-2208-0

PMCID:: PMC5964698

Article PubReader PDF–2.1M Citation

DeepSimulator: a deep simulator for Nanopore sequencing

Yu Li, Renmin Han, Chongwei Bi, Mo Li, Sheng Wang, Xin Gao

Bioinformatics. 2018 Sep 1; 34(17): 2899–2908. Published online 2018 Apr 6. doi: 10.1093/bioinformatics/bty223

PMCID:: PMC6129308

Article PubReader PDF–615K Citation

Xome-Blender: A novel cancer genome simulator

Roberto Semeraro, Valerio Orlandini, Alberto Magi

PLoS One. 2018; 13(4): e0194472. Published online 2018 Apr 5. doi: 10.1371/journal.pone.0194472

PMCID:: PMC5886411

Article PubReader PDF–5.9M Citation

Bamgineer: Introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets

Soroush Samadian, Jeff P. Bruce, Trevor J. Pugh

PLoS Comput Biol. 2018 Mar; 14(3): e1006080. Published online 2018 Mar 28. doi: 10.1371/journal.pcbi.1006080

PMCID:: PMC5891060

Article PubReader PDF–3.5M Citation

Environmental and Host Effects on Skin Bacterial Community Composition in Panamanian Frogs

Brandon J. Varela, David Lesbarrères, Roberto Ibáñez, David M. Green

Front Microbiol. 2018; 9: 298. Published online 2018 Feb 22. doi: 10.3389/fmicb.2018.00298

PMCID:: PMC5826957

Article PubReader PDF–2.1M Citation

Novel read density distribution score shows possible aligner artefacts, when mapping a single chromosome

Fedor M. Naumenko, Irina I. Abnizova, Nathan Beka, Mikhail A. Genaev, Yuriy L. Orlov

BMC Genomics. 2018; 19(Suppl 3): 92. Published online 2018 Feb 9. doi: 10.1186/s12864-018-4475-6

PMCID:: PMC5836841

Article PubReader PDF–1.9M Citation

HgtSIM: a simulator for horizontal gene transfer (HGT) in microbial communities

Weizhi Song, Kerrin Steensen, Torsten Thomas

PeerJ. 2017; 5: e4015. Published online 2017 Nov 8. doi: 10.7717/peerj.4015

PMCID:: PMC5681852

Article PubReader PDF–1.3M Citation

Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes

Haibao Tang, Ewen F. Kirkness, Christoph Lippert, William H. Biggs, Martin Fabani, Ernesto Guzman, Smriti Ramakrishnan, Victor Lavrenko, Boyko Kakaradov, Claire Hou, Barry Hicks, David Heckerman, Franz J. Och, C. Thomas Caskey, J. Craig Venter, Amalio Telenti

Am J Hum Genet. 2017 Nov 2; 101(5): 700–715. Published online 2017 Nov 2. doi: 10.1016/j.ajhg.2017.09.013

PMCID:: PMC5673627

Article PubReader PDF–1.1M Citation

Simulating the dynamics of targeted capture sequencing with CapSim

Minh Duc Cao, Devika Ganesamoorthy, Chenxi Zhou, Lachlan J M Coin

Bioinformatics. 2018 Mar 1; 34(5): 873–874. Published online 2017 Oct 28. doi: 10.1093/bioinformatics/btx691

PMCID:: PMC6192212

Article PubReader PDF–123K Citation

Next-generation sequencing applications in clinical bacteriology

Yair Motro, Jacob Moran-Gilad

Biomol Detect Quantif. 2017 Dec; 14: 1–6. Published online 2017 Oct 23. doi: 10.1016/j.bdq.2017.10.002

PMCID:: PMC5727008

Article PubReader PDF–204K Citation

A multi-scenario genome-wide medical population genetics simulation framework

Jacquiline W Mugo, Ephifania Geza, Joel Defo, Samar S M Elsheikh, Gaston K Mazandu, Nicola J Mulder, Emile R Chimusa

Bioinformatics. 2017 Oct 1; 33(19): 2995–3002. Published online 2017 Jun 24. doi: 10.1093/bioinformatics/btx369

PMCID:: PMC5870573

Article PubReader PDF–488K Citation

Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads

Ryan R. Wick, Louise M. Judd, Claire L. Gorrie, Kathryn E. Holt

PLoS Comput Biol. 2017 Jun; 13(6): e1005595. Published online 2017 Jun 8. doi: 10.1371/journal.pcbi.1005595

PMCID:: PMC5481147

Article PubReader PDF–7.2M Citation

NanoSim: nanopore sequence read simulator based on statistical characterization

Chen Yang, Justin Chu, René L Warren, Inanç Birol

Gigascience. 2017 Apr; 6(4): 1–6. Published online 2017 Feb 24. doi: 10.1093/gigascience/gix010

PMCID:: PMC5530317

Article PubReader PDF–829K Citation

Read Full Post »

IBM’s $3 Billion Investment In Synthetic Brains And Quantum Computing

Posted in Advanced Computing Platform, Artificial Intelligence - Breakthroughs in Theories and Technologies, Artificial Intelligence - General, Big Data, BioIT: BioInformatics, BioIT: BioInformatics, NGS, Clinical & Translational, Pharmaceutical R&D Informatics, Clinical Genomics, Cancer Informatics, Blockchain Transactions System, Intelligent Information Systems, Simulation Modeling in NGS on September 5, 2015| Leave a Comment »

IBM’s $3 Billion Investment In Synthetic Brains And Quantum Computing

Reporter: Aviva Lev-Ari, PhD, RN

IBM thinks the future belongs to computers that mimic the human brain and use quantum physics…and they’re betting $3 billion on it.

Sourced through Scoop.it from: www.fastcompany.com

See on Scoop.it – Cardiovascular and vascular imaging

Read Full Post »

Technology	Run Type			Maximum Read Length	Quality Scores	Error Rates	References
Technology	Single-read	Paired-end	Mate-pair	Maximum Read Length	Quality Scores	Error Rates	References
Illumina	X	X	X	300 bp	> Q30	0.0034 – 1%	⁶⁵
SOLiD	X	X	X	75 bp	> Q30	0.01 – 1%	⁶⁶
IonTorrent	X	X		400 bp	~ Q20	1.78%	²²
454	X	X		~700 bp (up to 1 Kb)	> Q20	1.07 – 1.7%	59,67
Nanopore	X			5.4 – 10 Kb	NAY	10 – 40%	^68–72
PacBio	X			~15 Kb (up to 40 Kb)	< Q10	5 – 10%	^22,73–75

Leaders in Pharmaceutical Business Intelligence (LPBI) Group

Funding, Deals & Partnerships: BIOLOGICS & MEDICAL DEVICES; BioMed e-Series; Medicine and Life Sciences Scientific Journal – http://PharmaceuticalIntelligence.com

Archive for the ‘Simulation Modeling in NGS’ Category

Data Science: Step by Step – A Resource for LPBI Group’s One-Year Internship in IT, IS, DS

Data Science: Step by Step – A Resource for LPBI Group One-Year Internship in IT, IS, DS

More On This Topic

9 free Harvard courses: learning Data Science

Step 1: Programming

Step 2: Data Visualization

Step 3: Probability

Step 4: Statistics

Step 5: Productivity Tools (Optional)

Step 6: Data Pre-Processing

Step 7: Linear Regression

Step 8: Machine Learning

Step 9: Capstone Project

Building a data science learning roadmap with free courses offered by MIT.

8 Free MIT Courses to Learn Data Science Online

Step 1: Learn to code

Step 2: Statistics

Step 3: Foundational Math Skills

Step 4: Machine Learning

Are These Courses Beginner-Friendly?

Share this:

Like this:

Simulation Tools of Genomic Next Generation Sequencing Data: Comparative Analysis & Genetic Simulation Resources

Simulation Tools of Genomic Next Generation Sequencing Data: Comparative Analysis & Genetic Simulation Resources

SIMULATION PROGRAMS

A comparison of tools for the simulation of genomic next-generation sequencing data

Online Summary

Abstract

An overview of current NGS technologies

Table 1

Simulation parameters

Share this:

Like this:

IBM’s $3 Billion Investment In Synthetic Brains And Quantum Computing

IBM’s $3 Billion Investment In Synthetic Brains And Quantum Computing

Share this:

Like this:

Follow Blog via Email

Recent Posts

Archives

Categories

Meta