
Genomic Pathogen Typing
Larry H. Bernstein, MD, FCAP, Curator
LPBI
Genomic Pathogen Typing Using Solid-State Nanopores
Citation: Squires AH, Atas E, Meller A (2015) Genomic Pathogen Typing Using Solid-State Nanopores. PLoS ONE 10(11): e0142944. http://dx.doi.org:/10.1371/journal.pone.0142944
Editor: Niyaz Ahmed, University of Hyderabad, INDIA
In clinical settings, rapid and accurate characterization of pathogens is essential for effective treatment of patients; however, subtle genetic changes in pathogens which elude traditional phenotypic typing may confer dangerous pathogenic properties such as toxicity, antibiotic resistance, or virulence. Existing options for molecular typing techniques characterize the critical genomic changes that distinguish harmful and benign strains, yet the well-established approaches, in particular those that rely on electrophoretic separation of nucleic acid fragments on a gel, have room for only incremental future improvements in speed, cost, and complexity. Solid-state nanopores are an emerging class of single-molecule sensors that can electrophoretically characterize charged biopolymers, and which offer significant advantages in terms of sample and reagent requirements, readout speed, parallelization, and automation. We present here the first application of nanopores for single-molecule molecular typing using length based “fingerprints” of critical sites in bacterial genomes. This technique is highly adaptable for detection of different types of genetic variation; as we illustrate using prototypical examples including Mycobacterium tuberculosis and methicillin-resistant Streptococcus aureus, the solid-state nanopore diagnostic platform may be used to detect large insertions or deletions, small insertions or deletions, and even single-nucleotide variations in bacterial DNA. We further show that Bayesian classification of test samples can provide highly confident pathogen typing results based on only a few tens of independent single-molecule events, making this method extremely sensitive and statistically robust.
Subtle genetic changes in bacteria can produce large variations in factors affecting pathogenicity, such as toxicity, antibiotic resistance, and virulence. These genetic variations are not only used to trace the epidemic and phylogenetic relationships among strains of bacteria, but are also critically important in clinical settings for proper patient diagnosis and treatment. Most existing approaches require sample incubation and growth over the course of multiple days prior to testing, and nearly all require expert handling of samples and interpretation of results. Traditional phenotypic typing techniques such as serotypes, biotypes, phage-types, and antibiograms lack the necessary sensitivity to distinguish between closely related pathogen strains, and therefore fail to adequately capture these critical variations for clinical applications. Gel-based techniques such as restriction fragment length polymorphism (RFLP) or cleaved amplified polymorphic sequences (CAPS) require a large amount of time and results are not easily compared or transferred among labs. Next-generation sequencing is an increasingly popular method of fully characterizing bacterial strains [1] and may be used for typing strains according to the sequences of a panel of housekeeping genes, as in multi-locus sequence typing (MLST) [2], but this approach is more commonly used to trace post hoc epidemic and phylogenetic relationships among clinical isolates. Furthermore, the complexity and quantity of sequencing data far exceeds the minimum information required to efficiently and accurately diagnose a patient. For example, bioinformatics studies suggest that a panel of just 30–50 single nucleotide variations (SNVs) could be used to uniquely identify thousands of strains of Mycobacterium tuberculosis [3, 4]. Yet SNVs are not the only source of variation among pathogens; polymorphisms from SNVs and short indels up to genetic changes as large as whole plasmids or sets of genes may be responsible for critical changes to pathogenicity. Thus there exists a clear clinical need for a novel approach to molecular typing that can quickly and simply screen patient samples for a panel of widely varying known genetic polymorphisms of dangerous pathogens.
Solid-state nanopores may be used to discriminate the lengths of unlabeled individual biopolymers such as DNA molecules across a wide range of lengths [5, 6]. Biopolymers are electrophoretically attracted and threaded through a voltage-biased nanoscale pore drilled in an ultrathin freestanding SiNx membrane [7, 8]. When a DNA molecule is threaded through a nanopore, it partially blocks the flow of ions moving through the pore, allowing real-time detection of the analyte by monitoring changes in the ion current. Nanopore sensing is biochemically simple, as it does not require labeling of the analyte with radioactive or fluorescent probes, yet it can be used to detect minute quantities of nucleic acid molecules, surpassing the sensitivity of bulk methods [8]. Moreover, nanopore sensing involves relatively simple instrumentation (primarily a current amplifier) and may be used to analyze thousands of molecules in just a few minutes, making this technique an ideal candidate for applications such as nucleic acid based diagnostics.
Here we describe and practice a novel detection scheme (Fig 1) for molecular typing of pathogens using solid-state nanopores, and demonstrate its ability to discriminate a wide range of critical genetic polymorphisms in closely related organisms with starkly different pathogenicities. In the first sensing mode of our approach (Mode I), large insertions or deletions are detected by directly classifying the length of DNA in the nanopore. In the second sensing mode (Mode II), small indels down to SNVs may be detected by sequence-specific digestion at the site of the polymorphism to produce either one or two DNA fragments, which are then detected in the nanopore. We first characterize the practical range of our nanopore system for detecting variation in DNA length, and show that fragment length differences are more readily apparent for shorter DNA lengths and for asymmetric cut sites. We then demonstrate that statistical analysis tools such as Bayesian classifiers, commonly used for automated classification, are highly effective for rapid and statistically robust discrimination among different lengths and combinations of DNA fragments translocating through a nanopore, even in cases where significant portions of these distributions overlap. We apply these techniques to demonstrate polymorphism discrimination down to the single nucleotide level in prototypical strains of Mycobacterium tuberculosis (virulent vs. avirulent) and Streptococcus aureus(methicillin-resistant vs. multi-drug resistant). This highly versatile combination of rapid length and digest discrimination, spanning several orders of magnitude of possible genomic variation size, in a single, parallelizable device, could be extended to probe a large panel of critical sites within a genome for point-of-care determination of critical pathogenic properties and sequence typing.
Mode I: Direct length detection according to analyte translocation dwell time and depth enables discrimination of longer vs. shorter fragments; i.e: whether or not an insertion or deletion is present (left). Mode II: Prior to translocation, samples are exposed to a restriction enzyme that cuts at the site of a SNV or short indel or mutation. Detection of cleaved vs. uncleaved DNA fragments in the nanopore reveals whether or not the critical genomic variation is present.
http://dx.doi.org:/10.1371/journal.pone.0142944.g001
Detection of DNA Sequence Polymorphisms in Solid-State Nanopores
The simplest form of nanopore translocation analysis involves the measurement of the depth of each current blockade (ΔIB) and the dwell time of each molecule within the pore (tD). Both parameters have been shown to grow nonlinearly with DNA length, forming the basis for fragment length separation in the nanopore system. The statistical distributions of these independently measured quantities may be used to distinguish between analytes of different lengths, such as DNAs [5, 6, 9], or proteins having identical molecular weight but slightly different charge or 3D structure [10–13]. Variation in the translocation dwell-time (tD) in solid-state nanopores measured for different DNA lengths (l), are empirically described by a power law: tD ∼ lα where α = 1.38±0.02, which has been reproduced by multiple experimental approaches [5, 9, 14]. Using a log-scale distribution of translocation times to estimate the distribution of tD, note that the difference in log(tD) for two sequences (lengths l0 and l0 + Δl) is more apparent for shorter length l0 as compared with the insertions and deletions Δl (i.e. when Δl/l0 ∼ 1) according to Eq 1:(1)
If the presence of two fragment lengths must be identified from within a single sample, it is desirable that their distributions of ΔIB or tD should be as well-separated as possible. Furthermore, if the presence of a cut sample must be distinguished from an uncut sample, then by Eq 1 the peak produced by the shorter part of a cut sample will appear farther away from the uncut peak than the longer part of a cut sample. To statistically distinguish the samples, it is desirable for the peak of the shorter part to be as dissimilar as possible from the uncut peak. Therefore, asymmetrically cut DNA pieces from a restriction digest are more readily distinguished from the original uncut length than those produced by symmetrically positioned restriction sites, provided that the shorter piece is of sufficient length to be detected by the nanopore. In cases where separation between two similar length biopolymers (Δl/l0 ∼ 1) is required, the measured histograms of either ΔIB or tD may overlap significantly, making discrimination between these molecules difficult. Combinations of multiple fragment lengths within a sample pose additional challenges, as their more complicated distributions may overlap or otherwise preclude simple contour cluster separation.
In the context of sequence typing, identification of fragments by sizing will indicate the presence of specific insertions and deletions that may enhance or reduce pathogenicity or otherwise uniquely identify a pathogenic strain. Upper bounds on Δl are set by: 1) sample preparation parameters and limitations; for example, robust and fast PCR amplification is most easily achieved for fragment lengths of ~102–103 bp [15] and 2) nanopore stability considerations; for example, nanopores are more frequently clogged by very long DNA (>20 kbp). Lower bounds on l0 are set by nanopore sensitivity; while several groups have demonstrated detection of small DNA fragments (<50 bp) [16] we find that a minimum l0 on the order of ~100 bp is more reliable since it is readily detectable in small nanopores with no additional modifications [5], producing an extremely small fraction of missed events due to the finite system bandwidth. Thus a reasonable design range for sequence typing fragments is ~100 bp minimum length forl0, ranging up to a few thousand base pairs maximum length for l0 + Δl. Many types of common genetic variations used for strain typing fall within this size range. For example, one complete IS6110 (insertion-like sequence element) insertion in M. tuberculosis is 1358 bp [17]. At the other end of this range, multi-drug resistant strains of methicillin-resistant S. aureus (MRSA) have many insertions and deletions in the range 47 bp—643 bp that affect their pathogenicity [18]. To detect the smallest indels, which fall below the minimum detectable Δl, we turn to the exquisite sequence specificity of digestion by restriction enzymes, which can identify sequence polymorphisms down to a single nucleotide variation.
Using these design principles, we present here two alternative modes of detection that illustrate the wide range of genomic variations that may be detected using a single sensor. For large insertions or deletions (Fig 1: Mode I, left panel), a nanopore may be used to discriminate the raw change in DNA length caused by the presence or absence of this sequence according to the duration of translocation events. For short indels, mutations, or single nucleotide variations (SNVs) (Fig 1: Mode II, right panel), which are more difficult to identify solely by length as discussed above, we utilize a restriction enzyme. The sample is only cut in the presence (or absence) of the critical sequence, and subsequent detection in a nanopore reveals either one or two fragments in the nanopore according to the observed durations and blockage levels of translocation events.
Event Diagram Discrimination of Sample Length and Composition
We first experimentally illustrate the practical length resolution of the nanopore platform for identifying sample length and composition. We analyzed samples containing mixtures of DNA fragments composed of one or two well-defined lengths. The resulting event diagrams create unique fingerprints that can be used to distinguish different lengths of DNA (Mode I) or whether or not a fragment of DNA has been cut (Mode II). Fig 2A–2E show event diagrams for 100 bp, 200 bp, 900 bp, 1000 bp, and 100+900 bp DNA in a single nanopore (diameter 4.8 nm, effective height 7 nm) at +300 mV bias (for additional examples, see Figs B-E in S1 File). Here, each translocation event is represented by its corresponding ion current event amplitude (ΔIB) and dwell time (tD). From comparison of Fig 2A and 2D, it is evident that insertions and deletions Δl several times larger than the base length (here: Δl:l0 = 9:1) are indeed easily distinguishable (Fig C in S1 File). Comparison of Fig 2A and 2B illustrates that Δl = 100 bp results in reasonably distinct event diagrams for l0 = 100 bp, which may be distinguished to >95% confidence with just a few events each, taking both dwell time and current amplitude into consideration (Fig D in S1 File). However, at l0 = 900 bp a minimum of several hundred events are required to confidently (>95%) differentiate l0 (Fig 2C) from l0 + Δl (1000 bp, Fig 2D), since their event diagrams overlap significantly (Fig E in S1 File). Returning to Eq 1, for Δl = 100 bp, we expect Δlog(tD) = 0.415 for l0 = 100 bp, and Δlog(tD) = 0.063 for l0 = 900 bp. For the data shown in Fig 2F, Δlog(tD) = 0.1 for l0 = 100 bp, and Δlog(tD) = 0.03 for l0 = 900 bp. The inability to easily and quickly discriminate the 900 bp DNA from the 1000 bp DNA demonstrates the practical limits set on Mode I sample identification according to the size of the insertion or deletion that must be detected.
(a) 100 bp at 1 nM. (b) 200 bp at 1 nM. (c) 900 bp at 1 nM. (d) 1000 bp at 1 nM. (e) 1:1 combination of 100 bp and 900 bp, total concentration 2 nM. (f) Semilog(x) distributions of translocation dwell times for all samples (a)-(e). Translocations for all samples were collected in a single nanopore (4.8 nm diameter, effective thickness ~7 nm) with a +300 mV bias relative to trans (open pore current: 13 nA). To facilitate visualization of population density, a random white noise offset below the acquisition rate of this data (-2 μs < Δt < +2 μs, acquisition rate 250 kHz) has been added to each tD. http://dx.doi.org:/10.1371/journal.pone.0142944.g002
Fig 2E illustrates how Mode II may overcome these limitations by digesting DNA into fragments: here, a highly asymmetric ratio of lengths in a mixed sample (100+900 bp) clearly facilitates sample identification as compared to the full length 1000 bp DNA (Fig 2D). However, Mode II also presents a more challenging case for quantitative discrimination between an uncut and a cut sample. Whereas single-length samples can be identified using either their tD or IB distribution (as shown in Fig 2F), the longer fragment in a cut sample may share significant overlap with the uncut sample. This is particularly true in the case of a highly asymmetric cut site.
….
(a) 2-D GMM for 1000 bp DNA fragment translocations. (b) 2-D GMM for 900+100 bp DNA fragment translocations. (c) Bayesian posterior estimates p(A|Θ) of correctly identifying a data set Θ as Case A, calculated for each increment of N points in Θ, repeated 1000 times (first 50 shown in gray) and averaged (blue), each using M = 1500 points in the model data set. (d) Bayesian posterior estimates p(B|Θ) of correctly identifying a data set Θ as Case B, calculated for each increment of N points in Θ, repeated 1000 times (first 50 shown in gray) and averaged (red), all using M = 1500 points in the model data set. (e) Bayesian posterior estimates p(A|Θ) for test data sets ofN points given a model based on data set size M. Each point represents the average of 1000 separate bootstrap simulations. (f) Bayesian posterior estimates p(A|Θ) for test data sets of N points given a model based on data set size M. Each point represents the average of 1000 separate bootstrap simulations. Insets: range of N for which p(A|Θ) reaches 0.95. See Methods and S1 File for complete numerical simulation details. http://dx.doi.org:/10.1371/journal.pone.0142944.g003
……
(a) Diagram of the main steps in sample preparation, detection, and classification: PCR fragments from isolated pathogens are subjected to a restriction digest, which recognizes and cuts only one genomic variant. Nanopore translocations are used to classify the pathogen according to the combination of fragment lengths detected. (b) ThemazG gene of the avirulent M. tuberculosis strain H37Ra is not cut by NaeI (942 bp), while the same gene in the closely related virulent strain H37Rv, which differs by only a single A-to-C mutation, is cut by NaeI (621bp + 321 bp). (c) Gaussian mixture model (one component) fit to translocations of mazG fragments from H37Ra. (d) Gaussian mixture model (two components) fit to translocations of mazG fragments from H37Rv. (e) Posterior probabilities for correctly identifying the H37Ra and H37Rv strains as a function of number of translocation events collected from an unknown sample, simulated using bootstrap sampling from nanopore translocation data. (f) The parC gene of the multi-drug-resistant MRSA strain FPR3757 is not cut by BseRI (886 bp) due to a single C-to-A mutation, while the closely related and less resistant strain HOU-MR is cut by BseRI (640bp + 245 bp). (g) Gaussian mixture model (one component) fit to translocations of parC fragments from FPR3757. (h) Gaussian mixture model (two components) fit to translocations of parC fragments from HOU-MR. (i) Posterior probabilities for correctly identifying the FPR3757 and HOU-MR strains as a function of number of translocation events collected from an unknown sample, simulated using bootstrap sampling from nanopore translocation data. http://dx.doi.org:/10.1371/journal.pone.0142944.g004
Conclusion
Solid-state nanopore based biosensing is a rapidly growing field due to its practical and conceptual simplicity, portability and versatility. To date, few reports have demonstrated the utility of the method towards clinical diagnostic applications. Yet as we have shown here, nanopores are well-suited to make statistically robust diagnostic classifications among different DNA lengths with real single-molecule data, even in cases where the distributions significantly overlap. Utilizing a Bayesian statistical model, we have demonstrated that nanopore sensing can be used to discriminate among pathogens based on well-known genomic variations. Both large indels (Mode I) or short indels and single nucleotide variations (Mode II) can be targeted using proper sequence-specific digestion with off-the-shelf restriction enzymes. Furthermore, the Bayesian classifiers indicate the statistical confidence of each classification as a function of the number of nanopore events obtained in each measurement. Even at this preliminary stage of development we find that only a few tens of events (obtained in just a few minutes using a single pore) are sufficient to produce a statistically reliable result with well-defined and small error margins.
Our method is general and can be adapted to address many different “multiple-choice” clinical questions using a nanopore biosensor or other single molecule approaches. Future extensions of this work may seek to design and implement large panels of critical sites that represent the minimum sets necessary to characterize genomic variation for various applications in healthcare and research, and to develop additional sensing modalities. Although the primary design challenge currently remains linked to the location and availability of restriction digestion sites, we expect that the ongoing development of designer restriction enzymes, for example systems based on modular zinc fingers [27], TALENs [28], or CRISPR-like proteins will provide additional design flexibility for this technique.
The nanopore fingerprinting approach presented here addresses clear needs in clinical molecular diagnostics for a rapid and simple sensor that can identify a wide range of genomic variation in pathogens to inform treatment options. We have shown here discrimination of both large and small scale genomic variations between pathogen strains, down to single SNVs. The large, flexible sample design space for lengths, cut sites, and enzyme selection at each critical locus ensures that the technique is highly customizable for different genomic variation panels that could profile pathogenicity, antibiotic resistance, or even sequence type. The inherent scalability, minimal sample requirements, speed, and simple readout of the nanopore platform would all facilitate on-site and perhaps even automated use: As successive events are recorded, an increasingly clear fingerprint of translocation times and blockage levels will permit online software to “call” the sample as soon as enough events have been accumulated. Our technique is highly portable and customizable, and the binary data would be readily transferrable among different labs.
Leave a Reply