Posts Tagged ‘Mycobacterium tuberculosis’

Genomic Pathogen Typing

Larry H. Bernstein, MD, FCAP, Curator



Genomic Pathogen Typing Using Solid-State Nanopores

Citation: Squires AH, Atas E, Meller A (2015) Genomic Pathogen Typing Using Solid-State Nanopores. PLoS ONE 10(11): e0142944.

Editor: Niyaz Ahmed, University of Hyderabad, INDIA

In clinical settings, rapid and accurate characterization of pathogens is essential for effective treatment of patients; however, subtle genetic changes in pathogens which elude traditional phenotypic typing may confer dangerous pathogenic properties such as toxicity, antibiotic resistance, or virulence. Existing options for molecular typing techniques characterize the critical genomic changes that distinguish harmful and benign strains, yet the well-established approaches, in particular those that rely on electrophoretic separation of nucleic acid fragments on a gel, have room for only incremental future improvements in speed, cost, and complexity. Solid-state nanopores are an emerging class of single-molecule sensors that can electrophoretically characterize charged biopolymers, and which offer significant advantages in terms of sample and reagent requirements, readout speed, parallelization, and automation. We present here the first application of nanopores for single-molecule molecular typing using length based “fingerprints” of critical sites in bacterial genomes. This technique is highly adaptable for detection of different types of genetic variation; as we illustrate using prototypical examples including Mycobacterium tuberculosis and methicillin-resistant Streptococcus aureus, the solid-state nanopore diagnostic platform may be used to detect large insertions or deletions, small insertions or deletions, and even single-nucleotide variations in bacterial DNA. We further show that Bayesian classification of test samples can provide highly confident pathogen typing results based on only a few tens of independent single-molecule events, making this method extremely sensitive and statistically robust.


Subtle genetic changes in bacteria can produce large variations in factors affecting pathogenicity, such as toxicity, antibiotic resistance, and virulence. These genetic variations are not only used to trace the epidemic and phylogenetic relationships among strains of bacteria, but are also critically important in clinical settings for proper patient diagnosis and treatment. Most existing approaches require sample incubation and growth over the course of multiple days prior to testing, and nearly all require expert handling of samples and interpretation of results. Traditional phenotypic typing techniques such as serotypes, biotypes, phage-types, and antibiograms lack the necessary sensitivity to distinguish between closely related pathogen strains, and therefore fail to adequately capture these critical variations for clinical applications. Gel-based techniques such as restriction fragment length polymorphism (RFLP) or cleaved amplified polymorphic sequences (CAPS) require a large amount of time and results are not easily compared or transferred among labs. Next-generation sequencing is an increasingly popular method of fully characterizing bacterial strains [1] and may be used for typing strains according to the sequences of a panel of housekeeping genes, as in multi-locus sequence typing (MLST) [2], but this approach is more commonly used to trace post hoc epidemic and phylogenetic relationships among clinical isolates. Furthermore, the complexity and quantity of sequencing data far exceeds the minimum information required to efficiently and accurately diagnose a patient. For example, bioinformatics studies suggest that a panel of just 30–50 single nucleotide variations (SNVs) could be used to uniquely identify thousands of strains of Mycobacterium tuberculosis [3, 4]. Yet SNVs are not the only source of variation among pathogens; polymorphisms from SNVs and short indels up to genetic changes as large as whole plasmids or sets of genes may be responsible for critical changes to pathogenicity. Thus there exists a clear clinical need for a novel approach to molecular typing that can quickly and simply screen patient samples for a panel of widely varying known genetic polymorphisms of dangerous pathogens.

Solid-state nanopores may be used to discriminate the lengths of unlabeled individual biopolymers such as DNA molecules across a wide range of lengths [5, 6]. Biopolymers are electrophoretically attracted and threaded through a voltage-biased nanoscale pore drilled in an ultrathin freestanding SiNx membrane [7, 8]. When a DNA molecule is threaded through a nanopore, it partially blocks the flow of ions moving through the pore, allowing real-time detection of the analyte by monitoring changes in the ion current. Nanopore sensing is biochemically simple, as it does not require labeling of the analyte with radioactive or fluorescent probes, yet it can be used to detect minute quantities of nucleic acid molecules, surpassing the sensitivity of bulk methods [8]. Moreover, nanopore sensing involves relatively simple instrumentation (primarily a current amplifier) and may be used to analyze thousands of molecules in just a few minutes, making this technique an ideal candidate for applications such as nucleic acid based diagnostics.

Here we describe and practice a novel detection scheme (Fig 1) for molecular typing of pathogens using solid-state nanopores, and demonstrate its ability to discriminate a wide range of critical genetic polymorphisms in closely related organisms with starkly different pathogenicities. In the first sensing mode of our approach (Mode I), large insertions or deletions are detected by directly classifying the length of DNA in the nanopore. In the second sensing mode (Mode II), small indels down to SNVs may be detected by sequence-specific digestion at the site of the polymorphism to produce either one or two DNA fragments, which are then detected in the nanopore. We first characterize the practical range of our nanopore system for detecting variation in DNA length, and show that fragment length differences are more readily apparent for shorter DNA lengths and for asymmetric cut sites. We then demonstrate that statistical analysis tools such as Bayesian classifiers, commonly used for automated classification, are highly effective for rapid and statistically robust discrimination among different lengths and combinations of DNA fragments translocating through a nanopore, even in cases where significant portions of these distributions overlap. We apply these techniques to demonstrate polymorphism discrimination down to the single nucleotide level in prototypical strains of Mycobacterium tuberculosis (virulent vs. avirulent) and Streptococcus aureus(methicillin-resistant vs. multi-drug resistant). This highly versatile combination of rapid length and digest discrimination, spanning several orders of magnitude of possible genomic variation size, in a single, parallelizable device, could be extended to probe a large panel of critical sites within a genome for point-of-care determination of critical pathogenic properties and sequence typing.

Fig 1.  Two Principal Modes for Nanopore Discrimination of Pathogen Genomic Variation.

Fig 1. Two Principal Modes for Nanopore Discrimination of Pathogen Genomic Variation.

Mode I: Direct length detection according to analyte translocation dwell time and depth enables discrimination of longer vs. shorter fragments; i.e: whether or not an insertion or deletion is present (left). Mode II: Prior to translocation, samples are exposed to a restriction enzyme that cuts at the site of a SNV or short indel or mutation. Detection of cleaved vs. uncleaved DNA fragments in the nanopore reveals whether or not the critical genomic variation is present.

Detection of DNA Sequence Polymorphisms in Solid-State Nanopores  

The simplest form of nanopore translocation analysis involves the measurement of the depth of each current blockade (ΔIB) and the dwell time of each molecule within the pore (tD). Both parameters have been shown to grow nonlinearly with DNA length, forming the basis for fragment length separation in the nanopore system. The statistical distributions of these independently measured quantities may be used to distinguish between analytes of different lengths, such as DNAs [5, 6, 9], or proteins having identical molecular weight but slightly different charge or 3D structure [1013]. Variation in the translocation dwell-time (tD) in solid-state nanopores measured for different DNA lengths (l), are empirically described by a power law: tDlα where α = 1.38±0.02, which has been reproduced by multiple experimental approaches [5, 9, 14]. Using a log-scale distribution of translocation times to estimate the distribution of tD, note that the difference in log(tD) for two sequences (lengths l0 and l0 + Δl) is more apparent for shorter length l0 as compared with the insertions and deletions Δl (i.e. when Δl/l0 ∼ 1) according to Eq 1:(1)

If the presence of two fragment lengths must be identified from within a single sample, it is desirable that their distributions of ΔIB or tD should be as well-separated as possible. Furthermore, if the presence of a cut sample must be distinguished from an uncut sample, then by Eq 1 the peak produced by the shorter part of a cut sample will appear farther away from the uncut peak than the longer part of a cut sample. To statistically distinguish the samples, it is desirable for the peak of the shorter part to be as dissimilar as possible from the uncut peak. Therefore, asymmetrically cut DNA pieces from a restriction digest are more readily distinguished from the original uncut length than those produced by symmetrically positioned restriction sites, provided that the shorter piece is of sufficient length to be detected by the nanopore. In cases where separation between two similar length biopolymers (Δl/l0 ∼ 1) is required, the measured histograms of either ΔIB or tD may overlap significantly, making discrimination between these molecules difficult. Combinations of multiple fragment lengths within a sample pose additional challenges, as their more complicated distributions may overlap or otherwise preclude simple contour cluster separation.

In the context of sequence typing, identification of fragments by sizing will indicate the presence of specific insertions and deletions that may enhance or reduce pathogenicity or otherwise uniquely identify a pathogenic strain. Upper bounds on Δl are set by: 1) sample preparation parameters and limitations; for example, robust and fast PCR amplification is most easily achieved for fragment lengths of ~102–103 bp [15] and 2) nanopore stability considerations; for example, nanopores are more frequently clogged by very long DNA (>20 kbp). Lower bounds on l0 are set by nanopore sensitivity; while several groups have demonstrated detection of small DNA fragments (<50 bp) [16] we find that a minimum l0 on the order of ~100 bp is more reliable since it is readily detectable in small nanopores with no additional modifications [5], producing an extremely small fraction of missed events due to the finite system bandwidth. Thus a reasonable design range for sequence typing fragments is ~100 bp minimum length forl0, ranging up to a few thousand base pairs maximum length for l0 + Δl. Many types of common genetic variations used for strain typing fall within this size range. For example, one complete IS6110 (insertion-like sequence element) insertion in M. tuberculosis is 1358 bp [17]. At the other end of this range, multi-drug resistant strains of methicillin-resistant S. aureus (MRSA) have many insertions and deletions in the range 47 bp—643 bp that affect their pathogenicity [18]. To detect the smallest indels, which fall below the minimum detectable Δl, we turn to the exquisite sequence specificity of digestion by restriction enzymes, which can identify sequence polymorphisms down to a single nucleotide variation.

Using these design principles, we present here two alternative modes of detection that illustrate the wide range of genomic variations that may be detected using a single sensor. For large insertions or deletions (Fig 1: Mode I, left panel), a nanopore may be used to discriminate the raw change in DNA length caused by the presence or absence of this sequence according to the duration of translocation events. For short indels, mutations, or single nucleotide variations (SNVs) (Fig 1: Mode II, right panel), which are more difficult to identify solely by length as discussed above, we utilize a restriction enzyme. The sample is only cut in the presence (or absence) of the critical sequence, and subsequent detection in a nanopore reveals either one or two fragments in the nanopore according to the observed durations and blockage levels of translocation events.

Event Diagram Discrimination of Sample Length and Composition

We first experimentally illustrate the practical length resolution of the nanopore platform for identifying sample length and composition. We analyzed samples containing mixtures of DNA fragments composed of one or two well-defined lengths. The resulting event diagrams create unique fingerprints that can be used to distinguish different lengths of DNA (Mode I) or whether or not a fragment of DNA has been cut (Mode II). Fig 2A–2E show event diagrams for 100 bp, 200 bp, 900 bp, 1000 bp, and 100+900 bp DNA in a single nanopore (diameter 4.8 nm, effective height 7 nm) at +300 mV bias (for additional examples, see Figs B-E in S1 File). Here, each translocation event is represented by its corresponding ion current event amplitude (ΔIB) and dwell time (tD). From comparison of Fig 2A and 2D, it is evident that insertions and deletions Δl several times larger than the base length (here: Δl:l0 = 9:1) are indeed easily distinguishable (Fig C in S1 File). Comparison of Fig 2A and 2B illustrates that Δl = 100 bp results in reasonably distinct event diagrams for l0 = 100 bp, which may be distinguished to >95% confidence with just a few events each, taking both dwell time and current amplitude into consideration (Fig D in S1 File). However, at l0 = 900 bp a minimum of several hundred events are required to confidently (>95%) differentiate l0 (Fig 2C) from l0 + Δl (1000 bp, Fig 2D), since their event diagrams overlap significantly (Fig E in S1 File). Returning to Eq 1, for Δl = 100 bp, we expect Δlog(tD) = 0.415 for l0 = 100 bp, and Δlog(tD) = 0.063 for l0 = 900 bp. For the data shown in Fig 2F, Δlog(tD) = 0.1 for l0 = 100 bp, and Δlog(tD) = 0.03 for l0 = 900 bp. The inability to easily and quickly discriminate the 900 bp DNA from the 1000 bp DNA demonstrates the practical limits set on Mode I sample identification according to the size of the insertion or deletion that must be detected.

Fig 2.  Translocation Event Diagrams Uniquely Identify DNA Fragment Lengths in a Single Nanopore.

Fig 2. Translocation Event Diagrams Uniquely Identify DNA Fragment Lengths in a Single Nanopore.

(a) 100 bp at 1 nM. (b) 200 bp at 1 nM. (c) 900 bp at 1 nM. (d) 1000 bp at 1 nM. (e) 1:1 combination of 100 bp and 900 bp, total concentration 2 nM. (f) Semilog(x) distributions of translocation dwell times for all samples (a)-(e). Translocations for all samples were collected in a single nanopore (4.8 nm diameter, effective thickness ~7 nm) with a +300 mV bias relative to trans (open pore current: 13 nA). To facilitate visualization of population density, a random white noise offset below the acquisition rate of this data (-2 μs < Δt < +2 μs, acquisition rate 250 kHz) has been added to each tD.

Fig 2E illustrates how Mode II may overcome these limitations by digesting DNA into fragments: here, a highly asymmetric ratio of lengths in a mixed sample (100+900 bp) clearly facilitates sample identification as compared to the full length 1000 bp DNA (Fig 2D). However, Mode II also presents a more challenging case for quantitative discrimination between an uncut and a cut sample. Whereas single-length samples can be identified using either their tD or Idistribution (as shown in Fig 2F), the longer fragment in a cut sample may share significant overlap with the uncut sample. This is particularly true in the case of a highly asymmetric cut site.



Fig 3. Gaussian Mixture Models for Mode II Classification of 1000 bp vs. 900+100 bp DNA Fragments.

(a) 2-D GMM for 1000 bp DNA fragment translocations. (b) 2-D GMM for 900+100 bp DNA fragment translocations. (c) Bayesian posterior estimates p(A|Θ) of correctly identifying a data set Θ as Case A, calculated for each increment of N points in Θ, repeated 1000 times (first 50 shown in gray) and averaged (blue), each using M = 1500 points in the model data set. (d) Bayesian posterior estimates p(B|Θ) of correctly identifying a data set Θ as Case B, calculated for each increment of N points in Θ, repeated 1000 times (first 50 shown in gray) and averaged (red), all using M = 1500 points in the model data set. (e) Bayesian posterior estimates p(A|Θ) for test data sets ofN points given a model based on data set size M. Each point represents the average of 1000 separate bootstrap simulations. (f) Bayesian posterior estimates p(A|Θ) for test data sets of N points given a model based on data set size M. Each point represents the average of 1000 separate bootstrap simulations. Insets: range of N for which p(A|Θ) reaches 0.95. See Methods and S1 File for complete numerical simulation details.



Fig 4. Gaussian Mixture Models of DNA Fragments for Actual Mode II Pathogen Typing at the SNV Level.

(a) Diagram of the main steps in sample preparation, detection, and classification: PCR fragments from isolated pathogens are subjected to a restriction digest, which recognizes and cuts only one genomic variant. Nanopore translocations are used to classify the pathogen according to the combination of fragment lengths detected. (b) ThemazG gene of the avirulent M. tuberculosis strain H37Ra is not cut by NaeI (942 bp), while the same gene in the closely related virulent strain H37Rv, which differs by only a single A-to-C mutation, is cut by NaeI (621bp + 321 bp). (c) Gaussian mixture model (one component) fit to translocations of mazG fragments from H37Ra. (d) Gaussian mixture model (two components) fit to translocations of mazG fragments from H37Rv. (e) Posterior probabilities for correctly identifying the H37Ra and H37Rv strains as a function of number of translocation events collected from an unknown sample, simulated using bootstrap sampling from nanopore translocation data. (f) The parC gene of the multi-drug-resistant MRSA strain FPR3757 is not cut by BseRI (886 bp) due to a single C-to-A mutation, while the closely related and less resistant strain HOU-MR is cut by BseRI (640bp + 245 bp). (g) Gaussian mixture model (one component) fit to translocations of parC fragments from FPR3757. (h) Gaussian mixture model (two components) fit to translocations of parC fragments from HOU-MR. (i) Posterior probabilities for correctly identifying the FPR3757 and HOU-MR strains as a function of number of translocation events collected from an unknown sample, simulated using bootstrap sampling from nanopore translocation data.


Solid-state nanopore based biosensing is a rapidly growing field due to its practical and conceptual simplicity, portability and versatility. To date, few reports have demonstrated the utility of the method towards clinical diagnostic applications. Yet as we have shown here, nanopores are well-suited to make statistically robust diagnostic classifications among different DNA lengths with real single-molecule data, even in cases where the distributions significantly overlap. Utilizing a Bayesian statistical model, we have demonstrated that nanopore sensing can be used to discriminate among pathogens based on well-known genomic variations. Both large indels (Mode I) or short indels and single nucleotide variations (Mode II) can be targeted using proper sequence-specific digestion with off-the-shelf restriction enzymes. Furthermore, the Bayesian classifiers indicate the statistical confidence of each classification as a function of the number of nanopore events obtained in each measurement. Even at this preliminary stage of development we find that only a few tens of events (obtained in just a few minutes using a single pore) are sufficient to produce a statistically reliable result with well-defined and small error margins.

Our method is general and can be adapted to address many different “multiple-choice” clinical questions using a nanopore biosensor or other single molecule approaches. Future extensions of this work may seek to design and implement large panels of critical sites that represent the minimum sets necessary to characterize genomic variation for various applications in healthcare and research, and to develop additional sensing modalities. Although the primary design challenge currently remains linked to the location and availability of restriction digestion sites, we expect that the ongoing development of designer restriction enzymes, for example systems based on modular zinc fingers [27], TALENs [28], or CRISPR-like proteins will provide additional design flexibility for this technique.

The nanopore fingerprinting approach presented here addresses clear needs in clinical molecular diagnostics for a rapid and simple sensor that can identify a wide range of genomic variation in pathogens to inform treatment options. We have shown here discrimination of both large and small scale genomic variations between pathogen strains, down to single SNVs. The large, flexible sample design space for lengths, cut sites, and enzyme selection at each critical locus ensures that the technique is highly customizable for different genomic variation panels that could profile pathogenicity, antibiotic resistance, or even sequence type. The inherent scalability, minimal sample requirements, speed, and simple readout of the nanopore platform would all facilitate on-site and perhaps even automated use: As successive events are recorded, an increasingly clear fingerprint of translocation times and blockage levels will permit online software to “call” the sample as soon as enough events have been accumulated. Our technique is highly portable and customizable, and the binary data would be readily transferrable among different labs.



Read Full Post »

Author: Marcus W. Feldman, PhD

Insofar as the genetic evolution of modern humans is concerned, large scale SNP studies of worldwide populations have provided a consistent picture of a migration out of Africa that gave rise to the human populations of the other continents. This migration probably began 60–80 kya, was probably not continuous, and could have resulted in a division during the passage through the Levant en route from east Africa. One division may have moved in a more southerly direction towards south and east Asia, possibly to Australia, and eventually, 15–30 kya into the Americas. The other division may have “turned left” and moved towards Europe.

In this process, which we call the “serial founder” model of human expansion (refs. 1, 2), migration and demography probably had effects that constrained the subsequent action of natural selection on human genes.

  • Variation in skin pigmentation genes today provides some of the strongest signals of natural selection during this human expansion. However, it is also likely that the
  • Immune response genes, e.g., MHC genes, achieved their high levels of polymorphism in response to new pathogens encountered in the great expansion.

Many of the strongest signals of natural selection indicate the importance of the innovations of farming and pastoralism. The gene sequences involved in lactose tolerance and starch metabolism, for example, are strikingly different in groups that adopted dairying or farming, respectively, from hunter-gatherers, who did not.

From the analysis of SNPs, I take home two messages.

  • The first is that although some parts of the genome show clear signals of selection, most of our DNA perceived via SNPs does not.
  • The second is that population growth and migration have been major forces in determining the patterns of variation. Indeed,
  • recent analyses of exome sequences confirm that the spectrum of rare allele frequencies is compatible only with recent and rapid population growth (ref. 3). Indeed,
  • recent analyses of the 1000 genomes data, that is, data from whole genome sequencing of one-thousand human genomes representing Africa (Yoruba), Europe (from Utah), and East Asia (China and Japan), identified only 35 non-synonymous SNPs from 33 genes as having been subject to recent adaptive selection (ref. 4).

The next phase of genomic analysis of humans, complete exome sequencing of large cohorts, or whole genome sequencing of samples from many representative populations, will focus more on two themes.

  • The first will be the role of rare alleles in human phenotypes, especially diseases. The previous phase, GWAS (genome-wide association studies), has been disappointing in revealing genetic “causes” of complex traits. However, my view is that
  • the second theme, the molecular genetics of gene regulation, and interaction of this regulation with the environment, is likely to have bigger payoffs, not only for determination of phenotypes, but also in showing where in the genome the strongest signals of selection lie. As more methylation profiles, small RNA patterns of interference, and other gene-regulatory analyses of whole genomes are completed, both the medical and evolutionary significance of DNA variation will become clearer.

Pemberton, T. J., D. Absher, M. W. Feldman, R. M. Myers, N. A. Rosenberg, and J. Z. Li. 2012. Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 91: 275–292.

Genome-wide patterns of homozygosity runs and their variation across individuals provide a valuable and often untapped resource for studying human genetic diversity and evolutionary history. Using genotype data at 577,489 autosomal SNPs, we employed a likelihood-based approach to identify runs of homozygosity (ROH) in 1,839 individuals representing 64 worldwide populations, classifying them by length into three classes—short, intermediate, and long—with a model-based clustering algorithm. For each class, the number and total length of ROH per individual show considerable variation across individuals and populations. The total lengths of short and intermediate ROH per individual increase with the distance of a population from East Africa, in agreement with similar patterns previously observed for locus-wise homozygosity and linkage disequilibrium. By contrast, total lengths of long ROH show large inter-individual variations that probably reflect recent inbreeding patterns, with higher values occurring more often in populations with known high frequencies of consanguineous unions. Across the genome, distributions of ROH are not uniform, and they have distinctive continental patterns. ROH frequencies across the genome are correlated with local genomic variables such as recombination rate, as well as with signals of recent positive selection. In addition, long ROH are more frequent in genomic regions harboring genes associated with autosomal- dominant diseases than in regions not implicated in Mendelian diseases. These results provide insight into the way in which homozygosity patterns are produced, and they generate baseline homozygosity patterns that can be used to aid homozygosity mapping of genes associated with recessive diseases.


Pepperell, C. S., J. M. Granka, D. C. Alexander, M. A. Behr, L. Chui, J. Gordon, J. L. Guthrie, F. B. Jamieson, D. Langlois-Klassen, R. Long, D. Nguyen, W. Wobeser, and M. W. Feldman. 2011. Dispersal of Mycobacterium tuberculosis via the Canadian fur trade. Proc. Natl. Acad. Sci. USA 108: 6526–6531.

Patterns of gene flow can have marked effects on the evolution of populations. To better understand the migration dynamics of Mycobacterium tuberculosis, we studied genetic data from European M. tuberculosis lineages currently circulating in Aboriginal and French Canadian communities. A single M. tuberculosis lineage, characterized by the DS6Quebec genomic deletion, is at highest frequency among Aboriginal populations in Ontario, Saskatchewan, and Alberta; this bacterial lineage is also dominant among tuberculosis (TB) cases in French Canadians resident in Quebec. Substantial contact between these human populations is limited to a specific historical era (1710–1870), during which individuals from these populations met to barter furs. Statistical analyses of extant M. tuberculosis minisatellite data are consistent with Quebec as a source population for M. tuberculosis gene flow into Aboriginal populations during the fur trade era. Historical and genetic analyses suggest that tiny M. tuberculosis populations persisted for ∼100 y among indigenous populations and subsequently expanded in the late 19th century after environmental changes favoring the pathogen. Our study suggests that spread of TB can occur by two asynchronous processes: (i) dispersal of M. tuberculosis by minimal numbers of human migrants, during which small pathogen populations are sustained by ongoing migration and slow disease dynamics, and (ii) expansion of the M. tuberculosis population facilitated by shifts in host ecology. If generalizable, these migration dynamics can help explain the low DNA sequence diversity observed among isolates of M. tuberculosis and the difficulties in global elimination of tuberculosis, as small, widely dispersed pathogen populations are difficult both to detect and to eradicate.


Henn, B. M., C. R. Gignoux, M. Jobin, J. M. Granka, J. M. Macpherson, J. M. Kidd, L. Rodríguez-Botigué, S. Ramachandran, L. Hon, A. Brisbin, A. A. Lin, P. A. Underhill, D. Comas, K. K. Kidd, P. J. Norman, P. Parham, C. D. Bustamante, J. L. Mountain, and M. W. Feldman. 2011. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proc. Natl. Acad. Sci. USA 108: 5154–5162.

Africa is inferred to be the continent of origin for all modern human populations, but the details of human prehistory and evolution in Africa remain largely obscure owing to the complex histories of hundreds of distinct populations. We present data for more than 580,000 SNPs for several hunter-gatherer populations: the Hadza and Sandawe of Tanzania, and the !Khomani Bushmen of South Africa, including speakers of the nearly extinct N|u language. We find that African hunter-gatherer populations today remain highly differentiated, encompassing major components of variation that are not found in other African populations. Hunter-gatherer populations also tend to have the lowest levels of genome-wide linkage disequilibrium among 27 African populations. We analyzed geographic patterns of linkage disequilibrium and population differentiation, as measured by FST, in Africa. The observed patterns are consistent with an origin of modern humans in southern Africa rather than eastern Africa, as is generally assumed. Additionally, genetic variation in African hunter-gatherer populations has been significantly affected by interaction with farmers and herders over the past 5,000 y, through both severe population bottlenecks and sex-biased migration. However, African hunter-gatherer populations continue to maintain the highest levels of genetic diversity in the world.


Casto, A. M., and M. W. Feldman. 2011. Genome-wide association study SNPs in the human genome diversity project populations: does selection affect unlinked SNPs with shared trait associations? PLoS Genet. 7(1): e1001266.

Genome-wide association studies (GWAS) have identified more than 2,000 trait-SNP associations, and the number continues to increase. GWAS have focused on traits with potential consequences for human fitness, including many immunological, metabolic, cardiovascular, and behavioral phenotypes. Given the polygenic nature of complex traits, selection may exert its influence on them by altering allele frequencies at many associated loci, a possibility which has yet to be explored empirically. Here we use 38 different measures of allele frequency variation and 8 iHS scores to characterize over 1,300 GWAS SNPs in 53 globally distributed human populations. We apply these same techniques to evaluate SNPs grouped by trait association. We find that groups of SNPs associated with pigmentation, blood pressure, infectious disease, and autoimmune disease traits exhibit unusual allele frequency patterns and elevated iHS scores in certain geographical locations. We also find that GWAS SNPs have generally elevated scores for measures of allele frequency variation and for iHS in Eurasia and East Asia. Overall, we believe that our results provide evidence for selection on several complex traits that has caused changes in allele frequencies and/or elevated iHS scores at a number of associated loci. Since GWAS SNPs collectively exhibit elevated allele frequency measures and iHS scores, selection on complex traits may be quite widespread. Our findings are most consistent with this selection being either positive or negative, although the relative contributions of the two are difficult to discern. Our results also suggest that trait-SNP associations identified in Eurasian samples may not be present in Africa, Oceania, and the Americas, possibly due to differences in linkage disequilibrium patterns. This observation suggests that non-Eurasian and non-East Asian sample populations should be included in future GWAS.


Casto, A. M., J. Z. Li, D. Absher, R. Myers, S. Ramachandran, and M. W. Feldman. 2010. Characterization of X-linked SNP genotypic variation in globally distributed human populations. Genome Biol. 11:R10.

Background: The transmission pattern of the human X chromosome reduces its population size relative to the autosomes, subjects it to disproportionate influence by female demography, and leaves X-linked mutations exposed to selection in males. As a result, the analysis of X-linked genomic variation can provide insights into the influence of demography and selection on the human genome. Here we characterize the genomic variation represented by 16,297 X-linked SNPs genotyped in the CEPH human genome diversity project samples.
Results: We found that X chromosomes tend to be more differentiated between human populations than autosomes, with several notable exceptions. Comparisons between genetically distant populations also showed an excess of Xlinked SNPs with large allele frequency differences. Combining information about these SNPs with results from tests designed to detect selective sweeps, we identified two regions that were clear outliers from the rest of the X chromosome for haplotype structure and allele frequency distribution. We were also able to more precisely define the geographical extent of some previously described X-linked selective sweeps.
Conclusions: The relationship between male and female demographic histories is likely to be complex as evidence supporting different conclusions can be found in the same dataset. Although demography may have contributed to the excess of SNPs with large allele frequency differences observed on the X chromosome, we believe that selection is at least partially responsible. Finally, our results reveal the geographical complexities of selective sweeps on the X chromosome and argue for the use of diverse populations in studies of selection.



1.  Cavalli-Sforza, L.L., and M.W. Feldman. 2003. The application of molecular genetic approaches to the study of human evolution. Nat. Genet. Supp. 33: 266–275.

2.  Henn, B. M., L. L. Cavalli-Sforza, and M. W. Feldman. 2012. The great human expansion. Proc. Natl. Acad. Sci. USA 109: 17758–17764.

3.  Keinan, A., and A. G. Clark. 2012. Recent explosive human population growth has resulted in an excess of rate genetic variants. Science 336: 740–743.

4.  Grossman, S. R., K. G. Andersen, I. Shlyakhter, S. Tabrizi, S. Winnicki, A. Yen, D. J. Park, D. Griesemer, E. K. Karlsson, S. H. Wong, M. Cabili, R. A. Adegbola, R. N. K. Bamezai, A. V. S. Hill, F. O. Vannberg, J. L. Rinn, 1000 Genomes Project, E. S. Lander, S. F. Schaffner, and P. C. Sabeti. 2013. Identifying recent adaptations in large-scale genomic data. Cell 152: 703–713.

Read Full Post »