220 VOLUME 46 | NUMBER 3 | MARCH 2014 Nature Genetics a n a ly s i s Human populations have undergone major changes in population size in the past 100,000 years, including recent rapid growth. How these demographic events have affected the burden of deleterious mutations in individuals and the frequencies of disease mutations in populations remains unclear. We use population genetic models to show that recent human demography has probably had little impact on the average burden of deleterious mutations. This prediction is supported by two exome sequence data sets showing that individuals of west African and European ancestry carry very similar burdens of damaging mutations. We further show that for many diseases, rare alleles are unlikely to contribute a large fraction of the heritable variation, and therefore the impact of recent growth is likely to be modest. However, for those diseases that have a direct impact on fitness, strongly deleterious rare mutations probably do have an important role, and recent growth will have increased their impact. Recent work has highlighted the impact of demographic history on the distribution of human genetic variation. Deep sequencing studies have identified huge numbers of very rare variants in human populations, which are the consequence of explosive population growth in the past 5,000 years1–6. Additionally, Europeans and east Asians have a greater fraction of high-frequency variants compared to Africans, probably because of an ancient bottleneck of nonAfrican populations5,7–10. Given these observations, it is natural to ask whether recent demographic history has affected the burden of genetic disease in modern human populations3,6,11,12. Keinan and Clark3 recently hypothesized that “some degree of genetic risk for complex disease may be due to this recent rapid increase in the number of rare variants in the human population.” A second important question concerns the relative importance of rare and common variants in causing disease13–15. If much of the genetic variation underlying disease is due to rare variants, it could help to explain the so-called ‘missing heritability’ of complex traits, implying that mapping approaches based on deep sequencing will be essential for the dissection of complex traits16. RESULTS The model To address these questions, we analyzed a theoretical model with a large number of biallelic sites, each of which was subject to twoway mutation, and natural selection against one of the alleles (Online Methods). We studied three types of demographic models thought to be relevant for human populations: (i) a bottleneck; (ii) exponential growth starting from a constant-sized population; and (iii) a complex demographic model for African Americans (including rapid recent growth) and European Americans (including two bottlenecks followed by growth) inferred by Tennessen et al.5. The main features of the Tennessen model are similar to those of other recent models9,10,17, but the Tennessen model uses a larger data set for parameter estimation. Our main results focus on selection against semidominant (i.e., additive) alleles in which the three genotypes have fitnesses of 1, 1 − s/2 and 1 − s, where s is the selection coefficient, and selection against recessive alleles with genotype fitnesses 1, 1 and 1 − s. The effects of demography in these two models are qualitatively representative of those over the range of dominance coefficients (Supplementary Note). In addition to the simulation results shown here, further results and detailed theoretical analysis for all our key results are provided in the Supplementary Note. The impact of demographic changes on individual load We focused first on the impact of demographic changes on individual load; that is, we wanted to understand whether demographic history has affected the burden of deleterious variation carried by a typical individual in a population. Individual load is related directly to the number of deleterious alleles carried by an individual or, for recessive mutations, the number of homozygous sites per individual (Online Methods and Supplementary Note). Figure 1 illustrates the impact of a bottleneck and population growth on the numbers of deleterious variants when selection is strong (s = 1%). As we expected, these demographic events have a major impact on the number and frequency spectra of deleterious variants: the bottleneck causes a decrease in the total number of segregating sites in a population largely because of loss of rare variants, whereas the mean frequency of alleles that survive increases. Meanwhile, exponential growth causes a rapid increase in the number of segregating sites because of a major influx of rare variants but also causes a consequent drop in the mean frequency at segregating sites. The deleterious mutation load is insensitive to recent population history Yuval B Simons1,8, Michael C Turchin2,8, Jonathan K Pritchard2–5 & Guy Sella6,7 1Department of Ecology, Evolution and Behavior, Hebrew University of Jerusalem, Jerusalem, Israel. 2Department of Human Genetics, University of Chicago, Chicago, Illinois, USA. 3Howard Hughes Medical Institute, Stanford University, Stanford, California, USA. 4Department of Biology, Stanford University, Stanford, California, USA. 5Department of Genetics, Stanford University, Stanford, California, USA. 6Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, USA. 7Present address: Department of Biological Sciences, Columbia University, New York, New York, USA. 8These authors contributed equally to this work. Correspondence should be addressed to J.K.P. (pritch@stanford.edu) or G.S. (gsella@math.huji.ac.il). Received 27 August 2013; accepted 16 January 2014; published online 9 February 2014; doi:10.1038/ng.2896 npg © 2014 Nature America, Inc. All rights reserved. Nature Genetics VOLUME 46 | NUMBER 3 | MARCH 2014 221 But despite these substantial shifts in the overall frequency spectrum, the impact on genetic load—namely, the mean number of deleterious variants per individual and thus the average fitness—is much more subtle. In the semidominant case, the individual burden is essentially unaffected by these demographic events (Fig. 1c,d). With growth, the increased number of segregating sites is balanced exactly by a decrease in the mean frequency (with the converse being true for the bottleneck model) so that the number of variants per individual stays constant. This kind of balance is predicted by classic mutation-selection balance models18 and can be shown to hold for general changes in population size, provided that selection is strong and deleterious alleles are at least partially dominant (Supplementary Note). The behavior of the recessive model is more complicated (Fig. 1e,f). In the bottleneck model, the mean number of deleterious variants per individual drops by 60% as a result of the bottleneck. This drop is due to the loss of rare alleles. However, during the bottleneck, some deleterious alleles drift to higher frequencies11,19, contributing disproportionately to the number of homozygotes. This causes a transient increase in the number of deleterious homozygous sites per individual, i.e., the recessive load. Meanwhile, population growth has a less pronounced effect on recessive variation, leaving the mean number of deleterious alleles per individual unchanged but causing a slight decrease in load. More generally, the manner in which demography affects individual load varies with the degree of dominance and the strength of selection (Fig. 2, Supplementary Note and Supplementary Table 1). The behavior of these models can be classified into three selection regimes: strong, weak and effectively neutral. In the case of strong selection, i.e., where selection is much stronger than drift (approximately s ≥ 10−3 for semidominant mutations), deleterious variants are extremely unlikely to fix, and virtually all of the genetic load is due to segregating variation. In this range, we infer that human demography has had no impact on semidominant load (and, more generally, for mutations with at least some dominance component) and has had only small effects on recessive load. The case of weak selection—where drift and selection have comparable effects—is more complex, as fixed alleles may contribute appreciably to load, and the steady-state load depends on population size20. However, the approach to the steady state is very slow, being limited by both the time to fixation (on the order of 4N generations) and the mutational input (on the order of 1/2NU generations, where U is the mutation rate). For both the semidominant and recessive cases, population growth is too recent to have substantially decreased the load. Recent growth increases the input of new deleterious mutations, but this effect is counterbalanced by the fact that the new deleterious mutations are proportionally rarer, as well as by the input of beneficial mutations. The bottleneck in Europeans is estimated to have occurred further in the past and at much lower population sizes5 (Supplementary Fig. 1), thus increasing its effect. In this case, the increase in drift causes segregating deleterious alleles to increase in frequency, sometimes reaching fixation, and results in a slight increase in load (Supplementary Fig. 2). The out-of-Africa bottleneck should thus lead to a slight increase of load in Europeans, most notably for recessive sites. In the effectively neutral range—where selection has negligible effects on the population dynamics—segregating variation contributes negligibly, and hence the load does not change with demography. Thus, across all three selection regimes, recent human demographic history is likely to have had virtually no impact on genetic load at partially dominant sites and only weak effects at recessive sites. Analysis of exome data To test these predictions, we analyzed two recent data sets of exome sequences from individuals of west African and European descent. Previous work comparing load in different populations has produced conflicting conclusions depending on the data set, choice of measures and functional annotations used. For example, Lohmueller et al.11 reported that there is “proportionally more deleterious variation in European than in African populations.” Similarly, Tennessen et al.5 found that European Americans had more nonreference genotypes when they used a conservative classification of deleterious sites but a b c d e f 100 –1,000 0 1,000 2,000 3,000 Time since beginning of bottleneck (generations) –1,000 0 1,000 2,000 3,000 Time since beginning of bottleneck (generations) Time since beginning of growth (generations) Time since beginning of growth (generations) 10,000 1,000 –1,000 0 1,000 2,000 3,000 Time (generations) Bottleneck Population size 100,000 10,000 Time (generations) Growth Population size –200 –100 0 100 200 102 104 Semidominant Recessive Number per MB 100 102 104 100 102 104 Number per MB Number per MB 100 102 104 Number per MB Number of segregating sites Number of segregating sites Number of segregating sites Number of deleterious alleles per individual Number of deleterious alleles per individual Number of rare deleterious alleles per individual Number of rare deleterious alleles per individual Number of segregating sites Number of rare segregating sites Number of rare segregating sites Number of rare segregating sites Number of rare segregating sites Load: number of deleterious alleles per individual Load: number of homozygous sites per individual Load: number of homozygous sites per individual Load: number of deleterious alleles per individual Number of rare deleterious alleles per individual Number of rare deleterious alleles per individual –200 –100 0 100 200 –200 –100 0 100 200 Figure 1 Time course of load and other key aspects of variation through a bottleneck and exponential growth. (a,b) The bottleneck (a) and exponential growth (b). (c–f) The expected number of variants and alleles per MB assuming semidominant mutations (c,d) or recessive mutations (e,f) with s = 1% and a mutation rate per site per generation of 10−8. a n a ly s i s npg © 2014 Nature America, Inc. All rights reserved. 222 VOLUME 46 | NUMBER 3 | MARCH 2014 Nature Genetics a n a ly s i s observed the opposite result when using a more liberal classification of sites (both observations were highly significant). We first analyzed single-nucleotide variant (SNV) frequency data from a recent exome sequencing study of 2,217 African Americans (AAs) and 4,298 European Americans (EAs) sequenced at 15,336 proteincoding genes by Fu et al.6 (the allele frequencies are available from the National Heart, Lung, and Blood Institute (NHLBI) Grand Opportunity (GO) Exome Variant Server). Additionally, we analyzed exome data from 88 Yoruba (YRI) and 81 European (CEU) individuals collected by the 1000 Genomes Project21. To test whether there are differences in load between individuals of west African and European descent, we considered the average number of derived alleles per individual at putatively deleterious segregating sites. For this purpose, we considered a site to be segregating if and only if it is variable within the combined sample of both populations. This definition ensures that the derived counts are comparable across populations. Under a semidominant model, the number of derived alleles increases monotonically with the segregating genetic load. Thus, any difference in average load between populations would be apparent as a difference in the mean number of derived alleles per individual. Here we focused on an equivalent measure that also facilitates comparisons across different types of sites, namely, the mean derived allele frequency within functional classes. The mean derived allele frequency is equal simply to the number of derived alleles per individual divided by twice the number of segregating sites in that class, and so any difference in the mean number of derived alleles per individual will also be a difference in the mean derived frequencies. For sites that are either neutral or semidominant, our model predicts that the mean derived allele frequency should be virtually identical in Africans and Europeans (Supplementary Note and Supplementary Fig. 3). At recessive sites, we expect a slight increase in mean derived frequency in Africans compared to Europeans (Supplementary Fig. 3), but overall we expect any differences to be small. We obtained functional predictions of SNVs from PolyPhen-2, which employs a method that uses sequence conservation and structural information to infer which nonsynonymous changes are most likely to have functional consequences22 (Supplementary Table 2 shows similar analyses with other functional prediction methods). When using the functional predictions, we observed a strong bias: SNVs for which the genome reference carries the derived allele are much more likely to be classified as benign than SNVs for which the reference allele is ancestral—this observation was true even when we controlled for the overall population frequency (Supplementary Fig. 4). Hence, our analysis incorporates a correction to account for this bias; we obtained very similar results using a separate set of unpublished human-independent PolyPhen scores provided by the Sunyaev lab (Supplementary Tables 3 and 4). Figure 3 summarizes the results from the data of Fu et al.6. The mean allele frequency declines with increasing functional severity5 from 2.8% at noncoding SNVs to 0.6% at probably damaging SNVs, implying that there is selection against most SNVs with predicted damaging effects. More striking, however, is the finding that within each of the five functional categories, the mean allele frequencies—and hence the numbers of derived alleles per individual—are essentially identical in the two populations despite the very large size of the data sets (P > 0.05 for all five comparisons). Results from the 1000 Genomes Project data are qualitatively similar: we found no significant differences between the YRI and CEU populations in the numbers of derived alleles per individual in any functional category (Supplementary Table 5). In summary, these observations are consistent with our model predictions that load should be very similar in these populations. Our conclusions probably differ from those of previous studies in part because earlier studies used measures that are related to load but are also sensitive to other differences between the populations being compared (for example, the number of neutral segregating sites and the frequency spectrum) and in part because of the reference bias in the functional annotations accounted for here (Supplementary Note). We note that D. Reich, S. Sunyaev and colleagues have recently made similar observations regarding load in different populations (personal communication). a Semidominant b European African Selection coefficient Segregating Change in load 1 × 10–5 1 × 10–6 1 × 10–7 1 × 10–8 –1 × 10–8 –1 × 10–7 –1 × 10–6 –1 × 10–5 10–6 10–4 10–2 0 Total Fixed Recessive Selection coefficient Segregating Change in load 1 × 10–5 1 × 10–6 1 × 10–7 1 × 10–8 –1 × 10–8 –1 × 10–7 –1 × 10–6 –1 × 10–5 10–6 10–4 10–2 0 Total Fixed European African Figure 2 Changes in load due to changes in population size during the histories of European and African Americans. (a,b) Semidominant (a) and recessive (b) sites. The blue lines denote the difference in load per base pair of DNA sequence in the present-day population compared to the ancestral (constant) population size as a function of the selection coefficient. The green and red lines show the difference in load due to segregating and fixed variants, respectively. The increase in load due to segregating variation in modern populations approximately cancels out the decrease in load due to fixed sites. The scale on the y axis is linear within the gray region and is logarithmic outside this region. Mean derived allele frequencies at different types of SNVs Mean derived allele frequency 0.030 0.025 0.020 0.015 0.010 0.005 Noncoding Synonymous 21,421 21,345 Number per individual, AA: Number per individual, EA: 15,401 15,231 Benign nonsynonymous 1,682 2,002 1,969 Probably damaging 1,695 Possibly damaging 5,373 5,338 African American European American Figure 3 Observed mean allele frequencies in AAs and EAs at various classes of SNVs. The plot shows the mean frequencies in each population (± 2 s.d.) using exome sequence data from Fu et al.6. Here a site is considered a SNV if it is segregating in the combined AA-EA sample of 6,515 individuals. The functional classifications of sites are from PolyPhen-2 (ref. 22) with biascorrecting modifications. The AA and EA mean frequencies are essentially identical within all five functional categories (p > 0.05). npg © 2014 Nature America, Inc. All rights reserved. Nature Genetics VOLUME 46 | NUMBER 3 | MARCH 2014 223 a n a ly s i s The impact of demography on genetic architecture Although changes in population size have had little impact on the average load carried by individuals, growth has greatly increased the number of rare variants in populations. So do rare variants have a greater (and substantial) role in the genetics of disease as a result of recent growth (Fig. 4)? Given the differences in population history, do higher-frequency variants have a greater role in Europeans and Asians than in Africans? The answers to these questions are of practical importance because different study designs may be needed to identify rare variants13,15,16,23. To study these questions, we computed the contributions of different allele frequencies to the heritable phenotypic variation among individuals in the population, namely x(1 − x)f(x)/2, where f(x) is the probability that a derived allele is at frequency x given the demographic model and selection coefficient. These distributions show the fraction of genetic variance for a disease that is contributed by alleles below frequency x for the simplest case in which the loci underlying the trait all have the same effect size and selection coefficient and are all semidominant (Supplementary Note). In practice, we anticipate that variants underlying a given disease would have a variety of selection coefficients and effect sizes, in which case the overall distribution would be an appropriately weighted mixture of distributions for different selection coefficients. Of note, here we consider the proportional contribution of variants at different frequencies, and thus these results should hold regardless of the number of loci underlying variation in the trait. Analysis of this model shows several interesting points. For effectively neutral or weakly deleterious sites (Fig. 4a), only a small fraction of the total variance comes from very rare alleles: although there are many rare alleles, each one contributes very little to population variance and individual load. The same is true for recessive variation across almost the entire range of selection coefficients (Supplementary Note and Supplementary Fig. 5). Likewise, if we assume that the frequency density f(x) follows the frequency spectrum observed at all nonsynonymous sites classified as probably damaging22, then under the same model, it is still only a modest fraction of the genetic variance that is due to rare alleles (Fig. 4b and ref. 5). Meanwhile, in all of these cases, the out-of-Africa bottleneck increases the contribution of intermediate-frequency alleles to the genetic variance (Fig. 4a–c); for example, at probably damaging sites, 62% of the variance in EAs is contributed by alleles with minor allele frequency above 10% as compared to only 49% in AAs. It is only for the case of strong, dominant selection that very rare variants (<0.1%) become important (Fig. 4c,d). For example, for a selection coefficient of 1%, most of the variation is due to rare alleles that arose within the recent exponential-growth phase. As a result, the contribution of extremely rare variants is much greater than it would have been in the absence of growth; for example, in AAs and EAs, 80% and 65% of the variance, respectively, is due to alleles below frequency 0.1% compared to just 25% in the constant population model. In practice, the genetic variants that contribute to a complex trait probably have a range of selection coefficients (s) and a range of effect sizes (a) on the phenotype in question (Supplementary Note). When a 1.0 Weak selection Minor allele frequency 0.8 0.6 Cumulative contribution to variance 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 d 100 Variance from rare alleles (%) Selection coefficient 80 60 Contribution of rares (%) 40 20 0 10–6 10–4 10–2 e 100 10–5 10–10 Genetic variance per site Constant effect size Effect size ∝ s Selection coefficient Variance 10–6 10–4 10–2 b 1.0 Data: probably damaging sites Minor allele frequency 0.8 0.6 Cumulative contribution to variance 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 f c 1.0 Strong selection Minor allele frequency 0.8 0.6 Cumulative contribution to variance 0.4 0.2 0 0 0.005 0.010 0.015 0.025 0.020 0.03 40 Variance from rare alleles (%) Correlation between selection and effect size 30 Contribution of rares (%) 20 10 0 0 0.2 0.4 0.6 0.8 1.0 Effect size independent of s Effect size ∝ s African European Constant Figure 4 Predicted effect of demography on the genetic architecture of disease risk. All plots (a–f) assume an additive trait and, with the exception of b, are based on simulations with semidominant selection under the Tennessen et al.5 demographic model. Results for the constant population size model are also provided for comparison. The upper plots (a–c) show the cumulative fractions of genetic variance due to alleles at frequency