Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
Genes Genet. Syst. (2018) 93, p. 149–161 Detecting ongoing selective sweeps 149 A new inference method for detecting an ongoing selective sweep 1 1 2,3 1 Naoko T. Fujito , Yoko Satta , Toshiyuki Hayakawa and Naoyuki Takahata School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies), Shonan Village, Hayama, Kanagawa 240-0193, Japan 2 3 Graduate School of Systems Life Sciences and Faculty of Arts and Science, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan (Received 1 March 2018, accepted 11 June 2018; J-STAGE Advance published date: 30 September 2018) A simple method was developed to detect signatures of ongoing selective sweeps in single nucleotide polymorphism (SNP) data. Based largely on the traditional site frequency spectrum (SFS), the method additionally incorporates linkage dis- equilibrium (LD) between pairs of SNP sites and uniquely represents both SFS and LD information as hierarchical “barcodes.” This barcode representation allows the identification of a hitchhiking genomic region surrounding a putative target site of positive selection, or a core site. Sweep signals at linked neutral sites are then measured by the proportion (F ) of derived alleles within the hitchhik- ing region that are linked in the derived allele group defined at the core site. In measuring F or intra-allelic variability in an informative way, certain conditions for derived allele frequencies are required, as illustrated with the human ST8SIA2 locus. Coalescent simulators with and without positive selection are used to assess the false-positive and false-negative rates of the F statistic. To demonstrate its power, the method was further applied to the LCT, OCA2, EDAR, SLC24A5 and ASPM loci, which are known to have undergone positive selection in human popu- lations. Overall, the method is powerful and can be used to identify core sites responsible for ongoing selective sweeps. Key words: hitchhiking, human evolution, linkage disequilibrium, population genomics, site frequency spectrum a selected site may not be sufficiently high, which natu- INTRODUCTION rally results in reduced statistical power of the SFS-based Population genomics has identified a large number of methods. It is also known that the inference of positive genetic variants that are associated with either global or selection made by these SFS-based methods tends to be local adaptations of species, in particular of anatomically muddled by the confounding effect of demography if they modern humans (e.g., Li et al., 2014; Schrider and Kern, attempt to detect deviations from the standard neutral 2017). As reviewed in Lachance and Tishkoff (2013), model of constant population size. In addition, these Vitti et al. (2013) and Fan et al. (2016), there are sev- methods often assume no recombination within a locus eral approaches for detecting positive Darwinian selec- but free recombination between loci or their loose link- tion from DNA polymorphisms. One approach is based age, and are extremely sensitive to these assumptions on on the allele frequency distribution or the site frequency recombination as linkage inflates the variance of allele spectrum (SFS) and uses either summary statistics (e.g., frequencies (Bustamante et al., 2001). Watterson, 1975; Tajima, 1989; Fu and Li, 1993; Fay Li (2011) proposed an alternative method that uses the and Wu, 2000; Zeng et al., 2006; Field et al., 2016) or maximum frequency of derived alleles in a sample as an the entire allele frequency distribution (e.g., Sawyer and indicator of unbalanced coalescent trees. This method Hartl, 1992; Bustamante et al., 2001; Nielsen et al., 2005; relies on the fact that the probability of unbalanced basal Gutenkunst et al., 2009; Pavlidis et al., 2013). However, branches is extremely low under neutrality and is inde- when positive selection is ongoing, the allele frequency at pendent of temporal changes in population size (see also Yang et al., 2018 for an extension to a varying size popu- lation, and Ferretti et al., 2017, who provided a system- Edited by Ryosuke Kimura atic analysis of the impact of the waiting times and the * Corresponding author. E-mail: [email protected] branching order of coalescent events on the SFS). One DOI: http://doi.org/10.1266/ggs.18-00008 150 N. T. FUJITO et al. challenge, however, arises from the fact that the required n . (1) kkl l1 imbalance must be strong in a large sample. In addition, the power of the method is naturally weakened when a If n = 1, the site k is a singleton (or the site has only selective sweep is partial and incomplete. one derived allele segregating in a sample) and the total The long-range haplotype (LRH) test is also commonly number of such sites across the sampled genomic region used for detecting sweep signals (Sabeti et al., 2002; Vitti is denoted by ξ . Likewise, if n = 2, the site k is a dou- 1 k et al., 2013). This approach relies on undisrupted long- bleton (or the site has two segregating derived alleles) range associations with neighboring polymorphisms, or and the total number of such sites is denoted by ξ . Here the relationship between an allele’s frequency at a focal we follow Fu (1995), who used the symbol ξ (1 ≤ i ≤ n − 1) site and the extent of linkage disequilibrium (LD) sur- to express the random number of SNP sites with each rounding it (Sabeti et al., 2002), and may be classified exhibiting i derived alleles segregating in a sample: ξ = into “haplotype” tests in contrast to SFS tests (Zeng et the number of SNP sites at each of which n = i for 1 ≤ al., 2006). However, it appears that LRH-based meth- k ≤ S. By definition, the total number of segregating ods tend to miss such signals that are localized in small sites is equal to the assumed number of SNP sites and n1 genomic regions due to nearby recombination hotspots. expressed as and the total number of derived i1 n1 In the present paper, we develop a new method that per- alleles is expressed as where i is the i ii i1 mits the detection of ongoing selective sweeps. Although scaled SFS (Fu, 1995; Ronen et al., 2013). Under the a sweep can be either “hard” (Maynard Smith and Haigh, standard neutral model of constant effective population 1974; Kaplan et al., 1989) or “soft” (Innan and Kim, 2004; size N , the mean of ξ is given by E{ξ } = θ /i for all pos- e i i Hermisson and Pennings, 2005; Przeworski et al., 2005), sible values of i (Sawyer and Hartl, 1992; Charlesworth we here presume a soft sweep for generality as it can and Charlesworth, 2012) and the mean of by accommodate the theoretical framework to the case of a where θ = 4N u and u is the mutation rate per region per single variant on which positive selection starts to act. In generation. The means of S and L therefore reduce to n1 particular, we are concerned with the situation under ES 1 / ia and E{L} = θ (n−1) (Watterson, i1 which the size of two basal branches in genealogy may 1975; Zeng et al., 2006). Moreover, we have not be particularly distorted, but wherein one descendant for the pairwise mean of nucleotide differences, defined 2 n1 allelic lineage has expanded rapidly. We demonstrate by in i as (Nei and Li, 1979; Tajima, i1 simulation that such lineage-specific expansion is seldom nn 1 observed under the standard neutral model (a low false- 2 n1 1989) and E{H} =θ for H i (Fay and Wu, i1 positive rate) as well as even under certain demographic nn 1 models of neutral variants subjected to population bottle- neck and expansion. Although we do not address the reverse problem that positive selection may severely bias estimates of demographic parameters (for which readers Table 1. Example of the site frequency spectrum {ξ } = (ξ = 2, i 1 may refer to Schrider et al., 2016), we examine whether ξ = 3, ξ = 2, ξ = 2, ξ = 0) with the total number 2 3 4 5 our method can equally exhibit a low false-negative rate of segregating sites S = 9 and the sample size n = 6. when positive selection in fact operates. For this pur- SNP sites pose, we conduct coalescent simulation with selection and core further apply our method to well-known cases of selective sweeps. 123456789 chr 1 101101111 chr 2 001001111 AN INFERENCE METHOD chr 3 000001110 We consider S bi-allelic single nucleotide polymorphism chr 4 010110011 (SNP) sites in a sample of n homologous chromosomes chr 5 000010001 randomly taken from a population. For simplicity, these SNP sites are numbered from the left end of a genomic chr 6 000000000 region under study. To summarize such data, it is con- n 112223344 venient to define an indicator matrix {χ }, the (k,l) ele- kl n 102103332 rk ment of which takes the value of 1 if the k-th SNP site of g 343453345 the l-th chromosome (1 ≤ k ≤ S and 1 ≤ l ≤ n) possesses The core SNP site (r) is arbitrarily set at r = 6 with n = a derived allele, otherwise being assigned as 0 (Table 3. The values of n and n for 1 ≤ k ≤ 9 determine the bar- k rk 1). The number of derived alleles at the k-th SNP site codes at these nine SNP sites, retaining their spatial posi- in a sample is then counted simply as tions. The conditions for recombinants are n > 0, n > n , rk k rk n > n and n > g = n + n – n as in Eq. (3a) in the text. r rk k k r rk Detecting ongoing selective sweeps 151 2000; Ferretti et al., 2017). Notably, these mean values tighter the linkage, the stronger the hitchhiking effect; are all independent of recombination rates, although the also, the broader the genomic region, the more recent variances are not. the positive selection. There are derived rr l l1 In a large sample, it is likely that ξ = 0 for many large i alleles at a core site. Using these n derived alleles and i r values. For this reason, we bin individual SNP sites into n − n = ancestral alleles at a core site, we () 1 rl l1 the following eight classes, although the number of such divide our sample of n homologous chromosomes into frequency classes depends on sample size n. Because of its two mutually exclusive groups: derived and ancestral preponderance, the first class (c ) consists solely of single- allele groups. The derived allele frequency at the core tons and is expected to have the mean of E{ξ } = θ (Fu and site is equal to f = n /n and the frequency class (c ) to 1 r r r Li, 1993). The second class (c ) consists of doubletons and which the core site belongs is also specified by n . For 2 r tripletons with E{ξ + ξ } = 5θ/6. Likewise, the remaining example, if the site is located at r = 6 as in Table 1, we 2 3 have n = 3, which belongs to c , and divide the sample 9 25 6 2 classes are c with , c with , E E 3 4 i i i 4 i10 into the derived allele group of (1,2,3) chromosomes and 68 185 c with , c with , c with the ancestral allele group of (4,5,6) chromosomes. This E E 5 6 7 i i i 26 i 69 1007 definition of allele groups and the following procedure E , and c with E 06 if n = 8 i i i 504 i186 can be extended, without any change, to the case where 1008 as for the East Asia meta-population in the 1000 more than one SNP site determines two core haplo- Genomes Project database (1000 Genomes Project Con- types. Figure 1 shows one such example in which two sortium, 2015). However, if n ≥ 1369 and ≥ 3720, we core haplotypes, CGC vs. non-CGC, are defined at three additionally need classes up to c and c , respectively, SNP sites (rs3759916, rs3759915 and rs3759914) that are 9 10 whereas if n = 68, we need only the first five classes. In located in the promoter region of the human sialyltrans- any case, we sort each individual SNP site k or SFS into ferase (ST8SIA2) locus (Fujito et al., 2018). This gene these classes with nearly equal weight. This strategy is schizophrenia-associated and expressed preferentially has been previously utilized, for example, in estimating in the brain, with the level being largely determined by the distribution of fitness effects for amino acid-altering the promoter SNPs. It is suggested that the expression mutations (Eyre-Walker et al., 2006). In our method, level is a genetic determinant of schizophrenia risk and however, this hierarchical binning of SFS plays an essen- a non-risk SNP type (CGC-type) has significantly reduced tial role in extracting information on “intra-allelic vari- promoter activity. In Asia and Europe, the major pro- ability” (Slatkin and Rannala, 1997). In this context, moter type is not CGC but TCT. it is instructive to recall that the higher the allele fre- Next, we compute the number (n ) of derived alleles at rk quency, the older the mean age of an allele (Kimura and the k-th SNP site that simultaneously occur with derived Ohta, 1973; Slatkin and Rannala, 2000). alleles at the core site r. This number becomes non-zero In human populations, the expected rough uniformity of only when both χ = 1 and χ = 1 hold true for some kl rl SFS over the eight classes is often severely violated. The chromosome l. The n is thus counted as rk SFS generally reveals an excess of rare as well as inter- mediate or high allele frequencies, thereby supporting n (2) rk rl kl l1 demographic models of recent population growth and past bottlenecks, in particular in non-African popula- and provides a measure of pairwise LD between the core tions (e.g., Schaffner et al., 2005; Gutenkunst et al., 2009; site and any other neighboring SNP site. The n in Eq. Liu and Fu, 2015; Terhorst et al., 2017). Alternatively, (1) ranges from 1 to n − 1 and the n in Eq. (2) ranges rk positive selection may leave similar signatures at linked from 0 to n or n , whichever is smaller. In Fig. 1, these k r neutral sites (e.g., Kaplan et al., 1989; Braverman et al., numbers are represented as “barcodes” that depict SNP 1995; Hermisson and Pennings, 2005; Evans et al., 2006; information by two-tone colored bars. Ronen et al., 2013; Schrider and Kern, 2017). In the The barcode representation of n and n differs from k rk following analysis, we inquire whether and how we can the traditional SFS in that it preserves spatial infor- isolate the effect of positive selection from that of demo- mation about SNP sites and is stratified in eight layers graphic causes. that are on average in proportion to their ages. Figure We pay special attention to one particular SNP site, 1 depicts derived alleles in c and c of ST8SIA2. In a 6 7 henceforth termed a core site or the r-th SNP site, central region in c , 12 equally tall red bars for n derived 7 rk which might be a target of positive selection. This core alleles are linked in the derived allele group; these sites site is assumed to be given or of interest, and does not may also contain a small number of derived alleles that require any a priori knowledge about its evolutionary are linked in the ancestral allele group (gray bars for n − aspects. Our aim is to examine whether there exists a n derived alleles). These barcodes indicate that the cen- rk genomic region hitchhiking and in linkage disequilibrium tral region is in tight linkage with respect to the core site (LD) with derived alleles at a core site. As a rule, the and the 12 SNP sites show 12 mutations that have accu- 9. 152 N. T. FUJITO et al. Fig. 1. Barcodes for the numbers of derived alleles (C6 A and C7 A) and r values for the extent of linkage disequilibrium (C6 B and C7 B) in classes c and c at the ST8SIA2 locus. Magenta bars represent the number (n ) of derived alleles 6 7 rk at the k-th SNP site that are linked in the core CGC haplotype, whereas gray bars represent the number (n − n ) of k rk derived alleles that are linked in the core non-CGC haplotypes, as described in the text. The height of each bar n in c k 7 ranges from 186 to 503 and that of each magenta bar ranges from 0 to 349 according to the number of CGC chromosomes in East Asians. The abscissa spans approximately 54 kb and the location of the core CGC sites is indicated by a single asterisk at the bottom (Fujito et al., 2018). mulated in the common ancestral lineage of n chromo- In the example given in Table 1, only the two sites k = rk somes. Later, it is more rigorously determined that this 4 and 9 satisfy these conditions with respect to the core central LD region is approximately 16 kb long and ranges site r = 6. In the barcode representation, the first two from 24 kb to 40 kb distant from the left end. When inequalities in Eq. (3a) are indicated by a two-tone col- the central region is extended to left and right regions ored bar at the k-th site, whereas the third is indicated by with each being approximately 19 kb long, it immediately the difference between two red bars at the core site and becomes apparent that derived alleles in these extended the k-th site, and the last is not depicted explicitly but regions tend to be linked with not only derived but also corresponds to the case where n − n ancestral alleles at ancestral alleles at the core site. Under the infinite-sites the core site are greater in number than derived alleles model with recombination (Kimura, 1969), there are only in a gray bar at the k-th site. This allele association two possibilities that can result in such allele configura- between the core site r and the k-th site is commonly tions at two sites. One possibility is that derived alleles measured by the following LD formulae: at the k-th site are of ancient origin and can therefore appear together with the derived alleles at the core site Df ff or r rk rk (3b) ff 11 ff (e.g., k = 8 in Table 1). In this case, we have n = rk rk rk n < n . The other is that derived alleles at the k-th r k site experienced recombination with respect to the core where f = n ⁄n, f = n ⁄n and f = n ⁄n as before. The rk rk k k r r site. In this case, we have n < n and n . More thor- above “r” in r (Hill and Robertson, 1968) should not be rk r k oughly, the conditions for the latter case of recombination confused with the same symbol r used for the position n n are that all of , , of a core site or for the recombination rate given in the () 1 () 1 rl kl rl kl rl kl l1 l1 l1 and are strictly positive as in the four- Discussion section. The |D| or r in Eq. (3b) is use- () 11 () rl kl l1 gamete test (Hudson and Kaplan, 1985): ful to grasp an outline of a core region: it constitutes a high LD region that includes a core site and exhibits a nn 0, nn ,. nn and gn nn (3a) low level of derived allele polymorphism. Figure 1 also rk krkr rk kk rrk Detecting ongoing selective sweeps 153 plots pairwise r values in c and c , revealing that the with the view that the extent of polymorphism within 6 7 central region exhibits strong LD with the core site and the TCT haplotype group is relatively high and the age is bounded by abruptly declined LD regions on both is relatively old. Conversely, the pattern and extent of sides. Of importance is the examination of LD in indi- derived allele polymorphism in the CGC haplotype group vidual classes within each of which SNPs are age-related are qualitatively and quantitatively different from those and tend to disclose the linkage relationships. To fur- in the TCT haplotype group, suggesting rapid lineage- ther delimit the boundaries of the core region, we carried specific expansion of the former. This difference is cap- out exhaustive window analyses of F and determined the tured by contrasting values of F , which is defined below. c c region that showed the minimum value. Provided that we have identified a core region, we are In Fig. 2, we draw the entire barcode representation now in a position to quantify such lineage-specific expan- of the same genomic region as in Fig. 1 with the CGC sion by an appropriate measure. We define F as the haplotype and the alternative TCT haplotype as a respec- ratio of the number of derived alleles in the derived allele tive core, and the regions on both sides are used for com- group to that in the entire sample, where: parison with the core region and the size is taken as being almost the same as the core region. Two features rk kc F (4) emerge. Signals for tight linkage in the core region are kc characteristic to the CGC haplotype group, whereas sev- eral SNP sites in the TCT haplotype group fall under for n and n at all the SNP sites in certain frequency rk k suspicion of recombination. It is therefore consistent classes (c). The numerator and denominator are calcu- Fig. 2. Entire barcode representation of the SNPs in a 54-kb region at the ST8SIA2 locus. The three core SNPs are located in the right middle (an asterisk in each class) and define the CGC and TCT haplotype groups with frequency 349/1008 = 35% and 515/1008 = 51%, respectively, in East Asians (the 1000 Genomes Project database). Note that according to the copy number of derived alleles at a site, the vertical scale differs greatly among different classes: c (top) to c (bottom). 1 8 154 N. T. FUJITO et al. lated in the same genomic region so that one can act as Table 2. Proportions of derived alleles that are linked to the CGC haplotype (n ) at the human ST8SIA2 locus. Data are rk an internal control of the other. However, it is strongly taken from East Asians (n = 1008) in the 1000 Genomes recommended to exclude not only the own class (c ) to Fn / n Project database. where c stands crkk kc kc which the core site belongs, but also the higher-than-own for specified classes of allele frequencies. classes altogether, because these classes contain over- Entire 54-kb Central 16-kb whelming numbers of derived alleles that accumulated in Both sides F 1 2 region region the common stem lineage of a derived allele (haplotype) 61/180 = 0.34 15/51 = 0.29 46/129 = 0.36 group and overshadow useful information regarding the internal structure of descendant lineages. For this rea- 44/154 = 0.29 15/42 = 0.36 29/112 = 0.26 son, we may rewrite Eq. (4) as F to express explicitly < c 46/242 = 0.18 8/104 = 0.08 38/138 = 0.28 that it is computed in all the classes lower than the core c allele frequency class. This F or quantifies intra- 77/441 = 0.17 0/64 = 0.00 77/377 = 0.20 F ˆ < c F allelic variability in the sense of Slatkin and Rannala ˆ 81/355 = 0.23 1/102 = 0.01 80/253 = 0.32 (1997, 2000) without invoking any prior knowledge 1245/7735 = 0.16 2/2439 = 0.001 1243/5296 = 0.24 regarding intra-allelic genealogy. It is shown by simula- tion that the F statistic in Eq. (4) satisfactorily reflects c ˆ 12312/29184 = 0.42 4198/5853 = 0.72 8114/23331 = 0.35 genealogical structure within a derived allele group. It 7799/23454 = 0.33 2541/9091 = 0.28 5258/14363 = 0.37 is important to realize that F ≈ f if no linkage relation- c c r ship is expected between the core site and all other sites Total 21665/61745 = 0.35 6780/17746 = 0.38 14885/43999 = 0.34 under consideration. The 54-kb region is composed of the central region and Table 2 shows the estimates of F in each of the eight both side regions. frequency classes of ST8SIA2 among East Asians in the For the definition of the core (central) region, see the text. Both sides implies the left and right side from the core. The 1000 Genomes Project database. These are obtained size of each side region is approximately the same as that separately in the central core region and, for comparison, of the core. in the neighboring region of 38 kb length as well (the rightmost column in Table 2). Here and subsequently, a caret over a statistic indicates an estimate. As the and lack of derived alleles as evidence for rapid expansion copy number of the CGC haplotype is 349 in the pres- of the CGC haplotype lineage. Exceptions are a few SNP ent sample of n = 1008, the core haplotype belongs to sites at which there are one and two CGC-linked derived c so that Eq. (4) is applied to all the classes lower than alleles in total in c and c , respectively (Table 2). As 7 5 6 these CGC-linked derived alleles occur together with a c . Our estimate of F 0.015 implies that large number of non-CGC-linked derived alleles at the within c through c , the CGC haplotype group contains same SNP sites, recombination is the most likely cause of 1 6 only 1.5% of derived alleles relative to the whole in the these minor associations. Thus, the central core region corresponding frequency classes. Comparison of this we have just identified may not be perfectly free from recombination. Such imperfection is practically inescap- estimate with F 0.240 in the neighboring able in actual data analyses and thus prompts us to be region indicates that although the latter is not yet close conservative in testing a null hypothesis. to the CGC haplotype frequency in East Asians (0.35), it Equally importantly, the observed values in low fre- is much higher than that in the core region, suggesting quency classes are much higher than those in high fre- that both sides are apparently in loose linkage with the core haplotype. It is also noted that ≈ 0.015 in the quency classes (Table 2). In fact, we have F 0.323 F c < c 12 CGC haplotype group is in sharp contrast to = 0.349 < c in the TCT haplotype group (Fig. 2). Although the large in classes c in contrast to F 0.004 in classes 1−2 is due partly to the high core frequency = 0.51 of < c r the TCT haplotype, the rest is attributed to a large num- c . This increased level of rare variants may stem from 3−6 ber of TCT-linked derived alleles in c and c . In con- the fact that when new mutations begin to accumulate 6 7 trast, the extremely low value in the CGC haplotype within the derived allele group, the rate of return to equi- < c stems from the virtual absence of CGC-linked derived librium is faster in lower frequency classes than in higher alleles that belong to c to c despite abundant non-CGC- frequency classes (Satta et al., 2018). Alternatively, the 4 6 linked derived alleles (Table 2). In terms of genealogy, rate of return to equilibrium may suffer from recombina- ˆ ˆ the deficiency of to corresponds to the presence tion and/or the phasing problem for rare variants unless F F c c 4 6 of a large cluster without any solid family relationship long-read sequencing is applied (cf. Field et al., 2016). In among its members after they originated from a common either case, inclusion of such rare variants inflates the ancestor. We take this unusually unstructured pattern estimate of to the level of 0.015. Before going further, < c Detecting ongoing selective sweeps 155 we simulated a model of selective sweeps and computed and separately to evaluate the effect of new F F c c 36 − 12 − mutations. We found that the ratio of to seldom F F c c 12 − 36 − exceeds the observed level of about 79 for a wide range of recombination rates (data not shown). Henceforth, it ˆ ˆ appears that the large value of relative to can be F F c c 12 − 36 − attributed to inaccurate haplotype phasing. Moreover, it is worth mentioning that under the standard neutral model, the statistic is fairly insensitive to rare vari- < c ants. For instance, the 1% threshold value (0.0346) of obtained under neutrality is approximately the same 36 − as 0.0384 of . This reflects the fact that under such < c a null equilibrium model, each frequency class contrib- utes almost equally to . In other words, and F F F < c c < c 7 36 − exhibit similar false-positive rates in rejecting neutral- Fig. 3. Distribution of F (1 ≤ n ≤ 185) under the standard ˆ ˆ < c F F ity (see below), and the use of instead of may < c c 7 36 − neutral model, provided that S = 146 and the frequency (f ) of make the test conservative, depending on the quality of derived alleles at a core site is within 0.35 ± 0.015. With over sequence data we use. 1,000 replications, the mean and standard deviation of F are 0.255 ± 0.147. TESTING NEUTRALITY To evaluate the false-positive rate (the probability α of 3). The threshold value of F for α = 0.01 is 0.038 and < c a type I error) in our method, we performed ms (Hudson, the observed value of ≈ 0.015 is significantly small < c 2002) and/or fastsimcoal (Excoffier and Foll, 2011) using with α < 0.001. We can thus reject the standard neutral the standard neutral model of constant N together with model with a very low false-positive rate. several demographic models proposed for modern human Under the SFD model, we obtained the distribution prehistory (e.g., Schaffner et al., 2005; Gutenkunst et of F with the mean and SD of 0.24 ± 0.18, finding < c al., 2009; Liu and Fu, 2015; Terhorst et al., 2017). In that it becomes somewhat broad and skewed toward doing so, we restricted our examination to demographic 0. Because of this and if the SFD model is treated as models inferred for the East Asian or European meta- a null hypothesis, the false-positive rate increases; how- population and ignored migration with other continental ever, the threshold value of F for α = 0.01 is as large < c populations. Moreover, these models differ from each as 0.026 and F ≈ 0.015 is still significantly small (α < < c other in detail. For instance, the model of Terhorst et 0.002). Likewise, under the TFD model, we obtained the al. (2017) is somewhat different from that of Schaffner et distribution of F with the mean and SD of 0.24 ± 0.19 < c al. (2005) in the strength and timing of bottlenecks even and found that the threshold value for α = 0.01 is 0.017 if we discretize the population size estimated in the for- and that α < 0.004 for F . Hence, changing population < c mer. Accordingly, here we use two simplified versions of size indeed increases the false-positive rate, but the effect Schaffner et al. (2005) and Terhorst et al. (2017) that are on the F statistic is surprisingly small. Conversely, if subsequently referred to as SFD and TFD, respectively the SFD and TFD models are treated as alternatives, the (Supplementary Fig. S1). power of the F statistic defined by 1 − β (the probabil- In simulating a core LD region of ST8SIA2, we assume ity β of a type II error) is too low to detect the effect in no recombination and set an estimated mutation rate in comparison with the standard neutral model. In fact, a core region (θ ) or the observed number of segregating setting α = 0.05 with the corresponding threshold value sites ( ). The choice between θ and S does not matter of F = 0.07, we found that β > 0.8 under the SFD and < c under the standard neutral model as long as S = θ a TFD models. This high probability of a type II error or approximately holds and both parameters are sufficiently the low power reflects the broad and largely overlapping large (e.g., S ≥ 100); however, the choice may make a distributions of F between the standard neutral model difference under the SFD and TFD models. For these of constant size and demographic models of changing models of changing population size, we use the observed population size (Fig. 4). In our subsequent analyses, we S value to avoid unnecessary complications. First, we thus use the traditional SFD model as a null hypothesis examined the F statistic under the standard neutral rather than the standard neutral model unless otherwise < c model and carried out 1,000 replications with a specified specified. range of derived allele frequencies at a core site (0.35 ± 0.015). The distribution of F becomes unimodal with < c the mean and standard deviation (SD) of 0.26 ± 0.15 (Fig. 156 N. T. FUJITO et al. Fig. 5. Distribution of the F statistic under a selective sweep Fig. 4. Distribution of F under the Schaffner et al. (2005) fluc- c model (pink) in comparison with the SFD model (blue). The tuating drift (SFD) model (blue) in comparison with that under parameters and conditions for a soft sweep by genic selection the standard neutral model (gray): f = 0.35, S = 146, no recom- are set as follows: n = 1008, 1,000 replications, region size = 16 bination, and n = 1008. The command lines in ms (Hudson, kb, no recombination, θ = 20, N s = 200, initial frequency f = 2002) are: ./ms 1008 4000 -s 160 for the standard neutral model e 0 0.008 and current frequency f = 0.35. The command line in and ./ms 1008 4000 -s 160 -eN 0.001 0.077 -eN 0.004745 0.007 r discoal (Kern and Schrider, 2016) is: ./discoal 1008 1000 16000 -eN 0.004995 0.077 -eN 0.0084975 0.006 -eN 0.0087475 0.24 -eN -t 20 -ws 0 -a 400 -x 0.5 -f 0.008 -c 0.35. 0.0425 0.125 for the SFD model. ther increases to > 0.78 by proportionally reducing the UNDER A SELECTIVE SWEEP type II error rate β. Conversely, if N s is ≤ 10, the power To examine the statistical power of F when a selective remains poor and the F statistic likely fails to detect c c sweep is an alternative, we carried out coalescent simu- such weak selection signals. lation with selection at a single site that incorporates a We further applied our method to three human genetic selected allelic lineage into the neutral genealogy back- loci that are known to have undergone hard selective ground. We used discoal, which generates either deter- sweeps. The best known is perhaps the locus for the ministic or stochastic trajectories for selective sweeps adaptation of lactase (LCT) persistence that has occurred (Kern and Schrider, 2016); readers may also refer to mbs independently in Africans and Europeans (Tishkoff et al., by Teshima and Innan (2009) and MSMS by Ewing and 2007). Surveying a 1-Mb region surrounding rs4988235 Hermisson (2010). For a given value of the population- (Supplementary Fig. S4A), we found that although the scaled selection intensity N s, discoal efficiently simu- whole region under the hitchhiking effect may exceed 500 lated soft sweeps for a given initial frequency f and a kb, a 50-kb central region is in extremely tight linkage and given current frequency f of a selected allele (Supplemen- exhibits marked deficiency of core-linked derived alleles tary Fig. S2). in the European meta-population. Because the current To imitate a putative soft sweep at ST8SIA2 (Fujito frequency ( ) of the core derived allele is 0.51, the core et al., 2018), we set f as 0.008 (the CGC haplotype fre- allele group belongs to c and the within the 50-kb 0 8 < c quency in Africans) and f as 0.35 (the CGC haplotype core region is calculated as 0.006. To assess the statis- frequency in East Asians). As expected, the distribution tical significance of the with the , we simulated < c r of F becomes narrow, and for N s = 200 and θ = 20, it the SFD model for the European meta-population and c e concentrates in a small range of 0.00053 − 0.21 with the a hard sweep model as an alternative hypothesis. The mean and SD of 0.020 ± 0.024 (Fig. 5). The threshold SFD simulation with S = 263 observed in the core region value for α = 0.01 is accordingly as small as 0.001. To yielded a threshold value of F for α = 0.01 at 0.032, < c evaluate the power, we further examined various values of which indicates that the observed value of = 0.006 < c N s. When compared with the mean and SD of F (0.24 ± is too small to be compatible with the SFD model (α < e c 0.18) for N s = 0, they sharply decrease to 0.077 ± 0.042 0.001). Conversely, the sweep model of genic selection for N s = 50, 0.050 ± 0.028 for N s = 100, and 0.039 ± with θ = S⁄a = 35 and N s = 200 yielded β = 0.218 for e e n e 0.029 for N s = 150 (Supplementary Fig. S3). Figure 6 α = 0.01, suggesting that the hard sweep on LCT is much shows the power of the F statistic against N s for three stronger than expected from N s = 200 (Fig. 7). c e e type I error rates (α = 0.1, 0.05, and 0.01 under the null As the second example, we studied oculocutaneous albi- hypothesis of the standard neutral model). For α = 0.01, nism type II (OCA2) eye color variation (rs1800414) with the F statistic can exhibit the power of 1 − β = 0.64 if = 0.60 in Asians (e.g., Sturm and Duffy, 2012). This c f N s is 100. As N s increases up to 200, the power fur- derived allele is also associated with skin pigmenta- e e Detecting ongoing selective sweeps 157 43 and N s = 200. For α = 0.01, the threshold value of tion and acts as a skin-lightening allele under epistatic F is 0.041, which is smaller than the observed value of interactions with the melanocortin-1 receptor gene. As < c = 0.086 (Fig. 7). Conversely, the F becomes too with LCT, we first performed the barcode analysis for F < c < c 8 high to reject the SFD model at the 1% level. Thus the an approximately 1-Mb region surrounding rs1800414 statistic on OCA2 is only suggestive of positive selection. (Supplementary Fig. S4B). The core region becomes 49 As the third example, we examined the gene encoding kb long and yields F = 0.086. We simulated the SFD < c ectodysplasin A receptor (EDAR), which is associated with model with the f and S = 320 for the East Asian meta- Figure 6. Asian hair thickness (Fujimoto et al., 2008). At the core population as well as the same sweep model with θ = site 1540 (rs3827760) within the coding region, the Asian- specific derived allele C resulted in a non-synonymous change and increased its frequency to f = 0.85 − 0.90 in CHB (Han Chinese Beijing) and JPT (Japanese Tokyo) populations, most likely after East Asians diverged from Europeans. The slow rate of LD decay measured by extended haplotype homozygosity (EHH) in Sabeti et al. (2002) and high extents of population differentia- tion measured by F both strongly supported an ongo- ST ing selective sweep (Fujimoto et al., 2008). Consistent with this, we found a 73-kb core region with = 0.196 < c (Supplementary Fig. S4C) and noted that this relatively N s high value of the is correlated with the high inci- < c Fig. 6. Power of the F statistic (ordinate) against N s (abscissa) c e dence of the 1540C allele in the Chinese and Japanese and the three curves for three α values of 0.1 (green), 0.05 (blue) populations. The simulation revealed that for α = 0.01, and 0.01 (red). These are evaluated by discoal with the same the threshold value of F is 0.233 under the SFD model < c command line as in Fig. 5. Genic selection follows stochastic 8 with S = 543 and β = 0.107 under the sweep model with frequency trajectories with selection intensity s and begins at initial frequency f at time t and ends at current frequency f at θ = 72 and N s = 200 (Fig. 7). Thus, the is signifi- 0 0 r e F < c time t . The time required to change the variant frequency from cantly small (α = 0.004) and supports the notion that the f to f is well approximated by 0 r hard sweep on EDAR is not much weaker than expected 1 ff 2 84 . 0 r from N s = 200. tt t ln . e r 0 s 1 ff s r 0 Figure 7. LCT (f = 0.51 ) OCA2 (f = 0.60 ) r r 1.0 1.0 0.8 0.8 0.6 0.6 Observed F = 0.006 Observed F = 0.086 c<8 c<8 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 EDAR (f = 0.87 ) ASPM (f = 0.41 ) 1.0 1.0 0.8 0.8 0.6 0.6 Observed F = 0.052 Observed F = 0.196 c<7 c<8 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fig. 7. Cumulative distribution of the F statistic under the neutral SFD model (blue) and the hard sweep model with N s = 100, 200 and 300 for LCT, OCA2, EDAR and ASPM. Simulation with 1,000 replications is performed for a fixed S value under the neutral SFD model, whereas it is performed for a specified θ value under the sweep model. A red triangle at the bottom of each panel shows an observed F value. Power of F statistics (1 – β) Cumula�ve distribu�on 158 N. T. FUJITO et al. Table 3. f dependency of the threshold values of the F statistic r c DISCUSSION for specified type I error rates α and the same depen- dency of the mean and standard deviation (SD). As a In our illustrative example, we examined the signifi- window for a specified value of f must be set, it is as- cance of the F statistic under the assumption of a fixed sumed to be ff 1/ fn . Under the assumption rr r core allele frequency f = 0.35. However, as with EHH r of the standard neutral model with n = 1008 and θ = (Sabeti et al., 2002) and other statistics of the LRH (Vitti 100 or S = 1000, the F is computed for < c when f = c 6 r 0.1, for < c when f = 0.2 to 0.4 and for < c when f ≥ 7 r 8 r et al., 2013), F also depends noticeably on f . It is there- c r 0.6. The exception is the case of f = 0.5, for which fore necessary to carry out appropriate simulation studies the average of F s is taken over realized values for < c c 7 to evaluate the statistical significance of observed F and/ and < c . For each set of parameters, the number of or to examine the empirical distribution for a number of replications is 1000. The differences between the two its realized values in a genome (see Fujito et al., 2018 for tables can likely be attributed to replication errors. comparison of various methods in their statistical power θ = 100 when applied to ST8SIA2). Nevertheless, to obtain a rough idea, we considered it useful to tabulate the f f dependency of threshold values of F and other related c 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 statistics under the standard neutral model. In making 0.01 0.009 0.017 0.037 0.064 0.074 0.096 0.147 0.245 0.387 such a table, we further assumed fairly large values of θ 0.05 0.016 0.028 0.056 0.097 0.116 0.151 0.219 0.316 0.517 or S to reduce stochastic errors that might be induced by 0.1 0.021 0.037 0.073 0.121 0.143 0.181 0.263 0.382 0.574 low mutation rates and, as a consequence, become more 0.5 0.054 0.101 0.168 0.258 0.279 0.345 0.444 0.577 0.746 appropriate for EDAR with a larger core region than that of ST8SIA2. Table 3 shows that the threshold value of mean 0.072 0.127 0.197 0.283 0.300 0.345 0.440 0.562 0.723 F is very sensitive to f while simultaneously illustrat- c r SD 0.062 0.096 0.120 0.145 0.136 0.122 0.132 0.130 0.111 ing the atypical characteristic of LCT with = 0.51 and = 0.006. In general, if f ≤ 0.6 or f ≥ 0.7, the mean r r < c 8 S = 1000 ( ) of F is greater or smaller than the median, and the distribution is positively or negatively skewed accord- 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ingly. 0.01 0.010 0.017 0.039 0.071 0.082 0.109 0.125 0.222 0.428 Although the distribution of F is generally very broad under neutrality, it may be claimed that F < f can be 0.05 0.017 0.029 0.062 0.101 0.117 0.159 0.206 0.327 0.527 expected only when recombination is rare or absent, 0.1 0.023 0.039 0.075 0.122 0.143 0.193 0.254 0.386 0.585 because only particular branches in a coalescent tree 0.5 0.053 0.099 0.172 0.262 0.280 0.343 0.438 0.578 0.753 can produce a cluster of chromosomes with specified fre- mean 0.068 0.127 0.207 0.290 0.300 0.351 0.436 0.566 0.733 quency f (Fu, 1995; Li, 2011; Ferretti et al., 2017; Yang SD 0.053 0.097 0.132 0.146 0.132 0.121 0.136 0.133 0.106 et al., 2018) and mutations in the common stem lineage are not counted in F . To illustrate this, let us consider a simple case of n = 4 (Supplementary Fig. S5). For f = 3/4, we can uniquely determine the tree topology in longer compute the F statistic. SLC24A5 is one of two r c which only one suitable internal branch satisfies the fre- known loci strongly affecting light skin pigmentation in quency condition. Assuming the number of nucleotide Europeans and shows a fixed derived allele in the CEU changes is ideally in proportion to the average coales- population (Utah residents with Northern and Western cent branch length, we can compute = 0.69 rather European ancestry; Olalde et al., 2014). Although the < c than 3/4. Similarly, for f = 1/2, we obtain = 0.38 variant is fixed in the population, the hierarchical bar- rather than 1/2. Hence, is slightly smaller than f in codes of an approximately 20-kb region surrounding a coalescent tree. However, it appears that the differ- rs1426654 still indicate marked deficiency of SNPs in c ence becomes even larger for a large sample size n and and c in addition to a notable recovery of polymorphism for a high current frequency f (Table 3). Alternatively, in the lower classes (data not shown). As these features if recombination occurs frequently, should approach f are likely footprints of a complete selective sweep, it is because a core site and any other site tend to be in link- both worthwhile and possible to quantify such character- age equilibrium regardless of frequency classes. Simu- istics by some other measures than F . Kimura et al. lation confirmed this assertion (Supplementary Fig. S6). (2007) considered two measures of the LRH and developed Conversely, the observation of a large difference between a method for when a selected allele has reached fixation f and F thus constitutes a first favorable signal of LD in a local population (see also Maynard Smith and Haigh, r c being worthy of further scrutiny. 1974; Kaplan et al., 1989; Ronen et al., 2013; Schrider and When a derived allele at a core site increases in fre- Kern, 2017). Alternatively, we may apply the F statistic quency and is eventually fixed in a population, we can no in a polymorphic population such as South East Asians Detecting ongoing selective sweeps 159 (SAS), although this may not work for the case of local assumption of a hard sweep with θ = 20 and N s = 200, adaptation. In SAS, we actually found a 134-kb core contrary to Peter et al. (2012), the threshold value of < c region with = 0.60 and F = 0.136. Although the becomes 0.029 for α = 0.01 and the corresponding value < c probability of a type I error (α = 0.03) may not be suffi- of β becomes 0.186 (Fig. 7). These results may be taken ciently small, this finding suggests that positive selection as evidence for positive selection on ASPM, though not has acted on SLC24A5 in SAS as well. very strong. However, unlike that for a hard sweep, the In most simulations, we have assumed no recombi- power (1 − β) for detecting a soft sweep depends strongly nation within a core region, not addressing the effect on the initial frequency f of a selected allele (Innan and of recombination on neighboring regions. Clearly, it is Kim, 2004; Hermisson and Pennings, 2005; Przeworski et worth demonstrating the effect on both the barcode rep- al., 2005). For this reason, we examined the distribution resentation and the F statistic. For this purpose, we ran of F for various values of f under the conditions of θ = c c 0 discoal with recombination under θ = 20, N s = 200 and 20, n = 1008 and N s = 200. Although the power is still e e = 0.35. Although it is necessary to assume a rather as high as 0.74 when f = 0.01, it sharply declines as f 0 0 small sample size of n = 200 or less for technical reasons, further increases: 0.26 for f = 0.05, 0.12 for f = 0.1, and 0 0 the c barcode thus generated reveals patterns similar almost 0 for f = 0.2, which is nearly halfway to the cur- 6 0 to the c barcode in Fig. 1 (Supplementary Fig. S7). We rent frequency f . Without knowing the initial frequency 7 r also confirmed that the F becomes approximately of a selected allele, as is usually the case, it seems almost < c 0.01 in the central region, but increases to 0.34 on both impossible to make definite conclusions about the power sides. In short, simulation with selection and recombina- of a method for detecting soft sweeps (but see Satta et al., tion can reproduce the pattern and level of polymorphism 2018 for one possibility). Nevertheless, if ASPM showed that resemble those observed in some human genomic more unambiguous non-neutrality (i.e., α < 0.01) con- regions under selective sweeps. More intriguingly, we cerning intra-allelic variability, it could have also been found that the F statistic is little affected even when suggested that the initial frequency of the selected allele the recombination rate (r) per region is fairly high (e.g., was no greater than 1% when a soft sweep began to oper- ρ = 4N r ≥ 10). This unexpected robustness certainly ate; such a low initial frequency appears consistent with depends on how to evaluate intra-allelic variability under the geographic distribution of the ASPM variant, which selective sweeps. As mentioned earlier, the F statistic is is confined to Eurasia. intended to exclude SNP mutations with n ≥ n = n at As a final caveat, whereas we have shown that chang- k rk r site k because they have accumulated in the stem lineage ing population size does not markedly affect the F sta- of a derived allele group. However, recombination, if it tistic, the demographic models examined here are not occurs in one or a few lineages in the group, may act so exhaustive. Many other demographic models or events as to incorporate some of those stem-lineage mutations potentially influence the F statistic. These include into intra-allelic variability and substantially enhance it: range expansion with so-called “surfing” in a wave front this is because although the number of remaining unre- that may mimic a selective sweep and be relevant to the combined lineages is smaller than n , they can be still present context (Edmonds et al., 2004; Klopfstein et al., large in size (large n ). Fortunately, it is likely that SNP 2006). However, this and other demographic causes rk sites thus created within the derived allele group remain such as population structure and admixture are beyond in the same frequency class as the core site c . In this the scope of the present paper. case, the F statistic in Eq. (4) simply ignores those n c rk mutations from intra-allelic variability. Source code for F statistic The source code (python) Next, we briefly discuss one issue concerning the dis- for the F statistic is available freely from our web site tinction between hard sweeps and soft sweeps, the latter (https://sites.google.com/site/sattalab/research). being the more dominant mode of adaptation (Schrider and Kern, 2017). As a soft sweep may drive multiple We thank Dr. Andy D. Kern for his kind and prompt response to our inquiry and request to utilize discoal. We also thank standing variants at linked neighboring sites from the the editor and reviewers for their constructive criticisms, and beginning and retain the polymorphism to some extent Editage (www.editage.jp) for English language editing on early even if selection has been completed, it is in general versions of this manuscript. This work was supported in part more challenging to detect a soft sweep than a hard by the Japan Society for Promotion of Science (JSPS) grant sweep. Peter et al. (2012) concluded that whereas LCT 16H04821 to Y. S., the JSPS grants 23570271, 25101705 and 16K07535 to T. H., and the Scientific Research on Innovative and EDAR are subjected to hard sweeps, ASPM (abnor- Areas, a MEXT Grant-in-Aid Project FY2016-2020 to N. T. mal spindle-like microcephaly-associated) (Mekel-Bobrov et al., 2005) is likely affected by a soft sweep. Owing to the similarity to ST8SIA2, we examined a core SNP site in REFERENCES ASPM with = 0.41 in Europeans and found that there exists a 26-kb core region with = 0.052. Under the 1000 Genomes Project Consortium (2015) A global reference for < c 7 160 N. T. FUJITO et al. human genetic variation. Nature 526, 68–74. Hudson, R. R. (2002) Generating samples under a Wright-Fisher Braverman, J. M., Hudson, R. R., Kaplan, N. L., Langley, C. H., neutral model of genetic variation. Bioinformatics 18, and Stephan, W. (1995) The hitchhiking effect on the site 337–338. frequency spectrum of DNA polymorphisms. Genetics 140, Hudson, R. R., and Kaplan, N. L. (1985) Statistical properties 783–796. of the number of recombination events in the history of a Bustamante, C. D., Wakeley, J., Sawyer, A., and Hartl, D. L. sample of DNA sequences. Genetics 111, 147–164. (2001) Directional selection and the site-frequency spec- Innan, H., and Kim, Y. (2004) Pattern of polymorphism after trum. Genetics 159, 1779–1788. strong artificial selection in a domestication event. Proc. Charlesworth, B., and Charlesworth, D. (2012) Elements of Evo- Natl. Acad. Sci. USA 101, 10667–10672. lutionary Genetics (2nd edition). Roberts and Company Kaplan, N. L., Hudson, R. R., and Langley, C. H. (1989) The Publishers, Colorado, USA. “hitchhiking effect” revisited. Genetics 123, 887–899. Edmonds, C. A., Lillie, A. S., and Cavalli-Sforza, L. L. (2004) Kern, A. D., and Schrider, D. R. (2016) Discoal: flexible coalescent Mutations arising in the wave front of an expanding popula- simulations with selection. Bioinformatics 32, 3839–3841. tion. Proc. Natl. Acad. Sci. USA 101, 975–979. Kimura, M. (1969) The number of heterozygous nucleotide sites Evans, P. D., Mekel-Bobrov, N., Vallender, E. J., Hudson, R. R., maintained in a finite population due to steady flux of and Lahn, B. T. (2006) Evidence that the adaptive allele of mutations. Genetics 61, 893–903. the brain size gene microcephalin introgressed into Homo Kimura, M., and Ohta, T. (1973) The age of a neutral mutant sapiens from archaic Homo lineage. Proc. Natl. Acad. Sci. persisting in a finite population. Genetics 75, 199–212. USA 103, 18178–18183. Kimura, R., Fujimoto, A., Tokunaga, K., and Ohashi, J. (2007) A Ewing, G., and Hermisson, J. (2010) MSMS: a coalescent simula- practical genome scan for population-specific strong selec- tion program including recombination, demographic struc- tive sweeps that have reached fixation. PLoS One 2, e286. ture and selection at a single locus. Bioinformatics 26, Klopfstein, S., Currat, M., and Excoffier, L. (2006) The fate of 2064–2065. mutations surfing on the wave of a range expansion. Mol. Excoffier, L., and Foll, M. (2011) fastsimcoal: a continuous-time Biol. Evol. 23, 482–490. coalescent simulator of genomic diversity under arbitrary Lachance, J., and Tishkoff, S. A. (2013) Population genomics of complex evolutionary scenarios. Bioinformatics 27, 1332– human adaptation. Annu. Rev. Ecol. Evol. Syst. 44, 123– 1334. 143. Eyre-Walker, A., Woolfit, M., and Phelps, T. (2006) The distribu- Li, H. (2011) A new test for detecting recent positive selection tion of fitness effects on new deleterious amino acid muta- that is free from the confounding impacts of demogra- tions in humans. Genetics 173, 891–900. phy. Mol. Biol. Evol. 28, 365–375. Fan, S., Hansen, M. E., Lo, Y., and Tishkoff, S. A. (2016) Going Li, M. J., Wang, L. Y., Xia, Z., Wong, M. P., Sham, P. C., and Wang, global by adapting local: a review of recent human adapta- J. (2014) dbPSHP: a database of recent positive selection tion. Science 354, 54–59. across human populations. Nucleic Acids Res. 42, D910– Fay, J. C., and Wu, C. I. (2000) Hitchhiking under positive Dar- D916. winian selection. Genetics 155, 1405–1413. Liu, X., and Fu, Y. X. (2015) Exploring population size changes Ferretti, L., Ledda, A., Wiehe, T., Achaz, G., and Ramos-Onsins, using SNP frequency spectra. Nat. Genet. 47, 555–559. S. E. (2017) Decomposing the site frequency spectrum: The Maynard Smith, J., and Haigh, J. (1974) The hitch-hiking effect impact of tree topology on neutrality tests. Genetics 207, of a favourable gene. Genet. Res. 23, 23–35. 229–240. Mekel-Bobrov, N., Gilbert, S. L., Evans, P. D., Vallender, E. J., Field, Y., Boyle, E. A., Telis, N., Gao, Z., Gaulton, K. J., Golan, D., Anderson, J. R., Hudson, R. R., Tishkoff, S. A., and Lahn, B. Yengo, L., Rocheleau, G., Froguel, P., McCarthy, M. I., et al. T. (2005) Ongoing adaptive evolution of ASPM, a brain size (2016) Detection of human adaptation during the past 2000 determinant in Homo sapiens. Science 309, 1720–1722. years. Science 354, 760–764. Nei, M., and Li, W.-H. (1979) Mathematical model for study- Fu, Y. X. (1995) Statistical properties of segregating sites. Theor. ing genetic variation in terms of restriction endonucle- Popul. Biol. 48, 172–197. ases. Proc. Natl. Acad. Sci. USA 76, 5269–5273. Fu, Y. X., and Li, W.-H. (1993) Statistical tests of neutrality of Nielsen, R., Williamson, S., Kim, Y., Hubisz, M. J., Clark, A. mutations. Genetics 133, 693–709. D., and Bustamante, C. (2005) Genomic scans for selective Fujimoto, A., Kimura, R., Ohashi, J., Omi, K., Yuliwulandari, R., sweeps using SNP data. Genome Res. 15, 1566–1575. Batubara, L., Mustofa, M. S., Samakkarn, U., Settheetham- Olalde, I., Allentoft, M. E., Sánchez-Quinto, F., Santpere, Ishida, W., Ishida, T., et al. (2008) A scan for genetic deter- G., Chiang, C. W. K., DeGiorgio, M., Prado-Martínez, J., minants of human hair morphology: EDAR is associated Rodríguez, J. A., Rasmussen, S., Quilez, J., et al. (2014) with Asian hair thickness. Hum. Mol. Genet. 17, 835–843. Derived immune and ancestral pigmentation alleles in a Fujito, N. T., Satta, Y., Hane, M., Matsui, A., Yashima, K., Kitajima, 7,000-year-old Mesolithic European. Nature 507, 225–228. K., Sato, C., Takahata, N., and Hayakawa, T. (2018) Posi- Pavlidis, P., Živković, D., Stamatakis, A., and Alachiotis, N. tive selection on schizophrenia-associated ST8SIA2 gene in (2013) SweeD: likelihood-based detection of selective sweeps post-glacial Asia. PLoS One 13, e0200278. in thousands of genomes. Mol. Biol. Evol. 30, 2224–2234. Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H., and Peter, B. M., Huerta-Sánchez, E., and Nielsen, R. (2012) Distin- Bustamante, C. D. (2009) Inferring the joint demographic guishing between selective sweeps from standing variation history of multiple populations from multidimensional SNP and from a de novo mutation. PLoS Genet. 8, e1003011. frequency data. PLoS Genet. 5, e1000695. Przeworski, M., Coop, G., and Wall, J. D. (2005) The signature of Hermisson, J., and Pennings, P. S. (2005) Soft sweeps: molecu- positive selection on standing genetic variation. Evolution lar population genetics of adaptation from standing genetic 59, 2312–2323. variation. Genetics 169, 2335–2352. Ronen, R., Udpa, N., Halperin, E., and Bafna, V. (2013) Learning Hill, W. G., and Robertson, A. R. (1968) Linkage disequilibrium natural selection from the site frequency spectrum. Genet- in finite populations. Theoret. Appl. Genet. 38, 226–231. ics 195, 181–193. Detecting ongoing selective sweeps 161 Sabeti, P. C., Reich, D. E., Higgins, J. M., Levine, H. Z. P., Richter, Tajima, F. (1989) Statistical method for testing the neutral D. J., Schaffner, S. F., Gabriel, S. B., Platko, J. V., Patterson, mutation hypothesis by DNA polymorphism. Genetics 123, N. J., McDonald, G. J., et al. (2002) Detecting recent posi- 585–595. tive selection in the human genome from haplotype struc- Terhorst, J., Kamm, J. A., and Song, Y. S. (2017) Robust and ture. Nature 419, 832–837. scalable inference of population history from hundreds of Satta, Y., Fujito, N. T., and Takahata, N. (2018) Nonequilibrium unphased whole genomes. Nat. Genet. 49, 303–309. neutral theory for hitchhikers. Mol. Biol. Evol. 35, 1362– Teshima, K. M., and Innan, H. (2009) mbs: modifying Hudson’s 1365. ms software to generate samples of DNA sequences with a Sawyer, S. A., and Hartl, D. L. (1992) Population genetics of poly- biallelic site under selection. BMC Bioinformatics 10, 166. morphism and divergence. Genetics 132, 1161–1176. Tishkoff, S. A., Reed, F. A., Ranciaro, A., Voight, B. F., Babbitt, Schaffner, S. F., Foo, C., Gabriel, S., Reich, D., Daly, M. J., and C. C., Silverman, J. S., Powell, K., Mortensen, H. M., Hirbo, Altshuler, D. (2005) Calibrating a coalescent simulation J. B., Osman, M., et al. (2007) Convergent adaptation of of human genome sequence variation. Genome Res. 15, human lactase persistence in Africa and Europe. Nat. 1576–1583. Genet. 39, 31–40. Schrider, D. R., and Kern, A. D. (2017) Soft sweeps are the domi- Vitti J. J., Grossman, S. R., and Sabeti, P. C. (2013) Detecting nant mode of adaptation in the human genome. Mol. Biol. natural selection in genomic data. Annu. Rev. Genet. 47, Evol. 34, 1863–1877. 97–120. Schrider, D. R., Shanku, A. G., and Kern, A. D. (2016) Effects Watterson, G. A. (1975) On the number of segregating sites in of linked selective sweeps on demographic inference and genetical models without recombination. Theor. Popul. model selection. Genetics 204, 1207–1223. Biol. 7, 256–276. Slatkin, M., and Rannala, B. (1997) Estimating the age of alleles Yang, Z., Li, J., Wiehe, T., and Li, H. (2018) Detecting recent by use of intraallelic variability. Am. J. Hum. Genet. 60, positive selection with a single locus test bipartitioning the 447–458. coalescent tree. Genetics 208, 791–805. Slatkin, M., and Rannala, B. (2000) Estimating allele age. Annu. Zeng, K., Fu, Y. X., Shi, S., and Wu, C. I. (2006) Statistical tests Rev. Genomics Hum. Genet. 1, 225–249. for detecting positive selection by utilizing high-frequency Sturm, R. A., and Duffy, D. L. (2012) Human pigmentation genes variants. Genetics 174, 1431–1439. under environmental selection. Genome Biol. 13, 248.
Genes & Genetic Systems – Unpaywall
Published: Aug 1, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.