Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Detecting ancient positive selection in humans using extended lineage sorting

Detecting ancient positive selection in humans using extended lineage sorting Method Detecting ancient positive selection in humans using extended lineage sorting Stéphane Peyrégne, Michael James Boyle, Michael Dannemann, and Kay Prüfer Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany Natural selection that affected modern humans early in their evolution has likely shaped some of the traits that set present- day humans apart from their closest extinct and living relatives. The ability to detect ancient natural selection in the human genome could provide insights into the molecular basis for these human-specific traits. Here, we introduce a method for detecting ancient selective sweeps by scanning for extended genomic regions where our closest extinct relatives, Neandertals and Denisovans, fall outside of the present-day human variation. Regions that are unusually long indicate the presence of lineages that reached fixation in the human population faster than expected under neutral evolution. Using simulations, we show that the method is able to detect ancient events of positive selection and that it can differentiate those from background selection. Applying our method to the 1000 Genomes data set, we find evidence for ancient selec- tive sweeps favoring regulatory changes and present a list of genomic regions that are predicted to underlie positively se- lected human specific traits. [Supplemental material is available for this article.] Modern humans differ from their closest extinct relatives, (up to 150,000 yr ago) (Fig. 2 from Racimo 2016) but has little pow- Neandertals, in several aspects, including skeletal and skull mor- er to detect events older than 200,000 yr ago in modern humans. A phology (Weaver 2009), and may also differ in other traits that second method applied an approximate Bayesian computation on are not preserved in the archeological record (Varki et al. 2008; patterns of homozygosity and haplotype diversity around alleles Laland et al. 2010). Natural selection may have played a role in fix- that reach fixation (Racimo et al. 2014). Although this approach ing these traits on the modern human lineage. However, the selec- expands our ability to investigate older time frames, this signal tion events driving the fixation would have been restricted to a of selection also fades over time, and events of positive selection specific timeframe, extending from the split between archaic and older than 300 thousand years ago (kya) become undetectable. modern humans ca. 650,000 yr ago to the split of modern human Based on a method introduced by Green et al. (2010), Prüfer populations from each other around 100,000 yr ago (Prüfer et al. et al. (2014) presented a hidden Markov model that identifies re- 2014). While methods exist that can be used to scan the genome gions in the genome where the Neandertal and Denisovan individ- for the remnants of past or ongoing positive selection (Nielsen uals fall outside of present-day human variation (i.e., the archaic et al. 2007; Pybus and Shapiro 2009), current methods have lineages fall basal compared to all present-day humans) and ap- limited power to detect positive selection on the human lineage plied the model to detect selective sweeps on the modern human that acted during this older timeframe (see Sabeti et al. 2006 for lineage. Regions that are unusually long are candidates for ancient a review on detection methods and their timeframes): an unusual- selective sweeps as variants are likely to have swept rapidly to fix- ly high ratio of functional changes to nonfunctional changes, such ation, dragging along with them large parts of the chromosomes as the d /d test, requires millions of years and often multiple that did not have time to be broken up by recombination. While N S events of selection to generate detectable signals (Kryazhimskiy this method is, in principle, expected to be able to detect events and Plotkin 2008), while unusual patterns of genetic diversity be- as old as the modern human split from Neandertals and Deniso- tween individuals and populations (e.g., extended homozygosity, vans, this power was never formally tested, and it has several other Tajima’s D, F ) are most powerful during the selective sweep or shortcomings. First, the method was limited to modern human ST shortly after (Sabeti et al. 2006; Oleksyk et al. 2010). A favorable polymorphisms, ignoring the additional information given by substitution is not expected to leave a mark on linked neutral var- fixed substitutions. Second, the method does not fit parameters iation beyond 250,000 yr in humans (Przeworski 2002, 2003). to the data but requires these parameters to be estimated through The genome sequencing of archaic humans (Neandertals and coalescent simulations. Denisovans) to high coverage (Meyer et al. 2012; Prüfer et al. 2014) Here, we introduce a refined version of this method, called has spawned new methods to investigate the genetic basis of mod- ELS (Extended Lineage Sorting), that models explicitly the longer ern human traits that are not shared by the archaics (Pääbo 2014). regions produced under selection and includes the fixed differenc- One method, called 3P-CLR, models allele frequency changes be- es between archaic and modern human genomes as an additional fore and after the split of two populations using the archaic ge- source of information. The ELS method also takes advantage of an nomes as an outgroup (Racimo 2016). 3P-CLR outperforms expectation-maximization algorithm to estimate the model pa- previous methods in the detection of older events of selection rameters from the data itself, making it free from assumptions re- garding human demographic history. Corresponding authors: [email protected], pruefer@ eva.mpg.de Article published online before print. Article, supplemental material, and publi- © 2017 Peyrégne et al. This article, published in Genome Research, is available cation date are at http://www.genome.org/cgi/doi/10.1101/gr.219493.116. under a Creative Commons License (Attribution 4.0 International), as described Freely available online through the Genome Research Open Access option. at http://creativecommons.org/licenses/by/4.0/. 27:1563–1572 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/17; www.genome.org Genome Research 1563 www.genome.org ́ Peyregne et al. To evaluate the power of the ELS method to detect ancient outgroup, i.e., all lineages are sorted. Because of recombination, selective sweeps, we tested its performance under scenarios of the human genome is a mosaic of independent evolutionary histo- background selection and neutrality. Finally, we present an updat- ries, and the process of lineage sorting is expected to randomly af- ed list of candidate regions that likely underwent positive selection fect regions, until, ultimately, all lineages will be sorted. In the case on the modern human lineage since the split from the common of modern humans, only a fraction of the regions in the genome ancestor with Neandertals and Denisovans. are expected to show lineage sorting (Prüfer et al. 2014), and the genome can be partitioned into regions where an archaic lineage falls either within the variation of modern humans (internal re- Results gion) or outside of the human variation (external region) (Fig. 1B). While lineage sorting can occur under neutrality, selection on Selection causes extended lineage sorting between closely related the modern human branch is expected to always lead to external populations regions as long as the selective sweep finished. In cases where The ancestors of modern humans split from the ancestors of the selective sweep is sufficiently strong, there will not be suffi- Neandertals and Denisovans between 450,000 and 750,000 yr cient time for recombination to break the linkage with neighbor- ago (Prüfer et al. 2014). Because the two newly formed descendant ing sites and a large region will reach fixation (extended lineage groups sampled the genetic variation from the ancestral popula- sorting) (Fig. 1C). In contrast, selection on standing variation tion, a derived variant can be shared between some members of may fail to generate such large regions, since recombination can both groups, while other individuals show the ancestral variant. act on the haplotype(s) with the prospective advantageous variant At these positions, some lineages from one group share a more re- before selection sets in. We note that neither demography nor se- cent common ancestor with some lineages in the other group than lection on the archaic lineage affect the lineage sorting within within the same group (Rosenberg 2002), a phenomenon called modern humans and thus the power to detect selective sweeps. incomplete lineage sorting (Fig. 1A). Eventually, a derived allele may reach fixation as part of a re- Expected incomplete lineage sorting among humans to archaics gion that has not been unlinked by recombination. In these re- gions, all descendants will derive from one common ancestor, We used coalescent simulations to determine the incidence and and any lineage from the other population will constitute an expected length of regions resulting from incomplete lineage Figure 1. Illustration of the lineage sorting process. (A) Effects on the genealogy. The process starts with a random distribution of lineages when the ancestral population splits. The lineage in black is an outgroup to lineages in blue, so that the blue lineages show a closer relationship between populations than to the black lineage (incomplete lineage sorting). When the blue lineages in the top population reach fixation (through a selective sweep for instance), any lineage from the other populations will constitute an outgroup, thereby completing the sorting of lineages. (B) Two types of genealogies illustrating the possible relationships between an archaic lineage and modern human lineages. (C) Local effects in the genome at different time points. The curves rep- resent the progression of lineage sorting for two independent regions, evolving under neutrality (black curve) and positive selection (blue curve), respec- tively. Longer fixation times are associated with more recombination so that neutrality produces smaller external regions. 1564 Genome Research www.genome.org Ancient positive selection in humans sorting in modern humans. Using a model of human demographic cM for most simulations with a selection coefficient of 0.001. In history (Yang et al. 2014), we estimated the fraction of lineage sort- contrast, external regions longer than 0.1 cM were observed for se- ing in modern humans in regard to Neandertals and Denisovans. lection coefficients above 0.05. Therefore, detectable signals are In simulations with 370 African chromosomes, and assuming a expected to be biased toward strong events with a selection coeffi- uniform recombination rate, ∼10% of the archaic genome is cient larger than 0.001. more divergent than the time to the most recent common ancestor of all sampled human variation. The length of the external regions Hidden Markov model to detect extended lineage sorting is expected to be ∼0.0016 centimorgan (cM) (95% CI: 0.001– 0.0095 cM; e.g., 1–9.5 kb for a recombination rate of 1 cM/Mb) To detect regions of extended lineage sorting, we modeled the with the longest regions on the order of 0.02 cM. In contrast, changes of local genealogies along the genome with a hidden internal regions are expected to be 0.012 cM long (95% CI: Markov model. We distinguish two types of genealogies, internal 0.0097–0.07 cM). or external, depending on whether the archaic lineage falls inside or outside of the human variation, respectively (Fig. 3A). The mod- el includes a third state corresponding to extended lineage sorting, Minimum strength of selection to produce detectable and external regions produced by this state are required to be lon- sweep signals ger, on average, than those produced by the external state. The three states are inferred from the state of the archaic allele (ances- We investigated the range of selection coefficients that could have led to the fixation of a lineage after the split with the archaic hom- tral or derived) either at a polymorphic position in modern hu- inins but before the differentiation of genetically modern humans mans or at a position where modern humans carry a fixed about 100–120 kya (Li and Durbin 2011) by simulating mutations derived variant. In the following, we describe the different statisti- cal properties expected for each type of genealogy. occurring at different times and evolving with different selection coefficients. While the simulations show that completed selective We first considered external regions. At modern human poly- morphic sites, the archaic genome is expected to carry the ances- sweeps could have occurred with selection coefficients as low as 0.0005 (Fig. 2A), the length distribution of haplotypes reaching tral variant since the derived variant would indicate incomplete lineage sorting. To account for sequencing errors or misassign- fixation is indistinguishable from neutrality for selection coeffi- cients under 0.001 (Fig. 2B,C). Under neutrality, the average ment of the ancestral state, we allow a probability of 0.01 for carry- ing the derived allele (see Methods). At sites where the derived length of external regions was 0.02 cM and remained below 0.03 allele is fixed, the archaic genome will of- ten carry the derived state, if the fixation event occurred before the split of the ar- chaic from the modern human lineage, or occasionally, the ancestral state, if the fixation event is more recent and oc- curred after the split. For internal regions, the archaic is expected to share the derived allele at modern human fixed derived sites but can carry the ancestral allele in our model to accommodate errors, albeit with low probability. In contrast, at sites that are polymorphic in modern humans, the probabilities of observing the ancestral or the derived allele in the archaic ge- nome will depend on the age of the de- rived variant, with young variants being less likely to be shared compared to older variants. The frequency of the derived variant in the modern human population can be used as a proxy for its age, and the emission probabilities in our model take the modern human derived allele fre- quency into account (see Methods). We modeled the transition proba- bilities between internal and external re- gions (related to the length of the regions) by exponential distributions. The extended lineage sorting state has Figure 2. (A) Fraction of selected alleles reaching fixation (gray) or segregating (orange) at present, the same chance of emitting derived al- depending on the strength of selection (columns) and the age of the mutation (rows, in kya) in our sim- leles as the other external state but is re- ulations. Events for which the selected variant was lost are not shown. (B) Distribution of the genetic quired to have a larger average length. length of external regions simulated under neutrality. (C) Distributions of the genetic length of external We used the Baum-Welch algorithm regions depending on the strength of selection (columns) and age of mutations in kya (rows). The blue line corresponds to the upper limit for the length of external regions produced under neutrality from B. (Durbin et al. 1998), an expectation- Genome Research 1565 www.genome.org ́ Peyregne et al. We do not expect ELS regions to be detected in our neutral simulations, and indeed we found that either the estimat- ed proportion of ELS converged to zero or the maximum likelihood estimate for the length of ELS and external regions con- verge to the same value (49% and 51% of simulations, respectively). A likeli- hood ratio test comparing a model with- out the ELS state to the full model with the ELS state also showed no significant improvement with the additional state in almost all neutral simulations (only one likelihood ratio test out of 100 simu- lations showed a significant improve- ment after Bonferroni correction for multiple testing). We then evaluated the accuracy of the ELS method to assign the correct ge- nealogy to regions based on sequences obtained through coalescent simulations with selection (Fig. 3B,C). In these simu- lations, the underlying genealogy at each site along the sequences is known and can be compared to the estimates. To be conservative, we only focused on results Figure 3. (A) Graphical representation of the Extended Lineage Sorting hidden Markov model. States with the smallest selection coefficient are depicted by nodes and transitions by edges. Each state emits an archaic allele as either derived, D, or (s = 0.005) that produces regions long ancestral, A, depending on the type of site in the modern human population (fixed or segregating at a enough to be detectable. In Figure 3B, given frequency). States are labeled I for Internal, E for External, and ELS for Extended Lineage Sorting. (B) we show the accuracy for labeling the ex- Receiver operator curves for varying cutoffs on the posterior probability of the ELS state and counting the number of sites in ELS regions that were correctly labeled. All bases labeled ELS outside of simulated ELS tended lineage sorting regions depen- regions are considered false positives. Sites in ELS regions with a posterior probability below the cutoff are dent on the posterior probability cutoff considered false negatives. (C) Example of the labeling of a simulated ELS region. Horizontal bars indicate for the ELS state. The results demonstrate true external (top) and internal (bottom) regions. The posterior probability is shown in red for ELS regions that the model has sufficient power to ac- and in gray for E regions. The region overlapping position 50,000 (red bar) is caused by a simulated selec- curately label sites that experienced se- tive sweep. lection with a coefficient s ≥ 0.005 and an occurrence of the beneficial mutation maximization algorithm, to estimate the emission probabilities as long as 600,000 yr ago. We also used the simulations of positive selection events (s = and estimate the transition probabilities with a likelihood maximi- zation algorithm. 0.005) with two different times at which the beneficial mutation occurred, 300 and 600 kya, to test how often the beneficial simu- lated variant falls within a detected ELS region (Supplemental Accuracy of parameter estimates and inferred genealogies Table S1). To put this rate of true positives into perspective, we We first investigated the performance of the parameter inference also counted how many ELS regions did not overlap the selected on simulated data under neutral evolution. We found that the es- variant (false positives). A large fraction of selected mutations timated probabilities for encountering ancestral/derived alleles in were detected (87%–92%). However, we also found a substantial external and internal regions fit the simulated parameters well fraction of false positive ELS regions (10%–11%). When restricting (on average, less than ± 0.08 from simulated under all tested detected ELS regions to those that are longer than 0.025 cM, we conditions) (Supplemental Figs. S1, S2), while the estimated find less than 0.1% false positives compared to 65%–68% true pos- length of internal and external regions deviate more from the sim- itives. Not all simulated regions with a selection coefficient of ulated lengths (around 15% overestimate of the mean length) 0.005 produce ELS regions of this size, so that the rate of true pos- (Supplemental Fig. S3). However, we found that the model exhib- itives for truly long regions is expected to be higher. For all follow- its better accuracy in labeling the correct genealogies with the esti- ing analyses, we used this minimal length cutoff of 0.025 cM. mated length parameters compared to the simulated true values (Supplemental Fig. S4). This difference seems to originate from Role of background selection the difficulty in accurately detecting very short external regions or internal regions with very few informative sites. We note that Background selection is defined as the constant removal of neutral detecting selection is not affected by this problem since we are pri- alleles due to linked deleterious mutations (Charlesworth et al. marily interested in detecting long external regions. Including 1993). In regions of the genome that undergo background selec- fixed differences improves the power to assign the correct geneal- tion, a fraction of the population will not contribute to subsequent ogies compared to a version of the method without this additional generations, causing a reduced effective population size. As a con- source of information (Supplemental Fig. S4). sequence, remaining neutral alleles can reach fixation faster than 1566 Genome Research www.genome.org Ancient positive selection in humans under neutrality, potentially producing unusually long external ty about recombination rates. We will refer to the set of 81 regions regions that could be mistaken as signals of positive selection. as the core set (Supplemental File S1) and the set including the 233 We investigated the effects of background selection by running putatively selected regions found with just one recombination map forward simulations with parameters that mimic the strength as the extended set (314 regions) (Supplemental File S2). and extent of background selection estimated for the human ge- For completeness, we also ran our model on the X nome (Messer 2013). While background selection simulations Chromosome and identified 12 additional candidates (43 if we did produce some long outlier regions that fall outside the distribu- consider candidates found with at least one recombination tion observed in neutral simulations, most regions are still smaller map), applying a more stringent length cutoff of 0.035 cM to ac- than regions simulated with positive selection at a conservative se- count for the stronger effects of random drift on this chromosome lection coefficient of 0.005 (Fig. 4A). Indeed, among the 1160 ex- (Methods). Interestingly, we also found a significant increase of ternal regions detected in our simulations of background selection posterior probabilities for selection within previously reported re- (s = 0.05) (Fig. 4A), only six were labeled as ELS and only three gions under potential recurrent selective sweeps in apes (Mann- −16 passed the minimal length filter of 0.025 cM. Whitney U one-sided test, P-value < 2.2 × 10 )(Supplemental Table S3; Dutheil et al. 2015; Nam et al. 2015). The detected selection candidate regions on the autosomes Candidate regions of positive selection on the human lineage do not show a decrease in B-scores (McVicker et al. 2009), a local measure of background selection strength, compared with random To identify ancient events of positive selection on the human lin- eage, we applied the ELS method to African genomes from the regions (Wilcoxon rank-sum test comparing the average B-scores with permuted regions, P-value = 0.565, or comparing the lowest 1000 Genomes Project (The 1000 Genomes Project Consortium 2012). We disregarded non-African populations since Neandertal B-scores in our regions to permuted regions, P-value = 0.504) (Fig. 4B). This suggests that candidate regions are not primarily generat- introgression in these populations could mask selective sweeps and lead to false negatives. A model with ELS fits the data signifi- ed by strong background selection. We compared our candidate regions to the top candidates of cantly better than a model without the ELS state for all chromo- somes and for both tested recombination maps (P-value < 1 × eight previous scans for selection, including iHS, F , XP-CLR, and ST −8 HKA (Pybus et al. 2014; Cagan et al. 2016). Using the estimated 10 )(Supplemental Table S2). We identified 81 regions of human extended lineage sorting time to the most recent common ancestor among Africans for each identified region/site, we found that our ELS scan identified for which both recombination maps support a genetic length great- er than 0.025 cM (average length: 0.05 cM). Depending on the re- significantly older events than other screens (Mann-Whitney U tests) (Fig. 5; Supplemental Table S4). We found 23 regions from combination map, the longest overlap between the maps is 0.12 (African-American map) or 0.17 (deCODE map) cM long, which the core set (detected by both recombination maps) overlapping with candidates from previous scans and 68 for the extended set is three to four times longer than the longest regions produced under background selection in our simulations. An additional (detected by at least one recombination map); neither overlap is more than expected at random (P-values are 0.06 and 0.595, re- 233 regions are longer than 0.025 cM according to only one recom- bination map, with 71% of those additional regions showing sup- spectively). In contrast, our candidate regions overlap candidate regions from 3P-CLR (Racimo 2016) and the ABC approach for port for the ELS state using both recombination maps. This suggests that the variation in the candidate set mostly stems from uncertain- detecting ancient selection (Racimo et al. 2014) more often than expected by chance (P-values < 0.05) (Supplemental Table S5). Biological functions of the candidate regions Since positive selection acts on advanta- geous phenotypes that are caused by changes to functional elements in the ge- nome, we would expect that our candi- date regions would overlap functional elements in the genome more often than expected. We first tested this hypothesis by counting the overlap between sweep candidate regions and protein coding genes (Ensembl release 82) (Aken et al. 2016). We find no statistically significant overlap of ELS regions with protein cod- Figure 4. Effects of background selection. (A) Comparison of the length of ELS regions in simulations of ing genes compared to randomly placed different scenarios. For the distribution under background selection, the s parameter corresponds to the regions of the same size (P-value = 0.671 average selection coefficient from the gamma distribution (shape parameter of 0.2). We assumed that the deleterious mutations are recessive with dominance coefficient h = 0.1. The horizontal blue line cor- and 0.124, for core and extended set, re- responds to the length cutoff applied to the real data. (B) Distribution of B-scores in the candidate sweep spectively) (Fig. 6). Previous work has regions (red curve) compared to sets of random regions with matching physical lengths (blue area with identified 96 proteins that carry human dotted blue lines indicating the 95% confidence intervals over 1000 random sets of regions). The lowest fixed derived nonsynonymous changes B-score (i.e., stronger background selection) was chosen when a region overlapped several B-score annotations. compared to Neandertal and Denisova, Genome Research 1567 www.genome.org ́ Peyregne et al. and Huber 2010) and found again no enrichment (Supplemental Table S6). In an attempt to include potential regulatory changes in the enrichment test, we assigned genes to candidate regions when a region fell upstream of or downstream from a gene (see Supplemental Methods). Although many candidate genes that were annotated in this way were most highly expressed in the brain or the heart (Odds ratio = 2.10 for both tissues), this enrich- ment is not significant when correcting for gene length and mul- tiple testing (Family-wise error rate = 0.336 and 0.997, respectively) (Supplemental Table S7). Additional work will be required to investigate the phenotyp- ic consequences of changes in candidate regions for selection. To facilitate this work, we provide an annotated list of fixed or nearly fixed sites on the human lineage that fall within our candidate re- gions (Supplemental File S3). Overlap with Neandertal introgression Introgression from Neandertals and Denisovans into modern hu- mans occurred approximately 37,000 to 86,000 yr ago (Sankararaman et al. 2012, 2016; Fu et al. 2014, 2015). For those advantageous derived variants that arose on the modern human Figure 5. Distributions of estimated ages of the modern human segre- gating derived variants with the highest frequency in putatively selected lineage prior to introgression, we would expect that selection regions or the age of the derived variants at sites identified by various ge- may have acted against the re-introduction of the ancestral variant nome-wide scans. Our candidate regions are labeled as ELS, for Extended through admixture. We tested whether this selection may have af- Lineage Sorting; other candidate regions are from Pybus et al. (2014) and fected the distribution of Neandertal introgressed DNA around Cagan et al. (2016). The color coding indicates the type of signal detected by each method. Ages were estimated by ARGweaver (Rasmussen et al. fixed changes in candidate sweep regions. Out of a total of 963 2014). We only report events between 0 and 600 kya. fixed derived variants in Africans overlapping the extended set of sweep regions, 240 (25%) show the ancestral allele in non- Africans and show evidence for re-introduction by admixture us- which constitute a particularly interesting subset of potentially ing a map of Neandertal introgression (Vernot and Akey 2014). functional changes to genes that may have been caused by selec- This level of Neandertal ancestry is comparable to the genome- tive sweeps (Prüfer et al. 2014). We found no overlap between wide fraction of out-of-Africa ancestral alleles at African fixed de- these genes and the core set of sweep candidate regions that rived sites (∼26%; bootstrap P-value = 0.583). We also find no sig- were identified by both recombination maps. However, when con- nificant reduction in frequency of Neandertal ancestry around sidering the extended set of sweep candidate regions, 11 regions candidate substitutions in sweep regions, when comparing one overlapped such genes: ADSL, BBIP1, ENTHD1, HERC5, KATNA1, KIF18A, NCOA6, PRDM10, SCAP, SLITRK1, and ZNHIT2. This over- lap is significantly larger than expected by chance (only two genes −3 are expected on average; P-value < 10 ). In all instances, the can- didate regions contained at least one fixed amino acid change. Since fixed changes are part of the information used to infer exter- nal regions, it stands to reason that the presence of such a change may bias toward observing an overlap with candidate regions (72/ 81 core regions and 275/314 regions from the extended set contain fixed changes). However, we note that the overlap with fixed ami- no acid changes is also significantly larger than the overlap with other fixed changes (963 of 20,347 fixed changes fall within can- didate regions from the extended set; binomial P-value = 0.006). Phenotype may also be influenced by regulatory changes that affect gene expressions. Interestingly, we found a significant en- richment for regions overlapping enhancers and promoters (P-val- ue < 0.001 and P-value = 0.002, respectively) (see Fig. 6) when considering the extended set of 314 candidate regions. However, this enrichment was not significant for the smaller core set of candidates. To further investigate the biological function of our regions, we tested for Gene Ontology enrichment in genes within the ex- Figure 6. Enrichment for regulatory elements (enhancers, P-value < tended set of regions (Prüfer et al. 2007). No category showed sig- 0.001; protein-coding genes, P-value = 0.124; and promoters, P-value = nificant enrichment when comparing to randomly placed regions 0.002) in the extended set of 314 candidate sweep regions. The distribu- of identical sizes in the genome (see Supplemental Methods). We tions were obtained by randomly placing candidate regions in the genome also assigned genes that overlap our extended data set to tissues to obtain lists of regions with similar physical lengths. The red lines repre- in which they show the significantly highest expression (Anders sent the value observed in the real extended set. 1568 Genome Research www.genome.org Ancient positive selection in humans randomly sampled fixed African substitution per region against overlap with other types of positive selection scans is smaller. random regions matched for size and distance to genes (Supple- Among our candidates, 55 are novel regions (234 if considering mental Figs. S5, S6). the extended set) that were not detected in any of the previous If selection against the re-introduction of an ancestral variant screens, including previous versions of the screen without fixed were very strong, selection may have depleted Neandertal ancestry differences (Supplemental Fig. S7). in a large region surrounding the selected allele. Interestingly, we While we find no difference in the fraction of genes in select- find some of our sweep candidate regions that fall within the ed regions compared to randomly placed regions, we detect an en- longest deserts of both Neandertal and Denisova ancestry richment for enhancers and promoter regions. This result is in (Supplemental Table S8; Vernot et al. 2016). A significantly high agreement with the hypothesis that regulatory changes may play number of the core set of regions fall in these deserts (5/81 regions, an important role in human-specific phenotypes (King and P-value = 0.024), while the extended set shows no significant en- Wilson 1975; Carroll 2003; Enard et al. 2014), maybe more so richment (9/314 regions, P-value = 0.205). than amino acid changes (Hernandez et al. 2011; see also Enard et al. 2014; Racimo et al. 2014). Interestingly, several gene candi- dates falling within sweep regions play a role in the function and Discussion development of the brain. A particularly interesting observation Many genetic changes set modern humans apart from Neandertals is the potential selection on the genes encoding both the ligand, and Denisovans, but their functions remain elusive. Most of these SLIT2, and its receptor, ROBO2, which reside on Chromosomes 4 changes probably resulted in either no change to the phenotype or and 3, respectively (see Supplemental File S3 for an annotated list of changes in those genes). Members of the Roundabout (ROBO) to a selectively neutral change. However, in rare instances, selec- tion may have favored changes modifying the appearance, behav- gene family play an important role in guiding developing axons ior, and abilities of present-day humans. Unfortunately, current in the nervous system through interactions with the ligands methods to identify selection have limited power to detect such SLITs. SLITs proteins act as attractive or repulsive signals for axons old events of positive selection (Przeworski 2002, 2003; Sabeti expressing different ROBO receptors. ROBO2 has been further asso- et al. 2006). ciated with vocabulary growth (St Pourcain et al. 2014), autism Here, we introduce a hidden Markov model to detect ancient (Suda et al. 2011), and dyslexia (Fisher and DeFries 2002) and is in- selective sweeps based on a signal of extended lineage sorting. volved in the development of neural circuits related to vocal learn- Using simulations, we were able to show that the method can ing in birds (Wang et al. 2015). Interestingly, ROBO2 is also in a long detect older events of selection as long as the selected variant desert of both Denisovan and Neandertal ancestry in non-Africans. was sufficiently advantageous. The power to detect older events We also identified interesting brain-related candidates on the is due to the fact that the method increases in power with the num- X Chromosome, among them DCX, a protein controlling neuro- ber of mutations that accumulated after the sweep finished. We nal migration by regulating the organization and stability of mi- also showed that background selection can cause false signals crotubules (Gleeson et al. 1999). Mutations in this gene can have and have chosen a minimum length cutoff on candidate regions. consequences for the expansion and folding of the cerebral cortex, While this cutoff reduces the number of false positives due to back- leading to the “double cortex” syndrome in females and “smooth ground selection, we note that this cutoff is expected to exclude brain” syndrome in males (Gleeson et al. 1998). bona fide events of positive selection, too. We have presented a new approach to detect ancient selective We applied the ELS method to 185 African genomes, the Altai sweeps based on a signal of extended lineage sorting. Applying this Neandertal genome, and the Denisovan genome and detected 81 approach to modern human data revealed that selection may have candidate regions of selection when requiring a minimum genetic acted primarily on regulatory changes. With population level se- length supported by two independent recombination maps. The quencing of nonhuman species becoming more readily available, uncertainty in the recombination maps has a large effect on our re- we anticipate that this approach will help to reveal the targets of sults, as shown by the much larger numberof 314 regions identified ancient selection in other species. by either recombination map. Recombination rates over the ge- nome are known to evolve rapidly (Lesecque et al. 2014), and of Methods particular concern are recent changes in recombination rates that make some regions appear larger in genetic length than they Data were in the past. By comparing the current recombination rates We used single nucleotide polymorphisms (SNPs) from 185 unre- in our regions to recombination rates in the ancestral population lated Luhya and Yoruba individuals from the 1000 Genomes of both chimpanzee and humans (Munch et al. 2014), we identi- Project phase I (The 1000 Genomes Project Consortium 2012) to- fied some candidate regions that may have increased in recombina- gether with four ape reference genome assemblies (chimpanzee tion rates (Supplemental Table S9). However, it is currently [panTro3] [Mikkelsen et al. 2005], bonobo [panPan1.1] [Prüfer impossible to date the change in recombination rates confidently, et al. 2012], gorilla [gorGor3] [Scally et al. 2012], and orangutan and these candidate sweeps may post-date the change. [ponAbe2] [Locke et al. 2011]) to compile a list of polymorphic A particular strength of our screen for selective sweeps is the and fixed derived changes in Luhya and Yoruba. Neandertal and ability to detect older events, as indicated by the estimated power Denisova alleles at these positions were extracted from published to detect simulated events of positive selection of old age and mod- VCFs (Danecek et al. 2011) using recommended filters (Prüfer erate strength. This sets the ELS method apart from previous ap- et al. 2014; see Supplemental Material for further details). Sites proaches that made use of archaic genomes, which were geared where either Neandertal or Denisova carried a third allele were toward detecting younger events with an age of less than disregarded. 300,000 yr ago (Racimo et al. 2014; Racimo 2016). Despite this dif- Genetic distances between these positions were calculated us- ference, we found significant overlap between the ELS candidates ing the African-American (Hinch et al. 2011) and the deCODE and the candidates identified by these other approaches, while the (Kong et al. 2010) recombination maps (available in Build 37 Genome Research 1569 www.genome.org ́ Peyregne et al. from http://www.well.ox.ac.uk/~anjali/). Both maps were chosen as implemented in the NLopt library (Steven G. Johnson, The since they estimate recombination rates from events that occurred NLopt nonlinear-optimization package, http://ab-initio.mit.edu/ within a few generations before the present. Recombination maps nlopt) to maximize the log-likelihood values calculated by the based on older events (i.e., LD-based map) can underestimate re- Forward algorithm. Convergence was attained in a maximum of combination rates in regions that underwent recent selective 1000 evaluations, and the log-likelihood maximization accuracy −4 sweeps, potentially masking true signals. was set to 10 . To test for convergence to local maxima, we ran the algorithm twice with different starting points and used the pa- rameters of the run with the highest likelihood to run the re-esti- Hidden Markov model mation algorithm a third time, starting with those parameters. We would like to estimate for each informative position the prob- All three runs gave similar results on all chromosomes. abilities for the three possible genealogies, external (E), internal (I), and extended lineage sorting (ELS), given the observed data. Post-processing Formally and following the notation from Durbin et al. (1998), The HMM was executed independently on all chromosomes for we calculate P(π = k|x), where i denotes the position, k ∈ {E,I,ELS} both Denisova and Neandertal and using the African-American and x is the sequence of observations with the ith observation de- and deCODE recombination maps. An external region was defined noted x . With the genetic distance d between consecutive sites as a stretch of high posterior probabilities (P ≥ 0.7) for the extended and l , the average genetic length of a region in state k, we specify −d/l k lineage sorting state that was uninterrupted by sites with a low the transition probabilities between identical states as t = e . k,k probability (P ≤ 0.1). The two cutoffs on the posterior probabilities Transitions from I to the states ELS and E depend on an addi- were determined by simulating sequences with positive selection tional parameter p, the proportion of transitions from I to ELS, −d/l I (s = 0.005, 500 kya; see below). Sites that were simulated external and their probability is given by t = p 1 − e and I,ELS −d/l I in both archaics were labeled as 1 and the remaining sites as t =(1 − p) 1 − e . Lastly, transitions from the two external I,E −d/l j 0. The HMM was then run on the simulations. By running a states to internal have the probability t = 1 − e , with j ∈ {E, j,I grid-search over possible cutoffs (step-sizes of 0.05 for the two pa- ELS}. By construction, transitions between E and ELS genealogies rameters) and labeling the HMM output accordingly, we identified are not allowed: it would not be possible to detect such transitions, the set of chosen parameters by minimizing the root mean square as those two states have the same statistical properties. error The inference further requires the probability for observing an ancestral or derived allele in the archaic at a site i with a derived () t − o allele frequency f > 0 in modern humans (noted x ) given that the i i i i i true genealogy is k ∈ {I, E, ELS}: e (x )= P(x | π = k). We assume that n k i i i ∀x : e (x)= e (x) , i.e., that both external states give rise to ances- ELS E with n the number of labeled sites, t the true label, and o the ob- i i tral and derived alleles in the archaic with equal probabilities given served label. the same observation. Since external regions are not expected to give rise to derived sites when the derived allele is segregating in modern humans, the only sources for such an observation can Simulations be errors or independent coinciding identical mutations, and we We simulated sequences using a model of recent human demogra- define an error rate for external regions: ε = e (x = derived, f < 1). E E i i phy to test the performance of our HMM under different scenarios Similarly fixed derived sites are expected to show the derived allele of neutral evolution, positive selection, or background selection. in the archaics if the local genealogy is internal, and we define an Each simulation consisted of one chimpanzee chromosome, one error rate for internal regions: ε = e (x = derived, f = 1). I I i i chromosome from each archaic hominin, and 370 human chro- We compute the posterior probability P(π = k | x) that an ob- mosomes, matching the 185 Luhya and Yoruba individuals used servation x came from state k given the observed sequence x as: in our analysis. For all simulations in this study, a constant muta- −8 −1 −1 tion rate of 1.45 × 10 bp ·generation , a constant recombina- P(x,p = k) −1 −1 P(p = k | x)= . tion rate of 1 cM.Mb .generation , and a generation time of 29 P(x) yr were assumed. We used estimates of population sizes from P(x, π = k)= f (i)b (i), where f (i)= P(x …x , π = k) and b (i)= i k k k 1 i i k Yang et al. (2014) and population split estimates from Prüfer P(x …x | π = k) are the output of the Forward and Backward i+1 L i et al. (2014) as parameters for the simulated demography algorithms, respectively (Rabiner 1989; Durbin et al. 1998). P(x) (Supplemental Information 1, 2). corresponds to the likelihood of the data given our model and Neutral simulations were generated with the coalescent was also calculated from the Forward algorithm. simulator scrm (Staab et al. 2014) and give a good match to our observed data when plotting derived allele frequency in modern humans against the proportion of derived alleles in the outgroup Parameter estimate (Supplemental Fig. S8). Simulations with positive selection were We used the Baum-Welch algorithm to estimate all emission prob- generated with the coalescent simulator msms (Ewing and abilities, with the exception of ε , the proportion of segregating Hermisson 2010), and background selection was explored using sites derived in the archaic genome in external regions, due to lim- forward-in-time simulations generated by SLiM (Messer 2013). ited accuracy in the estimates. We set this last parameter to a value Further details on simulation parameters are given in the of 0.01, a conservative upper limit on contamination and sequenc- Supplemental Material. ing error in the two high-coverage archaic genomes. The Baum- Welch algorithm was run for a maximum of 40 iterations, and Age comparison with other scans for selection the convergence criteria were set to a log-likelhood maxima differ- −4 ence of less than 10 . To compare our sweep screen with previous scans, we downloaded We estimated the remaining parameters (average lengths of candidate regions from the 1000G positive selection database regions and the proportion of transitions to the ELS state) using (Pybus et al. 2014). Only candidates with a P-value lower than the derivative free optimization method COBYLA (Powell 1994) 0.001 were considered. We added to this set of regions 1570 Genome Research www.genome.org Ancient positive selection in humans the top reported regions from a HKA scan (Cagan et al. 2016). Author contributions: S.P. implemented the method. S.P., Allele age estimates were obtained from ARGweaver (Rasmussen M.J.B., and M.D. analyzed data. S.P., M.J.B., M.D., and K.P. inter- et al. 2014). preted the results. K.P. designed the study. S.P. and K.P. wrote F , iHS, and XP-EHH are site-based statistics which localize the manuscript with input from all authors. ST sites that may have been selected (Malécot 1948; Wright 1951; Voight et al. 2006; Sabeti et al. 2007), whereas selective scans such as CLR, XP-CLR, Tajima’s D, Fay and Wu’s H, and HKA iden- References tify candidate regions (Hudson et al. 1987; Tajima 1989; Fay and The 1000 Genomes Project Consortium. 2012. An integrated map of genetic Wu 2000; Kim and Stephan 2002; Chen et al. 2010). In order to variation from 1,092 human genomes. Nature 491: 56–65. Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, Fernandez Banet compare the age of the selection events, we assumed that the se- J, Billis K, García Girón C, Hourlier T, et al. 2016. The Ensembl gene an- lected variant in candidate regions was the site with the highest notation system. Database (Oxford) doi: 10.1093/database/baw093. frequency. We note that this procedure will underestimate the Anders S, Huber W. 2010. DESeq: differential expression analysis for se- age of events if the true selected site reached fixation, as is often ex- quence count data. Genome Biol 11: R106. Cagan A, Theunert C, Laayouni H, Santpere G, Pybus M, Casals F, Prüfer K, pected for our method; the comparison is thus conservative. Navarro A, Marques- Bonet T, Bertranpetit J, et al. 2016. Natural selec- tion in the great apes. Mol Biol Evol 33: 3268–3283. Carroll SB. 2003. Genetics and the making of Homo sapiens. Nature 422: Annotations 849–857. We annotated candidate regions using protein coding genes from Charlesworth B, Morgan MT, Charlesworth D. 1993. The effect of deleteri- ous mutations on neutral molecular variation. Genetics 134: Ensembl (release 82), promoters and enhancers mapped by 1289–1303. GenoSTAN (Zacher et al. 2016), a measure of background selection Chen H, Patterson N, Reich D. 2010. Population differentiation as a test for (B-scores) (McVicker et al. 2009). Candidate regions were also over- selective sweeps. Genome Res 20: 393–402. lapped with regions previously suggested to have experienced re- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. 2011. The variant current selective sweeps in apes on the X Chromosome (Dutheil call format and VCFtools. Bioinformatics 27: 2156–2158. et al. 2015; Nam et al. 2015), regions of Neandertal ancestry Durbin R, Eddy S, Krogh A, Mitchison G. 1998. Biological sequence analysis: (Sankararaman et al. 2014; Vernot and Akey 2014), and long re- probabilistic models of proteins and nucleic acids. Cambridge University gions devoid of Neandertal and Denisova ancestry (Vernot et al. Press, Cambridge. Dutheil JY, Munch K, Nam K, Mailund T, Schierup MH. 2015. Strong selec- 2016). tive sweeps on the X chromosome in the human-chimpanzee ancestor To statistically test the overlap of our regions with these anno- explain its low divergence. PLoS Genet 11: e1005451. tations, we randomly placed regions of similar physical sizes in the Enard D, Messer PW, Petrov DA. 2014. Genome-wide signals of positive se- parts of the genome that passed our quality filters. Quality-filtered lection in human evolution. Genome Res 24: 885–895. Ewing G, Hermisson J. 2010. MSMS: a coalescent simulation program in- regions that were smaller than the longest gap present in our can- cluding recombination, demographic structure and selection at a single didate ELS regions were regarded as sufficiently short to not pro- locus. Bioinformatics 26: 2064–2065. hibit the placement of regions. Fay JC, Wu CI. 2000. Hitchhiking under positive Darwinian selection. Changes of recombination rates along the human lineage Genetics 155: 1405–1413. Fisher SE, DeFries JC. 2002. Developmental dyslexia: genetic dissection of a could limit our power to detect selected regions, and we used an complex cognitive trait. Nat Rev Neurosci 3: 767–780. ancestral recombination map of the human-chimpanzee ancestor Fu Q, Li H, Moorjani P, Jay F, Slepchenko SM, Bondarev AA, Johnson PL, to annotate top candidate regions (Supplemental Table S9; Munch Aximu-Petri A, Prüfer K, de Filippo C, et al. 2014. Genome sequence et al. 2014). of a 45,000-year-old modern human from western Siberia. Nature 514: 445–449. Finally, we further characterized fixed or nearly fixed human- Fu Q, Hajdinjak M, Moldovan OT, Constantin S, Mallick S, Skoglund P, specific changes within the candidate regions using annotations of Patterson N, Rohland N, Lazaridis I, Nickel B, et al. 2015. An early mod- histone marks (enhancers, promoters), eQTLs, transcription factor ern human from Romania with a recent Neanderthal ancestor. Nature binding sites, and conservation scores (Supplemental File S3). 524: 216–219. Gleeson JG, Allen KM, Fox JW, Lamperti ED, Berkovic S, Scheffer I, Cooper EC, Dobyns WB, Minnerath SR, Ross ME, et al. 1998. Doublecortin, a Software availability brain-specific gene mutated in human X-linked lissencephaly and dou- ble cortex syndrome, encodes a putative signaling protein. Cell 92: The software and input files used in this study have been made 63–72. available through the website http://bioinf.eva.mpg.de/ELS/ and Gleeson JG, Lin PT, Flanagan LA, Walsh CA. 1999. Doublecortin is a micro- tubule-associated protein and is expressed widely by migrating neurons. https://github.com/StephanePeyregne/ELS/. A version of the Neuron 23: 257–271. source code is also available as Supplemental Code in the online Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, version of this article. Li H, Zhai W, Fritz MH, et al. 2010. A draft sequence of the Neandertal genome. Science 328: 710–722. Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, 1000 Genomes Project, Sella G, Przeworski M. 2011. Classic selective sweeps Acknowledgments were rare in recent human evolution. Science 331: 920–924. Hinch AG, Tandon A, Patterson N, Song Y, Rohland N, Palmer CD, Chen We thank Michael Lachmann for early discussions about the de- GK, Wang K, Buxbaum SG, Akylbekova EL, et al. 2011. The landscape sign of the study; Janet Kelso and Mark Stoneking for many useful of recombination in African Americans. Nature 476: 170–175. comments on the manuscript; Svante Pääbo, Udo Stenzel, Hudson RR, Kreitman M, Aguadé M. 1987. A test of neutral molecular evo- Fernando Racimo, Amy Ko, and Adam Siepel for helpful discus- lution based on nucleotide data. Genetics 116: 153–159. Kim Y, Stephan W. 2002. Detecting a local signature of genetic hitchhiking sions; Julien Peyrégne for help in implementing the software; along a recombining chromosome. Genetics 160: 765–777. and Matthias Ongyerth, Manjusha Chintalapati, Steffi Grote, King MC, Wilson AC. 1975. Evolution at two levels in humans and chim- and Christoph Theunert for help with earlier analysis. We also panzees. Science 188: 107–116. thank three anonymous reviewers for their comments and sugges- Kong A, Thorleifsson G, Gudbjartsson DF, Masson G, Sigurdsson A, Jonasdottir A, Walters GB, Jonasdottir A, Gylfason A, Kristinsson KT, tions that helped to improve this manuscript. This research was et al. 2010. Fine-scale recombination rate differences between sexes, funded by the Max Planck Society, the Paul G. Allen Family foun- populations and individuals. Nature 467: 1099–1103. dation, and the European Research Council (grant agreement no. Kryazhimskiy S, Plotkin JB. 2008. The population genetics of dN/dS. PLoS 694707). Genet 4: e1000304. Genome Research 1571 www.genome.org ́ Peyregne et al. Laland KN, Odling-Smee J, Myles S. 2010. How culture shaped the human Racimo F, Kuhlwilm M, Slatkin M. 2014. A test for ancient selective sweeps genome: bringing genetics and the human sciences together. Nat Rev and an application to candidate sites in modern humans. Mol Biol Evol Genet 11: 137–148. 31: 3344–3358. Lesecque Y, Glémin S, Lartillot N, Mouchiroud D, Duret L. 2014. The Red Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. 2014. Genome-wide infer- Queen model of recombination hotspots evolution in the light of archa- ence of ancestral recombination graphs. PLoS Genet 10: e1004342. ic and modern human genomes. PLoS Genet 10: e1004790. Rosenberg NA. 2002. The probability of topological concordance of gene Li H, Durbin R. 2011. Inference of human population history from individ- trees and species trees. Theor Popul Biol 61: 225–247. Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, Shamovsky O, Palma ual whole-genome sequences. Nature 475: 493–496. A, Mikkelsen TS, Altshuler D, Lander ES. 2006. Positive natural selection Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, in the human lineage. Science 312: 1614–1620. Yang SP, Wang ZY, Chinwalla AT, Minx P, et al. 2011. Comparative Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, and demographic analysis of orang-utan genomes. Nature 469: Byrne EH, McCarroll SA, Gaudet R, et al. 2007. Genome-wide detection 529–533. and characterization of positive selection in human populations. Nature Malécot G. 1948. Les mathématiques de l’hérédité. Masson & Cie, Paris, 449: 913–918. France. Sankararaman S, Patterson N, Li H, Pääbo S, Reich D. 2012. The date of in- McVicker G, Gordon D, Davis C, Green P. 2009. Widespread genomic signa- terbreeding between Neandertals and modern humans. PLoS Genet 8: tures of natural selection in hominid evolution. PLoS Genet 5: e1000471. e1002947. Messer PW. 2013. SLiM: simulating evolution with selection and linkage. Sankararaman S, Mallick S, Dannemann M, Prüfer K, Kelso J, Pääbo S, Genetics 194: 1037–1039. Patterson N, Reich D. 2014. The genomic landscape of Neanderthal an- Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, Schraiber JG, cestry in present-day humans. Nature 507: 354–357. Jay F, Prüfer K, de Filippo C, et al. 2012. A high-coverage genome se- Sankararaman S, Mallick S, Patterson N, Reich D. 2016. The combined land- quence from an archaic Denisovan individual. Science 338: 222–226. scape of Denisovan and Neanderthal ancestry in present-day humans. Mikkelsen TS, Hillier LW, Eichler EE, Zody MC, Jaffe DB, Yang SP, Enard W, Curr Biol 26: 1241–1247. Hellmann I, Lindblad-Toh K, Altheide TK, et al. 2005. Initial sequence of Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, Hobolth the chimpanzee genome and comparison with the human genome. A, Lappalainen T, Mailund T, Marques-Bonet T, et al. 2012. Insights into Nature 437: 69–87. hominid evolution from the gorilla genome sequence. Nature 483: Munch K, Mailund T, Dutheil JY, Schierup MH. 2014. A fine-scale recombi- 169–175. nation map of the human-chimpanzee ancestor reveals faster change in St Pourcain B, Cents RA, Whitehouse AJ, Haworth CM, Davis OS, O’Reilly humans than in chimpanzees and a strong impact of GC-biased gene PF, Roulstone S, Wren Y, Ang QW, Velders FP, et al. 2014. Common var- conversion. Genome Res 24: 467–474. iation near ROBO2 is associated with expressive vocabulary in infancy. Nam K, Munch K, Hobolth A, Dutheil JY, Veeramah KR, Woerner AE, Nat Commun 5: 4831. Hammer MF, Great Ape Genome Diversity Project, Mailund T, Staab PR, Zhu S, Metzler D, Lunter G. 2014. Scrm: efficiently simulating long Schierup MH. 2015. Extreme selective sweeps independently targeted sequences using the approximated coalescent with recombination. the X chromosomes of the great apes. Proc Natl Acad Sci 112: Bioinformatics 31: 1680–1682. 6413–6418. Suda S, Iwata K, Shimmura C, Kameno Y, Anitha A, Thanseem I, Nakamura Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG. 2007. Recent K, Matsuzaki H, Tsuchiya KJ, Sugihara G, et al. 2011. Decreased expres- and ongoing selection in the human genome. Nat Rev Genet 8: 857–868. sion of axon-guidance receptors in the anterior cingulate cortex in au- Oleksyk TK, Smith MW, O’Brien SJ. 2010. Genome-wide scans for footprints tism. Mol Autism 2: 14. of natural selection. Philos Trans R Soc Lond B Biol Sci 365: 185–205. Tajima F. 1989. Statistical method for testing the neutral mutation hypoth- Pääbo S. 2014. The human condition—a molecular approach. Cell 157: esis by DNA polymorphism. Genetics 123: 585–595. 216–226. Varki A, Geschwind DH, Eichler EE. 2008. Explaining human uniqueness: Powell MJD. 1994. A direct search optimization method that models the ob- genome interactions with environment, behaviour and culture. Nat jective and constraint functions by linear interpolation. Adv Optim Rev Genet 9: 749–763. Numer Anal 275: 51–67. Vernot B, Akey JM. 2014. Resurrecting surviving Neandertal lineages from Prüfer K, Muetzel B, Do HH, Weiss G, Khaitovich P, Rahm E, Pääbo S, modern human genomes. Science 343: 1017–1021. Lachmann M, Enard W. 2007. FUNC: a package for detecting significant Vernot B, Tucci S, Kelso J, Schraiber JG, Wolf AB, Gittelman RM, associations between gene sets and ontological annotations. BMC Dannemann M, Grote S, McCoy RC, Norton H, et al. 2016. Bioinformatics 8: 41. Excavating Neandertal and Denisovan DNA from the genomes of Prüfer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, Koren S, Melanesian individuals. Science 352: 235–239. Sutton G, Kodira C, Winer R, et al. 2012. The bonobo genome compared Voight BF, Kudaravalli S, Wen X, Pritchard JK. 2006. A map of recent posi- with the chimpanzee and human genomes. Nature 486: 527–531. tive selection in the human genome. PLoS Biol 4: 446–458. Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, Heinze A, Wang R, Chen CC, Hara E, Rivas MV, Roulhac PL, Howard JT, Chakraborty Renaud G, Sudmant PH, de Filippo C, et al. 2014. The complete genome M, Audet JN, Jarvis ED. 2015. Convergent differential regulation of SLIT- sequence of a Neanderthal from the Altai Mountains. Nature 505: ROBO axon guidance genes in the brains of vocal learners. J Comp Neurol 43–49. 523: 892–906. Przeworski M. 2002. The signature of positive selection at randomly chosen Weaver TD. 2009. The meaning of Neandertal skeletal morphology. Proc loci. Genetics 160: 1179–1189. Natl Acad Sci 106: 16028–16033. Przeworski M. 2003. Estimating the time since the fixation of a beneficial Wright S. 1951. The genetical structure of populations. Ann Eugen 15: allele. Genetics 164: 1667–1676. 322–354. Pybus OG, Shapiro B. 2009. Natural selection and adaptation of molecular Yang MA, Harris K, Slatkin M. 2014. The projection of a test genome onto a sequences. In The phylogenetic handbook (ed. Lemey P, et al.), pp. reference population and applications to humans and archaic homi- 415–417. Cambridge University Press, New York. nins. Genetics 198: 1655–1670. Pybus M, Dall’Olio GM, Luisi P, Uzkudun M, Carreño-Torres A, Pavlidis P, Zacher B, Michel M, Schwalb B, Cramer P, Tresch A, Gagneur J. 2016. Laayouni H, Bertranpetit J, Engelken J. 2014. 1000 genomes selection Accurate promoter and enhancer identification in 127 ENCODE and browser 1.0: a genome browser dedicated to signatures of natural selec- roadmap epigenomics cell types and tissues by GenoSTAN. PLoS One tion in modern humans. Nucleic Acids Res 42: D903–D909. 12: e0169249. Rabiner LR. 1989. A tutorial on hidden Markov models and selected appli- cations in speech recognition. Proc IEEE 77: 257–286. Racimo F. 2016. Testing for ancient selection using cross-population allele frequency differentiation. Genetics 202: 733–750. Received December 9, 2016; accepted in revised form July 5, 2017. 1572 Genome Research www.genome.org http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Research Pubmed Central

Detecting ancient positive selection in humans using extended lineage sorting

Genome Research , Volume 27 (9) – Sep 1, 2017

Loading next page...
 
/lp/pubmed-central/detecting-ancient-positive-selection-in-humans-using-extended-lineage-NHGl5dEY8j

References (99)

Publisher
Pubmed Central
Copyright
© 2017 Peyrégne et al.; Published by Cold Spring Harbor Laboratory Press
ISSN
1088-9051
eISSN
1549-5469
DOI
10.1101/gr.219493.116
Publisher site
See Article on Publisher Site

Abstract

Method Detecting ancient positive selection in humans using extended lineage sorting Stéphane Peyrégne, Michael James Boyle, Michael Dannemann, and Kay Prüfer Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany Natural selection that affected modern humans early in their evolution has likely shaped some of the traits that set present- day humans apart from their closest extinct and living relatives. The ability to detect ancient natural selection in the human genome could provide insights into the molecular basis for these human-specific traits. Here, we introduce a method for detecting ancient selective sweeps by scanning for extended genomic regions where our closest extinct relatives, Neandertals and Denisovans, fall outside of the present-day human variation. Regions that are unusually long indicate the presence of lineages that reached fixation in the human population faster than expected under neutral evolution. Using simulations, we show that the method is able to detect ancient events of positive selection and that it can differentiate those from background selection. Applying our method to the 1000 Genomes data set, we find evidence for ancient selec- tive sweeps favoring regulatory changes and present a list of genomic regions that are predicted to underlie positively se- lected human specific traits. [Supplemental material is available for this article.] Modern humans differ from their closest extinct relatives, (up to 150,000 yr ago) (Fig. 2 from Racimo 2016) but has little pow- Neandertals, in several aspects, including skeletal and skull mor- er to detect events older than 200,000 yr ago in modern humans. A phology (Weaver 2009), and may also differ in other traits that second method applied an approximate Bayesian computation on are not preserved in the archeological record (Varki et al. 2008; patterns of homozygosity and haplotype diversity around alleles Laland et al. 2010). Natural selection may have played a role in fix- that reach fixation (Racimo et al. 2014). Although this approach ing these traits on the modern human lineage. However, the selec- expands our ability to investigate older time frames, this signal tion events driving the fixation would have been restricted to a of selection also fades over time, and events of positive selection specific timeframe, extending from the split between archaic and older than 300 thousand years ago (kya) become undetectable. modern humans ca. 650,000 yr ago to the split of modern human Based on a method introduced by Green et al. (2010), Prüfer populations from each other around 100,000 yr ago (Prüfer et al. et al. (2014) presented a hidden Markov model that identifies re- 2014). While methods exist that can be used to scan the genome gions in the genome where the Neandertal and Denisovan individ- for the remnants of past or ongoing positive selection (Nielsen uals fall outside of present-day human variation (i.e., the archaic et al. 2007; Pybus and Shapiro 2009), current methods have lineages fall basal compared to all present-day humans) and ap- limited power to detect positive selection on the human lineage plied the model to detect selective sweeps on the modern human that acted during this older timeframe (see Sabeti et al. 2006 for lineage. Regions that are unusually long are candidates for ancient a review on detection methods and their timeframes): an unusual- selective sweeps as variants are likely to have swept rapidly to fix- ly high ratio of functional changes to nonfunctional changes, such ation, dragging along with them large parts of the chromosomes as the d /d test, requires millions of years and often multiple that did not have time to be broken up by recombination. While N S events of selection to generate detectable signals (Kryazhimskiy this method is, in principle, expected to be able to detect events and Plotkin 2008), while unusual patterns of genetic diversity be- as old as the modern human split from Neandertals and Deniso- tween individuals and populations (e.g., extended homozygosity, vans, this power was never formally tested, and it has several other Tajima’s D, F ) are most powerful during the selective sweep or shortcomings. First, the method was limited to modern human ST shortly after (Sabeti et al. 2006; Oleksyk et al. 2010). A favorable polymorphisms, ignoring the additional information given by substitution is not expected to leave a mark on linked neutral var- fixed substitutions. Second, the method does not fit parameters iation beyond 250,000 yr in humans (Przeworski 2002, 2003). to the data but requires these parameters to be estimated through The genome sequencing of archaic humans (Neandertals and coalescent simulations. Denisovans) to high coverage (Meyer et al. 2012; Prüfer et al. 2014) Here, we introduce a refined version of this method, called has spawned new methods to investigate the genetic basis of mod- ELS (Extended Lineage Sorting), that models explicitly the longer ern human traits that are not shared by the archaics (Pääbo 2014). regions produced under selection and includes the fixed differenc- One method, called 3P-CLR, models allele frequency changes be- es between archaic and modern human genomes as an additional fore and after the split of two populations using the archaic ge- source of information. The ELS method also takes advantage of an nomes as an outgroup (Racimo 2016). 3P-CLR outperforms expectation-maximization algorithm to estimate the model pa- previous methods in the detection of older events of selection rameters from the data itself, making it free from assumptions re- garding human demographic history. Corresponding authors: [email protected], pruefer@ eva.mpg.de Article published online before print. Article, supplemental material, and publi- © 2017 Peyrégne et al. This article, published in Genome Research, is available cation date are at http://www.genome.org/cgi/doi/10.1101/gr.219493.116. under a Creative Commons License (Attribution 4.0 International), as described Freely available online through the Genome Research Open Access option. at http://creativecommons.org/licenses/by/4.0/. 27:1563–1572 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/17; www.genome.org Genome Research 1563 www.genome.org ́ Peyregne et al. To evaluate the power of the ELS method to detect ancient outgroup, i.e., all lineages are sorted. Because of recombination, selective sweeps, we tested its performance under scenarios of the human genome is a mosaic of independent evolutionary histo- background selection and neutrality. Finally, we present an updat- ries, and the process of lineage sorting is expected to randomly af- ed list of candidate regions that likely underwent positive selection fect regions, until, ultimately, all lineages will be sorted. In the case on the modern human lineage since the split from the common of modern humans, only a fraction of the regions in the genome ancestor with Neandertals and Denisovans. are expected to show lineage sorting (Prüfer et al. 2014), and the genome can be partitioned into regions where an archaic lineage falls either within the variation of modern humans (internal re- Results gion) or outside of the human variation (external region) (Fig. 1B). While lineage sorting can occur under neutrality, selection on Selection causes extended lineage sorting between closely related the modern human branch is expected to always lead to external populations regions as long as the selective sweep finished. In cases where The ancestors of modern humans split from the ancestors of the selective sweep is sufficiently strong, there will not be suffi- Neandertals and Denisovans between 450,000 and 750,000 yr cient time for recombination to break the linkage with neighbor- ago (Prüfer et al. 2014). Because the two newly formed descendant ing sites and a large region will reach fixation (extended lineage groups sampled the genetic variation from the ancestral popula- sorting) (Fig. 1C). In contrast, selection on standing variation tion, a derived variant can be shared between some members of may fail to generate such large regions, since recombination can both groups, while other individuals show the ancestral variant. act on the haplotype(s) with the prospective advantageous variant At these positions, some lineages from one group share a more re- before selection sets in. We note that neither demography nor se- cent common ancestor with some lineages in the other group than lection on the archaic lineage affect the lineage sorting within within the same group (Rosenberg 2002), a phenomenon called modern humans and thus the power to detect selective sweeps. incomplete lineage sorting (Fig. 1A). Eventually, a derived allele may reach fixation as part of a re- Expected incomplete lineage sorting among humans to archaics gion that has not been unlinked by recombination. In these re- gions, all descendants will derive from one common ancestor, We used coalescent simulations to determine the incidence and and any lineage from the other population will constitute an expected length of regions resulting from incomplete lineage Figure 1. Illustration of the lineage sorting process. (A) Effects on the genealogy. The process starts with a random distribution of lineages when the ancestral population splits. The lineage in black is an outgroup to lineages in blue, so that the blue lineages show a closer relationship between populations than to the black lineage (incomplete lineage sorting). When the blue lineages in the top population reach fixation (through a selective sweep for instance), any lineage from the other populations will constitute an outgroup, thereby completing the sorting of lineages. (B) Two types of genealogies illustrating the possible relationships between an archaic lineage and modern human lineages. (C) Local effects in the genome at different time points. The curves rep- resent the progression of lineage sorting for two independent regions, evolving under neutrality (black curve) and positive selection (blue curve), respec- tively. Longer fixation times are associated with more recombination so that neutrality produces smaller external regions. 1564 Genome Research www.genome.org Ancient positive selection in humans sorting in modern humans. Using a model of human demographic cM for most simulations with a selection coefficient of 0.001. In history (Yang et al. 2014), we estimated the fraction of lineage sort- contrast, external regions longer than 0.1 cM were observed for se- ing in modern humans in regard to Neandertals and Denisovans. lection coefficients above 0.05. Therefore, detectable signals are In simulations with 370 African chromosomes, and assuming a expected to be biased toward strong events with a selection coeffi- uniform recombination rate, ∼10% of the archaic genome is cient larger than 0.001. more divergent than the time to the most recent common ancestor of all sampled human variation. The length of the external regions Hidden Markov model to detect extended lineage sorting is expected to be ∼0.0016 centimorgan (cM) (95% CI: 0.001– 0.0095 cM; e.g., 1–9.5 kb for a recombination rate of 1 cM/Mb) To detect regions of extended lineage sorting, we modeled the with the longest regions on the order of 0.02 cM. In contrast, changes of local genealogies along the genome with a hidden internal regions are expected to be 0.012 cM long (95% CI: Markov model. We distinguish two types of genealogies, internal 0.0097–0.07 cM). or external, depending on whether the archaic lineage falls inside or outside of the human variation, respectively (Fig. 3A). The mod- el includes a third state corresponding to extended lineage sorting, Minimum strength of selection to produce detectable and external regions produced by this state are required to be lon- sweep signals ger, on average, than those produced by the external state. The three states are inferred from the state of the archaic allele (ances- We investigated the range of selection coefficients that could have led to the fixation of a lineage after the split with the archaic hom- tral or derived) either at a polymorphic position in modern hu- inins but before the differentiation of genetically modern humans mans or at a position where modern humans carry a fixed about 100–120 kya (Li and Durbin 2011) by simulating mutations derived variant. In the following, we describe the different statisti- cal properties expected for each type of genealogy. occurring at different times and evolving with different selection coefficients. While the simulations show that completed selective We first considered external regions. At modern human poly- morphic sites, the archaic genome is expected to carry the ances- sweeps could have occurred with selection coefficients as low as 0.0005 (Fig. 2A), the length distribution of haplotypes reaching tral variant since the derived variant would indicate incomplete lineage sorting. To account for sequencing errors or misassign- fixation is indistinguishable from neutrality for selection coeffi- cients under 0.001 (Fig. 2B,C). Under neutrality, the average ment of the ancestral state, we allow a probability of 0.01 for carry- ing the derived allele (see Methods). At sites where the derived length of external regions was 0.02 cM and remained below 0.03 allele is fixed, the archaic genome will of- ten carry the derived state, if the fixation event occurred before the split of the ar- chaic from the modern human lineage, or occasionally, the ancestral state, if the fixation event is more recent and oc- curred after the split. For internal regions, the archaic is expected to share the derived allele at modern human fixed derived sites but can carry the ancestral allele in our model to accommodate errors, albeit with low probability. In contrast, at sites that are polymorphic in modern humans, the probabilities of observing the ancestral or the derived allele in the archaic ge- nome will depend on the age of the de- rived variant, with young variants being less likely to be shared compared to older variants. The frequency of the derived variant in the modern human population can be used as a proxy for its age, and the emission probabilities in our model take the modern human derived allele fre- quency into account (see Methods). We modeled the transition proba- bilities between internal and external re- gions (related to the length of the regions) by exponential distributions. The extended lineage sorting state has Figure 2. (A) Fraction of selected alleles reaching fixation (gray) or segregating (orange) at present, the same chance of emitting derived al- depending on the strength of selection (columns) and the age of the mutation (rows, in kya) in our sim- leles as the other external state but is re- ulations. Events for which the selected variant was lost are not shown. (B) Distribution of the genetic quired to have a larger average length. length of external regions simulated under neutrality. (C) Distributions of the genetic length of external We used the Baum-Welch algorithm regions depending on the strength of selection (columns) and age of mutations in kya (rows). The blue line corresponds to the upper limit for the length of external regions produced under neutrality from B. (Durbin et al. 1998), an expectation- Genome Research 1565 www.genome.org ́ Peyregne et al. We do not expect ELS regions to be detected in our neutral simulations, and indeed we found that either the estimat- ed proportion of ELS converged to zero or the maximum likelihood estimate for the length of ELS and external regions con- verge to the same value (49% and 51% of simulations, respectively). A likeli- hood ratio test comparing a model with- out the ELS state to the full model with the ELS state also showed no significant improvement with the additional state in almost all neutral simulations (only one likelihood ratio test out of 100 simu- lations showed a significant improve- ment after Bonferroni correction for multiple testing). We then evaluated the accuracy of the ELS method to assign the correct ge- nealogy to regions based on sequences obtained through coalescent simulations with selection (Fig. 3B,C). In these simu- lations, the underlying genealogy at each site along the sequences is known and can be compared to the estimates. To be conservative, we only focused on results Figure 3. (A) Graphical representation of the Extended Lineage Sorting hidden Markov model. States with the smallest selection coefficient are depicted by nodes and transitions by edges. Each state emits an archaic allele as either derived, D, or (s = 0.005) that produces regions long ancestral, A, depending on the type of site in the modern human population (fixed or segregating at a enough to be detectable. In Figure 3B, given frequency). States are labeled I for Internal, E for External, and ELS for Extended Lineage Sorting. (B) we show the accuracy for labeling the ex- Receiver operator curves for varying cutoffs on the posterior probability of the ELS state and counting the number of sites in ELS regions that were correctly labeled. All bases labeled ELS outside of simulated ELS tended lineage sorting regions depen- regions are considered false positives. Sites in ELS regions with a posterior probability below the cutoff are dent on the posterior probability cutoff considered false negatives. (C) Example of the labeling of a simulated ELS region. Horizontal bars indicate for the ELS state. The results demonstrate true external (top) and internal (bottom) regions. The posterior probability is shown in red for ELS regions that the model has sufficient power to ac- and in gray for E regions. The region overlapping position 50,000 (red bar) is caused by a simulated selec- curately label sites that experienced se- tive sweep. lection with a coefficient s ≥ 0.005 and an occurrence of the beneficial mutation maximization algorithm, to estimate the emission probabilities as long as 600,000 yr ago. We also used the simulations of positive selection events (s = and estimate the transition probabilities with a likelihood maximi- zation algorithm. 0.005) with two different times at which the beneficial mutation occurred, 300 and 600 kya, to test how often the beneficial simu- lated variant falls within a detected ELS region (Supplemental Accuracy of parameter estimates and inferred genealogies Table S1). To put this rate of true positives into perspective, we We first investigated the performance of the parameter inference also counted how many ELS regions did not overlap the selected on simulated data under neutral evolution. We found that the es- variant (false positives). A large fraction of selected mutations timated probabilities for encountering ancestral/derived alleles in were detected (87%–92%). However, we also found a substantial external and internal regions fit the simulated parameters well fraction of false positive ELS regions (10%–11%). When restricting (on average, less than ± 0.08 from simulated under all tested detected ELS regions to those that are longer than 0.025 cM, we conditions) (Supplemental Figs. S1, S2), while the estimated find less than 0.1% false positives compared to 65%–68% true pos- length of internal and external regions deviate more from the sim- itives. Not all simulated regions with a selection coefficient of ulated lengths (around 15% overestimate of the mean length) 0.005 produce ELS regions of this size, so that the rate of true pos- (Supplemental Fig. S3). However, we found that the model exhib- itives for truly long regions is expected to be higher. For all follow- its better accuracy in labeling the correct genealogies with the esti- ing analyses, we used this minimal length cutoff of 0.025 cM. mated length parameters compared to the simulated true values (Supplemental Fig. S4). This difference seems to originate from Role of background selection the difficulty in accurately detecting very short external regions or internal regions with very few informative sites. We note that Background selection is defined as the constant removal of neutral detecting selection is not affected by this problem since we are pri- alleles due to linked deleterious mutations (Charlesworth et al. marily interested in detecting long external regions. Including 1993). In regions of the genome that undergo background selec- fixed differences improves the power to assign the correct geneal- tion, a fraction of the population will not contribute to subsequent ogies compared to a version of the method without this additional generations, causing a reduced effective population size. As a con- source of information (Supplemental Fig. S4). sequence, remaining neutral alleles can reach fixation faster than 1566 Genome Research www.genome.org Ancient positive selection in humans under neutrality, potentially producing unusually long external ty about recombination rates. We will refer to the set of 81 regions regions that could be mistaken as signals of positive selection. as the core set (Supplemental File S1) and the set including the 233 We investigated the effects of background selection by running putatively selected regions found with just one recombination map forward simulations with parameters that mimic the strength as the extended set (314 regions) (Supplemental File S2). and extent of background selection estimated for the human ge- For completeness, we also ran our model on the X nome (Messer 2013). While background selection simulations Chromosome and identified 12 additional candidates (43 if we did produce some long outlier regions that fall outside the distribu- consider candidates found with at least one recombination tion observed in neutral simulations, most regions are still smaller map), applying a more stringent length cutoff of 0.035 cM to ac- than regions simulated with positive selection at a conservative se- count for the stronger effects of random drift on this chromosome lection coefficient of 0.005 (Fig. 4A). Indeed, among the 1160 ex- (Methods). Interestingly, we also found a significant increase of ternal regions detected in our simulations of background selection posterior probabilities for selection within previously reported re- (s = 0.05) (Fig. 4A), only six were labeled as ELS and only three gions under potential recurrent selective sweeps in apes (Mann- −16 passed the minimal length filter of 0.025 cM. Whitney U one-sided test, P-value < 2.2 × 10 )(Supplemental Table S3; Dutheil et al. 2015; Nam et al. 2015). The detected selection candidate regions on the autosomes Candidate regions of positive selection on the human lineage do not show a decrease in B-scores (McVicker et al. 2009), a local measure of background selection strength, compared with random To identify ancient events of positive selection on the human lin- eage, we applied the ELS method to African genomes from the regions (Wilcoxon rank-sum test comparing the average B-scores with permuted regions, P-value = 0.565, or comparing the lowest 1000 Genomes Project (The 1000 Genomes Project Consortium 2012). We disregarded non-African populations since Neandertal B-scores in our regions to permuted regions, P-value = 0.504) (Fig. 4B). This suggests that candidate regions are not primarily generat- introgression in these populations could mask selective sweeps and lead to false negatives. A model with ELS fits the data signifi- ed by strong background selection. We compared our candidate regions to the top candidates of cantly better than a model without the ELS state for all chromo- somes and for both tested recombination maps (P-value < 1 × eight previous scans for selection, including iHS, F , XP-CLR, and ST −8 HKA (Pybus et al. 2014; Cagan et al. 2016). Using the estimated 10 )(Supplemental Table S2). We identified 81 regions of human extended lineage sorting time to the most recent common ancestor among Africans for each identified region/site, we found that our ELS scan identified for which both recombination maps support a genetic length great- er than 0.025 cM (average length: 0.05 cM). Depending on the re- significantly older events than other screens (Mann-Whitney U tests) (Fig. 5; Supplemental Table S4). We found 23 regions from combination map, the longest overlap between the maps is 0.12 (African-American map) or 0.17 (deCODE map) cM long, which the core set (detected by both recombination maps) overlapping with candidates from previous scans and 68 for the extended set is three to four times longer than the longest regions produced under background selection in our simulations. An additional (detected by at least one recombination map); neither overlap is more than expected at random (P-values are 0.06 and 0.595, re- 233 regions are longer than 0.025 cM according to only one recom- bination map, with 71% of those additional regions showing sup- spectively). In contrast, our candidate regions overlap candidate regions from 3P-CLR (Racimo 2016) and the ABC approach for port for the ELS state using both recombination maps. This suggests that the variation in the candidate set mostly stems from uncertain- detecting ancient selection (Racimo et al. 2014) more often than expected by chance (P-values < 0.05) (Supplemental Table S5). Biological functions of the candidate regions Since positive selection acts on advanta- geous phenotypes that are caused by changes to functional elements in the ge- nome, we would expect that our candi- date regions would overlap functional elements in the genome more often than expected. We first tested this hypothesis by counting the overlap between sweep candidate regions and protein coding genes (Ensembl release 82) (Aken et al. 2016). We find no statistically significant overlap of ELS regions with protein cod- Figure 4. Effects of background selection. (A) Comparison of the length of ELS regions in simulations of ing genes compared to randomly placed different scenarios. For the distribution under background selection, the s parameter corresponds to the regions of the same size (P-value = 0.671 average selection coefficient from the gamma distribution (shape parameter of 0.2). We assumed that the deleterious mutations are recessive with dominance coefficient h = 0.1. The horizontal blue line cor- and 0.124, for core and extended set, re- responds to the length cutoff applied to the real data. (B) Distribution of B-scores in the candidate sweep spectively) (Fig. 6). Previous work has regions (red curve) compared to sets of random regions with matching physical lengths (blue area with identified 96 proteins that carry human dotted blue lines indicating the 95% confidence intervals over 1000 random sets of regions). The lowest fixed derived nonsynonymous changes B-score (i.e., stronger background selection) was chosen when a region overlapped several B-score annotations. compared to Neandertal and Denisova, Genome Research 1567 www.genome.org ́ Peyregne et al. and Huber 2010) and found again no enrichment (Supplemental Table S6). In an attempt to include potential regulatory changes in the enrichment test, we assigned genes to candidate regions when a region fell upstream of or downstream from a gene (see Supplemental Methods). Although many candidate genes that were annotated in this way were most highly expressed in the brain or the heart (Odds ratio = 2.10 for both tissues), this enrich- ment is not significant when correcting for gene length and mul- tiple testing (Family-wise error rate = 0.336 and 0.997, respectively) (Supplemental Table S7). Additional work will be required to investigate the phenotyp- ic consequences of changes in candidate regions for selection. To facilitate this work, we provide an annotated list of fixed or nearly fixed sites on the human lineage that fall within our candidate re- gions (Supplemental File S3). Overlap with Neandertal introgression Introgression from Neandertals and Denisovans into modern hu- mans occurred approximately 37,000 to 86,000 yr ago (Sankararaman et al. 2012, 2016; Fu et al. 2014, 2015). For those advantageous derived variants that arose on the modern human Figure 5. Distributions of estimated ages of the modern human segre- gating derived variants with the highest frequency in putatively selected lineage prior to introgression, we would expect that selection regions or the age of the derived variants at sites identified by various ge- may have acted against the re-introduction of the ancestral variant nome-wide scans. Our candidate regions are labeled as ELS, for Extended through admixture. We tested whether this selection may have af- Lineage Sorting; other candidate regions are from Pybus et al. (2014) and fected the distribution of Neandertal introgressed DNA around Cagan et al. (2016). The color coding indicates the type of signal detected by each method. Ages were estimated by ARGweaver (Rasmussen et al. fixed changes in candidate sweep regions. Out of a total of 963 2014). We only report events between 0 and 600 kya. fixed derived variants in Africans overlapping the extended set of sweep regions, 240 (25%) show the ancestral allele in non- Africans and show evidence for re-introduction by admixture us- which constitute a particularly interesting subset of potentially ing a map of Neandertal introgression (Vernot and Akey 2014). functional changes to genes that may have been caused by selec- This level of Neandertal ancestry is comparable to the genome- tive sweeps (Prüfer et al. 2014). We found no overlap between wide fraction of out-of-Africa ancestral alleles at African fixed de- these genes and the core set of sweep candidate regions that rived sites (∼26%; bootstrap P-value = 0.583). We also find no sig- were identified by both recombination maps. However, when con- nificant reduction in frequency of Neandertal ancestry around sidering the extended set of sweep candidate regions, 11 regions candidate substitutions in sweep regions, when comparing one overlapped such genes: ADSL, BBIP1, ENTHD1, HERC5, KATNA1, KIF18A, NCOA6, PRDM10, SCAP, SLITRK1, and ZNHIT2. This over- lap is significantly larger than expected by chance (only two genes −3 are expected on average; P-value < 10 ). In all instances, the can- didate regions contained at least one fixed amino acid change. Since fixed changes are part of the information used to infer exter- nal regions, it stands to reason that the presence of such a change may bias toward observing an overlap with candidate regions (72/ 81 core regions and 275/314 regions from the extended set contain fixed changes). However, we note that the overlap with fixed ami- no acid changes is also significantly larger than the overlap with other fixed changes (963 of 20,347 fixed changes fall within can- didate regions from the extended set; binomial P-value = 0.006). Phenotype may also be influenced by regulatory changes that affect gene expressions. Interestingly, we found a significant en- richment for regions overlapping enhancers and promoters (P-val- ue < 0.001 and P-value = 0.002, respectively) (see Fig. 6) when considering the extended set of 314 candidate regions. However, this enrichment was not significant for the smaller core set of candidates. To further investigate the biological function of our regions, we tested for Gene Ontology enrichment in genes within the ex- Figure 6. Enrichment for regulatory elements (enhancers, P-value < tended set of regions (Prüfer et al. 2007). No category showed sig- 0.001; protein-coding genes, P-value = 0.124; and promoters, P-value = nificant enrichment when comparing to randomly placed regions 0.002) in the extended set of 314 candidate sweep regions. The distribu- of identical sizes in the genome (see Supplemental Methods). We tions were obtained by randomly placing candidate regions in the genome also assigned genes that overlap our extended data set to tissues to obtain lists of regions with similar physical lengths. The red lines repre- in which they show the significantly highest expression (Anders sent the value observed in the real extended set. 1568 Genome Research www.genome.org Ancient positive selection in humans randomly sampled fixed African substitution per region against overlap with other types of positive selection scans is smaller. random regions matched for size and distance to genes (Supple- Among our candidates, 55 are novel regions (234 if considering mental Figs. S5, S6). the extended set) that were not detected in any of the previous If selection against the re-introduction of an ancestral variant screens, including previous versions of the screen without fixed were very strong, selection may have depleted Neandertal ancestry differences (Supplemental Fig. S7). in a large region surrounding the selected allele. Interestingly, we While we find no difference in the fraction of genes in select- find some of our sweep candidate regions that fall within the ed regions compared to randomly placed regions, we detect an en- longest deserts of both Neandertal and Denisova ancestry richment for enhancers and promoter regions. This result is in (Supplemental Table S8; Vernot et al. 2016). A significantly high agreement with the hypothesis that regulatory changes may play number of the core set of regions fall in these deserts (5/81 regions, an important role in human-specific phenotypes (King and P-value = 0.024), while the extended set shows no significant en- Wilson 1975; Carroll 2003; Enard et al. 2014), maybe more so richment (9/314 regions, P-value = 0.205). than amino acid changes (Hernandez et al. 2011; see also Enard et al. 2014; Racimo et al. 2014). Interestingly, several gene candi- dates falling within sweep regions play a role in the function and Discussion development of the brain. A particularly interesting observation Many genetic changes set modern humans apart from Neandertals is the potential selection on the genes encoding both the ligand, and Denisovans, but their functions remain elusive. Most of these SLIT2, and its receptor, ROBO2, which reside on Chromosomes 4 changes probably resulted in either no change to the phenotype or and 3, respectively (see Supplemental File S3 for an annotated list of changes in those genes). Members of the Roundabout (ROBO) to a selectively neutral change. However, in rare instances, selec- tion may have favored changes modifying the appearance, behav- gene family play an important role in guiding developing axons ior, and abilities of present-day humans. Unfortunately, current in the nervous system through interactions with the ligands methods to identify selection have limited power to detect such SLITs. SLITs proteins act as attractive or repulsive signals for axons old events of positive selection (Przeworski 2002, 2003; Sabeti expressing different ROBO receptors. ROBO2 has been further asso- et al. 2006). ciated with vocabulary growth (St Pourcain et al. 2014), autism Here, we introduce a hidden Markov model to detect ancient (Suda et al. 2011), and dyslexia (Fisher and DeFries 2002) and is in- selective sweeps based on a signal of extended lineage sorting. volved in the development of neural circuits related to vocal learn- Using simulations, we were able to show that the method can ing in birds (Wang et al. 2015). Interestingly, ROBO2 is also in a long detect older events of selection as long as the selected variant desert of both Denisovan and Neandertal ancestry in non-Africans. was sufficiently advantageous. The power to detect older events We also identified interesting brain-related candidates on the is due to the fact that the method increases in power with the num- X Chromosome, among them DCX, a protein controlling neuro- ber of mutations that accumulated after the sweep finished. We nal migration by regulating the organization and stability of mi- also showed that background selection can cause false signals crotubules (Gleeson et al. 1999). Mutations in this gene can have and have chosen a minimum length cutoff on candidate regions. consequences for the expansion and folding of the cerebral cortex, While this cutoff reduces the number of false positives due to back- leading to the “double cortex” syndrome in females and “smooth ground selection, we note that this cutoff is expected to exclude brain” syndrome in males (Gleeson et al. 1998). bona fide events of positive selection, too. We have presented a new approach to detect ancient selective We applied the ELS method to 185 African genomes, the Altai sweeps based on a signal of extended lineage sorting. Applying this Neandertal genome, and the Denisovan genome and detected 81 approach to modern human data revealed that selection may have candidate regions of selection when requiring a minimum genetic acted primarily on regulatory changes. With population level se- length supported by two independent recombination maps. The quencing of nonhuman species becoming more readily available, uncertainty in the recombination maps has a large effect on our re- we anticipate that this approach will help to reveal the targets of sults, as shown by the much larger numberof 314 regions identified ancient selection in other species. by either recombination map. Recombination rates over the ge- nome are known to evolve rapidly (Lesecque et al. 2014), and of Methods particular concern are recent changes in recombination rates that make some regions appear larger in genetic length than they Data were in the past. By comparing the current recombination rates We used single nucleotide polymorphisms (SNPs) from 185 unre- in our regions to recombination rates in the ancestral population lated Luhya and Yoruba individuals from the 1000 Genomes of both chimpanzee and humans (Munch et al. 2014), we identi- Project phase I (The 1000 Genomes Project Consortium 2012) to- fied some candidate regions that may have increased in recombina- gether with four ape reference genome assemblies (chimpanzee tion rates (Supplemental Table S9). However, it is currently [panTro3] [Mikkelsen et al. 2005], bonobo [panPan1.1] [Prüfer impossible to date the change in recombination rates confidently, et al. 2012], gorilla [gorGor3] [Scally et al. 2012], and orangutan and these candidate sweeps may post-date the change. [ponAbe2] [Locke et al. 2011]) to compile a list of polymorphic A particular strength of our screen for selective sweeps is the and fixed derived changes in Luhya and Yoruba. Neandertal and ability to detect older events, as indicated by the estimated power Denisova alleles at these positions were extracted from published to detect simulated events of positive selection of old age and mod- VCFs (Danecek et al. 2011) using recommended filters (Prüfer erate strength. This sets the ELS method apart from previous ap- et al. 2014; see Supplemental Material for further details). Sites proaches that made use of archaic genomes, which were geared where either Neandertal or Denisova carried a third allele were toward detecting younger events with an age of less than disregarded. 300,000 yr ago (Racimo et al. 2014; Racimo 2016). Despite this dif- Genetic distances between these positions were calculated us- ference, we found significant overlap between the ELS candidates ing the African-American (Hinch et al. 2011) and the deCODE and the candidates identified by these other approaches, while the (Kong et al. 2010) recombination maps (available in Build 37 Genome Research 1569 www.genome.org ́ Peyregne et al. from http://www.well.ox.ac.uk/~anjali/). Both maps were chosen as implemented in the NLopt library (Steven G. Johnson, The since they estimate recombination rates from events that occurred NLopt nonlinear-optimization package, http://ab-initio.mit.edu/ within a few generations before the present. Recombination maps nlopt) to maximize the log-likelihood values calculated by the based on older events (i.e., LD-based map) can underestimate re- Forward algorithm. Convergence was attained in a maximum of combination rates in regions that underwent recent selective 1000 evaluations, and the log-likelihood maximization accuracy −4 sweeps, potentially masking true signals. was set to 10 . To test for convergence to local maxima, we ran the algorithm twice with different starting points and used the pa- rameters of the run with the highest likelihood to run the re-esti- Hidden Markov model mation algorithm a third time, starting with those parameters. We would like to estimate for each informative position the prob- All three runs gave similar results on all chromosomes. abilities for the three possible genealogies, external (E), internal (I), and extended lineage sorting (ELS), given the observed data. Post-processing Formally and following the notation from Durbin et al. (1998), The HMM was executed independently on all chromosomes for we calculate P(π = k|x), where i denotes the position, k ∈ {E,I,ELS} both Denisova and Neandertal and using the African-American and x is the sequence of observations with the ith observation de- and deCODE recombination maps. An external region was defined noted x . With the genetic distance d between consecutive sites as a stretch of high posterior probabilities (P ≥ 0.7) for the extended and l , the average genetic length of a region in state k, we specify −d/l k lineage sorting state that was uninterrupted by sites with a low the transition probabilities between identical states as t = e . k,k probability (P ≤ 0.1). The two cutoffs on the posterior probabilities Transitions from I to the states ELS and E depend on an addi- were determined by simulating sequences with positive selection tional parameter p, the proportion of transitions from I to ELS, −d/l I (s = 0.005, 500 kya; see below). Sites that were simulated external and their probability is given by t = p 1 − e and I,ELS −d/l I in both archaics were labeled as 1 and the remaining sites as t =(1 − p) 1 − e . Lastly, transitions from the two external I,E −d/l j 0. The HMM was then run on the simulations. By running a states to internal have the probability t = 1 − e , with j ∈ {E, j,I grid-search over possible cutoffs (step-sizes of 0.05 for the two pa- ELS}. By construction, transitions between E and ELS genealogies rameters) and labeling the HMM output accordingly, we identified are not allowed: it would not be possible to detect such transitions, the set of chosen parameters by minimizing the root mean square as those two states have the same statistical properties. error The inference further requires the probability for observing an ancestral or derived allele in the archaic at a site i with a derived () t − o allele frequency f > 0 in modern humans (noted x ) given that the i i i i i true genealogy is k ∈ {I, E, ELS}: e (x )= P(x | π = k). We assume that n k i i i ∀x : e (x)= e (x) , i.e., that both external states give rise to ances- ELS E with n the number of labeled sites, t the true label, and o the ob- i i tral and derived alleles in the archaic with equal probabilities given served label. the same observation. Since external regions are not expected to give rise to derived sites when the derived allele is segregating in modern humans, the only sources for such an observation can Simulations be errors or independent coinciding identical mutations, and we We simulated sequences using a model of recent human demogra- define an error rate for external regions: ε = e (x = derived, f < 1). E E i i phy to test the performance of our HMM under different scenarios Similarly fixed derived sites are expected to show the derived allele of neutral evolution, positive selection, or background selection. in the archaics if the local genealogy is internal, and we define an Each simulation consisted of one chimpanzee chromosome, one error rate for internal regions: ε = e (x = derived, f = 1). I I i i chromosome from each archaic hominin, and 370 human chro- We compute the posterior probability P(π = k | x) that an ob- mosomes, matching the 185 Luhya and Yoruba individuals used servation x came from state k given the observed sequence x as: in our analysis. For all simulations in this study, a constant muta- −8 −1 −1 tion rate of 1.45 × 10 bp ·generation , a constant recombina- P(x,p = k) −1 −1 P(p = k | x)= . tion rate of 1 cM.Mb .generation , and a generation time of 29 P(x) yr were assumed. We used estimates of population sizes from P(x, π = k)= f (i)b (i), where f (i)= P(x …x , π = k) and b (i)= i k k k 1 i i k Yang et al. (2014) and population split estimates from Prüfer P(x …x | π = k) are the output of the Forward and Backward i+1 L i et al. (2014) as parameters for the simulated demography algorithms, respectively (Rabiner 1989; Durbin et al. 1998). P(x) (Supplemental Information 1, 2). corresponds to the likelihood of the data given our model and Neutral simulations were generated with the coalescent was also calculated from the Forward algorithm. simulator scrm (Staab et al. 2014) and give a good match to our observed data when plotting derived allele frequency in modern humans against the proportion of derived alleles in the outgroup Parameter estimate (Supplemental Fig. S8). Simulations with positive selection were We used the Baum-Welch algorithm to estimate all emission prob- generated with the coalescent simulator msms (Ewing and abilities, with the exception of ε , the proportion of segregating Hermisson 2010), and background selection was explored using sites derived in the archaic genome in external regions, due to lim- forward-in-time simulations generated by SLiM (Messer 2013). ited accuracy in the estimates. We set this last parameter to a value Further details on simulation parameters are given in the of 0.01, a conservative upper limit on contamination and sequenc- Supplemental Material. ing error in the two high-coverage archaic genomes. The Baum- Welch algorithm was run for a maximum of 40 iterations, and Age comparison with other scans for selection the convergence criteria were set to a log-likelhood maxima differ- −4 ence of less than 10 . To compare our sweep screen with previous scans, we downloaded We estimated the remaining parameters (average lengths of candidate regions from the 1000G positive selection database regions and the proportion of transitions to the ELS state) using (Pybus et al. 2014). Only candidates with a P-value lower than the derivative free optimization method COBYLA (Powell 1994) 0.001 were considered. We added to this set of regions 1570 Genome Research www.genome.org Ancient positive selection in humans the top reported regions from a HKA scan (Cagan et al. 2016). Author contributions: S.P. implemented the method. S.P., Allele age estimates were obtained from ARGweaver (Rasmussen M.J.B., and M.D. analyzed data. S.P., M.J.B., M.D., and K.P. inter- et al. 2014). preted the results. K.P. designed the study. S.P. and K.P. wrote F , iHS, and XP-EHH are site-based statistics which localize the manuscript with input from all authors. ST sites that may have been selected (Malécot 1948; Wright 1951; Voight et al. 2006; Sabeti et al. 2007), whereas selective scans such as CLR, XP-CLR, Tajima’s D, Fay and Wu’s H, and HKA iden- References tify candidate regions (Hudson et al. 1987; Tajima 1989; Fay and The 1000 Genomes Project Consortium. 2012. An integrated map of genetic Wu 2000; Kim and Stephan 2002; Chen et al. 2010). In order to variation from 1,092 human genomes. Nature 491: 56–65. Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, Fernandez Banet compare the age of the selection events, we assumed that the se- J, Billis K, García Girón C, Hourlier T, et al. 2016. The Ensembl gene an- lected variant in candidate regions was the site with the highest notation system. Database (Oxford) doi: 10.1093/database/baw093. frequency. We note that this procedure will underestimate the Anders S, Huber W. 2010. DESeq: differential expression analysis for se- age of events if the true selected site reached fixation, as is often ex- quence count data. Genome Biol 11: R106. Cagan A, Theunert C, Laayouni H, Santpere G, Pybus M, Casals F, Prüfer K, pected for our method; the comparison is thus conservative. Navarro A, Marques- Bonet T, Bertranpetit J, et al. 2016. Natural selec- tion in the great apes. Mol Biol Evol 33: 3268–3283. Carroll SB. 2003. Genetics and the making of Homo sapiens. Nature 422: Annotations 849–857. We annotated candidate regions using protein coding genes from Charlesworth B, Morgan MT, Charlesworth D. 1993. The effect of deleteri- ous mutations on neutral molecular variation. Genetics 134: Ensembl (release 82), promoters and enhancers mapped by 1289–1303. GenoSTAN (Zacher et al. 2016), a measure of background selection Chen H, Patterson N, Reich D. 2010. Population differentiation as a test for (B-scores) (McVicker et al. 2009). Candidate regions were also over- selective sweeps. Genome Res 20: 393–402. lapped with regions previously suggested to have experienced re- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. 2011. The variant current selective sweeps in apes on the X Chromosome (Dutheil call format and VCFtools. Bioinformatics 27: 2156–2158. et al. 2015; Nam et al. 2015), regions of Neandertal ancestry Durbin R, Eddy S, Krogh A, Mitchison G. 1998. Biological sequence analysis: (Sankararaman et al. 2014; Vernot and Akey 2014), and long re- probabilistic models of proteins and nucleic acids. Cambridge University gions devoid of Neandertal and Denisova ancestry (Vernot et al. Press, Cambridge. Dutheil JY, Munch K, Nam K, Mailund T, Schierup MH. 2015. Strong selec- 2016). tive sweeps on the X chromosome in the human-chimpanzee ancestor To statistically test the overlap of our regions with these anno- explain its low divergence. PLoS Genet 11: e1005451. tations, we randomly placed regions of similar physical sizes in the Enard D, Messer PW, Petrov DA. 2014. Genome-wide signals of positive se- parts of the genome that passed our quality filters. Quality-filtered lection in human evolution. Genome Res 24: 885–895. Ewing G, Hermisson J. 2010. MSMS: a coalescent simulation program in- regions that were smaller than the longest gap present in our can- cluding recombination, demographic structure and selection at a single didate ELS regions were regarded as sufficiently short to not pro- locus. Bioinformatics 26: 2064–2065. hibit the placement of regions. Fay JC, Wu CI. 2000. Hitchhiking under positive Darwinian selection. Changes of recombination rates along the human lineage Genetics 155: 1405–1413. Fisher SE, DeFries JC. 2002. Developmental dyslexia: genetic dissection of a could limit our power to detect selected regions, and we used an complex cognitive trait. Nat Rev Neurosci 3: 767–780. ancestral recombination map of the human-chimpanzee ancestor Fu Q, Li H, Moorjani P, Jay F, Slepchenko SM, Bondarev AA, Johnson PL, to annotate top candidate regions (Supplemental Table S9; Munch Aximu-Petri A, Prüfer K, de Filippo C, et al. 2014. Genome sequence et al. 2014). of a 45,000-year-old modern human from western Siberia. Nature 514: 445–449. Finally, we further characterized fixed or nearly fixed human- Fu Q, Hajdinjak M, Moldovan OT, Constantin S, Mallick S, Skoglund P, specific changes within the candidate regions using annotations of Patterson N, Rohland N, Lazaridis I, Nickel B, et al. 2015. An early mod- histone marks (enhancers, promoters), eQTLs, transcription factor ern human from Romania with a recent Neanderthal ancestor. Nature binding sites, and conservation scores (Supplemental File S3). 524: 216–219. Gleeson JG, Allen KM, Fox JW, Lamperti ED, Berkovic S, Scheffer I, Cooper EC, Dobyns WB, Minnerath SR, Ross ME, et al. 1998. Doublecortin, a Software availability brain-specific gene mutated in human X-linked lissencephaly and dou- ble cortex syndrome, encodes a putative signaling protein. Cell 92: The software and input files used in this study have been made 63–72. available through the website http://bioinf.eva.mpg.de/ELS/ and Gleeson JG, Lin PT, Flanagan LA, Walsh CA. 1999. Doublecortin is a micro- tubule-associated protein and is expressed widely by migrating neurons. https://github.com/StephanePeyregne/ELS/. A version of the Neuron 23: 257–271. source code is also available as Supplemental Code in the online Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, version of this article. Li H, Zhai W, Fritz MH, et al. 2010. A draft sequence of the Neandertal genome. Science 328: 710–722. Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, 1000 Genomes Project, Sella G, Przeworski M. 2011. Classic selective sweeps Acknowledgments were rare in recent human evolution. Science 331: 920–924. Hinch AG, Tandon A, Patterson N, Song Y, Rohland N, Palmer CD, Chen We thank Michael Lachmann for early discussions about the de- GK, Wang K, Buxbaum SG, Akylbekova EL, et al. 2011. The landscape sign of the study; Janet Kelso and Mark Stoneking for many useful of recombination in African Americans. Nature 476: 170–175. comments on the manuscript; Svante Pääbo, Udo Stenzel, Hudson RR, Kreitman M, Aguadé M. 1987. A test of neutral molecular evo- Fernando Racimo, Amy Ko, and Adam Siepel for helpful discus- lution based on nucleotide data. Genetics 116: 153–159. Kim Y, Stephan W. 2002. Detecting a local signature of genetic hitchhiking sions; Julien Peyrégne for help in implementing the software; along a recombining chromosome. Genetics 160: 765–777. and Matthias Ongyerth, Manjusha Chintalapati, Steffi Grote, King MC, Wilson AC. 1975. Evolution at two levels in humans and chim- and Christoph Theunert for help with earlier analysis. We also panzees. Science 188: 107–116. thank three anonymous reviewers for their comments and sugges- Kong A, Thorleifsson G, Gudbjartsson DF, Masson G, Sigurdsson A, Jonasdottir A, Walters GB, Jonasdottir A, Gylfason A, Kristinsson KT, tions that helped to improve this manuscript. This research was et al. 2010. Fine-scale recombination rate differences between sexes, funded by the Max Planck Society, the Paul G. Allen Family foun- populations and individuals. Nature 467: 1099–1103. dation, and the European Research Council (grant agreement no. Kryazhimskiy S, Plotkin JB. 2008. The population genetics of dN/dS. PLoS 694707). Genet 4: e1000304. Genome Research 1571 www.genome.org ́ Peyregne et al. Laland KN, Odling-Smee J, Myles S. 2010. How culture shaped the human Racimo F, Kuhlwilm M, Slatkin M. 2014. A test for ancient selective sweeps genome: bringing genetics and the human sciences together. Nat Rev and an application to candidate sites in modern humans. Mol Biol Evol Genet 11: 137–148. 31: 3344–3358. Lesecque Y, Glémin S, Lartillot N, Mouchiroud D, Duret L. 2014. The Red Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. 2014. Genome-wide infer- Queen model of recombination hotspots evolution in the light of archa- ence of ancestral recombination graphs. PLoS Genet 10: e1004342. ic and modern human genomes. PLoS Genet 10: e1004790. Rosenberg NA. 2002. The probability of topological concordance of gene Li H, Durbin R. 2011. Inference of human population history from individ- trees and species trees. Theor Popul Biol 61: 225–247. Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, Shamovsky O, Palma ual whole-genome sequences. Nature 475: 493–496. A, Mikkelsen TS, Altshuler D, Lander ES. 2006. Positive natural selection Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, in the human lineage. Science 312: 1614–1620. Yang SP, Wang ZY, Chinwalla AT, Minx P, et al. 2011. Comparative Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, and demographic analysis of orang-utan genomes. Nature 469: Byrne EH, McCarroll SA, Gaudet R, et al. 2007. Genome-wide detection 529–533. and characterization of positive selection in human populations. Nature Malécot G. 1948. Les mathématiques de l’hérédité. Masson & Cie, Paris, 449: 913–918. France. Sankararaman S, Patterson N, Li H, Pääbo S, Reich D. 2012. The date of in- McVicker G, Gordon D, Davis C, Green P. 2009. Widespread genomic signa- terbreeding between Neandertals and modern humans. PLoS Genet 8: tures of natural selection in hominid evolution. PLoS Genet 5: e1000471. e1002947. Messer PW. 2013. SLiM: simulating evolution with selection and linkage. Sankararaman S, Mallick S, Dannemann M, Prüfer K, Kelso J, Pääbo S, Genetics 194: 1037–1039. Patterson N, Reich D. 2014. The genomic landscape of Neanderthal an- Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, Schraiber JG, cestry in present-day humans. Nature 507: 354–357. Jay F, Prüfer K, de Filippo C, et al. 2012. A high-coverage genome se- Sankararaman S, Mallick S, Patterson N, Reich D. 2016. The combined land- quence from an archaic Denisovan individual. Science 338: 222–226. scape of Denisovan and Neanderthal ancestry in present-day humans. Mikkelsen TS, Hillier LW, Eichler EE, Zody MC, Jaffe DB, Yang SP, Enard W, Curr Biol 26: 1241–1247. Hellmann I, Lindblad-Toh K, Altheide TK, et al. 2005. Initial sequence of Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, Hobolth the chimpanzee genome and comparison with the human genome. A, Lappalainen T, Mailund T, Marques-Bonet T, et al. 2012. Insights into Nature 437: 69–87. hominid evolution from the gorilla genome sequence. Nature 483: Munch K, Mailund T, Dutheil JY, Schierup MH. 2014. A fine-scale recombi- 169–175. nation map of the human-chimpanzee ancestor reveals faster change in St Pourcain B, Cents RA, Whitehouse AJ, Haworth CM, Davis OS, O’Reilly humans than in chimpanzees and a strong impact of GC-biased gene PF, Roulstone S, Wren Y, Ang QW, Velders FP, et al. 2014. Common var- conversion. Genome Res 24: 467–474. iation near ROBO2 is associated with expressive vocabulary in infancy. Nam K, Munch K, Hobolth A, Dutheil JY, Veeramah KR, Woerner AE, Nat Commun 5: 4831. Hammer MF, Great Ape Genome Diversity Project, Mailund T, Staab PR, Zhu S, Metzler D, Lunter G. 2014. Scrm: efficiently simulating long Schierup MH. 2015. Extreme selective sweeps independently targeted sequences using the approximated coalescent with recombination. the X chromosomes of the great apes. Proc Natl Acad Sci 112: Bioinformatics 31: 1680–1682. 6413–6418. Suda S, Iwata K, Shimmura C, Kameno Y, Anitha A, Thanseem I, Nakamura Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG. 2007. Recent K, Matsuzaki H, Tsuchiya KJ, Sugihara G, et al. 2011. Decreased expres- and ongoing selection in the human genome. Nat Rev Genet 8: 857–868. sion of axon-guidance receptors in the anterior cingulate cortex in au- Oleksyk TK, Smith MW, O’Brien SJ. 2010. Genome-wide scans for footprints tism. Mol Autism 2: 14. of natural selection. Philos Trans R Soc Lond B Biol Sci 365: 185–205. Tajima F. 1989. Statistical method for testing the neutral mutation hypoth- Pääbo S. 2014. The human condition—a molecular approach. Cell 157: esis by DNA polymorphism. Genetics 123: 585–595. 216–226. Varki A, Geschwind DH, Eichler EE. 2008. Explaining human uniqueness: Powell MJD. 1994. A direct search optimization method that models the ob- genome interactions with environment, behaviour and culture. Nat jective and constraint functions by linear interpolation. Adv Optim Rev Genet 9: 749–763. Numer Anal 275: 51–67. Vernot B, Akey JM. 2014. Resurrecting surviving Neandertal lineages from Prüfer K, Muetzel B, Do HH, Weiss G, Khaitovich P, Rahm E, Pääbo S, modern human genomes. Science 343: 1017–1021. Lachmann M, Enard W. 2007. FUNC: a package for detecting significant Vernot B, Tucci S, Kelso J, Schraiber JG, Wolf AB, Gittelman RM, associations between gene sets and ontological annotations. BMC Dannemann M, Grote S, McCoy RC, Norton H, et al. 2016. Bioinformatics 8: 41. Excavating Neandertal and Denisovan DNA from the genomes of Prüfer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, Koren S, Melanesian individuals. Science 352: 235–239. Sutton G, Kodira C, Winer R, et al. 2012. The bonobo genome compared Voight BF, Kudaravalli S, Wen X, Pritchard JK. 2006. A map of recent posi- with the chimpanzee and human genomes. Nature 486: 527–531. tive selection in the human genome. PLoS Biol 4: 446–458. Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, Heinze A, Wang R, Chen CC, Hara E, Rivas MV, Roulhac PL, Howard JT, Chakraborty Renaud G, Sudmant PH, de Filippo C, et al. 2014. The complete genome M, Audet JN, Jarvis ED. 2015. Convergent differential regulation of SLIT- sequence of a Neanderthal from the Altai Mountains. Nature 505: ROBO axon guidance genes in the brains of vocal learners. J Comp Neurol 43–49. 523: 892–906. Przeworski M. 2002. The signature of positive selection at randomly chosen Weaver TD. 2009. The meaning of Neandertal skeletal morphology. Proc loci. Genetics 160: 1179–1189. Natl Acad Sci 106: 16028–16033. Przeworski M. 2003. Estimating the time since the fixation of a beneficial Wright S. 1951. The genetical structure of populations. Ann Eugen 15: allele. Genetics 164: 1667–1676. 322–354. Pybus OG, Shapiro B. 2009. Natural selection and adaptation of molecular Yang MA, Harris K, Slatkin M. 2014. The projection of a test genome onto a sequences. In The phylogenetic handbook (ed. Lemey P, et al.), pp. reference population and applications to humans and archaic homi- 415–417. Cambridge University Press, New York. nins. Genetics 198: 1655–1670. Pybus M, Dall’Olio GM, Luisi P, Uzkudun M, Carreño-Torres A, Pavlidis P, Zacher B, Michel M, Schwalb B, Cramer P, Tresch A, Gagneur J. 2016. Laayouni H, Bertranpetit J, Engelken J. 2014. 1000 genomes selection Accurate promoter and enhancer identification in 127 ENCODE and browser 1.0: a genome browser dedicated to signatures of natural selec- roadmap epigenomics cell types and tissues by GenoSTAN. PLoS One tion in modern humans. Nucleic Acids Res 42: D903–D909. 12: e0169249. Rabiner LR. 1989. A tutorial on hidden Markov models and selected appli- cations in speech recognition. Proc IEEE 77: 257–286. Racimo F. 2016. Testing for ancient selection using cross-population allele frequency differentiation. Genetics 202: 733–750. Received December 9, 2016; accepted in revised form July 5, 2017. 1572 Genome Research www.genome.org

Journal

Genome ResearchPubmed Central

Published: Sep 1, 2017

There are no references for this article.