Access the full text.
Sign up today, get DeepDyve free for 14 days.
L. Cavalli-Sforza, A. Edwards (1967)
PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURESEvolution, 21
P. Armitage (1955)
Tests for Linear Trends in Proportions and FrequenciesBiometrics, 11
Desper (2002)
687J. Comput. Biol., 19
Devlin (1999)
997Biometrics, 55
Saitou (1987)
406Mol. Biol. Evol., 4
B. Monien, Ewald Speckenmeyer (1985)
Ramsey numbers and an approximation algorithm for the vertex cover problemActa Informatica, 22
Gad Kimmel, Michael Jordan, E. Halperin, R. Shamir, R. Karp (2007)
A randomization test for controlling population stratification in whole-genome association studies.American journal of human genetics, 81 5
J. Pritchard, J. Pritchard, N. Rosenberg (1999)
Use of unlinked genetic markers to detect population stratification in association studies.American journal of human genetics, 65 1
B. Keating, Sam Tischfield, S. Murray, T. Bhangale, Tom Price, J. Glessner, L. Galver, J. Barrett, S. Grant, Deborah Farlow, Hareesh Chandrupatla, M. Hansen, S. Ajmal, G. Papanicolaou, Yiran Guo, Mingyao Li, S. Derohannessian, P. Bakker, Swneke Bailey, A. Montpetit, A. Edmondson, K. Taylor, X. Gai, Susanna Wang, M. Fornage, T. Shaikh, L. Groop, M. Boehnke, A. Hall, A. Hattersley, E. Frackelton, N. Patterson, Charleston Chiang, Cecelia Kim, R. Fabsitz, W. Ouwehand, A. Price, P. Munroe, M. Caulfield, T. Drake, E. Boerwinkle, D. Reich, A. Whitehead, T. Cappola, N. Samani, A. Lusis, E. Schadt, James Wilson, W. Koenig, M. McCarthy, S. Kathiresan, S. Gabriel, H. Hakonarson, Sonia Anand, M. Reilly, J. Engert, D. Nickerson, D. Rader, J. Hirschhorn, G. FitzGerald (2008)
Concept, Design and Implementation of a Cardiovascular Gene-Centric 50 K SNP Array for Large-Scale Genomic Association StudiesPLoS ONE, 3
N. Saitou, M. Nei (1987)
The neighbor-joining method: a new method for reconstructing phylogenetic trees.Molecular biology and evolution, 4 4
Campbell (2005)
868Nat. Genet., 37
Cavalli-Sforza (1967)
233Am. J. Hum. Genet., 19
J. Pritchard, Matthew Stephens, Noah Rosenberg, P. Donnelly (2000)
Association mapping in structured populations.American journal of human genetics, 67 1
Garey (1979)
190
Pritchard (2000)
170Am. J. Hum. Genet., 67
Epstein (2007)
921Am. J. Hum. Genet., 80
Catarina Campbell, Elizabeth Ogburn, K. Lunetta, H. Lyon, M. Freedman, L. Groop, D. Altshuler, K. Ardlie, J. Hirschhorn (2005)
Demonstrating stratification in a European American populationNature Genetics, 37
Rosenberg (2002)
2381Science, 298
D. Serre, A. Montpetit, G. Paré, J. Engert, S. Yusuf, B. Keavney, T. Hudson, Sonia Anand (2008)
Correction of Population Stratification in Large Multi-Ethnic Association StudiesPLoS ONE, 3
Price (2006)
904Nat. Genet., 38
H. Hattemer (1982)
Genetic distance between populationsTheoretical and Applied Genetics, 62
Marchini (2004)
512Nat. Genet., 36
Balding (1995)
3Genetica, 96
Serre (2008)
e1382PLoS ONE, 1
M. Garey, David Johnson (1978)
Computers and Intractability: A Guide to the Theory of NP-Completeness
Nei (1972)
283Am. Naturalist, 106
Felsenstein (1985)
783Evolution, 39
M. Jakobsson, Sonja Scholz, P. Scheet, J. Gibbs, Jenna Vanliere, H. Fung, Zachary Szpiech, J. Degnan, Kai Wang, R. Guerreiro, J. Bras, Jennifer Schymick, D. Hernandez, B. Traynor, J. Simón-Sánchez, M. Matarin, A. Britton, J. Leemput, Ian Rafferty, M. Bucan, H. Cann, J. Hardy, N. Rosenberg, A. Singleton (2008)
Genotype, haplotype and copy-number variation in worldwide human populationsNature, 451
Goldstein (1995)
463Genetics, 139
Luca (2008)
453Am. J. Hum. Genet., 82
R. Tibshirani, G. Walther, T. Hastie (2000)
Estimating the number of clusters in a data set via the gap statisticJournal of the Royal Statistical Society: Series B (Statistical Methodology), 63
Li (2008)
1100Science, 319
Pritchard (1999)
220Am. J. Hum. Genet., 65
David Goldstein, M. Feldman, L. Cavalli-Sforza (1994)
An evaluation of genetic distances for use with microsatellite loci.Genetics, 139 1
Li (2008)
215Genet. Epid., 32
G. Nemhauser, L. Trotter (1975)
Vertex packings: Structural properties and algorithmsMathematical Programming, 8
B. Devlin, K. Roeder (1999)
Genomic Control for Association StudiesBiometrics, 55
J. Felsenstein (1985)
CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAPEvolution, 39
Qizhai Li, Kai Yu (2008)
Improved correction for population stratification in genome‐wide association studies by identifying hidden population structuresGenetic Epidemiology, 32
M. Epstein, A. Allen, G. Satten (2007)
A simple and improved correction for population stratification in case-control studies.American journal of human genetics, 80 5
J. Studier, Karl Keppler (1988)
A note on the neighbor-joining algorithm of Saitou and Nei.Molecular biology and evolution, 5 6
J. Marchini, L. Cardon, M. Phillips, P. Donnelly (2004)
The effects of human population structure on large genetic association studiesNature Genetics, 36
Armitage (1955)
375Biometrics, 11
A. Price, N. Patterson, R. Plenge, M. Weinblatt, N. Shadick, D. Reich (2006)
Principal components analysis corrects for stratification in genome-wide association studiesNature Genetics, 38
N. Rosenberg, J. Pritchard, J. Weber, H. Cann, K. Kidd, L. Zhivotovsky, M. Feldman (2002)
Genetic Structure of Human PopulationsScience, 298
Jun Li, D. Absher, Hua Tang, Audrey Southwick, A. Casto, Sohini Ramachandran, H. Cann, G. Barsh, M. Feldman, L. Cavalli-Sforza, R. Myers (2008)
Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation
Cavalli-Sforza (1967)
Phylogenetic analysis: models and estimation proceduresAm. J. Hum. Genet., 19
Jakobsson (2008)
998Nature, 451
Tibshirani (2001)
411J. R. Stat. Soc. B, 63
D. Balding, R. Nichols (2005)
A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternityGenetica, 96
Studier (1988)
729Mol. Biol. Evol., 5
Diana Luca, S. Ringquist, L. Klei, Ann Lee, C. Gieger, H. Wichmann, S. Schreiber, M. Krawczak, Ying Lu, A. Styche, B. Devlin, K. Roeder, M. Trucco (2008)
On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants.American journal of human genetics, 82 2
Kimmel (2007)
895Am. J. Hum. Genet., 81
R. Desper, O. Gascuel (2002)
Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution PrincipleJournal of computational biology : a journal of computational molecular cell biology, 9 5
Keating (2008)
e3583PLoS ONE, 3
M. Garey (1979)
Johnson: computers and intractability: a guide to the theory of np- completeness (freeman
Vol. 26 no. 6 2010, pages 798–806 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btq025 Genetics and population analysis Advance access publication January 22, 2010 Correcting population stratification in genetic association studies using a phylogenetic approach 1,∗ 2 2 3,4,5,∗ Mingyao Li , Muredach P. Reilly , Daniel J. Rader and Li-San Wang 1 2 3 Department of Biostatistics and Epidemiology, Cardiovascular Institute, Department of Pathology and Laboratory 4 5 Medicine, Penn Center for Bioinformatics and Institute on Aging, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA Associate Editor: Jeffrey Barrett ABSTRACT the human genome have made genetic association studies the mainstream for gene mapping of complex human diseases. For Motivation: The rapid development of genotyping technology and many diseases, the most practical approach is the population-based extensive cataloguing of single nucleotide polymorphisms (SNPs) design with unrelated individuals. Although having the advantages across the human genome have made genetic association studies of easier sample collection and greater power than family-based the mainstream for gene mapping of complex human diseases. For designs, population-based design is prone to population stratification many diseases, the most practical approach is the population-based (Marchini et al., 2004). Population stratification refers to the design with unrelated individuals. Although having the advantages presence of a systematic difference in allele frequencies between of easier sample collection and greater power than family-based subpopulations in a study due to ancestry difference between study designs, unrecognized population stratification in the study samples subjects. Unrecognized population stratification can lead to both can lead to both false-positive and false-negative findings and might false-positive and false-negative findings and can obscure the true obscure the true association signals if not appropriately corrected. association signals if not appropriately corrected. Methods: We report PHYLOSTRAT, a new method that corrects for There are three types of population structures that might population stratification by combining phylogeny constructed from be observed in genetic association studies: discrete population SNP genotypes and principal coordinates from multi-dimensional structure consists of populations that are remotely related (such scaling (MDS) analysis. This hybrid approach efficiently captures both as Europeans, Africans and Asians) and the population structure discrete and admixed population structures. is easy to discern as the individuals are clearly separated. Admixed Results: By extensive simulations, the analysis of a synthetic population structure consists of subjects of admixed ancestry (such genome-wide association dataset created using data from the as African Americans and Hispanic Americans) with different Human Genome Diversity Project, and the analysis of a lactase- individuals having different degrees of admixture, and that cannot height dataset, we show that our method can correct for population be separated into discrete clusters. Intercontinental gradients can stratification more efficiently than several existing population also be considered as admixed, although the degree of admixture stratification correction methods, including EIGENSTRAT, a hybrid is smaller than African Americans and Hispanic Americans. In approach based on MDS and clustering, and STRATSCORE , in terms other scenarios, we may see hierarchical population structure of requiring fewer random SNPs for inference of population structure. that consists of both discrete and admixed population structures. By combining the flexibility and hierarchical nature of phylogenetic Hierarchical population structures may be seen in studies that trees with the advantage of representing admixture using MDS, our involve multi-ethnic cohorts, which are becoming increasingly hybrid approach can capture the complex population structures in common in genetics consortiums (Serre et al., 2008). human populations effectively. Recognizing the issue of population stratification induced by Software Availability: Codes can be downloaded from population structures, various methods have been developed http://people.pcbi.upenn.edu/∼lswang/phylostrat/ to control for population stratification. Two early approaches Contact: mingyao@upenn.edu; iswang@upenn.edu. are genomic control (Devlin and Roeder, 1999) and structured Supplementary information: Supplementary data are available at association (Pritchard et al., 2000). The genomic control method Bioinformatics online. corrects for stratification by adjusting association statistics with an Received on September 10, 2009; revised on January 11, 2010; overall inflation factor obtained from a set of random markers that accepted on January 18, 2010 are not associated with the phenotypes of interest. However, some markers differ in their allele frequencies across ancestral populations more than others. Thus, the uniform adjustment may be insufficient 1 INTRODUCTION at markers having strong differentiation across ancestral populations The rapid development of genotyping technology and extensive and may be superfluous at markers lacking such differentiation. cataloguing of single nucleotide polymorphisms (SNPs) across Structured association uses STRUCTURE program (Pritchard and Rosenberg, 1999) to assign the study subjects to discrete subpopulations and then aggregates evidence of association within To whom correspondence should be addressed. each subpopulation. This method is computationally intensive, and 798 © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 798 798–806 Population stratification in genetic association studies assignments of subjects to clusters are sensitive to the number of efficient and versatile programs are widely available. Third, the clusters, which is often ill defined in many real studies. hierarchical characteristic of a phylogeny is easy to interpret and The current state-of-the-art approach for the correction of visualize. Finally, phylogenetic trees are more flexible to represent population stratification is EIGENSTRAT (Price et al., 2006), which highly complex spatial structures than clusters obtained from computes principal components for SNPs across the genome to clustering-based algorithms (Li and Yu, 2008). identify population structure. In this approach, a small number To correct for discrete, admixed and hierarchical population of ‘top’ principal components will capture the main axes of structures, we propose to combine information from phylogeny genetic variation in the study subjects. Correction for population and MDS together with the phylogenetic tree representing discrete stratification is carried out by including these top principal population structure and the principal coordinates of MDS analysis components as covariates in a regression framework. Although representing admixed population structure. Given the flexible nature popular, several studies have demonstrated that EIGENSTRAT and the hierarchical characteristic of phylogenetic trees, we expect will fail to correct for population stratification in certain scenarios this hybrid approach to perform well under complex population (Epstein et al., 2007; Kimmel et al., 2007; Luca et al., 2008). structures, such as those from multi-ethnic studies with some of Recently, Li and Yu (2008) proposed an extension of the study samples showing admixed population structure, whereas EIGENSTRAT by incorporating cluster information obtained from other study samples showing discrete population structure (Serre multi-dimensional scaling (MDS) analysis as additional covariates et al., 2008). in the adjustment. This MDS clustering approach tries to identify both discrete and admixed patterns of genetic variation and correct 2 METHODS for their potential confounding effects by adjusting each position of subject along identified axes of genetic variation and the cluster 2.1 Construction of a phylogenetic tree membership simultaneously. This method is a direct extension The first step in our method is to construct a phylogenetic tree of subjects from of EIGENSTRAT when the metric used by EIGENSTRAT is the genetic marker data. We opt to use the distance-based approach, which adopted for measuring the genetic correlation between two subjects. accepts as input a distance matrix, i.e. the pairwise dissimilarities between Simulation results demonstrate that the MDS clustering method every pair of individuals based on SNP genotypes and constructs trees provides a more appropriate correction for population stratification entirely from the distance matrix (Saitou and Nei, 1987; Studier and Keppler, 1988). In our analysis, we code the SNP genotypes as 0, 1 and 2, representing than EIGENSTRAT under simulation settings that they considered the number of minor alleles, and calculate the distance as ‘2—the number of (Li and Yu, 2008). alleles that are identical by state’ between two individuals. We then built The goal of this article is to utilize phylogenetic trees to correct the phylogenetic tree using the FastME algorithm (Desper and Gascuel, for population stratification in genetic association studies with 2002), a very fast distance-based phylogeny reconstruction algorithm that unrelated individuals. Widely used in the study of evolution and shows better topological accuracy than the commonly used neighbor joining other fields of biology, a phylogeny represents the evolutionary algorithm (Saitou and Nei, 1987) in simulation studies. Compared with other relationship between species as a tree structure, where each leaf more computationally intensive methods such as maximum likelihood or is an observed species, each internal node corresponds to the maximum parsimony, the distance-based approach is much more tractable, most recent common ancestors of all species below it and each especially when the number of individuals or the number of markers becomes internal edge represents a bipartition of leaves due to evolutionary large. Note that the leaves of the constructed phylogenetic tree are individual subjects in the study sample instead of subpopulations. divergence. The use of phylogenetics in the study of human genetic diversity has a long history dating back to the sixties: the approach was first proposed by Cavalli-Sforza and Edwards (1967) who 2.2 Reduced representation for the phylogenetic tree developed likelihood estimation methods using allelic frequencies We reduce the phylogenetic structure into a collection of bipartitions on in subpopulations. Distance measures that adjust allelic frequencies subjects in order for us to incorporate information from the phylogeny in the are also available (Goldstein et al., 1995; Nei, 1972). The association tests as covariates in a regression framework. Each bipartition corresponds to an internal edge in a phylogenetic tree, which divides the phylogenetic approach has been used extensively in the analysis of data into two groups: a phylogenetic tree with n leaves (subjects) will have the Human Genome Diversity Project (HGDP) (http://www.stanford up to n−3 internal edges; in turn, the entire phylogenetic tree can be fully .edu/group/morrinst/hgdp.html), which genotyped more than 1000 recovered by these (n−3) bipartitions. For each bipartition, we can construct individuals collected at more than 50 geographic locations around a 0-1 vector indicating the bipartition membership by assigning all members the world. The HGDP dataset has well-defined subpopulations from one of the two subsets in a bipartition to have value 0, and all members and complete population identification. The resulting phylogenies, from the other set to have value 1. We cannot use all of the bipartitions in the using microsatellites and SNPs, have been highly consistent and regression analysis, because (i) the degrees of freedom in the regression is compatible with the widely accepted pattern of human migration close to the number of observations (subjects) and significance will never be (Jakobsson et al., 2008; Li et al., 2008; Rosenberg et al., 2002). reached; (ii) most bipartitions, in particular, bipartitions that separate a very We propose a phylogenetics-based approach for correction small number of subjects from the rest, are not informative for the purpose of population stratification correction; and (iii) the bipartitions are not entirely of population stratification based on several motivations. First, independent; for example, bipartitions for adjacent edges may only differ by widely used and thoroughly tested for decades to study evolution, a small number of subjects. We select a subset of representative bipartitions phylogenetics is a natural choice for detecting divergence in human based on the following criteria: subpopulations. Phylogenetic analysis of the HGDP dataset shows that this approach is robust and sensitive to subtle population (1) The relative size of either side of a selected bipartition is above some given threshold. We used 2.5% in our analysis. structures (Jakobsson et al., 2008; Li et al., 2008). Second, thanks to decades of work by an active research community, many (2) Filter bipartitions by the correlation threshold so that each remaining phylogenetic reconstruction algorithms have been developed, and bipartition is correlated with at least one selected bipartition. We used [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 799 798–806 M.Li et al. the i-th element of c . With this geometric interpretation of the n observations, the Euclidean distance between the h-th and i-th observations is 2 2 = (c −c ) . hj ij hi j=1 MDS analysis attempts to find the optional q dimensional (q < n) approximation to the n dimensional representation so that the distance is preserved. 2.4 Association test with adjustment of phylogenetic bipartitions and principal coordinates from MDS Fig. 1. Illustration of the bipartition selection procedure. (A) A phylogeny analysis with 15 leaves. (B) The matrix of six bipartitions with size threshold of at least three leaves on either side of the bipartition. (C) The pairwise The phylogenetic bipartitions can effectively capture the population structure absolute correlation matrix. (D) The corresponding instance to the minimum information in a dataset when the structure is discrete or hierarchical. On dominating set problem using correlation threshold 0.7. Black vertices the other hand, some of the populations, such as admixed populations, (1, 2, 3, 6) form the solution returned by the greedy algorithm; bipartitions may contain continuous patterns of genetic variation. To incorporate both corresponding to the four vertices are also emboldened in the phylogeny (A) types of population structures, similar to MDS clustering, a hybrid approach and labeled in boldface in the matrix of bipartitions (B). Note that because proposed by Li and Yu (2008), we adjust for the population structure by the each vertex in (D) is adjacent to at least one black vertex, this ensures each phylogenetic bipartitions and the principal coordinates from MDS analysis. of the six bipartitions is correlated with at least one of the four chosen This is done by representing the population structure as a collection of bipartitions, so the reduced set of bipartitions still represents the topology of selected bipartitions and principal coordinates from the MDS analysis and the original phylogeny well. introducing them together with the genotypes of the SNP as independent variables, and the binary phenotype as the dependent variable in a regression framework. To test for genetic association for case–control studies, we 0.7 as the correlation threshold for discrete population structures, and conduct logistic regression with the following model: 0.1 for admixed population structures in the simulation studies. For the analysis of the HGDP data and the lactase-height data, we also used T T 0.7 as the threshold. We discuss the selection of correlation threshold logit[P(Y =1|g,X)]= α + βg + γ + η , in Section 4. where Y represents disease status (1: affected; 0: unaffected), represents The bipartition selection problem can be formulated as the NP-hard minimum the selected phylogenetic bipartitions, represents the selected principal dominating set problem in graph theory as follows (Garey and Johnson, coordinates from the MDS analysis, X represents a set of random markers 1979): let G =(V, E) be a graph where each vertex in the vertex set V that are used for inference of population structure and g is the genotype score corresponds to a bipartition passing the size threshold. We add an edge for the testing SNP, with (0, 1, 2) for a multiplicative model, (0, 1, 1) for a between any two vertices (u, v) to the edge set E if the absolute value dominant model and (0, 0, 1) for a recessive model. To test for association of the correlation between bipartitions associated with u and v is above between the SNP and the disease status, either a likelihood ratio test, a score the threshold. The minimum dominating set problem finds a smallest set of test, or a Wald test can be carried out. For continuous phenotypes, we can vertices V ⊆ V such that for every vertex u ∈ V −V, there exists a vertex v conduct linear regression with similar adjustments. We note that principal in V such that (u,v) ∈ E. Our problem can then be solved by finding such a components analysis (PCA) can be used instead of MDS and the computation minimum dominating set for G; then every bipartition is correlated with at for PCA is faster than similarity matrix. We choose to use MDS because the least one bipartition associated with a vertex in the dominating set. To select similarity matrix is already calculated when constructing the phylogenetic bipartitions, we implemented a simple approximation algorithm [where the tree; moreover, the results of MDS and PCA are generally similar. size of the solution is at most 1 + ln the size of the optimal solution (Garey and Note that the selected phylogenetic bipartitions and principal coordinates Johnson, 1979)] by iteratively (i) selecting a vertex with largest degree and might be correlated because both of them represent population structure in add to the cover, then (ii) removing all adjacent vertices and their incident the data, although with emphasis on different types of variation. To avoid edges from the graph. The selection step is repeated until all vertices are the issue of multi-collinearity, when the correlation between a phylogenetic removed. Figure 1 illustrates the bipartition selection procedure. bipartition and a principal coordinate is greater than 0.7, we will keep only the principal coordinate in the regression model. It is worth noting that multi- 2.3 MDS analysis collinearity is also a concern for the MDS clustering approach. Similar to MDS is a statistical technique that aims at displaying the similarity of what we did for PHYLOSTRAT, when testing the MDS clustering method members of a set of objects. This technique starts with a matrix of similarities by Li and Yu (2008), we also used 0.7 as the correlation cut-off point to or dissimilarities between a set of observations, and embeds the observations remove a correlated cluster dummy variable. as points in a low-dimensional Euclidean space so that Euclidean distances between points in the plot are close to the original dissimilarities. Suppose 2.5 Simulation set up that T isa(n × n) positive-semidefinite symmetric matrix of similarities We conducted extensive simulations under various settings to compare among a set of n observations. From the spectral decomposition of T,we the performance of PHYLOSTRAT with four other methods, including have the standard Cochran–Armitage trend test without corrections (Armitage, T = τ b b + τ b b +···+ τ b b , 1 1 2 2 n n 1 2 n 1955), the EIGENSTRAT approach (Price et al., 2006), the MDS clustering where τ ≥ τ ≥ ... ≥ τ are the eigen values of T and b , b , ..., b are the 1 2 n 1 2 n approach (Li and Yu, 2008) and the STRATSCORE approach (Epstein et al., corresponding eigen vectors. Alternatively, this may be written as 2007). We designed three sets of simulations, one for discrete population structures, one for admixed populations and the other based on genotype T =c c +c c +···+c c , 1 2 n 1 2 n data from the HGDP samples (Li et al., 2008). In all scenarios, we coded the 1/2 where c =τ b , j =1,2,…, n. Now consider the n observations as points in n j j genotype score for the testing SNP assuming a multiplicative model. In the dimensional space with the j-th coordinate for the i-th observation equal to c , ij four methods we tested, either top 10 principal components (EIGENSTRAT [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 800 798–806 Population stratification in genetic association studies Table 1. Population stratification configurations for discrete population Table 2. Population stratification configuration in the synthetic HGDP case– structure control dataset Configuration Group Population 1 Population 2 Population 3 Population 4 Population Continent Case (n = 452) Control (n = 543) C1 Case 200 300 1 Africa Central African Control 300 200 Republic (n = 29), Namibia (n =5) C2 Case 250 250 Congo (n = 15) Nigeria (n = 23) Control 0 500 Kenya (n = 12) Senegal (n = 23) C3 Case 225 175 100 control 175 100 225 2 Middle Algeria-Mzab (n = 29), France (n = 48) C4 Case 165 335 0 East/Europe Israel-Carmel (n = 45), Italy (n = 35) Control 0 165 335 Israel-Central (n = 49) Italy-Bergamo (n = 12) C5 Case 175 150 100 50 Israel-Negev (n = 47) Orkney Islands (n = 16) Control 75 100 150 175 Russia (n = 25) C6 Case 125 125 250 0 Russia-Caucasus (n = 17) Control 0 250 125 125 3 Central South China (n = 10) Cambodia (n =11) Asia/Oceania/ Pakistan (n = 182) China (n = 169) Configurations C1 and C2 contain two subpopulations; configurations C3 and East Asia/ Bougainville (n = 18) Japan (n = 27) C4 contain three subpopulations; and configurations C5 and C6 contain four America New Guinea (n = 16) Siberia (n = 25) subpopulations. Brazil (n =45) Colombia (n = 13) and STRATSCORE) or top 10 principal coordinates (MDS clustering and Mexico (n = 49) PHYLOSTRAT) were included in the analysis. Setting 1: Discrete population structures were simulated in a similar method as Price et al. (2006) and Li and Yu (2008). We considered six configurations (Table 1), representing two, three and four subpopulations. population structure. Type I error and power were estimated based on 100 In each setting, we generated simulated datasets each consisting of 500 (datasets) ×1000 (testing SNPs) = 100 000 tests. cases and 500 controls, with varying numbers of random SNPs carrying Setting 2: Cases and controls from an admixed population were simulated information capable of differentiating among populations. To generate similar to that as described by Price et al. (2006). Disease status for genotypes for the random SNPs, we followed the algorithm of Price individuals with ancestry proportions a from population 1 and (1 − a) from et al. (2006). Specifically, for each subpopulation, the allele frequency for population 2 were simulated using disease risk proportional to r , where r is each SNP was generated using the Balding–Nichols model (Balding and the ancestry risk and a is uniformly distributed on (0, 1). To insure an average Nichols, 1995) using F = 0.01. For each SNP, an ancestral population allele value of 0.5 across possible values of a, the probability of being affected was ST frequency p was drawn from the uniform distribution on (0.1, 0.9). The set to 0.5 log(r)r /(r − 1). The risk model with a genotype relative risk of 1.5 allele frequencies for each subpopulation were drawn from β-distribution for the disease allele was implemented the same way as discrete populations, with parameters p(1 − F )/F and (1 − p)(1 − F )/F . This distribution with allele frequency ap + (1 − a)p . Similar to the simulations for discrete ST ST ST ST 1 2 has mean p and variance F p(1 − p). populations, we also considered two categories of SNPs to evaluate the type I ST To evaluate the performance of different population stratification errors and two categories of SNPs to evaluate the power. correction methods, we considered four categories of testing SNPs: (i) the Setting 3: A case–control dataset of hierarchical population structure first category (random SNPs without association with disease) was generated was simulated based on individuals genotyped in HGDP (Li et al., 2008), the same way as those SNPs chosen for detecting population structure, an international project for studying the diversity and unity of the entire i.e. F = 0.01. (ii) For the second category (differentiated SNPs without human population. A total of 1064 individuals in this project, representing ST association with disease), we assumed a large allele frequency difference individuals from 51 populations from sub-Saharan Africa, North Africa, between the subpopulations. More specifically, for C1 and C2, we chose Europe, the Middle East, South/Central Asia, East Asia, Oceania and the allele frequencies of 0.2 for population 1 and 0.8 for population 2; for C3 Americas, were genotyped by the Illumina HumanHap 650K SNP array, and C4, the allele frequencies are 0.8, 0.8 and 0.2, respectively, for the three which includes 650 000 SNPs. We downloaded the genotype data and sample subpopulations; and for C5 and C6, the allele frequencies are 0.2, 0.8, 0.2 description information from http://hagsc.org/hgdp/files.html. and 0.8, respectively, for the four subpopulations. (iii) The third category After merging genotype data and sample description and data cleaning, of SNPs (random causal SNPs with association with disease) was for power 995 individuals remained for analysis. To obtain case–control data evaluation in which the allele frequency was generated the same way as those with population structure, we created three artificial subpopulations: random SNPs, i.e. F = 0.01; we then assumed a multiplicative model with (i) subpopulation 1 consists of 107 individuals fromAfrica, (ii) subpopulation ST a genotype relative risk of 1.5 for the causal allele to generate genotypes for 2 consists of 170 individuals from Middle East and 153 individuals from the causal SNPs conditioned on the disease status. (iv) The fourth category Europe; and (iii) subpopulation 3 consists of 107 individuals from America, of SNPs (differentiated causal SNPs with association with disease) was 192 individuals from Central South Asia, 232 individuals from East Asia generated in a similar fashion as category (iii) except that the marker allele and 34 individuals from Oceania. For each of the three subpopulations, frequency was generated the same way as those from category (ii). Testing we selected cases and controls based on the numbers specified in Table 2. SNPs in categories (i) and (ii) allow us to evaluate type I error rates of To infer population structure, we randomly selected 10 000 autosomal SNPs different methods, whereas testing SNPs in categories (iii) and (iv) allow us that have no missing genotypes and are in linkage equilibrium with each to evaluate the power when modest or extreme population stratifications are other (at r < 0.05). We then tested for association with the case–control present. status using 515 710 autosomal SNPs that satisfy the following quality To evaluate the type I error and power under each population structure, control criteria: (i) minor allele frequency >1% in both cases and controls, −7 we generated 100 datasets of 500 cases and 500 controls with each (ii) Hardy–Weinberg equilibrium test P-value >1 × 10 and (iii) fraction dataset consisting of 1000 testing SNPs for each of the above-mentioned of missingness <5%. For each method, we estimated the type I error rate at four SNP categories. Moreover, for each of the 100 datasets, we also various levels (α = 0.01, 0.005, 0.001 and 0.0005 level) and calculated the simulated m = 100, 300 or 500 random SNPs with F = 0.01 to infer genomic control inflation factor (Devlin and Roeder, 1999). ST [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 801 798–806 M.Li et al. Table 3. Empirical type I error rates (%) and power (%) under two discrete 2.6 Application to lactase-height data populations at 1% significance level A classic way to test the performance of a population stratification correction method is to examine whether the method is able to remove the effect Configuration M Trend ES PS MC SS of population stratification between the lactase (LCT) gene and height (Campbell et al., 2005). As we do not have access to the original data Non-causal SNPs: C1 100 45.61 1.04 1.08 1.01 0.85 reported by Campbell et al. (2005), we created a lactase-height dataset using random (category 1) 300 45.41 0.98 1.06 1.13 0.93 data from PennCAC, an ongoing candidate-gene study on coronary artery 500 45.53 1.03 1.04 0.95 0.8 classification that we are working on. PennCAC includes 1361 Caucasians C2 100 76.82 1.72 1.09 1.01 2.48 with phenotypes on height. These individuals were genotyped using the 300 77.01 1.11 1.11 1.06 1.23 ITMAT/Broad/CARe (IBC) 50K SNP array (Keating et al., 2008), which 500 77.17 1 1.08 1.08 1.29 includes 1755 autosomal ancestry informative markers (AIMs) and 21 LCT SNPs. The originally reported LCT SNP, rs4988235, is not included in the Non-causal SNPs: C1 100 99.84 1.18 1.1 1.05 0.59 highly differentiated 300 99.87 1.02 1.08 1.03 0.5 IBC array, but two other LCT SNPs, rs3769005 and rs7579771, have strong (category 2) 500 99.87 1 1 0.95 0.49 LD with rs4988235 (r = 0.72 in HapMap CEU samples). We, therefore, C2 100 100 3.43 1.08 1.06 4.85 tested association between height and these two SNPs. For all population 300 100 1.29 1.09 1.02 1.43 stratification correction methods that we considered for comparison, we 500 100 1.08 1.05 0.96 1.23 inferred population structure using 100, 200, 250, 300, 500, 700, 1000 and 1755 autosomal AIMs. Causal SNPs: random C1 100 68.16 87.11 87.48 87.61 84.83 (category 3) 300 67.8 87.59 87.72 87.77 85.46 500 67.9 87.59 87.44 87.72 85.21 3 RESULTS C2 100 79.62 69.09 70.19 70.17 69.62 300 79.67 69.44 70.32 70.15 70.05 To evaluate our proposed method, we carried out extensive 500 79.6 69.25 70.09 70.47 70.01 simulations, including discrete population structure, admixed population structure and a synthetic case–control GWAS dataset Causal SNPs: highly C1 100 23.56 75.27 82.3 82.56 67.44 differentiated 300 23.6 80.81 82.18 82.53 72.98 generated from the HGDP data, which represents hierarchical (category 4) 500 23.91 81 81.99 82.05 73.72 population structure. We assessed whether the proposed method C2 100 100 22.15 55.17 55.15 21.29 can appropriately correct for population stratification by estimating 300 100 39.9 55.08 55.06 40.08 type I error rate, and also assessed its power in detecting 500 100 44.49 55.14 55.14 45.06 disease association. We compared PHYLOSTRAT with four other m is the number of random SNPs used for inference of population structure. ES: methods, including (i) the conventional Cochran–Armitage trend EIGENSTRAT; PS: PYLOSTRAT; MC: MDS clustering; SS: STRATSCORE. test (Armitage, 1955), which does not control for population stratification, (ii) the EIGENSTRAT approach (Price et al., 2006), (iii) the MDS clustering approach (Li and Yu, 2008); and (iv) the still slightly inflated. Similar results are observed for EIGENSTRAT, STRATSCORE approach (Epstein et al., 2007). For STRATSCORE, which also yields inflated type I errors. In contrast, for this extreme as suggested by the authors (Dr Michael Epstein, personal situation, the type I error rate of PHYLOSTRAT is close to the communication), we adjusted the continuous stratification scores nominal level even with only 100 random SNPs, much less than the (obtained from principal components) rather than the quartiles of the number of random SNPs required by other methods to achieve the stratification scores because this modified version of STRATSCORE same level of type I error rate. This implies that PHYLOSTRAT generally leads to smaller type I errors than the original method uses the ancestry information contained in the random SNPs more under the simulation settings we considered. We did not compare efficiently. We observed that STRATSCORE can yield either inflated with the genomic control approach because Price et al. (2006) and or conservative type I errors. As suggested by the authors (Dr Glen Li and Yu (2008) have demonstrated its unsatisfactory performance. Satten, personal communication), we also implemented a modified For type I error and power estimation, significance was evaluated at version of STRATSCORE with 20 strata. This modified version the 1% level. yields appropriate type I errors when there are two underlying subpopulations; however, when the number of subpopulations is 3.1 Setting 1: discrete population structure greater than 2, e.g. configurations C4 and C6, the type I errors are Tables 3–4 display the results for discrete population structures still inflated even with 500 random SNPs. It is possible that more with two and four subpopulations, respectively. Results for three strata are needed to appropriately control the type I errors when subpopulations are similar (Supplementary Table 1). In all situations the number of subpopulations is more than two and the degree of we considered, PHYLOSTRAT has type I error rates that are close population stratification is extreme. to the nominal level. The MDS clustering approach performs well For causal SNPs in category (iii), PHYLOSTRAT, MDS when there are two discrete subpopulations; but when the number clustering and EIGENSTRAT yield similar power, but the power of discrete subpopulations is three or four and the population for STRATSCORE is generally lower than the other three methods. stratification is extreme, its type I error rates can be greater than the For causal SNPs in category (iv), PHYLOSTRAT is more powerful nominal level, especially when the number of random SNPs is small. than the other three methods under several settings. For example, for For example, when there are four discrete subpopulations and when configuration C6 in Table 4, with 300 random SNPs, the power for the degree of population stratification is extreme (i.e. configuration PHYLOSTRAT is 51.27%, whereas the powers for EIGENSTRAT, C6 in Table 4), the type I error rate of the MDS clustering approach MDS clustering and STRATSCORE are 39.06, 39.49 and 20.77%, can be as high as 10.19% with 100 random SNPs, 2.06% with 300 respectively. This is because EIGENSTRAT and MDS clustering random SNPs, even with 500 random SNPs, the type I error rate is cannot completely remove the confounding effect due to population [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 802 798–806 Population stratification in genetic association studies Table 4. Empirical type I error rates (%) and power (%) under four discrete Table 5. Empirical type I error rates (%) and power (%) under admixed populations at 1% significance level populations at 1% significance level Configuration m Trend ES PS MC SS rm Trend ES PS MC SS Non-causal SNPs: C5 100 46.22 1.3 1.09 1.22 0.53 Non-causal SNPs: 2 100 21.62 1.11 1.14 1.18 1.04 random (category 1) 300 46.6 1.04 1.06 1.04 0.41 random (category 1) 300 21.45 1.03 1.03 1.07 1.01 500 46.17 1.02 1.05 1.02 0.22 500 21.68 1.03 1.04 1.05 0.92 C6 100 64.21 3.31 1.19 3.52 2.19 3 100 41.14 1.44 1.44 1.43 1.31 300 64.09 1.18 1.06 1.42 0.87 300 41.23 1.15 1.14 1.15 0.99 500 64.11 1.11 1.05 1.15 0.75 500 41.06 0.97 0.99 0.98 0.91 Non-causal SNPs: C5 100 34.56 1.14 1.07 1.16 0.97 Non-causal SNPs: 2 100 62.79 1.43 1.42 1.46 1.2 highly differentiated 300 34.41 0.96 1.04 1.03 0.93 highly differentiated 300 62.71 1.09 1.11 1.12 0.83 (category 2) 500 34.39 1.02 0.98 1.04 0.92 (category 2) 500 62.92 1.06 1.05 1.02 0.83 C6 100 100 13.45 1.21 10.19 6.67 3 100 97.99 2.01 2.04 2.07 1.89 300 100 2.2 1.09 2.06 0.68 300 98.91 1.21 1.19 1.18 1.05 500 100 1.55 1.11 1.45 0.39 500 98.16 0.99 1.01 1 0.94 Causal SNPs: random C5 100 64.42 86.82 86.42 85.67 74.42 Causal SNPs: random 2 100 74.02 90.72 90.62 90.75 90.36 (category 3) 300 64.24 86.06 86.33 86.28 74.73 (category 3) 300 74.46 91.25 91.19 91.31 90.77 500 64.02 86.34 86.46 86.49 74.91 500 74.37 91.22 91.14 91.24 90.77 C6 100 70.54 70.14 71.86 71.11 62.16 3 100 68.21 88.62 88.45 88.61 88.27 300 70.33 71.58 71.98 72.06 63.86 300 68.41 89.49 89.49 89.57 89.15 500 70.18 71.4 71.9 72.26 64.2 500 67.92 89.09 89.01 89.18 89.03 Causal SNPs: highly C5 100 0.04 73.09 77.64 77.77 35.05 Causal SNPs: highly 2 100 4.97 87.96 87.54 88.1 86.8 differentiated 300 0.05 76.32 77.91 79.37 37.4 differentiated 300 4.45 92.36 92.23 92.43 91.19 (category 4) 500 0.05 78.7 78.6 79.88 38.59 (category 4) 500 5.32 92.74 92.64 92.7 91.77 C6 100 100 11.52 40.03 25.55 4.61 3 100 2.78 80.88 80.44 80.89 79.67 300 100 39.06 51.27 39.49 20.77 300 3.06 88.83 88.56 88.82 87.89 500 100 46.78 52.45 52.18 28.08 500 2.83 90.63 90.47 90.63 89.6 m is the number of random SNPs used for inference of population structure. ES: m is the number of random SNPs used for inference of population structure. r is EIGENSTRAT; PS: PHYLOSTRAT; MC: MDS clustering; SS: STRATSCORE. the ancestry risk between the two ancestral populations. ES: EIGENSTRAT; PS: PHYLOSTRAT; MC: MDS-clustering; SS: STRATSCORE. stratification, which obscures the true association signal, whereas STRATSCORE has conservative type I error rate. For causal SNPs in population stratification in the data. Such a complex population category (iv), we observed noticeable power change for each method structure poses a challenge to genetic association analysis, but also as the number of random SNPs, m, increases, and such power change offers an opportunity to evaluate various methods. is also due to difference in ability to remove the confounding effect We analyzed the 515 710 autosomal SNPs using EIGENSTRAT, of population stratification. MDS clustering, STRATSCORE and PHYLOSTRAT, and then estimated type I error rate for each method based on the 515 710 tests as none of the SNPs are associated with case– 3.2 Setting 2: admixed population structure control status by design. At α = 0.01 significance level, the type Table 5 shows the results for admixed populations. We I error rates of EIGENSTRAT, MDS clustering, STRATSCORE observed similar patterns for PHYLOSTRAT, MDS clustering, and PHYLOSTRAT are 0.0428, 0.0290, 0.0373 and 0.0232, EIGENSTRAT and STRATSCORE. All these methods yield type respectively; when α = 0.001, the type I error rates of the four I error rates that are close to the nominal level with STRATSCORE methods are 0.0087, 0.0047, 0.0067 and 0.0016, respectively; when being slightly conservative. The power for detecting causal SNPs is α = 0.0005, the corresponding type I error rates are 0.0054, 0.0026, similar for all methods. 0.0039 and 0.00048, respectively. We also estimated the genomic control inflation factor for each of the four methods: 1.66 for 3.3 Setting 3: HGDP data EIGENSTRAT, 1.45 for MDS clustering, 1.59 for STRATSCORE We applied our method to a dataset of 955 individuals from the and 1.35 for PHYLOSTRAT. We also investigated the performance HGDP data genotyped on the Illumina HumanHap 650 SNP array of these three methods with larger number of random SNPs (Li et al., 2008). To simulate a dataset with population structure, (m = 30 000, 50 000 and 70 000), and obtained similar results. we artificially created three populations (Table 2). The phylogenetic As the continent information is known for each individual, one tree built from 10 000 randomly selected autosomal SNPs is shown might consider controlling population stratification by adjusting in Figure 2. With the same 10 000 random SNPs, we also conducted continent; however, the genomic control inflation factor for this MDS analysis and plotted the first two principal coordinates simple approach is 2.67, much higher than the other three methods, (Supplementary Fig. 1). As shown in both figures, this synthetic suggesting that simply adjusting for continent is not sufficient. GWAS dataset contains both discrete and admixed population We also explored a hybrid approach by adjusting MDS principal structures. The genomic control inflation factor for unadjusted coordinates and continent; the genomic control inflation factor for trend test with 515 710 autosomal SNPs is 17.3, indicating strong this approach is 1.45, similar to that of MDS clustering. Although [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 803 798–806 M.Li et al. Table 6. P-values for analysis of the lactase-height dataset No. of AIMs LCT SNP ES PS MC SS 100 rs3769005 0.0007 0.0006 0.0006 0.0001 rs7579771 0.0008 0.0007 0.0008 0.0002 200 rs3769005 0.0007 0.0016 0.0009 0.0002 rs7579771 0.0009 0.0019 0.0011 0.0002 250 rs3769005 0.0358 0.0506 0.045 0.0445 rs7579771 0.0523 0.0676 0.0656 0.0506 300 rs3769005 0.1331 0.2612 0.1732 0.063 rs7579771 0.1637 0.329 0.217 0.0707 500 rs3769005 0.0957 0.1726 0.2141 0.0798 rs7579771 0.0843 0.1932 0.2487 0.0874 700 rs3769005 0.2278 0.3994 0.2437 0.0649 rs7579771 0.2485 0.4021 0.2647 0.0699 1000 rs3769005 0.5407 0.3835 0.6098 0.1101 rs7579771 0.5845 0.4075 0.647 0.1229 1755 rs3769005 0.3952 0.3739 0.3085 0.4807 rs7579771 0.4261 0.4032 0.3279 0.4516 ES: EIGENSTRAT; PS: PHYLOSTRAT; MC: MDS clustering; SS: STRATSCORE. Fig. 2. HGDP phylogenetic tree based on 10 000 randomly selected As shown in our simulations, this hybrid approach effectively autosomal SNPs. captures both discrete and admixed population structures. It yields a more appropriate correction for population stratification than EIGENSTRAT (Price et al., 2006), MDS clustering (Li and Yu, none of the methods we considered completely corrected for 2008) and STRATSCORE (Epstein et al., 2007) under discrete population stratification for this complex synthetic GWAS dataset, population structures; its performance is similar to these three PHYLOSTRAT did show more encouraging result than the other methods. approaches under admixed population structures. To evaluate the performance of our method when the population structure is hierarchical, we applied our method to the HGDP dataset, 3.4 Analysis of lactase-height data which contains real genetic variation patterns. Although none of Using the naïve Armitage trend test, we observed significant the methods could completely remove the confounding effect of associations between the LCT SNPs and height after adjusting population stratification, PHYLOSTRAT performs favorably against −6 sex (rs3769005: P-value= 1.8 × 10 ; rs7579771: P-value= 2.1 × the other methods and yields type I error rates that are closer to −6 10 ), suggesting the presence of population stratification. We then the nominal level. To test the performance of our method in real analyzed this dataset with various population stratification correction genetic association studies, we applied our method to a lactase- methods with different numbers of AIMs (Table 6). We adjusted sex height dataset and found that PHYLOSTRAT is able to correct in all analyses. for population stratification with only 250 AIMs, smaller than the Our results indicate that PHYLOSTRAT performs favorably number of AIMs required by the other methods. Our results suggest against the other methods. For example, with 250 AIMs, among the that phylogenetics is a robust and useful tool for inferring complex four methods we considered for comparison, only PHYLOSTRAT population structures, and appropriate utilization of information yields P-values >0.05 for both SNPs; whereas the other three captured by phylogenetics trees can help correct for population methods require more AIMs to remove the effect of population stratification in genetic association analysis, especially when the stratification. Our results are consistent with Price et al. (2006) and number of random SNPs is small. Epstein et al. (2007), who also observed that EIGENSTRAT cannot We note that as the number of random SNPs used for inference completely resolve the confounding issue between LCT and height of population structure increases (e.g. when m = 50 000), the type I when only a small number of AIMs were used in the analysis. errors of PHYLOSTRAT, EIGENSTRAT and MDS clustering are all close to the nominal level under simulation settings we considered, but the type I errors of STRATSCORE are conservative 4 DISCUSSION with the patterns similar to those seen in Tables 3–4 and We have developed a new method to correct for population Supplementary Table 1. These results suggest that when only a small stratification in genetic association analysis by combining number of random SNPs are available, one might consider using information obtained from phylogenetic trees and MDS analysis. PHYLOSTRAT to control for population stratification, but when Our method represents relations between individual’s genetic the number of random SNPs is large, EIGENSTRAT would be a background using a set of phylogenetic bipartitions and principal preferable approach as it is computationally faster. It is worth noting coordinates from MDS analysis, and incorporates them as covariates that the bipartitions obtained from the phylogenetics tree can be in a regression framework to adjust for the confounding effect due used together with principal components as basis functions to build to hidden population stratification. the stratification scores (Dr Glen Satten, personal communication), [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 804 798–806 Population stratification in genetic association studies and such a modified version may improve the performance of be properly placed, which we expect to happen very often in human STRATSCORE. genetic data. What methods best reduce correlated bipartitions and Our method shares similarity with another hybrid approach how the bipartition reduction step affects the stratification correction MDS clustering (Li and Yu, 2008) in that both methods use the is an important future research direction. principal coordinates from the MDS analysis to capture admixed For PHYLOSTRAT, the running time is dominated by the population structure. To capture discrete population structure, the pairwise distance computation as FastME is a very fast algorithm MDS clustering method assigns each individual a group membership with sub-quadratic asymptotic running time in practice. Because based on clustering of the MDS principal coordinates, where the pairwise distance computation is more time-consuming than PCA- number of clusters is determined by the gap statistic (Tibshirani based approaches, the running time of PHYLOSTRAT is longer et al., 2001). However, the problem of estimating the number of than EIGENSTRAT and STRATSCORE. For example, for a dataset clusters can be difficult, because in many situations there is no clear consisting of 500 cases and 500 controls with 500 random SNPs and definition of a ‘cluster’. Moreover, for data that are not clearly 1000 testing SNPs, using an Intel Xeon CPU (2.66 GHz, 8 GB RAM, separated into groups, e.g. when there are overlapping classes, linux), it took 9 s for EIGENTRAT, 14 s for STRATSCORE, 19 s determining the number of distinct clusters can be highly subjective. for PHYLOSTRAT and 40 s for MDS clustering. However, given In contrast, the use of phylogenies in our method circumvents this that high-performance and low-cost computational capabilities are problem because our method does not require such parameter to easily accessible, the required computing time for pairwise distance be predefined. It is worth noting that clustering is a degenerate or matrix calculation in PHYLOSTRAT would be a small overhead. a simplified version of phylogenetic tree. Unlike clustering, which One possible approach to reduce the running time is to limit the can only handle simple population structures, phylogenetic trees number of markers by LD pruning. Our experience with the analysis are suitable for handling more complex, hierarchical population of the HGDP data and several other GWAS datasets suggests that structures as demonstrated in our analysis of the HGDP data. this approach generally works well in GWAS. Compared with existing methods for population stratification In summary, we have proposed a new method that combines correction, our method has two distinct advantages. First, the information from phylogenetic tree and MDS together. Given phylogenetic approach allows better interpretation and visualization the flexible nature and the hierarchical characteristic, this hybrid of the population structure in the study sample, especially when approach is expected to perform well under complex population the data contain a hierarchical structure. Many tools have been structures. We expect our method will provide a useful tool in the developed for molecular phylogenetic studies since the fifties, and analysis of both candidate gene and GWAS studies. it will be an important research direction to find how these methods can be applied for the population stratification problem. Second, as shown in our simulations and the analysis of the lactase-height data, ACKNOWLEDGEMENTS our method requires fewer markers to infer population structure than We thank Drs Michael Epstein, Glen Satten, Maja Bucan, Junhyong other existing methods; therefore our method is ideal for candidate Kim and the late Richard Spielman for discussions. We also thank gene studies or replication study of GWAS, in which a large number three anonymous reviewers for their helpful comments that greatly of random SNPs may not be available. Although we considered a improved the manuscript. small number of random SNPs in our simulations, our method can be applied to GWAS as demonstrated by the analysis of the HGDP Funding: Penn Genomics Frontier Institute (an internal grant to M.L. dataset. and L.-S.W.). National Institutes of Health (grant R01HG004517 to Our method has to make a decision on how a set of bipartitions M.L.). is selected without losing too much information for subsequent Conflict of Interest: none declared. association analysis. If the population has a small number of subpopulations, most bipartitions will be similar to one of the bipartitions selected by our bipartition reduction algorithm. Having too many bipartitions retained in the stratification correction REFERENCES procedure will introduce many partially correlated covariates in Armitage,P. (1955) Tests for linear trends in proportions and frequencies. Biometrics, logistic regression, and can have an adverse effect on the numerical 11, 375–386. Balding,D.J. and Nichols,R.A. (1995) A method for quantifying differentiation between regression procedure; on the contrary, over-reduction of bipartitions populations at multi-allelic loci and its implications for investigating identity and may lead to loss of important information on the population paternity. Genetica, 96, 3–12. structure. Both scenarios can hurt the performance of our algorithm. Campbell,C.D. et al. (2005) Demonstrating stratification in an European American In our analysis, we tried various thresholds for bipartition selection. population. Nat. Genet., 37, 868–872. We found that using 2.5% as the threshold for the number of leaves Cavalli-Sforza,L.L. and Edwards,A.W.F. (1967) Phylogenetic analysis: models and estimation procedures. Am. J. Hum. Genet. 19, 233–257. and 0.7 as the threshold for the correlation coefficient generally Desper,R. and Gascuel,O. (2002) Fast and accurate phylogeny reconstruction algorithms leads to stable results for discrete population structures, and the based on the minimum-evolution principle. J. Comput. Biol., 19, 687–705. corresponding thresholds are 2.5% and 0.1 for admixed population Devlin,B. and Roeder,K. (1999) Genomic control for association studies. Biometrics, structures. In general, we recommend the users to try multiple 55, 997–1004. Epstein,M.P. et al. (2007) A simple and improved correction for population stratification sets of thresholds and select the threshold that gives the smallest in case-control studies. Am. J. Hum. Genet., 80, 921–930. genomic control inflation factor. We recognize that there are other Felsenstein,J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. methods such as bootstrapping that can reduce bipartitions in the Evolution, 39, 783–791. phylogeny (Felsenstein, 1985); however, we expect these methods Garey,M.R. and Johnson,D.S. (1979) Computers and Intractability: A Guide to the are very sensitive to any leaves (individuals) in the tree that cannot Theory of NP-Completeness. WH Freeman, New York, NY, USA, pp. 190. [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 805 798–806 M.Li et al. Goldstein,D.B. et al. (1995) An evaluation of genetic distances for use with Nei,M. (1972) Genetic distance between populations. Am. Naturalist, 106, 283–292. microsatellite loci. Genetics, 139, 463–471. Price,A.L. et al. (2006) Principal components analysis corrects for stratification in Jakobsson,M. et al. (2008) Genotype, haplotype and copy-number variation in genome-wide association studies. Nat. Genet., 38, 904–909. worldwide human populations. Nature, 451, 998–1002. Pritchard,J.K. and Rosenberg,N.A. (1999) Use of unlinked genetic markers to detect Keating,B.J. et al. (2008) Concept, design and implementation of a cardiovascular gene- population stratification in association studies. Am. J. Hum. Genet., 65, 220–228. centric 50K SNP array for large-scale genomic association studies. PLoS ONE, 3, Pritchard,J.K. et al. (2000) Association mapping in structured populations. Am. J. Hum. e3583. Genet., 67, 170–181. Kimmel,G. et al. (2007) A randomization test for controlling population stratification Rosenberg,N.A. et al. (2002) Genetic structure of human populations. Science, 298, in whole-genome association studies. Am. J. Hum. Genet., 81, 895–905. 2381–2385. Li,Q. and Yu,K. (2008) Improved correction for population stratification in genome- Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new method for wide association studies by identifying hidden population structures. Genet. Epid., reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. 32, 215–226. Serre,D. et al. (2008) Correction of population stratification in large multi-ethnic Li,J.Z. et al. (2008) Worldwide human relationships inferred from genome-wide patterns association studies. PLoS ONE, 1, e1382. of variation. Science, 319, 1100–1104. Studier,J.A. and Keppler,K.L. (1988) A note on the neighbor-joining algorithm of Saitou Luca,D. et al. (2008) On the use of general control samples for genome-wide association and Nei. Mol. Biol. Evol., 5, 729–731. studies: genetic matching highlights causal variants. Am. J. Hum. Genet., 82, Tibshirani,R. et al. (2001) Estimating the number of clusters in a data set via the gap 453–463. statistic. J. R. Stat. Soc. B, 63, 411–423. Marchini,J. et al. (2004) The effects of human population structure on large genetic association studies. Nat. Genet., 36, 512–517. [15:11 19/2/2010 Bioinformatics-btq025.tex] Page: 806 798–806
Bioinformatics – Oxford University Press
Published: Jan 22, 2010
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.