Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

GAPIT: genome association and prediction integrated tool

GAPIT: genome association and prediction integrated tool Vol. 28 no. 18 2012, pages 2397–2399 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bts444 Genetics and population analysis Advance Access publication July 13, 2012 1 2 3 4 2,5 Alexander E. Lipka , Feng Tian , Qishan Wang , Jason Peiffer ,MengLi , 1 6 1,2,4 2, Peter J. Bradbury , Michael A. Gore , Edward S. Buckler and Zhiwu Zhang Computational Biologist with the United States Department of Agriculture - Agricultural Research Service (USDA-ARS), 2 3 Ithaca, NY 14853, USA, Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA, Department of Animal Science, Shanghai Jiao Tong University, Shanghai 200240, China, Department of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853, USA, Centre of Pear Engineering Technology Research, Nanjing Agricultural University, Nanjing 210095, China and US Arid-Land Agricultural Research Center, United States Department of Agriculture-Agricultural Research Service, Maricopa, AZ 85138, USA Associate Editor: Jeffrey Barrett either STRUCTURE (Pritchard et al., 2000) or principal com- ABSTRACT ponents (PCs) can be included as fixed effects. The cryptic rela- Summary: Software programs that conduct genome-wide association tionships between individuals are accounted for through studies and genomic prediction and selection need to use methodol- a kinship matrix in the unified MLM (Yu et al., 2006). ogies that maximize statistical power, provide high prediction accur- The more computationally efficient and powerful compressed acy and run in a computationally efficient manner. We developed an R MLM (CMLM) (Zhang et al., 2010) uses a group kinship package called Genome Association and Prediction Integrated Tool matrix calculated from clustered individuals. (GAPIT) that implements advanced statistical methods including the Because the typical number of genotypic data points is exceed- compressed mixed linear model (CMLM) and CMLM-based genomic ing hundreds of millions, solving MLMs using the traditional prediction and selection. The GAPIT package can handle large data- restricted maximum likelihood approach is computationally sets in excess of 10 000 individuals and 1 million single-nucleotide intensive. Therefore, the efficient mixed model association polymorphisms with minimal computational time, while providing (EMMA) algorithm (Kang et al., 2008) was developed to user-friendly access and concise tables and graphs to interpret reduce this computational burden by reparameterizing the results. MLM likelihood functions. EMMA eXpedited (EMMAX) Availability: http://www.maizegenetics.net/GAPIT. (Kang et al., 2010) and population parameters previously deter- Contact: [email protected] mined (P3D) (Zhang et al., 2010) were independently developed Supplementary Information: Supplementary data are available at to further reduce computing time by eliminating the need to Bioinformatics online. re-estimate variance components at each marker. Received on April 11, 2012; revised on July 3, 2012; accepted on Most GS methods make predictions with the sum of the effects July 8, 2012 from all available SNPs or Genomic Best Linear Unbiased Prediction (gBLUP) based on a kinship matrix derived from these SNPs. The former approach offers higher prediction 1 INTRODUCTION accuracies for simpler traits, while the latter approach is more Advances in high-throughput single-nucleotide polymorphism accurate for complex traits (Daetwyler et al., 2010). Our work (SNP) genotyping are enabling powerful genome-wide associ- implements an improved gBLUP method that increases accur- ation studies (GWAS), thereby enhancing the ability to identify acy, especially for simple traits. causal mutations that underlie human diseases and agriculturally Most software packages were developed for a particular important traits. The resulting SNPs are also valuable for gen- GWAS or GS approach. For example, packages were written omic prediction and selection (GS), which provides criteria for exclusively for the EMMA and EMMAX algorithms. Other soft- disease risk management in humans and expedited selection in ware such as the Trait Analysis by aSSociation, Evolution and animal and plant breeding (Heffner et al., 2009; Meuwissen et al., Linkage (TASSEL) (Bradbury et al., 2007) and PLINK (Purcell 2001). Before the full potential of GWAS and GS are realized, et al., 2007) make multiple GWAS approaches available in one inflated false-positive rates, extensive computational require- package. We continue these software development efforts ments and suboptimal prediction accuracies need to be by creating Genome Association and Prediction Integrated addressed. Tool (GAPIT), which integrates the most powerful, accurate Newly developed GWAS statistical methods based on the and computationally efficient GWAS and GS methods into a mixed linear model (MLM) hold great promise to overcome single R package. these challenges. They are flexible because they incorporate fixed and random effects. To address the spurious associ- 2 IMPLEMENTATION ations that arise from population structure, covariates from The GAPIT program accepts several combinations of genotypic *To whom correspondence should be addressed. data, phenotypic data, externally obtained kinship matrices, and Published by Oxford University Press 2012. All rights reserved. For Permissions, please email: [email protected] 2397 A.E.Lipka et al. then phenotypic data and a kinship matrix are required to perform GS. By default, GAPIT uses the CMLM approach with P3D/ EMMAX for GWAS. GS is performed using the same optimiza- tion settings as GWAS (Supplementary Sections I and II and Fig. S1). There is an option to perform GS only by specifying ‘SNP.test¼ FALSE’. Seven algorithms are available to cluster individuals into groups. GAPIT can also perform the MLM and GLM approaches by adjusting the ‘group.to’ and ‘group.- from’ input parameters. When the kinship matrix is not pro- vided, it will be calculated with the methods of VanRaden (VanRaden, 2008), Loiselle (Loiselle et al., 1995) or EMMA (Kang et al., 2008). GAPIT can also perform principal compo- nent analysis of the genotypic data to control for population structure (Zhao et al., 2007). GAPIT has several strategies for analyzing large SNP datasets. One is to import genotypic data stored in multiple smaller files. If these files still exceed memory limits, the ‘file.fragment’ par- ameter can be used to sequentially load fragments within each file. If there is not enough memory to use all SNPs to calculate the kinship matrix and PCs, then the ‘SNP.fraction’ input par- ameter will select a random sample of the SNPs for these calcu- lations (Yu et al., 2009). The results from GAPIT are accessed as both objects within the R workspace and as external files. The R objects, which include GWAS and GS results, may be used for follow-up analyses in R. The external files include publication-ready sum- maries of GWAS and GS results. GWAS results are summarized by Manhattan plots, quantile–quantile plots and a table. Similarly, GS results are presented in a heat map and a table. Graphs of the heritability estimates and the likelihood func- tion at various compression levels are included. A subset of the graphs and tables produced by GAPIT are presented in Figure 1. Fig. 1. Gallery of GAPIT output. (a) Plot of the first two principal com- ponents (PC1 and PC2). (b) Plot of twice the negative log likelihood (-2LL, smaller is better) at various number of groups. (c) Graph showing 3 PERFORMANCE TESTS the optimum cluster algorithm, method to calculate group kinship, group number, -2LL, and the proportion of genetic variance (group heritability) EMMA and TASSEL were compared with GAPIT. These two and residual variance. (d) Distribution of best linear unbiased predictors packages were selected because both use the EMMA algorithm, (BLUPs) and their prediction error variance (PEV) (e) Genomic predic- while TASSEL also implements the CMLM approach with P3D. tion and selection output summary. The individual id (taxa), group, When the same approach was used, identical results were ob- RefInf which indicates whether the individual is in the reference group tained (Supplementary Figs S2 and S3). The computing time of (1) or not (2), the group ID number, the BLUP and the PEV of the all three packages increases linearly with the number of SNPs BLUP. (f) Manhattan plot. log P-values are plotted against physical (Supplementary Fig. S4). However, the average computing time map position of SNPs. Chromosomes are alternatingly colored. (g) Quantile–quantile (QQ) plot determines how GWAS results compare per SNP in GAPIT is 7-fold and 180-fold faster than TASSEL to the expected results under the null hypothesis of no association. and EMMA, respectively (Supplementary Fig. S4). It took 69.5 h (h) Output table of GWAS results. The SNP id, chromosome, bp pos- to analyze a dataset with 11 000 individuals and 500 000 SNPs, ition, P-value, minor allele frequency (maf), sample size (nobs), R of the which extrapolates to 7195 SNPs/CPU hours or less than 6 days model without the SNP, R of the model with the SNP and adjusted to analyze 1 million SNPs. P-value following a false discovery rate-controlling procedure (Benjamini and Hochberg, 1995). 4CONCLUSIONS covariates such as population structure and age. Multiple traits This R package uses state-of-the-art mixed model methods to can be stored in a single phenotypic dataset, which allows se- conduct GWAS and GS. GAPIT analyzes large datasets with quential analysis of each trait. The genotypic data may be stored minimum computational time and produces comprehensive re- in HapMap or numerical formats. If genotypic data are absent, sults including R objects and high-quality graphs. 2398 Genome association and prediction integrated tool Kang,H.M. et al. (2010) Variance component model to account for sample structure ACKNOWLEDGEMENTS in genome-wide association studies. Nat. Genet., 42, 348–354. J.C. Glaubitz is acknowledged for assistance with data analysis Kang,H.M. et al. (2008) Efficient control of population structure in model organism in UNIX. association mapping. Genetics, 178, 1709–1723. Loiselle,B.A. et al. (1995) Spatial genetic-structure of a tropical understory shrub, Funding: National Science Foundation (grants DBI-0321467, Psychotria Officinalis (Rubiaceae). Am. J. Bot., 82, 1420–1425. Meuwissen,T.H. et al. (2001) Prediction of total genetic value using genome-wide DBI0820619, and DBI-0922493); United States Department of dense marker maps. Genetics, 157, 1819–1829. Agriculture - Agricultural Research Service. Pritchard,J.K. et al. (2000) Association mapping in structured populations. Am. J. Hum. Genet., 67, 170–181. Conflict of Interest: none declared. Purcell,S. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. VanRaden,P.M. (2008) Efficient methods to compute genomic predictions. J. Dairy REFERENCES Sci., 91, 4414–4423. Yu,J. et al. (2009) Simulation appraisal of the adequacy of number of background Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a prac- markers for relationship estimation in association mapping. Plant Genome, 2, tical and powerful approach to mutliple testing. J. Roy Statis. Soc. B, 57, 63–77. 289–300. Yu,J.M. et al. (2006) A unified mixed-model method for association mapping that Bradbury,P.J. et al. (2007) TASSEL: software for association mapping of complex accounts for multiple levels of relatedness. Nat. Genet., 38, 203–208. traits in diverse samples. Bioinformatics, 23, 2633–2635. Zhang,Z. et al. (2010) Mixed linear model approach adapted for genome-wide Daetwyler,H.D. et al. (2010) The impact of genetic architecture on genome-wide association studies. Nat. Genet., 42, 355–360. evaluation methods. Genetics, 185, 1021–1031. Zhao,K. et al.(2007) An Arabidopsis example of association mapping in structured Heffner,E.L. et al. (2009) Genomic selection for crop improvement. Crop Sci., 49, samples. PLoS Genet., 3,e4. 1–12. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

GAPIT: genome association and prediction integrated tool

Loading next page...
 
/lp/oxford-university-press/gapit-genome-association-and-prediction-integrated-tool-K3E1wYUR84

References (34)

Publisher
Oxford University Press
Copyright
Published by Oxford University Press 2012. All rights reserved. For Permissions, please email: [email protected]
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bts444
pmid
22796960
Publisher site
See Article on Publisher Site

Abstract

Vol. 28 no. 18 2012, pages 2397–2399 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bts444 Genetics and population analysis Advance Access publication July 13, 2012 1 2 3 4 2,5 Alexander E. Lipka , Feng Tian , Qishan Wang , Jason Peiffer ,MengLi , 1 6 1,2,4 2, Peter J. Bradbury , Michael A. Gore , Edward S. Buckler and Zhiwu Zhang Computational Biologist with the United States Department of Agriculture - Agricultural Research Service (USDA-ARS), 2 3 Ithaca, NY 14853, USA, Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA, Department of Animal Science, Shanghai Jiao Tong University, Shanghai 200240, China, Department of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853, USA, Centre of Pear Engineering Technology Research, Nanjing Agricultural University, Nanjing 210095, China and US Arid-Land Agricultural Research Center, United States Department of Agriculture-Agricultural Research Service, Maricopa, AZ 85138, USA Associate Editor: Jeffrey Barrett either STRUCTURE (Pritchard et al., 2000) or principal com- ABSTRACT ponents (PCs) can be included as fixed effects. The cryptic rela- Summary: Software programs that conduct genome-wide association tionships between individuals are accounted for through studies and genomic prediction and selection need to use methodol- a kinship matrix in the unified MLM (Yu et al., 2006). ogies that maximize statistical power, provide high prediction accur- The more computationally efficient and powerful compressed acy and run in a computationally efficient manner. We developed an R MLM (CMLM) (Zhang et al., 2010) uses a group kinship package called Genome Association and Prediction Integrated Tool matrix calculated from clustered individuals. (GAPIT) that implements advanced statistical methods including the Because the typical number of genotypic data points is exceed- compressed mixed linear model (CMLM) and CMLM-based genomic ing hundreds of millions, solving MLMs using the traditional prediction and selection. The GAPIT package can handle large data- restricted maximum likelihood approach is computationally sets in excess of 10 000 individuals and 1 million single-nucleotide intensive. Therefore, the efficient mixed model association polymorphisms with minimal computational time, while providing (EMMA) algorithm (Kang et al., 2008) was developed to user-friendly access and concise tables and graphs to interpret reduce this computational burden by reparameterizing the results. MLM likelihood functions. EMMA eXpedited (EMMAX) Availability: http://www.maizegenetics.net/GAPIT. (Kang et al., 2010) and population parameters previously deter- Contact: [email protected] mined (P3D) (Zhang et al., 2010) were independently developed Supplementary Information: Supplementary data are available at to further reduce computing time by eliminating the need to Bioinformatics online. re-estimate variance components at each marker. Received on April 11, 2012; revised on July 3, 2012; accepted on Most GS methods make predictions with the sum of the effects July 8, 2012 from all available SNPs or Genomic Best Linear Unbiased Prediction (gBLUP) based on a kinship matrix derived from these SNPs. The former approach offers higher prediction 1 INTRODUCTION accuracies for simpler traits, while the latter approach is more Advances in high-throughput single-nucleotide polymorphism accurate for complex traits (Daetwyler et al., 2010). Our work (SNP) genotyping are enabling powerful genome-wide associ- implements an improved gBLUP method that increases accur- ation studies (GWAS), thereby enhancing the ability to identify acy, especially for simple traits. causal mutations that underlie human diseases and agriculturally Most software packages were developed for a particular important traits. The resulting SNPs are also valuable for gen- GWAS or GS approach. For example, packages were written omic prediction and selection (GS), which provides criteria for exclusively for the EMMA and EMMAX algorithms. Other soft- disease risk management in humans and expedited selection in ware such as the Trait Analysis by aSSociation, Evolution and animal and plant breeding (Heffner et al., 2009; Meuwissen et al., Linkage (TASSEL) (Bradbury et al., 2007) and PLINK (Purcell 2001). Before the full potential of GWAS and GS are realized, et al., 2007) make multiple GWAS approaches available in one inflated false-positive rates, extensive computational require- package. We continue these software development efforts ments and suboptimal prediction accuracies need to be by creating Genome Association and Prediction Integrated addressed. Tool (GAPIT), which integrates the most powerful, accurate Newly developed GWAS statistical methods based on the and computationally efficient GWAS and GS methods into a mixed linear model (MLM) hold great promise to overcome single R package. these challenges. They are flexible because they incorporate fixed and random effects. To address the spurious associ- 2 IMPLEMENTATION ations that arise from population structure, covariates from The GAPIT program accepts several combinations of genotypic *To whom correspondence should be addressed. data, phenotypic data, externally obtained kinship matrices, and Published by Oxford University Press 2012. All rights reserved. For Permissions, please email: [email protected] 2397 A.E.Lipka et al. then phenotypic data and a kinship matrix are required to perform GS. By default, GAPIT uses the CMLM approach with P3D/ EMMAX for GWAS. GS is performed using the same optimiza- tion settings as GWAS (Supplementary Sections I and II and Fig. S1). There is an option to perform GS only by specifying ‘SNP.test¼ FALSE’. Seven algorithms are available to cluster individuals into groups. GAPIT can also perform the MLM and GLM approaches by adjusting the ‘group.to’ and ‘group.- from’ input parameters. When the kinship matrix is not pro- vided, it will be calculated with the methods of VanRaden (VanRaden, 2008), Loiselle (Loiselle et al., 1995) or EMMA (Kang et al., 2008). GAPIT can also perform principal compo- nent analysis of the genotypic data to control for population structure (Zhao et al., 2007). GAPIT has several strategies for analyzing large SNP datasets. One is to import genotypic data stored in multiple smaller files. If these files still exceed memory limits, the ‘file.fragment’ par- ameter can be used to sequentially load fragments within each file. If there is not enough memory to use all SNPs to calculate the kinship matrix and PCs, then the ‘SNP.fraction’ input par- ameter will select a random sample of the SNPs for these calcu- lations (Yu et al., 2009). The results from GAPIT are accessed as both objects within the R workspace and as external files. The R objects, which include GWAS and GS results, may be used for follow-up analyses in R. The external files include publication-ready sum- maries of GWAS and GS results. GWAS results are summarized by Manhattan plots, quantile–quantile plots and a table. Similarly, GS results are presented in a heat map and a table. Graphs of the heritability estimates and the likelihood func- tion at various compression levels are included. A subset of the graphs and tables produced by GAPIT are presented in Figure 1. Fig. 1. Gallery of GAPIT output. (a) Plot of the first two principal com- ponents (PC1 and PC2). (b) Plot of twice the negative log likelihood (-2LL, smaller is better) at various number of groups. (c) Graph showing 3 PERFORMANCE TESTS the optimum cluster algorithm, method to calculate group kinship, group number, -2LL, and the proportion of genetic variance (group heritability) EMMA and TASSEL were compared with GAPIT. These two and residual variance. (d) Distribution of best linear unbiased predictors packages were selected because both use the EMMA algorithm, (BLUPs) and their prediction error variance (PEV) (e) Genomic predic- while TASSEL also implements the CMLM approach with P3D. tion and selection output summary. The individual id (taxa), group, When the same approach was used, identical results were ob- RefInf which indicates whether the individual is in the reference group tained (Supplementary Figs S2 and S3). The computing time of (1) or not (2), the group ID number, the BLUP and the PEV of the all three packages increases linearly with the number of SNPs BLUP. (f) Manhattan plot. log P-values are plotted against physical (Supplementary Fig. S4). However, the average computing time map position of SNPs. Chromosomes are alternatingly colored. (g) Quantile–quantile (QQ) plot determines how GWAS results compare per SNP in GAPIT is 7-fold and 180-fold faster than TASSEL to the expected results under the null hypothesis of no association. and EMMA, respectively (Supplementary Fig. S4). It took 69.5 h (h) Output table of GWAS results. The SNP id, chromosome, bp pos- to analyze a dataset with 11 000 individuals and 500 000 SNPs, ition, P-value, minor allele frequency (maf), sample size (nobs), R of the which extrapolates to 7195 SNPs/CPU hours or less than 6 days model without the SNP, R of the model with the SNP and adjusted to analyze 1 million SNPs. P-value following a false discovery rate-controlling procedure (Benjamini and Hochberg, 1995). 4CONCLUSIONS covariates such as population structure and age. Multiple traits This R package uses state-of-the-art mixed model methods to can be stored in a single phenotypic dataset, which allows se- conduct GWAS and GS. GAPIT analyzes large datasets with quential analysis of each trait. The genotypic data may be stored minimum computational time and produces comprehensive re- in HapMap or numerical formats. If genotypic data are absent, sults including R objects and high-quality graphs. 2398 Genome association and prediction integrated tool Kang,H.M. et al. (2010) Variance component model to account for sample structure ACKNOWLEDGEMENTS in genome-wide association studies. Nat. Genet., 42, 348–354. J.C. Glaubitz is acknowledged for assistance with data analysis Kang,H.M. et al. (2008) Efficient control of population structure in model organism in UNIX. association mapping. Genetics, 178, 1709–1723. Loiselle,B.A. et al. (1995) Spatial genetic-structure of a tropical understory shrub, Funding: National Science Foundation (grants DBI-0321467, Psychotria Officinalis (Rubiaceae). Am. J. Bot., 82, 1420–1425. Meuwissen,T.H. et al. (2001) Prediction of total genetic value using genome-wide DBI0820619, and DBI-0922493); United States Department of dense marker maps. Genetics, 157, 1819–1829. Agriculture - Agricultural Research Service. Pritchard,J.K. et al. (2000) Association mapping in structured populations. Am. J. Hum. Genet., 67, 170–181. Conflict of Interest: none declared. Purcell,S. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. VanRaden,P.M. (2008) Efficient methods to compute genomic predictions. J. Dairy REFERENCES Sci., 91, 4414–4423. Yu,J. et al. (2009) Simulation appraisal of the adequacy of number of background Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a prac- markers for relationship estimation in association mapping. Plant Genome, 2, tical and powerful approach to mutliple testing. J. Roy Statis. Soc. B, 57, 63–77. 289–300. Yu,J.M. et al. (2006) A unified mixed-model method for association mapping that Bradbury,P.J. et al. (2007) TASSEL: software for association mapping of complex accounts for multiple levels of relatedness. Nat. Genet., 38, 203–208. traits in diverse samples. Bioinformatics, 23, 2633–2635. Zhang,Z. et al. (2010) Mixed linear model approach adapted for genome-wide Daetwyler,H.D. et al. (2010) The impact of genetic architecture on genome-wide association studies. Nat. Genet., 42, 355–360. evaluation methods. Genetics, 185, 1021–1031. Zhao,K. et al.(2007) An Arabidopsis example of association mapping in structured Heffner,E.L. et al. (2009) Genomic selection for crop improvement. Crop Sci., 49, samples. PLoS Genet., 3,e4. 1–12.

Journal

BioinformaticsOxford University Press

Published: Jul 13, 2012

There are no references for this article.