Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

iPat: intelligent prediction and association tool for genomic research

iPat: intelligent prediction and association tool for genomic research Summary: The ultimate goal of genomic research is to effectively predict phenotypes from geno- types so that medical management can improve human health and molecular breeding can in- crease agricultural production. Genomic prediction or selection (GS) plays a complementary role to genome-wide association studies (GWAS), which is the primary method to identify genes under- lying phenotypes. Unfortunately, most computing tools cannot perform data analyses for both GWAS and GS. Furthermore, the majority of these tools are executed through a command-line interface (CLI), which requires programming skills. Non-programmers struggle to use them effi- ciently because of the steep learning curves and zero tolerance for data formats and mistakes when inputting keywords and parameters. To address these problems, this study developed a soft- ware package, named the Intelligent Prediction and Association Tool (iPat), with a user-friendly graphical user interface. With iPat, GWAS or GS can be performed using a pointing device to sim- ply drag and/or click on graphical elements to specify input data files, choose input parameters and select analytical models. Models available to users include those implemented in third party CLI packages such as GAPIT, PLINK, FarmCPU, BLINK, rrBLUP and BGLR. Users can choose any data format and conduct analyses with any of these packages. File conversions are automatically con- ducted for specified input data and selected packages. A GWAS-assisted genomic prediction method was implemented to perform genomic prediction using any GWAS method such as FarmCPU. iPat was written in Java for adaptation to multiple operating systems including Windows, Mac and Linux. Availability and implementation: The iPat executable file, user manual, tutorials and example data- sets are freely available at http://zzlab.net/iPat. Contact: zhiwu.zhang@wsu.edu 1 Introduction (Kang et al., 2008), GAPIT (Lipka et al., 2012; Tang et al., 2016) and FarmCPU (Liu et al., 2016). Genome-wide association studies (GWAS) have become the primary Other recently developed analytical methods have also given method for dissecting complex traits. To incorporate population genomic research a boost toward improving disease risk manage- structure, a general linear model was implemented in PLINK ment in humans and molecular breeding of plants and animals—the (Purcell et al., 2007) to reduce the spurious associations. Mixed lin- ultimate goals of genomic prediction. These packages include ear models have been developed to incorporate cryptic relationships rrBLUP (Endelman, 2011) and BGLR (Pe ´ rez and De Los Campos, among individuals to further reduce the spurious associations. 2014). rrBLUP implements ridge regression and genomic BLUP Software packages have been developed correspondingly to conduct (gBLUP) and BGLR implements Bayesian methods such as Bayes A, the analyses, including TASSEL (Bradbury et al., 2007), EMMA V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1925 1926 C.J.Chen and Z.Zhang B, CPi and LASSO. Some genomic prediction methods can be used for GWAS, for example, Bayes A, B and Cpi. In return, GWAS re- sults can also enhance genomic prediction (Spindel et al., 2016). The multiple available software packages provide the potential to enhance data analyses, but also create challenges for users. Most pack- ages only use a command-line interface (CLI), which has a very steep learning curve for non-programmers. Furthermore, users must spend great effort when shifting from one package to another due to incon- sistent format requirements for input data. Users must take the time to reformat their data accordingly. As a result, a user-friendly graphical user interface (GUI)-based software package that can access multiple CLI packages, use any type of the input file format, and perform both GWAS and genomic prediction or selection is critically needed. The objective of this study was to develop a software package with the following functions: (1) performs both GWAS and genomic predic- tion, including GWAS-assisted genomic prediction; (2) offers a friendly GUI to reduce user learning time and (3) requires only one input data Fig. 1. Design of the iPat. iPat provides users the ability to access incorpo- format to conduct any analysis with any incorporated method. rated software packages and data inputs (a) by using a GUI. The GUI (b) allow users to control all the processes, including modeling (c) and displaying re- sults (d). Currently, incorporated packages include GAPIT, PLINK, FarmCPU, 2 GWAS-assisted genomic prediction BLINK, rrBLUP and BGLR. Genotype data can be input in any format, includ- ing numerical, hapmap, VCF and PLINK. The GUI allows users to drag any By default, Intelligent Prediction and Association Tool (iPat) con- type of data file into the interface and create project icons to link data files, ducts genomic prediction after GWAS with any implemented CLI manage analyses and display results package. Genomic prediction is conducted by gBLUP with associ- ated loci fitted as fixed effects in the following model: After iPat is launched, the GUI appears as a blank frame labeled y ¼ Wc þ Xb þ Zu þ e (1) iPat. The frame is used to manage data files and project analyses. where y is a vector of phenotypes; c and b represent unknown fixed The frame behaves like a folder that users can drag any object into, effects, with c as inheritable factors (e.g. population structure and including files and other folders. The graphical icons on the frame associated genetic loci) and b as uninheritable factors (e.g. environ- are links to the original files and folders. By double-clicking on these mental treatments); and u is a vector of genomic prediction with size icons, the computer’s operating system opens them with the appro- n (number of individuals) for unknown random polygenic effects. priate default programs. For example, a folder is opened by file ex- These random effects follow a distribution with a mean of zero and plore. A text file is opened by text editor. a covariance matrix of G ¼ 2Kr , where K is the kinship with elem- A project icon can be created by double clicking anywhere on ent k (i, j¼ 1, 2, .. ., n) representing the relationship between indi- ij the iPat frame. Multiple project icons are acceptable. The project viduals i and j, and r is an unknown genetic variance. W, X and Z icons are used for linking the input files, defining parameters and are the incidence matrices for c, b and u, respectively. e is a vector of initiating modeling analyses. Both project icons and file icons can be random residual effects that are normally distributed with a mean of repositioned by dragging them with the pointing device. An icon can zero and a covariance of R¼ Ir , where I is the identity matrix and be deleted by dragging it to the bottom right-hand corner. When the r is the unknown residual variance. The predicted genetic merits icon is close to the corner, a trashcan will appear at the corner to in- (GM) of individuals are calculated by following equation: dicate the deletion. Overlapping a project icon and a data icon creates their connec- GM ¼ W^cþZu ^ (2) tion and is indicated by a dashed line (Fig. 1). Clicking on the dashed where ^ c and u ^ are the estimates and prediction of c and u, line turns it into a solid line. Clicking again returns the solid line respectively. back to a dashed line. When a solid line, the connection can be The associated loci are defined as the genetic markers with P-values dragged to the trashcan at the bottom right-hand corner for dele- above the Bonferroni threshold. The associated loci are also filtered for tion. When a project icon is linked to required genotype and pheno- markers that are in linkage disequilibrium (LD). Makers are sorted with type data files by the dashed line, the project icon can be opened as a the strongest associated marker on top. Any other marker with a LD of dialog by right-clicking. In the dialogue box, the user can define par- 50% (R ) or above with the top marker is removed. Then, the second ameters, select the desired model and execute the incorporated CLI strongest associated marker is selected as the top marker and the same packages to perform analyses. During the execution, the project process is repeated until no markers can be removed. The sum of the icon will spin. The spinning will stop and display either a green or a associated markers and the other fixed effects must be less than the red flag upon success or failure of the execution, respectively. square root of the number of individuals. If not, the less significant Results of a successful run can be displayed by double-clicking the markers are removed until this requirement is satisfied. project icon. 3 GUI, data and third party CLI packages 4 Implementation iPat’s GUI is designed to drag and click input data and access third iPat’s GUI was developed in Java. Input data and parameters are party CLI packages using a computer’s pointing device (Fig. 1). passed to specified CLI packages through the command-line inter- Users can also use the keyboard to change parameters. preter. The interpreters are MS-DOS in a Windows operating system iPat 1927 and Terminal in Mac OS or Linux systems. For an R-based package, mapping genes through GWAS and genomic prediction through the input parameters are translated into R script. The pre-requisite understanding the relationships between genotypes and phenotypes. R packages are imported into a library before calling the R package Additionally, iPat gives users the flexibility to combine different ana- for the analysis. iPat then opens a new thread and executes this R lysis methods (such as FarmCPU or rrBLUP) with different input script file by calling the ‘Rscript’ function in the command-line inter- formats (such as PLINK or hapmap genotype data) without requir- preters. For instance, if a user would like to perform GWAS by ing the tedious process of manual reformatting. These features FarmCPU, iPat will pass ‘Rscript FarmCPU.r mydata.dat mydata. should attract users of all levels. In turn, widespread use of iPat has map mydata.txt .. .’ to the command-line interpreter. The first argu- the potential to spawn faster advances in genomic research. ment of the function ‘Rscript’ signals which R script file should be compiled. The remaining arguments are used in FarmCPU.r, which defines the genotype data, genetic map and phenotype. For C-driven Acknowledgement packages, iPat calls the command-line interpreter directly. For ex- The authors thank Linda R. Klein for helpful comments and editing the ample, iPat will execute a command ‘plink–bfile mydata–assoc –out manuscript. mydata_out’ if binary files are used to run GWAS in PLINK, where mydata and mydata_out specify the path and name of the input and output files, respectively. Funding Execution of the CLI packages are monitored using Java system This work was partly supported by an Emerging Research Issues Internal functions. A new message panel is initiated to collect screen output Competitive Grant from the Agricultural Research Center at Washington for the CLI packages by calling java.lang.Process.getInputStream(). State University, College of Agricultural, Human and Natural Resource All information on the message panel is saved as a log file. A project Sciences; and the Endowment, Research Project (No. 126593) from the can be terminated at any time by closing this message panel—an ac- Washington Grain Commission, Department of Energy (awards of DE- tion that calls java.lang.Process.destroy(). iPat uses the commands SC0016366) and the National Institute of Food and Agriculture, U.S. java.io.IOException and java.lang.InterruptedException to catch Department of Agriculture (awards of 2015-05798 and 2016-68004-24770). exceptions in the executed command, allowing the program to de- Conflict of Interest: none declared. tect whether or not the computation was completed successfully. Input file formats are automatically converted to the formats corresponding to the specified CLI packages. iPat uses the first three References lines of each input file to determine the formats. Acceptable geno- Bradbury,P.J. et al. (2007) TASSEL: software for association mapping of com- type formats include hapmap, numerical, VCF, PLINK and BLINK. plex traits in diverse samples. Bioinformatics, 23, 2633–2635. Phenotype formats are acceptable with or without individual identi- Endelman,J. (2011) Ridge regression and other kernels for genomic selection fication. When input data formats match the chosen CLI package in the R package rrBLUP. Plant Genome, 4, 250–255. format requirements, analyses are conducted directly. Otherwise, Kang,H.M. et al. (2008) Efficient control of population structure in model or- format conversion is performed first. ganism association mapping. Genetics, 178, 1709–1723. Display results are presented uniformly with the same array of Lipka,A.E. et al. (2012) GAPIT: genome association and prediction integrated information and graphics, regardless of which CLI package is used. tool. Bioinformatics, 28, 2397–2399. Most CLI packages produce a limited set of results, such P-values Liu,X. et al. (2016) Iterative usage of fixed and random effect models for powerful and genomic predictions. iPat uses the display functions in GAPIT and efficient genome-wide association studies. PLoS Genet., 12, e1005767. as the universal set of result graphics, which include Manhattan Pe ´ rez,P. and De Los Campos,G. (2014) Genome-wide regression and predic- tion with the BGLR statistical package. Genetics, 198, 483–495. plots, QQ plots and heat maps for prediction and accuracy Purcell,S. et al. (2007) PLINK: a tool set for whole-genome association and distribution. population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. Spindel,J.E. et al. (2016) Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. 5 Conclusions Heredity (Edinb), 116, 395–408. Because of its GUI, iPat allows users to perform genomic analyses Tang,Y. et al. (2016) GAPIT version 2: an enhanced integrated tool for gen- without pre-requisite programming skills. Analyses include both omic association and prediction. Plant J., 9, 1–9. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

iPat: intelligent prediction and association tool for genomic research

Bioinformatics , Volume 34 (11): 3 – Jan 11, 2018

Loading next page...
 
/lp/ou_press/ipat-intelligent-prediction-and-association-tool-for-genomic-research-XxwXi0FG9q

References (21)

Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bty015
Publisher site
See Article on Publisher Site

Abstract

Summary: The ultimate goal of genomic research is to effectively predict phenotypes from geno- types so that medical management can improve human health and molecular breeding can in- crease agricultural production. Genomic prediction or selection (GS) plays a complementary role to genome-wide association studies (GWAS), which is the primary method to identify genes under- lying phenotypes. Unfortunately, most computing tools cannot perform data analyses for both GWAS and GS. Furthermore, the majority of these tools are executed through a command-line interface (CLI), which requires programming skills. Non-programmers struggle to use them effi- ciently because of the steep learning curves and zero tolerance for data formats and mistakes when inputting keywords and parameters. To address these problems, this study developed a soft- ware package, named the Intelligent Prediction and Association Tool (iPat), with a user-friendly graphical user interface. With iPat, GWAS or GS can be performed using a pointing device to sim- ply drag and/or click on graphical elements to specify input data files, choose input parameters and select analytical models. Models available to users include those implemented in third party CLI packages such as GAPIT, PLINK, FarmCPU, BLINK, rrBLUP and BGLR. Users can choose any data format and conduct analyses with any of these packages. File conversions are automatically con- ducted for specified input data and selected packages. A GWAS-assisted genomic prediction method was implemented to perform genomic prediction using any GWAS method such as FarmCPU. iPat was written in Java for adaptation to multiple operating systems including Windows, Mac and Linux. Availability and implementation: The iPat executable file, user manual, tutorials and example data- sets are freely available at http://zzlab.net/iPat. Contact: zhiwu.zhang@wsu.edu 1 Introduction (Kang et al., 2008), GAPIT (Lipka et al., 2012; Tang et al., 2016) and FarmCPU (Liu et al., 2016). Genome-wide association studies (GWAS) have become the primary Other recently developed analytical methods have also given method for dissecting complex traits. To incorporate population genomic research a boost toward improving disease risk manage- structure, a general linear model was implemented in PLINK ment in humans and molecular breeding of plants and animals—the (Purcell et al., 2007) to reduce the spurious associations. Mixed lin- ultimate goals of genomic prediction. These packages include ear models have been developed to incorporate cryptic relationships rrBLUP (Endelman, 2011) and BGLR (Pe ´ rez and De Los Campos, among individuals to further reduce the spurious associations. 2014). rrBLUP implements ridge regression and genomic BLUP Software packages have been developed correspondingly to conduct (gBLUP) and BGLR implements Bayesian methods such as Bayes A, the analyses, including TASSEL (Bradbury et al., 2007), EMMA V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1925 1926 C.J.Chen and Z.Zhang B, CPi and LASSO. Some genomic prediction methods can be used for GWAS, for example, Bayes A, B and Cpi. In return, GWAS re- sults can also enhance genomic prediction (Spindel et al., 2016). The multiple available software packages provide the potential to enhance data analyses, but also create challenges for users. Most pack- ages only use a command-line interface (CLI), which has a very steep learning curve for non-programmers. Furthermore, users must spend great effort when shifting from one package to another due to incon- sistent format requirements for input data. Users must take the time to reformat their data accordingly. As a result, a user-friendly graphical user interface (GUI)-based software package that can access multiple CLI packages, use any type of the input file format, and perform both GWAS and genomic prediction or selection is critically needed. The objective of this study was to develop a software package with the following functions: (1) performs both GWAS and genomic predic- tion, including GWAS-assisted genomic prediction; (2) offers a friendly GUI to reduce user learning time and (3) requires only one input data Fig. 1. Design of the iPat. iPat provides users the ability to access incorpo- format to conduct any analysis with any incorporated method. rated software packages and data inputs (a) by using a GUI. The GUI (b) allow users to control all the processes, including modeling (c) and displaying re- sults (d). Currently, incorporated packages include GAPIT, PLINK, FarmCPU, 2 GWAS-assisted genomic prediction BLINK, rrBLUP and BGLR. Genotype data can be input in any format, includ- ing numerical, hapmap, VCF and PLINK. The GUI allows users to drag any By default, Intelligent Prediction and Association Tool (iPat) con- type of data file into the interface and create project icons to link data files, ducts genomic prediction after GWAS with any implemented CLI manage analyses and display results package. Genomic prediction is conducted by gBLUP with associ- ated loci fitted as fixed effects in the following model: After iPat is launched, the GUI appears as a blank frame labeled y ¼ Wc þ Xb þ Zu þ e (1) iPat. The frame is used to manage data files and project analyses. where y is a vector of phenotypes; c and b represent unknown fixed The frame behaves like a folder that users can drag any object into, effects, with c as inheritable factors (e.g. population structure and including files and other folders. The graphical icons on the frame associated genetic loci) and b as uninheritable factors (e.g. environ- are links to the original files and folders. By double-clicking on these mental treatments); and u is a vector of genomic prediction with size icons, the computer’s operating system opens them with the appro- n (number of individuals) for unknown random polygenic effects. priate default programs. For example, a folder is opened by file ex- These random effects follow a distribution with a mean of zero and plore. A text file is opened by text editor. a covariance matrix of G ¼ 2Kr , where K is the kinship with elem- A project icon can be created by double clicking anywhere on ent k (i, j¼ 1, 2, .. ., n) representing the relationship between indi- ij the iPat frame. Multiple project icons are acceptable. The project viduals i and j, and r is an unknown genetic variance. W, X and Z icons are used for linking the input files, defining parameters and are the incidence matrices for c, b and u, respectively. e is a vector of initiating modeling analyses. Both project icons and file icons can be random residual effects that are normally distributed with a mean of repositioned by dragging them with the pointing device. An icon can zero and a covariance of R¼ Ir , where I is the identity matrix and be deleted by dragging it to the bottom right-hand corner. When the r is the unknown residual variance. The predicted genetic merits icon is close to the corner, a trashcan will appear at the corner to in- (GM) of individuals are calculated by following equation: dicate the deletion. Overlapping a project icon and a data icon creates their connec- GM ¼ W^cþZu ^ (2) tion and is indicated by a dashed line (Fig. 1). Clicking on the dashed where ^ c and u ^ are the estimates and prediction of c and u, line turns it into a solid line. Clicking again returns the solid line respectively. back to a dashed line. When a solid line, the connection can be The associated loci are defined as the genetic markers with P-values dragged to the trashcan at the bottom right-hand corner for dele- above the Bonferroni threshold. The associated loci are also filtered for tion. When a project icon is linked to required genotype and pheno- markers that are in linkage disequilibrium (LD). Makers are sorted with type data files by the dashed line, the project icon can be opened as a the strongest associated marker on top. Any other marker with a LD of dialog by right-clicking. In the dialogue box, the user can define par- 50% (R ) or above with the top marker is removed. Then, the second ameters, select the desired model and execute the incorporated CLI strongest associated marker is selected as the top marker and the same packages to perform analyses. During the execution, the project process is repeated until no markers can be removed. The sum of the icon will spin. The spinning will stop and display either a green or a associated markers and the other fixed effects must be less than the red flag upon success or failure of the execution, respectively. square root of the number of individuals. If not, the less significant Results of a successful run can be displayed by double-clicking the markers are removed until this requirement is satisfied. project icon. 3 GUI, data and third party CLI packages 4 Implementation iPat’s GUI is designed to drag and click input data and access third iPat’s GUI was developed in Java. Input data and parameters are party CLI packages using a computer’s pointing device (Fig. 1). passed to specified CLI packages through the command-line inter- Users can also use the keyboard to change parameters. preter. The interpreters are MS-DOS in a Windows operating system iPat 1927 and Terminal in Mac OS or Linux systems. For an R-based package, mapping genes through GWAS and genomic prediction through the input parameters are translated into R script. The pre-requisite understanding the relationships between genotypes and phenotypes. R packages are imported into a library before calling the R package Additionally, iPat gives users the flexibility to combine different ana- for the analysis. iPat then opens a new thread and executes this R lysis methods (such as FarmCPU or rrBLUP) with different input script file by calling the ‘Rscript’ function in the command-line inter- formats (such as PLINK or hapmap genotype data) without requir- preters. For instance, if a user would like to perform GWAS by ing the tedious process of manual reformatting. These features FarmCPU, iPat will pass ‘Rscript FarmCPU.r mydata.dat mydata. should attract users of all levels. In turn, widespread use of iPat has map mydata.txt .. .’ to the command-line interpreter. The first argu- the potential to spawn faster advances in genomic research. ment of the function ‘Rscript’ signals which R script file should be compiled. The remaining arguments are used in FarmCPU.r, which defines the genotype data, genetic map and phenotype. For C-driven Acknowledgement packages, iPat calls the command-line interpreter directly. For ex- The authors thank Linda R. Klein for helpful comments and editing the ample, iPat will execute a command ‘plink–bfile mydata–assoc –out manuscript. mydata_out’ if binary files are used to run GWAS in PLINK, where mydata and mydata_out specify the path and name of the input and output files, respectively. Funding Execution of the CLI packages are monitored using Java system This work was partly supported by an Emerging Research Issues Internal functions. A new message panel is initiated to collect screen output Competitive Grant from the Agricultural Research Center at Washington for the CLI packages by calling java.lang.Process.getInputStream(). State University, College of Agricultural, Human and Natural Resource All information on the message panel is saved as a log file. A project Sciences; and the Endowment, Research Project (No. 126593) from the can be terminated at any time by closing this message panel—an ac- Washington Grain Commission, Department of Energy (awards of DE- tion that calls java.lang.Process.destroy(). iPat uses the commands SC0016366) and the National Institute of Food and Agriculture, U.S. java.io.IOException and java.lang.InterruptedException to catch Department of Agriculture (awards of 2015-05798 and 2016-68004-24770). exceptions in the executed command, allowing the program to de- Conflict of Interest: none declared. tect whether or not the computation was completed successfully. Input file formats are automatically converted to the formats corresponding to the specified CLI packages. iPat uses the first three References lines of each input file to determine the formats. Acceptable geno- Bradbury,P.J. et al. (2007) TASSEL: software for association mapping of com- type formats include hapmap, numerical, VCF, PLINK and BLINK. plex traits in diverse samples. Bioinformatics, 23, 2633–2635. Phenotype formats are acceptable with or without individual identi- Endelman,J. (2011) Ridge regression and other kernels for genomic selection fication. When input data formats match the chosen CLI package in the R package rrBLUP. Plant Genome, 4, 250–255. format requirements, analyses are conducted directly. Otherwise, Kang,H.M. et al. (2008) Efficient control of population structure in model or- format conversion is performed first. ganism association mapping. Genetics, 178, 1709–1723. Display results are presented uniformly with the same array of Lipka,A.E. et al. (2012) GAPIT: genome association and prediction integrated information and graphics, regardless of which CLI package is used. tool. Bioinformatics, 28, 2397–2399. Most CLI packages produce a limited set of results, such P-values Liu,X. et al. (2016) Iterative usage of fixed and random effect models for powerful and genomic predictions. iPat uses the display functions in GAPIT and efficient genome-wide association studies. PLoS Genet., 12, e1005767. as the universal set of result graphics, which include Manhattan Pe ´ rez,P. and De Los Campos,G. (2014) Genome-wide regression and predic- tion with the BGLR statistical package. Genetics, 198, 483–495. plots, QQ plots and heat maps for prediction and accuracy Purcell,S. et al. (2007) PLINK: a tool set for whole-genome association and distribution. population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. Spindel,J.E. et al. (2016) Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. 5 Conclusions Heredity (Edinb), 116, 395–408. Because of its GUI, iPat allows users to perform genomic analyses Tang,Y. et al. (2016) GAPIT version 2: an enhanced integrated tool for gen- without pre-requisite programming skills. Analyses include both omic association and prediction. Plant J., 9, 1–9.

Journal

BioinformaticsOxford University Press

Published: Jan 11, 2018

There are no references for this article.