Access the full text.
Sign up today, get DeepDyve free for 14 days.
Summary: Although the genome-wide association study (GWAS) is a powerful method to identify disease-associated variants, it does not directly address the biological mechanisms underlying such genetic association signals. Here, we present PGA, a Perl- and Java-based program for post- GWAS analysis that predicts likely disease genes given a list of GWAS-reported variants. Designed with a command line interface, PGA incorporates genomic and eQTL data in identifying disease gene candidates and uses gene network and ontology data to score them based upon the strength of their relationship to the disease in question. Availability and implementation: http://zdzlab.einstein.yu.edu/1/pga.html Contact: zhengdong.zhang@einstein.yu.edu Supplementary information: Supplementary data are available at Bioinformatics online. causal genes and assign them evidence-based scores. By considering 1 Introduction associations between regulatory elements and promoters, our pro- In the past several years, genome-wide association studies (GWAS) gram can predict disease genes both proximal and distal to GWAS have been successfully applied to various human complex diseases signals and regulatory elements (e.g. enhancers) that could harbor leading to the identification of a large number of disease-associated non-coding causal SNPs. genetic loci. Interpreting these results, however, remains elusive as GWAS only detect statistical associations—not functional signals— among a subset of all variants and most associated SNPs are non- 2 Materials and methods coding—either intronic or intergenic. To uncover the biological mechanisms underlying disease association signals, it is necessary to Following the framework of our recent post-GWAS analysis of identify genes potentially affected by the reported variants as pos- schizophrenia (Lin et al., 2016), this application performs two dis- sible sources of these signals. tinct operations: it identifies candidate disease genes given a set of For lack of a better approach, in current GWAS the genes closest GWAS-reported variants, and it scores these candidate genes to pri- to or in the vicinity of disease-associated variants are used as the oritize those most likely to be the sources of the disease-association causal genes. This method cannot effectively handle variants found signals. in gene deserts or in close proximity to multiple genes and also over- Identifying risk regions. As GWAS only detect statistical associ- looks the possibility that risk variants may be contained in regula- ations from a pre-select subset of variants, it is necessary to identify tory elements and therefore affect distant genes. all unexamined variants that are in strong linkage disequilibrium Here, we present a Perl- and Java-based application with a com- (LD) with the GWAS-reported variants as potential alternative sour- mand line interface for post-GWAS analysis. It integrates both gene ces of the disease-association signals. Using VCFtools (Danecek network and annotation data with GWAS signals to predict disease et al., 2011) and a 1000 Genomes Project (1KG) reference panel V The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1786 PGA: post-GWAS analysis for disease gene identification 1787 (Genomes Project et al., 2012), we calculate the LD between each candidate gene is excluded from gene sets B and D when scored to GWAS-reported variant and every 1KG variant within a 400-kb avoid biased scoring. range. An LD block is then formed from all within-range SNPs with Notably, our gene scoring method is limited by information r > 0.5 and indexed by the corresponding GWAS variant. We merge available about known disease genes and is based on the hypothesis all overlapping or adjacent (within 250 kb) LD blocks to form gen- that novel disease genes will be involved in the same pathways and omic risk regions and then use them to identify both proximal and mechanisms as known disease genes. distal risk gene candidates. Evaluating scores. Our application produces a score threshold to Identifying risk gene candidates. Proximal risk gene candidates indicate that candidate genes with scores greater than this threshold are genes that—after extending their ranges by 20 kb on each end— should be considered as putative causal disease genes. To evaluate overlap with these genomic risk regions. Distal risk gene candidates the prediction precision of a score threshold, we use disease training are genes affected by an expression quantitative trait locus (eQTL) genes as the positive gene set and random background genes as nega- or transcriptional regulatory element (TRE) containing any of the tive gene sets. The threshold is the value that achieves a prediction variants found in strong LD with GWAS-reported variants in an LD precision 0.8. block (including the GWAS-reported variant itself). In order to in- Application output and efficiency. PGA produces a number of corporate this gene regulatory information, we collected lists of output files: a list of scored disease gene candidates based on the eQTL and TREs with their target genes from ENCODE (Thurman scoring method outlined above, a list of risk regions indexed by et al., 2012; Wang et al., 2018), FANTOM5 (Andersson et al., associated variants and their linked genes (with scores and markings 2014; Wang et al., 2018), and GTEx (Lonsdale et al., 2013) data. indicating proximal or distal), and several intermediate files such as Scoring risk gene candidates. Variants associated with a particu- the regulatory information for tracing the links of distal genes to lar disease may implicate a large number of disease gene candi- GWAS signals and lists of network and annotation-based predictive dates—particularly when distal gene candidates are considered as features that can be reused with the same training gene set in order well—and it is therefore useful to prioritize these candidate genes. to avoid redundant computation. The first operation performed by Our application employs a statistical method to score the disease- the application, generating risk gene candidates, requires between relatedness of risk gene candidates using predictive features derived 15 s and 90 s per input variant, dependent on the size of the 1KG ref- from gene networks and annotation based on a set of training genes erence panel in use. The next operation, scoring these candidate that are known to play a role in the etiology of the disease. genes, can take less than one to several hours, depending on the size Given this set of known disease genes D and the set of known of the network/annotation data and their relationship with the train- genes G (from GENCODE v19), we obtain the set of background ing gene set used to generate predictive features. As the frequent genes B ¼ G – D. Then, from the set of known disease genes D we itemset mining takes the majority of the time at this step, using pre- extract our predictive features: the frequent combinations of the generated predictive feature sets significantly reduces the time used Gene Ontology (GO) terms associated with the genes in D and of to score candidate genes. the neighbors of genes in D in the genome-scale human protein- Custom annotation data. PGA automatically uses built-in loci- protein interaction network that we employ (Li et al., 2017). GO gene regulatory information from ENCODE, FANTOM5 and terms of genes in D include both annotated GO terms and their GTEx eQTL data to identify distal risk gene candidates. It can also ancestor GO terms along the path of the ‘is a’ relationship in the incorporate additional loci-gene regulatory information provided by gene ontology structure. Our application uses the FP-growth users to potentially uncover more risk gene candidates. algorithm for frequent itemset mining (Han et al., 2000) with a l m jDj support value of : We limit the predictive features to 3-itemsets to avoid redundancy and intensive computation. 3 Application After feature extraction, we assign a score to each predictive fea- ture f based on the frequency of its association with genes in D and B: PGA can identify putative risk genes proximal or distal to GWAS signals. In a case study of Alzheimer’s disease (AD), we first col- S ¼ðÞ F =N =ðÞ ðF þ 1Þ=ðN Þ ; f D D B B lected the top 50 AD risk genes from MalaCards (Rappaport et al., 2017) as training genes (Supplementary Table S1) and 310 AD- in which F is the frequency with which f occurs in D and N is the D D associated SNPs from the GWAS Catalog (MacArthur et al., 2017) number of genes in D.F and N are corresponding values in B. B B as variant input (Supplementary Table S2). Given these two types of Next, for each candidate disease gene, we identify all the predictive input, PGA identified 242 risk genomic regions and 552 connected features with which it is associated and assign it the highest score of candidate genes (Fig. 1 and Supplementary Table S3), of which 131 these features. In the event that a risk gene candidate is a training were scored high and thus predicted as putative AD risk genes gene as well, the score S of each predictive feature it contains must (Supplementary Table S4). In the subsequent GO term and pathway be adjusted: analysis of these genes (Supplementary Tables S5 and S6), the most S ¼ðÞ ðÞ F 1Þ=ðN 1 =ðÞ ðÞ F þ 1 =N : D D B B f D significantly enriched biological process (BP) GO term, ‘Negative regulation of beta-amyloid formation,’ is directly related to the As network and annotation scores are treated separately, each gene pathogenesis of AD (Sadigh-Eteghad et al., 2015). Among the many has two different scores that are combined to produce a final gene over-represented inflammatory signaling pathways, ‘Signaling by score: Interleukins’ and ‘Cytokine Signaling in Immune system’ have both ðnÞ ðaÞ been suggested to play a role in the pathology and progression of S ¼ aS þðÞ 1 a SðÞ 0 < a < 1 ; f f AD (Weisman et al., 2006). (n) (a) in which S and S are the network and annotation-based scores, We also examined the effectiveness of PGA in putative disease f f respectively, and a is a coefficient controlling the relative weights of risk gene identification. In an AD risk genomic region in 22q13.2 these two scores on the final gene score. a ¼ 0.4 yields the best pre- (Supplementary Fig. S1A), PGA linked five candidate genes to the dictive power according to our evaluation (data not shown). Every AD-associated SNP (rs7364180). Although this SNP is located in an 1788 J.-R.Lin et al. Funding This work was supported by NIH grants R01 HG008153 from the National Human Genome Research Institute and R01 AG057909 from the National Institute on Aging to Z.D.Z. This work was also supported by NIH grant U01 MH101720 from the National Institute of Mental Health to the International Consortium on Brain and Behavior in 22q11.2 Deletion Syndrome. Conflict of Interest: none declared. References Andersson,R. et al. (2014) An atlas of active enhancers across human cell types and tissues. Nature, 507, 455–461. Barbero-Camps,E. et al. (2013) APP/PS1 mice overexpressing SREBP-2 exhibit combined Abeta accumulation and tau pathology underlying Alzheimer’s disease. Human Mol. Genet., 22, 3460–3476. Danecek,P. et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. Genomes Project,C. et al. (2012) An integrated map of genetic variation from 1, 092 human genomes. Nature, 491, 56–65. Han,J. et al. (2000) Mining frequent patterns without candidate generation. In: SIGMOD ‘00 Proceedings of the 2000 ACM SIGMOD International Fig. 1. The flowchart of the integrated post-GWAS study of Alzheimer’s dis- Conference on Management of Data. Association for Computing ease. 131 putative risk genes were identified from 310 GWAS reported SNPs Machinery (ACM), Dallas, Texas, USA. for AD Li,T.B. et al. (2017) A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Methods, 14, 61–64. intron of CCDC134, PGA identified a high scoring gene in the risk Lin,J.R. et al. (2016) Integrated post-GWAS analysis sheds new light on the region, SREBF2, which has been implicated in the pathology of AD disease mechanisms of schizophrenia. Genetics, 204, 1587–1600. (Barbero-Camps et al., 2013). In another case, PGA linked five can- Lonsdale,J. et al. (2013) The genotype-tissue expression (GTEx) project. Nat. Genet., 45, 580–585. didate genes to three AD-associated SNPs (rs9877502, rs61174035 MacArthur,J. et al. (2017) The new NHGRI-EBI Catalog of published and rs10937470) in an AD risk genomic region in 3q28 genome-wide association studies (GWAS Catalog). Nucleic Acids Res., 45, (Supplementary Fig. S1B). Among them, three are proximal candi- D896–D901. date genes that have low scores. PGA identified a high-scoring distal Ramanan,V.K. et al. (2015) GWAS of longitudinal amyloid accumulation on gene, IL1RAP, as a possible AD risk gene underlying the disease as- 18F-florbetapir PET in Alzheimer’s disease implicates microglial activation sociation signals of these SNPs. Interestingly, IL1RAP was impli- gene IL1RAP. Brain, 138, 3076–3088. cated as a novel causal gene for AD in a recent GWAS study Rappaport,N. et al. (2017) MalaCards: an amalgamated human disease com- (Ramanan et al., 2015). The details of regulatory information of pendium with diverse clinical and genetic annotation and structured search. high scoring genes are shown in Supplementary Table S7. Nucleic Acids Res., 45, D877–D887. Indeed, a systematic performance evaluation of PGA’s results Sadigh-Eteghad,S. et al. (2015) Amyloid-beta: a crucial factor in Alzheimer’s disease. Med. Princ. Pract., 24, 1–10. indicated that it is more effective than other methods at prioritizing Thurman,R.E. et al. (2012) The accessible chromatin landscape of the human risk genes in AD GWAS (Supplementary Fig. S2). The superior per- genome. Nature, 489, 75–82. formance of PGA is due to the fact that it integrates different types Wang,Z. et al. (2018) HEDD: Human Enhancer Disease Database. Nucleic of data (Supplementary Table S8) allowing it to uncover plausible Acids Res., 46, D113–D120. risk genes implicated by GWAS signals that might be missed by Weisman,D. et al. (2006) Interleukins, inflammation, and mechanisms of other methods. Alzheimer’s disease. Vitam. Horm., 74, 505–530.
Bioinformatics – Oxford University Press
Published: Dec 29, 2017
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.