Post genome-wide association analysis: dissecting computational pathway/network-based approaches

Post genome-wide association analysis: dissecting computational pathway/network-based approaches Abstract Over thousands of genetic associations to diseases have been identified by genome-wide association studies (GWASs), which conceptually is a single-marker-based approach. There are potentially many uses of these identified variants, including a better understanding of the pathogenesis of diseases, new leads for studying underlying risk prediction and clinical prediction of treatment. However, because of inadequate power, GWAS might miss disease genes and/or pathways with weak genetic or strong epistatic effects. Driven by the need to extract useful information from GWAS summary statistics, post-GWAS approaches (PGAs) were introduced. Here, we dissect and discuss advances made in pathway/network-based PGAs, with a particular focus on protein–protein interaction networks that leverage GWAS summary statistics by combining effects of multiple loci, subnetworks or pathways to detect genetic signals associated with complex diseases. We conclude with a discussion of research areas where further work on summary statistic-based methods is needed. genome-wide association, post-GWAS, subnetwork, pathways, biological network, protein–protein interaction Introduction Many new genetic associations to diseases have been identified by genome-wide association studies (GWASs) [1], which conceptually is a single-marker-based approach, leveraging thousands of genomes and/or sequences of sick and healthy individuals in detecting genetic polymorphisms with unusual genome-wide significant differences in allele frequency. However, GWAS have shown some limitations, for example, the translation of associated variants into biological hypotheses suitable for further investigation in the laboratory has not been as successful as initially anticipated [2, 3]. Another important challenge is the determination of how multiple, modestly associated genetic variants interact to influence a phenotype [4]. Furthermore, because the effect of a genetic polymorphism is viewed in isolation, GWAS may fail to reveal the true contribution of the detected genetic signal if the effects of other variation are not taken into account [2–8]. Genes can influence each other, for e.g., through enhancement or hindrance, by a process known as epistasis [5–7]. This can occur directly at the genomic level where a gene could encode for a transcriptional repressor, preventing transcription of other genes in the same or different biological pathways. Thus, biological networks, especially protein–protein interaction (PPI) networks, play critical roles in elucidating cause of disease [2, 5, 7]. Given challenges posed by current GWAS approaches, post-GWAS approaches (PGAs) have been developed [3, 4, 8] and are driven by the need to extract useful information from GWAS summary statistics (summary association statistics), comprising allelic Single-Nucleotide Polymorphism (SNP) effect sizes (log odds ratios for case-control traits) together with the associated standard errors, z-scores or P-values [8, 9]. Post-GWAS methods can broadly be clustered into three main categories, namely, single-variant-based post-GWAS, gene-scoring post-GWAS and pathway/network-based approaches [2, 5, 10]. Pasaniuc et al. has recently reviewed progress on single-variant and gene-scoring association PGAs [10–12]. However, there is no review on the progress made in pathway/network-based PGAs. Supplementary Table S1 provides a list of software tools for each post-GWAS category. Here, we dissect the pathway/network-based approaches that leverage GWAS summary statistics within PPI networks. We explore some critical concepts and current methods and discuss some of the technical differences as well as some common challenges across these approaches. Use of biological networks in post-GWAS Proteins, which are gene products, perform a critical range of biological functions in an organism through interactions with the cellular environment and the promotion of growth and functioning of the cell [12, 13]. It is also important to note that biological functioning can also be affected at the level of RNA. For example, noncoding RNA such microRNA and long noncoding RNA are important for maintaining the right cellular environment, allowing proteins to catalyze critical processes [14]. This suggests that any disease outcome or drug response requires concerted biological action of many genes (and RNA) involved in diverse processes and/or biological pathways. Identification of biological functions of disease-related genes will substantially increase our understanding of the biological mechanisms involved in disease pathogenesis. The limitations of the conventional single-marker-based approach for GWAS have necessitated the exploration and development of alternative or complementary approaches [3], including PGA (Table 1). The main assumption behind PGA is that, although association signals (GWAS summary statistics) of several variants involved in disease etiology may be too small to detect using the conventional single-marker-based approach, they may be collectively detected from the combined effect of multiple variants in interacting or grouped genes according to their shared functions within subnetworks or variants within biological pathways. Over recent years, several different approaches have been developed to detect the significance of a biological pathway, from a collection of SNPs from a GWAS, and to adjust for multiple testing at the pathway level [26, 27] (Supplementary Text S1). PGAs can broadly be classified into three categories with respect to the overall strategy used (Supplementary Table S1): (i) single-variant-based, (ii) gene-scoring and (iii) pathway/network-based PGAs [26–28]. Details of these approaches can be found in Supplementary Text S1. Further criteria to characterize PGAs are based on the way the null hypothesis can be tested or formulated [29]. Details of these approaches can be found in Supplementary Text S2. From the null hypothesis, PGAs are characterized into competitive (or enrichment) and self-contained (or association) methods. The differentiation between these two hypothesis-based methods is important, and key differences between these two approaches emanate from the null hypothesis stated [29]. It is worth noting that these PGAs have been designed for different applications in (a) meta-analysis, (b) conditional association and imputation using summary statistics, (c) fine-mapping causal variants by integrating functional annotations and/or trans-ethnic data, (d) polygenic predictions of disease risk and inferring polygenic architectures, (f) recovery signal of association from multiple variants and (g) analyzing multiple variants, traits or phenotypes. As mentioned earlier in the introduction, progress on single-variant-based and gene-scoring PGAs was recently reviewed by Pasaniuc et al. [1, 8]. The pathway/network-based PGAs are distinct based on either (1) conducting the association on subnetworks from interactive PPI networks (subnetworks are obtained based on approaches for searching the subnetwork in a network); or (2) conducting the association on pathways (where each pathway contains information of genes). Both categories in pathway/network-based PGAs require gene-based score or association summary statistics for each gene [11, 12]. Figure 1 illustrates the summary workflow of these pathway/network-based PGAs. Subsequent sections will provide detailed discussion of the steps illustrated in Figure 1. Table 1. Some existing online databases for retrieving comprehensive PPI maps, including PPIs within an organism (intraorganism map) or between organisms (interorganisms maps) Database Description Data type URL STRING Search Tool for the Retrieval of Interacting Genes/Proteins Integrated sets of known and predicted protein–protein functional interactions for several organisms [15] BioGRID Biological General Repository for Interaction Data sets Physical and genetic interactions [16] DIP Database of Interacting Proteins Experimentally identified interactions between proteins [17] IntAct Database of protein interaction data Literature or direct data deposition-based molecular interaction [18] HPRD Human Protein Reference Database Manually curated PPIs [19] MINT Molecular Interaction database Experimentally verified and literature-based curated PPIs [20] MIPS Mammalian Protein–Protein Interaction Database Manually curated high-quality PPI data [21] HPIDB Host–Pathogen Interaction Database Experimentally verified and predicted host–pathogen interactions [22] HPI-base Host–Pathogen Interactions Literature-based host–pathogen interactions [23] PATRIC Pathosystems Resource Integration Center Experimentally inferred host–pathogen interactions [24] VirHostNet Virus–Host Network release Biocurated virus–host molecular interactions [25] Database Description Data type URL STRING Search Tool for the Retrieval of Interacting Genes/Proteins Integrated sets of known and predicted protein–protein functional interactions for several organisms [15] BioGRID Biological General Repository for Interaction Data sets Physical and genetic interactions [16] DIP Database of Interacting Proteins Experimentally identified interactions between proteins [17] IntAct Database of protein interaction data Literature or direct data deposition-based molecular interaction [18] HPRD Human Protein Reference Database Manually curated PPIs [19] MINT Molecular Interaction database Experimentally verified and literature-based curated PPIs [20] MIPS Mammalian Protein–Protein Interaction Database Manually curated high-quality PPI data [21] HPIDB Host–Pathogen Interaction Database Experimentally verified and predicted host–pathogen interactions [22] HPI-base Host–Pathogen Interactions Literature-based host–pathogen interactions [23] PATRIC Pathosystems Resource Integration Center Experimentally inferred host–pathogen interactions [24] VirHostNet Virus–Host Network release Biocurated virus–host molecular interactions [25] Table 1. Some existing online databases for retrieving comprehensive PPI maps, including PPIs within an organism (intraorganism map) or between organisms (interorganisms maps) Database Description Data type URL STRING Search Tool for the Retrieval of Interacting Genes/Proteins Integrated sets of known and predicted protein–protein functional interactions for several organisms [15] BioGRID Biological General Repository for Interaction Data sets Physical and genetic interactions [16] DIP Database of Interacting Proteins Experimentally identified interactions between proteins [17] IntAct Database of protein interaction data Literature or direct data deposition-based molecular interaction [18] HPRD Human Protein Reference Database Manually curated PPIs [19] MINT Molecular Interaction database Experimentally verified and literature-based curated PPIs [20] MIPS Mammalian Protein–Protein Interaction Database Manually curated high-quality PPI data [21] HPIDB Host–Pathogen Interaction Database Experimentally verified and predicted host–pathogen interactions [22] HPI-base Host–Pathogen Interactions Literature-based host–pathogen interactions [23] PATRIC Pathosystems Resource Integration Center Experimentally inferred host–pathogen interactions [24] VirHostNet Virus–Host Network release Biocurated virus–host molecular interactions [25] Database Description Data type URL STRING Search Tool for the Retrieval of Interacting Genes/Proteins Integrated sets of known and predicted protein–protein functional interactions for several organisms [15] BioGRID Biological General Repository for Interaction Data sets Physical and genetic interactions [16] DIP Database of Interacting Proteins Experimentally identified interactions between proteins [17] IntAct Database of protein interaction data Literature or direct data deposition-based molecular interaction [18] HPRD Human Protein Reference Database Manually curated PPIs [19] MINT Molecular Interaction database Experimentally verified and literature-based curated PPIs [20] MIPS Mammalian Protein–Protein Interaction Database Manually curated high-quality PPI data [21] HPIDB Host–Pathogen Interaction Database Experimentally verified and predicted host–pathogen interactions [22] HPI-base Host–Pathogen Interactions Literature-based host–pathogen interactions [23] PATRIC Pathosystems Resource Integration Center Experimentally inferred host–pathogen interactions [24] VirHostNet Virus–Host Network release Biocurated virus–host molecular interactions [25] Figure 1. View largeDownload slide A summary workflow of the pathway/subnetworks for PGAs. Figure 1. View largeDownload slide A summary workflow of the pathway/subnetworks for PGAs. How are PPI data sets obtained? Cells are functional units of life and each protein contributes to different biological processes, which, in turn, act in diverse biological pathways. Under a given condition, a biological pathway is a series of events (interactions amongst molecules or biological processes) occurring sequentially in a cell, to change the cellular environment, ensuring stability. Information on biological processes and pathways are stored in bioinformatics resources and can be retrieved with bioinformatics tools. These pathway databases cover most of the known metabolic and regulatory pathway maps for several genomes. Generally, PPI can be detected experimentally or by computational analysis. Physical interactions are detected using direct experimental techniques, such as pull-down assays, co-immunoprecipitation or tandem affinity purification, high-throughput mass spectrometry techniques [30] and high-throughput yeast two-hybrid (Y2H) screens [31]. Other functional interactions can be inferred from biological knowledge, such as co-expression data from microarray analysis, text mining, shared evolutionary history based on sequence data (sequence similarity or shared domains) and genomic context (conserved genomic neighbor or gene order, gene fusion events, gene co-occurrence or phylogenetic profiles across genomes) [14, 32]. The PPI databases are publicly available and freely accessible via Web interfaces, and a partial list of these is shown in Table 1. Although there exists several PPI resources and databases, predicting biological pathways is also possible through a scoring approach and genome-wide PPI network analysis (Supplementary Text S3). Characterizing individual proteins in a PPI network A biological network is a graph modeling a biological system as an entity composed of subunits, and has become a useful tool enabling the integration of different biological data into a single framework [26]. Types of biological networks include signaling networks, gene regulatory or DNA–protein interaction networks [12, 32]; disease–gene networks linking diseases to genes; and drug interaction networks connecting drugs to their targets [15, 16, 33]. The most used biological network is the PPI, which has been applied in different biomedical applications for knowledge discovery, including protein function prediction [17–19] and disease-associated genes [14, 20, 21, 26, 27, 31], filtration and prioritization of protein targets [20], drug discovery [21–24, 29] and drug resistance analysis [23], detection of subnetworks or modules underlying disease risk [12], etc. It is worth to note that most PPI networks are not disease specific [10, 12, 19]. A PPI network is defined as a set of nodes (or vertices), representing proteins (or genes) connected by undirected edges (or links), representing the interactions or relationships between them (either direct physical or functional interactions) [2, 10]. Several types of PPI networks exist, and when they are integrated in a single network, the relationships between proteins are referred to as functional interactions or connections. Characterizing individual proteins in the context of an integrated or unified PPI network is important to understand how proteins function at the systems level and GWAS results may be mapped to such PPI to identify disease-associated subnetworks. In this context, the PPI network is modeled as a graph, which is denoted by G = (V, E), where V is the set of vertices (genes or proteins) and E the set of edges (interactions). Network centrality measures such as degree or connectivity, eccentricity, betweenness, closeness and eigenvector centrality [32] can be used to numerically characterize the importance of proteins in the network and are described below: Degree or connectivity centrality characterizes the importance of the protein in the network based on the number of proteins connected to it. For a given protein p∈V, the degree of p, denoted Cdeg(p), is given by: Cdeg(p)=∑q∈Vδ(p,q), (1) where δ (p, q) is 1 if p is functionally linked to q, and 0 otherwise. Note that if proteins in V are numbered from 1 to n, then the matrix A = (apq)1≤p,q≤n with apq = δ(p, q) is referred to as an adjacency matrix. A protein with a large number of functional connections is considered to be a key protein, as it may contribute to several processes in the system. 2. Eccentricity centrality shows how easily accessible a node is from other nodes, expressing its capability to quickly communicate with other proteins in the network. For a given protein, p∈V, denoted Cecc(p), is the longest length of the shortest path departing from the protein p, i.e. Cecc(p)=max{γpq:q∈V}, (2) where γpqcis the length of the shortest path from p to q. 3. Closeness centrality assesses the essentiality of a protein, u, based on how it keeps other proteins close to each other, thus speeding up the spread of biological information in the network. The closeness score, Cclos(p), of protein p is given by: Cclos(p)=11nc−1∑q∈Vγpq, (3) where nc is the number of proteins in the connected subnetwork that contains the proteins. 4. Betweenness centrality scores the ability of a protein to maintain the transmission of biological information between other proteins in the network based on the total number of shortest paths passing through the protein. The betweenness score of protein p∈V, denoted Cbet(p), is given by: Cbet(p)=∑s≠p≠t,s≠tσst(p)σst, (4) with σst(p) the number of shortest paths from any protein pairs (s, t) with s ≠ t in the network passing through p, and σst, the number of shortest paths from s and t. 5. Eigenvector centrality assigns weight to a protein based on how influential the proteins connected to it are. In this measure, the relevance of a protein depends on the quality of its neighbors rather than the number of its connections. The eigenvector score, Ceig(p), of protein p∈V is calculated as follows: Ceig(p)=1λ∑q∈VapqCeig(q), (5) where λ is the largest eigenvalue of the adjacency matrix A. It follows that ACeig = λCeig, with the vector Ceig = (Ceig(q))q∈V and its transpose CeigT, which is the eigenvector of A associated with λ. The topological structure of a PPI network provides information on the pattern associated with the general behavior of the biological system under consideration, which may clarify the global role and biological relevance of individual proteins in the network. Therefore, determining the topological structure of the PPI network can help understand the biological mechanisms underlying the functioning of the organism, including cellular organization and processes. In fact, it has been observed that, in general, biological networks exhibit scale-free properties [15, 16, 32], meaning that their degree distribution, which is the probability that a randomly selected protein is connected to k proteins (of degree k) in the network, approximates the power law, i.e. P(k) ¬ k−γ, (6) where the power exponent γ is a constant characteristic of the network. This distribution is independent of the number of nodes; thus, the networks are said to be scale free, in which case, the probability that a protein has a number of links larger than the mean degree of all proteins in the network, is small in scale-free networks [16]. This indicates that scale-free networks are heterogeneous, with few proteins highly connected and several proteins having only few interacting partners [16]. This is in contrast to random networks, which are homogeneous, where proteins have roughly the same degree and the distribution follows a Poisson distribution. For example, Figure 2A shows the degree distributions of PPI networks with scale-free topology properties for human (Homo sapiens). In these plots, we observe that although some of the proteins should have many interacting partners, most of them only have few partners. It is worth mentioning that these properties hold even for other model organisms, as well as pathogenic organisms, such as Mycobacterium tuberculosis, suggesting that the subnetwork-based post-GWAS can be applied or adapted to other organisms, where possible [10, 13, 17, 18]. Proteins participating in many interactions are referred to as ‘high degree’ or key proteins or hubs and have a high impact on the topological structure of the network, possibly ensuring the completion of basic chemical operations essential for the survival of the organism, such as energy transfer and redox reactions [16, 32]. Importantly, hub proteins are less likely to mutate, as their high connectivity is related to their functions [17], which are highly conserved and essential for the survival of the organism under consideration [18]. These networks have a higher error tolerance or resistance to random node failure or perturbations but are often vulnerable to targeted disruption or removal of hubs, which play a major role in maintaining the network's connectivity [17, 25]. This also implies that the average network clustering coefficient is significantly higher in the PPI network than in a random network [32]. This average network clustering coefficient [32, 33] is given by: cc=1n∑p∈Vcp, (7) where n is the length of the network, i.e. the number of proteins in the network, and cp is the clustering coefficient of protein p, that is the ratio of the actual number of interacting partners, np, of protein p to the total number of possible interacting partners of the protein p in the network and given by [13, 18]: cp=2npn(n−1). (8) Figure 2. View largeDownload slide (A) Protein connectivity or degree distribution in human PPI networks. Circles represent the frequency P(k) of observing a protein interacting with k partners in a network. The solid line plots the power law function approximating the connectivity distribution. (B) Path-length distribution in human PPI networks. Histogram plot represents the path-length distribution, i.e. frequency of occurrence of shortest path of length and the dashed line plot is the normal distribution approximating the path-length distribution. Figure 2. View largeDownload slide (A) Protein connectivity or degree distribution in human PPI networks. Circles represent the frequency P(k) of observing a protein interacting with k partners in a network. The solid line plots the power law function approximating the connectivity distribution. (B) Path-length distribution in human PPI networks. Histogram plot represents the path-length distribution, i.e. frequency of occurrence of shortest path of length and the dashed line plot is the normal distribution approximating the path-length distribution. Another special property characterizing PPI networks is the ‘small world’ property, i.e. the transmission of biological information between any two proteins is achieved through only a few steps or a much shorter path than would be expected in a random network of similar size and length [16]. This property provides insight into the network navigability, indicating how fast the information can be spread in the system independently of the number of proteins [16, 32]. Figure 2B shows the path-length distributions of human (H.sapiens). These plots indicate that the average path lengths range between 3 and 4 hops for each of these organisms, independent of the size (number of edges) and order (number of genes or proteins) in the organism under consideration’s network. This suggests that the spread of biological information in these systems is relatively fast, and this is important, especially for pathogenic organisms, as they need to survive and adapt to environmental niches. This ‘small world’ property may also provide the organism with an evolutionary advantage in the sense that it would be able to efficiently respond to changes in the environment and quickly exhibit a qualitative change of behavior in response to these perturbations [10]. It is common that a specific network property involves only some parts of the network rather than the whole network. This is referred to as subnetwork features, and these subnetworks can be either motifs or modules [19, 34]. Classifying proteins in subnetwork modules and functional motifs from a PPI network can help understand biological mechanisms underlying the system, the organization and dynamics of cell functions. To discover hidden information in a complex network, there exist several approaches to detect modules [21, 35] and functionally related proteins [21]. Examples of these approaches are provided in Boxes 1 and 2. The next section discusses techniques used in leveraging topological properties of a PPI network to search for subnetworks. Box 1: The procedure to generate a subnetwork based on topological structure of the network [10] Input: A mapped network G, containing weighted genes by either P-values or z-score and LD. From the network G, find structural hubs and connected components. For each gene, compute the betweenness, the closeness and the eigenvector scores. For each centrality score, compute the cutoff for central genes of subgraphs. A gene is a hub if its score is greater than or equal to a user defined cutoff. A gene is a central gene if it is a hub for all the four scoring measures in Step (3). For each central gene, search its neighbors for n iterations, or the mean shortest path. The central gene and its neighbors constitute a subnetwork of the network G. Output: subnetworks. Box 2: Greedy algorithm to search for a subnetwork [2] Assign a seed subnetwork S and calculate the subnetwork score Sm of S. Initially, the seed subnetwork is a single gene. S=β∑e∈Eedgeweight    (e)ψ+(1−β)∑v∈Vnodeweight    (v)γ, where E and V represent the edges and nodes of the module, β is a parameter between 0 and 1 to balance GWAS and weight signals from either gene expression or LD structure and ψ and γ are the total number of edges and nodes in E and V, respectively. 2. Examine all the first-order neighbors of S, and identify the neighbor node Nmax that generates the maximum increment of the subnetwork score. 3. Add Nmax to the current subnetwork S if the score increment is greater than Sm*r, where r is a parameter that decides the magnitude of increment. 4. Repeat Steps 1–3 until no more neighbors can be added. Mapping SNP-based GWAS summary statistics and searching for subnetworks Highly connected genes in PPI networks can be functionally important, and the removal of such genes is related to lethality. Deleterious variants in such genes are usually observed in aborted fetuses [36]. Considering an undirected weighted PPI network, G = (V, E), where V is the set of n genes as nodes, and E is the set of edges as interactions found between genes. Current approaches use different link weights, such as gene-correlation from genotype data [10], gene expression-based [2], functional and topological weights [10]. The following steps are sequentially executed to search for subnetworks in networks: Mapping SNPs to genes and identifying weights Map single SNP-based GWAS summary statistics to the respective genes. These GWAS summary statistics are usually assigned to a given gene in different ways: (a) if SNPs are located within the primary gene transcript and/or a user-defined base-pair distance downstream or upstream to a specific gene; (b) if they are in linkage disequilibrium (LD) within a specified boundary cutoff within a gene and (c) closest SNPs within gene at zero distance. Attribute scores or weights to functional interactions in the PPI network based on the weighting scheme under consideration. Here, several types of weights can be considered, including LD based on the reference genotype data set [10], gene co-expression [2] and score from the network topological structure. Searching subnetworks from weighted network There are two major approaches applied in PGA to search for subnetworks from weighted networks: (a) Analyze the general topological properties of the PPI network and quantify the usefulness of each gene using centrality measures to cluster network into subnetworks [10] (Box 1). Box 1 illustrates how the topological properties of a network (see section above) are used to obtain clustered nodes in subnetworks. (b) A greedy algorithm (described in Box 2) was also used in some software tools to search for dense subnetworks [2]. The obtained subnetworks can be used to perform the association test as discussed in the section below. Central genes within a given subnetwork, in association with complex disease susceptibility genes, are cores of biological subnetworks and are linked to other genes in that subnetwork via a few steps (paths in the network) [2, 10]. These centers are structural hubs with network centrality scores beyond a certain user defined threshold value. Pathway/subnetwork association from GWAS data Most of the association at subnetwork or pathway level requires the gene association summary statistics (gene score) and the biological network to be split into subnetworks [10] or dense modules [2] as discussed in the previous section [10]. Gene-based scoring Different methods are generally used to combine association signals of SNPs to assess the association of the gene with the phenotype, and these include (1) using the minimum or maximum SNP-specific P-value as the P-value for the gene (Sidak’s combination test) [2, 10], and (2) using an SNP-specific summary measure such as Fisher’s, Sime's and Sime's combination test [2, 10, 29], within the gene. Many gene scores in pathway/subnetwork-based PA methods have implemented the minimum P-value (or the maximum test statistic) of all the SNPs within a gene [2, 10, 19, 33]. However, when several distinct SNPs in the gene contribute to the overall association signal, and all have modest effect on the phenotype, using the minimum P-value may not be the best or most powerful approach to capture such information. In addition, genes with more SNPs are likely to have smaller minimum P-values compared with genes with fewer SNPs [19]. Let us assume that multiple SNPs within a gene contribute to the overall association of the gene. We also assume an independent and uniform distribution of P-values pi for the corresponding test statistic, Ti, testing the i-th marker, under the null hypothesis, although this assumption of independence may be violated because of LD among SNPs within the gene. If we consider a continuous monotonic function H, then a transformation of the P-value, pi, of the i-th SNP is given by: Zi=H−1(1−pi). (9) Below we describe four methods that enable the combining of P-values for all SNPs within a gene: (a) the maximum test statistic (Sidak’s combination test), (b) the Fisher’s combination test, (c) the Sime’s combination test and (d) the false discovery rate (FDR) method [2, 3, 10]. (a) Sidak’s combination test Considering only the SNP with the maximum test statistic (the best SNP), we can define the statistic ZB=pi, which is distributed as P(ZB ≤ ω)=1−(1−ω)k, where k is the number of independent SNPs, and ω is type I error rate. (b) Fisher’s combination test The statistic to combine k independent P-values or to combine information from k SNPs is given by: ZF=−2∑i=1k log (pi), (10) which follows a chi-squared distribution, χ2k2, with 2k degrees of freedom. (c) Sime's combination test Let pi be ordered as p(1)≤p(2)≤⋯≤p(k). The combined P-values are given by: p=∑i{kp(i)i}. (11) (d) FDR method Let F(α) be the expected proportion of tests yielding P-values less than or equal to α. Suppose a set pd={p1,…,pk} with d different P-values, p~j,j=1,…,d, and such that p~1<p~2<⋯<p~d. Let mj be the number of P-values equal to p~j among the set of P-values pd. Then, the estimate of the expected proportion F(α) is given by: F~(α)=1k∑j=1dI(p~j≤α)mj, (12) where I is an indicator function. For a one-sided test (chi-square test or trend test), consider π=min(1,2σ), and for a two-side test, consider π=min(1,2p), where p~=1k∑i=1kpi, α=1k∑i=1kai and ai=2min(pi,1−pi). The estimate v^(α) of the expected proportion v(α) of tests resulting in false positives when α is used as the P-value threshold to get significance is given as, v(α)=α,π, and π is the estimate of the proportion π of test with the true hypothesis. Therefore, the FDRs are expressed as ratios of the form: t(i)=v^(p(i))F^(p(i)) (13) and q(i)=min{t(i):j≥i} where q(1)≤q(2)≤⋯≤q(m) are the ordered q-value FDRs. Finally, q(1)=min{tj:j≥i} is the false discovery. Pathways/subnetwork association In this section, we distinguish two existing methods for pathway/subnetwork association, namely, one- and two-step approaches [29]. In the one-step approach, all the SNPs in a pathway are used without any consideration of gene-level scores. An example of the two-step approach is where it first uses SNPs in each gene to assess the association with the gene and then combines gene-level tests to evaluate the association of the phenotype with the pathway [22, 23]. Both the one-step and two-step methods have advantages and disadvantages, and their efficiency depends mostly on the underlying disease-causing mechanisms, which are generally unknown. To account for the independent assumption and the correlation of P-values among neighboring genes, many pathway/subnetwork association methods use the (a) Fisher’s combined probability, (b) Stouffer–Liptak [37, 38] and (c) chi-square score methods, which account for spatial correlations among SNPs/genes within a given subnetwork or pathway. These commonly used association tests are described below: (a) Fisher’s combined probability test Under the null hypothesis, the P-values pi, i = 1,…, L for a test statistic with a continuous null distribution are uniformly distributed in the interval (0, 1). In this framework, a parametric cumulative distribution function F can be chosen, and the P-values can be transformed into quantiles according to qi=F−1(pi),i=1,…,L. The combined test statistic Cp=∑i=1Lqi is a sum of independent and identically distributed random variables qi, each of which follows the probability distribution corresponding to F. Let F^ be the cumulative chi-square distribution and pi,i = 1, 2,…, L P-values of SNPs associated with a gene or a subnetwork. We obtain the combined P-value test statistic [38–42] Cp=−2∑i=1L log (1−pi), which has a chi-square distribution with 2 L degrees of freedom because of the additive property of the independent chi-square distribution. The combined P-value is p^=F^(Cp), and F^ is the cumulative distribution function for the chi-square distribution. Suppose pi are dependent, we can estimate the distribution of Fisher’s combined probability Cp by a scaled chi-square distribution such that Cp≈c∗χf2, where c is a scaling factor, and f is the degree of freedom. It follows, E(Cp)=E(c∗χf2)=c∗f and var(c∗χf2)=2∗c2∗f. Solving the above equation with respect to c and f, we obtain: c=4L+2∑i<jcov[−2 log(pi),−2 log(pj)]4L (14) and f=2(2L)24L+2∑i<jcov[−2 log(pi),−2 log(pj)]. (15) As Cp≈χ2L2, the combined P-value for Cp is determined by using the approximating distribution Cp/c≈χf2. To compute terms in var(Cp), one can apply approximations described in [39] by fitting a polynomial regression to the true values using a grid approach ranging values for the degrees of freedom (9 ≤ ν ≤ 125) and the autocorrelations σ (−0.98 ≤ ρ ≤ 0.98). Thus, the q-value score can be obtained based on the Benjamini–Hochberg false discovery correction [40]. (b) Stouffer–Liptak test Let F be a cumulative standard normal distribution N (0, 1). It follows that F(x) = Φ(x), where Φ is a cumulative distribution function of the standard normal, and qi = Φ−1(pi). Each qi, i = 1,…, L follows the probability density function of a standard normal, and using the additive property of independent random variables, the Liptak’s combined P-value test statistic [41] is: Cp=∑i=1LqiL, (16) and the related combined P-value is p^=Φ(Cp). To adjust for P-value dependency, let us assume that Pi, i = 1,…, L are correlated according to the correlation matrix Σ, which is a positive definite and nondegenerate. Thus, the Cholesky factor C exists such that Σ = CCT. The correlated quantiles Q = qi are transformed into independent Q^ as done by Zaykin et al. [42]. One can then obtain the transformation Q^=C−1Q, the qi, i = 1,…, L, which are now independent and follows a standard normal distribution [42, 43]. Finally, apply the Stouffer–Liptak test on Q^. As for Fisher’s combined probability test above, here, we can compute a q-value on a null-model from shuffled P-values. (c) Chi-square score method This test has recently been used in PPI-based free method (Figure 1) that leverages gene information in biological pathways to perform the association [44]. The pathway-based chi-square test also uses gene score P-values [44]. Box 3 describes the steps to obtain the gene scores and to perform pathways association test [44]. Box 3: Free PPI-based pathway chi-square score methods [44] Gene score P-values are ranked such that the lowest P-value gets the highest rank. The rank value is then divided by the number of genes plus 1 to obtain a uniform distribution. Uniform distribution values are transformed by the chi-square quantile function to obtain a chi-square distribution of gene scores. Chi-square-based gene scores of a given pathway of size m are summed and tested against a χm-distribution. The empirical sampling method is as follows: Gene score P-values are directly transformed with the a chi-square quantile function to obtain new gene scores: Fχ1−1(1−P). A raw pathway score for a pathway of size m is computed by summing the transformed gene scores for all pathway genes. A Monte Carlo estimate of the P-value is obtained by sampling random gene sets of size m and calculating the fraction of sets reaching a higher score than the gene set of the given pathway [44]. Characterization and enrichment of the identified subnetworks To identify the association between each subnetwork, Sj, j = 1,…, T, within n1,…, nT genes to a human pathway, Pk∈P, where P is the set of all available curated human pathways, curated pathways can be obtained from several annotated pathway databases (Table 2). Let a be denoted as the number of genes in the intersection between genes within Sj and genes within pathway Pk, and b the number of genes in the intersection between genes within Sj and those in the union of all pathways Pk,k=1,…,K. Let n be the number of genes in the intersection between genes in the Pk pathway and those in the union of all pathways Pk,k=1,…,K with k≠j, and m be the total number of genes in all pathways Pk,k=1,…,K. The statistic of overlap between subnetwork Sj, of nt genes and a given pathway Pk can be computed using the z-score (ZS), which uses the binomial proportion test, and given by: ZS=an−bmbm2(1−bm). Table 2. Some existing online databases for retrieving comprehensive biological pathway maps and processes Scheme Description Types URL KEGG Kyoto Encyclopedia of Genes and Genomes Integrated database of manually verified metabolic pathway information [34] BioCyc EcoCyc, MetaCyc and derivatives Literature-based curated and computationally predicted metabolic pathway information [35] NCI-NATURE Pathway database Biomedical database of human cellular signaling pathway [36] Reactome Curated pathway database Curated and peer reviewed database of biological pathway maps [37] GO Gene Ontology A classification system for annotation of genes and gene products using Molecular Function, Biological Process and Cellular Component [38] BioCarta A database of gene interaction models Database of gene interaction models, containing high-quality images of several cellular signaling and interaction pathways and each diagram [39] WikiPathways Pathway database Database of biological pathways maintained by and for the scientific community [40] Scheme Description Types URL KEGG Kyoto Encyclopedia of Genes and Genomes Integrated database of manually verified metabolic pathway information [34] BioCyc EcoCyc, MetaCyc and derivatives Literature-based curated and computationally predicted metabolic pathway information [35] NCI-NATURE Pathway database Biomedical database of human cellular signaling pathway [36] Reactome Curated pathway database Curated and peer reviewed database of biological pathway maps [37] GO Gene Ontology A classification system for annotation of genes and gene products using Molecular Function, Biological Process and Cellular Component [38] BioCarta A database of gene interaction models Database of gene interaction models, containing high-quality images of several cellular signaling and interaction pathways and each diagram [39] WikiPathways Pathway database Database of biological pathways maintained by and for the scientific community [40] Table 2. Some existing online databases for retrieving comprehensive biological pathway maps and processes Scheme Description Types URL KEGG Kyoto Encyclopedia of Genes and Genomes Integrated database of manually verified metabolic pathway information [34] BioCyc EcoCyc, MetaCyc and derivatives Literature-based curated and computationally predicted metabolic pathway information [35] NCI-NATURE Pathway database Biomedical database of human cellular signaling pathway [36] Reactome Curated pathway database Curated and peer reviewed database of biological pathway maps [37] GO Gene Ontology A classification system for annotation of genes and gene products using Molecular Function, Biological Process and Cellular Component [38] BioCarta A database of gene interaction models Database of gene interaction models, containing high-quality images of several cellular signaling and interaction pathways and each diagram [39] WikiPathways Pathway database Database of biological pathways maintained by and for the scientific community [40] Scheme Description Types URL KEGG Kyoto Encyclopedia of Genes and Genomes Integrated database of manually verified metabolic pathway information [34] BioCyc EcoCyc, MetaCyc and derivatives Literature-based curated and computationally predicted metabolic pathway information [35] NCI-NATURE Pathway database Biomedical database of human cellular signaling pathway [36] Reactome Curated pathway database Curated and peer reviewed database of biological pathway maps [37] GO Gene Ontology A classification system for annotation of genes and gene products using Molecular Function, Biological Process and Cellular Component [38] BioCarta A database of gene interaction models Database of gene interaction models, containing high-quality images of several cellular signaling and interaction pathways and each diagram [39] WikiPathways Pathway database Database of biological pathways maintained by and for the scientific community [40] The above approach not only scores the association of overlapping gene sets and a given pathway but also has the advantage of accounting for the network topological structure of interactions between genes in the subnetwork [10]. Discussion Over the past decade, we have experienced a shift from single-marker approaches toward whole-genome-based methods with the hopes of achieving a more global view of disease etiology [21]. Pathway-based methods provide a mechanism for exploring more sensitive and powerful analysis of GWAS data sets [43]. PGAs have provided a new paradigm to GWAS and might enable a complete characterization of genetic susceptibility to a disease. These PGAs are broadly grouped into three categories (Supplementary Table S1), namely, single-variant association tests, gene-based association tests and pathway/network-based approaches [4, 8]. These methods were designed for leveraging GWAS summary statistics to conduct meta-analysis, conditional association and imputation, fine-mapping causal variants by integrating functional annotations and/or trans-ethnic data, polygenic predictions of disease risk and inferring polygenic architectures or analyses of multiple variants, traits or phenotypes. This manuscript provides pertinent details on PPI and discusses the advances made in pathway/network-based PGAs, the aim of which is to guide current users in performing these types of analyses and also enable the development of new methods, which are able to overcome some of the challenges discussed below. Current challenges and opportunity There exist several challenges, which limit the practical use of pathway/network-based PGAs. While methods for pathway/network-based analysis are rapidly growing in number, several sources of biases must be taken into consideration, including the capacity for strongly associated markers to drive pathway association, the possible effects of SNPs being assigned to multiple genes and, more specifically, bias with respect to differences in gene size and differences in pathway size. These biases are not always considered in many of the existing methods (Supplementary Table S1) [7, 34]. Ignoring gene size when assessing gene association signal and when testing for pathway association can lead to inflated type 1 error rates. Permutation procedures can be used to adjust for pathway size. The results from the network/pathway-based approaches are sensitive to the way SNPs are assigned to genes, weight assigned to edges in PPI network and the accuracy of the PPI network itself. This is mainly because of the lack of accurate knowledge of complex traits and the incomplete human protein interaction network, which makes it challenging to directly compare the results from different network/pathway-based approaches [45]. Most of these methods do not accept a user-defined network, only a list of SNPs or genes. This makes it impossible to directly compare these methods in advising superiority of different strategies. The rapid growth of multi-loci and epistatic association approaches will now allow for the opportunity to design new or update current pathway/subnetworks-based PGAs to account for both multi-loci and epistatic association summary statistics as inputs. Conclusions and perspectives Several approaches have been designed to study complex genetic diseases incorporating large-scale data such as transcriptomics, proteomics, genomics and metabolomics. The inclusion of such information into pathway/network-based analysis would provide a critical mechanism for exploring more powerful and sensitive analysis of GWAS data sets [43]. Pathway/network-based approaches, which incorporate biological network structures, such as hubs and motifs into the analysis of GWAS data sets, will be vital in the years to come, providing optimal integrative models, which enable analysis at the systems level and ensure increased coverage and confidence, precision and accuracy. This manuscript focused on network/pathway-based PGAs that leverage GWAS summary statistics within human PPI network data in combining effects of multiple loci within genes, biological subnetwork and pathways to detect genetic signals beyond single-gene polymorphisms. We provide some keys concepts and discuss assumptions and methods behind pathway/network-based approaches. Moreover, the development of appropriate tools that will allow combinations of GWAS and multiple OMICs is increasingly needed. Such approaches will integrate, for example, whole-exome/genome sequencing data, transcriptomic, proteomic, epigenomic and metagenomic data, in addition to GWAS and PPI interactions, to refine our knowledge of pathophysiology of common complex diseases as well as monogenic conditions with variable penetrance and expressivity. Key Points Dissecting pathway/network-based post-GWAS approaches that leverage GWAS summary statistics within human PPI networks. Consistent classification of existing and future pathway/network-based post-GWAS approaches. Discussion of issues related to pathway/network-based post-GWAS approaches. Supplementary data Supplementary data are available online at https://academic.oup.com/bib. Emile R. Chimusa did PhD in Bioinformatics from the University of Cape Town. He is a Senior Lecturer at the Division of Human Genetics, Department of Pathology, University of Cape Town. Shareefa Dalvie did PhD in Human Genetics from the University of Cape Town. She is a lecturer at the Department of Psychiatry and Mental Health, University of Cape Town. Collet Dandara received his PhD degree from the University of Zimbabwe. He is a Professor of Human Genetics, Division of Human Genetics, University of Cape Town. Ambroise Wonkam received his PhD degree from the University of Geneva. He is a Professor/Senior Specialist at the Division of Human Genetics, University of Cape Town. Gaston K. Mazandu did PhD in Bioinformatics from the University of Cape Town. He is an Honorary Senior Member of the Computational Biology Division at University of Cape Town; a Researcher at the African Institute for Mathematical Sciences; and a Senior Lecturer at the Division of Human Genetics, University of Cape Town. Acknowledgments The authors thank researchers who have contributed towards advancing PGAs and all GWAS summary statistics donors around the world. Also, the authors thank the computing platform CHPC (https://www.ac.za/) and those who have helped in the preparation of this manuscript. Funding Some of the authors are funded in part by the National Institutes of Health Common Fund under grant number 1U54HG009790-01(IFGeneRA), U01HG009716 (HI Genes Africa), U24HG006941 (H3ABioNet), 1u01hg007459-01 (SADaCC) and Wellcome Trust/ AESA Ref: H3A/18/001. The content of this publication is solely the responsibility of the authors and does not necessarily represent the official views of the funders. References 1 Li MJ , Liu Z , Wang P. GWASdb v2: an update database for human genetic variants identified by genome-wide association studies . Nucleic Acids Res 2016 ; 44 ( D1 ): D869 – 76 . Google Scholar CrossRef Search ADS PubMed 2 Jia P , Zheng S , Long J. dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks . Bioinformatics 2011 ; 27 ( 1 ): 95 – 102 . Google Scholar CrossRef Search ADS PubMed 3 Peng G , Luo L , Siu H. Gene and pathway-based second-wave analysis of genome-wide association studies . Eur J Hum Genet 2010 ; 18 ( 1 ): 111 – 17 . Google Scholar CrossRef Search ADS PubMed 4 Cantor RM , Lange K , Sinsheimer JS. Prioritizing GWAS results: a review of statistical methods and recommendations for their application . Am J Hum Genet 2010 ; 86 ( 1 ): 6 – 22 . Google Scholar CrossRef Search ADS PubMed 5 Shahbaba B , Shachaf CM , Yu Z. A pathway analysis method for genome-wide association studies . Stat Med 2012 ; 31 ( 10 ): 988 – 1000 . Google Scholar CrossRef Search ADS PubMed 6 Wang K , Li M , Hakonarson H. Analysing biological pathways in genome-wide association studies . Nat Rev Genet 2010 ; 11 ( 12 ): 843 – 54 . Google Scholar CrossRef Search ADS PubMed 7 Wang K , Li M , Bucan M. Pathway-based approaches for analysis of genomewide association studies . Am J Hum Genet 2007 ; 81 ( 6 ): 1278 – 83 . Google Scholar CrossRef Search ADS PubMed 8 Pasaniuc B , Price AL. Dissecting the genetics of complex traits using summary association statistics . Nat Rev Genet 2017 ; 18 ( 2 ): 117 – 27 . Google Scholar CrossRef Search ADS PubMed 9 Li MJ , Pan Z , Liu Z , et al. Predicting regulatory variants with composite statistic . Bioinformatics 2016 ; 32 ( 18 ): 2729 – 36 . Google Scholar CrossRef Search ADS PubMed 10 Chimusa ER , Mbiyavanga M , Mazandu GK , et al. AncGWAS: a post genome-wide association study method for interaction, pathway, and ancestry analysis in homogeneous and admixed populations . Bioinformatics 2016 ; 32 ( 4 ): 549 – 56 . Google Scholar CrossRef Search ADS PubMed 11 Li M , Li J , Li MJ , et al. Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework . Nucleic Acids Res 2017 ; 45 ( 9 ): e75 . Google Scholar CrossRef Search ADS PubMed 12 Li MJ , Li M , Liu Z , et al. cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes . Genome Biol 2017 ; 18 ( 1 ): 52 . Google Scholar CrossRef Search ADS PubMed 13 Mulder NJ , Akinola RO , Mazandu GK , et al. Using biological networks to improve our understanding of infectious diseases . Comput Struct Biotechnol J 2014 ; 11 ( 18 ): 1 – 10 . Google Scholar CrossRef Search ADS PubMed 14 Chang SM , Hu WW. Long non‐coding RNA MALAT1 promotes oral squamous cell carcinoma development via microRNA‐125b/STAT3 axis . J Cell Physiol 2018 ; 233 ( 4 ): 3384 – 96 . Google Scholar CrossRef Search ADS PubMed 15 Ma'ayan A. Introduction to network analysis in systems biology . Sci Signal 2011 ; 4 ( 190 ): tr5 . Google Scholar CrossRef Search ADS PubMed 16 Albert R , Jeong H , Barabasi AL. Error and attack tolerance of complex networks . Nature 2000 ; 406 ( 6794 ): 378 – 82 . Google Scholar CrossRef Search ADS PubMed 17 Ekman D , Light S , Björklund AK , Elofsson A. What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol 2006 ; 7 ( 6 ): R45 . Google Scholar CrossRef Search ADS PubMed 18 Akinola RO , Mazandu GK , Mulder NJ. A quantitative approach to analyzing genome reductive evolution using protein-protein interaction networks: a case study of Mycobacterium leprae . Front Genet 2016 ; 7 : 39 . Google Scholar CrossRef Search ADS PubMed 19 Ma X , Gao L. Biological network analysis: insights into structure and functions . Brief Funct Genomics 2012 ; 11 ( 6 ): 434 – 42 . Google Scholar CrossRef Search ADS PubMed 20 Mazandu GK , Mulder NJ. DaGO-Fun: tool for gene ontology-based functional analysis using term information content measures . BMC Bioinformatics 2013 ; 14 ( 1 ): 284 . Google Scholar PubMed 21 Nelson MR , Tipney H , Painter JL , et al. The support of human genetic evidence for approved drug indications . Nat Genet 2015 ; 47 ( 8 ): 856 – 60 . Google Scholar CrossRef Search ADS PubMed 22 Holmans P , Green EK , Pahwa JS , et al. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder . Am J Hum Genet 2009 ; 85 ( 1 ): 13 – 24 . Google Scholar CrossRef Search ADS PubMed 23 Wu MC , Lin X. Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways . Stat Methods Med Res 2009 ; 18 ( 6 ): 577 – 93 . Google Scholar CrossRef Search ADS PubMed 24 Yu K , Li Q , Bergen AW , et al. Pathway analysis by adaptive combination of P‐values . Genet Epidemiol 2009 ; 33 ( 8 ): 700 – 9 . Google Scholar CrossRef Search ADS PubMed 25 Zotenko E , Mestre J , O'Leary DP , et al. Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connection between the network topology and essentiality . PLoS Comput Biol 2008 ; 4 ( 8 ): e1000140 . Google Scholar CrossRef Search ADS PubMed 26 Chen X , Wang L , Hu B , et al. Pathway‐based analysis for genome‐wide association studies using supervised principal components . Genet Epidemiol 2010 ; 34 ( 7 ): 716 – 24 . Google Scholar CrossRef Search ADS PubMed 27 Guo YF , Li J , Chen Y , et al. A new permutation strategy of pathway-based approach for genome-wide association study . BMC Bioinformatics 2009 ; 10 : 429 . Google Scholar CrossRef Search ADS PubMed 28 Kraft P , Raychaudhuri S. Complex diseases, complex genes: keeping pathways on the right track . Epidemiology 2009 ; 20 ( 4 ): 508 – 11 . Google Scholar CrossRef Search ADS PubMed 29 Fridley BL , Patch C. Gene set analysis of SNP data: benefits, challenges, and future directions . Eur J Hum Genet 2011 ; 19 ( 8 ): 837 – 43 . Google Scholar CrossRef Search ADS PubMed 30 Yellaboina S , Goyal K , Mande SC. Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: comparison with high-throughput experimental data . Genome Res 2007 ; 17 ( 4 ): 527 – 35 . Google Scholar CrossRef Search ADS PubMed 31 Piñero J , Bravo À , Queralt-Rosinach N. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants . Nucleic Acids Res 2017 ; 45 ( D1 ): D833 – 9 . Google Scholar CrossRef Search ADS PubMed 32 Mazandu GK , Mulder NJ. Generation and analysis of large-scale data-driven Mycobacterium tuberculosis functional networks for drug target identification . Adv Bioinformatics 2011 ; 2011 : 801478 . Google Scholar CrossRef Search ADS PubMed 33 Wang P , Qin J , Qin Y , et al. ChIP-Array 2: integrating multiple omics data to construct gene regulatory networks . Nucleic Acids Res 2015 ; 43 ( W1 ): W264 – 9 . Google Scholar CrossRef Search ADS PubMed 34 Alm E , Arkin PA. Biological networks . Curr Opin Struct Biol 2003 ; 13 ( 2 ): 193 – 202 . Google Scholar CrossRef Search ADS PubMed 35 Mazandu GK , Mulder NJ. Using the underlying biological organization of the Mycobacterium tuberculosis functional network for protein function prediction . Infect Genet Evol 2012 ; 12 ( 5 ): 922 – 932 . Google Scholar CrossRef Search ADS PubMed 36 Bessieres-Grattagliano B , Foliguet B , Devisme L , et al. Refining the clinicopathological pattern of cerebral proliferative glomeruloid vasculopathy (Fowler syndrome): report of 16 fetal cases . Eur J Med Genet 2009 ; 52 ( 6 ): 386 – 92 . Google Scholar CrossRef Search ADS PubMed 37 Fisher RA. Statistical methods for research workers. In: Breakthroughs in Statistics . New York, NY : Springer , 1992 , 66 – 70 . Google Scholar CrossRef Search ADS 38 Hess A , Iyer H. Fisher's combined p-value for detecting differentially expressed genes using Affymetrix expression arrays . BMC Genomics 2007 ; 8 ( 1 ): 96 . Google Scholar CrossRef Search ADS PubMed 39 Kost JT , McDermott MP. Combining dependent p-values . Stat Probab Lett 2002 ; 60 ( 2 ): 183 – 90 . Google Scholar CrossRef Search ADS 40 Benjamini Y , Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing . J R Stat Soc Series B Methodol 1995 ; 57 : 289 – 300 . 41 Liptak T. On the combination of independent tests . Magyar Tud Akad Mat Kutato Int Kozl 1958 ; 3 : 171 – 97 . 42 Zaykin DV , Zhivotovsky LA , Westfall PH , et al. Truncated product method for combining P‐values . Genet Epidemiol 2002 ; 22 ( 2 ): 170 – 85 . Google Scholar CrossRef Search ADS PubMed 43 Ramanan VK , Shen L , Moore JH , et al. Pathway analysis of genomic data: concepts, methods, and prospects for future development . Trends Genet 2012 ; 28 ( 7 ): 323 – 32 . Google Scholar CrossRef Search ADS PubMed 44 Lamparter D , Marbach D , Rueedi R , et al. Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics . PLoS Comput Biol 2016 ; 12 ( 1 ): e1004714 . Google Scholar CrossRef Search ADS PubMed 45 Wang T , Birsoy K , Hughes NW , et al. Identification and characterization of essential genes in the human genome . Science 2015 ; 350 ( 6264 ): 1096 – 101 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Post genome-wide association analysis: dissecting computational pathway/network-based approaches

Loading next page...
 
/lp/ou_press/post-genome-wide-association-analysis-dissecting-computational-pathway-P0eq0Y0PBV
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bby035
Publisher site
See Article on Publisher Site

Abstract

Abstract Over thousands of genetic associations to diseases have been identified by genome-wide association studies (GWASs), which conceptually is a single-marker-based approach. There are potentially many uses of these identified variants, including a better understanding of the pathogenesis of diseases, new leads for studying underlying risk prediction and clinical prediction of treatment. However, because of inadequate power, GWAS might miss disease genes and/or pathways with weak genetic or strong epistatic effects. Driven by the need to extract useful information from GWAS summary statistics, post-GWAS approaches (PGAs) were introduced. Here, we dissect and discuss advances made in pathway/network-based PGAs, with a particular focus on protein–protein interaction networks that leverage GWAS summary statistics by combining effects of multiple loci, subnetworks or pathways to detect genetic signals associated with complex diseases. We conclude with a discussion of research areas where further work on summary statistic-based methods is needed. genome-wide association, post-GWAS, subnetwork, pathways, biological network, protein–protein interaction Introduction Many new genetic associations to diseases have been identified by genome-wide association studies (GWASs) [1], which conceptually is a single-marker-based approach, leveraging thousands of genomes and/or sequences of sick and healthy individuals in detecting genetic polymorphisms with unusual genome-wide significant differences in allele frequency. However, GWAS have shown some limitations, for example, the translation of associated variants into biological hypotheses suitable for further investigation in the laboratory has not been as successful as initially anticipated [2, 3]. Another important challenge is the determination of how multiple, modestly associated genetic variants interact to influence a phenotype [4]. Furthermore, because the effect of a genetic polymorphism is viewed in isolation, GWAS may fail to reveal the true contribution of the detected genetic signal if the effects of other variation are not taken into account [2–8]. Genes can influence each other, for e.g., through enhancement or hindrance, by a process known as epistasis [5–7]. This can occur directly at the genomic level where a gene could encode for a transcriptional repressor, preventing transcription of other genes in the same or different biological pathways. Thus, biological networks, especially protein–protein interaction (PPI) networks, play critical roles in elucidating cause of disease [2, 5, 7]. Given challenges posed by current GWAS approaches, post-GWAS approaches (PGAs) have been developed [3, 4, 8] and are driven by the need to extract useful information from GWAS summary statistics (summary association statistics), comprising allelic Single-Nucleotide Polymorphism (SNP) effect sizes (log odds ratios for case-control traits) together with the associated standard errors, z-scores or P-values [8, 9]. Post-GWAS methods can broadly be clustered into three main categories, namely, single-variant-based post-GWAS, gene-scoring post-GWAS and pathway/network-based approaches [2, 5, 10]. Pasaniuc et al. has recently reviewed progress on single-variant and gene-scoring association PGAs [10–12]. However, there is no review on the progress made in pathway/network-based PGAs. Supplementary Table S1 provides a list of software tools for each post-GWAS category. Here, we dissect the pathway/network-based approaches that leverage GWAS summary statistics within PPI networks. We explore some critical concepts and current methods and discuss some of the technical differences as well as some common challenges across these approaches. Use of biological networks in post-GWAS Proteins, which are gene products, perform a critical range of biological functions in an organism through interactions with the cellular environment and the promotion of growth and functioning of the cell [12, 13]. It is also important to note that biological functioning can also be affected at the level of RNA. For example, noncoding RNA such microRNA and long noncoding RNA are important for maintaining the right cellular environment, allowing proteins to catalyze critical processes [14]. This suggests that any disease outcome or drug response requires concerted biological action of many genes (and RNA) involved in diverse processes and/or biological pathways. Identification of biological functions of disease-related genes will substantially increase our understanding of the biological mechanisms involved in disease pathogenesis. The limitations of the conventional single-marker-based approach for GWAS have necessitated the exploration and development of alternative or complementary approaches [3], including PGA (Table 1). The main assumption behind PGA is that, although association signals (GWAS summary statistics) of several variants involved in disease etiology may be too small to detect using the conventional single-marker-based approach, they may be collectively detected from the combined effect of multiple variants in interacting or grouped genes according to their shared functions within subnetworks or variants within biological pathways. Over recent years, several different approaches have been developed to detect the significance of a biological pathway, from a collection of SNPs from a GWAS, and to adjust for multiple testing at the pathway level [26, 27] (Supplementary Text S1). PGAs can broadly be classified into three categories with respect to the overall strategy used (Supplementary Table S1): (i) single-variant-based, (ii) gene-scoring and (iii) pathway/network-based PGAs [26–28]. Details of these approaches can be found in Supplementary Text S1. Further criteria to characterize PGAs are based on the way the null hypothesis can be tested or formulated [29]. Details of these approaches can be found in Supplementary Text S2. From the null hypothesis, PGAs are characterized into competitive (or enrichment) and self-contained (or association) methods. The differentiation between these two hypothesis-based methods is important, and key differences between these two approaches emanate from the null hypothesis stated [29]. It is worth noting that these PGAs have been designed for different applications in (a) meta-analysis, (b) conditional association and imputation using summary statistics, (c) fine-mapping causal variants by integrating functional annotations and/or trans-ethnic data, (d) polygenic predictions of disease risk and inferring polygenic architectures, (f) recovery signal of association from multiple variants and (g) analyzing multiple variants, traits or phenotypes. As mentioned earlier in the introduction, progress on single-variant-based and gene-scoring PGAs was recently reviewed by Pasaniuc et al. [1, 8]. The pathway/network-based PGAs are distinct based on either (1) conducting the association on subnetworks from interactive PPI networks (subnetworks are obtained based on approaches for searching the subnetwork in a network); or (2) conducting the association on pathways (where each pathway contains information of genes). Both categories in pathway/network-based PGAs require gene-based score or association summary statistics for each gene [11, 12]. Figure 1 illustrates the summary workflow of these pathway/network-based PGAs. Subsequent sections will provide detailed discussion of the steps illustrated in Figure 1. Table 1. Some existing online databases for retrieving comprehensive PPI maps, including PPIs within an organism (intraorganism map) or between organisms (interorganisms maps) Database Description Data type URL STRING Search Tool for the Retrieval of Interacting Genes/Proteins Integrated sets of known and predicted protein–protein functional interactions for several organisms [15] BioGRID Biological General Repository for Interaction Data sets Physical and genetic interactions [16] DIP Database of Interacting Proteins Experimentally identified interactions between proteins [17] IntAct Database of protein interaction data Literature or direct data deposition-based molecular interaction [18] HPRD Human Protein Reference Database Manually curated PPIs [19] MINT Molecular Interaction database Experimentally verified and literature-based curated PPIs [20] MIPS Mammalian Protein–Protein Interaction Database Manually curated high-quality PPI data [21] HPIDB Host–Pathogen Interaction Database Experimentally verified and predicted host–pathogen interactions [22] HPI-base Host–Pathogen Interactions Literature-based host–pathogen interactions [23] PATRIC Pathosystems Resource Integration Center Experimentally inferred host–pathogen interactions [24] VirHostNet Virus–Host Network release Biocurated virus–host molecular interactions [25] Database Description Data type URL STRING Search Tool for the Retrieval of Interacting Genes/Proteins Integrated sets of known and predicted protein–protein functional interactions for several organisms [15] BioGRID Biological General Repository for Interaction Data sets Physical and genetic interactions [16] DIP Database of Interacting Proteins Experimentally identified interactions between proteins [17] IntAct Database of protein interaction data Literature or direct data deposition-based molecular interaction [18] HPRD Human Protein Reference Database Manually curated PPIs [19] MINT Molecular Interaction database Experimentally verified and literature-based curated PPIs [20] MIPS Mammalian Protein–Protein Interaction Database Manually curated high-quality PPI data [21] HPIDB Host–Pathogen Interaction Database Experimentally verified and predicted host–pathogen interactions [22] HPI-base Host–Pathogen Interactions Literature-based host–pathogen interactions [23] PATRIC Pathosystems Resource Integration Center Experimentally inferred host–pathogen interactions [24] VirHostNet Virus–Host Network release Biocurated virus–host molecular interactions [25] Table 1. Some existing online databases for retrieving comprehensive PPI maps, including PPIs within an organism (intraorganism map) or between organisms (interorganisms maps) Database Description Data type URL STRING Search Tool for the Retrieval of Interacting Genes/Proteins Integrated sets of known and predicted protein–protein functional interactions for several organisms [15] BioGRID Biological General Repository for Interaction Data sets Physical and genetic interactions [16] DIP Database of Interacting Proteins Experimentally identified interactions between proteins [17] IntAct Database of protein interaction data Literature or direct data deposition-based molecular interaction [18] HPRD Human Protein Reference Database Manually curated PPIs [19] MINT Molecular Interaction database Experimentally verified and literature-based curated PPIs [20] MIPS Mammalian Protein–Protein Interaction Database Manually curated high-quality PPI data [21] HPIDB Host–Pathogen Interaction Database Experimentally verified and predicted host–pathogen interactions [22] HPI-base Host–Pathogen Interactions Literature-based host–pathogen interactions [23] PATRIC Pathosystems Resource Integration Center Experimentally inferred host–pathogen interactions [24] VirHostNet Virus–Host Network release Biocurated virus–host molecular interactions [25] Database Description Data type URL STRING Search Tool for the Retrieval of Interacting Genes/Proteins Integrated sets of known and predicted protein–protein functional interactions for several organisms [15] BioGRID Biological General Repository for Interaction Data sets Physical and genetic interactions [16] DIP Database of Interacting Proteins Experimentally identified interactions between proteins [17] IntAct Database of protein interaction data Literature or direct data deposition-based molecular interaction [18] HPRD Human Protein Reference Database Manually curated PPIs [19] MINT Molecular Interaction database Experimentally verified and literature-based curated PPIs [20] MIPS Mammalian Protein–Protein Interaction Database Manually curated high-quality PPI data [21] HPIDB Host–Pathogen Interaction Database Experimentally verified and predicted host–pathogen interactions [22] HPI-base Host–Pathogen Interactions Literature-based host–pathogen interactions [23] PATRIC Pathosystems Resource Integration Center Experimentally inferred host–pathogen interactions [24] VirHostNet Virus–Host Network release Biocurated virus–host molecular interactions [25] Figure 1. View largeDownload slide A summary workflow of the pathway/subnetworks for PGAs. Figure 1. View largeDownload slide A summary workflow of the pathway/subnetworks for PGAs. How are PPI data sets obtained? Cells are functional units of life and each protein contributes to different biological processes, which, in turn, act in diverse biological pathways. Under a given condition, a biological pathway is a series of events (interactions amongst molecules or biological processes) occurring sequentially in a cell, to change the cellular environment, ensuring stability. Information on biological processes and pathways are stored in bioinformatics resources and can be retrieved with bioinformatics tools. These pathway databases cover most of the known metabolic and regulatory pathway maps for several genomes. Generally, PPI can be detected experimentally or by computational analysis. Physical interactions are detected using direct experimental techniques, such as pull-down assays, co-immunoprecipitation or tandem affinity purification, high-throughput mass spectrometry techniques [30] and high-throughput yeast two-hybrid (Y2H) screens [31]. Other functional interactions can be inferred from biological knowledge, such as co-expression data from microarray analysis, text mining, shared evolutionary history based on sequence data (sequence similarity or shared domains) and genomic context (conserved genomic neighbor or gene order, gene fusion events, gene co-occurrence or phylogenetic profiles across genomes) [14, 32]. The PPI databases are publicly available and freely accessible via Web interfaces, and a partial list of these is shown in Table 1. Although there exists several PPI resources and databases, predicting biological pathways is also possible through a scoring approach and genome-wide PPI network analysis (Supplementary Text S3). Characterizing individual proteins in a PPI network A biological network is a graph modeling a biological system as an entity composed of subunits, and has become a useful tool enabling the integration of different biological data into a single framework [26]. Types of biological networks include signaling networks, gene regulatory or DNA–protein interaction networks [12, 32]; disease–gene networks linking diseases to genes; and drug interaction networks connecting drugs to their targets [15, 16, 33]. The most used biological network is the PPI, which has been applied in different biomedical applications for knowledge discovery, including protein function prediction [17–19] and disease-associated genes [14, 20, 21, 26, 27, 31], filtration and prioritization of protein targets [20], drug discovery [21–24, 29] and drug resistance analysis [23], detection of subnetworks or modules underlying disease risk [12], etc. It is worth to note that most PPI networks are not disease specific [10, 12, 19]. A PPI network is defined as a set of nodes (or vertices), representing proteins (or genes) connected by undirected edges (or links), representing the interactions or relationships between them (either direct physical or functional interactions) [2, 10]. Several types of PPI networks exist, and when they are integrated in a single network, the relationships between proteins are referred to as functional interactions or connections. Characterizing individual proteins in the context of an integrated or unified PPI network is important to understand how proteins function at the systems level and GWAS results may be mapped to such PPI to identify disease-associated subnetworks. In this context, the PPI network is modeled as a graph, which is denoted by G = (V, E), where V is the set of vertices (genes or proteins) and E the set of edges (interactions). Network centrality measures such as degree or connectivity, eccentricity, betweenness, closeness and eigenvector centrality [32] can be used to numerically characterize the importance of proteins in the network and are described below: Degree or connectivity centrality characterizes the importance of the protein in the network based on the number of proteins connected to it. For a given protein p∈V, the degree of p, denoted Cdeg(p), is given by: Cdeg(p)=∑q∈Vδ(p,q), (1) where δ (p, q) is 1 if p is functionally linked to q, and 0 otherwise. Note that if proteins in V are numbered from 1 to n, then the matrix A = (apq)1≤p,q≤n with apq = δ(p, q) is referred to as an adjacency matrix. A protein with a large number of functional connections is considered to be a key protein, as it may contribute to several processes in the system. 2. Eccentricity centrality shows how easily accessible a node is from other nodes, expressing its capability to quickly communicate with other proteins in the network. For a given protein, p∈V, denoted Cecc(p), is the longest length of the shortest path departing from the protein p, i.e. Cecc(p)=max{γpq:q∈V}, (2) where γpqcis the length of the shortest path from p to q. 3. Closeness centrality assesses the essentiality of a protein, u, based on how it keeps other proteins close to each other, thus speeding up the spread of biological information in the network. The closeness score, Cclos(p), of protein p is given by: Cclos(p)=11nc−1∑q∈Vγpq, (3) where nc is the number of proteins in the connected subnetwork that contains the proteins. 4. Betweenness centrality scores the ability of a protein to maintain the transmission of biological information between other proteins in the network based on the total number of shortest paths passing through the protein. The betweenness score of protein p∈V, denoted Cbet(p), is given by: Cbet(p)=∑s≠p≠t,s≠tσst(p)σst, (4) with σst(p) the number of shortest paths from any protein pairs (s, t) with s ≠ t in the network passing through p, and σst, the number of shortest paths from s and t. 5. Eigenvector centrality assigns weight to a protein based on how influential the proteins connected to it are. In this measure, the relevance of a protein depends on the quality of its neighbors rather than the number of its connections. The eigenvector score, Ceig(p), of protein p∈V is calculated as follows: Ceig(p)=1λ∑q∈VapqCeig(q), (5) where λ is the largest eigenvalue of the adjacency matrix A. It follows that ACeig = λCeig, with the vector Ceig = (Ceig(q))q∈V and its transpose CeigT, which is the eigenvector of A associated with λ. The topological structure of a PPI network provides information on the pattern associated with the general behavior of the biological system under consideration, which may clarify the global role and biological relevance of individual proteins in the network. Therefore, determining the topological structure of the PPI network can help understand the biological mechanisms underlying the functioning of the organism, including cellular organization and processes. In fact, it has been observed that, in general, biological networks exhibit scale-free properties [15, 16, 32], meaning that their degree distribution, which is the probability that a randomly selected protein is connected to k proteins (of degree k) in the network, approximates the power law, i.e. P(k) ¬ k−γ, (6) where the power exponent γ is a constant characteristic of the network. This distribution is independent of the number of nodes; thus, the networks are said to be scale free, in which case, the probability that a protein has a number of links larger than the mean degree of all proteins in the network, is small in scale-free networks [16]. This indicates that scale-free networks are heterogeneous, with few proteins highly connected and several proteins having only few interacting partners [16]. This is in contrast to random networks, which are homogeneous, where proteins have roughly the same degree and the distribution follows a Poisson distribution. For example, Figure 2A shows the degree distributions of PPI networks with scale-free topology properties for human (Homo sapiens). In these plots, we observe that although some of the proteins should have many interacting partners, most of them only have few partners. It is worth mentioning that these properties hold even for other model organisms, as well as pathogenic organisms, such as Mycobacterium tuberculosis, suggesting that the subnetwork-based post-GWAS can be applied or adapted to other organisms, where possible [10, 13, 17, 18]. Proteins participating in many interactions are referred to as ‘high degree’ or key proteins or hubs and have a high impact on the topological structure of the network, possibly ensuring the completion of basic chemical operations essential for the survival of the organism, such as energy transfer and redox reactions [16, 32]. Importantly, hub proteins are less likely to mutate, as their high connectivity is related to their functions [17], which are highly conserved and essential for the survival of the organism under consideration [18]. These networks have a higher error tolerance or resistance to random node failure or perturbations but are often vulnerable to targeted disruption or removal of hubs, which play a major role in maintaining the network's connectivity [17, 25]. This also implies that the average network clustering coefficient is significantly higher in the PPI network than in a random network [32]. This average network clustering coefficient [32, 33] is given by: cc=1n∑p∈Vcp, (7) where n is the length of the network, i.e. the number of proteins in the network, and cp is the clustering coefficient of protein p, that is the ratio of the actual number of interacting partners, np, of protein p to the total number of possible interacting partners of the protein p in the network and given by [13, 18]: cp=2npn(n−1). (8) Figure 2. View largeDownload slide (A) Protein connectivity or degree distribution in human PPI networks. Circles represent the frequency P(k) of observing a protein interacting with k partners in a network. The solid line plots the power law function approximating the connectivity distribution. (B) Path-length distribution in human PPI networks. Histogram plot represents the path-length distribution, i.e. frequency of occurrence of shortest path of length and the dashed line plot is the normal distribution approximating the path-length distribution. Figure 2. View largeDownload slide (A) Protein connectivity or degree distribution in human PPI networks. Circles represent the frequency P(k) of observing a protein interacting with k partners in a network. The solid line plots the power law function approximating the connectivity distribution. (B) Path-length distribution in human PPI networks. Histogram plot represents the path-length distribution, i.e. frequency of occurrence of shortest path of length and the dashed line plot is the normal distribution approximating the path-length distribution. Another special property characterizing PPI networks is the ‘small world’ property, i.e. the transmission of biological information between any two proteins is achieved through only a few steps or a much shorter path than would be expected in a random network of similar size and length [16]. This property provides insight into the network navigability, indicating how fast the information can be spread in the system independently of the number of proteins [16, 32]. Figure 2B shows the path-length distributions of human (H.sapiens). These plots indicate that the average path lengths range between 3 and 4 hops for each of these organisms, independent of the size (number of edges) and order (number of genes or proteins) in the organism under consideration’s network. This suggests that the spread of biological information in these systems is relatively fast, and this is important, especially for pathogenic organisms, as they need to survive and adapt to environmental niches. This ‘small world’ property may also provide the organism with an evolutionary advantage in the sense that it would be able to efficiently respond to changes in the environment and quickly exhibit a qualitative change of behavior in response to these perturbations [10]. It is common that a specific network property involves only some parts of the network rather than the whole network. This is referred to as subnetwork features, and these subnetworks can be either motifs or modules [19, 34]. Classifying proteins in subnetwork modules and functional motifs from a PPI network can help understand biological mechanisms underlying the system, the organization and dynamics of cell functions. To discover hidden information in a complex network, there exist several approaches to detect modules [21, 35] and functionally related proteins [21]. Examples of these approaches are provided in Boxes 1 and 2. The next section discusses techniques used in leveraging topological properties of a PPI network to search for subnetworks. Box 1: The procedure to generate a subnetwork based on topological structure of the network [10] Input: A mapped network G, containing weighted genes by either P-values or z-score and LD. From the network G, find structural hubs and connected components. For each gene, compute the betweenness, the closeness and the eigenvector scores. For each centrality score, compute the cutoff for central genes of subgraphs. A gene is a hub if its score is greater than or equal to a user defined cutoff. A gene is a central gene if it is a hub for all the four scoring measures in Step (3). For each central gene, search its neighbors for n iterations, or the mean shortest path. The central gene and its neighbors constitute a subnetwork of the network G. Output: subnetworks. Box 2: Greedy algorithm to search for a subnetwork [2] Assign a seed subnetwork S and calculate the subnetwork score Sm of S. Initially, the seed subnetwork is a single gene. S=β∑e∈Eedgeweight    (e)ψ+(1−β)∑v∈Vnodeweight    (v)γ, where E and V represent the edges and nodes of the module, β is a parameter between 0 and 1 to balance GWAS and weight signals from either gene expression or LD structure and ψ and γ are the total number of edges and nodes in E and V, respectively. 2. Examine all the first-order neighbors of S, and identify the neighbor node Nmax that generates the maximum increment of the subnetwork score. 3. Add Nmax to the current subnetwork S if the score increment is greater than Sm*r, where r is a parameter that decides the magnitude of increment. 4. Repeat Steps 1–3 until no more neighbors can be added. Mapping SNP-based GWAS summary statistics and searching for subnetworks Highly connected genes in PPI networks can be functionally important, and the removal of such genes is related to lethality. Deleterious variants in such genes are usually observed in aborted fetuses [36]. Considering an undirected weighted PPI network, G = (V, E), where V is the set of n genes as nodes, and E is the set of edges as interactions found between genes. Current approaches use different link weights, such as gene-correlation from genotype data [10], gene expression-based [2], functional and topological weights [10]. The following steps are sequentially executed to search for subnetworks in networks: Mapping SNPs to genes and identifying weights Map single SNP-based GWAS summary statistics to the respective genes. These GWAS summary statistics are usually assigned to a given gene in different ways: (a) if SNPs are located within the primary gene transcript and/or a user-defined base-pair distance downstream or upstream to a specific gene; (b) if they are in linkage disequilibrium (LD) within a specified boundary cutoff within a gene and (c) closest SNPs within gene at zero distance. Attribute scores or weights to functional interactions in the PPI network based on the weighting scheme under consideration. Here, several types of weights can be considered, including LD based on the reference genotype data set [10], gene co-expression [2] and score from the network topological structure. Searching subnetworks from weighted network There are two major approaches applied in PGA to search for subnetworks from weighted networks: (a) Analyze the general topological properties of the PPI network and quantify the usefulness of each gene using centrality measures to cluster network into subnetworks [10] (Box 1). Box 1 illustrates how the topological properties of a network (see section above) are used to obtain clustered nodes in subnetworks. (b) A greedy algorithm (described in Box 2) was also used in some software tools to search for dense subnetworks [2]. The obtained subnetworks can be used to perform the association test as discussed in the section below. Central genes within a given subnetwork, in association with complex disease susceptibility genes, are cores of biological subnetworks and are linked to other genes in that subnetwork via a few steps (paths in the network) [2, 10]. These centers are structural hubs with network centrality scores beyond a certain user defined threshold value. Pathway/subnetwork association from GWAS data Most of the association at subnetwork or pathway level requires the gene association summary statistics (gene score) and the biological network to be split into subnetworks [10] or dense modules [2] as discussed in the previous section [10]. Gene-based scoring Different methods are generally used to combine association signals of SNPs to assess the association of the gene with the phenotype, and these include (1) using the minimum or maximum SNP-specific P-value as the P-value for the gene (Sidak’s combination test) [2, 10], and (2) using an SNP-specific summary measure such as Fisher’s, Sime's and Sime's combination test [2, 10, 29], within the gene. Many gene scores in pathway/subnetwork-based PA methods have implemented the minimum P-value (or the maximum test statistic) of all the SNPs within a gene [2, 10, 19, 33]. However, when several distinct SNPs in the gene contribute to the overall association signal, and all have modest effect on the phenotype, using the minimum P-value may not be the best or most powerful approach to capture such information. In addition, genes with more SNPs are likely to have smaller minimum P-values compared with genes with fewer SNPs [19]. Let us assume that multiple SNPs within a gene contribute to the overall association of the gene. We also assume an independent and uniform distribution of P-values pi for the corresponding test statistic, Ti, testing the i-th marker, under the null hypothesis, although this assumption of independence may be violated because of LD among SNPs within the gene. If we consider a continuous monotonic function H, then a transformation of the P-value, pi, of the i-th SNP is given by: Zi=H−1(1−pi). (9) Below we describe four methods that enable the combining of P-values for all SNPs within a gene: (a) the maximum test statistic (Sidak’s combination test), (b) the Fisher’s combination test, (c) the Sime’s combination test and (d) the false discovery rate (FDR) method [2, 3, 10]. (a) Sidak’s combination test Considering only the SNP with the maximum test statistic (the best SNP), we can define the statistic ZB=pi, which is distributed as P(ZB ≤ ω)=1−(1−ω)k, where k is the number of independent SNPs, and ω is type I error rate. (b) Fisher’s combination test The statistic to combine k independent P-values or to combine information from k SNPs is given by: ZF=−2∑i=1k log (pi), (10) which follows a chi-squared distribution, χ2k2, with 2k degrees of freedom. (c) Sime's combination test Let pi be ordered as p(1)≤p(2)≤⋯≤p(k). The combined P-values are given by: p=∑i{kp(i)i}. (11) (d) FDR method Let F(α) be the expected proportion of tests yielding P-values less than or equal to α. Suppose a set pd={p1,…,pk} with d different P-values, p~j,j=1,…,d, and such that p~1<p~2<⋯<p~d. Let mj be the number of P-values equal to p~j among the set of P-values pd. Then, the estimate of the expected proportion F(α) is given by: F~(α)=1k∑j=1dI(p~j≤α)mj, (12) where I is an indicator function. For a one-sided test (chi-square test or trend test), consider π=min(1,2σ), and for a two-side test, consider π=min(1,2p), where p~=1k∑i=1kpi, α=1k∑i=1kai and ai=2min(pi,1−pi). The estimate v^(α) of the expected proportion v(α) of tests resulting in false positives when α is used as the P-value threshold to get significance is given as, v(α)=α,π, and π is the estimate of the proportion π of test with the true hypothesis. Therefore, the FDRs are expressed as ratios of the form: t(i)=v^(p(i))F^(p(i)) (13) and q(i)=min{t(i):j≥i} where q(1)≤q(2)≤⋯≤q(m) are the ordered q-value FDRs. Finally, q(1)=min{tj:j≥i} is the false discovery. Pathways/subnetwork association In this section, we distinguish two existing methods for pathway/subnetwork association, namely, one- and two-step approaches [29]. In the one-step approach, all the SNPs in a pathway are used without any consideration of gene-level scores. An example of the two-step approach is where it first uses SNPs in each gene to assess the association with the gene and then combines gene-level tests to evaluate the association of the phenotype with the pathway [22, 23]. Both the one-step and two-step methods have advantages and disadvantages, and their efficiency depends mostly on the underlying disease-causing mechanisms, which are generally unknown. To account for the independent assumption and the correlation of P-values among neighboring genes, many pathway/subnetwork association methods use the (a) Fisher’s combined probability, (b) Stouffer–Liptak [37, 38] and (c) chi-square score methods, which account for spatial correlations among SNPs/genes within a given subnetwork or pathway. These commonly used association tests are described below: (a) Fisher’s combined probability test Under the null hypothesis, the P-values pi, i = 1,…, L for a test statistic with a continuous null distribution are uniformly distributed in the interval (0, 1). In this framework, a parametric cumulative distribution function F can be chosen, and the P-values can be transformed into quantiles according to qi=F−1(pi),i=1,…,L. The combined test statistic Cp=∑i=1Lqi is a sum of independent and identically distributed random variables qi, each of which follows the probability distribution corresponding to F. Let F^ be the cumulative chi-square distribution and pi,i = 1, 2,…, L P-values of SNPs associated with a gene or a subnetwork. We obtain the combined P-value test statistic [38–42] Cp=−2∑i=1L log (1−pi), which has a chi-square distribution with 2 L degrees of freedom because of the additive property of the independent chi-square distribution. The combined P-value is p^=F^(Cp), and F^ is the cumulative distribution function for the chi-square distribution. Suppose pi are dependent, we can estimate the distribution of Fisher’s combined probability Cp by a scaled chi-square distribution such that Cp≈c∗χf2, where c is a scaling factor, and f is the degree of freedom. It follows, E(Cp)=E(c∗χf2)=c∗f and var(c∗χf2)=2∗c2∗f. Solving the above equation with respect to c and f, we obtain: c=4L+2∑i<jcov[−2 log(pi),−2 log(pj)]4L (14) and f=2(2L)24L+2∑i<jcov[−2 log(pi),−2 log(pj)]. (15) As Cp≈χ2L2, the combined P-value for Cp is determined by using the approximating distribution Cp/c≈χf2. To compute terms in var(Cp), one can apply approximations described in [39] by fitting a polynomial regression to the true values using a grid approach ranging values for the degrees of freedom (9 ≤ ν ≤ 125) and the autocorrelations σ (−0.98 ≤ ρ ≤ 0.98). Thus, the q-value score can be obtained based on the Benjamini–Hochberg false discovery correction [40]. (b) Stouffer–Liptak test Let F be a cumulative standard normal distribution N (0, 1). It follows that F(x) = Φ(x), where Φ is a cumulative distribution function of the standard normal, and qi = Φ−1(pi). Each qi, i = 1,…, L follows the probability density function of a standard normal, and using the additive property of independent random variables, the Liptak’s combined P-value test statistic [41] is: Cp=∑i=1LqiL, (16) and the related combined P-value is p^=Φ(Cp). To adjust for P-value dependency, let us assume that Pi, i = 1,…, L are correlated according to the correlation matrix Σ, which is a positive definite and nondegenerate. Thus, the Cholesky factor C exists such that Σ = CCT. The correlated quantiles Q = qi are transformed into independent Q^ as done by Zaykin et al. [42]. One can then obtain the transformation Q^=C−1Q, the qi, i = 1,…, L, which are now independent and follows a standard normal distribution [42, 43]. Finally, apply the Stouffer–Liptak test on Q^. As for Fisher’s combined probability test above, here, we can compute a q-value on a null-model from shuffled P-values. (c) Chi-square score method This test has recently been used in PPI-based free method (Figure 1) that leverages gene information in biological pathways to perform the association [44]. The pathway-based chi-square test also uses gene score P-values [44]. Box 3 describes the steps to obtain the gene scores and to perform pathways association test [44]. Box 3: Free PPI-based pathway chi-square score methods [44] Gene score P-values are ranked such that the lowest P-value gets the highest rank. The rank value is then divided by the number of genes plus 1 to obtain a uniform distribution. Uniform distribution values are transformed by the chi-square quantile function to obtain a chi-square distribution of gene scores. Chi-square-based gene scores of a given pathway of size m are summed and tested against a χm-distribution. The empirical sampling method is as follows: Gene score P-values are directly transformed with the a chi-square quantile function to obtain new gene scores: Fχ1−1(1−P). A raw pathway score for a pathway of size m is computed by summing the transformed gene scores for all pathway genes. A Monte Carlo estimate of the P-value is obtained by sampling random gene sets of size m and calculating the fraction of sets reaching a higher score than the gene set of the given pathway [44]. Characterization and enrichment of the identified subnetworks To identify the association between each subnetwork, Sj, j = 1,…, T, within n1,…, nT genes to a human pathway, Pk∈P, where P is the set of all available curated human pathways, curated pathways can be obtained from several annotated pathway databases (Table 2). Let a be denoted as the number of genes in the intersection between genes within Sj and genes within pathway Pk, and b the number of genes in the intersection between genes within Sj and those in the union of all pathways Pk,k=1,…,K. Let n be the number of genes in the intersection between genes in the Pk pathway and those in the union of all pathways Pk,k=1,…,K with k≠j, and m be the total number of genes in all pathways Pk,k=1,…,K. The statistic of overlap between subnetwork Sj, of nt genes and a given pathway Pk can be computed using the z-score (ZS), which uses the binomial proportion test, and given by: ZS=an−bmbm2(1−bm). Table 2. Some existing online databases for retrieving comprehensive biological pathway maps and processes Scheme Description Types URL KEGG Kyoto Encyclopedia of Genes and Genomes Integrated database of manually verified metabolic pathway information [34] BioCyc EcoCyc, MetaCyc and derivatives Literature-based curated and computationally predicted metabolic pathway information [35] NCI-NATURE Pathway database Biomedical database of human cellular signaling pathway [36] Reactome Curated pathway database Curated and peer reviewed database of biological pathway maps [37] GO Gene Ontology A classification system for annotation of genes and gene products using Molecular Function, Biological Process and Cellular Component [38] BioCarta A database of gene interaction models Database of gene interaction models, containing high-quality images of several cellular signaling and interaction pathways and each diagram [39] WikiPathways Pathway database Database of biological pathways maintained by and for the scientific community [40] Scheme Description Types URL KEGG Kyoto Encyclopedia of Genes and Genomes Integrated database of manually verified metabolic pathway information [34] BioCyc EcoCyc, MetaCyc and derivatives Literature-based curated and computationally predicted metabolic pathway information [35] NCI-NATURE Pathway database Biomedical database of human cellular signaling pathway [36] Reactome Curated pathway database Curated and peer reviewed database of biological pathway maps [37] GO Gene Ontology A classification system for annotation of genes and gene products using Molecular Function, Biological Process and Cellular Component [38] BioCarta A database of gene interaction models Database of gene interaction models, containing high-quality images of several cellular signaling and interaction pathways and each diagram [39] WikiPathways Pathway database Database of biological pathways maintained by and for the scientific community [40] Table 2. Some existing online databases for retrieving comprehensive biological pathway maps and processes Scheme Description Types URL KEGG Kyoto Encyclopedia of Genes and Genomes Integrated database of manually verified metabolic pathway information [34] BioCyc EcoCyc, MetaCyc and derivatives Literature-based curated and computationally predicted metabolic pathway information [35] NCI-NATURE Pathway database Biomedical database of human cellular signaling pathway [36] Reactome Curated pathway database Curated and peer reviewed database of biological pathway maps [37] GO Gene Ontology A classification system for annotation of genes and gene products using Molecular Function, Biological Process and Cellular Component [38] BioCarta A database of gene interaction models Database of gene interaction models, containing high-quality images of several cellular signaling and interaction pathways and each diagram [39] WikiPathways Pathway database Database of biological pathways maintained by and for the scientific community [40] Scheme Description Types URL KEGG Kyoto Encyclopedia of Genes and Genomes Integrated database of manually verified metabolic pathway information [34] BioCyc EcoCyc, MetaCyc and derivatives Literature-based curated and computationally predicted metabolic pathway information [35] NCI-NATURE Pathway database Biomedical database of human cellular signaling pathway [36] Reactome Curated pathway database Curated and peer reviewed database of biological pathway maps [37] GO Gene Ontology A classification system for annotation of genes and gene products using Molecular Function, Biological Process and Cellular Component [38] BioCarta A database of gene interaction models Database of gene interaction models, containing high-quality images of several cellular signaling and interaction pathways and each diagram [39] WikiPathways Pathway database Database of biological pathways maintained by and for the scientific community [40] The above approach not only scores the association of overlapping gene sets and a given pathway but also has the advantage of accounting for the network topological structure of interactions between genes in the subnetwork [10]. Discussion Over the past decade, we have experienced a shift from single-marker approaches toward whole-genome-based methods with the hopes of achieving a more global view of disease etiology [21]. Pathway-based methods provide a mechanism for exploring more sensitive and powerful analysis of GWAS data sets [43]. PGAs have provided a new paradigm to GWAS and might enable a complete characterization of genetic susceptibility to a disease. These PGAs are broadly grouped into three categories (Supplementary Table S1), namely, single-variant association tests, gene-based association tests and pathway/network-based approaches [4, 8]. These methods were designed for leveraging GWAS summary statistics to conduct meta-analysis, conditional association and imputation, fine-mapping causal variants by integrating functional annotations and/or trans-ethnic data, polygenic predictions of disease risk and inferring polygenic architectures or analyses of multiple variants, traits or phenotypes. This manuscript provides pertinent details on PPI and discusses the advances made in pathway/network-based PGAs, the aim of which is to guide current users in performing these types of analyses and also enable the development of new methods, which are able to overcome some of the challenges discussed below. Current challenges and opportunity There exist several challenges, which limit the practical use of pathway/network-based PGAs. While methods for pathway/network-based analysis are rapidly growing in number, several sources of biases must be taken into consideration, including the capacity for strongly associated markers to drive pathway association, the possible effects of SNPs being assigned to multiple genes and, more specifically, bias with respect to differences in gene size and differences in pathway size. These biases are not always considered in many of the existing methods (Supplementary Table S1) [7, 34]. Ignoring gene size when assessing gene association signal and when testing for pathway association can lead to inflated type 1 error rates. Permutation procedures can be used to adjust for pathway size. The results from the network/pathway-based approaches are sensitive to the way SNPs are assigned to genes, weight assigned to edges in PPI network and the accuracy of the PPI network itself. This is mainly because of the lack of accurate knowledge of complex traits and the incomplete human protein interaction network, which makes it challenging to directly compare the results from different network/pathway-based approaches [45]. Most of these methods do not accept a user-defined network, only a list of SNPs or genes. This makes it impossible to directly compare these methods in advising superiority of different strategies. The rapid growth of multi-loci and epistatic association approaches will now allow for the opportunity to design new or update current pathway/subnetworks-based PGAs to account for both multi-loci and epistatic association summary statistics as inputs. Conclusions and perspectives Several approaches have been designed to study complex genetic diseases incorporating large-scale data such as transcriptomics, proteomics, genomics and metabolomics. The inclusion of such information into pathway/network-based analysis would provide a critical mechanism for exploring more powerful and sensitive analysis of GWAS data sets [43]. Pathway/network-based approaches, which incorporate biological network structures, such as hubs and motifs into the analysis of GWAS data sets, will be vital in the years to come, providing optimal integrative models, which enable analysis at the systems level and ensure increased coverage and confidence, precision and accuracy. This manuscript focused on network/pathway-based PGAs that leverage GWAS summary statistics within human PPI network data in combining effects of multiple loci within genes, biological subnetwork and pathways to detect genetic signals beyond single-gene polymorphisms. We provide some keys concepts and discuss assumptions and methods behind pathway/network-based approaches. Moreover, the development of appropriate tools that will allow combinations of GWAS and multiple OMICs is increasingly needed. Such approaches will integrate, for example, whole-exome/genome sequencing data, transcriptomic, proteomic, epigenomic and metagenomic data, in addition to GWAS and PPI interactions, to refine our knowledge of pathophysiology of common complex diseases as well as monogenic conditions with variable penetrance and expressivity. Key Points Dissecting pathway/network-based post-GWAS approaches that leverage GWAS summary statistics within human PPI networks. Consistent classification of existing and future pathway/network-based post-GWAS approaches. Discussion of issues related to pathway/network-based post-GWAS approaches. Supplementary data Supplementary data are available online at https://academic.oup.com/bib. Emile R. Chimusa did PhD in Bioinformatics from the University of Cape Town. He is a Senior Lecturer at the Division of Human Genetics, Department of Pathology, University of Cape Town. Shareefa Dalvie did PhD in Human Genetics from the University of Cape Town. She is a lecturer at the Department of Psychiatry and Mental Health, University of Cape Town. Collet Dandara received his PhD degree from the University of Zimbabwe. He is a Professor of Human Genetics, Division of Human Genetics, University of Cape Town. Ambroise Wonkam received his PhD degree from the University of Geneva. He is a Professor/Senior Specialist at the Division of Human Genetics, University of Cape Town. Gaston K. Mazandu did PhD in Bioinformatics from the University of Cape Town. He is an Honorary Senior Member of the Computational Biology Division at University of Cape Town; a Researcher at the African Institute for Mathematical Sciences; and a Senior Lecturer at the Division of Human Genetics, University of Cape Town. Acknowledgments The authors thank researchers who have contributed towards advancing PGAs and all GWAS summary statistics donors around the world. Also, the authors thank the computing platform CHPC (https://www.ac.za/) and those who have helped in the preparation of this manuscript. Funding Some of the authors are funded in part by the National Institutes of Health Common Fund under grant number 1U54HG009790-01(IFGeneRA), U01HG009716 (HI Genes Africa), U24HG006941 (H3ABioNet), 1u01hg007459-01 (SADaCC) and Wellcome Trust/ AESA Ref: H3A/18/001. The content of this publication is solely the responsibility of the authors and does not necessarily represent the official views of the funders. References 1 Li MJ , Liu Z , Wang P. GWASdb v2: an update database for human genetic variants identified by genome-wide association studies . Nucleic Acids Res 2016 ; 44 ( D1 ): D869 – 76 . Google Scholar CrossRef Search ADS PubMed 2 Jia P , Zheng S , Long J. dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks . Bioinformatics 2011 ; 27 ( 1 ): 95 – 102 . Google Scholar CrossRef Search ADS PubMed 3 Peng G , Luo L , Siu H. Gene and pathway-based second-wave analysis of genome-wide association studies . Eur J Hum Genet 2010 ; 18 ( 1 ): 111 – 17 . Google Scholar CrossRef Search ADS PubMed 4 Cantor RM , Lange K , Sinsheimer JS. Prioritizing GWAS results: a review of statistical methods and recommendations for their application . Am J Hum Genet 2010 ; 86 ( 1 ): 6 – 22 . Google Scholar CrossRef Search ADS PubMed 5 Shahbaba B , Shachaf CM , Yu Z. A pathway analysis method for genome-wide association studies . Stat Med 2012 ; 31 ( 10 ): 988 – 1000 . Google Scholar CrossRef Search ADS PubMed 6 Wang K , Li M , Hakonarson H. Analysing biological pathways in genome-wide association studies . Nat Rev Genet 2010 ; 11 ( 12 ): 843 – 54 . Google Scholar CrossRef Search ADS PubMed 7 Wang K , Li M , Bucan M. Pathway-based approaches for analysis of genomewide association studies . Am J Hum Genet 2007 ; 81 ( 6 ): 1278 – 83 . Google Scholar CrossRef Search ADS PubMed 8 Pasaniuc B , Price AL. Dissecting the genetics of complex traits using summary association statistics . Nat Rev Genet 2017 ; 18 ( 2 ): 117 – 27 . Google Scholar CrossRef Search ADS PubMed 9 Li MJ , Pan Z , Liu Z , et al. Predicting regulatory variants with composite statistic . Bioinformatics 2016 ; 32 ( 18 ): 2729 – 36 . Google Scholar CrossRef Search ADS PubMed 10 Chimusa ER , Mbiyavanga M , Mazandu GK , et al. AncGWAS: a post genome-wide association study method for interaction, pathway, and ancestry analysis in homogeneous and admixed populations . Bioinformatics 2016 ; 32 ( 4 ): 549 – 56 . Google Scholar CrossRef Search ADS PubMed 11 Li M , Li J , Li MJ , et al. Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework . Nucleic Acids Res 2017 ; 45 ( 9 ): e75 . Google Scholar CrossRef Search ADS PubMed 12 Li MJ , Li M , Liu Z , et al. cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes . Genome Biol 2017 ; 18 ( 1 ): 52 . Google Scholar CrossRef Search ADS PubMed 13 Mulder NJ , Akinola RO , Mazandu GK , et al. Using biological networks to improve our understanding of infectious diseases . Comput Struct Biotechnol J 2014 ; 11 ( 18 ): 1 – 10 . Google Scholar CrossRef Search ADS PubMed 14 Chang SM , Hu WW. Long non‐coding RNA MALAT1 promotes oral squamous cell carcinoma development via microRNA‐125b/STAT3 axis . J Cell Physiol 2018 ; 233 ( 4 ): 3384 – 96 . Google Scholar CrossRef Search ADS PubMed 15 Ma'ayan A. Introduction to network analysis in systems biology . Sci Signal 2011 ; 4 ( 190 ): tr5 . Google Scholar CrossRef Search ADS PubMed 16 Albert R , Jeong H , Barabasi AL. Error and attack tolerance of complex networks . Nature 2000 ; 406 ( 6794 ): 378 – 82 . Google Scholar CrossRef Search ADS PubMed 17 Ekman D , Light S , Björklund AK , Elofsson A. What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae? Genome Biol 2006 ; 7 ( 6 ): R45 . Google Scholar CrossRef Search ADS PubMed 18 Akinola RO , Mazandu GK , Mulder NJ. A quantitative approach to analyzing genome reductive evolution using protein-protein interaction networks: a case study of Mycobacterium leprae . Front Genet 2016 ; 7 : 39 . Google Scholar CrossRef Search ADS PubMed 19 Ma X , Gao L. Biological network analysis: insights into structure and functions . Brief Funct Genomics 2012 ; 11 ( 6 ): 434 – 42 . Google Scholar CrossRef Search ADS PubMed 20 Mazandu GK , Mulder NJ. DaGO-Fun: tool for gene ontology-based functional analysis using term information content measures . BMC Bioinformatics 2013 ; 14 ( 1 ): 284 . Google Scholar PubMed 21 Nelson MR , Tipney H , Painter JL , et al. The support of human genetic evidence for approved drug indications . Nat Genet 2015 ; 47 ( 8 ): 856 – 60 . Google Scholar CrossRef Search ADS PubMed 22 Holmans P , Green EK , Pahwa JS , et al. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder . Am J Hum Genet 2009 ; 85 ( 1 ): 13 – 24 . Google Scholar CrossRef Search ADS PubMed 23 Wu MC , Lin X. Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways . Stat Methods Med Res 2009 ; 18 ( 6 ): 577 – 93 . Google Scholar CrossRef Search ADS PubMed 24 Yu K , Li Q , Bergen AW , et al. Pathway analysis by adaptive combination of P‐values . Genet Epidemiol 2009 ; 33 ( 8 ): 700 – 9 . Google Scholar CrossRef Search ADS PubMed 25 Zotenko E , Mestre J , O'Leary DP , et al. Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connection between the network topology and essentiality . PLoS Comput Biol 2008 ; 4 ( 8 ): e1000140 . Google Scholar CrossRef Search ADS PubMed 26 Chen X , Wang L , Hu B , et al. Pathway‐based analysis for genome‐wide association studies using supervised principal components . Genet Epidemiol 2010 ; 34 ( 7 ): 716 – 24 . Google Scholar CrossRef Search ADS PubMed 27 Guo YF , Li J , Chen Y , et al. A new permutation strategy of pathway-based approach for genome-wide association study . BMC Bioinformatics 2009 ; 10 : 429 . Google Scholar CrossRef Search ADS PubMed 28 Kraft P , Raychaudhuri S. Complex diseases, complex genes: keeping pathways on the right track . Epidemiology 2009 ; 20 ( 4 ): 508 – 11 . Google Scholar CrossRef Search ADS PubMed 29 Fridley BL , Patch C. Gene set analysis of SNP data: benefits, challenges, and future directions . Eur J Hum Genet 2011 ; 19 ( 8 ): 837 – 43 . Google Scholar CrossRef Search ADS PubMed 30 Yellaboina S , Goyal K , Mande SC. Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: comparison with high-throughput experimental data . Genome Res 2007 ; 17 ( 4 ): 527 – 35 . Google Scholar CrossRef Search ADS PubMed 31 Piñero J , Bravo À , Queralt-Rosinach N. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants . Nucleic Acids Res 2017 ; 45 ( D1 ): D833 – 9 . Google Scholar CrossRef Search ADS PubMed 32 Mazandu GK , Mulder NJ. Generation and analysis of large-scale data-driven Mycobacterium tuberculosis functional networks for drug target identification . Adv Bioinformatics 2011 ; 2011 : 801478 . Google Scholar CrossRef Search ADS PubMed 33 Wang P , Qin J , Qin Y , et al. ChIP-Array 2: integrating multiple omics data to construct gene regulatory networks . Nucleic Acids Res 2015 ; 43 ( W1 ): W264 – 9 . Google Scholar CrossRef Search ADS PubMed 34 Alm E , Arkin PA. Biological networks . Curr Opin Struct Biol 2003 ; 13 ( 2 ): 193 – 202 . Google Scholar CrossRef Search ADS PubMed 35 Mazandu GK , Mulder NJ. Using the underlying biological organization of the Mycobacterium tuberculosis functional network for protein function prediction . Infect Genet Evol 2012 ; 12 ( 5 ): 922 – 932 . Google Scholar CrossRef Search ADS PubMed 36 Bessieres-Grattagliano B , Foliguet B , Devisme L , et al. Refining the clinicopathological pattern of cerebral proliferative glomeruloid vasculopathy (Fowler syndrome): report of 16 fetal cases . Eur J Med Genet 2009 ; 52 ( 6 ): 386 – 92 . Google Scholar CrossRef Search ADS PubMed 37 Fisher RA. Statistical methods for research workers. In: Breakthroughs in Statistics . New York, NY : Springer , 1992 , 66 – 70 . Google Scholar CrossRef Search ADS 38 Hess A , Iyer H. Fisher's combined p-value for detecting differentially expressed genes using Affymetrix expression arrays . BMC Genomics 2007 ; 8 ( 1 ): 96 . Google Scholar CrossRef Search ADS PubMed 39 Kost JT , McDermott MP. Combining dependent p-values . Stat Probab Lett 2002 ; 60 ( 2 ): 183 – 90 . Google Scholar CrossRef Search ADS 40 Benjamini Y , Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing . J R Stat Soc Series B Methodol 1995 ; 57 : 289 – 300 . 41 Liptak T. On the combination of independent tests . Magyar Tud Akad Mat Kutato Int Kozl 1958 ; 3 : 171 – 97 . 42 Zaykin DV , Zhivotovsky LA , Westfall PH , et al. Truncated product method for combining P‐values . Genet Epidemiol 2002 ; 22 ( 2 ): 170 – 85 . Google Scholar CrossRef Search ADS PubMed 43 Ramanan VK , Shen L , Moore JH , et al. Pathway analysis of genomic data: concepts, methods, and prospects for future development . Trends Genet 2012 ; 28 ( 7 ): 323 – 32 . Google Scholar CrossRef Search ADS PubMed 44 Lamparter D , Marbach D , Rueedi R , et al. Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics . PLoS Comput Biol 2016 ; 12 ( 1 ): e1004714 . Google Scholar CrossRef Search ADS PubMed 45 Wang T , Birsoy K , Hughes NW , et al. Identification and characterization of essential genes in the human genome . Science 2015 ; 350 ( 6264 ): 1096 – 101 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Briefings in BioinformaticsOxford University Press

Published: Apr 26, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off