Edge-group sparse PCA for network-guided high dimensional data analysis

Edge-group sparse PCA for network-guided high dimensional data analysis Abstract Motivation Principal component analysis (PCA) has been widely used to deal with high-dimensional gene expression data. In this study, we proposed an Edge-group Sparse PCA (ESPCA) model by incorporating the group structure from a prior gene network into the PCA framework for dimension reduction and feature interpretation. ESPCA enforces sparsity of principal component (PC) loadings through considering the connectivity of gene variables in the prior network. We developed an alternating iterative algorithm to solve ESPCA. The key of this algorithm is to solve a new k-edge sparse projection problem and a greedy strategy has been adapted to address it. Here we adopted ESPCA for analyzing multiple gene expression matrices simultaneously. By incorporating prior knowledge, our method can overcome the drawbacks of sparse PCA and capture some gene modules with better biological interpretations. Results We evaluated the performance of ESPCA using a set of artificial datasets and two real biological datasets (including TCGA pan-cancer expression data and ENCODE expression data), and compared their performance with PCA and sparse PCA. The results showed that ESPCA could identify more biologically relevant genes, improve their biological interpretations and reveal distinct sample characteristics. Availability and implementation An R package of ESPCA is available at http://page.amss.ac.cn/shihua.zhang/ Contact zsh@amss.ac.cn or liujuan@whu.edu.cn Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction With the rapid development of deep sequencing technology, there has been a lot of various RNA-seq datasets on different cancers or tissues from large-scale projects such as the International Cancer Genome Consortium (ICGC) (Hudson et al., 2010), the Encyclopedia of DNA Elements (ENCODE) (Dunham et al., 2012), The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013) and the Genotype-Tissue Expression (GTEx) project (Lonsdale et al., 2013). Generally, these transcriptional data referring to ∼20 000 genes are typically high dimensional ones. Many dimension reduction and variable selection methods have been used to deal with such data for biological pattern discovery (Huisman et al., 2017; Lin et al., 2016; Van Der Maaten, 2014). Principal component analysis (PCA) is a classical dimension reduction method and has been widely adopted in high-dimensional gene expression analysis (Breschi et al., 2016; Chung and Storey, 2015; Ho et al., 2016; Ji et al., 2013; Liu et al., 2016; Ma and Dai, 2011; Ringnér, 2008). PCA and its variants can capture the linear relationship of variables to best explain the latent patterns of samples. However, the non-sparse principal component (PC) loadings by PCA employ all gene variables and lead to limited biological interpretability. Thus, variable selection is needed for gene expression analysis to select a small number of important genes. Recently, a number of studies have focused on developing sparse PCA models to encourage sparsity of PC loadings to extract gene modules with a limited number of genes for better interpretation (Deshpande and Montanari, 2014; Gu et al., 2014; Jolliffe et al., 2003; Journée et al., 2010; Lee et al., 2010; Rahmani et al., 2016; Shen and Huang, 2008; Sill et al., 2011, 2015; Witten et al., 2009; Yuan and Zhang, 2013; Zou et al., 2006). For example, LASSO regularized and elastic net regularized sparse PCA models have been proposed respectively (Jolliffe et al., 2003; Zou et al., 2006). However, these sparse PCA models can not accurately control the sparse level of PC loadings. Thus, the sparse PCA model with L0-penalty has been proposed to solve this issue (Journée et al., 2010; Yuan and Zhang, 2013). Several studies have developed different algorithms to solve these sparse PCA models (Deshpande and Montanari, 2014; Gu et al., 2014; Journée et al., 2010; Shen and Huang, 2008; Witten et al., 2009; Yuan and Zhang, 2013). A kind of commonly used methods employ regularized SVD framework to solve sparse PCA models (Shen and Huang, 2008; Witten et al., 2009). In addition, sparse PCA models have been widely used in cancer research (Hsu et al., 2014). For example, Sill et al. proposed a sparse PCA with stability selection (S4VDPCA) to deal with a medulloblastoma brain gene expression dataset and revealed that genes determined by the first two sparse PC loadings are significantly involved in several key pathways between molecular subgroups of medulloblastoma (Sill et al., 2011). However, these sparse PCA models assume that all the genes have equal prior probability of being selected in each PC loading and cannot integrate prior group structure among gene variables for variable selection. These gene interactions from a prior gene interaction network can be seen as a specific group structure. How to integrate the gene group structure into the sparse PCA framework for gene variable selection is an open challenge. Recently, some network-based classification and regression models have been proposed to identify a context-specific gene set or module via integrating gene expression data and an interaction network (Dittrich et al., 2008; Leiserson et al., 2015; Ruan et al., 2016; Sharan et al., 2007). Moreover, some other network-based methods have also been used to extract active gene modules by integrating differentially expressed levels (P-values) of genes between two given phenotypes and a gene interaction network (Ansari et al., 2017; Gwinner et al., 2017; Liu et al., 2017). However, PCA and sparse PCA do not consider the prior interaction network structure (Glaab, 2016). To our knowledge, little effort has been made to incorporate the group structure among genes from a gene interaction network into the SPCA framework for dimension reduction and feature interpretation. In this paper, we proposed an Edge-group Sparse PCA (ESPCA) for network-guided high dimensional gene expression data analysis (Fig. 1). Each gene interaction in the prior gene interaction network is considered as a group. ESPCA constrains the number of nonzero elements of each PC loading via selecting gene interactions (edges). That is, ESPCA encourages the selected genes in each PC loading to be linked in the prior gene interaction network (Fig. 1A). We developed an alternating iterative algorithm to solve ESPCA (Fig. 1B). Simulation tests showed superior performance of ESPCA over SPCA for variable selection. Extensive results of ESPCA applied onto two datasets (including TCGA pan-cancer expression data and ENCODE expression dataset) with a gene interaction network showed that ESPCA could identify more biologically relevant genes, improve their biological interpretations and reveal distinct sample characteristics. Fig. 1. View largeDownload slide (A) ESPCA for integrative analysis of gene expression and interaction network data. (B) Workflow of ESPCA algorithm for learning the first PC loading (u) and PC (v): (0) Initialize the algorithm with a randomly generated vector v; (1) Rebuild a gene weighted network based on Xv (denoted as z) by using wij=zi2+zj2 for any edge (i,j)∈G; (2) Sort all edge-weights from the weighted network and extract k edges with the largest k weights; (3) Obtain the selected genes based on the top k edges; (4) The elements of PC loading corresponding to non-selected genes are set to zero; (5) Normalize the PC loading to meet the normalizing condition; (6) Update v; (7) Obtain a new z. Repeat the steps (1) to (7) until convergence Fig. 1. View largeDownload slide (A) ESPCA for integrative analysis of gene expression and interaction network data. (B) Workflow of ESPCA algorithm for learning the first PC loading (u) and PC (v): (0) Initialize the algorithm with a randomly generated vector v; (1) Rebuild a gene weighted network based on Xv (denoted as z) by using wij=zi2+zj2 for any edge (i,j)∈G; (2) Sort all edge-weights from the weighted network and extract k edges with the largest k weights; (3) Obtain the selected genes based on the top k edges; (4) The elements of PC loading corresponding to non-selected genes are set to zero; (5) Normalize the PC loading to meet the normalizing condition; (6) Update v; (7) Obtain a new z. Repeat the steps (1) to (7) until convergence 2 Materials and methods Here, we first briefly described a regularized sparse PCA via L0-penalty (SPCA) and then proposed the ESPCA (Fig. 1). 2.1 SPCA Suppose X∈ℝp×n is a gene expression matrix with p genes and n samples, centered for each gene. Then the SPCA via L0-penalty (Journée et al., 2010; Yuan and Zhang, 2013) can be adopted to analyze the gene expression matrix:   maximize||u||2≤1 uTXXTu,  s.t.  ||u||0≤s, (1) where s is a positive integer, ||·||2 (or ||·||) denotes the Euclidean norm, and ||·||0 denotes the L0 norm. Note that ||u||0 is equal to the number of nonzero elements of u. A framework based on singular value decomposition (SVD) has been recommended to solve SPCA (Lin et al., 2016; Sill et al., 2015; Shen and Huang, 2008; Witten et al., 2009). That is, the following problem is used to solve the above one (Equation 1):   maximize||u||2≤1,||v||2≤1 uTXv,  s.t.  ||u||0≤s, (2) where u∈ℝp×1 and v∈ℝn×1 are the first PC loading and PC respectively. The following alternating iterative projection strategy (Journée et al., 2010; Yuan and Zhang, 2013) is used to solve problem (2) until convergence: u←u^||u^||, where u^=P(z,s) and z=Xv, v←v^||v^||, where v^=XTu,where P(z,s) is called s-sparse projection operator. It is a p-dimensional column vector and its ith ( i=1, 2,…,p) element is defined as follows:   [P(z,s)]i={zi, if  i∈supp(z,s),0, otherwise, (3) where supp(z,s) denotes the set of indexes of the largest s absolute elements of z. For example, if z=[−7, 5, 1, 2,−3]T in Eq. (3), then supp(z,2)={1, 2} and P(z,2)=[−7, 5, 0, 0, 0]T. 2.2 ESPCA Suppose G is a group structure. When G={g1,…,gM} is non-overlapping, the group sparse penalties with L1-norm (GL1) and L0-norm (GL0) are ||u||GL1=∑gi∈G||ugi|| and ||u||GL0=∑gi∈GI(||ugi||≠0) respectively. However, a number of applications need to consider overlapping group structures (Jacob et al., 2009). For example, two linked genes in the gene interaction network, can be considered as a group. Obviously, such edge-groups are overlapping. We denoted G={e1,…,eM} as an edge set with all edges from a given gene interaction network. Here we presented an Edge-group Sparse penalty (ES-penalty) as follows:   ||u||ES=minimize∀G′⊆G, support(u)⊆V(G′) |G′|, (4) where G′ is a subset of G, V(G′) is a vertex (gene) set induced from the edge set G′, |G′| denotes the number of elements of G′, and support(u) denotes the set of indexes of nonzero elements of u. A simple example is used to illustrate formula ( 4). If u=(5,3,2,0)T with an edge set G={(1, 2),(1,4),(2, 3),(3, 4)} in the formula ( 4). We can find an edge set G′={(1, 2),(2, 3)} with two edges and V(G′)={1,2,3} with three vertices to satisfy support(u)⊆V(G′), where support(u)={1, 2, 3}. Thus ||u||ES=2. Note that if G is a non-overlapping group structure, then ||u||ES reduces to ||u||GL0. In other words, the ES-penalty leads to a sparse PC loading whose nonzero elements selected based on some important gene interactions (edges) of G (Fig. 1B). Finally, we developed a regularized PCA with ES-penalty (ESPCA) for network-guided high dimensional gene expression data analysis:   maximize||u||2≤1,||v||2≤1 uTXv, s.t. ||u||ES≤k, (5) where u is the first PC loading, v is the first PC and k is a parameter to control the number of edges selected. 2.3 Learning algorithm for ESPCA The key of solving problem (5) is to solve a projection problem with fixed v and z=Xv in (5) as follows:   maximize||u||2≤1 uTz, s.t. ||u||ES≤k. (6) It is a NP-hard combinatorial optimization problem. Here we adopted a greedy strategy to get a proximal solution of (6) (Algorithm 1). Algorithm 1 Greedy k-edge sparse projection Require: z∈ℝp×1, parameter k, edge set G={e1,⋯,eM} 1: Let normG(z)=(||ze1||,…,||zeM||)T 2: Extract top k edge indexes: I=supp(normG(z),k) 3: G′={ei| i∈I, ei∈G} #The selected k edges. 4: VG′=V(G′) #The selected genes induced from G′. 5: Let u^=0∈ℝp×1 6: for any gene i in VG′do 7:   u^i=zi 8: end for 9: u=u^||u^|| 10: return u and PG(z,k)=u^ For convenience, the proximal solution of (6) is denoted as u=u^||u^|| where u^=PG(z,k), named as a k-edge sparse projection operator, and [PG(z,k)]i ( i=1,…,p) meets:   [PG(z,k)]i={zi,if  G(i)∩supp(normG(z),k)≠∅,0, otherwise, (7) where G(i) is the set of indexes of the edges containing gene i. That is, if the gene i is in the selected edges, then [PG(z,k)]i=zi, otherwise [PG(z,k)]i=0. Figure 1B illustrates an example with steps (1)–(5) to explain it. Based on Algorithm 1, we can thus solve the problem (5) by using the following alternating iterative strategy until convergence: u←u^||u^||, where u^=PG(z,k) and z=Xv, v←v^||v^||, where v^=XTu. Furthermore, ESPCA can be applied to generate multiple PCs and PC loadings. Specifically, given the current PCs, we adopted the following model to compute the next PC and its loading (Witten et al., 2009):   maximize||uℓ||2≤1,||vℓ||2≤1uℓTXvℓ    subject to||uℓ||ES<kℓ,vℓ⊥v1,…,vℓ−1. (8) where ℓ≥2 and the pairs {ui, vi} ( i=1,…,ℓ−1) are known, and vℓ must be orthogonal with v1,…,vℓ−1. Here, we adopted the alternating iterative strategy to solve Eq. (8) by fixing uℓ update vℓ and vice versa. Thus, the key of solving problem (8) is to solve another projection problem with fixed uℓ:   maximize||vℓ||2≤1uℓTXvℓsubject tovℓ⊥v1,…,vℓ−1. (9) Here we employed a Gram-Schmidt orthogonalization strategy, which has been used to solve FastICA for multiple components extraction (Hyvärinen and Oja, 2000). Let Vℓ−1=[v1;⋯;vℓ−1]∈ℝn×(ℓ−1) and Vℓ−1⊥∈ℝn×(n−ℓ+1), whose column space is orthogonal complement with the column space of Vℓ−1, i.e. [Vℓ−1;Vℓ−1⊥] is an orthogonal basis. Since vℓ⊥v1,…,vℓ−1 in Eq. (9), vℓ belongs to the column space of Vℓ−1⊥. Thus, vℓ can be written as Vℓ−1⊥β. We replaced vℓ with Vℓ−1⊥β in Eq. (9):   maximizeβuℓTXVℓ−1⊥βsubject to||β||2≤1. (10) We can easily get the optimal solution of problem (10) as follows:   β=Vℓ−1⊥TXTuℓ||Vℓ−1⊥TXTuℓ||. (11) Since vℓ=Vℓ−1⊥β, we replaced β using formula (11) and ensured vℓ to be a unit vector, thus we can obtain the optimal solution of problem (9):   vℓ=Vℓ−1⊥Vℓ−1⊥TXTuℓ||Vℓ−1⊥Vℓ−1⊥TXTuℓ||, (12) where Vℓ−1⊥Vℓ−1⊥T is not known. Next we introduced Lemma 1 to compute it. Lemma 1. If the column space of Vℓ−1⊥is orthogonal complement with the column space of Vℓ−1, then Vℓ−1⊥Vℓ−1⊥T=I−Vℓ−1Vℓ−1T. Proof. Let V=[Vℓ−1⊥;Vℓ−1], thus V is an orthogonal matrix. Based on the property of orthogonal matrix, VVT=I, we can get [Vℓ−1⊥;Vℓ−1][Vℓ−1⊥;Vℓ−1]T=I. It leads to Vℓ−1⊥Vℓ−1⊥T+Vℓ−1Vℓ−1T=I. Thus, the Lemma is true. □ Theorem 1. The optimal solution of problem (9) is   vℓ=v^ℓ||v^ℓ||,  where  v^ℓ=(I−Vℓ−1Vℓ−1T)XTuℓ. (13) Proof. Based on the Lemma 1, by replacing Vℓ−1⊥Vℓ−1⊥T with I−Vℓ−1Vℓ−1T in Eq. (12), we can get the optimal solution of problem (9) as vℓ=v^ℓ||v^ℓ||, where v^ℓ=(I−Vℓ−1Vℓ−1T)XTuℓ. □ When vℓ is fixed in the problem (8), we adopted the greedy k-edge sparse projection strategy (Algorithm 1) to learn uℓ. Finally, we proposed the following alternating iterative strategy to solve problem (8): uℓ←u^||u^||, where u^=PG(z,kℓ) and z=Xvℓ, vℓ←v^||v^||, where v^=(I−Vℓ−1Vℓ−1T)XTuℓ. Algorithm 2 ESPCA Require: X∈ℝp×n, edge set G, parameters L and k 1: Let X1=X 2: Let U= ∈ℝp×L, V= ∈ℝn×L, D= ∈ℝL×L 3: for ℓ=1 to Ldo 4:  Initialize vℓ with ||vℓ||2=1 5:   Vℓ−1=V[ ,1:ℓ] 6:  repeat 7:   Let z=Xvℓ 8:    uℓ←u^||u^||, where u^=PG(z,k) 9:    vℓ←v^||v^||, where v^=(I−Vℓ−1Vℓ−1T)XTuℓ 10:    dℓ=uℓTXℓv 11:  until convergence 12:   U[ ,ℓ]=uℓ, V[ ,ℓ]=vℓ, D[ℓ,ℓ]=dℓ 13:   Xℓ+1=Xℓ−dℓuℓvℓT 14: end for 15: return U, V and D In summary, we proposed Algorithm 1 to extract L PCs (columns of V) and PC loadings (columns of U). Note that it is stopped when some convergence criteria are satisfied (step 11). The computational complexity of Algorithm 2 is dominated by the cost of step 8 (k-edge sparse projection) and step 9 (multiplication between matrix and vector). A linear time selection algorithm (Quick Select) is used to select the largest k absolute values for a given vector. Thus, the cost of step 8 is O(|G|), where |G| is the number of edges in G. And the cost of step 9 is O(n2L+np). We set the maximum number of iterations to T between step 6 and step 11. Thus, the computational complexity of Algorithm 2 is O(n2L·TL+np·TL+|G|·TL). Note that we cannot guarantee that the algorithm converges to the optimal solution due to the non-convexity of this problem. Thus, we repeated our algorithm with a number of different random initial solutions. One of the advantages of Algorithm 2 is to ensure that these PCs (v1,v2,…,vL) are orthogonal to each other. However, there is still a subtle issue unresolved: How to ensure the orthogonality of the PC loadings (u1,u2,…,uL). Sometimes orthogonality of us may be at odds with sparsity of us. If we enforce orthogonality of us, then we may destroy the sparsity of us. To balance the sparsity and orthogonality, a sparse orthogonal decomposition strategy may be used (Ma et al., 2014). 2.4 ESPCA for multiple expression matrices Generally, PCA and its variants have been applied to a single gene expression dataset (Chung and Storey, 2015; Ho et al., 2016; Ji et al., 2013; Liu et al., 2016; Ma and Dai, 2011). Here, we extended ESPCA to deal with more than one expression matrix. Suppose Xi is the ith cancer-type (or cell-type) gene expression matrix with ℝp×ni ( i=1,…,t), where p is the number of genes and ni is the number of samples. We extended ESPCA model for joint analysis of gene expression data with samples from different cancer-types (or cell-types) to study their sharing characteristics:   maximize||u||2≤1 ∑i=1tuTXiXiTu, s.t. ||u||ES≤k. (14) It is easy to verify that ∑i=1tuTXiXiTu=uTXXTu, where X=[X1,…,Xt]∈ℝp×(n1+⋯+nt). Thus, solving problem (14) is equivalent to solving problem (5). Based on the outputs of Algorithm 2: U and V, we extract gene modules. Specifically, for each sparse PC loading uℓ ( 1≤ℓ≤L), the nonzero elements of uℓ are used to form a gene module. Meanwhile, the high-dimensional data points (columns of X) can be projected to low dimensional data points (columns of VT): X∈ℝp×n↦VT∈ℝL×n, where L≪p. 2.5 Model selection In ESPCA, a key issue is how to determine the parameter k (i.e. the number of edges) and rank L (i.e. the number of modules). In fact, how to determine these parameters is still an open problem for ESPCA. Inspired by previous studies (Lee et al., 2010), we can select the parameters by minimizing the following Bayesian Information Criterion (BIC):   BIC(k,L)=X−∑l=1LdlulvlTn·p+ log ⁡(n·p)n·p∑l=1L(||ul||0+||vl||0) Thus, k and L can always be chosen via minimizing BIC. Like previous studies, PCA or SPCA consider two principal components for further analysis (Sill et al., 2015). In the paper, we empirically set k = 100 and L = 2 as default values for ESPCA. This ensures that we yield two gene modules with less than 100 genes. 2.6 Functional analysis of gene modules We first assessed whether the genes from each module were tightly connected in the gene interaction network. For a given gene module i with pi genes and mi edges in the prior gene interaction network, we defined the edge-density (ED) score of this module as ED=mi/(pi2)M/(p2), where M is the total number of edges (interactions) and p is the total number of genes in the gene interaction network. We applied the right-tailed hypergeometric test to assess the enrichment significance. We also collected 1607 cancer genes from the allOnco database which merges some different cancer genes from several databases (http://www.bushmanlab.org/links/genelists). We also employed the right-tailed hypergeometric test to compute a significance level to measures whether these genes from the same module are significantly overlapped with the cancer genes. Furthermore, to assess the biological relevance of these genes from each module, we downloaded multiple gene functional annotations including GO biological processes (GOBP), KEGG pathways and reactome pathways from Molecular Signatures Database (MSigDB) (http://software.broadinstitute.org/gsea/msigdb). In total, we obtained 4653 GOBP terms, 186 KEGG pathway terms and 674 reactome pathway terms. We performed functional enrichment analysis for the genes from each module using hypergeometric test adjusted with the Benjamini-Hochberg (BH) correction (P-value <0.05). 2.7 Biological data We first considered a combined gene expression dataset of multiple cancer types and the gene expression data of each cancer was extracted from level 3 RNA-seq data in TCGA database (Weinstein et al., 2013) (https://gdc-portal.nci.nih.gov/). The 12 cancer types used in this study include bladder urothelial carcinoma (BLCA, 134 samples), breast invasive carcinoma (BRCA, 847 samples), colon and rectum carcinoma (CRC, 550 samples), head and neck squamous-cell carcinoma (HNSC, 303 samples), kidney renal clear-cell carcinoma (KIRC, 474 samples), brain lower grade glioma (LGG, 179 samples), lung adenocarcinoma (LUAD, 350 samples), lung squamous-cell carcinoma (LUSC, 315 samples), prostate adenocarcinoma (PRAD, 170 samples), skin cutaneous melanoma (SKCM, 234 samples), thyroid carcinoma (THCA, 224 samples), uterine corpus endometrioid carcinoma (UCEC, 478 samples). We also downloaded a gene interaction network built by integrating multiple data sources (e.g. HPRD, NCI-PID, Interact and BioGRID) from Pathway-Commons database (http://www.pathwaycommons.org/) as prior knowledge for the two biological applications. Finally, we obtained the pan-cancer gene expression dataset with 4258 samples and 10 399 genes, and a gene interaction network with 10 399 genes and 257 039 gene interactions. We also downloaded the corresponding clinical data of the 12 cancer types from Firehose (http://firebrowse.org/) for further analysis. We obtained another gene expression dataset based on the RNA-seq technique from ENCODE project (Dunham et al., 2012) (http://www.genome.ucsc.edu/ENCODE/). The ENCODE expression dataset contains 158 cell lines (samples) including GM12878 (29 samples), K562 (24 samples), HepG2 (12 samples), MCF7 (11 samples), HelaS3 (8 samples), A549 (4 samples), others (70 samples). The raw expression level of each gene was quantified by RPKM (reads per kilobase of transcript per million reads). Genes with expression in less than 20% samples were filtered out. For each sample, these raw gene expression values were log2-transformed after replacing all the zero values with the lowest nonzero values in the corresponding sample. Finally, we obtained a gene expression data with 9486 genes and 158 cell lines, and a gene interaction network with 9486 genes and 219 304 gene interactions. 3 Simulation study In this section, we first presented a simple example to explain how ESPCA overcomes the limitation of SPCA. We generated the simulated gene expression matrix X and an interaction network (i.e. an edge set G): Generate two PC loadings by the following formulas:   u1=[1,−1,0.7,0.1,−0.5,[rep(0,5)]T]T, (15a)  u2=[[rep(0,5)]T,0.1,−2,−0.5,0.3,0.1]T, (15b) where rep(0,a) denotes a column-vector of size a, whose elements are zeros. Generate two PCs by the following formulas:   v1=rnorm(100), v2=rnorm(100), (16) where rnorm(b) denotes a column-vector of size b, whose elements are randomly sampled from the standard normal distribution. Generate the expression matrix X∈ℝ10×100 with 10 genes and 100 samples by the following formula:   X=d1u1v1T+d2u2v2T+γϵ, (17) where d1=10, d2=5, γ = 5 and ϵ∼iidN(0,1). Generate the gene interactions (edges) with G=G1∪G2, where G1 and G2 are as follows:   G1={(1,2),(1,3),(1,5),(2,3),(2,4),(3,4),(4,5)},  G2={(6,7),(6,10),(7,8),(7,10),(8,9),(8,10),(9,10)}. We tested ESPCA and compared it with PCA and SPCA by applied them to this simulated data (Table 1, Supplementary Material and Supplementary Figs S1A and S2B). The true patterns of the simulated data with respect to PC loadings is that the first five variables are related to PC1 loading and the last five variables are related to PC2 loading. We ensured that each identified PC loading by ESPCA (k = 6 in Eq. 5) has five nonzero entries. We also set s = 5 (Eq. 2) for SPCA to ensure that each identified PC loading with five nonzero entries for comparison. Clearly, ESPCA correctly extracted all five expected variables for PC1 and PC2 loadings, respectively (Table 1). However, SPCA misses the 4th variable for PC1 loading and the 6th variable for PC2 loading, and PCA only obtained two non-sparse PC loadings. Intuitively, we can see that the absolute value of the signal value of the 4th variable is less than the 7th variable and the 7th variable for the PC1 loading of PCA (second column of Table 1). It causes that SPCA misses the 4th variable and mistakenly extracts 7th variable (fourth column in Table 1). However, the 4th variable is connected to multiple important variables (the 2nd, 3rd, 5th variables) in the prior edges G. Thus, the 4th variable also tends to be an important one. ESPCA is able to correctly identify it by integrating the edge information G. Similarly, we can also explain these results with respect to PC2 loadings. Table 1. The top two identified PC1 and PC2 loadings by PCA, SPCA and ESPCA, respectively Method  PCA   SPCA   ESPCA   PC  PC1  PC2  PC1  PC2  PC1  PC2  Var1  0.57  −0.08  0.57  0.08  0.57    Var2  −0.61  −0.04  −0.61    −0.61    Var3  0.45  −0.04  0.45    0.46    Var4  0.02  −0.01      0.02    Var5  −0.30  0.08  −0.30    −0.31    Var6  −0.04  0.04        0.04  Var7  −0.07  −0.95  −0.07  0.95    −0.96  Var8  −0.01  −0.24    0.24    −0.24  Var9  −0.01  0.12    −0.13    0.12  Var10  −0.01  0.11    −0.11    0.11  Method  PCA   SPCA   ESPCA   PC  PC1  PC2  PC1  PC2  PC1  PC2  Var1  0.57  −0.08  0.57  0.08  0.57    Var2  −0.61  −0.04  −0.61    −0.61    Var3  0.45  −0.04  0.45    0.46    Var4  0.02  −0.01      0.02    Var5  −0.30  0.08  −0.30    −0.31    Var6  −0.04  0.04        0.04  Var7  −0.07  −0.95  −0.07  0.95    −0.96  Var8  −0.01  −0.24    0.24    −0.24  Var9  −0.01  0.12    −0.13    0.12  Var10  −0.01  0.11    −0.11    0.11  Note: Empty cells are zeros. Table 1. The top two identified PC1 and PC2 loadings by PCA, SPCA and ESPCA, respectively Method  PCA   SPCA   ESPCA   PC  PC1  PC2  PC1  PC2  PC1  PC2  Var1  0.57  −0.08  0.57  0.08  0.57    Var2  −0.61  −0.04  −0.61    −0.61    Var3  0.45  −0.04  0.45    0.46    Var4  0.02  −0.01      0.02    Var5  −0.30  0.08  −0.30    −0.31    Var6  −0.04  0.04        0.04  Var7  −0.07  −0.95  −0.07  0.95    −0.96  Var8  −0.01  −0.24    0.24    −0.24  Var9  −0.01  0.12    −0.13    0.12  Var10  −0.01  0.11    −0.11    0.11  Method  PCA   SPCA   ESPCA   PC  PC1  PC2  PC1  PC2  PC1  PC2  Var1  0.57  −0.08  0.57  0.08  0.57    Var2  −0.61  −0.04  −0.61    −0.61    Var3  0.45  −0.04  0.45    0.46    Var4  0.02  −0.01      0.02    Var5  −0.30  0.08  −0.30    −0.31    Var6  −0.04  0.04        0.04  Var7  −0.07  −0.95  −0.07  0.95    −0.96  Var8  −0.01  −0.24    0.24    −0.24  Var9  −0.01  0.12    −0.13    0.12  Var10  −0.01  0.11    −0.11    0.11  Note: Empty cells are zeros. We also applied ESPCA and SPCA to the second simulated data for further comparison and evaluation. We first generated two sparse PC loadings and two PCs by using u1=u^1/||u^1|| with u^1=[rnorm(100); rep(0,100)], u2=u^2/||u^2|| with u^2=[rep(0,100); rnorm(100)], and v1=v^1/||v^1|| with v^1=rnorm(100), v2=v^2/||v^2|| with v^2=rnorm(100), and d1=10, d2=5. Then a sequence of observed matrices Xs were generated by using formula (17) with different noise levels γ∈[0.1,0.5] in steps of 0.1. For each γ, we generated 50 observed matrices Xs by using formula (17). And a common prior gene interaction network for row variables of X was generated which contains two subnetwork, the first network contains the first 100 nodes (genes) which are connected with probability P = 0.3, and the second contains the remaining ones connected with probability P = 0.3. For each γ, we applied ESPCA to its corresponding 50 observation matrices. The parameter k of ESPCA (Algorithm 2) was set as 400 (i.e. 400 edges) to ensure that the identified two sparse PC loadings can reveal the true genes. We also applied SPCA to the same observation matrices for comparison and ensured the sparse PC loadings identified by SPCA has the same sparse level with those identified by ESPCA. Three criteria (sensitivity, specificity and accuracy) were used to evaluate the performance of ESPCA and SPCA for extracting the patterns of PC1 and PC2 loadings. We found that ESPCA outperforms SPCA with respect to PC1 and PC2 loadings at all different noise levels (Fig. 2). Since the eigenvalue d12 of PC1 loading are greater than the eigenvalue d22 of PC2 loading, both ESPCA and SPCA obtain higher sensitivity and specificity for PC1 loading than those for PC2 loading at different noise levels. In addition, as the noise level becomes larger, both sensitivity and specificity of ESPCA and specificity of SPCA are getting smaller. In short, overall sensitivity and specificity of ESPCA are higher than those of SPCA in all cases. Finally, we also show that random prior networks will affect the performance of ESPCA greatly (Supplementary Material and Supplementary Fig. S2). Fig. 2. View largeDownload slide Performance comparison of SPCA and ESPCA in terms of sensitivity, specificity and accuracy (ACC) in the second simulated data where sensitivity measures proportion of true nonzero elements of each PC loading that are correctly identified, and specificity measures the proportion of zero elements that are correctly identified. The bar plots of the average values of 50 realizations on the simulated data with respect to different noise levels were shown with error bars Fig. 2. View largeDownload slide Performance comparison of SPCA and ESPCA in terms of sensitivity, specificity and accuracy (ACC) in the second simulated data where sensitivity measures proportion of true nonzero elements of each PC loading that are correctly identified, and specificity measures the proportion of zero elements that are correctly identified. The bar plots of the average values of 50 realizations on the simulated data with respect to different noise levels were shown with error bars 4 Biological applications We applied ESPCA to the two biological datasets and set k = 100 to yield modules of less than 100 genes. 4.1 Application to the pan-cancer dataset We applied ESPCA to the TCGA pan-cancer dataset and compared its performance with that of SPCA (Journée et al., 2010; Yuan and Zhang, 2013). ESPCA and SPCA cost about nine and one minutes to identify a PC for this real data respectively (similarly for the ENCODE dataset). In addition, both ESPCA and SPCA algorithms use five random initial solutions and the one with the largest objective value is considered as the final one. For comparison, we extracted the top two PC loadings by SPCA with the same number of nonzero genes as those identified by ESPCA. As we expected, we found that the two gene modules (ESPC1 and ESPC2) identified by ESPCA contain more gene interactions in the prior gene interaction network than those two modules (SPC1 and SPC2) identified by SPCA significantly. Moreover, ESPCA identified 20 (of 86) and 13 (of 55) cancer genes in ESPC1 dodule (P-value < 0.036, hypergeometric test) and ESPC2 module (P-value < 0.073, hypergeometric test), respectively. However, SPCA only identified 17 and 8 cancer genes in SPC1 module (P-value < 0.168, hypergeometric test) and SPC2 module (P-value < 0.63, hypergeometric test), respectively. In summary, the number of edges, cancer genes and enriched functional terms, and the ED score for PC loadings identified by ESPCA are all higher than those of SPCA (Fig. 3A). Fig. 3. View largeDownload slide Comparison of the results based on the top two PC loadings identified by ESPCA and SPCA in the TCGA pan-cancer dataset. (A) Summary statistics in the two modules identified by ESPCA and SPCA, respectively. #gene, #edge, #cancer-gene and #GO BP (or #KEGG, #reactome) are the number of genes, edges, cancer genes and enriched GO BP (or KEGG, reactome) terms (*P < 0.05 and ** P < 0.01); ED: edge-density score. (B) Two scatter plots based on the first two PCs by ESPCA and SPCA respectively. Data points were colored by the 12 different cancer types. (C) Box plot of normalized mutual information (NMI) scores based on 50 random seeds for K-means (which was used to cluster TCGA pan-cancer samples based on the first two PCs). P-value was calculated using the Wilcoxon rank sum test (Color version of this figure is available at Bioinformatics online.) Fig. 3. View largeDownload slide Comparison of the results based on the top two PC loadings identified by ESPCA and SPCA in the TCGA pan-cancer dataset. (A) Summary statistics in the two modules identified by ESPCA and SPCA, respectively. #gene, #edge, #cancer-gene and #GO BP (or #KEGG, #reactome) are the number of genes, edges, cancer genes and enriched GO BP (or KEGG, reactome) terms (*P < 0.05 and ** P < 0.01); ED: edge-density score. (B) Two scatter plots based on the first two PCs by ESPCA and SPCA respectively. Data points were colored by the 12 different cancer types. (C) Box plot of normalized mutual information (NMI) scores based on 50 random seeds for K-means (which was used to cluster TCGA pan-cancer samples based on the first two PCs). P-value was calculated using the Wilcoxon rank sum test (Color version of this figure is available at Bioinformatics online.) We then evaluated the biological relevance of the identified modules. The two modules ESPC1 and ESPC2 were significantly enriched in 54, 6, 7 and 81, 2, 26 GOBP, KEGG and reactome terms, respectively (Fig. 3A). As expected, the gene enrichment analysis revealed a number of cancer related pathways such as biological adhesion, epidermis development, cell cycle, cell division, and ECM-receptor interaction, cell adhesion molecules cams and pathways in cancer. For example, seven genes (ITGB4, ITGB6, LAMC2, LAMB3, LAMA3, SDC1, SDC3) from the ESPC1 module were discovered in the KEGG ECM-receptor interaction pathway, and six genes (PRKDC, ATR, STAG1, RAD21, SMC1A, SMC3) from the ESPC2 module were discovered in the KEGG cell cycle pathway (Supplementary Fig. S4A). However, we only found 26, 0, 4 and 1, 0, 0 GOBP, KEGG and reactome terms for the SPC1 and SPC2 modules, respectively (Fig. 3A). Moreover, the enriched functional terms of ESPCA have distinct lower P-values than those of SPCA (Supplementary Fig. S4B). These results suggest that ESPCA can identify more biologically relevant gene sets than SPCA. We also compared the ability of PCs of ESPCA and SPCA respectively to discover sample patterns. The top two PCs identified by ESPCA demonstrates more distinct samples patterns especially for LGG, SKCM, KIRC than those of SPCA visually (Fig. 3B). Furthermore, we also performed K-means clustering with 50 realizations on each dataset and computed the normalized mutual information (NMI) that measures the similarity between the K-means clustering and the true sample classes. The average NMI was significantly higher for ESPCA reduced data than that for SPCA reduced data (P < 2.2e-16, one-sided Wilcoxon rank sum, Fig. 3C). Finally, we found that the ESPC1 and ESPC2 were significantly related to LGG and UCEC with z-score > 2, respectively (Fig. 4A). Intriguingly, ESPC1 module is significantly associated with survival of LGG and ESPC2 is associated with survival with UCEC, respectively (Fig. 4B), suggesting their functional specificity of these PCs. Fig. 4. View largeDownload slide (A) Two bar plots of z-scores based on the first two PCs identified by ESPCA in the TCGA Pan-cancer dataset. The z-score for ESPC1 (or ESPC2) module was computed based on the cancer relevance score vector. The cancer relevance score was calculated based on the average of the squares of its PC values of the tumors in each cancer. (B) ESPC1 and ESPC2 modules are significantly associated with survival of LGG and UCEC, respectively. For LGG and ESPC1 module (similarly for UCEC and ESPC2 module), LGG tumors were divided into two groups based on the median of their corresponding PC coefficients and the Kaplan-Meier survival curve were drawn to show the difference between the two group patients. Log rank test was used to calculate the P-value and ‘+’ denotes the censored patients Fig. 4. View largeDownload slide (A) Two bar plots of z-scores based on the first two PCs identified by ESPCA in the TCGA Pan-cancer dataset. The z-score for ESPC1 (or ESPC2) module was computed based on the cancer relevance score vector. The cancer relevance score was calculated based on the average of the squares of its PC values of the tumors in each cancer. (B) ESPC1 and ESPC2 modules are significantly associated with survival of LGG and UCEC, respectively. For LGG and ESPC1 module (similarly for UCEC and ESPC2 module), LGG tumors were divided into two groups based on the median of their corresponding PC coefficients and the Kaplan-Meier survival curve were drawn to show the difference between the two group patients. Log rank test was used to calculate the P-value and ‘+’ denotes the censored patients 4.2 Application to the ENCODE dataset We also applied ESPCA and SPCA to the ENCODE gene expression dataset from multiple cell types. Similar to the TCGA pan-cancer application, we found that the two gene modules (ESPC1 and ESPC2) identified by ESPCA (Supplementary Fig. S5A) contain more gene interactions in the prior gene interaction network than those two modules (SPC1 and SPC2) identified by SPCA significantly (Fig. 5A). We also found that ESPCA identified 14 (of 69) and 30 (of 96) cancer genes in them, respectively. However, SPCA only identified 8 and 15 cancer genes in SPCA modules, respectively. In short, the number of the edges, cancer genes and enriched functional terms, and the ED score for PC loadings identified by ESPCA were higher than those by SPCA (Fig. 5A). Fig. 5. View largeDownload slide Comparison of the results based on the top two PC loadings identified by ESPCA and SPCA in the ENCODE dataset. (A) Summary statistics in these modules identified by ESPCA and SPCA, respectively. (B) Two scatter plots based on the first two PCs by ESPCA and SPCA. Data points were colored by the different cell lines. (C) Boxplot of NMI scores based on 50 random seeds for K-means (which was used to cluster ENCODE samples based on the first two PCs). P-value was calculated using Wilcoxon rank sum test (Color version of this figure is available at Bioinformatics online.) Fig. 5. View largeDownload slide Comparison of the results based on the top two PC loadings identified by ESPCA and SPCA in the ENCODE dataset. (A) Summary statistics in these modules identified by ESPCA and SPCA, respectively. (B) Two scatter plots based on the first two PCs by ESPCA and SPCA. Data points were colored by the different cell lines. (C) Boxplot of NMI scores based on 50 random seeds for K-means (which was used to cluster ENCODE samples based on the first two PCs). P-value was calculated using Wilcoxon rank sum test (Color version of this figure is available at Bioinformatics online.) Moreover, the two modules ESPC1 and ESPC2 were significantly enriched in 67, 4, 54 and 63, 11, 12 GOBP, KEGG and reactome terms, respectively (Fig. 5A). However, only 0, 1 and 7 significant terms are found for SPC1, and nothing for SPC2. Correspondingly, the enriched functional terms of ESPCA have more lower P-value than those of SPCA (Supplementary Fig. S5B). Moreover, compared to SPCA modules, ESPCA modules tend to be related with some cancer related functional terms such as metabolism of RNA, ECM receptor interaction, focal adhesion. Finally, the visualized PC plots of ESPCA and SPCA showed that ESPCA can reveal more distinct classes than those by SPCA (Fig. 5B). Clearly, samples of K562 and GM12878 cell types are more separated under ESPCA than those of SPCA. Moreover, the average NMI is significantly higher for ESPCA reduced data than that for SPCA reduced data (Fig. 5C). All the results suggested that ESPCA could reveal more biologically relevant gene sets and more underlying expression patterns. 5 Discussion and conclusion On one hand, sparse PCA is a typical unsupervised learning method for dimension reduction and feature selection. On the other hand, network-based methods for analysis have been employed to extract gene biomarkers. Inspired by these two facts, we proposed an Edge-group sparse PCA (ESPCA) for high-dimensional data analysis. ESPCA employs the gene interaction network to guide the selection of relevant genes. Mathematically, we considered an edge-group sparse penalty in the proposed ESPCA framework. The key of solving it lies in a subproblem, called k-edge sparse projection operation. We designed a greedy k-edge sparse projection algorithm to solve it. Based on the greedy algorithm, we developed an alternating iterative algorithm to solve ESPCA. One of the advantages of ESPCA is to ensure the obtained PCs are orthogonal. We first applied ESPCA and SPCA to a set of simulated data and showed that ESPCA were more effective compared to SPCA. We also applied ESPCA for joint analysis of gene expression data with samples from different cancer-types (or cell-types) to study their sharing characteristics. Two biological datasets were tested in our study, including the TCGA pan-cancer expression data with about 4000 tumor samples from different cancers, and the ENCODE expression dataset with 158 cell lines from different cell types. ESPCA could identify gene modules with significant biological relevance for the two real datasets and yield better biological interpretations by integrating a gene interaction network. We note that the singleton genes in the prior network (if there are such genes) can also be selected by ESPCA (Supplementary Material and Supplementary Fig. S6). Moreover, the distinct functional enrichment is not caused mostly by the prior network (Supplementary Material and Supplementary Fig. S7). Although PCA-based methods are usually employed for dimension reduction with two or three dimensions, ESPCA can be used to extract more gene modules (Supplementary Material and Supplementary Tables S1 and S2). Lastly, compared with typical PCA, ESPCA can reveal more biologically relevant modules (Supplementary Material and Supplementary Fig. S8). In the future, ESPCA may be extended in the following aspects. (i) In this paper, we only considered two linked genes (an edge) in a given gene interaction network as a group and proposed an edge-group sparse penalty in Eq. (4). In fact, one alternative is that we can also consider the multiple genes from the same functional class (e.g. GO BP term, KEGG pathway) as a group in such a penalty and extend it to consider more structured knowledge. (ii) Moreover, in the paper, we applied ESPCA for multiple expression matrices with equal importance to each one in Eq. (14). However, the actual situation is that these expression matrices from different cancers (or cell types), which have different characteristics, such as different number of samples. Thus, we may set different weights for the cancer-type gene expression matrices. Mathematically, we will get a new objective function ∑i=1tλiuXiXiTuT, which will lead to a regularized tensor PCA model with edge-sparse penalty. (iii) Finally, we believe that the concept of edge-group sparse penalty will be valuable and can be extended to other statistical learning frameworks with some optimization techniques. Acknowledgements Wenwen Min would like to thank the support of the Academy of Mathematics and Systems Science at CAS during his visit. Funding This work was supported by the National Natural Science Foundation of China [No. 11661141019, 61621003, 61422309 and 61379092]; the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) [No. XDB13040600], the Key Research Program of the Chinese Academy of Sciences [No. KFZD-SW-219], National Key Research and Development Program of China (2017YFC0908405) and CAS Frontier Science Research Key Project for Top Young Scientist [No. QYZDB-SSW-SYS008]. Conflict of Interest: none declared. References Ansari S. et al.   ( 2017) An approach to infer putative disease-specific mechanisms using neighboring gene networks. Bioinformatics , 33, 1987– 1994. Google Scholar CrossRef Search ADS PubMed  Breschi A. et al.   ( 2016) Gene-specific patterns of expression variation across organs and species. Genome Biol ., 17, 151. Google Scholar CrossRef Search ADS PubMed  Chung N.C., Storey J.D. ( 2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics , 31, 545– 554. Google Scholar CrossRef Search ADS PubMed  Deshpande Y., Montanari A. ( 2014) Sparse PCA via covariance thresholding. In: Conference on Neural Information Processing Systems . pp. 334– 342. Dittrich M.T. et al.   ( 2008) Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics , 24, i223– i231. Google Scholar CrossRef Search ADS PubMed  Dunham I. et al.   ( 2012) An integrated encyclopedia of DNA elements in the human genome. Nature , 489, 57– 74. Google Scholar CrossRef Search ADS PubMed  Glaab E. ( 2016) Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification. Brief. Bioinform ., 17, 440– 452. Google Scholar CrossRef Search ADS PubMed  Gu Q. et al.   ( 2014) Sparse PCA with oracle property. In: Conference on Neural Information Processing Systems. pp. 1529– 1537. Gwinner F. et al.   ( 2017) Network-based analysis of omics data: the LEAN method. Bioinformatics , 33, 701– 709. Google Scholar PubMed  Ho R. et al.   ( 2016) Als disrupts spinal motor neuron maturation and aging pathways within gene co-expression networks. Nat. Neurosci ., 19, 1256– 1267. Google Scholar CrossRef Search ADS PubMed  Hsu Y.L. et al.   ( 2014) Sparse principal component analysis in cancer research. Transl. Cancer Res ., 3, 182. Google Scholar PubMed  Hudson T.J. et al.   ( 2010) International network of cancer genome projects. Nature , 464, 993– 998. Google Scholar CrossRef Search ADS PubMed  Huisman S. et al.   ( 2017) BrainScope: interactive visual exploration of the spatial and temporal human brain transcriptome. Nucleic Acids Res ., 45, e83. Google Scholar PubMed  Hyvärinen A., Oja E. ( 2000) Independent component analysis: algorithms and applications. Neural Netw ., 13, 411– 430. Google Scholar CrossRef Search ADS PubMed  Jacob L. et al.   ( 2009) Group lasso with overlap and graph lasso. In: International Conference on Machine Learning. ACM, New York, NY, USA, pp. 433– 440. Ji H. et al.   ( 2013) Differential principal component analysis of ChIP-seq. Proc. Natl. Acad. Sci. USA , 110, 6789– 6794. Google Scholar CrossRef Search ADS   Jolliffe I.T. et al.   ( 2003) A modified principal component technique based on the lasso. J. Comput. Graph. Stat ., 12, 531– 547. Google Scholar CrossRef Search ADS   Journée M. et al.   ( 2010) Generalized power method for sparse principal component analysis. J. Mach. Learn. Res ., 11, 517– 553. Lee M. et al.   ( 2010) Biclustering via sparse singular value decomposition. Biometrics , 66, 1087– 1095. Google Scholar CrossRef Search ADS PubMed  Leiserson M.D. et al.   ( 2015) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet ., 47, 106– 114. Google Scholar CrossRef Search ADS PubMed  Lin Z. et al.   ( 2016) Simultaneous dimension reduction and adjustment for confounding variation. Proc. Natl. Acad. Sci. USA , 113, 14662– 14667. Google Scholar CrossRef Search ADS   Liu J.X. et al.   ( 2016) A class-information-based sparse component analysis method to identify differentially expressed genes on RNA-Seq data. IEEE/ACM Trans. Comput. Biol. Bioinform ., 13, 392– 398. Google Scholar CrossRef Search ADS PubMed  Liu Y. et al.   ( 2017) Sigmod: an exact and efficient method to identify a strongly interconnected disease-associated module in a gene network. Bioinformatics , 33, 1536– 1544. Google Scholar PubMed  Lonsdale J. et al.   ( 2013) The genotype-tissue expression (GTEx) project. Nat. Genet ., 45, 580– 585. Google Scholar CrossRef Search ADS PubMed  Ma S., Dai Y. ( 2011) Principal component analysis based methods in bioinformatics studies. Brief. Bioinform ., 12, 714– 722. Google Scholar CrossRef Search ADS PubMed  Ma X. et al.   ( 2014) Learning regulatory programs by threshold SVD regression. Proc. Natl. Acad. Sci. USA , 111, 15675– 15680. Google Scholar CrossRef Search ADS   Rahmani E. et al.   ( 2016) Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat. Methods , 13, 443– 445. Google Scholar CrossRef Search ADS PubMed  Ringnér M. ( 2008) What is principal component analysis? Nat. Biotechnol ., 26, 303– 304. Google Scholar CrossRef Search ADS PubMed  Ruan P. et al.   ( 2016) NEpiC: a network-assisted algorithm for epigenetic studies using mean and variance combined signals. Nucleic Acids Res ., 44, e134. Google Scholar CrossRef Search ADS PubMed  Sharan R. et al.   ( 2007) Network-based prediction of protein function. Mol. Syst. Biol ., 3, 88. Google Scholar CrossRef Search ADS PubMed  Shen H., Huang J.Z. ( 2008) Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal ., 99, 1015– 1034. Google Scholar CrossRef Search ADS   Sill M. et al.   ( 2011) Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics , 27, 2089– 2097. Google Scholar CrossRef Search ADS PubMed  Sill M. et al.   ( 2015) Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data. Bioinformatics , 31, 2683– 2690. Google Scholar CrossRef Search ADS PubMed  Van Der Maaten L. ( 2014) Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res ., 15, 3221– 3245. Weinstein J.N. et al.   ( 2013) The cancer genome atlas pan-cancer analysis project. Nat. Genet ., 45, 1263– 1120. Google Scholar CrossRef Search ADS PubMed  Witten D.M. et al.   ( 2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics , 10, 515– 534. Google Scholar CrossRef Search ADS PubMed  Yuan X.T., Zhang T. ( 2013) Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res ., 14, 899– 925. Zou H. et al.   ( 2006) Sparse principal component analysis. J. Comput. Graph. Stat ., 15, 265– 286. Google Scholar CrossRef Search ADS   © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Edge-group sparse PCA for network-guided high dimensional data analysis

Loading next page...
 
/lp/ou_press/edge-group-sparse-pca-for-network-guided-high-dimensional-data-n8ViVzxcZO
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty362
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation Principal component analysis (PCA) has been widely used to deal with high-dimensional gene expression data. In this study, we proposed an Edge-group Sparse PCA (ESPCA) model by incorporating the group structure from a prior gene network into the PCA framework for dimension reduction and feature interpretation. ESPCA enforces sparsity of principal component (PC) loadings through considering the connectivity of gene variables in the prior network. We developed an alternating iterative algorithm to solve ESPCA. The key of this algorithm is to solve a new k-edge sparse projection problem and a greedy strategy has been adapted to address it. Here we adopted ESPCA for analyzing multiple gene expression matrices simultaneously. By incorporating prior knowledge, our method can overcome the drawbacks of sparse PCA and capture some gene modules with better biological interpretations. Results We evaluated the performance of ESPCA using a set of artificial datasets and two real biological datasets (including TCGA pan-cancer expression data and ENCODE expression data), and compared their performance with PCA and sparse PCA. The results showed that ESPCA could identify more biologically relevant genes, improve their biological interpretations and reveal distinct sample characteristics. Availability and implementation An R package of ESPCA is available at http://page.amss.ac.cn/shihua.zhang/ Contact zsh@amss.ac.cn or liujuan@whu.edu.cn Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction With the rapid development of deep sequencing technology, there has been a lot of various RNA-seq datasets on different cancers or tissues from large-scale projects such as the International Cancer Genome Consortium (ICGC) (Hudson et al., 2010), the Encyclopedia of DNA Elements (ENCODE) (Dunham et al., 2012), The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013) and the Genotype-Tissue Expression (GTEx) project (Lonsdale et al., 2013). Generally, these transcriptional data referring to ∼20 000 genes are typically high dimensional ones. Many dimension reduction and variable selection methods have been used to deal with such data for biological pattern discovery (Huisman et al., 2017; Lin et al., 2016; Van Der Maaten, 2014). Principal component analysis (PCA) is a classical dimension reduction method and has been widely adopted in high-dimensional gene expression analysis (Breschi et al., 2016; Chung and Storey, 2015; Ho et al., 2016; Ji et al., 2013; Liu et al., 2016; Ma and Dai, 2011; Ringnér, 2008). PCA and its variants can capture the linear relationship of variables to best explain the latent patterns of samples. However, the non-sparse principal component (PC) loadings by PCA employ all gene variables and lead to limited biological interpretability. Thus, variable selection is needed for gene expression analysis to select a small number of important genes. Recently, a number of studies have focused on developing sparse PCA models to encourage sparsity of PC loadings to extract gene modules with a limited number of genes for better interpretation (Deshpande and Montanari, 2014; Gu et al., 2014; Jolliffe et al., 2003; Journée et al., 2010; Lee et al., 2010; Rahmani et al., 2016; Shen and Huang, 2008; Sill et al., 2011, 2015; Witten et al., 2009; Yuan and Zhang, 2013; Zou et al., 2006). For example, LASSO regularized and elastic net regularized sparse PCA models have been proposed respectively (Jolliffe et al., 2003; Zou et al., 2006). However, these sparse PCA models can not accurately control the sparse level of PC loadings. Thus, the sparse PCA model with L0-penalty has been proposed to solve this issue (Journée et al., 2010; Yuan and Zhang, 2013). Several studies have developed different algorithms to solve these sparse PCA models (Deshpande and Montanari, 2014; Gu et al., 2014; Journée et al., 2010; Shen and Huang, 2008; Witten et al., 2009; Yuan and Zhang, 2013). A kind of commonly used methods employ regularized SVD framework to solve sparse PCA models (Shen and Huang, 2008; Witten et al., 2009). In addition, sparse PCA models have been widely used in cancer research (Hsu et al., 2014). For example, Sill et al. proposed a sparse PCA with stability selection (S4VDPCA) to deal with a medulloblastoma brain gene expression dataset and revealed that genes determined by the first two sparse PC loadings are significantly involved in several key pathways between molecular subgroups of medulloblastoma (Sill et al., 2011). However, these sparse PCA models assume that all the genes have equal prior probability of being selected in each PC loading and cannot integrate prior group structure among gene variables for variable selection. These gene interactions from a prior gene interaction network can be seen as a specific group structure. How to integrate the gene group structure into the sparse PCA framework for gene variable selection is an open challenge. Recently, some network-based classification and regression models have been proposed to identify a context-specific gene set or module via integrating gene expression data and an interaction network (Dittrich et al., 2008; Leiserson et al., 2015; Ruan et al., 2016; Sharan et al., 2007). Moreover, some other network-based methods have also been used to extract active gene modules by integrating differentially expressed levels (P-values) of genes between two given phenotypes and a gene interaction network (Ansari et al., 2017; Gwinner et al., 2017; Liu et al., 2017). However, PCA and sparse PCA do not consider the prior interaction network structure (Glaab, 2016). To our knowledge, little effort has been made to incorporate the group structure among genes from a gene interaction network into the SPCA framework for dimension reduction and feature interpretation. In this paper, we proposed an Edge-group Sparse PCA (ESPCA) for network-guided high dimensional gene expression data analysis (Fig. 1). Each gene interaction in the prior gene interaction network is considered as a group. ESPCA constrains the number of nonzero elements of each PC loading via selecting gene interactions (edges). That is, ESPCA encourages the selected genes in each PC loading to be linked in the prior gene interaction network (Fig. 1A). We developed an alternating iterative algorithm to solve ESPCA (Fig. 1B). Simulation tests showed superior performance of ESPCA over SPCA for variable selection. Extensive results of ESPCA applied onto two datasets (including TCGA pan-cancer expression data and ENCODE expression dataset) with a gene interaction network showed that ESPCA could identify more biologically relevant genes, improve their biological interpretations and reveal distinct sample characteristics. Fig. 1. View largeDownload slide (A) ESPCA for integrative analysis of gene expression and interaction network data. (B) Workflow of ESPCA algorithm for learning the first PC loading (u) and PC (v): (0) Initialize the algorithm with a randomly generated vector v; (1) Rebuild a gene weighted network based on Xv (denoted as z) by using wij=zi2+zj2 for any edge (i,j)∈G; (2) Sort all edge-weights from the weighted network and extract k edges with the largest k weights; (3) Obtain the selected genes based on the top k edges; (4) The elements of PC loading corresponding to non-selected genes are set to zero; (5) Normalize the PC loading to meet the normalizing condition; (6) Update v; (7) Obtain a new z. Repeat the steps (1) to (7) until convergence Fig. 1. View largeDownload slide (A) ESPCA for integrative analysis of gene expression and interaction network data. (B) Workflow of ESPCA algorithm for learning the first PC loading (u) and PC (v): (0) Initialize the algorithm with a randomly generated vector v; (1) Rebuild a gene weighted network based on Xv (denoted as z) by using wij=zi2+zj2 for any edge (i,j)∈G; (2) Sort all edge-weights from the weighted network and extract k edges with the largest k weights; (3) Obtain the selected genes based on the top k edges; (4) The elements of PC loading corresponding to non-selected genes are set to zero; (5) Normalize the PC loading to meet the normalizing condition; (6) Update v; (7) Obtain a new z. Repeat the steps (1) to (7) until convergence 2 Materials and methods Here, we first briefly described a regularized sparse PCA via L0-penalty (SPCA) and then proposed the ESPCA (Fig. 1). 2.1 SPCA Suppose X∈ℝp×n is a gene expression matrix with p genes and n samples, centered for each gene. Then the SPCA via L0-penalty (Journée et al., 2010; Yuan and Zhang, 2013) can be adopted to analyze the gene expression matrix:   maximize||u||2≤1 uTXXTu,  s.t.  ||u||0≤s, (1) where s is a positive integer, ||·||2 (or ||·||) denotes the Euclidean norm, and ||·||0 denotes the L0 norm. Note that ||u||0 is equal to the number of nonzero elements of u. A framework based on singular value decomposition (SVD) has been recommended to solve SPCA (Lin et al., 2016; Sill et al., 2015; Shen and Huang, 2008; Witten et al., 2009). That is, the following problem is used to solve the above one (Equation 1):   maximize||u||2≤1,||v||2≤1 uTXv,  s.t.  ||u||0≤s, (2) where u∈ℝp×1 and v∈ℝn×1 are the first PC loading and PC respectively. The following alternating iterative projection strategy (Journée et al., 2010; Yuan and Zhang, 2013) is used to solve problem (2) until convergence: u←u^||u^||, where u^=P(z,s) and z=Xv, v←v^||v^||, where v^=XTu,where P(z,s) is called s-sparse projection operator. It is a p-dimensional column vector and its ith ( i=1, 2,…,p) element is defined as follows:   [P(z,s)]i={zi, if  i∈supp(z,s),0, otherwise, (3) where supp(z,s) denotes the set of indexes of the largest s absolute elements of z. For example, if z=[−7, 5, 1, 2,−3]T in Eq. (3), then supp(z,2)={1, 2} and P(z,2)=[−7, 5, 0, 0, 0]T. 2.2 ESPCA Suppose G is a group structure. When G={g1,…,gM} is non-overlapping, the group sparse penalties with L1-norm (GL1) and L0-norm (GL0) are ||u||GL1=∑gi∈G||ugi|| and ||u||GL0=∑gi∈GI(||ugi||≠0) respectively. However, a number of applications need to consider overlapping group structures (Jacob et al., 2009). For example, two linked genes in the gene interaction network, can be considered as a group. Obviously, such edge-groups are overlapping. We denoted G={e1,…,eM} as an edge set with all edges from a given gene interaction network. Here we presented an Edge-group Sparse penalty (ES-penalty) as follows:   ||u||ES=minimize∀G′⊆G, support(u)⊆V(G′) |G′|, (4) where G′ is a subset of G, V(G′) is a vertex (gene) set induced from the edge set G′, |G′| denotes the number of elements of G′, and support(u) denotes the set of indexes of nonzero elements of u. A simple example is used to illustrate formula ( 4). If u=(5,3,2,0)T with an edge set G={(1, 2),(1,4),(2, 3),(3, 4)} in the formula ( 4). We can find an edge set G′={(1, 2),(2, 3)} with two edges and V(G′)={1,2,3} with three vertices to satisfy support(u)⊆V(G′), where support(u)={1, 2, 3}. Thus ||u||ES=2. Note that if G is a non-overlapping group structure, then ||u||ES reduces to ||u||GL0. In other words, the ES-penalty leads to a sparse PC loading whose nonzero elements selected based on some important gene interactions (edges) of G (Fig. 1B). Finally, we developed a regularized PCA with ES-penalty (ESPCA) for network-guided high dimensional gene expression data analysis:   maximize||u||2≤1,||v||2≤1 uTXv, s.t. ||u||ES≤k, (5) where u is the first PC loading, v is the first PC and k is a parameter to control the number of edges selected. 2.3 Learning algorithm for ESPCA The key of solving problem (5) is to solve a projection problem with fixed v and z=Xv in (5) as follows:   maximize||u||2≤1 uTz, s.t. ||u||ES≤k. (6) It is a NP-hard combinatorial optimization problem. Here we adopted a greedy strategy to get a proximal solution of (6) (Algorithm 1). Algorithm 1 Greedy k-edge sparse projection Require: z∈ℝp×1, parameter k, edge set G={e1,⋯,eM} 1: Let normG(z)=(||ze1||,…,||zeM||)T 2: Extract top k edge indexes: I=supp(normG(z),k) 3: G′={ei| i∈I, ei∈G} #The selected k edges. 4: VG′=V(G′) #The selected genes induced from G′. 5: Let u^=0∈ℝp×1 6: for any gene i in VG′do 7:   u^i=zi 8: end for 9: u=u^||u^|| 10: return u and PG(z,k)=u^ For convenience, the proximal solution of (6) is denoted as u=u^||u^|| where u^=PG(z,k), named as a k-edge sparse projection operator, and [PG(z,k)]i ( i=1,…,p) meets:   [PG(z,k)]i={zi,if  G(i)∩supp(normG(z),k)≠∅,0, otherwise, (7) where G(i) is the set of indexes of the edges containing gene i. That is, if the gene i is in the selected edges, then [PG(z,k)]i=zi, otherwise [PG(z,k)]i=0. Figure 1B illustrates an example with steps (1)–(5) to explain it. Based on Algorithm 1, we can thus solve the problem (5) by using the following alternating iterative strategy until convergence: u←u^||u^||, where u^=PG(z,k) and z=Xv, v←v^||v^||, where v^=XTu. Furthermore, ESPCA can be applied to generate multiple PCs and PC loadings. Specifically, given the current PCs, we adopted the following model to compute the next PC and its loading (Witten et al., 2009):   maximize||uℓ||2≤1,||vℓ||2≤1uℓTXvℓ    subject to||uℓ||ES<kℓ,vℓ⊥v1,…,vℓ−1. (8) where ℓ≥2 and the pairs {ui, vi} ( i=1,…,ℓ−1) are known, and vℓ must be orthogonal with v1,…,vℓ−1. Here, we adopted the alternating iterative strategy to solve Eq. (8) by fixing uℓ update vℓ and vice versa. Thus, the key of solving problem (8) is to solve another projection problem with fixed uℓ:   maximize||vℓ||2≤1uℓTXvℓsubject tovℓ⊥v1,…,vℓ−1. (9) Here we employed a Gram-Schmidt orthogonalization strategy, which has been used to solve FastICA for multiple components extraction (Hyvärinen and Oja, 2000). Let Vℓ−1=[v1;⋯;vℓ−1]∈ℝn×(ℓ−1) and Vℓ−1⊥∈ℝn×(n−ℓ+1), whose column space is orthogonal complement with the column space of Vℓ−1, i.e. [Vℓ−1;Vℓ−1⊥] is an orthogonal basis. Since vℓ⊥v1,…,vℓ−1 in Eq. (9), vℓ belongs to the column space of Vℓ−1⊥. Thus, vℓ can be written as Vℓ−1⊥β. We replaced vℓ with Vℓ−1⊥β in Eq. (9):   maximizeβuℓTXVℓ−1⊥βsubject to||β||2≤1. (10) We can easily get the optimal solution of problem (10) as follows:   β=Vℓ−1⊥TXTuℓ||Vℓ−1⊥TXTuℓ||. (11) Since vℓ=Vℓ−1⊥β, we replaced β using formula (11) and ensured vℓ to be a unit vector, thus we can obtain the optimal solution of problem (9):   vℓ=Vℓ−1⊥Vℓ−1⊥TXTuℓ||Vℓ−1⊥Vℓ−1⊥TXTuℓ||, (12) where Vℓ−1⊥Vℓ−1⊥T is not known. Next we introduced Lemma 1 to compute it. Lemma 1. If the column space of Vℓ−1⊥is orthogonal complement with the column space of Vℓ−1, then Vℓ−1⊥Vℓ−1⊥T=I−Vℓ−1Vℓ−1T. Proof. Let V=[Vℓ−1⊥;Vℓ−1], thus V is an orthogonal matrix. Based on the property of orthogonal matrix, VVT=I, we can get [Vℓ−1⊥;Vℓ−1][Vℓ−1⊥;Vℓ−1]T=I. It leads to Vℓ−1⊥Vℓ−1⊥T+Vℓ−1Vℓ−1T=I. Thus, the Lemma is true. □ Theorem 1. The optimal solution of problem (9) is   vℓ=v^ℓ||v^ℓ||,  where  v^ℓ=(I−Vℓ−1Vℓ−1T)XTuℓ. (13) Proof. Based on the Lemma 1, by replacing Vℓ−1⊥Vℓ−1⊥T with I−Vℓ−1Vℓ−1T in Eq. (12), we can get the optimal solution of problem (9) as vℓ=v^ℓ||v^ℓ||, where v^ℓ=(I−Vℓ−1Vℓ−1T)XTuℓ. □ When vℓ is fixed in the problem (8), we adopted the greedy k-edge sparse projection strategy (Algorithm 1) to learn uℓ. Finally, we proposed the following alternating iterative strategy to solve problem (8): uℓ←u^||u^||, where u^=PG(z,kℓ) and z=Xvℓ, vℓ←v^||v^||, where v^=(I−Vℓ−1Vℓ−1T)XTuℓ. Algorithm 2 ESPCA Require: X∈ℝp×n, edge set G, parameters L and k 1: Let X1=X 2: Let U= ∈ℝp×L, V= ∈ℝn×L, D= ∈ℝL×L 3: for ℓ=1 to Ldo 4:  Initialize vℓ with ||vℓ||2=1 5:   Vℓ−1=V[ ,1:ℓ] 6:  repeat 7:   Let z=Xvℓ 8:    uℓ←u^||u^||, where u^=PG(z,k) 9:    vℓ←v^||v^||, where v^=(I−Vℓ−1Vℓ−1T)XTuℓ 10:    dℓ=uℓTXℓv 11:  until convergence 12:   U[ ,ℓ]=uℓ, V[ ,ℓ]=vℓ, D[ℓ,ℓ]=dℓ 13:   Xℓ+1=Xℓ−dℓuℓvℓT 14: end for 15: return U, V and D In summary, we proposed Algorithm 1 to extract L PCs (columns of V) and PC loadings (columns of U). Note that it is stopped when some convergence criteria are satisfied (step 11). The computational complexity of Algorithm 2 is dominated by the cost of step 8 (k-edge sparse projection) and step 9 (multiplication between matrix and vector). A linear time selection algorithm (Quick Select) is used to select the largest k absolute values for a given vector. Thus, the cost of step 8 is O(|G|), where |G| is the number of edges in G. And the cost of step 9 is O(n2L+np). We set the maximum number of iterations to T between step 6 and step 11. Thus, the computational complexity of Algorithm 2 is O(n2L·TL+np·TL+|G|·TL). Note that we cannot guarantee that the algorithm converges to the optimal solution due to the non-convexity of this problem. Thus, we repeated our algorithm with a number of different random initial solutions. One of the advantages of Algorithm 2 is to ensure that these PCs (v1,v2,…,vL) are orthogonal to each other. However, there is still a subtle issue unresolved: How to ensure the orthogonality of the PC loadings (u1,u2,…,uL). Sometimes orthogonality of us may be at odds with sparsity of us. If we enforce orthogonality of us, then we may destroy the sparsity of us. To balance the sparsity and orthogonality, a sparse orthogonal decomposition strategy may be used (Ma et al., 2014). 2.4 ESPCA for multiple expression matrices Generally, PCA and its variants have been applied to a single gene expression dataset (Chung and Storey, 2015; Ho et al., 2016; Ji et al., 2013; Liu et al., 2016; Ma and Dai, 2011). Here, we extended ESPCA to deal with more than one expression matrix. Suppose Xi is the ith cancer-type (or cell-type) gene expression matrix with ℝp×ni ( i=1,…,t), where p is the number of genes and ni is the number of samples. We extended ESPCA model for joint analysis of gene expression data with samples from different cancer-types (or cell-types) to study their sharing characteristics:   maximize||u||2≤1 ∑i=1tuTXiXiTu, s.t. ||u||ES≤k. (14) It is easy to verify that ∑i=1tuTXiXiTu=uTXXTu, where X=[X1,…,Xt]∈ℝp×(n1+⋯+nt). Thus, solving problem (14) is equivalent to solving problem (5). Based on the outputs of Algorithm 2: U and V, we extract gene modules. Specifically, for each sparse PC loading uℓ ( 1≤ℓ≤L), the nonzero elements of uℓ are used to form a gene module. Meanwhile, the high-dimensional data points (columns of X) can be projected to low dimensional data points (columns of VT): X∈ℝp×n↦VT∈ℝL×n, where L≪p. 2.5 Model selection In ESPCA, a key issue is how to determine the parameter k (i.e. the number of edges) and rank L (i.e. the number of modules). In fact, how to determine these parameters is still an open problem for ESPCA. Inspired by previous studies (Lee et al., 2010), we can select the parameters by minimizing the following Bayesian Information Criterion (BIC):   BIC(k,L)=X−∑l=1LdlulvlTn·p+ log ⁡(n·p)n·p∑l=1L(||ul||0+||vl||0) Thus, k and L can always be chosen via minimizing BIC. Like previous studies, PCA or SPCA consider two principal components for further analysis (Sill et al., 2015). In the paper, we empirically set k = 100 and L = 2 as default values for ESPCA. This ensures that we yield two gene modules with less than 100 genes. 2.6 Functional analysis of gene modules We first assessed whether the genes from each module were tightly connected in the gene interaction network. For a given gene module i with pi genes and mi edges in the prior gene interaction network, we defined the edge-density (ED) score of this module as ED=mi/(pi2)M/(p2), where M is the total number of edges (interactions) and p is the total number of genes in the gene interaction network. We applied the right-tailed hypergeometric test to assess the enrichment significance. We also collected 1607 cancer genes from the allOnco database which merges some different cancer genes from several databases (http://www.bushmanlab.org/links/genelists). We also employed the right-tailed hypergeometric test to compute a significance level to measures whether these genes from the same module are significantly overlapped with the cancer genes. Furthermore, to assess the biological relevance of these genes from each module, we downloaded multiple gene functional annotations including GO biological processes (GOBP), KEGG pathways and reactome pathways from Molecular Signatures Database (MSigDB) (http://software.broadinstitute.org/gsea/msigdb). In total, we obtained 4653 GOBP terms, 186 KEGG pathway terms and 674 reactome pathway terms. We performed functional enrichment analysis for the genes from each module using hypergeometric test adjusted with the Benjamini-Hochberg (BH) correction (P-value <0.05). 2.7 Biological data We first considered a combined gene expression dataset of multiple cancer types and the gene expression data of each cancer was extracted from level 3 RNA-seq data in TCGA database (Weinstein et al., 2013) (https://gdc-portal.nci.nih.gov/). The 12 cancer types used in this study include bladder urothelial carcinoma (BLCA, 134 samples), breast invasive carcinoma (BRCA, 847 samples), colon and rectum carcinoma (CRC, 550 samples), head and neck squamous-cell carcinoma (HNSC, 303 samples), kidney renal clear-cell carcinoma (KIRC, 474 samples), brain lower grade glioma (LGG, 179 samples), lung adenocarcinoma (LUAD, 350 samples), lung squamous-cell carcinoma (LUSC, 315 samples), prostate adenocarcinoma (PRAD, 170 samples), skin cutaneous melanoma (SKCM, 234 samples), thyroid carcinoma (THCA, 224 samples), uterine corpus endometrioid carcinoma (UCEC, 478 samples). We also downloaded a gene interaction network built by integrating multiple data sources (e.g. HPRD, NCI-PID, Interact and BioGRID) from Pathway-Commons database (http://www.pathwaycommons.org/) as prior knowledge for the two biological applications. Finally, we obtained the pan-cancer gene expression dataset with 4258 samples and 10 399 genes, and a gene interaction network with 10 399 genes and 257 039 gene interactions. We also downloaded the corresponding clinical data of the 12 cancer types from Firehose (http://firebrowse.org/) for further analysis. We obtained another gene expression dataset based on the RNA-seq technique from ENCODE project (Dunham et al., 2012) (http://www.genome.ucsc.edu/ENCODE/). The ENCODE expression dataset contains 158 cell lines (samples) including GM12878 (29 samples), K562 (24 samples), HepG2 (12 samples), MCF7 (11 samples), HelaS3 (8 samples), A549 (4 samples), others (70 samples). The raw expression level of each gene was quantified by RPKM (reads per kilobase of transcript per million reads). Genes with expression in less than 20% samples were filtered out. For each sample, these raw gene expression values were log2-transformed after replacing all the zero values with the lowest nonzero values in the corresponding sample. Finally, we obtained a gene expression data with 9486 genes and 158 cell lines, and a gene interaction network with 9486 genes and 219 304 gene interactions. 3 Simulation study In this section, we first presented a simple example to explain how ESPCA overcomes the limitation of SPCA. We generated the simulated gene expression matrix X and an interaction network (i.e. an edge set G): Generate two PC loadings by the following formulas:   u1=[1,−1,0.7,0.1,−0.5,[rep(0,5)]T]T, (15a)  u2=[[rep(0,5)]T,0.1,−2,−0.5,0.3,0.1]T, (15b) where rep(0,a) denotes a column-vector of size a, whose elements are zeros. Generate two PCs by the following formulas:   v1=rnorm(100), v2=rnorm(100), (16) where rnorm(b) denotes a column-vector of size b, whose elements are randomly sampled from the standard normal distribution. Generate the expression matrix X∈ℝ10×100 with 10 genes and 100 samples by the following formula:   X=d1u1v1T+d2u2v2T+γϵ, (17) where d1=10, d2=5, γ = 5 and ϵ∼iidN(0,1). Generate the gene interactions (edges) with G=G1∪G2, where G1 and G2 are as follows:   G1={(1,2),(1,3),(1,5),(2,3),(2,4),(3,4),(4,5)},  G2={(6,7),(6,10),(7,8),(7,10),(8,9),(8,10),(9,10)}. We tested ESPCA and compared it with PCA and SPCA by applied them to this simulated data (Table 1, Supplementary Material and Supplementary Figs S1A and S2B). The true patterns of the simulated data with respect to PC loadings is that the first five variables are related to PC1 loading and the last five variables are related to PC2 loading. We ensured that each identified PC loading by ESPCA (k = 6 in Eq. 5) has five nonzero entries. We also set s = 5 (Eq. 2) for SPCA to ensure that each identified PC loading with five nonzero entries for comparison. Clearly, ESPCA correctly extracted all five expected variables for PC1 and PC2 loadings, respectively (Table 1). However, SPCA misses the 4th variable for PC1 loading and the 6th variable for PC2 loading, and PCA only obtained two non-sparse PC loadings. Intuitively, we can see that the absolute value of the signal value of the 4th variable is less than the 7th variable and the 7th variable for the PC1 loading of PCA (second column of Table 1). It causes that SPCA misses the 4th variable and mistakenly extracts 7th variable (fourth column in Table 1). However, the 4th variable is connected to multiple important variables (the 2nd, 3rd, 5th variables) in the prior edges G. Thus, the 4th variable also tends to be an important one. ESPCA is able to correctly identify it by integrating the edge information G. Similarly, we can also explain these results with respect to PC2 loadings. Table 1. The top two identified PC1 and PC2 loadings by PCA, SPCA and ESPCA, respectively Method  PCA   SPCA   ESPCA   PC  PC1  PC2  PC1  PC2  PC1  PC2  Var1  0.57  −0.08  0.57  0.08  0.57    Var2  −0.61  −0.04  −0.61    −0.61    Var3  0.45  −0.04  0.45    0.46    Var4  0.02  −0.01      0.02    Var5  −0.30  0.08  −0.30    −0.31    Var6  −0.04  0.04        0.04  Var7  −0.07  −0.95  −0.07  0.95    −0.96  Var8  −0.01  −0.24    0.24    −0.24  Var9  −0.01  0.12    −0.13    0.12  Var10  −0.01  0.11    −0.11    0.11  Method  PCA   SPCA   ESPCA   PC  PC1  PC2  PC1  PC2  PC1  PC2  Var1  0.57  −0.08  0.57  0.08  0.57    Var2  −0.61  −0.04  −0.61    −0.61    Var3  0.45  −0.04  0.45    0.46    Var4  0.02  −0.01      0.02    Var5  −0.30  0.08  −0.30    −0.31    Var6  −0.04  0.04        0.04  Var7  −0.07  −0.95  −0.07  0.95    −0.96  Var8  −0.01  −0.24    0.24    −0.24  Var9  −0.01  0.12    −0.13    0.12  Var10  −0.01  0.11    −0.11    0.11  Note: Empty cells are zeros. Table 1. The top two identified PC1 and PC2 loadings by PCA, SPCA and ESPCA, respectively Method  PCA   SPCA   ESPCA   PC  PC1  PC2  PC1  PC2  PC1  PC2  Var1  0.57  −0.08  0.57  0.08  0.57    Var2  −0.61  −0.04  −0.61    −0.61    Var3  0.45  −0.04  0.45    0.46    Var4  0.02  −0.01      0.02    Var5  −0.30  0.08  −0.30    −0.31    Var6  −0.04  0.04        0.04  Var7  −0.07  −0.95  −0.07  0.95    −0.96  Var8  −0.01  −0.24    0.24    −0.24  Var9  −0.01  0.12    −0.13    0.12  Var10  −0.01  0.11    −0.11    0.11  Method  PCA   SPCA   ESPCA   PC  PC1  PC2  PC1  PC2  PC1  PC2  Var1  0.57  −0.08  0.57  0.08  0.57    Var2  −0.61  −0.04  −0.61    −0.61    Var3  0.45  −0.04  0.45    0.46    Var4  0.02  −0.01      0.02    Var5  −0.30  0.08  −0.30    −0.31    Var6  −0.04  0.04        0.04  Var7  −0.07  −0.95  −0.07  0.95    −0.96  Var8  −0.01  −0.24    0.24    −0.24  Var9  −0.01  0.12    −0.13    0.12  Var10  −0.01  0.11    −0.11    0.11  Note: Empty cells are zeros. We also applied ESPCA and SPCA to the second simulated data for further comparison and evaluation. We first generated two sparse PC loadings and two PCs by using u1=u^1/||u^1|| with u^1=[rnorm(100); rep(0,100)], u2=u^2/||u^2|| with u^2=[rep(0,100); rnorm(100)], and v1=v^1/||v^1|| with v^1=rnorm(100), v2=v^2/||v^2|| with v^2=rnorm(100), and d1=10, d2=5. Then a sequence of observed matrices Xs were generated by using formula (17) with different noise levels γ∈[0.1,0.5] in steps of 0.1. For each γ, we generated 50 observed matrices Xs by using formula (17). And a common prior gene interaction network for row variables of X was generated which contains two subnetwork, the first network contains the first 100 nodes (genes) which are connected with probability P = 0.3, and the second contains the remaining ones connected with probability P = 0.3. For each γ, we applied ESPCA to its corresponding 50 observation matrices. The parameter k of ESPCA (Algorithm 2) was set as 400 (i.e. 400 edges) to ensure that the identified two sparse PC loadings can reveal the true genes. We also applied SPCA to the same observation matrices for comparison and ensured the sparse PC loadings identified by SPCA has the same sparse level with those identified by ESPCA. Three criteria (sensitivity, specificity and accuracy) were used to evaluate the performance of ESPCA and SPCA for extracting the patterns of PC1 and PC2 loadings. We found that ESPCA outperforms SPCA with respect to PC1 and PC2 loadings at all different noise levels (Fig. 2). Since the eigenvalue d12 of PC1 loading are greater than the eigenvalue d22 of PC2 loading, both ESPCA and SPCA obtain higher sensitivity and specificity for PC1 loading than those for PC2 loading at different noise levels. In addition, as the noise level becomes larger, both sensitivity and specificity of ESPCA and specificity of SPCA are getting smaller. In short, overall sensitivity and specificity of ESPCA are higher than those of SPCA in all cases. Finally, we also show that random prior networks will affect the performance of ESPCA greatly (Supplementary Material and Supplementary Fig. S2). Fig. 2. View largeDownload slide Performance comparison of SPCA and ESPCA in terms of sensitivity, specificity and accuracy (ACC) in the second simulated data where sensitivity measures proportion of true nonzero elements of each PC loading that are correctly identified, and specificity measures the proportion of zero elements that are correctly identified. The bar plots of the average values of 50 realizations on the simulated data with respect to different noise levels were shown with error bars Fig. 2. View largeDownload slide Performance comparison of SPCA and ESPCA in terms of sensitivity, specificity and accuracy (ACC) in the second simulated data where sensitivity measures proportion of true nonzero elements of each PC loading that are correctly identified, and specificity measures the proportion of zero elements that are correctly identified. The bar plots of the average values of 50 realizations on the simulated data with respect to different noise levels were shown with error bars 4 Biological applications We applied ESPCA to the two biological datasets and set k = 100 to yield modules of less than 100 genes. 4.1 Application to the pan-cancer dataset We applied ESPCA to the TCGA pan-cancer dataset and compared its performance with that of SPCA (Journée et al., 2010; Yuan and Zhang, 2013). ESPCA and SPCA cost about nine and one minutes to identify a PC for this real data respectively (similarly for the ENCODE dataset). In addition, both ESPCA and SPCA algorithms use five random initial solutions and the one with the largest objective value is considered as the final one. For comparison, we extracted the top two PC loadings by SPCA with the same number of nonzero genes as those identified by ESPCA. As we expected, we found that the two gene modules (ESPC1 and ESPC2) identified by ESPCA contain more gene interactions in the prior gene interaction network than those two modules (SPC1 and SPC2) identified by SPCA significantly. Moreover, ESPCA identified 20 (of 86) and 13 (of 55) cancer genes in ESPC1 dodule (P-value < 0.036, hypergeometric test) and ESPC2 module (P-value < 0.073, hypergeometric test), respectively. However, SPCA only identified 17 and 8 cancer genes in SPC1 module (P-value < 0.168, hypergeometric test) and SPC2 module (P-value < 0.63, hypergeometric test), respectively. In summary, the number of edges, cancer genes and enriched functional terms, and the ED score for PC loadings identified by ESPCA are all higher than those of SPCA (Fig. 3A). Fig. 3. View largeDownload slide Comparison of the results based on the top two PC loadings identified by ESPCA and SPCA in the TCGA pan-cancer dataset. (A) Summary statistics in the two modules identified by ESPCA and SPCA, respectively. #gene, #edge, #cancer-gene and #GO BP (or #KEGG, #reactome) are the number of genes, edges, cancer genes and enriched GO BP (or KEGG, reactome) terms (*P < 0.05 and ** P < 0.01); ED: edge-density score. (B) Two scatter plots based on the first two PCs by ESPCA and SPCA respectively. Data points were colored by the 12 different cancer types. (C) Box plot of normalized mutual information (NMI) scores based on 50 random seeds for K-means (which was used to cluster TCGA pan-cancer samples based on the first two PCs). P-value was calculated using the Wilcoxon rank sum test (Color version of this figure is available at Bioinformatics online.) Fig. 3. View largeDownload slide Comparison of the results based on the top two PC loadings identified by ESPCA and SPCA in the TCGA pan-cancer dataset. (A) Summary statistics in the two modules identified by ESPCA and SPCA, respectively. #gene, #edge, #cancer-gene and #GO BP (or #KEGG, #reactome) are the number of genes, edges, cancer genes and enriched GO BP (or KEGG, reactome) terms (*P < 0.05 and ** P < 0.01); ED: edge-density score. (B) Two scatter plots based on the first two PCs by ESPCA and SPCA respectively. Data points were colored by the 12 different cancer types. (C) Box plot of normalized mutual information (NMI) scores based on 50 random seeds for K-means (which was used to cluster TCGA pan-cancer samples based on the first two PCs). P-value was calculated using the Wilcoxon rank sum test (Color version of this figure is available at Bioinformatics online.) We then evaluated the biological relevance of the identified modules. The two modules ESPC1 and ESPC2 were significantly enriched in 54, 6, 7 and 81, 2, 26 GOBP, KEGG and reactome terms, respectively (Fig. 3A). As expected, the gene enrichment analysis revealed a number of cancer related pathways such as biological adhesion, epidermis development, cell cycle, cell division, and ECM-receptor interaction, cell adhesion molecules cams and pathways in cancer. For example, seven genes (ITGB4, ITGB6, LAMC2, LAMB3, LAMA3, SDC1, SDC3) from the ESPC1 module were discovered in the KEGG ECM-receptor interaction pathway, and six genes (PRKDC, ATR, STAG1, RAD21, SMC1A, SMC3) from the ESPC2 module were discovered in the KEGG cell cycle pathway (Supplementary Fig. S4A). However, we only found 26, 0, 4 and 1, 0, 0 GOBP, KEGG and reactome terms for the SPC1 and SPC2 modules, respectively (Fig. 3A). Moreover, the enriched functional terms of ESPCA have distinct lower P-values than those of SPCA (Supplementary Fig. S4B). These results suggest that ESPCA can identify more biologically relevant gene sets than SPCA. We also compared the ability of PCs of ESPCA and SPCA respectively to discover sample patterns. The top two PCs identified by ESPCA demonstrates more distinct samples patterns especially for LGG, SKCM, KIRC than those of SPCA visually (Fig. 3B). Furthermore, we also performed K-means clustering with 50 realizations on each dataset and computed the normalized mutual information (NMI) that measures the similarity between the K-means clustering and the true sample classes. The average NMI was significantly higher for ESPCA reduced data than that for SPCA reduced data (P < 2.2e-16, one-sided Wilcoxon rank sum, Fig. 3C). Finally, we found that the ESPC1 and ESPC2 were significantly related to LGG and UCEC with z-score > 2, respectively (Fig. 4A). Intriguingly, ESPC1 module is significantly associated with survival of LGG and ESPC2 is associated with survival with UCEC, respectively (Fig. 4B), suggesting their functional specificity of these PCs. Fig. 4. View largeDownload slide (A) Two bar plots of z-scores based on the first two PCs identified by ESPCA in the TCGA Pan-cancer dataset. The z-score for ESPC1 (or ESPC2) module was computed based on the cancer relevance score vector. The cancer relevance score was calculated based on the average of the squares of its PC values of the tumors in each cancer. (B) ESPC1 and ESPC2 modules are significantly associated with survival of LGG and UCEC, respectively. For LGG and ESPC1 module (similarly for UCEC and ESPC2 module), LGG tumors were divided into two groups based on the median of their corresponding PC coefficients and the Kaplan-Meier survival curve were drawn to show the difference between the two group patients. Log rank test was used to calculate the P-value and ‘+’ denotes the censored patients Fig. 4. View largeDownload slide (A) Two bar plots of z-scores based on the first two PCs identified by ESPCA in the TCGA Pan-cancer dataset. The z-score for ESPC1 (or ESPC2) module was computed based on the cancer relevance score vector. The cancer relevance score was calculated based on the average of the squares of its PC values of the tumors in each cancer. (B) ESPC1 and ESPC2 modules are significantly associated with survival of LGG and UCEC, respectively. For LGG and ESPC1 module (similarly for UCEC and ESPC2 module), LGG tumors were divided into two groups based on the median of their corresponding PC coefficients and the Kaplan-Meier survival curve were drawn to show the difference between the two group patients. Log rank test was used to calculate the P-value and ‘+’ denotes the censored patients 4.2 Application to the ENCODE dataset We also applied ESPCA and SPCA to the ENCODE gene expression dataset from multiple cell types. Similar to the TCGA pan-cancer application, we found that the two gene modules (ESPC1 and ESPC2) identified by ESPCA (Supplementary Fig. S5A) contain more gene interactions in the prior gene interaction network than those two modules (SPC1 and SPC2) identified by SPCA significantly (Fig. 5A). We also found that ESPCA identified 14 (of 69) and 30 (of 96) cancer genes in them, respectively. However, SPCA only identified 8 and 15 cancer genes in SPCA modules, respectively. In short, the number of the edges, cancer genes and enriched functional terms, and the ED score for PC loadings identified by ESPCA were higher than those by SPCA (Fig. 5A). Fig. 5. View largeDownload slide Comparison of the results based on the top two PC loadings identified by ESPCA and SPCA in the ENCODE dataset. (A) Summary statistics in these modules identified by ESPCA and SPCA, respectively. (B) Two scatter plots based on the first two PCs by ESPCA and SPCA. Data points were colored by the different cell lines. (C) Boxplot of NMI scores based on 50 random seeds for K-means (which was used to cluster ENCODE samples based on the first two PCs). P-value was calculated using Wilcoxon rank sum test (Color version of this figure is available at Bioinformatics online.) Fig. 5. View largeDownload slide Comparison of the results based on the top two PC loadings identified by ESPCA and SPCA in the ENCODE dataset. (A) Summary statistics in these modules identified by ESPCA and SPCA, respectively. (B) Two scatter plots based on the first two PCs by ESPCA and SPCA. Data points were colored by the different cell lines. (C) Boxplot of NMI scores based on 50 random seeds for K-means (which was used to cluster ENCODE samples based on the first two PCs). P-value was calculated using Wilcoxon rank sum test (Color version of this figure is available at Bioinformatics online.) Moreover, the two modules ESPC1 and ESPC2 were significantly enriched in 67, 4, 54 and 63, 11, 12 GOBP, KEGG and reactome terms, respectively (Fig. 5A). However, only 0, 1 and 7 significant terms are found for SPC1, and nothing for SPC2. Correspondingly, the enriched functional terms of ESPCA have more lower P-value than those of SPCA (Supplementary Fig. S5B). Moreover, compared to SPCA modules, ESPCA modules tend to be related with some cancer related functional terms such as metabolism of RNA, ECM receptor interaction, focal adhesion. Finally, the visualized PC plots of ESPCA and SPCA showed that ESPCA can reveal more distinct classes than those by SPCA (Fig. 5B). Clearly, samples of K562 and GM12878 cell types are more separated under ESPCA than those of SPCA. Moreover, the average NMI is significantly higher for ESPCA reduced data than that for SPCA reduced data (Fig. 5C). All the results suggested that ESPCA could reveal more biologically relevant gene sets and more underlying expression patterns. 5 Discussion and conclusion On one hand, sparse PCA is a typical unsupervised learning method for dimension reduction and feature selection. On the other hand, network-based methods for analysis have been employed to extract gene biomarkers. Inspired by these two facts, we proposed an Edge-group sparse PCA (ESPCA) for high-dimensional data analysis. ESPCA employs the gene interaction network to guide the selection of relevant genes. Mathematically, we considered an edge-group sparse penalty in the proposed ESPCA framework. The key of solving it lies in a subproblem, called k-edge sparse projection operation. We designed a greedy k-edge sparse projection algorithm to solve it. Based on the greedy algorithm, we developed an alternating iterative algorithm to solve ESPCA. One of the advantages of ESPCA is to ensure the obtained PCs are orthogonal. We first applied ESPCA and SPCA to a set of simulated data and showed that ESPCA were more effective compared to SPCA. We also applied ESPCA for joint analysis of gene expression data with samples from different cancer-types (or cell-types) to study their sharing characteristics. Two biological datasets were tested in our study, including the TCGA pan-cancer expression data with about 4000 tumor samples from different cancers, and the ENCODE expression dataset with 158 cell lines from different cell types. ESPCA could identify gene modules with significant biological relevance for the two real datasets and yield better biological interpretations by integrating a gene interaction network. We note that the singleton genes in the prior network (if there are such genes) can also be selected by ESPCA (Supplementary Material and Supplementary Fig. S6). Moreover, the distinct functional enrichment is not caused mostly by the prior network (Supplementary Material and Supplementary Fig. S7). Although PCA-based methods are usually employed for dimension reduction with two or three dimensions, ESPCA can be used to extract more gene modules (Supplementary Material and Supplementary Tables S1 and S2). Lastly, compared with typical PCA, ESPCA can reveal more biologically relevant modules (Supplementary Material and Supplementary Fig. S8). In the future, ESPCA may be extended in the following aspects. (i) In this paper, we only considered two linked genes (an edge) in a given gene interaction network as a group and proposed an edge-group sparse penalty in Eq. (4). In fact, one alternative is that we can also consider the multiple genes from the same functional class (e.g. GO BP term, KEGG pathway) as a group in such a penalty and extend it to consider more structured knowledge. (ii) Moreover, in the paper, we applied ESPCA for multiple expression matrices with equal importance to each one in Eq. (14). However, the actual situation is that these expression matrices from different cancers (or cell types), which have different characteristics, such as different number of samples. Thus, we may set different weights for the cancer-type gene expression matrices. Mathematically, we will get a new objective function ∑i=1tλiuXiXiTuT, which will lead to a regularized tensor PCA model with edge-sparse penalty. (iii) Finally, we believe that the concept of edge-group sparse penalty will be valuable and can be extended to other statistical learning frameworks with some optimization techniques. Acknowledgements Wenwen Min would like to thank the support of the Academy of Mathematics and Systems Science at CAS during his visit. Funding This work was supported by the National Natural Science Foundation of China [No. 11661141019, 61621003, 61422309 and 61379092]; the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) [No. XDB13040600], the Key Research Program of the Chinese Academy of Sciences [No. KFZD-SW-219], National Key Research and Development Program of China (2017YFC0908405) and CAS Frontier Science Research Key Project for Top Young Scientist [No. QYZDB-SSW-SYS008]. Conflict of Interest: none declared. References Ansari S. et al.   ( 2017) An approach to infer putative disease-specific mechanisms using neighboring gene networks. Bioinformatics , 33, 1987– 1994. Google Scholar CrossRef Search ADS PubMed  Breschi A. et al.   ( 2016) Gene-specific patterns of expression variation across organs and species. Genome Biol ., 17, 151. Google Scholar CrossRef Search ADS PubMed  Chung N.C., Storey J.D. ( 2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics , 31, 545– 554. Google Scholar CrossRef Search ADS PubMed  Deshpande Y., Montanari A. ( 2014) Sparse PCA via covariance thresholding. In: Conference on Neural Information Processing Systems . pp. 334– 342. Dittrich M.T. et al.   ( 2008) Identifying functional modules in protein–protein interaction networks: an integrated exact approach. Bioinformatics , 24, i223– i231. Google Scholar CrossRef Search ADS PubMed  Dunham I. et al.   ( 2012) An integrated encyclopedia of DNA elements in the human genome. Nature , 489, 57– 74. Google Scholar CrossRef Search ADS PubMed  Glaab E. ( 2016) Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification. Brief. Bioinform ., 17, 440– 452. Google Scholar CrossRef Search ADS PubMed  Gu Q. et al.   ( 2014) Sparse PCA with oracle property. In: Conference on Neural Information Processing Systems. pp. 1529– 1537. Gwinner F. et al.   ( 2017) Network-based analysis of omics data: the LEAN method. Bioinformatics , 33, 701– 709. Google Scholar PubMed  Ho R. et al.   ( 2016) Als disrupts spinal motor neuron maturation and aging pathways within gene co-expression networks. Nat. Neurosci ., 19, 1256– 1267. Google Scholar CrossRef Search ADS PubMed  Hsu Y.L. et al.   ( 2014) Sparse principal component analysis in cancer research. Transl. Cancer Res ., 3, 182. Google Scholar PubMed  Hudson T.J. et al.   ( 2010) International network of cancer genome projects. Nature , 464, 993– 998. Google Scholar CrossRef Search ADS PubMed  Huisman S. et al.   ( 2017) BrainScope: interactive visual exploration of the spatial and temporal human brain transcriptome. Nucleic Acids Res ., 45, e83. Google Scholar PubMed  Hyvärinen A., Oja E. ( 2000) Independent component analysis: algorithms and applications. Neural Netw ., 13, 411– 430. Google Scholar CrossRef Search ADS PubMed  Jacob L. et al.   ( 2009) Group lasso with overlap and graph lasso. In: International Conference on Machine Learning. ACM, New York, NY, USA, pp. 433– 440. Ji H. et al.   ( 2013) Differential principal component analysis of ChIP-seq. Proc. Natl. Acad. Sci. USA , 110, 6789– 6794. Google Scholar CrossRef Search ADS   Jolliffe I.T. et al.   ( 2003) A modified principal component technique based on the lasso. J. Comput. Graph. Stat ., 12, 531– 547. Google Scholar CrossRef Search ADS   Journée M. et al.   ( 2010) Generalized power method for sparse principal component analysis. J. Mach. Learn. Res ., 11, 517– 553. Lee M. et al.   ( 2010) Biclustering via sparse singular value decomposition. Biometrics , 66, 1087– 1095. Google Scholar CrossRef Search ADS PubMed  Leiserson M.D. et al.   ( 2015) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet ., 47, 106– 114. Google Scholar CrossRef Search ADS PubMed  Lin Z. et al.   ( 2016) Simultaneous dimension reduction and adjustment for confounding variation. Proc. Natl. Acad. Sci. USA , 113, 14662– 14667. Google Scholar CrossRef Search ADS   Liu J.X. et al.   ( 2016) A class-information-based sparse component analysis method to identify differentially expressed genes on RNA-Seq data. IEEE/ACM Trans. Comput. Biol. Bioinform ., 13, 392– 398. Google Scholar CrossRef Search ADS PubMed  Liu Y. et al.   ( 2017) Sigmod: an exact and efficient method to identify a strongly interconnected disease-associated module in a gene network. Bioinformatics , 33, 1536– 1544. Google Scholar PubMed  Lonsdale J. et al.   ( 2013) The genotype-tissue expression (GTEx) project. Nat. Genet ., 45, 580– 585. Google Scholar CrossRef Search ADS PubMed  Ma S., Dai Y. ( 2011) Principal component analysis based methods in bioinformatics studies. Brief. Bioinform ., 12, 714– 722. Google Scholar CrossRef Search ADS PubMed  Ma X. et al.   ( 2014) Learning regulatory programs by threshold SVD regression. Proc. Natl. Acad. Sci. USA , 111, 15675– 15680. Google Scholar CrossRef Search ADS   Rahmani E. et al.   ( 2016) Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat. Methods , 13, 443– 445. Google Scholar CrossRef Search ADS PubMed  Ringnér M. ( 2008) What is principal component analysis? Nat. Biotechnol ., 26, 303– 304. Google Scholar CrossRef Search ADS PubMed  Ruan P. et al.   ( 2016) NEpiC: a network-assisted algorithm for epigenetic studies using mean and variance combined signals. Nucleic Acids Res ., 44, e134. Google Scholar CrossRef Search ADS PubMed  Sharan R. et al.   ( 2007) Network-based prediction of protein function. Mol. Syst. Biol ., 3, 88. Google Scholar CrossRef Search ADS PubMed  Shen H., Huang J.Z. ( 2008) Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal ., 99, 1015– 1034. Google Scholar CrossRef Search ADS   Sill M. et al.   ( 2011) Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics , 27, 2089– 2097. Google Scholar CrossRef Search ADS PubMed  Sill M. et al.   ( 2015) Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data. Bioinformatics , 31, 2683– 2690. Google Scholar CrossRef Search ADS PubMed  Van Der Maaten L. ( 2014) Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res ., 15, 3221– 3245. Weinstein J.N. et al.   ( 2013) The cancer genome atlas pan-cancer analysis project. Nat. Genet ., 45, 1263– 1120. Google Scholar CrossRef Search ADS PubMed  Witten D.M. et al.   ( 2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics , 10, 515– 534. Google Scholar CrossRef Search ADS PubMed  Yuan X.T., Zhang T. ( 2013) Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res ., 14, 899– 925. Zou H. et al.   ( 2006) Sparse principal component analysis. J. Comput. Graph. Stat ., 15, 265– 286. Google Scholar CrossRef Search ADS   © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BioinformaticsOxford University Press

Published: May 3, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off