A knowledge-based T2-statistic to perform pathway analysis for quantitative proteomic data
A knowledge-based T2-statistic to perform pathway analysis for quantitative proteomic data
2017-06-16 00:00:00
Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as OPENACCESS common approach; which fundamentally implies genes or proteins as independent units. Citation: Lai E-Y, Chen Y-H, Wu K-P (2017) A The independent assumption ignores the associations among biomolecules with similar knowledge-based T -statistic to perform pathway functions or cellular localization, as well as the interactions among them manifested as analysis for quantitative proteomic data. PLoS Comput Biol 13(6): e1005601. https://doi.org/ changes in expression ratios. Consequently, these methods often underestimate the asso- 10.1371/journal.pcbi.1005601 ciations among biomolecules and cause false positives in practice. Some studies incorpo- Editor: Jacquelyn S. Fetrow, University of rate the sample covariance matrix into the calculation to address this issue. However, Richmond, UNITED STATES sample covariance may not be a precise estimation if the sample size is very limited, which Received: January 12, 2017 is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative Accepted: May 29, 2017 proteomic data. The covariance matrix used in the test statistic is constructed by the confi- Published: June 16, 2017 dence scores retrieved from the STRING database or the HitPredict database. We also Copyright:© 2017 Lai et al. This is an open access design an integrating procedure to retain pathways of sufficient evidence as a pathway article distributed under the terms of the Creative group. The performance of the proposed T -statistic is demonstrated using five published Commons Attribution License, which permits unrestricted use, distribution, and reproduction in experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differen- any medium, provided the original author and tiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets pro- source are credited. duced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling Data Availability Statement: All relevant data are pathway is a gene expression dataset of limited sample size. Compared with other popular within the paper and its Supporting Information statistics, the proposed T -statistic yields more accurate descriptions in agreement with the files. discussion of the original publication. We implemented the T -statistic into an R package Funding: This work is funded by Ministry of T2GA, which is available at https://github.com/roqe/T2GA. Science and Technology (https://www.most.gov. tw/?l=en). The grant number is 104-2221-E-010- 009-MY2, and the receiver is KPW. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005601 June 16, 2017 1 / 29 2 A T -statistic for pathway analysis Author summary Pathway analysis is a common approach to quickly access the pathways being regulated in the experiments. There are numerous statistics to perform pathway analysis; most of them assume that the genes or proteins are independent of each other for statistical ease. This assumption, however, is unrealistic to the real biological system and may cause false posi- tives in practice. A standard way to address this issue is to measure the associations among genes or proteins. Unfortunately, the estimation of associations requires sufficient sample size, which is usually not available for proteomic data produced by mass spectrom- etry. In this study, we propose a T -statistic, which estimates the associations among gene products, to perform pathway analysis for quantitative proteomic data. Instead of calculat- ing the associations directly from data, we use the confidence scores retrieved from protein-protein interaction databases. We also design an integrating procedure to reserve pathways of sufficient evidence as a regulated pathway group. We compare the proposed T -statistic to other popular statistics using five published experimental datasets, and the T -statistic yields more accurate descriptions in agreement with the discussion of the orig- inal papers. Introduction Great progress has been made toward the development of high-throughput technologies and their application to biological and clinical research. In a quantitative experiment, genes or proteins with significant changes in expression are potential to have important roles in a given phenotype or phenomenon. Therefore, the analysis of quantitative experimental data generally produces a list of differentially expressed genes or proteins in order. The list may share some insights if the aim of the experiments is restricted to few targets. As regards high- throughput data, the list hardly provides biological understanding of the mechanisms being studied, since the data involve complicated regulations among biomolecules and the number of biomolecules is too large to examine all candidates individually. Systematically investigat- ing the underlying mechanisms from a high-throughput data therefore become a new challenge. To confront the challenge, one idea is to apply pathway analysis to identify the genes or pro- teins that are known to be involved in a biological process or interaction based on the existing knowledge. Many approaches of pathway analysis have been developed over years concerning different methodologies. Some general reviews of pathway analysis approaches can be found in [1±4]. These approaches can be broadly divided into two major factions: a competitive null with a gene sampling model and a self-contained null with a subject sampling model. The null hypothesis of a competitive test suggests that the target pathway (or a predefined gene set) is differentially expressed as well as the rest of all pathways. Practically a competitive null is closely tied to (although not necessarily) a gene sampling approach [5]. A gene sampling model principally implies the independence assumption; the assumption presupposes that genes are expressed independently of each other so that these genes can be manipulated as the sampling subject to produce the null distribution. The null hypothesis of a self-contained test, on the other hand, suggests that the target pathway is not differentially expressed between distinct phenotypes. Using the phenotypes of the experiments as the sampling subject is the setup of a subject sampling model. In other words, a competitive null with a gene sampling approach let pathways compete with each other in order to rank these pathways and use the number of genes as the sample size in the meanwhile; whereas a self-contained null with a subject PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005601 June 16, 2017 2 / 29 2 A T -statistic for pathway analysis sampling approach examine each single pathway to determine if the pathway is indeed differ- entially expressed between phenotypes and use the number of experiments as the sample size. Intuitively, a competitive null aims to find the pathway that is most significant among all path- ways; a self-contained null aims to find the pathway that is the most significantly expressed between phenotypes. The classification of null hypotheses and sampling methods is firstly sug- gested by Geoman et al [5]. Even both categories have their own pitfalls and benefits, the authors suggested using a self-contained null with a subject sampling model in pathway analy- sis. Competitive null with gene sampling model usually implies the independence assumption which may produce inaccurate small p-values, cause serious false positives [6], and result mis- leading interpretations [5]. Further methodology issues of using a self-contained null can be found in [7±9], and of using a competitive null can be found in [10]. Another issue of pathway analysis comes from the construction of test statistics. These sta- tistics can be divided into univariate tests and multivariate tests. Univariate statistics, such as a modification or a weighted summation of the t-scores [11±14], only focus on expression of genes or proteins and assume these biomolecules as independent units for statistical ease. The independence assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. In contrast, multivariate statistics take into consideration the associations among genes or proteins [15±20]. Some methodology studies [9, 21±23] have evaluated uni- variate tests and multivariate tests with synthetic and experimental datasets. Compared with multivariate tests, univariate tests generally result a decrease in statistical power [21] and an increase in false positive rate [22] along with the rise of average correlation. Multivariate approaches usually incorporate the sample covariance matrix into calculation to address bio- logical interaction. However, the sample covariance may not be precise enough to estimate the associations if the sample size is very limited. Compared with other gene expression data, proteomic data produced by mass spectrometry are more difficult to analyze systematically due to the limited number of experiments. This limitation causes current multivariate tests incompetent because the sample covariance will not be a robust statistic. The composition of pathway diagrams also become a challenge to pathway analysis. A path- way is a group of biomolecules that participate in a particular cellular process. The members of a pathway are usually defined by the tradition (i.e., the history of pathway discovery) of molec- ular biology scientists. The structure of pathway diagrams is not standardized and therefore arises some issues to pathway analysis. First, the same pathway from different databases or other sources may have the same core members but different side members. Under different experimental conditions, the size of accessible biomolecules also changes. For example, the phosphorylation proteomic data may not provide information to the proteins not belong to phosphoproteome. Since the number of members within a pathway is not a constant, using this number as a parameter to determine if the pathway is significant or not may lead to incon- sistent results. This issue arises with the assumption of a competitive null. Second, some mole- cules appear over pathways may play important roles as communication centers. For example, p53 appears in 38 pathways in the KEGG database. The shared members may interact or coop- erate with each other and form a functional module. If this module is regulated (i.e., the mem- bers within the module are differentially expressed as they cooperate together), the subsets of this module may present in abundant pathways and make these pathways seemingly signifi- cant. To identify the regulated pathway among the significant pathways of common modules, biologists usually utilize the distinctive molecules participating in that specific pathway; whereas pathways do not contain distinctive molecules may be irrelevant to the underlying mechanism. However, most of the current approaches do not take consideration of this issue. The suggestion of irrelevant pathways due to the redundancy over the significant pathways PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005601 June 16, 2017 3 / 29 2 A T -statistic for pathway analysis usually causes confusion to data description. A recent study [24] also focuses on the second issue. They demonstrate how the shared members affect p-values, and try to address this prob- lem under a competitive null. In this study, we introduced a multivariate test, based on the Hotelling's T -statistic, to per- form pathway analysis for quantitative proteomic data. The most serious problem of analyzing proteomic data produced by mass spectrometry is the limited number of experiments. We usually obtain only a few replicates (biological or technical) per experimental condition. To manage this issue, we had two special designs in our test. First, instead of using the sample covariance matrix (which is not robust when the sample size is limited), we use the covariance matrix that is constructed of the probabilistic confidence scores provided by the STRING and HitPredict databases. The proposed T -statistic is then built of the protein expression profile and the covariance matrix to consider the expression level of individual proteins along with the associations among them. Second, we designed a self-contained model to produce a null distribution of altering protein expression while retaining the structure of protein associations. We are not capable of applying a subject sampling model because the number of experiments is too limited. In addition to the settlement of sample size issue, we designed an integrating procedure to categorize significant pathways as well as to avoid redundancy. The performance of the proposed T -statistic is demonstrated using five public experimental datasets with differ- ent levels of biological complexity: the T-cell activation, the cAMP/PKA signaling, the effect of dasatinib on the BCR-ABL pathway, the differentiation process of myoblast, and the protective effect of myocilin via the MAPK signaling pathway. The first four datasets are proteomic data produced by mass spectrometry; the last dataset is a gene expression data of low sample size. We compared T with other popular statistics: DPA [25], GSEA [26], DAVID [27, 28], and IPA [29]. For most of the situations, T yielded more accurate descriptions in agreement with the discussion of the original publication. Materials and methods Experimental data We took four proteomic datasets of different biological complexity and experimental proper- ties to demonstrate our approach. To be comprehensive, the testing datasets include the case of pathway activation and inhibition; also the case of signaling phosphoproteome and cellular proteome. We only used the final ratios provided by the datasets since they may have different integration approaches (e.g. to combine the results of biological and/or technical repeats, to handle missing data or outliers) under different experimental designs. The summarized ratio is also the most available format for quantitative proteomic data. In this situation, the sample size of data becomes only one, calculating a covariance matrix is not even possible. Our approach provide a solution to undertake this difficulty. We also applied our approach on a gene expression dataset of three samples to demonstrate that the general idea is applicable to other quantitative data of low sample size. The phosphoproteomic data of TCR signaling. The T-cell receptor (TCR) signaling phosphoproteomic data [30] aim to reveal system-wide phenomena associated with T cell activation. The authors used OKT3, an antibody specific to CD3, to initiate the TCR signaling transduction in the human leukemia cell line Jurkat T lymphocyte. They used stable isotope labeling by amino acids in cell culture (SILAC) method to perform large-scale quantitative phosphoproteomic analyses and identified 696 TCR-responsive phosphorylation sites on 453 proteins. Phosphopeptides showing more than 1.85-fold change in abundance are quali- fied as ªTCR-responsiveº sites, suggested by the authors under their experimental conditions. We used these 696 proteins since the original publication does not provide raw data, and we PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005601 June 16, 2017 4 / 29 2 A T -statistic for pathway analysis aimed to use this small dataset as a positive control to evaluate the proposed statistic. The data- set contains three time points: 5, 15, and 60 min; the number of UniProt accessions are 30, 376, and 330, respectively. The authors specifically enriched pTyr-contained peptides in 5 and 15 min experiments since pTyr constitutes less than 1% of the total amino acid content; a global phosphopeptide-enrichment approach was used for 15 and 60 min experiments. The 5 min experiment is extremely small because it contained only pTyr-enriched peptides. The phosphoproteomic data of cAMP/PKA signaling. The cAMP-dependent protein kinase (i.e. protein kinase A, PKA) signaling phosphoproteomic data [31] aim to provide a resource of substrates of PKA. The authors also used SILAC to profile quantitative changes of potential PKA substrates. Prostaglandin E2 (PGE2) was applied to increase intracellular cAMP and activate PKA in Jurkat T cells. The light cells (L) was used for control, the medium (M) and the heavy (H) were treated with PGE2 for 1 and 60 min, respectively. The authors identi- fied 4284 phosphorylation sites on 607 proteins, and they discussed the dynamic phosphoryla- tion upon stimulation by PGE2. The study considered two expression ratios: M/L and H/L; the number of UniProt accessions are 594 and 595, respectively. The cellular proteomic data of myogenesis. This cellular proteomic data [32] monitor the changes in protein expression underlying the phenotypic conversion of human primary myoblasts; from primary mononucleated muscle cells to multinucleated myotubes, using the SILAC method. The authors used human satellite cells isolated from a quadriceps muscle biopsy of a 5-day old infant. The light (L), medium (M) and heavy (H) cells were first treated with the growth medium (serum-supplemented; low glucose), then switched to the differentia- tion medium (serum-free; high glucose) for 0, 24, and 72 hr, respectively. This study quantified 2240 proteins, in which 2227 unique UniProt identifiers are quantified for both M/L and H/L expression ratios. The phosphoproteomic data of BCR-ABL signaling for CML treatment. The pathology of chronic myeloid leukemia (CML) is commonly associated with an oncogenic tyrosine kinase BCR-ABL. Dasatinib is an inhibitor of the BCR-ABL and Src family tyrosine kinase, and it serves as a clinical drug for treatment of CML. This phosphoproteomic data [33] aim to inspect the effects of dasatinib on the entire cell signaling network, using the SILAC method. The authors used the human leukemia cell line K562, which expresses the activated BCR-ABL fusion protein, to examine cellular phosphorylation levels for three conditions. For one hour, the light cells (L) was treated with DMSO only, the medium (M) and the heavy (H) were treated with 5 and 50 nM dasatinib, respectively. The authors identified 5063 phosphorylation sites on 1889 proteins, and they further discussed the mechanisms induced by dasatinib. The study considered two expression ratios: M/L and H/L; the number of UniProt accessions are 5453 and 5443, respectively. The gene expression data of MAPK signaling. The MAPK signaling gene expression data [34] aims to understand the functions of myocilin, a causative gene for open angle glau- coma. The authors suggested that myocilin promotes cell proliferation and resistance to apo- ptosis via the ERK1/2 MAPK signaling pathway. The microarray analysis compared a stably transfected HEK293 cell line expressing myocilin with the control cell lines. The cells were treated with the 10μm MEK inhibitor U0126 two hours prior to the induction of myocilin expression. The gene expression profiling was performed using Human Affymetrix Gene Chip U133 Plus version 2.0. The processed dataset includes three treatment and three control expression values. We took the mean of the triplicated data as the representative expression. Then we took the difference of the treatment group and the control group since the processed data is already normalized. We used the identifier mapping files from the official websites and the final data include 21816 unique UniProt identifiers. PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005601 June 16, 2017 5 / 29 2 A T -statistic for pathway analysis Databases To provide more generalized results, we choose two pathway databases and two protein- protein interaction databases: KEGG and Reactome provide pathway categories served as pre- defined gene sets, STRING and HitPredict contributes the confidence scores to estimate the covariance between protein expressions. KEGG. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database [35] collects manually drawn pathway maps in various topics, especially in metabolism. We down- loaded the pathway database in the KGML format from the KEGG website. We used the ver- sion last updated on 5th July 2016. The human pathway database (the organism code: hsa) contains 303 pathway diagrams. We excluded the overview maps since they depict very general biological concepts (e.g. one overview map is titled ªMetabolic pathwaysº). There remain 290 pathways for the following analysis. For each pathway, we retrieved the components tagged as gene, enzyme, and group. Those components are associated with gene products so we use them to map the identifiers of experimental data onto pathways. Reactome. The Reactome pathway database [36] provides curated pathways of signal transduction, transport, DNA replication, protein synthesis, metabolism and other cellular processes. We downloaded the pathway database in the tab-delimited format from the Reac- tome website. We used the version last updated on 30th June 2016 (v57). The human pathway database (the organism code: 9606) contains 1618 pathway diagrams of the lowest level (i.e. most specific) and we only used the lowest level pathways for the following analysis. STRING. The STRING database [37] scores and integrates known and predicted protein- protein interactions and gives a global perspective for many organisms. STRING provides eight PPI types: neighborhood, gene fusion, co-occurrence, co-expression, experiments, databases, textmining and homology [38]. STRING suggests four levels of confidence score: low confidence (0.150), medium confidence (0.400), high confidence (0.700), and highest confidence (0.900). The version for the following analysis is version 9.1, and we only used the interactions above medium confidence since it is the default threshold of the STRING website.The results were produced using the experiment PPI score to avoid artificial correlation due to special interest in the research community. HitPredict. The HitPredict database [39] provides the reliability scores of experimentally identified, physical protein-protein interactions (PPI). The database integrates five popular source database: IntAct, BioGrid, HPRD, DIP, and MINT. HitPredict also annotates the confi- dence scores into two levels: low and high. The version for the following analysis is version 4, and we only retrieved the interactions annotated with high quality. A knowledge-based T -statistic Data processing and pathway mapping. The required input format is a list of protein identifiers (e.g. UniProt accession number) with the expression ratios of the experimental group to the control. The list is subjected to the following data processing steps. First, we per- form log transformation and winsorization on the data subsequently (if needed). Extreme val- ues beyond the threshold are replaced with a maximum permitted value within the threshold. For example, if the expression ratios are (−7, −1, 3, 4, 6) and the threshold is set to 5, then the extreme value 6 will be replaced by the maximum permitted value 4, and -7 replaced by -4. Users can adjust the threshold based on the experimental conditions (i.e., different models or settings of mass spectrometry machines). The reason to set a threshold is to manually control the contribution of proteins with extreme values, since those proteins usually have very a low abundance (maybe beneath the threshold of quantification) in one of the experimental condi- tions. The default setting of the threshold is 10 and none of the testing data presented in this PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005601 June 16, 2017 6 / 29 2 A T -statistic for pathway analysis paper exceed the threshold. Second, we deal with the problem of multiple identifications and multiple ratios. This problem originates from the mapping of different primary identifiers to UniProt accession numbers. If there exists any ratio with multiple identifiers, we assign the same expression ratio to those identifiers. If there exists any identifier with multiple ratios, we assign the median of the ratios to the identifier. Third, we standardize the data by dividing the standard deviation. Last, we map these proteins with their expression ratios to the pathways. The identifier mapping tables are downloaded from the KEGG and the Ensembl [40] official websites. 2 2 The proposed T -statistic. Hotelling's T -statistic is a multivariate generalization of Stu- dent's t statistic. Let x , . . ., x be n independent random variables of an m-variate normal dis- 1 n 2 2 tribution, an original Hotelling's T -statistic for an one-sample T test is defined as follows: 2 1 T n
x μ S
x μ 0 0 where x is the vector of column means, S is an m × m sample covariance matrix, and μ is a given vector. The null hypothesis is that the population mean vector of data μ is equal to a given vector μ ; x is served as an estimation of μ. Here we introduce the notation to describe the design of the proposed T -statistic. Let a pathway P be the set of the proteins that are mapped from the data. Then, P fpji 1; . . . ; qg where each p indicates a specific protein with a corresponding ratio x , and q the number of i i mapped proteins (i.e., the size of a pathway). To ensure the biological robustness of the expres- sion ratios {x }, we have another threshold representing our confidence toward the expression direction of the data, the default setting is 1.5. Only the proteins showing more fold change beyond the threshold are qualified to possess their original ratios; the ratios of other proteins are recognized as disturbance and set to zero. We use the proteins with high expression ratios to represent the expression directions of the pathway. If directions of these proteins fit the covariance matrix, then the pathway is more likely to be enriched. In other words, we think the direction is not stable for those proteins of low expression ratios; we only care if the pro- teins of high expression ratios fit the covariance matrix. Again, users should adjust the thresh- old based on the experimental conditions. We collect the processed ratios into a column vector x. The proposed T -statistic is con- structed with the vector x and the covariance matrix S of x, the former represents the signifi- cance of protein expression; the latter represents the interaction structure among the proteins. The covariance matrix S is not computed from the experimental data. Due to the limitation of sample size, calculating an informative covariance is infeasible. Our idea is to use the confi- dence score provided by protein-protein interaction databases to represent the strength of the covariance, and use the expression direction provided by the testing dataset to indicate the sign of the covariance. In order to apply this idea, we have an assumption below: If p and p have a strong interaction supported by experimental evidence, their expression i j ratios x and x will have a high covariance s with a consistent sign, i j ij where x and x are any two elements of x and s is the corresponding covariance of S. To be i j ij more precise, we denote the PPI database as the set I and the interaction pairs collected as its PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005601 June 16, 2017 7 / 29 2 A T -statistic for pathway analysis elements: ( ) p and p are any two distinctive proteins with a confidence score v u I c p p v u c describing the strength of their interaction: p p v u On the basis of the confidence scores in I and the protein expression ratios, each element s of ij S is determined by the following four rules: 1. For the elements where i = j, the main diagonal of S, namely the variance of x. In this case, there is no evidence score to refer, we have to assign a constant to the diagonal elements. This constant represents our confidence toward the accuracy of the data, we use the medium confidence level suggested by STRING, which is 0.4. 2. For the elements where i6 j, the confidence score between protein p and p exist in I , and i j the ratios x and x are of the same sign. In this case, either x and x are both up-regulated or i j i j both down-regulated, the covariance between them should be positive. We use c directly p pj as the substitute of the covariance. 3. For the elements where i6 j, the confidence score between protein p and p exist in I , and i j the ratios x and x are of opposite signs. In this case, one of x and x is up-regulated and the i j i j other is down-regulated, the covariance between them should be negative. We take the neg- ative value of c to be the substitute of the covariance. p pj 4. For the elements where i6 j, and the confidence score between protein p and p does not i j exist in I . In this case, we assign zero to these elements. The possible values of s can be summarized as follows: ij 0:4 if i j: > c if i 6 j; c 2 I; and x x 0: < p p p p i j i j i j s ij >