Access the full text.
Sign up today, get DeepDyve free for 14 days.
Borja Calvo, N. López-Bigas, S. Furney, P. Larrañaga, J. Lozano (2007)
A partially supervised classification approach to dominant and recessive human disease gene predictionComputer methods and programs in biomedicine, 85 3
E. Petretto, J. Mangion, N. Dickens, S. Cook, M. Kumaran, Han Lu, J. Fischer, H. Maatz, V. Křen, M. Pravenec, N. Hubner, T. Aitman (2006)
Heritability and Tissue Specificity of Expression Quantitative Trait LociPLoS Genetics, 2
E. Schmidt, E. Birney, David Croft, B. Bono, P. D’Eustachio, M. Gillespie, Gopal Gopinath, B. Jassal, S. Lewis, L. Matthews, L. Stein, Imre Vastrik, Guanming Wu (2004)
Reactome: a knowledgebase of biological pathwaysNucleic Acids Research, 33
Jung Choi, U. Yu, O. Yoo, Sangsoo Kim (2005)
Gene expression Differential coexpression analysis using microarray data and its application to human cancer
D. Altshuler, M. Daly, L. Kruglyak (2000)
Guilt by associationNature Genetics, 26
Natalie Wilson (2004)
Human Protein Reference DatabaseNature Reviews Molecular Cell Biology, 5
T. Billiar, D. Camp, G. Casella, M. Choudhry, Charles Cooper, Constance Elson, Bradley Freeman, R. Gamelli, Celeste Campbell-Finnerty, N. Gibran, D. Hayden, B. Harbrecht, David Herndon, J. Horton, W. Hubbard, J. Hunt, Jeffrey Johnson, M. Klein, J. Lederer, T. Logvinenko, R. Maier, J. Mannick, Philip Mason, B. McKinley, J. Minei, Ernest Moore, F. Moore, A. Nathens, G. O’Keefe, L. Rahme, D. Remick, D. Schoenfeld, M. Schwacha, M. Shapiro, G. Silver, Richard Smith, John Storey, M. Toner, H. Warren (2005)
A network-based analysis of systemic inflammation in humansNature, 437
A. Hamosh, A. Scott, J. Amberger, D. Valle, V. McKusick (2000)
Online Mendelian Inheritance In Man (OMIM)Human Mutation, 15
JK Choi, U Yu, OJ Yoo, S Kim (2005)
Differential coexpression analysis using microarray data and its application to human cancerBioinformatics, 21
Paul Shannon, Andrew Markiel, Owen Ozier, N. Baliga, Jonathan Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, T. Ideker (2003)
Cytoscape: a software environment for integrated models of biomolecular interaction networks.Genome research, 13 11
Gary Bader, I. Donaldson, Cheryl Wolting, B. Ouellette, T. Pawson, C. Hogue (2003)
BIND: the Biomolecular Interaction Network DatabaseNucleic acids research, 31 1
L. Elo, H. Järvenpää, M. Orešič, R. Lahesmaa, T. Aittokallio (2007)
Systematic construction of gene coexpression networks with applications to human T helper cell differentiation processBioinformatics, 23 16
L. Franke, H. Bakel, L. Fokkens, E. Jong, M. Egmont-Petersen, C. Wijmenga (2006)
Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes.American journal of human genetics, 78 6
Jing Yang, A. Su, Wen-Hsiung Li (2005)
Gene expression evolves faster in narrowly than in broadly expressed mammalian genes.Molecular biology and evolution, 22 10
H. Ogata, S. Goto, Kazushige Sato, W. Fujibuchi, H. Bono, M. Kanehisa (1999)
KEGG: Kyoto Encyclopedia of Genes and GenomesNucleic acids research, 27 1
C. Jongeneel, M. Delorenzi, C. Iseli, Daixing Zhou, C. Haudenschild, I. Khrebtukova, D. Kuznetsov, Brian Stevenson, R. Strausberg, A. Simpson, T. Vasicek (2005)
An atlas of human gene expression from massively parallel signature sequencing (MPSS).Genome research, 15 7
K. Goh, M. Cusick, D. Valle, B. Childs, M. Vidal, A. Barabási (2007)
The human disease networkProceedings of the National Academy of Sciences, 104
Author Simpson (1951)
The Interpretation of Interaction in Contingency TablesJournal of the royal statistical society series b-methodological, 13
C. Wolfe, I. Kohane, A. Butte (2005)
Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networksBMC Bioinformatics, 6
Arzucan Özgür, T. Vu, Günes Erkan, Dragomir Radev (2008)
Identifying gene-disease associations using centrality on a literature mined gene-interaction networkBioinformatics, 24
Lois Oakes (1957)
HUMAN DISEASEThe Ulster Medical Journal, 26
(2005)
Inflammation and Host Response to Injury Large Scale Collaborative Research: A network-based analysis of systemic inflammation
A. Reverter, A. Ingham, S. Lehnert, Siok-Hwee Tan, Yonghong Wang, A. Ratnakumar, B. Dalrymple (2006)
Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancerBioinformatics, 22 19
Han Liang, Wen-Hsiung Li (2007)
Gene essentiality, gene duplicability and protein connectivity in human and mouse.Trends in genetics : TIG, 23 8
T. Reverter-Gomez, S. McWilliam, W. Barris, B. Dalrymple (2005)
A rapid method for computationally inferring transcriptome coverage and microarray sensitivityBioinformatics, 21 1
A. Su, T. Wiltshire, S. Batalov, H. Lapp, K. Ching, David Block, Jie Zhang, R. Soden, M. Hayakawa, Gabriel Kreiman, M. Cooke, J. Walker, J. Hogenesch (2004)
A gene atlas of the mouse and human protein-encoding transcriptomes.Proceedings of the National Academy of Sciences of the United States of America, 101 16
J. Oyston (1998)
Online Mendelian Inheritance in Man.Anesthesiology, 89 3
N. Luscombe, M. Babu, Haiyuan Yu, M. Snyder, S. Teichmann, M. Gerstein (2004)
Genomic analysis of regulatory network dynamics reveals large topological changesNature, 431
E. Winter, L. Goodstadt, C. Ponting (2003)
Elevated rates of protein secretion, evolution, and disease among tissue-specific genes.Genome research, 14 1
Goparani Mishra, M. Suresh, K. Kumaran, N. Kannabiran, Shubha Suresh, P. Bala, K. Shivakumar, N. Anuradha, R. Reddy, T. Raghavan, Shalini Menon, G. Hanumanthu, Malvika Gupta, Sapna Upendran, Shweta Gupta, M. Mahesh, Bincy Jacob, Pinky Mathew, P. Chatterjee, K. Arun, Salil Sharma, K. Chandrika, Nandan Deshpande, Kshitish Palvankar, R. Raghavnath, R. Krishnakanth, H. Karathia, B. Rekha, R. Nayak, G. Vishnupriya, H. Kumar, M. Nagini, Sameer Kumar, Rojan Jose, P. Deepthi, S. Mohan, T. Gandhi, H. Harsha, K. Deshpande, M. Sarker, T. Prasad, A. Pandey
Human Protein Reference Database—2009 Update
Liqing Zhang, Wen-Hsiung Li (2005)
Human SNPs reveal no evidence of frequent positive selection.Molecular biology and evolution, 22 12
Background: The tissue specificity of gene expression has been linked to a number of significant outcomes including level of expression, and differential rates of polymorphism, evolution and disease association. Recent studies have also shown the importance of exploring differential gene connectivity and sequence conservation in the identification of disease-associated genes. However, no study relates gene interactions with tissue specificity and disease association. Methods: We adopted an a priori approach making as few assumptions as possible to analyse the interplay among gene-gene interactions with tissue specificity and its subsequent likelihood of association with disease. We mined three large datasets comprising expression data drawn from massively parallel signature sequencing across 32 tissues, describing a set of 55,606 true positive interactions for 7,197 genes, and microarray expression results generated during the profiling of systemic inflammation, from which 126,543 interactions among 7,090 genes were reported. Results: Amongst the myriad of complex relationships identified between expression, disease, connectivity and tissue specificity, some interesting patterns emerged. These include elevated rates of expression and network connectivity in housekeeping and disease-associated tissue-specific genes. We found that disease-associated genes are more likely to show tissue specific expression and most frequently interact with other disease genes. Using the thresholds defined in these observations, we develop a guilt-by-association algorithm and discover a group of 112 non-disease annotated genes that predominantly interact with disease-associated genes, impacting on disease outcomes. Conclusion: We conclude that parameters such as tissue specificity and network connectivity can be used in combination to identify a group of genes, not previously confirmed as disease causing, that are involved in interactions with disease causing genes. Our guilt-by-association algorithm should be useful for the discovery of additional modifiers of genetic diseases, and more generally, for the ability to associate genes of unknown function to clusters of genes with defined functions allowing for novel biological inference that can be subsequently validated. Page 1 of 11 (page number not for citation purposes) BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 disease, may impact the development of diseases, includ- Background The understanding of the biology underlying phenotype ing cancers, and hypothesize that many other members of is still a limiting factor in delivering the promise of high this list will ultimately be confirmed as modifiers of vari- throughput genomics. However, as new datasets are avail- ous genetic diseases. able, new data mining methods are developed and the goal appears ever more achievable. Methods Data resources, edits and nomenclature Among the high-throughput technologies, gene expres- We merged three large datasets as follows: Firstly, we sion profiling has led to the identification of genes that accessed expression data drawn from massively parallel perform in a coordinated manner allowing researchers to signature sequencing (MPSS) covering 182,719 tag signa- reasonably predict the role of genes for which no biologi- tures across 32 tissues [2]. Tissues represented on the cal function was attributed, based on the known perform- MPSS data included nine different central nervous system ance of other group members. These predictions rely on (CNS) areas (amygdale, caudate nucleus, cerebellum, cor- the guilt-by-association heuristic, widely invoked in pus callosum, fetal brain, hypothalamus, thalamus, spinal genomics and with proven applicability [1]. cord, and pituitary gland) and 23 non-CNS organs (adre- nal gland, bladder, bone marrow, heart, kidney, lung, At the same time, a comprehensive atlas of transcribed mammary gland, pancreas, placenta, prostate, retina, sali- genes in humans has revealed that genes may be split into vary gland, small intestine, spleen, stomach, testis, thy- two broad categories based on the number of tissues they mus, thyroid, trachea, uterus, colon, monocytes and are expressed in [2]. Genes that are expressed in many tis- peripheral blood lymphocytes). A total of 18,677 unique sues are designated as housekeeping (HK) while those genes were represented on the MPSS data and the number that are expressed in few tissues are termed tissue-specific of expressed genes per tissue averaged 8,943 and ranged (TS). from 5,845 in pancreas to 12,267 in testis. Tissue specificity has subsequently been linked to a Secondly, we downloaded a set of 55,606 true positive number of significant outcomes including level of expres- interactions among 7,197 genes that were defined from sion [3], ability to detect cis-acting and trans-acting functional studies [15]. This interactions dataset was built expression- quantitative trait loci [4], and differential rates including 2,788 confirmed, direct, physical protein-pro- of polymorphism [5], evolution [6] and disease-associa- tein interactions derived from the Biomolecular Interac- tion [7]. In addition, we [8] and others [9,10] have dem- tion Network Database (BIND; http://binddb.org) [16], onstrated the importance of exploring differential gene 18,176 confirmed human protein interactions from the connectivity in the identification of disease-associated Human Protein Reference Database (HPRD; http:// genes using microarray gene expression data. More www.hprd.org/) [17], 22,012 direct functional interac- recently, the combination of text mining with gene inter- tions from the Kyoto Encyclopedia of Genes and action network analysis has been proposed to infer Genomes (KEGG; http://www.genome.jp/kegg) [18], and unknown gene-disease associations [11]. 16,295 interactions derived from Reactome http:// www.reactome.org[19]. Furthermore, genes with a high degree of connectivity (network hubs) have been shown to be conserved across Finally, we used the microarray expression results gener- species [12] and their knockout phenotype more likely to ated during the profiling of systemic inflammation across be lethal [13]. Finally, based on sequence conservation 44,924 probe sets [20] and from which 126,543 interac- across species, a computational algorithm has been devel- tions among 7,090 genes were reported [8]. The microar- oped to identify genes associated with disease [14]. How- ray experiment used 92 Affymetrix GeneChips ever, no study relates gene interactions with tissue (Affymetrix, Santa Clara, CA) to examine gene expression specificity and its subsequent likelihood of association profiles in whole blood leukocytes immediately before with disease. and at 2, 4, 6, 9 and 24 h after intravenous administration of bacterial lipopolysaccharide (LPS) endotoxin to four To address this situation, we mined three large independ- healthy human subjects. For the control (placebo) data, ent datasets and classified transcribed human genes based four additional subjects were studied under identical con- on transcript abundance, tissue specificity, gene connec- ditions but without LPS administration. tivity and disease association. We discuss how these fac- tors relate to each other and, based on this new For the present study, and to enable the merging of the knowledge, implement a simple yet powerful guilt-by- three datasets, a number of edits were performed as fol- association algorithm that allows us to identify several lows: For the MPSS data, tags not expressed at more than candidate genes that, while not previously associated with 5 transcripts per million (tpm), in at least one tissue, were Page 2 of 11 (page number not for citation purposes) BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 disregarded. The threshold of 5 tpm corresponds to the Results and discussion sensitivity of MPSS technology as claimed by the manu- Initial gene groupings and unknown biological processes Figure 1 illustrates the way in which the 15,050 genes facturers and independently assessed in our laboratory [21]. Also, when the same gene was represented by more were simultaneously annotated as either disease-associ- than one MPSS tag, the reading from the most abundant ated or included in the true positive interactions and the tag, summed across all tissues, was assigned to that gene. inflammation datasets. These genes were further classified Finally, for the true positive interactions and the inflam- as either TS, NS or HK, and the number of disease-associ- mation datasets, interactions involving genes not sur- ated and/or interacting genes contained within each of the veyed in the MPSS data were also discarded. resulting 12 categories was determined. The proportion of genes with unknown biological process was also regis- These criteria resulted in 15,050 genes [see Additional file tered. 1] of which 5,198 and 4,950 were included in the true positive interactions and the inflammation datasets, As expected, the discovery of interactions as well as dis- respectively, and with 2,499 genes in common. In addi- ease-association for a given gene provides additional bio- tion, a total of 6,151 (41%) of the genes were associated logical knowledge, allowing inferences as to its genomic with disease according to OMIM database [22] as of Sep- functionality. Nevertheless, the biological process of tember 19, 2007; and with 1,445 of them defined as dis- about 10% of these presumably well-characterized genes ease-causing (i.e., associated with either known disease remains to be elucidated. On the other extreme, and high- phenotype or polymorphic sequence known). lighting the extent to which further research is needed, as many as 85% of NDIS, NINT genes and across the three Hereafter, we refer to DIS to indicate the 6,151 genes from expression categories (TS, NS and HK) belong to an our resulting dataset that are disease-associated according unknown biological process. to OMIM, and to NDIS to indicate the remaining 8,899 non-disease-associated genes also according to OMIM. The impact of tissue-specificity Similarly, we refer to INT (and NINT) to indicate genes in Among the myriad of complex relationships, some inter- our dataset for which interactions have (and have not) esting patterns emerged. Consistent with previous find- been reported. ings [3], we observed a strong relationship between the number of tissues in which a gene was expressed and its Data mining approaches level of expression (Table 1). Importantly, this relation- In order to further characterize the relationship existing ship was unaffected by disease or interaction status. between tissue specificity, gene connectivity and disease association, the 15,050 genes were classified as either TS Overall, the distribution of the expression of genes among or HK. To ensure that these two categories together repre- tissues was grossly bimodal. However, this bimodality sented the majority of the genes, we searched for category vanished when the distribution was examined separately limits from either extreme of the distribution of the for INT and NINT genes (Figure 2). INT genes are over- number of genes expressed in one, two, and up to 32 tis- represented among HK genes, while NINT genes are pre- sues, until equivalent categories were defined, cumula- dominantly TS. We conclude that the more tissues a gene tively representing > 50% of the total number of genes. In is expressed in, the higher its chances of interacting with doing so, there were 4,232 (28%) TS genes expressed in 1 at least one other gene, irrespective of the tissue-specificity to 4 tissues, and 4,006 (27%) HK genes expressed in more of this second gene. than 25 tissues. The remaining 6,812 (45%) genes were classified as non-specific (NS). Figure 3 presents the relationship between tissue specifi- city and proportion of disease-associated genes. The over- Finally, and in order to identify novel candidate genes all Pearson correlation coefficient (PCC) was moderate impacting disease, we developed a guilt-by-association (0.53) yet significant (P = 0.0019) indicating an increase algorithm. Selection thresholds based on the average in the number of DIS genes among broadly expressed number of known interactions combined with the average genes. Computing the PCC conditional on interaction sta- proportion of DIS genes among their interactors were tus results in a non-significant PCC of -0.26 (P = 0.1459) determined from DIS genes. These thresholds were then for NINT genes, and a strong negative PCC of -0.73 (P < applied to genes in the NDIS category. Genes exceeding 0.0001) for INT genes. This counterintuitive pattern of both thresholds were identified as likely disease-associ- correlation is representative of the Simpson's Paradox ated candidates. [23] with the paradox being that, although INT genes tend to be expressed in many tissues, those that are expressed in a tissue specific manner are more likely to be DIS. This is likely due to the increased number of relationships an Page 3 of 11 (page number not for citation purposes) BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 % Unknown Biological Process 0 10 20 30 40 50 60 70 80 90 100 HK Y Y 1,581 HK N Y HK Y N 1,288 HK N N 918 NS Y Y 2,034 NS N Y NS Y N 1,325 NS N N 2,678 TS Y Y 954 TS N Y 588 TS Y N 467 TS N N 2,223 Gene gr Figure 1 oupings Gene groupings. Genes were classified as tissue-specific (TS), non-specific (NS) or housekeeping (HK). Among each class, the number of interacting and disease-associated genes is noted, and for each of the resulting 12 categories, the percentage of genes with unknown biological process ontology is given. interacting HK gene would have compared to a TS equiv- genes (Figure 4). Consistent with our results, genes associ- alent, thereby increasing the likelihood of a mutation ated with similar disorders have been shown to have leading to a detrimental and potentially lethal outcome, higher likelihood of physical interactions between their as previously determined [6]. We conclude that it is not so products and a higher expression profiling similarity for much that TS genes are more likely to be associated with their transcripts [24]. disease, but rather that HK genes associated with disease are rarely observed. Identification of candidate disease genes via guilt-by- association Gene interactions in the context of tissue-specificity and Given our measurement confirming that like associates disease association with like, we developed a guilt-by-association algorithm Our analyses revealed that interacting HK genes are more with the aim of identifying candidate genes among the likely to interact with genes that are also HK (PCC = 0.89; previously classified non-disease subset. Our guilt-by- P < 0.0001) and vice-versa (i.e., TS genes are more likely association algorithm starts by examining the connectivity to interact among themselves). Importantly, this correla- properties of the DIS genes. In this context, DIS genes tion remained strong when conditioning on disease status were found to be involved, on average, in 12 interactions (Table 1). Also, interactions between two HK genes were (ranging from 0 to 176). Also on average, their interactors 12.8 times more frequent (P < 0.0001) and 3.3 times were themselves DIS genes in 75% of instances. Impor- more cohesive (P < 0.0001) as measured by the clustering tantly, while only 1,132 (or 18.4%) of DIS genes had > 12 coefficient, than interactions between two TS genes. The interactions (revealing the skewedness in the number of clustering coefficient is a measure of network cohesive- interactions), 651 (or 57.5%) of them interacted with DIS ness and captures how many neighbours of a given gene genes > 75% of the time. When these same thresholds are connected to each other. (i.e., > 12 interactions and > 75% of DIS genes among interactors) were applied to NDIS genes, we revealed the Similarly, interactions between two DIS genes were 3.1 presence of 112 genes [see Additional file 2], including 26 times more frequent (P < 0.0001) and 1.6 times more TS, 50 NS and 36 HK, that while not being associated with cohesive (P < 0.001) than interactions between two NDIS disease, have higher than average connectivity degree (> Page 4 of 11 (page number not for citation purposes) Interaction Disease # Genes BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 Table 1: Relationship between the number of tissues in which a gene is expressed and a series of variables. Variable Correlation Regression Expression: Overall genes 0.706 2.034 Non-Interacting (NI) genes only 0.709 1.802 Interacting genes only 0.707 2.107 Non-Disease (ND) 0.707 1.769 Disease (D) 0.709 2.382 Non-Interacting and Non-Disease 0.691 1.759 Non-Interacting and Disease 0.764 2.039 Interacting and Non-Disease 0.719 1.803 Interacting and Disease 0.702 2.438 Proportion of interacting genes: Overall genes 0.949 0.012 Non-Disease genes only 0.942 0.013 Disease genes only 0.917 0.008 Proportion of disease genes: Overall 0.527 0.002 Non-Interacting genes only -0.263 -0.001 Interacting genes only -0.733 -0.004 Tissue specificity of interactors: Overall genes 0.887 0.112 Non-Disease genes only 0.736 0.062 Disease genes only 0.872 0.151 Proportion of disease genes among interactors: Overall genes 0.229 0.000 Non-Disease genes only -0.048 0.000 Disease genes only 0.575 0.001 12 connections) and higher than average proportion (> In order to further ascertain the optimality of various loca- 75%) of genes in OMIM among their connectors. Table 2 tion parameters to be used as thresholds in the guilt-by- presents the number of genes in the contingency table association algorithm, we explored the proportion of truly underlying our guilt-by-association algorithm. disease associated genes from the total number of cap- tured genes and the results are presented in Table 3. While To assess the optimality of our approach, we repeated the the median performs slightly better (i.e. by up to 1.03 analyses using only the 1,445 DIS genes (out of the initial times better, or 78.9 over 76.3) than the mean when used 6,151) with known disease phenotype and either as a threshold for the proportion of disease genes among sequence mutation or molecular basis known as those interactors, this improvement is at the expense of generat- declared as truly disease-associated. The new thresholds ing substantially larger lists of candidate genes. When for connectivity and proportion of DIS genes among inter- exploring the number of connections, the mean is very th actors were 12 and 35%, respectively. The new list of can- close to the 75 percentile, indicating the skewness in the didate genes included 127 genes of which 107 were connectivity distribution with most genes having few con- assessed as DIS in the initial list of 6,151. Assuming the nections and few genes having many connections. Also, as remaining 20 genes are indeed false positives, this implies a threshold for the number of connections, the mean per- a precision of at least 84%. forms favourably against either inter-quartile. It should be noted that precision alone is not enough to However, the infeasibility of directly computing perform- assess the goodness of a classifier, as it is only concerned ance measures associated with a given algorithm in the with the ratio of identified genes that are positive, but not absence of negative examples should be acknowledged. with the total number of discovered genes. That is, although one can be relatively sure that certain Page 5 of 11 (page number not for citation purposes) BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 Interacting genes Non-Interacting genes F Figure 2 requency histogram of gene expression Frequency histogram of gene expression. For each gene, the tissues showing expression at more than 5 transcript per million were counted and the histogram explored separately for non-interacting (green) and interacting genes (red). The two distributions are statistically different (Kolmogorov-Smirnov test P-value < 0.001). -16 genes are associated with a disease, it is not possible to ) than the 14% predicted (hypergeometric P = 7.5 × 10 ensure that a set of genes is not involved in any disease. In by OMIM across the genome, with 2,549 genes defined as other words: Absence of evidence is not evidence of the basis of heritable disease out of the 18,091 total. absence. On the other extreme, some of the genes anno- Clusters of disease among candidate genes tated as disease associated by OMIM could also be false positives. In these situations, partially supervised learning In order to determine what diseases these genes might algorithms have been proposed to address this issue and impact, we explored the gene networks spanned by the in the context of identifying disease genes [14]. members of our guilt-by-association list, alone and in combination with their interactors. Based on the disease Nevertheless, a literature survey revealed that 44 of the associations shown [see Additional file 2], each cluster 112 candidate genes [see Additional file 2] have been pre- was examined for a common disease. In this fashion, we viously associated with polymorphisms or differential identified two clusters of genes that impact on either gene expression leading to a modified risk of disease. A breast or gastric cancer. Figure 5 depicts the Cytoscape further 10 genes exist within chromosomal regions associ- [25] representation of the breast cancer cluster where ated with disease. The remaining 58 genes have no obvi- seven of our guilt-by-association genes (APBA2BP, ous association to disease in any system. The 39% rate of CCNA2, COBRA1, PCAF, RAD51, SMARCA4 and disease association determined here is much higher STAT5A) were linked to the well characterized human Page 6 of 11 (page number not for citation purposes) BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 0.7 0.5 0.3 0.1 1 7 13 19 25 31 Number of tissues Dis Figure 3 ease association and tissue specificity Disease association and tissue specificity. Relationship between tissue specificity (x-axis) and proportion of disease-asso- ciated genes (y-axis) computed using all genes (blue pattern), and separate for non-interacting (green pattern) and interacting genes (red pattern). breast cancer susceptibility genes, BRCA1 and BRCA2. Conclusion Although none of these genes are annotated as disease Data mining approaches have allowed us to gain an causing in OMIM, five have been previously associated insight into the complex relationships existing between with the development of breast cancer, for example, alle- gene expression, disease association, network connectivity les of RAD51 are epistatic with alleles of BRCA2. However, and tissue specificity. We have identified elevated rates of CCNA2 is only mentioned in a very small number of expression and network connectivity among broadly reports on breast cancer and APBA2BP is not a well stud- expressed genes, and among disease-associated tissue-spe- ied gene. cific genes. For the case of gastric cancer, another cluster of seven In particular, when exploring the relationship between tis- genes (AKT3, KRAS, MAP2K4, PIK3CB, PLCB1, PIK3R5 sue specificity and disease association, we found this rela- and PPP3R2) was identified. Four of these genes have tionship most interesting. While there is a moderate been previously associated with gastrointestinal disease positive relationship between the number of tissues in while AKT3, PIK3CB and PIK3R5 have not, although the which a gene is expressed and the proportion of disease differential expression of AKT3 in gastric cancer is well genes, we show that this relationship is reversed when defined [see Additional file 2]. We suggest these previ- only considering genes for which interactions have been ously non-associated genes are strong candidates for fur- reported. We present this phenomenon as an example of ther study into the basis of these diseases and are potential the well-reported Simpson's Paradox. To a great extent, prognostic markers. the inclusion of number of interactions as a threshold Page 7 of 11 (page number not for citation purposes) Proportion of disease genes BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 TS NS HK 1.0 0.4 2.6 1.0 2.6 1.2 Disease…… TS 0.4 0.2 1.0 0.5 1.2 0.7 Non-Disease Disease…… 2.6 1.0 6.7 3.1 7.4 3.9 NS Non-Disease 1.0 0.5 3.1 1.9 4.2 2.7 Disease…… 2.6 1.2 7.4 4.2 9.7 5.7 HK 1.2 0.7 3.9 2.7 5.7 3.8 Non-Disease Relating gene connectivity Figure 4 with disease association and tissue specificity Relating gene connectivity with disease association and tissue specificity. Percentage of gene-gene interactions that exists between two groups of genes depending on their tissue specificity (TS: tissue-specific, NS: non-specific, and HK: house- keeping) and disease association. Colours indicate interactions between two disease-associated genes (red), between a disease- associated and a non-disease-associated gene (yellow), and between two non-disease-associated genes (green). The size of the rectangles indicates the relative number of interacting genes in each group. parameter in our guilt-by-association algorithm obviates members of this list will ultimately be confirmed as mod- the need to also include tissue specificity. ifiers of various genetic diseases. However, it should also be acknowledged that probability Finally, it should be noted that while new algorithms are values associated with testing the null hypothesis of a being proposed in the literature on a rather frantic pace, given PCC not being statistically different from zero were the task of comprehensively comparing algorithms could computed assuming asymptotic normality and as such are be unattainable if not futile. Instead, we claim that our prone to inaccuracies. With this in mind, we focussed on conservative thresholds for predicting disease association combining discrete parameters such as number of connec- is justified because using thresholds of known disease tions and the association to disease-associated genes to genes increases our likelihood of success given any estima- identify a group of genes, not previously confirmed as dis- tion process is going to have a degree of false positives. We ease causing, that are involved in interactions with disease acknowledge the list does not exhaust all possible disease causing genes. The nature of these newly identified inter- genes but merely gives researchers the best short list for actions could range from epistatic interactions (i.e., the further study. action of one gene is suppressed by another such as the case of RAD51 and BRCA1) to physical gene-gene interac- Abbreviations tions to correlated co-expression. Based on bibliographi- HK: housekeeping; MPSS: massively parallel signature cal validation and network re-construction we have sequencing; NS: non-specific; PCC: Pearson correlation identified several candidate genes that may impact the coefficient; TS: tissue-specific; DIS: genes in our dataset development of cancer and hypothesize that many other that are disease-associated according to OMIM as of Sep- tember 19, 2007; NDIS: genes in our dataset that are non- Page 8 of 11 (page number not for citation purposes) BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 Table 2: Contingency table underlying the guilt-by-association algorithm Disease Associated? Number of Connections % Disease-associated genes among interactors ≤ 75 > 75 Yes ≤ 12 3,112 1,907 > 12 481 651 No ≤ 12 7,853 705 > 12 229 112 Number of disease- and non-disease-associated genes by thresholds on number of connections and percentage of disease-associated genes among interactors. The thresholds are obtained from exploring disease-associated genes and correspond to the average number of connections (12) among disease-associated genes and the average proportion of disease-associated genes (75%) among their interactors. The 112 non-disease- associated genes (bottom right cell) form the basis of the newly reported disease-associated genes [see Additional file 2]. disease-associated genes also according to OMIM; INT: genes in our dataset for which interactions have been reported; NINT: genes in our dataset for which interac- tions have not been reported. Competing interests The authors declare that they have no competing interests. Authors' contributions AR conceived the study, carried out the data mining approaches and drafted the manuscript. AI directed the design and coordination of the biological/immunological relevance of the results and drafted the manuscript. BD participated in the coordination of the whole study and drafted the manuscript. All authors read and approved the final manuscript. Table 3: Precision analysis of the guilt-by-association algorithm Threshold for number of connections (TC) Threshold for % disease genes among interactors (TD) Q1 Q2 Q3 Mean TC = 1 TC = 4 TC = 13 TC = 12 Q1 TD = 12.8 N Captured 1,943 1,391 638 683 % Known 73.3 75.0 76.5 76.4 Q2 TD = 28.6 N Captured 1,024 563 195 219 % Known 74.8 78.9 85.1 84.9 Q3 TD = 50.0 N Captured 251 118 16 19 % Known 70.5 67.8 75.0 78.9 Mean TD = 35.0 N Captured 748 409 109 127 % Known 73.4 76.3 84.4 84.2 The optimality of various location parameters to be used as thresholds in the guilt-by-association algorithm was explored by computing the proportion of known (% Known) disease associated genes from the total number of captured genes (N Captured). The analysis was performed using only the 1,445 genes (out of the initial 6,151) with known disease phenotype as the set of truly disease causing, and with the remaining 4,706 th th th declared as disease associated. The three inter-quartiles (Q1: 25 percentile; Q2: 50 percentile or median; and Q3: 75 percentile) plus the mean were used as thresholds. Page 9 of 11 (page number not for citation purposes) BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 Gu Figure 5 ilt-by-association network analysis on breast cancer Guilt-by-association network analysis on breast cancer. A cluster of 7 non-disease associated genes (yellow) each inter- acting with BRCA1 and/or BRCA2. Acknowledgements The authors are grateful to Victor Jongeneel and Christian Haudenschild Additional material for providing the gene-centric and tag-centric annotated MPSS data files. The authors would like to acknowledge three reviewers who provided Additional file 1 important insights. In particular, comments by Borja Calvo on previous ver- sions of this manuscript greatly improved its final outcome. This work was Additional Table 1: The set of 15,050 genes. List of 15,050 genes included in the analyses. For each gene, the number of tissues (out of 32) supported by the CSIRO Centre for Complex Systems Science http:// in which the gene is being expressed, its average expression, disease asso- www.csiro.au/science/ComplexSystemsScience.html. ciation and connectivity structure is provided. Click here for file References [http://www.biomedcentral.com/content/supplementary/1756- 1. Wolfe CJ, Kohane IS, Butte AJ: Systematic survey reveals gen- 0381-1-8-S1.xls] eral applicability of "guilt-by-association" within gene coex- pression networks. BMC Bioinformatics 2005, 6:227. 2. Jongeneel CV, Delorenzi M, Iseli C, Zhou D, Haudenschild CD, Additional file 2 Khrebtukova I, Kutnetsov D, Stevenson BJ, Strausberg RL, Simpson Additional Table 2: Set of 112 guilt-by-association genes. List of 112 AJG, Vasicek TJ: An atlas of human gene expression from mas- genes not associated with disease according to OMIM yet with high con- sively parallel signature sequencing (MPSS). Genome Res 2005, 15:1007-1014. nectivity with disease-associated genes. For each gene, the proportion of 3. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, disease genes among connectors and polymorphism or differential expres- Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch sion associated with disease along with the relevant literature reference is JB: A gene atlas of the mouse and human protein-encoding provided. transcriptomes. Proc Natl Acad Sci USA 2004, 101:6062-6067. Click here for file 4. Pettretto E, Mangion J, Dickens NJ, Cook SA, Kumaran MK, Lu H, [http://www.biomedcentral.com/content/supplementary/1756- Fischer J, Maatz H, Kren V, Pravenec M, Hubner M, Hubner N, Aitman TJ: Heritability and tissue specificity of expression quantita- 0381-1-8-S2.doc] tive trait loci. PLoS Genetics 2006, 2:1625-1633. 5. Zhang L, Li WH: Human SNPs reveal no evidence of frequent positive selection. Mol Biol Evol 2005, 22:2504-2507. Page 10 of 11 (page number not for citation purposes) BioData Mining 2008, 1:8 http://www.biodatamining.org/content/1/1/8 6. Yang J, Su AI, Li WH: Gene expression evolves faster in nar- rowly than in broadly expressed mammalian genes. Mol Biol Evol 2005, 22:2113-2118. 7. Winter EE, Goodstadt L, Ponting CP: Elevated rates of protein secretion, evolution and disease among tissue-specific genes. Genome Res 2004, 14:54-61. 8. Reverter A, Ingham A, Lehnert SA, Tan SH, Wang YH, Ratnakumar A, Dalrymple BP: Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer. Bioinformatics 2006, 22:2396-2404. 9. Choi JK, Yu U, Yoo OJ, Kim S: Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics 2005, 21:4348-4355. 10. Elo LL, Järvenpää H, Oresic M, Lahesmaa R, Aittokallio T: System- atic construction of gene coexpression networks with appli- cation to human T helper cell differentiation process. Bioinformatics 2007, 23:2096-2103. 11. Özgür A, Vu T, Erkan G, Radv DR: Identifying gene-disease asso- ciations using centrality on a literature mined gene-interac- tion network. Bioinformatics 2008, 24:i277-i285. 12. Luscombe NM, Babu MM, Yu H, Snyder M, Telchmann SA, Gerstein M: Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 2004, 431:308-312. 13. Liang H, Li WH: Gene essentiality, gene duplicability and pro- tein connectivity in human and mouse. Trends Genet 2007, 23:375-378. 14. Calvo B, López-Bigas N, Furney SJ, Larrañaga P, Lozano JA: A par- tially supervised classification approach to dominant and recessive human disease gene prediction. Comput Meth Prog Bio 2007, 85:229-237. 15. Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C: Reconstruction of a functional human gene net- work, with an application for prioritizing positional candi- date genes. Am J Hum Genet 2006, 78:1011-1025. 16. Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hoque CW: BIND – The Biomolecular Interaction Network Data- base. Nucleic Acids Res 2001, 29:242-245. 17. Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shi- vakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanuman- thu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN, Deshpande N, Pal- vankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R, Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R, Deepthi P, Mohan SS, Gandhi TK, Harsha HC, Deshpande KS, Sarker M, Prasad TS, Pandey A: Human protein reference database – 2006 update. Nucleic Acids Res 2006, 34:D411-D414. 18. Kanehisa M, Goto S: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28:27-30. 19. Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005, 33:D428-D432. 20. Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, Cho RJ, Chen RO, Brownstein BH, Cobb JP, Tscheke SK, Miller-Graziano C, Moldawer LL, Mindrinos MN, Davis RW, Tompkins RG, Lowry SF, the Inflammation and Host Response to Injury Large Scale Collabora- tive Research: A network-based analysis of systemic inflamma- tion in human. Nature 2005, 337:1032-1037. 21. Reverter A, McWilliam SM, Barris W, Dalrymple BP: A rapid method for computationally inferring transcriptome cover- age and microarray sensitivity. Bioinformatics 2005, 21:80-89. Publish with Bio Med Central and every 22. McKusick VA: Online Mendelian Inheritance in Man, OMIM™. scientist can read your work free of charge [http://www.ncbi.nlm.nih.gov/Omim]. 23. Simpson EH: The interpretation of interaction in contingency "BioMed Central will be the most significant development for tables. J Royal Stat Soc Ser B 1951, 13:238-241. disseminating the results of biomedical researc h in our lifetime." 24. Goh K, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL: The Sir Paul Nurse, Cancer Research UK human disease network. Proc Natl Acad Sci USA 2007, 104:8685-8690. Your research papers will be: 25. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin available free of charge to the entire biomedical community N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. peer reviewed and published immediately upon acceptance Genome Res 2003, 13:2498-504. cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 11 of 11 (page number not for citation purposes)
BioData Mining – Springer Journals
Published: Sep 19, 2008
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.