Knowledge discovery by automated identification and ranking of implicit relationships

Jonathan D. Wren; Raffi Bekeredjian; Jelena A. Stewart; Ralph V. Shohet; Harold R. Garner

doi:10.1093/bioinformatics/btg421

Knowledge discovery by automated identification and ranking of implicit relationships

Wren, Jonathan D.; Bekeredjian, Raffi; Stewart, Jelena A.; Shohet, Ralph V.; Garner, Harold R. 2004-01-22 00:00:00 Vol. 20 no. 3 2004, pages 389–398 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg421 Knowledge discovery by automated identiﬁcation and ranking of implicit relationships 1,∗ 2,3 2,3 Jonathan D. Wren , Rafﬁ Bekeredjian , Jelena A. Stewart , 2,3 2,4 Ralph V. Shohet and Harold R. Garner Advanced Center for Genome Technology, Department of Botany and Microbiology, The University of Oklahoma, 620 Parrington Oval Rm. 106, Norman, OK 73019, USA, 2 3 4 Department of Internal Medicine, Division of Cardiology and McDermott Center for Human Growth and Development, Department of Biochemistry, Center for Biomedical Inventions, The University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390, USA Received on December 11, 2002; revised on March 11, 2003; accepted on August 10, 2003 Advance Access Publication January 22, 2004 ABSTRACT small fraction of the collective scientiﬁc knowledge within Motivation: New relationships are often implicit from existing any given ﬁeld. As increasing amounts of information and information, but the amount and growth of published literat- observations are compiled from different areas of research ure limits the scope of analysis an individual can accomplish. as individual reports, they can contribute towards a greater Our goal was to develop and test a computational method to understanding in apparently unrelated areas when considered identify relationships within scientiﬁc reports, such that large collectively. For example, it has been demonstrated that the sets of relationships between unrelated items could be sought useful implications of scientiﬁc discoveries can go unnoticed out and statistically ranked for their potential relevance as a set. or unutilized because they exist only implicitly from informa- Results: We ﬁrst construct a network of tentative relation- tion scattered among different areas of research (Swanson, ships between ‘objects’ of biomedical research interest (e.g. 1986). By using software to identify words shared between art- genes, diseases, phenotypes, chemicals) by identifying their icle titles, Swanson and Smalheiser were able to identify com- co-occurrences within all electronically available MEDLINE mon intermediates between Raynaud’s Disease (a circulatory records. Relationships shared by two unrelated objects are disorder restricting blood-ﬂow to the extremities) and the diet- then ranked against a random network model to estimate ary effects of ﬁsh oil, leading to the hypothesis and subsequent the statistical signiﬁcance of any given grouping. When com- proof (DiGiacomo et al., 1989) that compounds within dietary pared against known relationships, we ﬁnd that this ranking ﬁsh oil could ameliorate the symptoms of Raynaud’s Disease correlates with both the probability and frequency of object (Swanson, 1986; Smalheiser and Swanson, 1998). The term co-occurrence, demonstrating the method is well suited to ‘non-interactive literatures’ was coined to explain why such discover novel relationships based upon existing shared rela- a reasonable hypothesis had gone unnoticed by researchers tionships. To test this, we identiﬁed compounds whose shared in either ﬁeld alone. Finding methods to utilize greater the relationships predicted they might affect the development biomedical literature in an automated manner to aid scientiﬁc and/or progression of cardiac hypertrophy. When laboratory discovery is becoming a topic of increasing interest (Yandell tests were performed in a rodent model, chlorpromazine was and Majoros, 2002). found to reduce the progression of cardiac hypertrophy. While innovative, a keyword-based method such as that Contact: [email protected] of Swanson and Smalheiser is both limiting and highly Supplementary information: http://innovation.swmed.edu/ cumbersome, especially where a signiﬁcant body of literat- IRIDESCENT/Supplemental_Info.htm ure is concerned, for several reasons: First, only titles are used; second, word phrases such as ‘Interleukin 6’ are not INTRODUCTION taken into account, being reduced to ‘Interleukin’ and ‘6’; third, synonyms (e.g. ‘IL-6’) are not considered; ﬁnally, and There is a large difference between what is known and what perhaps most importantly, the number of unique keywords we know as individuals. We are only aware of a relatively grows rapidly per record analyzed, providing an impractically large amount of output for any user to examine (additional To whom correspondence should be addressed. Bioinformatics 20(3) © Oxford University Press 2004; all rights reserved. 389 J.D.Wren et al. a b c Fig. 1. Using literature-based relationships to engage in the discovery of new knowledge. (a) Beginning with an object of interest (black node), tentative relationships are assigned to other objects (gray nodes) when they are co-mentioned within MEDLINE abstracts. (b) Each related object (gray) is then queried for its relationships with other objects (white nodes). The white nodes are not directly related to the primary node and are thus only implicitly related, through intermediates. (c) The relationships shared by white and black nodes are ranked against a random network model to establish how many would be expected by chance alone, given the connectivity of each node in the set. Suppose the entire network consists of 1000 nodes and the numbers to the right represent how many connections each of the implicit nodes have to other nodes in the network. We can then assign a statistical weighting that reﬂects how exceptional any given implied relationship is based upon the shared intermediates. In this example, a node with 950 relationships may share many relationships with the primary node, but this is nothing exceptional because it is related to most objects in the network. It is thus down-weighted in importance (dashed lines). The connections of the gray nodes must also be taken into account, but is not shown here for simplicity. discussion in online supplement). Others have designed lit- (Fig. 1a) by identifying the co-occurrence of A and B objects erature exploration systems that overcome the ﬁrst three within MEDLINE records (titles and abstracts). Each B object barriers mentioned by looking for the co-occurrences of major can then be queried to identify other objects (‘C’) that co- Medical Subject Headings (MeSH) descriptors (Hristovski occurred with them in the literature (Fig. 1b). Each of these et al., 2001) or by mapping text to UMLS concepts (Wee- new objects, C, that are not themselves A or B objects, are ber et al., 2000, 2003). The size of the domain to be analyzed related to A only implicitly. That is, they have no documented remains the most signiﬁcant problem, however, and has been relationship with A, but share one or more relationships with dealt with thus far by user intervention in the selection of A. This large list of implicitly related objects will contain intermediates for analysis. potential discoveries of new relationships. However, because The ability to seek out novel, undocumented relationships of their abundance, it is necessary to prioritize and rank these that are logically implicit from a body of information—yet objects in some manner. To do so, we describe a method to not explicitly stated within that body—has obvious scientiﬁc rank relationships shared by two objects within a literature value. It enables us to use the current state of knowledge to network against a random network model to evaluate how infer possible new relationships that have yet to be studied. statistically exceptional any given set of shared relationships Our ability to postulate a potential relationship between two is (Fig. 1c). We also show that this ranking correlates with the or more things depends principally upon how many com- probability that two objects are related as well as the strength mon relationships we are aware of between these things, if (frequency of co-occurrence) by which they are related. any, to suggest that a relationship exists where none has been documented. Awareness of relationships, especially sets of Object co-occurrences exhaustively identify relationships, is central to the human process of insight and potential relationships discovery. We attempt to identify as many relationships as possible by postulating that a potential relationship exists between General approach two objects when they are observed to co-occur within the Herein we describe a method to identify potential relationships same MEDLINE record, an approach also taken by others to within the biomedical literature by deﬁning areas of research identify potential relationships between genes (Stapley and interest such as genes, diseases, phenotypes and chemical Benoit, 2000; Jenssen et al., 2001), proteins (Blaschke et al., compounds (hereafter referred to simply as ‘objects’). Begin- 1999) and drugs (Rindﬂesch et al., 2000). Some have used ning with an object of interest (call it ‘A’), we can identify the co-occurrence of certain MeSH terms to reﬂect potential other objects (‘B’) tentatively related to it within the literature relationships (Hristovski et al., 2001; Perez-Iratxeta et al., 390 Identifying new relationships from existing ones 2002), but MeSH terms for each object within an abstract to identify novel relationships that are implicit by virtue of are not always provided, and in a number of cases (e.g. shared relationships. gene names) are used to reﬂect the existence of a more gen- For an implicit relationship—two objects related only eral category as opposed to a speciﬁc entity. Here, we are through shared intermediates—it is not yet clear what stat- interested in associations between any areas of active bio- istical parameters best correlate with the probability of it medical research interest. We assemble the primary names representing a biologically meaningful relationship. However, and synonyms for genes, diseases, phenotypes and chem- we can assume that the probability of an implicit relation- ical/pharmaceutical compounds into a composite database so ship (A ↔ C) being biologically meaningful would not be that the names can be recognized as they occur within text. greater than the least probable of the two individual (A ↔ B These objects were chosen because they are of broad interest or B ↔ C) relationships linking them, where the symbol ↔ in biology and medicine, but the approach we present allows is deﬁned as the existence of a non-directional relationship for any ‘class’ of object to be incorporated that is considered between two objects. This is equivalent to stating that the of research interest (e.g. tissue types, protein motifs, cell lines, strength of a chain is no greater than its weakest link. etc.). It should be noted, however, that addition of a new object class presumes that co-citations with other object classes could SYSTEMS AND METHODS be construed as meaningful. For example, country names Code was written in Visual Basic 6.0 (SP5) using ODBC could be added as an object class, but it is doubtful that extensions to interface with a Microsoft Access 2000 data- any co-occurrences with other objects would be considered base, with database queries written in SQL. Analyses were run biologically meaningful or interesting. on a Pentium 4–2.4 GHz desktop PC running Windows 2000. Database entries were obtained from the following sources, all Fuzzy logic is used to weight importance of downloaded between December 13 and December 25, 2001: co-occurrence The disadvantage of using co-occurrence is that it does not Database Location always reﬂect the existence of a biologically meaningful rela- tionship. To reﬂect this ambiguity, we borrow from Fuzzy Set OMIM ftp://ftp.ncbi.nlm.nih.gov/repository/ Theory and model relationships as probabilistic, that is, ran- OMIM/omim.txt.Z ging from 0 to 1, rather than binary values [for an overview GDB http://gdbwww.gdb.org/gdb/ of Fuzzy Set Theory see Steimann (1997) and for a thorough advancedSearch.html discussion see Klir and Yuan (1995)]. By manually survey- HGNC http://www.gene.ucl.ac.uk/public- ing each object co-mentioned within a sample of MEDLINE ﬁles/nomen/nomeids.txt records, we can estimate the probability that a co-mention LocusLink ftp://ftp.ncbi.nih.gov/refseq/ reﬂects the presence of a non-trivial relationship between the LocusLink/LL.out_hs.gz (Human) two objects. This base probability can then be used to assign MeSH http://www.nlm.nih.gov/cgi/ a fuzzy score to each relationship, reﬂecting the probability request.meshdata (MeSH Trees ﬁle) that one or more co-occurrences are meaningful. Since terms MEDLINE National Library of Medicine that co-occur more frequently are more likely to represent http://www.nlm.nih.gov biologically meaningful relationships (Jenssen et al., 2001), Genome Ontology http://www.geneontology.org each relationship is assigned a score based on the frequency and type (i.e. abstract or sentence) of co-mentions observed and their corresponding error rates (discussed in the section Values used to ascertain relatedness to one of the two on ‘Implementation’). fuzzy sets (i.e. belonging to the category ‘related’ or ‘not By deﬁning what objects will be recognized rather than related’) were based upon the probability that a co-occurrence using all words reduces the magnitude of analysis and allows of objects equated to a non-trivial relationship between the a focus on relationships with a higher potential of being con- two (see online supplement Fig. S2). Thus, the value of sidered ‘interesting’. Diseases and clinical phenotypes were relatedness can range from 0 to 1 and is estimated by the obtained from Online Mendelian Inheritance in Man (OMIM) number of times (n) that the two objects were co-mentioned (Hamosh et al., 2000); chemical compounds and drugs from and the error rate (r) associated with the co-mention metric the MeSH database (Lowe and Barnett, 1994); and genes from (i.e. sentence or abstract) used to establish the relationship. Locuslink (Maglott et al., 2000) and the Human Gene Nomen- This formula, P(related) = 1 − r , is used to calculate clature Committee (HGNC) (Povey et al., 2001). As tentative the relatedness of two objects and is referred to as a veracity relationships between these objects are identiﬁed within text, score. The veracity score can range in value from 0.58 (two they are entered into a database. This database enables the con- objects only co-mentioned once within one abstract) to 1.0. struction of a network of relationships, which can be queried Rather than summing the raw number of co-mentions shared to identify relationships shared among a set of objects and by two objects, summing their veracity scores permits a more 391 J.D.Wren et al. accurate estimate of how many relationships are truly shared randomly connected in a network with a total of N nodes, based upon the known error rate. the probability that a node will be connected to A is given as We deﬁne the ‘strength’ of a relationship as a function of K /N where K is the total number of connections to A. The A t A the number of times two objects have been co-mentioned probability B will be connected to A [written as P(B ∈ A)]is and the probability that each co-mention represents a non- K /N and the probability A will be connected to B [written A t trivial relationship. The term ‘strength’ is used rather than as P(A ∈ B)], is K /N . Because the formula P(A ∈ B) or B t frequency because we record both sentence co-mentions (C ) P(B ∈ A) is more easily represented in mathematical terms and abstract co-mentions (C ), and need a convenient way as the probability B is not related to A and vice versa, written to combine the two. The strength score (S) is assigned based as NOT [P(A ∈/ B) AND P(B ∈/ A)], we can deﬁne the upon the individual co-mention error rates, r (17% FP) and probability in mathematical terms as: r (42% FP) respectively, by the formula S = C ∗ (1 − r ) + a s s K K A B C ∗ (1 − r ). a a P(A ↔ B) = 1 − 1 − ∗ 1 − (1) N N t t Materials and methods for the Chlorpromazine–Cardiac Hypertrophy experiments are contained in the online Intuitively, we expect that if K = N or K = N then A t B t supplement. P(A ↔ B) = 1, since the number of connections to one node does not matter if the other node is connected to all nodes. This formula applies for all non-zero values of K and ALGORITHM K . Random network simulations were conducted to con- In this section, we adopt terminology from graph theory and ﬁrm the validity of this formula (data not shown). Summing refer to objects as ‘nodes’ and relationships (co-citations) as the probability of each individual relationship, we can extend ‘connections’, also known as the ‘edges’ between nodes. We this formula to estimate the expected number of connections also deﬁne an implicitly related node (C) as one that has no a set of nodes, B , would share with another object, A, by the direct connection to the query node (A), yet is connected to equation: one or more intermediate nodes (B) that are simultaneously connected to A. To evaluate the potential signiﬁcance of an K K A B Expect(A ↔ B ) = 1 − 1 − ∗ 1 − (2) implicitly related node, we compare the set of i nodes (B ) N N t t i=1 shared by both the query node A and the implicit node C, against a random network model. Given that we are interested Equation (2) is used to estimate the expected number of in an node A, and know from processing all literature associ- shared relationships between B and C, given the connectivity ated with A that it is related to all nodes in the set B , we ask of each intermediate (shared) node in the set B that A is known the question ‘Given the number of connections each node in to be connected to. the set B has, and the number of connections the target node (C) has, how many connections might we expect between B IMPLEMENTATION and C by chance alone?’ For example, if C were related to Precision and recall rates are estimated every node in a 1000 node network and A had 100 connec- tions within this network, all of which were shared with C, First, we estimated the precision of using co-occurrence as this would be expected and therefore unexceptional. Thus, a method of identifying the existence of a non-trivial rela- dividing the number of observed connections (Obs) between tionship between two objects by manually evaluating the B and C by the number of connections we would expect by co-occurring objects within a random set of 25 MEDLINE chance (Exp) provides us with a value reﬂecting the statist- records (titles and abstracts). We found that two objects ical signiﬁcance of the shared connections. This value allows co-mentioned within the same sentence were more likely us to estimate the potential relevance of a set of connections. to be related (83%) than objects co-mentioned in the same For example, if a set of connections linking a disease (A) abstract (58%). Using sentence co-mentions alone, how- to a chemical (C) were to encompass highly common nodes ever, would miss 43% of the non-trivial relationships within such as ‘sodium’ and ‘symptom’, we recognize that—whether an abstract. This proportion of correct relationships among true or not—these types of connections are sufﬁciently vague abstract co-mentions is similar to the estimates others have to be of little use to a scientist in postulating how A and C obtained (Jenssen et al., 2001; Ding et al., 2002). Addition- might have an interesting and speciﬁc connection through ally, because judgment of what constitutes a ‘relationship’ and these intermediates. If the shared connections involve spe- what is ‘non-trivial’ is somewhat subjective, we attempted to ciﬁc transporters or genes, which would not be as frequently estimate this error rate in a more objective way by identifying mentioned in the literature, it becomes easier to postulate how objects co-mentioned in the ﬁrst half of MEDLINE (records speciﬁc actions of (C) could produce (A). up until approximately Nov. 1991), but not in the second We derived an expectation value based upon the relative half. The rationale for this approach comes from the obser- connectivity of each node involved. Assuming nodes are vation that related objects (e.g. insulin–glucose) tend to be 392 Identifying new relationships from existing ones composite object database, adding 3094 acronyms to data- base entries that did not have an acronym speciﬁed. ARGH also identiﬁed 4786 spelling/hyphenation variants observed within MEDLINE for objects within the composite database. It is difﬁcult to assess what impact ARGH has upon the pre- cision or recall when processing records, as the reduction in the false-negative (FN) rate depends upon how common the variant or acronym is and reduction in the false-positive rate depends upon the acronym. For example, the gene calcitonin is associated with the acronym CT, which has a different deﬁnition within MEDLINE 96% of the time (Computed Tomography). Gene names like SOCS-3 are unambiguous and unaffected by the use of ARGH to resolve acronyms, but less than half of the deﬁnitions of SOCS-3 within MEDLINE would be recognized without the spelling vari- Fig. 2. Analysis of the ﬁrst 10 000 co-cited objects found within the ants provided by ARGH (the ARGH database can be queried 1st but not 2nd half of MEDLINE, grouped by the total number of at http://lethargy.swmed.edu/argh/argh.asp). co-citations identiﬁed within MEDLINE for the two objects. The fact that these co-citations are non-recurring suggests that the co-citation Estimated recall rate of using abstracts versus did not reﬂect the existence of a relationship studied between the full-text articles two. As shown here, this distribution is signiﬁcantly different than Abstracts presumably contain the most important ﬁndings the overall distribution in the co-citation frequency of the 63 836 of a report and important ﬁndings are usually reiterated in records analyzed. future abstracts, but it could be argued that some relation- ships might not be found within abstracts. To estimate this and obtain a recall rate, we calculated the total number of rela- co-mentioned over the course of many studies after their ﬁrst tionships within a domain of knowledge (MEDLINE articles) co-mention. We reasoned that if two objects are co-mentioned that are contained in their electronically accessible summary early (establishing their co-existence in the literature), but not form (MEDLINE titles and abstracts). A set of objects men- again after an equal number of publications, there are several tioned within review articles was manually compiled and possibilities: the co-mention was the result of two unrelated compared to the relationships found within MEDLINE titles topics being discussed together (e.g. incidental, broad topical coverage), the objects were once studied for a relationship but and abstracts. The same list was compared to the object none was found or it was in error, or a relationship was estab- database to estimate what percent of object types mentioned lished but was not of sufﬁcient interest to warrant further study. in MEDLINE were represented in the databases used. Four Regardless of the exact reason, these represent non-persistent objects were randomly chosen from the collective object data- ‘relationships’, and are suggestive of a class of co-mentions base, representing one of each object type, with the stipulation that at least two review articles had been written about the (erroneous or uninteresting) that we wish to exclude. We object within the past three years. A set of review articles examined the ﬁrst 10 000 non-persistent co-citations found was selected for CTLA-4 (gene) (McCoy and Le Gros, 1999; within MEDLINE and found a similar distribution as would Green, 2000; Tomer, 2001), Fragile-X Syndrome (disease) be predicted by the error rate formula (Fig. 2), although by (Bardoni et al., 2000; Jin and Warren, 2000; Kooy et al., 2000), this method the predicted error rate would be slightly higher. cachexia (clinical phenotype) (Barber, 2001; Hasselgren and This helps to conﬁrm the accuracy of the estimates and to Fischer, 2001; Tisdale, 2001) and dynorphin (chemical com- justify the use of a power-law decay function to represent the pound) (Steiner and Gerfen, 1998; Caudle and Mannes, 2000). probability of error. Only objects of the same types (i.e. other genes, diseases, Resolving ambiguous acronyms phenotypes and chemicals) were counted. Acronyms were resolved as they occurred within text using There were a total of 40 objects mentioned in the literat- an Acronym Resolving General Heuristic (ARGH) to reduce ure but not found in the database (2 diseases, 9 phenotypes, both random and systematic errors in term recognition, which 7 genes and 22 chemical compounds). The 2 disease names operates with ∼ 96% precision and 92% recall (Wren and (Graves’ Opthalamopathy and Relapsing-remitting Experi- Garner, 2002). A total of 4309 acronyms were ﬂagged as mental Autoimmune Encephalomyelitis) and 9 phenotypes ambiguous (i.e. one deﬁnition must be >95% of all identi- were not mentioned in OMIM. Three of these phenotypes, ﬁed deﬁnitions to be considered unambiguous) and requiring however, were simply the result of a semantic differ- resolution. The ARGH database of MEDLINE acronyms was ence between the OMIM entry and the article (‘rocking’ also used to expand the acronym list for entries within the versus ‘body-rocking’, ‘greater interocular distance’ versus 393 J.D.Wren et al. ‘increased interocular distance’, ‘fetal akinesia’ versus ‘akin- esia’). The most problematic category was ‘small molecules’, for which many chemicals and drugs widely mentioned in the literature (e.g. DAMGO, DADLE, isoprenaline) were simply not found in the MeSH trees database. In this sample, there were 181 objects found within the review articles, 141 of which were also in the composite data- base (78%). From the 40 objects mentioned in the reviews but not found in the database, 2 were diseases, 9 phenotypes, 7 genes and 22 chemical compounds. From these 141 data- base objects mentioned within the full-text of the reviews, 138 of them (98%) could be found within the body of a MED- LINE title or abstract, suggesting that most objects pertinent to a review can also be found within an abstract or title. Seman- tically, 124 of these 138 objects were spelled in the literature the same way they were found in the database, giving a recall rate of 90% (124/138) in terms of identifying the conceptual occurrence of database objects within textual input, and 69% (124/181) in terms of identifying relevant relationships within its domain of knowledge (MEDLINE). Some of the FN failures to identify objects within text were systematic (e.g. the MeSH entry 5,8,11,14,17- Eicosapentaenoic Acid is almost always referred to in MED- LINE simply as eicosapentaenoic acid) while other failures varied in their rates (e.g. JNK was found to be spelled 81 dif- ferent ways including ‘c-Jun N-terminal kinase’ 605 times, ‘c-Jun NH2-terminal kinase’ 154 times and ‘c-Jun amino- Fig. 3. The distribution in the number of relationships per object terminal kinase’ 62 times). follows a scale-free power law distribution. (a) A relatively small fraction of the objects in the database are directly related to a large Creating a network of relationships using percentage of the total, contributing to a rapid explosion in the MEDLINE number of implicitly related objects. (b) As the number of direct A total of 12 037 763 MEDLINE records recorded from relationships increases, the number of implicit relationships rapidly 1967 to January 2002 were processed to create a network approaches the theoretical maximum, which is the total number of of 3 482 204 unique relationships between objects. Approx- nodes in the network, and then decreases linearly with the number imately two-third of the objects in the database found exact of possible implicit relationships. literal matches within the literature, identifying at least one ‘relatedness’ of two objects solely by examining the relation- relationship for 22 482 of the 33 539 unique objects (85 234 ships they share. To establish this, an Obs/Exp was calculated total terms when including synonyms) within the database. for all relationship sets (of at least 100 objects) shared by As expected, we ﬁnd a highly disproportionate distribution a central query object and any other object in the database, in the number of relationships per object (Fig. 3a), indicating regardless of whether a direct relationship was known or not. the network is scale-free in nature. As such, this connectivity The Obs/Exp scores were sorted from highest to lowest on contributes to a rapid explosion in the number of implicitly the x-axis, and the strength of the relationship, if known, related objects as the number of direct relationships increases was plotted on the y-axis. If the strength was not known (Fig. 3b). Thus, identifying implicitly related objects becomes (i.e. it was an implicit relationship), then no bar was plotted. secondary to being able to rank their potential signiﬁcance. For the object ‘Cardiac Hypertrophy’, we see that the higher Furthermore, this also shows that the search for implicit rela- tionships more than one level removed (i.e. A ↔ B ↔ C ↔ the Obs/Exp ratio, the more likely the relationship is known D) would likely be fruitless in the absence of any further con- (Fig. 4). Furthermore, we note that the higher the Obs/Exp straints, since all non-circular connections from A are reached ratio, the stronger the relationship tends to be (i.e. the more relatively rapidly as the domain grows to a modest size. frequently they are co-cited). To conﬁrm that the trend observed in Figure 4 is not speciﬁc Ranking all implicit relationships to the analysis of cardiac hypertrophy, but rather is a general We evaluated whether this observed to expected ratio trend, we randomly picked 100 objects from the database that (Obs/Exp) we are calculating could be used to estimate the had between 500 and 1000 relationships within the network 394 Identifying new relationships from existing ones Fig. 4. Objects were ranked for their implicit ‘relatedness’ to car- diac hypertrophy solely on the basis of the relationships they shared Fig. 5. The observed to expected ratio obtained from identifying (Obs/Exp). If a relationship in MEDLINE has been established, its and ranking shared relationships correlates with the existence and strength (based upon frequency of co-occurrence within MEDLINE) known strength of a relationship. This enables novel (implicit) rela- is plotted on the y-axis, otherwise it will appear as a gap (meaning tionships to be correlated with the probability they are relevant (as no relationship has been established). Shown is a subset of 4887 judged by existing relationships) and important (as judged by the objects sharing at least 100 relationships with cardiac hypertrophy, strength/frequency of historical reporting). sorted by their calculated observed to expected ratio. Due to x-axis compression, not all gaps will be visible on this graph. is induced in response to environmental stimuli, such as arter- (this range was chosen simply to ensure that the approximate ial hypertension, increased cardiac work or hormonal stimuli. scale of analysis for each object was similar). Implicit rela- It is an intensively studied condition as evidenced by the tionships were identiﬁed for these objects and ranked by their 4092 articles in MEDLINE containing the key phrase ‘cardiac Obs/Exp values. The top 1000 Obs/Exp scores were taken hypertrophy’. A total of 2102 unique objects were co- for each analysis and ranked from 1 (highest Obs/Exp) to mentioned within all articles mentioning cardiac hypertrophy 1000 (lowest), and a normalized strength score calculated and 19 718 unique objects were implicitly related to cardiac for each object analyzed, ranging from 1.0 (strongest dir- hypertrophy through a total of 1 842 599 different paths. ect relationship observed) to 0.0 (no relationship observed). Examining the shared relationships for the implicitly related Figure 5 shows this average strength plotted against the objects in Table 1, we excluded ‘endotoxins’ from further Obs/Exp rankings, indicating that this is a general trend. study in part because this refers to a class of immuno- In some ways, the correlation of exceptional groupings with inductive compounds rather than a speciﬁc compound, and known relationships is not too surprising, as we would expect in part because endotoxins are known to have substantial that two objects with very similar purposes, functions, or effects on cardiac function that would complicate interpreta- involvement in a biological process should interact with and/or tion of hypertrophy. Morphine was excluded from the study be studied with many of the same objects. This does, however, because it would have substantial effects on the behavior of establish that the relatedness of two objects can be correl- mice (including somnolence and reduced feeding) that would ated with the statistical exceptionality of the relationships limit the dose used for study. Chlorpromazine (CPZ) was they share. More importantly, this provides us with a means deemed more suitable for further study, in part because it to evaluate quantitatively implicit relationships by demon- is a commonly used drug and an unrecognized effect on car- strating that the Obs/Exp score correlates positively with diac hypertrophy could have clinical importance. A list of established relationships. This numeric evaluation enables us shared relationships between cardiac hypertrophy and CPZ is to identify new relationships, not found within MEDLINE available in the online web supplement. records, that are more likely to be logically plausible and relev- Chlorpromazine (CPZ) is an aliphatic phenothiazine com- ant to the query object because of the relationships they share. pound used principally as an anti-psychotic and anti-emetic drug (Shen, 1999). It has a number of physiological effects Wet-lab testing of in-silico predictions Cardiac hypertrophy is deﬁned as an increase in the size of myocytes that is associated with detrimental effects on Statistics are as of June 23, 2003, although this analysis was initially aspects of contractile and electrical function in the heart. It conducted in January 2002 when there were fewer articles than this. 395 J.D.Wren et al. Table 1. Chemical compounds within the composite database implicitly related to cardiac hypertrophy Rank Implicit relationship Shared Quality Expected Obs/Exp rels estimate 1 Endotoxin 1301 1025 307 3.34 2 Morphine 1217 939 283 3.32 3 Chlorpromazine 1089 824 252 3.28 4 Globulin 1130 850 265 3.20 5 Cisplatin 1129 862 274 3.14 6 Neomycin 1105 842 272 3.10 7 Polyethylene glycol 1153 863 279 3.09 8 Phytohemagglutinin 1099 807 266 3.03 9 Methotrexate 1190 897 308 2.91 10 Casein 1165 895 308 2.91 11 Isoleucine 1142 852 293 2.91 12 Galactose 1104 826 284 2.91 13 Progesterone 1448 1132 392 2.89 Fig. 6. Chlorpromazine protects against the development of cardiac 14 Esterase 1197 908 317 2.86 hypertrophy. Several parameters of ventricular hypertrophy were 15 Tetracycline 1066 800 283 2.83 determined by echocardiography. One group of mice received iso- 16 Acetone 1075 804 285 2.82 proterenol only (ISO, n = 10) and the other received both isoproter- 17 Concanavalin A 1317 1002 355 2.82 enol and chlorpromazine (CPZ + ISO, n = 8). Symbols represent 18 Polysaccharide 1092 829 295 2.81 individual mice, brackets denote mean (center triangle) and standard 19 Bromide 1368 1048 381 2.75 deviation for group. LVW = left ventricle weight (CPZ + ISO 11 ± 20 Methanol 1221 930 354 2.63 27%, ISO 51 ± 43%,P< 0.02), LVMI = left ventricular mass index The 20 objects with the most implicit connections (shared rels) were extracted and sorted (CPZ + ISO 11 ± 28%, ISO 50 ± 52%,P< 0.04), PWT = posterior by observed to expected ratio, which is calculated using the probability each direct wall thickness (CPZ + ISO 16 ± 16%, ISO 36 ± 27%,P< 0.05), relationship comprising the implicit relationship is valid (quality estimate). These are IVSWT = intraventricular septum wall thickness (CPZ + ISO 19 ± compounds that should not have any documented relationship with cardiac hypertrophy 18%, ISO 31 ± 20%,P< 0.12). within the MEDLINE titles and abstracts analyzed yet, at the same time, share many relationships with it. again before the mice were sacriﬁced to allow estimation of and molecular targets that suggest it might provide an anti- their left ventricular weight (LVW), left ventricular mass index hypertrophic effect in the heart, one of which is its alpha- (LVMI), posterior wall thickness (PWT) and intraventricular adrenergic blocking activity (Morgan and Van Maanen, 1980). septum wall thickness (IVSWT). In all four of the parameters Hypertrophy can be induced through over-stimulation of measured, we found that the amount of cardiac hypertrophy alpha-adrenergic receptors by agonists and this effect can be was signiﬁcantly reduced in the isoproterenol (ISO) plus CPZ blocked by alpha-adrenergic antagonists (Colucci, 1982). It treated mice (n = 8) in comparison to the control group given has recently been recognized that the calmodulin-dependent only ISO (n = 10), as evaluated by 1-tailed Student’s t -test phosphatase calcineurin plays an important role in some with unequal variance (Fig. 6). forms of hypertrophy (Molkentin et al., 1998). CPZ has been reported to interact with calmodulin (Marshak et al., 1981) as DISCUSSION an antagonist, suggesting a potential role beyond that of an alpha-adrenergic receptor. Despite the potential mechanistic A relationship between CPZ and cardiac hypertrophy has not connections between cardiac hypertrophy and CPZ, there is been previously suggested in the literature. The application of no indication within MEDLINE that any relationship between implicit relationship analysis was required for generating the the two has been suggested. underlying hypothesis of this study. It was previously known that CPZ had modest alpha-blocking activity, but the ﬁnd- In-silico predicted effect conﬁrmed ing that it interferes with ISO-induced hypertrophy, a pure in rodent model beta-adrenergic effect, is surprising and provocative. Possible We looked for an association between CPZ and cardiac hyper- mechanisms include: (1) a previously unsuspected activity of trophy in a rodent model. Two groups of mice were given CPZ on beta receptors, either directly or through cross-talk 20 mg/kg/day isoproterenol by osmotic minipump, with one between different classes of receptors; (2) an effect of CPZ group additionally receiving 10 mg/kg/day CPZ. This dose on downstream signaling from the beta receptor in the car- of CPZ did not perceptibly alter feeding behavior or physical diac cells; or (3) a ‘pseudo-hypertrophic’, non-cellular, effect activity. Echocardiograms were obtained before treatment and related to increased myocardial edema or matrix deposition. 396 Identifying new relationships from existing ones Such an investigation could have clinical implications. If this PGA grant 5U01HL6688002, NIH/NHLBI grant P50 drug exerts a similar effect against common precipitants of CA70907, a Hudson Foundation grant and a National AHA hypertrophy, it could provide a clue to molecular structures Grant-in-Aid (R.V.S.). that should be explored for therapeutic beneﬁt. Moreover, many tens of thousands of patients already receive CPZ, and it may be contributing to cardiac pathology, or protection, in a REFERENCES previously unsuspected fashion. Conﬁrmation and evaluation Barber,M.D. (2001) Cancer cachexia and its treatment with ﬁsh-oil- of the mechanism of CPZ in cardiac hypertrophy will require enriched nutritional supplementation. Nutrition, 17, 751–755. further work beyond these preliminary investigations. Bardoni,B., Mandel,J.L. and Fisch,G.S. (2000) FMR1 gene and Overall, we have demonstrated that an analysis of shared fragile X syndrome. Am. J. Med. Genet., 97, 153–163. Blaschke,C., Andrade,M.A., Ouzounis,C. and Valencia,A. (1999) relationships scored against a random network model has Automatic extraction of biological information from scientiﬁc the potential to elucidate novel and interesting relationships text: protein–protein interactions. ISMB, 7, 60–67. not documented within MEDLINE, but rather based upon Caudle,R.M. and Mannes,A.J. (2000) Dynorphin: friend or foe? information contained therein. Automating the relationship Pain, 87, 235–239. identiﬁcation process enables us to bypass the monumental Colucci,W.S. (1982) Alpha-adrenergic receptor blockade with time and effort that would be required to record manually prazosin. Consideration of hypertension, heart failure, and poten- every relationship within MEDLINE’s 12 million abstracts, tial new applications. Ann. Int. Med., 97, 67–77. and using an object-based model reduces the need to ascer- DiGiacomo,R.A., Kremer,J.M. and Shah,D.M. (1989) Fish-oil diet- tain which relationships are of interest, since the objects ary supplementation in patients with Raynaud’s phenomenon: within the database are presumably those a user would be a double-blind, controlled, prospective study. Am. J. Med., 86, interested in. However, there are shortcomings in the use 158–164. Ding,J., Berleant,D., Nettleton,D. and Wurtele,E. (2002) Mining of this method: ﬁrst, there is the problem of ‘uninteresting’ Medline: abstracts, sentences or phrases? Pac. Symp. Biocomput., relationships. To some extent, this will be user-dependant. Kauau, Hawaii, 7, 326–337. Objects that share many relationships may indeed have a Green,J.M. (2000) The B7/CD28/CTLA4 T-cell activation pathway. relationship themselves, but the nature of their relationship Implications for inﬂammatory lung disease. Am. J. Respir. Cell may be such that it would not be considered interesting or Mol. Biol., 22, 261–264. worth investigating. Second, ascertaining the nature of the Hamosh,A., Scott,A.F., Amberger,J., Valle,D. and McKusick,V.A. implied relationship by examining the shared relationships (2000) Online Mendelian Inheritance in Man (OMIM). Hum. is time-consuming. Methods of providing a summary ana- Mutat., 15, 57–61. lysis or better evaluating which of the shared relationships Hasselgren,P.O. and Fischer,J.E. (2001) Muscle cachexia: current are potentially interesting would be highly desirable. Third, concepts of intracellular mechanisms and molecular regulation. comparison to a random network model relies upon the ana- Ann. Surg., 233, 9–17. Hristovski,D., Stare,J., Peterlin,B. and Dzeroski,S. (2001) Support- lyzed text to be focused (non-random) in nature. To the extent ing discovery in medicine by association rule mining in Medline that writing is random or functionally detached within an and UMLS. Medinfo, 10, 1344–1348. analyzed textual unit, trivial connections will be made and Jenssen,T.K., Laegreid,A., Komorowski,J. and Hovig,E. (2001) A lit- fewer groupings will stand out statistically. Finally, work erature network of human genes for high-throughput analysis of still needs to be done on better establishing relationships, gene expression. Nat. Genet., 28, 21–28. beyond what we have done here in scoring the probabil- Jin,P. and Warren,S.T. (2000) Understanding the molecular basis of ity that a co-occurrence was meaningful. For example, the fragile X syndrome. Hum. Mol. Genet., 9, 901–908. method asserts that a relationship is known when two objects Klir,G. and Yuan,B. (1995) Fuzzy Sets and Fuzzy Logic: Theory and have been mentioned together within the same abstract, Applications. Prentice Hall. and unknown if they have not. While this may be a good Kooy,R.F., Willemsen,R. and Oostra,B.A. (2000) Fragile X syn- generalization, two objects may have been co-mentioned sev- drome at the turn of the century. Mol. Med. Today, 6, 193–198. Lowe,H.J. and Barnett,G.O. (1994) Understanding and using the eral times and yet the overall nature of their relationship, medical subject headings (MeSH) vocabulary to perform literature or certain aspects thereof, remains unknown. Nonetheless, searches. JAMA, 271, 1103–1108. we believe this method will prove to be of utility in a Maglott,D.R., Katz,K.S., Sicotte,H. and Pruitt,K.D. (2000) NCBI’s ﬁeld where the amount of information continues to increase LocusLink and RefSeq. Nucleic Acids Res., 28, 126–128. exponentially. Marshak,D.R., Watterson,D.M. and Van Eldik,L.J. (1981) Calcium- dependent interaction of S100b, troponin C, and calmodulin with an immobilized phenothiazine. Proc. Natl Acad. Sci., USA, 78, ACKNOWLEDGEMENTS 6793–6797. Thanks to Irene Rombel for a helpful review of the manuscript. McCoy,K.D. and Le Gros,G. (1999) The role of CTLA-4 in the This work was funded in part by NSF-EPSCoR grant no. EPS- regulation of T cell immune responses. Immunol. Cell Biol., 77, 0132534 (J.D.W.), NIH/NCI grant CA81656, NIH/NHLBI 1–10. 397 J.D.Wren et al. Molkentin,J.D., Lu,J.R., Antos,C.L., Markham,B., Richardson,J., Steimann,F. (1997) Fuzzy set theory in medicine. Artif. Intell. Med., Robbins,J., Grant,S.R. and Olson,E.N. (1998) A calcineurin- 11, 1–7. dependent transcriptional pathway for cardiac hypertrophy. Cell, Steiner,H. and Gerfen,C.R. (1998) Role of dynorphin and enkephalin 93, 215–228. in the regulation of striatal output pathways and behavior. Exp. Morgan,J.P. and Van Maanen,E.F. (1980) The role of differen- Brain Res., 123, 60–76. tial blockade of alpha-adrenergic agonists in chlorpromazine- Swanson,D.R. (1986) Fish oil, Raynaud’s syndrome, and undis- induced hypotension. Arch. Int. Pharmacodyn. Ther., 247, covered public knowledge. Perspect. Biol. Med., 30, 7–18. 135–144. Tisdale,M.J. (2001) Cancer anorexia and cachexia. Nutrition, 17, Perez-Iratxeta,C., Bork,P. and Andrade,M.A. (2002) Association of 438–442. genes to genetically inherited diseases using data mining. Nat. Tomer,Y. (2001) Unraveling the genetic susceptibility to autoimmune Genet., 31, 316–319. thyroid diseases: CTLA-4 takes the stage. Thyroid, 11, 167–169. Povey,S., Lovering,R., Bruford,E., Wright,M., Lush,M. and Wain,H. Weeber,M., Klein,H., Aronson,A.R., Mork,J.G., de Jong-van den (2001) The HUGO Gene Nomenclature Committee (HGNC). Berg,L.T. and Vos,R. (2000) Text-based discovery in biomedicine: Hum. Genet., 109, 678–680. the architecture of the DAD-system. Proceedings of AMIA Annual Rindﬂesch,T.C., Tanabe,L., Weinstein,J.N. and Hunter,L. (2000) Fall Symposium, Los Angeles, CA, 903–907. EDGAR: extraction of drugs, genes and relations from the Weeber,M., Vos,R., Klein,H., De Jong-Van Den Berg,L.T., Aron- biomedical literature. Pac. Symp. Biocomput., 517–528. son,A.R. and Molema,G. (2003) Generating hypotheses by dis- Shen,W.W. (1999) A history of antipsychotic drug development. covering implicit associations in the literature: a case report of a Comp. Psychiatry 40, 407–414. search for new potential therapeutic uses for thalidomide. J. Am. Smalheiser,N.R. and Swanson,D.R. (1998) Using ARROWSMITH: Med. Inform. Assoc., 10, 252–259. a computer-assisted approach to formulating and assess- Wren,J.D. and Garner,H.R. (2002) Heuristics for identiﬁcation of ing scientiﬁc hypotheses. Comp. Meth. Prog. Biomed., 57, acronym-deﬁnition patterns within text: towards an automated 149–153. construction of comprehensive acronym-deﬁnition dictionaries. Stapley,B.J. and Benoit,G. (2000) Biobibliometrics: information Meth. Inform. Med., 41, 426–434. retrieval and visualization from co-occurrences of gene names Yandell,M.D. and Majoros,W.H. (2002) Genomics and natural in Medline abstracts. Pac. Symp. Biocomput., 5, 529–540. language processing. Nat. Rev. Genet., 3, 601–610. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/knowledge-discovery-by-automated-identification-and-ranking-of-s0s2y0QKz0

Loading next page...

References (40)

H. Steiner, C. Gerfen (1998)
Role of dynorphin and enkephalin in the regulation of striatal output pathways and behavior
Experimental Brain Research, 123
D. Marshak, D. Watterson, L. Eldik (1981)
Calcium-dependent interaction of S100b, troponin C, and calmodulin with an immobilized phenothiazine.
Proceedings of the National Academy of Sciences of the United States of America, 78 11
M. Weeber, R. Vos, H. Klein, L. Berg, A. Aronson, G. Molema (2003)
Case Report: Generating Hypotheses by Discovering Implicit Associations in the Literature: A Case Report of a Search for New Potential Therapeutic Uses for Thalidomide
Journal of the American Medical Informatics Association : JAMIA, 10 3
C. Perez-Iratxeta, P. Bork, M. Andrade (2002)
Association of genes to genetically inherited diseases using data mining
Nature Genetics, 31
D. Hristovski, S. Džeroski, B. Peterlin, A. Rozic-Hristovski (2000)
Supporting Discovery in Medicine by Association Rule Mining of Bibliographic Databases
M. Tisdale (2001)
Cancer anorexia and cachexia.
Nutrition, 17 5
P. Hasselgren, J. Fischer (2001)
Muscle Cachexia: Current Concepts of Intracellular Mechanisms and Molecular Regulation
Annals of Surgery, 233
Jonathan Wren, Harold Garner (2002)
Heuristics for Identification of Acronym-Definition Patterns within Text: Towards an Automated Construction of Comprehensive Acronym-Definition Dictionaries
Methods of Information in Medicine, 41
Sangeeta Amladi (2003)
Online Mendelian Inheritance in Man 'OMIM'.
Indian journal of dermatology, venereology and leprology, 69 6
Jonathan Green (2000)
The B7/CD28/CTLA4 T-cell activation pathway. Implications for inflammatory lung disease.
American journal of respiratory cell and molecular biology, 22 3
G. Klir, B. Yuan (1995)
Fuzzy sets and fuzzy logic - theory and applications
M. Barber (2001)
Cancer cachexia and its treatment with fish-oil-enriched nutritional supplementation.
Nutrition, 17 9
Y. Tomer (2001)
Unraveling the genetic susceptibility to autoimmune thyroid diseases: CTLA-4 takes the stage.
Thyroid : official journal of the American Thyroid Association, 11 2
W. Shen (1999)
A history of antipsychotic drug development.
Comprehensive psychiatry, 40 6
Kathy McCoy, G. Gros (1999)
The role of CTLA‐4 in the regulation of T cell immune responses
Immunology and Cell Biology, 77
Donna Maglott, Kenneth Katz, Hugues Sicotte, Kim Pruitt (2000)
NCBI's LocusLink and RefSeq
Nucleic acids research, 28 1
J. Morgan, E. Maanen (1980)
the role of differential blockade of alpha-adrenergic agonists in chlorpromazine-induced hypotension.
Archives internationales de pharmacodynamie et de therapie, 247 1
B. Stapley, G. Benoît (1999)
Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
J. Ding, D. Berleant, D. Nettleton, E. Wurtele (2001)
Mining MEDLINE: Abstracts, Sentences, or Phrases?
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Ralph Digiacomo, J. Kremer, D. Shah (1989)
Fish-oil dietary supplementation in patients with Raynaud's phenomenon: a double-blind, controlled, prospective study.
The American journal of medicine, 86 2
R. Kooy, R. Willemsen, B. Oostra (2000)
Fragile X syndrome at the turn of the century.
Molecular medicine today, 6 5
Hum. Genet
B. Bardoni, Jean-Louis Mandel, Gene Fisch (2000)
FMR1 gene and fragile X syndrome.
American journal of medical genetics, 97 2
M. Yandell, W. Majoros (2002)
Genomics and natural language processing
Nature Reviews Genetics, 3
S. Povey, R. Lovering, E. Bruford, Mathew Wright, M. Lush, H. Wain (2001)
The HUGO Gene Nomenclature Committee (HGNC)
Human Genetics, 109
D. Swanson (2015)
Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge
Perspectives in Biology and Medicine, 30
R. Caudle, A. Mannes (2000)
Dynorphin: friend or foe?
Pain, 87
M. Weeber, H. Klein, A. Aronson, James Mork, L. Berg, R. Vos (2000)
Text-based discovery in biomedicine: the architecture of the DAD-system
Proceedings. AMIA Symposium
T. Jenssen, A. Lægreid, J. Komorowski, E. Hovig (2001)
A literature network of human genes for high-throughput analysis of gene expression
Nature Genetics, 28
W. Colucci (1982)
Alpha-adrenergic receptor blockade with prazosin. Consideration of hypertension, heart failure, and potential new applications.
Annals of internal medicine, 97 1
D. Hristovski, J. Stare, B. Peterlin, S. Džeroski (2001)
Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS
Studies in health technology and informatics, 84 Pt 2
T. Rindflesch, L. Tanabe, J. Weinstein, L. Hunter (1999)
EDGAR: extraction of drugs, genes and relations from the biomedical literature.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
H. Lowe, Gene Barnett (1994)
Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches.
JAMA, 271 14
Vanmaanen Ef (1980)
the role of differential blockade of alpha-adrenergic agonists in chlorpromazine-induced hypotension.
Archives internationales de pharmacodynamie et de thérapie, 247
N. Smalheiser, D. Swanson (1998)
Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses.
Computer methods and programs in biomedicine, 57 3
P. Jin, S. Warren (2000)
Understanding the molecular basis of fragile X syndrome.
Human molecular genetics, 9 6
C. Blaschke, Miguel Andrade, C. Ouzounis, A. Valencia (1999)
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions
Proceedings. International Conference on Intelligent Systems for Molecular Biology
J. Molkentin, Jian-rong Lu, C. Antos, B. Markham, J. Richardson, J. Robbins, S. Grant, E. Olson (1998)
A Calcineurin-Dependent Transcriptional Pathway for Cardiac Hypertrophy
Cell, 93
F. Steimann (1997)
Fuzzy set theory in medicine
Artificial intelligence in medicine, 11 1
Dimitar Filev (1996)
Fuzzy SETS AND FUZZY LOGIC

Publisher: Oxford University Press
Copyright: Bioinformatics 20(3) © Oxford University Press 2004; all rights reserved.
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btg421
pmid: 14960466
Publisher site: See Article on Publisher Site

Abstract

Vol. 20 no. 3 2004, pages 389–398 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg421 Knowledge discovery by automated identiﬁcation and ranking of implicit relationships 1,∗ 2,3 2,3 Jonathan D. Wren , Rafﬁ Bekeredjian , Jelena A. Stewart , 2,3 2,4 Ralph V. Shohet and Harold R. Garner Advanced Center for Genome Technology, Department of Botany and Microbiology, The University of Oklahoma, 620 Parrington Oval Rm. 106, Norman, OK 73019, USA, 2 3 4 Department of Internal Medicine, Division of Cardiology and McDermott Center for Human Growth and Development, Department of Biochemistry, Center for Biomedical Inventions, The University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390, USA Received on December 11, 2002; revised on March 11, 2003; accepted on August 10, 2003 Advance Access Publication January 22, 2004 ABSTRACT small fraction of the collective scientiﬁc knowledge within Motivation: New relationships are often implicit from existing any given ﬁeld. As increasing amounts of information and information, but the amount and growth of published literat- observations are compiled from different areas of research ure limits the scope of analysis an individual can accomplish. as individual reports, they can contribute towards a greater Our goal was to develop and test a computational method to understanding in apparently unrelated areas when considered identify relationships within scientiﬁc reports, such that large collectively. For example, it has been demonstrated that the sets of relationships between unrelated items could be sought useful implications of scientiﬁc discoveries can go unnoticed out and statistically ranked for their potential relevance as a set. or unutilized because they exist only implicitly from informa- Results: We ﬁrst construct a network of tentative relation- tion scattered among different areas of research (Swanson, ships between ‘objects’ of biomedical research interest (e.g. 1986). By using software to identify words shared between art- genes, diseases, phenotypes, chemicals) by identifying their icle titles, Swanson and Smalheiser were able to identify com- co-occurrences within all electronically available MEDLINE mon intermediates between Raynaud’s Disease (a circulatory records. Relationships shared by two unrelated objects are disorder restricting blood-ﬂow to the extremities) and the diet- then ranked against a random network model to estimate ary effects of ﬁsh oil, leading to the hypothesis and subsequent the statistical signiﬁcance of any given grouping. When com- proof (DiGiacomo et al., 1989) that compounds within dietary pared against known relationships, we ﬁnd that this ranking ﬁsh oil could ameliorate the symptoms of Raynaud’s Disease correlates with both the probability and frequency of object (Swanson, 1986; Smalheiser and Swanson, 1998). The term co-occurrence, demonstrating the method is well suited to ‘non-interactive literatures’ was coined to explain why such discover novel relationships based upon existing shared rela- a reasonable hypothesis had gone unnoticed by researchers tionships. To test this, we identiﬁed compounds whose shared in either ﬁeld alone. Finding methods to utilize greater the relationships predicted they might affect the development biomedical literature in an automated manner to aid scientiﬁc and/or progression of cardiac hypertrophy. When laboratory discovery is becoming a topic of increasing interest (Yandell tests were performed in a rodent model, chlorpromazine was and Majoros, 2002). found to reduce the progression of cardiac hypertrophy. While innovative, a keyword-based method such as that Contact: [email protected] of Swanson and Smalheiser is both limiting and highly Supplementary information: http://innovation.swmed.edu/ cumbersome, especially where a signiﬁcant body of literat- IRIDESCENT/Supplemental_Info.htm ure is concerned, for several reasons: First, only titles are used; second, word phrases such as ‘Interleukin 6’ are not INTRODUCTION taken into account, being reduced to ‘Interleukin’ and ‘6’; third, synonyms (e.g. ‘IL-6’) are not considered; ﬁnally, and There is a large difference between what is known and what perhaps most importantly, the number of unique keywords we know as individuals. We are only aware of a relatively grows rapidly per record analyzed, providing an impractically large amount of output for any user to examine (additional To whom correspondence should be addressed. Bioinformatics 20(3) © Oxford University Press 2004; all rights reserved. 389 J.D.Wren et al. a b c Fig. 1. Using literature-based relationships to engage in the discovery of new knowledge. (a) Beginning with an object of interest (black node), tentative relationships are assigned to other objects (gray nodes) when they are co-mentioned within MEDLINE abstracts. (b) Each related object (gray) is then queried for its relationships with other objects (white nodes). The white nodes are not directly related to the primary node and are thus only implicitly related, through intermediates. (c) The relationships shared by white and black nodes are ranked against a random network model to establish how many would be expected by chance alone, given the connectivity of each node in the set. Suppose the entire network consists of 1000 nodes and the numbers to the right represent how many connections each of the implicit nodes have to other nodes in the network. We can then assign a statistical weighting that reﬂects how exceptional any given implied relationship is based upon the shared intermediates. In this example, a node with 950 relationships may share many relationships with the primary node, but this is nothing exceptional because it is related to most objects in the network. It is thus down-weighted in importance (dashed lines). The connections of the gray nodes must also be taken into account, but is not shown here for simplicity. discussion in online supplement). Others have designed lit- (Fig. 1a) by identifying the co-occurrence of A and B objects erature exploration systems that overcome the ﬁrst three within MEDLINE records (titles and abstracts). Each B object barriers mentioned by looking for the co-occurrences of major can then be queried to identify other objects (‘C’) that co- Medical Subject Headings (MeSH) descriptors (Hristovski occurred with them in the literature (Fig. 1b). Each of these et al., 2001) or by mapping text to UMLS concepts (Wee- new objects, C, that are not themselves A or B objects, are ber et al., 2000, 2003). The size of the domain to be analyzed related to A only implicitly. That is, they have no documented remains the most signiﬁcant problem, however, and has been relationship with A, but share one or more relationships with dealt with thus far by user intervention in the selection of A. This large list of implicitly related objects will contain intermediates for analysis. potential discoveries of new relationships. However, because The ability to seek out novel, undocumented relationships of their abundance, it is necessary to prioritize and rank these that are logically implicit from a body of information—yet objects in some manner. To do so, we describe a method to not explicitly stated within that body—has obvious scientiﬁc rank relationships shared by two objects within a literature value. It enables us to use the current state of knowledge to network against a random network model to evaluate how infer possible new relationships that have yet to be studied. statistically exceptional any given set of shared relationships Our ability to postulate a potential relationship between two is (Fig. 1c). We also show that this ranking correlates with the or more things depends principally upon how many com- probability that two objects are related as well as the strength mon relationships we are aware of between these things, if (frequency of co-occurrence) by which they are related. any, to suggest that a relationship exists where none has been documented. Awareness of relationships, especially sets of Object co-occurrences exhaustively identify relationships, is central to the human process of insight and potential relationships discovery. We attempt to identify as many relationships as possible by postulating that a potential relationship exists between General approach two objects when they are observed to co-occur within the Herein we describe a method to identify potential relationships same MEDLINE record, an approach also taken by others to within the biomedical literature by deﬁning areas of research identify potential relationships between genes (Stapley and interest such as genes, diseases, phenotypes and chemical Benoit, 2000; Jenssen et al., 2001), proteins (Blaschke et al., compounds (hereafter referred to simply as ‘objects’). Begin- 1999) and drugs (Rindﬂesch et al., 2000). Some have used ning with an object of interest (call it ‘A’), we can identify the co-occurrence of certain MeSH terms to reﬂect potential other objects (‘B’) tentatively related to it within the literature relationships (Hristovski et al., 2001; Perez-Iratxeta et al., 390 Identifying new relationships from existing ones 2002), but MeSH terms for each object within an abstract to identify novel relationships that are implicit by virtue of are not always provided, and in a number of cases (e.g. shared relationships. gene names) are used to reﬂect the existence of a more gen- For an implicit relationship—two objects related only eral category as opposed to a speciﬁc entity. Here, we are through shared intermediates—it is not yet clear what stat- interested in associations between any areas of active bio- istical parameters best correlate with the probability of it medical research interest. We assemble the primary names representing a biologically meaningful relationship. However, and synonyms for genes, diseases, phenotypes and chem- we can assume that the probability of an implicit relation- ical/pharmaceutical compounds into a composite database so ship (A ↔ C) being biologically meaningful would not be that the names can be recognized as they occur within text. greater than the least probable of the two individual (A ↔ B These objects were chosen because they are of broad interest or B ↔ C) relationships linking them, where the symbol ↔ in biology and medicine, but the approach we present allows is deﬁned as the existence of a non-directional relationship for any ‘class’ of object to be incorporated that is considered between two objects. This is equivalent to stating that the of research interest (e.g. tissue types, protein motifs, cell lines, strength of a chain is no greater than its weakest link. etc.). It should be noted, however, that addition of a new object class presumes that co-citations with other object classes could SYSTEMS AND METHODS be construed as meaningful. For example, country names Code was written in Visual Basic 6.0 (SP5) using ODBC could be added as an object class, but it is doubtful that extensions to interface with a Microsoft Access 2000 data- any co-occurrences with other objects would be considered base, with database queries written in SQL. Analyses were run biologically meaningful or interesting. on a Pentium 4–2.4 GHz desktop PC running Windows 2000. Database entries were obtained from the following sources, all Fuzzy logic is used to weight importance of downloaded between December 13 and December 25, 2001: co-occurrence The disadvantage of using co-occurrence is that it does not Database Location always reﬂect the existence of a biologically meaningful rela- tionship. To reﬂect this ambiguity, we borrow from Fuzzy Set OMIM ftp://ftp.ncbi.nlm.nih.gov/repository/ Theory and model relationships as probabilistic, that is, ran- OMIM/omim.txt.Z ging from 0 to 1, rather than binary values [for an overview GDB http://gdbwww.gdb.org/gdb/ of Fuzzy Set Theory see Steimann (1997) and for a thorough advancedSearch.html discussion see Klir and Yuan (1995)]. By manually survey- HGNC http://www.gene.ucl.ac.uk/public- ing each object co-mentioned within a sample of MEDLINE ﬁles/nomen/nomeids.txt records, we can estimate the probability that a co-mention LocusLink ftp://ftp.ncbi.nih.gov/refseq/ reﬂects the presence of a non-trivial relationship between the LocusLink/LL.out_hs.gz (Human) two objects. This base probability can then be used to assign MeSH http://www.nlm.nih.gov/cgi/ a fuzzy score to each relationship, reﬂecting the probability request.meshdata (MeSH Trees ﬁle) that one or more co-occurrences are meaningful. Since terms MEDLINE National Library of Medicine that co-occur more frequently are more likely to represent http://www.nlm.nih.gov biologically meaningful relationships (Jenssen et al., 2001), Genome Ontology http://www.geneontology.org each relationship is assigned a score based on the frequency and type (i.e. abstract or sentence) of co-mentions observed and their corresponding error rates (discussed in the section Values used to ascertain relatedness to one of the two on ‘Implementation’). fuzzy sets (i.e. belonging to the category ‘related’ or ‘not By deﬁning what objects will be recognized rather than related’) were based upon the probability that a co-occurrence using all words reduces the magnitude of analysis and allows of objects equated to a non-trivial relationship between the a focus on relationships with a higher potential of being con- two (see online supplement Fig. S2). Thus, the value of sidered ‘interesting’. Diseases and clinical phenotypes were relatedness can range from 0 to 1 and is estimated by the obtained from Online Mendelian Inheritance in Man (OMIM) number of times (n) that the two objects were co-mentioned (Hamosh et al., 2000); chemical compounds and drugs from and the error rate (r) associated with the co-mention metric the MeSH database (Lowe and Barnett, 1994); and genes from (i.e. sentence or abstract) used to establish the relationship. Locuslink (Maglott et al., 2000) and the Human Gene Nomen- This formula, P(related) = 1 − r , is used to calculate clature Committee (HGNC) (Povey et al., 2001). As tentative the relatedness of two objects and is referred to as a veracity relationships between these objects are identiﬁed within text, score. The veracity score can range in value from 0.58 (two they are entered into a database. This database enables the con- objects only co-mentioned once within one abstract) to 1.0. struction of a network of relationships, which can be queried Rather than summing the raw number of co-mentions shared to identify relationships shared among a set of objects and by two objects, summing their veracity scores permits a more 391 J.D.Wren et al. accurate estimate of how many relationships are truly shared randomly connected in a network with a total of N nodes, based upon the known error rate. the probability that a node will be connected to A is given as We deﬁne the ‘strength’ of a relationship as a function of K /N where K is the total number of connections to A. The A t A the number of times two objects have been co-mentioned probability B will be connected to A [written as P(B ∈ A)]is and the probability that each co-mention represents a non- K /N and the probability A will be connected to B [written A t trivial relationship. The term ‘strength’ is used rather than as P(A ∈ B)], is K /N . Because the formula P(A ∈ B) or B t frequency because we record both sentence co-mentions (C ) P(B ∈ A) is more easily represented in mathematical terms and abstract co-mentions (C ), and need a convenient way as the probability B is not related to A and vice versa, written to combine the two. The strength score (S) is assigned based as NOT [P(A ∈/ B) AND P(B ∈/ A)], we can deﬁne the upon the individual co-mention error rates, r (17% FP) and probability in mathematical terms as: r (42% FP) respectively, by the formula S = C ∗ (1 − r ) + a s s K K A B C ∗ (1 − r ). a a P(A ↔ B) = 1 − 1 − ∗ 1 − (1) N N t t Materials and methods for the Chlorpromazine–Cardiac Hypertrophy experiments are contained in the online Intuitively, we expect that if K = N or K = N then A t B t supplement. P(A ↔ B) = 1, since the number of connections to one node does not matter if the other node is connected to all nodes. This formula applies for all non-zero values of K and ALGORITHM K . Random network simulations were conducted to con- In this section, we adopt terminology from graph theory and ﬁrm the validity of this formula (data not shown). Summing refer to objects as ‘nodes’ and relationships (co-citations) as the probability of each individual relationship, we can extend ‘connections’, also known as the ‘edges’ between nodes. We this formula to estimate the expected number of connections also deﬁne an implicitly related node (C) as one that has no a set of nodes, B , would share with another object, A, by the direct connection to the query node (A), yet is connected to equation: one or more intermediate nodes (B) that are simultaneously connected to A. To evaluate the potential signiﬁcance of an K K A B Expect(A ↔ B ) = 1 − 1 − ∗ 1 − (2) implicitly related node, we compare the set of i nodes (B ) N N t t i=1 shared by both the query node A and the implicit node C, against a random network model. Given that we are interested Equation (2) is used to estimate the expected number of in an node A, and know from processing all literature associ- shared relationships between B and C, given the connectivity ated with A that it is related to all nodes in the set B , we ask of each intermediate (shared) node in the set B that A is known the question ‘Given the number of connections each node in to be connected to. the set B has, and the number of connections the target node (C) has, how many connections might we expect between B IMPLEMENTATION and C by chance alone?’ For example, if C were related to Precision and recall rates are estimated every node in a 1000 node network and A had 100 connec- tions within this network, all of which were shared with C, First, we estimated the precision of using co-occurrence as this would be expected and therefore unexceptional. Thus, a method of identifying the existence of a non-trivial rela- dividing the number of observed connections (Obs) between tionship between two objects by manually evaluating the B and C by the number of connections we would expect by co-occurring objects within a random set of 25 MEDLINE chance (Exp) provides us with a value reﬂecting the statist- records (titles and abstracts). We found that two objects ical signiﬁcance of the shared connections. This value allows co-mentioned within the same sentence were more likely us to estimate the potential relevance of a set of connections. to be related (83%) than objects co-mentioned in the same For example, if a set of connections linking a disease (A) abstract (58%). Using sentence co-mentions alone, how- to a chemical (C) were to encompass highly common nodes ever, would miss 43% of the non-trivial relationships within such as ‘sodium’ and ‘symptom’, we recognize that—whether an abstract. This proportion of correct relationships among true or not—these types of connections are sufﬁciently vague abstract co-mentions is similar to the estimates others have to be of little use to a scientist in postulating how A and C obtained (Jenssen et al., 2001; Ding et al., 2002). Addition- might have an interesting and speciﬁc connection through ally, because judgment of what constitutes a ‘relationship’ and these intermediates. If the shared connections involve spe- what is ‘non-trivial’ is somewhat subjective, we attempted to ciﬁc transporters or genes, which would not be as frequently estimate this error rate in a more objective way by identifying mentioned in the literature, it becomes easier to postulate how objects co-mentioned in the ﬁrst half of MEDLINE (records speciﬁc actions of (C) could produce (A). up until approximately Nov. 1991), but not in the second We derived an expectation value based upon the relative half. The rationale for this approach comes from the obser- connectivity of each node involved. Assuming nodes are vation that related objects (e.g. insulin–glucose) tend to be 392 Identifying new relationships from existing ones composite object database, adding 3094 acronyms to data- base entries that did not have an acronym speciﬁed. ARGH also identiﬁed 4786 spelling/hyphenation variants observed within MEDLINE for objects within the composite database. It is difﬁcult to assess what impact ARGH has upon the pre- cision or recall when processing records, as the reduction in the false-negative (FN) rate depends upon how common the variant or acronym is and reduction in the false-positive rate depends upon the acronym. For example, the gene calcitonin is associated with the acronym CT, which has a different deﬁnition within MEDLINE 96% of the time (Computed Tomography). Gene names like SOCS-3 are unambiguous and unaffected by the use of ARGH to resolve acronyms, but less than half of the deﬁnitions of SOCS-3 within MEDLINE would be recognized without the spelling vari- Fig. 2. Analysis of the ﬁrst 10 000 co-cited objects found within the ants provided by ARGH (the ARGH database can be queried 1st but not 2nd half of MEDLINE, grouped by the total number of at http://lethargy.swmed.edu/argh/argh.asp). co-citations identiﬁed within MEDLINE for the two objects. The fact that these co-citations are non-recurring suggests that the co-citation Estimated recall rate of using abstracts versus did not reﬂect the existence of a relationship studied between the full-text articles two. As shown here, this distribution is signiﬁcantly different than Abstracts presumably contain the most important ﬁndings the overall distribution in the co-citation frequency of the 63 836 of a report and important ﬁndings are usually reiterated in records analyzed. future abstracts, but it could be argued that some relation- ships might not be found within abstracts. To estimate this and obtain a recall rate, we calculated the total number of rela- co-mentioned over the course of many studies after their ﬁrst tionships within a domain of knowledge (MEDLINE articles) co-mention. We reasoned that if two objects are co-mentioned that are contained in their electronically accessible summary early (establishing their co-existence in the literature), but not form (MEDLINE titles and abstracts). A set of objects men- again after an equal number of publications, there are several tioned within review articles was manually compiled and possibilities: the co-mention was the result of two unrelated compared to the relationships found within MEDLINE titles topics being discussed together (e.g. incidental, broad topical coverage), the objects were once studied for a relationship but and abstracts. The same list was compared to the object none was found or it was in error, or a relationship was estab- database to estimate what percent of object types mentioned lished but was not of sufﬁcient interest to warrant further study. in MEDLINE were represented in the databases used. Four Regardless of the exact reason, these represent non-persistent objects were randomly chosen from the collective object data- ‘relationships’, and are suggestive of a class of co-mentions base, representing one of each object type, with the stipulation that at least two review articles had been written about the (erroneous or uninteresting) that we wish to exclude. We object within the past three years. A set of review articles examined the ﬁrst 10 000 non-persistent co-citations found was selected for CTLA-4 (gene) (McCoy and Le Gros, 1999; within MEDLINE and found a similar distribution as would Green, 2000; Tomer, 2001), Fragile-X Syndrome (disease) be predicted by the error rate formula (Fig. 2), although by (Bardoni et al., 2000; Jin and Warren, 2000; Kooy et al., 2000), this method the predicted error rate would be slightly higher. cachexia (clinical phenotype) (Barber, 2001; Hasselgren and This helps to conﬁrm the accuracy of the estimates and to Fischer, 2001; Tisdale, 2001) and dynorphin (chemical com- justify the use of a power-law decay function to represent the pound) (Steiner and Gerfen, 1998; Caudle and Mannes, 2000). probability of error. Only objects of the same types (i.e. other genes, diseases, Resolving ambiguous acronyms phenotypes and chemicals) were counted. Acronyms were resolved as they occurred within text using There were a total of 40 objects mentioned in the literat- an Acronym Resolving General Heuristic (ARGH) to reduce ure but not found in the database (2 diseases, 9 phenotypes, both random and systematic errors in term recognition, which 7 genes and 22 chemical compounds). The 2 disease names operates with ∼ 96% precision and 92% recall (Wren and (Graves’ Opthalamopathy and Relapsing-remitting Experi- Garner, 2002). A total of 4309 acronyms were ﬂagged as mental Autoimmune Encephalomyelitis) and 9 phenotypes ambiguous (i.e. one deﬁnition must be >95% of all identi- were not mentioned in OMIM. Three of these phenotypes, ﬁed deﬁnitions to be considered unambiguous) and requiring however, were simply the result of a semantic differ- resolution. The ARGH database of MEDLINE acronyms was ence between the OMIM entry and the article (‘rocking’ also used to expand the acronym list for entries within the versus ‘body-rocking’, ‘greater interocular distance’ versus 393 J.D.Wren et al. ‘increased interocular distance’, ‘fetal akinesia’ versus ‘akin- esia’). The most problematic category was ‘small molecules’, for which many chemicals and drugs widely mentioned in the literature (e.g. DAMGO, DADLE, isoprenaline) were simply not found in the MeSH trees database. In this sample, there were 181 objects found within the review articles, 141 of which were also in the composite data- base (78%). From the 40 objects mentioned in the reviews but not found in the database, 2 were diseases, 9 phenotypes, 7 genes and 22 chemical compounds. From these 141 data- base objects mentioned within the full-text of the reviews, 138 of them (98%) could be found within the body of a MED- LINE title or abstract, suggesting that most objects pertinent to a review can also be found within an abstract or title. Seman- tically, 124 of these 138 objects were spelled in the literature the same way they were found in the database, giving a recall rate of 90% (124/138) in terms of identifying the conceptual occurrence of database objects within textual input, and 69% (124/181) in terms of identifying relevant relationships within its domain of knowledge (MEDLINE). Some of the FN failures to identify objects within text were systematic (e.g. the MeSH entry 5,8,11,14,17- Eicosapentaenoic Acid is almost always referred to in MED- LINE simply as eicosapentaenoic acid) while other failures varied in their rates (e.g. JNK was found to be spelled 81 dif- ferent ways including ‘c-Jun N-terminal kinase’ 605 times, ‘c-Jun NH2-terminal kinase’ 154 times and ‘c-Jun amino- Fig. 3. The distribution in the number of relationships per object terminal kinase’ 62 times). follows a scale-free power law distribution. (a) A relatively small fraction of the objects in the database are directly related to a large Creating a network of relationships using percentage of the total, contributing to a rapid explosion in the MEDLINE number of implicitly related objects. (b) As the number of direct A total of 12 037 763 MEDLINE records recorded from relationships increases, the number of implicit relationships rapidly 1967 to January 2002 were processed to create a network approaches the theoretical maximum, which is the total number of of 3 482 204 unique relationships between objects. Approx- nodes in the network, and then decreases linearly with the number imately two-third of the objects in the database found exact of possible implicit relationships. literal matches within the literature, identifying at least one ‘relatedness’ of two objects solely by examining the relation- relationship for 22 482 of the 33 539 unique objects (85 234 ships they share. To establish this, an Obs/Exp was calculated total terms when including synonyms) within the database. for all relationship sets (of at least 100 objects) shared by As expected, we ﬁnd a highly disproportionate distribution a central query object and any other object in the database, in the number of relationships per object (Fig. 3a), indicating regardless of whether a direct relationship was known or not. the network is scale-free in nature. As such, this connectivity The Obs/Exp scores were sorted from highest to lowest on contributes to a rapid explosion in the number of implicitly the x-axis, and the strength of the relationship, if known, related objects as the number of direct relationships increases was plotted on the y-axis. If the strength was not known (Fig. 3b). Thus, identifying implicitly related objects becomes (i.e. it was an implicit relationship), then no bar was plotted. secondary to being able to rank their potential signiﬁcance. For the object ‘Cardiac Hypertrophy’, we see that the higher Furthermore, this also shows that the search for implicit rela- tionships more than one level removed (i.e. A ↔ B ↔ C ↔ the Obs/Exp ratio, the more likely the relationship is known D) would likely be fruitless in the absence of any further con- (Fig. 4). Furthermore, we note that the higher the Obs/Exp straints, since all non-circular connections from A are reached ratio, the stronger the relationship tends to be (i.e. the more relatively rapidly as the domain grows to a modest size. frequently they are co-cited). To conﬁrm that the trend observed in Figure 4 is not speciﬁc Ranking all implicit relationships to the analysis of cardiac hypertrophy, but rather is a general We evaluated whether this observed to expected ratio trend, we randomly picked 100 objects from the database that (Obs/Exp) we are calculating could be used to estimate the had between 500 and 1000 relationships within the network 394 Identifying new relationships from existing ones Fig. 4. Objects were ranked for their implicit ‘relatedness’ to car- diac hypertrophy solely on the basis of the relationships they shared Fig. 5. The observed to expected ratio obtained from identifying (Obs/Exp). If a relationship in MEDLINE has been established, its and ranking shared relationships correlates with the existence and strength (based upon frequency of co-occurrence within MEDLINE) known strength of a relationship. This enables novel (implicit) rela- is plotted on the y-axis, otherwise it will appear as a gap (meaning tionships to be correlated with the probability they are relevant (as no relationship has been established). Shown is a subset of 4887 judged by existing relationships) and important (as judged by the objects sharing at least 100 relationships with cardiac hypertrophy, strength/frequency of historical reporting). sorted by their calculated observed to expected ratio. Due to x-axis compression, not all gaps will be visible on this graph. is induced in response to environmental stimuli, such as arter- (this range was chosen simply to ensure that the approximate ial hypertension, increased cardiac work or hormonal stimuli. scale of analysis for each object was similar). Implicit rela- It is an intensively studied condition as evidenced by the tionships were identiﬁed for these objects and ranked by their 4092 articles in MEDLINE containing the key phrase ‘cardiac Obs/Exp values. The top 1000 Obs/Exp scores were taken hypertrophy’. A total of 2102 unique objects were co- for each analysis and ranked from 1 (highest Obs/Exp) to mentioned within all articles mentioning cardiac hypertrophy 1000 (lowest), and a normalized strength score calculated and 19 718 unique objects were implicitly related to cardiac for each object analyzed, ranging from 1.0 (strongest dir- hypertrophy through a total of 1 842 599 different paths. ect relationship observed) to 0.0 (no relationship observed). Examining the shared relationships for the implicitly related Figure 5 shows this average strength plotted against the objects in Table 1, we excluded ‘endotoxins’ from further Obs/Exp rankings, indicating that this is a general trend. study in part because this refers to a class of immuno- In some ways, the correlation of exceptional groupings with inductive compounds rather than a speciﬁc compound, and known relationships is not too surprising, as we would expect in part because endotoxins are known to have substantial that two objects with very similar purposes, functions, or effects on cardiac function that would complicate interpreta- involvement in a biological process should interact with and/or tion of hypertrophy. Morphine was excluded from the study be studied with many of the same objects. This does, however, because it would have substantial effects on the behavior of establish that the relatedness of two objects can be correl- mice (including somnolence and reduced feeding) that would ated with the statistical exceptionality of the relationships limit the dose used for study. Chlorpromazine (CPZ) was they share. More importantly, this provides us with a means deemed more suitable for further study, in part because it to evaluate quantitatively implicit relationships by demon- is a commonly used drug and an unrecognized effect on car- strating that the Obs/Exp score correlates positively with diac hypertrophy could have clinical importance. A list of established relationships. This numeric evaluation enables us shared relationships between cardiac hypertrophy and CPZ is to identify new relationships, not found within MEDLINE available in the online web supplement. records, that are more likely to be logically plausible and relev- Chlorpromazine (CPZ) is an aliphatic phenothiazine com- ant to the query object because of the relationships they share. pound used principally as an anti-psychotic and anti-emetic drug (Shen, 1999). It has a number of physiological effects Wet-lab testing of in-silico predictions Cardiac hypertrophy is deﬁned as an increase in the size of myocytes that is associated with detrimental effects on Statistics are as of June 23, 2003, although this analysis was initially aspects of contractile and electrical function in the heart. It conducted in January 2002 when there were fewer articles than this. 395 J.D.Wren et al. Table 1. Chemical compounds within the composite database implicitly related to cardiac hypertrophy Rank Implicit relationship Shared Quality Expected Obs/Exp rels estimate 1 Endotoxin 1301 1025 307 3.34 2 Morphine 1217 939 283 3.32 3 Chlorpromazine 1089 824 252 3.28 4 Globulin 1130 850 265 3.20 5 Cisplatin 1129 862 274 3.14 6 Neomycin 1105 842 272 3.10 7 Polyethylene glycol 1153 863 279 3.09 8 Phytohemagglutinin 1099 807 266 3.03 9 Methotrexate 1190 897 308 2.91 10 Casein 1165 895 308 2.91 11 Isoleucine 1142 852 293 2.91 12 Galactose 1104 826 284 2.91 13 Progesterone 1448 1132 392 2.89 Fig. 6. Chlorpromazine protects against the development of cardiac 14 Esterase 1197 908 317 2.86 hypertrophy. Several parameters of ventricular hypertrophy were 15 Tetracycline 1066 800 283 2.83 determined by echocardiography. One group of mice received iso- 16 Acetone 1075 804 285 2.82 proterenol only (ISO, n = 10) and the other received both isoproter- 17 Concanavalin A 1317 1002 355 2.82 enol and chlorpromazine (CPZ + ISO, n = 8). Symbols represent 18 Polysaccharide 1092 829 295 2.81 individual mice, brackets denote mean (center triangle) and standard 19 Bromide 1368 1048 381 2.75 deviation for group. LVW = left ventricle weight (CPZ + ISO 11 ± 20 Methanol 1221 930 354 2.63 27%, ISO 51 ± 43%,P< 0.02), LVMI = left ventricular mass index The 20 objects with the most implicit connections (shared rels) were extracted and sorted (CPZ + ISO 11 ± 28%, ISO 50 ± 52%,P< 0.04), PWT = posterior by observed to expected ratio, which is calculated using the probability each direct wall thickness (CPZ + ISO 16 ± 16%, ISO 36 ± 27%,P< 0.05), relationship comprising the implicit relationship is valid (quality estimate). These are IVSWT = intraventricular septum wall thickness (CPZ + ISO 19 ± compounds that should not have any documented relationship with cardiac hypertrophy 18%, ISO 31 ± 20%,P< 0.12). within the MEDLINE titles and abstracts analyzed yet, at the same time, share many relationships with it. again before the mice were sacriﬁced to allow estimation of and molecular targets that suggest it might provide an anti- their left ventricular weight (LVW), left ventricular mass index hypertrophic effect in the heart, one of which is its alpha- (LVMI), posterior wall thickness (PWT) and intraventricular adrenergic blocking activity (Morgan and Van Maanen, 1980). septum wall thickness (IVSWT). In all four of the parameters Hypertrophy can be induced through over-stimulation of measured, we found that the amount of cardiac hypertrophy alpha-adrenergic receptors by agonists and this effect can be was signiﬁcantly reduced in the isoproterenol (ISO) plus CPZ blocked by alpha-adrenergic antagonists (Colucci, 1982). It treated mice (n = 8) in comparison to the control group given has recently been recognized that the calmodulin-dependent only ISO (n = 10), as evaluated by 1-tailed Student’s t -test phosphatase calcineurin plays an important role in some with unequal variance (Fig. 6). forms of hypertrophy (Molkentin et al., 1998). CPZ has been reported to interact with calmodulin (Marshak et al., 1981) as DISCUSSION an antagonist, suggesting a potential role beyond that of an alpha-adrenergic receptor. Despite the potential mechanistic A relationship between CPZ and cardiac hypertrophy has not connections between cardiac hypertrophy and CPZ, there is been previously suggested in the literature. The application of no indication within MEDLINE that any relationship between implicit relationship analysis was required for generating the the two has been suggested. underlying hypothesis of this study. It was previously known that CPZ had modest alpha-blocking activity, but the ﬁnd- In-silico predicted effect conﬁrmed ing that it interferes with ISO-induced hypertrophy, a pure in rodent model beta-adrenergic effect, is surprising and provocative. Possible We looked for an association between CPZ and cardiac hyper- mechanisms include: (1) a previously unsuspected activity of trophy in a rodent model. Two groups of mice were given CPZ on beta receptors, either directly or through cross-talk 20 mg/kg/day isoproterenol by osmotic minipump, with one between different classes of receptors; (2) an effect of CPZ group additionally receiving 10 mg/kg/day CPZ. This dose on downstream signaling from the beta receptor in the car- of CPZ did not perceptibly alter feeding behavior or physical diac cells; or (3) a ‘pseudo-hypertrophic’, non-cellular, effect activity. Echocardiograms were obtained before treatment and related to increased myocardial edema or matrix deposition. 396 Identifying new relationships from existing ones Such an investigation could have clinical implications. If this PGA grant 5U01HL6688002, NIH/NHLBI grant P50 drug exerts a similar effect against common precipitants of CA70907, a Hudson Foundation grant and a National AHA hypertrophy, it could provide a clue to molecular structures Grant-in-Aid (R.V.S.). that should be explored for therapeutic beneﬁt. Moreover, many tens of thousands of patients already receive CPZ, and it may be contributing to cardiac pathology, or protection, in a REFERENCES previously unsuspected fashion. Conﬁrmation and evaluation Barber,M.D. (2001) Cancer cachexia and its treatment with ﬁsh-oil- of the mechanism of CPZ in cardiac hypertrophy will require enriched nutritional supplementation. Nutrition, 17, 751–755. further work beyond these preliminary investigations. Bardoni,B., Mandel,J.L. and Fisch,G.S. (2000) FMR1 gene and Overall, we have demonstrated that an analysis of shared fragile X syndrome. Am. J. Med. Genet., 97, 153–163. Blaschke,C., Andrade,M.A., Ouzounis,C. and Valencia,A. (1999) relationships scored against a random network model has Automatic extraction of biological information from scientiﬁc the potential to elucidate novel and interesting relationships text: protein–protein interactions. ISMB, 7, 60–67. not documented within MEDLINE, but rather based upon Caudle,R.M. and Mannes,A.J. (2000) Dynorphin: friend or foe? information contained therein. Automating the relationship Pain, 87, 235–239. identiﬁcation process enables us to bypass the monumental Colucci,W.S. (1982) Alpha-adrenergic receptor blockade with time and effort that would be required to record manually prazosin. Consideration of hypertension, heart failure, and poten- every relationship within MEDLINE’s 12 million abstracts, tial new applications. Ann. Int. Med., 97, 67–77. and using an object-based model reduces the need to ascer- DiGiacomo,R.A., Kremer,J.M. and Shah,D.M. (1989) Fish-oil diet- tain which relationships are of interest, since the objects ary supplementation in patients with Raynaud’s phenomenon: within the database are presumably those a user would be a double-blind, controlled, prospective study. Am. J. Med., 86, interested in. However, there are shortcomings in the use 158–164. Ding,J., Berleant,D., Nettleton,D. and Wurtele,E. (2002) Mining of this method: ﬁrst, there is the problem of ‘uninteresting’ Medline: abstracts, sentences or phrases? Pac. Symp. Biocomput., relationships. To some extent, this will be user-dependant. Kauau, Hawaii, 7, 326–337. Objects that share many relationships may indeed have a Green,J.M. (2000) The B7/CD28/CTLA4 T-cell activation pathway. relationship themselves, but the nature of their relationship Implications for inﬂammatory lung disease. Am. J. Respir. Cell may be such that it would not be considered interesting or Mol. Biol., 22, 261–264. worth investigating. Second, ascertaining the nature of the Hamosh,A., Scott,A.F., Amberger,J., Valle,D. and McKusick,V.A. implied relationship by examining the shared relationships (2000) Online Mendelian Inheritance in Man (OMIM). Hum. is time-consuming. Methods of providing a summary ana- Mutat., 15, 57–61. lysis or better evaluating which of the shared relationships Hasselgren,P.O. and Fischer,J.E. (2001) Muscle cachexia: current are potentially interesting would be highly desirable. Third, concepts of intracellular mechanisms and molecular regulation. comparison to a random network model relies upon the ana- Ann. Surg., 233, 9–17. Hristovski,D., Stare,J., Peterlin,B. and Dzeroski,S. (2001) Support- lyzed text to be focused (non-random) in nature. To the extent ing discovery in medicine by association rule mining in Medline that writing is random or functionally detached within an and UMLS. Medinfo, 10, 1344–1348. analyzed textual unit, trivial connections will be made and Jenssen,T.K., Laegreid,A., Komorowski,J. and Hovig,E. (2001) A lit- fewer groupings will stand out statistically. Finally, work erature network of human genes for high-throughput analysis of still needs to be done on better establishing relationships, gene expression. Nat. Genet., 28, 21–28. beyond what we have done here in scoring the probabil- Jin,P. and Warren,S.T. (2000) Understanding the molecular basis of ity that a co-occurrence was meaningful. For example, the fragile X syndrome. Hum. Mol. Genet., 9, 901–908. method asserts that a relationship is known when two objects Klir,G. and Yuan,B. (1995) Fuzzy Sets and Fuzzy Logic: Theory and have been mentioned together within the same abstract, Applications. Prentice Hall. and unknown if they have not. While this may be a good Kooy,R.F., Willemsen,R. and Oostra,B.A. (2000) Fragile X syn- generalization, two objects may have been co-mentioned sev- drome at the turn of the century. Mol. Med. Today, 6, 193–198. Lowe,H.J. and Barnett,G.O. (1994) Understanding and using the eral times and yet the overall nature of their relationship, medical subject headings (MeSH) vocabulary to perform literature or certain aspects thereof, remains unknown. Nonetheless, searches. JAMA, 271, 1103–1108. we believe this method will prove to be of utility in a Maglott,D.R., Katz,K.S., Sicotte,H. and Pruitt,K.D. (2000) NCBI’s ﬁeld where the amount of information continues to increase LocusLink and RefSeq. Nucleic Acids Res., 28, 126–128. exponentially. Marshak,D.R., Watterson,D.M. and Van Eldik,L.J. (1981) Calcium- dependent interaction of S100b, troponin C, and calmodulin with an immobilized phenothiazine. Proc. Natl Acad. Sci., USA, 78, ACKNOWLEDGEMENTS 6793–6797. Thanks to Irene Rombel for a helpful review of the manuscript. McCoy,K.D. and Le Gros,G. (1999) The role of CTLA-4 in the This work was funded in part by NSF-EPSCoR grant no. EPS- regulation of T cell immune responses. Immunol. Cell Biol., 77, 0132534 (J.D.W.), NIH/NCI grant CA81656, NIH/NHLBI 1–10. 397 J.D.Wren et al. Molkentin,J.D., Lu,J.R., Antos,C.L., Markham,B., Richardson,J., Steimann,F. (1997) Fuzzy set theory in medicine. Artif. Intell. Med., Robbins,J., Grant,S.R. and Olson,E.N. (1998) A calcineurin- 11, 1–7. dependent transcriptional pathway for cardiac hypertrophy. Cell, Steiner,H. and Gerfen,C.R. (1998) Role of dynorphin and enkephalin 93, 215–228. in the regulation of striatal output pathways and behavior. Exp. Morgan,J.P. and Van Maanen,E.F. (1980) The role of differen- Brain Res., 123, 60–76. tial blockade of alpha-adrenergic agonists in chlorpromazine- Swanson,D.R. (1986) Fish oil, Raynaud’s syndrome, and undis- induced hypotension. Arch. Int. Pharmacodyn. Ther., 247, covered public knowledge. Perspect. Biol. Med., 30, 7–18. 135–144. Tisdale,M.J. (2001) Cancer anorexia and cachexia. Nutrition, 17, Perez-Iratxeta,C., Bork,P. and Andrade,M.A. (2002) Association of 438–442. genes to genetically inherited diseases using data mining. Nat. Tomer,Y. (2001) Unraveling the genetic susceptibility to autoimmune Genet., 31, 316–319. thyroid diseases: CTLA-4 takes the stage. Thyroid, 11, 167–169. Povey,S., Lovering,R., Bruford,E., Wright,M., Lush,M. and Wain,H. Weeber,M., Klein,H., Aronson,A.R., Mork,J.G., de Jong-van den (2001) The HUGO Gene Nomenclature Committee (HGNC). Berg,L.T. and Vos,R. (2000) Text-based discovery in biomedicine: Hum. Genet., 109, 678–680. the architecture of the DAD-system. Proceedings of AMIA Annual Rindﬂesch,T.C., Tanabe,L., Weinstein,J.N. and Hunter,L. (2000) Fall Symposium, Los Angeles, CA, 903–907. EDGAR: extraction of drugs, genes and relations from the Weeber,M., Vos,R., Klein,H., De Jong-Van Den Berg,L.T., Aron- biomedical literature. Pac. Symp. Biocomput., 517–528. son,A.R. and Molema,G. (2003) Generating hypotheses by dis- Shen,W.W. (1999) A history of antipsychotic drug development. covering implicit associations in the literature: a case report of a Comp. Psychiatry 40, 407–414. search for new potential therapeutic uses for thalidomide. J. Am. Smalheiser,N.R. and Swanson,D.R. (1998) Using ARROWSMITH: Med. Inform. Assoc., 10, 252–259. a computer-assisted approach to formulating and assess- Wren,J.D. and Garner,H.R. (2002) Heuristics for identiﬁcation of ing scientiﬁc hypotheses. Comp. Meth. Prog. Biomed., 57, acronym-deﬁnition patterns within text: towards an automated 149–153. construction of comprehensive acronym-deﬁnition dictionaries. Stapley,B.J. and Benoit,G. (2000) Biobibliometrics: information Meth. Inform. Med., 41, 426–434. retrieval and visualization from co-occurrences of gene names Yandell,M.D. and Majoros,W.H. (2002) Genomics and natural in Medline abstracts. Pac. Symp. Biocomput., 5, 529–540. language processing. Nat. Rev. Genet., 3, 601–610.

Journal

Bioinformatics – Oxford University Press

Published: Jan 22, 2004

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Knowledge discovery by automated identification and ranking of implicit relationships

Knowledge discovery by automated identification and ranking of implicit relationships

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Knowledge discovery by automated identification and ranking of implicit relationships

Knowledge discovery by automated identification and ranking of implicit relationships

References (40)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies