Identification of reaction organization patterns that naturally cluster enzymatic transformations

Identification of reaction organization patterns that naturally cluster enzymatic transformations Background: Metabolic reactions are chemical transformations commonly catalyzed by enzymes. In recent years, the explosion of genomic data and individual experimental characterizations have contributed to the construction of databases and methodologies for the analysis of metabolic networks. Some methodologies based on graph theory organize compound networks into metabolic functional categories without preserving biochemical pathways. Other methods based on chemical group exchange and atom flow trace the conversion of substrates into products in detail, which is useful for inferring metabolic pathways. Methods: Here, we present a novel rule-based approach incorporating both methods that decomposes each reaction into architectures of compound pairs and loner compounds that can be organized into tree structures. We compared the tree structure-compound pairs to those reported in the KEGG-RPAIR dataset and obtained a match precision of 81%. The generated tree structures naturally clustered all reactions into general reaction patterns of compounds with similar chemical transformations. The match precision of each cluster was calculated and used to suggest reactant-pairs for which manual curation can be avoided because this is the main goal of the method. We evaluated catalytic processes in the clusters based on Enzyme Commission categories that revealed preferential use of enzyme classes. Conclusions: We demonstrate that the application of simple rules can enable the identification of reaction patterns reflecting metabolic reactions that transform substrates into products and the types of catalysis involved in these transformations. Our rule-based approach can be incorporated as the input in pathfinders or as a tool for the construction of reaction classifiers, indicating its usefulness for predicting enzyme catalysis. Keywords: Metabolic reaction, Reaction patterns, Compound transformation, Reactant pairs, Enzyme catalysis Background systems and detailed group reaction exchanges. How- One of the main forces defining genome content is me- ever, the results arising from system-oriented studies tabolism, a chemical system that generates all the neces- are difficult to incorporatedirectlyintothe enzyme sary chemical substances in living cells through chemical catalysis context. reactions mainly catalyzed by genome-encoded enzymes Approaches based on graph theory build compound– [1]. The increased availability of metabolic information compound networks that reveal hubs (compounds that has attracted the interest of bioinformaticians, who have are highly connected) and modules (compound sets that developed computational methods to uncover important suggest communities of similar chemical and functional information regarding global properties of metabolic properties) [2]. Various properties arise from the re- moval of hubs, providing a partial view of the metabolic * Correspondence: rmaria@ibt.unam.mx network due to loss of biochemical information required Departamento de Microbiología Molecular, Instituto de Biotecnología to include compound modifications present in more Universidad Nacional Autónoma de México, Apdo, Postal 510-3, 62250 traditional metabolic maps. Such a level of detail can be Cuernavaca, Morelos, Mexico Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 2 of 17 achieved by adopting a reaction as a unit with the aim of principle was used to create the balance rule, and the analyzing specific compound transformations. Arita M. second was used to propose the count rule. Both rules created a method to detail atom correspondence were employed to analyze a set of 6230 well-curated re- between substrates and products within reactions, dem- actions documented in the KEGG database. Each TS onstrating that metabolic networks do not follow a contained at least one pair of compounds similar in small-world network distribution (many elements in structure to RPairs. To assess the performance of our modules connected by a few hubs) [3]. rules, we compared each resultant pair in a reaction to Other methods used to describe relationships between curated RPairs from the KEGG database (8) and ob- chemical compounds that participate in metabolic reac- tained a precision value of 81% for the full set. tions are so-called atom mappers, which automatically Because our approach outlines the specific architec- compare compounds to locate group transfer. An inter- ture of each reaction, it can then be grouped according esting property of atom mappers is that they identify to its TS topology, grouping reactions of the same pat- structural transformations between single-compound tern into a single cluster of TSs (CTS). As a result, all pairs, allowing creation of reactant pairs (RPairs) such as the 6230 reactions analyzed clustered into 71 CTSs. those in the Kyoto Encyclopedia of Genes and Genomes Moreover, the reactions grouped in each CTS transfer (KEGG) RPAIR database, which are defined as “pairs of similar chemical groups. Based on this property, we compounds that have atoms or atom groups in common propose that our rule-based approach can be used as a on two sides of a reaction” [4]. RPairs have also been classifier of metabolic reactions. Given this observation, proven to be useful when combined with hierarchical we analyzed whether our CTSs were associated with en- clustering algorithms to elucidate relationships between zyme categories represented by the Enzyme Commission reactions and enzymes [5], ultimately defining groups of number (ECN) available for each reaction [10]. There- compounds with related metabolic pathways [6]. These fore, we measured the enrichment of each ECN in each methods have been developed as promising alternatives CTS and revealed that the ECNs naturally fit the chem- for pathway discovery [7] but are slow to automate [8] ical patterns disclosed by our protocol. and tend to require both a priori information and sec- ondary methods to handle special cases, such as com- Results pounds with rings, prior to the final step of manual Generation of tree structures curation. Nonetheless, atom transfer is still a useful tool Cofactors used in metabolic reactions are compounds for predicting enzymatic catalysis in reaction sets [9]. that suffer small transformations compared to other re- Considering the aforementioned properties of atom actants that are modified dramatically in one or more mappers and graph theory as tools for pathway discov- steps. One way to compare these changes within reac- ery, the first goal of our work was to implement a simple tions is to measure the difference in molecular weights and fast method to complement atom mappers with the between compounds. Using this observation, we devised aim of avoiding a manual curation step as much as pos- a computational protocol, called the “balance rule”, that sible. For this purpose, we performed a statistical com- estimates the similarity of compounds within a reaction. parison of the TS-pairs proposed by our method with The purpose of this rule is to distinguish groups of com- those in the RPAIR/RCLASS datasets, generating a pre- pounds participating in a reaction based on differences cision value that can be interpreted as the confidence of in their molecular weights (Fig. 1a). the predicted set of reactant-pairs. We propose that the As a first step, for each analyzed reaction, we assigned results obtained by our method will provide researchers the elements of the Cartesian product (ECPs) derived with sets of confident pairs, a property proven to show from the compounds on the left and right sides of the better performance in pathfinders. Most importantly, equation, estimating the relative mass for each (de- our results can be used as a guide to determine which scribed in more detail in the Methods section). The ECP reactions should be candidates for manual curation. with the smallest difference in relative mass was then Therefore, we systematically analyzed single-chemical separated from the rest of the reaction, and then all metabolic reactions and outlined the mutual associations ECPs of the reaction for which the compounds were of their compounds, representing each reaction as a tree previously selected were eliminated (Fig. 1a). The structure (TS). We then constructed a rule-based remaining ECPs were then subjected to the same process approach centered on two principles: (i) wide sets of until no ECPs remained or until the establishing of a dif- metabolic reactions involve compounds, such as pool ference between the relative ECP masses was not pos- metabolites and cofactors, that undergo small modifica- sible. Thus, at the end of the process, no compounds in tions; and (ii) some compounds, usually coenzymes or the analyzed reaction participated in more than one pool metabolites, are frequently associated with various ECP. The implications of this restriction are discussed in reactions in unrelated metabolic pathways. The first the next section. We represented rounds of separation Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 3 of 17 ab Fig. 1 Representation of the constructed rules and tree structure (TS) construction. We illustrate graphical representations of our rules and their applications. a Example of using the balance rule to generate a tree structure for reaction R00658, in which the substrate 2-phospho-D-glycerate (C00631) is transformed into phosphoenolpyruvate (C00074) and water (C00001), by selecting the element of the Cartesian product (ECP) C00631_ C00074 as the element with the smallest difference in molecular weight within the reaction. b Example of using the count rule to construct a tree structure for reaction R02090, in which the substrates ATP (C00002) and dGMP (C00362) are transformed into ADP (C00008) and dGDP (C00361), by selecting the element of Cartesian product C00002_C00008 as the more frequently represented ECP in the entire network. For the TS representation, nodes: the gray octagon defines the tree root (reaction), the squares define the CP node, the rhomboid defines lone compound nodes, and the white circle defines compounds. Edges: blue, split through balance; orange, split through count; thin line, node/compound link. For the reaction string format (RSF), we represent the split in a line described in detail in the Methods section and selection into a tree structure (TS) in which the tips reflects an internal reaction property, and the count rule is showed the selected ECPs that described the pairs and used when the balance rule fails to select a single ECP. loner compounds defining each reaction (Fig. 2). When To validate our approach, we used only the 6392 reac- applying the previous rules, we observed that for some re- tions that were fully described in the KEGG LIGAND data- actions, the establishing of mass differences for particular set that were available in 2015 (see Methods section) [11]. sets of ECPs was not possible. For these reactions, our From the original set, we discarded 1099 reactions with just protocol uses a second rule, called the “count rule”,that one substrate and one product, including 335 reactions cat- selects an ECP based on its frequency in the entire net- aloged as isomerizations, which occur when a substrate work (Fig. 1b). The notion behind this rule is that some atom is arranged into another position on the product. For compounds, such as coenzymes and pool metabolites, are example, in reaction R01518, 2-phospho-D-glycerate frequently associated with numerous reactions in many (C00631) is transformed into 3-phospho-D-glycerate unrelated pathways. This property has been used previ- (C00197) by isomerization. For this reaction, the only pos- ously to identify metabolic network hubs [2]. In our work- sible tree structure would be formed by both compounds flow, the balance rule is always applied first because it (C00631_C00197) alone, and we therefore decided not to Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 4 of 17 Fig. 2 Analysis of the metabolic network using the rule-based approach. We illustrate the general procedures of how we applied the balance and count rules to the KEGG database. The procedure starts with the compounds and reactions from the KEGG, and the rules are applied to each reaction to generate a tree structure. Tree structures with identical architecture organizations are clustered together. Finally, we calculate the RPAIR correlation and Enzymatic Commission number (ECN) abundance levels to evaluate the performance of the method include these transformations in successive analyses. The remain alone on either side) (Fig. 2). A graphical same splitting method was applied to the rest of the group, representation of these architectures is shown in including 642 reactions lacking an enzyme entry. Additional file 1:Table S2.The next step wastoval- The balance rule was able to describe 6291 of 6392 re- idate the proposed architectures and interpret their actions from the KEGG database, and the count rule was chemical properties. used for the 101 remaining reactions. Importantly, 15 of these 101 reactions, including reaction R00509, have a TS-compound pairs correlated to KEGG RPairs small number of reactants (one or two) that are trans- We considered the presence of compound pairs identi- formed into one or two products (Fig. 3a). In this case, fied for each reaction to evaluate the architecture of each the compound arrangements were selected using only TS. Compound pairs are also described in reactions as the count rule. The 86 remaining reactions that were in- reactant pairs in the KEGG RPAIR database, in which a volved in more complex transformations required both pair in a reaction represents group transfer interactions rules to generate the TS for each reaction. An example among compounds [4, 12]. The current version of the is presented in Fig. 3b for reaction R10376, which is in- KEGG is different from previous versions in that RPAIR volved in the biosynthesis of indole diterpene alkaloids. has been replaced by the RCLASS set, which contains The selection begins by separating the terpendole E only well-curated pairs according to the authors. (C20536) and 13-desoxyterpendole I (C20542) molecules Well-curated pairs are the “classification of reactions from the rest of the reaction to form the pair based on the chemical structure transformation patterns (C20536_C20542). The count rule was then used again of substrate-product pairs (reactant pairs), which are to separate NADP+ (C00006) and NADPH (C00005) represented by so-called RDM patterns” [11]. The 2015 from the other compounds in the reaction to form the version of the KEGG contains the original RPAIR, in pair (C00005_ C00006). Finally, the minimal differences in which RCLASS pairs and other pairs have weaker evi- molecular weight for the remaining compounds were suf- dence of being correct. We compared the RCLASS list ficient to complete the process using the balance rule. to our predicted pairs reaction by reaction to measure Another notable observation from our approach the level of correspondence. From a total of 11,093 was that the TSs found for the 6392 reactions re- RPairs in the 6392 reactions, 8966 of our tree structure-- vealed common architectures that could be clustered compound pairs (TS-compound pairs) matched an RPair. considering the order of splitting within the reaction, We used a Bayesian approach to infer the probability of a and the two rules were used by the protocol. This TS-compound pair associating with its corresponding clustering procedure resulted in a set of 71 different RPair to measure our prediction confidence levels. As ex- TS arrangements that included all the reactions tensively described in the Methods section, the confidence tested (Table 1 and Additional file 1:Table S2).Not- interval was determined by the 2.5 and 97.5% quantiles. ably, the tips of each TS were composed of two Therefore, we found that the correspondence of the architecture types: pairs (two compounds associated TS-compound pairs over the entire RCLASS set was 0.81, directly through ECP selection by leaving exactly one a value that should be interpreted as the precision of our compound on each side of the equation) and loner method (Fig. 4a and Table 2) and that we consider very compounds (compounds from pair selection that good given the simplicity of the method. Notably, some Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 5 of 17 Fig. 3 Representation of three tree structure examples. Panel a shows the split of reaction R00509 (subsection a1) as represented in the KEGG database. Subsection a.2 shows a graphical representation of a TS. The TS split representations are viewed from the top down, and compact strings are read left to right. Nodes: the gray octagon defines the tree root (reaction), the squares define the CP node, the rhomboid defines lone compound nodes, and the white circle defines compounds. Edges: blue, split through balance; orange, split through count; thin line, node/compound link. In subsection a.3, we represent the reaction split in a string format constructed as explained in the Methods section. In panels b and c, we show the TS graphical representation of reactions R10376 and R00025, respectively. For both reactions, the same caption descriptions for panels a2 and a3 are applied to each respective subsection RPairs were constructed by mating one compound with are conformed by pairs sharing a common compound. This more than one of the same reaction because an atom type is an important difference between our approach and the can be shared by more than one compound, which is the manner in which atom mappers construct reactant pairs. case in the KEGG because a total of 1157 KEGG reactions An example showing the result of this difference is reaction Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 6 of 17 Table 1 Patterns identified and their condensed representations TS-cluster identifier CTS patterns Number of reactions per CTS Rule used to generate the pattern CTS-1 >(!(C_C)(C_C)) 1627 Balance CTS-2 >(!(C)(C_C)) 1059 Balance CTS-3 >(!(C)(!(C)(C_C))) 1028 Balance CTS-4 >(!(C)(!(C_C)(C_C))) 897 Balance CTS-5 >(!(C)(!(!(C)(!(C)(C_C)))(C_C))) 439 Balance CTS-6 >(!(C)(!(!(C)(C_C))(C_C))) 349 Balance CTS-7 >(!(C_C)(!(C_C)(C_C))) 150 Balance CTS-8 >(!(C_C)(!(C)(C_C))) 143 Balance CTS-9 >(!(!(C)(C_C))(!(C)(C_C))) 103 Balance CTS-10 >(!(C)(!(!(C_C)(C_C))(C_C))) 97 Balance CTS-11 >(!(C)(!(!(C_C)(!(C)(C_C)))(C_C))) 82 Balance CTS-12 >(!(C)(!(C)(!(C)(C_C)))) 66 Balance CTS-13 >(!(!(C)(C_C))(C_C)) 54 Balance CTS-14 >(!(C_C)(!(C)(!(C)(C_C)))) 42 Balance CTS-15 >(!(!(C)(!(C)(C_C)))(C_C)) 29 Balance CTS-16 >(!(!(C_C)(!(C)(C_C)))(C_C)) 26 Balance CTS-20 >(!(C)(!(C)(!(C)(!(C)(C_C))))) 12 Balance CTS-21 >(!(C)(!(C_C)(!(C)(C_C)))) 12 Balance CTS-23 >(!(C_C)(!(C_C)(!(C)(C_C)))) 8 Balance CTS-24 >(!(C_C)(!(!(C)(C)_C))(C_C))) 8 Balance CTS-25 >(!(C)(!(!(!(C)(C_C))(C_C))(C_C))) 5 Balance CTS-26 >(!(!(C)(C_C))(!(C_C)(C_C))) 5 Balance CTS-29 >(!(C)(!(!(C)(!(C)(!(C)(C_C))))(C_C))) 4 Balance CTS-30 >(!(C)(!(!(C)(C_C))(!(C)(C_C)))) 4 Balance CTS-31 >(!(!(C_C)(C_C))(C_C)) 4 Balance CTS-33 >(!(C)(!(!(C)(!(C_C)(!(C)(!(C)(C_C)))))(C_C))) 3 Balance CTS-34 >(!(C)(!(!(C)(!(!(C)(C_C))(C_C)))(C_C))) 3 Balance CTS-38 >(!(C)(!(!(C)(C_C))(!(!(C)(C_C))(C_C)))) 2 Balance CTS-39 >(!(C)(!(C_C)(!(!(C)(C_C))(C_C)))) 2 Balance CTS-40 >(!(C)(!(!(C)(!(C)(C_C)))(!(C)(C_C)))) 2 Balance CTS-41 >(!(!(C)(C_C))(!(!(C)(C_C))(C_C))) 2 Balance CTS-42 >(!(!(C)(!(C)(C_C)))(!(C)(!(C)(C_C)))) 2 Balance CTS-44 >(!(C)(!(C)(!(C)(!(C)(!(C)(C_C)))))) 2 Balance CTS-45 >(!(C_C)(!(C)(!(C_C)(C_C)))) 2 Balance CTS-46 >(!(!(C)(!(C)(C_C)))(!(C)(C_C))) 2 Balance CTS-47 >(!(!(C_C)(C_C))(!(C)(C_C))) 2 Balance CTS-50 >(!(!(C)(!(C)(C_C)))(!(C_C)(!(!(C)(C_C))(C_C)))) 1 Balance CTS-51 >(!(C)(!(!(C)(!(!(C)(C_C))(!(C)(C_C))))(C_C))) 1 Balance CTS-52 >(!(!(C)(!(C)(C_C)))(!(C)(!(C)(!(C)(!(C)(C_C)))))) 1 Balance CTS-53 >(!(C)(!(!(C)(!(C)(!(C_C)(C_C))))(C_C))) 1 Balance CTS-54 >(!(C)(!(!(C_C)(C_C))(!(C)(!(C)(C_C))))) 1 Balance CTS-55 >(!(C_C)(!(!(C)(!(C)(C_C)))(!(C)(C_C)))) 1 Balance CTS-57 >(!(C)(!(C_C)(!(C)(!(C_C)(C_C))))) 1 Balance CTS-58 >(!(C)(!(!(C)(!(C_C)(C_C)))(C_C))) 1 Balance Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 7 of 17 Table 1 Patterns identified and their condensed representations (Continued) TS-cluster identifier CTS patterns Number of reactions per CTS Rule used to generate the pattern CTS-59 >(!(C_C)(!(C)(!(C_C)(!(C)(C_C))))) 1 Balance CTS-60 >(!(C_C)(!(C_C)(!(C)(!(C)(C_C))))) 1 Balance CTS-61 >(!(C_C)(!(C_C)(!(C_C)(C_C)))) 1 Balance CTS-62 >(!(!(C)(C_C))(!(C)(!(C_C)(C_C)))) 1 Balance CTS-65 >(!(!(C)(C_C))(!(C)(!(C)(C_C)))) 1 Balance CTS-66 >(!(!(C)(!(C)(!(C)(C_C))))(C_C)) 1 Balance CTS-19 >(!!(C_C)(C_C)) 14 Count CTS-71 >(!!(C)(C_C)) 1 Count CTS-17 >(!!(!(C)(C_C))(!(C_C)(C_C))) 24 Count-Balance CTS-18 >(!!(!(C_C)(C_C))(!(C)(!(C)(C_C)))) 16 Count-Balance CTS-22 >(!!(C_C)(!(C)(C_C))) 10 Count-Balance CTS-27 >(!!(!(C)(C_C))(!(C)(!(C)(C_C)))) 5 Count-Balance CTS-28 >(!(C)(!!(C)(C_C))) 5 Count-Balance CTS-32 >(!!(!(C)(C_C))(!(C)(C_C))) 4 Count-Balance CTS-35 >(!!(!(C)(!(C)(C_C)))(!(C)(!(C)(C_C)))) 3 Count-Balance CTS-36 >(!!(!(C_C)(C_C))(C_C)) 3 Count-Balance CTS-37 >(!!(!(C)(C_C))(C_C)) 3 Count-Balance CTS-43 >(!!(!(C)(C_C))(!(!(C)(C_C))(C_C))) 2 Count-Balance CTS-48 >(!!(C_C)(!(C)(!(C)(C_C)))) 2 Count-Balance CTS-49 >(!!(!(C)(!(C)(C_C)))(C_C)) 2 Count-Balance CTS-56 >(!!(!(C)(!(C_C)(C_C)))(!(C)(!(C)(C_C)))) 1 Count-Balance CTS-63 >(!!(C_C)(!!(!(C)(!(C)(C_C)))(C_C))) 1 Count-Balance CTS-64 >(!!(!(C_C)(!(C)(C_C)))(!(C)(C_C))) 1 Count-Balance CTS-67 >(!!(C_C)(!(!(C)(C_C))(C_C))) 1 Count-Balance CTS-68 >(!!(!(C_C)(!(C)(C_C)))(C_C)) 1 Count-Balance CTS-69 >(!!(C_C)(!(C_C)(C_C))) 1 Count-Balance CTS-70 >(!(C)(!!(C_C)(C_C))) 1 Count-Balance We describe the general topologies of the reactions found by our approach. Column 1 indicates the CTS identifier, and Column 2 shows the general pattern found for groups of reactions. “>” indicates the root of the tree structure; “!” indicates use of the balance rule; “!!” indicates use of the count rule;and “C” indicates a compound. A pair is described as (C_C), and a loner compound is (C). The number of parentheses around the pair or loner compound indicates its depth in the tree and the number of partitions employed for separation. Column 4 describes the rule or rules used to generate the arrangements R00025, where ethylnitronate (C18091) is catalyzed to nitrite method keeps this compound in the architecture and thus (C00088) and acetaldehyde (C00084) by the enzyme nitro- shows its contribution to the reaction. Because our method nate monooxygenase (Fig. 3c). In this reaction, ethylnitro- generates unique compound combinations as the final re- nate is decomposed into three reactants pairs, sult, the undetected pair composed of acetaldehyde and RC00126-(C00061_C01847), RC02541-(C00084_C18091) ethylnitronate (RC02541-(C00084_C18091)) was eliminated and RC02759-(C00088_C18091), which correspond to as apairat someother pointinthe processdue to itspres- (FMN_reduced_FMN), (acetaldehyde_ethylnitronate) and ence in another ECP with a bigger mass difference com- (nitrite_ethylnitronate), respectively. In this reaction, our pared with the remaining products in the tips of the tree. protocol positively detected FMN_ reduced_FMN as a pair We also observed that our protocol generated 11,093 (C00061_C01847) and an additional group ((_C00001) TS-compound pairs and from these 2127 are absent in the (C00088_ C18091)) composed of water as a loner com- RPAIR data set, making these pairs candidates for manual pound (_C00001) and the pair ethylnitronate_nitrite curation to establish their confidence. (C00088_ C18091) matching an RPair in RCLASS Because our protocol naturally clustered each reaction RC02759. Notably, water (C00001) is not associated with according to its architectural arrangement into 71 any RPair for reaction R00025. As an advantage, our groups, we estimated the precision of CTSs for which Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 8 of 17 ab c Fig. 4 Correlation between TS pairs and RPAIRs. We present three examples of posterior distributions of the inferred parameter θ. a Total TS pairs vs. the entire RPAIR dataset. b Example showing the sharp-peaked distribution observed in CTS-1. c Example of a broad distribution showing the effect of a high correlation level in a group with a small number of reactions Table 2 Comparison of TS-compound pairs versus RPairs Cluster ID Number Number of Number of Number of Number Number of Number of Inferior Superior Average of total reactions with reactions with reactions with of RPAIRs matching mismatched interval interval value of reactions only matching no matched mixed pair in a CTS TS-compound TS-compound 0.025 0.975 the interval pairs pairs matches pairs pairs General 6392 4869 618 905 11,193 8966 2127 0.8006 0.8153 0.8070 CTS-1 1627 960 391 276 3254 2196 1058 0.6587 0.6909 0.6748 CTS-2 1059 1016 43 0 1059 1016 43 0.9467 0.9704 0.9586 CTS-3 1028 926 102 0 1028 926 102 0.8818 0.9183 0.9 CTS-4 897 867 8 22 1794 1756 38 0.9717 0.985 0.9783 CTS-5 439 435 1 3 878 873 5 0.9884 0.9981 0.9933 CTS-6 349 341 0 8 698 690 8 0.9794 0.995 0.9872 CTS-7 150 1 17 132 450 206 244 0.412 0.5039 0.458 CTS-8 143 57 7 79 286 193 93 0.6195 0.7278 0.6737 CTS-9 103 51 3 49 206 151 55 0.6707 0.791 0.7309 CTS-10 97 2 0 95 291 196 95 0.6187 0.7261 0.6724 CTS-11 82 0 0 82 246 163 83 0.6024 0.7203 0.6613 CTS-12 66 58 8 0 66 58 8 0.7906 0.9453 0.868 CTS-13 54 39 7 8 108 86 22 0.7158 0.8664 0.7911 CTS-14 42 21 5 16 84 58 26 0.5882 0.7841 0.6862 CTS-15 29 28 0 1 58 57 1 0.9373 0.9996 0.9684 CTS-16 26 0 0 26 78 29 49 0.2687 0.4812 0.3749 CTS-17 24 0 0 24 72 48 24 0.5545 0.77 0.6623 CTS-18 16 8 0 8 48 37 11 0.6434 0.877 0.7602 CTS-19 14 5 6 3 28 13 15 0.2867 0.6467 0.4667 CTS-20 12 11 1 0 12 11 1 0.7151 0.9977 0.8564 CTS-21 12 4 3 5 24 13 11 0.3449 0.7318 0.5384 CTS-22 10 6 1 3 20 15 5 0.5443 0.9085 0.7264 We compared the numbers of hits and fails of our predicted pairs with RPairs in the KEGG RPAIR dataset. The table presents precision estimations based on the calculated posterior probabilities of each group. The confidence interval and average value are shown in the last 3 columns Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 9 of 17 the number of reactions was sufficiently large to perform Tree structure distribution reveals that the predicted patterns a reasonable statistical inference. Therefore, estimations yield reaction groups with similar chemical events were performed only for CTSs with at least 10 reactions The 2015 version of the KEGG retained the classification with predicted pairs. The results of our precision ana- of RPairs proposed by Kotera M. et al. regarding “main- lysis of 6291 reactions clustered in 22 different CTSs are pairs (describing main changes in substrates), cofac pairs shown in Table 2, with the CTSs numbered from 1 to 22 (describing changes in cofactors for oxidoreductases), (1 represents the group with the highest number of reac- trans pairs (focused on transferred groups for transfer- tions, and 22 represents the group with the smallest ases), ligase pairs (describing the consumption of number of reactions). As expected from the group size nucleoside triphosphates for ligases), and leave pairs (de- differences, the results presented in Table 2 show that scribing the separation or addition of inorganic com- the CTSs had different levels of precision. We observed pounds for enzymes, such as lyases and hydrolases)” [4]. eight CTSs encompassing 3879 reactions with an aver- These classes link compound pairs within a reaction age confidence value higher than 0.80. For example, for with the enzyme activity exerted upon them, indicating CTS-5, we predicted only five pairs that were different that the same pair could be assigned to a different class from those assigned by the KEGG out of 878 total reac- depending on the reaction in which it participates. tions (Table 2), indicating that every pair found in this Therefore, our next step was to analyze our predicted CTS had a very high probability of being an RPair. We pairs in this pair-reaction context. We labeled each pair also noticed that all the reactions in CTS-5 were oxidor- with its RPAIR class (main, cofac, trans, leave or ligase), eductions reactions in which the incorporated oxygen is and we also labeled each pair according to the propor- not necessarily derived from molecular oxygen. tion of predicted RPairs in its corresponding reaction, Another 2198 reactions were grouped in CTSs with resulting in four categories. The first category included average precision values ranging from 0.79 to 0.60 (Table reactions in which the predicted pairs (EPP) entirely fit 2). For example, CTS-1 had the highest number of mem- the published RPairs. The second and third categories bers with the same architecture arrangement (1627 reac- involved reactions with at least one hit pair and one or tions). From the 3254 possible pairs in this CTS, we more failed pairs. We labeled these reactions as mixed predicted that 2196 matched an RPair, representing pairs (MP) as they produced mixed-positive pairs (MPP) two-thirds of the group. This gave CTS-1 an average and mixed-failed pairs (MFP). The last group consisted probability of 0.67, which also showed a sharp-peaked dis- of pairs in reactions in which no RPair was found, called tribution (Fig. 4b). Therefore, this CTS provided a large failed pairs (FP). We labeled the pairs using three confi- amount of information supporting our conclusion that dence ranges taken from the Bayesian probability estima- two of its three pairs are dependable RPairs, and the tion previously defined for their respective CTSs. The CTSs remaining one-third of the group (1058) represents candi- were ranked by their mean probability x,where treeswith dates for validation through manual curation. We found ¨high confidence¨ were in the range x ≥ 0.8, trees with that other CTSs in the same confidence range showed “medium confidence” were between 0.8 > x ≥ 0.6, and CTSs broad distributions, such as CTS-18 (Fig. 4c), which ex- in which 0.6 > x were labeled “low confidence”. hibited a value of 0.64 for the 2.5% interval and 0.87 for Figure 5 shows the frequency of the reactant pairs the 97.5% interval. In this case, in which 37 of 48 possible grouped according to their RPAIR class (main, cofac, RPairs were matched, we had less information supporting trans, leave and ligase), our hit-fail categories (EPP, MPP, our estimate, and the greater uncertainty could be ex- MFP and FP) and our confidence ranges (high, medium plained by the small sample size (16 reactions). Despite and low). The EPP category included a total of 7672 pre- the small sample size, the proportion of hits revealed that dicted pairs, and notably, 48% of these matched a main CTS-18 had an intermediate probability of yielding a posi- RPair with high confidence. Interestingly, another 21% tive RPair. Nevertheless, inspection of this CTS showed of the EPP pairs fell in the cofactor class (cofac) with that all the reactions in the cluster are catalyzed by ligases high confidence. A very small group, less than 1% of the classified by enzyme commissions in classes 6.3.4 and EPP category, belonged to the ligase and leave classes, 6.3.5. Furthermore, two reactions involved in the exhibiting a very small proportion of pairs with one and biosynthesis of neomycin, kanamycin and gentamicin, four members, respectively. In the same category, the R10767 and R10768, are catalyzed by a tobramycin other 26% of the pairs with medium confidence were carbamoyltransferase of the group 6.1.2.2, indicating that main pairs, and another 2.7% were cofac pairs. Similar despite the medium significance, compounds of reactions behavior was observed for the MPP category because the in CTS-18 undergo similar transformations catalyzed by main and cofac categories predominated in 1120 pairs; similar enzymes. The next step was to determine whether however, in contrast to the EPP category, the concordant the reaction groups in the remaining CTSs also exhibit pairs exhibited mainly a medium confidence level. An- similar chemical transformations. other 240 pairs in the low-confidence group were mostly Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 10 of 17 could not be provided or when we obtained results that were inconsistent with the proposed RPairs. The first ex- ample is reaction R10088 from CTS-32 in which we ob- tained mixed pairs. As shown in Fig. 6a, this reaction entry describes the conversion of D-Ribose 5-phosphate and D-Glyceraldehyde 3-phosphate (G3P) into Pyridoxal phosphate(vitaminB6) in thepresenceofanammoniamol- ecule, yielding four molecules of water and inorganic phos- phate (Pi). Our resulting TS shows that vitamin B6 is paired to the D-ribose 5-phosphate molecule (C00117_C00018) as proposed in the RCLASS ID RC03049 (C00018_C00117). The overall TS partially reflects the proposed mechanism, as showninFig. 6a. The vitamin is proposed to be assem- bled from ribose, ammonia and glyceraldehyde 3-phosphate (G3P), with the phosphate group from the ribose molecule leaving the complex [13]. In the TS, vitamin B6 is paired with the ribose molecule, with ammonia as the closest loner compound, which is consistent with the overall mechanism up to the intermediary before the excision of Pi. G3P (C00118), however, is paired with Pi (C00009) because they Fig. 5 Analysis of TS pairs using RPAIR classes. The figure represents are more similar in terms of molecular weight, implying that the abundance of each TS pair as a function of the class proposed the Pi group that enters and leaves is the group in G3P, not by Kotera [4]. We categorized the pairs as being in reactions in which the group in the ribose molecule. The mapping proposed by the whole predicted pair entirely fit the published RPair (we called this the KEGG also pairs G3P with pyridoxal phosphate - EPP). The second group includes reactions representing at least one hit RC01783 (C00018_C00118) -, which is consistent with the with one or more fails, which we called mixed pairs, and this group was subdivided into positive (MPP) and failed pairs (MFP). The proposed mechanism [13]. Nonetheless, we should remem- last group contains reactions in which we failed to detect an RPair, called ber that our method does not consider molecular mecha- failed pairs (FP). We also arranged the TS pairs according to their mean nisms in performing the splitting in contrast to atom precision levels (x) estimated from the CTSs by Bayesian analysis. TS pairs mappers, which attempt to reflect the atom transitions be- in the precision range x ≥0.8wereclassified as “high confidence”,those tween chemical species in detail. Finally, an important trait in the range 0.80 > x ≥ 0.60 were classified as “medium confidence”,and those in the range x < 0.60 were labeled “low confidence” of the TS layout is the first splitting performed by the count rule, which reflects the fact that most of the mass in the pyridoxalmoleculeisalreadypresent in the intermediary distributed in the ligase category. Pairs in the MFP prior to Pi excision. One important aspect of the reaction group (446 pairs) tended to belong to the cofac and leave described in Fig. 6a is that the enzyme Pdx1 (pyridoxal categories, both with medium confidence. Finally, as 5′-phosphate synthase, ECN 4.3.3.6) prefers the substrate shown in Fig. 5, the approach failed to provide a con- G3P and joins the molecule through imine formation, which cordant RPair for 576 reactions in the FP category. Inter- is an important trait for understanding the mechanism and estingly, these FP pairs were classified by Kotera as trans reaching the proposed conclusion. We think that consider- and leave pairs, and 383 were clustered into CTS-1 (data ing enzymatic mechanisms will be valuable for improving not shown). Furthermore, they were shown to be prefer- our method. entially catalyzed by hexosyltransferases, pentosyltrans- Another interesting example, shown in Fig. 6b,is ferases and phosphotransferases, with an alcohol group R07795 in CTS-4. The reaction describes the conversion as the acceptor, and by some enzymes other than methyl of 3-sulfocatechol (C06336) into 2-hydroxymuconate groups that transfer alkyl or aryl groups. (C02501). This reaction is catalyzed by the enzyme cat- The results presented thus far provide evidence that echol 2,3-dioxygenase (ECN 1.13.11.2), which also cata- our rules are exceptional in defining pairs in reactions lyzes R04089 - the conversion of 2,3-dihydroxytoluene catalyzed by oxidoreductases. In contrast, these rules into 2-hydroxy-6-keto-2,4-heptadienoate - and reactions have limitations in establishing the transitions of specific R05295, R05404 and R05406, all of which are implicated types of transferases, lyases and hydrolases. The results in conversions of different chemical species of catechol obtained for these reactions prompted us to perform involving aromatic biodegradation. All these reactions, manual curation with the aim of establishing the degree except for R07795, were clustered in CTS-2, which had to which the mechanisms proposed in the literature fol- a mean precision of 95%. Notably, the TS-pairs in these low our proposed splitting when a statistical description reactions match the reactions’ RCLASS pairs, which is Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 11 of 17 Fig. 6 Manual curation of reactant-pairs. Panel a illustrates the proposed mechanism for reaction R00018 and its tree structure (TS). In this reaction, D-Ribose 5-phosphate (red) and D-Glyceraldehyde 3-phosphate (blue) are converted into Pyridoxal phosphate in the presence of an ammonia molecule (green). Figure adapted from reference [13]. Panel b illustrates the proposed mechanism for reaction R07795 and its TS. This reaction is the conversion of 3-Sulfocatechol into 2-Hydroxymuconate. Figure adapted from reference [14]. The pairs in TSs that follow that reaction’s proposed mechanism are marked by a checkmark. The TSs are also shown in string format (RSF), (see Methods) important because these reactions yield an additional the KEGG data, and we also propose that this pair product despite the similarity in their ring cleavage should be considered incorrect. mechanism by incorporating oxygen during biological A count of the pairs recovered in the CTSs that were oxidations of these organic substrates [14]. In contrast, not statistically evaluated also reveals some patterns. in R07795, oxygen is incorporated into 3-sulfocatechol Overall, 194 of the 324 pairs in these reactions are con- through an attack that releases sulfite, which we paired firmed RCLASS entries. However, the proportion of pair with oxygen (C00007_C00094). The 2015 version of matches is not homogeneous, which is similar to the 22 RPAIR included this pair as a leave pair, but RCLASS ig- evaluated CTSs. For example, in CTS-32, which includes nores it due to mechanistic inconsistency. The mechan- four reactions, seven of eight pairs are confirmed ism proposed by Junker F et al. [14] is consistent with RCLASS entries. Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 12 of 17 The results presented in this section, determined using This result can be explained by the fact that some reac- the Kotera classification, suggest that reactions in the tions are catalyzed by more than one enzyme or by multi- CTSs might be correlated to specific catalytic conver- functional enzymes. For example, CTS-2 has reactions sions. We tested this hypothesis with the aim of proving catalyzed by the enzyme chloromuconate cycloisomerase whether our rules naturally cluster reactions that share (EC 5.5.1.7), which is cataloged as both an isomerase and the same ECN within a CTS because this number is an intramolecular lyase due to its capacity to release free used to classify reactions in terms of the enzymes by hydrogen chloride when 2-chloro-cis, cis-muconic acid is which they are catalyzed. transformed into cis-4-carboxymethylenebut-2-en-4-olide [15]. Therefore, finding a representation of more than one Tree structures cluster into general enzyme patterns enzyme with similar catalysis types despite classification in Analysis of our TS-compound pairs and CTSs suggested different ECNs should not difficult. that our approach can cluster reactions catalyzed by As previously mentioned, isomerization reactions with enzymes belonging to similar classes as observed for the only one substrate and one product were not considered ligases of CTS-18. To evaluate the generality of this ob- because they form trivial tree structures. Nonetheless, servation, we associated the ECNs corresponding to each isomerizations embedded in reactions with more than reaction. ECNs are composed of four digits separated by one compound on each side of the equation were con- periods, and the numbers represent a progressively spe- sidered and mapped to specific CTSs (CTS-2, CTS-3 cific classification of each enzyme. The first digit groups and CTS-4). Similar distributions were observed for enzymes into general types of catalysis. For example, other ECNs, such as ligases represented in CTS-7, ECN type 1 (ECN-1) describes oxidoreductases that CTS-9, CTS-14 and CTS-18 and hydrolases represented catalyze oxidoreduction reactions in which hydrogen in CTS-1, CTS-2 and CTS-3. and oxygen atoms or electrons are transferred from one As shown in Fig. 7, our CTSs tended to concentrate par- substance to another. The subsequent digits add levels ticular ECNs. To test whether this over-representation of specificity; ECN-1.1, for example, contains oxidore- was statistically significant, we calculated the enrichment ductases that act on donor alcohol groups. Figure 7 of each ECN in each CTS considering only the first ECN shows the frequency of each ECN associated with each digit and CTSs grouping more than 10 reactions using reaction grouped in each CTS considering only its first Fisher’s exact test. We considered enriched CTSs to have digit. The bar plot in this figure shows that all CTSs Pvalues <0.05. Figure 8 shows the significantly enriched were concentrated mainly in the ECNs of a single class, groups represented in terms of logarithmic odd ratios, and CTSs with higher numbers of members tended to which are useful for evaluating the strength of enrichment. include other ECNs of other classes in lesser proportions. The CTSs considered enriched in an ECN category were Fig. 7 Correlation of tree structure clusters with general enzymatic categories. The clusters of tree structures (CTSs) tended to naturally Fig. 8 Enrichment of enzyme categories into tree clusters. We illustrate group enzymatic categories provided by the Enzyme Commission the results of the significance of each enzyme category within the numbers (ECNs). This figure presents this tendency using the first digit clusters of tree structures (CTSs). The graph shows the logarithm of the ECN class, in which enzymes are classified as oxidoreductases of the odds ratio, which represents the strength of the enrichment (ECN-1), transferases (ECN-2), hydrolases (ECN-3), lyases (ECN-4), (spectrum color bar), and the number of hits in the group (point density). isomerases (ECN-5) and ligases (ECN-6) We only show CTSs with False discovery rate < 0.05 Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 13 of 17 those displaying a false discovery rate, reported as the observation, we applied Fisher’s exact test while considering logodd ratios ≥ 0.5. the second and third ECN digits, and our results showed As shown in Figs. 7 and 8, the probability of having re- that at these class levels, the remaining CTSs, including actions catalyzed by specific enzyme types increased as CTS-2 and CTS-17, tended to be significantly enriched in the number of reactions in the CTS decreased, as shown more specific ECNs. The results of these tests are by the significant enrichment in transferases, ligases and shown in Additional file 1: Figures S1 and S2. oxidoreductases in CTS-20, CTS-21 and CTS-22, re- Considering the presented results, we conclude that spectively. Moreover, 17 of the 22 CTSs were signifi- our proposed rules showed that a significant number cantly enriched in only one ECN, and another three of the reactions evaluated reflected a clear biochem- were significantly enriched in only two ECNs; CTS-2 ical link between tree structures and specific chemical has lyases (ECN-4) and isomerases (ECN-5), CTS-4 has events. oxidoreductases (ECN-1) and isomerases (ECN-5), and CTS-7 has oxidoreductases (ECN-1) and ligases (ECN-6). Discussion An analysis of CTS-2, which is significantly enriched in We described a clustering analysis of enzymatic reac- ECN-5 and ECN-4, showed that its isomerases belong to tions described in the KEGG database using our the subclasses 5.5.1.7 (a chloromuconate cycloisomerase) rule-based approach. Our results allowed us to classify and 5.5.1.11 (a dichloromuconate cycloisomerase). These different metabolic reactions into patterns, revealing as- enzymes catalyze eight reactions for the degradation of sociations among compounds at a glance. The rules of chlorocyclohexane or chlorobenzene, respectively, and our approach were implemented as a protocol that first they release a hydrogen chloride molecule in all cases. locates and separates groups with the smallest differ- Surprisingly, when analyzing the KEGG database in more ences among compounds within a reaction. Thus, using detail, we found two other reactions (R08119 and R09215) our approach, the most different compounds are identi- catalyzed by a chloromuconate cycloisomerase that were fied by successive elimination. The logic behind this ap- not grouped in CTS-2. Interestingly, the isomerization proach is that some compounds, such as coenzymes and processes in R08119, which transforms 2-fluoro-cis, pool compounds (such as S-adenosyl-L-methionine cis-muconate into 5-fluoromuconolactone, and R09215, [16]), tend to undergo less dramatic changes during a in which 3,5-dichloro-2-methylmuconate is transformed reaction (the basis of the balance rule) and tend to be into 3,5-dichloro-2-methylmuconolactone, do not yield associated with more enzymes in many unrelated path- halogens as reaction products because fluorine and chlor- ways (the basis of the count rule). Other methods, such ine remain bound to the products, respectively. This is a as compound mappers, highlight atom exchanges very good example showing that our rule-based approach between substrates and products through heuristic can separate very similar reactions in CTSs with very high graph-matching algorithms, which requires knowledge precision levels (mean 0.95, see Table 2) even if they are of the reaction modifications. Water, for example, is re- catalyzed by the same enzyme. moved from the analysis to increase the detection of Figure 8 shows that of the 17 significantly enriched oxygen and hydrogen traces at the cost of losing cataly- CTSs, the clusters CTS-16, CTS-17, CTS-18 and sis conducted by hydrolysis [3, 12, 17]. In contrast, our CTS-22 were most significantly enriched for specific approach does not remove any compound unless it is ECN categories. Other groups, such as CTS-5 and present in an unbalanced reaction or in a reaction in CTS-9, were almost homogenous because they had only which the subscripts are not well defined, such as reac- one or two reactions with ECNs of another class. tion R00001 (polyphosphate + n H O < => (n + 1) oligo- Figure 8 shows that despite significant enrichment in phosphate). Therefore, we were able to analyze a wider only one ECN class at the first digit, CTS-20 did not range of compounds within reactions. have one of the highest odd ratios due to the small size Our analysis also showed that despite the complexity of the group, which had only 10 elements. In contrast, of enzyme catalysis, reactions with similar catalytic pat- CTS-5 clustered 482 reactions, 480 of which were classi- terns were grouped through our TSs, which outlined the fied in ECN-5. chemical transformations of substrates and products Careful examination of the 17 CTSs with statistically sig- within reactions. We demonstrated that the TSs pre- nificant enrichment in one ECN revealed a tendency to served pair architectures comparable to those of reactant concentrate reactions with similar catalysis types, even at pairs of the RPAIR dataset [12], which were based on the second digit as shown for CTS-2. Other examples in- identification of the atom type [8]. In contrast, our ap- clude reactions in CTS-17 catalyzed by enzymes of subclass proach based on two simple rules performed well, par- 1.4.1 and involved in amino acid oxi-reductions. Reaction ticularly for pairs classified in the RPAIR set as main, R00025 served as the exception because it is catalyzed by cofac (redox cofactors) and ligases (mainly phosphoryla- nitronate monooxygenase (1.13.12.16). Driven by this tions). Our protocol was less precise when describing Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 14 of 17 small groups of pairs cataloged as trans and leave, pathways and showed that the metabolic network with specifically transferring compounds such as hexose, the best performance combines biochemical knowledge pentose, alkyl or aryl groups and a small group of phos- encoded in the KEGG RPAIR with a weighting scheme photransferases. These reactions were found mainly in that penalizes highly connected compounds. This is con- CTS-1 and CTS-2; therefore, different rules need to be sistent with the pair-reaction classification that we established or a manual curation step should be imple- tested, which shows that most of the our EPPs were mented to verify these reactions. main RPairs. We believe that incorporation of our We also found that for a particular set of reactions, TS rule-based approach into pathfinders will substantially pairs were predicted that did not correspond to the pre- improve the performance of these methods as our dicted RPairs, most of which were catalogued by Kotera method could provide the relative relevance of com- as trans or leave [4]. R06726 (retronecine + 2 L-isoleu- pound pairs within reactions. cine <= > senecionine, grouped in CTS-2) serves as a Kotera and coworkers [18] and Rahman et al. [9] de- good example of a reaction with inconsistent pairs. In veloped methods to predict reactant pairs using similar- this case, we predicted the pair (C0047_C06176, ity between reaction centers, proving the feasibility of L-isoleucine_senecionine) by using a balance reaction grouping reactions by their ECNs through the chemical transforming two L-isoleucine molecules. In comparison, transformations of compounds. Our approach can clus- the KEGG pairs a retronecine molecule with the senecio- ter metabolic reactions by their general patterns, result- nine in an RPair given that the molecules have more sig- ing in groups with significant enrichment in specific nificant common atom types in contrast to L-isoleucine. ECNs, even at the third digit. We found secondary ECN Another five of six reactions in CTS-2 with a predicted classes in smaller proportions of CTSs, which does not pair according to our approach could not be compared disprove the above observation because individual reac- due to the lack of an RPair. Three of these reactions tions can be linked to several ECNs. Furthermore, slight (R07652, R07916 and R10177) were involved in geranyl variations in these classes are expected due to factors and farnesyl transformations, which are catalyzed by en- such as multiple domain enzymes and enzyme promis- zymes that transfer acyl or aryl groups from the substrate cuity, which has been documented as a common to the product. Among these reactions, the ECPs defined phenomenon [19]. as more similar included compounds with acyl or aryl Statistical evaluation performed using the false discov- groups because they have similar molecular weights and ery rate, represented as the odds ratio logarithm, showed are separated by diphosphate, a compound with a smaller that 10 of the 22 CTSs used in this analysis are enriched mass compared with that of the others within the reaction. in oxidoreductases, which we also proved when testing These reactions in the KEGG were not assigned an the Kotera classification [4]. However, we discarded all RPair because they involve multiple steps. For this reactions without an assigned ECN, and we predicted 49 type of reaction, careful curation is required to evalu- CTSs that clustered 113 reactions that were not used in ate the consistency of our partitioning. Fisher’s exact test. These 49 CTSs had less than ten re- We can conclude that our rules, which are based on actions, and their small sample sizes made statistical intrinsic biochemical reaction properties and interac- tests less convenient. Therefore, the pair architectures tions of compounds across the whole network, organize found in these reactions are candidates for manual cur- reactions in individual tree structures, which is useful ation to test their reliability. Overall, we propose that for clustering considering only the proposed compound with confirmation of our curated set, our CTSs can arrangements. This approach is also helpful for easily serve as good guides for predicting reaction catalysis. identifying reactions with similar transformations. How- ever, for a small group of transformations, this simple Conclusions approach was insufficient, as shown in Fig. 5 for the Some methodologies based on graph theory organize MFP and FP categories. Therefore, this approach should compound networks into metabolic functional categories be improved for these cases in future work. without preserving biochemical pathways. Other methods Another advantage of our rule-based approach is based on chemical group exchange and atom flow trace that it does not eliminate any compounds during tree the conversion of substrates into products in detail, which assembly, allowing observation of the relative ordering is useful for inferring metabolic pathways. To analyze of compounds predicted by our rules within a reac- metabolic networks, we presented a rule-based approach tion and among reactions, a feature that is not avail- incorporating both methods that decomposes each reac- able in atom mappers. tion into architectures of compound pairs and loner com- Work published by Faust C. and coworkers [16] evalu- pounds that can be organized into tree structures. We ated the effect of using well-curated reactant pairs and found that the tree structure-compound pairs fit those other parameters on reconstruction of metabolic reported in the KEGG-RPAIR with excellent match Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 15 of 17 precision. The generated tree structures naturally clus- Weight metrics tered all reactions into 71 general reaction patterns of For the balance method, we used molecular weights compounds with similar chemical transformations. Fi- as reported in the COMPOUND dataset. For the nally, we evaluated the catalysis types in the clusters count method, we calculated the frequencies of all based on Enzyme Commission categories and revealed Cartesian products and used these numbers as the preferential use of enzyme classes. We demonstrated that weight measures. applying simple rules can allow identification of reaction patterns reflecting metabolic reactions that transform sub- Tree structure construction strates into products and the type of catalysis involved in For every reaction in the dataset, we constructed a TS. the transformation. The pairs generated using our We used Perl scripts to construct an algorithm based on rule-based approach can be incorporated as inputs to im- the calculated mass differences and frequencies of Cartesian prove the performance of pathfinders because products in the metabolic network to divide each reaction well-curated pairs provided better results, as demon- in the dataset. For this purpose, we created two rules, the strated by Faust and collaborators [7]. Our method is a re- balance and count rules. action classifier that can correlate EC numbers to our CTSs. Therefore, we propose that our method could con- Balance rule stitute a useful first step for the prediction of reaction ca- In a given reaction R, A and B are defined as the set of talysis, which can be conducted by simply incorporating chemical species on the left and right sides of R, respect- the discarded reactions that were not considered in this ively. Next, we define the operation P′(A) as the set of A analysis. Finally, using this last property, testing whether subsets, excluding the empty set. Elements of the Cartesian ′(A) ′(B) the enzymes clustered in the CTSs are evolutionarily re- product CP = P × P are the basic input of our proced- lated, will be useful, a task our group have started. ure. For each element (a, b) ∈ CP, excluding the whole In future work, we intend to model chemical transi- reaction, we calculate tions of generic reactions, such as those with discordant pairs. For these, we will introduce fewer generic rules d ¼jj ðÞ W −W ; ð1Þ ab a b that will consider other measures different from molecu- K lar weight, such as giving carbon atoms a greater weight where K = max (W , W ) and W is the sum of the because carbon flow is the main feature of metabolic a b a, b molecular weights for all the compounds in a given CP transformations. Additionally, as suggested by the results (ECP) element for sides A and B, respectively. for the reactions described in terms of their catalytic To clarify the operations described above, illustrating mechanism, we will include the roles of enzymes in the our procedure applied to the generic reaction C + D→ reaction layout. E + F nearly step by step is convenient. Therefore, P (A) ={{C}, {D}, {C, D}} and P (B)={{E}, {F}, {E, F}}. To be con- crete, when considering the ECP ({C}, {E, F}), Methods d ¼ jðW −ðW þ W ÞÞj , where W , W and Datasets C;EF C E F C E We analyzed datasets stored in the 2015 version of the W are the molecular weights of compounds C, E and F, KEGG database [11]. We collected information from the respectively. REACTION and COMPOUND datasets from the LIG- Next, we seek the minimum d among all the ele- ab AND collection. From REACTION, we collected IDs ments in the ECP. Because the reaction is balanced, in and equations (including coefficients) from 9910 reac- this case d = 0, we exclude the entire reaction. This whole tions along with ENZYMES and RPAIR data. For RPAIR, ECP with the minimum d will be considered the first we also used RCLASS identifiers and enzyme category branch of the TS and will be placed on the right side. In information regarding 15,349 entries; we retained only a subsequent step, we eliminate the remaining ECPs RPairs with RCLASS identifiers. From COMPOUND, we containing at least one ECP compound with the mini- collected the IDs, chemical formulas and molecular mum d. This is an iterative process that continues for weights of 7661 compounds. To limit our analysis to a each branch until a minimal difference among each ECP well-curated and verifiable set, all reactions that included cannot be established or no remaining ECPs are found. compounds from the GLYCAN dataset and reactions with incompletely described coefficients and subscripts Count rule were removed. We also removed 1099 reactions with The count rule is a means of selecting ECPs based on one compound on each side of the equation from the their occurrence in the metabolic network. This rule was analysis. Ultimately, 6392 curated reactions were in- inspired by other works reporting that some compounds cluded in the final set. are frequently used in metabolic transformations as Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 16 of 17 exchange currency [20–22]. For implementation of this the distribution of θ. The probability Ρ(θ| y) given y co- rule, we simply count the frequency of each ECP in the incidences between two datasets out of n trials follows a whole network and use this value as a measure to com- beta distribution. For practical purposes, we gave the ex- pare compounds within a reaction. When applied, we pectation value as a summary of the whole distribution. took the selected branch as the ECP with the highest fre- This value is easily calculated using the standard formula yþ1 quency. The ECP selected with the balance rule was ΕðθÞ¼ . Therefore, as in the present cases, when nþ2 placed on the right side, and the remaining ECPs were y, n ≫ 1, this result is very close to the ratio y/n placed on the left, discarding all ECPs with compounds [23]. Here, the parameters α and β are the number present in the other group. The balance rule is always of hits + 1 and fails + 1, respectively. This test was applied on an ECP first, and the count rule is used to performedfor theentireset andfor the22CTS disambiguate cases in which the balance rule fails to se- with at least ten reactions using the betainv function lect a single ECP. in GNU Octave 4.2.1. We represented the distribu- tion with the gnuplot 5.0 program. Reaction in a string format (RSF) After successive application of the rules, we con- Enrichment of ECN classes in TS patterns structed a representation visualized as a tree (Fig. 3 We tested for enrichment of the EC numbers classified and Additional file 1: Table S2). We also represented into CTSs by our method using a two-sided Fisher’s each TS in a JSON format (JavaScript Object Notation) exact test. Additionally, we controlled the false discovery and in two simplified formats. These formats are exempli- rate using the Benjamini-Hochberg procedure [24]. We fied below; Eq. 2a gives a generic syntax outline, and evaluated the strength of the enrichment using the odds Eq. 2b specify reaction R00760. ratio. For each possible combination of a given EC cat- egory “C” and a particular tree “X”, the odds ratio is de- rootðbalanceðÞ compound compoundðÞ compound compound ð2aÞ fined as (A/B)/(C/D), where A is the number of ECs of category “C” classified in tree “X”; B is the number of rootðÞ balanceðÞ C00095 C00085ðÞ C00002 C00008 ð2bÞ ECs that are not of category “C” classified in tree “X”; C >ðÞ !ðÞ C CðÞ C C ð2cÞ is the number of ECs of category “C” that are not in tree “X”; and D is the number of ECs that are not of category In Eq. 2c, “>” indicates the tree root for the TS, and “!” “C” and are not in tree “X”. We considered only odds ra- indicates the split according to the rule used (one mark tios with P values < 0.05. For this purpose, we extracted (“!”) indicates the balance rule, and two marks (“!!”) indi- EC number(s) related to each reaction from the KEGG cate the count rule). The number of parentheses around database, which corresponded to 4552 ECs distributed at each pair or loner compound show the depth at which it least one time in a reaction. Notably, 1134 of 8957 reac- is nested, indicating the number of partitions employed tions did not have ECs earmarked in the KEGG data- to construct the tree. “C” in Eq. 2c represents compound base. All the graphical representations were created positions independent of the specific compound. using R scripts developed in RStudio Version 1.0.136 and edited in Inkscape 0.91. CTS grouping For each reaction, a TS was proposed, and the architec- Endnotes tures found were represented as in Eq. 2c.The TSsavail- The power of a given set S is defined as the set of all S able for each reaction were clustered into clusters of TSs subsets, including the empty set {} and S itself. Therefore, (CTSs) according to their topology. Table 1 shows the re- the operation we performed on A and B can be described sultant clusters that were grouped considering the arrange- in standard mathematical notation as P′(S) = P(S) \ {{}}, ments but not the specific compounds within the reaction. where P(S) is the power of S, and the set difference is indicated by “\”. Pair validation We compared the pairs generated by our method with Additional file the RPair structures stored in the KEGG REACTION file. For this purpose, we counted the frequency with Additional file 1: Table S1. List of the reactions split by only the count rule or by the count rule in some step of division. Table S2. which a TS-compound pair was equal to an RPair with Representation of the CTS in a string and tree structure format. The first an RCLASS in the same reaction. We then estimated the column, shows the CTS ID; second column represents the reaction split posterior probability of successful data distribution ver- in a string; third column shows the graphical (node-edges), representation. Figure S1. CTS vs EC_two_digits comparison by the False sus having a failed pair as follows: Let θ be the corres- Discovery Rate. Figure S2. CTS vs EC_three_digits comparison by the pondence of the TS-compund pairs with the RCLASS False Discovery Rate. (PDF 3439 kb) set. Using Bayesian analysis, we were able to determine Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 17 of 17 Abbreviations 6. Oh M, Yamada T, Hattori M, Goto S, Kanehisa M. Systematic analysis of CTS: Cluster of tree structures; ECN: Enzyme Commission number; ECP: Element enzyme-catalyzed reaction patterns and prediction of microbial biodegradation of the Cartesian product; EPP: Entirely predicted pair; FP: Failed pair; G3P: D- pathways. J Chem Inf Model. 2007;47:1702–12. Glyceraldehyde 3-phosphate; MFP: Mixed-failed pair; MPP: Mixed-positive 7. Faust K, Croes D, van Helden J. Prediction of metabolic pathways from pair; Pi: Inorganic phosphate; RPair: Reactant pair; RSF: Reaction in a string format; genome-scale metabolic networks. Biosystems. 2011;105:109–21. [cited TS: Tree structure; TS-compound pairs: Tree structure-compound pairs 2013 Aug 21] 8. Heinonen M, Lappalainen S, Mielikäinen T, Rousu J. Computing atom mappings for biochemical reactions without subgraph isomorphism. Acknowledgements J Comput Biol. 2011;18:43–58. We thank Ricardo Ciria, Juan Manuel Hurtado and Shirley Ainsworth for their 9. Rahman SA, Cuesta SM, Furnham N, Holliday GL, Thornton JM. EC-BLAST: a technical and bibliographical support. We also want to thank Enrique Merino tool to automatically search and compare enzyme reactions. Nat Methods. for reviewing and providing comments regarding this manuscript. This study 2014;11:171–4. was supported by grant IN204515 from PAPIIT-UNAM awarded to RMGR. Carlos 10. Enzyme Nomenclature. Recommendations 1992. Supplement: corrections Daniel Vazquez was supported by a CONACyT scholarship and by the and additions. Eur J Biochem. 1994;223:1–5. [cited 2017 Aug 3]. Blackwell Consorcio de Investigación del Golfo de México (CIGOM) funded by the Fondo Publishing Ltd. Sectorial CONACYT-Secretaría de Energía-Hidrocarburos. 11. Kanehisa M, Goto S, Kawashima S, Nakaya A. The KEGG databases at GenomeNet. Nucleic Acids Res. 2002;30:42–6. Availability of data and materials 12. Hattori M, Okuno Y, Goto S, Kanehisa M. Development of a chemical structure The data supporting the findings of this study are available from the KEGG comparison method for integrated analysis of chemical and genomic under reference number BJ2100/vrk245yj, but restrictions apply to the availability information in the metabolic pathways. J Am Chem Soc. 2003;125:11853–65. of these data, which were used under license for the current study and are not 13. Hanes JW, Burns KE, Hilmey DG, Chatterjee A, Dorrestein PC, Begley TP. publicly available. However, data are available from the authors upon reasonable Mechanistic studies on pyridoxal phosphate synthase: the reaction pathway request and with permission of the KEGG database. leading to a chromophoric intermediate. J Am Chem Soc. 2008;32291–300. 14. Junker F, Field JA, Bangerter F, Ramsteiner K, Kohler H, Joannou CL, et al. Software and code Oxygenation and spontaneous deamination of 2-aminobenzenesulphonic All the programs used in our analysis are available at our web page acid in Alcaligenes sp. strain 0-1 with subsequent meta ring cleavage and http://www.ibt.unam.mx/biocomputo/reaCTS_programs.html. spontaneous desulphonation to 2-hydroxymuconic acid. Biochem J. 1994; 436:429–36. Authors’ contributions 15. Dorn E, Knackmuss H-J. Chemical structure and biodegradability of halogenated CVH conceived the original idea and developed the method. AL conceived aromatic compounds. Biochem J. 1978;174:73–84. and developed the presicion test. EPS conceived and developed the 16. Faust K, Croes D, van Helden J. Metabolic Pathfinding using RPAIR annotation. enrichment test. LS suggested ideas for data curation. RMGR improved the J Mol Biol. 2009;388:390–414. Elsevier Ltd method, analyzed the data, supervized the project and wrote the original 17. Latendresse M, Malerich JP, Travers M, Karp PD. Accurate atom-mapping draft. Writing – review and editing: CVH, AL EPS, LS and RMGR. All authors computation for biochemical reactions. J Chem Inf Model. 2012;52:2970–82. read and approved the final manuscript. 18. Kotera M, Okuno Y, Hattori M, Goto S, Kanehisa M. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J Am Ethics approval and consent to participate Chem Soc. 2004;126:16487–98. Not applicable. 19. Khersonsky O, Tawfik DS. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu Rev Biochem. 2010;79:471–505. Competing interests 20. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi A-L. The large-scale The authors declare that they have no competing interests. organization of metabolic networks. Nature. 2000;407:651–4. 21. Fell D, Wagner A. The small world of metabolism. Nat Biotechnol. 2000;18: 1121–2. Publisher’sNote 22. Lima-Mendez G, van Helden J. The powerful law of the power law and Springer Nature remains neutral with regard to jurisdictional claims in other myths in network biology. Mol BioSyst. 2009;5:1482–93. published maps and institutional affiliations. 23. Link C. Advances in Empirical Bayes Modeling and Bayesian Computation The Harvard community has made this article openly available. Please share Author details how this access benefits you. Your story matters. 2015; Departamento de Microbiología Molecular, Instituto de Biotecnología 24. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical Universidad Nacional Autónoma de México, Apdo, Postal 510-3, 62250 and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol). Cuernavaca, Morelos, Mexico. Departamento de Ingeniería Celular y 1995;57(1):289–300. Biocatálisis, Instituto de Biotecnología Universidad Nacional Autónoma de México, Apdo, Postal 510-3, 62250 Cuernavaca, Morelos, Mexico. Received: 13 October 2017 Accepted: 9 May 2018 References 1. Muto A, Kotera M, Tokimatsu T, Nakagawa Z, Goto S, Kanehisa M. Modular architecture of metabolic pathways revealed by conserved sequences of reactions. J Chem Inf Model. 2013;53:613–22. 2. Ravasz E, Somera a L, Mongru D a, Oltvai ZN, Barabási a L. Hierarchical organization of modularity in metabolic networks. Science. 2002;297: 1551–5. [cited 2013 Aug 13] 3. Arita M. The metabolic world of Escherichia coli is not small. Proc Natl Acad Sci U S A. 2004;101:1543–7. 4. Kotera M, Hattori M, Oh M-A, Yamamoto R, Komeno T, Yabuzaki J, et al. (2004). RPAIR: a reactant-pair database representing chemical changes in enzymatic reactions. Genome Inf. 2004;15:P062. 5. Shimizu Y, Hattori M, Goto S, Kanehisa M. Generalized reaction patterns for prediction of unknown enzymatic reactions. Genome Informatics Ser. 2008; 20:149–58. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Systems Biology Springer Journals

Identification of reaction organization patterns that naturally cluster enzymatic transformations

Free
17 pages

Loading next page...
 
/lp/springer_journal/identification-of-reaction-organization-patterns-that-naturally-PoXQqsTyT6
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s).
Subject
Life Sciences; Bioinformatics; Systems Biology; Simulation and Modeling; Computational Biology/Bioinformatics; Physiological, Cellular and Medical Topics; Algorithms
eISSN
1752-0509
D.O.I.
10.1186/s12918-018-0583-9
Publisher site
See Article on Publisher Site

Abstract

Background: Metabolic reactions are chemical transformations commonly catalyzed by enzymes. In recent years, the explosion of genomic data and individual experimental characterizations have contributed to the construction of databases and methodologies for the analysis of metabolic networks. Some methodologies based on graph theory organize compound networks into metabolic functional categories without preserving biochemical pathways. Other methods based on chemical group exchange and atom flow trace the conversion of substrates into products in detail, which is useful for inferring metabolic pathways. Methods: Here, we present a novel rule-based approach incorporating both methods that decomposes each reaction into architectures of compound pairs and loner compounds that can be organized into tree structures. We compared the tree structure-compound pairs to those reported in the KEGG-RPAIR dataset and obtained a match precision of 81%. The generated tree structures naturally clustered all reactions into general reaction patterns of compounds with similar chemical transformations. The match precision of each cluster was calculated and used to suggest reactant-pairs for which manual curation can be avoided because this is the main goal of the method. We evaluated catalytic processes in the clusters based on Enzyme Commission categories that revealed preferential use of enzyme classes. Conclusions: We demonstrate that the application of simple rules can enable the identification of reaction patterns reflecting metabolic reactions that transform substrates into products and the types of catalysis involved in these transformations. Our rule-based approach can be incorporated as the input in pathfinders or as a tool for the construction of reaction classifiers, indicating its usefulness for predicting enzyme catalysis. Keywords: Metabolic reaction, Reaction patterns, Compound transformation, Reactant pairs, Enzyme catalysis Background systems and detailed group reaction exchanges. How- One of the main forces defining genome content is me- ever, the results arising from system-oriented studies tabolism, a chemical system that generates all the neces- are difficult to incorporatedirectlyintothe enzyme sary chemical substances in living cells through chemical catalysis context. reactions mainly catalyzed by genome-encoded enzymes Approaches based on graph theory build compound– [1]. The increased availability of metabolic information compound networks that reveal hubs (compounds that has attracted the interest of bioinformaticians, who have are highly connected) and modules (compound sets that developed computational methods to uncover important suggest communities of similar chemical and functional information regarding global properties of metabolic properties) [2]. Various properties arise from the re- moval of hubs, providing a partial view of the metabolic * Correspondence: rmaria@ibt.unam.mx network due to loss of biochemical information required Departamento de Microbiología Molecular, Instituto de Biotecnología to include compound modifications present in more Universidad Nacional Autónoma de México, Apdo, Postal 510-3, 62250 traditional metabolic maps. Such a level of detail can be Cuernavaca, Morelos, Mexico Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 2 of 17 achieved by adopting a reaction as a unit with the aim of principle was used to create the balance rule, and the analyzing specific compound transformations. Arita M. second was used to propose the count rule. Both rules created a method to detail atom correspondence were employed to analyze a set of 6230 well-curated re- between substrates and products within reactions, dem- actions documented in the KEGG database. Each TS onstrating that metabolic networks do not follow a contained at least one pair of compounds similar in small-world network distribution (many elements in structure to RPairs. To assess the performance of our modules connected by a few hubs) [3]. rules, we compared each resultant pair in a reaction to Other methods used to describe relationships between curated RPairs from the KEGG database (8) and ob- chemical compounds that participate in metabolic reac- tained a precision value of 81% for the full set. tions are so-called atom mappers, which automatically Because our approach outlines the specific architec- compare compounds to locate group transfer. An inter- ture of each reaction, it can then be grouped according esting property of atom mappers is that they identify to its TS topology, grouping reactions of the same pat- structural transformations between single-compound tern into a single cluster of TSs (CTS). As a result, all pairs, allowing creation of reactant pairs (RPairs) such as the 6230 reactions analyzed clustered into 71 CTSs. those in the Kyoto Encyclopedia of Genes and Genomes Moreover, the reactions grouped in each CTS transfer (KEGG) RPAIR database, which are defined as “pairs of similar chemical groups. Based on this property, we compounds that have atoms or atom groups in common propose that our rule-based approach can be used as a on two sides of a reaction” [4]. RPairs have also been classifier of metabolic reactions. Given this observation, proven to be useful when combined with hierarchical we analyzed whether our CTSs were associated with en- clustering algorithms to elucidate relationships between zyme categories represented by the Enzyme Commission reactions and enzymes [5], ultimately defining groups of number (ECN) available for each reaction [10]. There- compounds with related metabolic pathways [6]. These fore, we measured the enrichment of each ECN in each methods have been developed as promising alternatives CTS and revealed that the ECNs naturally fit the chem- for pathway discovery [7] but are slow to automate [8] ical patterns disclosed by our protocol. and tend to require both a priori information and sec- ondary methods to handle special cases, such as com- Results pounds with rings, prior to the final step of manual Generation of tree structures curation. Nonetheless, atom transfer is still a useful tool Cofactors used in metabolic reactions are compounds for predicting enzymatic catalysis in reaction sets [9]. that suffer small transformations compared to other re- Considering the aforementioned properties of atom actants that are modified dramatically in one or more mappers and graph theory as tools for pathway discov- steps. One way to compare these changes within reac- ery, the first goal of our work was to implement a simple tions is to measure the difference in molecular weights and fast method to complement atom mappers with the between compounds. Using this observation, we devised aim of avoiding a manual curation step as much as pos- a computational protocol, called the “balance rule”, that sible. For this purpose, we performed a statistical com- estimates the similarity of compounds within a reaction. parison of the TS-pairs proposed by our method with The purpose of this rule is to distinguish groups of com- those in the RPAIR/RCLASS datasets, generating a pre- pounds participating in a reaction based on differences cision value that can be interpreted as the confidence of in their molecular weights (Fig. 1a). the predicted set of reactant-pairs. We propose that the As a first step, for each analyzed reaction, we assigned results obtained by our method will provide researchers the elements of the Cartesian product (ECPs) derived with sets of confident pairs, a property proven to show from the compounds on the left and right sides of the better performance in pathfinders. Most importantly, equation, estimating the relative mass for each (de- our results can be used as a guide to determine which scribed in more detail in the Methods section). The ECP reactions should be candidates for manual curation. with the smallest difference in relative mass was then Therefore, we systematically analyzed single-chemical separated from the rest of the reaction, and then all metabolic reactions and outlined the mutual associations ECPs of the reaction for which the compounds were of their compounds, representing each reaction as a tree previously selected were eliminated (Fig. 1a). The structure (TS). We then constructed a rule-based remaining ECPs were then subjected to the same process approach centered on two principles: (i) wide sets of until no ECPs remained or until the establishing of a dif- metabolic reactions involve compounds, such as pool ference between the relative ECP masses was not pos- metabolites and cofactors, that undergo small modifica- sible. Thus, at the end of the process, no compounds in tions; and (ii) some compounds, usually coenzymes or the analyzed reaction participated in more than one pool metabolites, are frequently associated with various ECP. The implications of this restriction are discussed in reactions in unrelated metabolic pathways. The first the next section. We represented rounds of separation Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 3 of 17 ab Fig. 1 Representation of the constructed rules and tree structure (TS) construction. We illustrate graphical representations of our rules and their applications. a Example of using the balance rule to generate a tree structure for reaction R00658, in which the substrate 2-phospho-D-glycerate (C00631) is transformed into phosphoenolpyruvate (C00074) and water (C00001), by selecting the element of the Cartesian product (ECP) C00631_ C00074 as the element with the smallest difference in molecular weight within the reaction. b Example of using the count rule to construct a tree structure for reaction R02090, in which the substrates ATP (C00002) and dGMP (C00362) are transformed into ADP (C00008) and dGDP (C00361), by selecting the element of Cartesian product C00002_C00008 as the more frequently represented ECP in the entire network. For the TS representation, nodes: the gray octagon defines the tree root (reaction), the squares define the CP node, the rhomboid defines lone compound nodes, and the white circle defines compounds. Edges: blue, split through balance; orange, split through count; thin line, node/compound link. For the reaction string format (RSF), we represent the split in a line described in detail in the Methods section and selection into a tree structure (TS) in which the tips reflects an internal reaction property, and the count rule is showed the selected ECPs that described the pairs and used when the balance rule fails to select a single ECP. loner compounds defining each reaction (Fig. 2). When To validate our approach, we used only the 6392 reac- applying the previous rules, we observed that for some re- tions that were fully described in the KEGG LIGAND data- actions, the establishing of mass differences for particular set that were available in 2015 (see Methods section) [11]. sets of ECPs was not possible. For these reactions, our From the original set, we discarded 1099 reactions with just protocol uses a second rule, called the “count rule”,that one substrate and one product, including 335 reactions cat- selects an ECP based on its frequency in the entire net- aloged as isomerizations, which occur when a substrate work (Fig. 1b). The notion behind this rule is that some atom is arranged into another position on the product. For compounds, such as coenzymes and pool metabolites, are example, in reaction R01518, 2-phospho-D-glycerate frequently associated with numerous reactions in many (C00631) is transformed into 3-phospho-D-glycerate unrelated pathways. This property has been used previ- (C00197) by isomerization. For this reaction, the only pos- ously to identify metabolic network hubs [2]. In our work- sible tree structure would be formed by both compounds flow, the balance rule is always applied first because it (C00631_C00197) alone, and we therefore decided not to Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 4 of 17 Fig. 2 Analysis of the metabolic network using the rule-based approach. We illustrate the general procedures of how we applied the balance and count rules to the KEGG database. The procedure starts with the compounds and reactions from the KEGG, and the rules are applied to each reaction to generate a tree structure. Tree structures with identical architecture organizations are clustered together. Finally, we calculate the RPAIR correlation and Enzymatic Commission number (ECN) abundance levels to evaluate the performance of the method include these transformations in successive analyses. The remain alone on either side) (Fig. 2). A graphical same splitting method was applied to the rest of the group, representation of these architectures is shown in including 642 reactions lacking an enzyme entry. Additional file 1:Table S2.The next step wastoval- The balance rule was able to describe 6291 of 6392 re- idate the proposed architectures and interpret their actions from the KEGG database, and the count rule was chemical properties. used for the 101 remaining reactions. Importantly, 15 of these 101 reactions, including reaction R00509, have a TS-compound pairs correlated to KEGG RPairs small number of reactants (one or two) that are trans- We considered the presence of compound pairs identi- formed into one or two products (Fig. 3a). In this case, fied for each reaction to evaluate the architecture of each the compound arrangements were selected using only TS. Compound pairs are also described in reactions as the count rule. The 86 remaining reactions that were in- reactant pairs in the KEGG RPAIR database, in which a volved in more complex transformations required both pair in a reaction represents group transfer interactions rules to generate the TS for each reaction. An example among compounds [4, 12]. The current version of the is presented in Fig. 3b for reaction R10376, which is in- KEGG is different from previous versions in that RPAIR volved in the biosynthesis of indole diterpene alkaloids. has been replaced by the RCLASS set, which contains The selection begins by separating the terpendole E only well-curated pairs according to the authors. (C20536) and 13-desoxyterpendole I (C20542) molecules Well-curated pairs are the “classification of reactions from the rest of the reaction to form the pair based on the chemical structure transformation patterns (C20536_C20542). The count rule was then used again of substrate-product pairs (reactant pairs), which are to separate NADP+ (C00006) and NADPH (C00005) represented by so-called RDM patterns” [11]. The 2015 from the other compounds in the reaction to form the version of the KEGG contains the original RPAIR, in pair (C00005_ C00006). Finally, the minimal differences in which RCLASS pairs and other pairs have weaker evi- molecular weight for the remaining compounds were suf- dence of being correct. We compared the RCLASS list ficient to complete the process using the balance rule. to our predicted pairs reaction by reaction to measure Another notable observation from our approach the level of correspondence. From a total of 11,093 was that the TSs found for the 6392 reactions re- RPairs in the 6392 reactions, 8966 of our tree structure-- vealed common architectures that could be clustered compound pairs (TS-compound pairs) matched an RPair. considering the order of splitting within the reaction, We used a Bayesian approach to infer the probability of a and the two rules were used by the protocol. This TS-compound pair associating with its corresponding clustering procedure resulted in a set of 71 different RPair to measure our prediction confidence levels. As ex- TS arrangements that included all the reactions tensively described in the Methods section, the confidence tested (Table 1 and Additional file 1:Table S2).Not- interval was determined by the 2.5 and 97.5% quantiles. ably, the tips of each TS were composed of two Therefore, we found that the correspondence of the architecture types: pairs (two compounds associated TS-compound pairs over the entire RCLASS set was 0.81, directly through ECP selection by leaving exactly one a value that should be interpreted as the precision of our compound on each side of the equation) and loner method (Fig. 4a and Table 2) and that we consider very compounds (compounds from pair selection that good given the simplicity of the method. Notably, some Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 5 of 17 Fig. 3 Representation of three tree structure examples. Panel a shows the split of reaction R00509 (subsection a1) as represented in the KEGG database. Subsection a.2 shows a graphical representation of a TS. The TS split representations are viewed from the top down, and compact strings are read left to right. Nodes: the gray octagon defines the tree root (reaction), the squares define the CP node, the rhomboid defines lone compound nodes, and the white circle defines compounds. Edges: blue, split through balance; orange, split through count; thin line, node/compound link. In subsection a.3, we represent the reaction split in a string format constructed as explained in the Methods section. In panels b and c, we show the TS graphical representation of reactions R10376 and R00025, respectively. For both reactions, the same caption descriptions for panels a2 and a3 are applied to each respective subsection RPairs were constructed by mating one compound with are conformed by pairs sharing a common compound. This more than one of the same reaction because an atom type is an important difference between our approach and the can be shared by more than one compound, which is the manner in which atom mappers construct reactant pairs. case in the KEGG because a total of 1157 KEGG reactions An example showing the result of this difference is reaction Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 6 of 17 Table 1 Patterns identified and their condensed representations TS-cluster identifier CTS patterns Number of reactions per CTS Rule used to generate the pattern CTS-1 >(!(C_C)(C_C)) 1627 Balance CTS-2 >(!(C)(C_C)) 1059 Balance CTS-3 >(!(C)(!(C)(C_C))) 1028 Balance CTS-4 >(!(C)(!(C_C)(C_C))) 897 Balance CTS-5 >(!(C)(!(!(C)(!(C)(C_C)))(C_C))) 439 Balance CTS-6 >(!(C)(!(!(C)(C_C))(C_C))) 349 Balance CTS-7 >(!(C_C)(!(C_C)(C_C))) 150 Balance CTS-8 >(!(C_C)(!(C)(C_C))) 143 Balance CTS-9 >(!(!(C)(C_C))(!(C)(C_C))) 103 Balance CTS-10 >(!(C)(!(!(C_C)(C_C))(C_C))) 97 Balance CTS-11 >(!(C)(!(!(C_C)(!(C)(C_C)))(C_C))) 82 Balance CTS-12 >(!(C)(!(C)(!(C)(C_C)))) 66 Balance CTS-13 >(!(!(C)(C_C))(C_C)) 54 Balance CTS-14 >(!(C_C)(!(C)(!(C)(C_C)))) 42 Balance CTS-15 >(!(!(C)(!(C)(C_C)))(C_C)) 29 Balance CTS-16 >(!(!(C_C)(!(C)(C_C)))(C_C)) 26 Balance CTS-20 >(!(C)(!(C)(!(C)(!(C)(C_C))))) 12 Balance CTS-21 >(!(C)(!(C_C)(!(C)(C_C)))) 12 Balance CTS-23 >(!(C_C)(!(C_C)(!(C)(C_C)))) 8 Balance CTS-24 >(!(C_C)(!(!(C)(C)_C))(C_C))) 8 Balance CTS-25 >(!(C)(!(!(!(C)(C_C))(C_C))(C_C))) 5 Balance CTS-26 >(!(!(C)(C_C))(!(C_C)(C_C))) 5 Balance CTS-29 >(!(C)(!(!(C)(!(C)(!(C)(C_C))))(C_C))) 4 Balance CTS-30 >(!(C)(!(!(C)(C_C))(!(C)(C_C)))) 4 Balance CTS-31 >(!(!(C_C)(C_C))(C_C)) 4 Balance CTS-33 >(!(C)(!(!(C)(!(C_C)(!(C)(!(C)(C_C)))))(C_C))) 3 Balance CTS-34 >(!(C)(!(!(C)(!(!(C)(C_C))(C_C)))(C_C))) 3 Balance CTS-38 >(!(C)(!(!(C)(C_C))(!(!(C)(C_C))(C_C)))) 2 Balance CTS-39 >(!(C)(!(C_C)(!(!(C)(C_C))(C_C)))) 2 Balance CTS-40 >(!(C)(!(!(C)(!(C)(C_C)))(!(C)(C_C)))) 2 Balance CTS-41 >(!(!(C)(C_C))(!(!(C)(C_C))(C_C))) 2 Balance CTS-42 >(!(!(C)(!(C)(C_C)))(!(C)(!(C)(C_C)))) 2 Balance CTS-44 >(!(C)(!(C)(!(C)(!(C)(!(C)(C_C)))))) 2 Balance CTS-45 >(!(C_C)(!(C)(!(C_C)(C_C)))) 2 Balance CTS-46 >(!(!(C)(!(C)(C_C)))(!(C)(C_C))) 2 Balance CTS-47 >(!(!(C_C)(C_C))(!(C)(C_C))) 2 Balance CTS-50 >(!(!(C)(!(C)(C_C)))(!(C_C)(!(!(C)(C_C))(C_C)))) 1 Balance CTS-51 >(!(C)(!(!(C)(!(!(C)(C_C))(!(C)(C_C))))(C_C))) 1 Balance CTS-52 >(!(!(C)(!(C)(C_C)))(!(C)(!(C)(!(C)(!(C)(C_C)))))) 1 Balance CTS-53 >(!(C)(!(!(C)(!(C)(!(C_C)(C_C))))(C_C))) 1 Balance CTS-54 >(!(C)(!(!(C_C)(C_C))(!(C)(!(C)(C_C))))) 1 Balance CTS-55 >(!(C_C)(!(!(C)(!(C)(C_C)))(!(C)(C_C)))) 1 Balance CTS-57 >(!(C)(!(C_C)(!(C)(!(C_C)(C_C))))) 1 Balance CTS-58 >(!(C)(!(!(C)(!(C_C)(C_C)))(C_C))) 1 Balance Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 7 of 17 Table 1 Patterns identified and their condensed representations (Continued) TS-cluster identifier CTS patterns Number of reactions per CTS Rule used to generate the pattern CTS-59 >(!(C_C)(!(C)(!(C_C)(!(C)(C_C))))) 1 Balance CTS-60 >(!(C_C)(!(C_C)(!(C)(!(C)(C_C))))) 1 Balance CTS-61 >(!(C_C)(!(C_C)(!(C_C)(C_C)))) 1 Balance CTS-62 >(!(!(C)(C_C))(!(C)(!(C_C)(C_C)))) 1 Balance CTS-65 >(!(!(C)(C_C))(!(C)(!(C)(C_C)))) 1 Balance CTS-66 >(!(!(C)(!(C)(!(C)(C_C))))(C_C)) 1 Balance CTS-19 >(!!(C_C)(C_C)) 14 Count CTS-71 >(!!(C)(C_C)) 1 Count CTS-17 >(!!(!(C)(C_C))(!(C_C)(C_C))) 24 Count-Balance CTS-18 >(!!(!(C_C)(C_C))(!(C)(!(C)(C_C)))) 16 Count-Balance CTS-22 >(!!(C_C)(!(C)(C_C))) 10 Count-Balance CTS-27 >(!!(!(C)(C_C))(!(C)(!(C)(C_C)))) 5 Count-Balance CTS-28 >(!(C)(!!(C)(C_C))) 5 Count-Balance CTS-32 >(!!(!(C)(C_C))(!(C)(C_C))) 4 Count-Balance CTS-35 >(!!(!(C)(!(C)(C_C)))(!(C)(!(C)(C_C)))) 3 Count-Balance CTS-36 >(!!(!(C_C)(C_C))(C_C)) 3 Count-Balance CTS-37 >(!!(!(C)(C_C))(C_C)) 3 Count-Balance CTS-43 >(!!(!(C)(C_C))(!(!(C)(C_C))(C_C))) 2 Count-Balance CTS-48 >(!!(C_C)(!(C)(!(C)(C_C)))) 2 Count-Balance CTS-49 >(!!(!(C)(!(C)(C_C)))(C_C)) 2 Count-Balance CTS-56 >(!!(!(C)(!(C_C)(C_C)))(!(C)(!(C)(C_C)))) 1 Count-Balance CTS-63 >(!!(C_C)(!!(!(C)(!(C)(C_C)))(C_C))) 1 Count-Balance CTS-64 >(!!(!(C_C)(!(C)(C_C)))(!(C)(C_C))) 1 Count-Balance CTS-67 >(!!(C_C)(!(!(C)(C_C))(C_C))) 1 Count-Balance CTS-68 >(!!(!(C_C)(!(C)(C_C)))(C_C)) 1 Count-Balance CTS-69 >(!!(C_C)(!(C_C)(C_C))) 1 Count-Balance CTS-70 >(!(C)(!!(C_C)(C_C))) 1 Count-Balance We describe the general topologies of the reactions found by our approach. Column 1 indicates the CTS identifier, and Column 2 shows the general pattern found for groups of reactions. “>” indicates the root of the tree structure; “!” indicates use of the balance rule; “!!” indicates use of the count rule;and “C” indicates a compound. A pair is described as (C_C), and a loner compound is (C). The number of parentheses around the pair or loner compound indicates its depth in the tree and the number of partitions employed for separation. Column 4 describes the rule or rules used to generate the arrangements R00025, where ethylnitronate (C18091) is catalyzed to nitrite method keeps this compound in the architecture and thus (C00088) and acetaldehyde (C00084) by the enzyme nitro- shows its contribution to the reaction. Because our method nate monooxygenase (Fig. 3c). In this reaction, ethylnitro- generates unique compound combinations as the final re- nate is decomposed into three reactants pairs, sult, the undetected pair composed of acetaldehyde and RC00126-(C00061_C01847), RC02541-(C00084_C18091) ethylnitronate (RC02541-(C00084_C18091)) was eliminated and RC02759-(C00088_C18091), which correspond to as apairat someother pointinthe processdue to itspres- (FMN_reduced_FMN), (acetaldehyde_ethylnitronate) and ence in another ECP with a bigger mass difference com- (nitrite_ethylnitronate), respectively. In this reaction, our pared with the remaining products in the tips of the tree. protocol positively detected FMN_ reduced_FMN as a pair We also observed that our protocol generated 11,093 (C00061_C01847) and an additional group ((_C00001) TS-compound pairs and from these 2127 are absent in the (C00088_ C18091)) composed of water as a loner com- RPAIR data set, making these pairs candidates for manual pound (_C00001) and the pair ethylnitronate_nitrite curation to establish their confidence. (C00088_ C18091) matching an RPair in RCLASS Because our protocol naturally clustered each reaction RC02759. Notably, water (C00001) is not associated with according to its architectural arrangement into 71 any RPair for reaction R00025. As an advantage, our groups, we estimated the precision of CTSs for which Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 8 of 17 ab c Fig. 4 Correlation between TS pairs and RPAIRs. We present three examples of posterior distributions of the inferred parameter θ. a Total TS pairs vs. the entire RPAIR dataset. b Example showing the sharp-peaked distribution observed in CTS-1. c Example of a broad distribution showing the effect of a high correlation level in a group with a small number of reactions Table 2 Comparison of TS-compound pairs versus RPairs Cluster ID Number Number of Number of Number of Number Number of Number of Inferior Superior Average of total reactions with reactions with reactions with of RPAIRs matching mismatched interval interval value of reactions only matching no matched mixed pair in a CTS TS-compound TS-compound 0.025 0.975 the interval pairs pairs matches pairs pairs General 6392 4869 618 905 11,193 8966 2127 0.8006 0.8153 0.8070 CTS-1 1627 960 391 276 3254 2196 1058 0.6587 0.6909 0.6748 CTS-2 1059 1016 43 0 1059 1016 43 0.9467 0.9704 0.9586 CTS-3 1028 926 102 0 1028 926 102 0.8818 0.9183 0.9 CTS-4 897 867 8 22 1794 1756 38 0.9717 0.985 0.9783 CTS-5 439 435 1 3 878 873 5 0.9884 0.9981 0.9933 CTS-6 349 341 0 8 698 690 8 0.9794 0.995 0.9872 CTS-7 150 1 17 132 450 206 244 0.412 0.5039 0.458 CTS-8 143 57 7 79 286 193 93 0.6195 0.7278 0.6737 CTS-9 103 51 3 49 206 151 55 0.6707 0.791 0.7309 CTS-10 97 2 0 95 291 196 95 0.6187 0.7261 0.6724 CTS-11 82 0 0 82 246 163 83 0.6024 0.7203 0.6613 CTS-12 66 58 8 0 66 58 8 0.7906 0.9453 0.868 CTS-13 54 39 7 8 108 86 22 0.7158 0.8664 0.7911 CTS-14 42 21 5 16 84 58 26 0.5882 0.7841 0.6862 CTS-15 29 28 0 1 58 57 1 0.9373 0.9996 0.9684 CTS-16 26 0 0 26 78 29 49 0.2687 0.4812 0.3749 CTS-17 24 0 0 24 72 48 24 0.5545 0.77 0.6623 CTS-18 16 8 0 8 48 37 11 0.6434 0.877 0.7602 CTS-19 14 5 6 3 28 13 15 0.2867 0.6467 0.4667 CTS-20 12 11 1 0 12 11 1 0.7151 0.9977 0.8564 CTS-21 12 4 3 5 24 13 11 0.3449 0.7318 0.5384 CTS-22 10 6 1 3 20 15 5 0.5443 0.9085 0.7264 We compared the numbers of hits and fails of our predicted pairs with RPairs in the KEGG RPAIR dataset. The table presents precision estimations based on the calculated posterior probabilities of each group. The confidence interval and average value are shown in the last 3 columns Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 9 of 17 the number of reactions was sufficiently large to perform Tree structure distribution reveals that the predicted patterns a reasonable statistical inference. Therefore, estimations yield reaction groups with similar chemical events were performed only for CTSs with at least 10 reactions The 2015 version of the KEGG retained the classification with predicted pairs. The results of our precision ana- of RPairs proposed by Kotera M. et al. regarding “main- lysis of 6291 reactions clustered in 22 different CTSs are pairs (describing main changes in substrates), cofac pairs shown in Table 2, with the CTSs numbered from 1 to 22 (describing changes in cofactors for oxidoreductases), (1 represents the group with the highest number of reac- trans pairs (focused on transferred groups for transfer- tions, and 22 represents the group with the smallest ases), ligase pairs (describing the consumption of number of reactions). As expected from the group size nucleoside triphosphates for ligases), and leave pairs (de- differences, the results presented in Table 2 show that scribing the separation or addition of inorganic com- the CTSs had different levels of precision. We observed pounds for enzymes, such as lyases and hydrolases)” [4]. eight CTSs encompassing 3879 reactions with an aver- These classes link compound pairs within a reaction age confidence value higher than 0.80. For example, for with the enzyme activity exerted upon them, indicating CTS-5, we predicted only five pairs that were different that the same pair could be assigned to a different class from those assigned by the KEGG out of 878 total reac- depending on the reaction in which it participates. tions (Table 2), indicating that every pair found in this Therefore, our next step was to analyze our predicted CTS had a very high probability of being an RPair. We pairs in this pair-reaction context. We labeled each pair also noticed that all the reactions in CTS-5 were oxidor- with its RPAIR class (main, cofac, trans, leave or ligase), eductions reactions in which the incorporated oxygen is and we also labeled each pair according to the propor- not necessarily derived from molecular oxygen. tion of predicted RPairs in its corresponding reaction, Another 2198 reactions were grouped in CTSs with resulting in four categories. The first category included average precision values ranging from 0.79 to 0.60 (Table reactions in which the predicted pairs (EPP) entirely fit 2). For example, CTS-1 had the highest number of mem- the published RPairs. The second and third categories bers with the same architecture arrangement (1627 reac- involved reactions with at least one hit pair and one or tions). From the 3254 possible pairs in this CTS, we more failed pairs. We labeled these reactions as mixed predicted that 2196 matched an RPair, representing pairs (MP) as they produced mixed-positive pairs (MPP) two-thirds of the group. This gave CTS-1 an average and mixed-failed pairs (MFP). The last group consisted probability of 0.67, which also showed a sharp-peaked dis- of pairs in reactions in which no RPair was found, called tribution (Fig. 4b). Therefore, this CTS provided a large failed pairs (FP). We labeled the pairs using three confi- amount of information supporting our conclusion that dence ranges taken from the Bayesian probability estima- two of its three pairs are dependable RPairs, and the tion previously defined for their respective CTSs. The CTSs remaining one-third of the group (1058) represents candi- were ranked by their mean probability x,where treeswith dates for validation through manual curation. We found ¨high confidence¨ were in the range x ≥ 0.8, trees with that other CTSs in the same confidence range showed “medium confidence” were between 0.8 > x ≥ 0.6, and CTSs broad distributions, such as CTS-18 (Fig. 4c), which ex- in which 0.6 > x were labeled “low confidence”. hibited a value of 0.64 for the 2.5% interval and 0.87 for Figure 5 shows the frequency of the reactant pairs the 97.5% interval. In this case, in which 37 of 48 possible grouped according to their RPAIR class (main, cofac, RPairs were matched, we had less information supporting trans, leave and ligase), our hit-fail categories (EPP, MPP, our estimate, and the greater uncertainty could be ex- MFP and FP) and our confidence ranges (high, medium plained by the small sample size (16 reactions). Despite and low). The EPP category included a total of 7672 pre- the small sample size, the proportion of hits revealed that dicted pairs, and notably, 48% of these matched a main CTS-18 had an intermediate probability of yielding a posi- RPair with high confidence. Interestingly, another 21% tive RPair. Nevertheless, inspection of this CTS showed of the EPP pairs fell in the cofactor class (cofac) with that all the reactions in the cluster are catalyzed by ligases high confidence. A very small group, less than 1% of the classified by enzyme commissions in classes 6.3.4 and EPP category, belonged to the ligase and leave classes, 6.3.5. Furthermore, two reactions involved in the exhibiting a very small proportion of pairs with one and biosynthesis of neomycin, kanamycin and gentamicin, four members, respectively. In the same category, the R10767 and R10768, are catalyzed by a tobramycin other 26% of the pairs with medium confidence were carbamoyltransferase of the group 6.1.2.2, indicating that main pairs, and another 2.7% were cofac pairs. Similar despite the medium significance, compounds of reactions behavior was observed for the MPP category because the in CTS-18 undergo similar transformations catalyzed by main and cofac categories predominated in 1120 pairs; similar enzymes. The next step was to determine whether however, in contrast to the EPP category, the concordant the reaction groups in the remaining CTSs also exhibit pairs exhibited mainly a medium confidence level. An- similar chemical transformations. other 240 pairs in the low-confidence group were mostly Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 10 of 17 could not be provided or when we obtained results that were inconsistent with the proposed RPairs. The first ex- ample is reaction R10088 from CTS-32 in which we ob- tained mixed pairs. As shown in Fig. 6a, this reaction entry describes the conversion of D-Ribose 5-phosphate and D-Glyceraldehyde 3-phosphate (G3P) into Pyridoxal phosphate(vitaminB6) in thepresenceofanammoniamol- ecule, yielding four molecules of water and inorganic phos- phate (Pi). Our resulting TS shows that vitamin B6 is paired to the D-ribose 5-phosphate molecule (C00117_C00018) as proposed in the RCLASS ID RC03049 (C00018_C00117). The overall TS partially reflects the proposed mechanism, as showninFig. 6a. The vitamin is proposed to be assem- bled from ribose, ammonia and glyceraldehyde 3-phosphate (G3P), with the phosphate group from the ribose molecule leaving the complex [13]. In the TS, vitamin B6 is paired with the ribose molecule, with ammonia as the closest loner compound, which is consistent with the overall mechanism up to the intermediary before the excision of Pi. G3P (C00118), however, is paired with Pi (C00009) because they Fig. 5 Analysis of TS pairs using RPAIR classes. The figure represents are more similar in terms of molecular weight, implying that the abundance of each TS pair as a function of the class proposed the Pi group that enters and leaves is the group in G3P, not by Kotera [4]. We categorized the pairs as being in reactions in which the group in the ribose molecule. The mapping proposed by the whole predicted pair entirely fit the published RPair (we called this the KEGG also pairs G3P with pyridoxal phosphate - EPP). The second group includes reactions representing at least one hit RC01783 (C00018_C00118) -, which is consistent with the with one or more fails, which we called mixed pairs, and this group was subdivided into positive (MPP) and failed pairs (MFP). The proposed mechanism [13]. Nonetheless, we should remem- last group contains reactions in which we failed to detect an RPair, called ber that our method does not consider molecular mecha- failed pairs (FP). We also arranged the TS pairs according to their mean nisms in performing the splitting in contrast to atom precision levels (x) estimated from the CTSs by Bayesian analysis. TS pairs mappers, which attempt to reflect the atom transitions be- in the precision range x ≥0.8wereclassified as “high confidence”,those tween chemical species in detail. Finally, an important trait in the range 0.80 > x ≥ 0.60 were classified as “medium confidence”,and those in the range x < 0.60 were labeled “low confidence” of the TS layout is the first splitting performed by the count rule, which reflects the fact that most of the mass in the pyridoxalmoleculeisalreadypresent in the intermediary distributed in the ligase category. Pairs in the MFP prior to Pi excision. One important aspect of the reaction group (446 pairs) tended to belong to the cofac and leave described in Fig. 6a is that the enzyme Pdx1 (pyridoxal categories, both with medium confidence. Finally, as 5′-phosphate synthase, ECN 4.3.3.6) prefers the substrate shown in Fig. 5, the approach failed to provide a con- G3P and joins the molecule through imine formation, which cordant RPair for 576 reactions in the FP category. Inter- is an important trait for understanding the mechanism and estingly, these FP pairs were classified by Kotera as trans reaching the proposed conclusion. We think that consider- and leave pairs, and 383 were clustered into CTS-1 (data ing enzymatic mechanisms will be valuable for improving not shown). Furthermore, they were shown to be prefer- our method. entially catalyzed by hexosyltransferases, pentosyltrans- Another interesting example, shown in Fig. 6b,is ferases and phosphotransferases, with an alcohol group R07795 in CTS-4. The reaction describes the conversion as the acceptor, and by some enzymes other than methyl of 3-sulfocatechol (C06336) into 2-hydroxymuconate groups that transfer alkyl or aryl groups. (C02501). This reaction is catalyzed by the enzyme cat- The results presented thus far provide evidence that echol 2,3-dioxygenase (ECN 1.13.11.2), which also cata- our rules are exceptional in defining pairs in reactions lyzes R04089 - the conversion of 2,3-dihydroxytoluene catalyzed by oxidoreductases. In contrast, these rules into 2-hydroxy-6-keto-2,4-heptadienoate - and reactions have limitations in establishing the transitions of specific R05295, R05404 and R05406, all of which are implicated types of transferases, lyases and hydrolases. The results in conversions of different chemical species of catechol obtained for these reactions prompted us to perform involving aromatic biodegradation. All these reactions, manual curation with the aim of establishing the degree except for R07795, were clustered in CTS-2, which had to which the mechanisms proposed in the literature fol- a mean precision of 95%. Notably, the TS-pairs in these low our proposed splitting when a statistical description reactions match the reactions’ RCLASS pairs, which is Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 11 of 17 Fig. 6 Manual curation of reactant-pairs. Panel a illustrates the proposed mechanism for reaction R00018 and its tree structure (TS). In this reaction, D-Ribose 5-phosphate (red) and D-Glyceraldehyde 3-phosphate (blue) are converted into Pyridoxal phosphate in the presence of an ammonia molecule (green). Figure adapted from reference [13]. Panel b illustrates the proposed mechanism for reaction R07795 and its TS. This reaction is the conversion of 3-Sulfocatechol into 2-Hydroxymuconate. Figure adapted from reference [14]. The pairs in TSs that follow that reaction’s proposed mechanism are marked by a checkmark. The TSs are also shown in string format (RSF), (see Methods) important because these reactions yield an additional the KEGG data, and we also propose that this pair product despite the similarity in their ring cleavage should be considered incorrect. mechanism by incorporating oxygen during biological A count of the pairs recovered in the CTSs that were oxidations of these organic substrates [14]. In contrast, not statistically evaluated also reveals some patterns. in R07795, oxygen is incorporated into 3-sulfocatechol Overall, 194 of the 324 pairs in these reactions are con- through an attack that releases sulfite, which we paired firmed RCLASS entries. However, the proportion of pair with oxygen (C00007_C00094). The 2015 version of matches is not homogeneous, which is similar to the 22 RPAIR included this pair as a leave pair, but RCLASS ig- evaluated CTSs. For example, in CTS-32, which includes nores it due to mechanistic inconsistency. The mechan- four reactions, seven of eight pairs are confirmed ism proposed by Junker F et al. [14] is consistent with RCLASS entries. Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 12 of 17 The results presented in this section, determined using This result can be explained by the fact that some reac- the Kotera classification, suggest that reactions in the tions are catalyzed by more than one enzyme or by multi- CTSs might be correlated to specific catalytic conver- functional enzymes. For example, CTS-2 has reactions sions. We tested this hypothesis with the aim of proving catalyzed by the enzyme chloromuconate cycloisomerase whether our rules naturally cluster reactions that share (EC 5.5.1.7), which is cataloged as both an isomerase and the same ECN within a CTS because this number is an intramolecular lyase due to its capacity to release free used to classify reactions in terms of the enzymes by hydrogen chloride when 2-chloro-cis, cis-muconic acid is which they are catalyzed. transformed into cis-4-carboxymethylenebut-2-en-4-olide [15]. Therefore, finding a representation of more than one Tree structures cluster into general enzyme patterns enzyme with similar catalysis types despite classification in Analysis of our TS-compound pairs and CTSs suggested different ECNs should not difficult. that our approach can cluster reactions catalyzed by As previously mentioned, isomerization reactions with enzymes belonging to similar classes as observed for the only one substrate and one product were not considered ligases of CTS-18. To evaluate the generality of this ob- because they form trivial tree structures. Nonetheless, servation, we associated the ECNs corresponding to each isomerizations embedded in reactions with more than reaction. ECNs are composed of four digits separated by one compound on each side of the equation were con- periods, and the numbers represent a progressively spe- sidered and mapped to specific CTSs (CTS-2, CTS-3 cific classification of each enzyme. The first digit groups and CTS-4). Similar distributions were observed for enzymes into general types of catalysis. For example, other ECNs, such as ligases represented in CTS-7, ECN type 1 (ECN-1) describes oxidoreductases that CTS-9, CTS-14 and CTS-18 and hydrolases represented catalyze oxidoreduction reactions in which hydrogen in CTS-1, CTS-2 and CTS-3. and oxygen atoms or electrons are transferred from one As shown in Fig. 7, our CTSs tended to concentrate par- substance to another. The subsequent digits add levels ticular ECNs. To test whether this over-representation of specificity; ECN-1.1, for example, contains oxidore- was statistically significant, we calculated the enrichment ductases that act on donor alcohol groups. Figure 7 of each ECN in each CTS considering only the first ECN shows the frequency of each ECN associated with each digit and CTSs grouping more than 10 reactions using reaction grouped in each CTS considering only its first Fisher’s exact test. We considered enriched CTSs to have digit. The bar plot in this figure shows that all CTSs Pvalues <0.05. Figure 8 shows the significantly enriched were concentrated mainly in the ECNs of a single class, groups represented in terms of logarithmic odd ratios, and CTSs with higher numbers of members tended to which are useful for evaluating the strength of enrichment. include other ECNs of other classes in lesser proportions. The CTSs considered enriched in an ECN category were Fig. 7 Correlation of tree structure clusters with general enzymatic categories. The clusters of tree structures (CTSs) tended to naturally Fig. 8 Enrichment of enzyme categories into tree clusters. We illustrate group enzymatic categories provided by the Enzyme Commission the results of the significance of each enzyme category within the numbers (ECNs). This figure presents this tendency using the first digit clusters of tree structures (CTSs). The graph shows the logarithm of the ECN class, in which enzymes are classified as oxidoreductases of the odds ratio, which represents the strength of the enrichment (ECN-1), transferases (ECN-2), hydrolases (ECN-3), lyases (ECN-4), (spectrum color bar), and the number of hits in the group (point density). isomerases (ECN-5) and ligases (ECN-6) We only show CTSs with False discovery rate < 0.05 Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 13 of 17 those displaying a false discovery rate, reported as the observation, we applied Fisher’s exact test while considering logodd ratios ≥ 0.5. the second and third ECN digits, and our results showed As shown in Figs. 7 and 8, the probability of having re- that at these class levels, the remaining CTSs, including actions catalyzed by specific enzyme types increased as CTS-2 and CTS-17, tended to be significantly enriched in the number of reactions in the CTS decreased, as shown more specific ECNs. The results of these tests are by the significant enrichment in transferases, ligases and shown in Additional file 1: Figures S1 and S2. oxidoreductases in CTS-20, CTS-21 and CTS-22, re- Considering the presented results, we conclude that spectively. Moreover, 17 of the 22 CTSs were signifi- our proposed rules showed that a significant number cantly enriched in only one ECN, and another three of the reactions evaluated reflected a clear biochem- were significantly enriched in only two ECNs; CTS-2 ical link between tree structures and specific chemical has lyases (ECN-4) and isomerases (ECN-5), CTS-4 has events. oxidoreductases (ECN-1) and isomerases (ECN-5), and CTS-7 has oxidoreductases (ECN-1) and ligases (ECN-6). Discussion An analysis of CTS-2, which is significantly enriched in We described a clustering analysis of enzymatic reac- ECN-5 and ECN-4, showed that its isomerases belong to tions described in the KEGG database using our the subclasses 5.5.1.7 (a chloromuconate cycloisomerase) rule-based approach. Our results allowed us to classify and 5.5.1.11 (a dichloromuconate cycloisomerase). These different metabolic reactions into patterns, revealing as- enzymes catalyze eight reactions for the degradation of sociations among compounds at a glance. The rules of chlorocyclohexane or chlorobenzene, respectively, and our approach were implemented as a protocol that first they release a hydrogen chloride molecule in all cases. locates and separates groups with the smallest differ- Surprisingly, when analyzing the KEGG database in more ences among compounds within a reaction. Thus, using detail, we found two other reactions (R08119 and R09215) our approach, the most different compounds are identi- catalyzed by a chloromuconate cycloisomerase that were fied by successive elimination. The logic behind this ap- not grouped in CTS-2. Interestingly, the isomerization proach is that some compounds, such as coenzymes and processes in R08119, which transforms 2-fluoro-cis, pool compounds (such as S-adenosyl-L-methionine cis-muconate into 5-fluoromuconolactone, and R09215, [16]), tend to undergo less dramatic changes during a in which 3,5-dichloro-2-methylmuconate is transformed reaction (the basis of the balance rule) and tend to be into 3,5-dichloro-2-methylmuconolactone, do not yield associated with more enzymes in many unrelated path- halogens as reaction products because fluorine and chlor- ways (the basis of the count rule). Other methods, such ine remain bound to the products, respectively. This is a as compound mappers, highlight atom exchanges very good example showing that our rule-based approach between substrates and products through heuristic can separate very similar reactions in CTSs with very high graph-matching algorithms, which requires knowledge precision levels (mean 0.95, see Table 2) even if they are of the reaction modifications. Water, for example, is re- catalyzed by the same enzyme. moved from the analysis to increase the detection of Figure 8 shows that of the 17 significantly enriched oxygen and hydrogen traces at the cost of losing cataly- CTSs, the clusters CTS-16, CTS-17, CTS-18 and sis conducted by hydrolysis [3, 12, 17]. In contrast, our CTS-22 were most significantly enriched for specific approach does not remove any compound unless it is ECN categories. Other groups, such as CTS-5 and present in an unbalanced reaction or in a reaction in CTS-9, were almost homogenous because they had only which the subscripts are not well defined, such as reac- one or two reactions with ECNs of another class. tion R00001 (polyphosphate + n H O < => (n + 1) oligo- Figure 8 shows that despite significant enrichment in phosphate). Therefore, we were able to analyze a wider only one ECN class at the first digit, CTS-20 did not range of compounds within reactions. have one of the highest odd ratios due to the small size Our analysis also showed that despite the complexity of the group, which had only 10 elements. In contrast, of enzyme catalysis, reactions with similar catalytic pat- CTS-5 clustered 482 reactions, 480 of which were classi- terns were grouped through our TSs, which outlined the fied in ECN-5. chemical transformations of substrates and products Careful examination of the 17 CTSs with statistically sig- within reactions. We demonstrated that the TSs pre- nificant enrichment in one ECN revealed a tendency to served pair architectures comparable to those of reactant concentrate reactions with similar catalysis types, even at pairs of the RPAIR dataset [12], which were based on the second digit as shown for CTS-2. Other examples in- identification of the atom type [8]. In contrast, our ap- clude reactions in CTS-17 catalyzed by enzymes of subclass proach based on two simple rules performed well, par- 1.4.1 and involved in amino acid oxi-reductions. Reaction ticularly for pairs classified in the RPAIR set as main, R00025 served as the exception because it is catalyzed by cofac (redox cofactors) and ligases (mainly phosphoryla- nitronate monooxygenase (1.13.12.16). Driven by this tions). Our protocol was less precise when describing Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 14 of 17 small groups of pairs cataloged as trans and leave, pathways and showed that the metabolic network with specifically transferring compounds such as hexose, the best performance combines biochemical knowledge pentose, alkyl or aryl groups and a small group of phos- encoded in the KEGG RPAIR with a weighting scheme photransferases. These reactions were found mainly in that penalizes highly connected compounds. This is con- CTS-1 and CTS-2; therefore, different rules need to be sistent with the pair-reaction classification that we established or a manual curation step should be imple- tested, which shows that most of the our EPPs were mented to verify these reactions. main RPairs. We believe that incorporation of our We also found that for a particular set of reactions, TS rule-based approach into pathfinders will substantially pairs were predicted that did not correspond to the pre- improve the performance of these methods as our dicted RPairs, most of which were catalogued by Kotera method could provide the relative relevance of com- as trans or leave [4]. R06726 (retronecine + 2 L-isoleu- pound pairs within reactions. cine <= > senecionine, grouped in CTS-2) serves as a Kotera and coworkers [18] and Rahman et al. [9] de- good example of a reaction with inconsistent pairs. In veloped methods to predict reactant pairs using similar- this case, we predicted the pair (C0047_C06176, ity between reaction centers, proving the feasibility of L-isoleucine_senecionine) by using a balance reaction grouping reactions by their ECNs through the chemical transforming two L-isoleucine molecules. In comparison, transformations of compounds. Our approach can clus- the KEGG pairs a retronecine molecule with the senecio- ter metabolic reactions by their general patterns, result- nine in an RPair given that the molecules have more sig- ing in groups with significant enrichment in specific nificant common atom types in contrast to L-isoleucine. ECNs, even at the third digit. We found secondary ECN Another five of six reactions in CTS-2 with a predicted classes in smaller proportions of CTSs, which does not pair according to our approach could not be compared disprove the above observation because individual reac- due to the lack of an RPair. Three of these reactions tions can be linked to several ECNs. Furthermore, slight (R07652, R07916 and R10177) were involved in geranyl variations in these classes are expected due to factors and farnesyl transformations, which are catalyzed by en- such as multiple domain enzymes and enzyme promis- zymes that transfer acyl or aryl groups from the substrate cuity, which has been documented as a common to the product. Among these reactions, the ECPs defined phenomenon [19]. as more similar included compounds with acyl or aryl Statistical evaluation performed using the false discov- groups because they have similar molecular weights and ery rate, represented as the odds ratio logarithm, showed are separated by diphosphate, a compound with a smaller that 10 of the 22 CTSs used in this analysis are enriched mass compared with that of the others within the reaction. in oxidoreductases, which we also proved when testing These reactions in the KEGG were not assigned an the Kotera classification [4]. However, we discarded all RPair because they involve multiple steps. For this reactions without an assigned ECN, and we predicted 49 type of reaction, careful curation is required to evalu- CTSs that clustered 113 reactions that were not used in ate the consistency of our partitioning. Fisher’s exact test. These 49 CTSs had less than ten re- We can conclude that our rules, which are based on actions, and their small sample sizes made statistical intrinsic biochemical reaction properties and interac- tests less convenient. Therefore, the pair architectures tions of compounds across the whole network, organize found in these reactions are candidates for manual cur- reactions in individual tree structures, which is useful ation to test their reliability. Overall, we propose that for clustering considering only the proposed compound with confirmation of our curated set, our CTSs can arrangements. This approach is also helpful for easily serve as good guides for predicting reaction catalysis. identifying reactions with similar transformations. How- ever, for a small group of transformations, this simple Conclusions approach was insufficient, as shown in Fig. 5 for the Some methodologies based on graph theory organize MFP and FP categories. Therefore, this approach should compound networks into metabolic functional categories be improved for these cases in future work. without preserving biochemical pathways. Other methods Another advantage of our rule-based approach is based on chemical group exchange and atom flow trace that it does not eliminate any compounds during tree the conversion of substrates into products in detail, which assembly, allowing observation of the relative ordering is useful for inferring metabolic pathways. To analyze of compounds predicted by our rules within a reac- metabolic networks, we presented a rule-based approach tion and among reactions, a feature that is not avail- incorporating both methods that decomposes each reac- able in atom mappers. tion into architectures of compound pairs and loner com- Work published by Faust C. and coworkers [16] evalu- pounds that can be organized into tree structures. We ated the effect of using well-curated reactant pairs and found that the tree structure-compound pairs fit those other parameters on reconstruction of metabolic reported in the KEGG-RPAIR with excellent match Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 15 of 17 precision. The generated tree structures naturally clus- Weight metrics tered all reactions into 71 general reaction patterns of For the balance method, we used molecular weights compounds with similar chemical transformations. Fi- as reported in the COMPOUND dataset. For the nally, we evaluated the catalysis types in the clusters count method, we calculated the frequencies of all based on Enzyme Commission categories and revealed Cartesian products and used these numbers as the preferential use of enzyme classes. We demonstrated that weight measures. applying simple rules can allow identification of reaction patterns reflecting metabolic reactions that transform sub- Tree structure construction strates into products and the type of catalysis involved in For every reaction in the dataset, we constructed a TS. the transformation. The pairs generated using our We used Perl scripts to construct an algorithm based on rule-based approach can be incorporated as inputs to im- the calculated mass differences and frequencies of Cartesian prove the performance of pathfinders because products in the metabolic network to divide each reaction well-curated pairs provided better results, as demon- in the dataset. For this purpose, we created two rules, the strated by Faust and collaborators [7]. Our method is a re- balance and count rules. action classifier that can correlate EC numbers to our CTSs. Therefore, we propose that our method could con- Balance rule stitute a useful first step for the prediction of reaction ca- In a given reaction R, A and B are defined as the set of talysis, which can be conducted by simply incorporating chemical species on the left and right sides of R, respect- the discarded reactions that were not considered in this ively. Next, we define the operation P′(A) as the set of A analysis. Finally, using this last property, testing whether subsets, excluding the empty set. Elements of the Cartesian ′(A) ′(B) the enzymes clustered in the CTSs are evolutionarily re- product CP = P × P are the basic input of our proced- lated, will be useful, a task our group have started. ure. For each element (a, b) ∈ CP, excluding the whole In future work, we intend to model chemical transi- reaction, we calculate tions of generic reactions, such as those with discordant pairs. For these, we will introduce fewer generic rules d ¼jj ðÞ W −W ; ð1Þ ab a b that will consider other measures different from molecu- K lar weight, such as giving carbon atoms a greater weight where K = max (W , W ) and W is the sum of the because carbon flow is the main feature of metabolic a b a, b molecular weights for all the compounds in a given CP transformations. Additionally, as suggested by the results (ECP) element for sides A and B, respectively. for the reactions described in terms of their catalytic To clarify the operations described above, illustrating mechanism, we will include the roles of enzymes in the our procedure applied to the generic reaction C + D→ reaction layout. E + F nearly step by step is convenient. Therefore, P (A) ={{C}, {D}, {C, D}} and P (B)={{E}, {F}, {E, F}}. To be con- crete, when considering the ECP ({C}, {E, F}), Methods d ¼ jðW −ðW þ W ÞÞj , where W , W and Datasets C;EF C E F C E We analyzed datasets stored in the 2015 version of the W are the molecular weights of compounds C, E and F, KEGG database [11]. We collected information from the respectively. REACTION and COMPOUND datasets from the LIG- Next, we seek the minimum d among all the ele- ab AND collection. From REACTION, we collected IDs ments in the ECP. Because the reaction is balanced, in and equations (including coefficients) from 9910 reac- this case d = 0, we exclude the entire reaction. This whole tions along with ENZYMES and RPAIR data. For RPAIR, ECP with the minimum d will be considered the first we also used RCLASS identifiers and enzyme category branch of the TS and will be placed on the right side. In information regarding 15,349 entries; we retained only a subsequent step, we eliminate the remaining ECPs RPairs with RCLASS identifiers. From COMPOUND, we containing at least one ECP compound with the mini- collected the IDs, chemical formulas and molecular mum d. This is an iterative process that continues for weights of 7661 compounds. To limit our analysis to a each branch until a minimal difference among each ECP well-curated and verifiable set, all reactions that included cannot be established or no remaining ECPs are found. compounds from the GLYCAN dataset and reactions with incompletely described coefficients and subscripts Count rule were removed. We also removed 1099 reactions with The count rule is a means of selecting ECPs based on one compound on each side of the equation from the their occurrence in the metabolic network. This rule was analysis. Ultimately, 6392 curated reactions were in- inspired by other works reporting that some compounds cluded in the final set. are frequently used in metabolic transformations as Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 16 of 17 exchange currency [20–22]. For implementation of this the distribution of θ. The probability Ρ(θ| y) given y co- rule, we simply count the frequency of each ECP in the incidences between two datasets out of n trials follows a whole network and use this value as a measure to com- beta distribution. For practical purposes, we gave the ex- pare compounds within a reaction. When applied, we pectation value as a summary of the whole distribution. took the selected branch as the ECP with the highest fre- This value is easily calculated using the standard formula yþ1 quency. The ECP selected with the balance rule was ΕðθÞ¼ . Therefore, as in the present cases, when nþ2 placed on the right side, and the remaining ECPs were y, n ≫ 1, this result is very close to the ratio y/n placed on the left, discarding all ECPs with compounds [23]. Here, the parameters α and β are the number present in the other group. The balance rule is always of hits + 1 and fails + 1, respectively. This test was applied on an ECP first, and the count rule is used to performedfor theentireset andfor the22CTS disambiguate cases in which the balance rule fails to se- with at least ten reactions using the betainv function lect a single ECP. in GNU Octave 4.2.1. We represented the distribu- tion with the gnuplot 5.0 program. Reaction in a string format (RSF) After successive application of the rules, we con- Enrichment of ECN classes in TS patterns structed a representation visualized as a tree (Fig. 3 We tested for enrichment of the EC numbers classified and Additional file 1: Table S2). We also represented into CTSs by our method using a two-sided Fisher’s each TS in a JSON format (JavaScript Object Notation) exact test. Additionally, we controlled the false discovery and in two simplified formats. These formats are exempli- rate using the Benjamini-Hochberg procedure [24]. We fied below; Eq. 2a gives a generic syntax outline, and evaluated the strength of the enrichment using the odds Eq. 2b specify reaction R00760. ratio. For each possible combination of a given EC cat- egory “C” and a particular tree “X”, the odds ratio is de- rootðbalanceðÞ compound compoundðÞ compound compound ð2aÞ fined as (A/B)/(C/D), where A is the number of ECs of category “C” classified in tree “X”; B is the number of rootðÞ balanceðÞ C00095 C00085ðÞ C00002 C00008 ð2bÞ ECs that are not of category “C” classified in tree “X”; C >ðÞ !ðÞ C CðÞ C C ð2cÞ is the number of ECs of category “C” that are not in tree “X”; and D is the number of ECs that are not of category In Eq. 2c, “>” indicates the tree root for the TS, and “!” “C” and are not in tree “X”. We considered only odds ra- indicates the split according to the rule used (one mark tios with P values < 0.05. For this purpose, we extracted (“!”) indicates the balance rule, and two marks (“!!”) indi- EC number(s) related to each reaction from the KEGG cate the count rule). The number of parentheses around database, which corresponded to 4552 ECs distributed at each pair or loner compound show the depth at which it least one time in a reaction. Notably, 1134 of 8957 reac- is nested, indicating the number of partitions employed tions did not have ECs earmarked in the KEGG data- to construct the tree. “C” in Eq. 2c represents compound base. All the graphical representations were created positions independent of the specific compound. using R scripts developed in RStudio Version 1.0.136 and edited in Inkscape 0.91. CTS grouping For each reaction, a TS was proposed, and the architec- Endnotes tures found were represented as in Eq. 2c.The TSsavail- The power of a given set S is defined as the set of all S able for each reaction were clustered into clusters of TSs subsets, including the empty set {} and S itself. Therefore, (CTSs) according to their topology. Table 1 shows the re- the operation we performed on A and B can be described sultant clusters that were grouped considering the arrange- in standard mathematical notation as P′(S) = P(S) \ {{}}, ments but not the specific compounds within the reaction. where P(S) is the power of S, and the set difference is indicated by “\”. Pair validation We compared the pairs generated by our method with Additional file the RPair structures stored in the KEGG REACTION file. For this purpose, we counted the frequency with Additional file 1: Table S1. List of the reactions split by only the count rule or by the count rule in some step of division. Table S2. which a TS-compound pair was equal to an RPair with Representation of the CTS in a string and tree structure format. The first an RCLASS in the same reaction. We then estimated the column, shows the CTS ID; second column represents the reaction split posterior probability of successful data distribution ver- in a string; third column shows the graphical (node-edges), representation. Figure S1. CTS vs EC_two_digits comparison by the False sus having a failed pair as follows: Let θ be the corres- Discovery Rate. Figure S2. CTS vs EC_three_digits comparison by the pondence of the TS-compund pairs with the RCLASS False Discovery Rate. (PDF 3439 kb) set. Using Bayesian analysis, we were able to determine Vazquez-Hernandez et al. BMC Systems Biology (2018) 12:63 Page 17 of 17 Abbreviations 6. Oh M, Yamada T, Hattori M, Goto S, Kanehisa M. Systematic analysis of CTS: Cluster of tree structures; ECN: Enzyme Commission number; ECP: Element enzyme-catalyzed reaction patterns and prediction of microbial biodegradation of the Cartesian product; EPP: Entirely predicted pair; FP: Failed pair; G3P: D- pathways. J Chem Inf Model. 2007;47:1702–12. Glyceraldehyde 3-phosphate; MFP: Mixed-failed pair; MPP: Mixed-positive 7. Faust K, Croes D, van Helden J. Prediction of metabolic pathways from pair; Pi: Inorganic phosphate; RPair: Reactant pair; RSF: Reaction in a string format; genome-scale metabolic networks. Biosystems. 2011;105:109–21. [cited TS: Tree structure; TS-compound pairs: Tree structure-compound pairs 2013 Aug 21] 8. Heinonen M, Lappalainen S, Mielikäinen T, Rousu J. Computing atom mappings for biochemical reactions without subgraph isomorphism. Acknowledgements J Comput Biol. 2011;18:43–58. We thank Ricardo Ciria, Juan Manuel Hurtado and Shirley Ainsworth for their 9. Rahman SA, Cuesta SM, Furnham N, Holliday GL, Thornton JM. EC-BLAST: a technical and bibliographical support. We also want to thank Enrique Merino tool to automatically search and compare enzyme reactions. Nat Methods. for reviewing and providing comments regarding this manuscript. This study 2014;11:171–4. was supported by grant IN204515 from PAPIIT-UNAM awarded to RMGR. Carlos 10. Enzyme Nomenclature. Recommendations 1992. Supplement: corrections Daniel Vazquez was supported by a CONACyT scholarship and by the and additions. Eur J Biochem. 1994;223:1–5. [cited 2017 Aug 3]. Blackwell Consorcio de Investigación del Golfo de México (CIGOM) funded by the Fondo Publishing Ltd. Sectorial CONACYT-Secretaría de Energía-Hidrocarburos. 11. Kanehisa M, Goto S, Kawashima S, Nakaya A. The KEGG databases at GenomeNet. Nucleic Acids Res. 2002;30:42–6. Availability of data and materials 12. Hattori M, Okuno Y, Goto S, Kanehisa M. Development of a chemical structure The data supporting the findings of this study are available from the KEGG comparison method for integrated analysis of chemical and genomic under reference number BJ2100/vrk245yj, but restrictions apply to the availability information in the metabolic pathways. J Am Chem Soc. 2003;125:11853–65. of these data, which were used under license for the current study and are not 13. Hanes JW, Burns KE, Hilmey DG, Chatterjee A, Dorrestein PC, Begley TP. publicly available. However, data are available from the authors upon reasonable Mechanistic studies on pyridoxal phosphate synthase: the reaction pathway request and with permission of the KEGG database. leading to a chromophoric intermediate. J Am Chem Soc. 2008;32291–300. 14. Junker F, Field JA, Bangerter F, Ramsteiner K, Kohler H, Joannou CL, et al. Software and code Oxygenation and spontaneous deamination of 2-aminobenzenesulphonic All the programs used in our analysis are available at our web page acid in Alcaligenes sp. strain 0-1 with subsequent meta ring cleavage and http://www.ibt.unam.mx/biocomputo/reaCTS_programs.html. spontaneous desulphonation to 2-hydroxymuconic acid. Biochem J. 1994; 436:429–36. Authors’ contributions 15. Dorn E, Knackmuss H-J. Chemical structure and biodegradability of halogenated CVH conceived the original idea and developed the method. AL conceived aromatic compounds. Biochem J. 1978;174:73–84. and developed the presicion test. EPS conceived and developed the 16. Faust K, Croes D, van Helden J. Metabolic Pathfinding using RPAIR annotation. enrichment test. LS suggested ideas for data curation. RMGR improved the J Mol Biol. 2009;388:390–414. Elsevier Ltd method, analyzed the data, supervized the project and wrote the original 17. Latendresse M, Malerich JP, Travers M, Karp PD. Accurate atom-mapping draft. Writing – review and editing: CVH, AL EPS, LS and RMGR. All authors computation for biochemical reactions. J Chem Inf Model. 2012;52:2970–82. read and approved the final manuscript. 18. Kotera M, Okuno Y, Hattori M, Goto S, Kanehisa M. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J Am Ethics approval and consent to participate Chem Soc. 2004;126:16487–98. Not applicable. 19. Khersonsky O, Tawfik DS. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu Rev Biochem. 2010;79:471–505. Competing interests 20. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi A-L. The large-scale The authors declare that they have no competing interests. organization of metabolic networks. Nature. 2000;407:651–4. 21. Fell D, Wagner A. The small world of metabolism. Nat Biotechnol. 2000;18: 1121–2. Publisher’sNote 22. Lima-Mendez G, van Helden J. The powerful law of the power law and Springer Nature remains neutral with regard to jurisdictional claims in other myths in network biology. Mol BioSyst. 2009;5:1482–93. published maps and institutional affiliations. 23. Link C. Advances in Empirical Bayes Modeling and Bayesian Computation The Harvard community has made this article openly available. Please share Author details how this access benefits you. Your story matters. 2015; Departamento de Microbiología Molecular, Instituto de Biotecnología 24. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical Universidad Nacional Autónoma de México, Apdo, Postal 510-3, 62250 and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol). Cuernavaca, Morelos, Mexico. Departamento de Ingeniería Celular y 1995;57(1):289–300. Biocatálisis, Instituto de Biotecnología Universidad Nacional Autónoma de México, Apdo, Postal 510-3, 62250 Cuernavaca, Morelos, Mexico. Received: 13 October 2017 Accepted: 9 May 2018 References 1. Muto A, Kotera M, Tokimatsu T, Nakagawa Z, Goto S, Kanehisa M. Modular architecture of metabolic pathways revealed by conserved sequences of reactions. J Chem Inf Model. 2013;53:613–22. 2. Ravasz E, Somera a L, Mongru D a, Oltvai ZN, Barabási a L. Hierarchical organization of modularity in metabolic networks. Science. 2002;297: 1551–5. [cited 2013 Aug 13] 3. Arita M. The metabolic world of Escherichia coli is not small. Proc Natl Acad Sci U S A. 2004;101:1543–7. 4. Kotera M, Hattori M, Oh M-A, Yamamoto R, Komeno T, Yabuzaki J, et al. (2004). RPAIR: a reactant-pair database representing chemical changes in enzymatic reactions. Genome Inf. 2004;15:P062. 5. Shimizu Y, Hattori M, Goto S, Kanehisa M. Generalized reaction patterns for prediction of unknown enzymatic reactions. Genome Informatics Ser. 2008; 20:149–58.

Journal

BMC Systems BiologySpringer Journals

Published: May 30, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off