Origin, evolution, and divergence of plant class C GH9 endoglucanases

Origin, evolution, and divergence of plant class C GH9 endoglucanases Background: Glycoside hydrolases of the GH9 family encode cellulases that predominantly function as endoglucanases and have wide applications in the food, paper, pharmaceutical, and biofuel industries. The partitioning of plant GH9 endoglucanases, into classes A, B, and C, is based on the differential presence of transmembrane, signal peptide, and the carbohydrate binding module (CBM49). There is considerable debate on the distribution and the functions of these enzymes which may vary in different organisms. In light of these findings we examined the origin, emergence, and subsequent divergence of plant GH9 endoglucanases, with an emphasis on elucidating the role of CBM49 in the digestion of crystalline cellulose by class C members. Results: Since, the digestion of crystalline cellulose mandates the presence of a well-defined set of aromatic and polar amino acids and/or an attributable domain that can mediate this conversion, we hypothesize a vertical mode of transfer of genes that could favour the emergence of class C like GH9 endoglucanase activity in land plants from potentially ancestral non plant taxa. We demonstrated the concomitant occurrence of a GH9 domain with CBM49 and other homologous carbohydrate binding modules, in putative endoglucanase sequences from several non-plant taxa. In the absence of comparable full length CBMs, we have characterized several low strength patterns that could approximate the CBM49, thereby, extending support for digestion of crystalline cellulose to other segments of the protein. We also provide data suggestive of the ancestral role of putative class C GH9 endoglucanases in land plants, which includes detailed phylogenetics and the presence and subsequent loss of CBM49, transmembrane, and signal peptide regions in certain populations of early land plants. These findings suggest that classes A and B of modern vascular land plants may have emerged by diverging directly from CBM49 encompassing putative class C enzymes. Conclusion: Our detailed phylogenetic and bioinformatics analysis of putative GH9 endoglucanase sequences across major taxa suggests that plant class C enzymes, despite their recent discovery, could function as the last common ancestor of classes A and B. Additionally, research into their ability to digest or inter-convert crystalline and amorphous forms of cellulose could make them lucrative candidates for engineering biofuel feedstock. Keywords: Cellulase, Cellulose, Glycoside hydrolase, GH9, Endoglucanases, Phylogenetics Background presence/ absence of transmembrane (TM) and/ or signal Glycoside hydrolase 9 (GH9) endoglucanases utilize water peptide (SP) sub regions [1, 2]. Theabundantlypresent (EC3.x.y.z) to cleave the glycoside (1→ 4) or (1→ 3) bonds amorphous cellulose is enzymatically amenable to diges- between repeated monomeric β(D)-glucopyranose units of tion,and is thedefacto substratefor theseenzymes.How- cellulose and comprise sequences from all major kingdoms ever, an editing/ modifying function for crystalline cellulose of life [1, 2]. GH9 endoglucanases in land plants were previ- has been ascribed to class A endoglucanases, either exclu- ously clustered into classes A and B on the basis of the sively or in association with the cellulosome [3–5]. The dis- covery and further characterization of a carbohydrate binding module (CBM49) at the C-termini of previously * Correspondence: siddhartha_kundu@yahoo.co.in; rita.genomics@gmail.com Department of Biochemistry, Government of NCT of Delhi, Dr. Baba Saheb annotated GH9 endoglucanases (classes A and B) in Sola- Ambedkar Medical College & Hospital, New Delhi 110085, India num lycopersicum, Oryza sativa, Arabidopsis thaliana, and Crop Genetics and Informatics Group, School of Computational and Nicotiana tabacum conferred, on this family, catalytic Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 2 of 19 competency for crystalline cellulose [6–8]. The hydrogen- reciprocally, the existence of mixed function endo- and bond stabilized crystalline cellulose, is the preferred sub- exo-glucanases acting in tandem with biosynthetic cata- strate for bacteria, fungi, archaea, and protists, organisms lysts to modulate the composition of the encompassing which predate the emergence of green land plants by sev- cell wall matrix/ capsule/ coat [15–17]. Observations by eral millions of years [9–14]. The discovery, therefore, that several investigators suggest a correlation between exhib- a subset of plant GH9 endoglucanases could utilize crystal- ited function with the occurrence of sequence homology or line cellulose as its cognate substrate raises fundamental manifested enzymatic activity. Thus, despite the proximity questions not only on the evolution and ancestry of plant of divergence between multicellular green algae and GH9 endoglucanases, but also the functional relevance primitive land plants 470 − 480 Million years ago (Mya), of an additional hydrolase with a hitherto novel homologous GH9 endoglucanase sequences are either spectrum of catalytic activity. completely absent or at best partial and fragmented in uni- Cellulose, is a straight chain polymer of repeating cellular members (Chlamydomonas reinhardtii, Volvox car- units of β(1→ 4) linked D-glucopyranose residues and teri)[16, 17]. In contrast, bacteria (≅3200 − 3950 Mya), consists of microcrystalline (I , I ) and amorphous archaea (≅390 − 1350 Mya), protists (≅2000 − 3000 Mya), α β (I am, I am) regions (Fig. 1a and b). This heteroge- fungi (≅1000 − 1500 Mya), and some animals (180 − 670 α β neous distribution is dictated by the presence of a Mya) not just possess sequences with ascribable GH9 endo- rich inter-and intra-fibrillar hydrogen bond network. glucanase activity of crystalline cellulose, but also a demon- Whilst, the paucity of hydrogen bonds in the former strable and relevant function (Table 1)[18–40]. These facilitates enzymatic cleavage, the ordered structure of include modulation of sporulation (Dictyostelium spp.,clos- the latter, imposes constraints on the activity profile tridiales, bacillales), host-pathogen interactions (fungi, nem- of plant GH9 endoglucanases. Natural cellulose is atodes, protists, plants), repair and survival (Euryarchaea), rarely pure (Gossypium spp., 90%), and is frequently and preventive desiccation (bacteria, Dictyostelium spp.) found in association with other carbohydrates (hemi- [15, 41–49]. Genomic evidence of GH9 endoglucanases in cellulose) and/ or other macromolecules (lipids, pro- some animals (marine invertebrates, termites, arthropods, teins). The presence of these complexes would also imply, parasitic and saprophytic nematodes), in the absence of Fig. 1 Taxonomic distribution and analysis of the GH9 domain in putative endoglucanse sequences. a Molecular structure of cellulose with repeating units of D-glucopyranose linked by a β(1→ 4) glycosidic bond. The liberated mono- or oligosaccharides either retain the β-hydroxyl group (retaining), or are inverted (α-hydroxyl) after transformation, b Generic reaction mechanism of hydrolytic GH9 endoglucanase (EC 3.2.1.4) mediated transformation of cellulose into simpler oligo- and/or mono-saccharides, c Alignment compatible sequences of GH9 domains from putative GH9 endoglucanases across all taxa (n = 607). Abbreviations: GH9, glycoside hydrolase; EC, enzyme commission Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 3 of 19 Table 1 Literature based divergence rates of taxa utilized for itself [4, 5]. The presence of signal peptide regions, in calibrating the time trees contrast, posits that these enzymes may be secreted Taxa Divergence (Mya) and digest cellulose extracellularly. Such a mechanism might benefit fungal pathogens of plants, may be de- Bacteria 3200–3950 ployed by termites, and participate in glucose extrac- Protists 1600–3000 tion in ruminants as well [15, 42, 44, 48]. The Archaea 500–2500 proportion of sequences that exhibit class B and C Animals 1000 activity is subject to much debate. Whilst, a simple Crustacea 511 sequence similarity suggests a preponderance of class Insects 396 B members, complex classification schema using hidden markov models (HMM) and artificial neural networks C.intestinalis 180 (ANN) indicates a marginally greater number of putative Chordates 542 class C GH9 endoglucanases in primary transcript data Arthropoda 540 from sequenced land plants [16, 58–60]. Fungi 1500 The potential importance of class C enzymes in biomass Green algae 500–2000 conversion notwithstanding, a paradigm shift in the chem- Bryophytes 470–475 ical nature of cellulose, the inconsistencies in the numbers observed between predicted and observed members, and Tracheophytes 395–425 a conserved reaction chemistry in extant non plant taxa, Land Plants suggest that plant class C GH9 endoglucanases may pre- Monocots 90–141 date classes A and B enzymes [16, 58–61]. Here, we at- Eudicots 90–141 tempt to resolve some of these queries by investigating the Rosids 108–117 origins, evolution, and subsequent divergence of the GH9 Asterids 107–117 domain in putative plant endoglucanase sequences, with Abbreviations: Mya Millions of years particular emphasis on the contribution of class C mem- bers. The role of the aromatic (W/ Y / F) and polar un- demonstrable function, was postulated to have occurred charged (S/T/N/Q) is critical to the functioning of during phases of co-infection with gastrointestinal and oral endoglucanases in the presence and absence of well- microbiota [15, 42, 44, 45, 50–54]. However, the con- defined CBMs, and, in the presence of low complexity re- firmed presence in numerous other animals, similarity gions their incorporation into the GH9 domain might in substrate and reaction chemistry, and sequence constitute the only measure of approximating the CBM49 conservation, along with supporting laboratory data [62–64]. These residues despite being non-catalytic them- has refuted much of this horizontal transfer mode of selves have been shown to confer the capacity on the gene transfer [15, 41, 42, 44, 45, 55–57]. Davison and encompassing enzymes to discriminate between related li- Blaxter suggested a single origin of GH9 genes based gands (cellulose/ X, X = {xylose, lignin, chitin; β-1,3/β-1,4), on monophyly in the phylogenetic tree and conserved effect and in some cases even the binding affinity for a intron positions [55]. cognate substrate, contribute to processivity and thermal In land plants (Viridiplantae), the activity profile of stability, and interestingly introduce catalytic competency GH9 endoglucanases on cellulose, correlates, in part, [62–78]. We utilize a combination of phylogenetic ana- with their distribution, as well as the purported roles lysis, pattern approximation, identification, distribution in growth, development, flowering, and seed germin- analysis, and residue mapping of the CBM49 to investigate ation [16]. The carbohydrate binding modules/ do- the emergence of crystalline cellulose digesting activity in mains (n = 64), are sequences 40 − 200 aa in length, land plants. Finally, we complement these analyses by and despite being intrinsically non catalytic can facili- examining the presence and distribution of transmem- tate the hydrolytic cleavage of the glycosidic linkage brane and signal peptide regions in vascular land plants, [47]. Unlike the C-terminally localized CBM49 of and the possible routes by which endoglucanase se- plant GH9 endoglucanases, different CBMs favouring quences with putative class C activity could contribute to the activity on crystalline cellulose in bacteria, fungi, the emergence of sequences with novel functionality. protists, animals, and possibly archaea and green algae are distributed throughout the length of the se- Methods quence [16]. The presence of one or more TM re- Collation, annotation, and domain extraction of GH9 gions also suggests that at least in plants cellulose endoglucanases metabolism may occur in clusters of (biosynthetic, de- Sequences of putative GH9 endoglucanases were down- grading enzymes) and be localized at the membrane loaded from the publically available databases National Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 4 of 19 Center for Biotechnology Information (NCBI; http:// opening = gap extension = 10), with gap opening penal- www.ncbi.nlm.nih.gov) and Carbohydrate-active en- ties of 0.1 (pairwise alignment) and 0.2 (MSA), a diver- zymes (CAZy; http://www.cazy.org/) [16, 79, 80]. Se- gence cut off of 20%, and the BLOSUM62 set of quences of green land plants (Viridiplantae) utilized for matrices (Additional file 5: Table S1, Additional file 1: this analysis were downloaded from Phytozome (https:// Text S1 and Additional file 3: Text S4) [85, 86]. This phytozome.jgi.doe.gov/pz/portal.html), extensively cu- was chosen to account for the purported domain distri- rated, and classified into classes A, B, and C as described bution of classes A, B, and C among the various taxa. previously [16]. Annotation for non-plant GH9 endoglu- Sequences were deemed compatible if and only if their canases was in accordance with the schema adopted by pairwise alignments were free from errors as determined dbCAN (Carbohydrate enzyme annotation; http://csbl. by the distance matrix computed by MEGA7.0. The top bmb.uga.edu/dbCAN)[81]. The pooled sequences were scoring amino acid substitution models for the afore- filtered on the basis of their contribution to a compatible mentioned MSAs was selected amongst all (n = 56) using multiple sequence alignment (MSA) and the presence of the Akaike information criteria corrected (min(AICc)) a single GH9 domain as determined by MEGA7.0 (Mo- and the Bayesian information criteria (min(BIC)) as indi- lecular evolutionary genetic analysis, local installation) ces (Additional file 6: Table S3). BEAST v2.4.7 (Bayesian and the SMART (Simple modular architecture research evolutionary analysis by sampling trees) and the accom- tool) server [82–84]. Exclusion criteria for this prelimin- panying software suite (FigTree v1.4.3, DensiTree, Tracer ary data were: a) an indeterminable MSA, b) the complete v1.6, TreeAnnotator) was utilized to infer the date and absence of a demonstrable GH9 domain, c) more than visualize a maximum clade credibility tree with median one GH9 domain ((GH9) : x > 1) in the same sequence, heights, and tabulate descriptive statistics after the poster- and d) presence of a concomitant GH domain other than ior probabilities converged (Tables 2 and 3; Additional file GH9 ((GH9 ∧ GHx): x∈[1, 8] ∧ [10 − 130]). Amino acids at 7: Table S4) [87–89]. Whilst, the age of the node and the the start and end positions of the GH9 domains were branch times of the clades were inferred directly (Mya), noted and extracted (n ) using in-house developed PERL support was denoted as the posterior probabilities (PP%) scripts (Additional file 1: Text S1, Additional file 2:Text and bootstrap values (n = 1000) by maximum likelihood S3, and Additional file 3: Text S4). Here, the final set of (ML%), i.e., support = PP %, ML%, (FigTree v1.4.3). compatible sequences of the GH9 domains (n ), pattern Whilst, the selection of the root for evaluating the 1A selected GH9 domains (n , n ), pattern selected and evolution of the GH9 domain (parent of the bacterial 1B 1C GH9 encompassing full length sequences (n ), CBM49/ clade), was based on fossil records that suggested that CBM49-like sequences of land plants (n = n ; X ={A, bacteria were amongst the earliest forms of life 3X LPSX B, C}) comprised the datasets utilized in this study. The (≈3170 − 4180 Mya), the same for the CBM49/ distinct and delineable CBM49 from putative class C GH9 CBM49-like land plants was the presence of a distinct endoglucanases was similarly isolated and comprised (n and delineable CBM49 in the ancestral bryophytes 3C = n ) (Additional file 4: Text S2). The amino acid con- and tracheophytes coupled with the assumption that LPSC tent of the extricated GH9 and CBM49 domains were the parent of class C vascular land plants (≈201 − assessed using PIR (Protein information server, http://pir. 241 Mya) were likely to possess the same architecture georgetown.edu) and categorized on the basis of side (Table 2; Additional file 5: Table S1 and Additional chain content into those with hydrophobic side chains file 8:Table S2)[18, 19]. (HSC), aromatic amino acids (AAA), polar uncharged (PUC), polar charged acidic (PCA), and polar charged basic (PCB). The GH9 domains were used for phylogen- Pattern analysis and motif approximation of CBM49 in etic analysis and time tree estimation (n ), CBM49 was putative GH9 endoglucanases 1A utilized for pattern analysis and motif approximation (n ) The boundaries of CBM49 were defined in characterized 3C , and CBM49-like full length sequences from plant and and putative class C GH9 endoglucanase sequences with non-plant taxa were utilized for assessing relevant bio- single- and multiple-copies of the GH9 domain (n =116) informatics indices (n , n )(Additional file 1:Text S1, (Additional file 5: Table S1C and Additional file 8:Table 1B 1C Additional file 4: Text S2, Additional file 2:TextS3 and S2B) [6–8, 83, 84]. These were then clustered, realigned, Additional file 3:TextS4). and represented using the Clustal Omega and WebLogo servers (https://www.ebi.ac.uk/Tools/msa/clustalo; http:// Model selection, phylogenetic analysis, and time tree weblogo.berkeley.edu/logo.cgi) with default parameters estimation [90–92]. The refined list of CBM49 sequences in (n =100) Multiple sequence alignments (MSA) of the extracted class C GH9 endoglucanases were then submitted to the GH9 domains and the CBM49/ CBM49-like in land PRATT v 2.1 server (http://web.expasy.org/pratt), and uti- plants were generated using the default parameters (gap lized to identify and score suitable domain spanning Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 5 of 19 Table 2 Parameters utilized for Bayesian inference of evolution Table 3 Taxonomic distribution of bacteria in datasets of the GH9 and CBM49 domains Dataset n n 1 2 Site Model: Gamma Number of sequences 116 64 Subsitution rate = 1.0 1. Firmicutes 65 44 Substitution model: WAG/ JTT Clostridiales 49 40 Gamma category count = 5 Bacillales 15 4 Shape: 1.537/ 0.813 Selenomonadales 1 – Proportion invariant: NA/ 0.027 2. Actinobacteria 20 10 Clock Model: Relaxed clock Log Normal Micrococcales 3 1 Number of Discrete Rates = − 1 Streptomycetales 11 6 Clock rate = 1.0 Streptosporangiales 2 1 Calibrated Yule model Birth rate = 1.0 Micromonosporales 2 – Type (Full) Pseudonocardiales 2 2 birthRate Model: Gamma 3. Proteobacteria 20 5 Initial = 1.0 [−∞, ∞] Gamma (γ)14 3 α = 1.0E − 03 Alpha (α)4 2 β = 1.0E +03 Delta (δ)1 – Mode:= Shape Scale Undefined 1 – Offset = 0.0 4. CFB 9 3 gammaShape Model: Gamma 5. Cyanobacteria 1 – Initial = 1.0 [−∞, ∞] α = 1.0E − 03 6. Undefined 1 1 β = 1.0E +03 Abbreviations: GH9 Glycoside hydrolase 9, CFB Chlorobi, Mode: Shape Scale Fibrobacteres, Bacteroidetes Offset = 0.0 Population mean Model: Exponential patterns [93]. A profile of these patterns (n = 20) was gen- erated based on the numbers of putative class C enzymes Initial = 1.0 [−∞, ∞] μ = 10.0 that they were found in, i.e., 5→ 100 (Table 4). This was Offset = 0.0 used to search for sequences with CBM49-like motifs amongst full length GH9 endoglucanase sequences with- Uncorrelated relaxed local clock mean Model: Exponential Initial = 1.0 [−∞, ∞] outa delineable CBM49region, andonthe GH9domainit- μ = 10.0 self and was accomplished using the server ScanProsite Offset = 0.0 (http://prosite.expasy.org/scanprosite)(Additional file 9: Uncorrelated relaxed local clock Model: Exponential Table S5). These datasets (n , n , n , n ) along with the standard deviation 1B 1C 2 3 Initial = 1.0 [−∞, ∞] subset of was used for all further analyses (Tables 4, 5 and σ = 0.3337 6; Additional file 9: Table S5, Additional file 10: Table S6, Offset = 0.0 Additional file 11: Table S7 and Additional file 12:Table Root Parent of: Bacteria/ Vascular class C S8, Additional file 13 Text S9, Additional file 16:TextS10, land plants Additional file 14: Text S11 and Additional file 15:Text Monophyletic S12). Alternatively, a Hidden Markov Model or support Model: Log Normal vector machine(SVM) may havebeenutilizedfor μ = 8.2/ 5.41 this part of the analysis. SVMs, are binary classifiers σ = 0.07/ 0.055 and incorporate several features of the training se- Offset = 0.0 quences to determine presence/ absence in an un- 2.5% Quantile = 3170/ 201Mya known sequence of interest. Whilst the SVM for the 97.5% Quantile = 4180/ 249Mya CBM49 could have been easily constructed, its utility Markov chain monte carlo Chain length = 14,917,000 / in identifying the same in a distantly related se- 16,120,000 quence is likely to be limited. The HMM, however, Pre Burnin = 4,200,000/ 2,130,000 for this specific module hand would simply indicate Recording interval = 1000 the existence of a similar region above a certain threshold. Since, our requirement mandated features Abbreviations: GH9 Glycoside hydrolase 9, CBM49 Carbohydrate binding module 49, WAG Whelan and Goldman, JTT Jones, Taylor, and Thornton of both these, i.e., presence/ absence of CBM49-like Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 6 of 19 Table 4 Alignment based pattern analysis of CBM49 in putative and characterized class C GH9 endoglucanases Motif Fs Sm Rm 1 GPIWGLTK[AS]G[DN]SY[GTV]FP[EST][HW][IL][NS][ST]L[APS][AV]GKS[LM]EFVYIH[AS][AT]S 140.4529 5 2.47E-35 2 GPIWGL[ST][KR]SG[DN]S[FY][AGT][FL]P[EST][HW][ILM]x[ST]Lx[AS]GKSLEFVYIH[AS][AT][ST] 131.0036 10 3.02E-32 3 GPIWGL[NST]x(2)[GP][DENQ]x(2)[AGTV] 75.6634 15 7.32E-04 4 GPIWGL[ST]x(2)[GP][DEN]x(2)[AGTV]x[PV]x(4)[STV]x(3)[GQ]x[GS]xE[FV][NV][FY][IV][HY][ASTV][AQT][GPST] 35.2349 20 5.86E-16 5 GPIWG[LV][NST]x[AST][GP][DENQT]x(2)[AGSTV] 36.2141 25 4.10E-04 6 GPIWG[LV][ANST]x(2)[GP][DENQT]x(2)[AGSTV] 33.2127 30 2.92E-03 7 GPI[WY]G[LV][ANST]x(3)[DENQT]x(2)[ADGSTV] 28.9138 35 1.00E-01 8 GP[IL]WGL[ANST]x(3)[ADEGNQ] 28.034 40 0.16 9 GP[IL]WG[LV][NST]x(3)[DENQT] 27.6347 45 0.14 10 GP[IL]WG[LV][AENST]x(3)[ADEGNQT] 26.4888 50 0.39 11 GP[IL][WY]G[LV][AENST]x(3)[ADEGNQT] 25.6627 55 1.4 12 GP[ILV][WY]G[LV][AENST] 23.5981 61 4.9 13 GP[ILV]xG[LV] 18.3557 65 356 14 G[NPS][IL][WY]G[LV][ANST] 22.9977 70 9.2 15 G[NPS][ILV]WG[LV] 20.977 77 15 16 G[NPS][ILV][WY]G[LV] 20.1508 80 53 17 G[DNPQS][ILV][WY]G[LV] 19.4127 85 83 18 G[DENPQST]x(2)G[LV] 12.9137 90 12,445 19 Gx[ILV][WY]G[LV] 17.5296 98 323 20 Gx(3)G[LV] 11.5238 100 33,184 Abbreviations: Fs Fitness score, E Glutamic acid, Sm Number of sequences matched, Q Glutamine, Rm Estimated number of random matches, S Serine, A Alanine, T Threonine, L Leucine, C Cysteine, M Methionine, Y Tyrosine, I Isoleucine, F Phenylalanine, V Valine, W Tryptophan, G Glycine, K Lysine, D Aspartic acid, H Histidine, N Asparagine, P Proline, R Arginine, x Any amino acid regions in GH9 domain containing endoglucanases C) (Additional file 6 Table S3, Additional file 7:Table S4, across taxa, these predictors of the extrema would Additional file 9: Table S5, Additional file 10:Table S6 and not have sufficed. Additional file 11:Table S7,Additional file 1: Texts S1, Additional file 4:Texts S2,Additionalfile 2: Texts S3 and Domain analysis of plant GH9 endoglucanases Additional file 3: Texts S4. Since the methods discussed af- The above compiled datasets (n − n )weremeant to offer ford compelling evidence of the ancestral nature of class C 1 3 an insight into the origin and evolution of the GH9- GH9 endoglucanase sequences, our subsequent analyses CBM49-like domain across all taxa, the end point being the (domain frequency) was focussed on establishing potential emergence of plant GH9 endoglucanases (classes A, B, and divergence of class C members and/ or the emergence of Table 5 Distribution of sequence segments in classes A, B, and C plant GH9 endoglucanases MEMSAT-SVM DAS PHOBIUS SP TM SP TM SP TM C0 (NN) 0.0000 (0/97) 0.0588 (3/51) 0.0000 (0/100) C1 (YY) 0.7525 (73/97) 1.0000 (97/97) 0.8604 (37/43) 0.8958 (43/48) 0.0000 (0/2) 0.0200 (2/100) C2 (NY) 0.2474 (24/97) 0.1395 (6/43) 1.0000 (2/2) B0 (NN) 0.0000 (0/75) 0.0196 (1/51) 0.0533 (4/75) B1 (YY) 0.8133 (61/75) 1.0000 (75/75) 0.5600 (28/50) 0.9803 (50/51) 0.3333 (1/3) 0.0422 (3/71) B2 (NY) 0.1866 (14/75) 0.4400 (22/50) 0.6667 (2/3) A0 (NN) 0.0000 (0/22) 0.0454 (1/22) 0.0909 (2/22) A1 (YY) 0.0000 (0/22) 1.0000 (22/22) 0.0000 (0/21) 0.9545 (21/22) 0.0000 (0/20) 0.9090 (20/22) A2 (NY) 1.0000 (22/22) 1.0000 (21/21) 1.0000 (20/20) + + Abbreviations: SVM Support vector machine, SP Signal peptide, TM Transmembrane region, DAS Density alignment surface, YY (SP ) ∧ (TM ∨ PH ∨ RH) , NY − + − − (SP ) ∧ (TM ∨ PH ∨ RH) , NN (SP ) ∧ (TM ∨ PH ∨ RH) Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 7 of 19 Table 6 Salient features of putative GH9 endoglucanase sequences with multiple delineable domains GH9 CBM2 CBM3 CBM4_9 CBM10 CBMX_2 CBM49 pattern 20 ALS (n = 3) gi|313241202 Y Y (cT) Y gi|260808721 Y Y (nT) Y gi|254553092 Y Y (nT) Y BAC (n = 24) gi|15894203 Y Y (cT) Y gi|15894200 Y Y (cT) Y gi|15893851 Y Y (nT) Y gi|300789210 Y Y (cT) Y (nT) Y gi|300785821 Y YY (cT) Y gi|121833 Y YY (cT) YY (cT) Y gi|320006799 Y Y (cT) Y (nT) Y gi|295094191 Y Y (nT) Y gi|291544575 Y Y (nT) Y gi|291543938 Y Y (cT) Y gi|34811382 Y Y (cT) Y gi|34811081 Y Y (cT) Y gi|2554767 Y Y (cT) Y gi|551774 Y Y (cT) Y gi|311900744 Y Y (cT) Y gi|311900370 Y Y (cT) Y (cT) Y gi|270288703 Y Y (cT) Y gi|270288702 Y Y (cT) Y gi|270288700 Y Y (nT) Y gi|270288699 Y Y (cT) Y gi|39636954 Y Y (cT) Y gi|6272570 Y Y (nT) YYYYY gi|237858935 Y Y (cT) Y (cT) YY gi|4490766 Y YY (cT) YYY PRS (n = 2) gi|281207043 Y Y (cT) gi|281207029 Y Y (cT) Abbreviations: ALS Animals, BAC Bacteria, PRS Protists, Y Present, nT N-terminal, cT C-terminal, GH9 Glycoside hydrolase 9, CBM Carbohydrate binding module classes A and B. Plant GH9 endoglucanase sequences pos- sequence as strong (TM), weak pore-lining (PH), or re- sess a differential distribution of TM, SP, and CBM49 re- entrant (RH), i.e., (TM ∨ PH ∨ RH). [94, 100]. The dense gions. and the frequency of occurrence of these was alignment surface (DAS-TMfilter) differs from other pre- analysed by directly comparing CBM49 positive class C dictors of transmembrane regions in considering hydropho- members (n = n = 97) with pattern 20 selected se- bic region(s) of a query protein, and mapping the results to 3C LPSC quences of putative classes A (n = n = 22) and known transmembrane regions [95, 96]. PHOBIUS, is a 3A LPSA B(n = n = 75) (Additional file 10:Table S6, hidden Markov model based delineator of signal peptide re- 3B LPSB Additional file 3:TextS4). Since,the hydrophobic gions and uses sub models of the sequences that comprise profile of these regions overlap, we utilized data these regions along with topology information to make pre- from three algorithms that predict both TM and SP dictions [101]. regions to arrive at a consensus. The servers con- sulted were: MEMSAT-SVM, DAS-TMfilter, and PHOBIUS Algorithm to assess contribution of prediction method to [94–101](Additional file 11: Table S7, Additional file 13: each sub segment Text S9, Additional file 16: Text S10 Additional file 14: Full length sequences of land plants encompassing the Text S11 and Additional file 15:Text S12).The MEMSAT- CBM49-pattern 20,i.e., classesA,B,and C(n =(n = 3 3LPSA SVM classifies membrane spanning helical regions in a n )+(n = n )+(n = n ) = 187) were searched 3A 3LPSB 3B 3LPSC 3C Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 8 of 19 for well defined amino acid segments using the aforemen- archaeal sequence (Methanohalobium evestigatum; tioned servers (MEMSAT-SVM, DAS, PHOBIUS). The tr|D7E938). This sequence has a predicted GH9 subset (NN) was used to define sequences without deline- domain length of 222 aa (Eval =1.2E − 08), and sub able TM and SP regions (NN ={C0, B0, A0}). The method optimally aligned sequences are likely to have inflated of choice was determined by rendering the resultant data scores in excess of the threshold for inclusion. On the other equivalent and therefore, comparable. The definitions uti- hand, despite possessing GH9 domains of suitable length, lized are as under: the lower confidence levels of the HMM predictor for α- proteobacteria (Asticcacaulis biprosthecum; gi|328841530, gi|328840708; Evals =2.20E − 17, 8.40E − 15), and a mem- TM ∶ ¼ Sequences with one or more predicted transmembrane domains ber each of the Chlorobi-Fibrobacter-Bacillales (CFB) SP ∶ ¼ Sequences with one or more predicted signal peptide regions PH ∶ ¼ Sequences with one or more predicted pore lining helices ancestral phylum (Bacterioides fluxus YIT 12057; RH ∶ ¼ Sequences with one or more predicted pore lining helices − − gi|328530713, gi|328531610; Evals =2.80E − 25, 2.40E − NN ∶ ¼ ðÞ SP ∧ðÞ TM∨PH∨RH − þ NY ∶ ¼ ðÞ SP ∧ðÞ TM∨PH∨RH 23)) and subgroup Bacillales of the Firmicutes (Listeria þ þ YY ∶ ¼ ðÞ SP ∧ðÞ TM∨PH∨RH þ innocus; gi|313621564; Eval =1.50E − 20) were probable Y ∶ ¼ ðÞ TM∨PH∨RH confounders for the alignment mismatch (Additional file 5: Table S1A). The bacterial subgroup comprised Gram nega- tive (proteobacteria) and Gram positive organisms (mem- Step 1:Sequences with negative predictions for both SP bers of CFB phylum, cyanobacteria, firmicutes, and and TM regions (f(NN)↔ℕ) and {x ∈ NN ⊂ bacillales) (Table 3,Fig. 1c). However, multiple distinct − − n ∣ (SP )∧ (TM∨ PH∨ RH) , i∈ℕ), were representations of the GH9 domain in one protein are not removed from the computations. uncommon, and are present as two or four (Saccoglossus Step 2:The remaining sequences were assessed for kowalevskii; gi|291236258) copies (n = 16; n =7, ALS the presence of the transmembrane subregions n =2, n = 7) (Additional file 5:Table S1B).Add- BAC LPS (f(Y)↔ℕ)and {x ∈ Y ⊂ n ∣ (TM∨ PH∨ RH) , itionally, we observed the concomitant presence of i 3 i∈ ℕ). heterogenous Glycoside hydrolase domains in some Step 3:The data computed in Step 2 was then used bacterial species (n = 4), which included Caldocel- BAC to calculate the number of sequences with or lum saccharolyticum (gi| 1708078; GH9, GH48), Rumi- without the presence of an associated signal peptide nococcus champanellensis (gi| 291543673; GH9, GH16), regions (f(NY)↔ℕ)and (f(YY)↔ℕ). {x ∈ NY ⊂ Ruminioclostridium thermocellum (gi| 1663519; GH9, − + n |(SP )∧ (TM∨ PH∨ RH) , i∈ ℕ}and {x ∈ GH44), and Caldicellulosiruptor spp. (gi| 12743885; 3 i + + YY ⊂ n |(SP )∧ (TM∨ PH∨ RH) , i∈ ℕ}. GH9, GH44) (Additional file 5:Table S1C).Interest- Step 4:Utilize the data from the above to compute a ingly, despite being classified as GH9 members, only ratio was used to establish equivalence between the the anaerobic methanogen (Methanohalobium evestiga- predictions, and thereby, a rationale for its subsequent tum; tr|D7E938) of the archaea subgroup Euryarch- . . j NY j j YY j inclusion/ exclusion ð ; Þ. aeota possessed the requisite GH9 domain (Additional j Y j j Y j file 5: Table S1D). Results Taxonomic distribution of the GH9 domain Evolution and emergence of the GH9 and CBMs in plant The GH9 domain averages ≈448 aa, and is present as and non plant taxa a single copy in the sequences investigated (n = 607), The data suggests that the GH9 domain is conserved i.e., bacteria (BAC), land plants (LPS), animals (ALS), across all taxa and a catalytically functional copy may fungi (FGI), green algae (GAL), protists (PRS), and ar- have been present in bacteria (≈3000 Mya; support = chaea (ARC) (Fig. 1c; Additional file 5:Table S1A). 100%, 96%) (Fig. 2; Additional file 7:Table S4A, Although, the vast majority of sequences selected for Additional file 17: Text S5 and Additional file 18: this study were putative GH9 endoglucanases, avail- Text S6). Interestingly, the clades of the land plants able empirical data (kinetic, transcript data, 3D struc- and green algae appears to have diverged relatively ture) for many of these taxa were available and early and independently of the animals, fungi, and included (n = 26; n =1; n =11). Whilst, most the protists (≈1961 Mya; support = 100%). Whilst, the LPS ALS BAC sequences possessed alignment compatible GH9 GH9 domains of the land plants and green algae domains (n = 601), there were few sequences (n =6) continued to evolve for another ≈1750 Mya finally 1A which could not be aligned and were not utilized in diverging from each other relatively recently the estimation of divergence of GH9 domains across (≈211 Mya; support = 97%). In contrast, the protists taxa (Additional file 5: Table S1A, Additional file 1: diverged from animals and fungi (≈817 Mya; support Text S1). The source of error was most likely the = 97%), whilst GH9 domains of animals and fungi Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 9 of 19 Fig. 2 Evolution of GH9 domain. A Bayesian inference (BI) dated tree was estimated (maximum clade credibility) from the computed tree population (n = 4476; burn − in = 70%) using the WAG amino acid substitution model and parent of the clade of bacteria as the root. Whilst, node ages (= node height = branch time of the longest diverging taxa) and branch times are in Mya, support for branch points are indicated by the posterior probabilities (PP%) and bootstrap values (n = 1000; ML%), i.e., support = PP %, ML%. The root for this tree was the parent of bacteria (3170 − 4180 Mya).The log likelihood for this tree was (≈− 0.0838233). Abbreviations: BI, Bayesian inference; GH9, glycoside hydrolase; Mya, millions of years; WAG, Whelan and Goldman diverged from each other (≈11 Mya; support = 97%). acids whose side chain functional groups (PUC ={−OH, A generic timeline for the evolution of the GH9 −SH, −NH }), i.e., Serine (S), Threonine (T), Cysteine (C), domain, i.e., BAC > PRS >{FGI, GAL, ALS, LPS}, is Tyrosine (Y), Asparagine (N), and Glutamine (Q), could perfectly plausible (Fig. 2). We also posited, and potentially contribute to the catalytic machinery of these thence investigated the contribution of non-GH9 re- putative enzymes (Additional file 8: Table S2C). Interest- gions (CBM49, linker(s)) to substrate dichotomy (crystal- ingly, there was a paucity of the catalytic permissive (PCA line, amorphous) in plant GH9 endoglucanases. We ={−COO }) amino acids (D/E) in the sequences analysed observed distinct and delineable CBM49s (79 − 84 aa; me- (Fig. 3a and b;Additional file 8: Table S2, Additional file 4: dian =81 aa) in putative class C GH9 endoglucanase se- Text S2). Clearly, the restricted taxonomic distribution of quences of flowering land plants (n = 102) after outlier CBM49 precludes a direct comparison, thereby justifying exclusion (n =2; Zea mays, GRMZM2G143747_P01; Sela- our search for patterns that could approximate CBM49 ginella. moellendorffii, 109529)(Additional file 8:Table (Fig. 3;Additional file 9: Table S5). These patterns were S2A and B). The only exceptions were the presence of a partitioned into those with low/ high fitness strengths, single CBM49 (82 aa) in the protist, Polysphondylium which was correlated to its compositional complexity pallidumPN500 (gi|281207043, gi|281207029) (Additional (Table 4, Fig. 3c). Since, patterns of reduced complexity file 5: Table S1A). Remarkably, our results indicate a unique are likely to be present in a greater number of sequences, copy of CBM49 in bryophytes (n =4; Physcomitrella and also possess low fitness (Fs)scores (Table 4,Fig. 4c). patens) and tracheophytes (n =3; S. moellendorffii) The Rm-value is the expected number of random matches (Additional file 8: Table S2). Analysis of the primary se- in 100,000 unrelated sequences [102]. For instance, the quences also indicates the presence of one or more linker pattern with the lowest fitness score (p20), i.e., Gx(3)G[LV], sequences connecting the GH9 to the CBMs. In CBM49 has the value Rm = 33184 (n = 100), whilst the same for the class C sequences this constitutes a 7–77 AA (Prunus per- high scoring pattern 1 (p1) was Rm =2.47E − 35 (n =5) sica, ppa022524m; Phaseolus vulgaris, Phvul.011G030300.1) (Table 5,Fig. 3c). The presence of these patterns in (Additional file 5: Table S1 and Additional file 8:Table S2). CBM49-containing characterized class C sequences was confirmed initially, following which, their occurrence in non-class C members was evaluated (Fig. 4a and b). Characterization, analysis, and assessment of relevance of These data, for full length sequences of putative GH9 CBM49-spanning patterns in non-plant taxa endoglucanases without a delineable CBM49 in terms of The amino acid profile (HSC ≅ 46.2%; AAA ≅ 11%; PUC ≅ number of hits and sequences corresponds to: p1 − 36%; PCA ≅ 4.7%; PCB ≅ 13%) of the truncated CBM49 p17 (hits = 0), p18 (hits =93; sequences = 81), p19 (hits = sequences (n = 102) suggests a high percentage of amino 2; sequences = 2), and p20 (hits =233; sequences = 194) Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 10 of 19 Fig. 3 Characterizing the carbohydrate binding module (CBM49). a Multiple sequence alignment of the CBM49 in class C GH9 endoglucanases. This region has been highlighted in the presented alignment, and suggests a conservation, in not just the overall structure, but also several key residues (W| F| Y; K| R| N| H| Q). Additionally, the highest (p1) and the lowest (p20) scoring patterns that approximate CBM49 have been illustrated. The rudimentary p20 derived from class C sequences was found in several organisms (n = 194), including classes A and B of plant GH9 members, b WebLogo of the carbohydrate binding domain 49 of putative class C plant GH9 endoglucanases. Truncated sequences with a well defined 81 AA region corresponding to the CBM49 were utilized to construct this, and c Analysis of 20 patterns spanning the CBM49 with number of matched sequences (Sm), fitness (Fs), and randomly matched sequences (log(Rm=R)) as indices. Abbreviations: AA, amino acids; CBM49, carbohydrate binding module; Fs, fitness score; GH9, glycoside hydrolase; Sm, number of sequences with matches; Rm, number of randomly matched sequences (Additional file 9: Table S5A). The results for all taxa algae (n =2) (Fig. 4b; Additional file 2: Text S3). The dis- with the GH9 domain: 18 (hits = 98; sequences = 89), tribution of bacteria between the datasets (n , n ) was 1 2 p19 (hits =1; sequences =1), and p20 (hits =315; sequences similar firmicutes (≈56%, ≈69%), actinobacteria (≈17. = 265) (Additional file 9:Table S5B).The lowscoring p18 2%, ≈15.6%), and proteobacteria (≈17.2%, ≈8%) (Table 3). (Gx[DENQPST]x(2)G[LV]) and p20 (Gx(3)G[LV]) are the However, the sole archaeal sequence (tr|D7E938) was con- only patterns equivalent to the CBM49 which are found in spicuous in the absence of the same (Figs. 1c and 4b; classes A and B along with other taxa, in both full length Additional file 5: Table S1A). We also observed that while and GH9 domain sequences (Table 4). The maximal popu- several sequences of land plants, bacteria, and fungi in- lation (≈44 − 48%) and taxa-specific coverage (ALS, BAC, cluded more than one occurrence of this pattern, green FGI, PRS, LPS), then justifies the utilization of p20 in defin- algae, protists, and animals only contained one occurrence ing a dataset that could be used to develop an evolutionary of Gx(3)G[LV] (Additional file 9: Table S5A). A search for trace of putative class C specific endoglucanase activity sequences with pattern 18 (G[DENPQST]x(2)G[LV]), with (Additional file 7: Table S4 and Additional file 9:Table S5, a marginal increase in fitness strength (| δ | ≅1.4) p20, p18 Additional file 13: Text S9, Additional file 16:TextS10, eliminated green algae altogether (Table 4; Additional file 9: Additional file 14: Text S11 and Additional file 15:Text TableS5).The taxonomicspreadfor matchedoccurrences S12). This combined, i.e., inclusive of class C sequences, on the GH9 domain (n =607) with p18 (n ; n =3, 1 1B ALS dataset (n = 291) of full length putative GH9 endogluca- n = 34, n =3, n =7, n =28, n =14) and 2 BAC FGI LPSA LPSB LPSC nase sequences then possessed GH9 (n =1) and CBM49- p20 (n ; n =14, n =53, n =4, n =6, n =14, 1C ALS BAC FGI PRS LPSA p20 (n ≥ 1) occurrences, and includes bacteria (n = 64), n =70, n = 108), reiterates the generic nature of LPSB LPSC animals (n =18), fungi (n = 5), protists (n =8), and green these patterns (Additional file 9: Table S5B). Interestingly, Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 11 of 19 Fig. 4 Pattern analysis and major findings in selected plant GH9 endoglucanases. a Distribution and presence of high- and low-fitness strength CBM49-spanning patterns (53 'hits' on 4 sequences) in characterized class C enzymes, b Taxonomic distribution of the low strength p20 (n = 291), and c Analysis of the presence of all 20 CBM49-spanning patterns in selected sequences of classes A, B, and C (n = 187). Clearly, the ubiquitous presence of p20 favours its use as an index of the presence of CBM49 in non class C taxa. The higher strength patterns (p1-p17) are limited to putative class C GH9 endoglucanases. Abbreviations: CBM49, carbohydrate binding module; GH9, glycoside hydrolase; p20, pattern 20 and in complete contrast is the profile of occurrences of SVM data clearly suggest that all classes of GH9 p19, which despite its low fitness registers a single hit (class endoglucanase sequences possess distinct high- (trans- C, S. moellendorffii, 109529). membrane; n ≈ 96 % , n ≈ 83 % , n ≈ 80%) or LPSA LPSB LPSC low- scoring (pore-lining; n ≈ 4% , n ≈ 19 % , LPSA LPSB Analysis of CBM49 and CBM49-like GH9 endoglucanases n ≈ 20%) helical regions, with the exception of LPSC of vascular land plants the class B sequence (MDP0000199273), which pos- In addition to establishing the origins of CBM49, we ex- sessed both classes of helices. Interestingly, a third amined the divergence of putative class C GH9 endoglu- class (re-entrant helical) was computed in class A canase sequences and the emergence of classes A and B members (n = 3). When these data were com- LPSA in vascular land plants. To accomplish this a subset of bined, i.e., TM ∨ PH ∨ RH,all classesA,B, and C pattern 20 selected GH9 endoglucanase sequences in were shown to possess one or more TM subregions land plants (n = 186; n =22, n = 75, n = 89) (n = n = n = 100 % ) (Table 5). The same for the LPSA LPSB LPSC LPSA LPSB LPSC was collated and compared. The node ages and branch DAS-TMfilter (n =95%, n =98%, n = 90%), and LPSA LPSB LPSC times suggest that vascular class C (≈222 Mya; sup- PHOBIUS (n =91%, n =4%, n = 2%) (Table 5). LPSA LPSB LPSC port = 100%, 99%) GH9 endoglucanases predate mem- The computations also suggest a bimodal distribution of − + bers of classes A and B (≈114 Mya; support = 87%, signal peptide regions ((SP ) ∧ (TM ∨ PH ∨ RH) ∶ = NY, + + 99%) (Fig. 5; Additional file 19:TextS7and (SP ) ∧ (TM ∨ PH ∨ RH) ∶ = YY). While, the data for Additional file 20: Text S8). The molecular basis of MEMSAT-SVM was (n ≅ 80%, n = 75%; YY), the LPSB LPSC these findings were ascertained by examining CBM49 same for the DAS-TMfilter was (n ≅ 56%, n = 86%; LPSB LPSC (class C) and CBM49-like (classes -A and -B) YY). In contrast, the data from PHOBIUS differed consid- sequences of vascular land plants for the presence of erably (n ≅ 33.3%, n =0; YY), and was applicable to LPSB LPSC concomitant transmembrane and signal peptide only 3 sequences. The was primarily due the almost regions (Table 5, Fig. 4a; Additional file 11: Table S7, complete absence of TM (n =4%, n =2%), or LPSB LPSC Additional file 16: Text S10, Additional file 14:Text conversely the overwhelming presence of signal peptide S11 and Additional file 15:TextS12). TheMEMSAT- regions in classes B and C (n ≅ 96%, n ≅ 98%) LPSB LPSC Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 12 of 19 Fig. 5 Insights into divergence of plant class C GH9 endoglucanases. A Bayesian inference (BI) dated tree was estimated (maximum clade credibility) from the computed tree population (n = 4837; burn − in = 70%) using the JTT + I + G amino acid substitution model. Whilst, node ages (= node height = branch time of the longest diverging taxa) and branch times are in Mya, support for branch points are indicated by the posterior probabilities (PP%) and bootstrap values (n = 1000; ML%), i.e., support = PP%, ML%. The root for this tree was the parent of vascular class C land plants (201 − 249 Mya). The log likelihood for this tree was (≈− 0.1350387). Abbreviations: BI, Bayesian inference; GH9, glycoside hydrolase; I, proportion of invariant sites; G, gamma parameter; Mya, millions of years; JTT, Jones, Taylor, and Thornton enzymes (Table 5;Additional file 11: Table S7, Additional The structure of crystalline cellulose renders it resistant file 16: Text S10 and Additional file 15: Text S12). How- to alterations in temperature, salt, pH of the surrounding ever, as discussed vide supra, the corresponding results for environment, clearly a desirable trait in archaea (methano- the presence of the TM ∨ PH ∨ RH regions in class A GH9 gens) and bacteria (halophiles, thermophiles) which inhabit endoglucanases predicted by MEMSAT-SVM (n = 100%), extreme environments such as hot springs and the oral LPSA DAS-TMfilter (n =95%), and PHOBIUS (n = 91%) and gastrointestinal microbiomes of several animals. Here, LPSA LPSA was almost identical (Table 5). Additionally, whilst, the results perhaps, the role of GH9 endoglucanases could be critical from DAS-TMfilter were similar to MEMSAT-SVM, its in remodelling the cell membranes, thereby maintaining coverage of classes B (n =67%) and C (n = 51%) was intracellular homeostasis [47, 49]. Additionally, crystalline LPSB LPSC suboptimal. The MEMSAT-SVM data, therefore was deemed cellulose is inert, compact, and insoluble in aqueous and most appropriate for predicting the molecular events that several organic solvents. These physicochemical properties mayhaveoccurredduringthe evolution of plant GH9 would imply that spores and seeds made predominantly of endoglucanases (Table 5; Additional file 11:Table S7, this polymer would be resistant to dessication and Additional file 15: Text S12). stressors such as weather fluctuations [14, 41, 43]. Clearly, protists (Dictyostelium- and Polysphondylium-spp.)and gram positive bacteria may have utilized GH9 endoglu- Discussion canses to regulate the processes of sporulation, dissemin- Evolutionary significance of crystalline cellulose digesting ation, and effective germination [14, 41, 43]. The non plant GH9 endoglucanases lipopolysaccharides (complexes of crystalline cellulose with Our results, on the evolution of the GH9 and CBM49 lipids) synthesized by gram negative bacteria (proteobac- regions suggest a pyramidal model with vertical gene teria, actinobacteria) and fungi, too, could aid protection of transfer and progressive evolution (loss or modification of the organism from host immune systems (phagocytosis) function) as a plausible explanation for the emergence, while concomitantly establishing an infection (Cryptococ- occurrence, and divergence of GH9 endoglucanase activity cus neoformans, Pseudomonas spp., Vibrio spp.) or infest- (≈3000 Mya)(Figs. 2 and 5)[15–28, 32–34]. Conversely, ation in developing protists and marine invertebrates [9– since crystalline cellulose is the preferred substrate, 14, 41, 43, 48, 102–104]. Reciprocally, an interesting utility this also implies a conserved active site architecture of GH9 endoglucanases is to facilitate the symbiotic/ para- of the encoded protein and a correspondingly similar sitic association between some fungi and bacteria of animal reaction chemistry in non-plant taxa and land plants and plants hosts (macrophages, leguminous nodules of the with putative class C GH9 endoglucanase activity rhizomes) by digesting the crystalline cellulose of the host. (Tables 1 and 4,Fig. 4a; Additional file 5:TablesS1 Thus, bacteria/ fungi could secrete these enzymes and/ and Additional file 8:Table S2) [7, 46, 47]. or in association with the cellulosome could digest Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 13 of 19 the cellulose and hemicellulose in root hairs and and p20 (Gx(3)G[LV]), possessed amino acids that may be wood to extract/ exchange nutrients (Laccaria bicolor, both potentially catalytic and/ or facilitatory. Whilst, the Sporisorium reilianum, Phanerochaete chrysosporium) bulky side chains of the aromatic amino acids can physic- [42, 44, 53, 54, 105–109]. Although cellulose is un- ally stretch the glycosidic linkage between adjacent β(D)- equivocally inert, reports of its potential to stimulate glucopyranose residues and weaken it several fold, amino an immune response in the host are not unknown. In acids with side chain functional groups (−OH, −NH , −SH), fact, specialized cells in the tunics of marine verte- can effect electron-proton transfers and are critical com- brates (O. dioica, S. kowalevskii, and C. intestinalis) ponents of the catalytic machinery of any enzyme [62–64, might function as primitive phagocytes that could de- 115]. The concomitant occurrence of these residues with tect the presence of crystalline cellulose (potential the GH9, i.e., (GH9 ∧ p18) ∨ (GH9 ∧ p20), could function pathogen, index of nutritional status) and could mod- as an index of CBM49-presence on the GH9 domain in erate a suitable response (adhesion to the substratum, sequences of non class C taxa and can then be utilized to infection by marine microbes). The ability to utilize trace the origins of CBM49. The biological relevance of the nutritionally superior crystalline cellulose may be this approach may be gleaned by examining the correl- an important consideration, albeit, indirect for the ation between the presence of aromatic amino acids which dominant global presence of arthropods including in- are known to influence catalysis of crystalline cellulose sects (Apis mellifera, Camponotus floridanus, Nasonia and the 'hits' or 'occurrences' of low strength patterns in vitripennis, Nasutitermes Takasagoensis), crustaceans non class C enzymes (Table 4; Additional file 9: Table S5) (Daphnia pulex), and segmented worms (Additional [62–78]. Whilst, the complete absence of aromatic acids file 5: Table S1A, Additional file 1:TextS1)[15, 50, could be responsible for the generic distribution of p18 51, 108–114]. Since, GH9 endoglucanase producing and p20 (93 ≤ n ≤ 230, full length;98 ≤ n ≤ 315; Hits Hits bacteria populate the microbiomes of these animals, GH9 domain), the incorporation of a single residue W/ they are able to extract glucose from diverse sub- Yinto p19 results in a significant reduction in its oc- strates (wood, chitoligosaccharides) and can subsist in currence in non class C members (n =2, full length; Hits several seemingly inhospitable environments. Add- n =1; GH9 domain)(Table 4, Fig. 4b;Additionalfile 9: Hits itionally, and in comparison to the kingdom specific Table S5). analysis (bacteria, fungi, land plants, animals) with corresponding multiple trees by previous investigators, we were able to generate a unified time tree of over Evolution of the CBM49 encompassing class C GH9 600 GH9 domain sequences spread over every major endoglucanases taxa (n ≈ 6.5X, n ≈ 3.4X, n ≈ 1.6X, n ≈ 4.8X), The identification of the CBM49 as the facilitator of BAC ALS FGI LPS and include green algae and protists [55]. crystalline cellulose digestion (class C activity) in a select population of previously annotated GH9 endoglucanases Rationale and relevance of a multimodal approach to in land plants raises intriguing queries with regards to approximating the CBM49 the origin, subsequent divergence, and physiological As discussed vide supra, the carbohydrate binding relevance of substrate shuffling (amorphous, crystalline) module CBM49 is unique to class C members of in plant GH9 endoglucanases [6–8, 33, 34]. In the ab- land plants (Fig. 3; Additional file 8: Table S2 and sence of an identifiable CBM49, the analysis of full Additional file 4: Text S2). Our data suggests that length putative GH9 endoglucanase sequences with oc- homologous CBMs (GH9 ∧ (CBMx) | x ∈ {2, 3, 4,10,49, currences of p20 (low strength generic approximator of X}, y = {1, 2}) distributed across the length of the CBM49) might constitute a viable approach, and provide protein might contribute to catalysis of crystalline insights into the origins and subsequent divergence of cellulose in bacteria (n = 37), animals (n = 18), and CBM49 containing enzymes. protists (n =2) (Table 6; Additional file 12:Table S8). The data from the SMART server also indicated the presence of several low complexity regions both, Emergence and origin of the CBM49 in full length and truncated (GH9 domains) sequences. The influence of non-GH9 regions of the primary se- This coupled with the sparse CBM data (<10%), prompted quence on the catalytic spectrum of plant GH9 endo- us to search for CBM49 spanning patterns amongst glucanases, suggest that these, like the GH9 may have putative non class C GH9 endoglucanase sequences, rea- originated in non-plant taxa. These could include the soning that patterns with low fitness scores might consti- presence of: a) homologous CBMs throughout the tute a superior index of approximating the CBM49. In our length of the protein sequence, and b) delocalized analysis the CBM49-approximating and low scoring p18 residue- specific activity of the GH9 domain itself. (Gx[DENQPST]x(2)G[LV]), p19 (Gx[ILV][WY]G[LV]), Extensive sequence analysis of full length and GH9 Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 14 of 19 domain sequences of non-plant taxa reveals the groups each, a partitioning that is based on the pres- presence of several regions of low complexity, along ence or absence of a signal peptide region (Table 5, with sparsely present pre-defined CBMs (n =57; ≅9.3%) Fig. 6; Additional file 11:Table S7,Additional file 13: (Table 6; Additional file 12: Table S8). The numbers not- Texts S9, Additional file 16: Text S10, Additional file 14: withstanding, distinct copies of CBM2 (animals, bacteria), Text S11 and Additional file 15:Text S12). The first CBM3 (bacteria), CBM4 (animals, bacteria), CBM10 (bac- model purports that the last common ancestor (LCA) of teria), CBMX (bacteria), and the CBM49 (protists) itself vascular plant GH9 endoglucanases were class C-like en- (GH9 ∧ (CBMx) ), have been characterized in literature zymes in bryophytes and early tracheophytes. Subsequent with the encompassing GH9 endoglucanases exhibiting a losses, in parallel of the CBM49 could have resulted in the clear preference for crystalline cellulose [39–53](Table 6; appearance of modern vascular equivalents (Figs. 5 and 6). Additional file 12: Table S8). Interestingly, the CBMs 2 This model also offers an explanation to the fewer and 4 of animals and bacteria were present at opposite numbers of class C members frequently observed by termini of the GH9 domain. Thus, while CBM4_9 is C- investigators, despite contrasting bioinformatics evi- terminal in animals, its position in bacteria is distinctly N- dence [14, 58–60]. Indeed, this may be the route of terminal, with the reverse being true for CBM2 choice for the emergence of class C (≈222 Mya; sup- (Additional file 12: Table S8). This mobility of CBMs port = 100%, 99%) and classes A and B (≈114 Mya; across taxa suggests that either N- or C-terminal posi- support = 87%, 99%) (Figs. 5 and 6). Clearly, this tioned CBMs could have functioned as precursors of model would mandate the presence of distinct sub- CBM49. The length of the linker sequences exhibited con- populations of the LCA, i.e., CBM49 with either TM siderably greater variation in non-plant taxa (27 − 230 aa) or SP regions. Alternatively, class C GH9 endogluca- as compared to land plants (7 − 77 aa) (Additional file 5: nases of land plants may have been the first to Table S1A, Additional file 8: Table S2A and Additional emerge after the tracheophytes, whilst classes A and file 12: Table S8). In contrast, the low strength CBM49- B evolved from them by the progressive loss of the approximator, i.e., pattern 20, could be mapped directly signal peptide. This route, too, seems perfectly plaus- onto the full length and GH9 domains ( ≅ 50%). In the ible given the presence of two distinct sub popula- presence of key aromatic and/ or polar uncharged amino tions of class C GH9 endoglucanases (C1, C2), with acids this mapping could also confer competency to digest each diverging secondary to the loss of the CBM49 crystalline cellulose. Whilst, the exact origin of the subregion (class C2→ class A1 ≈ class B2; n )and 1A CBM49 remains speculative, our results when combined the considerable earlier divergence of class C vascu- indicate a distinct probability (>0.00) that a double lar plants (Table 5,Figs. 5 and 6). Since, classes A ((GH9 ∧ (CBMx) ) = {0.093} ∨ (GH9 ∧ p20) = {0.44,0.48}) or and B, in vascular land plants could be originate in triple event ((GH9 ∧ (CBMx) ∧ p20) = {0.041,0.046}) may parallel and directly from their class C counterparts, have resulted in the emergence of CBM49 in early land the fewer numbers observed could simply mean plants (Table 6;Additional file 12: Table S8). fewer original class C members left as compared to class B GH9 endoglucanases. A third scenario, could be the origin of later members sequentially, i.e., Divergence of class C GH9 endoglucanases class C→ class A→ class B or class C→ class B→ The interdomain linker, a common feature between class A (Fig. 5). Phylogenetic and sequence analysis the GH9 and CBMs is, surprisingly stable and seems of this dataset (n ) suggests that the most probable to have remained as such for ≅450 − 480 Mya. routes was class C1→ class B1→ class A1and/or Whilst, the evidence for the ancestral role of class C (class C2→ class A1 ≈ class B2; n )(Fig. 6). 1A members of vascular land plant GH9 endoglucanases is fairly unequivocal, a clear insight into the down- stream molecular events that may have occurred in Class C GH9 enzymes, last common ancestor of plant GH9 their transformation to classes A and B is debatable endoglucanases (Figs. 5 and 6). Here too, we posited that vertical gene Physiologically, the development of an intact vascular loss of class C GH9 endoglucanase sequences was opera- system could have brought about a paradigm shift in tive and could result in the emergence of classes A (A1) not just the utilization of extant endoglucanase and B (B1, B2) (Table 5, Fig. 6;Additional file 11: Table S7, activity, but also in the nature of cellulose itself. The Additional file 13: Texts S9, Additional file 16:Text S10, introduction and persistence of water molecules Additional file 14: Text S11 and Additional file 15:Text between the microfibrils of cellulose could have S12). The extensive computational analysis conducted resulted in competition for hydrogen bonds with in this work suggests that classes B (B1, B2) and C waterratherthanother fibrils of cellulose. These (C1, C2) could be considered a union of two distinct events could have been complemented by the late Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 15 of 19 Fig. 6 Evolution, divergence, and emergence of plant class C GH9 endoglucanases. a Evolutionary theories for the emergence and divergence of classes A, B, and C plant GH9 endoglucanases. The major considerations in proposing these were data gleaned from the time trees, and analysis of the sequences for the presence and/ or absence the transmembrane, signal peptides, and the CBM49 itself. b Phylogenetic and bioinformatics analysis of full length sequences of classes A, B, and C plants endoglucanases with one or more occurrences of p20. CBM49 attributable class C activity along with a GH9 domain was present in early land plants bryophytes (avascular) and tracheophytes (vascular), and suggests the presence of two class C populations which may have diverged so as to result in the newer classes A and B. Abbreviations: CBM49, carbohydrate binding module; GH9, glycoside hydrolase; p20, pattern 20 emergence of the crystalline cellulose (I , I )editing endoglucanases of class C (Table 5,Fig. 6)[62–64]. The α β subclass A GH9 endoglucanases, and could have presence of the linker region too, may have facilitated the shifted the reaction equilibria towards the right, i.e., progressive loss of CBM49 and its progressive transform- synthesis of amorphous cellulose (I am, I am)[10]. ation into classes A and B over ≅114 Mya (Fig. 5). Since, α β These reactions can be depicted as: the modified chemistry and quantity of cellulose made it amenable to rapid digestion, enzymes of classes A and B were more suited to digesting the now abundant amorph- ous regions of cellulose, and could utilize it as a source of carbon, as well as remodel it to effect growth, develop- ment, flowering, and germination [16, 58]. Whilst the presence of crystalline cellulose in the stems of cereal crops (Hordeum vulgare, Brachypodium distachyon, O. sativa) facilitates growth and cultivation, its secretion in the mucilage from the epidermal cells of differentiating The proliferation of amorphous regions would have eudicot seeds is a critical event in germination [58, 60, rendered cellulose accessible and amenable to enzymatic 116–118]. The recent divergence of land plant GH9 endo- conversion with lesser stringency. Evolutionarily, this glucanases into monocots such as the cereals (O. sativa, means that the CBM49 in land plants (avascular and early B. distachyon, Panicum virgatum) and the asterid subdiv- vascular) despite its ancestral origins may no longer be ision of the eudicots (S. tuberosum, S. lycopersicum and N. necessary for cellulose metabolism. This in turn may have tabacum) is consistent in all classes and in both datasets initiated a series of molecular events in extant class C (n , n )(Table 1). These could reflect a modification of 1A 3 endoglucanase sequences of late tracheophytes such as S. the culinary habits of a developing civilization with a de- moellendorffii, and may have culminated in the divergence sire for bulk and storage foods (Table 4). Here, too the in and subsequent appearance of late vascular GH9 situ digestion of crystalline cellulose by class C enzymes Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 16 of 19 or its conversion to amorphous forms thereof, could Additional file 9: Table S5. Distribution of low strength patterns in non proceed unhindered. The continuing molecular evolution class C taxa. (XLSX 41 kb) of classes A and B enzymes also suggests a versatile and Additional file 10: Table S6. Distribution of low strength patterns in non class C land plants. (XLSX 19 kb) adaptive mechanism of action perhaps in tandem with the Additional file 11: Table S7. Distribution of TM, SP, and CBM49 in land emergence of novel pathophysiological stimuli. The exist- plants. (XLSX 22 kb) ence of high levels of mRNA of putative class C members Additional file 12: Table S8. Distribution of CBMs and low strength observed from the internode regions (high cellulose con- patterns in taxa. (XLSX 18 kb) tent) of the developing stems of O. sativa and A. thaliana, Additional file 13 Text S9. Distribution of low strength patterns in land suggest that these enzymes could still be of benefit to plants. (TXT 84 kb) modern land plants, as they could direct the higher affinity Additional file 14: Text S11. Distribution of TM and SP in land plants (DAS-TMfilter). (TXT 36 kb) classes A and B enzymes to regions of growth and devel- Additional file 15: Text S12. Distribution of TM and SP in land plants opment, where the concentrations of cellulose would (MEMSAT-SVM). (ZIP 2102 kb) be much lower [16, 58, 60, 116–118]. The CBM49 of Additional file 16: Text S10. Distribution of TM and SP in land plants class C plant GH9 endoglucanases could also func- (PHOBIUS). (TXT 9 kb) tion as a gene/ protein repository for newly emer- Additional file 17: Text S5. Maximum clade credibility tree to assess ging functions, thus justifying their title as living evolution of the GH9 domain. (TXT 5 kb) fossils of the plant world. Additional file 18: Text S6. Maximum likelihood estimate of branching times of GH9 evolution with bootstrapping. (PDF 49 kb) Additional file 19: Text S7. Maximum clade credibility tree to assess Conclusions divergence of the CBM49 in land plants. (TXT 2 kb) Our work when coupled with extant data on class C Additional file 20: Text S8. Maximum likelihood estimate of branching times of CBM49 in land plants with bootstrapping. (PDF 10 kb) plant GH9 endoglucanases suggests that these enzymes are ancestral to classes A and B of this family. Plant GH9 endoglucanases are able to digest crystalline cellu- Abbreviations AAA: Aromatic amino acids; ALS: Animals; ANN: Artificial neural network; lose (class C activity) in a manner reminiscent of cataly- BAC: Bacteria; BEAST: Bayesian evolutionary analysis by sampling trees; sis by bacteria, animals, protists, fungi, and archaea. Our BRY: Bryophytes; CAZy: Carbohydrate active enzymes; CBM: Carbohydrate work here suggests that the GH9 domain is relatively binding module; DAS-TMfilter: Density alignment server; dbCAN: Database of carbohydrate enzymes annotated; EC: Enzyme commission; FGI: Fungi; well conserved across taxa. We also present plausible GAL: Green algae; GH: Glycoside hydrolase; HMM: Hidden markov model; phylogenetic time lines coupled with bioinformatics evi- HSC: Hydrophobic side chains; I I : Crystalline cellulose; α’ β’ dence that favour a vertical mode of gene evolution that I am,I am,: Amorphous cellulose; LCA: Last common ancestor; LPS: Land α β plants; MEGA: Molecular evolutionary genetic analysis; MSA: Multiple may have contributed to the origin and emergence of sequence alignment; Mya: Millions of years; PCA: Polar charged acidic; the CBM49 between the GH9 endoglucanases of plants PCB: Polar charged basic; PIR: Protein information server; PRATT: Pattern and non plant taxa, as well as its subsequent divergence analysis; PRS: Protists; PUC: Polar uncharged; SMART: Simple modular architecture research tool; SP: Signal peptide; SVM: Support vector machine; (tracheophytes and the vascular land plants of classes A, TM: Trans-membrane; TRY: Tracheophytes B, and C). Finally, we review the computational evidence in context of likely physiological events that may have Acknowledgements occurred during their divergence and evolution. RS gratefully acknowledges financial support from JNU through UPE-II grant and Ramalingaswami fellowship from DBT, India. These however, had no role in the design of the study and collection, analysis, and interpretation of data Additional files and in writing the manuscript. Additional file 1: Text S1. Sequences of GH9 in all taxa (fasta). (FASTA Funding 283 kb) RS gratefully acknowledges financial support from JNU through UPE-II grant and Ramalingaswami fellowship from DBT, India. These however, had no role Additional file 2: Text S3. Sequences with pattern 20 across all taxa in the design of the study and collection, analysis, and interpretation of data (fasta). (FASTA 197 kb) and in writing the manuscript. Additional file 3: Text S4. Sequences of land plants (CBM49, pattern 20; fasta). (FASTA 114 kb) Availability of data and materials Additional file 4: Text S2. Sequences of CBM49 in predicted class C Data is available as supporting material with the manuscript. The datasets land plants (fasta). (FASTA 11 kb) used and/or analysed during the current study are available from the Additional file 5: Table S1. GH9 domain based classification of taxa. corresponding author on reasonable request. (XLSX 54 kb) Additional file 6: Table S3. Maximum likelihood based evaluation of Authors’ contributions amino acid substitution models. (XLSX 30 kb) SK outlined and designed the study, conceptualized the algorithm(s) and formulae for prediction, manually collated all the sequences, and their Additional file 7: Table S4. Posterior probabilities for parameters references, carried out the computational analysis, constructed the models, utilized to date GH9/ CBM49 evolution across taxa. (XLSX 15 kb) formulated the filters, and wrote the manuscript. RS outlined the study and Additional file 8: Table S2. CBM49 based classification of land plants. participated in manuscript discussions. All authors read and approved the (XLSX 22 kb) final manuscript. Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 17 of 19 Ethics approval and consent to participate 18. Bell EA, Boehnke P, Harrison TM, Mao WL. Potentially biogenic carbon Not applicable. preserved in a 4.1 billion-year-old zircon. Proc Natl Acad Sci U S A. 2015; 112(47):14518–21. 19. Noffke N, Christian D, Wacey D, Hazen RM. Microbially induced Competing interests sedimentary structures recording an ancient ecosystem in the ca. 3.48 The authors declare that they have no competing interests. billion-year-old dresser formation, Pilbara, Western Australia. Astrobiology. 2013;13(12):1103–24. 20. Schopf JW. Fossil evidence of Archaean life. Philos Trans R Soc Lond Ser B Publisher’sNote Biol Sci. 2006;361(1470):869–85. Springer Nature remains neutral with regard to jurisdictional claims in 21. Bengtson S, Belivanova V, Rasmussen B, Whitehouse M. The controversial published maps and institutional affiliations. “Cambrian” fossils of the Vindhyan are real but more than a billion years older. Proc Natl Acad Sci U S A. 2009;106(19):7729–34. Received: 24 July 2017 Accepted: 18 April 2018 22. Brocks JJ, Logan GA, Buick R, Summons RE. Archean molecular fossils and the early rise of eukaryotes. Science. 1999;285(5430):1033–6. 23. Peterson KJ, Butterfield NJ. Origin of the Eumetazoa: testing ecological References predictions of molecular clocks against the Proterozoic fossil record. Proc 1. Libertini E, Li Y, McQueen-Mason SJ. Phylogenetic analysis of the plant Natl Acad Sci U S A. 2005;102(27):9547–52. endo-beta-1,4-glucanase gene family. J Mol Evol. 2004;58(5):506–15. 24. Budd GE, Butterfield NJ, Jensen S. Crustaceans and the “Cambrian 2. Molhoj M, Pagant S, Hofte H. Towards understanding the role of explosion”. Science. 2001;294(5549):2047. membrane-bound endo-beta-1,4-glucanases in cellulose biosynthesis. Plant 25. Engel MS, Grimaldi DA. New light shed on the oldest insect. Nature. 2004; Cell Physiol. 2002;43(12):1399–406. 427(6975):627–30. 3. Maloney VJ, Mansfield SD. Characterization and varied expression of a 26. Berna L, Alvarez-Valin F. Evolutionary genomics of fast evolving tunicates. membrane-bound endo-β-1,4-glucanase in hybrid poplar. Plant Biotechnol Genome Biol Evol. 2014;6(7):1724–38. J. 2010;8(3):294–307. 27. Erwin DH, Davidson EH. The last common bilaterian ancestor. Development. 4. Mansoori N, Timmers J, Desprez T, Alvim-Kamei CL, Dees DC, Vincken JP, 2002;129(13):3021–32. Visser RG, Hofte H, Vernhettes S, Trindade LM. KORRIGAN1 interacts 28. Betts MJ, Topper TP, Valentine JL, Skovsted CB, Paterson JR, Brock GA. A specifically with integral components of the cellulose synthase machinery. new early Cambrian bradoriid (Arthropoda) assemblage from the northern PLoS One. 2014;9(11):e112387. flinders ranges, South Australia. Gondwana Res. 2014;25(1):420–37. 5. Vain T, Crowell EF, Timpano H, Biot E, Desprez T, Mansoori N, Trindade LM, 29. Braun A, Chen J, Waloszek D, Maas A. First early Cambrian Radiolaria. Geol Pagant S, Robert S, Hofte H, et al. The Cellulase KORRIGAN is part of the Soc Lond, Spec Publ. 2007;286(1):143–9. cellulose synthase complex. Plant Physiol. 2014;165(4):1521–32. 30. Butterfield NJ. Probable Proterozoic fungi. Paleobiology. 2005;31(1):165. 6. Brummell DA, Bird CR, Schuch W, Bennett AB. An endo-1,4-beta-glucanase https://doi.org/10.1666/0094-8373. expressed at high levels in rapidly expanding tissues. Plant Mol Biol. 1997; 31. Lucking R, Huhndorf S, Pfister DH, Plata ER, Lumbsch HT. Fungi evolved 33(1):87–95. right on track. Mycologia. 2009;101(6):810–22. 7. Urbanowicz BR, Catala C, Irwin D, Wilson DB, Ripoll DR, Rose JK. A tomato 32. Bhattacharya D. Dating algal origin using molecular clock methods. Protist. endo-beta-1,4-glucanase, SlCel9C1, represents a distinct subclass with a new 2004;155(1):9–10. family of carbohydrate binding modules (CBM49). J Biol Chem. 2007;282(16): 33. Bhattacharya D, Medlin aL. Algal phylogeny and the origin of land plants. 12066–74. Plant Physiol. 1998;116(1):9–15. 8. Yoshida K, Imaizumi N, Kaneko S, Kawagoe Y, Tagiri A, Tanaka H, Nishitani K, 34. Gray, J., Massa, D. & Boucot, A. J. Caradocian land plant microfossils from Komae K. Carbohydrate-binding module of a rice endo-beta-1,4-glycanase, Libya. Geology 10, 197–201, doi: https://doi.org/10.1130/0091-7613(1982). OsCel9A, expressed in auxin-induced lateral root primordia, is post- 35. Crane PR, Herendeen P, Friis EM. Fossils and plant phylogeny. Am J Bot. translationally truncated. Plant Cell Physiol. 2006;47(11):1555–71. 2004;91(10):1683–99. 9. Blouzard JC, Bourgeois C, de Philip P, Valette O, Belaich A, Tardif C, Belaich 36. Kenrick P, Crane PR. Nature. 1997;389(6646):33–9. JP, Pages S. Enzyme diversity of the cellulolytic system produced by 37. Qiu YL, Li L, Wang B, Chen Z, Knoop V, Groth-Malonek M, Dombrovska O, Lee Clostridium cellulolyticum explored by two-dimensional analysis: J, Kent L, Rest J, et al. The deepest divergences in land plants inferred from identification of seven genes encoding new dockerin-containing proteins. J phylogenetic evidence. Proc Natl Acad Sci U S A. 2006;103(42):15511–6. Bacteriol. 2007;189(6):2300–9. 38. Chaw SM, Chang CC, Chen HL, Li WH. Dating the monocot-dicot 10. Mingardon F, Bagert JD, Maisonnier C, Trudeau DL, Arnold FH. Comparison divergence and the origin of core eudicots using whole chloroplast of family 9 cellulases from mesophilic and thermophilic bacteria. Appl genomes. J Mol Evol. 2004;58(4):424–41. Environ Microbiol. 2011;77(4):1436–42. 39. Gandolfo MA, Nixon KC, Crepet WL. Triuridaceae fossil flowers from the 11. Qi M, Jun HS, Forsberg CW. Cel9D, an atypical 1,4-beta-D-glucan upper cretaceous of New Jersey. Am J Bot. 2002;89(12):1940–57. glucohydrolase from Fibrobacter succinogenes: characteristics, catalytic 40. Gandolfo MA, Nixon KC, Crepet WL, Stevenson DW, Friis EM. Nature. 1998; residues, and synergistic interactions with other cellulases. J Bacteriol. 2008; 394(6693):532–3. 190(6):1976–84. 41. Blume JE, Ennis HL, Dictyostelium A. Discoideum cellulase is a member of a 12. Yi Z, Su X, Revindran V, Mackie RI, Cann I. Molecular and biochemical spore germination-specific gene family. J Biol Chem. 1991;266(23):15432–7. analyses of CbCel9A/Cel48A, a highly secreted multi-modular cellulase by 42. del Campillo E, Gaddam S, Mettle-Amuah D, Heneks J. A tale of two tissues: Caldicellulosiruptor bescii during growth on crystalline cellulose. PLoS One. AtGH9C1 is an endo-beta-1,4-glucanase involved in root hair and 2013;8(12):e84172. endosperm development in Arabidopsis. PLoS One. 2012;7(11):e49363. 13. Zhang C, Zhang W, Lu X. Expression and characteristics of a ca(2)(+ 43. Ficko-Blean E, Boraston AB. The interaction of a carbohydrate-binding )-dependent endoglucanase from Cytophaga hutchinsonii. Appl Microbiol module from a Clostridium perfringens N-acetyl-beta-hexosaminidase with Biotechnol. 2015;99(22):9617–23. its carbohydrate receptor. J Biol Chem. 2006;281(49):37748–57. 14. Ramalingam R, Blume JE, Ennis HL. The Dictyostelium discoideum spore germination-specific cellulase is organized into functional domains. J 44. Goellner M, Wang X, Davis EL. Endo-beta-1,4-glucanase expression in Bacteriol. 1992;174(23):7834–7. compatible plant-nematode interactions. Plant Cell. 2001;13(10):2241–55. 15. Allardyce BJ, Linton SM, Saborowski R. The last piece in the cellulase puzzle: 45. Matthysse AG, Deschet K, Williams M, Marry M, White AR, Smith WC. A the characterisation of beta-glucosidase from the herbivorous gecarcinid functional cellulose synthase from ascidian epidermis. Proc Natl Acad Sci U land crab Gecarcoidea natalis. J Exp Biol. 2010;213(Pt 17):2950–7. S A. 2004;101(4):986–91. 16. Kundu S, Sharma R. In silico identification and taxonomic distribution of 46. McLean BW, Bray MR, Boraston AB, Gilkes NR, Haynes CA, Kilburn DG. plant class C GH9 endoglucanases. Front Plant Sci. 2016;7:1185. Analysis of binding of the family 2a carbohydrate-binding module 17. Domozych DS, Ciancia M, Fangel JU, Mikkelsen MD, Ulvskov P, Willats WG. from Cellulomonas fimi xylanase 10A to cellulose: specificity and The cell walls of green algae: a journey through evolution and diversity. identification of functionally important amino acid residues. Protein Front Plant Sci. 2012;3:82. Eng. 2000;13(11):801–9. Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 18 of 19 47. Boraston AB, Bolam DN, Gilbert HJ, Davies GJ. Carbohydrate-binding modules: engineered cellulose-binding domains of cellobiohydrolase I from fine-tuning polysaccharide recognition. Biochem J. 2004;382(Pt 3):769–81. Trichoderma reesei. Protein Sci. 1997;6(2):294–303. 48. O'Meara TR, Alspaugh JA. The Cryptococcus neoformans capsule: a sword 70. Morrill J, Kulcinskaja E, Sulewska AM, Lahtinen S, Stalbrand H, and a shield. Clin Microbiol Rev. 2012;25(3):387–408. Svensson B, Abou Hachem M. The GH5 1,4-beta-mannanase from Bifidobacterium animalis subsp. lactis Bl-04 possesses a low-affinity 49. Gao B, Gupta RS. Phylogenomic analysis of proteins that are distinctive of mannan-binding module and highlights the diversity of mannanolytic archaea and its main subgroups and the origin of methanogenesis. BMC enzymes. BMC Biochem. 2015;16:26. Genomics. 2007;8:86. 71. Nishijima H, Nozaki K, Mizuno M, Arai T, Amano Y. Extra tyrosine in the 50. Dehal P, Satou Y, Campbell RK, Chapman J, Degnan B, De Tomaso A, carbohydrate-binding module of Irpex lacteus Xyn10B enhances its Davidson B, Di Gregorio A, Gelpke M, Goodstein DM, et al. The draft cellulose-binding ability. Biosci Biotechnol Biochem. 2015;79(5):738–46. genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science. 2002;298(5601):2157–67. 72. Parsiegla G, Reverbel-Leroy C, Tardif C, Belaich JP, Driguez H, Haser R. 51. Linton SM, Greenaway P, Towle DW. Endogenous production of endo-beta- Crystal structures of the cellulase Cel48F in complex with inhibitors and 1,4-glucanase by decapod crustaceans. J Comp Physiol B. 2006;176(4):339–48. substrates give insights into its processive action. Biochemistry. 2000; 52. Lo N, Watanabe H, Sugimura M. Evidence for the presence of a cellulase 39(37):11238–46. gene in the last common ancestor of bilaterian animals. Proc Biol Sci. 2003; 73. Simpson HD, Barras F. Functional analysis of the carbohydrate-binding 270(Suppl 1):S69–72. domains of Erwinia chrysanthemi Cel5 (endoglucanase Z) and an Escherichia coli putative chitinase. J Bacteriol. 1999;181(15):4611–6. 53. Scholl EH, Thorne JL, McCarter JP, Bird DM. Horizontally transferred genes in 74. Simpson PJ, Xie H, Bolam DN, Gilbert HJ, Williamson MP. The structural basis plant-parasitic nematodes: a high-throughput genomic approach. Genome for the ligand specificity of family 2 carbohydrate-binding modules. J Biol Biol. 2003;4(6):R39. Chem. 2000;275(52):41137–42. 54. Smant G, Stokkermans JP, Yan Y, de Boer JM, Baum TJ, Wang X, Hussey RS, 75. Strobel KL, Pfeiffer KA, Blanch HW, Clark DS. Structural insights into Gommers FJ, Henrissat B, Davis EL, et al. Endogenous cellulases in animals: the affinity of Cel7A carbohydrate-binding module for lignin. J Biol isolation of beta-1, 4-endoglucanase genes from two species of plant- Chem. 2015;290(37):22818–26. parasitic cyst nematodes. Proc Natl Acad Sci U S A. 1998;95(9):4906–11. 55. Davison A, Blaxter M. Ancient origin of glycosyl hydrolase family 9 cellulase 76. Taylor CB, Talib MF, McCabe C, Bu L, Adney WS, Himmel ME, Crowley MF, genes. Mol Biol Evol. 2005;22(5):1273–84. Beckham GT. Computational investigation of glycosylation effects on a 56. Salzberg SL, White O, Peterson J, Eisen JA. Microbial genes in the human family 1 carbohydrate-binding module. J Biol Chem. 2012;287(5):3147–55. genome: lateral transfer or gene loss? Science. 2001;292(5523):1903–6. 77. YanivO,Petkun S,Shimon LJ, Bayer EA, LamedR,FrolowF.Asingle mutation reforms the binding activity of an adhesion-deficient family 57. Stanhope MJ, Lupas A, Italia MJ, Koretke KK, Volker C, Brown JR. 3 carbohydrate-binding module. Acta Crystallogr D Biol Crystallogr. Phylogenetic analyses do not support horizontal gene transfers from 2012;68(Pt 7):819–28. bacteria to vertebrates. Nature. 2001;411(6840):940–4. 78. Zhang C, Wang Y, Li Z, Zhou X, Zhang W, Zhao Y, Lu X. Characterization of 58. Buchanan M, Burton RA, Dhugga KS, Rafalski AJ, Tingey SV, Shirley NJ, Fincher a multi-function processive endoglucanase CHU_2103 from Cytophaga GB. Endo-(1,4)-beta-glucanase gene families in the grasses: temporal and hutchinsonii. Appl Microbiol Biotechnol. 2014;98(15):6679–87. spatial co-transcription of orthologous genes. BMC Plant Biol. 2012;12:235. 79. Henrissat B. A classification of glycosyl hydrolases based on amino acid 59. Montanier C, Flint JE, Bolam DN, Xie H, Liu Z, Rogowski A, Weiner DP, sequence similarities. Biochem J. 1991;280(Pt 2):309–16. Ratnaparkhe S, Nurizzo D, Roberts SM, et al. Circular permutation provides an evolutionary link between two families of calcium-dependent 80. Henrissat B, Bairoch A. New families in the classification of glycosyl carbohydrate binding modules. J Biol Chem. 2010;285(41):31742–54. hydrolases based on amino acid sequence similarities. Biochem J. 1993; 293(Pt 3):781–8. 60. Xie G, Yang B, Xu Z, Li F, Guo K, Zhang M, Wang L, Zou W, Wang Y, Peng L. 81. Yin Y, Mao X, Yang J, Chen X, Mao F, Xu Y. dbCAN: a web resource for Global identification of multiple OsGH9 family members and their involvement in cellulose crystallinity modification in rice. PLoS One. 2013;8(1):e50171. automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2012; 40(Web Server issue):W445–51. 61. Lopez-Casado G, Urbanowicz BR, Damasceno CM, Rose JK. Plant glycosyl 82. Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics hydrolases and biofuels: a natural marriage. Curr Opin Plant Biol. 2008;11(3):329–37. analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016;33(7):1870–4. 62. Alahuhta M, Xu Q, Bomble YJ, Brunecky R, Adney WS, Ding SY, Himmel ME, 83. Letunic I, Doerks T, Bork P. SMART: recent updates, new developments and Lunin VV. The unique binding mode of cellulosomal CBM4 from Clostridium status in 2015. Nucleic Acids Res. 2015;43(Database issue):D257–60. thermocellum cellobiohydrolase a. J Mol Biol. 2010;402(2):374–87. 63. Duan CJ, Feng YL, Cao QL, Huang MY, Feng JX. Identification of a novel 84. Schultz J, Milpetz F, Bork P, Ponting CP. SMART, a simple modular family of carbohydrate-binding modules with broad ligand specificity. Sci architecture research tool: identification of signaling domains. Proc Natl Rep. 2016;6:19392. Acad Sci U S A. 1998;95(11):5857–64. 85. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein 64. Prates ET, Stankovic I, Silveira RL, Liberato MV, Henrique-Silva F, Pereira N, Jr., blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–9. Polikarpov I, Skaf MS: X-ray structure and molecular dynamics simulations of 86. Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos G. BLOSUM62 endoglucanase 3 from Trichoderma harzianum: structural organization and miscalculations improve search performance. Nat Biotechnol. 2008;26(3):274–5. substrate recognition by endoglucanases that lack cellulose binding 87. Bouckaert R, Heled J, Kuhnert D, Vaughan T, Wu CH, Xie D, Suchard MA, module. PLoS One 2013, 8(3):e59069. Rambaut A, Drummond AJ. BEAST 2: a software platform for Bayesian 65. Boraston AB, Nurizzo D, Notenboom V, Ducros V, Rose DR, Kilburn DG, evolutionary analysis. PLoS Comput Biol. 2014;10(4):e1003537. Davies GJ. Differential oligosaccharide recognition by evolutionarily- related beta-1,4 and beta-1,3 glucan-binding modules. J Mol Biol. 2002; 88. Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and 319(5):1143–56. dating with confidence. PLoS Biol. 2006;4(5):e88. 66. Charnock SJ, Bolam DN, Nurizzo D, Szabo L, McKie VA, Gilbert HJ, Davies GJ. 89. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with Promiscuity in ligand-binding: the three-dimensional structure of a BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29(8):1969–73. Piromyces carbohydrate-binding module, CBM29-2, in complex with cello- 90. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, and mannohexaose. Proc Natl Acad Sci U S A. 2002;99(22):14077–82. Remmert M, Soding J, et al. Fast, scalable generation of high-quality protein 67. Crennell SJ, Cook D, Minns A, Svergun D, Andersen RL, Nordberg Karlsson E. multiple sequence alignments using Clustal omega. Mol Syst Biol. 2011;7:539. Dimerisation and an increase in active site aromatic groups as adaptations 91. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo to high temperatures: X-ray solution scattering and substrate-bound crystal generator. Genome Res. 2004;14(6):1188–90. structures of Rhodothermus marinus endoglucanase Cel12A. J Mol Biol. 92. Schneider TD, Stephens RM. Sequence logos: a new way to display 2006;356(1):57–71. consensus sequences. Nucleic Acids Res. 1990;18(20):6097–100. 68. Kim SJ, Kim SH, Shin SK, Hyeon JE, Han SO. Mutation of a conserved 93. Jonassen I, Collins JF, Higgins DG. Finding flexible patterns in unaligned tryptophan residue in the CBM3c of a GH9 endoglucanase inhibits activity. protein sequences. Protein Sci. 1995;4(8):1587–95. Int J Biol Macromol. 2016;92:159–66. 94. Buchan DW, Minneci F, Nugent TC, Bryson K, Jones DT. Scalable web 69. Mattinen ML, Kontteli M, Kerovuo J, Linder M, Annila A, Lindeberg G, services for the PSIPRED protein analysis workbench. Nucleic Acids Res. Reinikainen T, Drakenberg T. Three-dimensional structures of three 2013;41(Web Server issue):W349–57. Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 19 of 19 95. Cserzo M, Eisenhaber F, Eisenhaber B, Simon I. On filtering false positive transmembrane protein predictions. Protein Eng. 2002;15(9):745–52. 96. Cserzo M, Wallin E, Simon I, von Heijne G, Elofsson A. Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng. 1997;10(6):673–6. 97. Jones DT. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics. 2007;23(5):538–44. 98. Jones DT, Taylor WR, Thornton JM. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry. 1994;33(10):3038–49. 99. Kall L, Krogh A, Sonnhammer EL. Advantages of combined transmembrane topology and signal peptide prediction–the Phobius web server. Nucleic Acids Res. 2007;35(Web Server issue):W429–32. 100. Nugent T, Jones DT. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics. 2009;10:159. 101. Kall L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004;338(5):1027–36. 102. Nicodeme P. Fast approximate motif statistics. J Comput Biol. 2001;8(3):235–48. 103. Nasser W, Santhanam B, Miranda ER, Parikh A, Juneja K, Rot G, Dinh C, Chen R, Zupan B, Shaulsky G, et al. Bacterial discrimination by dictyostelid amoebae reveals the complexity of ancient interspecies interactions. Curr Biol. 2013;23(10):862–72. 104. Sanders D, Borys KD, Kisa F, Rakowski SA, Lozano M, Filutowicz M. Multiple Dictyostelid species destroy biofilms of Klebsiella oxytoca and other gram negative species. Protist. 2017;168(3):311–25. 105. Dashtban M, Schraft H, Qin W. Fungal bioconversion of lignocellulosic residues; opportunities & perspectives. Int J Biol Sci. 2009;5(6):578–95. 106. Ghareeb H, Becker A, Iven T, Feussner I, Schirawski J. Sporisorium reilianum infection changes inflorescence and branching architectures of maize. Plant Physiol. 2011;156(4):2037–52. 107. Hilden L, Daniel G, Johansson G. Use of a fluorescence labelled, carbohydrate-binding module from Phanerochaete chrysosporium Cel7D for studying wood cell wall ultrastructure. Biotechnol Lett. 2003;25(7):553–8. 108. Martin F, Aerts A, Ahren D, Brun A, Danchin EG, Duchaussoy F, Gibon J, Kohler A, Lindquist E, Pereda V, et al. The genome of Laccaria bicolor provides insights into mycorrhizal symbiosis. Nature. 2008;452(7183):88–92. 109. Sims PF, Soares-Felipe MS, Wang Q, Gent ME, Tempelaars C, Broda P. Differential expression of multiple exo-cellobiohydrolase I-like genes in the lignin-degrading fungus Phanerochaete chrysosporium. Mol Microbiol. 1994; 12(2):209–16. 110. Sagane Y, Zech K, Bouquet JM, Schmid M, Bal U, Thompson EM. Functional specialization of cellulose synthase genes of prokaryotic origin in chordate larvaceans. Development. 2010;137(9):1483–92. 111. Di Bella MA, Fedders H, De Leo G, Leippe M. Localization of antimicrobial peptides in the tunic of Ciona intestinalis (Ascidiacea, Tunicata) and their involvement in local inflammatory-like reactions. Results Immunol. 2011;1(1):70–5. 112. Fischer R, Ostafe R, Twyman RM. Cellulases from insects. Adv Biochem Eng Biotechnol. 2013;136:51–64. 113. Grell MN, Linde T, Nygaard S, Nielsen KL, Boomsma JJ, Lange L. The fungal symbiont of Acromyrmex leaf-cutting ants expresses the full spectrum of genes to degrade cellulose and other plant cell wall polysaccharides. BMC Genomics. 2013;14:928. 114. Khademi S, Guarino LA, Watanabe H, Tokuda G, Meyer EF. Structure of an endoglucanase from termite, Nasutitermes takasagoensis. Acta Crystallogr D Biol Crystallogr. 2002;58(Pt 4):653–9. 115. Kundu S. Distribution and prediction of catalytic domains in 2-oxoglutarate dependent dioxygenases. BMC Res Notes. 2012;5:410. 116. Matos DA, Whitney IP, Harrington MJ, Hazen SP. Cell walls and the developmental anatomy of the Brachypodium distachyon stem internode. PLoS One. 2013;8(11):e80640. 117. Sullivan S, Ralet MC, Berger A, Diatloff E, Bischoff V, Gonneau M, Marion-Poll A, North HM. CESA5 is required for the synthesis of cellulose with a role in structuring the adherent mucilage of Arabidopsis seeds. Plant Physiol. 2011; 156(4):1725–39. 118. Tan HT, Shirley NJ, Singh RR, Henderson M, Dhugga KS, Mayo GM, Fincher GB, Burton RA. Powerful regulatory systems and post-transcriptional gene silencing resist increases in cellulose content in cell walls of barley. BMC Plant Biol. 2015;15:62. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Evolutionary Biology Springer Journals

Origin, evolution, and divergence of plant class C GH9 endoglucanases

Free
19 pages

Loading next page...
 
/lp/springer_journal/origin-evolution-and-divergence-of-plant-class-c-gh9-endoglucanases-GXnOAcSZ2z
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s).
Subject
Life Sciences; Evolutionary Biology; Animal Systematics/Taxonomy/Biogeography; Entomology; Genetics and Population Dynamics; Life Sciences, general
eISSN
1471-2148
D.O.I.
10.1186/s12862-018-1185-2
Publisher site
See Article on Publisher Site

Abstract

Background: Glycoside hydrolases of the GH9 family encode cellulases that predominantly function as endoglucanases and have wide applications in the food, paper, pharmaceutical, and biofuel industries. The partitioning of plant GH9 endoglucanases, into classes A, B, and C, is based on the differential presence of transmembrane, signal peptide, and the carbohydrate binding module (CBM49). There is considerable debate on the distribution and the functions of these enzymes which may vary in different organisms. In light of these findings we examined the origin, emergence, and subsequent divergence of plant GH9 endoglucanases, with an emphasis on elucidating the role of CBM49 in the digestion of crystalline cellulose by class C members. Results: Since, the digestion of crystalline cellulose mandates the presence of a well-defined set of aromatic and polar amino acids and/or an attributable domain that can mediate this conversion, we hypothesize a vertical mode of transfer of genes that could favour the emergence of class C like GH9 endoglucanase activity in land plants from potentially ancestral non plant taxa. We demonstrated the concomitant occurrence of a GH9 domain with CBM49 and other homologous carbohydrate binding modules, in putative endoglucanase sequences from several non-plant taxa. In the absence of comparable full length CBMs, we have characterized several low strength patterns that could approximate the CBM49, thereby, extending support for digestion of crystalline cellulose to other segments of the protein. We also provide data suggestive of the ancestral role of putative class C GH9 endoglucanases in land plants, which includes detailed phylogenetics and the presence and subsequent loss of CBM49, transmembrane, and signal peptide regions in certain populations of early land plants. These findings suggest that classes A and B of modern vascular land plants may have emerged by diverging directly from CBM49 encompassing putative class C enzymes. Conclusion: Our detailed phylogenetic and bioinformatics analysis of putative GH9 endoglucanase sequences across major taxa suggests that plant class C enzymes, despite their recent discovery, could function as the last common ancestor of classes A and B. Additionally, research into their ability to digest or inter-convert crystalline and amorphous forms of cellulose could make them lucrative candidates for engineering biofuel feedstock. Keywords: Cellulase, Cellulose, Glycoside hydrolase, GH9, Endoglucanases, Phylogenetics Background presence/ absence of transmembrane (TM) and/ or signal Glycoside hydrolase 9 (GH9) endoglucanases utilize water peptide (SP) sub regions [1, 2]. Theabundantlypresent (EC3.x.y.z) to cleave the glycoside (1→ 4) or (1→ 3) bonds amorphous cellulose is enzymatically amenable to diges- between repeated monomeric β(D)-glucopyranose units of tion,and is thedefacto substratefor theseenzymes.How- cellulose and comprise sequences from all major kingdoms ever, an editing/ modifying function for crystalline cellulose of life [1, 2]. GH9 endoglucanases in land plants were previ- has been ascribed to class A endoglucanases, either exclu- ously clustered into classes A and B on the basis of the sively or in association with the cellulosome [3–5]. The dis- covery and further characterization of a carbohydrate binding module (CBM49) at the C-termini of previously * Correspondence: siddhartha_kundu@yahoo.co.in; rita.genomics@gmail.com Department of Biochemistry, Government of NCT of Delhi, Dr. Baba Saheb annotated GH9 endoglucanases (classes A and B) in Sola- Ambedkar Medical College & Hospital, New Delhi 110085, India num lycopersicum, Oryza sativa, Arabidopsis thaliana, and Crop Genetics and Informatics Group, School of Computational and Nicotiana tabacum conferred, on this family, catalytic Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 2 of 19 competency for crystalline cellulose [6–8]. The hydrogen- reciprocally, the existence of mixed function endo- and bond stabilized crystalline cellulose, is the preferred sub- exo-glucanases acting in tandem with biosynthetic cata- strate for bacteria, fungi, archaea, and protists, organisms lysts to modulate the composition of the encompassing which predate the emergence of green land plants by sev- cell wall matrix/ capsule/ coat [15–17]. Observations by eral millions of years [9–14]. The discovery, therefore, that several investigators suggest a correlation between exhib- a subset of plant GH9 endoglucanases could utilize crystal- ited function with the occurrence of sequence homology or line cellulose as its cognate substrate raises fundamental manifested enzymatic activity. Thus, despite the proximity questions not only on the evolution and ancestry of plant of divergence between multicellular green algae and GH9 endoglucanases, but also the functional relevance primitive land plants 470 − 480 Million years ago (Mya), of an additional hydrolase with a hitherto novel homologous GH9 endoglucanase sequences are either spectrum of catalytic activity. completely absent or at best partial and fragmented in uni- Cellulose, is a straight chain polymer of repeating cellular members (Chlamydomonas reinhardtii, Volvox car- units of β(1→ 4) linked D-glucopyranose residues and teri)[16, 17]. In contrast, bacteria (≅3200 − 3950 Mya), consists of microcrystalline (I , I ) and amorphous archaea (≅390 − 1350 Mya), protists (≅2000 − 3000 Mya), α β (I am, I am) regions (Fig. 1a and b). This heteroge- fungi (≅1000 − 1500 Mya), and some animals (180 − 670 α β neous distribution is dictated by the presence of a Mya) not just possess sequences with ascribable GH9 endo- rich inter-and intra-fibrillar hydrogen bond network. glucanase activity of crystalline cellulose, but also a demon- Whilst, the paucity of hydrogen bonds in the former strable and relevant function (Table 1)[18–40]. These facilitates enzymatic cleavage, the ordered structure of include modulation of sporulation (Dictyostelium spp.,clos- the latter, imposes constraints on the activity profile tridiales, bacillales), host-pathogen interactions (fungi, nem- of plant GH9 endoglucanases. Natural cellulose is atodes, protists, plants), repair and survival (Euryarchaea), rarely pure (Gossypium spp., 90%), and is frequently and preventive desiccation (bacteria, Dictyostelium spp.) found in association with other carbohydrates (hemi- [15, 41–49]. Genomic evidence of GH9 endoglucanases in cellulose) and/ or other macromolecules (lipids, pro- some animals (marine invertebrates, termites, arthropods, teins). The presence of these complexes would also imply, parasitic and saprophytic nematodes), in the absence of Fig. 1 Taxonomic distribution and analysis of the GH9 domain in putative endoglucanse sequences. a Molecular structure of cellulose with repeating units of D-glucopyranose linked by a β(1→ 4) glycosidic bond. The liberated mono- or oligosaccharides either retain the β-hydroxyl group (retaining), or are inverted (α-hydroxyl) after transformation, b Generic reaction mechanism of hydrolytic GH9 endoglucanase (EC 3.2.1.4) mediated transformation of cellulose into simpler oligo- and/or mono-saccharides, c Alignment compatible sequences of GH9 domains from putative GH9 endoglucanases across all taxa (n = 607). Abbreviations: GH9, glycoside hydrolase; EC, enzyme commission Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 3 of 19 Table 1 Literature based divergence rates of taxa utilized for itself [4, 5]. The presence of signal peptide regions, in calibrating the time trees contrast, posits that these enzymes may be secreted Taxa Divergence (Mya) and digest cellulose extracellularly. Such a mechanism might benefit fungal pathogens of plants, may be de- Bacteria 3200–3950 ployed by termites, and participate in glucose extrac- Protists 1600–3000 tion in ruminants as well [15, 42, 44, 48]. The Archaea 500–2500 proportion of sequences that exhibit class B and C Animals 1000 activity is subject to much debate. Whilst, a simple Crustacea 511 sequence similarity suggests a preponderance of class Insects 396 B members, complex classification schema using hidden markov models (HMM) and artificial neural networks C.intestinalis 180 (ANN) indicates a marginally greater number of putative Chordates 542 class C GH9 endoglucanases in primary transcript data Arthropoda 540 from sequenced land plants [16, 58–60]. Fungi 1500 The potential importance of class C enzymes in biomass Green algae 500–2000 conversion notwithstanding, a paradigm shift in the chem- Bryophytes 470–475 ical nature of cellulose, the inconsistencies in the numbers observed between predicted and observed members, and Tracheophytes 395–425 a conserved reaction chemistry in extant non plant taxa, Land Plants suggest that plant class C GH9 endoglucanases may pre- Monocots 90–141 date classes A and B enzymes [16, 58–61]. Here, we at- Eudicots 90–141 tempt to resolve some of these queries by investigating the Rosids 108–117 origins, evolution, and subsequent divergence of the GH9 Asterids 107–117 domain in putative plant endoglucanase sequences, with Abbreviations: Mya Millions of years particular emphasis on the contribution of class C mem- bers. The role of the aromatic (W/ Y / F) and polar un- demonstrable function, was postulated to have occurred charged (S/T/N/Q) is critical to the functioning of during phases of co-infection with gastrointestinal and oral endoglucanases in the presence and absence of well- microbiota [15, 42, 44, 45, 50–54]. However, the con- defined CBMs, and, in the presence of low complexity re- firmed presence in numerous other animals, similarity gions their incorporation into the GH9 domain might in substrate and reaction chemistry, and sequence constitute the only measure of approximating the CBM49 conservation, along with supporting laboratory data [62–64]. These residues despite being non-catalytic them- has refuted much of this horizontal transfer mode of selves have been shown to confer the capacity on the gene transfer [15, 41, 42, 44, 45, 55–57]. Davison and encompassing enzymes to discriminate between related li- Blaxter suggested a single origin of GH9 genes based gands (cellulose/ X, X = {xylose, lignin, chitin; β-1,3/β-1,4), on monophyly in the phylogenetic tree and conserved effect and in some cases even the binding affinity for a intron positions [55]. cognate substrate, contribute to processivity and thermal In land plants (Viridiplantae), the activity profile of stability, and interestingly introduce catalytic competency GH9 endoglucanases on cellulose, correlates, in part, [62–78]. We utilize a combination of phylogenetic ana- with their distribution, as well as the purported roles lysis, pattern approximation, identification, distribution in growth, development, flowering, and seed germin- analysis, and residue mapping of the CBM49 to investigate ation [16]. The carbohydrate binding modules/ do- the emergence of crystalline cellulose digesting activity in mains (n = 64), are sequences 40 − 200 aa in length, land plants. Finally, we complement these analyses by and despite being intrinsically non catalytic can facili- examining the presence and distribution of transmem- tate the hydrolytic cleavage of the glycosidic linkage brane and signal peptide regions in vascular land plants, [47]. Unlike the C-terminally localized CBM49 of and the possible routes by which endoglucanase se- plant GH9 endoglucanases, different CBMs favouring quences with putative class C activity could contribute to the activity on crystalline cellulose in bacteria, fungi, the emergence of sequences with novel functionality. protists, animals, and possibly archaea and green algae are distributed throughout the length of the se- Methods quence [16]. The presence of one or more TM re- Collation, annotation, and domain extraction of GH9 gions also suggests that at least in plants cellulose endoglucanases metabolism may occur in clusters of (biosynthetic, de- Sequences of putative GH9 endoglucanases were down- grading enzymes) and be localized at the membrane loaded from the publically available databases National Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 4 of 19 Center for Biotechnology Information (NCBI; http:// opening = gap extension = 10), with gap opening penal- www.ncbi.nlm.nih.gov) and Carbohydrate-active en- ties of 0.1 (pairwise alignment) and 0.2 (MSA), a diver- zymes (CAZy; http://www.cazy.org/) [16, 79, 80]. Se- gence cut off of 20%, and the BLOSUM62 set of quences of green land plants (Viridiplantae) utilized for matrices (Additional file 5: Table S1, Additional file 1: this analysis were downloaded from Phytozome (https:// Text S1 and Additional file 3: Text S4) [85, 86]. This phytozome.jgi.doe.gov/pz/portal.html), extensively cu- was chosen to account for the purported domain distri- rated, and classified into classes A, B, and C as described bution of classes A, B, and C among the various taxa. previously [16]. Annotation for non-plant GH9 endoglu- Sequences were deemed compatible if and only if their canases was in accordance with the schema adopted by pairwise alignments were free from errors as determined dbCAN (Carbohydrate enzyme annotation; http://csbl. by the distance matrix computed by MEGA7.0. The top bmb.uga.edu/dbCAN)[81]. The pooled sequences were scoring amino acid substitution models for the afore- filtered on the basis of their contribution to a compatible mentioned MSAs was selected amongst all (n = 56) using multiple sequence alignment (MSA) and the presence of the Akaike information criteria corrected (min(AICc)) a single GH9 domain as determined by MEGA7.0 (Mo- and the Bayesian information criteria (min(BIC)) as indi- lecular evolutionary genetic analysis, local installation) ces (Additional file 6: Table S3). BEAST v2.4.7 (Bayesian and the SMART (Simple modular architecture research evolutionary analysis by sampling trees) and the accom- tool) server [82–84]. Exclusion criteria for this prelimin- panying software suite (FigTree v1.4.3, DensiTree, Tracer ary data were: a) an indeterminable MSA, b) the complete v1.6, TreeAnnotator) was utilized to infer the date and absence of a demonstrable GH9 domain, c) more than visualize a maximum clade credibility tree with median one GH9 domain ((GH9) : x > 1) in the same sequence, heights, and tabulate descriptive statistics after the poster- and d) presence of a concomitant GH domain other than ior probabilities converged (Tables 2 and 3; Additional file GH9 ((GH9 ∧ GHx): x∈[1, 8] ∧ [10 − 130]). Amino acids at 7: Table S4) [87–89]. Whilst, the age of the node and the the start and end positions of the GH9 domains were branch times of the clades were inferred directly (Mya), noted and extracted (n ) using in-house developed PERL support was denoted as the posterior probabilities (PP%) scripts (Additional file 1: Text S1, Additional file 2:Text and bootstrap values (n = 1000) by maximum likelihood S3, and Additional file 3: Text S4). Here, the final set of (ML%), i.e., support = PP %, ML%, (FigTree v1.4.3). compatible sequences of the GH9 domains (n ), pattern Whilst, the selection of the root for evaluating the 1A selected GH9 domains (n , n ), pattern selected and evolution of the GH9 domain (parent of the bacterial 1B 1C GH9 encompassing full length sequences (n ), CBM49/ clade), was based on fossil records that suggested that CBM49-like sequences of land plants (n = n ; X ={A, bacteria were amongst the earliest forms of life 3X LPSX B, C}) comprised the datasets utilized in this study. The (≈3170 − 4180 Mya), the same for the CBM49/ distinct and delineable CBM49 from putative class C GH9 CBM49-like land plants was the presence of a distinct endoglucanases was similarly isolated and comprised (n and delineable CBM49 in the ancestral bryophytes 3C = n ) (Additional file 4: Text S2). The amino acid con- and tracheophytes coupled with the assumption that LPSC tent of the extricated GH9 and CBM49 domains were the parent of class C vascular land plants (≈201 − assessed using PIR (Protein information server, http://pir. 241 Mya) were likely to possess the same architecture georgetown.edu) and categorized on the basis of side (Table 2; Additional file 5: Table S1 and Additional chain content into those with hydrophobic side chains file 8:Table S2)[18, 19]. (HSC), aromatic amino acids (AAA), polar uncharged (PUC), polar charged acidic (PCA), and polar charged basic (PCB). The GH9 domains were used for phylogen- Pattern analysis and motif approximation of CBM49 in etic analysis and time tree estimation (n ), CBM49 was putative GH9 endoglucanases 1A utilized for pattern analysis and motif approximation (n ) The boundaries of CBM49 were defined in characterized 3C , and CBM49-like full length sequences from plant and and putative class C GH9 endoglucanase sequences with non-plant taxa were utilized for assessing relevant bio- single- and multiple-copies of the GH9 domain (n =116) informatics indices (n , n )(Additional file 1:Text S1, (Additional file 5: Table S1C and Additional file 8:Table 1B 1C Additional file 4: Text S2, Additional file 2:TextS3 and S2B) [6–8, 83, 84]. These were then clustered, realigned, Additional file 3:TextS4). and represented using the Clustal Omega and WebLogo servers (https://www.ebi.ac.uk/Tools/msa/clustalo; http:// Model selection, phylogenetic analysis, and time tree weblogo.berkeley.edu/logo.cgi) with default parameters estimation [90–92]. The refined list of CBM49 sequences in (n =100) Multiple sequence alignments (MSA) of the extracted class C GH9 endoglucanases were then submitted to the GH9 domains and the CBM49/ CBM49-like in land PRATT v 2.1 server (http://web.expasy.org/pratt), and uti- plants were generated using the default parameters (gap lized to identify and score suitable domain spanning Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 5 of 19 Table 2 Parameters utilized for Bayesian inference of evolution Table 3 Taxonomic distribution of bacteria in datasets of the GH9 and CBM49 domains Dataset n n 1 2 Site Model: Gamma Number of sequences 116 64 Subsitution rate = 1.0 1. Firmicutes 65 44 Substitution model: WAG/ JTT Clostridiales 49 40 Gamma category count = 5 Bacillales 15 4 Shape: 1.537/ 0.813 Selenomonadales 1 – Proportion invariant: NA/ 0.027 2. Actinobacteria 20 10 Clock Model: Relaxed clock Log Normal Micrococcales 3 1 Number of Discrete Rates = − 1 Streptomycetales 11 6 Clock rate = 1.0 Streptosporangiales 2 1 Calibrated Yule model Birth rate = 1.0 Micromonosporales 2 – Type (Full) Pseudonocardiales 2 2 birthRate Model: Gamma 3. Proteobacteria 20 5 Initial = 1.0 [−∞, ∞] Gamma (γ)14 3 α = 1.0E − 03 Alpha (α)4 2 β = 1.0E +03 Delta (δ)1 – Mode:= Shape Scale Undefined 1 – Offset = 0.0 4. CFB 9 3 gammaShape Model: Gamma 5. Cyanobacteria 1 – Initial = 1.0 [−∞, ∞] α = 1.0E − 03 6. Undefined 1 1 β = 1.0E +03 Abbreviations: GH9 Glycoside hydrolase 9, CFB Chlorobi, Mode: Shape Scale Fibrobacteres, Bacteroidetes Offset = 0.0 Population mean Model: Exponential patterns [93]. A profile of these patterns (n = 20) was gen- erated based on the numbers of putative class C enzymes Initial = 1.0 [−∞, ∞] μ = 10.0 that they were found in, i.e., 5→ 100 (Table 4). This was Offset = 0.0 used to search for sequences with CBM49-like motifs amongst full length GH9 endoglucanase sequences with- Uncorrelated relaxed local clock mean Model: Exponential Initial = 1.0 [−∞, ∞] outa delineable CBM49region, andonthe GH9domainit- μ = 10.0 self and was accomplished using the server ScanProsite Offset = 0.0 (http://prosite.expasy.org/scanprosite)(Additional file 9: Uncorrelated relaxed local clock Model: Exponential Table S5). These datasets (n , n , n , n ) along with the standard deviation 1B 1C 2 3 Initial = 1.0 [−∞, ∞] subset of was used for all further analyses (Tables 4, 5 and σ = 0.3337 6; Additional file 9: Table S5, Additional file 10: Table S6, Offset = 0.0 Additional file 11: Table S7 and Additional file 12:Table Root Parent of: Bacteria/ Vascular class C S8, Additional file 13 Text S9, Additional file 16:TextS10, land plants Additional file 14: Text S11 and Additional file 15:Text Monophyletic S12). Alternatively, a Hidden Markov Model or support Model: Log Normal vector machine(SVM) may havebeenutilizedfor μ = 8.2/ 5.41 this part of the analysis. SVMs, are binary classifiers σ = 0.07/ 0.055 and incorporate several features of the training se- Offset = 0.0 quences to determine presence/ absence in an un- 2.5% Quantile = 3170/ 201Mya known sequence of interest. Whilst the SVM for the 97.5% Quantile = 4180/ 249Mya CBM49 could have been easily constructed, its utility Markov chain monte carlo Chain length = 14,917,000 / in identifying the same in a distantly related se- 16,120,000 quence is likely to be limited. The HMM, however, Pre Burnin = 4,200,000/ 2,130,000 for this specific module hand would simply indicate Recording interval = 1000 the existence of a similar region above a certain threshold. Since, our requirement mandated features Abbreviations: GH9 Glycoside hydrolase 9, CBM49 Carbohydrate binding module 49, WAG Whelan and Goldman, JTT Jones, Taylor, and Thornton of both these, i.e., presence/ absence of CBM49-like Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 6 of 19 Table 4 Alignment based pattern analysis of CBM49 in putative and characterized class C GH9 endoglucanases Motif Fs Sm Rm 1 GPIWGLTK[AS]G[DN]SY[GTV]FP[EST][HW][IL][NS][ST]L[APS][AV]GKS[LM]EFVYIH[AS][AT]S 140.4529 5 2.47E-35 2 GPIWGL[ST][KR]SG[DN]S[FY][AGT][FL]P[EST][HW][ILM]x[ST]Lx[AS]GKSLEFVYIH[AS][AT][ST] 131.0036 10 3.02E-32 3 GPIWGL[NST]x(2)[GP][DENQ]x(2)[AGTV] 75.6634 15 7.32E-04 4 GPIWGL[ST]x(2)[GP][DEN]x(2)[AGTV]x[PV]x(4)[STV]x(3)[GQ]x[GS]xE[FV][NV][FY][IV][HY][ASTV][AQT][GPST] 35.2349 20 5.86E-16 5 GPIWG[LV][NST]x[AST][GP][DENQT]x(2)[AGSTV] 36.2141 25 4.10E-04 6 GPIWG[LV][ANST]x(2)[GP][DENQT]x(2)[AGSTV] 33.2127 30 2.92E-03 7 GPI[WY]G[LV][ANST]x(3)[DENQT]x(2)[ADGSTV] 28.9138 35 1.00E-01 8 GP[IL]WGL[ANST]x(3)[ADEGNQ] 28.034 40 0.16 9 GP[IL]WG[LV][NST]x(3)[DENQT] 27.6347 45 0.14 10 GP[IL]WG[LV][AENST]x(3)[ADEGNQT] 26.4888 50 0.39 11 GP[IL][WY]G[LV][AENST]x(3)[ADEGNQT] 25.6627 55 1.4 12 GP[ILV][WY]G[LV][AENST] 23.5981 61 4.9 13 GP[ILV]xG[LV] 18.3557 65 356 14 G[NPS][IL][WY]G[LV][ANST] 22.9977 70 9.2 15 G[NPS][ILV]WG[LV] 20.977 77 15 16 G[NPS][ILV][WY]G[LV] 20.1508 80 53 17 G[DNPQS][ILV][WY]G[LV] 19.4127 85 83 18 G[DENPQST]x(2)G[LV] 12.9137 90 12,445 19 Gx[ILV][WY]G[LV] 17.5296 98 323 20 Gx(3)G[LV] 11.5238 100 33,184 Abbreviations: Fs Fitness score, E Glutamic acid, Sm Number of sequences matched, Q Glutamine, Rm Estimated number of random matches, S Serine, A Alanine, T Threonine, L Leucine, C Cysteine, M Methionine, Y Tyrosine, I Isoleucine, F Phenylalanine, V Valine, W Tryptophan, G Glycine, K Lysine, D Aspartic acid, H Histidine, N Asparagine, P Proline, R Arginine, x Any amino acid regions in GH9 domain containing endoglucanases C) (Additional file 6 Table S3, Additional file 7:Table S4, across taxa, these predictors of the extrema would Additional file 9: Table S5, Additional file 10:Table S6 and not have sufficed. Additional file 11:Table S7,Additional file 1: Texts S1, Additional file 4:Texts S2,Additionalfile 2: Texts S3 and Domain analysis of plant GH9 endoglucanases Additional file 3: Texts S4. Since the methods discussed af- The above compiled datasets (n − n )weremeant to offer ford compelling evidence of the ancestral nature of class C 1 3 an insight into the origin and evolution of the GH9- GH9 endoglucanase sequences, our subsequent analyses CBM49-like domain across all taxa, the end point being the (domain frequency) was focussed on establishing potential emergence of plant GH9 endoglucanases (classes A, B, and divergence of class C members and/ or the emergence of Table 5 Distribution of sequence segments in classes A, B, and C plant GH9 endoglucanases MEMSAT-SVM DAS PHOBIUS SP TM SP TM SP TM C0 (NN) 0.0000 (0/97) 0.0588 (3/51) 0.0000 (0/100) C1 (YY) 0.7525 (73/97) 1.0000 (97/97) 0.8604 (37/43) 0.8958 (43/48) 0.0000 (0/2) 0.0200 (2/100) C2 (NY) 0.2474 (24/97) 0.1395 (6/43) 1.0000 (2/2) B0 (NN) 0.0000 (0/75) 0.0196 (1/51) 0.0533 (4/75) B1 (YY) 0.8133 (61/75) 1.0000 (75/75) 0.5600 (28/50) 0.9803 (50/51) 0.3333 (1/3) 0.0422 (3/71) B2 (NY) 0.1866 (14/75) 0.4400 (22/50) 0.6667 (2/3) A0 (NN) 0.0000 (0/22) 0.0454 (1/22) 0.0909 (2/22) A1 (YY) 0.0000 (0/22) 1.0000 (22/22) 0.0000 (0/21) 0.9545 (21/22) 0.0000 (0/20) 0.9090 (20/22) A2 (NY) 1.0000 (22/22) 1.0000 (21/21) 1.0000 (20/20) + + Abbreviations: SVM Support vector machine, SP Signal peptide, TM Transmembrane region, DAS Density alignment surface, YY (SP ) ∧ (TM ∨ PH ∨ RH) , NY − + − − (SP ) ∧ (TM ∨ PH ∨ RH) , NN (SP ) ∧ (TM ∨ PH ∨ RH) Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 7 of 19 Table 6 Salient features of putative GH9 endoglucanase sequences with multiple delineable domains GH9 CBM2 CBM3 CBM4_9 CBM10 CBMX_2 CBM49 pattern 20 ALS (n = 3) gi|313241202 Y Y (cT) Y gi|260808721 Y Y (nT) Y gi|254553092 Y Y (nT) Y BAC (n = 24) gi|15894203 Y Y (cT) Y gi|15894200 Y Y (cT) Y gi|15893851 Y Y (nT) Y gi|300789210 Y Y (cT) Y (nT) Y gi|300785821 Y YY (cT) Y gi|121833 Y YY (cT) YY (cT) Y gi|320006799 Y Y (cT) Y (nT) Y gi|295094191 Y Y (nT) Y gi|291544575 Y Y (nT) Y gi|291543938 Y Y (cT) Y gi|34811382 Y Y (cT) Y gi|34811081 Y Y (cT) Y gi|2554767 Y Y (cT) Y gi|551774 Y Y (cT) Y gi|311900744 Y Y (cT) Y gi|311900370 Y Y (cT) Y (cT) Y gi|270288703 Y Y (cT) Y gi|270288702 Y Y (cT) Y gi|270288700 Y Y (nT) Y gi|270288699 Y Y (cT) Y gi|39636954 Y Y (cT) Y gi|6272570 Y Y (nT) YYYYY gi|237858935 Y Y (cT) Y (cT) YY gi|4490766 Y YY (cT) YYY PRS (n = 2) gi|281207043 Y Y (cT) gi|281207029 Y Y (cT) Abbreviations: ALS Animals, BAC Bacteria, PRS Protists, Y Present, nT N-terminal, cT C-terminal, GH9 Glycoside hydrolase 9, CBM Carbohydrate binding module classes A and B. Plant GH9 endoglucanase sequences pos- sequence as strong (TM), weak pore-lining (PH), or re- sess a differential distribution of TM, SP, and CBM49 re- entrant (RH), i.e., (TM ∨ PH ∨ RH). [94, 100]. The dense gions. and the frequency of occurrence of these was alignment surface (DAS-TMfilter) differs from other pre- analysed by directly comparing CBM49 positive class C dictors of transmembrane regions in considering hydropho- members (n = n = 97) with pattern 20 selected se- bic region(s) of a query protein, and mapping the results to 3C LPSC quences of putative classes A (n = n = 22) and known transmembrane regions [95, 96]. PHOBIUS, is a 3A LPSA B(n = n = 75) (Additional file 10:Table S6, hidden Markov model based delineator of signal peptide re- 3B LPSB Additional file 3:TextS4). Since,the hydrophobic gions and uses sub models of the sequences that comprise profile of these regions overlap, we utilized data these regions along with topology information to make pre- from three algorithms that predict both TM and SP dictions [101]. regions to arrive at a consensus. The servers con- sulted were: MEMSAT-SVM, DAS-TMfilter, and PHOBIUS Algorithm to assess contribution of prediction method to [94–101](Additional file 11: Table S7, Additional file 13: each sub segment Text S9, Additional file 16: Text S10 Additional file 14: Full length sequences of land plants encompassing the Text S11 and Additional file 15:Text S12).The MEMSAT- CBM49-pattern 20,i.e., classesA,B,and C(n =(n = 3 3LPSA SVM classifies membrane spanning helical regions in a n )+(n = n )+(n = n ) = 187) were searched 3A 3LPSB 3B 3LPSC 3C Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 8 of 19 for well defined amino acid segments using the aforemen- archaeal sequence (Methanohalobium evestigatum; tioned servers (MEMSAT-SVM, DAS, PHOBIUS). The tr|D7E938). This sequence has a predicted GH9 subset (NN) was used to define sequences without deline- domain length of 222 aa (Eval =1.2E − 08), and sub able TM and SP regions (NN ={C0, B0, A0}). The method optimally aligned sequences are likely to have inflated of choice was determined by rendering the resultant data scores in excess of the threshold for inclusion. On the other equivalent and therefore, comparable. The definitions uti- hand, despite possessing GH9 domains of suitable length, lized are as under: the lower confidence levels of the HMM predictor for α- proteobacteria (Asticcacaulis biprosthecum; gi|328841530, gi|328840708; Evals =2.20E − 17, 8.40E − 15), and a mem- TM ∶ ¼ Sequences with one or more predicted transmembrane domains ber each of the Chlorobi-Fibrobacter-Bacillales (CFB) SP ∶ ¼ Sequences with one or more predicted signal peptide regions PH ∶ ¼ Sequences with one or more predicted pore lining helices ancestral phylum (Bacterioides fluxus YIT 12057; RH ∶ ¼ Sequences with one or more predicted pore lining helices − − gi|328530713, gi|328531610; Evals =2.80E − 25, 2.40E − NN ∶ ¼ ðÞ SP ∧ðÞ TM∨PH∨RH − þ NY ∶ ¼ ðÞ SP ∧ðÞ TM∨PH∨RH 23)) and subgroup Bacillales of the Firmicutes (Listeria þ þ YY ∶ ¼ ðÞ SP ∧ðÞ TM∨PH∨RH þ innocus; gi|313621564; Eval =1.50E − 20) were probable Y ∶ ¼ ðÞ TM∨PH∨RH confounders for the alignment mismatch (Additional file 5: Table S1A). The bacterial subgroup comprised Gram nega- tive (proteobacteria) and Gram positive organisms (mem- Step 1:Sequences with negative predictions for both SP bers of CFB phylum, cyanobacteria, firmicutes, and and TM regions (f(NN)↔ℕ) and {x ∈ NN ⊂ bacillales) (Table 3,Fig. 1c). However, multiple distinct − − n ∣ (SP )∧ (TM∨ PH∨ RH) , i∈ℕ), were representations of the GH9 domain in one protein are not removed from the computations. uncommon, and are present as two or four (Saccoglossus Step 2:The remaining sequences were assessed for kowalevskii; gi|291236258) copies (n = 16; n =7, ALS the presence of the transmembrane subregions n =2, n = 7) (Additional file 5:Table S1B).Add- BAC LPS (f(Y)↔ℕ)and {x ∈ Y ⊂ n ∣ (TM∨ PH∨ RH) , itionally, we observed the concomitant presence of i 3 i∈ ℕ). heterogenous Glycoside hydrolase domains in some Step 3:The data computed in Step 2 was then used bacterial species (n = 4), which included Caldocel- BAC to calculate the number of sequences with or lum saccharolyticum (gi| 1708078; GH9, GH48), Rumi- without the presence of an associated signal peptide nococcus champanellensis (gi| 291543673; GH9, GH16), regions (f(NY)↔ℕ)and (f(YY)↔ℕ). {x ∈ NY ⊂ Ruminioclostridium thermocellum (gi| 1663519; GH9, − + n |(SP )∧ (TM∨ PH∨ RH) , i∈ ℕ}and {x ∈ GH44), and Caldicellulosiruptor spp. (gi| 12743885; 3 i + + YY ⊂ n |(SP )∧ (TM∨ PH∨ RH) , i∈ ℕ}. GH9, GH44) (Additional file 5:Table S1C).Interest- Step 4:Utilize the data from the above to compute a ingly, despite being classified as GH9 members, only ratio was used to establish equivalence between the the anaerobic methanogen (Methanohalobium evestiga- predictions, and thereby, a rationale for its subsequent tum; tr|D7E938) of the archaea subgroup Euryarch- . . j NY j j YY j inclusion/ exclusion ð ; Þ. aeota possessed the requisite GH9 domain (Additional j Y j j Y j file 5: Table S1D). Results Taxonomic distribution of the GH9 domain Evolution and emergence of the GH9 and CBMs in plant The GH9 domain averages ≈448 aa, and is present as and non plant taxa a single copy in the sequences investigated (n = 607), The data suggests that the GH9 domain is conserved i.e., bacteria (BAC), land plants (LPS), animals (ALS), across all taxa and a catalytically functional copy may fungi (FGI), green algae (GAL), protists (PRS), and ar- have been present in bacteria (≈3000 Mya; support = chaea (ARC) (Fig. 1c; Additional file 5:Table S1A). 100%, 96%) (Fig. 2; Additional file 7:Table S4A, Although, the vast majority of sequences selected for Additional file 17: Text S5 and Additional file 18: this study were putative GH9 endoglucanases, avail- Text S6). Interestingly, the clades of the land plants able empirical data (kinetic, transcript data, 3D struc- and green algae appears to have diverged relatively ture) for many of these taxa were available and early and independently of the animals, fungi, and included (n = 26; n =1; n =11). Whilst, most the protists (≈1961 Mya; support = 100%). Whilst, the LPS ALS BAC sequences possessed alignment compatible GH9 GH9 domains of the land plants and green algae domains (n = 601), there were few sequences (n =6) continued to evolve for another ≈1750 Mya finally 1A which could not be aligned and were not utilized in diverging from each other relatively recently the estimation of divergence of GH9 domains across (≈211 Mya; support = 97%). In contrast, the protists taxa (Additional file 5: Table S1A, Additional file 1: diverged from animals and fungi (≈817 Mya; support Text S1). The source of error was most likely the = 97%), whilst GH9 domains of animals and fungi Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 9 of 19 Fig. 2 Evolution of GH9 domain. A Bayesian inference (BI) dated tree was estimated (maximum clade credibility) from the computed tree population (n = 4476; burn − in = 70%) using the WAG amino acid substitution model and parent of the clade of bacteria as the root. Whilst, node ages (= node height = branch time of the longest diverging taxa) and branch times are in Mya, support for branch points are indicated by the posterior probabilities (PP%) and bootstrap values (n = 1000; ML%), i.e., support = PP %, ML%. The root for this tree was the parent of bacteria (3170 − 4180 Mya).The log likelihood for this tree was (≈− 0.0838233). Abbreviations: BI, Bayesian inference; GH9, glycoside hydrolase; Mya, millions of years; WAG, Whelan and Goldman diverged from each other (≈11 Mya; support = 97%). acids whose side chain functional groups (PUC ={−OH, A generic timeline for the evolution of the GH9 −SH, −NH }), i.e., Serine (S), Threonine (T), Cysteine (C), domain, i.e., BAC > PRS >{FGI, GAL, ALS, LPS}, is Tyrosine (Y), Asparagine (N), and Glutamine (Q), could perfectly plausible (Fig. 2). We also posited, and potentially contribute to the catalytic machinery of these thence investigated the contribution of non-GH9 re- putative enzymes (Additional file 8: Table S2C). Interest- gions (CBM49, linker(s)) to substrate dichotomy (crystal- ingly, there was a paucity of the catalytic permissive (PCA line, amorphous) in plant GH9 endoglucanases. We ={−COO }) amino acids (D/E) in the sequences analysed observed distinct and delineable CBM49s (79 − 84 aa; me- (Fig. 3a and b;Additional file 8: Table S2, Additional file 4: dian =81 aa) in putative class C GH9 endoglucanase se- Text S2). Clearly, the restricted taxonomic distribution of quences of flowering land plants (n = 102) after outlier CBM49 precludes a direct comparison, thereby justifying exclusion (n =2; Zea mays, GRMZM2G143747_P01; Sela- our search for patterns that could approximate CBM49 ginella. moellendorffii, 109529)(Additional file 8:Table (Fig. 3;Additional file 9: Table S5). These patterns were S2A and B). The only exceptions were the presence of a partitioned into those with low/ high fitness strengths, single CBM49 (82 aa) in the protist, Polysphondylium which was correlated to its compositional complexity pallidumPN500 (gi|281207043, gi|281207029) (Additional (Table 4, Fig. 3c). Since, patterns of reduced complexity file 5: Table S1A). Remarkably, our results indicate a unique are likely to be present in a greater number of sequences, copy of CBM49 in bryophytes (n =4; Physcomitrella and also possess low fitness (Fs)scores (Table 4,Fig. 4c). patens) and tracheophytes (n =3; S. moellendorffii) The Rm-value is the expected number of random matches (Additional file 8: Table S2). Analysis of the primary se- in 100,000 unrelated sequences [102]. For instance, the quences also indicates the presence of one or more linker pattern with the lowest fitness score (p20), i.e., Gx(3)G[LV], sequences connecting the GH9 to the CBMs. In CBM49 has the value Rm = 33184 (n = 100), whilst the same for the class C sequences this constitutes a 7–77 AA (Prunus per- high scoring pattern 1 (p1) was Rm =2.47E − 35 (n =5) sica, ppa022524m; Phaseolus vulgaris, Phvul.011G030300.1) (Table 5,Fig. 3c). The presence of these patterns in (Additional file 5: Table S1 and Additional file 8:Table S2). CBM49-containing characterized class C sequences was confirmed initially, following which, their occurrence in non-class C members was evaluated (Fig. 4a and b). Characterization, analysis, and assessment of relevance of These data, for full length sequences of putative GH9 CBM49-spanning patterns in non-plant taxa endoglucanases without a delineable CBM49 in terms of The amino acid profile (HSC ≅ 46.2%; AAA ≅ 11%; PUC ≅ number of hits and sequences corresponds to: p1 − 36%; PCA ≅ 4.7%; PCB ≅ 13%) of the truncated CBM49 p17 (hits = 0), p18 (hits =93; sequences = 81), p19 (hits = sequences (n = 102) suggests a high percentage of amino 2; sequences = 2), and p20 (hits =233; sequences = 194) Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 10 of 19 Fig. 3 Characterizing the carbohydrate binding module (CBM49). a Multiple sequence alignment of the CBM49 in class C GH9 endoglucanases. This region has been highlighted in the presented alignment, and suggests a conservation, in not just the overall structure, but also several key residues (W| F| Y; K| R| N| H| Q). Additionally, the highest (p1) and the lowest (p20) scoring patterns that approximate CBM49 have been illustrated. The rudimentary p20 derived from class C sequences was found in several organisms (n = 194), including classes A and B of plant GH9 members, b WebLogo of the carbohydrate binding domain 49 of putative class C plant GH9 endoglucanases. Truncated sequences with a well defined 81 AA region corresponding to the CBM49 were utilized to construct this, and c Analysis of 20 patterns spanning the CBM49 with number of matched sequences (Sm), fitness (Fs), and randomly matched sequences (log(Rm=R)) as indices. Abbreviations: AA, amino acids; CBM49, carbohydrate binding module; Fs, fitness score; GH9, glycoside hydrolase; Sm, number of sequences with matches; Rm, number of randomly matched sequences (Additional file 9: Table S5A). The results for all taxa algae (n =2) (Fig. 4b; Additional file 2: Text S3). The dis- with the GH9 domain: 18 (hits = 98; sequences = 89), tribution of bacteria between the datasets (n , n ) was 1 2 p19 (hits =1; sequences =1), and p20 (hits =315; sequences similar firmicutes (≈56%, ≈69%), actinobacteria (≈17. = 265) (Additional file 9:Table S5B).The lowscoring p18 2%, ≈15.6%), and proteobacteria (≈17.2%, ≈8%) (Table 3). (Gx[DENQPST]x(2)G[LV]) and p20 (Gx(3)G[LV]) are the However, the sole archaeal sequence (tr|D7E938) was con- only patterns equivalent to the CBM49 which are found in spicuous in the absence of the same (Figs. 1c and 4b; classes A and B along with other taxa, in both full length Additional file 5: Table S1A). We also observed that while and GH9 domain sequences (Table 4). The maximal popu- several sequences of land plants, bacteria, and fungi in- lation (≈44 − 48%) and taxa-specific coverage (ALS, BAC, cluded more than one occurrence of this pattern, green FGI, PRS, LPS), then justifies the utilization of p20 in defin- algae, protists, and animals only contained one occurrence ing a dataset that could be used to develop an evolutionary of Gx(3)G[LV] (Additional file 9: Table S5A). A search for trace of putative class C specific endoglucanase activity sequences with pattern 18 (G[DENPQST]x(2)G[LV]), with (Additional file 7: Table S4 and Additional file 9:Table S5, a marginal increase in fitness strength (| δ | ≅1.4) p20, p18 Additional file 13: Text S9, Additional file 16:TextS10, eliminated green algae altogether (Table 4; Additional file 9: Additional file 14: Text S11 and Additional file 15:Text TableS5).The taxonomicspreadfor matchedoccurrences S12). This combined, i.e., inclusive of class C sequences, on the GH9 domain (n =607) with p18 (n ; n =3, 1 1B ALS dataset (n = 291) of full length putative GH9 endogluca- n = 34, n =3, n =7, n =28, n =14) and 2 BAC FGI LPSA LPSB LPSC nase sequences then possessed GH9 (n =1) and CBM49- p20 (n ; n =14, n =53, n =4, n =6, n =14, 1C ALS BAC FGI PRS LPSA p20 (n ≥ 1) occurrences, and includes bacteria (n = 64), n =70, n = 108), reiterates the generic nature of LPSB LPSC animals (n =18), fungi (n = 5), protists (n =8), and green these patterns (Additional file 9: Table S5B). Interestingly, Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 11 of 19 Fig. 4 Pattern analysis and major findings in selected plant GH9 endoglucanases. a Distribution and presence of high- and low-fitness strength CBM49-spanning patterns (53 'hits' on 4 sequences) in characterized class C enzymes, b Taxonomic distribution of the low strength p20 (n = 291), and c Analysis of the presence of all 20 CBM49-spanning patterns in selected sequences of classes A, B, and C (n = 187). Clearly, the ubiquitous presence of p20 favours its use as an index of the presence of CBM49 in non class C taxa. The higher strength patterns (p1-p17) are limited to putative class C GH9 endoglucanases. Abbreviations: CBM49, carbohydrate binding module; GH9, glycoside hydrolase; p20, pattern 20 and in complete contrast is the profile of occurrences of SVM data clearly suggest that all classes of GH9 p19, which despite its low fitness registers a single hit (class endoglucanase sequences possess distinct high- (trans- C, S. moellendorffii, 109529). membrane; n ≈ 96 % , n ≈ 83 % , n ≈ 80%) or LPSA LPSB LPSC low- scoring (pore-lining; n ≈ 4% , n ≈ 19 % , LPSA LPSB Analysis of CBM49 and CBM49-like GH9 endoglucanases n ≈ 20%) helical regions, with the exception of LPSC of vascular land plants the class B sequence (MDP0000199273), which pos- In addition to establishing the origins of CBM49, we ex- sessed both classes of helices. Interestingly, a third amined the divergence of putative class C GH9 endoglu- class (re-entrant helical) was computed in class A canase sequences and the emergence of classes A and B members (n = 3). When these data were com- LPSA in vascular land plants. To accomplish this a subset of bined, i.e., TM ∨ PH ∨ RH,all classesA,B, and C pattern 20 selected GH9 endoglucanase sequences in were shown to possess one or more TM subregions land plants (n = 186; n =22, n = 75, n = 89) (n = n = n = 100 % ) (Table 5). The same for the LPSA LPSB LPSC LPSA LPSB LPSC was collated and compared. The node ages and branch DAS-TMfilter (n =95%, n =98%, n = 90%), and LPSA LPSB LPSC times suggest that vascular class C (≈222 Mya; sup- PHOBIUS (n =91%, n =4%, n = 2%) (Table 5). LPSA LPSB LPSC port = 100%, 99%) GH9 endoglucanases predate mem- The computations also suggest a bimodal distribution of − + bers of classes A and B (≈114 Mya; support = 87%, signal peptide regions ((SP ) ∧ (TM ∨ PH ∨ RH) ∶ = NY, + + 99%) (Fig. 5; Additional file 19:TextS7and (SP ) ∧ (TM ∨ PH ∨ RH) ∶ = YY). While, the data for Additional file 20: Text S8). The molecular basis of MEMSAT-SVM was (n ≅ 80%, n = 75%; YY), the LPSB LPSC these findings were ascertained by examining CBM49 same for the DAS-TMfilter was (n ≅ 56%, n = 86%; LPSB LPSC (class C) and CBM49-like (classes -A and -B) YY). In contrast, the data from PHOBIUS differed consid- sequences of vascular land plants for the presence of erably (n ≅ 33.3%, n =0; YY), and was applicable to LPSB LPSC concomitant transmembrane and signal peptide only 3 sequences. The was primarily due the almost regions (Table 5, Fig. 4a; Additional file 11: Table S7, complete absence of TM (n =4%, n =2%), or LPSB LPSC Additional file 16: Text S10, Additional file 14:Text conversely the overwhelming presence of signal peptide S11 and Additional file 15:TextS12). TheMEMSAT- regions in classes B and C (n ≅ 96%, n ≅ 98%) LPSB LPSC Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 12 of 19 Fig. 5 Insights into divergence of plant class C GH9 endoglucanases. A Bayesian inference (BI) dated tree was estimated (maximum clade credibility) from the computed tree population (n = 4837; burn − in = 70%) using the JTT + I + G amino acid substitution model. Whilst, node ages (= node height = branch time of the longest diverging taxa) and branch times are in Mya, support for branch points are indicated by the posterior probabilities (PP%) and bootstrap values (n = 1000; ML%), i.e., support = PP%, ML%. The root for this tree was the parent of vascular class C land plants (201 − 249 Mya). The log likelihood for this tree was (≈− 0.1350387). Abbreviations: BI, Bayesian inference; GH9, glycoside hydrolase; I, proportion of invariant sites; G, gamma parameter; Mya, millions of years; JTT, Jones, Taylor, and Thornton enzymes (Table 5;Additional file 11: Table S7, Additional The structure of crystalline cellulose renders it resistant file 16: Text S10 and Additional file 15: Text S12). How- to alterations in temperature, salt, pH of the surrounding ever, as discussed vide supra, the corresponding results for environment, clearly a desirable trait in archaea (methano- the presence of the TM ∨ PH ∨ RH regions in class A GH9 gens) and bacteria (halophiles, thermophiles) which inhabit endoglucanases predicted by MEMSAT-SVM (n = 100%), extreme environments such as hot springs and the oral LPSA DAS-TMfilter (n =95%), and PHOBIUS (n = 91%) and gastrointestinal microbiomes of several animals. Here, LPSA LPSA was almost identical (Table 5). Additionally, whilst, the results perhaps, the role of GH9 endoglucanases could be critical from DAS-TMfilter were similar to MEMSAT-SVM, its in remodelling the cell membranes, thereby maintaining coverage of classes B (n =67%) and C (n = 51%) was intracellular homeostasis [47, 49]. Additionally, crystalline LPSB LPSC suboptimal. The MEMSAT-SVM data, therefore was deemed cellulose is inert, compact, and insoluble in aqueous and most appropriate for predicting the molecular events that several organic solvents. These physicochemical properties mayhaveoccurredduringthe evolution of plant GH9 would imply that spores and seeds made predominantly of endoglucanases (Table 5; Additional file 11:Table S7, this polymer would be resistant to dessication and Additional file 15: Text S12). stressors such as weather fluctuations [14, 41, 43]. Clearly, protists (Dictyostelium- and Polysphondylium-spp.)and gram positive bacteria may have utilized GH9 endoglu- Discussion canses to regulate the processes of sporulation, dissemin- Evolutionary significance of crystalline cellulose digesting ation, and effective germination [14, 41, 43]. The non plant GH9 endoglucanases lipopolysaccharides (complexes of crystalline cellulose with Our results, on the evolution of the GH9 and CBM49 lipids) synthesized by gram negative bacteria (proteobac- regions suggest a pyramidal model with vertical gene teria, actinobacteria) and fungi, too, could aid protection of transfer and progressive evolution (loss or modification of the organism from host immune systems (phagocytosis) function) as a plausible explanation for the emergence, while concomitantly establishing an infection (Cryptococ- occurrence, and divergence of GH9 endoglucanase activity cus neoformans, Pseudomonas spp., Vibrio spp.) or infest- (≈3000 Mya)(Figs. 2 and 5)[15–28, 32–34]. Conversely, ation in developing protists and marine invertebrates [9– since crystalline cellulose is the preferred substrate, 14, 41, 43, 48, 102–104]. Reciprocally, an interesting utility this also implies a conserved active site architecture of GH9 endoglucanases is to facilitate the symbiotic/ para- of the encoded protein and a correspondingly similar sitic association between some fungi and bacteria of animal reaction chemistry in non-plant taxa and land plants and plants hosts (macrophages, leguminous nodules of the with putative class C GH9 endoglucanase activity rhizomes) by digesting the crystalline cellulose of the host. (Tables 1 and 4,Fig. 4a; Additional file 5:TablesS1 Thus, bacteria/ fungi could secrete these enzymes and/ and Additional file 8:Table S2) [7, 46, 47]. or in association with the cellulosome could digest Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 13 of 19 the cellulose and hemicellulose in root hairs and and p20 (Gx(3)G[LV]), possessed amino acids that may be wood to extract/ exchange nutrients (Laccaria bicolor, both potentially catalytic and/ or facilitatory. Whilst, the Sporisorium reilianum, Phanerochaete chrysosporium) bulky side chains of the aromatic amino acids can physic- [42, 44, 53, 54, 105–109]. Although cellulose is un- ally stretch the glycosidic linkage between adjacent β(D)- equivocally inert, reports of its potential to stimulate glucopyranose residues and weaken it several fold, amino an immune response in the host are not unknown. In acids with side chain functional groups (−OH, −NH , −SH), fact, specialized cells in the tunics of marine verte- can effect electron-proton transfers and are critical com- brates (O. dioica, S. kowalevskii, and C. intestinalis) ponents of the catalytic machinery of any enzyme [62–64, might function as primitive phagocytes that could de- 115]. The concomitant occurrence of these residues with tect the presence of crystalline cellulose (potential the GH9, i.e., (GH9 ∧ p18) ∨ (GH9 ∧ p20), could function pathogen, index of nutritional status) and could mod- as an index of CBM49-presence on the GH9 domain in erate a suitable response (adhesion to the substratum, sequences of non class C taxa and can then be utilized to infection by marine microbes). The ability to utilize trace the origins of CBM49. The biological relevance of the nutritionally superior crystalline cellulose may be this approach may be gleaned by examining the correl- an important consideration, albeit, indirect for the ation between the presence of aromatic amino acids which dominant global presence of arthropods including in- are known to influence catalysis of crystalline cellulose sects (Apis mellifera, Camponotus floridanus, Nasonia and the 'hits' or 'occurrences' of low strength patterns in vitripennis, Nasutitermes Takasagoensis), crustaceans non class C enzymes (Table 4; Additional file 9: Table S5) (Daphnia pulex), and segmented worms (Additional [62–78]. Whilst, the complete absence of aromatic acids file 5: Table S1A, Additional file 1:TextS1)[15, 50, could be responsible for the generic distribution of p18 51, 108–114]. Since, GH9 endoglucanase producing and p20 (93 ≤ n ≤ 230, full length;98 ≤ n ≤ 315; Hits Hits bacteria populate the microbiomes of these animals, GH9 domain), the incorporation of a single residue W/ they are able to extract glucose from diverse sub- Yinto p19 results in a significant reduction in its oc- strates (wood, chitoligosaccharides) and can subsist in currence in non class C members (n =2, full length; Hits several seemingly inhospitable environments. Add- n =1; GH9 domain)(Table 4, Fig. 4b;Additionalfile 9: Hits itionally, and in comparison to the kingdom specific Table S5). analysis (bacteria, fungi, land plants, animals) with corresponding multiple trees by previous investigators, we were able to generate a unified time tree of over Evolution of the CBM49 encompassing class C GH9 600 GH9 domain sequences spread over every major endoglucanases taxa (n ≈ 6.5X, n ≈ 3.4X, n ≈ 1.6X, n ≈ 4.8X), The identification of the CBM49 as the facilitator of BAC ALS FGI LPS and include green algae and protists [55]. crystalline cellulose digestion (class C activity) in a select population of previously annotated GH9 endoglucanases Rationale and relevance of a multimodal approach to in land plants raises intriguing queries with regards to approximating the CBM49 the origin, subsequent divergence, and physiological As discussed vide supra, the carbohydrate binding relevance of substrate shuffling (amorphous, crystalline) module CBM49 is unique to class C members of in plant GH9 endoglucanases [6–8, 33, 34]. In the ab- land plants (Fig. 3; Additional file 8: Table S2 and sence of an identifiable CBM49, the analysis of full Additional file 4: Text S2). Our data suggests that length putative GH9 endoglucanase sequences with oc- homologous CBMs (GH9 ∧ (CBMx) | x ∈ {2, 3, 4,10,49, currences of p20 (low strength generic approximator of X}, y = {1, 2}) distributed across the length of the CBM49) might constitute a viable approach, and provide protein might contribute to catalysis of crystalline insights into the origins and subsequent divergence of cellulose in bacteria (n = 37), animals (n = 18), and CBM49 containing enzymes. protists (n =2) (Table 6; Additional file 12:Table S8). The data from the SMART server also indicated the presence of several low complexity regions both, Emergence and origin of the CBM49 in full length and truncated (GH9 domains) sequences. The influence of non-GH9 regions of the primary se- This coupled with the sparse CBM data (<10%), prompted quence on the catalytic spectrum of plant GH9 endo- us to search for CBM49 spanning patterns amongst glucanases, suggest that these, like the GH9 may have putative non class C GH9 endoglucanase sequences, rea- originated in non-plant taxa. These could include the soning that patterns with low fitness scores might consti- presence of: a) homologous CBMs throughout the tute a superior index of approximating the CBM49. In our length of the protein sequence, and b) delocalized analysis the CBM49-approximating and low scoring p18 residue- specific activity of the GH9 domain itself. (Gx[DENQPST]x(2)G[LV]), p19 (Gx[ILV][WY]G[LV]), Extensive sequence analysis of full length and GH9 Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 14 of 19 domain sequences of non-plant taxa reveals the groups each, a partitioning that is based on the pres- presence of several regions of low complexity, along ence or absence of a signal peptide region (Table 5, with sparsely present pre-defined CBMs (n =57; ≅9.3%) Fig. 6; Additional file 11:Table S7,Additional file 13: (Table 6; Additional file 12: Table S8). The numbers not- Texts S9, Additional file 16: Text S10, Additional file 14: withstanding, distinct copies of CBM2 (animals, bacteria), Text S11 and Additional file 15:Text S12). The first CBM3 (bacteria), CBM4 (animals, bacteria), CBM10 (bac- model purports that the last common ancestor (LCA) of teria), CBMX (bacteria), and the CBM49 (protists) itself vascular plant GH9 endoglucanases were class C-like en- (GH9 ∧ (CBMx) ), have been characterized in literature zymes in bryophytes and early tracheophytes. Subsequent with the encompassing GH9 endoglucanases exhibiting a losses, in parallel of the CBM49 could have resulted in the clear preference for crystalline cellulose [39–53](Table 6; appearance of modern vascular equivalents (Figs. 5 and 6). Additional file 12: Table S8). Interestingly, the CBMs 2 This model also offers an explanation to the fewer and 4 of animals and bacteria were present at opposite numbers of class C members frequently observed by termini of the GH9 domain. Thus, while CBM4_9 is C- investigators, despite contrasting bioinformatics evi- terminal in animals, its position in bacteria is distinctly N- dence [14, 58–60]. Indeed, this may be the route of terminal, with the reverse being true for CBM2 choice for the emergence of class C (≈222 Mya; sup- (Additional file 12: Table S8). This mobility of CBMs port = 100%, 99%) and classes A and B (≈114 Mya; across taxa suggests that either N- or C-terminal posi- support = 87%, 99%) (Figs. 5 and 6). Clearly, this tioned CBMs could have functioned as precursors of model would mandate the presence of distinct sub- CBM49. The length of the linker sequences exhibited con- populations of the LCA, i.e., CBM49 with either TM siderably greater variation in non-plant taxa (27 − 230 aa) or SP regions. Alternatively, class C GH9 endogluca- as compared to land plants (7 − 77 aa) (Additional file 5: nases of land plants may have been the first to Table S1A, Additional file 8: Table S2A and Additional emerge after the tracheophytes, whilst classes A and file 12: Table S8). In contrast, the low strength CBM49- B evolved from them by the progressive loss of the approximator, i.e., pattern 20, could be mapped directly signal peptide. This route, too, seems perfectly plaus- onto the full length and GH9 domains ( ≅ 50%). In the ible given the presence of two distinct sub popula- presence of key aromatic and/ or polar uncharged amino tions of class C GH9 endoglucanases (C1, C2), with acids this mapping could also confer competency to digest each diverging secondary to the loss of the CBM49 crystalline cellulose. Whilst, the exact origin of the subregion (class C2→ class A1 ≈ class B2; n )and 1A CBM49 remains speculative, our results when combined the considerable earlier divergence of class C vascu- indicate a distinct probability (>0.00) that a double lar plants (Table 5,Figs. 5 and 6). Since, classes A ((GH9 ∧ (CBMx) ) = {0.093} ∨ (GH9 ∧ p20) = {0.44,0.48}) or and B, in vascular land plants could be originate in triple event ((GH9 ∧ (CBMx) ∧ p20) = {0.041,0.046}) may parallel and directly from their class C counterparts, have resulted in the emergence of CBM49 in early land the fewer numbers observed could simply mean plants (Table 6;Additional file 12: Table S8). fewer original class C members left as compared to class B GH9 endoglucanases. A third scenario, could be the origin of later members sequentially, i.e., Divergence of class C GH9 endoglucanases class C→ class A→ class B or class C→ class B→ The interdomain linker, a common feature between class A (Fig. 5). Phylogenetic and sequence analysis the GH9 and CBMs is, surprisingly stable and seems of this dataset (n ) suggests that the most probable to have remained as such for ≅450 − 480 Mya. routes was class C1→ class B1→ class A1and/or Whilst, the evidence for the ancestral role of class C (class C2→ class A1 ≈ class B2; n )(Fig. 6). 1A members of vascular land plant GH9 endoglucanases is fairly unequivocal, a clear insight into the down- stream molecular events that may have occurred in Class C GH9 enzymes, last common ancestor of plant GH9 their transformation to classes A and B is debatable endoglucanases (Figs. 5 and 6). Here too, we posited that vertical gene Physiologically, the development of an intact vascular loss of class C GH9 endoglucanase sequences was opera- system could have brought about a paradigm shift in tive and could result in the emergence of classes A (A1) not just the utilization of extant endoglucanase and B (B1, B2) (Table 5, Fig. 6;Additional file 11: Table S7, activity, but also in the nature of cellulose itself. The Additional file 13: Texts S9, Additional file 16:Text S10, introduction and persistence of water molecules Additional file 14: Text S11 and Additional file 15:Text between the microfibrils of cellulose could have S12). The extensive computational analysis conducted resulted in competition for hydrogen bonds with in this work suggests that classes B (B1, B2) and C waterratherthanother fibrils of cellulose. These (C1, C2) could be considered a union of two distinct events could have been complemented by the late Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 15 of 19 Fig. 6 Evolution, divergence, and emergence of plant class C GH9 endoglucanases. a Evolutionary theories for the emergence and divergence of classes A, B, and C plant GH9 endoglucanases. The major considerations in proposing these were data gleaned from the time trees, and analysis of the sequences for the presence and/ or absence the transmembrane, signal peptides, and the CBM49 itself. b Phylogenetic and bioinformatics analysis of full length sequences of classes A, B, and C plants endoglucanases with one or more occurrences of p20. CBM49 attributable class C activity along with a GH9 domain was present in early land plants bryophytes (avascular) and tracheophytes (vascular), and suggests the presence of two class C populations which may have diverged so as to result in the newer classes A and B. Abbreviations: CBM49, carbohydrate binding module; GH9, glycoside hydrolase; p20, pattern 20 emergence of the crystalline cellulose (I , I )editing endoglucanases of class C (Table 5,Fig. 6)[62–64]. The α β subclass A GH9 endoglucanases, and could have presence of the linker region too, may have facilitated the shifted the reaction equilibria towards the right, i.e., progressive loss of CBM49 and its progressive transform- synthesis of amorphous cellulose (I am, I am)[10]. ation into classes A and B over ≅114 Mya (Fig. 5). Since, α β These reactions can be depicted as: the modified chemistry and quantity of cellulose made it amenable to rapid digestion, enzymes of classes A and B were more suited to digesting the now abundant amorph- ous regions of cellulose, and could utilize it as a source of carbon, as well as remodel it to effect growth, develop- ment, flowering, and germination [16, 58]. Whilst the presence of crystalline cellulose in the stems of cereal crops (Hordeum vulgare, Brachypodium distachyon, O. sativa) facilitates growth and cultivation, its secretion in the mucilage from the epidermal cells of differentiating The proliferation of amorphous regions would have eudicot seeds is a critical event in germination [58, 60, rendered cellulose accessible and amenable to enzymatic 116–118]. The recent divergence of land plant GH9 endo- conversion with lesser stringency. Evolutionarily, this glucanases into monocots such as the cereals (O. sativa, means that the CBM49 in land plants (avascular and early B. distachyon, Panicum virgatum) and the asterid subdiv- vascular) despite its ancestral origins may no longer be ision of the eudicots (S. tuberosum, S. lycopersicum and N. necessary for cellulose metabolism. This in turn may have tabacum) is consistent in all classes and in both datasets initiated a series of molecular events in extant class C (n , n )(Table 1). These could reflect a modification of 1A 3 endoglucanase sequences of late tracheophytes such as S. the culinary habits of a developing civilization with a de- moellendorffii, and may have culminated in the divergence sire for bulk and storage foods (Table 4). Here, too the in and subsequent appearance of late vascular GH9 situ digestion of crystalline cellulose by class C enzymes Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 16 of 19 or its conversion to amorphous forms thereof, could Additional file 9: Table S5. Distribution of low strength patterns in non proceed unhindered. The continuing molecular evolution class C taxa. (XLSX 41 kb) of classes A and B enzymes also suggests a versatile and Additional file 10: Table S6. Distribution of low strength patterns in non class C land plants. (XLSX 19 kb) adaptive mechanism of action perhaps in tandem with the Additional file 11: Table S7. Distribution of TM, SP, and CBM49 in land emergence of novel pathophysiological stimuli. The exist- plants. (XLSX 22 kb) ence of high levels of mRNA of putative class C members Additional file 12: Table S8. Distribution of CBMs and low strength observed from the internode regions (high cellulose con- patterns in taxa. (XLSX 18 kb) tent) of the developing stems of O. sativa and A. thaliana, Additional file 13 Text S9. Distribution of low strength patterns in land suggest that these enzymes could still be of benefit to plants. (TXT 84 kb) modern land plants, as they could direct the higher affinity Additional file 14: Text S11. Distribution of TM and SP in land plants (DAS-TMfilter). (TXT 36 kb) classes A and B enzymes to regions of growth and devel- Additional file 15: Text S12. Distribution of TM and SP in land plants opment, where the concentrations of cellulose would (MEMSAT-SVM). (ZIP 2102 kb) be much lower [16, 58, 60, 116–118]. The CBM49 of Additional file 16: Text S10. Distribution of TM and SP in land plants class C plant GH9 endoglucanases could also func- (PHOBIUS). (TXT 9 kb) tion as a gene/ protein repository for newly emer- Additional file 17: Text S5. Maximum clade credibility tree to assess ging functions, thus justifying their title as living evolution of the GH9 domain. (TXT 5 kb) fossils of the plant world. Additional file 18: Text S6. Maximum likelihood estimate of branching times of GH9 evolution with bootstrapping. (PDF 49 kb) Additional file 19: Text S7. Maximum clade credibility tree to assess Conclusions divergence of the CBM49 in land plants. (TXT 2 kb) Our work when coupled with extant data on class C Additional file 20: Text S8. Maximum likelihood estimate of branching times of CBM49 in land plants with bootstrapping. (PDF 10 kb) plant GH9 endoglucanases suggests that these enzymes are ancestral to classes A and B of this family. Plant GH9 endoglucanases are able to digest crystalline cellu- Abbreviations AAA: Aromatic amino acids; ALS: Animals; ANN: Artificial neural network; lose (class C activity) in a manner reminiscent of cataly- BAC: Bacteria; BEAST: Bayesian evolutionary analysis by sampling trees; sis by bacteria, animals, protists, fungi, and archaea. Our BRY: Bryophytes; CAZy: Carbohydrate active enzymes; CBM: Carbohydrate work here suggests that the GH9 domain is relatively binding module; DAS-TMfilter: Density alignment server; dbCAN: Database of carbohydrate enzymes annotated; EC: Enzyme commission; FGI: Fungi; well conserved across taxa. We also present plausible GAL: Green algae; GH: Glycoside hydrolase; HMM: Hidden markov model; phylogenetic time lines coupled with bioinformatics evi- HSC: Hydrophobic side chains; I I : Crystalline cellulose; α’ β’ dence that favour a vertical mode of gene evolution that I am,I am,: Amorphous cellulose; LCA: Last common ancestor; LPS: Land α β plants; MEGA: Molecular evolutionary genetic analysis; MSA: Multiple may have contributed to the origin and emergence of sequence alignment; Mya: Millions of years; PCA: Polar charged acidic; the CBM49 between the GH9 endoglucanases of plants PCB: Polar charged basic; PIR: Protein information server; PRATT: Pattern and non plant taxa, as well as its subsequent divergence analysis; PRS: Protists; PUC: Polar uncharged; SMART: Simple modular architecture research tool; SP: Signal peptide; SVM: Support vector machine; (tracheophytes and the vascular land plants of classes A, TM: Trans-membrane; TRY: Tracheophytes B, and C). Finally, we review the computational evidence in context of likely physiological events that may have Acknowledgements occurred during their divergence and evolution. RS gratefully acknowledges financial support from JNU through UPE-II grant and Ramalingaswami fellowship from DBT, India. These however, had no role in the design of the study and collection, analysis, and interpretation of data Additional files and in writing the manuscript. Additional file 1: Text S1. Sequences of GH9 in all taxa (fasta). (FASTA Funding 283 kb) RS gratefully acknowledges financial support from JNU through UPE-II grant and Ramalingaswami fellowship from DBT, India. These however, had no role Additional file 2: Text S3. Sequences with pattern 20 across all taxa in the design of the study and collection, analysis, and interpretation of data (fasta). (FASTA 197 kb) and in writing the manuscript. Additional file 3: Text S4. Sequences of land plants (CBM49, pattern 20; fasta). (FASTA 114 kb) Availability of data and materials Additional file 4: Text S2. Sequences of CBM49 in predicted class C Data is available as supporting material with the manuscript. The datasets land plants (fasta). (FASTA 11 kb) used and/or analysed during the current study are available from the Additional file 5: Table S1. GH9 domain based classification of taxa. corresponding author on reasonable request. (XLSX 54 kb) Additional file 6: Table S3. Maximum likelihood based evaluation of Authors’ contributions amino acid substitution models. (XLSX 30 kb) SK outlined and designed the study, conceptualized the algorithm(s) and formulae for prediction, manually collated all the sequences, and their Additional file 7: Table S4. Posterior probabilities for parameters references, carried out the computational analysis, constructed the models, utilized to date GH9/ CBM49 evolution across taxa. (XLSX 15 kb) formulated the filters, and wrote the manuscript. RS outlined the study and Additional file 8: Table S2. CBM49 based classification of land plants. participated in manuscript discussions. All authors read and approved the (XLSX 22 kb) final manuscript. Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 17 of 19 Ethics approval and consent to participate 18. Bell EA, Boehnke P, Harrison TM, Mao WL. Potentially biogenic carbon Not applicable. preserved in a 4.1 billion-year-old zircon. Proc Natl Acad Sci U S A. 2015; 112(47):14518–21. 19. Noffke N, Christian D, Wacey D, Hazen RM. Microbially induced Competing interests sedimentary structures recording an ancient ecosystem in the ca. 3.48 The authors declare that they have no competing interests. billion-year-old dresser formation, Pilbara, Western Australia. Astrobiology. 2013;13(12):1103–24. 20. Schopf JW. Fossil evidence of Archaean life. Philos Trans R Soc Lond Ser B Publisher’sNote Biol Sci. 2006;361(1470):869–85. Springer Nature remains neutral with regard to jurisdictional claims in 21. Bengtson S, Belivanova V, Rasmussen B, Whitehouse M. The controversial published maps and institutional affiliations. “Cambrian” fossils of the Vindhyan are real but more than a billion years older. Proc Natl Acad Sci U S A. 2009;106(19):7729–34. Received: 24 July 2017 Accepted: 18 April 2018 22. Brocks JJ, Logan GA, Buick R, Summons RE. Archean molecular fossils and the early rise of eukaryotes. Science. 1999;285(5430):1033–6. 23. Peterson KJ, Butterfield NJ. Origin of the Eumetazoa: testing ecological References predictions of molecular clocks against the Proterozoic fossil record. Proc 1. Libertini E, Li Y, McQueen-Mason SJ. Phylogenetic analysis of the plant Natl Acad Sci U S A. 2005;102(27):9547–52. endo-beta-1,4-glucanase gene family. J Mol Evol. 2004;58(5):506–15. 24. Budd GE, Butterfield NJ, Jensen S. Crustaceans and the “Cambrian 2. Molhoj M, Pagant S, Hofte H. Towards understanding the role of explosion”. Science. 2001;294(5549):2047. membrane-bound endo-beta-1,4-glucanases in cellulose biosynthesis. Plant 25. Engel MS, Grimaldi DA. New light shed on the oldest insect. Nature. 2004; Cell Physiol. 2002;43(12):1399–406. 427(6975):627–30. 3. Maloney VJ, Mansfield SD. Characterization and varied expression of a 26. Berna L, Alvarez-Valin F. Evolutionary genomics of fast evolving tunicates. membrane-bound endo-β-1,4-glucanase in hybrid poplar. Plant Biotechnol Genome Biol Evol. 2014;6(7):1724–38. J. 2010;8(3):294–307. 27. Erwin DH, Davidson EH. The last common bilaterian ancestor. Development. 4. Mansoori N, Timmers J, Desprez T, Alvim-Kamei CL, Dees DC, Vincken JP, 2002;129(13):3021–32. Visser RG, Hofte H, Vernhettes S, Trindade LM. KORRIGAN1 interacts 28. Betts MJ, Topper TP, Valentine JL, Skovsted CB, Paterson JR, Brock GA. A specifically with integral components of the cellulose synthase machinery. new early Cambrian bradoriid (Arthropoda) assemblage from the northern PLoS One. 2014;9(11):e112387. flinders ranges, South Australia. Gondwana Res. 2014;25(1):420–37. 5. Vain T, Crowell EF, Timpano H, Biot E, Desprez T, Mansoori N, Trindade LM, 29. Braun A, Chen J, Waloszek D, Maas A. First early Cambrian Radiolaria. Geol Pagant S, Robert S, Hofte H, et al. The Cellulase KORRIGAN is part of the Soc Lond, Spec Publ. 2007;286(1):143–9. cellulose synthase complex. Plant Physiol. 2014;165(4):1521–32. 30. Butterfield NJ. Probable Proterozoic fungi. Paleobiology. 2005;31(1):165. 6. Brummell DA, Bird CR, Schuch W, Bennett AB. An endo-1,4-beta-glucanase https://doi.org/10.1666/0094-8373. expressed at high levels in rapidly expanding tissues. Plant Mol Biol. 1997; 31. Lucking R, Huhndorf S, Pfister DH, Plata ER, Lumbsch HT. Fungi evolved 33(1):87–95. right on track. Mycologia. 2009;101(6):810–22. 7. Urbanowicz BR, Catala C, Irwin D, Wilson DB, Ripoll DR, Rose JK. A tomato 32. Bhattacharya D. Dating algal origin using molecular clock methods. Protist. endo-beta-1,4-glucanase, SlCel9C1, represents a distinct subclass with a new 2004;155(1):9–10. family of carbohydrate binding modules (CBM49). J Biol Chem. 2007;282(16): 33. Bhattacharya D, Medlin aL. Algal phylogeny and the origin of land plants. 12066–74. Plant Physiol. 1998;116(1):9–15. 8. Yoshida K, Imaizumi N, Kaneko S, Kawagoe Y, Tagiri A, Tanaka H, Nishitani K, 34. Gray, J., Massa, D. & Boucot, A. J. Caradocian land plant microfossils from Komae K. Carbohydrate-binding module of a rice endo-beta-1,4-glycanase, Libya. Geology 10, 197–201, doi: https://doi.org/10.1130/0091-7613(1982). OsCel9A, expressed in auxin-induced lateral root primordia, is post- 35. Crane PR, Herendeen P, Friis EM. Fossils and plant phylogeny. Am J Bot. translationally truncated. Plant Cell Physiol. 2006;47(11):1555–71. 2004;91(10):1683–99. 9. Blouzard JC, Bourgeois C, de Philip P, Valette O, Belaich A, Tardif C, Belaich 36. Kenrick P, Crane PR. Nature. 1997;389(6646):33–9. JP, Pages S. Enzyme diversity of the cellulolytic system produced by 37. Qiu YL, Li L, Wang B, Chen Z, Knoop V, Groth-Malonek M, Dombrovska O, Lee Clostridium cellulolyticum explored by two-dimensional analysis: J, Kent L, Rest J, et al. The deepest divergences in land plants inferred from identification of seven genes encoding new dockerin-containing proteins. J phylogenetic evidence. Proc Natl Acad Sci U S A. 2006;103(42):15511–6. Bacteriol. 2007;189(6):2300–9. 38. Chaw SM, Chang CC, Chen HL, Li WH. Dating the monocot-dicot 10. Mingardon F, Bagert JD, Maisonnier C, Trudeau DL, Arnold FH. Comparison divergence and the origin of core eudicots using whole chloroplast of family 9 cellulases from mesophilic and thermophilic bacteria. Appl genomes. J Mol Evol. 2004;58(4):424–41. Environ Microbiol. 2011;77(4):1436–42. 39. Gandolfo MA, Nixon KC, Crepet WL. Triuridaceae fossil flowers from the 11. Qi M, Jun HS, Forsberg CW. Cel9D, an atypical 1,4-beta-D-glucan upper cretaceous of New Jersey. Am J Bot. 2002;89(12):1940–57. glucohydrolase from Fibrobacter succinogenes: characteristics, catalytic 40. Gandolfo MA, Nixon KC, Crepet WL, Stevenson DW, Friis EM. Nature. 1998; residues, and synergistic interactions with other cellulases. J Bacteriol. 2008; 394(6693):532–3. 190(6):1976–84. 41. Blume JE, Ennis HL, Dictyostelium A. Discoideum cellulase is a member of a 12. Yi Z, Su X, Revindran V, Mackie RI, Cann I. Molecular and biochemical spore germination-specific gene family. J Biol Chem. 1991;266(23):15432–7. analyses of CbCel9A/Cel48A, a highly secreted multi-modular cellulase by 42. del Campillo E, Gaddam S, Mettle-Amuah D, Heneks J. A tale of two tissues: Caldicellulosiruptor bescii during growth on crystalline cellulose. PLoS One. AtGH9C1 is an endo-beta-1,4-glucanase involved in root hair and 2013;8(12):e84172. endosperm development in Arabidopsis. PLoS One. 2012;7(11):e49363. 13. Zhang C, Zhang W, Lu X. Expression and characteristics of a ca(2)(+ 43. Ficko-Blean E, Boraston AB. The interaction of a carbohydrate-binding )-dependent endoglucanase from Cytophaga hutchinsonii. Appl Microbiol module from a Clostridium perfringens N-acetyl-beta-hexosaminidase with Biotechnol. 2015;99(22):9617–23. its carbohydrate receptor. J Biol Chem. 2006;281(49):37748–57. 14. Ramalingam R, Blume JE, Ennis HL. The Dictyostelium discoideum spore germination-specific cellulase is organized into functional domains. J 44. Goellner M, Wang X, Davis EL. Endo-beta-1,4-glucanase expression in Bacteriol. 1992;174(23):7834–7. compatible plant-nematode interactions. Plant Cell. 2001;13(10):2241–55. 15. Allardyce BJ, Linton SM, Saborowski R. The last piece in the cellulase puzzle: 45. Matthysse AG, Deschet K, Williams M, Marry M, White AR, Smith WC. A the characterisation of beta-glucosidase from the herbivorous gecarcinid functional cellulose synthase from ascidian epidermis. Proc Natl Acad Sci U land crab Gecarcoidea natalis. J Exp Biol. 2010;213(Pt 17):2950–7. S A. 2004;101(4):986–91. 16. Kundu S, Sharma R. In silico identification and taxonomic distribution of 46. McLean BW, Bray MR, Boraston AB, Gilkes NR, Haynes CA, Kilburn DG. plant class C GH9 endoglucanases. Front Plant Sci. 2016;7:1185. Analysis of binding of the family 2a carbohydrate-binding module 17. Domozych DS, Ciancia M, Fangel JU, Mikkelsen MD, Ulvskov P, Willats WG. from Cellulomonas fimi xylanase 10A to cellulose: specificity and The cell walls of green algae: a journey through evolution and diversity. identification of functionally important amino acid residues. Protein Front Plant Sci. 2012;3:82. Eng. 2000;13(11):801–9. Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 18 of 19 47. Boraston AB, Bolam DN, Gilbert HJ, Davies GJ. Carbohydrate-binding modules: engineered cellulose-binding domains of cellobiohydrolase I from fine-tuning polysaccharide recognition. Biochem J. 2004;382(Pt 3):769–81. Trichoderma reesei. Protein Sci. 1997;6(2):294–303. 48. O'Meara TR, Alspaugh JA. The Cryptococcus neoformans capsule: a sword 70. Morrill J, Kulcinskaja E, Sulewska AM, Lahtinen S, Stalbrand H, and a shield. Clin Microbiol Rev. 2012;25(3):387–408. Svensson B, Abou Hachem M. The GH5 1,4-beta-mannanase from Bifidobacterium animalis subsp. lactis Bl-04 possesses a low-affinity 49. Gao B, Gupta RS. Phylogenomic analysis of proteins that are distinctive of mannan-binding module and highlights the diversity of mannanolytic archaea and its main subgroups and the origin of methanogenesis. BMC enzymes. BMC Biochem. 2015;16:26. Genomics. 2007;8:86. 71. Nishijima H, Nozaki K, Mizuno M, Arai T, Amano Y. Extra tyrosine in the 50. Dehal P, Satou Y, Campbell RK, Chapman J, Degnan B, De Tomaso A, carbohydrate-binding module of Irpex lacteus Xyn10B enhances its Davidson B, Di Gregorio A, Gelpke M, Goodstein DM, et al. The draft cellulose-binding ability. Biosci Biotechnol Biochem. 2015;79(5):738–46. genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science. 2002;298(5601):2157–67. 72. Parsiegla G, Reverbel-Leroy C, Tardif C, Belaich JP, Driguez H, Haser R. 51. Linton SM, Greenaway P, Towle DW. Endogenous production of endo-beta- Crystal structures of the cellulase Cel48F in complex with inhibitors and 1,4-glucanase by decapod crustaceans. J Comp Physiol B. 2006;176(4):339–48. substrates give insights into its processive action. Biochemistry. 2000; 52. Lo N, Watanabe H, Sugimura M. Evidence for the presence of a cellulase 39(37):11238–46. gene in the last common ancestor of bilaterian animals. Proc Biol Sci. 2003; 73. Simpson HD, Barras F. Functional analysis of the carbohydrate-binding 270(Suppl 1):S69–72. domains of Erwinia chrysanthemi Cel5 (endoglucanase Z) and an Escherichia coli putative chitinase. J Bacteriol. 1999;181(15):4611–6. 53. Scholl EH, Thorne JL, McCarter JP, Bird DM. Horizontally transferred genes in 74. Simpson PJ, Xie H, Bolam DN, Gilbert HJ, Williamson MP. The structural basis plant-parasitic nematodes: a high-throughput genomic approach. Genome for the ligand specificity of family 2 carbohydrate-binding modules. J Biol Biol. 2003;4(6):R39. Chem. 2000;275(52):41137–42. 54. Smant G, Stokkermans JP, Yan Y, de Boer JM, Baum TJ, Wang X, Hussey RS, 75. Strobel KL, Pfeiffer KA, Blanch HW, Clark DS. Structural insights into Gommers FJ, Henrissat B, Davis EL, et al. Endogenous cellulases in animals: the affinity of Cel7A carbohydrate-binding module for lignin. J Biol isolation of beta-1, 4-endoglucanase genes from two species of plant- Chem. 2015;290(37):22818–26. parasitic cyst nematodes. Proc Natl Acad Sci U S A. 1998;95(9):4906–11. 55. Davison A, Blaxter M. Ancient origin of glycosyl hydrolase family 9 cellulase 76. Taylor CB, Talib MF, McCabe C, Bu L, Adney WS, Himmel ME, Crowley MF, genes. Mol Biol Evol. 2005;22(5):1273–84. Beckham GT. Computational investigation of glycosylation effects on a 56. Salzberg SL, White O, Peterson J, Eisen JA. Microbial genes in the human family 1 carbohydrate-binding module. J Biol Chem. 2012;287(5):3147–55. genome: lateral transfer or gene loss? Science. 2001;292(5523):1903–6. 77. YanivO,Petkun S,Shimon LJ, Bayer EA, LamedR,FrolowF.Asingle mutation reforms the binding activity of an adhesion-deficient family 57. Stanhope MJ, Lupas A, Italia MJ, Koretke KK, Volker C, Brown JR. 3 carbohydrate-binding module. Acta Crystallogr D Biol Crystallogr. Phylogenetic analyses do not support horizontal gene transfers from 2012;68(Pt 7):819–28. bacteria to vertebrates. Nature. 2001;411(6840):940–4. 78. Zhang C, Wang Y, Li Z, Zhou X, Zhang W, Zhao Y, Lu X. Characterization of 58. Buchanan M, Burton RA, Dhugga KS, Rafalski AJ, Tingey SV, Shirley NJ, Fincher a multi-function processive endoglucanase CHU_2103 from Cytophaga GB. Endo-(1,4)-beta-glucanase gene families in the grasses: temporal and hutchinsonii. Appl Microbiol Biotechnol. 2014;98(15):6679–87. spatial co-transcription of orthologous genes. BMC Plant Biol. 2012;12:235. 79. Henrissat B. A classification of glycosyl hydrolases based on amino acid 59. Montanier C, Flint JE, Bolam DN, Xie H, Liu Z, Rogowski A, Weiner DP, sequence similarities. Biochem J. 1991;280(Pt 2):309–16. Ratnaparkhe S, Nurizzo D, Roberts SM, et al. Circular permutation provides an evolutionary link between two families of calcium-dependent 80. Henrissat B, Bairoch A. New families in the classification of glycosyl carbohydrate binding modules. J Biol Chem. 2010;285(41):31742–54. hydrolases based on amino acid sequence similarities. Biochem J. 1993; 293(Pt 3):781–8. 60. Xie G, Yang B, Xu Z, Li F, Guo K, Zhang M, Wang L, Zou W, Wang Y, Peng L. 81. Yin Y, Mao X, Yang J, Chen X, Mao F, Xu Y. dbCAN: a web resource for Global identification of multiple OsGH9 family members and their involvement in cellulose crystallinity modification in rice. PLoS One. 2013;8(1):e50171. automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2012; 40(Web Server issue):W445–51. 61. Lopez-Casado G, Urbanowicz BR, Damasceno CM, Rose JK. Plant glycosyl 82. Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics hydrolases and biofuels: a natural marriage. Curr Opin Plant Biol. 2008;11(3):329–37. analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016;33(7):1870–4. 62. Alahuhta M, Xu Q, Bomble YJ, Brunecky R, Adney WS, Ding SY, Himmel ME, 83. Letunic I, Doerks T, Bork P. SMART: recent updates, new developments and Lunin VV. The unique binding mode of cellulosomal CBM4 from Clostridium status in 2015. Nucleic Acids Res. 2015;43(Database issue):D257–60. thermocellum cellobiohydrolase a. J Mol Biol. 2010;402(2):374–87. 63. Duan CJ, Feng YL, Cao QL, Huang MY, Feng JX. Identification of a novel 84. Schultz J, Milpetz F, Bork P, Ponting CP. SMART, a simple modular family of carbohydrate-binding modules with broad ligand specificity. Sci architecture research tool: identification of signaling domains. Proc Natl Rep. 2016;6:19392. Acad Sci U S A. 1998;95(11):5857–64. 85. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein 64. Prates ET, Stankovic I, Silveira RL, Liberato MV, Henrique-Silva F, Pereira N, Jr., blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–9. Polikarpov I, Skaf MS: X-ray structure and molecular dynamics simulations of 86. Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos G. BLOSUM62 endoglucanase 3 from Trichoderma harzianum: structural organization and miscalculations improve search performance. Nat Biotechnol. 2008;26(3):274–5. substrate recognition by endoglucanases that lack cellulose binding 87. Bouckaert R, Heled J, Kuhnert D, Vaughan T, Wu CH, Xie D, Suchard MA, module. PLoS One 2013, 8(3):e59069. Rambaut A, Drummond AJ. BEAST 2: a software platform for Bayesian 65. Boraston AB, Nurizzo D, Notenboom V, Ducros V, Rose DR, Kilburn DG, evolutionary analysis. PLoS Comput Biol. 2014;10(4):e1003537. Davies GJ. Differential oligosaccharide recognition by evolutionarily- related beta-1,4 and beta-1,3 glucan-binding modules. J Mol Biol. 2002; 88. Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and 319(5):1143–56. dating with confidence. PLoS Biol. 2006;4(5):e88. 66. Charnock SJ, Bolam DN, Nurizzo D, Szabo L, McKie VA, Gilbert HJ, Davies GJ. 89. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with Promiscuity in ligand-binding: the three-dimensional structure of a BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29(8):1969–73. Piromyces carbohydrate-binding module, CBM29-2, in complex with cello- 90. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, and mannohexaose. Proc Natl Acad Sci U S A. 2002;99(22):14077–82. Remmert M, Soding J, et al. Fast, scalable generation of high-quality protein 67. Crennell SJ, Cook D, Minns A, Svergun D, Andersen RL, Nordberg Karlsson E. multiple sequence alignments using Clustal omega. Mol Syst Biol. 2011;7:539. Dimerisation and an increase in active site aromatic groups as adaptations 91. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo to high temperatures: X-ray solution scattering and substrate-bound crystal generator. Genome Res. 2004;14(6):1188–90. structures of Rhodothermus marinus endoglucanase Cel12A. J Mol Biol. 92. Schneider TD, Stephens RM. Sequence logos: a new way to display 2006;356(1):57–71. consensus sequences. Nucleic Acids Res. 1990;18(20):6097–100. 68. Kim SJ, Kim SH, Shin SK, Hyeon JE, Han SO. Mutation of a conserved 93. Jonassen I, Collins JF, Higgins DG. Finding flexible patterns in unaligned tryptophan residue in the CBM3c of a GH9 endoglucanase inhibits activity. protein sequences. Protein Sci. 1995;4(8):1587–95. Int J Biol Macromol. 2016;92:159–66. 94. Buchan DW, Minneci F, Nugent TC, Bryson K, Jones DT. Scalable web 69. Mattinen ML, Kontteli M, Kerovuo J, Linder M, Annila A, Lindeberg G, services for the PSIPRED protein analysis workbench. Nucleic Acids Res. Reinikainen T, Drakenberg T. Three-dimensional structures of three 2013;41(Web Server issue):W349–57. Kundu and Sharma BMC Evolutionary Biology (2018) 18:79 Page 19 of 19 95. Cserzo M, Eisenhaber F, Eisenhaber B, Simon I. On filtering false positive transmembrane protein predictions. Protein Eng. 2002;15(9):745–52. 96. Cserzo M, Wallin E, Simon I, von Heijne G, Elofsson A. Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng. 1997;10(6):673–6. 97. Jones DT. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics. 2007;23(5):538–44. 98. Jones DT, Taylor WR, Thornton JM. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry. 1994;33(10):3038–49. 99. Kall L, Krogh A, Sonnhammer EL. Advantages of combined transmembrane topology and signal peptide prediction–the Phobius web server. Nucleic Acids Res. 2007;35(Web Server issue):W429–32. 100. Nugent T, Jones DT. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics. 2009;10:159. 101. Kall L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004;338(5):1027–36. 102. Nicodeme P. Fast approximate motif statistics. J Comput Biol. 2001;8(3):235–48. 103. Nasser W, Santhanam B, Miranda ER, Parikh A, Juneja K, Rot G, Dinh C, Chen R, Zupan B, Shaulsky G, et al. Bacterial discrimination by dictyostelid amoebae reveals the complexity of ancient interspecies interactions. Curr Biol. 2013;23(10):862–72. 104. Sanders D, Borys KD, Kisa F, Rakowski SA, Lozano M, Filutowicz M. Multiple Dictyostelid species destroy biofilms of Klebsiella oxytoca and other gram negative species. Protist. 2017;168(3):311–25. 105. Dashtban M, Schraft H, Qin W. Fungal bioconversion of lignocellulosic residues; opportunities & perspectives. Int J Biol Sci. 2009;5(6):578–95. 106. Ghareeb H, Becker A, Iven T, Feussner I, Schirawski J. Sporisorium reilianum infection changes inflorescence and branching architectures of maize. Plant Physiol. 2011;156(4):2037–52. 107. Hilden L, Daniel G, Johansson G. Use of a fluorescence labelled, carbohydrate-binding module from Phanerochaete chrysosporium Cel7D for studying wood cell wall ultrastructure. Biotechnol Lett. 2003;25(7):553–8. 108. Martin F, Aerts A, Ahren D, Brun A, Danchin EG, Duchaussoy F, Gibon J, Kohler A, Lindquist E, Pereda V, et al. The genome of Laccaria bicolor provides insights into mycorrhizal symbiosis. Nature. 2008;452(7183):88–92. 109. Sims PF, Soares-Felipe MS, Wang Q, Gent ME, Tempelaars C, Broda P. Differential expression of multiple exo-cellobiohydrolase I-like genes in the lignin-degrading fungus Phanerochaete chrysosporium. Mol Microbiol. 1994; 12(2):209–16. 110. Sagane Y, Zech K, Bouquet JM, Schmid M, Bal U, Thompson EM. Functional specialization of cellulose synthase genes of prokaryotic origin in chordate larvaceans. Development. 2010;137(9):1483–92. 111. Di Bella MA, Fedders H, De Leo G, Leippe M. Localization of antimicrobial peptides in the tunic of Ciona intestinalis (Ascidiacea, Tunicata) and their involvement in local inflammatory-like reactions. Results Immunol. 2011;1(1):70–5. 112. Fischer R, Ostafe R, Twyman RM. Cellulases from insects. Adv Biochem Eng Biotechnol. 2013;136:51–64. 113. Grell MN, Linde T, Nygaard S, Nielsen KL, Boomsma JJ, Lange L. The fungal symbiont of Acromyrmex leaf-cutting ants expresses the full spectrum of genes to degrade cellulose and other plant cell wall polysaccharides. BMC Genomics. 2013;14:928. 114. Khademi S, Guarino LA, Watanabe H, Tokuda G, Meyer EF. Structure of an endoglucanase from termite, Nasutitermes takasagoensis. Acta Crystallogr D Biol Crystallogr. 2002;58(Pt 4):653–9. 115. Kundu S. Distribution and prediction of catalytic domains in 2-oxoglutarate dependent dioxygenases. BMC Res Notes. 2012;5:410. 116. Matos DA, Whitney IP, Harrington MJ, Hazen SP. Cell walls and the developmental anatomy of the Brachypodium distachyon stem internode. PLoS One. 2013;8(11):e80640. 117. Sullivan S, Ralet MC, Berger A, Diatloff E, Bischoff V, Gonneau M, Marion-Poll A, North HM. CESA5 is required for the synthesis of cellulose with a role in structuring the adherent mucilage of Arabidopsis seeds. Plant Physiol. 2011; 156(4):1725–39. 118. Tan HT, Shirley NJ, Singh RR, Henderson M, Dhugga KS, Mayo GM, Fincher GB, Burton RA. Powerful regulatory systems and post-transcriptional gene silencing resist increases in cellulose content in cell walls of barley. BMC Plant Biol. 2015;15:62.

Journal

BMC Evolutionary BiologySpringer Journals

Published: May 30, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off