A data-mining approach to spacer oligonucleotide typing of  Mycobacterium tuberculosis

M. Sebban; I. Mokrousov; N. Rastogi; C. Sola

doi:10.1093/bioinformatics/18.2.235

A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis

Sebban, M.; Mokrousov, I.; Rastogi, N.; Sola, C. 2002-02-01 00:00:00 Vol. 18 no. 2 2002 BIOINFORMATICS Pages 235–243 A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis 1 2 2 2,∗ M. Sebban , I. Mokrousov , N. Rastogi and C. Sola French West Indies and Guiana University, TRIVIA, Department of Mathematics and Computer Science, Campus Fouillole, 97159 Pointe-a-Pitr ` e Cedex, Guadeloupe and Unite ´ de la Tuberculose et des Mycobacteries, ´ Institut Pasteur de Guadeloupe, BP 484, F-97165 Pointe-a-Pitr ` e Cedex, Guadeloupe Received on February 9, 2001; revised on June 25, 2001; accepted on October 9, 2001 ABSTRACT quences. To avoid this, it is needed to detect emerging Motivation: The Direct Repeat (DR) locus of Mycobac- outbreaks before they reach epidemic stage (Grein et al., terium tuberculosis is a suitable model to study (i) molec- 2000). In this context, tuberculosis remains the leading ular epidemiology and (ii) the evolutionary genetics of tu- cause of death by an infectious disease and its control berculosis. This is achieved by a DNA analysis technique relies both on improvement of drug availability and (genotyping), called spacer oligonucleotide typing (spolig- diagnosis abilities. Among new diagnostic tools, which otyping). In this paper, we investigated data analysis meth- have led to a better understanding of global tuberculosis ods to discover intelligible knowledge rules from spoligo- epidemiology, DNA ﬁngerprinting and the building of typing, that has not yet been applied on such representa- genotyping databases represent a new and powerful way tion. This processing was achieved by applying the C4.5 to analyze tuberculosis transmission (van Embden and van induction algorithm and knowledge rules were produced. Soolingen, 2000). Nevertheless, when handling a large Finally, a Prototype Selection (PS) procedure was applied amount of data, especially binary data, the astonishing to eliminate noisy data. This both simpliﬁed decision rules, ability of human brain to intuitively separate noisy from as well as the number of spacers to be tested to solve signiﬁcant information is most efﬁciently challenged by classiﬁcation tasks. In the second part of this paper, the computers, that may easily learn and reproduce some contribution of 25 new additional spacers and the knowl- human tasks such as classiﬁcation or similarity analysis. edge rules inferred were studied from a machine learning The Tuberculosis agent belongs to the Mycobacterium point of view. From a statistical point of view, the correla- tuberculosis complex that may be split into various tions between spacers were analyzed and suggested that subclasses. With overlapping yet distinct epidemiologies, both negative and positive ones may be related to potential these include M. tuberculosis sensu stricto, Mycobac- structural constraints within the DR locus that may shape terium bovis, Mycobacterium africanum and two other its evolution directly or indirectly. less investigated subspecies M. microti and M. canetti, Results: By generating knowledge rules induced from de- which will not be further discussed here. Among DNA cision trees, it was shown that not only the expert knowl- ﬁngerprinting studies, spacer oligonucleotide typing or edge may be modeled but also improved and simpliﬁed to spoligotyping has been applied to characterize these sub- solve automatic classiﬁcation tasks on unknown patterns. types and has gained increased international acceptance A practical consequence of this study may be a simpliﬁ- because it may be both rapid and easily applied as a ﬁrst cation of the spoligotyping technique, resulting in a reduc- line discriminatory test (Kremer et al., 1999). It is based tion of the experimental constraints and an increase in the on the properties of the Direct Repeat (DR) locus of the number of samples processed. M. tuberculosis complex genome, one of the best known Contact: csola@pasteur.gp polymorphic loci of this pathogenic agent (Hermans et al., 1991; Groenen et al., 1993; van Embden et al., 2000). 1 INTRODUCTION It is also a suitable genetic model to study recombination since it shows an extensive strain-to-strain polymorphism As a result of increased human population migrations, and may be used both for molecular epidemiological infectious disease spreading may have global conse- studies (Kamerbeek et al., 1997) and for molecular evolu- To whom correspondence should be addressed. tionary studies (Fang et al., 1999; Sola et al., 2001a) on c Oxford University Press 2002 235 M.Sebban et al. tuberculosis. This locus is composed of multiple identical (KDD) ﬁeld. Even if an approach consisting in deriving or nearly identical (differing by one or a few nucleotides) simple rules for classiﬁcation is not new, it has not yet 36-base pairs (bp) DR copies, which are interspersed been applied for spoligotyping of M. tuberculosis. The by short and nonrepetitive inter-DR spacer sequences results obtained show that the number of spacers can be (between 35 and 41 bp long). The precise physiological dramatically reduced for determining the clade (class) of role—if any—of this locus remains unknown. It has a new proﬁle. However, the large size of DB1, resulting been suggested that a similar locus found in the Archaea in the presence of noisy data (typing errors, mislabeled Haloferax mediterranei may be involved in replicon instances, misclassiﬁed examples, overlapping between partitioning (Mojica et al., 1995). The association of one 2 clades), required to ‘clean’ the database. Thanks to an DR and one spacer is designated Direct Variable Repeat efﬁcient data reduction technique it was achieved by se- (DVR), and strains may consequently differ by one or lecting only the relevant prototypes. These methods have more discrete DVRs (Groenen et al., 1993). Alternatively, been developed during the last decade to combat noisy strains may sometimes also differ by long deletions data in the modern, larger databases. Nevertheless, since of DR repeats, and IS6110-mediated transposition or Prototype Selection (PS) is not the theoretical subject of homologous recombination are likely to be involved in this article, interested readers are referred to a number such structural changes (Fang et al., 1999; van Embden of papers dealing with this topic (Hart, 1968; Gates, et al., 2000). Nevertheless, the mechanisms that generate 1972; Aha et al., 1971; Wilson and Martinez, 1998). this diversity of repetitive structures are yet poorly under- In this article, we applied on DB1 a recently published stood. However it may be speculated that a combination PS method (Sebban and Nock, 2000) which consists of replication slippage, homologous recombination as in keeping proﬁles representative of the probability of well as insertion-sequence-mediated events are driving density of each clade. This results in the deletion of forces that shape its evolution. Based on an initial repre- border examples (overlapping), and useless examples sentation of 43 speciﬁc inter-DR spacers, spoligotyping, at the center of clades. The results obtained show that a PCR-based reverse cross-blot hybridization procedure, the new knowledge base may be dramatically reduced was invented in 1997 (Kamerbeek et al., 1997); spac- after PS without reducing the decision rule performances. ers 20, 21, 33–36 were extracted from M. bovis BCG Lastly, we also analyzed a second data base, called DB2, sequence (Hermans et al., 1991), whereas others were where 25 new spacers were added to the 43 original ones. extracted from the M. tuberculosis H37Rv reference strain 10 new spacer sequences discovered in an M. bovis strain (Groenen et al., 1993; Cole et al., 1998). Spoligotyping (isolate 401) were published previously (Beggs et al., permits to subtype M. tuberculosis complex strains, either 1996). Recently, 41 new spacer sequences among which directly on sputum specimen, on cultures or even on 15 were retained in our study, were also described (van ancient histological specimen and bone extracts. A ﬁrst Embden et al., 2000). This constitutes a new representa- global phylogeographical classiﬁcation of M. tuberculosis tion space for spoligotyping which now includes 43 + 25, complex by spoligotyping was recently attempted by i.e. 68 spacers. The goal of this experiment was to study our team and two new phylogeographical classes were the contribution of the 25 new spacers on DB2, rather deﬁned, the East-African Indian and the Latin American than comparing knowledge rules deduced from DB1 and and Mediterranean clades (EA-I and LA-M respectively), DB2. Actually, because of the experimental difﬁculties, that harbor speciﬁc spoligotyping signatures (Sola et al., DB2 contains for the moment only 323 proﬁles, resulting 2001a,b). We built a ﬁrst database, called DB1, composed in non-comparable knowledge bases. Preliminary results of 7352 strains split into 342 shared types (spoligotypes obtained show that decision rules after the insertion of the common to two or more isolates) and representative of 25 new spacers on DB2 are fewer in number than before. more than 60 countries. Our main aim in the ﬁrst part Finally, a correlation search between the spacers showed of this paper consisted in generating knowledge rules the existence of both negative and positive correlations which would be useful for automatically classifying new between speciﬁc spacers. These results suggest that proﬁles. Due to the intrinsic properties of human brain potential topological constraints on the DNA of the DR functions, e.g. global memorization of various shapes and locus may directly or indirectly shape its evolution. synthetical way to treat information, human experts may have difﬁculties to analytically express their knowledge 2 SYSTEMS AND METHODS under the form of formal decision rules. This is particu- 2.1 Oligonucleotide design, 68-dimensional larly true while handling binary data, i.e. spoligotyping spoligotypes patterns in our case. Indeed, data mining methods appears to be better suited than human observation to obtain The original set of 43 spacers initially published (Kamer- relevant knowledge rules permitting to classify patterns, beek et al., 1997) was recently modiﬁed (van Embden a task ﬁt for the Knowledge Discovery in Database et al., 2000). Parameters of the spoligotyping technique 236 Data mining and spoligotyping of M. tuberculosis were kept unchanged. The ﬁrst and second additional sets (Kremer et al., 1999; van Embden et al., 2000; of oligonucleotides (25mer), contained 10 (set A) and 15 Sola et al., 2001a,b). The 7352 strains are labeled (set B) oligonucleotides and were respectively selected according to criteria described below. Although the according to sequences published (Beggs et al., 1996; van human expert may both recognize and efﬁciently Embden et al., 2000). In each case, the best 25mer was label many speciﬁc ﬁngerprints ‘by eye,’ the literal selected within the DR spacer sequence with the software expression of formal and exclusive knowledge rules ‘Primers for the Mac’ (v1.0a Apple Pi, Ashland, MA). compatible with automatic information treatment, is Preliminary experiments for each oligonucleotide set currently a challenging task. Given the presence of (12.5, 25, 50 and 100 pmol/150 µ l for set A, and 12.5, 40, noisy data, we will see in this manuscript, that the 100 pmol/150 µ l for set B) including adequate positive a priori visual rules will be most efﬁciently challenged and negative controls strains were performed to optimize by the machine learning and data-mining approaches. hybridization conditions. Final membrane preparation • Afri, for M. africanum,(n = 180 strains): this Af ri was performed using these optimized concentrations. subclass of M. tuberculosis complex, as recently Oligonucleotide description and concentrations are avail- described in Viana-Niero et al. (2001) includes any able upon request to the corresponding author. DB1 is part proﬁle where spacers 8, 9 and 39 are absent. of an ongoing population-based study which contained at the time 7352 proﬁles, split into 342 shared-types, • T,(n = 1590): this clade is to be split in future into and which were representative of about 60 countries. A various yet undeﬁned families. It includes any strains limited version of this database and the sources of the where at least one of the spacers 1–30 is present, data were published recently (Sola et al., 2001a). Seven spacers 33–36 are simultaneously absent, spacer 31 is modiﬁed-spoligotyping experiments were performed and present, spacer 9 or 10 is present, and at least one of the origin of the DNAs was as follows: Japan (n = 6), the spacers 21–24 is present. Russia (n = 92), USA (n = 5), caribbean (n = 95) and Italy (n = 12). A total of 116 proﬁles were retained • Beijing,(n = 1268): this clade (van Soolingen Bei j i ng for DB2 construction and mixed to the 207 previously et al., 1995) includes any proﬁle where spacers 1–34 published proﬁles (170 M. tuberculosis sensu stricto + 37 are absent. M. bovis as described in van Embden et al., 2000). The number of proﬁles in (DB2) totaled 323, including orphan • EA-I, for East-African–Indian, (n = 907): this EAI patterns, i.e. proﬁles found only once . clade (Kallenius et al., 1999; Sola et al., 2001a) includes any proﬁle where spacers 29–32 and 34 2.2 Notations and clades are simultaneously absent, and at least one of the Since we applied machine learning algorithms, standard spacers 1–30 is present. notations of this ﬁeld for describing databases from a • Haarlem,(n = 1034): this clade (Kremer et Haarlem modeling point of view will be used. Let (x , y )bean i i al., 1999) includes any strain where spacers 31, 33– instance of the database, where x is a p-dimensional 36 are simultaneously absent, and at least one of the vector and y a belonging class. p corresponds to the spacers 1–30 is present. number of features characterizing the n instances of a Learning Sample (LS). The following equivalence list was • LAM-1, LAM for Latin America and Mediterranean, established for the speciﬁc problem to be treated here: (n = 819): this clade (Sola et al., 2001a) LAM 1 — p is the number of spacers (features or descriptors) in includes any strains where spacers 21–24, and 33– each spoligotype of the databases. p = 43 for DB1 and 36 are simultaneously absent, and at least one of the p = 68 for DB2. spacers 1–30 is present. —n is the number of spoligotypes (n = 7352 for DB1 • LAM-2,(n = 294): this clade is an attempt LAM 2 and n = 323 for DB2), i.e. the size of the LS. to deﬁne a new family. It includes any strain where p spacers 9–10, and 33–36 are simultaneously absent, 1 2 —x is the p-dimensional binary vector (x , x ,..., x ) i i i and at least one of the spacers 1–30 is present. As corresponding to the p values taken by each spacer shown in Figures 1 and 2 subfamilies of LAM1 and (0 or 1) for the i th instance. LAM2 will be deﬁned by SIPINA and were not deﬁned —y is the class of the spoligotype x . The 7352 instances by the expert. This discrepancy suggests that the visual i i of DB1 were clustered in 9 classes, a priori deﬁned by rules deﬁned by the expert can not be easily modelized the human expert, using previously published results by SIPINA because of spoligotypes harboring large deletions which currently jeopardizes classiﬁcation Descriptions of DB1 and DB2 are available upon request to the authors. of these families. Moreover recent yet unpublished 237 M.Sebban et al. observations suggests that there is indeed more than N Y 2, yet undeﬁned subclades among the LAM strains. N Y N Y • X,(n = 1186): this clade is currently found to be 97.3 97.8 highly prevalent in some english-speaking countries N Y M.Bovis Afri N Y (Soini et al., 2000, our own unpublished observations). 17 + 97.2 1.2 − It includes any proﬁle where spacer 18 is missing, 100 98.1 EA-I1 Haarlem spacers 33–36 are absent, and missing spacer 18 may 100 22 33 92.4 0.8 N Y sometimes be linked to the absence of spacers 39–42. 1.9 98.7 • M. bovis,(n = 74): this well deﬁned subclass Y Beijing EA-I2 M. bovis 42 18 includes any strain where spacers 39–43 are simulta- 6.2 neously absent, and spacers 33–36 are simultaneously X2 N N Y + 19 Y 10 present. 90.1 3.3 − 92.0 88.6 3.8 59.5 Some of the clade or subclass deﬁnitions cited above 90.5 T2 X1 T1 LAM21 + + + 93.8 3.0 96.4 0.6 + 99.6 0.2 NY 8 −− 91.1 2.1 − − may be considered as having been validated independently LAM11 4.9 32.3 by other investigators (e.g. for M. africanum, Haarlem, Beijing, EA-I, M. bovis) whereas the description of others LAM12 LAM22 100 100 is either new and speculative (e.g. LAM-2, clade X) or remains undeﬁned (clade T). Although the search of knowledge rules and deﬁnition of clades are distinct Fig. 1. Decision tree induced from DB1 in the 43-dimensional space processes using different softwares, both processes are on 7352 strains. N and Y represents respectively the absence and the interlinked and results in one domain may boost the presence of the current spacer. A square represents the discriminant knowledge in the other one. We will see that this is indeed spacer used for spliting the current sample; a circle represents a leaf of the tree. Numbers below clade deﬁnition is the success rate the case and that data-mining constitutes a powerful for classiﬁcation whereas the number on the top left of the clades approach when handling genotyping databases. represent the relative percentage of patterns found in that leaf. 2.3 Knowledge rules induced by decision trees In this part, induction algorithms were applied on the orig- Y inal LS in order to generate decision trees. These trees are N Y Y 18 18 built by spliting the sample into sub-samples, according to an optimized criterion (often an information measure) Beijing N N Y N Y 9 j 99.5 0.2 + 36 which is in our case the best discriminant spacer x ; each 98.8 98.1 38.9 subset is then split according to the same strategy result- N Y EA-I1 23 X LAM12 100 + 43 22 96.2 1.2 100 − N Y ing into two new subtrees, and so on, until the information 38.9 Y 98.3 98.8 gain is sufﬁcient. Once the tree is built, each path from LAM11 Haarlem T M.Bovis 8 99.7 0.1 − + 100 85.4 3.6 100 20.9 the root to a leaf describes a decision rule (or knowledge LAM13 Y 95.7 3.0 + rule), useful for understanding the model but also for la- 100 90.1 Afri LAM2 + + 63.6 10.3 96.8 1.8 beling new examples. This type of model is very useful in − − ﬁelds in which the understandability of the deduced model is important. For instance, while it is not very crucial to Fig. 2. Decision tree induced from the subset of prototypes of DB1 understand what is the reasoning of a model specialized in the 43-dimensional space. for the postal code recognition (in this case an artiﬁcial neural network can be very efﬁcient), the understandabil- ity of an artiﬁcial model for a medical diagnosis system is lan, 1993), CART (Breiman et al., 1984), and SIPINA obviously more important. This is the case for our prob- (Zighed and Rakotomalala, 2000). lem. Decision trees are well suited, not only for providing very efﬁcient classiﬁcation models but also for describ- 2.4 Accuracy estimation by cross-validation ing the process from the input vector x (the spacers) to procedure the predicted output label y (the class). In these experi- ments, we used the SIPINA for Windows software, devel- In order to assess the predictive ability of a model, i.e. oped at the ERIC laboratory of the University of Lyon 2 the determination of the belonging class of a new spolig- (http://eric.univ-lyon2.fr). This software collects the state- otype, a cross-validation procedure is frequently used in of-the-art tree induction algorithms, such as C4.5 (Quin- the computer science and statistics ﬁelds (Efron and Tib- 238 Data mining and spoligotyping of M. tuberculosis shirani, 1993). Based on the results obtained on the LS, an tion, we will use PSRCG as a data preprocessing process unbiaised estimation of the theoretical generalization ac- on the 7352 proﬁles, before constructing a new decision curacy of the model to the whole population is provided tree. By eliminating irrelevant strains, we aim at improv- by this method. Then, the generalization accuracy is esti- ing the current knowledge base induced for automatically mated from the information contained in LS by this cross- classifying spoligotypes. validation procedure. The principle is as follows: the orig- inal sample is divided into f folders. A predictive model 3 EXPERIMENTAL RESULTS (a decision tree, in our case) is built from f − 1 subsets 3.1 Knowledge discovery from DB1 during a learning stage, and tested on the remaining one. In this section, we automatically generated knowledge The same procedure is repeated for the f permutations. rules and built a predictive model from DB1. We applied The estimation of the generalization accuracy is the av- the C4.5 induction algorithm to generate the decision tree erage of the f accuracies computed at each stage of the described in Figure 1 from the 7352 strains. As shown in cross-validation. Such an estimation method will be used this ﬁgure, each path from the root to a leaf describes a in the following experimental section. knowledge rule. For the nine deﬁned clades, the problem may be modeled by 14 knowledge rules only. Each of the 2.5 Prototype selection 7352 strains, a priori labeled by the expert, belongs to only Prototype selection was historically used to improve one leaf. Each leaf describes a given class, according to the the efﬁciency of the k-nearest-neighbor classiﬁer (Hart, majority clade represented in the subset of these proﬁles. 1968), which classiﬁes an unknown instance according Actually, a given strain x , may belong to a leaf for which to a local vote by its k-nearest neighbors and accord- the majority clade is not y , i.e. the clade of x . It means i i ing to a given distance function (often the Euclidean). that some leaves may be pure, i.e. they contain examples Although its use was widely spread and encouraged of a unique clade, whereas others describe a class by a by early theoretical results linking its generalization relative vote. In the latter case, the minority examples error to Bayes risk, this classiﬁer suffers from several represent the learning errors of the model. A learning error practical limitations (Breiman et al., 1984). Firstly, it may either represent misclassiﬁcation by the expert or be is computationally expensive because it stores all the due to biological processes such as recombination or gene instances in memory. Secondly, intolerant of noisy in- ﬂows which may lead to fuzzy classiﬁcation boundaries. stances, and of irrelevant attributes, it is sensitive to the Indeed, a decision rule may not only be assessed according chosen distance function. Pioneer works in PS ﬁrstly to its accuracy on the LS, but also by the number of searched only solutions to solve the two ﬁrst problems examples which follow this rule. The accuracy of the listed above (Hart, 1968; Gates, 1972; Aha et al., 1971). rule is described in Figure 1 by the Conﬁdence Interval The third problem (intolerance of irrelevant attributes) is (CI) of the proportion of correctly classiﬁed strains (value a matter of feature selection. PS algorithms usually use within each circle). It is actually more suited to describe the Euclidean distance. This function is appropriate only the accuracy by a CI, which takes into account the number if the attributes are numeric. Wilson and Martinez (1998) of strains which follow the rule, rather than by the simple proposed new distance functions to handle numeric and success rate. A CI, with a α risk at 5%, is deﬁned as nominal attributes. follows: In this paper, we used a recently proposed method Pro- p(1 − p) CI = p ± 1.96 totype Selection using Relative Certainty Gain (PSRCG; Sebban and Nock, 2000). We investigated the PS as an where p is the success rate and n the number of strains. information preserving problem. Rather than optimizing It explains why there is no CI for 100% success rates the accuracy of a classiﬁer, we built a statistical informa- because in these cases, (1 − p) = 0. The number tion criterion for Relative Certainty Gain (RCG) based on of examples that follow the rule is represented by the a quadratic entropy computed from the nearest-neighbor strain proportion of the clade in which this conjunction of topology. From neighbors linked by an edge with a given spacers is found (value on each circle). In order to compare instance, PSRCG computes a quadratic entropy by taking the performances of this automatic model (decision tree) into account the label of each neighbor. From this entropy with the expert’s knowledge base previously mentioned in (which conveys a local uncertainty), it deduces a global Section 2.2, we decided to characterize the two approaches uncertainty of the LS. While an instance deletion is sta- by the following criteria: tistically signiﬁcant, PSRCG eliminates uninformative ex- amples. Using k-nearest neighbors for building the neigh- • the number of rules: it describes the complexity of the borhood graph, PSRCG has shown a high ability for re- model. We take into account here the conjunctions of ducing the database and improving classiﬁcation perfor- spacers, disjunctions corresponding to different paths mances (Sebban and Nock, 2000). In the following sec- in the tree; 239 M.Sebban et al. • the mean size of the rules (number of spacers from • Finally, while an average of 25.4 spacers per rule the root to the leaf): note that between two decision is used by the expert for ﬁnding the label, only 5.8 trees with same performances, we usually choose the are used by the induced decision tree. This result smallest one; may get practical application in a near future to simplify the spoligotying technique when deﬁnition • the success rate on the LS: the accuracy is obviously of major spoligotyping clades will be improved and the most popular criterion for assessing the quality of minimal extra genotyping information will be required a model; to automatically label an unknown strain. The learning accuracy on DB1 suggests that it is possible • the number of different spacers used in the decision to build a predictive model, which would automati- rule: for our speciﬁc problem, we aim at using cally classify any new proﬁle. To assess the accuracy the smallest number of spacers for simplifying the generalization of this model, we achieved a 5-fold cross- experimental processing. validation procedure on DB1. A mean success rate of 98% According to these criteria, the expert’s knowledge base with a standard deviation of 0.8% was obtained. and the decision rules induced by C4.5 on DB1 are 3.2 Prototype selection from the 7352 strains summarized in Table 1. We may also note that: In this section, we used the PSRCG algorithm (for more • The expert’s ability to label an unknown strain is details see Sebban and Nock, 2000). This PS algorithm real in many cases. However, it may not be so is based on the optimization of an information measure straightforward to generate simple knowledge rules in (a quadratic entropy). It consists in deleting the irrelevant all cases. Moreover, a knowledge base should respect strains and globally improves the information in the a set of constraints, particularly the non-overlapping representation space. Practically, the irrelevant strains between rules. This is not the case for the rules are those located at the clade boundaries, where the listed above. This explains the weak accuracy on uncertainty is likely to be high. Moreover, strains at the average (76.6%) of the human rule in comparison center of clusters (subsets of strains of a same clade in with the knowledge base automatically deduced from the representation space) are also weakly relevant for DB1 with SIPINA (97.6%), which increases to (99%) classiﬁcation purposes. PSRCG is particularly suited in after PS. The observation of Figure 1 shows that the such a context, because it is not dependent on a learning a priori expert clustering is validated by knowledge algorithm and does not generate an inductive bias. In other rules with strong CIs in most instances. In other cases, words, the resulting subsets of prototypes may then be and for lesser deﬁned clades (e.g. T, X, or LAM2) used by an induction algorithm to build decision trees. the existence of secondary or tertiary rules together Among the 7352 original strains, our algorithm kept 4014 with new subsets (e.g. X1, X2) suggests that today’s proﬁles only. From this subset of strains, called DB1 , sub global classiﬁcation of tubercle bacilli clades based on the decision tree using SIPINA was built. The new model spoligotyping results remains sub-optimal and is likely is presented in Figure 2. As it was done for the knowledge to evolve in the near future. base deduced from DB1, we computed the same criteria as above on DB1 . Results are included in Table 1. We sub • Except for two spacers (37 and 38), the whole infor- note that the deletion of noisy data results in the decrease mation (41 spacers) contained in the spoligotypes to of the number of rules (11 versus 14), and in an increase of discriminate the nine clades was used by the expert. the accuracy (99.0 versus 97.6%). Moreover, the number On the contrary, only 13 spacers are used by the of spacers is dramatically reduced. Actually, nine spacers automatically deduced model from DB1 on the 7352 are sufﬁcient to totally discriminate the 4014 strains (8, 9, strains, and this with a higher performance. The 18, 19, 22, 23, 31, 36, 43), among them eight were already selected spacers are: 8–10, 17–19, 22, 31, 33, 34, present in the decision tree built from DB1. In the new tree 36, 42, 43. In other words, presence or absence of shown in Figure 2, a closer insight at the subsets of the these spacers may be considered as highly informative LAM clades (designated LAM11, LAM12 and LAM13 in in the 43 dimension space for classifying proﬁles. that tree), show that this spliting represents left(LAM12) Neighborhood relationships may indicate that all or right(LAM11) entire deletions of the DR locus, whereas recombination mechanisms do not act similarly on LAM13 represent ‘true’ LAM as deﬁned above. The full each spacer. As an example, spacers 9 and 31 are signiﬁcance of this result requires further investigations. well known hot spots for IS6110 transposition, an 3.3 Contribution of the 25 additional spacers evolutionary driving force that may change the DR locus structure (van Embden et al., 2000; Filliol et al., The database DB2 was constructed by assembling previ- 2000; Legrand et al., 2001; Benjamin et al., 2001). ously published (n = 207) and our own in house gener- 240 Data mining and spoligotyping of M. tuberculosis Table 1. Results according to four performance criteria discriminate the right side of the tree. On the other hand, spacers 26 and 43 were required in the ﬁrst decision tree. Thus, the presence of spacer 1 now Rules deduced from allows to remove an irrelevant rule. Criterion Expert DB1 DB1 sub The contribution of the 25 additional spacers may also be Number of rules 17.0 14.0 11.0 assessed through a statistical study. We performed a prin- Mean size of the rules 25.4 5.8 3.7 cipal component analysis from the two datasets, in order Accuracy 76.6 97.6 99.0 Number of spacers 41.0 13.0 9.0 to compare the part of the total variance explained by the two main factorial axes. While this percentage of the total variance was around 53% by mapping individuals of DB2 from a 43-dimensional space to a planar representation, ated experimental proﬁles (n = 116). Since M. africanum the part of the variance explained by the two axes was im- strains were not yet tested for 25 additional spacers, the proved by the addition of 25 new spacers. Actually, about results described in this section may unfortunately not be 60% of the total variance was explained by such a repre- applied to M. africanum. Besides, since DB2 is yet limited sentation which conﬁrms the contribution of these new 25 in size, our aim does not consist in comparing knowlege spacers. Lastly, by using this data analysis method, a cor- discovered in DB1 and DB2, but rather in DB2 before and relation circle between spacers was drawn (see Figure 5). after insertion of the 25 new spacers. Actually, the compar- In this ﬁgure, the correlation level between two spacers is ison between the two representation spacers (43 and 68) measured by the value of the angle between two points. must be achieved according to the same LS. We obtained High positive correlation between spacers is shown by an two new decision trees which are respectively shown in acute angle. Inversely, a negative correlation is shown as Figures 3 and 4. In order to facilitate the comparison, we an obtuse angle and no correlation between spacers by a now used as spacer labels their new position in the spoligo- right angle. Figure 5 shows a very high negative correla- type represented by 68 features (for instance, the spacer 21 tion between spacers 41 and 47. This means that the pres- is the 21st value in the 68-dimensional representation, and ence of the spacer 41 is almost always accompanied by the not the 21st in the 43-dimensional space). The renum- absence of the spacer 47, and vice versa. This knowledge bering of the spacers was previously done by van Emb- is refuted only for 12 individuals among 323, which are all den et al. (2000), because of the conserved order of the characterized by a large deletion in the locus, likely to be spacers among different isolates of M. tuberculosis. Con- linked to IS6110-mediated deletion events (van Embden sequently, the numbers reﬂect the genetic organization in et al., 2000). In this context, it is striking that the pres- the genome. Among the 11 rules, 7 are very relevant, and ence of an inverted IS6110 copy in the Beijing clade oc- describe classes Haarlem, M. bovis, EAI1, Beijing, LAM1, curs precisely between these two spacers. This very same T1, LAM2 almost perfectly. The 4 others are either weakly DVR 47 also harbors point mutations in the DR repeat in accurate (T2, X) or have weak inﬂuence (T3, EAI2). Two all the M. bovis strains investigated so far (van Embden et spacers and consequently related DVRs (31 and 47), have al., 2000). We also found a positive correlation between been shown to harbor point mutations in the DR repeat spacers 1, 21, 28, 29, 31 and 35, but these spacers are in- itself (van Embden et al., 2000). This is a striking result dependent of spacers 41 and 47. The signiﬁcance of this that could be explained by the presence/absence of a single correlation remains to be investigated. nucleotide polymorphism in the various clades described. Comparing the ﬁrst and the second decision trees, the two 4 CONCLUSION following observations may be done: In this paper, we applied data mining methods to generate (1) The number of knowledge rules is reduced in the knowledge rules for solving classiﬁcation tasks from ex- 68-dimensional space. 11 rules (with 10 spacers) in isting or speciﬁcally created databases. To the best of our the 43-dimensional space for discriminating the 323 knowledge, this study is a ﬁrst attempt to automatically proﬁles into the 8 classes were inferred, whereas discover simple knowledge rules from spoligotyping data. 9 spacers and 10 rules for reaching the same Amazingly, the generated rules are different and simpler objective were required in the 68-dimensional space. than those previously deﬁned by the expert. A possible Theoretically, a smaller model is desirable because explanation of this phenomenon may come from the fact it will not suffer from the overﬁtting of the learning that decision trees use pruning to avoid overﬁtting (result- data, and will be more reliable for classifying new ing in deep trees), whereas the expert takes into account unknown individuals. all data as signals, i.e. without differentiating noisy from (2) The reduction of the tree size is explained by the unnoisy data. This is especially true for spoligotyping, 43 68 presence of the new spacer 1, which allows to totally where 2 (and now 2 ) combinations are theoretically 241 M.Sebban et al. M.Bovis N Y 47 100 26 Y 96.5 N Y EAI1 N Y 35 31 95.2 3.5 100 31 14.9 Beijing EAI2 Haarlem T2 Y 28 100 29 100 + 100 76.9 11.7 N Y 7.5 71.6 T1 T3 X 21 + + 98.O 2 73.3 11.4 − Fig. 5. Correlation circle from the eight spacers used in the decision 100 − tree. 96.6 66.7 LAM1 LAM2 berculosis stricto sensu into the 8 or 9 validated clades. This would represent a ﬁrst level of sub-classiﬁcation for Fig. 3. Decision tree induced from DB2 in the 43-dimensional space epidemiological purposes, an application that deserves after removing the 25 additional spacers. further investigation. Future investigations will intend to study other M. tuberculosis representations, such as the N Variable Number of Tandem DNA Repeats (VNTRs) or the Mycobacterial Interspersed Repetitive Units (MIRUs; Frothingham and Meeker-O’Connell, 1998; Supply et al., M.Bovis N Y 47 100 2000). Data-mining methods will undoubtedly allow to distinguish useful information from noisy data obtained 1 by these markers. 95.2 100 ACKNOWLEDGEMENTS N Y Beijing EAI N Y 35 31 This work was supported by the Del ´ egation ´ Gen ´ erale ´ 100 100 au Reseau ´ International des Instituts Pasteur et Instituts Associes ´ and the Fondation Raoul Follereau. We are also 14.9 Haarlem T2 very grateful to Dr C.Mammina, Italy, Dr B.Vishnevsky Y 28 + 100 76.9 11.7 and Dr T.Otten, Russia and Dr H.Kasai, Japan, for N Y providing some DNAs of M. tuberculosis clinical isolates. 7.5 71.6 T1 T3 X 21 + + 98.O 2 73.3 11.4 − 100 − REFERENCES Aha,D., Kibler,K. and Albert,M. (1971) Instance-based learning 96.6 66.7 algorithms. Mach. Learn., 6,37–66. LAM1 LAM2 Beggs,M.L., Cave,M.D., Marlowe,C., Cloney,L., Duck,P. and Eise- nach,K.D. (1996) Characterization of Mycobacterium tuberculo- sis complex direct repeat sequences for use in cycling probe re- Fig. 4. Decision tree induced from DB2 in the 68-dimensional space actions. J. Clin. Microbiol., 34, 2985–2989. by the C4.5 induction algorithm. Benjamin,W.H. Jr., Lok,K.H., Harris,R., Brook,N., Bond,L., Mulc- ahy,D., Robinson,N., Pruitt,V., Kirkpatrick,D.P., Kimerling,M.E. and Dunlap,N.E. (2001) Identiﬁcation of a contaminating My- cobacterium tuberculosis strain with a transposition of an likely. A direct consequence of these simpler rules may IS6110 insertion element resulting in an altered spoligotype. J. be a change in the laboratory experimental setup by Clin. Microbiol., 39, 1092–1096. eliminating the use of uninformative spacers. Another Breiman,L., Friedman,J.H., Olshen,R.A. and Stone,C.J. (1984) consequence may be that uninformative spacers may be Classiﬁcation and Regression Trees. Chapman and Hall, downweighted or excluded when dealing with phylogeny Wadsworth, CA. reconstruction using parsimony principles. The described Cole,S.T., Brosch,R., Parkhill,J., Garnier,T., Churcher,C., Harris,D. process could also allow to automatically classify M. tu- and Gordon,S.V. et al. (1998) Deciphering the biology of My- 242 Data mining and spoligotyping of M. tuberculosis cobacterium tuberculosis from the complete genome sequence. Riley,L.W., Yakrus,M.A., Musser,J.M. and van Embden,J.D.A. Nature, 393, 537–544. (1999) Comparison of methods based on different molecular epi- van Embden,J.D.A. and van Soolingen,D. (2000) Molecular epi- demiologial markers for typing of Mycobacterium tuberculosis demiology of tuberculosis: coming of age. Int. J. Tuberc. Lung. strains: interlaboratory study of discriminatory power and repro- Dis., 4, 285–286. ducibility. J. Clin. Microbiol., 37, 2607–2618. van Embden,J.D.A., van Gorkom,T., Kremer,K., Jansen,R., Van der Legrand,E., Filliol,I., Sola,C. and Rastogi,N. (2001) Use of spoligo- Zeijst,B.A.M. and Schouls,L.M. (2000) Genetic variation and typing to study the evolution of the direct repeat locus by IS6110 evolutionary origin of the direct repeat locus of Mycobacterium transposition in Mycobacterium tuberculosis. J. Clin. Microbiol., tuberculosis complex bacteria. J. Bacteriol., 182, 2393–2401. 39, 1595–1599. Efron,B. and Tibshirani,R. (1993) An Introduction to the Bootstrap. Mojica,F.J., Ferrer,C., Juez,G. and Rodriguez-Valera,F. (1995) Long Chapman and Hall, London. stretches of short tandem repeats are present in the largest Fang,Z., Doig,C., Kenna,D.T., Smittipat,N., Palittapongarnpim,P., replicons of the Archaea Haloferax mediterranei and Haloferax Watt,B. and Forbes,K.J. (1999) IS6110-mediated deletions of volcanii and could be involved in replicon partitioning. Mol. wild-type chromosomes of Mycobacterium tuberculosis. J. Bac- Microbiol., 17,85–93. teriol., 181, 1014–1020. Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan Filliol,I., Sola,C. and Rastogi,N. (2000) Detection of a previously Kaufmann, San Mateo, CA. unampliﬁed spacer within the DR locus of Mycobacterium Sebban,M. and Nock,R. (2000) Instance pruning as an information tuberculosis: epidemiological implications. J. Clin. Microbiol., preserving problem. In Proceedings of the 17th International 38, 1231–1234. Conference on Machine Learning. Stanford University, pp. 855– Frothingham,R. and Meeker-O’Connell,W.A. (1998) Genetic diver- 862. sity in the Mycobacterium tuberculosis complex based on vari- Soini,H., Pan,X., Amin,X., Graviss,E.A., Siddiqui,A. and able numbers of tandem DNA repeats. Microbiol., 144, 1189– Musser,J.M. (2000) Characterization of Mycobacterium 1196. tuberculosis isolates from patients in Houston, Texas, by spoligotyping. J. Clin. Microbiol., 38, 669–676. Gates,G.W. (1972) The reduced nearest neighbor rule. IEEE Trans. Inf. Theor., IT13, 431–433. Sola,C., Filliol,I., Gutierrez,M.C., Mokrousov,I., Vincent,V. and Grein,T.W., Kamara,K.B., Rodier,G., Plant,A.J., Ryan,M.J., Rastogi,N. (2001a) An update of the spoligotyping database of Ohyama,T. and Heymann,D.L. (2000) Rumors of disease in Mycobacterium tuberculosis: epidemiological and phylogeneti- the global village: outbreak veriﬁcation. Emerg. Infect. Dis., 6, cal perspectives. Emerg. Inf. Dis., 9, 390–396. 97–102. Sola,C., Filliol,I., Legrand,E., Mokrousov,I. and Rastogi,N. (2001b) Groenen,P.M.A., Bunschoten,A.E., van Soolingen,D. and van Em- Mycobacterium tuberculosis phylogeny reconstruction based on bden,J.D.A. (1993) Nature of DNA polymorphism in the direct combined numerical analysis with IS1081,IS6110, VNTR and repeat cluster of Mycobacterium tuberculosis. Mol. Microbiol., DR-based spoligotyping suggests the existence of two new 10, 1057–1065. phylogeographical clades. J. Mol. Evol., 53, 680–689. Hart,P.E. (1968) The condensed nearest neighbor rule. IEEE Trans. van Soolingen,D., Qian,L., de Haas,P.E.W., Douglas,J.T., Traore,H., Inf. Theor., IT14, 515–516. Portaels,F., Qing,H.Z., Enkhsaikan,D., Nymadawa,P. and van Hermans,P.W.M., van Soolingen,D., Bik,E.M., de Haas,P.E.W., Embden,J.D.A. (1995) Predominance of a single genotype of Dale,J.W. and van Embden,J.D.A. (1991) Insertion element Mycobacterium tuberculosis in countries of East Asia. J. Clin. IS987 from Mycobacterium bovis BCG is located in a hot- Microbiol., 33, 3234–3238. spot integration region for insertion elements in Mycobacterium Supply,P., Mazars,E., Lesjean,S., Vincent,V., Gicquel,B. and tuberculosis complex strains. Infect. Immun., 59, 2695–2705. Locht,C. (2000) Variable human minisatellite-like regions in Kallenius,G., ¨ Koivula,T., Ghebremichael,S., Hoffner,S.E., Nor- the Mycobacterium tuberculosis genome. Mol. Microbiol., 36, berg,R.E., Svensson,E., Dias,F., Marklund,B. and Svenson,S.B. 762–771. (1999) Evolution and clonal traits of Mycobacterium tuberculo- Viana-Niero,C., Gutierrez,M.C., Sola,C., Filliol,I., Boulahbal,F., sis in Guinea-Bissau. J. Clin. Microbiol., 37, 3872–3878. Vincent,V. and Rastogi,N. (2001) Genetic diversity of Mycobac- Kamerbeek,J., Schouls,L., Kolk,A., van Agterveld,M., van Soolin- terium africanum clinical isolates based on IS6110-restriction gen,D., Kuijper,S., Bunschoten,A., Molhuizen,H., Shaw,R., fragment length polymorphism analysis, spoligotyping, and vari- Goyal,M. and van Embden,J.D.A. (1997) Simultaneous detec- able number of tandem DNA. J. Clin. Microbiol., 39,57–65. Wilson,D. and Martinez,T. (1998) Reduction techniques for tion and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J. Clin. Microbiol., 35, 907–914. instances-based learning algorithms. Mach. Learn., 38, 257–286. Kremer,K., van Soolingen,D., Frothingham,R., Haas,W.H., Her- Zighed,D.A. and Rakotomalala,R. (2000) Graphe Induction- mans,P.W.M., Martin,C., Palittapongarnpim,P., Plikaytis,B.B., Apprentissage et Data Mining. Hermes, Paris. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/a-data-mining-approach-to-spacer-oligonucleotide-typing-of-QLn6gc1UOP

Loading next page...

References (30)

P Hermans, D. Soolingen, E Bik, P. Haas, J Dale, J. Embden (1991)
Insertion element IS987 from Mycobacterium bovis BCG is located in a hot-spot integration region for insertion elements in Mycobacterium tuberculosis complex strains
Infection and Immunity, 59
C. Hilborn, D. Lainiotis (1967)
The Condensed Nearest Neighbor Rule
F. Mojica, C. Ferrer, G. Juez, F. Rodríguez-Valera (1995)
Long stretches of short tandem repeats are present in the largest replicons of the Archaea Haloferax mediterranei and Haloferax volcanii and could be involved in replicon partitioning
Molecular Microbiology, 17
D. Soolingen, Lishi Qian, P. Haas, J. Douglas, H. Traore, F. Portaels, H. Qing, D. Enkhsaikan, P. Nymadawa, J. Embden (1995)
Predominance of a single genotype of Mycobacterium tuberculosis in countries of east Asia
Journal of Clinical Microbiology, 33
L. Breiman, J. Friedman, R. Olshen, C. Stone (1984)
Classification and Regression Trees
Biometrics, 40
G. Gates (1998)
The Reduced Nearest Neighbor Rule
M. Sebban, R. Nock (2000)
Instance Pruning as an Information Preserving Problem
P. Groenen, A. Bunschoten, D. Soolingen, Jan Errtbden (1993)
Nature of DNA polymorphism in the direct repeat cluster of Mycobacterium tuberculosis; application for strain differentiation by a novel typing method
Molecular Microbiology, 10
Z. Fang, C. Doig, D. Kenna, N. Smittipat, P. Palittapongarnpim, B. Watt, K. Forbes (1999)
IS6110-Mediated Deletions of Wild-Type Chromosomes of Mycobacterium tuberculosis
Journal of Bacteriology, 181
J. Embden, D. Soolingen (2000)
Molecular epidemiology of tuberculosis: coming of age.
The international journal of tuberculosis and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease, 4 4
J. Quinlan (1992)
C4.5: Programs for Machine Learning
I. Filliol, C. Sola, Nalin Rastogi (2000)
Detection of a Previously Unamplified Spacer within the DR Locus of Mycobacterium tuberculosis: Epidemiological Implications
Journal of Clinical Microbiology, 38
C. Viana-Niero, C. Gutiérrez, C. Sola, I. Filliol, F. Boulahbal, V. Vincent, Nalin Rastogi (2001)
Genetic Diversity of Mycobacterium africanum Clinical Isolates Based on IS6110-Restriction Fragment Length Polymorphism Analysis, Spoligotyping, and Variable Number of Tandem DNA Repeats
Journal of Clinical Microbiology, 39
D. Wilson, Tony Martinez, Robert Holte (2000)
Reduction Techniques for Instance-Based Learning Algorithms
Machine Learning, 38
G. Källenius, T. Koivula, S. Ghebremichael, S. Hoffner, R. Norberg, E. Svensson, F. Dias, B. Marklund, S. Svenson (1999)
Evolution and Clonal Traits of Mycobacterium tuberculosis Complex in Guinea-Bissau
Journal of Clinical Microbiology, 37
J. Kamerbeek, L. Schouls, A. Kolk, Miranda, Van, Agterveld, Dick, Soolingen, S. Kuijper, A. Bunschoten, H. Molhuizen, R. Shaw, Madhu Goyal, Jan, Embden (1997)
Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology
Journal of Clinical Microbiology, 35
J. Embden, T. Gorkom, K. Kremer, Ruud Jansen, B. Zeijst, L. Schouls (2000)
Genetic Variation and Evolutionary Origin of the Direct Repeat Locus of Mycobacterium tuberculosis Complex Bacteria
Journal of Bacteriology, 182
D. Aha, D. Kibler, M. Albert (2004)
Instance-based learning algorithms
Machine Learning, 6
(2000)
Graphe Induction- Apprentissage et Data Mining
T. Grein, Kande-Bure Kamara, G. Rodier, Aileen Plant, P. Bovier, Michael Ryan, T. Ohyama, D. Heymann (2000)
Rumors of disease in the global village: outbreak verification.
Emerging Infectious Diseases, 6
M. Beggs, M. Cave, C. Marlowe, L. Cloney, P. Duck, K. Eisenach (1996)
Characterization of Mycobacterium tuberculosis complex direct repeat sequence for use in cycling probe reaction
Journal of Clinical Microbiology, 34
S. Cole, R. Brosch, J. Parkhill, T. Garnier, C. Churcher, D. Harris, S. Gordon, K. Eiglmeier, S. Gas, C. Barry, F. Tekaia, K. Badcock, D. Basham, David Brown, T. Chillingworth, R. Connor, R. Davies, K. Devlin, T. Feltwell, S. Gentles, N. Hamlin, S. Holroyd, T. Hornsby, K. Jagels, A. Krogh, J. Mclean, S. Moule, L. Murphy, K. Oliver, J. Osborne, M. Quail, M. Rajandream, J. Rogers, S. Rutter, K. Seeger, J. Skelton, R. Squares, S. Squares, J. Sulston, K. Taylor, S. Whitehead, B. Barrell (1998)
Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence
Nature, 393
W. Benjamin, K. Lok, Randall Harris, N. Brook, L. Bond, Donna Mulcahy, N. Robinson, Virginia Pruitt, D. Kirkpatrick, M. Kimerling, N. Dunlap (2001)
Identification of a Contaminating Mycobacterium tuberculosis Strain with a Transposition of an IS6110Insertion Element Resulting in an Altered Spoligotype
Journal of Clinical Microbiology, 39
E. Legrand, I. Filliol, C. Sola, Nalin Rastogi (2001)
Use of Spoligotyping To Study the Evolution of the Direct Repeat Locus by IS6110 Transposition inMycobacterium tuberculosis
Journal of Clinical Microbiology, 39
R. Frothingham, W. Meeker-O'Connell (1998)
Genetic diversity in the Mycobacterium tuberculosis complex based on variable numbers of tandem DNA repeats.
Microbiology, 144 ( Pt 5)
P. Supply, E. Mazars, Sarah Lesjean, V. Vincent, B. Gicquel, C. Locht (2000)
Variable human minisatellite‐like regions in the Mycobacterium tuberculosis genome
Molecular Microbiology, 36
H. Soini, Xi Pan, A. Amin, E. Graviss, Anees Siddiqui, J. Musser (2000)
Characterization of Mycobacterium tuberculosis Isolates from Patients in Houston, Texas, by Spoligotyping
Journal of Clinical Microbiology, 38
(2001)
An update of the spoligotyping database of Mycobacterium tuberculosis: epidemiological and phylogenetical perspectives
S. Salzberg, Alberto Segre (1994)
Programs for Machine Learning
K. Kremer, D. Soolingen, R. Frothingham, W. Haas, P. Hermans, Carlos Martín, P. Palittapongarnpim, B. Plikaytis, L. Riley, M. Yakrus, J. Musser, J. Embden (1999)
Comparison of Methods Based on Different Molecular Epidemiological Markers for Typing of Mycobacterium tuberculosis Complex Strains: Interlaboratory Study of Discriminatory Power and Reproducibility
Journal of Clinical Microbiology, 37

Publisher: Oxford University Press
Copyright: © Oxford University Press 2002
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/18.2.235
Publisher site: See Article on Publisher Site

Abstract

Vol. 18 no. 2 2002 BIOINFORMATICS Pages 235–243 A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis 1 2 2 2,∗ M. Sebban , I. Mokrousov , N. Rastogi and C. Sola French West Indies and Guiana University, TRIVIA, Department of Mathematics and Computer Science, Campus Fouillole, 97159 Pointe-a-Pitr ` e Cedex, Guadeloupe and Unite ´ de la Tuberculose et des Mycobacteries, ´ Institut Pasteur de Guadeloupe, BP 484, F-97165 Pointe-a-Pitr ` e Cedex, Guadeloupe Received on February 9, 2001; revised on June 25, 2001; accepted on October 9, 2001 ABSTRACT quences. To avoid this, it is needed to detect emerging Motivation: The Direct Repeat (DR) locus of Mycobac- outbreaks before they reach epidemic stage (Grein et al., terium tuberculosis is a suitable model to study (i) molec- 2000). In this context, tuberculosis remains the leading ular epidemiology and (ii) the evolutionary genetics of tu- cause of death by an infectious disease and its control berculosis. This is achieved by a DNA analysis technique relies both on improvement of drug availability and (genotyping), called spacer oligonucleotide typing (spolig- diagnosis abilities. Among new diagnostic tools, which otyping). In this paper, we investigated data analysis meth- have led to a better understanding of global tuberculosis ods to discover intelligible knowledge rules from spoligo- epidemiology, DNA ﬁngerprinting and the building of typing, that has not yet been applied on such representa- genotyping databases represent a new and powerful way tion. This processing was achieved by applying the C4.5 to analyze tuberculosis transmission (van Embden and van induction algorithm and knowledge rules were produced. Soolingen, 2000). Nevertheless, when handling a large Finally, a Prototype Selection (PS) procedure was applied amount of data, especially binary data, the astonishing to eliminate noisy data. This both simpliﬁed decision rules, ability of human brain to intuitively separate noisy from as well as the number of spacers to be tested to solve signiﬁcant information is most efﬁciently challenged by classiﬁcation tasks. In the second part of this paper, the computers, that may easily learn and reproduce some contribution of 25 new additional spacers and the knowl- human tasks such as classiﬁcation or similarity analysis. edge rules inferred were studied from a machine learning The Tuberculosis agent belongs to the Mycobacterium point of view. From a statistical point of view, the correla- tuberculosis complex that may be split into various tions between spacers were analyzed and suggested that subclasses. With overlapping yet distinct epidemiologies, both negative and positive ones may be related to potential these include M. tuberculosis sensu stricto, Mycobac- structural constraints within the DR locus that may shape terium bovis, Mycobacterium africanum and two other its evolution directly or indirectly. less investigated subspecies M. microti and M. canetti, Results: By generating knowledge rules induced from de- which will not be further discussed here. Among DNA cision trees, it was shown that not only the expert knowl- ﬁngerprinting studies, spacer oligonucleotide typing or edge may be modeled but also improved and simpliﬁed to spoligotyping has been applied to characterize these sub- solve automatic classiﬁcation tasks on unknown patterns. types and has gained increased international acceptance A practical consequence of this study may be a simpliﬁ- because it may be both rapid and easily applied as a ﬁrst cation of the spoligotyping technique, resulting in a reduc- line discriminatory test (Kremer et al., 1999). It is based tion of the experimental constraints and an increase in the on the properties of the Direct Repeat (DR) locus of the number of samples processed. M. tuberculosis complex genome, one of the best known Contact: csola@pasteur.gp polymorphic loci of this pathogenic agent (Hermans et al., 1991; Groenen et al., 1993; van Embden et al., 2000). 1 INTRODUCTION It is also a suitable genetic model to study recombination since it shows an extensive strain-to-strain polymorphism As a result of increased human population migrations, and may be used both for molecular epidemiological infectious disease spreading may have global conse- studies (Kamerbeek et al., 1997) and for molecular evolu- To whom correspondence should be addressed. tionary studies (Fang et al., 1999; Sola et al., 2001a) on c Oxford University Press 2002 235 M.Sebban et al. tuberculosis. This locus is composed of multiple identical (KDD) ﬁeld. Even if an approach consisting in deriving or nearly identical (differing by one or a few nucleotides) simple rules for classiﬁcation is not new, it has not yet 36-base pairs (bp) DR copies, which are interspersed been applied for spoligotyping of M. tuberculosis. The by short and nonrepetitive inter-DR spacer sequences results obtained show that the number of spacers can be (between 35 and 41 bp long). The precise physiological dramatically reduced for determining the clade (class) of role—if any—of this locus remains unknown. It has a new proﬁle. However, the large size of DB1, resulting been suggested that a similar locus found in the Archaea in the presence of noisy data (typing errors, mislabeled Haloferax mediterranei may be involved in replicon instances, misclassiﬁed examples, overlapping between partitioning (Mojica et al., 1995). The association of one 2 clades), required to ‘clean’ the database. Thanks to an DR and one spacer is designated Direct Variable Repeat efﬁcient data reduction technique it was achieved by se- (DVR), and strains may consequently differ by one or lecting only the relevant prototypes. These methods have more discrete DVRs (Groenen et al., 1993). Alternatively, been developed during the last decade to combat noisy strains may sometimes also differ by long deletions data in the modern, larger databases. Nevertheless, since of DR repeats, and IS6110-mediated transposition or Prototype Selection (PS) is not the theoretical subject of homologous recombination are likely to be involved in this article, interested readers are referred to a number such structural changes (Fang et al., 1999; van Embden of papers dealing with this topic (Hart, 1968; Gates, et al., 2000). Nevertheless, the mechanisms that generate 1972; Aha et al., 1971; Wilson and Martinez, 1998). this diversity of repetitive structures are yet poorly under- In this article, we applied on DB1 a recently published stood. However it may be speculated that a combination PS method (Sebban and Nock, 2000) which consists of replication slippage, homologous recombination as in keeping proﬁles representative of the probability of well as insertion-sequence-mediated events are driving density of each clade. This results in the deletion of forces that shape its evolution. Based on an initial repre- border examples (overlapping), and useless examples sentation of 43 speciﬁc inter-DR spacers, spoligotyping, at the center of clades. The results obtained show that a PCR-based reverse cross-blot hybridization procedure, the new knowledge base may be dramatically reduced was invented in 1997 (Kamerbeek et al., 1997); spac- after PS without reducing the decision rule performances. ers 20, 21, 33–36 were extracted from M. bovis BCG Lastly, we also analyzed a second data base, called DB2, sequence (Hermans et al., 1991), whereas others were where 25 new spacers were added to the 43 original ones. extracted from the M. tuberculosis H37Rv reference strain 10 new spacer sequences discovered in an M. bovis strain (Groenen et al., 1993; Cole et al., 1998). Spoligotyping (isolate 401) were published previously (Beggs et al., permits to subtype M. tuberculosis complex strains, either 1996). Recently, 41 new spacer sequences among which directly on sputum specimen, on cultures or even on 15 were retained in our study, were also described (van ancient histological specimen and bone extracts. A ﬁrst Embden et al., 2000). This constitutes a new representa- global phylogeographical classiﬁcation of M. tuberculosis tion space for spoligotyping which now includes 43 + 25, complex by spoligotyping was recently attempted by i.e. 68 spacers. The goal of this experiment was to study our team and two new phylogeographical classes were the contribution of the 25 new spacers on DB2, rather deﬁned, the East-African Indian and the Latin American than comparing knowledge rules deduced from DB1 and and Mediterranean clades (EA-I and LA-M respectively), DB2. Actually, because of the experimental difﬁculties, that harbor speciﬁc spoligotyping signatures (Sola et al., DB2 contains for the moment only 323 proﬁles, resulting 2001a,b). We built a ﬁrst database, called DB1, composed in non-comparable knowledge bases. Preliminary results of 7352 strains split into 342 shared types (spoligotypes obtained show that decision rules after the insertion of the common to two or more isolates) and representative of 25 new spacers on DB2 are fewer in number than before. more than 60 countries. Our main aim in the ﬁrst part Finally, a correlation search between the spacers showed of this paper consisted in generating knowledge rules the existence of both negative and positive correlations which would be useful for automatically classifying new between speciﬁc spacers. These results suggest that proﬁles. Due to the intrinsic properties of human brain potential topological constraints on the DNA of the DR functions, e.g. global memorization of various shapes and locus may directly or indirectly shape its evolution. synthetical way to treat information, human experts may have difﬁculties to analytically express their knowledge 2 SYSTEMS AND METHODS under the form of formal decision rules. This is particu- 2.1 Oligonucleotide design, 68-dimensional larly true while handling binary data, i.e. spoligotyping spoligotypes patterns in our case. Indeed, data mining methods appears to be better suited than human observation to obtain The original set of 43 spacers initially published (Kamer- relevant knowledge rules permitting to classify patterns, beek et al., 1997) was recently modiﬁed (van Embden a task ﬁt for the Knowledge Discovery in Database et al., 2000). Parameters of the spoligotyping technique 236 Data mining and spoligotyping of M. tuberculosis were kept unchanged. The ﬁrst and second additional sets (Kremer et al., 1999; van Embden et al., 2000; of oligonucleotides (25mer), contained 10 (set A) and 15 Sola et al., 2001a,b). The 7352 strains are labeled (set B) oligonucleotides and were respectively selected according to criteria described below. Although the according to sequences published (Beggs et al., 1996; van human expert may both recognize and efﬁciently Embden et al., 2000). In each case, the best 25mer was label many speciﬁc ﬁngerprints ‘by eye,’ the literal selected within the DR spacer sequence with the software expression of formal and exclusive knowledge rules ‘Primers for the Mac’ (v1.0a Apple Pi, Ashland, MA). compatible with automatic information treatment, is Preliminary experiments for each oligonucleotide set currently a challenging task. Given the presence of (12.5, 25, 50 and 100 pmol/150 µ l for set A, and 12.5, 40, noisy data, we will see in this manuscript, that the 100 pmol/150 µ l for set B) including adequate positive a priori visual rules will be most efﬁciently challenged and negative controls strains were performed to optimize by the machine learning and data-mining approaches. hybridization conditions. Final membrane preparation • Afri, for M. africanum,(n = 180 strains): this Af ri was performed using these optimized concentrations. subclass of M. tuberculosis complex, as recently Oligonucleotide description and concentrations are avail- described in Viana-Niero et al. (2001) includes any able upon request to the corresponding author. DB1 is part proﬁle where spacers 8, 9 and 39 are absent. of an ongoing population-based study which contained at the time 7352 proﬁles, split into 342 shared-types, • T,(n = 1590): this clade is to be split in future into and which were representative of about 60 countries. A various yet undeﬁned families. It includes any strains limited version of this database and the sources of the where at least one of the spacers 1–30 is present, data were published recently (Sola et al., 2001a). Seven spacers 33–36 are simultaneously absent, spacer 31 is modiﬁed-spoligotyping experiments were performed and present, spacer 9 or 10 is present, and at least one of the origin of the DNAs was as follows: Japan (n = 6), the spacers 21–24 is present. Russia (n = 92), USA (n = 5), caribbean (n = 95) and Italy (n = 12). A total of 116 proﬁles were retained • Beijing,(n = 1268): this clade (van Soolingen Bei j i ng for DB2 construction and mixed to the 207 previously et al., 1995) includes any proﬁle where spacers 1–34 published proﬁles (170 M. tuberculosis sensu stricto + 37 are absent. M. bovis as described in van Embden et al., 2000). The number of proﬁles in (DB2) totaled 323, including orphan • EA-I, for East-African–Indian, (n = 907): this EAI patterns, i.e. proﬁles found only once . clade (Kallenius et al., 1999; Sola et al., 2001a) includes any proﬁle where spacers 29–32 and 34 2.2 Notations and clades are simultaneously absent, and at least one of the Since we applied machine learning algorithms, standard spacers 1–30 is present. notations of this ﬁeld for describing databases from a • Haarlem,(n = 1034): this clade (Kremer et Haarlem modeling point of view will be used. Let (x , y )bean i i al., 1999) includes any strain where spacers 31, 33– instance of the database, where x is a p-dimensional 36 are simultaneously absent, and at least one of the vector and y a belonging class. p corresponds to the spacers 1–30 is present. number of features characterizing the n instances of a Learning Sample (LS). The following equivalence list was • LAM-1, LAM for Latin America and Mediterranean, established for the speciﬁc problem to be treated here: (n = 819): this clade (Sola et al., 2001a) LAM 1 — p is the number of spacers (features or descriptors) in includes any strains where spacers 21–24, and 33– each spoligotype of the databases. p = 43 for DB1 and 36 are simultaneously absent, and at least one of the p = 68 for DB2. spacers 1–30 is present. —n is the number of spoligotypes (n = 7352 for DB1 • LAM-2,(n = 294): this clade is an attempt LAM 2 and n = 323 for DB2), i.e. the size of the LS. to deﬁne a new family. It includes any strain where p spacers 9–10, and 33–36 are simultaneously absent, 1 2 —x is the p-dimensional binary vector (x , x ,..., x ) i i i and at least one of the spacers 1–30 is present. As corresponding to the p values taken by each spacer shown in Figures 1 and 2 subfamilies of LAM1 and (0 or 1) for the i th instance. LAM2 will be deﬁned by SIPINA and were not deﬁned —y is the class of the spoligotype x . The 7352 instances by the expert. This discrepancy suggests that the visual i i of DB1 were clustered in 9 classes, a priori deﬁned by rules deﬁned by the expert can not be easily modelized the human expert, using previously published results by SIPINA because of spoligotypes harboring large deletions which currently jeopardizes classiﬁcation Descriptions of DB1 and DB2 are available upon request to the authors. of these families. Moreover recent yet unpublished 237 M.Sebban et al. observations suggests that there is indeed more than N Y 2, yet undeﬁned subclades among the LAM strains. N Y N Y • X,(n = 1186): this clade is currently found to be 97.3 97.8 highly prevalent in some english-speaking countries N Y M.Bovis Afri N Y (Soini et al., 2000, our own unpublished observations). 17 + 97.2 1.2 − It includes any proﬁle where spacer 18 is missing, 100 98.1 EA-I1 Haarlem spacers 33–36 are absent, and missing spacer 18 may 100 22 33 92.4 0.8 N Y sometimes be linked to the absence of spacers 39–42. 1.9 98.7 • M. bovis,(n = 74): this well deﬁned subclass Y Beijing EA-I2 M. bovis 42 18 includes any strain where spacers 39–43 are simulta- 6.2 neously absent, and spacers 33–36 are simultaneously X2 N N Y + 19 Y 10 present. 90.1 3.3 − 92.0 88.6 3.8 59.5 Some of the clade or subclass deﬁnitions cited above 90.5 T2 X1 T1 LAM21 + + + 93.8 3.0 96.4 0.6 + 99.6 0.2 NY 8 −− 91.1 2.1 − − may be considered as having been validated independently LAM11 4.9 32.3 by other investigators (e.g. for M. africanum, Haarlem, Beijing, EA-I, M. bovis) whereas the description of others LAM12 LAM22 100 100 is either new and speculative (e.g. LAM-2, clade X) or remains undeﬁned (clade T). Although the search of knowledge rules and deﬁnition of clades are distinct Fig. 1. Decision tree induced from DB1 in the 43-dimensional space processes using different softwares, both processes are on 7352 strains. N and Y represents respectively the absence and the interlinked and results in one domain may boost the presence of the current spacer. A square represents the discriminant knowledge in the other one. We will see that this is indeed spacer used for spliting the current sample; a circle represents a leaf of the tree. Numbers below clade deﬁnition is the success rate the case and that data-mining constitutes a powerful for classiﬁcation whereas the number on the top left of the clades approach when handling genotyping databases. represent the relative percentage of patterns found in that leaf. 2.3 Knowledge rules induced by decision trees In this part, induction algorithms were applied on the orig- Y inal LS in order to generate decision trees. These trees are N Y Y 18 18 built by spliting the sample into sub-samples, according to an optimized criterion (often an information measure) Beijing N N Y N Y 9 j 99.5 0.2 + 36 which is in our case the best discriminant spacer x ; each 98.8 98.1 38.9 subset is then split according to the same strategy result- N Y EA-I1 23 X LAM12 100 + 43 22 96.2 1.2 100 − N Y ing into two new subtrees, and so on, until the information 38.9 Y 98.3 98.8 gain is sufﬁcient. Once the tree is built, each path from LAM11 Haarlem T M.Bovis 8 99.7 0.1 − + 100 85.4 3.6 100 20.9 the root to a leaf describes a decision rule (or knowledge LAM13 Y 95.7 3.0 + rule), useful for understanding the model but also for la- 100 90.1 Afri LAM2 + + 63.6 10.3 96.8 1.8 beling new examples. This type of model is very useful in − − ﬁelds in which the understandability of the deduced model is important. For instance, while it is not very crucial to Fig. 2. Decision tree induced from the subset of prototypes of DB1 understand what is the reasoning of a model specialized in the 43-dimensional space. for the postal code recognition (in this case an artiﬁcial neural network can be very efﬁcient), the understandabil- ity of an artiﬁcial model for a medical diagnosis system is lan, 1993), CART (Breiman et al., 1984), and SIPINA obviously more important. This is the case for our prob- (Zighed and Rakotomalala, 2000). lem. Decision trees are well suited, not only for providing very efﬁcient classiﬁcation models but also for describ- 2.4 Accuracy estimation by cross-validation ing the process from the input vector x (the spacers) to procedure the predicted output label y (the class). In these experi- ments, we used the SIPINA for Windows software, devel- In order to assess the predictive ability of a model, i.e. oped at the ERIC laboratory of the University of Lyon 2 the determination of the belonging class of a new spolig- (http://eric.univ-lyon2.fr). This software collects the state- otype, a cross-validation procedure is frequently used in of-the-art tree induction algorithms, such as C4.5 (Quin- the computer science and statistics ﬁelds (Efron and Tib- 238 Data mining and spoligotyping of M. tuberculosis shirani, 1993). Based on the results obtained on the LS, an tion, we will use PSRCG as a data preprocessing process unbiaised estimation of the theoretical generalization ac- on the 7352 proﬁles, before constructing a new decision curacy of the model to the whole population is provided tree. By eliminating irrelevant strains, we aim at improv- by this method. Then, the generalization accuracy is esti- ing the current knowledge base induced for automatically mated from the information contained in LS by this cross- classifying spoligotypes. validation procedure. The principle is as follows: the orig- inal sample is divided into f folders. A predictive model 3 EXPERIMENTAL RESULTS (a decision tree, in our case) is built from f − 1 subsets 3.1 Knowledge discovery from DB1 during a learning stage, and tested on the remaining one. In this section, we automatically generated knowledge The same procedure is repeated for the f permutations. rules and built a predictive model from DB1. We applied The estimation of the generalization accuracy is the av- the C4.5 induction algorithm to generate the decision tree erage of the f accuracies computed at each stage of the described in Figure 1 from the 7352 strains. As shown in cross-validation. Such an estimation method will be used this ﬁgure, each path from the root to a leaf describes a in the following experimental section. knowledge rule. For the nine deﬁned clades, the problem may be modeled by 14 knowledge rules only. Each of the 2.5 Prototype selection 7352 strains, a priori labeled by the expert, belongs to only Prototype selection was historically used to improve one leaf. Each leaf describes a given class, according to the the efﬁciency of the k-nearest-neighbor classiﬁer (Hart, majority clade represented in the subset of these proﬁles. 1968), which classiﬁes an unknown instance according Actually, a given strain x , may belong to a leaf for which to a local vote by its k-nearest neighbors and accord- the majority clade is not y , i.e. the clade of x . It means i i ing to a given distance function (often the Euclidean). that some leaves may be pure, i.e. they contain examples Although its use was widely spread and encouraged of a unique clade, whereas others describe a class by a by early theoretical results linking its generalization relative vote. In the latter case, the minority examples error to Bayes risk, this classiﬁer suffers from several represent the learning errors of the model. A learning error practical limitations (Breiman et al., 1984). Firstly, it may either represent misclassiﬁcation by the expert or be is computationally expensive because it stores all the due to biological processes such as recombination or gene instances in memory. Secondly, intolerant of noisy in- ﬂows which may lead to fuzzy classiﬁcation boundaries. stances, and of irrelevant attributes, it is sensitive to the Indeed, a decision rule may not only be assessed according chosen distance function. Pioneer works in PS ﬁrstly to its accuracy on the LS, but also by the number of searched only solutions to solve the two ﬁrst problems examples which follow this rule. The accuracy of the listed above (Hart, 1968; Gates, 1972; Aha et al., 1971). rule is described in Figure 1 by the Conﬁdence Interval The third problem (intolerance of irrelevant attributes) is (CI) of the proportion of correctly classiﬁed strains (value a matter of feature selection. PS algorithms usually use within each circle). It is actually more suited to describe the Euclidean distance. This function is appropriate only the accuracy by a CI, which takes into account the number if the attributes are numeric. Wilson and Martinez (1998) of strains which follow the rule, rather than by the simple proposed new distance functions to handle numeric and success rate. A CI, with a α risk at 5%, is deﬁned as nominal attributes. follows: In this paper, we used a recently proposed method Pro- p(1 − p) CI = p ± 1.96 totype Selection using Relative Certainty Gain (PSRCG; Sebban and Nock, 2000). We investigated the PS as an where p is the success rate and n the number of strains. information preserving problem. Rather than optimizing It explains why there is no CI for 100% success rates the accuracy of a classiﬁer, we built a statistical informa- because in these cases, (1 − p) = 0. The number tion criterion for Relative Certainty Gain (RCG) based on of examples that follow the rule is represented by the a quadratic entropy computed from the nearest-neighbor strain proportion of the clade in which this conjunction of topology. From neighbors linked by an edge with a given spacers is found (value on each circle). In order to compare instance, PSRCG computes a quadratic entropy by taking the performances of this automatic model (decision tree) into account the label of each neighbor. From this entropy with the expert’s knowledge base previously mentioned in (which conveys a local uncertainty), it deduces a global Section 2.2, we decided to characterize the two approaches uncertainty of the LS. While an instance deletion is sta- by the following criteria: tistically signiﬁcant, PSRCG eliminates uninformative ex- amples. Using k-nearest neighbors for building the neigh- • the number of rules: it describes the complexity of the borhood graph, PSRCG has shown a high ability for re- model. We take into account here the conjunctions of ducing the database and improving classiﬁcation perfor- spacers, disjunctions corresponding to different paths mances (Sebban and Nock, 2000). In the following sec- in the tree; 239 M.Sebban et al. • the mean size of the rules (number of spacers from • Finally, while an average of 25.4 spacers per rule the root to the leaf): note that between two decision is used by the expert for ﬁnding the label, only 5.8 trees with same performances, we usually choose the are used by the induced decision tree. This result smallest one; may get practical application in a near future to simplify the spoligotying technique when deﬁnition • the success rate on the LS: the accuracy is obviously of major spoligotyping clades will be improved and the most popular criterion for assessing the quality of minimal extra genotyping information will be required a model; to automatically label an unknown strain. The learning accuracy on DB1 suggests that it is possible • the number of different spacers used in the decision to build a predictive model, which would automati- rule: for our speciﬁc problem, we aim at using cally classify any new proﬁle. To assess the accuracy the smallest number of spacers for simplifying the generalization of this model, we achieved a 5-fold cross- experimental processing. validation procedure on DB1. A mean success rate of 98% According to these criteria, the expert’s knowledge base with a standard deviation of 0.8% was obtained. and the decision rules induced by C4.5 on DB1 are 3.2 Prototype selection from the 7352 strains summarized in Table 1. We may also note that: In this section, we used the PSRCG algorithm (for more • The expert’s ability to label an unknown strain is details see Sebban and Nock, 2000). This PS algorithm real in many cases. However, it may not be so is based on the optimization of an information measure straightforward to generate simple knowledge rules in (a quadratic entropy). It consists in deleting the irrelevant all cases. Moreover, a knowledge base should respect strains and globally improves the information in the a set of constraints, particularly the non-overlapping representation space. Practically, the irrelevant strains between rules. This is not the case for the rules are those located at the clade boundaries, where the listed above. This explains the weak accuracy on uncertainty is likely to be high. Moreover, strains at the average (76.6%) of the human rule in comparison center of clusters (subsets of strains of a same clade in with the knowledge base automatically deduced from the representation space) are also weakly relevant for DB1 with SIPINA (97.6%), which increases to (99%) classiﬁcation purposes. PSRCG is particularly suited in after PS. The observation of Figure 1 shows that the such a context, because it is not dependent on a learning a priori expert clustering is validated by knowledge algorithm and does not generate an inductive bias. In other rules with strong CIs in most instances. In other cases, words, the resulting subsets of prototypes may then be and for lesser deﬁned clades (e.g. T, X, or LAM2) used by an induction algorithm to build decision trees. the existence of secondary or tertiary rules together Among the 7352 original strains, our algorithm kept 4014 with new subsets (e.g. X1, X2) suggests that today’s proﬁles only. From this subset of strains, called DB1 , sub global classiﬁcation of tubercle bacilli clades based on the decision tree using SIPINA was built. The new model spoligotyping results remains sub-optimal and is likely is presented in Figure 2. As it was done for the knowledge to evolve in the near future. base deduced from DB1, we computed the same criteria as above on DB1 . Results are included in Table 1. We sub • Except for two spacers (37 and 38), the whole infor- note that the deletion of noisy data results in the decrease mation (41 spacers) contained in the spoligotypes to of the number of rules (11 versus 14), and in an increase of discriminate the nine clades was used by the expert. the accuracy (99.0 versus 97.6%). Moreover, the number On the contrary, only 13 spacers are used by the of spacers is dramatically reduced. Actually, nine spacers automatically deduced model from DB1 on the 7352 are sufﬁcient to totally discriminate the 4014 strains (8, 9, strains, and this with a higher performance. The 18, 19, 22, 23, 31, 36, 43), among them eight were already selected spacers are: 8–10, 17–19, 22, 31, 33, 34, present in the decision tree built from DB1. In the new tree 36, 42, 43. In other words, presence or absence of shown in Figure 2, a closer insight at the subsets of the these spacers may be considered as highly informative LAM clades (designated LAM11, LAM12 and LAM13 in in the 43 dimension space for classifying proﬁles. that tree), show that this spliting represents left(LAM12) Neighborhood relationships may indicate that all or right(LAM11) entire deletions of the DR locus, whereas recombination mechanisms do not act similarly on LAM13 represent ‘true’ LAM as deﬁned above. The full each spacer. As an example, spacers 9 and 31 are signiﬁcance of this result requires further investigations. well known hot spots for IS6110 transposition, an 3.3 Contribution of the 25 additional spacers evolutionary driving force that may change the DR locus structure (van Embden et al., 2000; Filliol et al., The database DB2 was constructed by assembling previ- 2000; Legrand et al., 2001; Benjamin et al., 2001). ously published (n = 207) and our own in house gener- 240 Data mining and spoligotyping of M. tuberculosis Table 1. Results according to four performance criteria discriminate the right side of the tree. On the other hand, spacers 26 and 43 were required in the ﬁrst decision tree. Thus, the presence of spacer 1 now Rules deduced from allows to remove an irrelevant rule. Criterion Expert DB1 DB1 sub The contribution of the 25 additional spacers may also be Number of rules 17.0 14.0 11.0 assessed through a statistical study. We performed a prin- Mean size of the rules 25.4 5.8 3.7 cipal component analysis from the two datasets, in order Accuracy 76.6 97.6 99.0 Number of spacers 41.0 13.0 9.0 to compare the part of the total variance explained by the two main factorial axes. While this percentage of the total variance was around 53% by mapping individuals of DB2 from a 43-dimensional space to a planar representation, ated experimental proﬁles (n = 116). Since M. africanum the part of the variance explained by the two axes was im- strains were not yet tested for 25 additional spacers, the proved by the addition of 25 new spacers. Actually, about results described in this section may unfortunately not be 60% of the total variance was explained by such a repre- applied to M. africanum. Besides, since DB2 is yet limited sentation which conﬁrms the contribution of these new 25 in size, our aim does not consist in comparing knowlege spacers. Lastly, by using this data analysis method, a cor- discovered in DB1 and DB2, but rather in DB2 before and relation circle between spacers was drawn (see Figure 5). after insertion of the 25 new spacers. Actually, the compar- In this ﬁgure, the correlation level between two spacers is ison between the two representation spacers (43 and 68) measured by the value of the angle between two points. must be achieved according to the same LS. We obtained High positive correlation between spacers is shown by an two new decision trees which are respectively shown in acute angle. Inversely, a negative correlation is shown as Figures 3 and 4. In order to facilitate the comparison, we an obtuse angle and no correlation between spacers by a now used as spacer labels their new position in the spoligo- right angle. Figure 5 shows a very high negative correla- type represented by 68 features (for instance, the spacer 21 tion between spacers 41 and 47. This means that the pres- is the 21st value in the 68-dimensional representation, and ence of the spacer 41 is almost always accompanied by the not the 21st in the 43-dimensional space). The renum- absence of the spacer 47, and vice versa. This knowledge bering of the spacers was previously done by van Emb- is refuted only for 12 individuals among 323, which are all den et al. (2000), because of the conserved order of the characterized by a large deletion in the locus, likely to be spacers among different isolates of M. tuberculosis. Con- linked to IS6110-mediated deletion events (van Embden sequently, the numbers reﬂect the genetic organization in et al., 2000). In this context, it is striking that the pres- the genome. Among the 11 rules, 7 are very relevant, and ence of an inverted IS6110 copy in the Beijing clade oc- describe classes Haarlem, M. bovis, EAI1, Beijing, LAM1, curs precisely between these two spacers. This very same T1, LAM2 almost perfectly. The 4 others are either weakly DVR 47 also harbors point mutations in the DR repeat in accurate (T2, X) or have weak inﬂuence (T3, EAI2). Two all the M. bovis strains investigated so far (van Embden et spacers and consequently related DVRs (31 and 47), have al., 2000). We also found a positive correlation between been shown to harbor point mutations in the DR repeat spacers 1, 21, 28, 29, 31 and 35, but these spacers are in- itself (van Embden et al., 2000). This is a striking result dependent of spacers 41 and 47. The signiﬁcance of this that could be explained by the presence/absence of a single correlation remains to be investigated. nucleotide polymorphism in the various clades described. Comparing the ﬁrst and the second decision trees, the two 4 CONCLUSION following observations may be done: In this paper, we applied data mining methods to generate (1) The number of knowledge rules is reduced in the knowledge rules for solving classiﬁcation tasks from ex- 68-dimensional space. 11 rules (with 10 spacers) in isting or speciﬁcally created databases. To the best of our the 43-dimensional space for discriminating the 323 knowledge, this study is a ﬁrst attempt to automatically proﬁles into the 8 classes were inferred, whereas discover simple knowledge rules from spoligotyping data. 9 spacers and 10 rules for reaching the same Amazingly, the generated rules are different and simpler objective were required in the 68-dimensional space. than those previously deﬁned by the expert. A possible Theoretically, a smaller model is desirable because explanation of this phenomenon may come from the fact it will not suffer from the overﬁtting of the learning that decision trees use pruning to avoid overﬁtting (result- data, and will be more reliable for classifying new ing in deep trees), whereas the expert takes into account unknown individuals. all data as signals, i.e. without differentiating noisy from (2) The reduction of the tree size is explained by the unnoisy data. This is especially true for spoligotyping, 43 68 presence of the new spacer 1, which allows to totally where 2 (and now 2 ) combinations are theoretically 241 M.Sebban et al. M.Bovis N Y 47 100 26 Y 96.5 N Y EAI1 N Y 35 31 95.2 3.5 100 31 14.9 Beijing EAI2 Haarlem T2 Y 28 100 29 100 + 100 76.9 11.7 N Y 7.5 71.6 T1 T3 X 21 + + 98.O 2 73.3 11.4 − Fig. 5. Correlation circle from the eight spacers used in the decision 100 − tree. 96.6 66.7 LAM1 LAM2 berculosis stricto sensu into the 8 or 9 validated clades. This would represent a ﬁrst level of sub-classiﬁcation for Fig. 3. Decision tree induced from DB2 in the 43-dimensional space epidemiological purposes, an application that deserves after removing the 25 additional spacers. further investigation. Future investigations will intend to study other M. tuberculosis representations, such as the N Variable Number of Tandem DNA Repeats (VNTRs) or the Mycobacterial Interspersed Repetitive Units (MIRUs; Frothingham and Meeker-O’Connell, 1998; Supply et al., M.Bovis N Y 47 100 2000). Data-mining methods will undoubtedly allow to distinguish useful information from noisy data obtained 1 by these markers. 95.2 100 ACKNOWLEDGEMENTS N Y Beijing EAI N Y 35 31 This work was supported by the Del ´ egation ´ Gen ´ erale ´ 100 100 au Reseau ´ International des Instituts Pasteur et Instituts Associes ´ and the Fondation Raoul Follereau. We are also 14.9 Haarlem T2 very grateful to Dr C.Mammina, Italy, Dr B.Vishnevsky Y 28 + 100 76.9 11.7 and Dr T.Otten, Russia and Dr H.Kasai, Japan, for N Y providing some DNAs of M. tuberculosis clinical isolates. 7.5 71.6 T1 T3 X 21 + + 98.O 2 73.3 11.4 − 100 − REFERENCES Aha,D., Kibler,K. and Albert,M. (1971) Instance-based learning 96.6 66.7 algorithms. Mach. Learn., 6,37–66. LAM1 LAM2 Beggs,M.L., Cave,M.D., Marlowe,C., Cloney,L., Duck,P. and Eise- nach,K.D. (1996) Characterization of Mycobacterium tuberculo- sis complex direct repeat sequences for use in cycling probe re- Fig. 4. Decision tree induced from DB2 in the 68-dimensional space actions. J. Clin. Microbiol., 34, 2985–2989. by the C4.5 induction algorithm. Benjamin,W.H. Jr., Lok,K.H., Harris,R., Brook,N., Bond,L., Mulc- ahy,D., Robinson,N., Pruitt,V., Kirkpatrick,D.P., Kimerling,M.E. and Dunlap,N.E. (2001) Identiﬁcation of a contaminating My- cobacterium tuberculosis strain with a transposition of an likely. A direct consequence of these simpler rules may IS6110 insertion element resulting in an altered spoligotype. J. be a change in the laboratory experimental setup by Clin. Microbiol., 39, 1092–1096. eliminating the use of uninformative spacers. Another Breiman,L., Friedman,J.H., Olshen,R.A. and Stone,C.J. (1984) consequence may be that uninformative spacers may be Classiﬁcation and Regression Trees. Chapman and Hall, downweighted or excluded when dealing with phylogeny Wadsworth, CA. reconstruction using parsimony principles. The described Cole,S.T., Brosch,R., Parkhill,J., Garnier,T., Churcher,C., Harris,D. process could also allow to automatically classify M. tu- and Gordon,S.V. et al. (1998) Deciphering the biology of My- 242 Data mining and spoligotyping of M. tuberculosis cobacterium tuberculosis from the complete genome sequence. Riley,L.W., Yakrus,M.A., Musser,J.M. and van Embden,J.D.A. Nature, 393, 537–544. (1999) Comparison of methods based on different molecular epi- van Embden,J.D.A. and van Soolingen,D. (2000) Molecular epi- demiologial markers for typing of Mycobacterium tuberculosis demiology of tuberculosis: coming of age. Int. J. Tuberc. Lung. strains: interlaboratory study of discriminatory power and repro- Dis., 4, 285–286. ducibility. J. Clin. Microbiol., 37, 2607–2618. van Embden,J.D.A., van Gorkom,T., Kremer,K., Jansen,R., Van der Legrand,E., Filliol,I., Sola,C. and Rastogi,N. (2001) Use of spoligo- Zeijst,B.A.M. and Schouls,L.M. (2000) Genetic variation and typing to study the evolution of the direct repeat locus by IS6110 evolutionary origin of the direct repeat locus of Mycobacterium transposition in Mycobacterium tuberculosis. J. Clin. Microbiol., tuberculosis complex bacteria. J. Bacteriol., 182, 2393–2401. 39, 1595–1599. Efron,B. and Tibshirani,R. (1993) An Introduction to the Bootstrap. Mojica,F.J., Ferrer,C., Juez,G. and Rodriguez-Valera,F. (1995) Long Chapman and Hall, London. stretches of short tandem repeats are present in the largest Fang,Z., Doig,C., Kenna,D.T., Smittipat,N., Palittapongarnpim,P., replicons of the Archaea Haloferax mediterranei and Haloferax Watt,B. and Forbes,K.J. (1999) IS6110-mediated deletions of volcanii and could be involved in replicon partitioning. Mol. wild-type chromosomes of Mycobacterium tuberculosis. J. Bac- Microbiol., 17,85–93. teriol., 181, 1014–1020. Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan Filliol,I., Sola,C. and Rastogi,N. (2000) Detection of a previously Kaufmann, San Mateo, CA. unampliﬁed spacer within the DR locus of Mycobacterium Sebban,M. and Nock,R. (2000) Instance pruning as an information tuberculosis: epidemiological implications. J. Clin. Microbiol., preserving problem. In Proceedings of the 17th International 38, 1231–1234. Conference on Machine Learning. Stanford University, pp. 855– Frothingham,R. and Meeker-O’Connell,W.A. (1998) Genetic diver- 862. sity in the Mycobacterium tuberculosis complex based on vari- Soini,H., Pan,X., Amin,X., Graviss,E.A., Siddiqui,A. and able numbers of tandem DNA repeats. Microbiol., 144, 1189– Musser,J.M. (2000) Characterization of Mycobacterium 1196. tuberculosis isolates from patients in Houston, Texas, by spoligotyping. J. Clin. Microbiol., 38, 669–676. Gates,G.W. (1972) The reduced nearest neighbor rule. IEEE Trans. Inf. Theor., IT13, 431–433. Sola,C., Filliol,I., Gutierrez,M.C., Mokrousov,I., Vincent,V. and Grein,T.W., Kamara,K.B., Rodier,G., Plant,A.J., Ryan,M.J., Rastogi,N. (2001a) An update of the spoligotyping database of Ohyama,T. and Heymann,D.L. (2000) Rumors of disease in Mycobacterium tuberculosis: epidemiological and phylogeneti- the global village: outbreak veriﬁcation. Emerg. Infect. Dis., 6, cal perspectives. Emerg. Inf. Dis., 9, 390–396. 97–102. Sola,C., Filliol,I., Legrand,E., Mokrousov,I. and Rastogi,N. (2001b) Groenen,P.M.A., Bunschoten,A.E., van Soolingen,D. and van Em- Mycobacterium tuberculosis phylogeny reconstruction based on bden,J.D.A. (1993) Nature of DNA polymorphism in the direct combined numerical analysis with IS1081,IS6110, VNTR and repeat cluster of Mycobacterium tuberculosis. Mol. Microbiol., DR-based spoligotyping suggests the existence of two new 10, 1057–1065. phylogeographical clades. J. Mol. Evol., 53, 680–689. Hart,P.E. (1968) The condensed nearest neighbor rule. IEEE Trans. van Soolingen,D., Qian,L., de Haas,P.E.W., Douglas,J.T., Traore,H., Inf. Theor., IT14, 515–516. Portaels,F., Qing,H.Z., Enkhsaikan,D., Nymadawa,P. and van Hermans,P.W.M., van Soolingen,D., Bik,E.M., de Haas,P.E.W., Embden,J.D.A. (1995) Predominance of a single genotype of Dale,J.W. and van Embden,J.D.A. (1991) Insertion element Mycobacterium tuberculosis in countries of East Asia. J. Clin. IS987 from Mycobacterium bovis BCG is located in a hot- Microbiol., 33, 3234–3238. spot integration region for insertion elements in Mycobacterium Supply,P., Mazars,E., Lesjean,S., Vincent,V., Gicquel,B. and tuberculosis complex strains. Infect. Immun., 59, 2695–2705. Locht,C. (2000) Variable human minisatellite-like regions in Kallenius,G., ¨ Koivula,T., Ghebremichael,S., Hoffner,S.E., Nor- the Mycobacterium tuberculosis genome. Mol. Microbiol., 36, berg,R.E., Svensson,E., Dias,F., Marklund,B. and Svenson,S.B. 762–771. (1999) Evolution and clonal traits of Mycobacterium tuberculo- Viana-Niero,C., Gutierrez,M.C., Sola,C., Filliol,I., Boulahbal,F., sis in Guinea-Bissau. J. Clin. Microbiol., 37, 3872–3878. Vincent,V. and Rastogi,N. (2001) Genetic diversity of Mycobac- Kamerbeek,J., Schouls,L., Kolk,A., van Agterveld,M., van Soolin- terium africanum clinical isolates based on IS6110-restriction gen,D., Kuijper,S., Bunschoten,A., Molhuizen,H., Shaw,R., fragment length polymorphism analysis, spoligotyping, and vari- Goyal,M. and van Embden,J.D.A. (1997) Simultaneous detec- able number of tandem DNA. J. Clin. Microbiol., 39,57–65. Wilson,D. and Martinez,T. (1998) Reduction techniques for tion and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J. Clin. Microbiol., 35, 907–914. instances-based learning algorithms. Mach. Learn., 38, 257–286. Kremer,K., van Soolingen,D., Frothingham,R., Haas,W.H., Her- Zighed,D.A. and Rakotomalala,R. (2000) Graphe Induction- mans,P.W.M., Martin,C., Palittapongarnpim,P., Plikaytis,B.B., Apprentissage et Data Mining. Hermes, Paris.

Journal

Bioinformatics – Oxford University Press

Published: Feb 1, 2002

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis

A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis

A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis

References (30)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies