Comparing genome versus proteome-based identification of clinical bacterial isolates

Comparing genome versus proteome-based identification of clinical bacterial isolates Abstract Whole-genome sequencing (WGS) is gaining importance in the analysis of bacterial cultures derived from patients with infectious diseases. Existing computational tools for WGS-based identification have, however, been evaluated on previously defined data relying thereby unwarily on the available taxonomic information. Here, we newly sequenced 846 clinical gram-negative bacterial isolates representing multiple distinct genera and compared the performance of five tools (CLARK, Kaiju, Kraken, DIAMOND/MEGAN and TUIT). To establish a faithful ‘gold standard’, the expert-driven taxonomy was compared with identifications based on matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry (MS) analysis. Additionally, the tools were also evaluated using a data set of 200 Staphylococcus aureus isolates. CLARK and Kraken (with k =31) performed best with 626 (100%) and 193 (99.5%) correct species classifications for the gram-negative and S. aureus isolates, respectively. Moreover, CLARK and Kraken demonstrated highest mean F-measure values (85.5/87.9% and 94.4/94.7% for the two data sets, respectively) in comparison with DIAMOND/MEGAN (71 and 85.3%), Kaiju (41.8 and 18.9%) and TUIT (34.5 and 86.5%). Finally, CLARK, Kaiju and Kraken outperformed the other tools by a factor of 30 to 170 fold in terms of runtime. We conclude that the application of nucleotide-based tools using k-mers—e.g. CLARK or Kraken—allows for accurate and fast taxonomic characterization of bacterial isolates from WGS data. Hence, our results suggest WGS-based genotyping to be a promising alternative to the MS-based biotyping in clinical settings. Moreover, we suggest that complementary information should be used for the evaluation of taxonomic classification tools, as public databases may suffer from suboptimal annotations. bacteria, taxonomy, MALDI-TOF MS, whole-genome next-generation sequencing Introduction In the light of the global increase of antibiotic-resistant microorganisms, rapid and accurate pathogen characterization—i.e. their classification into organism groups—is essential for an effective treatment of infectious diseases [1]. This facilitates patient stratification and personalized therapies. Several approaches have been developed for the taxonomic characterization of bacterial isolates. The classical microbiological approaches are built on a large basis of constantly revised expert knowledge and typically involve Gram staining, analysis of culture growth, phenotype and biochemical reaction patterns [2]. These methods are increasingly augmented by high-throughput molecular methods such as 16S ribosomal RNA (rRNA) gene sequencing [3]. However, the taxonomic resolution based on the 16S rRNA gene alone is limited [3, 4]. Another alternative is taxonomic analysis using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) where the obtained protein mass spectra are compared against a reference database [5]. This proteome-based approach is characterized by high accuracy [5–7], low operating costs and quick turn-around times [8, 9]. Usually, MALDI-TOF MS-based analysis is applied to pure cultures, but there are studies performing taxonomic classification of mixed microbial communities [10, 11]. The reference spectra are typically not disclosed by the manufacturers, thus hampering the study and expansion of the existing references [12]. However, several attempts to create a publicly available database exist [13, 14] with the database ‘Spectra’ [15], curated by the Public Health Agency of Sweden, providing a sizeable data source containing >5000 spectra of bacteria and fungi. Although various factors might negatively affect the analysis outcome, e.g. sample preparation or age, limiting the result comparability between laboratories [7, 12, 16, 17], MS-based pathogen identification is applied in many diagnostic laboratories [1]. In part, driven by decreasing costs and faster turn-around times, whole-genome sequencing (WGS) has gained importance for the identification of pathogens as well as for antimicrobial resistance analyses and outbreak monitoring [2]. The existing sequencing-based taxonomic classification tools can be organized into two groups—tools relying on specific marker genes/sequences (e.g. MetaPhlAn [18] and MetaPhyler [19]) and whole-genome-based tools. The whole-genome-based approaches assign input sequences to taxa using alignments (e.g. DIAMOND [20] and TUIT [21]), k-mer matching (e.g. CLARK [22], Kaiju [23] and Kraken [24]) or alignment-free methods (e.g. Vervier et al. [25], RAIphy [26] and PhyloPythia [27–29]). While the tools’ performances are evaluated in the respective publications, these performance evaluations rely on publicly available data typically generated in independent, earlier experiments. Hence, incomplete or suboptimal annotations, e.g. because of contaminating sequences, are expected to exert negative effects on the evaluations. Importantly, the use of complementary information, such as MALDI-TOF MS-based taxonomic classification, is missing. This is of particular importance in the context of clinically relevant pathogen identification [2]. Here, we newly sequenced 846 pathogenic, gram-negative bacterial clinical isolates [including, among others, Escherichia spp. (22%), Proteus spp. (14%), Klebsiella spp. (16%), Pseudomonas spp. (11%), Enterobacter spp. (6%), Salmonella spp. (8%) and Acinetobacter spp. (6%)] and evaluated the classification performance of a set of WGS-based taxonomic classification tools (CLARK, DIAMOND/MEGAN, Kaiju, Kraken and TUIT). The ‘ground truth’ taxonomic assignments were established by confirming the expert-driven taxonomy using a Bruker Biotyper MALDI-TOF MS system. Moreover, we newly sequenced 200 Staphylococcus aureus isolates and performed the same analysis as for the gram-negative bacteria where the ‘ground truth’ comprised only the MS-based taxonomy. Our results demonstrated that certain WGS-based approaches allow for an accurate taxonomic classification, and thus, can be considered as promising alternatives to MS-based biotyping. Moreover, the complementary information of the protein mass spectra is a powerful alternative to relying on existing, yet potentially misleading, publicly available data. Materials and methods Bacterial isolates Our first data set consisted of 846 gram-negative bacterial clinical isolates collected for diagnostic purposes. The isolates were characterized by microbiologists from the respective laboratory according to the institutional guidelines for routine clinical microbiological testing, which was state of the art at the time of testing (Supplementary Table S1). The overview of the taxonomic assignments was created with Krona [30] (Figure 1). The samples are part of the microbiology strain collection of Siemens Healthcare Diagnostics (West Sacramento, CA). For 240 isolates, the data set included the collection date (Supplementary Figure S1), and for 783 isolates, the collection location (country or continent) was provided (Supplementary Table S1). Figure 1 View largeDownload slide Taxonomic composition of the 846 gram-negative isolates based on expert-driven taxonomy. Figure 1 View largeDownload slide Taxonomic composition of the 846 gram-negative isolates based on expert-driven taxonomy. The second data set included 200 S. aureus clinical isolates, which are part of the S. aureus strain collection of Saarland University Medical Center. For all isolates, the location of the isolation (country) and the isolation year (except for one sample) were provided (Supplementary Table S2, Supplementary Figure S1). DNA extraction Four streaks of each gram-negative bacterial isolate were cultured on trypticase soy agar containing 5% sheep blood, and cell suspensions were made in sterile 1.5 ml collection tubes containing 50 µl Nuclease-Free Water (AM9930, Life Technologies). Bacterial isolate samples were stored at −20 °C until nucleic acid extraction. The Tissue Preparation System (TPS) (096D0382-02_01_B, Siemens) and the VERSANT® Tissue Preparation Reagents (TPR) kit (10632404B, Siemens) were used to extract DNA from these bacterial isolates. TPS for nucleic acid extraction has been described previously [31–33]. Before extraction, the bacterial isolates were thawed at room temperature and were pelleted at 2000g for 5 s. The DNA extraction protocol DNAext was used for complete total nucleic acid extraction of samples. The total nucleic acid eluates were then transferred into 96 well quantitative polymerase chain reaction (qPCR) detection plates (401341, Agilent Technologies) for RNase A digestion, DNA quantitation and plate DNA concentration standardization processes. Rnase A (AM2271, Life Technologies), which was diluted in nuclease-free water following manufacturer’s instructions, was added to 50 µl of the total nucleic acid eluate for a final working concentration of 20 µg/ml. Digestion enzyme and eluate mixture were incubated at 37 °C for 30 min using Siemens VERSANT® Amplification and Detection instrument. DNA from the RNase-digested eluate was quantitated using the Quant-iT™ PicoGreen dsDNA Assay (P11496, Life Technologies) following the assay kit instruction, and fluorescence was determined on the Siemens VERSANT® Amplification and Detection instrument. In total, 25 µl of the quantitated DNA eluates were transferred into a new 96 well PCR plate for plate DNA concentration standardization before library preparation. Elution buffer from the TPR kit was used to adjust DNA concentration. The standardized DNA eluate plate was then stored at −80 °C until library preparation. Pure isolates from the S. aureus data set were grown overnight in brain heard infusion liquid culture with regular shaking (3 ml, 150 rpm). In total, 1 ml of overnight culture was centrifuged (10 min at 5000g), and the pellet was resuspended in P1 buffer (Qiagen), supplemented with 4 µl lysostaphin (10 mg/ml, frozen stock solution, Sigma) and incubated at 37 °C (30 min, 900 rpm) for enzymatic digestion of S. aureus cell walls. Protein K extraction was performed at 56 °C (30 min) by adding 300 µl lysis buffer and protein K solution (Maxwell 16 LEV Blood DNA kit, Promega). Following automated nucleic acid isolation (Promega Maxwell) culture extracts were eluted in 75 µl nuclease-free water. Quality of high molecular DNA without DNA degradation was confirmed by standard agarose gel electrophoresis. Next-generation sequencing Before library preparation, quality control of isolated bacterial DNA was conducted using a Qubit 2.0 Fluorometer (Qubit dsDNA BR Assay Kit, Life Technologies) and an Agilent 2200 TapeStation (Genomic DNA ScreenTape, Agilent Technologies). Next-generation sequencing libraries were prepared in 96 well format using NexteraXT DNA Sample Preparation Kit and NexteraXT Index Kit for 96 indexes (Illumina) according to the manufacturer’s protocol. The resulting sequencing libraries were quantified in a qPCR-based approach using the KAPA SYBR FAST qPCR MasterMix Kit (Peqlab) on a ViiA 7 Real-Time PCR System (Life Technologies). In total, 96 samples were pooled per lane for paired-end sequencing (2×100 bp) on Illumina Hiseq2000 or Hiseq2500 sequencers using TruSeq PE Cluster v3 and TruSeq SBS v3 sequencing chemistry (Illumina). Basic sequencing quality parameters were determined using the FastQC quality control tool for high-throughput sequence data [34], and the reports were summarized using MultiQC (version 0.8) [35] for the gram-negative isolates and S. aureus, respectively (Supplementary Tables S3 and S4, Supplementary Figures S2 and S3). A subset of gram-negative samples was resequenced because of low read coverage in the initial run; data of both runs were subsequently merged (Supplementary Table S1). Proteome-based identification Bacterial isolates were cultured on trypticase soy agar containing 5% sheep blood (BD BBL) and incubated at 35 °C for 18–24 h. Isolates were subjected to MALDI-TOF MS analysis using the Bruker Biotyper 3.1.65 (Bruker Daltonics, Bremen, Germany). Isolated colonies were directly smeared onto a polished steel target plate (Bruker Daltonics). Matrix (α-cyano-4-hydroxycinnamic acid, Bruker Daltonics), reconstituted as recommended by the manufacturer (50% acetonitrile, 47.5% water and 2.5% trifluoroacetic acid), was added to the cellular material on the target plate. Following successful calibration with the Bacterial Test Standard (Bruker Daltonics), bacterial isolates were tested on the Bruker Biotyper (flexControl version 3.3.108.0 and flexAnalysis 3.3.80.0) following the manufacturer’s instructions. Mass spectra were obtained, and scores were generated. Scores of  ≥2.0 were considered probable species identifications, scores of ≥1.7 but  <2.0 were considered probable genus identifications and scores  <1.700 were considered not reliable identifications, i.e. as not identified. The used cutoff values were defined according to the manufacturer’s guidelines. From 846 isolates analyzed with the Bruker Biotyper system, 100 samples were retested in a second run because they were not identified or yielded ambiguous classification in the first run. Best hits of the first and the second run were considered as the classification results. These included either species- or genus-level assignments with respect to the score cutoff values as described above. For all species assignments, the corresponding genus was determined using the R-package taxize [36] and the NCBI taxonomy database [37] (accessed 4 October 2016). The results of both runs were consolidated as follows: if the runs disagreed on the species level but not on the genus level, then only the genus was saved; if the runs disagreed on the genus level, the sample was considered as unclassified; and if a sample was classified at the species level in one run but only at the genus level in the other run and both genus-level assignments were concordant, then the species-level assignment was saved. Detailed information on each sample can be found in Supplementary Table S1. The identification of the isolates from the S. aureus data set was performed by a MALDI-TOF MS analysis using standard protocol (Bruker Biotyper, Bruker Daltonics). Genome-based identification of bacteria We applied five tools for whole-genome-based taxonomic classification: BLAST-based tools DIAMOND [20] Lowest Common Ancestor (LCA) assignment using MEGAN [38]) and TUIT [21], and k-mer-based approaches CLARK [22], Kaiju [23] and Kraken [24]. CLARK: Version 1.1.3 was used, and the database was created with the respective script at the species level using finished bacterial genomes from the NCBI RefSeq database (2 November 2015). The tool was run in default mode, k-mer lengths were set to 21, 25, 29 or 31 and forward and reverse paired-end reads were used as input. Report files were created using ‘getAbundance’ with default parameters. Kaiju: Version 1.4-7 was used with the default database of complete genomes downloaded from NCBI FTP server (30 June 2016). Paired-end reads were used as input with the default run mode Maximum Exact Matches (MEM). Report files on species level were created using ‘kaijuReport’. Kraken: Version 0.10.4-beta was used with the default database containing finished genomes from the NCBI RefSeq database (13 January 2015). The k-mer lengths were set to 21, 25, 29 or 31, and forward and reverse paired-end reads were used as input. Report files were created from the raw output using ‘kraken-report’. DIAMOND/MEGAN: As DIAMOND has no direct support for paired-end reads, we used only forward reads as input. Version 0.6.13.48 was used with BLASTX search against the NCBI nonredundant protein sequence database (nr) (27 February 2015) with default parameters. The output was further processed using the LCA method implemented in MEGAN (version 5.10.6, tool blast2lca with default parameters and GenInfo Identifier (GI) number taxon mapping from March 2015) and summarized by counting the number of mapped sequences for each listed taxon. TUIT: As in the case of DIAMOND, only forward reads were used as input. Furthermore, the FASTQ file was converted to FASTA format using FASTX-Toolkit [39] (version 0.0.14, with ‘-Q 33’), and unique reads were collapsed to reduce the number of input sequences. TUIT (version 1.0.3.2) was used with local BLAST search against the NCBI nucleotide collection (nt) (4 April 2014) and default parameters. The output was summarized as described for DIAMOND/MEGAN. Result summaries From the individual tools’ reports, the following information was computed: species taxon with maximal percentage of mapped sequences (first best species hit with respect to all reported sequences), the number of sequences mapped to the best hit species taxon, the number of sequences classified at the species level and the number of sequences mapped to the expected species taxon, i.e. taxa obtained by the merged MS-based analysis results in case of the gram-negative isolates. The total number of sequences was set to the number of input sequences, i.e. the number of reads for the CLARK, DIAMOND/MEGAN, Kaiju and Kraken results, and the number of FASTA sequences after converting FASTQ files into FASTA files for the results of TUIT. The summarized results can be found in Supplementary Tables S1 and S2. Performance measures For each WGS-based summary file, the following performance measures were calculated: the sensitivity, precision and F-measure values with respect to the best species hit and expected species taxon (Supplementary Table S1). Sensitivity was defined as the ratio of reads assigned to the species taxon and the total number of reads. Precision was defined as the ratio of reads assigned to the species taxon and reads classified at the species level (i.e. assigned to any species taxon). F-measure was defined as 2× (sensitivity× precision)/(sensitivity + precision). Runtime analysis To compare the runtimes of the tools, we randomly selected five gram-negative samples whose number of reads was between 100 000 and 1 000 000 to reduce computational cost. For each sample and each tool, the elapsed (wall clock) time was estimated three times using ‘GNU time’ (version 1.7). Before measuring the time, the tools were ‘pre-run’ on a single sample. The tools were called in the same way as described above with the following additional settings: the number of threads was set to 30 using the parameter ‘–threads’ for Kraken and DIAMOND, ‘-n’ for CLARK and ‘-z’ for Kaiju. For TUIT, only the number of threads for the BLAST search can be set manually through the parameter ‘NumThreads’ in the supplied property file. This parameter was also set to 30. Furthermore, we used the option ‘–preload’ for Kraken. The analysis was performed on a server with 500 Gb RAM and 64 processors [AMD Opteron(tm) Processor 6378] with 1400 MHz. The time spent on downloading and creating the databases was not considered. The final runtime per sample was computed as the mean over the three repetitions and normalized by the number of read pairs in the corresponding FASTQ files. Effect of read processing on results of CLARK and Kraken CLARK and Kraken were additionally run on paired reads preprocessed with Trimmomatic [40] as described below. K-mer length was set to 31. All other parameters and output processing were kept as described above. Identification of species contained in the reference databases Whether the expected species were represented by the used reference databases of the WGS-based tools was determined as follows: for CLARK, Kaiju, Kraken and TUIT, the GI numbers were extracted from used nucleotide sequences and mapped to the taxonomy names using the taxonomy mapping files of NCBI; then we checked whether the expected taxa were contained in the set of retrieved taxonomy names; for DIAMOND/MEGAN, the sequence titles were extracted from the used nr database, filtered to retrieve only those with one unique taxonomy name and used to search for the expected taxa. Identification of candidate mixtures To detect samples containing genomic data of more than one organism, we performed a homogeneity analysis based on the WGS data. The raw reads were trimmed using Trimmomatic [40] (version 0.35, command line parameters: PE ILLUMINACLIP:NexteraPE-PE.fa:1:50:30 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36), assembled into scaffolds using SPAdes [41] (version 3.6.2, command line parameters: -k 21,33,55 –careful) and annotated by PROKKA [42] (version 1.11, command line parameters: –gram neg –mincontiglength 200). The homogeneity of individual samples was assessed using a set of ‘essential genes’. These genes are in single copy and conserved in 95% of all sequenced bacteria [43]. A sample was considered a candidate mixture if >10 of the essential genes were found in multiple copies (Supplementary Tables S1 and S2). Multilocus sequence typing Multilocus sequence typing (MLST) profiles for each isolate were obtained using a pipeline implemented by Torsten Seeman [44] (version 2.6) and PubMLST schemes [45] (8 August 2016). The scaffold FASTA files obtained by PROKKA (mentioned earlier) were used as input. The output contained the closest PubMLST scheme, the corresponding sequence type (ST; if available) and the allele IDs (Supplementary Table S3). Results containing more than one missing allele were considered as unreliable. Validation set A validation set of gram-negative isolates used to access the performance of WGS-based results was defined as follows. It included only samples identified at species level by MS-based analysis and whose species taxon was supported by the expert-driven taxonomy. If the expert-driven taxonomy contained two species taxa, then it was sufficient that one of them was the same as the MS-based taxon. If the expert-driven taxonomy included only genus, then it was required to be the same as the genus MS-based species taxon. Furthermore, samples were excluded if their assembly failed or if they had <200 000 reads or if they were identified as candidate mixtures (Supplementary Table S1) resulting in 656 samples in total. For S. aureus, the validation set contained 194 isolates after removing samples because of failed assembly or if they had <200 000 reads or if they were identified as candidate mixtures (Supplementary Table S2). Results Gram-negative isolates Our data set consisted of 846 gram-negative bacterial clinical isolates collected and identified by microbiologists for diagnostic purposes (Figure 1). These isolates were part of the microbiology strain collection at Siemens Healthcare Diagnostics (West Sacramento, CA). We sequenced whole genomes of all 846 isolates using an Illumina HiSeq system and performed WGS-based taxonomic classification using in silico methods. Additionally, a Bruker Biotyper MALDI-TOF MS system was used for taxonomic characterization where a subset of 100 isolates was reclassified in a second run. For 240 isolates, the date of collection was available covering a time span from 1986 to 2011 (Supplementary Figure S1). Furthermore, for 783 isolates, the data set included the collection location: a majority of samples was collected in North America (738), followed by Japan (23), Europe (21) and Australia (1). In the MS-based analysis of the 846 isolates, eight (0.9%) samples in the first run and four from the 100 reanalyzed samples (4%) were not identified. Only two samples remained unclassified after both runs. In total, 734 of 846 (86.8%) and 86 of 100 (86%) isolates were resolved to the species level in the first and second run, respectively, while 104 of 846 (12.3%) and 10 of 100 (10%) were classified at the genus level only. The score values varied from 1.73 to 2.57 (Supplementary Figure S4). For the 74 samples identified at the species level in both MS runs, 52 (70.3%) had concordant results. From the remaining 22 samples with divergent species taxa, 16 were assigned to the same genus. Among these, 11 samples were identified as Serratia ureilytica in the first run but reclassified as Serratia marcescens in the second run. A stronger concordance of 83 of 90 samples (92.2%) was observed at the genus level. After merging the results of both runs, 723 from 846 samples (85.5%) were classified at the species level, 114 (13.5%) at the genus level only and 9 (1.1%) were considered as unclassified (Supplementary Table S1). Next, we examined the agreement between the MS-based results and the expert-driven taxonomy over all isolates (846). Concordant species assignments (in case of two taxa, one match was sufficient) were observed in 646 (76.4%) cases (Figure 2, Supplementary Figure S5). Moreover, the MS-based results were supported by all WGS-based tools (for CLARK and Kraken, k = 31) in 602 (71.2%) cases where 24 (2.8%) assignments were not supported by the expert-driven taxonomy. The validation set for the evaluation of the WGS-based tools was defined comprising 656 isolates with verified species classification (Supplementary Table S1). The taxonomy of the selected isolates was additionally confirmed by the assembly-based MLST results. For species with available MLST schemes in the PubMLST database, for all samples, except one Klebsiella pneumoniae isolate, the closet reported scheme belonged to the expected species taxon (Supplementary Figure S6), and in 354 of 458 cases, a ST could be assigned (Supplementary Table S1). In the following, if not stated otherwise, only the isolates from the validation set were used for the analysis of the WGS-based results. Figure 2 View largeDownload slide Euler diagram of the species taxa of the expert-driven and MS-based identifications. In cases where two species taxa were given by the expert-driven taxonomy, a match to one of the taxa was sufficient to report concordance. The third set lying within the MS-based taxa set represents assignments supported by CLARK (31-mers), Kaiju, DIAMOND/MEGAN, Kraken (31-mers) and TUIT. Its difference with the set of the expert-driven species taxa includes three isolates without an expert-based species assignment. Figure 2 View largeDownload slide Euler diagram of the species taxa of the expert-driven and MS-based identifications. In cases where two species taxa were given by the expert-driven taxonomy, a match to one of the taxa was sufficient to report concordance. The third set lying within the MS-based taxa set represents assignments supported by CLARK (31-mers), Kaiju, DIAMOND/MEGAN, Kraken (31-mers) and TUIT. Its difference with the set of the expert-driven species taxa includes three isolates without an expert-based species assignment. CLARK and Kraken require the k-mer length to be fixed when building the respective reference databases. In both cases, the values was set to 31, the default value of Kraken. In an additional analysis, we confirmed the observation made for Kraken and CLARK that higher k-mer values are associated with higher precision and lower values with higher sensitivity [22, 24] (Supplementary Figure S7). We examined the results obtained from CLARK, DIAMOND/MEGAN, Kaiju, Kraken and TUIT regarding the wrongly assigned species taxa and their presence in the reference data sets of the used tools (Figure 3). The numbers of incorrect classifications per species were comparable among all five tools with some exceptions: 5 Citrobacter koseri and 19 Enterobacter aerogenes samples were misclassified by DIAMOND/MEGAN and TUIT, whereas CLARK, Kaiju and Kraken yielded no misclassifications for these species; TUIT assigned all 17 Klebsiella oxytoca samples to K. pneumoniae but had only two wrong assignments for Proteus vulgaris, whereas the other tools misclassified 11 P. vulgaris samples. For CLARK and Kraken, in all 30 cases of incorrect species classifications, the expected taxon was not contained in the respective reference databases. For Kaiju, this was the case for only 4 of the 18 incorrect assignments. The databases used for DIAMOND and TUIT contained all expected species taxa. To perform a fair comparison of the tools, only isolates belonging to species included in the reference data sets of all five tools (626 samples) were considered if not stated otherwise. Figure 3 View largeDownload slide A number of misclassified samples from the validation set per species for CLARK (31-mers), Kaiju, DIAMOND/MEGAN, Kraken (31-mers) and TUIT. Only the expected species taxa for which at least one tool yielded a misclassification were included. The number of misclassifications is provided within each cell—the higher the value, the darker the background color. For CLARK, Kaiju and Kraken, the numbers of species missing in the used databases are printed in bold and are highlighted by a black rectangle. The genus taxa were abbreviated by the first three letters. Figure 3 View largeDownload slide A number of misclassified samples from the validation set per species for CLARK (31-mers), Kaiju, DIAMOND/MEGAN, Kraken (31-mers) and TUIT. Only the expected species taxa for which at least one tool yielded a misclassification were included. The number of misclassifications is provided within each cell—the higher the value, the darker the background color. For CLARK, Kaiju and Kraken, the numbers of species missing in the used databases are printed in bold and are highlighted by a black rectangle. The genus taxa were abbreviated by the first three letters. The overlap of species assignments between all four tools was 92.3% (578 of 626 samples) (Supplementary Figure S8). CLARK and Kraken obtained same species assignments for all samples, while the lowest concordance was found between Kaiju and TUIT (92.3%). As next, we compared the results of CLARK, DIAMOND/MEGAN, Kraken and TUIT with respect to the expected species taxa. For each of the five tools, the percentage of correct species- and genus-level assignments was computed (Supplementary Figure S9). The best results were obtained by CLARK and Kraken with 100% of correctly assigned species taxa followed by Kaiju (99.5%), DIAMOND/MEGAN (96%) and TUIT (92.7%). We then considered the mean classification performance computed with respect to the expected species taxa. CLARK and Kraken demonstrated comparable mean sensitivity values of 79.1 and 81.7%, respectively; DIAMOND/MEGAN had a mean sensitivity of 64.6% followed by Kaiju (31.4%), and TUIT (25.8%) (Figure 4). The highest mean precision was achieved by Kraken (96.2%) followed by CLARK (94.5%), TUIT (87.5%), Kaiju (87.2) and DIAMOND/MEGAN (79.2%). Regarding the F-measure, best mean performance was achieved by Kraken (87.9%) followed by CLARK (85.5%), DIAMOND/MEGAN (71%), Kaiju (41.8) and TUIT (34.5%). Finally, we examined the distribution of the best hit performance values for correctly and wrongly classified isolates (Supplementary Figure S10). We observed that isolates assigned to wrong species taxa tended to have a combination of lower sensitivity and precision values compared with correctly classified isolates. However, there was no clear separation of both groups. Moreover, it should be noted that the pairwise sensitivity and precision values of DIAMOND’s results were often almost identical. Figure 4 View largeDownload slide The mean runtime (n=5) per 1 million reads measured using five randomly chosen gram-negative samples, and the mean sensitivity, precision and F-measure percentages computed with respect to the expected species taxa for CLARK (31-mers), DIAMOND/MEGAN, Kaiju, Kraken (31-mers) and TUIT. Only samples from the validation set and with expected species present in all used reference databases were used. The x-axis is square root transformed. Figure 4 View largeDownload slide The mean runtime (n=5) per 1 million reads measured using five randomly chosen gram-negative samples, and the mean sensitivity, precision and F-measure percentages computed with respect to the expected species taxa for CLARK (31-mers), DIAMOND/MEGAN, Kaiju, Kraken (31-mers) and TUIT. Only samples from the validation set and with expected species present in all used reference databases were used. The x-axis is square root transformed. Besides the classification performance, computational runtime is a relevant factor in choosing which software to use, especially when analyzing large-scale data sets. Hence, we compared the runtimes of the tools based on five randomly selected samples (Figure 4). Among the herein used tools, TUIT was the slowest with an average of 169 min per 1 million reads followed by DIAMOND (32.6 min). CLARK, Kaiju and Kraken achieved the best results requiring <3 min per 1 million reads, with Kaiju being the fastest (<1 min). Finally, we investigated whether adapter and quality trimming of the raw reads would adversely affect the classification results with the focus on CLARK and Kraken, as they demonstrated comparable results and achieved better performance than the other tools. We compared the results obtained using raw or processed reads (for k=31). Regarding the best species hits of all samples from the validation set (656), all assignments stayed consistent when using CLARK and Kraken. Moreover, we compared the percentages of raw and trimmed reads assigned to the best hit and expected species taxa, respectively. The distribution of the absolute differences (Supplementary Figure S11) was inspected, and the 99th percentile (at the value of 2.6 for CLARK and 2.5 for Kraken) was defined as the cutoff for outlier detection: The numbers of outliers for the best hits and expected species taxa, respectively, were seven for CLARK and Kraken. Staphylococcus aureus isolates We whole genome sequenced 200 S. aureus clinical isolates from the S. aureus strain collection of Saarland University Medical Center. The samples were collected between 1976 and 2014 (Supplementary Figure S1). The majority of the isolates (187) was collected from Germany, 11 were from Mozambique, 1 from Switzerland and 1 from the United States (Supplementary Table S2). In the assembly-based MLST analysis, for all isolates, except one, in the validation set, the closest reported MLST scheme belonged to S. aureus, and 173 isolates were assigned to known MLST profiles. We performed an analogous analysis as for the gram-negative isolates to compare the WGS-based results. The tools were concordant with the assignments of all isolates in the validation set except for one case where Kaiju disagreed with the other tools (Supplementary Figure S12). Only one isolate was not assigned to S. aureus by CLARK, DIAMOND/MEGAN, Kraken and TUIT; two isolates were misclassified by Kaiju (Supplementary Figure S13). CLARK and Kraken achieved high mean sensitivity values of 90.8% and 91.3%, respectively (Supplementary Figure S14). Lower values were observed for TUIT (77.1%), DIAMOND/MEGAN (75.9%) and Kaiju (10.9%). CLARK, DIAMOND/MEGAN, Kraken and TUIT demonstrated high-precision values of  >97%, whereas Kaiju had a lower value of 85.9%. Accordingly, CLARK and Kraken had highest F-measure values of 94.4 and 94.7%, respectively, followed by TUIT (86.5%), DIAMOND/MEGAN (85.3%) and Kaiju (18.9%). Discussion Rapid and accurate pathogen characterization is essential for an effective treatment of infections facilitating patient stratification and personalized therapies. WGS is gaining importance in the analysis of bacterial cultures derived from patients with infectious diseases. Various computational approaches have been developed to perform taxonomic analysis based on WGS data. However, evaluations using newly sequenced clinical samples and complementary information confirming the taxonomy are missing. Here, we performed WGS-based taxonomic analysis of 846 gram-negative bacterial isolates and validated the results by comparing with MS-based classifications obtained using a Bruker Biotyper MALDI-TOF MS system and confirmed by expert-driven taxonomy. Our data set included species which are frequently found to be responsible for nosocomial (hospital-acquired) infections, such as Acinetobacter baumannii, Escherichia coli, Klebsiella spp. and Pseudomonas aeruginosa [46]. Additionally, we included a data set of 200 S. aureus isolates. To determine the concordance between the expert-driven and MS-based taxonomy of the gram-negative isolates, we first analyzed the MS-based results to determine samples with uncertain or ambiguous classifications. In general, possible limitations of MS-based analysis include, but are not restricted to, the limited differentiation of E. coli and some Shigella spp. isolates [7, 9, 47–49], and also inaccurate differentiation of species in other groups such as Acinetobacter [48], Citrobacter [50] and Enterobacter cloacae complex [48] and the missing identification of Salmonella isolates below the genus level [51]. In our MS-based analysis, most samples (about 86%) were classified at the species level and the taxa of samples classified only at the genus level included, among others, the genera imposing particular challenges to MS-based biotyping as mentioned above (Supplementary Table S1). We found that both MS runs produced concordant results in 52 of 74 (70.3%) and 84 of 90 (93.3%) cases at the species and at the genus level, respectively. From the 22 discordant species-level assignments, 11 isolates were first identified as S. ureilytica and then reclassified as S. marcescens. S. ureilytica is a relatively new species [52] whose identification by a MALDI-TOF MS system was shown to be challenging [50, 53], potentially explaining the observed ambiguities. In the final taxonomic assignment, the reported inconsistencies were resolved such that 723 samples were classified at the species level, 115 at the genus level only and 8 samples were considered as unclassified because of divergent assignments or failed identification. We then compared the resulting species taxa with the expert-driven results and observed a high concordance of 76.4%. However, in ca. 3% of the cases, all tested WGS-based tools were concordant with the MS-based result, which was not supported by the expert-driven classification. We could assume that in these cases, the expert-driven taxonomy was incomplete (no species taxon) or incorrect, which demonstrates the importance of using complementary information to confirm the identification results. Subsequently, we defined a validation set including only samples with confirmed taxonomy to be used for the evaluation of the WGS-based results. For any taxonomic classification approach, the availability and quality of the reference data are crucial for an accurate identification. Sequence-based tools for taxonomic classification generally rely on publicly available data sources. CLARK and Kraken construct their k-mer databases from finished genomes of the NCBI RefSeq database [54, 55], which included 2785 bacterial data sets (containing chromosome and/or plasmid DNA sequences) in this study. DIAMOND and TUIT use the NCBI nonredundant protein (nr) and nucleotide (nt) collections [56], respectively. The NCBI nr collection includes data from GenBank, RefSeq, UniProtKB/Swiss-Prot, PDB and PRF; and the nt collection includes data from RefSeq and GenBank except EST, GSS, STS and HTG [56]. Kaiju can use complete genomes from the NCBI RefSeq database or the nr collection. In this study, Kaiju was run using proteins extracted from 5135 complete genomes. Considering the presence of expected species in the different reference data sets, in only few cases, the availability of the respective species was sufficient for correct classifications (Figure 3). Only TUIT was able to correctly identify most of the P. vulgaris isolates, though this species was also contained in the reference data sets of DIAMOND and Kaiju. For the single P. vulgaris genome used by Kaiju (NZ_CP012675.1), we found that the nucleotide sequence of the rpoB gene (DNA-directed RNA polymerase subunit beta, WP_004246906.1), shown to be appropriate to distinguish Proteus spp. [57], was 100% identical to the rpoB gene from the complete Proteus mirabilis genome (NZ_CP012674.1, ALE23450). Furthermore, the average nucleotide identity value [58] (http://enve-omics.ce.gatech.edu/ani/index) of both genomes was  >99.9%. Based on these findings, we assumed that NZ_CP012675.1 was misclassified explaining why Kaiju was not able to correctly identify the P. vulgaris isolates. We hypothesized that similar reasons may be responsible for other incorrect assignments where the expected species was present in the used database but whose isolates were nevertheless misclassified. Focusing only on species presumably contained in all databases, our analysis demonstrated that WGS-based identification approaches can yield highly accurate results. All tools classified  >92% (best result 100%) of the gram-negative samples correctly. CLARK and Kraken demonstrated best mean sensitivity about 80% followed by DIAMOND/MEGAN with 64.4%. Kaiju and TUIT had a comparably low mean sensitivity (31.4 and 25.8%) but better precision (87.2 and 87.5%) than DIAMOND/MEGAN (79.2%). The highest precision was observed for Kraken (96.2%) and CLARK (94.5%). The low sensitivity of TUIT may be because of missing matches to the reference database or because the default TUIT cutoff values used during the assignment of the LCAs, were too strict. The authors of TUIT suggest a trial-and-error procedure to adjust the cutoffs [21], but this was prohibited by the high computational runtime of the tool. In contrast, the default parameters used for assignment filtering and LCA assignment with MEGAN for the DIAMOND results appeared to be too permissive, as the sensitivity and precision values were close to each other. Furthermore, it should be noted that CLARK, Kaiju and Kraken may have benefited from using paired-end data, while DIAMOND and TUIT were run on forward reads only. Moreover, DIAMOND and Kaiju are only able to classify protein-coding sequences, as the reads are matched to protein databases affecting their sensitivity. Though we observed a tendency of lower performance values for incorrectly classified isolates, some wrongly and correctly assigned isolates demonstrated comparable sensitivity and precision values. We hypothesized that these cases may either include not detected contaminated isolates or isolates belonging to a species missing in the database but closely related to other species of the same genus with available reference data. These isolates would require a closer examination, e.g. considering all taxa exceeding a reasonable abundance cutoff. The overall results demonstrate the importance of a comprehensive and representative reference database for a successful and precise taxonomic classification, which is even more crucial within a clinical setting. The performance of the WGS-based tools on the S. aureus data set was comparable with the observations made for the gram-negative isolates: at least 99% of all samples were classified as S. aureus; the exceptions were one sample (ID 191) classified as Enterococcus faecium by Kaiju and one sample (ID 80) identified as Staphylococcus carnosus by all five tools. In the latter case, we could assume that the MS-based taxon was wrong or that a wrong probe was used for WGS. Highest sensitivity was achieved by CLARK and Kraken (>90%), and all tools except Kaiju had a mean precision  >97%. Considering the runtime of the tested WGS-based tools, CLARK, Kaiju and Kraken were substantially faster than DIAMOND and TUIT requiring only a few minutes to process a million of reads. Furthermore, we also evaluated the robustness of CLARK and Kraken with respect to adapter and quality trimming of the raw reads and observed that this procedure had only marginal effects on the classification results. In summary, the k-mer and exact matching-based tools CLARK and Kraken appear to be the primary choices with respect to classification performance, usability and runtime among the herein tested approaches. Kaiju represents an appealing alternative, as it was faster than CLARK and Kraken, and requires no parameters to be set to create a reference database. But it should be kept in mind that the tool can classify only protein-coding sequences. Overall, taxonomic classification of bacterial isolates based on WGS data provides highly accurate results, and thus, represents a promising alternative to MS-based biotyping. WGS would also enable further analyses, such as phylogenetic analysis and genotyping, which are mandatory for surveillance of outbreaks and antimicrobial resistance. However, multiple issues have to be addressed before WGS-based approaches can be applied in clinical settings. The effect of different library preparation and sequencing methods on the quality of the identification results should be investigated and quantified. As the comprehensiveness and the quality of the reference database has a high impact on the reliability of the taxonomic assignments, a careful selection and validation of the reference data would be necessary. This holds especially for organisms represented by only one genome as seen in case of the (most likely mislabeled) P. mirabilis genome used by Kaiju. Large-scale efforts, such as the ‘Genomic Encyclopedia of Bacteria and Archaea’ [59] and its pilot studies, and the ‘100K Foodborne Pathogen Genome Project’ [60] are expected to greatly expand the volume and diversity of available reference genomes. An additional aspect is the genomic variability of bacteria and in particular the differences between pathogenic and nonpathogenic species. The lifestyle of a bacterium influences to a great extent its genome size and variability. Pathogenic species represent specialized organisms leading an allopatric lifestyle and are characterized by a significantly reduced genome compared with species from a sympatric environment [61, 62]. Furthermore, the genome of a bacterial strain can be seen as a composition of ‘core’ genes, conserved among many strains of the same species, and ‘accessory’ genes, which vary between different strains [62]. The set of all genes found in a species is referred to as a pan-genome [63]. The core genome similarity is considered to be a good approach to define bacterial species relevant for humans [62]. However, it has also been proposed to apply pan-genome analysis to redefine bacterial species [64, 65]. Another important point is the fact that many bacterial organisms cannot be grown in the laboratory, thus challenging their identification [66]. Single-cell sequencing is considered to be a promising solution for this problem [67]. In our analysis, we focused on cultured isolates as their accurate identification can be seen as a necessary requirement for future, culture-independent studies. Regarding the underlying concept of the classification tools, we focused in this study on alignment-based methods. But there also exist alignment-free approaches (e.g. PhyloPythia/S/S+ [27–29], RAIphy [26] and the approach of Vervier et al. [25]), and approaches combining alignment-based and alignment-free similarity measures (e.g. Borozan et al. [68]). Finally, exhaustive testing procedures using high-quality validation data should be performed including relevant human pathogens to access the reliability and accuracy of the implemented method. Key Points Kmer-based taxonomic information derived from WGS data allows for accurate and fast classification of bacterial clinical isolates at species level, and is thus, an appealing alternative to MS-based analysis. Establishing a high-quality reference database as well as its continuous extension is vital for the correctness of the taxonomic classifications. The evaluation of taxonomic classification tools should include complementary information to confirm the taxonomy of the underlying data. Supplementary Data Supplementary data are available online at http://bib.oxfordjournals.org/. Valentina Galata is PhD student at the Chair of Clinical Bioinformatics at Saarland University. Christina Backes is Postdoc at the Chair for Clinical Bioinformatics at Saarland University. Cédric Christian Laczny is Postdoc at the Chair for Clinical Bioinformatics at Saarland University. Georg Hemmrich-Stanisak is a research scientist at the ICMB at the Christian-Albrechts-University of Kiel in Germany. Howard Li, PhD, worked as a system development technical lead at Roche Molecular Systems. His expertise includes sample preparation, laboratory automation and medical device R&D. He is a member of American Chemical Society and American Society for Microbiology. Laura Smoot, PhD, is a microbiologist with training and research in molecular microbiology field, and has experience in in vitro diagnostics industry. She worked at Siemens Healthcare, R&D. She is a member of American Society for Microbiology (ASM) and European Society of Clinical Microbiology, and Infectious Diseases (ESCMID). Dr Andreas Emanuel Posch is Senior Key Expert for Bioinformatics and Systems Biology at Siemens Healthcare, In Vitro Diagnostics and Bioscience R&D. Dr Susanne Schmolke is Senior Project Manager Strategy at Siemens Healthcare. Her expertise includes virology, molecular biology and medical device industry. Markus Bischoff is professor and senior scientist at the Institute of Medical Microbiology and Hygiene at Saarland University. Lutz von Müller was vice head of the Institute of Medical Microbiology and Hygiene at the Saarland University. Current position: Head of Institute of Laboratory Medicine, Microbiology and Hygiene, Christophorus Hospitals, Coesfeld, Germany. Dr Achim Plum is Managing Director of Curetis GmbH and molecular geneticist by training. His areas of expertise include precision medicine and companion diagnostics, biomarker discovery and validation, IVD industry and molecular diagnostics. Andre Franke, PhD, is the director of the ICMB at the Christian-Albrechts-University of Kiel in Germany. The primary foci of his research are high-throughput analyses, laboratory automation, next generation sequencing, chronic inflammatory diseases, GWAS, and bioinformatics. He is a member of the German Society for Human Genetics (GfH) and the German Society for Internal Medicine (DGIM). Andreas Keller is professor and head of the Chair for Clinical Bioinformatics at Saarland University. Acknowledgement The authors would like to thank Siemens Healthcare and Curetis GmbH for the support as well as the provided data set. The authors also thank Mathias Herrmann for comments that greatly improved the manuscript. Funding Siemens Healthcare and in parts by the Best Ageing (grant number 306031) from the European Union. Availability of WGS data The raw WGS data are available on a reasonable request for noncommercial use after signing a nondisclosure agreement. References 1 Greatorex J, Ellington MJ, Köser CU, et al.   New methods for identifying infectious diseases. Br Med Bull  2014; 112: 27– 35. doi: 10.1093/bmb/ldu027 Google Scholar CrossRef Search ADS PubMed  2 Didelot X, Bowden R, Wilson DJ, et al.   Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet  2012; 13: 601– 12. doi: 10.1038/nrg3226 Google Scholar CrossRef Search ADS PubMed  3 Janda JM, Abbott SL. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. J Clin Microbiol  2007; 45: 2761– 4. doi: 10.1128/JCM.01228-07 Google Scholar CrossRef Search ADS PubMed  4 Rajendhran J, Gunasekaran P. Microbial phylogeny and diversity: small subunit ribosomal RNA sequence analysis and beyond. Microbiol Res  2011; 166: 99– 110. doi: 10.1016/j.micres.2010.02.003 Google Scholar CrossRef Search ADS PubMed  5 van Veen SQ, Claas ECJ, Kuijper EJ. High-throughput identification of bacteria and yeast by matrix-assisted laser desorption ionization-time of flight mass spectrometry in conventional medical microbiology laboratories. J Clin Microbiol  2010; 48: 900– 7. doi: 10.1128/JCM.02071-09 Google Scholar CrossRef Search ADS PubMed  6 Seng P, Drancourt M, Gouriet F, et al.   Ongoing revolution in bacteriology: routine identification of bacteria by matrix-assisted laser desorption ionization time-of-flight mass spectrometry. Clin Infect Dis  2009; 49: 543– 51. doi: 10.1086/600885 Google Scholar CrossRef Search ADS PubMed  7 Bizzini A, Durussel C, Bille J, et al.   Performance of matrix-assisted laser desorption ionization-time of flight mass spectrometry for identification of bacterial strains routinely isolated in a clinical microbiology laboratory. J Clin Microbiol  2010; 48: 1549– 54. doi: 10.1128/JCM.01794-09 Google Scholar CrossRef Search ADS PubMed  8 Köser CU, Ellington MJ, Cartwright EJP, et al.   Routine use of microbial whole genome sequencing in diagnostic and public health microbiology. PLoS Pathog  2012; 8: e1002824. doi: 10.1371/journal.ppat.1002824 Google Scholar CrossRef Search ADS PubMed  9 Croxatto A, Prod’hom G, Greub G, et al.   Applications of MALDI-TOF mass spectrometry in clinical diagnostic microbiology. FEMS Microbiol Rev  2012; 36: 380– 407. doi: 10.1111/j.1574-6976.2011.00298.x Google Scholar CrossRef Search ADS PubMed  10 Mahé P, Arsac M, Chatellier S, et al.   Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum. Bioinformatics  2014; 30: 1280– 6. doi: 10.1093/bioinformatics/btu022 Google Scholar CrossRef Search ADS PubMed  11 Zhang L, Smart S, Sandrin TR. Biomarker- and similarity coefficient-based approaches to bacterial mixture characterization using matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS). Sci Rep  2015; 5: 15834. doi: 10.1038/srep15834 Google Scholar CrossRef Search ADS PubMed  12 Sandrin TR, Demirev PA. Using mass spectrometry to identify and characterize bacteria. Microb  2014; 9: 23– 9. 13 Mazzeo MF, Sorrentino A, Gaita M, et al.   Matrix-assisted laser desorption ionization-time of flight mass spectrometry for the discrimination of food-borne microorganisms. Appl Environ Microbiol  2006; 72: 1180– 9. doi: 10.1128/AEM.72.2.1180-1189.2006 Google Scholar CrossRef Search ADS PubMed  14 Böhme K, Fernández-No IC, Barros-Velázquez J, et al.   SpectraBank: an open access tool for rapid microbial identification by MALDI-TOF MS fingerprinting. Electrophoresis  2012; 33: 2138– 42. doi: 10.1002/elps.201200074 Google Scholar CrossRef Search ADS PubMed  15 Spectra—Extended spectra database for microorganism identification by MALDI-TOF MS. http://spectra.folkhalsomyndigheten.se/spectra/. (5 April 2016, date last accessed) 16 Nicolau A, Sequeira L, Santos C, Mota M. Matrix-assisted laser desorption/ionisation time-of-flight mass spectrometry (MALDI-TOF MS) applied to diatom identification: influence of culturing age. Aquat Biol  2014; 20: 139– 44. doi: 10.3354/ab00548 Google Scholar CrossRef Search ADS   17 Veloo ACM, Elgersma PE, Friedrich AW, et al.   The influence of incubation time, sample preparation and exposure to oxygen on the quality of the MALDI-TOF MS spectrum of anaerobic bacteria. Clin Microbiol Infect  2014; 20: O1091– 7. doi: 10.1111/1469-0691.12644 Google Scholar CrossRef Search ADS PubMed  18 Segata N, Waldron L, Ballarini A, et al.   Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods  2012; 9: 811– 4. doi: 10.1038/nmeth.2066 Google Scholar CrossRef Search ADS PubMed  19 Liu B, Gibbons T, Ghodsi M, et al.   Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics  2011; 12 (Suppl 2): S4. doi: 10.1186/1471-2164-12-S2-S4 Google Scholar CrossRef Search ADS PubMed  20 Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods  2014; 12: 59– 60. doi: 10.1038/nmeth.3176 Google Scholar CrossRef Search ADS PubMed  21 Tuzhikov A, Panchin A, Shestopalov VI. TUIT, a BLAST-based tool for taxonomic classification of nucleotide sequences. Biotechniques  2014; 56: 78– 84. doi: 10.2144/000114135 Google Scholar CrossRef Search ADS PubMed  22 Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics  2015; 16: 236. doi: 10.1186/s12864-015-1419-2 Google Scholar CrossRef Search ADS PubMed  23 Menzel P, Lee Ng K, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun  2016; 7: 11257. Google Scholar CrossRef Search ADS PubMed  24 Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol  2014; 15: R46. doi: 10.1186/gb-2014-15-3-r46 Google Scholar CrossRef Search ADS PubMed  25 Vervier K, Mahé P, Tournoud M, et al.   Large-scale machine learning for metagenomics sequence classification. Bioinformatics  2016; 32: 1023– 32. doi: 10.1093/bioinformatics/btv683 Google Scholar CrossRef Search ADS PubMed  26 Nalbantoglu OU, Way SF, Hinrichs SH, et al.   RAIphy: phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles. BMC Bioinformatics  2011; 12: 41. doi: 10.1186/1471-2105-12-41 Google Scholar CrossRef Search ADS PubMed  27 McHardy AC, Martín HG, Tsirigos A, et al.   Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods  2007; 4: 63– 72. doi: 10.1038/nmeth976 Google Scholar CrossRef Search ADS PubMed  28 Patil KR, Roune L, McHardy AC, et al.   The PhyloPythiaS web server for taxonomic assignment of metagenome sequences. PLoS One  2012; 7: e38581. doi: 10.1371/journal.pone.0038581 Google Scholar CrossRef Search ADS PubMed  29 Gregor I, Dröge J, Schirmer M, et al.   PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. Peer J  2016; 4: e1603. Google Scholar CrossRef Search ADS PubMed  30 Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a web browser. BMC Bioinformatics  2011; 12: 385. doi: 10.1186/1471-2105-12-385 Google Scholar CrossRef Search ADS PubMed  31 Vajapey U, Ying A, Hennig G, Li H. Validation of a fully-automated nucleic acid extraction method for fresh frozen tissue using the Siemens tissue preparation solution. Assoc Mol Pathol  2013; 15: 843– 945. 32 Guettouche T, Rantus J, Slosek K, et al.   A workflow enabling whole exome and whole genome sequencing of formalin fixed paraffin embedded samples with minimal amounts of DNA. Adv Genome Biol Technol  2013. https://www.kapabiosystems.com/assets/Guettouche_A-Workflow-Enabling-Whole-Exome-and-Whole-Genome-Sequencing-of-FFPE-Samples-with-Minimal-Amounts-of-DNA_AGBT_2013.pdf (15 November 2016, date last accessed). 33 van Eijk R, Stevens L, Morreau H, van Wezel T. Assessment of a fully automated high-throughput DNA extraction method from formalin-fixed, paraffin-embedded tissue for KRAS, and BRAF somatic mutation analysis. Exp Mol Pathol  2013; 94: 121– 5. doi: 10.1016/j.yexmp.2012.06.004 Google Scholar CrossRef Search ADS PubMed  34 FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. (31 May 2016, date last accessed). 35 Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics  2016; btw354. doi: 10.1093/bioinformatics/btw354 36 Chamberlain SA, Szöcs E. taxize: taxonomic search and retrieval in R. F1000Research  2013; 2: 191. doi: 10.12688/f1000research.2-191.v2 Google Scholar PubMed  37 Sayers EW, Barrett T, Benson DA, et al.   Database resources of the National Center for Biotechnology Information. Nucleic Acids Res  2009; 37: D5– 15. doi: 10.1093/nar/gkn741 Google Scholar CrossRef Search ADS PubMed  38 Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res  2007; 17: 377– 86. doi: 10.1101/gr.5969107 Google Scholar CrossRef Search ADS PubMed  39 FASTX-Toolkit. http://hannonlab.cshl.edu/fastx_toolkit/. (27 October 2015, date last accessed). 40 Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics  2014; 30: 2114– 20. doi: 10.1093/bioinformatics/btu170 Google Scholar CrossRef Search ADS PubMed  41 Bankevich A, Nurk S, Antipov D, et al.   SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol  2012; 19: 455– 77. doi: 10.1089/cmb.2012.0021 Google Scholar CrossRef Search ADS PubMed  42 Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics  2014; 30: 2068– 9. doi: 10.1093/bioinformatics/btu153 Google Scholar CrossRef Search ADS PubMed  43 Dupont CL, Rusch DB, Yooseph S, et al.   Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. isme J  2012; 6: 1186– 99. doi: 10.1038/ismej.2011.189 Google Scholar CrossRef Search ADS PubMed  44 Seemann T. MLST pipeline. https://github.com/tseemann/mlst. (8 August 2016, date last accessed). 45 PubMLST. www.pubmlst.org. (8 August 2016, date last accessed). 46 Peleg AY, Hooper DC. Hospital-acquired infections due to gram-negative bacteria. N Engl J Med  2010; 362: 1804– 13. doi: 10.1056/NEJMra0904124 Google Scholar CrossRef Search ADS PubMed  47 Murray PR. What is new in clinical microbiology-microbial identification by MALDI-TOF mass spectrometry: a paper from the 2011 William Beaumont Hospital Symposium on molecular pathology. J Mol Diagn  2012; 14: 419– 23. doi: 10.1016/j.jmoldx.2012.03.007 Google Scholar CrossRef Search ADS PubMed  48 Patel R. MALDI-TOF MS for the diagnosis of infectious diseases. Clin Chem  2015; 61: 100– 11. doi: 10.1373/clinchem.2014.221770 Google Scholar CrossRef Search ADS PubMed  49 Du Z, Li L, Chen C-F, et al.   G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery. Nucleic Acids Res  2009; 37: W345– 9. Google Scholar CrossRef Search ADS PubMed  50 Eigner U, Holfelder M, Oberdorfer K, et al.   Performance of a matrix-assisted laser desorption ionization-time-of-flight mass spectrometry system for the identification of bacterial isolates in the clinical routine laboratory. Clin Lab  2009; 55: 289– 96. Google Scholar PubMed  51 Chen JH, Ho P-L, Kwan GS, et al.   Direct bacterial identification in positive blood cultures by use of two commercial matrix-assisted laser desorption ionization-time of flight mass spectrometry systems. J Clin Microbiol  2013; 51: 1733– 9. doi: 10.1128/JCM.03259-12 Google Scholar CrossRef Search ADS PubMed  52 Bhadra B, Roy P, Chakraborty R. Serratia ureilytica sp. nov., a novel urea-utilizing species. Int J Syst Evol Microbiol  2005; 55: 2155– 8. doi: 10.1099/ijs.0.63674-0 Google Scholar CrossRef Search ADS PubMed  53 Seng P, Abat C, Rolain JM, et al.   Identification of rare pathogenic bacteria in a clinical microbiology laboratory: impact of matrix-assisted laser desorption ionization-time of flight mass spectrometry. J Clin Microbiol  2013; 51: 2182– 94. doi: 10.1128/JCM.00492-13 Google Scholar CrossRef Search ADS PubMed  54 Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res  2005; 33: D501– 4. doi: 10.1093/nar/gki025 Google Scholar CrossRef Search ADS PubMed  55 Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI reference sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res  2011; 40: D130– 5. doi: 10.1093/nar/gkr1079 Google Scholar CrossRef Search ADS PubMed  56 NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res  2013; 41: D8– 20. doi: 10.1093/nar/gks1189 PubMed  57 Giammanco GM, Grimont PA, Grimont F, et al.   Phylogenetic analysis of the genera Proteus, Morganella and Providencia by comparison of rpoB gene sequences of type and clinical strains suggests the reclassification of Proteus myxofaciens in a new genus, Cosenzaea gen. nov., as Cosenzaea myxofaciens comb. nov. Int J Syst Evol Microbiol  2011; 61: 1638– 44. doi: 10.1099/ijs.0.021964-0 Google Scholar CrossRef Search ADS PubMed  58 Goris J, Konstantinidis KT, Klappenbach JA, et al.   DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol  2007; 57: 81– 91. doi: 10.1099/ijs.0.64483-0 Google Scholar CrossRef Search ADS PubMed  59 Wu D, Hugenholtz P, Mavromatis K, et al.   A phylogeny-driven genomic encyclopaedia of bacteria and archaea. Nature  2009; 462: 1056– 60. doi: 10.1038/nature08656 Google Scholar CrossRef Search ADS PubMed  60 Timme RE, Allard MW, Luo Y, et al.   Draft genome sequences of 21 Salmonella enterica serovar enteritidis strains. J Bacteriol  2012; 194: 5994– 5. doi: 10.1128/JB.01289-12 Google Scholar CrossRef Search ADS PubMed  61 Georgiades K, Raoult D. Defining pathogenic bacterial species in the genomic era. Front Microbiol  2010; 1: 151. doi: 10.3389/fmicb.2010.00151 Google Scholar PubMed  62 Segerman B. The genetic integrity of bacterial species: the core genome and the accessory genome, two different stories. Front Cell Infect Microbiol  2012; 2: 116. doi: 10.3389/fcimb.2012.00116 Google Scholar CrossRef Search ADS PubMed  63 Medini D, Donati C, Tettelin H, et al.   The microbial pan-genome. Curr Opin Genet Dev  2005; 15: 589– 94. doi: 10.1016/j.gde.2005.09.006 Google Scholar CrossRef Search ADS PubMed  64 Caputo A, Merhej V, Georgiades K, et al.   Pan-genomic analysis to redefine species and subspecies based on quantum discontinuous variation: the Klebsiella paradigm. Biol Direct  2015; 10: 55. doi: 10.1186/s13062-015-0085-2 Google Scholar CrossRef Search ADS PubMed  65 Rouli L, Merhej V, Fournier P-E, Raoult D. The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microbes New Infect  2015; 7: 72– 85. doi: 10.1016/j.nmni.2015.06.005 Google Scholar CrossRef Search ADS PubMed  66 Stewart EJ. Growing unculturable bacteria. J Bacteriol  2012; 194: 4151– 60. doi: 10.1128/JB.00345-12 Google Scholar CrossRef Search ADS PubMed  67 Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet  2016; 17: 175– 88. doi: 10.1038/nrg.2015.16 Google Scholar CrossRef Search ADS PubMed  68 Borozan I, Watt S, Ferretti V. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification. Bioinformatics  2015; 31: 1396– 404. doi: 10.1093/bioinformatics/btv006 Google Scholar CrossRef Search ADS PubMed  © The Authors 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Loading next page...
 
/lp/ou_press/comparing-genome-versus-proteome-based-identification-of-clinical-5u8K80oFVP
Publisher
Oxford University Press
Copyright
© The Authors 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bbw122
Publisher site
See Article on Publisher Site

Abstract

Abstract Whole-genome sequencing (WGS) is gaining importance in the analysis of bacterial cultures derived from patients with infectious diseases. Existing computational tools for WGS-based identification have, however, been evaluated on previously defined data relying thereby unwarily on the available taxonomic information. Here, we newly sequenced 846 clinical gram-negative bacterial isolates representing multiple distinct genera and compared the performance of five tools (CLARK, Kaiju, Kraken, DIAMOND/MEGAN and TUIT). To establish a faithful ‘gold standard’, the expert-driven taxonomy was compared with identifications based on matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry (MS) analysis. Additionally, the tools were also evaluated using a data set of 200 Staphylococcus aureus isolates. CLARK and Kraken (with k =31) performed best with 626 (100%) and 193 (99.5%) correct species classifications for the gram-negative and S. aureus isolates, respectively. Moreover, CLARK and Kraken demonstrated highest mean F-measure values (85.5/87.9% and 94.4/94.7% for the two data sets, respectively) in comparison with DIAMOND/MEGAN (71 and 85.3%), Kaiju (41.8 and 18.9%) and TUIT (34.5 and 86.5%). Finally, CLARK, Kaiju and Kraken outperformed the other tools by a factor of 30 to 170 fold in terms of runtime. We conclude that the application of nucleotide-based tools using k-mers—e.g. CLARK or Kraken—allows for accurate and fast taxonomic characterization of bacterial isolates from WGS data. Hence, our results suggest WGS-based genotyping to be a promising alternative to the MS-based biotyping in clinical settings. Moreover, we suggest that complementary information should be used for the evaluation of taxonomic classification tools, as public databases may suffer from suboptimal annotations. bacteria, taxonomy, MALDI-TOF MS, whole-genome next-generation sequencing Introduction In the light of the global increase of antibiotic-resistant microorganisms, rapid and accurate pathogen characterization—i.e. their classification into organism groups—is essential for an effective treatment of infectious diseases [1]. This facilitates patient stratification and personalized therapies. Several approaches have been developed for the taxonomic characterization of bacterial isolates. The classical microbiological approaches are built on a large basis of constantly revised expert knowledge and typically involve Gram staining, analysis of culture growth, phenotype and biochemical reaction patterns [2]. These methods are increasingly augmented by high-throughput molecular methods such as 16S ribosomal RNA (rRNA) gene sequencing [3]. However, the taxonomic resolution based on the 16S rRNA gene alone is limited [3, 4]. Another alternative is taxonomic analysis using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) where the obtained protein mass spectra are compared against a reference database [5]. This proteome-based approach is characterized by high accuracy [5–7], low operating costs and quick turn-around times [8, 9]. Usually, MALDI-TOF MS-based analysis is applied to pure cultures, but there are studies performing taxonomic classification of mixed microbial communities [10, 11]. The reference spectra are typically not disclosed by the manufacturers, thus hampering the study and expansion of the existing references [12]. However, several attempts to create a publicly available database exist [13, 14] with the database ‘Spectra’ [15], curated by the Public Health Agency of Sweden, providing a sizeable data source containing >5000 spectra of bacteria and fungi. Although various factors might negatively affect the analysis outcome, e.g. sample preparation or age, limiting the result comparability between laboratories [7, 12, 16, 17], MS-based pathogen identification is applied in many diagnostic laboratories [1]. In part, driven by decreasing costs and faster turn-around times, whole-genome sequencing (WGS) has gained importance for the identification of pathogens as well as for antimicrobial resistance analyses and outbreak monitoring [2]. The existing sequencing-based taxonomic classification tools can be organized into two groups—tools relying on specific marker genes/sequences (e.g. MetaPhlAn [18] and MetaPhyler [19]) and whole-genome-based tools. The whole-genome-based approaches assign input sequences to taxa using alignments (e.g. DIAMOND [20] and TUIT [21]), k-mer matching (e.g. CLARK [22], Kaiju [23] and Kraken [24]) or alignment-free methods (e.g. Vervier et al. [25], RAIphy [26] and PhyloPythia [27–29]). While the tools’ performances are evaluated in the respective publications, these performance evaluations rely on publicly available data typically generated in independent, earlier experiments. Hence, incomplete or suboptimal annotations, e.g. because of contaminating sequences, are expected to exert negative effects on the evaluations. Importantly, the use of complementary information, such as MALDI-TOF MS-based taxonomic classification, is missing. This is of particular importance in the context of clinically relevant pathogen identification [2]. Here, we newly sequenced 846 pathogenic, gram-negative bacterial clinical isolates [including, among others, Escherichia spp. (22%), Proteus spp. (14%), Klebsiella spp. (16%), Pseudomonas spp. (11%), Enterobacter spp. (6%), Salmonella spp. (8%) and Acinetobacter spp. (6%)] and evaluated the classification performance of a set of WGS-based taxonomic classification tools (CLARK, DIAMOND/MEGAN, Kaiju, Kraken and TUIT). The ‘ground truth’ taxonomic assignments were established by confirming the expert-driven taxonomy using a Bruker Biotyper MALDI-TOF MS system. Moreover, we newly sequenced 200 Staphylococcus aureus isolates and performed the same analysis as for the gram-negative bacteria where the ‘ground truth’ comprised only the MS-based taxonomy. Our results demonstrated that certain WGS-based approaches allow for an accurate taxonomic classification, and thus, can be considered as promising alternatives to MS-based biotyping. Moreover, the complementary information of the protein mass spectra is a powerful alternative to relying on existing, yet potentially misleading, publicly available data. Materials and methods Bacterial isolates Our first data set consisted of 846 gram-negative bacterial clinical isolates collected for diagnostic purposes. The isolates were characterized by microbiologists from the respective laboratory according to the institutional guidelines for routine clinical microbiological testing, which was state of the art at the time of testing (Supplementary Table S1). The overview of the taxonomic assignments was created with Krona [30] (Figure 1). The samples are part of the microbiology strain collection of Siemens Healthcare Diagnostics (West Sacramento, CA). For 240 isolates, the data set included the collection date (Supplementary Figure S1), and for 783 isolates, the collection location (country or continent) was provided (Supplementary Table S1). Figure 1 View largeDownload slide Taxonomic composition of the 846 gram-negative isolates based on expert-driven taxonomy. Figure 1 View largeDownload slide Taxonomic composition of the 846 gram-negative isolates based on expert-driven taxonomy. The second data set included 200 S. aureus clinical isolates, which are part of the S. aureus strain collection of Saarland University Medical Center. For all isolates, the location of the isolation (country) and the isolation year (except for one sample) were provided (Supplementary Table S2, Supplementary Figure S1). DNA extraction Four streaks of each gram-negative bacterial isolate were cultured on trypticase soy agar containing 5% sheep blood, and cell suspensions were made in sterile 1.5 ml collection tubes containing 50 µl Nuclease-Free Water (AM9930, Life Technologies). Bacterial isolate samples were stored at −20 °C until nucleic acid extraction. The Tissue Preparation System (TPS) (096D0382-02_01_B, Siemens) and the VERSANT® Tissue Preparation Reagents (TPR) kit (10632404B, Siemens) were used to extract DNA from these bacterial isolates. TPS for nucleic acid extraction has been described previously [31–33]. Before extraction, the bacterial isolates were thawed at room temperature and were pelleted at 2000g for 5 s. The DNA extraction protocol DNAext was used for complete total nucleic acid extraction of samples. The total nucleic acid eluates were then transferred into 96 well quantitative polymerase chain reaction (qPCR) detection plates (401341, Agilent Technologies) for RNase A digestion, DNA quantitation and plate DNA concentration standardization processes. Rnase A (AM2271, Life Technologies), which was diluted in nuclease-free water following manufacturer’s instructions, was added to 50 µl of the total nucleic acid eluate for a final working concentration of 20 µg/ml. Digestion enzyme and eluate mixture were incubated at 37 °C for 30 min using Siemens VERSANT® Amplification and Detection instrument. DNA from the RNase-digested eluate was quantitated using the Quant-iT™ PicoGreen dsDNA Assay (P11496, Life Technologies) following the assay kit instruction, and fluorescence was determined on the Siemens VERSANT® Amplification and Detection instrument. In total, 25 µl of the quantitated DNA eluates were transferred into a new 96 well PCR plate for plate DNA concentration standardization before library preparation. Elution buffer from the TPR kit was used to adjust DNA concentration. The standardized DNA eluate plate was then stored at −80 °C until library preparation. Pure isolates from the S. aureus data set were grown overnight in brain heard infusion liquid culture with regular shaking (3 ml, 150 rpm). In total, 1 ml of overnight culture was centrifuged (10 min at 5000g), and the pellet was resuspended in P1 buffer (Qiagen), supplemented with 4 µl lysostaphin (10 mg/ml, frozen stock solution, Sigma) and incubated at 37 °C (30 min, 900 rpm) for enzymatic digestion of S. aureus cell walls. Protein K extraction was performed at 56 °C (30 min) by adding 300 µl lysis buffer and protein K solution (Maxwell 16 LEV Blood DNA kit, Promega). Following automated nucleic acid isolation (Promega Maxwell) culture extracts were eluted in 75 µl nuclease-free water. Quality of high molecular DNA without DNA degradation was confirmed by standard agarose gel electrophoresis. Next-generation sequencing Before library preparation, quality control of isolated bacterial DNA was conducted using a Qubit 2.0 Fluorometer (Qubit dsDNA BR Assay Kit, Life Technologies) and an Agilent 2200 TapeStation (Genomic DNA ScreenTape, Agilent Technologies). Next-generation sequencing libraries were prepared in 96 well format using NexteraXT DNA Sample Preparation Kit and NexteraXT Index Kit for 96 indexes (Illumina) according to the manufacturer’s protocol. The resulting sequencing libraries were quantified in a qPCR-based approach using the KAPA SYBR FAST qPCR MasterMix Kit (Peqlab) on a ViiA 7 Real-Time PCR System (Life Technologies). In total, 96 samples were pooled per lane for paired-end sequencing (2×100 bp) on Illumina Hiseq2000 or Hiseq2500 sequencers using TruSeq PE Cluster v3 and TruSeq SBS v3 sequencing chemistry (Illumina). Basic sequencing quality parameters were determined using the FastQC quality control tool for high-throughput sequence data [34], and the reports were summarized using MultiQC (version 0.8) [35] for the gram-negative isolates and S. aureus, respectively (Supplementary Tables S3 and S4, Supplementary Figures S2 and S3). A subset of gram-negative samples was resequenced because of low read coverage in the initial run; data of both runs were subsequently merged (Supplementary Table S1). Proteome-based identification Bacterial isolates were cultured on trypticase soy agar containing 5% sheep blood (BD BBL) and incubated at 35 °C for 18–24 h. Isolates were subjected to MALDI-TOF MS analysis using the Bruker Biotyper 3.1.65 (Bruker Daltonics, Bremen, Germany). Isolated colonies were directly smeared onto a polished steel target plate (Bruker Daltonics). Matrix (α-cyano-4-hydroxycinnamic acid, Bruker Daltonics), reconstituted as recommended by the manufacturer (50% acetonitrile, 47.5% water and 2.5% trifluoroacetic acid), was added to the cellular material on the target plate. Following successful calibration with the Bacterial Test Standard (Bruker Daltonics), bacterial isolates were tested on the Bruker Biotyper (flexControl version 3.3.108.0 and flexAnalysis 3.3.80.0) following the manufacturer’s instructions. Mass spectra were obtained, and scores were generated. Scores of  ≥2.0 were considered probable species identifications, scores of ≥1.7 but  <2.0 were considered probable genus identifications and scores  <1.700 were considered not reliable identifications, i.e. as not identified. The used cutoff values were defined according to the manufacturer’s guidelines. From 846 isolates analyzed with the Bruker Biotyper system, 100 samples were retested in a second run because they were not identified or yielded ambiguous classification in the first run. Best hits of the first and the second run were considered as the classification results. These included either species- or genus-level assignments with respect to the score cutoff values as described above. For all species assignments, the corresponding genus was determined using the R-package taxize [36] and the NCBI taxonomy database [37] (accessed 4 October 2016). The results of both runs were consolidated as follows: if the runs disagreed on the species level but not on the genus level, then only the genus was saved; if the runs disagreed on the genus level, the sample was considered as unclassified; and if a sample was classified at the species level in one run but only at the genus level in the other run and both genus-level assignments were concordant, then the species-level assignment was saved. Detailed information on each sample can be found in Supplementary Table S1. The identification of the isolates from the S. aureus data set was performed by a MALDI-TOF MS analysis using standard protocol (Bruker Biotyper, Bruker Daltonics). Genome-based identification of bacteria We applied five tools for whole-genome-based taxonomic classification: BLAST-based tools DIAMOND [20] Lowest Common Ancestor (LCA) assignment using MEGAN [38]) and TUIT [21], and k-mer-based approaches CLARK [22], Kaiju [23] and Kraken [24]. CLARK: Version 1.1.3 was used, and the database was created with the respective script at the species level using finished bacterial genomes from the NCBI RefSeq database (2 November 2015). The tool was run in default mode, k-mer lengths were set to 21, 25, 29 or 31 and forward and reverse paired-end reads were used as input. Report files were created using ‘getAbundance’ with default parameters. Kaiju: Version 1.4-7 was used with the default database of complete genomes downloaded from NCBI FTP server (30 June 2016). Paired-end reads were used as input with the default run mode Maximum Exact Matches (MEM). Report files on species level were created using ‘kaijuReport’. Kraken: Version 0.10.4-beta was used with the default database containing finished genomes from the NCBI RefSeq database (13 January 2015). The k-mer lengths were set to 21, 25, 29 or 31, and forward and reverse paired-end reads were used as input. Report files were created from the raw output using ‘kraken-report’. DIAMOND/MEGAN: As DIAMOND has no direct support for paired-end reads, we used only forward reads as input. Version 0.6.13.48 was used with BLASTX search against the NCBI nonredundant protein sequence database (nr) (27 February 2015) with default parameters. The output was further processed using the LCA method implemented in MEGAN (version 5.10.6, tool blast2lca with default parameters and GenInfo Identifier (GI) number taxon mapping from March 2015) and summarized by counting the number of mapped sequences for each listed taxon. TUIT: As in the case of DIAMOND, only forward reads were used as input. Furthermore, the FASTQ file was converted to FASTA format using FASTX-Toolkit [39] (version 0.0.14, with ‘-Q 33’), and unique reads were collapsed to reduce the number of input sequences. TUIT (version 1.0.3.2) was used with local BLAST search against the NCBI nucleotide collection (nt) (4 April 2014) and default parameters. The output was summarized as described for DIAMOND/MEGAN. Result summaries From the individual tools’ reports, the following information was computed: species taxon with maximal percentage of mapped sequences (first best species hit with respect to all reported sequences), the number of sequences mapped to the best hit species taxon, the number of sequences classified at the species level and the number of sequences mapped to the expected species taxon, i.e. taxa obtained by the merged MS-based analysis results in case of the gram-negative isolates. The total number of sequences was set to the number of input sequences, i.e. the number of reads for the CLARK, DIAMOND/MEGAN, Kaiju and Kraken results, and the number of FASTA sequences after converting FASTQ files into FASTA files for the results of TUIT. The summarized results can be found in Supplementary Tables S1 and S2. Performance measures For each WGS-based summary file, the following performance measures were calculated: the sensitivity, precision and F-measure values with respect to the best species hit and expected species taxon (Supplementary Table S1). Sensitivity was defined as the ratio of reads assigned to the species taxon and the total number of reads. Precision was defined as the ratio of reads assigned to the species taxon and reads classified at the species level (i.e. assigned to any species taxon). F-measure was defined as 2× (sensitivity× precision)/(sensitivity + precision). Runtime analysis To compare the runtimes of the tools, we randomly selected five gram-negative samples whose number of reads was between 100 000 and 1 000 000 to reduce computational cost. For each sample and each tool, the elapsed (wall clock) time was estimated three times using ‘GNU time’ (version 1.7). Before measuring the time, the tools were ‘pre-run’ on a single sample. The tools were called in the same way as described above with the following additional settings: the number of threads was set to 30 using the parameter ‘–threads’ for Kraken and DIAMOND, ‘-n’ for CLARK and ‘-z’ for Kaiju. For TUIT, only the number of threads for the BLAST search can be set manually through the parameter ‘NumThreads’ in the supplied property file. This parameter was also set to 30. Furthermore, we used the option ‘–preload’ for Kraken. The analysis was performed on a server with 500 Gb RAM and 64 processors [AMD Opteron(tm) Processor 6378] with 1400 MHz. The time spent on downloading and creating the databases was not considered. The final runtime per sample was computed as the mean over the three repetitions and normalized by the number of read pairs in the corresponding FASTQ files. Effect of read processing on results of CLARK and Kraken CLARK and Kraken were additionally run on paired reads preprocessed with Trimmomatic [40] as described below. K-mer length was set to 31. All other parameters and output processing were kept as described above. Identification of species contained in the reference databases Whether the expected species were represented by the used reference databases of the WGS-based tools was determined as follows: for CLARK, Kaiju, Kraken and TUIT, the GI numbers were extracted from used nucleotide sequences and mapped to the taxonomy names using the taxonomy mapping files of NCBI; then we checked whether the expected taxa were contained in the set of retrieved taxonomy names; for DIAMOND/MEGAN, the sequence titles were extracted from the used nr database, filtered to retrieve only those with one unique taxonomy name and used to search for the expected taxa. Identification of candidate mixtures To detect samples containing genomic data of more than one organism, we performed a homogeneity analysis based on the WGS data. The raw reads were trimmed using Trimmomatic [40] (version 0.35, command line parameters: PE ILLUMINACLIP:NexteraPE-PE.fa:1:50:30 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36), assembled into scaffolds using SPAdes [41] (version 3.6.2, command line parameters: -k 21,33,55 –careful) and annotated by PROKKA [42] (version 1.11, command line parameters: –gram neg –mincontiglength 200). The homogeneity of individual samples was assessed using a set of ‘essential genes’. These genes are in single copy and conserved in 95% of all sequenced bacteria [43]. A sample was considered a candidate mixture if >10 of the essential genes were found in multiple copies (Supplementary Tables S1 and S2). Multilocus sequence typing Multilocus sequence typing (MLST) profiles for each isolate were obtained using a pipeline implemented by Torsten Seeman [44] (version 2.6) and PubMLST schemes [45] (8 August 2016). The scaffold FASTA files obtained by PROKKA (mentioned earlier) were used as input. The output contained the closest PubMLST scheme, the corresponding sequence type (ST; if available) and the allele IDs (Supplementary Table S3). Results containing more than one missing allele were considered as unreliable. Validation set A validation set of gram-negative isolates used to access the performance of WGS-based results was defined as follows. It included only samples identified at species level by MS-based analysis and whose species taxon was supported by the expert-driven taxonomy. If the expert-driven taxonomy contained two species taxa, then it was sufficient that one of them was the same as the MS-based taxon. If the expert-driven taxonomy included only genus, then it was required to be the same as the genus MS-based species taxon. Furthermore, samples were excluded if their assembly failed or if they had <200 000 reads or if they were identified as candidate mixtures (Supplementary Table S1) resulting in 656 samples in total. For S. aureus, the validation set contained 194 isolates after removing samples because of failed assembly or if they had <200 000 reads or if they were identified as candidate mixtures (Supplementary Table S2). Results Gram-negative isolates Our data set consisted of 846 gram-negative bacterial clinical isolates collected and identified by microbiologists for diagnostic purposes (Figure 1). These isolates were part of the microbiology strain collection at Siemens Healthcare Diagnostics (West Sacramento, CA). We sequenced whole genomes of all 846 isolates using an Illumina HiSeq system and performed WGS-based taxonomic classification using in silico methods. Additionally, a Bruker Biotyper MALDI-TOF MS system was used for taxonomic characterization where a subset of 100 isolates was reclassified in a second run. For 240 isolates, the date of collection was available covering a time span from 1986 to 2011 (Supplementary Figure S1). Furthermore, for 783 isolates, the data set included the collection location: a majority of samples was collected in North America (738), followed by Japan (23), Europe (21) and Australia (1). In the MS-based analysis of the 846 isolates, eight (0.9%) samples in the first run and four from the 100 reanalyzed samples (4%) were not identified. Only two samples remained unclassified after both runs. In total, 734 of 846 (86.8%) and 86 of 100 (86%) isolates were resolved to the species level in the first and second run, respectively, while 104 of 846 (12.3%) and 10 of 100 (10%) were classified at the genus level only. The score values varied from 1.73 to 2.57 (Supplementary Figure S4). For the 74 samples identified at the species level in both MS runs, 52 (70.3%) had concordant results. From the remaining 22 samples with divergent species taxa, 16 were assigned to the same genus. Among these, 11 samples were identified as Serratia ureilytica in the first run but reclassified as Serratia marcescens in the second run. A stronger concordance of 83 of 90 samples (92.2%) was observed at the genus level. After merging the results of both runs, 723 from 846 samples (85.5%) were classified at the species level, 114 (13.5%) at the genus level only and 9 (1.1%) were considered as unclassified (Supplementary Table S1). Next, we examined the agreement between the MS-based results and the expert-driven taxonomy over all isolates (846). Concordant species assignments (in case of two taxa, one match was sufficient) were observed in 646 (76.4%) cases (Figure 2, Supplementary Figure S5). Moreover, the MS-based results were supported by all WGS-based tools (for CLARK and Kraken, k = 31) in 602 (71.2%) cases where 24 (2.8%) assignments were not supported by the expert-driven taxonomy. The validation set for the evaluation of the WGS-based tools was defined comprising 656 isolates with verified species classification (Supplementary Table S1). The taxonomy of the selected isolates was additionally confirmed by the assembly-based MLST results. For species with available MLST schemes in the PubMLST database, for all samples, except one Klebsiella pneumoniae isolate, the closet reported scheme belonged to the expected species taxon (Supplementary Figure S6), and in 354 of 458 cases, a ST could be assigned (Supplementary Table S1). In the following, if not stated otherwise, only the isolates from the validation set were used for the analysis of the WGS-based results. Figure 2 View largeDownload slide Euler diagram of the species taxa of the expert-driven and MS-based identifications. In cases where two species taxa were given by the expert-driven taxonomy, a match to one of the taxa was sufficient to report concordance. The third set lying within the MS-based taxa set represents assignments supported by CLARK (31-mers), Kaiju, DIAMOND/MEGAN, Kraken (31-mers) and TUIT. Its difference with the set of the expert-driven species taxa includes three isolates without an expert-based species assignment. Figure 2 View largeDownload slide Euler diagram of the species taxa of the expert-driven and MS-based identifications. In cases where two species taxa were given by the expert-driven taxonomy, a match to one of the taxa was sufficient to report concordance. The third set lying within the MS-based taxa set represents assignments supported by CLARK (31-mers), Kaiju, DIAMOND/MEGAN, Kraken (31-mers) and TUIT. Its difference with the set of the expert-driven species taxa includes three isolates without an expert-based species assignment. CLARK and Kraken require the k-mer length to be fixed when building the respective reference databases. In both cases, the values was set to 31, the default value of Kraken. In an additional analysis, we confirmed the observation made for Kraken and CLARK that higher k-mer values are associated with higher precision and lower values with higher sensitivity [22, 24] (Supplementary Figure S7). We examined the results obtained from CLARK, DIAMOND/MEGAN, Kaiju, Kraken and TUIT regarding the wrongly assigned species taxa and their presence in the reference data sets of the used tools (Figure 3). The numbers of incorrect classifications per species were comparable among all five tools with some exceptions: 5 Citrobacter koseri and 19 Enterobacter aerogenes samples were misclassified by DIAMOND/MEGAN and TUIT, whereas CLARK, Kaiju and Kraken yielded no misclassifications for these species; TUIT assigned all 17 Klebsiella oxytoca samples to K. pneumoniae but had only two wrong assignments for Proteus vulgaris, whereas the other tools misclassified 11 P. vulgaris samples. For CLARK and Kraken, in all 30 cases of incorrect species classifications, the expected taxon was not contained in the respective reference databases. For Kaiju, this was the case for only 4 of the 18 incorrect assignments. The databases used for DIAMOND and TUIT contained all expected species taxa. To perform a fair comparison of the tools, only isolates belonging to species included in the reference data sets of all five tools (626 samples) were considered if not stated otherwise. Figure 3 View largeDownload slide A number of misclassified samples from the validation set per species for CLARK (31-mers), Kaiju, DIAMOND/MEGAN, Kraken (31-mers) and TUIT. Only the expected species taxa for which at least one tool yielded a misclassification were included. The number of misclassifications is provided within each cell—the higher the value, the darker the background color. For CLARK, Kaiju and Kraken, the numbers of species missing in the used databases are printed in bold and are highlighted by a black rectangle. The genus taxa were abbreviated by the first three letters. Figure 3 View largeDownload slide A number of misclassified samples from the validation set per species for CLARK (31-mers), Kaiju, DIAMOND/MEGAN, Kraken (31-mers) and TUIT. Only the expected species taxa for which at least one tool yielded a misclassification were included. The number of misclassifications is provided within each cell—the higher the value, the darker the background color. For CLARK, Kaiju and Kraken, the numbers of species missing in the used databases are printed in bold and are highlighted by a black rectangle. The genus taxa were abbreviated by the first three letters. The overlap of species assignments between all four tools was 92.3% (578 of 626 samples) (Supplementary Figure S8). CLARK and Kraken obtained same species assignments for all samples, while the lowest concordance was found between Kaiju and TUIT (92.3%). As next, we compared the results of CLARK, DIAMOND/MEGAN, Kraken and TUIT with respect to the expected species taxa. For each of the five tools, the percentage of correct species- and genus-level assignments was computed (Supplementary Figure S9). The best results were obtained by CLARK and Kraken with 100% of correctly assigned species taxa followed by Kaiju (99.5%), DIAMOND/MEGAN (96%) and TUIT (92.7%). We then considered the mean classification performance computed with respect to the expected species taxa. CLARK and Kraken demonstrated comparable mean sensitivity values of 79.1 and 81.7%, respectively; DIAMOND/MEGAN had a mean sensitivity of 64.6% followed by Kaiju (31.4%), and TUIT (25.8%) (Figure 4). The highest mean precision was achieved by Kraken (96.2%) followed by CLARK (94.5%), TUIT (87.5%), Kaiju (87.2) and DIAMOND/MEGAN (79.2%). Regarding the F-measure, best mean performance was achieved by Kraken (87.9%) followed by CLARK (85.5%), DIAMOND/MEGAN (71%), Kaiju (41.8) and TUIT (34.5%). Finally, we examined the distribution of the best hit performance values for correctly and wrongly classified isolates (Supplementary Figure S10). We observed that isolates assigned to wrong species taxa tended to have a combination of lower sensitivity and precision values compared with correctly classified isolates. However, there was no clear separation of both groups. Moreover, it should be noted that the pairwise sensitivity and precision values of DIAMOND’s results were often almost identical. Figure 4 View largeDownload slide The mean runtime (n=5) per 1 million reads measured using five randomly chosen gram-negative samples, and the mean sensitivity, precision and F-measure percentages computed with respect to the expected species taxa for CLARK (31-mers), DIAMOND/MEGAN, Kaiju, Kraken (31-mers) and TUIT. Only samples from the validation set and with expected species present in all used reference databases were used. The x-axis is square root transformed. Figure 4 View largeDownload slide The mean runtime (n=5) per 1 million reads measured using five randomly chosen gram-negative samples, and the mean sensitivity, precision and F-measure percentages computed with respect to the expected species taxa for CLARK (31-mers), DIAMOND/MEGAN, Kaiju, Kraken (31-mers) and TUIT. Only samples from the validation set and with expected species present in all used reference databases were used. The x-axis is square root transformed. Besides the classification performance, computational runtime is a relevant factor in choosing which software to use, especially when analyzing large-scale data sets. Hence, we compared the runtimes of the tools based on five randomly selected samples (Figure 4). Among the herein used tools, TUIT was the slowest with an average of 169 min per 1 million reads followed by DIAMOND (32.6 min). CLARK, Kaiju and Kraken achieved the best results requiring <3 min per 1 million reads, with Kaiju being the fastest (<1 min). Finally, we investigated whether adapter and quality trimming of the raw reads would adversely affect the classification results with the focus on CLARK and Kraken, as they demonstrated comparable results and achieved better performance than the other tools. We compared the results obtained using raw or processed reads (for k=31). Regarding the best species hits of all samples from the validation set (656), all assignments stayed consistent when using CLARK and Kraken. Moreover, we compared the percentages of raw and trimmed reads assigned to the best hit and expected species taxa, respectively. The distribution of the absolute differences (Supplementary Figure S11) was inspected, and the 99th percentile (at the value of 2.6 for CLARK and 2.5 for Kraken) was defined as the cutoff for outlier detection: The numbers of outliers for the best hits and expected species taxa, respectively, were seven for CLARK and Kraken. Staphylococcus aureus isolates We whole genome sequenced 200 S. aureus clinical isolates from the S. aureus strain collection of Saarland University Medical Center. The samples were collected between 1976 and 2014 (Supplementary Figure S1). The majority of the isolates (187) was collected from Germany, 11 were from Mozambique, 1 from Switzerland and 1 from the United States (Supplementary Table S2). In the assembly-based MLST analysis, for all isolates, except one, in the validation set, the closest reported MLST scheme belonged to S. aureus, and 173 isolates were assigned to known MLST profiles. We performed an analogous analysis as for the gram-negative isolates to compare the WGS-based results. The tools were concordant with the assignments of all isolates in the validation set except for one case where Kaiju disagreed with the other tools (Supplementary Figure S12). Only one isolate was not assigned to S. aureus by CLARK, DIAMOND/MEGAN, Kraken and TUIT; two isolates were misclassified by Kaiju (Supplementary Figure S13). CLARK and Kraken achieved high mean sensitivity values of 90.8% and 91.3%, respectively (Supplementary Figure S14). Lower values were observed for TUIT (77.1%), DIAMOND/MEGAN (75.9%) and Kaiju (10.9%). CLARK, DIAMOND/MEGAN, Kraken and TUIT demonstrated high-precision values of  >97%, whereas Kaiju had a lower value of 85.9%. Accordingly, CLARK and Kraken had highest F-measure values of 94.4 and 94.7%, respectively, followed by TUIT (86.5%), DIAMOND/MEGAN (85.3%) and Kaiju (18.9%). Discussion Rapid and accurate pathogen characterization is essential for an effective treatment of infections facilitating patient stratification and personalized therapies. WGS is gaining importance in the analysis of bacterial cultures derived from patients with infectious diseases. Various computational approaches have been developed to perform taxonomic analysis based on WGS data. However, evaluations using newly sequenced clinical samples and complementary information confirming the taxonomy are missing. Here, we performed WGS-based taxonomic analysis of 846 gram-negative bacterial isolates and validated the results by comparing with MS-based classifications obtained using a Bruker Biotyper MALDI-TOF MS system and confirmed by expert-driven taxonomy. Our data set included species which are frequently found to be responsible for nosocomial (hospital-acquired) infections, such as Acinetobacter baumannii, Escherichia coli, Klebsiella spp. and Pseudomonas aeruginosa [46]. Additionally, we included a data set of 200 S. aureus isolates. To determine the concordance between the expert-driven and MS-based taxonomy of the gram-negative isolates, we first analyzed the MS-based results to determine samples with uncertain or ambiguous classifications. In general, possible limitations of MS-based analysis include, but are not restricted to, the limited differentiation of E. coli and some Shigella spp. isolates [7, 9, 47–49], and also inaccurate differentiation of species in other groups such as Acinetobacter [48], Citrobacter [50] and Enterobacter cloacae complex [48] and the missing identification of Salmonella isolates below the genus level [51]. In our MS-based analysis, most samples (about 86%) were classified at the species level and the taxa of samples classified only at the genus level included, among others, the genera imposing particular challenges to MS-based biotyping as mentioned above (Supplementary Table S1). We found that both MS runs produced concordant results in 52 of 74 (70.3%) and 84 of 90 (93.3%) cases at the species and at the genus level, respectively. From the 22 discordant species-level assignments, 11 isolates were first identified as S. ureilytica and then reclassified as S. marcescens. S. ureilytica is a relatively new species [52] whose identification by a MALDI-TOF MS system was shown to be challenging [50, 53], potentially explaining the observed ambiguities. In the final taxonomic assignment, the reported inconsistencies were resolved such that 723 samples were classified at the species level, 115 at the genus level only and 8 samples were considered as unclassified because of divergent assignments or failed identification. We then compared the resulting species taxa with the expert-driven results and observed a high concordance of 76.4%. However, in ca. 3% of the cases, all tested WGS-based tools were concordant with the MS-based result, which was not supported by the expert-driven classification. We could assume that in these cases, the expert-driven taxonomy was incomplete (no species taxon) or incorrect, which demonstrates the importance of using complementary information to confirm the identification results. Subsequently, we defined a validation set including only samples with confirmed taxonomy to be used for the evaluation of the WGS-based results. For any taxonomic classification approach, the availability and quality of the reference data are crucial for an accurate identification. Sequence-based tools for taxonomic classification generally rely on publicly available data sources. CLARK and Kraken construct their k-mer databases from finished genomes of the NCBI RefSeq database [54, 55], which included 2785 bacterial data sets (containing chromosome and/or plasmid DNA sequences) in this study. DIAMOND and TUIT use the NCBI nonredundant protein (nr) and nucleotide (nt) collections [56], respectively. The NCBI nr collection includes data from GenBank, RefSeq, UniProtKB/Swiss-Prot, PDB and PRF; and the nt collection includes data from RefSeq and GenBank except EST, GSS, STS and HTG [56]. Kaiju can use complete genomes from the NCBI RefSeq database or the nr collection. In this study, Kaiju was run using proteins extracted from 5135 complete genomes. Considering the presence of expected species in the different reference data sets, in only few cases, the availability of the respective species was sufficient for correct classifications (Figure 3). Only TUIT was able to correctly identify most of the P. vulgaris isolates, though this species was also contained in the reference data sets of DIAMOND and Kaiju. For the single P. vulgaris genome used by Kaiju (NZ_CP012675.1), we found that the nucleotide sequence of the rpoB gene (DNA-directed RNA polymerase subunit beta, WP_004246906.1), shown to be appropriate to distinguish Proteus spp. [57], was 100% identical to the rpoB gene from the complete Proteus mirabilis genome (NZ_CP012674.1, ALE23450). Furthermore, the average nucleotide identity value [58] (http://enve-omics.ce.gatech.edu/ani/index) of both genomes was  >99.9%. Based on these findings, we assumed that NZ_CP012675.1 was misclassified explaining why Kaiju was not able to correctly identify the P. vulgaris isolates. We hypothesized that similar reasons may be responsible for other incorrect assignments where the expected species was present in the used database but whose isolates were nevertheless misclassified. Focusing only on species presumably contained in all databases, our analysis demonstrated that WGS-based identification approaches can yield highly accurate results. All tools classified  >92% (best result 100%) of the gram-negative samples correctly. CLARK and Kraken demonstrated best mean sensitivity about 80% followed by DIAMOND/MEGAN with 64.4%. Kaiju and TUIT had a comparably low mean sensitivity (31.4 and 25.8%) but better precision (87.2 and 87.5%) than DIAMOND/MEGAN (79.2%). The highest precision was observed for Kraken (96.2%) and CLARK (94.5%). The low sensitivity of TUIT may be because of missing matches to the reference database or because the default TUIT cutoff values used during the assignment of the LCAs, were too strict. The authors of TUIT suggest a trial-and-error procedure to adjust the cutoffs [21], but this was prohibited by the high computational runtime of the tool. In contrast, the default parameters used for assignment filtering and LCA assignment with MEGAN for the DIAMOND results appeared to be too permissive, as the sensitivity and precision values were close to each other. Furthermore, it should be noted that CLARK, Kaiju and Kraken may have benefited from using paired-end data, while DIAMOND and TUIT were run on forward reads only. Moreover, DIAMOND and Kaiju are only able to classify protein-coding sequences, as the reads are matched to protein databases affecting their sensitivity. Though we observed a tendency of lower performance values for incorrectly classified isolates, some wrongly and correctly assigned isolates demonstrated comparable sensitivity and precision values. We hypothesized that these cases may either include not detected contaminated isolates or isolates belonging to a species missing in the database but closely related to other species of the same genus with available reference data. These isolates would require a closer examination, e.g. considering all taxa exceeding a reasonable abundance cutoff. The overall results demonstrate the importance of a comprehensive and representative reference database for a successful and precise taxonomic classification, which is even more crucial within a clinical setting. The performance of the WGS-based tools on the S. aureus data set was comparable with the observations made for the gram-negative isolates: at least 99% of all samples were classified as S. aureus; the exceptions were one sample (ID 191) classified as Enterococcus faecium by Kaiju and one sample (ID 80) identified as Staphylococcus carnosus by all five tools. In the latter case, we could assume that the MS-based taxon was wrong or that a wrong probe was used for WGS. Highest sensitivity was achieved by CLARK and Kraken (>90%), and all tools except Kaiju had a mean precision  >97%. Considering the runtime of the tested WGS-based tools, CLARK, Kaiju and Kraken were substantially faster than DIAMOND and TUIT requiring only a few minutes to process a million of reads. Furthermore, we also evaluated the robustness of CLARK and Kraken with respect to adapter and quality trimming of the raw reads and observed that this procedure had only marginal effects on the classification results. In summary, the k-mer and exact matching-based tools CLARK and Kraken appear to be the primary choices with respect to classification performance, usability and runtime among the herein tested approaches. Kaiju represents an appealing alternative, as it was faster than CLARK and Kraken, and requires no parameters to be set to create a reference database. But it should be kept in mind that the tool can classify only protein-coding sequences. Overall, taxonomic classification of bacterial isolates based on WGS data provides highly accurate results, and thus, represents a promising alternative to MS-based biotyping. WGS would also enable further analyses, such as phylogenetic analysis and genotyping, which are mandatory for surveillance of outbreaks and antimicrobial resistance. However, multiple issues have to be addressed before WGS-based approaches can be applied in clinical settings. The effect of different library preparation and sequencing methods on the quality of the identification results should be investigated and quantified. As the comprehensiveness and the quality of the reference database has a high impact on the reliability of the taxonomic assignments, a careful selection and validation of the reference data would be necessary. This holds especially for organisms represented by only one genome as seen in case of the (most likely mislabeled) P. mirabilis genome used by Kaiju. Large-scale efforts, such as the ‘Genomic Encyclopedia of Bacteria and Archaea’ [59] and its pilot studies, and the ‘100K Foodborne Pathogen Genome Project’ [60] are expected to greatly expand the volume and diversity of available reference genomes. An additional aspect is the genomic variability of bacteria and in particular the differences between pathogenic and nonpathogenic species. The lifestyle of a bacterium influences to a great extent its genome size and variability. Pathogenic species represent specialized organisms leading an allopatric lifestyle and are characterized by a significantly reduced genome compared with species from a sympatric environment [61, 62]. Furthermore, the genome of a bacterial strain can be seen as a composition of ‘core’ genes, conserved among many strains of the same species, and ‘accessory’ genes, which vary between different strains [62]. The set of all genes found in a species is referred to as a pan-genome [63]. The core genome similarity is considered to be a good approach to define bacterial species relevant for humans [62]. However, it has also been proposed to apply pan-genome analysis to redefine bacterial species [64, 65]. Another important point is the fact that many bacterial organisms cannot be grown in the laboratory, thus challenging their identification [66]. Single-cell sequencing is considered to be a promising solution for this problem [67]. In our analysis, we focused on cultured isolates as their accurate identification can be seen as a necessary requirement for future, culture-independent studies. Regarding the underlying concept of the classification tools, we focused in this study on alignment-based methods. But there also exist alignment-free approaches (e.g. PhyloPythia/S/S+ [27–29], RAIphy [26] and the approach of Vervier et al. [25]), and approaches combining alignment-based and alignment-free similarity measures (e.g. Borozan et al. [68]). Finally, exhaustive testing procedures using high-quality validation data should be performed including relevant human pathogens to access the reliability and accuracy of the implemented method. Key Points Kmer-based taxonomic information derived from WGS data allows for accurate and fast classification of bacterial clinical isolates at species level, and is thus, an appealing alternative to MS-based analysis. Establishing a high-quality reference database as well as its continuous extension is vital for the correctness of the taxonomic classifications. The evaluation of taxonomic classification tools should include complementary information to confirm the taxonomy of the underlying data. Supplementary Data Supplementary data are available online at http://bib.oxfordjournals.org/. Valentina Galata is PhD student at the Chair of Clinical Bioinformatics at Saarland University. Christina Backes is Postdoc at the Chair for Clinical Bioinformatics at Saarland University. Cédric Christian Laczny is Postdoc at the Chair for Clinical Bioinformatics at Saarland University. Georg Hemmrich-Stanisak is a research scientist at the ICMB at the Christian-Albrechts-University of Kiel in Germany. Howard Li, PhD, worked as a system development technical lead at Roche Molecular Systems. His expertise includes sample preparation, laboratory automation and medical device R&D. He is a member of American Chemical Society and American Society for Microbiology. Laura Smoot, PhD, is a microbiologist with training and research in molecular microbiology field, and has experience in in vitro diagnostics industry. She worked at Siemens Healthcare, R&D. She is a member of American Society for Microbiology (ASM) and European Society of Clinical Microbiology, and Infectious Diseases (ESCMID). Dr Andreas Emanuel Posch is Senior Key Expert for Bioinformatics and Systems Biology at Siemens Healthcare, In Vitro Diagnostics and Bioscience R&D. Dr Susanne Schmolke is Senior Project Manager Strategy at Siemens Healthcare. Her expertise includes virology, molecular biology and medical device industry. Markus Bischoff is professor and senior scientist at the Institute of Medical Microbiology and Hygiene at Saarland University. Lutz von Müller was vice head of the Institute of Medical Microbiology and Hygiene at the Saarland University. Current position: Head of Institute of Laboratory Medicine, Microbiology and Hygiene, Christophorus Hospitals, Coesfeld, Germany. Dr Achim Plum is Managing Director of Curetis GmbH and molecular geneticist by training. His areas of expertise include precision medicine and companion diagnostics, biomarker discovery and validation, IVD industry and molecular diagnostics. Andre Franke, PhD, is the director of the ICMB at the Christian-Albrechts-University of Kiel in Germany. The primary foci of his research are high-throughput analyses, laboratory automation, next generation sequencing, chronic inflammatory diseases, GWAS, and bioinformatics. He is a member of the German Society for Human Genetics (GfH) and the German Society for Internal Medicine (DGIM). Andreas Keller is professor and head of the Chair for Clinical Bioinformatics at Saarland University. Acknowledgement The authors would like to thank Siemens Healthcare and Curetis GmbH for the support as well as the provided data set. The authors also thank Mathias Herrmann for comments that greatly improved the manuscript. Funding Siemens Healthcare and in parts by the Best Ageing (grant number 306031) from the European Union. Availability of WGS data The raw WGS data are available on a reasonable request for noncommercial use after signing a nondisclosure agreement. References 1 Greatorex J, Ellington MJ, Köser CU, et al.   New methods for identifying infectious diseases. Br Med Bull  2014; 112: 27– 35. doi: 10.1093/bmb/ldu027 Google Scholar CrossRef Search ADS PubMed  2 Didelot X, Bowden R, Wilson DJ, et al.   Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet  2012; 13: 601– 12. doi: 10.1038/nrg3226 Google Scholar CrossRef Search ADS PubMed  3 Janda JM, Abbott SL. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. J Clin Microbiol  2007; 45: 2761– 4. doi: 10.1128/JCM.01228-07 Google Scholar CrossRef Search ADS PubMed  4 Rajendhran J, Gunasekaran P. Microbial phylogeny and diversity: small subunit ribosomal RNA sequence analysis and beyond. Microbiol Res  2011; 166: 99– 110. doi: 10.1016/j.micres.2010.02.003 Google Scholar CrossRef Search ADS PubMed  5 van Veen SQ, Claas ECJ, Kuijper EJ. High-throughput identification of bacteria and yeast by matrix-assisted laser desorption ionization-time of flight mass spectrometry in conventional medical microbiology laboratories. J Clin Microbiol  2010; 48: 900– 7. doi: 10.1128/JCM.02071-09 Google Scholar CrossRef Search ADS PubMed  6 Seng P, Drancourt M, Gouriet F, et al.   Ongoing revolution in bacteriology: routine identification of bacteria by matrix-assisted laser desorption ionization time-of-flight mass spectrometry. Clin Infect Dis  2009; 49: 543– 51. doi: 10.1086/600885 Google Scholar CrossRef Search ADS PubMed  7 Bizzini A, Durussel C, Bille J, et al.   Performance of matrix-assisted laser desorption ionization-time of flight mass spectrometry for identification of bacterial strains routinely isolated in a clinical microbiology laboratory. J Clin Microbiol  2010; 48: 1549– 54. doi: 10.1128/JCM.01794-09 Google Scholar CrossRef Search ADS PubMed  8 Köser CU, Ellington MJ, Cartwright EJP, et al.   Routine use of microbial whole genome sequencing in diagnostic and public health microbiology. PLoS Pathog  2012; 8: e1002824. doi: 10.1371/journal.ppat.1002824 Google Scholar CrossRef Search ADS PubMed  9 Croxatto A, Prod’hom G, Greub G, et al.   Applications of MALDI-TOF mass spectrometry in clinical diagnostic microbiology. FEMS Microbiol Rev  2012; 36: 380– 407. doi: 10.1111/j.1574-6976.2011.00298.x Google Scholar CrossRef Search ADS PubMed  10 Mahé P, Arsac M, Chatellier S, et al.   Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum. Bioinformatics  2014; 30: 1280– 6. doi: 10.1093/bioinformatics/btu022 Google Scholar CrossRef Search ADS PubMed  11 Zhang L, Smart S, Sandrin TR. Biomarker- and similarity coefficient-based approaches to bacterial mixture characterization using matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS). Sci Rep  2015; 5: 15834. doi: 10.1038/srep15834 Google Scholar CrossRef Search ADS PubMed  12 Sandrin TR, Demirev PA. Using mass spectrometry to identify and characterize bacteria. Microb  2014; 9: 23– 9. 13 Mazzeo MF, Sorrentino A, Gaita M, et al.   Matrix-assisted laser desorption ionization-time of flight mass spectrometry for the discrimination of food-borne microorganisms. Appl Environ Microbiol  2006; 72: 1180– 9. doi: 10.1128/AEM.72.2.1180-1189.2006 Google Scholar CrossRef Search ADS PubMed  14 Böhme K, Fernández-No IC, Barros-Velázquez J, et al.   SpectraBank: an open access tool for rapid microbial identification by MALDI-TOF MS fingerprinting. Electrophoresis  2012; 33: 2138– 42. doi: 10.1002/elps.201200074 Google Scholar CrossRef Search ADS PubMed  15 Spectra—Extended spectra database for microorganism identification by MALDI-TOF MS. http://spectra.folkhalsomyndigheten.se/spectra/. (5 April 2016, date last accessed) 16 Nicolau A, Sequeira L, Santos C, Mota M. Matrix-assisted laser desorption/ionisation time-of-flight mass spectrometry (MALDI-TOF MS) applied to diatom identification: influence of culturing age. Aquat Biol  2014; 20: 139– 44. doi: 10.3354/ab00548 Google Scholar CrossRef Search ADS   17 Veloo ACM, Elgersma PE, Friedrich AW, et al.   The influence of incubation time, sample preparation and exposure to oxygen on the quality of the MALDI-TOF MS spectrum of anaerobic bacteria. Clin Microbiol Infect  2014; 20: O1091– 7. doi: 10.1111/1469-0691.12644 Google Scholar CrossRef Search ADS PubMed  18 Segata N, Waldron L, Ballarini A, et al.   Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods  2012; 9: 811– 4. doi: 10.1038/nmeth.2066 Google Scholar CrossRef Search ADS PubMed  19 Liu B, Gibbons T, Ghodsi M, et al.   Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics  2011; 12 (Suppl 2): S4. doi: 10.1186/1471-2164-12-S2-S4 Google Scholar CrossRef Search ADS PubMed  20 Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods  2014; 12: 59– 60. doi: 10.1038/nmeth.3176 Google Scholar CrossRef Search ADS PubMed  21 Tuzhikov A, Panchin A, Shestopalov VI. TUIT, a BLAST-based tool for taxonomic classification of nucleotide sequences. Biotechniques  2014; 56: 78– 84. doi: 10.2144/000114135 Google Scholar CrossRef Search ADS PubMed  22 Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics  2015; 16: 236. doi: 10.1186/s12864-015-1419-2 Google Scholar CrossRef Search ADS PubMed  23 Menzel P, Lee Ng K, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun  2016; 7: 11257. Google Scholar CrossRef Search ADS PubMed  24 Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol  2014; 15: R46. doi: 10.1186/gb-2014-15-3-r46 Google Scholar CrossRef Search ADS PubMed  25 Vervier K, Mahé P, Tournoud M, et al.   Large-scale machine learning for metagenomics sequence classification. Bioinformatics  2016; 32: 1023– 32. doi: 10.1093/bioinformatics/btv683 Google Scholar CrossRef Search ADS PubMed  26 Nalbantoglu OU, Way SF, Hinrichs SH, et al.   RAIphy: phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles. BMC Bioinformatics  2011; 12: 41. doi: 10.1186/1471-2105-12-41 Google Scholar CrossRef Search ADS PubMed  27 McHardy AC, Martín HG, Tsirigos A, et al.   Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods  2007; 4: 63– 72. doi: 10.1038/nmeth976 Google Scholar CrossRef Search ADS PubMed  28 Patil KR, Roune L, McHardy AC, et al.   The PhyloPythiaS web server for taxonomic assignment of metagenome sequences. PLoS One  2012; 7: e38581. doi: 10.1371/journal.pone.0038581 Google Scholar CrossRef Search ADS PubMed  29 Gregor I, Dröge J, Schirmer M, et al.   PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. Peer J  2016; 4: e1603. Google Scholar CrossRef Search ADS PubMed  30 Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a web browser. BMC Bioinformatics  2011; 12: 385. doi: 10.1186/1471-2105-12-385 Google Scholar CrossRef Search ADS PubMed  31 Vajapey U, Ying A, Hennig G, Li H. Validation of a fully-automated nucleic acid extraction method for fresh frozen tissue using the Siemens tissue preparation solution. Assoc Mol Pathol  2013; 15: 843– 945. 32 Guettouche T, Rantus J, Slosek K, et al.   A workflow enabling whole exome and whole genome sequencing of formalin fixed paraffin embedded samples with minimal amounts of DNA. Adv Genome Biol Technol  2013. https://www.kapabiosystems.com/assets/Guettouche_A-Workflow-Enabling-Whole-Exome-and-Whole-Genome-Sequencing-of-FFPE-Samples-with-Minimal-Amounts-of-DNA_AGBT_2013.pdf (15 November 2016, date last accessed). 33 van Eijk R, Stevens L, Morreau H, van Wezel T. Assessment of a fully automated high-throughput DNA extraction method from formalin-fixed, paraffin-embedded tissue for KRAS, and BRAF somatic mutation analysis. Exp Mol Pathol  2013; 94: 121– 5. doi: 10.1016/j.yexmp.2012.06.004 Google Scholar CrossRef Search ADS PubMed  34 FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. (31 May 2016, date last accessed). 35 Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics  2016; btw354. doi: 10.1093/bioinformatics/btw354 36 Chamberlain SA, Szöcs E. taxize: taxonomic search and retrieval in R. F1000Research  2013; 2: 191. doi: 10.12688/f1000research.2-191.v2 Google Scholar PubMed  37 Sayers EW, Barrett T, Benson DA, et al.   Database resources of the National Center for Biotechnology Information. Nucleic Acids Res  2009; 37: D5– 15. doi: 10.1093/nar/gkn741 Google Scholar CrossRef Search ADS PubMed  38 Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res  2007; 17: 377– 86. doi: 10.1101/gr.5969107 Google Scholar CrossRef Search ADS PubMed  39 FASTX-Toolkit. http://hannonlab.cshl.edu/fastx_toolkit/. (27 October 2015, date last accessed). 40 Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics  2014; 30: 2114– 20. doi: 10.1093/bioinformatics/btu170 Google Scholar CrossRef Search ADS PubMed  41 Bankevich A, Nurk S, Antipov D, et al.   SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol  2012; 19: 455– 77. doi: 10.1089/cmb.2012.0021 Google Scholar CrossRef Search ADS PubMed  42 Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics  2014; 30: 2068– 9. doi: 10.1093/bioinformatics/btu153 Google Scholar CrossRef Search ADS PubMed  43 Dupont CL, Rusch DB, Yooseph S, et al.   Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. isme J  2012; 6: 1186– 99. doi: 10.1038/ismej.2011.189 Google Scholar CrossRef Search ADS PubMed  44 Seemann T. MLST pipeline. https://github.com/tseemann/mlst. (8 August 2016, date last accessed). 45 PubMLST. www.pubmlst.org. (8 August 2016, date last accessed). 46 Peleg AY, Hooper DC. Hospital-acquired infections due to gram-negative bacteria. N Engl J Med  2010; 362: 1804– 13. doi: 10.1056/NEJMra0904124 Google Scholar CrossRef Search ADS PubMed  47 Murray PR. What is new in clinical microbiology-microbial identification by MALDI-TOF mass spectrometry: a paper from the 2011 William Beaumont Hospital Symposium on molecular pathology. J Mol Diagn  2012; 14: 419– 23. doi: 10.1016/j.jmoldx.2012.03.007 Google Scholar CrossRef Search ADS PubMed  48 Patel R. MALDI-TOF MS for the diagnosis of infectious diseases. Clin Chem  2015; 61: 100– 11. doi: 10.1373/clinchem.2014.221770 Google Scholar CrossRef Search ADS PubMed  49 Du Z, Li L, Chen C-F, et al.   G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery. Nucleic Acids Res  2009; 37: W345– 9. Google Scholar CrossRef Search ADS PubMed  50 Eigner U, Holfelder M, Oberdorfer K, et al.   Performance of a matrix-assisted laser desorption ionization-time-of-flight mass spectrometry system for the identification of bacterial isolates in the clinical routine laboratory. Clin Lab  2009; 55: 289– 96. Google Scholar PubMed  51 Chen JH, Ho P-L, Kwan GS, et al.   Direct bacterial identification in positive blood cultures by use of two commercial matrix-assisted laser desorption ionization-time of flight mass spectrometry systems. J Clin Microbiol  2013; 51: 1733– 9. doi: 10.1128/JCM.03259-12 Google Scholar CrossRef Search ADS PubMed  52 Bhadra B, Roy P, Chakraborty R. Serratia ureilytica sp. nov., a novel urea-utilizing species. Int J Syst Evol Microbiol  2005; 55: 2155– 8. doi: 10.1099/ijs.0.63674-0 Google Scholar CrossRef Search ADS PubMed  53 Seng P, Abat C, Rolain JM, et al.   Identification of rare pathogenic bacteria in a clinical microbiology laboratory: impact of matrix-assisted laser desorption ionization-time of flight mass spectrometry. J Clin Microbiol  2013; 51: 2182– 94. doi: 10.1128/JCM.00492-13 Google Scholar CrossRef Search ADS PubMed  54 Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res  2005; 33: D501– 4. doi: 10.1093/nar/gki025 Google Scholar CrossRef Search ADS PubMed  55 Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI reference sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res  2011; 40: D130– 5. doi: 10.1093/nar/gkr1079 Google Scholar CrossRef Search ADS PubMed  56 NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res  2013; 41: D8– 20. doi: 10.1093/nar/gks1189 PubMed  57 Giammanco GM, Grimont PA, Grimont F, et al.   Phylogenetic analysis of the genera Proteus, Morganella and Providencia by comparison of rpoB gene sequences of type and clinical strains suggests the reclassification of Proteus myxofaciens in a new genus, Cosenzaea gen. nov., as Cosenzaea myxofaciens comb. nov. Int J Syst Evol Microbiol  2011; 61: 1638– 44. doi: 10.1099/ijs.0.021964-0 Google Scholar CrossRef Search ADS PubMed  58 Goris J, Konstantinidis KT, Klappenbach JA, et al.   DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol  2007; 57: 81– 91. doi: 10.1099/ijs.0.64483-0 Google Scholar CrossRef Search ADS PubMed  59 Wu D, Hugenholtz P, Mavromatis K, et al.   A phylogeny-driven genomic encyclopaedia of bacteria and archaea. Nature  2009; 462: 1056– 60. doi: 10.1038/nature08656 Google Scholar CrossRef Search ADS PubMed  60 Timme RE, Allard MW, Luo Y, et al.   Draft genome sequences of 21 Salmonella enterica serovar enteritidis strains. J Bacteriol  2012; 194: 5994– 5. doi: 10.1128/JB.01289-12 Google Scholar CrossRef Search ADS PubMed  61 Georgiades K, Raoult D. Defining pathogenic bacterial species in the genomic era. Front Microbiol  2010; 1: 151. doi: 10.3389/fmicb.2010.00151 Google Scholar PubMed  62 Segerman B. The genetic integrity of bacterial species: the core genome and the accessory genome, two different stories. Front Cell Infect Microbiol  2012; 2: 116. doi: 10.3389/fcimb.2012.00116 Google Scholar CrossRef Search ADS PubMed  63 Medini D, Donati C, Tettelin H, et al.   The microbial pan-genome. Curr Opin Genet Dev  2005; 15: 589– 94. doi: 10.1016/j.gde.2005.09.006 Google Scholar CrossRef Search ADS PubMed  64 Caputo A, Merhej V, Georgiades K, et al.   Pan-genomic analysis to redefine species and subspecies based on quantum discontinuous variation: the Klebsiella paradigm. Biol Direct  2015; 10: 55. doi: 10.1186/s13062-015-0085-2 Google Scholar CrossRef Search ADS PubMed  65 Rouli L, Merhej V, Fournier P-E, Raoult D. The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microbes New Infect  2015; 7: 72– 85. doi: 10.1016/j.nmni.2015.06.005 Google Scholar CrossRef Search ADS PubMed  66 Stewart EJ. Growing unculturable bacteria. J Bacteriol  2012; 194: 4151– 60. doi: 10.1128/JB.00345-12 Google Scholar CrossRef Search ADS PubMed  67 Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet  2016; 17: 175– 88. doi: 10.1038/nrg.2015.16 Google Scholar CrossRef Search ADS PubMed  68 Borozan I, Watt S, Ferretti V. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification. Bioinformatics  2015; 31: 1396– 404. doi: 10.1093/bioinformatics/btv006 Google Scholar CrossRef Search ADS PubMed  © The Authors 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Briefings in BioinformaticsOxford University Press

Published: Dec 22, 2016

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off