Abstract High-throughput next-generation shotgun sequencing of pathogenic bacteria is growing in clinical relevance, especially for chromosomal DNA-based taxonomic identification and for antibiotic resistance prediction. Genetic exchange is facilitated for extrachromosomal DNA, e.g. plasmid-borne antibiotic resistance genes. Consequently, accurate identification of plasmids from whole-genome sequencing (WGS) data remains one of the major challenges for sequencing-based precision medicine in infectious diseases. Here, we assess the heterogeneity of four state-of-the-art tools (cBar, PlasmidFinder, plasmidSPAdes and Recycler) for the in silico prediction of plasmid-derived sequences from WGS data. Heterogeneity, sensitivity and precision were evaluated by reference-independent and reference-dependent benchmarking using 846 Gram-negative clinical isolates. Interestingly, the majority of predicted sequences were tool-specific, resulting in a pronounced heterogeneity across tools for the reference-independent assessment. In the reference-dependent assessment, sensitivity and precision values were found to substantially vary between tools and across taxa, with cBar exhibiting the highest median sensitivity (87.45%) but a low median precision (27.05%). Furthermore, integrating the individual tools into an ensemble approach showed increased sensitivity (95.55%) while reducing the precision (25.62%). CBar and plasmidSPAdes exhibited the strongest concordance with respect to identified antibiotic resistance factors. Moreover, false-positive plasmid predictions typically contained only few antibiotic resistance factors. In conclusion, while high degrees of heterogeneity and variation in sensitivity and precision were observed across the different tools and taxa, existing tools are valuable for investigating the plasmid-borne resistome. Nevertheless, additional studies on representative clinical data sets will be necessary to translate in silico plasmid prediction approaches from research to clinical application. bacteria, plasmids, prediction, next-generation sequencing Introduction Bacterial plasmids play important roles in the emergence and spread of antibiotic resistance . These genetic elements vary in size, are mostly circular, can replicate independently and often encode resistance- and/or virulence-related genes [1–4]. Moreover, the dissemination of pathogens is facilitated by inter-species plasmid exchange . A prominent example is the plasmid-encoded mcr-1 gene inducing colistin resistance originally reported by Y. Liu and Y. Wang for Enterobacteriaceae samples collected in China . The mcr-1 gene was subsequently found in bacteria collected in Europe, Laos, Thailand and Nigeria . Therefore, plasmid detection and classification are crucial steps for the identification and characterization of plasmid-mediated phenotypes. Polymerase chain reaction-based replicon typing (based on elements of the replication machinery) [8, 9] and MOB typing (based on conserved motifs of the relaxase gene) are frequently used to detect and classify plasmids [10, 11]. Limitations of these approaches are, among others, that the available typing schemes do not cover all plasmids and that the complete genetic repertoire of the plasmid(s) remains unknown, as the focus of these approaches is on a specific set of genes . In contrast, whole-genome sequencing (WGS) indiscriminately resolves the chromosomal and extrachromosomal genetic complements. Subsequent annotation of de novo assembled sequences enables the characterization of chromosome- and plasmid-derived functional potential in addition to taxonomic identification of the studied organism. In a detailed review of plasmid classification within the context of antibiotic resistance epidemiology, Orlek et al.  describe the potential of the in silico analysis of WGS data to address the limitations of replicon and MOB typing. Furthermore, Arredondo-Alonso et al.  reviewed computational solutions for the automated plasmid prediction on a set of 42 reference genomes. The existing in silico approaches can be divided into three main categories: marker-gene search-based approaches, e.g. searching for replicons in the sequences (PlasmidFinder ); approaches based on genomic signatures, e.g. k-mer frequencies, of plasmid-derived and chromosomal DNA (cBar ); and approaches identifying plasmids based on k-mer coverage differences and/or circular paths in the assembly graph (PlasmidSPAdes , Recycler ). However, repetitive regions and/or genes found on multiple genomic units (chromosomes and plasmids) challenge the de novo assembly of short-read sequencing data, resulting in fragmented assemblies and mis-assemblies . In accordance with studies reporting on the improved contiguity of genome assemblies based on or augmented by long reads [18–21], Arredondo-Alonso et al. conclude that long-read sequencing data are expected to greatly assist in the resolution of chromosomal and extrachromosomal sequences. Unarguably, full-length genomic resolution is ultimately desirable, but despite advances in long-read sequencing, short-read-based approaches currently dominate the WGS space and can provide crucial diagnostic information. Therefore, the analysis of a cohort of clinical samples will allow improved assessment of the variance in the predictions across different taxa but also between the individual tools. Here, we analyzed the short-read WGS data of 846 Gram-negative, clinical bacterial isolates using four existing in silico plasmid prediction tools (cBar, PlasmidFinder, plasmidSPAdes and Recycler) and an ensemble approach that integrates the individual tools’ predictions. The heterogeneity between the individual tools was first assessed using reference-independent approaches. Subsequently, an ad hoc ground truth was defined. This was necessary as the herein included isolates were patient-derived and the closest reference genome needed to be identified first. De novo assembled contigs were then aligned against the respective reference chromosome(s) and plasmid(s) to identify plasmid-positive samples. This information was used to evaluate the sensitivity and precision of the individual tools and the ensemble approach. Furthermore, the differences in k-mer coverage of chromosome and plasmid sequences in plasmid-positive samples were compared. Finally, we analyzed the concordance between the predictions and the ground truth with respect to plasmid-borne antibiotic resistance genes. Materials and methods WGS and preprocessing Batches of 96 samples were sequenced per lane for paired-end sequencing (2 × 100 bp) on Illumina Hiseq2000 or Hiseq2500 sequencers using TruSeq PE Cluster v3 and TruSeq SBS v3 sequencing chemistry (Illumina) as previously described in detail . A total of 2 705 458 738 raw reads and a median of 2 987 123 reads per sample were generated. Trimmomatic version 0.35 was used with the command line parameters: ‘PE ILLUMINACLIP:NexteraPE-PE.fa:1:50:30 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36’ . Only the trimmed, paired-end reads were used herein, if not stated otherwise. De novo assembly SPAdes version 3.10.1  was used to assemble the trimmed, paired-end reads with the following parameters: ‘--careful -t 6 -k 21,33,55’. Predicting plasmid sequences plasmidSPAdes: ‘plasmidspades.py’ from SPAdes version 3.10.1 [16, 24] was used to assemble the trimmed, paired-end reads to identify candidate plasmid sequences, with the following parameters: ‘--careful -t 6 -k 21,33,55’. PlasmidFinder: Sequences for the Enterobacteriaceae were downloaded from https://bitbucket.org/genomicepidemiology/plasmidfinder_db/src (commit ID: d5a49e9b01b0) . A BLASTN database was built using ‘makeblastdb’ of ncbi-blast-2.6.0+. cBar: Version 1.2 was used . Recycler: Version 0.62 was used . The required BAM file was generated using bwa-0.7.15. Recycler’s ‘make_fasta_from_fastg.py’ was used to generate the FASTA file (from the ‘assembly_graph.fastg’ file generated by SPAdes) required to build the bwa index [25, 26]. The trimmed paired-end reads were aligned against the resulting index with ‘bwa mem’, and the SAM output was directly converted to BAM format using ‘samtools view -buS -| samtools view -bF 0x0800 -| samtools sort –’ (samtools version 0.1.19-96b5f2294a) . The resulting BAM file was indexed using ‘samtools index’. Finally, Recycler was run with the following options: ‘-g assembly_graph.fastg -k 55 -b assembly_graph.bam -i True’. Ensemble approach: To increase the sensitivity, we implemented a straightforward ensemble approach. The candidate plasmid sequences, as predicted by cBar, plasmidSPAdes, PlasmidFinder and Recycler, were pooled and clustered using ‘cd-hit-est’ from CD-HIT version 4.6.6 and the default parameters [28, 29]. Accompanying code can be found at https://github.com/claczny/2017_plasmid_prediction_review. Pairwise correlation of the individual tools’ predictions Sourmash  version 2.0.0a1 was used to compute signatures for each tool’s predicted sequences (‘compute -k 31 –scaled 50 –track-abundance’). As plasmid-derived sequences are typically much shorter than chromosomal sequences, a small scaling factor was chosen accordingly. Subsequently, for all tools, all predictions within each tool were compared against each other using sourmash’s ‘compare’ function. The resulting similarity matrices per tool were converted to distance matrices (1—similarity value), and all pairwise tool combinations were correlated using the ‘mantel’ function (‘method=“pearson”, permutations = 999, parallel = 30’) in the vegan R package  for samples occurring in both of the predictions of the respective tool pair. The superheat-function in the superheat R package was used for plotting. Complete reference genomes Nucleotide FASTA files of complete bacterial genomes were downloaded from the NCBI RefSeq database (ncbi-genome-download, https://github.com/kblin/ncbi-genome-download, version 0.2.2, parameters: --section refseq --format fasta --assembly-level complete --human-readable --parallel 5 --retries 3 --verbose bacteria, on 24 May 2017). In total, 6901 genomes were retrieved; sequences containing the word ‘plasmid’ in their FASTA header were considered as plasmids resulting in 5611 plasmid and 7415 non-plasmid sequences in total. Defining the ad hoc ground truth data Lacking dedicated, finished genomes for the present clinical, i.e. patient-derived in contrast to reference material, isolates, sourmash version 2.0.0a1 was used to identify the most similar, complete reference genome. Specifically, signatures were first computed for each complete reference genome and the contigs of each successful de novo assembly (‘compute -k 31 --scaled 2000 --track-abundance -o SEQ.sig SEQ.fa’). The reference genomes’ signatures were indexed (‘index REFIDXPREFIX -k 31 --traverse-directory PATH_TO_REF_SIGNATURES’). Subsequently, for each de novo assembly, the index was searched (‘search -k 31 ASSEMBLY.sig REFIDXPREFIX.sbt.json -o ASSEMBLY.best_only_hits.txt --best-only’), and the top hit returned by sourmash was used as the respective reference. For each isolate-reference pair, the isolate’s de novo contigs were aligned against the reference to identify plasmid or chromosome sequences using BLASTN (ncbi-blast-2.6.0+, format: ‘6 std qcovs qcovhsp qlen slen’) . For each query sequence, the subject (reference sequence) with the longest alignment length and highest query-coverage-by-subject was selected. If multiple hits existed, the hit with the highest bitscore was chosen. If multiple hits remained, the first subject representing a plasmid was chosen. Thus, each de novo contig was assigned a label whether it represents a plasmid, and contigs not matching a sequence of the closest reference genome were considered ‘unclassified’. Evaluating the predictions Reference-independent analysis of heterogeneity: Sequences were clustered as described for the ‘Ensemble approach’. The ‘clstr2txt.pl’ script from CD-HIT version 4.6.6 was used to reformat the cluster output. The reformatted output was used to compute the fraction of the cumulative length of the cluster centroids that was represented by one, two, three or all four tools. It should be noted that the length of the cluster centroid was used here as a proxy. However, the actual shared fraction could be lower if cluster members are of shorter length than the cluster centroid. Reference sequence coverage: PlasmidSPAdes and Recycler generate their own set of contigs, whereas cBar and PlasmidFinder directly identify candidate plasmid sequences on the de novo assembled contigs. Thus, to stay consistent between all tools, the predicted sequences were linked with the ad hoc ground truth sequences by using the former as queries and the latter as the subjects in BLASTN (ncbi-blast-2.6.0+, format: ‘6 std qcovs qcovhsp qlen slen’). Similar to the approach used for defining the ground truth data, for each query sequence, the subject (de novo contig) with the longest alignment length and highest query-coverage-by-subject was selected. If multiple hits existed, the hit with the highest bitscore was chosen. If multiple hits remained, the longest subject was chosen. Should still multiple hits remain, the first subject representing a plasmid was chosen. Unclassified sequences were ignored. Based on the resulting prediction-to-ground-truth mapping, the sensitivity and precision were computed using the following definitions: P= cumulative length of ground truth plasmid sequences TP=∑length(subject); if query was predicted plasmid and subject was ground truth plasmid FN= P – TP N= cumulative length of ground truth chromosome sequences FP=∑length(subject); if query was predicted plasmid and subject was ground truth chromosome TN= N – FP Sensitivity= TPP Precision= TPTP+FPThe following edge cases were considered and handled accordingly: The sample contained no plasmids and the tool predicted no plasmids: P=0, TP=0, FN=0, FP=0, TN=N The sample contained plasmids and the tool predicted no plasmids: TP=0, FN=P, FP=0, TN=N The sample contained no plasmids and the tool predicted plasmids: P=0, TP=0, FN=0Antibiotic resistance genes: Prokka version 1.11 was used to annotate the genes of the predicted and ground truth plasmid sequences . Translated coding DNA sequences were searched against the ResFams core database using hmmsearch version 3.1b2 (‘--cut_tc --tblout’) [34, 35]. The counts of ResFams hits per-sample-and-tool were compared with the respective counts of the ground truth for samples common to the respective tool and the ground truth. The Spearman correlation was computed using the ‘cor’ function in R version 3.3.2 . For each comparison, a linear model including confidence intervals was fitted using the ‘geom_smooth’ function from the ggplot package version 2.2.1 in R version 3.3.2 . Results and discussion Cultured isolates of Gram-negative bacteria from 846 clinical samples were sequenced as described in Galata et al. , and de novo assemblies were successfully created for 844 samples. We evaluated the performance of four plasmid-prediction tools: cBar, PlasmidFinder, plasmidSPAdes and Recycler. Moreover, we integrated the individual tools into an ensemble approach by merging and clustering the predictions according to their nucleotide sequence identity to remove redundant sequences. In addition to evaluating the predictions using reference-independent as well as reference-dependent approaches, the concordance between the predictions and the ground truth with respect to plasmid-borne antibiotic resistance genes was analyzed. Reference-independent assessment of plasmid predictions In a first analysis, we were mainly interested in the heterogeneity of the predictions between the individual tools. We thus performed the predictions and compared them with each other. Interestingly, all tested tools substantially varied in their number of predicted plasmid-positive samples, i.e. samples predicted to contain at least one plasmid-derived sequence. CBar predicted all 844 samples to be plasmid-positive, while plasmidSPAdes, PlasmidFinder and Recycler predicted 766, 446 and 375 plasmid-positive samples, respectively. Moreover, the cumulative lengths of the predicted sequences per sample were found to vary markedly: cBar was found to have the largest cumulative lengths, while Recycler had the lowest (Figure 1). Figure 1. View largeDownload slide Cumulative lengths of the predicted plasmid sequences per tool. The y-axis uses a log10 scale. The median values are shown and the boxplots represent the median, two hinges, two whiskers and all outlier points individually. Figure 1. View largeDownload slide Cumulative lengths of the predicted plasmid sequences per tool. The y-axis uses a log10 scale. The median values are shown and the boxplots represent the median, two hinges, two whiskers and all outlier points individually. Moreover, the tools were tested for their pairwise correlations across the predictions. CBar and plasmidSPAdes were found to exhibit the highest correlation value (0.82), suggesting that these two approaches resulted in somewhat related predictions (Figure 2). In contrast, plasmidSPAdes and Recycler were found to have the lowest pairwise correlation (0.3). Based on the clustering of the individual tools’ predictions using the ensemble approach, the tools’ heterogeneity was furthermore evaluated with respect to the shared fraction of the cumulative plasmid lengths. The largest fraction was represented by sequences predicted by a single tool (Figure 3). Conversely, all four tools were infrequently found to show pronounced overlap in their prediction. Furthermore, strong variations were observed in the fraction of cumulative length predicted by one or by two tools. This indicates a distinct heterogeneity between the individual tools’ predictions. Figure 2. View largeDownload slide Correlation of the tools’ predictions. For each tool, a distance matrix with respect to the tool’s predictions was computed. Pairwise distance matrix correlation was computed and is shown in the heatmap. The color indicates the correlation degree and correlation values are shown in each cell. Figure 2. View largeDownload slide Correlation of the tools’ predictions. For each tool, a distance matrix with respect to the tool’s predictions was computed. Pairwise distance matrix correlation was computed and is shown in the heatmap. The color indicates the correlation degree and correlation values are shown in each cell. Figure 3. View largeDownload slide Fraction of cumulative lengths shared by the tested tools. The fraction of the cumulative length is shown on the x-axis, and the number of tools exhibiting overlap of the respective sequence(s) is shown on the y-axis. The lengths of the cluster centroids were taken as proxies. Points are jittered randomly vertically per tool for representation purposes. The boxplots represent the median, two hinges and two whiskers. Figure 3. View largeDownload slide Fraction of cumulative lengths shared by the tested tools. The fraction of the cumulative length is shown on the x-axis, and the number of tools exhibiting overlap of the respective sequence(s) is shown on the y-axis. The lengths of the cluster centroids were taken as proxies. Points are jittered randomly vertically per tool for representation purposes. The boxplots represent the median, two hinges and two whiskers. Reference-dependent assessment of sensitivity and precision A complete reference genome could be identified for 818 of the 846 samples. The ad hoc definition of the ground truth was required because of the lack of dedicated, finished genomes, e.g. using complimentary long-read sequencing data, for the present set of clinical isolates. The median cumulative lengths of chromosome contigs, plasmid contigs and unclassified contigs were 4 907 449, 114 954 and 27 733 bp, respectively (Supplementary Figure S1). Seven samples had >1 Mbp of unclassified contigs and were thus excluded from further analyses, resulting in median lengths of 514.0, 437.5 and 118.0 bp, for chromosome contigs, plasmid contigs and unclassified contigs, respectively (Supplementary Figure S2). A total of 347 samples were considered to be plasmid-positive according to the ad hoc ground truth and were subsequently used to compute the sensitivity and precision of the individual tools and of the ensemble approach (Supplementary Figure S3). CBar was found to be the most sensitive (median sensitivity: 87.45%) among the individual tools, followed by plasmidSPAdes (81.49%) and PlasmidFinder (36.47%) (Figure 4). Recycler’s predictions generally had overall low cumulative lengths (Figure 1), consequently resulting in extremely low sensitivity values (median sensitivity: 0.00%). Importantly, Recycler was designed to recover circular sequences, and the present results suggest that their number was minimal in our de novo assemblies. The ensemble approach resulted in a median sensitivity value of 95.55%. Figure 4. View largeDownload slide Prediction performances of the tested tools and the ensemble approach for plasmid-positive samples based on the ad hoc ground truth. Sensitivity (‘Sens’) and precision (‘Prec’) are shown. The median values are shown and the boxplots represent the median, two hinges and two whiskers. Figure 4. View largeDownload slide Prediction performances of the tested tools and the ensemble approach for plasmid-positive samples based on the ad hoc ground truth. Sensitivity (‘Sens’) and precision (‘Prec’) are shown. The median values are shown and the boxplots represent the median, two hinges and two whiskers. Resolving the prediction performances by genus revealed strong variations, both within and between the individual tools (Supplementary Figure S4). For example, while cBar was found to exhibit overall high sensitivity values, plasmid sequences of Acinetobacter spp. were less well detected. Moreover, the sensitivity of plasmidSPAdes varied strongly for the Citrobacter spp., Enterobacter spp. and Salmonella spp. samples. PlasmidFinder exhibited particularly low sensitivity for Acinetobacter spp., which is to be expected, as this genus is a member of the Moraxellaceae family and, thus, not covered by PlasmidFinder’s Enterobactericeae-specific database. The sensitivity of the ensemble approach was found to be on par or better compared with the individual tools. While cBar had the highest median sensitivity, its median precision (27.05%) was below the median precision of PlasmidFinder (100%) and plasmidSPAdes (52.70%), indicating that cBar frequently misclassifies chromosomal contigs as being plasmid-derived (false positives). The median precision of the ensemble approach was 25.62%. Importantly, the ensemble approach included all the false-positive predictions of the individual tools, which explains the low precision. Similar to the sensitivity results resolved by genus, the precision of the individual tools varied substantially (Supplementary Figure S4). Notably, the highest median precisions were observed for Klebsiella spp. In contrast, the precision was extremely low for Acinetobacter spp., regardless of the approach being reference-dependent (cBar, PlasmidFinder) or reference-independent (plasmidSPAdes, Recycler). Hierarchical clustering of the individual tools and the ensemble approach with respect to their true-positive values revealed cBar and the ensemble approach to be the most similar, followed by plasmidSPAdes (Figure 5). Figure 5. View largeDownload slide Hierarchical clustering of individual tools according to their true-positive values. True-positive values represent the cumulative base pair length correctly covered by the individual tools and were scaled and centered before computing the hierarchical clustering. Figure 5. View largeDownload slide Hierarchical clustering of individual tools according to their true-positive values. True-positive values represent the cumulative base pair length correctly covered by the individual tools and were scaled and centered before computing the hierarchical clustering. Differential coverage between chromosome and plasmid sequences Plasmids can independently replicate [3, 4] and thus can occur in different copy numbers than the bacterial chromosome(s). PlasmidSPAdes uses this information to identify assembly graph components with (substantially) differing coverage, considering these components as candidate plasmids. However, this approach is, by design, challenged by plasmids occurring in similar copy numbers as the chromosome(s) (false negatives), or by components within the graph that exhibit coverage differences despite representing chromosomal sequences (false positives), e.g. because of bacterial cells at different stages in the replication cycle . To study how frequently plasmid sequences significantly differed in their copy numbers from the chromosome sequences, we analyzed the k-mer coverage of the de novo assembled contigs. Of the 811 isolates (818−7 samples with >1 Mbp of unclassified sequences), 28.11% (228 of 811) showed statistically significant results (alpha = 0.05; false discovery rate-adjusted: 185 of 811) when tested for unimodality of the k-mer coverage distributions (Supplementary Figure S5), suggesting that these distributions could be considered mutimodal. However, only 31.70% (110 of 347) of the plasmid-positive samples were likely multimodal in their k-mer coverage distributions. Moreover, 61.10% (212 of 347) of the plasmid-positive samples significantly differed in their k-mer coverages of the plasmid sequences and chromosome sequences (Wilcoxon-Mann-Whitney-test, P < 0.05). It should be noted that plasmidSPAdes median sensitivity was higher (81.29%; Figure 4); yet, this was computed using the sequence coverage of plasmid sequences rather than number of samples. Correctly predicted, long-assembled sequences will increase the true-positive value, thereby leading to higher sensitivity values. Antibiotic resistance factors encoded in plasmid sequences The in silico separation of genomic sequences into ‘chromosome-derived’ or ‘extrachromosome-derived’ has proven to be a challenging task as demonstrated herein as well as in [12, 13]. Nevertheless, the identification of candidate plasmid-derived sequences in fragmented assemblies is relevant. Specifically, the functional potential can thus be assessed for the candidates. To this end, antibiotic resistance genes included in ResFams were identified on the predicted and ground truth plasmid sequences of plasmid-positive samples. The number of ResFams hits was found to vary within and between the individual tools but also for the ground truth (Figure 6). PlasmidFinder and Recycler recovered few of the expected ResFams hits, which is in accordance with the reduced sensitivity observed herein (Figure 4). CBar and plasmidSPAdes were found to more closely represent the ground truth distribution of the ResFams hits. Only plasmidSPAdes exhibited a higher number of hits than found in the ground truth. These extra hits might represent chromosome-borne antibiotic resistance genes. As plasmidSPAdes uses coverage information for its predictions, it could be speculated that the respective chromosomal regions exhibited differential coverage to the remainder of the chromosome. While there are various potential reasons as to why this could occur, e.g. competitive advantage under antibiotic pressure and thus increased replication, the exact reason is currently unknown. Moreover, the ResFams hit counts were compared pairwise between the ground truth and the individual tools, and the respective Spearman correlations were computed (Figure 7). CBar and plasmidSPAdes were found to be the closest to represent the ground truth, with cBar exhibiting a higher correlation (0.68 versus 0.56), likely because of the increased variation toward low or high counts for plasmidSPAdes. Figure 6. View largeDownload slide ResFams hits counts of plasmid-positive samples. The number of samples per tool is shown in parentheses. Only plasmid-positive samples with at least one ResFams hit are shown. Points are jittered randomly horizontally per tool for representation purposes. The boxplots represent the median, two hinges and two whiskers. Figure 6. View largeDownload slide ResFams hits counts of plasmid-positive samples. The number of samples per tool is shown in parentheses. Only plasmid-positive samples with at least one ResFams hit are shown. Points are jittered randomly horizontally per tool for representation purposes. The boxplots represent the median, two hinges and two whiskers. Figure 7. View largeDownload slide Comparison of ResFams hits against the ad hoc ground truth. ResFams hits counts of the four herein tested tools are plotted against the respective counts in the ground truth for paired samples. A linear model was fitted (black line) and confidence intervals are shown (in orange). Moreover, a two-dimensional density estimate is plotted with transparency increasing with decreasing point density. Figure 7. View largeDownload slide Comparison of ResFams hits against the ad hoc ground truth. ResFams hits counts of the four herein tested tools are plotted against the respective counts in the ground truth for paired samples. A linear model was fitted (black line) and confidence intervals are shown (in orange). Moreover, a two-dimensional density estimate is plotted with transparency increasing with decreasing point density. Conclusion The importance of WGS has been repeatedly demonstrated for taxonomic identification of microorganisms, with its application in infectious disease diagnostics and in epidemiological studies providing direct benefits to individuals and the general public [39–41]. Furthermore, the indiscriminate extraction of the entire microbial genomic complement enables concurrent sequencing of chromosomal and extrachromosomal sequences, e.g. plasmids in bacteria. This is especially relevant as plasmid-encoded functions can strongly affect the bacterial phenotype, thus providing crucial information beyond chromosomes and taxonomy [42–45]. To this end, the present study analyzed the performances of four plasmid prediction tools on de novo assemblies of 846 Gram-negative WGS clinical isolates using reference-independent and reference-dependent evaluation approaches. With respect to the latter, the use of patient-derived isolates, in contrast to reference material, required the definition of an ad hoc ground truth. This approach was found to be robust for the plasmid-positive samples, as the cumulative length of unclassified sequences was limited. However, plasmid sequences, in particular if they were recently acquired, might have been missed; yet, this would not negatively affect the present evaluation, as unclassified sequences were ignored. Moreover, plasmid sequences recently introduced in the chromosome(s) or plasmid sequences homologous to chromosome sequences might represent confounding factors in the definition of the ad hoc ground truth. This further highlights the importance of full-length assemblies/reference genomes, which were, however, unavailable for the herein included isolates, and the generation of this complementary data was beyond the scope of the current study. Overall, no single-best approach was identified and pronounced variations in heterogeneity between the tools were observed, with cBar and plasmidSPAdes showing the strongest correlation. Moreover, the diversity of the present samples comprised 11 genera of at least 20 samples and allowed to reveal taxon-dependent variation, both, within tools and between tools. Interestingly, Acinetobacter-borne plasmids were less well detected by cBar, resulting in a low sensitivity, which may be because of a limited representation of this genus in the reference database that was originally used for cBar’s training. Furthermore, the generally low precision for this specific genus suggests that Acinetobacter spp. infections may require dedicated analyses and attention, e.g. in the case of plasmid-carrying, multidrug-resistant Acinetobacter baumannii organisms [46–48]. The taxon-dependent variation in the tools’ performances highlights the importance of concurrent identification of taxonomy and functional potential and the need for reference databases with an increased diversity, e.g. improved coverage of Acinetobacter spp. by cBar and PlasmidFinder. Moreover, we showed that copy numbers of plasmid sequences need not necessarily vary significantly to the copy number of the chromosome(s), thereby limiting coverage-based approaches. Accordingly, the use of complementary approaches that could lend mutual support, e.g. using cBar and plasmidSPAdes, appears sensible. In addition to the individual tools, an ensemble approach integrating the four independent predictions was evaluated. Overall, the sensitivity was found to be increased and less variable. However, the combination of the individual tools also led to reduced precision. Accordingly, the ensemble approach represents an interesting solution if the objective is to maximize the sensitivity, and false positives are acceptable and/or can be removed downstream, e.g. by identifying sequences with exceptionally high or low fold-coverage or by identifying sequences encoding relevant factors, such as antibiotic resistance genes. This approach is, however, not intended to replace the development of improved databases and prediction algorithms in the future. An example of the fast developments in this field is PlasmidTron, which was published, while the present manuscript was in revision . Moreover, PLACNET represents a recently published approach for the plasmid reconstruction from WGS data . It was excluded from the present evaluation of fully automated tools because of a manual pruning step in PLACNET’s workflow. The reconstructed genomic sequences, including the plasmid sequences, remained fragmented in the present study, which is in accordance with the results reported by Arredondo-Alonso et al. . While long-read-based sequencing greatly improves the contiguity of genome assemblies [18, 19, 51], plasmid prediction tools can strongly reduce the search space for short-read-based data. Importantly, despite the frequent prediction of false positives, the accordance in the number of antibiotic resistance genes with respect to the ground truth was found to be high for cBar and plasmidSPAdes. Overall, this is expected to support precision medicine by reducing the time and work burden required for data examination. Furthermore, the present study illustrates that specific objectives are met by specific approaches and, thus, systematic benchmarking on extensive and curated data sets is important for the translation of bioinformatics tools from research to clinical application. Key Points Extrachromosomal DNA in the form of plasmids can carry phenotype-relevant information, e.g. antibiotic resistance factors. Next-generation sequencing of isolates allows linkage of taxonomy and extrachromosomal functional potential via concurrent resolution of chromosomal and extrachromosomal DNA. Existing in silico plasmid-prediction approaches showed limited agreement as well as strong inter- and intra-taxon variability on a set of 846 WGS clinical bacterial isolates. Combining the individual predictions resulted in increased sensitivity while reducing precision. Antibiotic resistance gene counts on predicted plasmid sequences were not strongly affected by false-positive predictions. Supplementary Data Supplementary data are available online at http://bib.oxfordjournals.org/. Andreas E. Posch is a Managing Director of Ares Genetics GmbH. Cedric C. Laczny is a Postdoc at the Chair for Clinical Bioinformatics at Saarland University. Valentina Galata is a PhD student at the Chair of Clinical Bioinformatics at Saarland University. Achim Plum is a Managing Director of Ares Genetics GmbH and Curetis GmbH. Andreas Keller is a professor and head of the Chair for Clinical Bioinformatics at Saarland University. Acknowledgement The authors would like to thank Curetis GmbH and Ares Genetics GmbH for the support as well as the provided data set. Funding In parts by the Best Ageing (grant number 306031) from the European Union. Availability of WGS data The raw WGS data are available on a reasonable request for academic research use only after signing a data transfer agreement. References 1 Carattoli A. Resistance plasmid families in Enterobacteriaceae. Antimicrob Agents Chemother 2009; 53( 6): 2227– 38. doi: 10.1128/AAC.01707-08 Google Scholar CrossRef Search ADS PubMed 2 Frost LS, Leplae R, Summers AO, et al. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 2005; 3( 9): 722– 32. doi: 10.1038/nrmicro1235 Google Scholar CrossRef Search ADS PubMed 3 Scott JR. Regulation of plasmid replication. Microbiol Rev 1984; 48( 1): 1– 23. Google Scholar PubMed 4 del Solar G, Giraldo R, Ruiz-Echevarría MJ, et al. Replication and control of circular bacterial plasmids. Microbiol Mol Biol Rev 1998; 62( 2): 434– 64. doi: 1092-2172/98/$04.0010 Google Scholar PubMed 5 Conlan S, Park M, Deming C, et al. Plasmid dynamics in KPC-positive Klebsiella pneumoniae during long-term patient colonization. MBio 2016: 2: e000085. doi: 10.1128/mBio.00742-16 6 Liu Y-Y, Wang Y, Walsh TR, et al. Emergence of plasmid-mediated colistin resistance mechanism MCR-1 in animals and human beings in China: a microbiological and molecular biological study. Lancet Infect Dis 2016; 16: 161– 8. doi: 10.1016/S1473-3099(15)00424-7 Google Scholar CrossRef Search ADS PubMed 7 Zhi C, Lv L, Yu L-F, et al. Dissemination of the mcr-1 colistin resistance gene. Lancet Infect Dis 2016; 16: 292– 3. doi: 10.1016/S1473-3099(16)00063-3 Google Scholar CrossRef Search ADS PubMed 8 Couturier M, Bex F, Bergquist PL, et al. Identification and classification of bacterial plasmids. Microbiol Rev 1988; 52( 3): 375– 95. Google Scholar PubMed 9 Carattoli A, Bertini A, Villa L, et al. Identification of plasmids by PCR-based replicon typing. J Microbiol Methods 2005; 63( 3): 219– 28. doi: 10.1016/j.mimet.2005.03.018 Google Scholar CrossRef Search ADS PubMed 10 Francia MV, Varsaki A, Garcillán-Barcia MP, et al. A classification scheme for mobilization regions of bacterial plasmids. FEMS Microbiol Rev 2004; 28( 1): 79– 100. doi: 10.1016/j.femsre.2003.09.001 Google Scholar CrossRef Search ADS PubMed 11 Alvarado A, Garcillán-Barcia MP, de la Cruz F. A degenerate primer MOB typing (DPMT) method to classify gamma-proteobacterial plasmids in clinical and environmental settings. PLoS One 2012; 7( 7): e40438. doi: 10.1371/journal.pone.0040438 Google Scholar CrossRef Search ADS PubMed 12 Orlek A, Stoesser N, Anjum MF, et al. Plasmid classification in an era of whole-genome sequencing: application in studies of antibiotic resistance epidemiology. Front Microbiol 2017; 8: 182. doi: 10.3389/fmicb.2017.00182 Google Scholar CrossRef Search ADS PubMed 13 Arredondo-Alonso S, Willems RJ, van Schaik W, et al. On the (im)possibility of reconstructing plasmids from whole-genome short-read sequencing data. Microb Genomics 2017; 1– 18. doi: 10.1099/mgen.0.000128 14 Carattoli A, Zankari E, García-Fernández A, et al. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother 2014; 58( 7): 3895– 903. doi: 10.1128/AAC.02412-14 Google Scholar CrossRef Search ADS PubMed 15 Zhou F, Xu Y. cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data. Bioinformatics 2010; 26( 16): 2051– 2. doi: 10.1093/bioinformatics/btq299 Google Scholar CrossRef Search ADS PubMed 16 Antipov D, Hartwick N, Shen M, et al. plasmidSPAdes: assembling plasmids from whole genome sequencing data. Bioinformatics 2016; 32: 3380– 7. doi: 10.1093/bioinformatics/btw493 Google Scholar CrossRef Search ADS PubMed 17 Rozov R, Brown Kav A, Bogumil D, et al. Recycler: an algorithm for detecting plasmids from de novo assembly graphs. Bioinformatics 2016; 33: 475– 82. doi: 10.1093/bioinformatics/btw651 18 English AC, Richards S, Han Y, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 2012; 7( 11): e47768. doi: 10.1371/journal.pone.0047768 Google Scholar CrossRef Search ADS PubMed 19 Reuter S, Hunt M, Peacock SJ, et al. Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology. Microb Genomics 2016; 2( 9): e000085. doi: 10.1099/mgen.0.000085 20 George S, Pankhurst L, Hubbard A, et al. Resolving plasmid structures in Enterobacteriaceae using the MinION nanopore sequencer: assessment of MinION and MinION/Illumina hybrid data assembly approaches. Microb Genomics 2017; 3( 8): 10. http://dx.doi.org/10.1099/mgen.0.000118 21 Wick RR, Judd LM, Gorrie CL, et al. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017; 13( 6): e1005595. doi: 10.1371/journal.pcbi.1005595 Google Scholar CrossRef Search ADS PubMed 22 Galata V, Backes C, Laczny CC, et al. Comparing genome versus proteome-based identification of clinical bacterial isolates. Brief Bioinform 2016; doi:10.1093/bib/bbw122. doi: 10.1093/bib/bbw122 23 Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30( 15): 2114– 20. doi: 10.1093/bioinformatics/btu170 Google Scholar CrossRef Search ADS PubMed 24 Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012; 19( 5): 455– 77. doi: 10.1089/cmb.2012.0021 Google Scholar CrossRef Search ADS PubMed 25 Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009; 25( 14): 1754– 60. doi: 10.1093/bioinformatics/btp324 Google Scholar CrossRef Search ADS PubMed 26 Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] 2013;1303.3997v1. 27 Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009; 25( 16): 2078– 9. doi: 10.1093/bioinformatics/btp352 Google Scholar CrossRef Search ADS PubMed 28 Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012; 28( 23): 3150– 2. doi: 10.1093/bioinformatics/bts565 Google Scholar CrossRef Search ADS PubMed 29 Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006; 22( 13): 1658– 9. doi: 10.1093/bioinformatics/btl158 Google Scholar CrossRef Search ADS PubMed 30 Titus Brown C, Irber L. sourmash: a library for MinHash sketching of DNA. J Open Source Softw 2016; 1( 5): 27. doi: 10.21105/joss.00027 Google Scholar CrossRef Search ADS 31 Oksanen J, Blanchet FG, Friendly M, et al. vegan: Community Ecology Package, 2017. 32 Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol 1990; 215( 3): 403– 10. doi: 10.1016/S0022-2836(05)80360-2 Google Scholar CrossRef Search ADS PubMed 33 Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014; 30( 14): 2068– 9. doi: 10.1093/bioinformatics/btu153 Google Scholar CrossRef Search ADS PubMed 34 Eddy SR. Profile hidden Markov models. Bioinformatics 1998; 14( 9): 755– 63. http://dx.doi.org/10.1093/bioinformatics/14.9.755 Google Scholar CrossRef Search ADS PubMed 35 Gibson MK, Forsberg KJ, Dantas G. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology. Isme J 2015; 9( 1): 207– 16. doi: 10.1038/ismej.2014.106 Google Scholar CrossRef Search ADS PubMed 36 R Core Team. R: A Language and Environment for Statistical Computing , 2016. 37 Wickham H. ggplot2: Elegant Graphics for Data Analysis . New York, NY: Springer-Verlag, 2009. 38 Korem T, Zeevi D, Suez J, et al. Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples. Science 2015; 349( 6252): 1101– 6. doi: 10.1126/science.aac4812 Google Scholar CrossRef Search ADS PubMed 39 Grumaz S, Stevens P, Grumaz C, et al. Next-generation sequencing diagnostics of bacteremia in septic patients. Genome Med 2016; 8( 1): 73. doi: 10.1186/s13073-016-0326-8 Google Scholar CrossRef Search ADS PubMed 40 Loman NJ, Constantinidou C, Christner M, et al. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104: H4. Jama 2013; 309( 14): 1502– 10. doi: 10.1001/jama.2013.3231 Google Scholar CrossRef Search ADS PubMed 41 Zhou K, Lokate M, Deurenberg RH, et al. Characterization of a CTX-M-15 producing Klebsiella pneumoniae outbreak strain assigned to a novel sequence type (1427). Front Microbiol 2015; 6: 1250. doi: 10.3389/fmicb.2015.01250 Google Scholar PubMed 42 von Wright A, Tynkkynen S. Construction of Streptococcus lactis subsp. lactis strains with a single plasmid associated with mucoid phenotype. Appl Environ Microbiol 1987; 53( 6): 1385– 6. Google Scholar PubMed 43 Matsui H, Bacot CM, Garlington WA, et al. Virulence plasmid-borne spvB and spvC genes can replace the 90-kilobase plasmid in conferring virulence to Salmonella enterica serovar typhimurium in subcutaneously inoculated mice. J Bacteriol 2001; 183( 15): 4652– 8. doi: 10.1128/JB.183.15.4652-4658.2001 Google Scholar CrossRef Search ADS PubMed 44 Hammerl JA, Freytag B, Lanka E, et al. The pYV virulence plasmids of Yersinia pseudotuberculosis and Y. pestis contain a conserved DNA region responsible for the mobilization by the self-transmissible plasmid pYE854. Environ Microbiol Rep 2012; 4( 4): 433– 8. doi: 10.1111/j.1758-2229.2012.00353.x Google Scholar CrossRef Search ADS PubMed 45 Guiney DG, Fang FC, Krause M, et al. Plasmid-mediated virulence genes in non-typhoid Salmonella serovars. FEMS Microbiol Lett 1994; 124( 1): 1– 9. http://dx.doi.org/10.1111/j.1574-6968.1994.tb07253.x Google Scholar CrossRef Search ADS PubMed 46 Huang H, Dong Y, Yang Z-L, et al. Complete sequence of pABTJ2, a plasmid from Acinetobacter baumannii MDR-TJ, carrying many phage-like elements. Genomics Proteomics Bioinformatics 2014; 12( 4): 172– 7. doi: 10.1016/j.gpb.2014.05.001 Google Scholar CrossRef Search ADS PubMed 47 Weber BS, Ly PM, Irwin JN, et al. A multidrug resistance plasmid contains the molecular switch for type VI secretion in Acinetobacter baumannii. Proc Natl Acad Sci USA 2015; 112( 30): 9442– 7. doi: 10.1073/pnas.1502966112 Google Scholar CrossRef Search ADS PubMed 48 Hamidian M, Holt KE, Pickard D, et al. A small Acinetobacter plasmid carrying the tet39 tetracycline resistance determinant. J Antimicrob Chemother 2016; 71( 1): 269– 71. doi: 10.1093/jac/dkv293 Google Scholar CrossRef Search ADS PubMed 49 Page AJ, Wailan A, Shao Y, et al. PlasmidTron: assembling the cause of phenotypes from NGS data. bioRxiv 2017. https://doi.org/10.1101/188920. 50 Lanza VF, de Toro M, Garcillán-Barcia MP, et al. Plasmid flux in Escherichia coli ST131 sublineages, analyzed by plasmid constellation network (PLACNET), a new method for plasmid reconstruction from whole genome sequences. PLoS Genet 2014; 10: e1004766. doi: 10.1371/journal.pgen.1004766 Google Scholar CrossRef Search ADS PubMed 51 Quick J, Quinlan AR, Loman NJ. A reference bacterial genome dataset generated on the MinIONTM portable single-molecule nanopore sequencer. Gigascience 2014; 3( 1): 22. doi: 10.1186/2047-217X-3-22 Google Scholar CrossRef Search ADS PubMed © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: firstname.lastname@example.org
Briefings in Bioinformatics – Oxford University Press
Published: Dec 5, 2017
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera