dbOTU3: A new implementation of distribution-based OTU calling
dbOTU3: A new implementation of distribution-based OTU calling
Olesen, Scott W.;Duvallet, Claire;Alm, Eric J.;
2017-05-04 00:00:00
a1111111111 a1111111111 Distribution-based operational taxonomic unit-calling (dbOTU) improves on other a1111111111 approaches by incorporating information about the input sequences' distribution across a1111111111 samples. Previous implementations of dbOTU presented challenges for users. Here we introduce and evaluate a new implementation of dbOTU that is faster and more user- friendly. We show that this new implementation has theoretical and practical improvements over previous implementations of dbOTU, making the algorithm more accessible to micro- OPENACCESS bial ecology and biomedical researchers. Citation: Olesen SW, Duvallet C, Alm EJ (2017) dbOTU3: A new implementation of distribution- based OTU calling. PLoS ONE 12(5): e0176335. https://doi.org/10.1371/journal.pone.0176335 Editor: Jonathan H. Badger, National Cancer Institute, UNITED STATES Introduction Received: December 20, 2016 Preheim et al. [1] formulated the distribution-based OTU-calling (dbOTU) algorithm, an extremely accurate algorithm for grouping DNA sequences from microbial communities into Accepted: April 10, 2017 operational taxonomic units (OTUs) for ecological or biomedical research. Unlike most OTU- Published: May 4, 2017 calling approaches, which group sequences based only on the similarities of the sequences Copyright:© 2017 Olesen et al. This is an open themselves, this algorithm also uses information about the distribution of sequences across access article distributed under the terms of the samples. This allows dbOTU to distinguish ecologically-distinct but sequence-similar organ- Creative Commons Attribution License, which isms or populations. permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. The algorithm Data Availability Statement: All relevant data are Motivation. The algorithm aims to separate genetically-similar sequences that appear to within the paper and its Supporting Information be ecologically distinct (or, conversely, to join less genetically-similar sequences that appear to files. be ecologically identical). For example, if two sequences differ by only one nucleotide, an Funding: This material is based on work supported OTU-calling algorithm would likely group those two sequences into the same OTU. However, by the Center for Microbiome Informatics and if the two sequences never appeared together in the same sample, an observer would probably Therapeutics at MIT (microbiome.mit.edu). The conclude that that one nucleotide difference corresponds to two distinct groups of organisms, funders had no role in study design, data collection and analysis, decision to publish, or preparation of one which lives in one group of samples, the other living in the other. the manuscript. Conversely, if two sequences differed by a few nucleotides, an OTU-calling algorithm would probably place those two sequences into different OTUs. However, if the two sequences Competing interests: The authors have declared that no competing interests exist. appeared in the same ratio in all samples (e.g., sequence 2 was always almost exactly ten times PLOS ONE | https://doi.org/10.1371/journal.pone.0176335 May 4, 2017 1 / 13 Distribution-based OTU calling less abundant than sequence 1), an observer might conclude that the second sequence was either sequencing error or a member of the same ecological population as the first sequence. The original workflow. The workflow of the original dbOTU implementation was: 1. Process 16S rRNA data up to dereplicated sequences. 2. Create a table of sequence counts showing the number of times each sequence appears in each sample. 3. Align the dereplicated sequences. Using the alignment, make a phylogenetic tree and a ªdis- tance matrixº showing the genetic dissimilarity between sequences. 4. Feed the matrix and the table of sequence counts into the algorithm proper, which groups the sequences into OTUs. Overview of the algorithm. The pipeline's last step is the dbOTU algorithm proper and was the main contribution of Preheim et al. [1]. The algorithm is summarized in Fig 2 of that publication. In outline, the algorithm was: 1. Make the most abundant sequence an OTU. 2. For each sequence (in order of decreasing abundance), find the set of OTUs that meet ªabundanceº and ªgeneticº criteria. The abundance criterion requires that the candidate sequence be some fold less abundant than the OTU (e.g., so that it can be considered sequencing error). The genetic criterion requires that the candidate sequence be sufficiently similar to the OTU's sequence (e.g., so that it can be considered sequencing error or part of the same population of organisms). 3. If no OTUs meet these two criteria, make the candidate sequence into a new OTU. 4. If OTUs do meet these criteria, then, starting with the most genetically-similar OTU, check if the candidate sequence is distributed differently among the samples than that OTU. If the distributions are sufficiently similar, merge the candidate sequence into that OTU. Specifi- cally, add the candidate sequence's counts across samples to the OTU's counts. 5. If the candidate sequence does not have a distribution across sample sufficiently similar to an existing OTU, then make this sequence a new OTU. 6. Move on to the next candidate sequence. Note that an OTU's counts change every time a candidate sequence is merged into that OTU, but an OTU's sequence never changes. In other words, an OTU's candidate sequence is the sequence of its most abundant member. Variations on the dbOTU algorithm are characterized mainly by their choice of genetic cri- terion, which determines which sequences are sufficiently genetically similar that they are allowed to merged into the same OTU, and distribution criterion, which determines which sequences are distributed sufficiently similarly across samples to be allowed to be merged. Previous implementations The dbOTU algorithm has been implemented twice. Here we will introduce a third implemen- tation. The implementations vary in terms of: · the exact input files they require, · how they evaluate the genetic (i.e., sequence similarity) criterion, PLOS ONE | https://doi.org/10.1371/journal.pone.0176335 May 4, 2017 2 / 13 Distribution-based OTU calling Table 1. Comparison of the dbOTU implementations. Implementation Programming Required input Genetic criterion Distribution languages criterion dbOTU1 Perl, R Matrices of genetic distances, sequence count Values from input genetic distance Simulatedχ test table matrix* dbOTU2 Python 2, R Unaligned and aligned sequences, sequence Proportion of mismatched sites* Simulatedχ test count table dbOTU3 Python 2/3 Unaligned sequences, sequence count table Levenshtein edit distance Likelihood-ratio test *The ®rst two implementations recommended inputting information about dissimilarity of aligned and unaligned sequences, and the dissimilarity used in the genetic criterion was the minimum of those two dissimilarities for each pair of sequences. https://doi.org/10.1371/journal.pone.0176335.t001 · how they evaluate the distribution (i.e., ecological similarity) criterion, and · the details of the software itself. These differences are summarized in Table 1. The first implementation. The original implementation (github.com/spacocha/ Distribution-based-clustering) which we term ªdbOTU1º, was coded in Perl and R (with accessory shell scripts) and took matrices of genetic dissimilarities and a table of sequence counts as input. The table of sequence counts, similar to an OTU table, countained information about how many times each unique sequence appeared in each sample. The algorithm performed best when it was provided with two dissimilarity matrices, one computed from unaligned sequences and one computed from aligned sequences. To per- form the alignment, the original publication recommended thealign.seqs command in mothur [2]. To compute the dissimilarities matrices, the publication recommended the -makematrix option in FastTree [3]. The values in FastTree's dissimilarity matrix are Jukes- 3 4d Cantor distances log 1 , where d is the proportion of mismatched sites, that is, the 4 3 number of mismatches in the aligned sequences divided by the length of the sequences. (If the sequences are aligned and contain gaps, the gaps are not counted in the denominator.) The genetic criterion in dbOTU1 was articulated as the minimum of the aligned and unaligned Jukes-Cantor distances, which was a work-around for the fact that using the aligned sequences sometimes led to a greater distance between two sequences than would be computed by comparing the unaligned sequences. The distribution criterion was evaluated using the p-value from theχ test of independence as implemented in thechisq.test function in R [4], which was called in a separate process from a Perl script. The p-value was compared against a threshold for determining whether two sequences should be merged into one OTU. If the p-value fell below some treshold, the distri- butions of the two sequences across the samples were considered too distinct for the two sequences to be merged. (Because the p-values were used as an operational threshold for merg- ing OTUsÐrather than as explicit statements about the statistical significance of the similarity of two sequences' distributionsÐthese p-values were never subjected to multiple hypothesis correction.) Many of the comparisons involved sequences with small numbers of counts, for which the analytical, asymptotic calculation of the p-value of aχ test is not accurate. This implementa- tion therefore used a simulated p-value, available through the R command'ssimulate.p. value option, with 10 simulated contingency tables. This empirical calculation was slow and was the principal bottleneck in running dbOTU1. The second implementation. The second implementation (github.com/spacocha/ dbOTUcaller), which we call ªdbOTU2º, was coded in Python 2 and interfaced with R PLOS ONE | https://doi.org/10.1371/journal.pone.0176335 May 4, 2017 3 / 13 Distribution-based OTU calling usingrpy2 (rpy2.bitbucket.org) (with accessory Perl scripts). It took unaligned sequences, aligned sequences, and the sequence count table as input. Like dbOTU1, dbOTU2 used the minimum of the aligned and unaligned genetic dissimi- larities for its genetic criterion. Unlike dbOTU1, which used the Jukes-Cantor distance, dbOTU2 used the simple proportion d of mismatched sites and computed these dissimilarities directly from the sequence files only when required by the algorithm. This selective computa- tion avoided loading the matrix of N pairwise distances. Hereafter we refer to this metric, the minimum of aligned and unaligned proportion of mismatched sites, as the dbOTU2 metric. Like the first implementation, this one used the simulated p-value from R'schisq.test for its distribution criterion. Unlike dbOTU1, dbOTU2 called the test viarpy2 from the Python script. This removed the need for reading and writing temporary files, but it was still slow and required both R and Python. This implementation This implementation, dbOTU3, aims to improve speed and ease of use. It is written in pure Python and is compatible with Python 2 and 3. For the genetic criterion, this implementation uses the Levenshtein edit distance, the number of single-position insertions, deletions, or sub- stitutions required to transform one sequence into another, as an approximation for the sequence dissimilarity. (The Levenshtein distance is implemented in thepython- Levenshtein package;github.com/ztane/python-Levenshtein.) For the distri- bution criterion, this implementation uses a likelihood-ratio test. The merit of these choices for the genetic and distribution criteria are discussed in the Results. Methods New genetic and distribution criteria This implementation evaluates the genetic criterion using the Levenshtein edit distance: a can- didate sequence will not be merged into an OTU if 2E/(ℓ + ℓ ) is greater than some threshold, 1 2 where E is the Levenshtein edit distance, ℓ is the length of the candidate sequence, and ℓ is 1 2 the length of the OTU's sequence. As shown in the Results, this metric is a good approximation of the proportion of mismatched sites in an alignment. This implementation evaluates the distribution criterion using the p-value from a likeli-
i
i hood-ratio test. Define x as the number of counts that the OTU has in sample i and x as the 1 2
i number of counts the candidate sequence has. Define also X x and similarly X . 1 i 1 The alternative hypothesis for this test is that the OTU and candidate sequence are distrib-
i
i uted ªdifferentlyº, that is, that each of the x and x are drawn from different random vari- 1 2 ables, each of which we model as Poisson-distributed [5]. Thus, the alternative hypothesis is
i
i
i
i H : x Poisson l and x Poisson l ;
1 1 2 1 1 2 where there are no constraints on the Poisson parameters. The null model asserts that the OTU and candidate sequence are distributed ªthe sameº, that is, that the candidate sequence's counts in each sample is drawn from a Poisson random variable whose parameter is proportional to the parameter of the OTU's Poisson variables, where the constant of proportionality is the same across samples. Specifically, the null model is
i
i
i
i H : x Poisson l and x Poisson rl ;
2 1 2 We expect that 0 < ρ < 1 because the candidate sequence is overall less abundant than the PLOS ONE | https://doi.org/10.1371/journal.pone.0176335 May 4, 2017 4 / 13 Distribution-based OTU calling OTU. Asserting maximum likelihood for each model shows that
i
i l x
3 1 1
i
i l x
4 r
5
i
i
i 1 l x x
6 1 2 X X 1 2 so that the test statistic is L 2
L L 2f
x x