Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry

Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical... B The Author(s), 2018 J. Am. Soc. Mass Spectrom. (2018) 29:1721Y1737 DOI: 10.1007/s13361-018-1986-y RESEARCH ARTICLE Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry 1 2 1 3 2 Gelio Alves, Guanghui Wang, Aleksey Y. Ogurtsov, Steven K. Drake, Marjan Gucek, 4 1 David B. Sacks, Yi-Kuo Yu National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Proteomics Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA Critical Care Medicine Department, Clinical Center, National Institutes of Health, Bethesda, MD 20892, USA Department of Laboratory Medicine, Clinical Center, National Institutes of Health, Bethesda, MD 20892, USA Abstract. Rapid and accurate identification and classification of microorganisms is of paramount importance to public health and safety. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly im- proved. However, the increasing number of mi- crobes sequenced is complicating correct micro- bial identification even in a simple sample due to the large number of candidates present. To prop- erly untwine candidate microbes in samples con- taining one or more microbes, one needs to go beyond apparent morphology or simple Bfingerprinting^;to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptide-centric representations of microbes to better separate them and by augmenting our earlier analysis method that yields accurate statistical significance. Here, we present an updated analysis workflow that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using 226 MS/MS publicly available data files (each containing from 2500 to nearly 100,000 MS/MS spectra) and 4000 additional MS/MS data files, that the updated workflow can correctly identify multiple microbes at the genus and often the species level for samples containing more than one microbe. We have also shown that the proposed workflow computes accurate statistical significances, i.e., E values for identified peptides and unified E values for identified microbes. Our updated analysis workflow MiCId, a freely available software for Microorganism Classification and Identification, is available for download at https://www. ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html. Keywords: Pathogen identification, Microorganism classification, Statistical significance, Mass, Spectrometry, Proteomics Received: 13 November 2017/Revised: 30 March 2018/Accepted: 25 April 2018/Published Online: 5 June 2018 Introduction apid and accurate identification and classification of mi- Electronic supplementary material The online version of this article (https:// Rcroorganisms is of paramount importance to public health doi.org/10.1007/s13361-018-1986-y) contains supplementary material, which and safety [1–3]. Traditional methods for microbial identifica- is available to authorized users. tions target only a limited number of microorganisms [4, 5]and often require 72 h or more to carry out [6–8]. For most Correspondence to: Yi–Kuo Yu; e-mail: yyu@ncbi.nlm.nih.gov 1722 G. Alves et al.: Identification of Microorganisms microorganismal identification protocols, the first step can take cannot be used for identification in a sample containing multi- the longest. This time-consuming step involves preparing a ple microbes [27, 28]. Given that our main goal is to robustly culture of the collected sample in a selected medium, usually and accurately identify multiple microbes in mixed samples, in a blood culture, to test for the presence of any microbes and to our analyses, we use only data collected from high-resolution amplify the concentration of microbes that might be present [7– instruments using high-performance liquid chromatography- 9]. If the prepared culture tests positive for the presence of tandem MS (LC-MS/MS) [29] to mitigate one of the challenges microbes, further tests are required to distinguish and identify (noise in data) that hinders the identification of multiple mi- within the sample each microbe present [7, 8, 10, 11]. crobes. Another challenge appears due to the fast expansion of One of these tests is the analytical profile index (API), database size. In order not to miss identify any known microbe, which consists of a system of 20 biochemical reactions. Among one needs as an input to peptide identification tools a protein the concerns with the API test are that it cannot always identify database that includes all microbes present in the sample. The microbes at the species level and it cannot handle samples ever-increasing number of microbial proteomes implies an composed of multiple microbes [12]. Another frequently used ever-expanding peptide/protein database along with ever- test is the enzyme-linked immunosorbent assay (ELISA), a increasing number of post-translational modifications (PTMs) highly specific yet expensive test, which relies on the use of as well as single amino acid polymorphisms (SAPs) that will antibodies that are specific to antigens of a given species of overwhelm most peptide identification tools [30]. microbe [13, 14]. Traditional polymerase chain reactions In addition to having longer search time, the consequence of (PCR) of 16S rRNA followed by sequencing methods [15] using a much larger database includes reduction of sensitivity. are also utilized in the identification of microbes. One issue in One way to circumvent this is to first construct a peptide- using 16S rRNA sequence information is that it provides centric database without including PTMs or SAPs, sorted ac- reliable identification at the genus level for the majority of cording to peptide masses, then have an interface program that cases, but it cannot always be used to differentiate between extracts from the database, for every precursor ion mass of the closely related species having a 16S rRNA sequence similarity query spectrum, the corresponding peptide set and passes it to greater than 97% [16, 17]. For example, Bacillus globisporus peptide identification tools as the input database. PTMs and and Bacillus psychrophilus have greater than 99.5% sequence SAPs are then allowed only for microbes whose identification homology in their 16S rRNAs but have a DNA-level related- confidences are higher than a specified threshold. Although ness of only 23–50% when measured by hybridization reaction this strategy allows for a higher peptide identification rate, it [18, 19]. Studies using FilmArray multiplex PCR have been should not be confused with the multistage proteomics search shown to be able to early detect single or multiple pathogenic strategies based on the target-decoy statistics [31]that is often microbes present in positive blood cultures, as long as patho- used in the metaproteomics community. For this type of mul- genic microbes are present with high enough concentrations tistage search strategy, in the second step (re-search), the and the FilmArray system has the proper primers for these resulting proportion of false discoveries (or false discovery pathogenic microbes [20]. There are also methods that do not rate) can no longer be estimated correctly [32–34], hence require a blood culture and can identify microbes from whole undermining their validity. When using methods that do not blood samples. Of these methods, PCR amplification for require target-decoy approach to assign accurate statistical electrospray ionization mass spectrometry (PCR/ESI-MS) is significance, however, the difference between the multistage the most promising with reported sensitivity and specificity in and one-pass search strategies becomes that of using a smaller the 1990s [8]. database and a larger one. And the sensitivity advantage of In recent years, next-generation sequencing (NGS) and multistage strategies can be achieved with much less search mass spectrometry (MS) have emerged as reliable technologies time by the stratified database search method [35]. However, for rapid and accurate identification/classification of microbes even if the database size issue can be mitigated by using [21, 22]. Utilizing both the coding and noncoding DNA infor- stratified database and on-the-fly scope expansion of PTMs and SAPs, a major challenge remains in terms of identifying mation, NGS can be used to screen bacteria effectively al- multiple microbes. though it lacks gene expression information. On the other hand, the MS-based technology (the focus of the current manuscript) This major challenge originates from the difficulty in delin- identifies microbes via peptides found thus providing protein eating the microbes based on peptides identified. We meet this expression information. There are different ways in which challenge by developing a method that can delineate or group these technologies can be employed for microorganismal iden- microbes based on their peptidome similarity (counting only tification, and we direct the reader to the review articles of experimentally observed peptides of high confidence) and can Hodkinson and Grice [23] and of Saurce and Klien [22]fora assign accurate statistical significances to microbes identified. survey. In terms of identifying multiple microbes, the aforementioned Both the NGS and MS-based technologies can be routinely challenges are common to both the NGS [36–38] and MS- used to identify single microbes. For example, matrix-assisted based methods [39–45]. However, the extent to which one can laser desorption ionization time-of-flight MS (MALDI-TOF- apply the statistical methods developed in this manuscript to MS) [24, 25] can be employed to identify single microbes NGS-based methods deserves a separate investigation that is quickly and accurately in pure samples [26]. However, it beyond the scope of the current manuscript. In addition to the G. Alves et al.: Identification of Microorganisms 1723 two main challenges outlined above, existing MS-based Materials and Methods methods for microbial identification/classification [39–44]as Downloaded MS/MS Data Files mentioned by Boulund et al. [45] also face other challenges such as being limited to simple samples and needing manual A total of 207 LC-MS/MS data files were downloaded from the intervention during data analysis. The latter makes it hard to ProteomeXchange database at http://www.proteomexchange. automate the data analysis workflow. org/ and from PeptideAtlas at http://www.peptideatlas.org/. Interestingly, identifying multiple microbes is also pursued in Of these LC-MS/MS data files, 194 are from mixture samples areas such as environmental proteomics, also termed of known organisms, with each sample containing one, two, metaproteomics [46–48]. In this area, even though the primary four or nine organisms. The remaining 13 data files are from aim is to understand the functional expression of complex complex samples of the human gut. Supplementary Tables S1, samples (not the microbe classification/identification in particu- S2, S3,and S4 provide the data file (DF) number, the file name, lar), recently the reliability of taxonomic attribution using the and the ProteomeXchange or PeptideAtlas identifier for each bioinformatic tools Unipept [49]and MEGAN[50] has been LC-MS/MS data file downloaded. All the MS/MS spectra assessed using unique (taxon-specific) peptides found experi- described here were acquired on high-resolution mass spec- mentally [51]. This latter endeavor is equivalent to classification/ trometers and further experimental details can be obtained in identification of microbes [47, 48], which happens to be our the ProteomeXchange website. The true positive microbes in main goal. However, unlike Unipept [49]and MEGAN[50], the latter group of data (human gut microbiome) are unknown. our method can identify/classify microbes with accurate statis- For the data from samples in the former group, the true posi- tical significance assignment. Our method can also be applied in tives in each sample were provided along with their mix ratios; terms of functional expression. To achieve this, we identify the many such samples have nearly equal number of cells per peptides, then the microbes, and then the corresponding pro- microbe, and some are from very biased microbe populations. teins. With proteins identified, one may query the term databases To complement the ratio varieties of the downloaded data, we such as gene ontology to obtain functional expression [52]. also generated some in-house MS/MS data from samples with In this manuscript, we present an updated version of MiCId, more biased (but not extreme) ratios. an analysis workflow for rapid and accurate identifications/ classifications of microbes. MiCId was designed to automate In-House Dataset the complete process, from microbial peptide database con- struction to microbial identification and protein identification. Bacterial Culture Preparation Fresh Escherichia coli The first version of MiCId, standing for Microorganism Clas- (ATCC 25922), Pseudomonas aeruginosa (ATCC 27853), sification and Identification, was tested for analysis of samples Streptococcus pneumoniae (ATCC 49619), and containing a single microbe [53]. We have now updated MiCId Staphylococcus aureus (ATCC 25923) plates were used to to specifically handle mixed samples containing multiple mi- inoculate a 2 ml tryptic broth for overnight growth. From each crobes while preserving its speed and its accurate statistical saturated culture, a 10-ml vial was inoculated with 100 μl significance. Using 226 LC-MS/MS data files from a variety of (1:100 dilution) and put in shaker at 37 C. Each culture growth microbial samples, some of whose microbial compositions are was monitored by nephelometer to an optical density value unknown, and 4000 blended MS/MS data files, we have ex- between 0.4 and 0.51 (approximately 10 cells). True CFU tensively evaluated MiCId’s performance in terms of values were achieved by plating diluted samples on sheep identifying/classifying multiple microbes at different taxonom- blood agar plates and counting the resulting colonies. Based ic levels. MiCId utilizes a hierarchical identification strategy on the measured approximate cell count values for each culture, where microbes are identified starting at the phylum level then these cultures were mixed in different ratios to generate mi- descending one level at a time. With the E value cutoff set at crobe mixtures containing 10 cells total. The prepared microbe mixtures with their estimated ratios, DFs 84–102, are listed in 0.01 (effectively control the PFD to be less than 5%), in terms Table S3. These mixtures were added to eppendorf tubes and of microbe identification using blended MS/MS data (BMD-A spun at 14 K rpm for 2 min until all of the samples were in the and BMD-B), MiCId yields an average true positive rate of 0.9813 at the genus level and 0.9550 at the species level. (More eppendorf tube and the supernatants discarded. These pellets details can be found in Table 2.) One should note, however, were washed with 1 ml 70% EtOH and then resuspended in that these numbers were obtained by blending spectra from up 150 μl 70% formic acid. After vortexing, 150 μlacetonitrile to 24 single-species samples. Generalization to complex mi- was added and samples were vortexed and respun. The super- crobiota samples should be taken with a grain of salt. MiCId’s natant was transferred to a clean tube and speed-vacuum dried computed statistical significance is shown to be accurate for on medium heat. To each dried tube, 40 μlof 6 Murea and microbes identified when tested against a decoy database at 50 mM NH HCO were added and the tube was sonicated for 4 3 various taxonomical levels, providing a confidence measure for 50 min with occasional vortexing. Samples were reduced with users to control the proportion of false discoveries. The pro- DTT (4 μl 1 M in water, 37 °C for 60 min), alkylated (20 μl posed workflow has been implemented in MiCId, a freely iodoacetamide 40 mg/ml in water, at room temperature for available software that can be downloaded at http://www. 60 min in the dark), and quenched with DTT (4 μl, 15 min). ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html. The tubes were diluted with 100 μlof 50 mM NH HCO and 4 3 1724 G. Alves et al.: Identification of Microorganisms 10 μl of 100 mM NH HCO . Trypsin/Lys-c (Promega, 2 μg) specificity; BMD-C was used to evaluate the accuracy of 4 3 was added to each tube. Samples were digested using the CEM MiCId’s statistical significance assignment in terms of microbe Discovery microwave digester (60 min, 50 °C, 50 W, with identifications. cooling). After digestion, samples were stored at − 20 °C until DFs 1–24, which cover 24 species and 21 genera, were used used. to generate BMD-A. BMD-A is composed of five subsets, each corresponding to a fixed sampling percentage of 1, 5, 10, 25, or 50%. Every subset contains 500 blended DFs, each of which Liquid Chromatography-Tandem Mass Spectrometry Acquisi- was generated by sampling a fixed percentage of MS/MS tion Liquid chromatography-tandem mass spectrometry (LC- spectra from every one of the 24 DFs. BMD-B, covering 15 MS-MS) was performed using an Eksigent nanoLC-Ultra 2D species and 15 genera, is made of seven subsets each compos- system (Dublin, CA) coupled to an Orbitrap Elite mass spec- ing 200 blended DFs. Every blended DF in a subset was trometer (Thermo Scientific, San Jose, CA). Peptide samples generated by sampling a fixed percentage p of MS/MS spectra were first loaded onto a Zorbax 300SB-C18 trap column from group 1 DFs and a complement percentage 100-p of MS/ (Agilent, Palo Alto, CA) at a flow rate of 6 μL/min for MS spectra from group 2 DFs. For BMD-B, group 1 contains 10 min, and then separated on a reversed-phase BetaBasic DFs 24–31 and group 2 contains DFs 32–38. The seven dif- C18 PicoFrit analytical column (0.075 × 250 mm, New Objec- ferent subsets are distinguished by their complement pairs of tive, Woburn, MA) using a 90-min linear gradient of 5–35% percentages: (95, 5%), (90, 10%), (75, 25%), (50, 50%), (25, acetonitrile in 0.1% formic acid at a flow rate of 250 nl/min. 75%), (10, 90%), (5, 95%). BMD-C contains 100 blended DFs. Eluted peptides were sprayed into the Orbitrap Elite equipped Every blended DF of BMD-C was generated by sampling 50% with a nano-spray ionization source. Both survey (MS) and of MS/MS spectra from each of the following DFs: 40, 43, 48, product (MS/MS) spectra were acquired in the Orbitrap, and 53, 55, and 61. the FTMS resolution was set at 30,000 and 15,000, respective- ly. Each MS scan was followed by six data-dependent CID MS/MS scans with dynamic exclusion. Other mass spectrom- Peptide-Centric Databases etry settings were as follows: spray voltage, 1.8 kV; full MS Protein sequences were downloaded (on February 16, 2018) mass range, m/z 300 to 2000; normalized collision energy, from the National Center for Biotechnology Information 35%; precursor ion isolation mass width, 3 Da. (NCBI) at ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/. For the current study, five peptide-centric databases (DBs) (DB-1 through DB-5) were constructed. In the genbank.assembly file, Blended MS/MS Dataset an organism’s genome assembly level is labeled as contig, Even though we already have some data files from mixture scaffold, chromosome, or complete genome, in order of in- samples of up to nine microbes, they are limited in number and creasing completeness. The peptide-centric DB-1 (DB-2) in- perhaps in complexity. While we have the data files from the cludes, in addition to Homo sapiens and Mus musculus,all complex samples of the human gut, the true positives within bacteria, archaea, fungi, and virus whose assemblies are at these samples are unknown. To stress test our proposed iden- chromosome (scaffold) level or higher. The other three data- tification method, we need a large dataset made of highly bases were created from metagenomics data that have been complex samples but with true positives known. Absent analyzed and transformed into protein sequence fasta files [55]: existing DFs of this type, we generated blended DFs in silico DB-3 (DB-Human_0_RF_6GB.fasta containing 3,423,708 se- (similar to methods of [54] thathavebeenemployedtoevaluate quences), DB-4 (DB-Human_0_CF_6GB.fasta containing metagenomics analysis workflows). 192,582 sequences), and DB-5 (DB-Human_1+2+3_RF_ Each blended MS/MS data file for this purpose was gener- 6GB.fasta containing 2,866,541 sequences). ated using the following steps: (1) identify a list of data files, All DBs were constructed as follows: downloaded protein each containing MS/MS spectra from a sample of a microbe; sequences were in silico digested following the digestion rule (2) for each data file in the list, a number of MS/MS spectra out for trypsin and lys-c, i.e., cleaving at the carboxyl termini of of the total were randomly sampled according to a pre-specified arginine and lysine, allowing up to two missed cleavage sites. percentage; (3) merge the sampled MS/MS spectra to mimic a In our DBs, only nonredundant tryptic and lys-c peptides with data file from a mixture sample of these microbes; and (4) molecular masses between 660 and 4000 Da were kept. By repeat steps 1–3 to achieve the desired number of blended nonredundant peptide, we mean the following. We keep a copy MS/MS data files. The number of microbes chosen and the and only one copy of every possible peptide (resulted from in specified percentage of MS/MS spectra to sample (from each silico enzyme digestion of the protein database) regardless of microbe’s MS/MS data file) determine the size of a blended whether the peptide is shared by multiple microbes or not. A MS/MS data file. nonredundant peptide therefore can be a unique peptide to a A total of three blended MS/MS datasets (BMDs) were microbe at a given taxonomic level, but may become a shared generated. BMD-A was used for learning the parameters need- peptide when a lower taxonomic level is considered. Never- ed for MiCId’s clustering procedure; BMD-B was used to theless, no peptides will be excluded, and as more microbe evaluate MiCId’s performance in terms of sensitivity and genomes are sequenced, the growth rate of our peptide-centric G. Alves et al.: Identification of Microorganisms 1725 database is expected to be smaller than that of the protein in DNA sequencing technology and a polyphasic approach that databases. utilizes genotypic, chemotypic, and phenotypic information In DB-1 and DB-2, for each peptide, the names of strains, during taxonomic classification [60]. subspecies, species, genera, families, orders, classes, and phyla To provide microbe identification significances, we com- that contain this peptide are also recorded and linked to the pute a unified E value (E ) by combining the spectrum-specific peptide. The sizes of DB-1 and DB-2 are 46 and 200 GB, E values of the confidently identified peptides (CIPs). A pep- respectively. Taxonomic information included in DB-1 and tide is considered a CIP if it is identified with an E value smaller DB-2 was extracted from the taxonomy files downloaded (on than a cutoff (E ). When performing identifications, we might February 16, 2018) from the NCBI (https://www.ncbi.nlm.nih. want to eliminate potential false positives aggressively thus gov/Taxonomy). The 46,838,064 protein sequences setting E low; currently, E is defined to be the minimum of c c (807,574,956 nonredundant tryptic peptides) in DB-1 are from 1 and 100/n (with n denoting the total number of MS/MS s s 23,911 organisms, belonging to 13,072 species, 1890 genera, spectra acquired for a given experiment). (When assessing and 517 families. The 260,931,852 protein sequences statistical accuracy using a random/decoy database, however, (2,507,889,685 nonredundant tryptic peptides) in DB-2 are one is essentially counting false positives and setting E too low from 75,356 organisms, belonging to 26,368 species, 2870 will reduce the number of false positives estimated.) With this genera, and 701 families. Relevant information pertaining to E specified, on average for large n , only 100 false positive c s the NCBI taxonomy identifiers and organism names for the peptides are expected among the CIPs. We next detail an different organisms included in DB-1 and DB-2 can be found important clustering step [53], which is now improved to in Supplementary Table S5. Since DB-3, DB-4, and DB-5 are accommodate identifications of multiple microbes. In our peptide-driven clustering procedure [53], taxa sharing from metagenomics reads/contigs, no taxonomy information is available. significant amounts of CIPs were clustered together. All CIPs were regarded as equally important, and the clustering proce- Software Parameter Values Used dure did not go through further iteration. In this manuscript, we incorporate all peptides with E values less than 1 in the clus- Six databases, DB-1 through DB-5 and the reverse of DB-1, tering procedure; however, we also introduce fractional counts were used in MS/MS data analyses; the first five were used as to give peptides with better (lower) E values more (less) influ- the target DBs while the last as the decoy DB. The other ence than peptides with worse (higher) E values. The updated software parameters are described below. While performing peptide-driven clustering procedure is described below. database searches, the digestion rules of trypsin and lys-c were First, peptides identified with E values ≤ 1 are mapped to the assumed with up to two missed cleavage sites per peptide different taxa in the database. Second, a standardized weighted allowed. Iodocetamide was used as the reduction agent, chang- count (Z) isassignedto eachidentifiedpeptide (whose E value ing the molecular mass of every cysteine from 103.00919 to is E). 160.030647 Da. The mass error tolerance of 10 ppm was set for both precursor and product ions. RAId’s Rscore scoring func- tion, using b and y ions as evidence, was used for scoring peptides. The statistical significance assigned to each peptide ZEðÞ¼ : ð1Þ was given by RAId’s theoretically derived peptide score dis- 1 þ tribution [56]. The largest (cutoff) E value for a peptide to be reported was set to 1. Third, taxa are sorted in decreasing order of their weighted Statistical Method for Microbial Identification number of identified peptides to prepare for the clustering For our statistical method for microbial identification to be procedure. The first taxon entering a cluster is called the head effective, two prerequisites are indispensable: (1) accurate sig- of that taxon cluster, while other taxa the members of that nificance assignments, e.g., E values, at the peptide level must cluster. Starting from the best ranked taxon (a cluster head) in be provided and (2) microbes used for database construction the sorted list, any other lower ranked taxon will cluster to the must have the correct taxonomic classification. The first re- former if a resemblance coefficient of 0.85 or larger is obtained quirement is satisfied because peptide identifications in MiCId between them. The resemblance coefficient is defined as the are done by using RAId’s scoring function and significance proportion of the weighted number of peptides belonging to a assignments which have been shown to yield accurate E values lower ranked taxon that can be explained by identified peptides [56, 57]. associated with a cluster head. Once the worst ranked taxon is As for the second requirement, it is known that the mi- reached, the process will continue with the best ranked, not-yet- crobes’ taxonomic classification is not perfect and sometimes clustered taxon as a cluster head and repeat until all the controversial. For example, some studies recommend that unclustered taxa have been attempted as a cluster head, but Shigella flexneri should be classified as a strain of not more than once. Escherichia coli [58, 59]. However, the microbes’ taxonomic Fourth, after having generated taxon clusters, we further classification is expected to improve thanks to recent advances group these taxon clusters as follows. Starting with the lowest 1726 G. Alves et al.: Identification of Microorganisms ranking taxon cluster head, we denote its resemblance coeffi- compare with, leading to the conditional probability formula cient to others by its proportion of the weighted number of for the product of truncated P values [53, 63]. identified peptides that are shared by all other heads of taxon h  i clusters; when a resemblance coefficient of 0.85 or larger is m ln P −ln ðÞ τ m−1 raw c obtained, the cluster under consideration is merged to its closest P ~τ≤τj m; m ¼ ∑ ; ð4Þ t raw raw P s! s¼0 taxon cluster, i.e., the one whose head shares the largest weighted number of peptides with the current head. The re- where m ≡ ⌈m ⌉ is the smallest integer that is greater than or raw maining number of cluster heads n is used as the Bonferroni equal to m . An example of how to compute the unified P raw correction factor for significance calculation later. There is, value using Eq. (5) can be found in the electronic supplemen- however, an exception to the general clustering rules above. tary material of an earlier publication [53]. When a taxon (taxon cluster) contains three or more CIPs that Finally, P is given by are not shared with any other taxa (taxon clusters), it can only be a cluster head. M ! M−m P ~ τ≤τ ¼ PðÞ 1−P P ~ τ≤τj m; m u c t raw There is also a difference in the identification workflow m!ðÞ M−m ! compared to our earlier method. We considered all genera M M ! M− j ð5Þ þ ∑ PðÞ 1−P (species) together when performing genus (species) level iden- c j!ðÞ M−j ! j¼mþ1 hi tifications [53]. Here, we begin identifications at the phylum j j θ P −τ P ~ τ≤τj j; j þθτ−P ; c c level and then down. The current method has the advantage that one may eliminate from consideration taxa that are unlike- lm ly to be present during the upper level identifications. Below is where M ¼ ∑ wðÞ E≤1 with N being the total number j T j¼1 the condition we employ to select taxa to be retained at each of identified peptides (with E value ≤ 1) mappable to taxon T, identification level. Each cluster head will be considered; any P is the database P value for E , and the θ(x) function takes the c c member with a percentage difference in the weighted number value 1 when x>0and0otherwise. of CIPs to its cluster head less than 15% or having three or Within each cluster, the unified E value of each member more CIPs that are not shared with other taxa will be retained. taxon, cluster head included, is computed; the taxon with the In order to provide statistical significances at various taxo- most significant unified E value becomes the head of the nomic levels, we compute a unified E value E by combining cluster. Note that the starting cluster head (with largest M) the spectrum-specific E values of the CIPs belonging to the remains most significant most of the time, indicating the sta- same taxon. The details of how the E s are computed for bility of our clustering procedure and significance assignment. microbes at different taxonomic levels have been previously After this step, the clusters are finally sorted in ascending order described [53]. Here, we only briefly outline the essential steps of the unified E values of the cluster heads. for computing E . The unified E value E is given by Statistical Method for Protein Identification E ¼ n  P ; ð2Þ u c u In this update of MiCId, we have included the protein where in Eq. (2), the unified P value (P ) is multiplied by the identification capability. Proteomes of confidently identi- Bonferroni correction factor, n , the final number of peptide- fied microbes (cluster heads at the species level) are used driven taxon clusters. The P is obtained by first transforming as the protein database. The implemented statistical the E values (E) of CIPs into database P values (p)[61, 62], p = −E framework for protein identification is found on a rigor- 1 − e .Aweight (w ) is then defined for each peptide π as ously derived general formula [64], which has been 1/C !with C being the total number of taxon clusters contain- π π extensively tested for the application to protein identifi- ing π.Note that w s are computed for all πs identified with E cation [65]. Since the detailed derivations [64]and the values ≤ 1 although only CIPs are used for computing the applications [65] are already available, here we only unified P value. Evidently, C varies by the taxonomic level briefly summarize the statistical method for protein iden- considered. tification in MiCId. For each taxon T, one then combines the weighted product As before, we consider all identified peptides with E values of database P values into a new variable E ≤ 1 and convert each E value of an identified peptide into a n −E database P values (P =1 − e )[61, 62]. These identified pep- τ ¼ ∏p ; ð3Þ tides are then mapped to database proteins that contain them. i¼1 Assume that a given protein contains L identified peptides with where n is the total number of CIPs mappable to taxon T,and P values. Let us group these L peptides, according to the the weight for peptide π is w ¼ 1=C !. The sum of peptide number of database proteins a peptide maps to, into m groups i i π weights m ≡∑ wðÞ E≤E gives the effective number of with 1 ≤ m ≤ L.Within each group k,the n peptide P values raw i c i¼1 k degrees of freedom. This allows one to define a stochastic have equal weight, while peptide P values in different groups variable ~ τ of the same number of degrees of freedom to are weighted differently. G. Alves et al.: Identification of Microorganisms 1727 The weighting enters our formalism through the following plot the species-level peptidome similarity histograms quantities of interest using all three similarity measures (see panel b of Figure 1). The closeness among all three versions of sim- "# "# w w k k n n k k m m ilarity measures for species indicates that the presence of τ≡∏ ∏p and Q≡∏ ∏x ; ð6Þ k; j k; j high peptidome similarity among species is generic, not a k¼1 j¼1 k¼1 j¼1 consequence of asymmetric similarity. However, similar to where each p represents a reported peptide database P value the well-known example of high peptidome similarity be- k; j and each x represents a random variable drawn from a tween E. coli and S. flexneri [58, 59], the exceedingly high k; j uniform, independent distribution over [0, 1]. The quantity of peptidome similarities observed among certain species may interest F(τ) ≡ Prob(Q ≤ τ), representing the protein P value, also be a consequence of incorrect/problematic taxonomy can beobtained[64, 65]. classifications. The question of what to include/exclude from the database m m is definitely important, and the answer probably varies depend- FðÞ τ ¼ ∏ r ∑ ∑ HðÞ −r ln τ; g l k g þ1 l¼1 k¼1 GðÞ k r ing on the research conducted. It has been shown that the 0 1 choice of databases affects microbial identification [55]. As g = n −1 þ g ! j m j j another example, researchers studying the human gut ðÞ −1 @ A ∏  ; ð7Þ n þg j j microbiome may wish to construct a gut-specific database n −1 !g ! j¼1; j≠k r −r j j k [48] by including only human gut microbial species that have been cataloged by the Human Microbiome Project [66]. To where r ≡ 1/w is the number of proteins a group k peptide k k facilitate microbial research of various types, MiCId offers a maps to, ∑ enumerates each set of nonnegative integers GðÞ k simple procedure for constructing a customized database and {g , g , …, g }thatsatisfies the k-dependent constraint 1 2 m using it for microbial identification afterwards. A user only ∑ g ¼ n −1, and the function H is defined as i¼1 i needs to specify a list with the names of species or their NCBI taxonomy identifier, and everything else is handled by MiCId. −x HxðÞ ; n ≡ e ∑ : ð8Þ k! k¼0 E Value Accuracy Evaluation As stated in the BMaterials and Methods^ section, the An example application of Eq. (7) can be found in the statistical method employed to assign statistical signifi- supplementary information of the published study on protein cance to microbes identified requires accurate per- identification [65]. spectrum significances, e.g., E values, at the peptide level. Combining RAId’s accurate per-spectrum statistics [56]to output unified E value (E ) for microbes identified, we Results and Discussion expect E to be accurate and evaluated its accuracy as To investigate the feasibility of microbial identification described below. MS/MS spectra from BMD-C were used based on peptides identified, we examine in silico the to query the reverse [67, 68] of DB-1 while keeping the peptidome similarities among microbes at different taxo- taxonomic assignments untouched. The reverse of DB-1 nomic level in our DB-1. The similarity of taxon X to Y, thus acts as a peptide database from decoy organisms. S ≡∣ X ∩ Y ∣ / ∣ X∣, is defined as the numbered of (Therefore, each decoy organism carries a canonical name, X→ Y shared peptides between the two taxa divided by the num- but its peptide sequences are reverse of those of the true ber of peptides corresponding to taxon X. Panel a of organism of the same name.) The unified E value E of Figure 1 displays histograms of the in silico peptidome each decoy organism can then be calculated using Eqs. (2) similarities computed among the families, genera, and spe- and (5). Obviously, each identified decoy organism is a cies in DB-1. The histograms in Figure 1 show that pep- false positive. The accuracy of assigned statistical signifi- tides are weakly shared among taxa above species level cance (or type I error) can be visualized by a log-log plot having similarity values that are typically much less than [69, 70] of the expected number of errors (decoy organ- 0.6, indicating that microbial identifications at levels isms identified) per query versus the E reported by higher than species should be easier than at the species MiCId. level. Based on these histograms, it appears that unambig- Since we have in total 100 queries (blended DFs) in BMD- uous identification at the species level for certain species C, one would expect 1 decoy microbial cluster head with E ≤ −2 −1 can be challenging. To investigate whether the high 10 , 10 decoy microbial cluster heads with E ≤ 10 , 100 peptidome similarity at the species level is an artifact due microbial cluster heads with E ≤ 10 , and 10,000 microbial to asymmetric similarity measure, we defined in addition cluster heads with E ≤ 10 . The log-log plot of the expected t w o symme tr ize d si mi l a ri ti es : number of errors per query versus E should yield a curve close S (X→ Y)= S (Y→ X)≡∣ X ∩ Y ∣ /min {| X|,| Y|} and to the y = x line. Panels a, b, and c of Figure 2 show that the max max S (X→ Y)= S (Y→ X)≡∣ X ∩ Y ∣ /max {| X|,| Y|} and computed E curves trace very closely the y = x line, indicating min min u 1728 G. Alves et al.: Identification of Microorganisms 10 10 (a) Family (b) Species Genus Species (S ) min Species Species (S ) max 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 1. Histograms of microbial peptidome similarity (ρ). The curves display the peptidome similarity histograms in DB-1. In general, asymmetric peptidome similarity is used: similarity of taxon X to Y, S , is defined by S ≡∣ X ∩ Y ∣ / ∣ X∣, the number X → Y X → Y of shared peptides between the two taxa divided by the number of peptides corresponding to taxon X. Note that both S and X → Y S are included in the histogram. To illustrate that the high frequency of large peptidome similarity among species is generic, we Y → X defined in addition two symmetrized similarities: S (X → Y)= S (Y → X) ≡∣ X ∩ Y ∣ /min {| X|,| Y|} and max max S (X → Y)= S (Y → X)≡∣ X ∩ Y ∣ /max {| X|,| Y| }. These new symmetrized measures are shown only for peptidome similarities min min among species. The closeness among all three versions of similarity measures for species indicates that the presence of high peptidome similarity among species is generic, not a consequence of asymmetric similarity that the computed E is indeed accurate. Two dashed lines, y = Microbial Identification 3x and y = x/3, are also provided as references. As described in Validation of MiCId’s Taxa Identification Using Mixtures of the BStatistical Method for Microbial Identification^ subsec- Known Compositions One important component of our tion, when assessing the accuracy of assigned significance method, as described in the BStatistical Method for Microbial using decoy databases, one should not set E too low and the Identification^ subsection, is the peptide-driven clustering pro- statistical accuracy should remain even with different E .To cedure used to group microbes at different taxonomic levels. test this, we have used three E values, 0.01, 0.1, and 0.5, The parameters needed for the clustering procedure include the corresponding to curves in panels a, b, and c, respectively. resemblance coefficient, the minimum number of unique CIPs These E values, 0.01, 0.1, and 0.5, yield 347, 3183, and required to prevent a taxon (or a cluster of taxa) from being 15,040 CIPs, respectively. further clustered, and the condition for taxa to be selected for Although the statistical formula (7) for protein identifica- the next level identification. These parameter values were tion is better found than the one (5) we used for microbe learned from using BMD-A (as a training dataset) to query identification/classification, we did not switch to the former DB-1; they are selected by maximizing the Btrue positive rate^ in this MiCId update for the following reason. When (TPR) at the species level under a specified E value cutoff. performing microbial identification/classification, one needs Once determined, the parameters are used for all level of taxa to combine hundreds to thousands of peptide database P identifications. When computing the TPR, a cluster of identi- values as opposed to tens of database P values for protein fied microbes contributes only one count; the cluster is viewed identification. For the former case, if one were to use formula as a true positive only if the cluster head is a true positive and as (7), the summation becomes very time-consuming to compute a false positive otherwise. The first half of Table 2 displays, and can considerably slow down the entire workflow. from phylum-level to species-level identifications, the TPR 2 2 2 10 10 10 (a) Family (b) Family (c) Family Genus Genus Genus 1 1 1 10 10 10 Species Species Species 0 0 0 10 10 10 −1 −1 −1 10 10 10 −2 −2 −2 10 10 10 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 E−value E−value E−value Figure 2. Accuracy assessment of the unified E value (E ) for microorganism identifications. Each of the 100 blended DFs in BMD-C was queried in the decoy (made of reversed target peptide sequences) of the target DB-1. Three E s, peptide E value cutoffs, are used: (a) E = 0.01, (b) E =0.1, and (c) E = 0.5. The closer the curve is to the y = x line, the more accurate the reported E c c c u E[Erros Per Query] Frequency E[Erros Per Query] Frequency E[Erros Per Query] G. Alves et al.: Identification of Microorganisms 1729 along with the PFD using the parameters learned (with E value is among the known microbes and a false positive other- cutoff 0.01). More pertinent information at the genus and wise. Although the table headings are explained in the species levels can be found in Supplementary Figure S1. caption, we shall elaborate on few of them here: IF is With the parameters determined, we use BMD-B and the overall identification fraction (proportion of times a mixture sample datasets of known composition to query taxon is identified, be it a cluster head or not, from samples DB-1 in order to test the ability of MiCId in microbial containing it); IF is no larger than IF as it records the 2 1 identification/classification. The resulting TPR and PFD identification fraction of a known microbe that is also the (with E value cutoff 0.01) are displayed in the second half head of the cluster it belongs to; IF cannot be larger than of Table 2. The retrieval curves of taxa and peptides from IF by definition since it reports the identification fraction using BMD-B as queries are plotted in panels a and c of of a known microbe that, in addition to satisfying the Figure 3. Plottedinpanel bofFigure 3 is the PFDs versus conditions for IF , must have taxon-specific (unique) pep- the E values of identified taxa. This panel indicates that tide hits. The fact that sometimes IF is smaller than IF 2 1 using a E value cutoff of 0.01, one can control the PFDs at indicates the room for improvement in our identification the genus and species level at 5 and 2%, respectively. method: it shows occurrences of a true positive microbe Panel d of Figure 3 shows the histogram of peptides not being the cluster head. There are two scenarios that this identified with different E values. BMD-B contains 1400 can happen. First, it is possible that a false positive mi- blended DFs, each of which contains spectra from all 15 crobe having a similar peptidome to the true positive genera/species; hence, the maximum count of identifiable somehow becomes more significant than the true positive genera/species is 21,000. In this assessment, an identified microbe and the new cluster head. Second, we can have cluster is counted as a true positive only if its cluster head 0.14 Genus (b) (a) 0.12 Species 0.10 0.08 0.06 Genus Species 0.04 0.02 0.00 0 −8 −6 −4 −2 0 10 10 10 10 10 0.00 0.02 0.04 0.06 0.08 0.10 E−value PFD 9 8 10 10 (c) 8 (d) 10 3 10 0 0.00 0.02 0.04 0.06 0.08 0.10 −20 −16 −12 −8 −40 PFD log (E−value) Figure 3. Retrieval assessments using the blended MS/MS dataset BMD-B of known microbe compositions to query DB-1 are shown in panels (a) (taxa) and (c) (peptides). (b) The PFDs versus the E values of identified taxa. This panel indicates that using a E value cutoff of 0.01, one can control the PFDs at the genus and species level at 5 and 2%, respectively. (d) The histogram of peptides identified with different E values. BMD-B contains 1400 blended DFs, each of which contains spectra from all 15 genera/species; hence, the maximum count of identifiable genera/species is 21,000. An identified cluster is counted as a true positive only if its cluster head is among the known microbes and a false positive otherwise. The table headings are explained below: SK represents the species key; E[R] is the taxon’s average rank in the identified cluster containing it; IF is the overall identification fraction (proportion of times a taxon is identified, be it a cluster head or not, from samples containing it); IF records the identification fraction of a known microbe that also happens to be the head of the cluster it belongs to; IF reports the identification fraction of a known microbe that not only is the head of the cluster it belongs to but also has unique (taxon-specific) peptide hits; e[NIP] is the average number of identified peptides; e[CS] represents the average cluster size containing the taxon. In the table above, microbial identification was controlled at the 5% PFD. The species keys are 1, Escherichia coli;2, Enterobacter lignolyticus;3, Streptococcus pyogenes;4, Mycobacterium tuberculosis;5, Salmonella enterica;6, Yersinia pestis;7, Shewanella oneidensis;8, Pseudomonas aeruginosa;9, Bacillus subtilis;10, Bordetella pertussis;11, Bartonella henselae;12, Rhodobacter sphaeroides;13, Thermotoga maritima;14, Geobacter bemidjiensis;15, Caulobacter vibrioides Number of Microorganisms Number of Peptides Frequency PFD 1730 G. Alves et al.: Identification of Microorganisms (a) two true positive microbes sharing a larger number of CIPs and are clustered together. One quantity of particular interest is IF -IF . Notethatwith 2 3 IF -IF being zero for all genus-level identifications, we know 2 3 that each microbe present in these DFs has genus-specific peptides identified, indicating separability (or weak correla- tion) among genera. On the other hand, at the species level, we found several cases with IF -IF being nonzero (highlighted 2 3 in orange), indicating that strong similarities exist among cer- tain species. Interestingly, the fact that IF >IF for several 2 3 (b) species reveals that MiCId can correctly identify at times these species without relying on species-specific peptides. Another noteworthy point is that even though DB-1 is a fairly large database, the majority of identified peptides are still quite significant (see panel d). As another test of MiCId’s ability to correctly identify microbes, a series of 85 samples, DFs 39–123, composed of one, two, four, or nine known microbes were used to query DB-1. Panels A and B of Supplementary Figure S2 display the Figure 4. Identification result comparison between MiCId and PFD curves for the genus and the species-level identifications, different workflows by analyzing DF 124 (human stool sample) respectively, from samples containing one microbe and several in two databases DB-3 (a) and DB-4 (b). The workflows com- microbes. Panels C and D show the histograms of peptides pared with MiCId include MetaProteome analyzer (MPA) that identified with different E values, respectively, from samples uses X!Tandem for peptide identification [55], MaxQuant (MQ) containing one microbe and several microbes. The identifica- [55], and proteome discoverer (PD) which uses Sequest-HT and tion performance of MiCId using DFs 39–123 and with the percolator for peptide identification [55]. Plotted in the Venn PFD cutoff at 5% is summarized in the associated table of diagrams are the intersection of nonredundant peptides identi- Supplementary Figure S2. fied at the 1% PFD for the four workflows. For MQ, MPA, and We have also tested if MiCId can identify viruses from PD, genus identifications were done by sending all peptides samples of Calu-3 human lung cancer cells (infected with identified at 1% PFD to Unipept with the filtering strategy rec- ommended by [51] enforced. For MiCId, all peptides identified influenza A, harvested 0, 3, 7, 12, 18, and 24 h post infection with E value ≤ 1 are used for genus identifications, and only with five replicates at each time point). These samples, DFs heads of genus clusters identified with E value ≤ 0.01 are used 137–226, are from cell lysates using multidimensional protein to generate the Venn diagram identification technology (MudPIT) [71] and were not enriched for virus proteins. Using these samples, MiCId was able to correctly identify influenza A as early as 7 h for 2 out of 15 (MQ, MPA, and PD) adapting MEGAN [50]since theauthors samples and could correctly identify influenza A after 12 h for of [51] have shown that Unipept outperforms MEGAN in all of them. Note that with the E value cutoff of 0.01, MiCId terms of pathogen identifications. For MQ, MPA, and PD, does not report any false positive regardless whether true genus identifications were done by sending all peptides identi- positives are reported or not, indicating a robust false positive fied at 1% PFD to Unipept with the filtering strategy recom- control. The results obtained is quite surprising given that the mended by [51] enforced: when the number of unique peptides size of DB-1 (covering 46,838,064 proteins) is so much larger mapped to a taxon is less than 0.5% of the total number of than that of influenza A (11 proteins), see Supplementary unique (taxon-specific) peptides, that taxon is viewed as a false Table S6. positive. For MiCId, all peptides identified with E value ≤ 1are used for genus identifications, and only heads of genus clusters identified with E value ≤ 0.01 are used to generate the Venn Microbial Identification in Complex Mixture of Unknown diagram. Species-level identification comparison was not pos- Composition Using a human stool sample, DF-124, we com- sible because Unipept processes the database differently at this pare the microbial identifications at the genus level of MiCId level [49]. However, more comparisons with the PD workflow with the results from a previous study [55] that includes three adapting Unipept are available in Supplementary Table S7 and workflows: MetaProteome Analyzer (MPA) that uses X!Tan- Supplementary Figure S3. From Supplementary Figure S3,the dem for peptide identification [55], MaxQuant (MQ) [55], and readers will notice that although PD in general identifies more Proteome Discoverer (PD) which uses Sequest-HT and Perco- peptides than MiCId in this complex sample, MiCId neverthe- lator for peptide identification [55]. In Figure 4, two databases, less identifies more genera. Given the rigorous statistical foun- DB-3 and DB-4, are employed for identification comparison, dation of MiCId and the low E value cutoff used, the additional shown as Venn diagrams, among the four workflows. We did genera identified by MiCId are unlikely to be false positives. not compare MiCId with the three aforementioned workflows On the other hand, the large number of peptides identified by G. Alves et al.: Identification of Microorganisms 1731 PD, but not by MiCId, do not yield additional confident genera PFDs) to analyze data from several replicates of the same identifications. This might be because these peptides are pri- sample, how well can the identified taxa overlap given the marily false hits or the aforementioned filtering strategy, due to possible intrinsic variation due to data-dependent acquisition its heuristic nature, sometimes can be too aggressive [47, 51] of typical MS/MS practice? To investigate the size of this and lose true positives. intrinsic variation and its impact on taxa identifications, we Note that in Figure 4, although the peptides identified do not analyze nine MS/MS datasets DF-128 to DF-136, three techni- have strong overlaps, the genera identified seem to overlap cal replicates from human stool samples of three different vol- much more. The discrepancy in peptides identified might be unteers [30]. In Figure 5, we plot in the Venn diagrams the attributed to the difference in terms of how PFDs are estimated. overlaps among triplicates in terms of peptides, genera, species, Leaving this issue aside, however, a more fundamental question and proteins identified by MiCId when searching DB-2. The remains: using a workflow such as MiCId (that yields accurate peptide Venn diagrams, displaying the intersections of AB AB AB 246 590 1443 1526 2909 1747 3254 2173 2999 1512 2115 4429 357 3845 1112 3417 1211 825 1883 3147 1298 C C C AB AB AB 0 0 1 2 2 0 1 2 1 16 15 14 05 11 00 3 2 0 C C C AB AB AB 1 3 3 5 4 1 3 4 3 19 13 13 112 02 03 3 5 1 C C C AB AB AB 299 371 692 455 770 516 1186 776 726 597 1421 942 95 1101 211 894 83 361 694 1426 288 C C C Figure 5. Identification overlaps of peptides, microbes, and proteins among technical replicates. Analyzing nine MS/MS datasets, DFs 128–136, three technical replicates from human stool samples of three different volunteers [30], we plot in the Venn diagrams the overlaps among triplicates in terms of peptides, genera, species, and proteins identified by MiCId when searching DB-2. The peptide Venn diagrams show the intersections of nonredundant peptides identified at the 1% PFD, while other Venn diagrams are constructed using genera, species, and proteins identified with E value ≤ 0.01 1732 G. Alves et al.: Identification of Microorganisms nonredundant peptides identified at the 1% PFD, show a lot database [30]. However, when a more complex sample is used, more variations than the Venn diagrams for genera and species. searching a larger database has the advantage of discovering This indicates that the taxa identified remain largely the same unexpected but perhaps true positive peptides [35, 55]. In other even though there exists nonnegligible intrinsic variations words, even though searching a larger database reduces the among technical replicates in terms of peptides/proteins identi- statistical significances of peptides previously identified in a fied [30]. smaller database, new peptides not contained in the smaller The other problem of interest is to compare the taxa identi- database may be identified with high significance. Hence, the fication performance of various workflows given the same list sensitivity in terms of peptide identification may even increase of input peptides. Evidently, when the input peptides are the when searching a larger database. same, then except for MiCId, the taxa identified by MQ, MPA, To investigate the latter point, we have used data from and PD become identical (as they all are given by Unipept). In complex samples (DFs 128–136) to query DB-1 and DB-2. this context, it becomes the taxon identification comparison The sizes of DB-1 and DB-2 are 46 and 200 GB, respectively. between MiCId and Unipept. To meaningfully compare MiCId The 46,838,064 protein sequences (807,574,956 nonredundant with Unipept (with and without the filtering strategy proposed tryptic peptides) in DB-1 are from 23,911 organisms, belong- in [51]), however, the true positive microbes must be known. ing to 13,072 species, 1890 genera, and 517 families. The We thus used a data set (DF-100 to DF-108) from a mixture of 260,931,852 protein sequences (2,507,889,685 nonredundant four known microbes. The results are summarized in Table 1. tryptic peptides) in DB-2 are from 75,356 organisms, belong- We found that MiCId consistently identify more true positives ing to 26,368 species, 2870 genera, and 701 families. In terms while controlling the false positives at a lower rate when of number of tryptic peptides, DB-2 is about three times the compared to Unipept. One important observation is that the size of DB-1, although in terms of protein sequences it is about heuristic filtering strategy [51] does not always control the 5.6 times the size of DB-1. The peptide retrieval curves for each number of false positives to a low number while our statistical DFs are shown in groups. The first group, containing PFD method does. It is possible that the E value cutoff of 1 in Table 1 curves from DFs 128–130, is displayed in Figure 6. The other may include too many peptides with poor statistical signifi- two groups are shown in Supplementary Figure S4. The con- cance, hence hindering the performance of Unipept. For this sistent trend for each of these DFs (from complex samples) is reason, we have provided in the Supplementary Table S8 the that at a given PFD value, searching the larger database (DB-2) results from controlling the input peptides at 1% PFD using the indeed yields a larger number of identified peptides in compar- MiCId statistics. As one may see in the Supplementary ison with searching the smaller database (DB-1). Table S8, the same trend persists. Protein Identification Sensitivity Does Not Always Decrease as the Database Size Increases Sensitivity drop due to searching a large database MiCId performs protein identifications by extending its taxa has been examined and reported by many groups [30, 32, 35]. identifications in one go. It first queries the MS/MS spectra in The basic fact is that when a simple (single) microbe sample is a peptide-centric database, which is constructed from microbial used, searching a smaller database (containing the proteome of protein sequences and used for microbial identification at vari- that microbe) has a better sensitivity than searching a larger ous taxonomic levels. MiCId then uses the protein sequences Table 1. Genus Identification Comparison Between MiCId and Unipept Genus assignment for data files 100–108 MiCId Unipept DF NPU TP TP FP FP TP TP FP FP f u f u f u f u 100 9229 4 4 0 2 3 4 18 591 101 9865 4 4 0 0 3 4 18 634 102 9666 4 4 0 0 3 4 12 595 103 23,248 4 4 0 3 3 4 11 862 104 22,496 4 4 0 1 3 4 11 858 105 23,105 4 4 0 0 3 4 13 870 106 21,730 4 4 0 0 3 4 12 880 107 24,091 4 4 0 1 3 4 13 892 108 21,093 4 4 0 1 3 4 16 866 DF-100 to DF-108, sample mixtures composed of four bacteria S. pneumoniae, S. aureus, E. coli,and P. aeruginosa, are used to query DB-1. All peptides with E values ≤ 1 are used as input for both MiCId and Unipept. The data file (DF) index is shown in the first column. The number of peptides used (NPU) is displayed in the second column. For Unipept results, the subscript u (or f) of the number of true positive (TP) and the number of false positive (FP) means that the filtering strategy [51] is turned off (or on); for MiCId, the subscript u includes taxa with E value ≤ 1 while f includes only taxa with E value ≤ 0.01 G. Alves et al.: Identification of Microorganisms 1733 DF−128 DF−129 DF−130 30000 40000 40000 (a) DB−1 (b) DB−1 (c) DB−1 35000 35000 DB−2 DB−2 DB−2 30000 30000 25000 25000 15000 20000 20000 15000 15000 10000 10000 5000 5000 0 0 0 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 PFD PFD PFD Figure 6. Peptide retrieval curves for DFs 128–130 (complex samples) when searching DB-1 (46 GB) and DB-2 (200 GB). At the same PFD value, more peptides are identified when searching the larger database (DB-2) belonging to the confidently identified microbial species to second worst ranking protein and repeats it until one has tested construct a protein database on-the-fly and queries in it the the second best ranking protein. However, there is an excep- MS/MS spectra again for protein identification. This approach tion: a protein will not cluster if it has a unique evidence peptide differs from metaproteomic approaches which identify proteins (a peptide that maps to only one protein) with assigned E value −4 directly from identified peptides without performing microbial less than both 10 and 1/n , with n being the number of s s identifications first [30, 48]. Because each on-the-fly constructed spectra collected per experiment. Specifically, we cluster pro- protein database, no longer covering all microbes, is much teins that share the majority of the peptide evidences when the smaller than the original peptide-centric database, MiCId can information obtained from identified peptides does not allow us consider post-translationally modified peptides and semi- to pin point a specific protein. Because we do not assign enzymatic (semi-tryptic) peptides during protein identification functionality to proteins identified, we do not cluster proteins without adding much computational cost. In a previous study by their homology. In fact, it is possible for two totally dissim- [65], we have shown that the method employed for protein ilar proteins to be clustered together if they share a couple of identification in MiCId assigns accurate statistical significance peptides that happened to be identified with high confidence. to identified proteins and also has a protein retrieval performance that is no worse than that of any other protein identification Execution Time method. Therefore, we only present the protein identification With speed a main consideration, MiCId code was written in results of MiCId from analyzing DFs 128–136, three technical C++ and its routine for database search was written using replicates from human stool samples of three different volunteers parallel programming. This allows users to run jobs with a [30]). In the last row of Figure 5, we plot in the Venn diagrams flexible number of cores. We have measured the execution the overlaps among triplicates in terms of proteins identified by time of MiCId in performing microbe identification when MiCId when searching DB-2. As one may see, there exists querying MS/MS spectra in DB-1 (23,911 organisms and nonnegligible variation among the triplicate results in terms of proteins identified. This may be attributed to the significant variations in the peptides identified, see the Venn diagrams on (a) 15K (DB−1) the first row of Figure 5. On the other hand, from the Venn 15K (DB−2) diagrams on the second row (genus) and the third row (species) 100K (DB−1) of Figure 5, the genus and species identifications appear consis- 2400 tent despite the variations in peptides/proteins identified. 100K (DB−2) When clustering proteins, MiCId does not use protein ho- mology to cluster them. Rather, it uses a peptide-centric clus- tering procedure which is described below: first peptides iden- tified with E values ≤ 1 are mapped to proteins in the database; 1200 second, a standardized weighted count (Z)is assigned to each identified peptide (whose E value is E), ZEðÞ¼ 1= 1 þ , where E is the E value cutoff used to control peptide identifi- cation at a 5% PFD value; third, the sum, W , of the weighted (Z) peptide evidence for every protein p is computed and used 0 4 8 1216 2024 28 32 to sort proteins in order of decreasing W; fourth, one starts with Number of Cores the worst ranking protein p and clusters it to the better ranking Figure 7. MiCId execution time versus number of cores using protein which can explain the highest percentage of W .If that two datasets (one containing 15,000 MS/MS spectra and the percentage is below 95%, p does not cluster into any better other 100,000 MS/MS spectra) to query DB-1 (46 GB) and DB-2 ranking protein; fifth, continue the process in step 4 for the (200 GB) Number of Peptides Number of Peptides Execution Time (s) Number of Peptides 1734 G. Alves et al.: Identification of Microorganisms Table 2. True Positive Rate (TPR) and Proportion of False Discoveries (PFD) at the 0.01 E Value Cutoff Using BMD-A and BMD-B True positive rate and proportion of false discoveries MS/MS data set BMD-A Taxonomical level P C O F G S TPR 100.00% 94.13% 99.99% 99.93% 97.69% 94.10% Ec =0.01 PFD 3.02% 3.47% 2.33% 2.59% 1.49% 1.91% Ec =0.01 MS/MS data set BMD-B Taxonomical level P C O F G S TPR 100.00% 100.00% 100.00% 100% 98.57% 96.90% Ec =0.01 PFD 0.07% 0.00% 0.15% 2.37% 4.98% 1.89% Ec =0.01 The first half of the table displays the TPR along with the PFD (with E value cutoff 0.01), using the parameters learned from BMD-A and apply them to BMD-A, at various taxonomic identification levels: phylum (P), class (C), order (O), family (F), genus (G), and species (S). Displayed in the second half of the table are the microbial identification/classification results using BMD-B to query DB-1 with the parameters learned from using BMD-A 46 GB) and DB-2 (75,356 organisms and 200 GB). Figure 7 Evidently, limitation of our method exists. In fact, inter- shows that for data sets of ≈100,000 MS/MS spectra, microbe pretation of the results using our analysis workflow depends identification using DB-1 can be accomplished in about 15 min on the presence/absence of the correct microorganism in the with 4 cores and reduces to around 6 min with 16 cores. On the database. If one is certain that the correct microorganism is other hand, with ≈100,000 MS/MS spectra, the execution time present in the database, one should interpret the results as for microbe identification using DB-2 ranges from around microbial identification. On the other hand, if one is sure 44 min (with 4 cores) to 14.2 min (with 16 cores). Our results that the correct microorganism is absent from the database, indicate that when the database size increases by a factor of 4.3, one may interpret the results as microbial classification, i.e., the execution time increases only by a factor of 2.4 (using 16 finding the closest relative. cores), suggesting that only a 5.6-fold execution time will be encountered with a 10-fold database size increase. This reflects the scalability of MiCId in handling large databases. Figure 7 Conclusion also shows that the execution time reduction by increasing the number of cores appears to reach a plateau at 16 cores. This is In this study, we have proposed a workflow for microbial because the piece of C++ code performing microbial identifi- classification/identification by processing MS/MS data ac- cation has not yet been parallelized, thus incurring a constant quired in high-resolution mass spectrometers. We have used a time cost. large number of MS/MS data files to assess MiCId’s ability in identifying microbes in mixture samples. Results from our identification assessment show that (at the E value cutoff 0.01 yielding PFD ≤ 5%) MiCId has an average true positive rate of Limitation and Future Direction 0.9813 at the genus level and 0.9550 at the species level. (More We have mentioned that based on Figure 1, it appears that details can be found in Table 2.) One should note, however, unambiguous identification at the species level for certain that these numbers were obtained from blended spectra of species can be challenging. This difficulty worsens when com- single species samples. Generalization to complex microbiota plex samples are concerned and presents the direction for future samples must be taken with a grain of salt. Nevertheless, we efforts. For example, most real environmental samples will not were also able to show, by comparing with published results, be clonal, but might be clades of related organisms. Hence, that MiCId performs comparably to existing methods in mi- even the concept of Bspecies^ can be problematic here. Sepa- crobial identifications. A major advantage of MiCId over rating closely related organisms evolved from a common an- existing methods is that it can assign to microbes and proteins cestor species may not be possible under our current method identified accurate statistical significance, e.g., E values, thus due to lack of proteome information and high degree of prote- providing users a measure suitable for controlling false posi- ome similarities of closely related organisms. Nevertheless, tives (type I errors) and to estimate the PFD via the Sorić identifying the ancestral species can be a realistic long-term equation [72]. In contrast with current metaproteomics goal. methods, MiCId’s protein identification strategy begins with On the other hand, numerous lines of evidence indicate that microbial identifications, allowing users to consider post- individual Bstrains^ might differ in physiology, and thus, de- translationally modified peptides and semi-enzymatic (semi- tection at strain levels might be needed for clinical applications. tryptic) peptides without adding much computational cost. This For this case, however, cultured colonies of bacteria might be feature might be welcomed by researchers interested in prote- possible and progress has been made at the strain-level identi- omics of complex microbial mixtures. Storing only nonredun- fications in simple mixtures [45]. The challenge and future dant peptides from proteins digested in silico, the peptide- direction here pertains to how to minimize the culture time centric database of MiCId has a database size to number of and increase the identification rate. organism ratio that decreases with the increasing number of G. Alves et al.: Identification of Microorganisms 1735 2. Warnock, D.W.: Fungal diseases: an evolving public health challenge. organisms, making database search time increase only Med. Mycol. 44(8), 697–705 (2006) sublinearly with the number of the included organisms. 3. Aluwong, T., Bello, M.: Emerging diseases and implications for millen- MiCId’s workflow is fully automated: users need only specify nium development goals in Africa by 2015—an overview. Vet. Ital. 46(2), 137–145 (2010) a list of microbes to be included in the peptide-centric database 4. Didelot, X., Bowden, R., Wilson, D.J., Peto, T.E., Crook, D.W.: and the parameters to analyze the MS/MS spectra, and every- Transforming clinical microbiology with bacterial genome sequencing. thing is handled internally by MiCId. MiCId can be Nat Rev. Genet. 13(9), 601–612 (2012) 5. Garcia-Prats, J.A., Cooper, T.R., Schneider, V.F., Stager, C.E., Hansen, downloaded freely at http://www.ncbi.nlm.nih.gov/ T.N.: Rapid detection of microorganisms in blood cultures of newborn CBBresearch/Yu/downloads.html. infants utilizing an automated blood culture system. Pediatrics. 105(3 Pt The study presented here focuses mainly on the 1), 523–527 (2000) 6. Cleven, B.E., Palka-Santini, M., Gielen, J., Meembor, S., Kronke, M., classification/identification with accurate statistical signifi- Krut, O.: Identification and characterization of bacterial pathogens caus- cance for microbial mixtures via high-resolution MS/MS spec- ing bloodstream infections by DNA microarray. J. Clin. Microbiol. 44(7), tra. True advances in microbial studies using complex samples, 2389–2397 (2006) however, require multidisciplinary collaboration. Along this 7. Morgenthaler, N.G., Kostrzewa, M.: Rapid identification of pathogens in positive blood culture of patients with sepsis: review and meta-analysis of direction, many innovative approaches have been developed. the performance of the sepsityper kit. Int J Microbiol. 2015, 827,416 (2015) Such examples include but are not limited to faster MS/MS 8. Dubourg, G., Raoult, D.: Emerging methodologies for pathogen identifi- instruments like Orbitrap Fusion Lumos Tribid [73–75]that cation in positive blood culture testing. Expert. Rev. Mol. Diagn. 16(1), 97–111 (2016) can better sample complex mixtures by collecting a greater 9. Frank, U., Malkotsis, D., Mlangeni, D., Daschner, F.D.: Controlled number of MS/MS spectra, improved sample preparation pro- clinical comparison of three commercial blood culture systems. Eur. J. tocols for pathogenic samples [76, 77], and new techniques to Clin. Microbiol. Infect. Dis. 18(4), 248–255 (1999) 10. Puttaswamy, S., Lee, B.D., Sengupta, S.: Novel electrical method for isolate strain-specific peptides from microbes’ membrane pro- early detection of viable bacteria in blood cultures. J. Clin. Microbiol. teins [78, 79]. These advances, together with computational 49(6), 2286–2289 (2011) workflows (such as MiCId) built on firm statistical founda- 11. Kang, D.K., Ali, M.M., Zhang, K., Huang, S.S., Peterson, E., Digman, tions, may potentially further improve microbe identification in M.A., Gratton, E., Zhao, W.: Rapid detection of single bacteria in unpro- cessed blood using integrated comprehensive droplet digital detection. complex samples. Nat. Commun. 5, 5427 (2014) 12. Holmes, B., Willcox, W.R., Lapage, S.P.: Identification of Enterobacte- riaceae by the API 20E system. J. Clin. Pathol. 31(1), 22–30 (1978) Acknowledgements 13. Engvall, E.: Quantitative enzyme immunoassay (ELISA) in microbiolo- gy. Med. Biol. 55(4), 193–200 (1977) We thank the administrative group of the National Institutes of 14. Yolken, R.H.: Enzyme immunoassays for the detection of microbial Health Biowulf Cluster, where all the computational tasks were antigens and prospects for improved assays. Yale J Biol Med. 59(1), carried out. Without the availability of the NIH Biowulf Cluster, 25–31 (1986) 15. Price,L.B.,Liu,C.M.,Melendez, J.H.,Frankel,Y.M., Engelthaler,D.,Aziz, it would be exceedingly difficult to carry out this research due to M.,Bowers,J.,Rattray,R.,Ravel, J.,Kingsley, C.,Keim,P.S., Lazarus, the large sizes and number of data files used. This work was G.S., Zenilman, J.M.: Community analysis of chronic wound bacteria using supported by the Intramural Research Program of the National 16S rRNA gene-based pyrosequencing: impact of diabetes and antibiotics on chronic wound microbiota. PLoS One. 4(7), e6462 (2009) Library of Medicine, the National Heart, Lung, Blood Institute, 16. Stackebrandt, E., Goebel, B.M.: Taxonomic note: a place for dna-dna and the Clinical Center at the National Institutes of Health. reassociation and 16s rrna sequence analysis in the present species definition in bacteriology. Int. J. Syst. Evol. Microbiol. 44(4), 846–849 (1994) 17. Janda, J.M., Abbott, S.L.: 16S rRNA gene sequencing for bacterial Funding Information identification in the diagnostic laboratory: pluses, perils, and pitfalls. J. Clin. Microbiol. 45(9), 2761–2764 (2007) Funding for Open Access publication charges for this article 18. Fox, G.E., Wisotzkey, J.D., Jurtshuk, P., Fox, G.E.: How close is close: was provided by the National Institutes of Health. 16S rRNA sequence identity may not be sufficient to guarantee species identity. Int. J. Syst. Bacteriol. 42(1), 166–170 (1992) 19. Petti, C.A.: Detection and identification of microorganisms by gene amplification and sequencing. Clin. Infect. Dis. 44(8), 1108–1114 (2007) Open Access 20. Altun, O., Almuhayawi, M., Ullberg, M., Ozenci, V.: Rapid identification of microorganisms from sterile body fluids by use of FilmArray. J. Clin. This article is distributed under the terms of the Creative Microbiol. 53(2), 710–712 (2015) Commons Attribution 4.0 International License (http:// 21. Chan, J.Z., Pallen, M.J., Oppenheim, B., Constantinidou, C.: Genome sequenc- ing in clinical microbiology. Nat. Biotechnol. 30(11), 1068–1071 (2012) creativecommons.org/licenses/by/4.0/), which permits unre- 22. Sauer, S., Kliem, M.: Mass spectrometry tools for the classification and stricted use, distribution, and reproduction in any medium, identification of bacteria. Nat. Rev. Microbiol. 8(1), 74–82 (2010) provided you give appropriate credit to the original author(s) 23. Hodkinson, B.P., Grice, E.A.: Next-generation sequencing: a review of and the source, provide a link to the Creative Commons technologies and tools for wound microbiome research. Adv Wound Care (New Rochelle). 4(1), 50–58 (2015) license, and indicate if changes were made. 24. Karas, M., Hillenkamp, F.: Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Anal. Chem. 60(20), 2299– 2301 (1988) 25. Marvin, L.F., Roberts, M.A., Fay, L.B.: Matrix-assisted laser desorption/ References ionization time-of-flight mass spectrometry in clinical chemistry. Clin. Chim. Acta. 337(1–2), 11–21 (2003) 1. Tauxe, R.V.: Emerging foodborne diseases: an evolving public health 26. Stevenson, L.G., Drake, S.K., Murray, P.R.: Rapid identification of challenge. Emerging Infect Dis. 3(4), 425–434 (1997) bacteria in positive blood culture broths by matrix-assisted laser 1736 G. Alves et al.: Identification of Microorganisms desorption ionization-time of flight mass spectrometry. J. Clin. Microbiol. MetaPro-IQ: a universal metaproteomic approach to studying human and 48(2), 444–447 (2010) mouse gut microbiota. Microbiome. 4(1), 31 (2016) 27. Bhatti, M.M., Boonlayangoor, S., Beavis, K.G., Tesic, V.: Evaluation of 49. Mesuere, B., Devreese, B., Debyser, G., Aerts, M., Vandamme, P., FilmArray and Verigene systems for rapid identification of positive blood Dawyndt, P.: Unipept: tryptic peptide-based biodiversity analysis of cultures. J. Clin. Microbiol. 52(9), 3433–3436 (2014) metaproteome samples. J. Proteome Res. 11(12), 5773–5780 (2012) 50. Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of 28. Tanner, H., Evans, J.T., Gossain, S., Hussain, A.: Evaluation of three sample preparation methods for the direct identification of bacteria in positive blood metagenomic data. Genome Res. 17(3), 377–386 (2007) cultures by MALDI-TOF. BMC Res Notes. 10(1), 48 (2017) 51. Tanca, A., Palomba, A., Deligios, M., Cubeddu, T., Fraumene, C., Biosa, G., Pagnozzi, D., Addis, M.F., Uzzau, S.: Evaluating the impact of 29. Karlsson, R., Gonzales-Siles, L., Boulund, F., Svensson-Stadler, L., different sequence databases on metaproteome analysis: insights from a Skovbjerg, S., Karlsson, A., Davidson, M., Hulth, S., Kristiansson, E., lab-assembled microbial mixture. PLoS One. 8(12), e82,981 (2013) Moore, E.R.: Proteotyping: proteomic characterization, classification and 52. Stojmirović, A., Yu, Y.K.: Robust and accurate data enrichment statistics identification of microorganisms—a prospectus. Syst. Appl. Microbiol. via distribution function of sum of weights. Bioinformatics. 26(21), 38(4), 246–257 (2015) 2752–2759 (2010) 30. Chatterjee, S., Stupp, G.S., Park, S.K., Ducom, J.C., Yates, J.R., Su, A.I., 53. Alves, G., Wang, G., Ogurtsov, A.Y., Drake, S.K., Gucek, M., Suffredini, Wolan, D.W.: A comprehensive and scalable database search system for A.F., Sacks, D.B., Yu, Y.K.: Identification of microorganisms by high metaproteomics. BMC Genomics. 17(1), 642 (2016) resolution tandem mass spectrometry with accurate statistical signifi- 31. Jagtap, P., Goslinga, J., Kooren, J.A., McGowan, T., Wroblewski, M.S., cance. J. Am. Soc. Mass Spectrom. 27(2), 194–210 (2016) Seymour, S.L., Griffin, T.J.: A two-step database search method improves 54. Byrd, A.L., Perez-Rogers, J.F., Manimaran, S., Castro-Nallar, E., Toma, sensitivity in peptide sequence matches for metaproteomics and I., McCaffrey, T., Siegel, M., Benson, G., Crandall, K.A., Johnson, W.E.: proteogenomics studies. Proteomics. 13(8), 1352–1357 (2013) Clinical PathoScope: rapid alignment and filtration for accurate pathogen 32. Tharakan, R., Edwards, N., Graham, D.R.: Data maximization by identification in clinical samples using unassembled sequencing data. multipass analysis of protein mass spectra. Proteomics. 10(6), 1160– BMC Bioinformatics. 15, 262 (2014) 1171 (2010) 55. Tanca, A., Palomba, A., Fraumene, C., Pagnozzi, D., Manghina, V., 33. Bern, M., Kil, Y.J.: Comment on Bunbiased statistical analysis for multi-stage Deligios, M., Muth, T., Rapp, E., Martens, L., Addis, M.F., Uzzau, S.: proteomic search strategies^.J.ProteomeRes. 10(4), 2123–2127 (2011) The impact of sequence database choice on metaproteomic results in gut 34. Gupta, N., Bandeira, N., Keich, U., Pevzner, P.A.: Target-decoy approach microbiota studies. Microbiome. 4(1), 51 (2016) and false discovery rate: when things may go wrong. J. Am. Soc. Mass 56. Alves, G., Ogurtsov, A.Y., Yu, Y.K.: RAId_DbS: peptide identification Spectrom. 22(7), 1111–1120 (2011) using database searches with realistic statistics. Biol. Direct. 2,25 (2007) 35. Alves, G., Yu, Y.K.: Improving peptide identification sensitivity in shot- 57. Alves, G., Yu, Y.K.: Confidence assignment for mass spectrometry based gun proteomics by stratification of search space. J. Proteome Res. 12(6), peptide identifications via the extreme value distribution. Bioinformatics. 2571–2581 (2013) 32(17), 2642–2649 (2016) 36. Wooley, J.C., Ye, Y.: Metagenomics: facts and artifacts and Computa- 58. Pupo, G.M., Lan, R., Reeves, P.R.: Multiple independent origins of tional Challenges*. J Comput Sci Technol. 25(1), 71–81 (2009) Shigella clones of Escherichia coli and convergent evolution of many of 37. Bazinet, A.L., Cummings, M.P.: A comparative evaluation of sequence their characteristics. Proc. Natl. Acad. Sci. 97(19), 10,567–10,572 (2000) classification programs. BMC Bioinformatics. 13,92 (2012) 59. Jin, Q., Yuan, Z., Xu, J., Wang, Y., Shen, Y., Lu, W., Wang, J., Liu, H., 38. Wang, W.L., Xu, S.Y., Ren, Z.G., Tao, L., Jiang, J.W., Zheng, S.S.: Yang, J., Yang, F., Zhang, X., Zhang, J., Yang, G., Wu, H., Qu, D., Dong, Application of metagenomics in the human gut microbiome. World J. J., Sun, L., Xue, Y., Zhao, A., Gao, Y., Zhu, J., Kan, B., Ding, K., Chen, Gastroenterol. 21(3), 803–814 (2015) S., Cheng, H., Yao, Z., He, B., Chen, R., Ma, D., Qiang, B., Wen, Y., 39. Dworzanski, J.P., Snyder, A.P., Chen, R., Zhang, H., Wishart, D., Li, L.: Hou, Y., Yu, J.: Genome sequence of Shigella flexneri 2a: insights into Identification of bacteria using tandem mass spectrometry combined with pathogenicity through comparison with genomes of Escherichia coli K12 a proteome database and statistical scoring. Anal. Chem. 76(8), 2355– and O157. Nucleic Acids Res. 30(20), 4432–4441 (2002) 2366 (2004) 60. Prakash, O., Verma, M., Sharma, P., Kumar, M., Kumari, K., Singh, A., 40. Dworzanski, J.P., Snyder, A.P.: Classification and identification of bac- Kumari, H., Jit, S., Gupta, S.K., Khanna, M., Lal, R.: Polyphasic ap- teria using mass spectrometry-based proteomics. Expert Rev Proteomics. proach of bacterial classification—an overview of recent advances. Indian 2(6), 863–878 (2005) J. Microbiol. 47(2), 98–108 (2007) 41. Jabbour, R.E., Deshpande, S.V., Wade, M.M., Stanford, M.F., Wick, 61. Yu, Y.K., Gertz, E.M., Agarwala, R., Schaffer, A.A., Altschul, S.F.: C.H., Zulich, A.W., Skowronski, E.W., Snyder, A.P.: Double-blind char- Retrieval accuracy, statistical significance and compositional similarity acterization of non-genome-sequenced bacteria by mass spectrometry- in protein sequence database searches. Nucleic Acids Res. 34(20), 5966– based proteomics. Appl. Environ. Microbiol. 76(11), 3637–3644 (2010) 5973 (2006) 42. Jabbour, R.E., Deshpande, S.V., Stanford, M.F., Wick, C.H., Zulich, 62. Alves, G., Wu, W.W., Wang, G., Shen, R.F., Yu, Y.K.: Enhancing A.W., Snyder, A.P.: A protein processing filter method for bacterial peptide identification confidence by combining search methods. J. Prote- identification by mass spectrometry-based proteomics. J. Proteome Res. ome Res. 7, 3102–3113 (2008) 10(2), 907–912 (2011) 63. Zaykin, D.V., Zhivotovsky, L.A., Westfall, P.H., Weir, B.S.: Truncated 43. Tracz, D.M., McCorrister, S.J., Chong, P.M., Lee, D.M., Corbett, C.R., product method for combining P-values. Genet. Epidemiol. 22(2), 170– Westmacott, G.R.: A simple shotgun proteomics method for rapid bacte- 185 (2002) rial identification. J. Microbiol. Methods. 94(1), 54–57 (2013) 64. Alves, G., Yu, Y.K.: Combining independent, weighted P-values: achiev- 44. Penzlin, A., Lindner, M.S., Doellinger, J., Dabrowski, P.W., Nitsche, A., ing computational stability by a systematic expansion with controllable Renard, B.Y.: Pipasic: similarity and expression correction for strain-level accuracy. PLoS One. 6(8), e22,647 (2011) identification and quantification in metaproteomics. Bioinformatics. 65. Alves, G., Yu, Y.K.: Mass spectrometry-based protein identification with 30(12), i149–i156 (2014) accurate statistical significance assignment. Bioinformatics. 31(5), 699– 45. Boulund, F., Karlsson, R., Gonzales-Siles, L., Johnning, A., Karami, N., 706 (2015) Al-Bayati, O., Ahren, C., Moore, E.R.B., Kristiansson, E.: TCUP: typing 66. Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, and characterization of bacteria using bottom-up tandem mass spectrom- J.A., Bonazzi, V., McEwen, J.E., Wetterstrand, K.A., Deal, C., Baker, etry proteomics. Mol. Cell. Proteomics (2017) C.C., Di Francesco, V., Howcroft, T.K., Karp, R.W., Lunsford, R.D., 46. Hettich, R.L., Pan, C., Chourey, K., Giannone, R.J.: Metaproteomics: Wellington, C.R., Belachew, T., Wright, M., Giblin, C., David, H., Mills, harnessing the power of high performance mass spectrometry to identify M.,Salomon,R.,Mullins,C.,Akolkar,B.,Begg,L.,Davis,C., the suite of proteins that control metabolic activities in microbial com- Grandison, L., Humble, M., Khalsa, J., Little, A.R., Peavy, H., Pontzer, munities. Anal. Chem. 85(9), 4203–4214 (2013) C., Portnoy, M., Sayre, M.H., Starke-Reed, P., Zakhari, S., Read, J., 47. Jagtap, P.D., Blakely, A., Murray, K., Stewart, S., Kooren, J., Johnson, Watson, B., Guyer, M.: The NIH human microbiome project. Genome J.E., Rhodus, N.L., Rudney, J., Griffin, T.J.: Metaproteomic analysis Res. 19(12), 2317–2323 (2009) using the Galaxy framework. Proteomics. 15(20), 3553–3565 (2015) 67. Elias, J.E., Gygi, S.P.: Target-decoy search strategy for increased confi- 48. Zhang, X., Ning, Z., Mayne, J., Moore, J.I., Li, J., Butcher, J., Deeke, dence in large-scale protein identifications by mass spectrometry. Nat. S.A., Chen, R., Chiang, C.K., Wen, M., Mack, D., Stintzi, A., Figeys, D.: Methods. 4(3), 207–214 (2007) G. Alves et al.: Identification of Microorganisms 1737 68. Wang, G., Wu, W.W., Zhang, Z., Masilamani, S., Shen, R.F.: Decoy methods using an Orbitrap Fusion Lumos Tribrid mass spectrometer. methods for assessing false positives and false discovery rates in shotgun Proteomics 17(9) (2017) proteomics. Anal. Chem. 81,146–159 (2009) 76. Mottaz-Brewer, H.M., Norbeck, A.D., Adkins, J.N., Manes, N.P., 69. Michael, J.R.: The stabilized probability plot. Biometrika. 70(1), 11–17 Ansong, C., Shi, L., Rikihisa, Y., Kikuchi, T., Wong, S.W., Estep, (1983) R.D., Heffron, F., Pasa-Tolic, L., Smith, R.D.: Optimization of proteomic 70. Alves, G., Ogurtsov, A.Y., Wu, W.W., Wang, G., Shen, R.F., Yu, Y.K.: sample preparation procedures for comprehensive protein characteriza- Calibrating E-values for MS2 database search methods. Biol. Direct. 2,26 tion of pathogenic systems. J. Biomol. Tech. 19(5), 285–295 (2008) (2007) 77. Tanca, A., Palomba, A., Pisanu, S., Deligios, M., Fraumene, C., 71. Wolters, D.A., Washburn, M.P., Yates, J.R.: An automated multidimen- Manghina, V., Pagnozzi, D., Addis, M.F., Uzzau, S.: A straightforward sional protein identification technology for shotgun proteomics. Anal. and efficient analytical pipeline for metaproteome characterization. Chem. 73(23), 5683–5690 (2001) Microbiome. 2(1), 49 (2014) 72. Sorić, B.: Statistical Bdiscoveries^ and effect-size estimation. J. Am. Stat. 78. Karlsson, R., Davidson, M., Svensson-Stadler, L., Karlsson, A., Olesen, Assoc. 84(406), 608–610 (1989) K., Carlsohn, E., Moore, E.R.: Strain-level typing and identification of 73. Makarov, A.: Electrostatic axially harmonic orbital trapping:? A high- bacteria using mass spectrometry-based proteomics. J. Proteome Res. performance technique of mass analysis. Anal. Chem. 72(6), 1156–1162 11(5), 2710–2720 (2012) (2000) 79. Gonzales-Siles, L., Karlsson, R., Kenny, D., Karlsson, A., Sjoling, A.: 74. Eliuk, S., Makarov, A.: Evolution of Orbitrap mass spectrometry instru- Proteomic analysis of enterotoxigenic Escherichia coli (ETEC) in neutral mentation. Annu Rev Anal Chem (Palo Alto, Calif). 8,61–80 (2015) and alkaline conditions. BMC Microbiol. 17(1), 11 (2017) 75. Espadas, G., Borras, E., Chiva, C., Sabido, E.: Evaluation of different peptide fragmentation types and mass analyzers in data-dependent http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of The American Society for Mass Spectrometry Springer Journals

Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry

Free
17 pages

Loading next page...
 
/lp/springer-journals/rapid-classification-and-identification-of-multiple-microorganisms-E3gc74VIN7
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Chemistry; Analytical Chemistry; Biotechnology; Organic Chemistry; Proteomics; Bioinformatics
ISSN
1044-0305
eISSN
1879-1123
D.O.I.
10.1007/s13361-018-1986-y
Publisher site
See Article on Publisher Site

Abstract

B The Author(s), 2018 J. Am. Soc. Mass Spectrom. (2018) 29:1721Y1737 DOI: 10.1007/s13361-018-1986-y RESEARCH ARTICLE Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry 1 2 1 3 2 Gelio Alves, Guanghui Wang, Aleksey Y. Ogurtsov, Steven K. Drake, Marjan Gucek, 4 1 David B. Sacks, Yi-Kuo Yu National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Proteomics Core, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA Critical Care Medicine Department, Clinical Center, National Institutes of Health, Bethesda, MD 20892, USA Department of Laboratory Medicine, Clinical Center, National Institutes of Health, Bethesda, MD 20892, USA Abstract. Rapid and accurate identification and classification of microorganisms is of paramount importance to public health and safety. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly im- proved. However, the increasing number of mi- crobes sequenced is complicating correct micro- bial identification even in a simple sample due to the large number of candidates present. To prop- erly untwine candidate microbes in samples con- taining one or more microbes, one needs to go beyond apparent morphology or simple Bfingerprinting^;to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptide-centric representations of microbes to better separate them and by augmenting our earlier analysis method that yields accurate statistical significance. Here, we present an updated analysis workflow that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using 226 MS/MS publicly available data files (each containing from 2500 to nearly 100,000 MS/MS spectra) and 4000 additional MS/MS data files, that the updated workflow can correctly identify multiple microbes at the genus and often the species level for samples containing more than one microbe. We have also shown that the proposed workflow computes accurate statistical significances, i.e., E values for identified peptides and unified E values for identified microbes. Our updated analysis workflow MiCId, a freely available software for Microorganism Classification and Identification, is available for download at https://www. ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html. Keywords: Pathogen identification, Microorganism classification, Statistical significance, Mass, Spectrometry, Proteomics Received: 13 November 2017/Revised: 30 March 2018/Accepted: 25 April 2018/Published Online: 5 June 2018 Introduction apid and accurate identification and classification of mi- Electronic supplementary material The online version of this article (https:// Rcroorganisms is of paramount importance to public health doi.org/10.1007/s13361-018-1986-y) contains supplementary material, which and safety [1–3]. Traditional methods for microbial identifica- is available to authorized users. tions target only a limited number of microorganisms [4, 5]and often require 72 h or more to carry out [6–8]. For most Correspondence to: Yi–Kuo Yu; e-mail: yyu@ncbi.nlm.nih.gov 1722 G. Alves et al.: Identification of Microorganisms microorganismal identification protocols, the first step can take cannot be used for identification in a sample containing multi- the longest. This time-consuming step involves preparing a ple microbes [27, 28]. Given that our main goal is to robustly culture of the collected sample in a selected medium, usually and accurately identify multiple microbes in mixed samples, in a blood culture, to test for the presence of any microbes and to our analyses, we use only data collected from high-resolution amplify the concentration of microbes that might be present [7– instruments using high-performance liquid chromatography- 9]. If the prepared culture tests positive for the presence of tandem MS (LC-MS/MS) [29] to mitigate one of the challenges microbes, further tests are required to distinguish and identify (noise in data) that hinders the identification of multiple mi- within the sample each microbe present [7, 8, 10, 11]. crobes. Another challenge appears due to the fast expansion of One of these tests is the analytical profile index (API), database size. In order not to miss identify any known microbe, which consists of a system of 20 biochemical reactions. Among one needs as an input to peptide identification tools a protein the concerns with the API test are that it cannot always identify database that includes all microbes present in the sample. The microbes at the species level and it cannot handle samples ever-increasing number of microbial proteomes implies an composed of multiple microbes [12]. Another frequently used ever-expanding peptide/protein database along with ever- test is the enzyme-linked immunosorbent assay (ELISA), a increasing number of post-translational modifications (PTMs) highly specific yet expensive test, which relies on the use of as well as single amino acid polymorphisms (SAPs) that will antibodies that are specific to antigens of a given species of overwhelm most peptide identification tools [30]. microbe [13, 14]. Traditional polymerase chain reactions In addition to having longer search time, the consequence of (PCR) of 16S rRNA followed by sequencing methods [15] using a much larger database includes reduction of sensitivity. are also utilized in the identification of microbes. One issue in One way to circumvent this is to first construct a peptide- using 16S rRNA sequence information is that it provides centric database without including PTMs or SAPs, sorted ac- reliable identification at the genus level for the majority of cording to peptide masses, then have an interface program that cases, but it cannot always be used to differentiate between extracts from the database, for every precursor ion mass of the closely related species having a 16S rRNA sequence similarity query spectrum, the corresponding peptide set and passes it to greater than 97% [16, 17]. For example, Bacillus globisporus peptide identification tools as the input database. PTMs and and Bacillus psychrophilus have greater than 99.5% sequence SAPs are then allowed only for microbes whose identification homology in their 16S rRNAs but have a DNA-level related- confidences are higher than a specified threshold. Although ness of only 23–50% when measured by hybridization reaction this strategy allows for a higher peptide identification rate, it [18, 19]. Studies using FilmArray multiplex PCR have been should not be confused with the multistage proteomics search shown to be able to early detect single or multiple pathogenic strategies based on the target-decoy statistics [31]that is often microbes present in positive blood cultures, as long as patho- used in the metaproteomics community. For this type of mul- genic microbes are present with high enough concentrations tistage search strategy, in the second step (re-search), the and the FilmArray system has the proper primers for these resulting proportion of false discoveries (or false discovery pathogenic microbes [20]. There are also methods that do not rate) can no longer be estimated correctly [32–34], hence require a blood culture and can identify microbes from whole undermining their validity. When using methods that do not blood samples. Of these methods, PCR amplification for require target-decoy approach to assign accurate statistical electrospray ionization mass spectrometry (PCR/ESI-MS) is significance, however, the difference between the multistage the most promising with reported sensitivity and specificity in and one-pass search strategies becomes that of using a smaller the 1990s [8]. database and a larger one. And the sensitivity advantage of In recent years, next-generation sequencing (NGS) and multistage strategies can be achieved with much less search mass spectrometry (MS) have emerged as reliable technologies time by the stratified database search method [35]. However, for rapid and accurate identification/classification of microbes even if the database size issue can be mitigated by using [21, 22]. Utilizing both the coding and noncoding DNA infor- stratified database and on-the-fly scope expansion of PTMs and SAPs, a major challenge remains in terms of identifying mation, NGS can be used to screen bacteria effectively al- multiple microbes. though it lacks gene expression information. On the other hand, the MS-based technology (the focus of the current manuscript) This major challenge originates from the difficulty in delin- identifies microbes via peptides found thus providing protein eating the microbes based on peptides identified. We meet this expression information. There are different ways in which challenge by developing a method that can delineate or group these technologies can be employed for microorganismal iden- microbes based on their peptidome similarity (counting only tification, and we direct the reader to the review articles of experimentally observed peptides of high confidence) and can Hodkinson and Grice [23] and of Saurce and Klien [22]fora assign accurate statistical significances to microbes identified. survey. In terms of identifying multiple microbes, the aforementioned Both the NGS and MS-based technologies can be routinely challenges are common to both the NGS [36–38] and MS- used to identify single microbes. For example, matrix-assisted based methods [39–45]. However, the extent to which one can laser desorption ionization time-of-flight MS (MALDI-TOF- apply the statistical methods developed in this manuscript to MS) [24, 25] can be employed to identify single microbes NGS-based methods deserves a separate investigation that is quickly and accurately in pure samples [26]. However, it beyond the scope of the current manuscript. In addition to the G. Alves et al.: Identification of Microorganisms 1723 two main challenges outlined above, existing MS-based Materials and Methods methods for microbial identification/classification [39–44]as Downloaded MS/MS Data Files mentioned by Boulund et al. [45] also face other challenges such as being limited to simple samples and needing manual A total of 207 LC-MS/MS data files were downloaded from the intervention during data analysis. The latter makes it hard to ProteomeXchange database at http://www.proteomexchange. automate the data analysis workflow. org/ and from PeptideAtlas at http://www.peptideatlas.org/. Interestingly, identifying multiple microbes is also pursued in Of these LC-MS/MS data files, 194 are from mixture samples areas such as environmental proteomics, also termed of known organisms, with each sample containing one, two, metaproteomics [46–48]. In this area, even though the primary four or nine organisms. The remaining 13 data files are from aim is to understand the functional expression of complex complex samples of the human gut. Supplementary Tables S1, samples (not the microbe classification/identification in particu- S2, S3,and S4 provide the data file (DF) number, the file name, lar), recently the reliability of taxonomic attribution using the and the ProteomeXchange or PeptideAtlas identifier for each bioinformatic tools Unipept [49]and MEGAN[50] has been LC-MS/MS data file downloaded. All the MS/MS spectra assessed using unique (taxon-specific) peptides found experi- described here were acquired on high-resolution mass spec- mentally [51]. This latter endeavor is equivalent to classification/ trometers and further experimental details can be obtained in identification of microbes [47, 48], which happens to be our the ProteomeXchange website. The true positive microbes in main goal. However, unlike Unipept [49]and MEGAN[50], the latter group of data (human gut microbiome) are unknown. our method can identify/classify microbes with accurate statis- For the data from samples in the former group, the true posi- tical significance assignment. Our method can also be applied in tives in each sample were provided along with their mix ratios; terms of functional expression. To achieve this, we identify the many such samples have nearly equal number of cells per peptides, then the microbes, and then the corresponding pro- microbe, and some are from very biased microbe populations. teins. With proteins identified, one may query the term databases To complement the ratio varieties of the downloaded data, we such as gene ontology to obtain functional expression [52]. also generated some in-house MS/MS data from samples with In this manuscript, we present an updated version of MiCId, more biased (but not extreme) ratios. an analysis workflow for rapid and accurate identifications/ classifications of microbes. MiCId was designed to automate In-House Dataset the complete process, from microbial peptide database con- struction to microbial identification and protein identification. Bacterial Culture Preparation Fresh Escherichia coli The first version of MiCId, standing for Microorganism Clas- (ATCC 25922), Pseudomonas aeruginosa (ATCC 27853), sification and Identification, was tested for analysis of samples Streptococcus pneumoniae (ATCC 49619), and containing a single microbe [53]. We have now updated MiCId Staphylococcus aureus (ATCC 25923) plates were used to to specifically handle mixed samples containing multiple mi- inoculate a 2 ml tryptic broth for overnight growth. From each crobes while preserving its speed and its accurate statistical saturated culture, a 10-ml vial was inoculated with 100 μl significance. Using 226 LC-MS/MS data files from a variety of (1:100 dilution) and put in shaker at 37 C. Each culture growth microbial samples, some of whose microbial compositions are was monitored by nephelometer to an optical density value unknown, and 4000 blended MS/MS data files, we have ex- between 0.4 and 0.51 (approximately 10 cells). True CFU tensively evaluated MiCId’s performance in terms of values were achieved by plating diluted samples on sheep identifying/classifying multiple microbes at different taxonom- blood agar plates and counting the resulting colonies. Based ic levels. MiCId utilizes a hierarchical identification strategy on the measured approximate cell count values for each culture, where microbes are identified starting at the phylum level then these cultures were mixed in different ratios to generate mi- descending one level at a time. With the E value cutoff set at crobe mixtures containing 10 cells total. The prepared microbe mixtures with their estimated ratios, DFs 84–102, are listed in 0.01 (effectively control the PFD to be less than 5%), in terms Table S3. These mixtures were added to eppendorf tubes and of microbe identification using blended MS/MS data (BMD-A spun at 14 K rpm for 2 min until all of the samples were in the and BMD-B), MiCId yields an average true positive rate of 0.9813 at the genus level and 0.9550 at the species level. (More eppendorf tube and the supernatants discarded. These pellets details can be found in Table 2.) One should note, however, were washed with 1 ml 70% EtOH and then resuspended in that these numbers were obtained by blending spectra from up 150 μl 70% formic acid. After vortexing, 150 μlacetonitrile to 24 single-species samples. Generalization to complex mi- was added and samples were vortexed and respun. The super- crobiota samples should be taken with a grain of salt. MiCId’s natant was transferred to a clean tube and speed-vacuum dried computed statistical significance is shown to be accurate for on medium heat. To each dried tube, 40 μlof 6 Murea and microbes identified when tested against a decoy database at 50 mM NH HCO were added and the tube was sonicated for 4 3 various taxonomical levels, providing a confidence measure for 50 min with occasional vortexing. Samples were reduced with users to control the proportion of false discoveries. The pro- DTT (4 μl 1 M in water, 37 °C for 60 min), alkylated (20 μl posed workflow has been implemented in MiCId, a freely iodoacetamide 40 mg/ml in water, at room temperature for available software that can be downloaded at http://www. 60 min in the dark), and quenched with DTT (4 μl, 15 min). ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html. The tubes were diluted with 100 μlof 50 mM NH HCO and 4 3 1724 G. Alves et al.: Identification of Microorganisms 10 μl of 100 mM NH HCO . Trypsin/Lys-c (Promega, 2 μg) specificity; BMD-C was used to evaluate the accuracy of 4 3 was added to each tube. Samples were digested using the CEM MiCId’s statistical significance assignment in terms of microbe Discovery microwave digester (60 min, 50 °C, 50 W, with identifications. cooling). After digestion, samples were stored at − 20 °C until DFs 1–24, which cover 24 species and 21 genera, were used used. to generate BMD-A. BMD-A is composed of five subsets, each corresponding to a fixed sampling percentage of 1, 5, 10, 25, or 50%. Every subset contains 500 blended DFs, each of which Liquid Chromatography-Tandem Mass Spectrometry Acquisi- was generated by sampling a fixed percentage of MS/MS tion Liquid chromatography-tandem mass spectrometry (LC- spectra from every one of the 24 DFs. BMD-B, covering 15 MS-MS) was performed using an Eksigent nanoLC-Ultra 2D species and 15 genera, is made of seven subsets each compos- system (Dublin, CA) coupled to an Orbitrap Elite mass spec- ing 200 blended DFs. Every blended DF in a subset was trometer (Thermo Scientific, San Jose, CA). Peptide samples generated by sampling a fixed percentage p of MS/MS spectra were first loaded onto a Zorbax 300SB-C18 trap column from group 1 DFs and a complement percentage 100-p of MS/ (Agilent, Palo Alto, CA) at a flow rate of 6 μL/min for MS spectra from group 2 DFs. For BMD-B, group 1 contains 10 min, and then separated on a reversed-phase BetaBasic DFs 24–31 and group 2 contains DFs 32–38. The seven dif- C18 PicoFrit analytical column (0.075 × 250 mm, New Objec- ferent subsets are distinguished by their complement pairs of tive, Woburn, MA) using a 90-min linear gradient of 5–35% percentages: (95, 5%), (90, 10%), (75, 25%), (50, 50%), (25, acetonitrile in 0.1% formic acid at a flow rate of 250 nl/min. 75%), (10, 90%), (5, 95%). BMD-C contains 100 blended DFs. Eluted peptides were sprayed into the Orbitrap Elite equipped Every blended DF of BMD-C was generated by sampling 50% with a nano-spray ionization source. Both survey (MS) and of MS/MS spectra from each of the following DFs: 40, 43, 48, product (MS/MS) spectra were acquired in the Orbitrap, and 53, 55, and 61. the FTMS resolution was set at 30,000 and 15,000, respective- ly. Each MS scan was followed by six data-dependent CID MS/MS scans with dynamic exclusion. Other mass spectrom- Peptide-Centric Databases etry settings were as follows: spray voltage, 1.8 kV; full MS Protein sequences were downloaded (on February 16, 2018) mass range, m/z 300 to 2000; normalized collision energy, from the National Center for Biotechnology Information 35%; precursor ion isolation mass width, 3 Da. (NCBI) at ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/. For the current study, five peptide-centric databases (DBs) (DB-1 through DB-5) were constructed. In the genbank.assembly file, Blended MS/MS Dataset an organism’s genome assembly level is labeled as contig, Even though we already have some data files from mixture scaffold, chromosome, or complete genome, in order of in- samples of up to nine microbes, they are limited in number and creasing completeness. The peptide-centric DB-1 (DB-2) in- perhaps in complexity. While we have the data files from the cludes, in addition to Homo sapiens and Mus musculus,all complex samples of the human gut, the true positives within bacteria, archaea, fungi, and virus whose assemblies are at these samples are unknown. To stress test our proposed iden- chromosome (scaffold) level or higher. The other three data- tification method, we need a large dataset made of highly bases were created from metagenomics data that have been complex samples but with true positives known. Absent analyzed and transformed into protein sequence fasta files [55]: existing DFs of this type, we generated blended DFs in silico DB-3 (DB-Human_0_RF_6GB.fasta containing 3,423,708 se- (similar to methods of [54] thathavebeenemployedtoevaluate quences), DB-4 (DB-Human_0_CF_6GB.fasta containing metagenomics analysis workflows). 192,582 sequences), and DB-5 (DB-Human_1+2+3_RF_ Each blended MS/MS data file for this purpose was gener- 6GB.fasta containing 2,866,541 sequences). ated using the following steps: (1) identify a list of data files, All DBs were constructed as follows: downloaded protein each containing MS/MS spectra from a sample of a microbe; sequences were in silico digested following the digestion rule (2) for each data file in the list, a number of MS/MS spectra out for trypsin and lys-c, i.e., cleaving at the carboxyl termini of of the total were randomly sampled according to a pre-specified arginine and lysine, allowing up to two missed cleavage sites. percentage; (3) merge the sampled MS/MS spectra to mimic a In our DBs, only nonredundant tryptic and lys-c peptides with data file from a mixture sample of these microbes; and (4) molecular masses between 660 and 4000 Da were kept. By repeat steps 1–3 to achieve the desired number of blended nonredundant peptide, we mean the following. We keep a copy MS/MS data files. The number of microbes chosen and the and only one copy of every possible peptide (resulted from in specified percentage of MS/MS spectra to sample (from each silico enzyme digestion of the protein database) regardless of microbe’s MS/MS data file) determine the size of a blended whether the peptide is shared by multiple microbes or not. A MS/MS data file. nonredundant peptide therefore can be a unique peptide to a A total of three blended MS/MS datasets (BMDs) were microbe at a given taxonomic level, but may become a shared generated. BMD-A was used for learning the parameters need- peptide when a lower taxonomic level is considered. Never- ed for MiCId’s clustering procedure; BMD-B was used to theless, no peptides will be excluded, and as more microbe evaluate MiCId’s performance in terms of sensitivity and genomes are sequenced, the growth rate of our peptide-centric G. Alves et al.: Identification of Microorganisms 1725 database is expected to be smaller than that of the protein in DNA sequencing technology and a polyphasic approach that databases. utilizes genotypic, chemotypic, and phenotypic information In DB-1 and DB-2, for each peptide, the names of strains, during taxonomic classification [60]. subspecies, species, genera, families, orders, classes, and phyla To provide microbe identification significances, we com- that contain this peptide are also recorded and linked to the pute a unified E value (E ) by combining the spectrum-specific peptide. The sizes of DB-1 and DB-2 are 46 and 200 GB, E values of the confidently identified peptides (CIPs). A pep- respectively. Taxonomic information included in DB-1 and tide is considered a CIP if it is identified with an E value smaller DB-2 was extracted from the taxonomy files downloaded (on than a cutoff (E ). When performing identifications, we might February 16, 2018) from the NCBI (https://www.ncbi.nlm.nih. want to eliminate potential false positives aggressively thus gov/Taxonomy). The 46,838,064 protein sequences setting E low; currently, E is defined to be the minimum of c c (807,574,956 nonredundant tryptic peptides) in DB-1 are from 1 and 100/n (with n denoting the total number of MS/MS s s 23,911 organisms, belonging to 13,072 species, 1890 genera, spectra acquired for a given experiment). (When assessing and 517 families. The 260,931,852 protein sequences statistical accuracy using a random/decoy database, however, (2,507,889,685 nonredundant tryptic peptides) in DB-2 are one is essentially counting false positives and setting E too low from 75,356 organisms, belonging to 26,368 species, 2870 will reduce the number of false positives estimated.) With this genera, and 701 families. Relevant information pertaining to E specified, on average for large n , only 100 false positive c s the NCBI taxonomy identifiers and organism names for the peptides are expected among the CIPs. We next detail an different organisms included in DB-1 and DB-2 can be found important clustering step [53], which is now improved to in Supplementary Table S5. Since DB-3, DB-4, and DB-5 are accommodate identifications of multiple microbes. In our peptide-driven clustering procedure [53], taxa sharing from metagenomics reads/contigs, no taxonomy information is available. significant amounts of CIPs were clustered together. All CIPs were regarded as equally important, and the clustering proce- Software Parameter Values Used dure did not go through further iteration. In this manuscript, we incorporate all peptides with E values less than 1 in the clus- Six databases, DB-1 through DB-5 and the reverse of DB-1, tering procedure; however, we also introduce fractional counts were used in MS/MS data analyses; the first five were used as to give peptides with better (lower) E values more (less) influ- the target DBs while the last as the decoy DB. The other ence than peptides with worse (higher) E values. The updated software parameters are described below. While performing peptide-driven clustering procedure is described below. database searches, the digestion rules of trypsin and lys-c were First, peptides identified with E values ≤ 1 are mapped to the assumed with up to two missed cleavage sites per peptide different taxa in the database. Second, a standardized weighted allowed. Iodocetamide was used as the reduction agent, chang- count (Z) isassignedto eachidentifiedpeptide (whose E value ing the molecular mass of every cysteine from 103.00919 to is E). 160.030647 Da. The mass error tolerance of 10 ppm was set for both precursor and product ions. RAId’s Rscore scoring func- tion, using b and y ions as evidence, was used for scoring peptides. The statistical significance assigned to each peptide ZEðÞ¼ : ð1Þ was given by RAId’s theoretically derived peptide score dis- 1 þ tribution [56]. The largest (cutoff) E value for a peptide to be reported was set to 1. Third, taxa are sorted in decreasing order of their weighted Statistical Method for Microbial Identification number of identified peptides to prepare for the clustering For our statistical method for microbial identification to be procedure. The first taxon entering a cluster is called the head effective, two prerequisites are indispensable: (1) accurate sig- of that taxon cluster, while other taxa the members of that nificance assignments, e.g., E values, at the peptide level must cluster. Starting from the best ranked taxon (a cluster head) in be provided and (2) microbes used for database construction the sorted list, any other lower ranked taxon will cluster to the must have the correct taxonomic classification. The first re- former if a resemblance coefficient of 0.85 or larger is obtained quirement is satisfied because peptide identifications in MiCId between them. The resemblance coefficient is defined as the are done by using RAId’s scoring function and significance proportion of the weighted number of peptides belonging to a assignments which have been shown to yield accurate E values lower ranked taxon that can be explained by identified peptides [56, 57]. associated with a cluster head. Once the worst ranked taxon is As for the second requirement, it is known that the mi- reached, the process will continue with the best ranked, not-yet- crobes’ taxonomic classification is not perfect and sometimes clustered taxon as a cluster head and repeat until all the controversial. For example, some studies recommend that unclustered taxa have been attempted as a cluster head, but Shigella flexneri should be classified as a strain of not more than once. Escherichia coli [58, 59]. However, the microbes’ taxonomic Fourth, after having generated taxon clusters, we further classification is expected to improve thanks to recent advances group these taxon clusters as follows. Starting with the lowest 1726 G. Alves et al.: Identification of Microorganisms ranking taxon cluster head, we denote its resemblance coeffi- compare with, leading to the conditional probability formula cient to others by its proportion of the weighted number of for the product of truncated P values [53, 63]. identified peptides that are shared by all other heads of taxon h  i clusters; when a resemblance coefficient of 0.85 or larger is m ln P −ln ðÞ τ m−1 raw c obtained, the cluster under consideration is merged to its closest P ~τ≤τj m; m ¼ ∑ ; ð4Þ t raw raw P s! s¼0 taxon cluster, i.e., the one whose head shares the largest weighted number of peptides with the current head. The re- where m ≡ ⌈m ⌉ is the smallest integer that is greater than or raw maining number of cluster heads n is used as the Bonferroni equal to m . An example of how to compute the unified P raw correction factor for significance calculation later. There is, value using Eq. (5) can be found in the electronic supplemen- however, an exception to the general clustering rules above. tary material of an earlier publication [53]. When a taxon (taxon cluster) contains three or more CIPs that Finally, P is given by are not shared with any other taxa (taxon clusters), it can only be a cluster head. M ! M−m P ~ τ≤τ ¼ PðÞ 1−P P ~ τ≤τj m; m u c t raw There is also a difference in the identification workflow m!ðÞ M−m ! compared to our earlier method. We considered all genera M M ! M− j ð5Þ þ ∑ PðÞ 1−P (species) together when performing genus (species) level iden- c j!ðÞ M−j ! j¼mþ1 hi tifications [53]. Here, we begin identifications at the phylum j j θ P −τ P ~ τ≤τj j; j þθτ−P ; c c level and then down. The current method has the advantage that one may eliminate from consideration taxa that are unlike- lm ly to be present during the upper level identifications. Below is where M ¼ ∑ wðÞ E≤1 with N being the total number j T j¼1 the condition we employ to select taxa to be retained at each of identified peptides (with E value ≤ 1) mappable to taxon T, identification level. Each cluster head will be considered; any P is the database P value for E , and the θ(x) function takes the c c member with a percentage difference in the weighted number value 1 when x>0and0otherwise. of CIPs to its cluster head less than 15% or having three or Within each cluster, the unified E value of each member more CIPs that are not shared with other taxa will be retained. taxon, cluster head included, is computed; the taxon with the In order to provide statistical significances at various taxo- most significant unified E value becomes the head of the nomic levels, we compute a unified E value E by combining cluster. Note that the starting cluster head (with largest M) the spectrum-specific E values of the CIPs belonging to the remains most significant most of the time, indicating the sta- same taxon. The details of how the E s are computed for bility of our clustering procedure and significance assignment. microbes at different taxonomic levels have been previously After this step, the clusters are finally sorted in ascending order described [53]. Here, we only briefly outline the essential steps of the unified E values of the cluster heads. for computing E . The unified E value E is given by Statistical Method for Protein Identification E ¼ n  P ; ð2Þ u c u In this update of MiCId, we have included the protein where in Eq. (2), the unified P value (P ) is multiplied by the identification capability. Proteomes of confidently identi- Bonferroni correction factor, n , the final number of peptide- fied microbes (cluster heads at the species level) are used driven taxon clusters. The P is obtained by first transforming as the protein database. The implemented statistical the E values (E) of CIPs into database P values (p)[61, 62], p = −E framework for protein identification is found on a rigor- 1 − e .Aweight (w ) is then defined for each peptide π as ously derived general formula [64], which has been 1/C !with C being the total number of taxon clusters contain- π π extensively tested for the application to protein identifi- ing π.Note that w s are computed for all πs identified with E cation [65]. Since the detailed derivations [64]and the values ≤ 1 although only CIPs are used for computing the applications [65] are already available, here we only unified P value. Evidently, C varies by the taxonomic level briefly summarize the statistical method for protein iden- considered. tification in MiCId. For each taxon T, one then combines the weighted product As before, we consider all identified peptides with E values of database P values into a new variable E ≤ 1 and convert each E value of an identified peptide into a n −E database P values (P =1 − e )[61, 62]. These identified pep- τ ¼ ∏p ; ð3Þ tides are then mapped to database proteins that contain them. i¼1 Assume that a given protein contains L identified peptides with where n is the total number of CIPs mappable to taxon T,and P values. Let us group these L peptides, according to the the weight for peptide π is w ¼ 1=C !. The sum of peptide number of database proteins a peptide maps to, into m groups i i π weights m ≡∑ wðÞ E≤E gives the effective number of with 1 ≤ m ≤ L.Within each group k,the n peptide P values raw i c i¼1 k degrees of freedom. This allows one to define a stochastic have equal weight, while peptide P values in different groups variable ~ τ of the same number of degrees of freedom to are weighted differently. G. Alves et al.: Identification of Microorganisms 1727 The weighting enters our formalism through the following plot the species-level peptidome similarity histograms quantities of interest using all three similarity measures (see panel b of Figure 1). The closeness among all three versions of sim- "# "# w w k k n n k k m m ilarity measures for species indicates that the presence of τ≡∏ ∏p and Q≡∏ ∏x ; ð6Þ k; j k; j high peptidome similarity among species is generic, not a k¼1 j¼1 k¼1 j¼1 consequence of asymmetric similarity. However, similar to where each p represents a reported peptide database P value the well-known example of high peptidome similarity be- k; j and each x represents a random variable drawn from a tween E. coli and S. flexneri [58, 59], the exceedingly high k; j uniform, independent distribution over [0, 1]. The quantity of peptidome similarities observed among certain species may interest F(τ) ≡ Prob(Q ≤ τ), representing the protein P value, also be a consequence of incorrect/problematic taxonomy can beobtained[64, 65]. classifications. The question of what to include/exclude from the database m m is definitely important, and the answer probably varies depend- FðÞ τ ¼ ∏ r ∑ ∑ HðÞ −r ln τ; g l k g þ1 l¼1 k¼1 GðÞ k r ing on the research conducted. It has been shown that the 0 1 choice of databases affects microbial identification [55]. As g = n −1 þ g ! j m j j another example, researchers studying the human gut ðÞ −1 @ A ∏  ; ð7Þ n þg j j microbiome may wish to construct a gut-specific database n −1 !g ! j¼1; j≠k r −r j j k [48] by including only human gut microbial species that have been cataloged by the Human Microbiome Project [66]. To where r ≡ 1/w is the number of proteins a group k peptide k k facilitate microbial research of various types, MiCId offers a maps to, ∑ enumerates each set of nonnegative integers GðÞ k simple procedure for constructing a customized database and {g , g , …, g }thatsatisfies the k-dependent constraint 1 2 m using it for microbial identification afterwards. A user only ∑ g ¼ n −1, and the function H is defined as i¼1 i needs to specify a list with the names of species or their NCBI taxonomy identifier, and everything else is handled by MiCId. −x HxðÞ ; n ≡ e ∑ : ð8Þ k! k¼0 E Value Accuracy Evaluation As stated in the BMaterials and Methods^ section, the An example application of Eq. (7) can be found in the statistical method employed to assign statistical signifi- supplementary information of the published study on protein cance to microbes identified requires accurate per- identification [65]. spectrum significances, e.g., E values, at the peptide level. Combining RAId’s accurate per-spectrum statistics [56]to output unified E value (E ) for microbes identified, we Results and Discussion expect E to be accurate and evaluated its accuracy as To investigate the feasibility of microbial identification described below. MS/MS spectra from BMD-C were used based on peptides identified, we examine in silico the to query the reverse [67, 68] of DB-1 while keeping the peptidome similarities among microbes at different taxo- taxonomic assignments untouched. The reverse of DB-1 nomic level in our DB-1. The similarity of taxon X to Y, thus acts as a peptide database from decoy organisms. S ≡∣ X ∩ Y ∣ / ∣ X∣, is defined as the numbered of (Therefore, each decoy organism carries a canonical name, X→ Y shared peptides between the two taxa divided by the num- but its peptide sequences are reverse of those of the true ber of peptides corresponding to taxon X. Panel a of organism of the same name.) The unified E value E of Figure 1 displays histograms of the in silico peptidome each decoy organism can then be calculated using Eqs. (2) similarities computed among the families, genera, and spe- and (5). Obviously, each identified decoy organism is a cies in DB-1. The histograms in Figure 1 show that pep- false positive. The accuracy of assigned statistical signifi- tides are weakly shared among taxa above species level cance (or type I error) can be visualized by a log-log plot having similarity values that are typically much less than [69, 70] of the expected number of errors (decoy organ- 0.6, indicating that microbial identifications at levels isms identified) per query versus the E reported by higher than species should be easier than at the species MiCId. level. Based on these histograms, it appears that unambig- Since we have in total 100 queries (blended DFs) in BMD- uous identification at the species level for certain species C, one would expect 1 decoy microbial cluster head with E ≤ −2 −1 can be challenging. To investigate whether the high 10 , 10 decoy microbial cluster heads with E ≤ 10 , 100 peptidome similarity at the species level is an artifact due microbial cluster heads with E ≤ 10 , and 10,000 microbial to asymmetric similarity measure, we defined in addition cluster heads with E ≤ 10 . The log-log plot of the expected t w o symme tr ize d si mi l a ri ti es : number of errors per query versus E should yield a curve close S (X→ Y)= S (Y→ X)≡∣ X ∩ Y ∣ /min {| X|,| Y|} and to the y = x line. Panels a, b, and c of Figure 2 show that the max max S (X→ Y)= S (Y→ X)≡∣ X ∩ Y ∣ /max {| X|,| Y|} and computed E curves trace very closely the y = x line, indicating min min u 1728 G. Alves et al.: Identification of Microorganisms 10 10 (a) Family (b) Species Genus Species (S ) min Species Species (S ) max 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 1. Histograms of microbial peptidome similarity (ρ). The curves display the peptidome similarity histograms in DB-1. In general, asymmetric peptidome similarity is used: similarity of taxon X to Y, S , is defined by S ≡∣ X ∩ Y ∣ / ∣ X∣, the number X → Y X → Y of shared peptides between the two taxa divided by the number of peptides corresponding to taxon X. Note that both S and X → Y S are included in the histogram. To illustrate that the high frequency of large peptidome similarity among species is generic, we Y → X defined in addition two symmetrized similarities: S (X → Y)= S (Y → X) ≡∣ X ∩ Y ∣ /min {| X|,| Y|} and max max S (X → Y)= S (Y → X)≡∣ X ∩ Y ∣ /max {| X|,| Y| }. These new symmetrized measures are shown only for peptidome similarities min min among species. The closeness among all three versions of similarity measures for species indicates that the presence of high peptidome similarity among species is generic, not a consequence of asymmetric similarity that the computed E is indeed accurate. Two dashed lines, y = Microbial Identification 3x and y = x/3, are also provided as references. As described in Validation of MiCId’s Taxa Identification Using Mixtures of the BStatistical Method for Microbial Identification^ subsec- Known Compositions One important component of our tion, when assessing the accuracy of assigned significance method, as described in the BStatistical Method for Microbial using decoy databases, one should not set E too low and the Identification^ subsection, is the peptide-driven clustering pro- statistical accuracy should remain even with different E .To cedure used to group microbes at different taxonomic levels. test this, we have used three E values, 0.01, 0.1, and 0.5, The parameters needed for the clustering procedure include the corresponding to curves in panels a, b, and c, respectively. resemblance coefficient, the minimum number of unique CIPs These E values, 0.01, 0.1, and 0.5, yield 347, 3183, and required to prevent a taxon (or a cluster of taxa) from being 15,040 CIPs, respectively. further clustered, and the condition for taxa to be selected for Although the statistical formula (7) for protein identifica- the next level identification. These parameter values were tion is better found than the one (5) we used for microbe learned from using BMD-A (as a training dataset) to query identification/classification, we did not switch to the former DB-1; they are selected by maximizing the Btrue positive rate^ in this MiCId update for the following reason. When (TPR) at the species level under a specified E value cutoff. performing microbial identification/classification, one needs Once determined, the parameters are used for all level of taxa to combine hundreds to thousands of peptide database P identifications. When computing the TPR, a cluster of identi- values as opposed to tens of database P values for protein fied microbes contributes only one count; the cluster is viewed identification. For the former case, if one were to use formula as a true positive only if the cluster head is a true positive and as (7), the summation becomes very time-consuming to compute a false positive otherwise. The first half of Table 2 displays, and can considerably slow down the entire workflow. from phylum-level to species-level identifications, the TPR 2 2 2 10 10 10 (a) Family (b) Family (c) Family Genus Genus Genus 1 1 1 10 10 10 Species Species Species 0 0 0 10 10 10 −1 −1 −1 10 10 10 −2 −2 −2 10 10 10 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 E−value E−value E−value Figure 2. Accuracy assessment of the unified E value (E ) for microorganism identifications. Each of the 100 blended DFs in BMD-C was queried in the decoy (made of reversed target peptide sequences) of the target DB-1. Three E s, peptide E value cutoffs, are used: (a) E = 0.01, (b) E =0.1, and (c) E = 0.5. The closer the curve is to the y = x line, the more accurate the reported E c c c u E[Erros Per Query] Frequency E[Erros Per Query] Frequency E[Erros Per Query] G. Alves et al.: Identification of Microorganisms 1729 along with the PFD using the parameters learned (with E value is among the known microbes and a false positive other- cutoff 0.01). More pertinent information at the genus and wise. Although the table headings are explained in the species levels can be found in Supplementary Figure S1. caption, we shall elaborate on few of them here: IF is With the parameters determined, we use BMD-B and the overall identification fraction (proportion of times a mixture sample datasets of known composition to query taxon is identified, be it a cluster head or not, from samples DB-1 in order to test the ability of MiCId in microbial containing it); IF is no larger than IF as it records the 2 1 identification/classification. The resulting TPR and PFD identification fraction of a known microbe that is also the (with E value cutoff 0.01) are displayed in the second half head of the cluster it belongs to; IF cannot be larger than of Table 2. The retrieval curves of taxa and peptides from IF by definition since it reports the identification fraction using BMD-B as queries are plotted in panels a and c of of a known microbe that, in addition to satisfying the Figure 3. Plottedinpanel bofFigure 3 is the PFDs versus conditions for IF , must have taxon-specific (unique) pep- the E values of identified taxa. This panel indicates that tide hits. The fact that sometimes IF is smaller than IF 2 1 using a E value cutoff of 0.01, one can control the PFDs at indicates the room for improvement in our identification the genus and species level at 5 and 2%, respectively. method: it shows occurrences of a true positive microbe Panel d of Figure 3 shows the histogram of peptides not being the cluster head. There are two scenarios that this identified with different E values. BMD-B contains 1400 can happen. First, it is possible that a false positive mi- blended DFs, each of which contains spectra from all 15 crobe having a similar peptidome to the true positive genera/species; hence, the maximum count of identifiable somehow becomes more significant than the true positive genera/species is 21,000. In this assessment, an identified microbe and the new cluster head. Second, we can have cluster is counted as a true positive only if its cluster head 0.14 Genus (b) (a) 0.12 Species 0.10 0.08 0.06 Genus Species 0.04 0.02 0.00 0 −8 −6 −4 −2 0 10 10 10 10 10 0.00 0.02 0.04 0.06 0.08 0.10 E−value PFD 9 8 10 10 (c) 8 (d) 10 3 10 0 0.00 0.02 0.04 0.06 0.08 0.10 −20 −16 −12 −8 −40 PFD log (E−value) Figure 3. Retrieval assessments using the blended MS/MS dataset BMD-B of known microbe compositions to query DB-1 are shown in panels (a) (taxa) and (c) (peptides). (b) The PFDs versus the E values of identified taxa. This panel indicates that using a E value cutoff of 0.01, one can control the PFDs at the genus and species level at 5 and 2%, respectively. (d) The histogram of peptides identified with different E values. BMD-B contains 1400 blended DFs, each of which contains spectra from all 15 genera/species; hence, the maximum count of identifiable genera/species is 21,000. An identified cluster is counted as a true positive only if its cluster head is among the known microbes and a false positive otherwise. The table headings are explained below: SK represents the species key; E[R] is the taxon’s average rank in the identified cluster containing it; IF is the overall identification fraction (proportion of times a taxon is identified, be it a cluster head or not, from samples containing it); IF records the identification fraction of a known microbe that also happens to be the head of the cluster it belongs to; IF reports the identification fraction of a known microbe that not only is the head of the cluster it belongs to but also has unique (taxon-specific) peptide hits; e[NIP] is the average number of identified peptides; e[CS] represents the average cluster size containing the taxon. In the table above, microbial identification was controlled at the 5% PFD. The species keys are 1, Escherichia coli;2, Enterobacter lignolyticus;3, Streptococcus pyogenes;4, Mycobacterium tuberculosis;5, Salmonella enterica;6, Yersinia pestis;7, Shewanella oneidensis;8, Pseudomonas aeruginosa;9, Bacillus subtilis;10, Bordetella pertussis;11, Bartonella henselae;12, Rhodobacter sphaeroides;13, Thermotoga maritima;14, Geobacter bemidjiensis;15, Caulobacter vibrioides Number of Microorganisms Number of Peptides Frequency PFD 1730 G. Alves et al.: Identification of Microorganisms (a) two true positive microbes sharing a larger number of CIPs and are clustered together. One quantity of particular interest is IF -IF . Notethatwith 2 3 IF -IF being zero for all genus-level identifications, we know 2 3 that each microbe present in these DFs has genus-specific peptides identified, indicating separability (or weak correla- tion) among genera. On the other hand, at the species level, we found several cases with IF -IF being nonzero (highlighted 2 3 in orange), indicating that strong similarities exist among cer- tain species. Interestingly, the fact that IF >IF for several 2 3 (b) species reveals that MiCId can correctly identify at times these species without relying on species-specific peptides. Another noteworthy point is that even though DB-1 is a fairly large database, the majority of identified peptides are still quite significant (see panel d). As another test of MiCId’s ability to correctly identify microbes, a series of 85 samples, DFs 39–123, composed of one, two, four, or nine known microbes were used to query DB-1. Panels A and B of Supplementary Figure S2 display the Figure 4. Identification result comparison between MiCId and PFD curves for the genus and the species-level identifications, different workflows by analyzing DF 124 (human stool sample) respectively, from samples containing one microbe and several in two databases DB-3 (a) and DB-4 (b). The workflows com- microbes. Panels C and D show the histograms of peptides pared with MiCId include MetaProteome analyzer (MPA) that identified with different E values, respectively, from samples uses X!Tandem for peptide identification [55], MaxQuant (MQ) containing one microbe and several microbes. The identifica- [55], and proteome discoverer (PD) which uses Sequest-HT and tion performance of MiCId using DFs 39–123 and with the percolator for peptide identification [55]. Plotted in the Venn PFD cutoff at 5% is summarized in the associated table of diagrams are the intersection of nonredundant peptides identi- Supplementary Figure S2. fied at the 1% PFD for the four workflows. For MQ, MPA, and We have also tested if MiCId can identify viruses from PD, genus identifications were done by sending all peptides samples of Calu-3 human lung cancer cells (infected with identified at 1% PFD to Unipept with the filtering strategy rec- ommended by [51] enforced. For MiCId, all peptides identified influenza A, harvested 0, 3, 7, 12, 18, and 24 h post infection with E value ≤ 1 are used for genus identifications, and only with five replicates at each time point). These samples, DFs heads of genus clusters identified with E value ≤ 0.01 are used 137–226, are from cell lysates using multidimensional protein to generate the Venn diagram identification technology (MudPIT) [71] and were not enriched for virus proteins. Using these samples, MiCId was able to correctly identify influenza A as early as 7 h for 2 out of 15 (MQ, MPA, and PD) adapting MEGAN [50]since theauthors samples and could correctly identify influenza A after 12 h for of [51] have shown that Unipept outperforms MEGAN in all of them. Note that with the E value cutoff of 0.01, MiCId terms of pathogen identifications. For MQ, MPA, and PD, does not report any false positive regardless whether true genus identifications were done by sending all peptides identi- positives are reported or not, indicating a robust false positive fied at 1% PFD to Unipept with the filtering strategy recom- control. The results obtained is quite surprising given that the mended by [51] enforced: when the number of unique peptides size of DB-1 (covering 46,838,064 proteins) is so much larger mapped to a taxon is less than 0.5% of the total number of than that of influenza A (11 proteins), see Supplementary unique (taxon-specific) peptides, that taxon is viewed as a false Table S6. positive. For MiCId, all peptides identified with E value ≤ 1are used for genus identifications, and only heads of genus clusters identified with E value ≤ 0.01 are used to generate the Venn Microbial Identification in Complex Mixture of Unknown diagram. Species-level identification comparison was not pos- Composition Using a human stool sample, DF-124, we com- sible because Unipept processes the database differently at this pare the microbial identifications at the genus level of MiCId level [49]. However, more comparisons with the PD workflow with the results from a previous study [55] that includes three adapting Unipept are available in Supplementary Table S7 and workflows: MetaProteome Analyzer (MPA) that uses X!Tan- Supplementary Figure S3. From Supplementary Figure S3,the dem for peptide identification [55], MaxQuant (MQ) [55], and readers will notice that although PD in general identifies more Proteome Discoverer (PD) which uses Sequest-HT and Perco- peptides than MiCId in this complex sample, MiCId neverthe- lator for peptide identification [55]. In Figure 4, two databases, less identifies more genera. Given the rigorous statistical foun- DB-3 and DB-4, are employed for identification comparison, dation of MiCId and the low E value cutoff used, the additional shown as Venn diagrams, among the four workflows. We did genera identified by MiCId are unlikely to be false positives. not compare MiCId with the three aforementioned workflows On the other hand, the large number of peptides identified by G. Alves et al.: Identification of Microorganisms 1731 PD, but not by MiCId, do not yield additional confident genera PFDs) to analyze data from several replicates of the same identifications. This might be because these peptides are pri- sample, how well can the identified taxa overlap given the marily false hits or the aforementioned filtering strategy, due to possible intrinsic variation due to data-dependent acquisition its heuristic nature, sometimes can be too aggressive [47, 51] of typical MS/MS practice? To investigate the size of this and lose true positives. intrinsic variation and its impact on taxa identifications, we Note that in Figure 4, although the peptides identified do not analyze nine MS/MS datasets DF-128 to DF-136, three techni- have strong overlaps, the genera identified seem to overlap cal replicates from human stool samples of three different vol- much more. The discrepancy in peptides identified might be unteers [30]. In Figure 5, we plot in the Venn diagrams the attributed to the difference in terms of how PFDs are estimated. overlaps among triplicates in terms of peptides, genera, species, Leaving this issue aside, however, a more fundamental question and proteins identified by MiCId when searching DB-2. The remains: using a workflow such as MiCId (that yields accurate peptide Venn diagrams, displaying the intersections of AB AB AB 246 590 1443 1526 2909 1747 3254 2173 2999 1512 2115 4429 357 3845 1112 3417 1211 825 1883 3147 1298 C C C AB AB AB 0 0 1 2 2 0 1 2 1 16 15 14 05 11 00 3 2 0 C C C AB AB AB 1 3 3 5 4 1 3 4 3 19 13 13 112 02 03 3 5 1 C C C AB AB AB 299 371 692 455 770 516 1186 776 726 597 1421 942 95 1101 211 894 83 361 694 1426 288 C C C Figure 5. Identification overlaps of peptides, microbes, and proteins among technical replicates. Analyzing nine MS/MS datasets, DFs 128–136, three technical replicates from human stool samples of three different volunteers [30], we plot in the Venn diagrams the overlaps among triplicates in terms of peptides, genera, species, and proteins identified by MiCId when searching DB-2. The peptide Venn diagrams show the intersections of nonredundant peptides identified at the 1% PFD, while other Venn diagrams are constructed using genera, species, and proteins identified with E value ≤ 0.01 1732 G. Alves et al.: Identification of Microorganisms nonredundant peptides identified at the 1% PFD, show a lot database [30]. However, when a more complex sample is used, more variations than the Venn diagrams for genera and species. searching a larger database has the advantage of discovering This indicates that the taxa identified remain largely the same unexpected but perhaps true positive peptides [35, 55]. In other even though there exists nonnegligible intrinsic variations words, even though searching a larger database reduces the among technical replicates in terms of peptides/proteins identi- statistical significances of peptides previously identified in a fied [30]. smaller database, new peptides not contained in the smaller The other problem of interest is to compare the taxa identi- database may be identified with high significance. Hence, the fication performance of various workflows given the same list sensitivity in terms of peptide identification may even increase of input peptides. Evidently, when the input peptides are the when searching a larger database. same, then except for MiCId, the taxa identified by MQ, MPA, To investigate the latter point, we have used data from and PD become identical (as they all are given by Unipept). In complex samples (DFs 128–136) to query DB-1 and DB-2. this context, it becomes the taxon identification comparison The sizes of DB-1 and DB-2 are 46 and 200 GB, respectively. between MiCId and Unipept. To meaningfully compare MiCId The 46,838,064 protein sequences (807,574,956 nonredundant with Unipept (with and without the filtering strategy proposed tryptic peptides) in DB-1 are from 23,911 organisms, belong- in [51]), however, the true positive microbes must be known. ing to 13,072 species, 1890 genera, and 517 families. The We thus used a data set (DF-100 to DF-108) from a mixture of 260,931,852 protein sequences (2,507,889,685 nonredundant four known microbes. The results are summarized in Table 1. tryptic peptides) in DB-2 are from 75,356 organisms, belong- We found that MiCId consistently identify more true positives ing to 26,368 species, 2870 genera, and 701 families. In terms while controlling the false positives at a lower rate when of number of tryptic peptides, DB-2 is about three times the compared to Unipept. One important observation is that the size of DB-1, although in terms of protein sequences it is about heuristic filtering strategy [51] does not always control the 5.6 times the size of DB-1. The peptide retrieval curves for each number of false positives to a low number while our statistical DFs are shown in groups. The first group, containing PFD method does. It is possible that the E value cutoff of 1 in Table 1 curves from DFs 128–130, is displayed in Figure 6. The other may include too many peptides with poor statistical signifi- two groups are shown in Supplementary Figure S4. The con- cance, hence hindering the performance of Unipept. For this sistent trend for each of these DFs (from complex samples) is reason, we have provided in the Supplementary Table S8 the that at a given PFD value, searching the larger database (DB-2) results from controlling the input peptides at 1% PFD using the indeed yields a larger number of identified peptides in compar- MiCId statistics. As one may see in the Supplementary ison with searching the smaller database (DB-1). Table S8, the same trend persists. Protein Identification Sensitivity Does Not Always Decrease as the Database Size Increases Sensitivity drop due to searching a large database MiCId performs protein identifications by extending its taxa has been examined and reported by many groups [30, 32, 35]. identifications in one go. It first queries the MS/MS spectra in The basic fact is that when a simple (single) microbe sample is a peptide-centric database, which is constructed from microbial used, searching a smaller database (containing the proteome of protein sequences and used for microbial identification at vari- that microbe) has a better sensitivity than searching a larger ous taxonomic levels. MiCId then uses the protein sequences Table 1. Genus Identification Comparison Between MiCId and Unipept Genus assignment for data files 100–108 MiCId Unipept DF NPU TP TP FP FP TP TP FP FP f u f u f u f u 100 9229 4 4 0 2 3 4 18 591 101 9865 4 4 0 0 3 4 18 634 102 9666 4 4 0 0 3 4 12 595 103 23,248 4 4 0 3 3 4 11 862 104 22,496 4 4 0 1 3 4 11 858 105 23,105 4 4 0 0 3 4 13 870 106 21,730 4 4 0 0 3 4 12 880 107 24,091 4 4 0 1 3 4 13 892 108 21,093 4 4 0 1 3 4 16 866 DF-100 to DF-108, sample mixtures composed of four bacteria S. pneumoniae, S. aureus, E. coli,and P. aeruginosa, are used to query DB-1. All peptides with E values ≤ 1 are used as input for both MiCId and Unipept. The data file (DF) index is shown in the first column. The number of peptides used (NPU) is displayed in the second column. For Unipept results, the subscript u (or f) of the number of true positive (TP) and the number of false positive (FP) means that the filtering strategy [51] is turned off (or on); for MiCId, the subscript u includes taxa with E value ≤ 1 while f includes only taxa with E value ≤ 0.01 G. Alves et al.: Identification of Microorganisms 1733 DF−128 DF−129 DF−130 30000 40000 40000 (a) DB−1 (b) DB−1 (c) DB−1 35000 35000 DB−2 DB−2 DB−2 30000 30000 25000 25000 15000 20000 20000 15000 15000 10000 10000 5000 5000 0 0 0 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 PFD PFD PFD Figure 6. Peptide retrieval curves for DFs 128–130 (complex samples) when searching DB-1 (46 GB) and DB-2 (200 GB). At the same PFD value, more peptides are identified when searching the larger database (DB-2) belonging to the confidently identified microbial species to second worst ranking protein and repeats it until one has tested construct a protein database on-the-fly and queries in it the the second best ranking protein. However, there is an excep- MS/MS spectra again for protein identification. This approach tion: a protein will not cluster if it has a unique evidence peptide differs from metaproteomic approaches which identify proteins (a peptide that maps to only one protein) with assigned E value −4 directly from identified peptides without performing microbial less than both 10 and 1/n , with n being the number of s s identifications first [30, 48]. Because each on-the-fly constructed spectra collected per experiment. Specifically, we cluster pro- protein database, no longer covering all microbes, is much teins that share the majority of the peptide evidences when the smaller than the original peptide-centric database, MiCId can information obtained from identified peptides does not allow us consider post-translationally modified peptides and semi- to pin point a specific protein. Because we do not assign enzymatic (semi-tryptic) peptides during protein identification functionality to proteins identified, we do not cluster proteins without adding much computational cost. In a previous study by their homology. In fact, it is possible for two totally dissim- [65], we have shown that the method employed for protein ilar proteins to be clustered together if they share a couple of identification in MiCId assigns accurate statistical significance peptides that happened to be identified with high confidence. to identified proteins and also has a protein retrieval performance that is no worse than that of any other protein identification Execution Time method. Therefore, we only present the protein identification With speed a main consideration, MiCId code was written in results of MiCId from analyzing DFs 128–136, three technical C++ and its routine for database search was written using replicates from human stool samples of three different volunteers parallel programming. This allows users to run jobs with a [30]). In the last row of Figure 5, we plot in the Venn diagrams flexible number of cores. We have measured the execution the overlaps among triplicates in terms of proteins identified by time of MiCId in performing microbe identification when MiCId when searching DB-2. As one may see, there exists querying MS/MS spectra in DB-1 (23,911 organisms and nonnegligible variation among the triplicate results in terms of proteins identified. This may be attributed to the significant variations in the peptides identified, see the Venn diagrams on (a) 15K (DB−1) the first row of Figure 5. On the other hand, from the Venn 15K (DB−2) diagrams on the second row (genus) and the third row (species) 100K (DB−1) of Figure 5, the genus and species identifications appear consis- 2400 tent despite the variations in peptides/proteins identified. 100K (DB−2) When clustering proteins, MiCId does not use protein ho- mology to cluster them. Rather, it uses a peptide-centric clus- tering procedure which is described below: first peptides iden- tified with E values ≤ 1 are mapped to proteins in the database; 1200 second, a standardized weighted count (Z)is assigned to each identified peptide (whose E value is E), ZEðÞ¼ 1= 1 þ , where E is the E value cutoff used to control peptide identifi- cation at a 5% PFD value; third, the sum, W , of the weighted (Z) peptide evidence for every protein p is computed and used 0 4 8 1216 2024 28 32 to sort proteins in order of decreasing W; fourth, one starts with Number of Cores the worst ranking protein p and clusters it to the better ranking Figure 7. MiCId execution time versus number of cores using protein which can explain the highest percentage of W .If that two datasets (one containing 15,000 MS/MS spectra and the percentage is below 95%, p does not cluster into any better other 100,000 MS/MS spectra) to query DB-1 (46 GB) and DB-2 ranking protein; fifth, continue the process in step 4 for the (200 GB) Number of Peptides Number of Peptides Execution Time (s) Number of Peptides 1734 G. Alves et al.: Identification of Microorganisms Table 2. True Positive Rate (TPR) and Proportion of False Discoveries (PFD) at the 0.01 E Value Cutoff Using BMD-A and BMD-B True positive rate and proportion of false discoveries MS/MS data set BMD-A Taxonomical level P C O F G S TPR 100.00% 94.13% 99.99% 99.93% 97.69% 94.10% Ec =0.01 PFD 3.02% 3.47% 2.33% 2.59% 1.49% 1.91% Ec =0.01 MS/MS data set BMD-B Taxonomical level P C O F G S TPR 100.00% 100.00% 100.00% 100% 98.57% 96.90% Ec =0.01 PFD 0.07% 0.00% 0.15% 2.37% 4.98% 1.89% Ec =0.01 The first half of the table displays the TPR along with the PFD (with E value cutoff 0.01), using the parameters learned from BMD-A and apply them to BMD-A, at various taxonomic identification levels: phylum (P), class (C), order (O), family (F), genus (G), and species (S). Displayed in the second half of the table are the microbial identification/classification results using BMD-B to query DB-1 with the parameters learned from using BMD-A 46 GB) and DB-2 (75,356 organisms and 200 GB). Figure 7 Evidently, limitation of our method exists. In fact, inter- shows that for data sets of ≈100,000 MS/MS spectra, microbe pretation of the results using our analysis workflow depends identification using DB-1 can be accomplished in about 15 min on the presence/absence of the correct microorganism in the with 4 cores and reduces to around 6 min with 16 cores. On the database. If one is certain that the correct microorganism is other hand, with ≈100,000 MS/MS spectra, the execution time present in the database, one should interpret the results as for microbe identification using DB-2 ranges from around microbial identification. On the other hand, if one is sure 44 min (with 4 cores) to 14.2 min (with 16 cores). Our results that the correct microorganism is absent from the database, indicate that when the database size increases by a factor of 4.3, one may interpret the results as microbial classification, i.e., the execution time increases only by a factor of 2.4 (using 16 finding the closest relative. cores), suggesting that only a 5.6-fold execution time will be encountered with a 10-fold database size increase. This reflects the scalability of MiCId in handling large databases. Figure 7 Conclusion also shows that the execution time reduction by increasing the number of cores appears to reach a plateau at 16 cores. This is In this study, we have proposed a workflow for microbial because the piece of C++ code performing microbial identifi- classification/identification by processing MS/MS data ac- cation has not yet been parallelized, thus incurring a constant quired in high-resolution mass spectrometers. We have used a time cost. large number of MS/MS data files to assess MiCId’s ability in identifying microbes in mixture samples. Results from our identification assessment show that (at the E value cutoff 0.01 yielding PFD ≤ 5%) MiCId has an average true positive rate of Limitation and Future Direction 0.9813 at the genus level and 0.9550 at the species level. (More We have mentioned that based on Figure 1, it appears that details can be found in Table 2.) One should note, however, unambiguous identification at the species level for certain that these numbers were obtained from blended spectra of species can be challenging. This difficulty worsens when com- single species samples. Generalization to complex microbiota plex samples are concerned and presents the direction for future samples must be taken with a grain of salt. Nevertheless, we efforts. For example, most real environmental samples will not were also able to show, by comparing with published results, be clonal, but might be clades of related organisms. Hence, that MiCId performs comparably to existing methods in mi- even the concept of Bspecies^ can be problematic here. Sepa- crobial identifications. A major advantage of MiCId over rating closely related organisms evolved from a common an- existing methods is that it can assign to microbes and proteins cestor species may not be possible under our current method identified accurate statistical significance, e.g., E values, thus due to lack of proteome information and high degree of prote- providing users a measure suitable for controlling false posi- ome similarities of closely related organisms. Nevertheless, tives (type I errors) and to estimate the PFD via the Sorić identifying the ancestral species can be a realistic long-term equation [72]. In contrast with current metaproteomics goal. methods, MiCId’s protein identification strategy begins with On the other hand, numerous lines of evidence indicate that microbial identifications, allowing users to consider post- individual Bstrains^ might differ in physiology, and thus, de- translationally modified peptides and semi-enzymatic (semi- tection at strain levels might be needed for clinical applications. tryptic) peptides without adding much computational cost. This For this case, however, cultured colonies of bacteria might be feature might be welcomed by researchers interested in prote- possible and progress has been made at the strain-level identi- omics of complex microbial mixtures. Storing only nonredun- fications in simple mixtures [45]. The challenge and future dant peptides from proteins digested in silico, the peptide- direction here pertains to how to minimize the culture time centric database of MiCId has a database size to number of and increase the identification rate. organism ratio that decreases with the increasing number of G. Alves et al.: Identification of Microorganisms 1735 2. Warnock, D.W.: Fungal diseases: an evolving public health challenge. organisms, making database search time increase only Med. Mycol. 44(8), 697–705 (2006) sublinearly with the number of the included organisms. 3. Aluwong, T., Bello, M.: Emerging diseases and implications for millen- MiCId’s workflow is fully automated: users need only specify nium development goals in Africa by 2015—an overview. Vet. Ital. 46(2), 137–145 (2010) a list of microbes to be included in the peptide-centric database 4. Didelot, X., Bowden, R., Wilson, D.J., Peto, T.E., Crook, D.W.: and the parameters to analyze the MS/MS spectra, and every- Transforming clinical microbiology with bacterial genome sequencing. thing is handled internally by MiCId. MiCId can be Nat Rev. Genet. 13(9), 601–612 (2012) 5. Garcia-Prats, J.A., Cooper, T.R., Schneider, V.F., Stager, C.E., Hansen, downloaded freely at http://www.ncbi.nlm.nih.gov/ T.N.: Rapid detection of microorganisms in blood cultures of newborn CBBresearch/Yu/downloads.html. infants utilizing an automated blood culture system. Pediatrics. 105(3 Pt The study presented here focuses mainly on the 1), 523–527 (2000) 6. Cleven, B.E., Palka-Santini, M., Gielen, J., Meembor, S., Kronke, M., classification/identification with accurate statistical signifi- Krut, O.: Identification and characterization of bacterial pathogens caus- cance for microbial mixtures via high-resolution MS/MS spec- ing bloodstream infections by DNA microarray. J. Clin. Microbiol. 44(7), tra. True advances in microbial studies using complex samples, 2389–2397 (2006) however, require multidisciplinary collaboration. Along this 7. Morgenthaler, N.G., Kostrzewa, M.: Rapid identification of pathogens in positive blood culture of patients with sepsis: review and meta-analysis of direction, many innovative approaches have been developed. the performance of the sepsityper kit. Int J Microbiol. 2015, 827,416 (2015) Such examples include but are not limited to faster MS/MS 8. Dubourg, G., Raoult, D.: Emerging methodologies for pathogen identifi- instruments like Orbitrap Fusion Lumos Tribid [73–75]that cation in positive blood culture testing. Expert. Rev. Mol. Diagn. 16(1), 97–111 (2016) can better sample complex mixtures by collecting a greater 9. Frank, U., Malkotsis, D., Mlangeni, D., Daschner, F.D.: Controlled number of MS/MS spectra, improved sample preparation pro- clinical comparison of three commercial blood culture systems. Eur. J. tocols for pathogenic samples [76, 77], and new techniques to Clin. Microbiol. Infect. Dis. 18(4), 248–255 (1999) 10. Puttaswamy, S., Lee, B.D., Sengupta, S.: Novel electrical method for isolate strain-specific peptides from microbes’ membrane pro- early detection of viable bacteria in blood cultures. J. Clin. Microbiol. teins [78, 79]. These advances, together with computational 49(6), 2286–2289 (2011) workflows (such as MiCId) built on firm statistical founda- 11. Kang, D.K., Ali, M.M., Zhang, K., Huang, S.S., Peterson, E., Digman, tions, may potentially further improve microbe identification in M.A., Gratton, E., Zhao, W.: Rapid detection of single bacteria in unpro- cessed blood using integrated comprehensive droplet digital detection. complex samples. Nat. Commun. 5, 5427 (2014) 12. Holmes, B., Willcox, W.R., Lapage, S.P.: Identification of Enterobacte- riaceae by the API 20E system. J. Clin. Pathol. 31(1), 22–30 (1978) Acknowledgements 13. Engvall, E.: Quantitative enzyme immunoassay (ELISA) in microbiolo- gy. Med. Biol. 55(4), 193–200 (1977) We thank the administrative group of the National Institutes of 14. Yolken, R.H.: Enzyme immunoassays for the detection of microbial Health Biowulf Cluster, where all the computational tasks were antigens and prospects for improved assays. Yale J Biol Med. 59(1), carried out. Without the availability of the NIH Biowulf Cluster, 25–31 (1986) 15. Price,L.B.,Liu,C.M.,Melendez, J.H.,Frankel,Y.M., Engelthaler,D.,Aziz, it would be exceedingly difficult to carry out this research due to M.,Bowers,J.,Rattray,R.,Ravel, J.,Kingsley, C.,Keim,P.S., Lazarus, the large sizes and number of data files used. This work was G.S., Zenilman, J.M.: Community analysis of chronic wound bacteria using supported by the Intramural Research Program of the National 16S rRNA gene-based pyrosequencing: impact of diabetes and antibiotics on chronic wound microbiota. PLoS One. 4(7), e6462 (2009) Library of Medicine, the National Heart, Lung, Blood Institute, 16. Stackebrandt, E., Goebel, B.M.: Taxonomic note: a place for dna-dna and the Clinical Center at the National Institutes of Health. reassociation and 16s rrna sequence analysis in the present species definition in bacteriology. Int. J. Syst. Evol. Microbiol. 44(4), 846–849 (1994) 17. Janda, J.M., Abbott, S.L.: 16S rRNA gene sequencing for bacterial Funding Information identification in the diagnostic laboratory: pluses, perils, and pitfalls. J. Clin. Microbiol. 45(9), 2761–2764 (2007) Funding for Open Access publication charges for this article 18. Fox, G.E., Wisotzkey, J.D., Jurtshuk, P., Fox, G.E.: How close is close: was provided by the National Institutes of Health. 16S rRNA sequence identity may not be sufficient to guarantee species identity. Int. J. Syst. Bacteriol. 42(1), 166–170 (1992) 19. Petti, C.A.: Detection and identification of microorganisms by gene amplification and sequencing. Clin. Infect. Dis. 44(8), 1108–1114 (2007) Open Access 20. Altun, O., Almuhayawi, M., Ullberg, M., Ozenci, V.: Rapid identification of microorganisms from sterile body fluids by use of FilmArray. J. Clin. This article is distributed under the terms of the Creative Microbiol. 53(2), 710–712 (2015) Commons Attribution 4.0 International License (http:// 21. Chan, J.Z., Pallen, M.J., Oppenheim, B., Constantinidou, C.: Genome sequenc- ing in clinical microbiology. Nat. Biotechnol. 30(11), 1068–1071 (2012) creativecommons.org/licenses/by/4.0/), which permits unre- 22. Sauer, S., Kliem, M.: Mass spectrometry tools for the classification and stricted use, distribution, and reproduction in any medium, identification of bacteria. Nat. Rev. Microbiol. 8(1), 74–82 (2010) provided you give appropriate credit to the original author(s) 23. Hodkinson, B.P., Grice, E.A.: Next-generation sequencing: a review of and the source, provide a link to the Creative Commons technologies and tools for wound microbiome research. Adv Wound Care (New Rochelle). 4(1), 50–58 (2015) license, and indicate if changes were made. 24. Karas, M., Hillenkamp, F.: Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Anal. Chem. 60(20), 2299– 2301 (1988) 25. Marvin, L.F., Roberts, M.A., Fay, L.B.: Matrix-assisted laser desorption/ References ionization time-of-flight mass spectrometry in clinical chemistry. Clin. Chim. Acta. 337(1–2), 11–21 (2003) 1. Tauxe, R.V.: Emerging foodborne diseases: an evolving public health 26. Stevenson, L.G., Drake, S.K., Murray, P.R.: Rapid identification of challenge. Emerging Infect Dis. 3(4), 425–434 (1997) bacteria in positive blood culture broths by matrix-assisted laser 1736 G. Alves et al.: Identification of Microorganisms desorption ionization-time of flight mass spectrometry. J. Clin. Microbiol. MetaPro-IQ: a universal metaproteomic approach to studying human and 48(2), 444–447 (2010) mouse gut microbiota. Microbiome. 4(1), 31 (2016) 27. Bhatti, M.M., Boonlayangoor, S., Beavis, K.G., Tesic, V.: Evaluation of 49. Mesuere, B., Devreese, B., Debyser, G., Aerts, M., Vandamme, P., FilmArray and Verigene systems for rapid identification of positive blood Dawyndt, P.: Unipept: tryptic peptide-based biodiversity analysis of cultures. J. Clin. Microbiol. 52(9), 3433–3436 (2014) metaproteome samples. J. Proteome Res. 11(12), 5773–5780 (2012) 50. Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of 28. Tanner, H., Evans, J.T., Gossain, S., Hussain, A.: Evaluation of three sample preparation methods for the direct identification of bacteria in positive blood metagenomic data. Genome Res. 17(3), 377–386 (2007) cultures by MALDI-TOF. BMC Res Notes. 10(1), 48 (2017) 51. Tanca, A., Palomba, A., Deligios, M., Cubeddu, T., Fraumene, C., Biosa, G., Pagnozzi, D., Addis, M.F., Uzzau, S.: Evaluating the impact of 29. Karlsson, R., Gonzales-Siles, L., Boulund, F., Svensson-Stadler, L., different sequence databases on metaproteome analysis: insights from a Skovbjerg, S., Karlsson, A., Davidson, M., Hulth, S., Kristiansson, E., lab-assembled microbial mixture. PLoS One. 8(12), e82,981 (2013) Moore, E.R.: Proteotyping: proteomic characterization, classification and 52. Stojmirović, A., Yu, Y.K.: Robust and accurate data enrichment statistics identification of microorganisms—a prospectus. Syst. Appl. Microbiol. via distribution function of sum of weights. Bioinformatics. 26(21), 38(4), 246–257 (2015) 2752–2759 (2010) 30. Chatterjee, S., Stupp, G.S., Park, S.K., Ducom, J.C., Yates, J.R., Su, A.I., 53. Alves, G., Wang, G., Ogurtsov, A.Y., Drake, S.K., Gucek, M., Suffredini, Wolan, D.W.: A comprehensive and scalable database search system for A.F., Sacks, D.B., Yu, Y.K.: Identification of microorganisms by high metaproteomics. BMC Genomics. 17(1), 642 (2016) resolution tandem mass spectrometry with accurate statistical signifi- 31. Jagtap, P., Goslinga, J., Kooren, J.A., McGowan, T., Wroblewski, M.S., cance. J. Am. Soc. Mass Spectrom. 27(2), 194–210 (2016) Seymour, S.L., Griffin, T.J.: A two-step database search method improves 54. Byrd, A.L., Perez-Rogers, J.F., Manimaran, S., Castro-Nallar, E., Toma, sensitivity in peptide sequence matches for metaproteomics and I., McCaffrey, T., Siegel, M., Benson, G., Crandall, K.A., Johnson, W.E.: proteogenomics studies. Proteomics. 13(8), 1352–1357 (2013) Clinical PathoScope: rapid alignment and filtration for accurate pathogen 32. Tharakan, R., Edwards, N., Graham, D.R.: Data maximization by identification in clinical samples using unassembled sequencing data. multipass analysis of protein mass spectra. Proteomics. 10(6), 1160– BMC Bioinformatics. 15, 262 (2014) 1171 (2010) 55. Tanca, A., Palomba, A., Fraumene, C., Pagnozzi, D., Manghina, V., 33. Bern, M., Kil, Y.J.: Comment on Bunbiased statistical analysis for multi-stage Deligios, M., Muth, T., Rapp, E., Martens, L., Addis, M.F., Uzzau, S.: proteomic search strategies^.J.ProteomeRes. 10(4), 2123–2127 (2011) The impact of sequence database choice on metaproteomic results in gut 34. Gupta, N., Bandeira, N., Keich, U., Pevzner, P.A.: Target-decoy approach microbiota studies. Microbiome. 4(1), 51 (2016) and false discovery rate: when things may go wrong. J. Am. Soc. Mass 56. Alves, G., Ogurtsov, A.Y., Yu, Y.K.: RAId_DbS: peptide identification Spectrom. 22(7), 1111–1120 (2011) using database searches with realistic statistics. Biol. Direct. 2,25 (2007) 35. Alves, G., Yu, Y.K.: Improving peptide identification sensitivity in shot- 57. Alves, G., Yu, Y.K.: Confidence assignment for mass spectrometry based gun proteomics by stratification of search space. J. Proteome Res. 12(6), peptide identifications via the extreme value distribution. Bioinformatics. 2571–2581 (2013) 32(17), 2642–2649 (2016) 36. Wooley, J.C., Ye, Y.: Metagenomics: facts and artifacts and Computa- 58. Pupo, G.M., Lan, R., Reeves, P.R.: Multiple independent origins of tional Challenges*. J Comput Sci Technol. 25(1), 71–81 (2009) Shigella clones of Escherichia coli and convergent evolution of many of 37. Bazinet, A.L., Cummings, M.P.: A comparative evaluation of sequence their characteristics. Proc. Natl. Acad. Sci. 97(19), 10,567–10,572 (2000) classification programs. BMC Bioinformatics. 13,92 (2012) 59. Jin, Q., Yuan, Z., Xu, J., Wang, Y., Shen, Y., Lu, W., Wang, J., Liu, H., 38. Wang, W.L., Xu, S.Y., Ren, Z.G., Tao, L., Jiang, J.W., Zheng, S.S.: Yang, J., Yang, F., Zhang, X., Zhang, J., Yang, G., Wu, H., Qu, D., Dong, Application of metagenomics in the human gut microbiome. World J. J., Sun, L., Xue, Y., Zhao, A., Gao, Y., Zhu, J., Kan, B., Ding, K., Chen, Gastroenterol. 21(3), 803–814 (2015) S., Cheng, H., Yao, Z., He, B., Chen, R., Ma, D., Qiang, B., Wen, Y., 39. Dworzanski, J.P., Snyder, A.P., Chen, R., Zhang, H., Wishart, D., Li, L.: Hou, Y., Yu, J.: Genome sequence of Shigella flexneri 2a: insights into Identification of bacteria using tandem mass spectrometry combined with pathogenicity through comparison with genomes of Escherichia coli K12 a proteome database and statistical scoring. Anal. Chem. 76(8), 2355– and O157. Nucleic Acids Res. 30(20), 4432–4441 (2002) 2366 (2004) 60. Prakash, O., Verma, M., Sharma, P., Kumar, M., Kumari, K., Singh, A., 40. Dworzanski, J.P., Snyder, A.P.: Classification and identification of bac- Kumari, H., Jit, S., Gupta, S.K., Khanna, M., Lal, R.: Polyphasic ap- teria using mass spectrometry-based proteomics. Expert Rev Proteomics. proach of bacterial classification—an overview of recent advances. Indian 2(6), 863–878 (2005) J. Microbiol. 47(2), 98–108 (2007) 41. Jabbour, R.E., Deshpande, S.V., Wade, M.M., Stanford, M.F., Wick, 61. Yu, Y.K., Gertz, E.M., Agarwala, R., Schaffer, A.A., Altschul, S.F.: C.H., Zulich, A.W., Skowronski, E.W., Snyder, A.P.: Double-blind char- Retrieval accuracy, statistical significance and compositional similarity acterization of non-genome-sequenced bacteria by mass spectrometry- in protein sequence database searches. Nucleic Acids Res. 34(20), 5966– based proteomics. Appl. Environ. Microbiol. 76(11), 3637–3644 (2010) 5973 (2006) 42. Jabbour, R.E., Deshpande, S.V., Stanford, M.F., Wick, C.H., Zulich, 62. Alves, G., Wu, W.W., Wang, G., Shen, R.F., Yu, Y.K.: Enhancing A.W., Snyder, A.P.: A protein processing filter method for bacterial peptide identification confidence by combining search methods. J. Prote- identification by mass spectrometry-based proteomics. J. Proteome Res. ome Res. 7, 3102–3113 (2008) 10(2), 907–912 (2011) 63. Zaykin, D.V., Zhivotovsky, L.A., Westfall, P.H., Weir, B.S.: Truncated 43. Tracz, D.M., McCorrister, S.J., Chong, P.M., Lee, D.M., Corbett, C.R., product method for combining P-values. Genet. Epidemiol. 22(2), 170– Westmacott, G.R.: A simple shotgun proteomics method for rapid bacte- 185 (2002) rial identification. J. Microbiol. Methods. 94(1), 54–57 (2013) 64. Alves, G., Yu, Y.K.: Combining independent, weighted P-values: achiev- 44. Penzlin, A., Lindner, M.S., Doellinger, J., Dabrowski, P.W., Nitsche, A., ing computational stability by a systematic expansion with controllable Renard, B.Y.: Pipasic: similarity and expression correction for strain-level accuracy. PLoS One. 6(8), e22,647 (2011) identification and quantification in metaproteomics. Bioinformatics. 65. Alves, G., Yu, Y.K.: Mass spectrometry-based protein identification with 30(12), i149–i156 (2014) accurate statistical significance assignment. Bioinformatics. 31(5), 699– 45. Boulund, F., Karlsson, R., Gonzales-Siles, L., Johnning, A., Karami, N., 706 (2015) Al-Bayati, O., Ahren, C., Moore, E.R.B., Kristiansson, E.: TCUP: typing 66. Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, and characterization of bacteria using bottom-up tandem mass spectrom- J.A., Bonazzi, V., McEwen, J.E., Wetterstrand, K.A., Deal, C., Baker, etry proteomics. Mol. Cell. Proteomics (2017) C.C., Di Francesco, V., Howcroft, T.K., Karp, R.W., Lunsford, R.D., 46. Hettich, R.L., Pan, C., Chourey, K., Giannone, R.J.: Metaproteomics: Wellington, C.R., Belachew, T., Wright, M., Giblin, C., David, H., Mills, harnessing the power of high performance mass spectrometry to identify M.,Salomon,R.,Mullins,C.,Akolkar,B.,Begg,L.,Davis,C., the suite of proteins that control metabolic activities in microbial com- Grandison, L., Humble, M., Khalsa, J., Little, A.R., Peavy, H., Pontzer, munities. Anal. Chem. 85(9), 4203–4214 (2013) C., Portnoy, M., Sayre, M.H., Starke-Reed, P., Zakhari, S., Read, J., 47. Jagtap, P.D., Blakely, A., Murray, K., Stewart, S., Kooren, J., Johnson, Watson, B., Guyer, M.: The NIH human microbiome project. Genome J.E., Rhodus, N.L., Rudney, J., Griffin, T.J.: Metaproteomic analysis Res. 19(12), 2317–2323 (2009) using the Galaxy framework. Proteomics. 15(20), 3553–3565 (2015) 67. Elias, J.E., Gygi, S.P.: Target-decoy search strategy for increased confi- 48. Zhang, X., Ning, Z., Mayne, J., Moore, J.I., Li, J., Butcher, J., Deeke, dence in large-scale protein identifications by mass spectrometry. Nat. S.A., Chen, R., Chiang, C.K., Wen, M., Mack, D., Stintzi, A., Figeys, D.: Methods. 4(3), 207–214 (2007) G. Alves et al.: Identification of Microorganisms 1737 68. Wang, G., Wu, W.W., Zhang, Z., Masilamani, S., Shen, R.F.: Decoy methods using an Orbitrap Fusion Lumos Tribrid mass spectrometer. methods for assessing false positives and false discovery rates in shotgun Proteomics 17(9) (2017) proteomics. Anal. Chem. 81,146–159 (2009) 76. Mottaz-Brewer, H.M., Norbeck, A.D., Adkins, J.N., Manes, N.P., 69. Michael, J.R.: The stabilized probability plot. Biometrika. 70(1), 11–17 Ansong, C., Shi, L., Rikihisa, Y., Kikuchi, T., Wong, S.W., Estep, (1983) R.D., Heffron, F., Pasa-Tolic, L., Smith, R.D.: Optimization of proteomic 70. Alves, G., Ogurtsov, A.Y., Wu, W.W., Wang, G., Shen, R.F., Yu, Y.K.: sample preparation procedures for comprehensive protein characteriza- Calibrating E-values for MS2 database search methods. Biol. Direct. 2,26 tion of pathogenic systems. J. Biomol. Tech. 19(5), 285–295 (2008) (2007) 77. Tanca, A., Palomba, A., Pisanu, S., Deligios, M., Fraumene, C., 71. Wolters, D.A., Washburn, M.P., Yates, J.R.: An automated multidimen- Manghina, V., Pagnozzi, D., Addis, M.F., Uzzau, S.: A straightforward sional protein identification technology for shotgun proteomics. Anal. and efficient analytical pipeline for metaproteome characterization. Chem. 73(23), 5683–5690 (2001) Microbiome. 2(1), 49 (2014) 72. Sorić, B.: Statistical Bdiscoveries^ and effect-size estimation. J. Am. Stat. 78. Karlsson, R., Davidson, M., Svensson-Stadler, L., Karlsson, A., Olesen, Assoc. 84(406), 608–610 (1989) K., Carlsohn, E., Moore, E.R.: Strain-level typing and identification of 73. Makarov, A.: Electrostatic axially harmonic orbital trapping:? A high- bacteria using mass spectrometry-based proteomics. J. Proteome Res. performance technique of mass analysis. Anal. Chem. 72(6), 1156–1162 11(5), 2710–2720 (2012) (2000) 79. Gonzales-Siles, L., Karlsson, R., Kenny, D., Karlsson, A., Sjoling, A.: 74. Eliuk, S., Makarov, A.: Evolution of Orbitrap mass spectrometry instru- Proteomic analysis of enterotoxigenic Escherichia coli (ETEC) in neutral mentation. Annu Rev Anal Chem (Palo Alto, Calif). 8,61–80 (2015) and alkaline conditions. BMC Microbiol. 17(1), 11 (2017) 75. Espadas, G., Borras, E., Chiva, C., Sabido, E.: Evaluation of different peptide fragmentation types and mass analyzers in data-dependent

Journal

Journal of The American Society for Mass SpectrometrySpringer Journals

Published: Jun 5, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off