TY - JOUR AU - Wang,, Jianxin AB - Abstract Proteins are dominant executors of living processes. Compared to genetic variations, changes in the molecular structure and state of a protein (i.e. proteoforms) are more directly related to pathological changes in diseases. Characterizing proteoforms involves identifying and locating primary structure alterations (PSAs) in proteoforms, which is of practical importance for the advancement of the medical profession. With the development of mass spectrometry (MS) technology, the characterization of proteoforms based on top-down MS technology has become possible. This type of method is relatively new and faces many challenges. Since the proteoform identification is the most important process in characterizing proteoforms, we comprehensively review the existing proteoform identification methods in this study. Before identifying proteoforms, the spectra need to be preprocessed, and protein sequence databases can be filtered to speed up the identification. Therefore, we also summarize some popular deconvolution algorithms, various filtering algorithms for improving the proteoform identification performance and various scoring methods for localizing proteoforms. Moreover, commonly used methods were evaluated and compared in this review. We believe our review could help researchers better understand the current state of the development in this field and design new efficient algorithms for the proteoform characterization. top-down mass spectrometry, deconvolution, proteoform identification, proteoform localization, posttranslational modification Introduction Proteins are dominant executors of living processes, and various molecular forms of protein products are directly related to pathological disease changes, like heart failure [1] and age-dependent memory impairment [2]. The term ‘proteoform’ [3] is often mentioned in top-down (TD) proteomics literature, which means the protein product of a gene taking various molecular forms caused by primary structure alteration (PSAs). There are five types of PSAs on proteins: (1) sequence variation, (2) fixed posttranslational modification (PTM), (3) variable PTM, (4) terminal truncation and (5) unknown mass shift. Characterizing proteoforms helps identify essential diagnostic markers and therapeutic targets, which can improve disease diagnosis and treatment [4–8]. Therefore, the characterization of proteoforms is of practical significance for the advancement of medicine. Recently, studies on proteoforms have attracted broad attention in bioinformatics. The proteoform characterization is viewed as a complex computational problem. A protein combined with multiple PSAs comes into various proteoforms, leading to the combinational explosion due to its vast quantities. For example, if all possibilities of PTMs of the human histone H3.1 protein are exhausted, the number of proteoforms, in theory, would reach up to 40 trillion [3, 9]. Nevertheless, some experiments indicate that the number of proteoforms occurring in real biological samples appears to be much smaller than that of theoretical combinations [4] and types of proteoforms are fewer too. Garcia’s group has proposed a mass spectrometry (MS) method on analyzing histone proteoforms, and their experimental results show that histone proteoforms found by their method are much rarer than proteoforms in theoretical combinations [10]. Coincidentally, Coon’s group found only 74 proteoforms of histone H4 protein [11]. Due to lack of golden data sets of proteoforms in real samples, there are two explanations for the distinction between theory and practice: (1) existing methods based on MS techniques can only identify some proteoforms with high abundance, and many low-abundance proteoforms cannot be detected; (2) the proteoforms present in biological samples only cover a small part of theoretical proteoforms. In the future, it is necessary to improve the technology to reveal real causes [5]. There are three types of MS methods for proteoform analysis, including bottom-up (BU) proteomics, middle-down (MD) proteomics, and TD proteomics. A decade ago, TD proteomics for analyzing proteoforms faced a lot of challenges in the protein sample preparation [12, 13], liquid chromatography (LC) separation and data analysis, and so on. Therefore, BU shotgun proteomics was widely employed in protein identification studies, which identified more proteins in protein mixtures than TD proteomics [14, 15]. However, some different gene products, isoforms and proteoforms contain the same peptide, while BU techniques do not cover complete protein sequences. Therefore, it is a challenging task to infer the complete information of proteoforms through BU techniques. Besides, BU techniques were suitable for the PTM identification at individual peptides rather than complex PTMs. MD proteomics [16, 17] was applied to identify longer peptides by digesting protein samples with different biological enzymes. However, the fact that cannot be neglected is that the distinction between long peptides and intact proteoforms still exists. At present, the rapid development of protein separation technologies and high-throughput technologies allows researchers to obtain high-precision tandem MS data of an intact protein from biological samples at a low cost, which, in turn, furthers the development of TD proteomics. TD-based methods identify proteoforms by isolating intact proteins from biological samples by using protein separation techniques and then analyzing intact proteins by LC-tandem MS (LC-MS/MS). TD proteomic techniques can, under a perfect condition, obtain the complete information of amino acid sequences of proteins as well as PTMs without the pre-enzymatic digestion of a sample, which provides a priori knowledge of the identification of proteoforms. Compared to BU shotgun proteomic techniques, TD proteomic techniques have their own unique advantages in identifying proteoforms from intact proteins rather than from short peptides [18]. In comparison with separation and MS techniques, data analysis is the most significant obstacle currently hindering the development of TD proteomics [19]. Data analysis in TD refers to developing an algorithm or a software system for processing data to achieve the characterization of proteoforms, which consists of three steps: preparation of TDMS data, deconvolution and characterization of proteoforms. We depict the pipeline for characterizing proteoforms based on TD proteomic techniques, which is shown in Figure 1. Figure 1 Open in new tabDownload slide A pipeline for characterizing proteoforms based on TD proteomic techniques. Figure 1 Open in new tabDownload slide A pipeline for characterizing proteoforms based on TD proteomic techniques. In the past decade, researchers proposed algorithms and software systems based on TDMS techniques, but there were still a few relevant reviews from the perspective of bioinformatics to conduct a comprehensive evaluation on newly developed methods. In order to help more researchers quickly understand the state-of-the-art studies in the field and the limitations of existing methods, this review aims to report a comprehensive survey on computational methods for proteoform characterization based on TDMS techniques. Because proteoform identification is the most important part of the protein characterization process, we classify existing solutions for proteoform identification into two groups. Based on TDMS, solutions in the first group can be further divided into two categories: (1) expanding the proteoform database and (2) blind PSA search methods, including spectral alignment-based algorithms and precursor ion-independent algorithm (PIITA) [20]. Solutions in the second group are to list types of PSAs identified by each method. Through the above classifications, researchers in this field can have a deeper understanding of existing technologies for identifying various proteoforms. Before identifying proteoforms, spectra need to be preprocessed, and the protein sequence database should be filtered to speed up the identification. Therefore, this study is a review of the whole process of proteoform characterization based on TDMS including a summary of the deconvolution algorithms, various filtering algorithms for improving the proteoform identification and various scoring methods for locating proteoforms. In addition, a comprehensive evaluation and performance comparison of four proteoform identification methods of the latest years are given. Finally, the future development of proteoform characterization on TDMS is discussed. In order to better understand the whole process of proteoform characterization, we describe some important terms shown in Figure 2. Chromatograms (peaks) are shown in Figure 2A. As shown in Figure 2B, the isotopomer envelopes of protein fragments can be measured by mass spectrometers, which consist of spectral peak sets of fragment ions of the same chemical formula and charge state in MS/MS. Figure 2B also illustrates the overlapping envelopes that means the possible overlaps on the m/z axis between two or more different isotopomer envelopes. The theoretical isotopic distribution is a list of theoretical peaks obtained by using some model with the mass of the peaks and charge states as inputs. As shown in Figure 2C, the theoretical isotopic distribution can be used for detecting the candidate isotopomer envelopes. Figure 2D shows the monoisotopic peak that consists of the monoisotopic mass and intensity. The monoisotopic mass is the sum of all atomic masses of the primary isotope of each atom in the molecule, which is typically expressed in Daltons (Da). Figure 2 Open in new tabDownload slide The description of some important terms. (A) The chromatograms (peaks), (B) The isotopomer envelopes of protein fragments, (C) The theoretical isotopic distribution, (D) The monoisotopic peak. Figure 2 Open in new tabDownload slide The description of some important terms. (A) The chromatograms (peaks), (B) The isotopomer envelopes of protein fragments, (C) The theoretical isotopic distribution, (D) The monoisotopic peak. Deconvolution algorithms In pretreating MS data of intact proteins, high-resolution masses of protein precursor ions and fragment ions are indispensable for identifying proteoforms based on TDMS. Therefore, it is necessary to convert the raw mass spectral data of proteins into a standard list of peaks (e.g. single charge, monoisotopic mass), which facilitates the subsequent process. Proteins are considered macromolecules, which may have a molecular weight of monoisotopic mass above 3000 Da. The accuracy of deconvolution results affects the analysis process in the proteoform characterization. The main task of the deconvolution in this filed is to extract a list of monoisotopic peaks from each TD tandem MS (MS/MS). In fact, in the MS data preprocessing stage, it is difficult to design a perfect deconvolution algorithm for accurately extracting monoisotopic peaks, because raw files, as the input of deconvolution algorithm, contain a large number of noise peaks and overlapping peaks, as well as the small intensity of most monoisotopic peaks. To date, some deconvolution strategies have been proposed, which work well for simple spectra, but yield inefficiencies for complex spectra because of high charge states and overlapping envelopes. Deconvolution methods are generally divided into three categories: peak assignment algorithms, isotopic algorithms, and simulation-based algorithms. In bioinformatics, isotopic algorithms are general for deconvolution, so peak assignment algorithms [21–27] and simulation-based algorithms [28–32] are not described in detail. Table 1 lists popular isotopic algorithms and their corresponding URLs. Table 1 A list of commonly used software tools for TD spectral deconvolution Software Website Algorithm THRASH [28] https://omictools.com/thrash-tool Chi-squared fitting, least-squares fitting Xtract [48] http://proteinaceous.net/product/prosightpc-4-0/ Chi-squared fitting, least-squares fitting DeconMSn [36] https://omics.pnl.gov/software/deconmsn SVM YADA [31] http://pcarvalho.com/patternlab/downloads/windows/yada/ Optimal Bayesian classifier, expectation–maximization (EM) algorithm DeconTools (Decon2LS) [45, 46] https://omics.pnl.gov/software/decontools-decon2ls Chi-squared fitting MS-Deconv [32] http://bix.ucsd.edu/projects/msdeconv/ Dynamic programming algorithm MASH Suite [33] http://ge.crb.wisc.edu/software.html Chi-squared fitting, least-squares fitting MS-Deconv+/ TopFD [42] http://proteomics.informatics.iupui.edu/software/toppic/ Dynamic programming algorithm UniDec [53] http://unidec.chem.ox.ac.uk Bayesian framework CRAWler [51] — Chi-squared fitting, least-squares fitting pParseTD (pParse) [38, 39] http://pfind.net/software/pTop/index.html SVM ProMex [40] https://omics.pnl.gov/software/mspathfinder Bayesian network, Pearson correlation, Wilcoxon rank-sum test, and hypergeometric test IMTBX [43] https://github.com/chhh/IMTBX — IPPD [34] https://omictools.com/ippd-tool Gaussians and exponentially modified Gaussians Software Website Algorithm THRASH [28] https://omictools.com/thrash-tool Chi-squared fitting, least-squares fitting Xtract [48] http://proteinaceous.net/product/prosightpc-4-0/ Chi-squared fitting, least-squares fitting DeconMSn [36] https://omics.pnl.gov/software/deconmsn SVM YADA [31] http://pcarvalho.com/patternlab/downloads/windows/yada/ Optimal Bayesian classifier, expectation–maximization (EM) algorithm DeconTools (Decon2LS) [45, 46] https://omics.pnl.gov/software/decontools-decon2ls Chi-squared fitting MS-Deconv [32] http://bix.ucsd.edu/projects/msdeconv/ Dynamic programming algorithm MASH Suite [33] http://ge.crb.wisc.edu/software.html Chi-squared fitting, least-squares fitting MS-Deconv+/ TopFD [42] http://proteomics.informatics.iupui.edu/software/toppic/ Dynamic programming algorithm UniDec [53] http://unidec.chem.ox.ac.uk Bayesian framework CRAWler [51] — Chi-squared fitting, least-squares fitting pParseTD (pParse) [38, 39] http://pfind.net/software/pTop/index.html SVM ProMex [40] https://omics.pnl.gov/software/mspathfinder Bayesian network, Pearson correlation, Wilcoxon rank-sum test, and hypergeometric test IMTBX [43] https://github.com/chhh/IMTBX — IPPD [34] https://omictools.com/ippd-tool Gaussians and exponentially modified Gaussians Open in new tab Table 1 A list of commonly used software tools for TD spectral deconvolution Software Website Algorithm THRASH [28] https://omictools.com/thrash-tool Chi-squared fitting, least-squares fitting Xtract [48] http://proteinaceous.net/product/prosightpc-4-0/ Chi-squared fitting, least-squares fitting DeconMSn [36] https://omics.pnl.gov/software/deconmsn SVM YADA [31] http://pcarvalho.com/patternlab/downloads/windows/yada/ Optimal Bayesian classifier, expectation–maximization (EM) algorithm DeconTools (Decon2LS) [45, 46] https://omics.pnl.gov/software/decontools-decon2ls Chi-squared fitting MS-Deconv [32] http://bix.ucsd.edu/projects/msdeconv/ Dynamic programming algorithm MASH Suite [33] http://ge.crb.wisc.edu/software.html Chi-squared fitting, least-squares fitting MS-Deconv+/ TopFD [42] http://proteomics.informatics.iupui.edu/software/toppic/ Dynamic programming algorithm UniDec [53] http://unidec.chem.ox.ac.uk Bayesian framework CRAWler [51] — Chi-squared fitting, least-squares fitting pParseTD (pParse) [38, 39] http://pfind.net/software/pTop/index.html SVM ProMex [40] https://omics.pnl.gov/software/mspathfinder Bayesian network, Pearson correlation, Wilcoxon rank-sum test, and hypergeometric test IMTBX [43] https://github.com/chhh/IMTBX — IPPD [34] https://omictools.com/ippd-tool Gaussians and exponentially modified Gaussians Software Website Algorithm THRASH [28] https://omictools.com/thrash-tool Chi-squared fitting, least-squares fitting Xtract [48] http://proteinaceous.net/product/prosightpc-4-0/ Chi-squared fitting, least-squares fitting DeconMSn [36] https://omics.pnl.gov/software/deconmsn SVM YADA [31] http://pcarvalho.com/patternlab/downloads/windows/yada/ Optimal Bayesian classifier, expectation–maximization (EM) algorithm DeconTools (Decon2LS) [45, 46] https://omics.pnl.gov/software/decontools-decon2ls Chi-squared fitting MS-Deconv [32] http://bix.ucsd.edu/projects/msdeconv/ Dynamic programming algorithm MASH Suite [33] http://ge.crb.wisc.edu/software.html Chi-squared fitting, least-squares fitting MS-Deconv+/ TopFD [42] http://proteomics.informatics.iupui.edu/software/toppic/ Dynamic programming algorithm UniDec [53] http://unidec.chem.ox.ac.uk Bayesian framework CRAWler [51] — Chi-squared fitting, least-squares fitting pParseTD (pParse) [38, 39] http://pfind.net/software/pTop/index.html SVM ProMex [40] https://omics.pnl.gov/software/mspathfinder Bayesian network, Pearson correlation, Wilcoxon rank-sum test, and hypergeometric test IMTBX [43] https://github.com/chhh/IMTBX — IPPD [34] https://omictools.com/ippd-tool Gaussians and exponentially modified Gaussians Open in new tab As shown in Figure 3, the analysis process of deconvolution algorithms can usually be divided into four steps: (1) generating candidate isotopomer envelopes, (2) generating theoretical isotopomer envelopes, (3) evaluating candidates and theoretical isotopomer envelopes matching by a scoring function, and (4) extracting the monoisotopic mass from the reported envelope. Figure 3 Open in new tabDownload slide The analysis pipeline of deconvolution algorithms. Figure 3 Open in new tabDownload slide The analysis pipeline of deconvolution algorithms. Generating candidate isotopomer envelopes In the first step, most deconvolution algorithms are designed for extracting candidate isotopomer envelopes from experimental raw spectra. Due to the complexity of TDMS spectra of intact proteins, such as noise peaks, higher charge states, overlapping envelopes, and more fragments, it is difficult to automatically generate candidate isotopomer envelopes with high accuracy and speed. Some strategies have been adopted to address those challenges. THRASH [33], a classical automatically deconvolution algorithm for MS data of biomacromolecules, was developed by Horn’s group based on ICR-2LS in 2000. This algorithm has been the core algorithm for preprocessing TDMS data. For decreasing noise peaks and identifying isotopic signal peaks, the noise level estimation and baseline correction problem are addressed by deconvolution algorithms. A formula for calculating the signal-to-noise ratio (S/N) was proposed in THRASH for noise filtering using a 4 Da window, which mainly takes the intensity of the noise baseline and the width of the noise peak into account. To speed the process of determination of noises, MasSPIKE [34] and AID-MS [35] identify noises in entire spectra. YADA [36] uses an algorithm of peak filtering to eliminate noise peaks whose intensity is lower than the threshold calculated by an optimal Bayesian classifier with the expectation–maximization (EM) algorithm. MS-Deconv [37] selects candidate base peaks through a method similar to THRASH, which estimates the noise intensity level from centroided spectra and removes peaks below the noise intensity threshold. In order to detect the low-intensity peaks that are often viewed as noise peaks, MASH Suite [38] adopts a match function for users to modify the threshold of S/N at a certain window. IPPD [39] is an R package for deconvolution, which provides some parameters for users to remove noise peaks. For the identification of isotopomer envelopes, THRASH applies the moving window method and the Fourier/Patterson method [40] to determine the initial charge state and monoisotopic peak information of MS in a 1 or 2 Da window. With the wide applications of THRASH, some subsequent researchers have revealed some limitations in both the determination of overlapping clusters and the processing speed. To identify the overlapping clusters, AID-MS identifies firstly the peak with the highest intensity, generates a 1.2 or 2.7 Da window around this maximum peak, and selects isotopomer envelopes with several developed techniques: (1) a Lorentzian-based peak subtraction algorithm, on which the overlapping clusters in the high peak density region can be resolved robustly by subtracting the identified isotopic clusters (isotopomer envelopes), and (2) the peak selection method, which accelerates the determination of charge states and the identification of the isotopomer envelopes. MasSPIKE utilizes the matched filter method for determining the charge state, and, then, the overlapped isotopic distributions are separated, and each of them is assigned to the state of charges. Due to little access to the peak profile information, low-resolution data do not perform well when determining the state of charges. To solve this problem, DeconMSn [41] implements an SVM-based charge detection algorithm, which identifies the most probable charge state of the parent species with the peak characteristics of its fragment mode, and also provides an advantage in shortening the searching time. YADA adopts a peak-finding algorithm to determine the state of charges and detects isotopomer envelopes, which does not adopt the strategy of subtraction but can identify overlapping isotopomer envelopes. THRASH defines the charge state of each set of peaks, while MS-Deconv considers all possible charge states. To handle the isotopic cluster overlapping or peaks sharing problem, MS-Deconv represents the envelopes using a graph-theoretical approach and then selects the heaviest path in the graph through dynamic programming algorithms [42]. The path of a graph is divided into the overlapping region strategy of the envelope, in which the scoring function determines the weight. To identify low-abundance peaks in certain m/z (mass-to-charge ratio) regions, MASH Suite supports modifying S/N threshold parameters. pParseTD [43] is an important preprocessing module proposed in pTop, also a key update version of pParse2.0 [44], which introduces multiple features of precursors including the number of peaks in the isotopic cluster, the length of matched isotopic clusters, etc. into an SVM model for recalling precursors more accurately and identifying overlapping envelopes more effectively. IPPD applies the Gaussians and exponentially modified Gaussians to model peak shapes, which is capable of resolving the overlapping peaks. ProMex [45] is a tool in the open-source software suite Informed-Proteomics (https://github.com/PNNL-Comp-Mass-Spec/Informed-Proteomics). The method produces a list of possible masses by the input mass range and tolerance; then the experimental isotopomer envelopes are identified by scanning the MS and grouped according to the state of charges and the elution time. Also, a greedy algorithm is adopted to aggregate isotopomer envelopes. Generating theoretical isotopomer envelopes Once getting candidate isotopomer envelopes, in the second step, the deconvolution algorithms generate the corresponding theoretical isotopic distributions, and then theoretical isotopic distributions are converted to theoretical isotopomer envelopes. The theoretical isotopic distribution is a list of theoretical peaks obtained by using some models with the mass of the spectral peaks and charge states as inputs. Theoretical isotopic distributions are scaled based on peak intensities of experimental isotopomer envelopes. For generating the theoretical isotopic distribution, an ‘ averagine’ model [46] is adopted commonly. THRASH, MasSPIKE, AID-MS, MASH Suite, MS-Deconv+ [47], ProMex, IMTBX [48], and IPPD apply the ‘averagine’ model to generate the theoretical isotopic distribution, while MS-Deconv and PparseTD employ the ‘emass’ model [49]. Some deconvolution algorithms, like YADA, are mainly devised to resolve MS data of peptide ions, and the ‘averagine’ model can be adopted to improve the monoisotopic peak detection of macromolecules. Most algorithms utilize a scaling strategy for theoretical isotopic distributions to convert corresponding theoretical isotopomer envelopes that can be used for matching experimental and theoretical isotopomer envelopes in the next step. Evaluating candidate and theoretical isotopomer envelopes matching by a scoring function In the third step, candidate and theoretical isotopomer envelopes are evaluated by a scoring function. Envelopes whose scores are above the threshold specified by the scoring function are selected. Most deconvolution algorithms select the envelope set with a simple greedy strategy, which iteratively selects the highest one and then removes it from spectra. THRASH uses the least-squares method to fit theoretical isotopic peaks and experimental isotopic peaks and defines a reliability value (RL) as the criteria for real isotopic clusters. Also, when a candidate isotopomer envelope is found in a certain window, THRASH applies the method of ‘subtractive peak finding’ to subtract candidate isotopomer envelopes and find iteratively possible isotopic clusters. MasSPIKE employs a statistical method for aligning each experimental isotopic distribution with its theoretical one through a maximum likelihood estimation method to obtain an optimal alignment index. AID-MS calculates correlation coefficients and matching errors as criteria for real isotopic clusters. Decon2LS [50] is a modified version based on THRASH with a different scoring strategy for evaluating candidates and theoretical isotopomer envelopes and provides a user interface for setting parameters to obtain the monoisotopic mass with high accuracy. Decon2LS has the capability of labeling overlapping isotopic patterns. DeconTools [51] is the new version of Decon2LS and adds the RAPID method [52] for improving the Decon2LS method and adds a ‘ResultValidator’ module to determine monoisotopic peak results. MS-Deconv proposes a scoring strategy for selecting isotopomer envelopes. In order to filter candidate isotopomer envelopes, each envelope is assigned to a 1 Da window based on the position of its base peak, and the envelopes (up to five) are selected by the score in each window. Compared to other methods, the algorithm scores the isotopomer envelope set rather than a single isotopomer envelope. MS-Deconv can identify more peaks, and its recognition speed has multiplied 33 times and 4 times, respectively, compared to THRASH and Xtract [53]. The limitation of MS-Deconv is not supporting the output co-eluting precursor. MASH Suite adopts the same ‘averagine’ model as THRASH to generate theoretical isotopic distributions. The Dalton shift function utilized by MASH Suite, which works in shifting the generated theoretical isotopic distribution to both ends by 1 Da, is provided to better match experimental isotopomer envelopes. MS-Deconv+ proposes a scoring function (L-score) for calculating the similarity score between a pair of experimental and theoretical isotopomer envelopes. L-score can replace the original evaluation function in MS-Deconv. L-score expresses the peak p in the isotopomer envelope by a pair of (x, y), where x and y are the m/z value and intensity of the peak, respectively. L-score extracts five features of the envelope, namely, m/z value, peak intensity distribution, supporting envelopes, neutral loss envelopes, and missing peak numbers as features. These five features are then linearly combined to calculate the L-score. MS-Deconv+ can report monoisotopic mass more correctly than MS-Deconv. Considering the fact that low-intensity peaks or overlapping peaks tend to affect the results of the similarity-based comparison between experimental and theoretical isotopomer envelopes researchers have developed pParseTD by introducing multiple features into an SVM model for identifying the candidate envelopes. ProMex adopts the greedy algorithm to gather peaks into candidate isotopomer envelopes with adjacent charges and elution times. ProMex identifies isotopomer envelopes by using the Pearson correlation similarity scores of experimental and theoretical envelopes and calculates the statistical significance by the Wilcoxon rank-sum test and hypergeometric test. Then some LC-MS-related features are considered, including the isotopomer envelope shape, intensity, charge distribution, and elution profile under different charge states. To infer the probability of aggregated isotopomer envelopes, ProMex evaluates the importance of features by the likelihood score function and models the feature through a Bayesian network. Extracting the monoisotopic mass from the reported envelopes In the final step, a corresponding list of monoisotopic masses is extracted based on each reported isotopomer envelope. THRASH has been implemented in many software systems, like ProSightPC [54], MASH Suite Pro [55], and so on. It has also been extended in Decon2LS, DeconMSn, Xtract, MASH Suite, and CRAWler [56]. Xtract (Thermo Fisher Scientific, San Jose, CA) is a commercial software for decharging based on THRASH. The custom version of CRAWler applications adopts a modified version of THRASH to determine the quality of the precursors and fragments and create PUF files needed for ProSightPC. BUPID-Top-Down [57], implemented by Boston University, is a web-based tool for deconvolution of ECD or ETD data sets. UniDec [58] is a general Bayesian deconvolution software system that is fast, flexible, and robust for mass spectral deconvolution. This method is a particular case of the Richardson–Lucy mathematical deconvolution algorithm. It is worth noting that it does not require a lot of user intervention or any prior knowledge. Some frequently used deconvolution algorithms are available as commercial products or embedded in larger software packages from major vendors. Xtract and ReSpect [59] deconvolution algorithms are embedded in BioPharma Finder™ software released by Thermo Fisher Scientific. DataAnalysis™ is released by Bruker Inc., which can automatically locate monoisotopic peaks using the SNAP algorithm [60]. A maximum entropy deconvolution algorithm of intact protein is embedded in MassHunter BioConfirm software released by Agilent Inc. [61]. Shimadzu Inc. released the i-PDeA algorithm for the separation of co-eluted peaks [62]. Intact Mass™ developed by Protein Metrics Inc. can be used to analyze mass spectra of intact proteins and offer an algorithmic approach for deconvolution [63]. Polymerix implemented by Sierra Analytics Inc. can be used for the deconvolution of homopolymer [64]. ProMass released by Novatia, LLC is embedded in many other platforms, which can determine the uncharged mass of the biomolecule by using an automated biomolecule deconvolution algorithm [65]. RoWinPro uses ProMass for the deconvolution [66]. The characterization of proteoforms based on TDMS The proteoform characterization based on TDMS techniques requires both the identification and the localization of proteoforms. In the proteoform identification stage, one challenge is that MS and protein sequences are difficult to match precisely, and another is that the matching process is quite time-consuming. As intact proteins contain a longer amino acid sequence and keep a mixture of five kinds of PSAs, the number of proteoforms is theoretically huge enough to lead to the combinatorial explosion. In the proteoform localization stage, it is difficult to locate the correct amino acid sites, even with the identified amino acid and modification types since the complete protein sequence is long and may contain the same amino acid species at different amino acid sites. This section mainly summarizes the popular proteoform identification methods based on TDMS. Besides, several filtering algorithms to improve the proteoform identification based on TDMS and various scoring methods to locate PTMs in proteoforms are under discussion. Identification of proteoforms based on TDMS Proteoform identification algorithms can be classified by different criteria. This study describes 18 methods for proteoform identification based on TDMS. The addresses for downloading these methods are shown in Table 2. Table 2 Software tools for proteoform identification using TDMS Methods Software or algorithm features Advantages Limitations Website ProSight [49, 62–65] ✓ Web user interface ✓ Cross platform ✓ Sequence visualization ✓ Variable PTMs ✓ Statistical significance • Unknown PSAs • Performance depends on proteoform database http://proteinaceous.net/product/prosightpc-4-0/ https://prosightptm.northwestern.edu/ https://bio.tools/prosight_ptm https://ms-utils.org/ https://www.topdownproteomics.org/resources/software/ MascotTD [66] Mascot Server ✓ Web user interface ✓ Cross platform ✓ API support ✓ Sequence visualization ✓ Similar proteoforms ✓ Internal fragments • Unknown PSAs • Performance depends on proteoform database http://www.matrixscience.com/ BUPID-Top-Down [52] ✓ Windows GUI application ✓ Internal fragments ✓ Side chain loss ✓ Neutral loss ✓ Variable PTMs ✓ Unknown PSAs • Performance depends on proteoform database http://www.bumc.bu.edu/cardiovascularproteomics/cpctools/bupid-top-down/ https://omictools.com/bupid-top-down-tool ProteinGoggle [67] ✓ Windows GUI application ✓ Overlapping isotopic envelopes resolving ✓ Mass spectral data interpretation ✓ Variable PTMs • Performance depends on proteoform database https://proteingoggle.tongji.edu.cn https://www.topdownproteomics.org/resources/software/ MetaMorpheus [68] ✓ Windows GUI application ✓ Open source ✓ Search speed ✓ Unknown PSAs • Performance depends on proteoform database http://github.com/smith-chem-wisc/MetaMorpheus/ https://bio.tools/MetaMorpheus https://omictools.com/metamorpheus-tool TDPortal [69] ✓ Web user interface ✓ Workflow for high-throughput TD proteomics analysis ✓ Data sharing ✓ Sequence visualization ✓ Terminal truncations ✓ Unknown PSAs • Performance depends on proteoform database http://nrtdp.northwestern.edu/resource-software/ https://omictools.com/tdportal-tool MS-TopDown [76] C++ application Open source ✓ Variable PTMs ✓ Unknown PSAs • Search time • No statistical significance http://proteomics.bioprojects.org PIITA [15] — ✓ Search speed ✓ Unknown PSAs • Requiring b- or y-ion fragments — MS-Align+ [77] ✓ Cross platform ✓ Command line ✓ Open source ✓ Statistical significance ✓ Terminal truncations ✓ Unknown PSAs • >2 unknown mass shifts • False-positive proteoform spectrum matches http://bix.ucsd.edu/projects/msalign/ https://omictools.com/ms-align-tool https://www.topdownproteomics.org/resources/software/ MS-Align-E [78] ✓ Cross platform ✓ Command line ✓ Ultra-modified proteins ✓ Variable PTMs ✓ Unknown PSAs • Similar mass shifts • Terminal truncations http://proteomics.informatics.iupui.edu/software/msaligne/ MASH Suite Pro [50] ✓ Interface customization ✓ Windows GUI application ✓ Statistical significance ✓ Sequence visualization ✓ Terminal truncations ✓ Unknown PSAs • Same as MS-Align+ http://crb.wisc.edu/yinglab/software.html https://omictools.com/mash-suite-pro-tool pTop [38] ✓ Windows GUI application ✓ Reducing searching space ✓ Search speed ✓ Validation of PrSM ✓ Visualization with pBuild ✓ Variable PTMs ✓ Unknown PSAs • Terminal truncations (the pTop2.0 support) http://pfind.ict.ac.cn/software/pTop/index https://www.topdownproteomics.org/resources/software/ TopPIC [79] • Open source • Windows GUI application • Cross platform • Statistical significance • Integrating with filtering algorithm • Terminal truncations • Unknown PSAs • Multiple modifications • Variable PTMs http://proteomics.informatics.iupui.edu/software/toppic/ https://bio.tools/toppic https://omictools.com/toppic-tool https://ms-utils.org/ https://www.topdownproteomics.org/resources/software/ TopMG [86] • Open source • Windows GUI application • Cross platform • Variable PTMs • Terminal truncations • Unknown PSAs • Search time http://proteomics.informatics.iupui.edu/software/topmg/ https://omictools.com/topmg-tool.html MSPathFinder [89] • Command line • Open source • Search speed • Statistical significance • Variable PTMs • Unknown PSAs https://omics.pnl.gov/software/mspathfinder https://omictools.com/lcmsspectator-tool https://www.topdownproteomics.org/resources/software/ HomMTM [94] • C++ application • Supporting multiplexed tandem MS • Variable PTMs • Unknown PSAs — SpectRUM [84] • MATLAB GUI application • De novo sequencing • Variable PTMs • Terminal truncations • Unknown PSAs • Spectrum calculation • Double-sided truncations https://github.com/BIRL/SPECTRUM/ https://bio.tools/SPECTRUM Twister [87] • Command line • Cross platform • De novo sequencing • Variable PTMs • Terminal truncations • Multiple modifications • Protein mixtures http://bioinf.spbau.ru/en/twister https://omictools.com/twister-tool Methods Software or algorithm features Advantages Limitations Website ProSight [49, 62–65] ✓ Web user interface ✓ Cross platform ✓ Sequence visualization ✓ Variable PTMs ✓ Statistical significance • Unknown PSAs • Performance depends on proteoform database http://proteinaceous.net/product/prosightpc-4-0/ https://prosightptm.northwestern.edu/ https://bio.tools/prosight_ptm https://ms-utils.org/ https://www.topdownproteomics.org/resources/software/ MascotTD [66] Mascot Server ✓ Web user interface ✓ Cross platform ✓ API support ✓ Sequence visualization ✓ Similar proteoforms ✓ Internal fragments • Unknown PSAs • Performance depends on proteoform database http://www.matrixscience.com/ BUPID-Top-Down [52] ✓ Windows GUI application ✓ Internal fragments ✓ Side chain loss ✓ Neutral loss ✓ Variable PTMs ✓ Unknown PSAs • Performance depends on proteoform database http://www.bumc.bu.edu/cardiovascularproteomics/cpctools/bupid-top-down/ https://omictools.com/bupid-top-down-tool ProteinGoggle [67] ✓ Windows GUI application ✓ Overlapping isotopic envelopes resolving ✓ Mass spectral data interpretation ✓ Variable PTMs • Performance depends on proteoform database https://proteingoggle.tongji.edu.cn https://www.topdownproteomics.org/resources/software/ MetaMorpheus [68] ✓ Windows GUI application ✓ Open source ✓ Search speed ✓ Unknown PSAs • Performance depends on proteoform database http://github.com/smith-chem-wisc/MetaMorpheus/ https://bio.tools/MetaMorpheus https://omictools.com/metamorpheus-tool TDPortal [69] ✓ Web user interface ✓ Workflow for high-throughput TD proteomics analysis ✓ Data sharing ✓ Sequence visualization ✓ Terminal truncations ✓ Unknown PSAs • Performance depends on proteoform database http://nrtdp.northwestern.edu/resource-software/ https://omictools.com/tdportal-tool MS-TopDown [76] C++ application Open source ✓ Variable PTMs ✓ Unknown PSAs • Search time • No statistical significance http://proteomics.bioprojects.org PIITA [15] — ✓ Search speed ✓ Unknown PSAs • Requiring b- or y-ion fragments — MS-Align+ [77] ✓ Cross platform ✓ Command line ✓ Open source ✓ Statistical significance ✓ Terminal truncations ✓ Unknown PSAs • >2 unknown mass shifts • False-positive proteoform spectrum matches http://bix.ucsd.edu/projects/msalign/ https://omictools.com/ms-align-tool https://www.topdownproteomics.org/resources/software/ MS-Align-E [78] ✓ Cross platform ✓ Command line ✓ Ultra-modified proteins ✓ Variable PTMs ✓ Unknown PSAs • Similar mass shifts • Terminal truncations http://proteomics.informatics.iupui.edu/software/msaligne/ MASH Suite Pro [50] ✓ Interface customization ✓ Windows GUI application ✓ Statistical significance ✓ Sequence visualization ✓ Terminal truncations ✓ Unknown PSAs • Same as MS-Align+ http://crb.wisc.edu/yinglab/software.html https://omictools.com/mash-suite-pro-tool pTop [38] ✓ Windows GUI application ✓ Reducing searching space ✓ Search speed ✓ Validation of PrSM ✓ Visualization with pBuild ✓ Variable PTMs ✓ Unknown PSAs • Terminal truncations (the pTop2.0 support) http://pfind.ict.ac.cn/software/pTop/index https://www.topdownproteomics.org/resources/software/ TopPIC [79] • Open source • Windows GUI application • Cross platform • Statistical significance • Integrating with filtering algorithm • Terminal truncations • Unknown PSAs • Multiple modifications • Variable PTMs http://proteomics.informatics.iupui.edu/software/toppic/ https://bio.tools/toppic https://omictools.com/toppic-tool https://ms-utils.org/ https://www.topdownproteomics.org/resources/software/ TopMG [86] • Open source • Windows GUI application • Cross platform • Variable PTMs • Terminal truncations • Unknown PSAs • Search time http://proteomics.informatics.iupui.edu/software/topmg/ https://omictools.com/topmg-tool.html MSPathFinder [89] • Command line • Open source • Search speed • Statistical significance • Variable PTMs • Unknown PSAs https://omics.pnl.gov/software/mspathfinder https://omictools.com/lcmsspectator-tool https://www.topdownproteomics.org/resources/software/ HomMTM [94] • C++ application • Supporting multiplexed tandem MS • Variable PTMs • Unknown PSAs — SpectRUM [84] • MATLAB GUI application • De novo sequencing • Variable PTMs • Terminal truncations • Unknown PSAs • Spectrum calculation • Double-sided truncations https://github.com/BIRL/SPECTRUM/ https://bio.tools/SPECTRUM Twister [87] • Command line • Cross platform • De novo sequencing • Variable PTMs • Terminal truncations • Multiple modifications • Protein mixtures http://bioinf.spbau.ru/en/twister https://omictools.com/twister-tool The table lists the software features, advantages, disadvantages, and web sources from original link or several registries/link from websites of bioinformatics tools (https://www.bio.tools, https://omictools.com/, https://ms-utils.org, www.topdownproteomics.org). The URLs of website from original link are made italic. Open in new tab Table 2 Software tools for proteoform identification using TDMS Methods Software or algorithm features Advantages Limitations Website ProSight [49, 62–65] ✓ Web user interface ✓ Cross platform ✓ Sequence visualization ✓ Variable PTMs ✓ Statistical significance • Unknown PSAs • Performance depends on proteoform database http://proteinaceous.net/product/prosightpc-4-0/ https://prosightptm.northwestern.edu/ https://bio.tools/prosight_ptm https://ms-utils.org/ https://www.topdownproteomics.org/resources/software/ MascotTD [66] Mascot Server ✓ Web user interface ✓ Cross platform ✓ API support ✓ Sequence visualization ✓ Similar proteoforms ✓ Internal fragments • Unknown PSAs • Performance depends on proteoform database http://www.matrixscience.com/ BUPID-Top-Down [52] ✓ Windows GUI application ✓ Internal fragments ✓ Side chain loss ✓ Neutral loss ✓ Variable PTMs ✓ Unknown PSAs • Performance depends on proteoform database http://www.bumc.bu.edu/cardiovascularproteomics/cpctools/bupid-top-down/ https://omictools.com/bupid-top-down-tool ProteinGoggle [67] ✓ Windows GUI application ✓ Overlapping isotopic envelopes resolving ✓ Mass spectral data interpretation ✓ Variable PTMs • Performance depends on proteoform database https://proteingoggle.tongji.edu.cn https://www.topdownproteomics.org/resources/software/ MetaMorpheus [68] ✓ Windows GUI application ✓ Open source ✓ Search speed ✓ Unknown PSAs • Performance depends on proteoform database http://github.com/smith-chem-wisc/MetaMorpheus/ https://bio.tools/MetaMorpheus https://omictools.com/metamorpheus-tool TDPortal [69] ✓ Web user interface ✓ Workflow for high-throughput TD proteomics analysis ✓ Data sharing ✓ Sequence visualization ✓ Terminal truncations ✓ Unknown PSAs • Performance depends on proteoform database http://nrtdp.northwestern.edu/resource-software/ https://omictools.com/tdportal-tool MS-TopDown [76] C++ application Open source ✓ Variable PTMs ✓ Unknown PSAs • Search time • No statistical significance http://proteomics.bioprojects.org PIITA [15] — ✓ Search speed ✓ Unknown PSAs • Requiring b- or y-ion fragments — MS-Align+ [77] ✓ Cross platform ✓ Command line ✓ Open source ✓ Statistical significance ✓ Terminal truncations ✓ Unknown PSAs • >2 unknown mass shifts • False-positive proteoform spectrum matches http://bix.ucsd.edu/projects/msalign/ https://omictools.com/ms-align-tool https://www.topdownproteomics.org/resources/software/ MS-Align-E [78] ✓ Cross platform ✓ Command line ✓ Ultra-modified proteins ✓ Variable PTMs ✓ Unknown PSAs • Similar mass shifts • Terminal truncations http://proteomics.informatics.iupui.edu/software/msaligne/ MASH Suite Pro [50] ✓ Interface customization ✓ Windows GUI application ✓ Statistical significance ✓ Sequence visualization ✓ Terminal truncations ✓ Unknown PSAs • Same as MS-Align+ http://crb.wisc.edu/yinglab/software.html https://omictools.com/mash-suite-pro-tool pTop [38] ✓ Windows GUI application ✓ Reducing searching space ✓ Search speed ✓ Validation of PrSM ✓ Visualization with pBuild ✓ Variable PTMs ✓ Unknown PSAs • Terminal truncations (the pTop2.0 support) http://pfind.ict.ac.cn/software/pTop/index https://www.topdownproteomics.org/resources/software/ TopPIC [79] • Open source • Windows GUI application • Cross platform • Statistical significance • Integrating with filtering algorithm • Terminal truncations • Unknown PSAs • Multiple modifications • Variable PTMs http://proteomics.informatics.iupui.edu/software/toppic/ https://bio.tools/toppic https://omictools.com/toppic-tool https://ms-utils.org/ https://www.topdownproteomics.org/resources/software/ TopMG [86] • Open source • Windows GUI application • Cross platform • Variable PTMs • Terminal truncations • Unknown PSAs • Search time http://proteomics.informatics.iupui.edu/software/topmg/ https://omictools.com/topmg-tool.html MSPathFinder [89] • Command line • Open source • Search speed • Statistical significance • Variable PTMs • Unknown PSAs https://omics.pnl.gov/software/mspathfinder https://omictools.com/lcmsspectator-tool https://www.topdownproteomics.org/resources/software/ HomMTM [94] • C++ application • Supporting multiplexed tandem MS • Variable PTMs • Unknown PSAs — SpectRUM [84] • MATLAB GUI application • De novo sequencing • Variable PTMs • Terminal truncations • Unknown PSAs • Spectrum calculation • Double-sided truncations https://github.com/BIRL/SPECTRUM/ https://bio.tools/SPECTRUM Twister [87] • Command line • Cross platform • De novo sequencing • Variable PTMs • Terminal truncations • Multiple modifications • Protein mixtures http://bioinf.spbau.ru/en/twister https://omictools.com/twister-tool Methods Software or algorithm features Advantages Limitations Website ProSight [49, 62–65] ✓ Web user interface ✓ Cross platform ✓ Sequence visualization ✓ Variable PTMs ✓ Statistical significance • Unknown PSAs • Performance depends on proteoform database http://proteinaceous.net/product/prosightpc-4-0/ https://prosightptm.northwestern.edu/ https://bio.tools/prosight_ptm https://ms-utils.org/ https://www.topdownproteomics.org/resources/software/ MascotTD [66] Mascot Server ✓ Web user interface ✓ Cross platform ✓ API support ✓ Sequence visualization ✓ Similar proteoforms ✓ Internal fragments • Unknown PSAs • Performance depends on proteoform database http://www.matrixscience.com/ BUPID-Top-Down [52] ✓ Windows GUI application ✓ Internal fragments ✓ Side chain loss ✓ Neutral loss ✓ Variable PTMs ✓ Unknown PSAs • Performance depends on proteoform database http://www.bumc.bu.edu/cardiovascularproteomics/cpctools/bupid-top-down/ https://omictools.com/bupid-top-down-tool ProteinGoggle [67] ✓ Windows GUI application ✓ Overlapping isotopic envelopes resolving ✓ Mass spectral data interpretation ✓ Variable PTMs • Performance depends on proteoform database https://proteingoggle.tongji.edu.cn https://www.topdownproteomics.org/resources/software/ MetaMorpheus [68] ✓ Windows GUI application ✓ Open source ✓ Search speed ✓ Unknown PSAs • Performance depends on proteoform database http://github.com/smith-chem-wisc/MetaMorpheus/ https://bio.tools/MetaMorpheus https://omictools.com/metamorpheus-tool TDPortal [69] ✓ Web user interface ✓ Workflow for high-throughput TD proteomics analysis ✓ Data sharing ✓ Sequence visualization ✓ Terminal truncations ✓ Unknown PSAs • Performance depends on proteoform database http://nrtdp.northwestern.edu/resource-software/ https://omictools.com/tdportal-tool MS-TopDown [76] C++ application Open source ✓ Variable PTMs ✓ Unknown PSAs • Search time • No statistical significance http://proteomics.bioprojects.org PIITA [15] — ✓ Search speed ✓ Unknown PSAs • Requiring b- or y-ion fragments — MS-Align+ [77] ✓ Cross platform ✓ Command line ✓ Open source ✓ Statistical significance ✓ Terminal truncations ✓ Unknown PSAs • >2 unknown mass shifts • False-positive proteoform spectrum matches http://bix.ucsd.edu/projects/msalign/ https://omictools.com/ms-align-tool https://www.topdownproteomics.org/resources/software/ MS-Align-E [78] ✓ Cross platform ✓ Command line ✓ Ultra-modified proteins ✓ Variable PTMs ✓ Unknown PSAs • Similar mass shifts • Terminal truncations http://proteomics.informatics.iupui.edu/software/msaligne/ MASH Suite Pro [50] ✓ Interface customization ✓ Windows GUI application ✓ Statistical significance ✓ Sequence visualization ✓ Terminal truncations ✓ Unknown PSAs • Same as MS-Align+ http://crb.wisc.edu/yinglab/software.html https://omictools.com/mash-suite-pro-tool pTop [38] ✓ Windows GUI application ✓ Reducing searching space ✓ Search speed ✓ Validation of PrSM ✓ Visualization with pBuild ✓ Variable PTMs ✓ Unknown PSAs • Terminal truncations (the pTop2.0 support) http://pfind.ict.ac.cn/software/pTop/index https://www.topdownproteomics.org/resources/software/ TopPIC [79] • Open source • Windows GUI application • Cross platform • Statistical significance • Integrating with filtering algorithm • Terminal truncations • Unknown PSAs • Multiple modifications • Variable PTMs http://proteomics.informatics.iupui.edu/software/toppic/ https://bio.tools/toppic https://omictools.com/toppic-tool https://ms-utils.org/ https://www.topdownproteomics.org/resources/software/ TopMG [86] • Open source • Windows GUI application • Cross platform • Variable PTMs • Terminal truncations • Unknown PSAs • Search time http://proteomics.informatics.iupui.edu/software/topmg/ https://omictools.com/topmg-tool.html MSPathFinder [89] • Command line • Open source • Search speed • Statistical significance • Variable PTMs • Unknown PSAs https://omics.pnl.gov/software/mspathfinder https://omictools.com/lcmsspectator-tool https://www.topdownproteomics.org/resources/software/ HomMTM [94] • C++ application • Supporting multiplexed tandem MS • Variable PTMs • Unknown PSAs — SpectRUM [84] • MATLAB GUI application • De novo sequencing • Variable PTMs • Terminal truncations • Unknown PSAs • Spectrum calculation • Double-sided truncations https://github.com/BIRL/SPECTRUM/ https://bio.tools/SPECTRUM Twister [87] • Command line • Cross platform • De novo sequencing • Variable PTMs • Terminal truncations • Multiple modifications • Protein mixtures http://bioinf.spbau.ru/en/twister https://omictools.com/twister-tool The table lists the software features, advantages, disadvantages, and web sources from original link or several registries/link from websites of bioinformatics tools (https://www.bio.tools, https://omictools.com/, https://ms-utils.org, www.topdownproteomics.org). The URLs of website from original link are made italic. Open in new tab Figure 4 Open in new tabDownload slide Classification based on proteoform identification algorithm. Figure 4 Open in new tabDownload slide Classification based on proteoform identification algorithm. The current proteoform identification methods based on TDMS can be classified into two categories, (1) extended proteoform database methods and (2) blind PSA search methods, which can be further divided into three categories: spectral alignment-based algorithms, graph model-based algorithms, and PIITAs. This classification is shown in Figure 4. The extended proteoform database methods The extended proteoform database methods essentially enumerate all possible proteoforms automatically according to the annotation information in the database. Since a proteoform database contains possible proteoform sequences, database-based methods match an MS with the proteoform sequences in the database, which can quickly identify proteoforms. Extended proteoform database methods include ProSight [54, 67–70], MascotTD [71], BUPID-Top-Down, ProteinGoggle [72], and so on. ProSight team developed a series of proteoform characterization tools based on proteoform databases such as ProSightPTM, ProSightPC, ProSightPD, and ProSight Lite. Drawing an offline ‘shotgun annotation’ database strategy, the team developed a database of proteoforms containing sequence mutations, variable splicing, and PTMs. ProSightPTM identifies and characterizes intact proteins by searching this proteoform database. ProSightPTM is extended into a commercial software system, named ProSightPC, which allows users to search for tandem MS data in a PTM database containing known biological complexity in UniProt. The matching process is scored through the Poisson distribution probability model as follows: $$\begin{equation} {P}_{f,n}=\frac{(xf)^n\times{c}^{- xf}}{n!} \end{equation}$$ (1) where Pf,n is the probability of random protein matching, f is the number of input fragment masses, n is the number of random fragment ion hits, and x is equal to the average probability of the fragment ion mass of random protein matching. After an actual number of matching ions is substituted into Equation (1), the probability score Pf,n of the random match of a protein to a spectrum can be calculated. If the mass spectrum can match the corresponding proteoform sequence in the proteoform database during the search process, the reliability of the identification is supposed to be very high. MascotTD is an extended version of Mascot with the same graphical user interface and options as Mascot. Compared to Mascot, it only searches for peptides of 16 kDa precursor ion mass. The MascotTD method can be used for database searches of larger mass precursor ions. The proteoform database in this method excludes the most unusual proteoform sequences to maintain a controllable database size so that intact proteins can be identified. It can search for TD MS/MS data with a search engine that searches for both BU and TD MS/MS data. The scoring method of MascotTD is also based on the probability. It has the advantage of sensitivity and specificity, which means that it can identify proteins in all mammalian subsequences, but also distinguish similar proteoforms produced by the same protein. However, compared to ProSight, MascotTD always reports the low estimation probability and the high mismatching probability. Since MascotTD is initially calibrated according to BU data empirically, it requires further TD data adjustments. BUPID-Top-Down is an algorithm derived from BUPID to distribute productions in TD MS/MS spectra. BUPID can handle both immobilized and variable PTMs. BUPID-Top-Down can be applied to analyze spectra through various fragmentation methods, as well as to identify internal fragments, side chain loss, neutral loss, and PTMs. For unknown proteins, BUPID-Top-Down searches the database of best-matching protein sequences and speeds up the calculation through a heuristic model. The mass spectrograph may produce low signals and overlapping peaks of biomacromolecules, which increases the difficulty of isotopic envelope identification and deisotoping, and it becomes more error-prone when inferring monoisotopic masses. To address this issue, ProteinGoggle develops an isotopic m/z and envelope fingerprinting (iMEF) algorithm based on raw mass spectral data (isotopic envelopes of precursors and product ions) directly without the deconvolution to identify proteins. The iMEF algorithm combines the isotopic m/z fingerprint (iMF) algorithm and the isotopic envelope fingerprinting (iEF) algorithm. The iMF algorithm is applied to capture a precursor or production candidate from a database that is pre-constructed and contains all isotopic envelope information (precursors and product ions) of all proteoforms of the system under study. The iEF algorithm is applied to identify matching precursors or product ions. Finally, the user-specified total numbers of matched productions and PTM scores are provided for the proteoform identification. When customizing the database, ProteinGoggle’s proteoform database includes user-selected PTMs and amino acid variations. MetaMorpheus [73] is an open-source proteomics database search software system that combines the functionality of Morpheus and enhanced Global PTM Discovery(G-PTM-D). G-PTM-D is a technique for the identification and localization of peptides and proteoforms. In order to address the limitations of G-PTM-D, the enhanced G-PTM-D adds an acquisition spectral calibration procedure to G-PTM-D, which improves the search speed. The enhanced G-PTM-D expands the list of modifications when the database is expanded, improving the ability to identify proteoforms. If the proteoform spectrum match (PrSM is the resulting proteoform spectrum match) mass difference between the theoretical mass and the experimental mass is equal to the mass of the known PTM, the PTM is annotated in the proteoform database and then searched using the annotated database for proteoform identification. TDPortal [74] is a TD search tool developed by the National Resource for Translational and Developmental Proteomics (NRTDP) for the measurement of intact proteins. Because ProSight is used for analysis in only one or several proteoforms, TDPortal allows for high-throughput top-down proteomics analysis. TDPortal also made some improvements in search time. The limitations of extended proteoform database methods are as follows: (1) the proteoform database has a weak scalability and takes up a lot of storage space since a huge number of proteoforms may lead to a ‘combination explosion’ phenomenon; (2) for more complex species, the size of the proteoform database continues to expand, which leads to the increase of search time; and (3) there are still a number of mass spectra that cannot match proteoforms in the proteoform database. It has the same deficiencies as other methods of identification through proteoform databases due to the limited size—the identification performance is not high. Currently, extended proteoform database methods have been widely used in BUMS to identify proteoforms [75, 76] and analyze alternative splicing [77, 78] in proteogenomic research, but it is inefficient in TDMS. The blind PSA search methods In order to improve the limitations of the extended proteoform database methods in the memory space, some blind PSA search methods are proposed. The so-called blind PSA search does not need to generate all candidate proteoforms following ‘shotgun’ methods as ProSight does. The main idea of blind PSA search methods is the comparison of experimental spectra with theoretical sequences. The blind PSA search methods cannot only save the memory space but also improve the performance of proteoform identification. Although the blind PSA search method is better than the extended proteoform database method, the problem of this method is time-consuming. Dynamic programming increases computational speed in many methods due to the high computational complexity of the alignment. The blind PSA search methods can be further divided into PIITAs, spectral alignment-based algorithms, and graph model-based algorithms. Precursor ion-independent algorithms Due to the small number of MS/MS collected by the liquid-phase fractionation, PIITA uses the gas-phase fractionation (GPF) to obtain more MS/MS. This method is a precursor ion-independent TD algorithm. PIITA compares the deconvoluted and deisotoped MS/MS with all possible theoretical MS/MS of each protein in the database, regardless of the measured precursor ion mass. After the protein is identified, the difference ΔM between the measured and theoretical precursor mass is calculated. The precursor mass of the MS/MS spectrum represents the molecular mass of the proteoform, which can be calculated by monoisotopic mass minus 1.007276 Da and divide it by the number of charges. First, PIITA preprocesses all FASTA raw files through the InSilicoSpectro library [79] and creates a new file that contains only the fragment ion mass to reduce the searching time. Then, the fragment ion matching (FIM) score and ΔSc of each MS/MS were calculated, and the tolerance error of the fragment ion mass was ±15. ΔSc is similar to ΔC of SEQUEST [80]. Therefore, ΔSc is: $$\begin{equation} \Delta \mathrm{Sc}=\frac{{\mathrm{FIM}}_{1\mathrm{st}}-{\mathrm{FIM}}_{2\mathrm{nd}}}{\mathrm{FIT}\ } \end{equation}$$ (2) where FIM1st is the highest fragment ion matching scores of all proteins and FIM2nd is the second and FIT is the total of fragment ions. The limitations of PIITA include the following: (1) when a protein mainly exhibits b- or y-ion fragments, PIITA cannot identify modifications that appear at the same terminus, and (2) since the identification requires at least a fragment ion of the b- or y-ion fragments, it cannot be identified directly. Spectral alignment-based algorithms In order to improve the identification of proteoforms, some researchers have proposed a number of spectral alignment-based algorithms. Since the identification performance of extended proteoform database methods increases as the size of the proteoform search library increases, it takes a lot of memory to improve the performance. To address the above problems, some spectral alignment-based algorithms have been proposed, which consider all possible PTMs or mutations, only need to use a protein database for searching, and can simultaneously score multiple proteoforms. Compared with PIITAs, some spectral alignment-based algorithms use a probability-based scoring function, which makes results more practically significant. Currently, methods for identifying proteoforms using spectral alignment are as follows. MS-TopDown [81] applies spectral alignments in TDMS to identify proteoforms such as mutations, insertions, deletions, and PTMs. MS-TopDown creates a grid with n peaks of S and m prefix masses of protein sequence P and finds the maximum score alignment path in this grid through the dynamic programming algorithm. An n × m × T array D is recursively filled, and after completing array D, the optimal alignment score is given in Dnm(T). F represents the total number of mass shifts, which equals to the number of specific mass shifts (Ts) plus the number of general mass shifts (Tg). In constructing a dynamic programming array for the spectral alignment, various combinations of Tg and Ts are used to calculate the match score for the path. To account for general or specific types of mass shifts, a dimension is added to the dynamic programming array for spectral alignments, and n × m × Tg × Ts arrays D and M are constructed as follows: $$\begin{equation} {D}_{ij}\left(g,s\right)=\mathit{\max}\left\{\begin{array}{c}{D}_{\mathit{\operatorname{diag}}\left(i,j\right)}\left(g,s\right)+1\ \\{}{M}_{i-1,j-1}\left(g-1,s\right)\ \\{}{D}_{{\mathit{\operatorname{diag}}}_{\left(\delta \right)}\left(i,j\right)}\left(g,s\right)\ \delta \in{\Delta }_{PTMs}\ \end{array}\right. \end{equation}$$ (3) $$\begin{equation} {M}_{ij}\left(g,s\right)=\mathit{\max}\left\{\begin{array}{c}{D}_{ij}\left(g,s\right)\\{}{M}_{i-1,j}\left(g,s\right)\ \\{}{M}_{i,j-1}\left(g,s\right)\ \end{array}\right. \end{equation}$$ (4) where the array |${D}_{ij}\Big(g,s\Big)$| is the highest score alignment path from (a0, b0) to (ai, bj) with at most |$g+s$| mass shifts, g is the number of general shifts, s is the number of specific mass shifts, and |$\mathit{\operatorname{diag}}\Big(i,j\Big)$| is defined as the largest co-diagonal pair of (i, j). The advantage of MS-TopDown is the ability to identify proteoforms with numerous modifications efficiently through spectral alignments. In TDMS, multiple isobaric proteoforms are often processed in the same spectrum. MS-TopDown is also attempted to find positional isomers from mixture spectra of proteoforms by extending the algorithm. In addition, several details of TDMS, such as mass measurement errors and weak fragmentation patterns, are also addressed. The spectral alignment-based algorithm limits the number of unknown mass shifts, and the main mass shifts primarily in the algorithm belong to a particular set of known modifications (e.g. oxidation, methylation, etc.). Since the running time of MS-TopDown scales linearly with the number of PTM, searching for large proteins becomes very slow. Moreover, MS-TopDown has a simple shared peak count scoring scheme, but the statistical significance for PrSMs is not provided and thus it is difficult to choose good PrSMs. In order to improve the speed of proteoform identification, MS-Align+ [82] improves on the basis of MS-TopDown and defines four spectral alignment-based algorithms, such as complete spectral alignment, prefix spectral alignment, suffix spectral alignment, and internal spectral alignment methods. The diagonal line at −45° crossing the spectral grid, which can filter spectral alignments with low matching weights, is calculated through the spectral convolution method. The method develops a filtering strategy based on the weight of the crossing lines in the spectral grid. For a case that the crossing lines’ weight is low and a protein diagonal score of a spectrum in the spectral grid is given, the diagonal scores are sorted, and only the top 20 high-scoring proteins are chosen for further spectral alignment analysis. For the case that the crossing lines’ weight is high, the top 20 crossing lines are selected to find the optimal alignment path of the protein spectrum pair in the spectral grid. This method filters a large number of proteins, which can significantly reduce the searching space. Compared to MS-TopDown, MS-Align+ also provides the statistical significance to evaluate identified proteoforms and proposes a method for calculating E-values with a full spectrum rather than finite parameters. The shortcoming of MS-Align+ is that the algorithm identified at most two unknown mass shifts and some false-positive proteoform spectrum matches that may be generated. In order to solve the problem that MS-Align+ can only identify two unknown mass shifts, Liu et al. proposed a fast heuristic algorithm MS-Align-E [83] based on MS-Align+. This algorithm can identify PTMs in ultra-modified proteins such as histones. An ultra-modified protein refers to a protein with multiple PTMs. MS-Align-E is also used to find the longest path in a spectral grid with a given number of modifications through a dynamic programming algorithm. It can be applied to identify PTMs of both unknown and known proteins with modifications. The deficiency of MS-Align-E is sometimes reporting incorrect proteoforms due to errors in precursor or fragment mass when identifying multiplexed spectra and PTMs with similar mass shifts. It is also difficult to identify proteoforms with terminal truncations. Since then, some tools have emerged to integrate the above spectral alignment-based algorithms. MASH Suite Pro is a GUI-based software with capabilities of protein identification, characterization, quantification, visual validation, and supporting the interface customization. It is suitable for large-scale TD proteomics data analysis. MASH Suite Pro supports several different deconvolution methods, including MS-Deconv, UniDec, and enhanced THRASH, through high-resolution MS and MS/MS data, to optimize the protein identification performance. MS-Align+ is applied to identify proteoforms, so proteoforms with terminal truncations, unexpected PTMs, and sequence variations can be identified. TopPIC [84] combines the protein filtration, spectral alignment, and E-values to complete proteoform identification. It focuses on the identification of proteoforms with unknown PSAs and the characterization of unknown modifications. First, TopPIC filters the protein database through an index-based filtering algorithm [85], to reduce the number of candidate proteins from thousands to tens, which significantly saves the time for subsequent tasks. Then, TopPIC finds the optimal alignment between the spectra and the remaining candidate proteins of the filtering algorithm through the spectral alignment-based algorithm MS-Align+. Finally, TopPIC calculates the E-values of candidate PrSMs by extending the generator function method [86] or the lookup table and reports the PrSM with the best E-value. TopPIC also integrates MIScore [87] adopting a Bayesian model to find one or two modifications which can best explain the unknown mass shifts in PrSMs, with the best value and confidence score for each reported modification and the automatic characterization of unknown mass shifts. MIScore is an optional function of TopPIC and is introduced in detail in the section ‘Localization of PTM sites in the proteoform’. SpectroGene [88] has extended TopPIC for genome annotation using tandem TD spectra. TopPIC still has some shortcomings: (1) since only fragments with no numerous changes can be filtered in the filtering process, which adopts a filtering algorithm to speed up the identification, it is not suitable for identifying proteoforms containing multiple modifications; (2) it cannot identify an oxidized proteoform and also those with several other variable PTMs; and (3) the evaluation function is a simple mass count score while the peak intensity is not considered. SPECTRUM [89], implemented with MATLAB, is an open-source tool for TD proteomics analysis. The tool provides a pipeline for intact protein identification, which supports de novo peptide sequence for tag analysis and propensity-driven PTM characterization by using blind PTM search and spectral comparison strategy. A GUI interface is provided by SPECTRUM for users to access its functions and view its results. Graph model-based algorithms In order to explore a large amount of proteoform space faster, some researchers have simplified the proteoform space based on graph model-based algorithms which convert the identification of proteoforms to the problem of finding the optimal path in a graph. Some methods for identifying proteoforms based on graph models are as follows. Before identifying proteoforms, pTop [43] adopts two strategies to limit the searching space. First, it adopts a sequence tag-based protein identification strategy and searches the protein database based on the sequence tag to extract the top 100 candidate proteins. Then, it generates a candidate combinatorial modification, which is beneficial for the search, through the difference in mass between the precursor mass and the candidate protein. This strategy can play a role in reducing the number of all combinations, and only those combinations that meet the mass constraints require subsequent scoring. Through these two strategies, not only the searching space of proteoforms is greatly reduced, but also the speed of identification is accelerated. pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms. All combinatorial modifications that satisfy the mass constraints are depicted as a directed acyclic graph, as shown in Figure 5(A). A path is considered valid when it is connected from the source to the sink. Each valid path in Figure 5(A) represents a proteoform, of which the best-matching one is considered as a proteoform with the best-weighted path. The problem of modifying the site and proteoform ordering is now simplified to an optimal path-finding problem. pTop uses the algorithm proposed in [90] to find the k-best path. The disadvantage is that proteoforms with terminal truncations cannot be identified in the latest version. TopMG [91] is based on the idea of a mass graph and improves the disadvantage of pTop. Since the mass graph can represent the proteoforms with variable PTMs and terminal truncations, TopMG converts the protein identification problem into a mass graph alignment problem. TopMG describes the mass graph alignment problem with a mass graph that represent candidate proteoforms and TD MS/MS spectra. TopMG defines a proteoform mass graph (Figure 5B1) and a spectral mass graph (Figure 5B2), respectively. The node in Figure 5B1 is the prefix residue mass of some possible proteoforms, and the weight of the edge is the amino acid residue mass. The directional red edge in Figure 5B indicates that its corresponding amino acid site is a variable PTM site and its weight is the residue mass of an amino acid with PTMs. Figure 5B2 is the conversion of prefix residue masses into a spectral mass graph in ascending order. The mass graph alignment problem is aimed to find path A in the proteoform mass graph and to find its consistent path B in the spectral mass graph, achieving the maximum of the similarity score Score (A, B) between two paths, and then to find a best match between two paths. The similarity score Score (A, B) is defined as the shared mass counting score between paths A and B. Since PTMs may occur at different amino acid sites, different paths may have the same path weight. In order to reduce the running time, TopMG proposes a dynamic programming algorithm to calculate the r-distance set and the highest shared mass counting score in all consistent path pairs (A, B), where r means at most r red edges (see Figure 5B) in the set of distances are contained. The downside of TopMG is that it takes a lot of computation time to analyze large TDMS data. Twister [92, 93] is a tool for de novo sequencing of proteins and peptides from TD spectra. This algorithm first generates high-quality sequence tags and then combines them into a set of aggregation paths by using T-Bruijn graph. The goal of the tool is to find optimal paths in T-Bruijn graph that can output a set of amino acid strings. Compared to TopMG, MSPathFinder [45] effectively reduces the running time based on a sequence graph approach as shown in Figure 5C. MSPathFinder allows users to specify the maximal number of modifications in the PTM and sequence, which can reduce the combinatorial proteoform space. The types of PTMs specified in Figure 5C are oxidized methionine and methylated lysine, and the maximal number of modifications allowed in the sequence is two. The path from the source to the sink represents a proteoform. Each vertex of a sequence graph represents a unique segment, and its elemental composition is equal to the sum of all amino acid elemental compositions in the segment. If the elemental compositions of one vertex minus another vertex are equal to the elemental compositions of amino acids or amino acids with any PTM, two vertices are connected by edges. Unmodified amino acids, acetylated arginine, and methylated lysine are indicated by black, green, and yellow edges, respectively. The main idea of MSPathFinder is to represent the compositions of atoms using vertices in the sequence graph, rather than the combination of modifications. Because many proteoforms differ only in the PTM sites, the number of unique elemental compositions is much smaller than the number of possible proteoforms; thus the running time is reduced. MSPathFinder then adopts a spectral alignment-based algorithm [45, 79, 80] similar to MS-Align+ to find the highest-scoring proteoforms in the protein sequences based on a parametric dynamic programming algorithm. Figure 5 Open in new tabDownload slide The different strategies of graph model-based algorithms. Figure 5 Open in new tabDownload slide The different strategies of graph model-based algorithms. Since the identification of terminal truncations leads to the further expansion of the searching space, MSPathFinder also implements a filtering approach similar to de novo sequencing algorithm [95] to find short amino acid sequences called sequence tags. Once a protein has matched a sequence tag, MSPathFinder would search for proteoforms from two ends of the matching sequence toward the terminal (see Figure 5D). While this tag-based approach helps limit the searching space, the performance of the method depends on high-quality sequence tags. When the number of continuous fragment ions detected in MS/MS is insufficient, the proteoform identification may be affected. MSPathFinder only contains user-defined modifications, so it is somewhat limited by identifying unexpected PTMs. MSPathFinder is the proteoform identification algorithm mentioned in the Informed-Proteomics software suite. MSPathFinder uses the LC-MS feature file generated by ProMex and a set of modifications as input to provide the statistical significance (P-value or E-value) of PrSMs, where the E-value of PrSMs is calculated using the generator function method [96, 97]. MSPathFinder also adopts the target–decoy method, estimating the false discovery rate (FDR) [98]. Many software tools are currently designed to analyze tandem MS of individual proteoforms rather than multiplexed tandem MS. TD HomMTM (homologous multiplexed tandem mass) spectra means that proteoforms in a spectrum contain different PTM patterns but the same modifications. Recently, Zhu et al. [99] turned the TD HomkMTM spectral identification problem into the minimum error k-splittable flow problem (MEkSF) on a graph. In order to solve the problem of the minimum error two-splittable flow (ME2SF) in a layered directed graph, a dynamic programming algorithm is adopted, so that proteoforms can be identified and quantified. The disadvantage is that the experimental MS misses many fragment peaks and contains many noise peaks, making it difficult to identify more than two proteoforms reliably from HomMTM spectra. Although methods in Figure 5 depict the graph model, there are certain differences between them. PTop characterizes the directed acyclic graph, which simplifies the identification of proteoforms into an optimal path-finding problem. Based on the idea of mass graph, TopMG translates the identification of proteoforms into MS alignment problems. MSPathFinder is based on sequence graph methods to identify proteoforms. In the model, pTop uses nodes to represent proteoforms, while TopMG uses edges to represent proteoforms. The representation of TopMG simplifies graphics and greatly reduces the number of nodes in the graph. MSPathFinder represents the path of the graph as the chemical elemental composition of proteoforms, and this design ultimately reduces the running time. Through the first classification, we divide proteoform identification methods into two categories and then further classify blind PSA search methods, which enables researchers to get a deeper understanding of existing methods. To also allow researchers who want to use these methods for the proteoform identification to quickly find suitable methods, this paper lists methods to identify various types of PSAs, as shown in Table 3. Since sequence variations and fixed PTMs can be identified by most methods, the classification here is only for other three PSAs: variable PTMs, terminal truncations, and unknown PSAs. Table 3 Proteoform identification methods to identify various types of PSA Methods Variable PTMs Terminal truncations Unknown PSAs ProSightPC • •/° ° MascotTD — — ° BUPID-Top-Down • — • ProteinGoggle • — — MetaMorpheus — — • TDPortal — • • MS-TopDown • — • PIITA — — • MS-Align+ •/° • • MS-Align-E • ° • MASH Suite Pro •/° • • pTop • ° • TopPIC ° • • TopMG • • • MSPathFinder • •/° ° HomMTM • — ° SPECTRUM • • • Twister • • — Methods Variable PTMs Terminal truncations Unknown PSAs ProSightPC • •/° ° MascotTD — — ° BUPID-Top-Down • — • ProteinGoggle • — — MetaMorpheus — — • TDPortal — • • MS-TopDown • — • PIITA — — • MS-Align+ •/° • • MS-Align-E • ° • MASH Suite Pro •/° • • pTop • ° • TopPIC ° • • TopMG • • • MSPathFinder • •/° ° HomMTM • — ° SPECTRUM • • • Twister • • — Open in new tab Table 3 Proteoform identification methods to identify various types of PSA Methods Variable PTMs Terminal truncations Unknown PSAs ProSightPC • •/° ° MascotTD — — ° BUPID-Top-Down • — • ProteinGoggle • — — MetaMorpheus — — • TDPortal — • • MS-TopDown • — • PIITA — — • MS-Align+ •/° • • MS-Align-E • ° • MASH Suite Pro •/° • • pTop • ° • TopPIC ° • • TopMG • • • MSPathFinder • •/° ° HomMTM • — ° SPECTRUM • • • Twister • • — Methods Variable PTMs Terminal truncations Unknown PSAs ProSightPC • •/° ° MascotTD — — ° BUPID-Top-Down • — • ProteinGoggle • — — MetaMorpheus — — • TDPortal — • • MS-TopDown • — • PIITA — — • MS-Align+ •/° • • MS-Align-E • ° • MASH Suite Pro •/° • • pTop • ° • TopPIC ° • • TopMG • • • MSPathFinder • •/° ° HomMTM • — ° SPECTRUM • • • Twister • • — Open in new tab In Table 3, ‘•’ indicates that the method can identify proteoforms with corresponding PSAs in the table, and ‘°’ indicates that the method cannot identify such PSA proteoforms. ‘•/°’ indicates that this method identifies proteoforms with such PSAs limited by certain conditions. For example, MS-Align+ treats variable PTMs as unknown PSAs for the proteoform identification; ProSightPC is not suitable for unknown PTMs with terminal truncations; MSPathFinder relies on high-quality sequence tags when identifying proteoforms with terminal truncations. ‘—’ indicates that there is no clear explanation in the relevant literature. Filtering algorithms for the proteoform identification Although many algorithms have been proposed for identifying proteoforms, in the high-throughput proteomic level analysis, millions of spectra need to be aligned with ten thousand of protein sequences, causing the spectral alignment-based algorithm to be extremely slow. Therefore, filtration algorithms are essential in the proteomic level analysis. Some proteoform identification methods combined with efficient protein sequence filtering algorithms can speed up database searches. As a result, the performance of these methods highly depends on the sensitivity and efficiency of their filtering algorithms. At present, protein sequence filtration methods are mainly divided into three categories. The first type of method is based on the error tolerance [100]. This type of method tolerates a large error between the precursor mass of the experimental spectrum and the molecular mass of the candidate sequence. Of the analysis of TDMS, this method is adopted in the Delta-M mode of ProSightPC. The downside is that when the error tolerance is enormous, many candidates are reported by the filtering method, which significantly increases the running time of the database search. The second type of method is based on sequence tags. This method was proposed by Mann et al. [101] in 1994. In this method, sequence tags are generated from the experimental spectrum and searched against the database to find hits. Sequence tag methods have been widely and successfully used in BU proteomics analysis [95, 102–107]. In TD proteomics analysis, sequence tag-based methods are adopted in ProSightPC (for sequence tagging mode), pTop, and MSPathFinder. The accuracy of the sequence tag-based methods depends on whether the experimental spectrum contains continuous fragment ions. The third type of method is based on unmodified protein fragments (UPFs). The unmodified protein fragment and its matched fragment mass are to filter the proteins in the experimental spectra. Due to the computationally intensive problem of this method, some index-based algorithms [85, 108, 109] have been proposed. In TD proteomics analysis, this method has been adopted in MS-Align + and TopPIC. The method based on unmodified protein fragments, with higher precision than other methods, can be split into two steps of primary strategies. Firstly, the similarity score between the spectrum S and the protein sequence P with the unmodified protein fragments is calculated. Then, proteins in the database are ranked according to their similarity scores, and the top t proteins are reported as a filtered result. Two UPF-based methods are applied in TopPIC, called UPF-RESTRICT and UPF-DIAGONAL, respectively. ASF [110] includes two effective approximate spectrum-based protein identification filtering algorithms for identifying ultra-modified proteoforms with variable PTMs or both variable PTMs and unexpected alterations. Currently, ASF can be optionally used for filtering TopMG. The main principle of this algorithm is: first, an experimental spectrum is converted into a corresponding approximate spectrum, and then the UPF-based method is used to filter the protein database. The limitation of ASF is that it does not apply to proteoforms with only unexpected alterations. As an approximate spectrum needs generating by ASF, another limitation is its much slower speed than other filtering methods. Also, the running time is an exponential function of parameter h. Recently, an unbiased automatic quality control filter for top-down proteomics data sets, named TDGC [111], is developed for selecting high-quality spectra, which can increase the quality of identification rates. Localization of PTM sites in proteoforms The localization of PTM sites in the proteoforms means finding the amino acid site where a protein is modified. Currently, the localization of PTM sites in proteoforms includes manual and automatic methods. In BU proteomics analysis, there are many methods for automatic identification and localization of PTMs, such as A-score [112], PTM score [113], phosphorylation localization score [114], PTMFinder [115], SloMo [116], Mascot Delta score [117], PhosphoRS [118], iPTMClust [119], and so on. While many methods based on TDMS techniques can identify proteoforms, locating proteoforms still relies primarily on the manual annotation. For example, ProSightPC includes a graphical user interface for manually characterizing proteoforms, but it is inefficient when analyzing high-throughput data. ProSight Lite [70] (http://prosightlite.northwestern.edu/) is a TD proteomics data analysis application for matching a set of experimental mass spectra to a single candidate sequence. It is a simple and intuitive platform that can partially replace ProSightPTM and ProSightPTM2 for characterizing proteoforms. At the same time, a high-resolution graphic fragment map is also provided. The downside of ProSight Lite requires prior knowledge of protein sequences to locate PTMs effectively. Currently, methods for automatically locating PTMs include PTMcRAWler, C-scores, and MIScore methods. PTMcRAWler [120] is an automated software program for analyzing protein mass shifts. It is capable of traversing the list of THRASH-inferred mass values from cRAWler-Plus (a modified version of the online automation cRAWler). The mass shift Δm matching the theoretical PTM is found according to the user-defined list. If the error of a mass shift with the same charge is within 0.1 Da, then a match is determined, and the mass shift is added to the PTM list. To limit false positives, PTMcRAWler sets S/N = 10 in THRASH to reliably find PTMs. The advantage of this method is that it is being rapidly screened for PTMs, for locating not only singly but also multiply modified proteoforms. LeDuc et al. proposed C-scores based on a Bayesian framework [121]. Proteoforms are identified by calculating C-scores of candidate proteoforms in the database. It allows the injection of prior knowledge into generative models to take full advantage of known properties in proteins and TD analysis systems like fragmentation propensity information, ‘off-by-1 Da’ discontinuity errors, and so on. The C-score takes the precursor mass observed in the experiment and a set of neutral fragment ion masses into account. It applies the following Bayesian formula: $$\begin{equation} \mathit{\Pr}\left({\varnothing}_q|{M}_O,\left\{{m}_i\right\}\right)=\frac{\mathit{\Pr}\left({M}_O,\left\{{m}_i\right\}|{\varnothing}_q\right)}{\mathit{\Pr}\left({M}_O,\left\{{m}_i\right\}\right)} \end{equation}$$ (5) $$\begin{equation} C- scores=-10{\mathit{\log}}_{10}\left(1-\mathit{\Pr}\left({\varnothing}_q|{M}_O,\left\{{m}_i\right\}\right)\right) \end{equation}$$ (6) where |${M}_O$| is the observed precursor mass, |${\varnothing}_q$| is the arbitrary protein sequence with the precursor mass of |${M}_O$|⁠, mi is the ith mass of the observed fragment ions, and |$\mathit{\Pr}({\varnothing}_q|{M}_O,\{{m}_i\})$| is the likelihood probability. A disadvantage of the C-score approach is that proteoforms can be effectively characterized only if the proteoform contains no unknown PSAs and the extended database contains candidate proteoforms. In order to address the shortcomings of the C-score method, Kou et al. [87] proposed the MIScore method, which cannot only characterize proteoforms containing unknown PSAs but also speed up the experiment. The method also uses a Bayesian model to calculate the confidence score for each candidate modification site, i.e. to estimate the posterior probability of the modification occurring at each amino acid site. Therefore, PTMs in the proteoforms are identified and located automatically, and the site with the highest confidence score is the localization site of the proteoform. A TD MS/MS spectrum S is obtained from a proteoform containing m amino acids and one modification. Let P be the unmodified protein sequence of the target protein variant and F1, F2,···, Fm denote all proteoforms that may modify P. Fi is a proteoform modified on the ith amino acid. Then, the posterior probability of proteoform Fi of the query spectrum S, Pr(Fi|S) is calculated as follows: $$\begin{equation} \mathit{\Pr}\left({F}_i|S\right)=\frac{\mathit{\Pr}\left(S|{F}_i\right)\mathit{\Pr}\left({F}_i\right)}{\mathit{\Pr}(S)}=\frac{\mathit{\Pr}\left(S|{F}_i\right)\mathit{\Pr}\left({F}_i\right)}{\sum_{j=1}^m\mathit{\Pr}\left(S|{F}_j\right)\mathit{\Pr}\left({F}_j\right)} \end{equation}$$ (7) where Pr(S|Fi) is the conditional probability of the spectrum S for a given proteoform Fi and Pr(S) is the probability of S. Pr(S) is calculated as the sum of the prior probabilities Pr(Fj) multiplied by their likelihood probability Pr(S|Fj). When the Bayesian model is used to calculate the confidence score of each candidate modification site, the averaging method is used to calculate the prior probability Pr(Fj) of each site. That is, a priori probability of each site being modified is the same, i.e. Pr(Fj) = 1/w, j = 1, 2,···, w. MIScore can also be used to identify multiple modifications from mass shifts. MIScore describes a dynamic programming algorithm that uses a simple shared mass count score to reduce the time for calculating probabilities effectively. MIScore calculates the confidence score for the potential location of a PTM. The deficiencies are as follows: (1) no more than two modifications can be identified in an unknown mass shift; (2) characterization becomes difficult when a proteoform contains many modifications resulting from alternative splicing; (3) its accuracy depends on the accuracy of the reported precursor mass; (4) unknown mass shifts identified from multiple spectra, which cannot be accurately characterized; and (5) due to the presence of missing peaks, the identified mass shift may result from a combination of multiple modifications. Kelleher’s research group has released the proteoform characterization tool (beta version, http://pcs.kelleher.northwestern.edu/), which can report a Phrrap score for each proteoform of a multiply modified protein. Different tools or software packages employ one or more algorithms for proteoform characterization. Table 4 lists some common algorithms, such as the dynamic programming algorithm, graph-based algorithm, and probability algorithm, used in the tools or software packages of proteoform characterization. Table 4 Common algorithms applied in the process of proteoform characterization Dynamic programming algorithm Deconvolution methods MS-Deconv MS-Deconv uses a graph-theoretical approach to represent the envelopes and then selects the heaviest path in the graph through dynamic programming algorithms Identification methods MS-TopDown MS-TopDown applies dynamic programming algorithms to increase the computational speed MS-Align-E MS-Align-E adopts a dynamic programming algorithm to find the longest path in a spectral grid with a given number of modifications pTop pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms MSPathFinder MSPathFinder adopts a spectral alignment-based algorithm similar to MS-Align+ to find the highest-scoring proteoforms in the protein sequences based on a parametric dynamic programming algorithm HomkMTM HomkMTM turns spectral identification problem into the minimum error k-splittable flow problem on a graph and uses a dynamic programming algorithm to solve the problem of the minimum error two-splittable flow Localization methods MIScore MIScore describes a dynamic programming algorithm that uses a simple shared mass count score to reduce the time for calculating probabilities effectively Graph model-based algorithms Identification methods pTop pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms. All combinatorial modifications that satisfy the mass constraints are depicted as a directed acyclic graph. The problem of modifying the site and proteoform ordering is simplified to an optimal path-finding problem TopMG TopMG converts the protein identification problem into a mass graph alignment problem MSPathFinder MSPathFinder allows the user to specify the maximal number of modifications allowed in the PTM and sequence. It can explore the combinatorial proteoform space efficiently using a sequence graph-based approach HomkMTM HomkMTM turns a spectral identification problem into the minimum error k-splittable flow problem on a graph Probability methods Identification methods ProSightPTM ProSightPTM is extended into a commercial software system, named ProSightPC, which allows users to search for tandem MS data in a PTM database containing known biological complexity in UniProt. The matching process is scored through the Poisson distribution probability model Localization methods C-scores C-scores based on a Bayesian framework allow the injection of prior knowledge into generative models to take full advantage of known properties in proteins and TD analysis systems MIScore MIScore uses a Bayesian model to calculate the confidence score for each candidate modification site and estimate the posterior probability of the modification occurring at each amino acid site Dynamic programming algorithm Deconvolution methods MS-Deconv MS-Deconv uses a graph-theoretical approach to represent the envelopes and then selects the heaviest path in the graph through dynamic programming algorithms Identification methods MS-TopDown MS-TopDown applies dynamic programming algorithms to increase the computational speed MS-Align-E MS-Align-E adopts a dynamic programming algorithm to find the longest path in a spectral grid with a given number of modifications pTop pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms MSPathFinder MSPathFinder adopts a spectral alignment-based algorithm similar to MS-Align+ to find the highest-scoring proteoforms in the protein sequences based on a parametric dynamic programming algorithm HomkMTM HomkMTM turns spectral identification problem into the minimum error k-splittable flow problem on a graph and uses a dynamic programming algorithm to solve the problem of the minimum error two-splittable flow Localization methods MIScore MIScore describes a dynamic programming algorithm that uses a simple shared mass count score to reduce the time for calculating probabilities effectively Graph model-based algorithms Identification methods pTop pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms. All combinatorial modifications that satisfy the mass constraints are depicted as a directed acyclic graph. The problem of modifying the site and proteoform ordering is simplified to an optimal path-finding problem TopMG TopMG converts the protein identification problem into a mass graph alignment problem MSPathFinder MSPathFinder allows the user to specify the maximal number of modifications allowed in the PTM and sequence. It can explore the combinatorial proteoform space efficiently using a sequence graph-based approach HomkMTM HomkMTM turns a spectral identification problem into the minimum error k-splittable flow problem on a graph Probability methods Identification methods ProSightPTM ProSightPTM is extended into a commercial software system, named ProSightPC, which allows users to search for tandem MS data in a PTM database containing known biological complexity in UniProt. The matching process is scored through the Poisson distribution probability model Localization methods C-scores C-scores based on a Bayesian framework allow the injection of prior knowledge into generative models to take full advantage of known properties in proteins and TD analysis systems MIScore MIScore uses a Bayesian model to calculate the confidence score for each candidate modification site and estimate the posterior probability of the modification occurring at each amino acid site Open in new tab Table 4 Common algorithms applied in the process of proteoform characterization Dynamic programming algorithm Deconvolution methods MS-Deconv MS-Deconv uses a graph-theoretical approach to represent the envelopes and then selects the heaviest path in the graph through dynamic programming algorithms Identification methods MS-TopDown MS-TopDown applies dynamic programming algorithms to increase the computational speed MS-Align-E MS-Align-E adopts a dynamic programming algorithm to find the longest path in a spectral grid with a given number of modifications pTop pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms MSPathFinder MSPathFinder adopts a spectral alignment-based algorithm similar to MS-Align+ to find the highest-scoring proteoforms in the protein sequences based on a parametric dynamic programming algorithm HomkMTM HomkMTM turns spectral identification problem into the minimum error k-splittable flow problem on a graph and uses a dynamic programming algorithm to solve the problem of the minimum error two-splittable flow Localization methods MIScore MIScore describes a dynamic programming algorithm that uses a simple shared mass count score to reduce the time for calculating probabilities effectively Graph model-based algorithms Identification methods pTop pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms. All combinatorial modifications that satisfy the mass constraints are depicted as a directed acyclic graph. The problem of modifying the site and proteoform ordering is simplified to an optimal path-finding problem TopMG TopMG converts the protein identification problem into a mass graph alignment problem MSPathFinder MSPathFinder allows the user to specify the maximal number of modifications allowed in the PTM and sequence. It can explore the combinatorial proteoform space efficiently using a sequence graph-based approach HomkMTM HomkMTM turns a spectral identification problem into the minimum error k-splittable flow problem on a graph Probability methods Identification methods ProSightPTM ProSightPTM is extended into a commercial software system, named ProSightPC, which allows users to search for tandem MS data in a PTM database containing known biological complexity in UniProt. The matching process is scored through the Poisson distribution probability model Localization methods C-scores C-scores based on a Bayesian framework allow the injection of prior knowledge into generative models to take full advantage of known properties in proteins and TD analysis systems MIScore MIScore uses a Bayesian model to calculate the confidence score for each candidate modification site and estimate the posterior probability of the modification occurring at each amino acid site Dynamic programming algorithm Deconvolution methods MS-Deconv MS-Deconv uses a graph-theoretical approach to represent the envelopes and then selects the heaviest path in the graph through dynamic programming algorithms Identification methods MS-TopDown MS-TopDown applies dynamic programming algorithms to increase the computational speed MS-Align-E MS-Align-E adopts a dynamic programming algorithm to find the longest path in a spectral grid with a given number of modifications pTop pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms MSPathFinder MSPathFinder adopts a spectral alignment-based algorithm similar to MS-Align+ to find the highest-scoring proteoforms in the protein sequences based on a parametric dynamic programming algorithm HomkMTM HomkMTM turns spectral identification problem into the minimum error k-splittable flow problem on a graph and uses a dynamic programming algorithm to solve the problem of the minimum error two-splittable flow Localization methods MIScore MIScore describes a dynamic programming algorithm that uses a simple shared mass count score to reduce the time for calculating probabilities effectively Graph model-based algorithms Identification methods pTop pTop designs a scoring function based on the dynamic programming to determine the possible modification sites for candidate proteoforms. All combinatorial modifications that satisfy the mass constraints are depicted as a directed acyclic graph. The problem of modifying the site and proteoform ordering is simplified to an optimal path-finding problem TopMG TopMG converts the protein identification problem into a mass graph alignment problem MSPathFinder MSPathFinder allows the user to specify the maximal number of modifications allowed in the PTM and sequence. It can explore the combinatorial proteoform space efficiently using a sequence graph-based approach HomkMTM HomkMTM turns a spectral identification problem into the minimum error k-splittable flow problem on a graph Probability methods Identification methods ProSightPTM ProSightPTM is extended into a commercial software system, named ProSightPC, which allows users to search for tandem MS data in a PTM database containing known biological complexity in UniProt. The matching process is scored through the Poisson distribution probability model Localization methods C-scores C-scores based on a Bayesian framework allow the injection of prior knowledge into generative models to take full advantage of known properties in proteins and TD analysis systems MIScore MIScore uses a Bayesian model to calculate the confidence score for each candidate modification site and estimate the posterior probability of the modification occurring at each amino acid site Open in new tab Experiment In order to verify the performance of the existing proteoform identification methods, we select the state-of-the-art methods for comprehensive evaluation and comparison. In this study, we compare four blind PSA methods: pTop (version 1.2.0), TopPIC (version 1.2.2), TopMG (version 1.2.2), and MSPathFinder (version 1.0.7017). We don’t choose ProSightPC to compare with other methods, because it is a commercial software system and its database is encoded in the binary format; the target–decoy approach cannot be applied. Since FDR is currently the most widely used and well-evaluated method for the identification of proteoforms, we used FDR to objectively compare each method. In addition, we performed an overlap analysis of the results and compared the running time of each method. Data sets Since each method differs in the selection of data sets and evaluation criteria, we need to comprehensively evaluate and compare the performance of various methods on the same data set and criteria to objectively evaluate the performance of existing methods. We have performed experiments using the histone H3.1 data set, which is obtained through an LTQ Orbitrap Velos mass spectrometer (Thermo Fisher Scientific, Waltham, MA) [122]. In order to ensure a fair comparison, this study conducts experiments on a subtype of breast cancer (WHIM2-P32) [56]. We select five PTMs (acetylation, methylation, dimethylation, trimethylation, and phosphorylation) as variable PTMs in the identification of proteoforms of histones and three PTMs (acetylation, oxidation and dehydro) as variable PTMs in the identification of proteoforms of WHIM2-P32. The amino acid types of the specific sites for each modification are shown in Supplementary Tables 1 and 2. Parameter settings Each method is executed on the same Windows Server with a 3.10 GHz CPU (Intel(R) Xeon(R) E3-1220 v3) and 8 GB memory. Because different methods require different formats for input files, the original file format (.raw) needs to be converted to the input format required by the corresponding method. For pTop, pParseTD is used to deconvolute the original spectral file. For TopPIC, TopMG, and MSPathFinder, the original spectral file is converted to the .mzML file format via ProteoWizard [123]. Then, TopPIC and TopMG continue to deconvolute the .mzML file into the .msalign file format using TopFD integrated into TopPIC. For MSPathFinder, ProMex integrated into the open-source package Informed-Proteomics is used to deconvolute and generate the .ms1ft file as input to the MSPathFinder. In order to ensure the objectivity of the experiment, each system uses the default parameters. The parameter settings of pTop, TopPIC, TopMG, and MSPathFinder in the experiment are shown in Supplementary Table 3. As shown in Supplementary Table 3, the four methods contain two parameters that are precursor error and fragment error, where Da is the mass unit of the protein and ppm stands for parts per million. These two error parameters are the default parameters and the most important to directly affect the accuracy of proteoform identification. The smaller the error parameter setting is, the more accurate the identification is. Since the maximum mass is set to 50 000 Da, when the precursor mass of the protein is 50 000 Da, TopPIC’s precursor error = 15 ppm = 0.75 Da. Therefore, the default value of error parameter of the pTop method is larger than other methods. The parameters of variable modification are provided to different software packages for identification and localization of PTMs. The maximum number of mass modification parameter means the maximum number of variable PTM sites in a proteoform. Validation methods In the validation stage, it is difficult to design a highly efficient method to validate the test results due to the current lack of a golden data set of proteoform characterization results. Currently, the accuracy of proteoform identification based on TDMS can be evaluated by three indicators: (1) the number of matched fragment ions; (2) the statistical significance including P-value and E-value; and (3) the FDR. The number of matched fragment ions refers to the number of fragment ions that the mass spectrum matches the candidate protein sequence. Because the greater the numbers of matched fragment ions are, the better PrSMs can be matched so that the identification result can be judged by the number of matched fragment ions. Some methods give a threshold, and if the number of matched fragment ions of PrSM exceeds the threshold, it is determined that the PrSM identification result is correct. The statistical significance is to determine the P-value or E-value of PrSMs and to give the threshold value to judge the identification result of proteoforms. The E-value describes the number of accidental hits when searching for spectra in a specific size protein database, so the smaller the E-value is, the more accurate the results can be. The statistical significance can be calculated through a similar generating function method [86, 96, 97]. The statistical significance can also be calculated through the Markov chain Monte Carlo (MCMC) method. Recently, Kou et al. [124] proposed TopMCMC for calculating the statistical significance of each PrSM. In order to improve the accuracy while reducing the running time, the method combines the Markov chain random walk algorithm with a greedy algorithm. This method achieves high accuracy in estimating E-values from TDMS and identifying FDR. It is superior to the generating function method in characterizing correct identifications and incorrect identifications because it estimates the probability of protein levels. TopMCMC cannot report accurate scores when identifying PrSMs with many variable PTMs, thus affecting the accuracy of statistical significance. FDR [98, 125, 126] is the expected fraction for a false-positive assignment of an identified PrSM. As shown in Figure 6, the same amount of decoy sequences are obtained by reversing or reshuffling target protein sequences. The target protein sequences and decoy sequences are then combined into a target–decoy database to search for PrSM. FDR can be estimated as the number of decoy matches divided by the number of target matches, as follows: $$\begin{equation} \mathrm{FDR}=100\times \frac{N_{\mathrm{reverse}}}{N_{\mathrm{forward}}} \end{equation}$$ (8) where Nreverse represents the number of decoy matches and Nforward represents the number of target matches. The selected FDR threshold (such as 1% FDR) is typically used to filter the identification of PrSMs. Figure 6 Open in new tabDownload slide The target–decoy strategy. Figure 6 Open in new tabDownload slide The target–decoy strategy. Comparisons between proteoform identification methods FDR analysis Currently, the FDR method has become a widely used measure for evaluating PrSMs. The FDR threshold is set to 1% in the experiment. The numbers of PrSMs, proteoforms, and proteins identified on the two data sets by pTop, TopPIC, TopMG, and MSPathFinder methods are shown in Supplementary Tables 4 and 5, respectively. As shown in Supplementary Table 4, pTop can identify the largest numbers of PrSMs, proteoforms, and proteins. From experimental results, we found that TopPIC, TopMG, and MSPathFinder generate at most one PrSM in each experimental spectrum, while pTop can generate multiple PrSMs in an experimental spectrum. Since each experimental spectrum corresponds to only one real protein sequence, our results in pTop retain only one PrSM per experimental spectrum, and finally 1736 PrSMs can be retained in pTop. Although pTop can identify the most, it is not enough to prove that this method is better than other methods. As shown in Supplementary Table 3, the precursor error set by pTop is larger than that of other methods, which leads to a large error in the identification result. When we adjust the precursor error set by pTop, the number of identification proteoforms decreases. Although MSPathFinder identifies a small number of PrSMs, its identification result is more accurate. The method identification is less likely to have two points. (1) According to the MSPathFinder team, the experiment gives five PTM types, and each modification corresponds to 1–3 amino acid modifications. If the experiment is for five PTMs, it is possible that too many dynamic modifications occur at the same residue, resulting in too many possible peptide scores and decrease in the number of PrSMs identified and increasing the running time of the experiment greatly. (2) As shown in Supplementary Table 3, the precursor error set by MSPathFinder is smaller than other methods, so the error of the identification result is small. The MSPathFinder team suggest that we search separately for five different PTM types. For the types of modifications of acetylation, methylation, dimethylation, trimethylation, and phosphorylation, MSPathFinder respectively identified 2019, 2146, 2164, 2123, and 1973 PrSMs; identified 895, 1492, 1228, 813 and 495 proteoforms; as well as identified 187, 148, 179, 140 and 214 proteins. Since the search for each modification type separately results in only one modification type in the exported proteoform, it cannot be compared with the results of other methods. Here we combine five modifications to search MSPathFinder and compare the results with other methods. At the same time, we also use WHIM2-P32 data set for experiments. As shown in Supplementary Table 5, TopPIC can identify the largest number of PrSMs, proteoforms, and proteins. Since TopPIC has a good effect on identifying unknown modification proteoforms, most of the proteoforms identified by this method contain unknown modifications. While several other methods only identified the modifications given in Supplementary Tables 1 and 2, TopPIC can identify many proteoforms that were not identified by other methods, and the accuracy of these proteoforms remains to be tested. Overlap analysis Since proteoform identification lacks a golden data set, it is not sufficient to use only the number of identifications as a criterion for judging the quality of various methods. To better compare the methods, we look at the overlap between the four methods, as shown in Supplementary Figures 1 and 2. Supplementary Figure 1 depicts the results based on histone H3.1 data sets. Supplementary Figure 1A depicts the overlap of PrSMs identified by four methods. Supplementary Figure 1B depicts the overlap of proteoforms identified by these four methods. Since the results of TopPIC and TopMG do not give specific sites for PTMs, we compare the overlap of proteoforms of TopPIC to TopMG and the overlap of pTop to MSPathFinder. From the overlap of proteoforms, the co-identified proteoforms of each method are less, such as proteoforms with only one co-identified by pTop and MSPathFinder. TopPIC and TopMG have only identified eight common proteoforms, probably because TopPIC’s output contains an unknown mass shift. Therefore, the current proteoform characterization is not satisfactory. Supplementary Figures 1C and 1D depict the overlaps of proteins identified by four methods and experimental spectra that can be identified, respectively. As shown in Supplementary Figure 1C, these four methods identify four common proteins. Supplementary Figure 1D illustrates that different methods have different performance in different experimental spectra. From the Venn diagram, the performance of proteoform identification methods based on TDMS technology needs to be further improved. At the same time, we also perform an overlap analysis of identification results on WHIM2-P32 data set. Supplementary Figure 2A depicts the overlap of PrSMs identified by four methods based on WHIM2-P32 data sets. Supplementary Figure 2B depicts the overlap of proteoforms identified by these four methods. Supplementary Figures 2C and 2D depicts the overlap of proteins identified by four methods and experimental spectra that can be identified, respectively. It can be seen from both Supplementary Figures 1 and 2 that the overlap of methods is not ideal. This further explanation indicates that the performance of proteoform identification methods based on TDMS technology needs to be further improved. Running time analysis Next, we compare the running time of each method based on histone H3.1 data sets and WHIM2-P32 data sets, as shown in Supplementary Tables 6 and 7, respectively. Supplementary Table 6 shows that TopPIC runs faster than other methods when identifying on the histone H3.1 data set. Since pTop cannot simultaneously identify MS of collision-induced dissociation (CID) and electron transfer dissociation (ETD), we perform two experiments on this method. The running time of pTop is the sum of running times of two experiments. Due to the large number of modification of histone data sets, we divided modification types into five subclasses and performed experiments with MSPathFinder. Five experimental running times were 14 h 40.53 min, 14 h 24.99 min, 13 h 58.19 min, 11 h 1.97min, and 16 h 42.12 min, respectively, which shows that when MSPathFinder allows numerous modifications in an experiment, it severely affects the running time. Supplementary Table 7 shows that pTop runs faster than other methods when identifying on WHIM2-P32 data sets. The searching space of TopPIC increases exponentially with the number of shifts allowed, and the size of the searching space in pTop depends on the number of combined modifications in a precursor mass deviation constraint. For example, a protein sequence of length 100 may produce more than 10 000 candidate proteoforms in TopPIC, while only a few hundred candidate proteoforms are produced in pTop, so the running time of TopPIC is shorter than that of pTop. Challenge and future work Although the separation technology and MS of intact proteins have been developed rapidly in TD proteomics, algorithms and software systems are still lacking. Currently, many algorithms for the identification of proteoforms have been developed in BU proteomics, but only a few software tools can process TD data directly. Therefore, the field is badly in need of new algorithms to improve the performance of the proteoform identification. We can consider developing new software from four aspects: deconvolution algorithms, identification algorithms, filtering algorithms, and localization scoring algorithms. Although the existing algorithms have already addressed the proteoform identification of high-throughput data, there are still many difficulties in practices. The following are four significant problems unresolved: (i) In the MS data preprocessing stage, it is difficult to design a perfect deconvolution algorithm for accurately extracting monoisotopic peaks, because raw files, as the input of deconvolution algorithms, contain a large number of noise peaks and overlapping peaks, as well as the small intensity of most monoisotopic peak. The accuracy of monoisotopic peaks extraction directly affects the performance of proteoform identification. (ii) Existing methods for identifying proteoforms can identify fewer modifications. Currently, the number of modifications in proteoform identification methods based on TDMS is limited to a single digit, especially for unknown PSAs, which is limited to two. Actually, any amino acid site on a protein may be modified, so the number of modifications in the proteoform is much larger than the ones that can be identified by current technology. To improve the accuracy, the number of modifications in proteoform identification should increase. (iii) Most of proteoform identification algorithms are time-consuming. In theory, a protein combined with multiple modifications can lead to combinatorial explosions, which is a challenging computational problem for identifying proteoforms. When comparing experimental spectra to theoretical sequences, the computational complexity is higher. Although many methods have adopted dynamic programming algorithms to improve the computational speed, reducing the computational complexity is still one of the essential issues to be solved in this field. (iv) There are fewer algorithms that can locate proteoforms automatically and accurately. While there are many methods to identify proteoforms based on TDMS techniques, the localization of proteoforms still primarily relies on manual annotations. Moreover, the current methods for automatically locating proteoforms are generally not highly accurate. For example, the current MIScore method is less than 40% reliable in the experiment. For the future work, we can focus on the following four perspectives: (i) Although there are many methods for deconvolution, the noise peak and overlapping peak problems are still the source of the problem that the performance of proteoforms is not high. We can use a dynamic S/N ratio threshold selection algorithm to filter noise peaks more accurately. At present, most deconvolution algorithms use the ‘averagine’ model to generate theoretical isotopic distributions, which is quite different from actual isotopomer envelopes. A better algorithm can be proposed to generate the theoretical isotopic distribution. (ii) More focuses should be put on combining different identification algorithms to identify proteoforms. Because different identification algorithms have different advantages, effectively combining the different identification algorithms can bring out the benefits of various identification algorithms, thereby improving the performance of proteoform identification. We can also consider using the intensity of mass spectrum in characterizing proteoforms. Since peak intensities also provide valuable information for the characterization of proteoforms, we can consider adding peak intensity information to algorithms to improve the performance. (iii) In order to address the running time, it is considered to combine the existing filtering algorithms with identification algorithms to reduce the number of experimental spectra compared with the theoretical sequences, thereby shortening the running time. We can also propose new filtering algorithms to reduce the search space for protein sequences. (iv) The accuracy of algorithms that automatically locate proteoforms should be improved. The existing methods adopt the averaging method when calculating the prior probability of PTMs at different sites, which assumes that the prior probability of PTMs at each site is the same, but it is different actually. Therefore, the relevant biological knowledge can be applied to know the type of PTMs that different amino acids are easy to produce and to estimate the prior probability of PTMs at different sites, thereby improving the accuracy of the confidence score and the localization performance of proteoforms. The identification of proteoforms based on TDMS is important both theoretically and practically. At present, studies in this field are still in its infancy, and there are still many challenging problems to be addressed in the future. We believe that this review could provide a starting point for those who are interested in this field and help them develop more powerful software tools for proteoform characterization. Key Points The accuracy of deconvolution results affects the analysis process in the proteoform characterization and some popular deconvolution algorithms are summarized. Considering that the proteoform identification is the most important process in characterizing proteoforms, the different strategies of the existing proteoform identification methods are discussed and classified based on their searching algorithms and computational models. Some strategies are adopted in characterizing proteoforms, including various filtering algorithms for improving the performance of identification and some scoring methods for locating PTM sites in proteoforms. The various indicators used to evaluate the performance of proteoforms based on top-down mass spectrometry are summarized, and four proteoform identification methods proposed in recent years are comprehensively evaluated and compared. Funding National Natural Science Foundation of China (Nos. 61832019, 61772197, 61972185); Natural Science Foundation of Hunan Province of China (No. 2018JJ2262); Hunan Provincial Science and Technology Program (No. 2018WK4001); Scientific Research Fund of Hunan Provincial Education Department (No. 15CY007, 19A316); Hunan Provincial Innovation Foundation for Postgraduate (No. CX2018B313). Jiancheng Zhong is an associate professor in the College of Information Science and Engineering, Hunan Normal University, Changsha, Hunan, China. He is a member of Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China. His current research interests include computational proteomics and data mining. Yusui Sun is a master student in the College of Information Science and Engineering, Hunan Normal University, Changsha, Hunan, China. Her current research interests include bioinformatics, data mining and machine learning. Minzhu Xie is a professor in the College of Information Science and Engineering, Hunan Normal University, Changsha, Hunan, China. His primary research interests include bioinformatics and data mining. Wei Peng is a professor in Kunming University of Science and Technology, Kunming, Yunnan, China. Her current research interests include molecular systems biology, biological system identification and data mining. Chushu Zhang is a student in the College of Information Science and Engineering, Hunan Normal University, Changsha, Hunan, China. Her current research interests include bioinformatics, data mining and machine learning. Fang-Xiang Wu is a professor in the College of Engineering and the Department of Computer Science at University of Saskatchewan, Saskatoon, Canada. His current research interests include bioinformatics and artificial intelligence. Jianxin Wang is a professor in Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, Hunan, China. His research interests include computational genomics and proteomics. References 1. Dong X , Sumandea CA , Chen YC , et al. Augmented phosphorylation of cardiac troponin i in hypertensive heart failure . J Biol Chem 2012 ; 287 : 848 – 57 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Peleg S , Sananbenesi F , Zovoilis A , et al. Altered histone acetylation is associated with age-dependent memory impairment in mice . Science 2010 ; 328 : 753 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Smith LM , Kelleher NL . Consortium for top down proteomics. Proteoform: a single term describing protein complexity . Nat Methods 2013 ; 10 : 186 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Aebersold R , Agar JN , Amster IJ , et al. How many human proteoforms are there? Nat Chem Biol 2018 ; 14 : 206 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Smith LM , Kelleher NL . Proteoforms as the next proteomics currency . Science 2018 ; 359 : 1106 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 6. Yang M , Luo H , Li Y , et al. Overlap matrix completion for predicting drug-associated indications . PLoS Comput Biol 2019 ; 15 : e1007541 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Yan C , Wang J , Ni P , et al. Dnrlmf-mda: predicting microrna-disease associations based on similarities of micrornas and diseases . IEEE/ACM Trans Comput Biol Bioinform 2019 ; 16 : 233 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Yang M , Luo H , Li Y , et al. Drug repositioning based on bounded nuclear norm regularization . Bioinformatics 2019 ; 35 : i455 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Moradian A , Kalli A , Sweredoski MJ , et al. The top-down, middle-down, and bottom-up mass spectrometry approaches for characterization of histone variants and their post-translational modifications . Proteomics 2014 ; 14 : 489 – 97 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Yuan ZF , Arnaudo AM , Garcia BA . Mass spectrometric analysis of histone proteoforms . Annu Rev Anal Chem 2014 ; 7 : 113 – 28 . Google Scholar Crossref Search ADS WorldCat 11. Phanstiel D , Brumbaugh J , Berggren WT , et al. Mass spectrometry identifies and quantifies 74 unique histone H4 isoforms in differentiating human embryonic stem cells . Proc Natl Acad Sci U S A 2008 ; 105 : 4093 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 12. Tran JC , Zamdborg L , Ahlf DR , et al. Mapping intact protein isoforms in discovery mode using top-down proteomics . Nature 2011 ; 480 : 254 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Chemistry CBT . Mass spectrometry: bottom-up or top-down? Science 2006 ; 314 : 65 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Huang T , Wang J , Yu W , et al. Protein inference: a review . Brief Bioinform 2012 ; 13 : 586 – 614 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Zhong J , Wang J , Ding X , et al. Protein inference from the integration of tandem ms data and interactome networks . IEEE/ACM Trans Comput Biol Bioinform 2016 ; 14 : 1399 – 409 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Baliban RC , DiMaggio PA , Plazas-Mayorca MD , et al. A novel approach for untargeted post-translational modification identification using integer linear optimization and tandem mass spectrometry . Mol Cell Proteomics 2010 ; 9 : 764 – 79 . Google Scholar Crossref Search ADS PubMed WorldCat 17. DiMaggio PA , Young NL , Baliban RC , et al. A mixed integer linear optimization framework for the identification and quantification of targeted post-translational modifications of highly modified proteins using multiplexed electron transfer dissociation tandem mass spectrometry . Mol Cell Proteom 2009 ; 8 : 2527 – 43 . Google Scholar Crossref Search ADS WorldCat 18. Catherman AD , Skinner OS , Kelleher NL . Top down proteomics: facts and perspectives . Biochem Biophys Res Commun 2014 ; 445 : 683 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 19. Lanucara F , Eyers CE . Top-down mass spectrometry for the analysis of combinatorial post-translational modifications . Mass Spectrom Rev 2013 ; 32 : 27 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Tsai YS , Scherl A , Shaw JL , et al. Precursor ion independent algorithm for top-down shotgun proteomics . J Am Soc Mass Spectrom 2009 ; 20 : 2154 – 66 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Łacki MK , Lermyte F , Miasojedow B , et al. Masstodon: a tool for assigning peaks and modeling electron transfer reactions in top-down mass spectrometry . Anal Chem 2019 , 1801 ; 91 :– 1807 . WorldCat 22. Mann M , Meng CK , Fenn JB . Interpreting mass spectra of multiply charged ions . Anal Chem 1989 ; 61 : 1702 – 8 . Google Scholar Crossref Search ADS WorldCat 23. Tseng YH , Uetrecht C , Yang SC , et al. Game-theory-based search engine to automate the mass assignment in complex native electrospray mass spectra . Anal Chem 2013 ; 85 : 11275 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Zheng H , Ojha PC , McClean S , et al. Heuristic charge assignment for deconvolution of electrospray ionization mass spectra . Rapid Commun Mass Spectrom 2003 ; 17 : 429 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Reinhold BB , Reinhold VN . Electrospray ionization mass spectrometry: deconvolution by an entropy-based algorithm . J Am Soc Mass Spectrom 1992 ; 3 : 207 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Hagen JJ , Monnig CA . Method for estimating molecular mass from electrospray spectra . Anal Chem 1994 ; 66 : 1877 – 83 . Google Scholar Crossref Search ADS WorldCat 27. Zhang Z , Marshall AG . A universal algorithm for fast and automated charge state deconvolution of electrospray mass-to-charge ratio spectra . J Am Soc Mass Spectrom 1998 ; 9 : 225 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 28. Morgner N , Robinson CV . Mass ign: an assignment strategy for maximizing information from the mass spectra of heterogeneous protein assemblies . Anal Chem 2012 ; 84 : 2939 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Stengel F , Baldwin AJ , Bush MF , et al. Dissecting heterogeneous molecular chaperone complexes using a mass spectrum deconvolution approach . Chem Biol 2012 ; 19 : 599 – 607 . Google Scholar Crossref Search ADS PubMed WorldCat 30. Sivalingam GN , Yan J , Sahota H , et al. Amphitrite: a program for processing travelling wave ion mobility mass spectrometry data . Int J Mass Spectrom 2013 ; 345 : 54 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 31. van Breukelen B , Barendregt A , Heck AJ , et al. Resolving stoichiometries and oligomeric states of glutamate synthase protein complexes with curve fitting and simulation of electrospray mass spectra . Rapid Commun Mass Spectrom 2006 ; 20 : 2490 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Hilton GR , Hochberg GK , Laganowsky A , et al. C-terminal interactions mediate the quaternary dynamics of αb-crystallin . Philos Trans R Soc B 2013 ; 368 : 20110405 . Google Scholar Crossref Search ADS WorldCat 33. Horn DM , Zubarev RA , McLafferty FW . Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules . J Am Soc Mass Spectrom 2000 ; 11 : 320 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 34. Kaur P , O’Connor PB . Algorithms for automatic interpretation of high resolution mass spectra . J Am Soc Mass Spectrom 2006 ; 17 : 459 – 68 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Chen L , Sze SK , Yang H . Automated intensity descent algorithm for interpretation of complex high-resolution mass spectra . Anal Chem 2006 ; 78 : 5006 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Carvalho PC , Xu T , Han X , et al. YADA: a tool for taking the most out of high- resolution spectra . Bioinformatics 2009 ; 25 : 2734 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Liu X , Inbar Y , Dorrestein PC , et al. Deconvolution and database search of complex tandem mass spectra of intact proteins: a combinatorial approach . Mol Cell Proteomics 2010 ; 9 : 2772 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Guner H , Close PL , Cai W , et al. MASH suite: a user-friendly and versatile software interface for high-resolution mass spectrometry data interpretation and visualization . J Am Soc Mass Spectrom 2014 ; 25 : 464 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Slawski M , Hussong R , Tholey A , et al. Isotope pattern deconvolution for peptide mass spectrometry by non-negative least squares/least absolute deviation template matching . BMC Bioinformatics 2012 ; 13 : 291 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Senko MW , Beu SC , McLafferty FW . Automated assignment of charge states from resolved isotopic peaks for multiply charged ions . J Am Soc Mass Spectrom 1995 ; 6 : 52 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 41. Mayampurath AM , Jaitly N , Purvine SO , et al. DeconMSn: a software tool for accurate parent ion monoisotopic mass determination for tandem mass spectra . Bioinformatics 2008 ; 24 : 1021 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Jones NC , Pevzner PA . An Introduction to Bioinformatics Algorithms . Cambridge, MA : MIT Press , 2004 . Google Preview WorldCat COPAC 43. Sun RX , Luo L , Wu L , et al. Ptop 1.0: a high-accuracy and high-efficiency search engine for intact protein identification . Anal Chem 2016 ; 88 : 3082 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 44. Yuan ZF , Liu C , Wang HP , et al. pParse: a method for accurate determination of monoisotopic peaks in high-resolution mass spectra . Proteomics 2012 ; 12 : 226 – 35 . Google Scholar Crossref Search ADS PubMed WorldCat 45. Park J , Piehowski PD , Wilkins C , et al. Informed-proteomics: open-source software package for top-down proteomics . Nat Methods 2017 ; 14 : 909 . Google Scholar Crossref Search ADS PubMed WorldCat 46. Senko MW , Beu SC , McLaffertycor FW . Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions . J Am Soc Mass Spectrom 1995 ; 6 : 229 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 47. Kou Q , Wu S , Liu X . A new scoring function for top-down spectral deconvolution . BMC Genomics 2014 ; 15 : 1140 . Google Scholar Crossref Search ADS PubMed WorldCat 48. Avtonomov DM , Polasky DA , Ruotolo BT , et al. Imtbx and grppr: software for top- down proteomics utilizing ion mobility-mass spectrometry . Anal Chem 2018 ; 90 : 2369 – 75 . Google Scholar Crossref Search ADS PubMed WorldCat 49. Rockwood AL , Haimi P . Efficient calculation of accurate masses of isotopic peaks . J Am Soc Mass Spectrom 2006 ; 17 : 415 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 50. Jaitly N , Mayampurath A , Littlefield K , et al. Decon2LS: an open-source software package for automated processing and visualization of high resolution mass spectrometry data . BMC Bioinformatics 2009 ; 10 : 87 . Google Scholar Crossref Search ADS PubMed WorldCat 51. Slysz GW , Baker ES , Shah AR , et al. The decontools framework: an application programming interface enabling flexibility in accurate mass and time tag workflows for proteomics and metabolomics. In: Proceedings of the 58th Annual ASMS Conference on Mass Spectrometry and Allied Topics , 2010 . 52. Park K , Yoon JY , Lee S , et al. Isotopic peak intensity ratio based algorithm for determination of isotopic clusters and monoisotopic masses of polypeptides from high-resolution mass spectrometric data . Anal Chem 2008 ; 80 : 7294 – 303 . Google Scholar Crossref Search ADS PubMed WorldCat 53. Zabrouskov V , Senko MW , Du Y , et al. New and automated MSn approaches for top-down identification of modified proteins . J Am Soc Mass Spectrom 2005 ; 16 : 2027 – 38 . Google Scholar Crossref Search ADS PubMed WorldCat 54. Zamdborg L , LeDuc RD , Glowacz KJ , et al. Prosight ptm 2.0: improved protein identification and characterization for top down mass spectrometry . Nucleic Acids Res 2007 ; 35 : W701 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 55. Cai W , Guner H , Gregorich ZR , et al. Mash suite pro: a comprehensive software tool for top-down proteomics . Mol Cell Proteomics 2016 ; 15 : 703 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 56. Ntai I , LeDuc RD , Fellers RT , et al. Integrated bottom-up and top-down proteomics of patient-derived breast tumor xenografts . Mol Cell Proteomics 2016 ; 15 : 45 – 56 . Google Scholar Crossref Search ADS PubMed WorldCat 57. Tong R , Infusini G , Cui W , et al. Bupid-top-down: database search and assignment of top-down ms/ms data. In: Proceedings of the 57th American Society Conference on Mass Spectrometry and Allied Topics, Philadelphia, PA , 2009 . 58. Marty MT , Baldwin AJ , Marklund EG , et al. Bayesian deconvolution of mass and ion mobility spectra: from binary interactions to polydisperse ensembles . Anal Chem 2015 ; 87 : 4370 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 59. Wohlschlager T , Scheffler K , Forstenlehner IC , et al. Native mass spectrometry combined with enzymatic dissection unravels glycoform heterogeneity of biopharmaceuticals . Nat Commun 2018 ; 9 : 1713 . Google Scholar Crossref Search ADS PubMed WorldCat 60. Street JM , Barran PE , Mackay CL , et al. Identification and proteomic profiling of exosomes in human cerebrospinal fluid . J Transl Med 2012 ; 10 : 5 . Google Scholar Crossref Search ADS PubMed WorldCat 61. Jiang J , Lazarus MB , Pasquina L , et al. A neutral diphosphate mimic crosslinks the active site of human o-glcnac transferase . Nat Chem Biol 2012 ; 8 : 72 . Google Scholar Crossref Search ADS WorldCat 62. Yanagisawa T . Fractional determination of co-eluted compounds using a new data processing method for photodiode array detector . Shimadzu J 2014 ; 2 : 39 – 42 . WorldCat 63. Belov AM , Zang L , Sebastiano R , et al. Complementary middle-down and intact monoclonal antibody proteoform characterization by capillary zone electrophoresis–mass spectrometry . Electrophoresis 2018 ; 39 : 2069 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 64. Mert H , Tunca U , Hizal G . Thiophenol derivatives as a reducing agent for in situ generation of cu(I) species via electron transfer reaction in copper-catalyzed living/controlled radical polymerization of styrene . J Polym Sci A Polym Chem 2006 ; 44 : 5923 – 32 . Google Scholar Crossref Search ADS WorldCat 65. Dörre K , Olczak M , Wada Y , et al. A new case of UDP-galactose transporter deficiency (SLC35A2-CDG): molecular basis, clinical phenotype, and therapeutic approach . J Inherit Metab Dis 2015 ; 38 : 931 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 66. Gersch M , Hackl MW , Dubiella C , et al. A mass spectrometry platform for a streamlined investigation of proteasome integrity, posttranslational modifications, and inhibitor binding . Chem Biol 2015 ; 22 : 404 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 67. Meng F , Cargile BJ , Miller LM , et al. Informatics and multiplexing of intact protein identification in bacteria and the archaea . Nat Biotechnol 2001 ; 19 : 952 . Google Scholar Crossref Search ADS PubMed WorldCat 68. Taylor GK , Kim YB , Forbes AJ , et al. Web and database software for identification of intact proteins using “top down” mass spectrometry . Anal Chem 2003 ; 75 : 4081 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 69. LeDuc RD , Taylor GK , Kim YB , et al. Prosight ptm: an integrated environment for protein identification and characterization by top-down mass spectrometry . Nucleic Acids Res 2004 ; 32 : W340 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 70. Fellers RT , Greer JB , Early BP , et al. Prosight lite: graphical software to analyze top-down mass spectrometry data . Proteomics 2015 ; 15 : 1235 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 71. Karabacak NM , Li L , Tiwari A , et al. Sensitive and specific identification of wild type and variant proteins from 8 to 669 kda using top-down mass spectrometry . Mol Cell Proteomics 2009 ; 8 : 846 – 56 . Google Scholar Crossref Search ADS PubMed WorldCat 72. Li L , Tian Z . Interpreting raw biological mass spectra using isotopic mass-to-charge ratio and envelope fingerprinting . Rapid Commun Mass Spectrom 2013 ; 27 : 1267 – 77 . Google Scholar Crossref Search ADS PubMed WorldCat 73. Solntsev SK , Shortreed MR , Frey BL , et al. Enhanced global post-translational modification discovery with metamorpheus . J Proteome Res 2018 ; 17 : 1844 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 74. Toby TK , Fornelli L , Srzentić K , et al. A comprehensive pipeline for translational top-down proteomics from a single blood draw . Nat Protoc 2019 ; 14 : 119 . Google Scholar Crossref Search ADS PubMed WorldCat 75. Yang X , Tschaplinski TJ , Hurst GB , et al. Discovery and annotation of small proteins using genomics, proteomics and computational approaches . Genome Res 2011 ; 21 : 634 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 76. Castellana NE , Payne SH , Shen Z , et al. Discovery and revision of arabidopsis genes by proteogenomics . Proc Natl Acad Sci 2008 ; 105 : 21034 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 77. Sheynkman GM , Shortreed MR , Frey BL , et al. Discovery and mass spectrometric analysis of novel splice-junction peptides using rna-seq . Mol Cell Proteomics 2013 ; 12 : 2341 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 78. Evans VC , Barker G , Heesom KJ , et al. De novo derivation of proteomes from transcriptomes for transcript and protein identification . Nat Methods 2012 ; 9 : 1207 . Google Scholar Crossref Search ADS PubMed WorldCat 79. Colinge J , Masselot A , Carbonell P , et al. Insilicospectro: an open-source proteomics library . J Proteome Res 2006 ; 5 : 619 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 80. Eng JK , McCormack AL , Yates JR . An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database . J Am Soc Mass Spectrom 1994 ; 5 : 976 – 89 . Google Scholar Crossref Search ADS PubMed WorldCat 81. Frank AM , Pesavento JJ , Mizzen CA , et al. Interpreting top-down mass spectra using spectral alignment . Anal Chem 2008 ; 80 : 2499 – 505 . Google Scholar Crossref Search ADS PubMed WorldCat 82. Liu X , Sirotkin Y , Shen Y , et al. Protein identification using top-down spectra . Mol Cell Proteomics 2012 ; 11 : M111 – 008524 . Google Scholar Crossref Search ADS WorldCat 83. Liu X , Hengel S , Wu S , et al. Identification of ultramodified proteins using top-down tandem mass spectra . J Proteome Res 2013 ; 12 : 5830 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 84. Kou Q , Xun L , Liu X . Toppic: a software tool for top-down mass spectrometry- based proteoform identification and characterization . Bioinformatics 2016 ; 32 : 3495 – 7 . Google Scholar PubMed WorldCat 85. Liu X , Mammana A , Bafna V . Speeding up tandem mass spectral identification using indexes . Bioinformatics 2012 ; 28 : 1692 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 86. Liu X , Segar MW , Shuai CL , et al. Spectral probabilities of top-down tandem mass spectra . BMC Genomics 2014 ; 15 : S9 . Google Scholar Crossref Search ADS PubMed WorldCat 87. Kou Q , Zhu B , Wu S , et al. Characterization of proteoforms with unknown post- translational modifications using the miscore . J Proteome Res 2016 ; 15 : 2422 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 88. Kolmogorov M , Liu X , Pevzner PA . Spectrogene: a tool for proteogenomic annotations using top-down spectra . J Proteome Res 2015 ; 15 : 144 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 89. Basharat AR , Iman K , Khalid MF , et al. Spectrum–a matlab toolbox for proteoform identification from top-down proteomics data . Sci Rep 2019 ; 9 : 1 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 90. Chi H , Chen H , He K , et al. Pnovo+: de novo peptide sequencing using complementary hcd and etd tandem mass spectra . J Proteome Res 2012 ; 12 : 615 – 25 . Google Scholar Crossref Search ADS PubMed WorldCat 91. Kou Q , Wu S , Tolić N , et al. A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra . Bioinformatics 2017 ; 33 : 1309 – 16 . Google Scholar PubMed WorldCat 92. Vyatkina K , Wu S , Dekker LJ , et al. Top-down analysis of protein samples by de novo sequencing techniques . Bioinformatics 2016 ; 32 : 2753 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 93. Vyatkina K , Wu S , Dekker LJ , et al. De novo sequencing of peptides from top-down tandem mass spectra . J Proteome Res 2015 ; 14 : 4450 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 94. Sjoberg J . 60th asms conference on mass spectrometry and allied topics . J Am Soc Mass Spectrom 2012 ; 23 : 1 – 252 . Google Scholar Crossref Search ADS PubMed WorldCat 95. Frank A , Tanner S , Bafna V , et al. Peptide sequence tags for fast database search in mass-spectrometry . J Proteome Res 2005 ; 4 : 1287 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 96. Kim S , Gupta N , Pevzner PA . Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases . J Proteome Res 2008 ; 7 : 3354 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 97. Kim S , Pevzner PA . Ms-gf+ makes progress towards a universal database search tool for proteomics . Nat Commun 2014 ; 5 : 5277 . Google Scholar Crossref Search ADS PubMed WorldCat 98. Elias JE , Gygi SP . Target-decoy search strategy for increased confidence in large- scale protein identifications by mass spectrometry . Nat Methods 2007 ; 4 : 207 . Google Scholar Crossref Search ADS PubMed WorldCat 99. Zhu K , Liu X . A graph-based approach for proteoform identification and quantification using top-down homogeneous multiplexed tandem mass spectra . BMC Bioinformatics 2018 ; 19 : 161 . Google Scholar Crossref Search ADS PubMed WorldCat 100. Chick JM , Kolippakkam D , Nusinow DP , et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides . Nat Biotechnol 2015 ; 33 : 743 . Google Scholar Crossref Search ADS PubMed WorldCat 101. Mann M , Wilm M . Error-tolerant identification of peptides in sequence databases by peptide sequence tags . Anal Chem 1994 ; 66 : 4390 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 102. Tanner S , Shu H , Frank A , et al. Inspect: identification of posttranslationally modified peptides from tandem mass spectra . Anal Chem 2005 ; 77 : 4626 – 39 . Google Scholar Crossref Search ADS PubMed WorldCat 103. Cao X , Nesvizhskii AI . Improved sequence tag generation method for peptide identification in tandem mass spectrometry . J Proteome Res 2008 ; 7 : 4422 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 104. Tabb DL , Ma ZQ , Martin DB , et al. Directag: accurate sequence tags from peptide ms/ms through statistical scoring . J Proteome Res 2008 ; 7 : 3838 – 46 . Google Scholar Crossref Search ADS PubMed WorldCat 105. Kim S , Gupta N , Bandeira N , et al. Spectral dictionaries: integrating de novo peptide sequencing with database search of tandem mass spectra . Mol Cell Proteomics 2009 ; 8 : 53 – 69 . Google Scholar Crossref Search ADS PubMed WorldCat 106. Jeong K , Kim S , Bandeira N , et al. Gapped spectral dictionaries and their applications for database searches of tandem mass spectra . Mol Cell Proteomics 2011 ; 10 : M110 – 002220 . WorldCat 107. Deng F , Wang L , Liu X . An efficient algorithm for the blocked pattern matching problem . Bioinformatics 2014 ; 31 : 532 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 108. Chi H , He K , Yang B , et al. Pfind–alioth: a novel unrestricted database search algorithm to improve the interpretation of high-resolution ms/ms data . J Proteome 2015 ; 125 : 89 – 97 . Google Scholar Crossref Search ADS WorldCat 109. Kong AT , Leprevost FV , Avtonomov DM , et al. Msfragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics . Nat Methods 2017 ; 14 : 513 . Google Scholar Crossref Search ADS PubMed WorldCat 110. Kou Q , Wu S , Liu X . Systematic evaluation of protein sequence filtering algorithms for proteoform identification using top-down mass spectrometry . Proteomics 2018 ; 18 : 1700306 . Google Scholar Crossref Search ADS WorldCat 111. Lima DB , Silva AR , Dupré M , et al. Top-down garbage collector: a tool for selecting high-quality top-down proteomics mass spectra . Bioinformatics 2019 ; 35 : 3489 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 112. Beausoleil SA , Villén J , Gerber SA , et al. A probability-based approach for high-throughput protein phosphorylation analysis and site localization . Nat Biotechnol 2006 ; 24 : 1285 . Google Scholar Crossref Search ADS PubMed WorldCat 113. Olsen JV , Blagoev B , Gnad F , et al. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks . Cell 2006 ; 127 : 635 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 114. Albuquerque CP , Smolka MB , Payne SH , et al. A multidimensional chromatography technology for in-depth phosphoproteome analysis . Mol Cell Proteomics 2008 ; 7 : 1389 – 96 . Google Scholar Crossref Search ADS PubMed WorldCat 115. Tanner S , Payne SH , Dasari S , et al. Accurate annotation of peptide modifications through unrestrictive database search . J Proteome Res 2007 ; 7 : 170 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 116. Bailey CM , Sweet SM , Cunningham DL , et al. Slomo: automated site localization of modifications from etd/ecd mass spectra . J Proteome Res 2009 ; 8 : 1965 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 117. Savitski MM , Lemeer S , Boesche M , et al. Confident phosphorylation site localization using the mascot delta score . Mol Cell Proteomics 2011 ; 10 : M110 – 003830 . Google Scholar Crossref Search ADS PubMed WorldCat 118. Taus T , ÉKocher T , Pichler P , et al. Universal and confident phosphorylation site localization using phosphors . J Proteome Res 2011 ; 10 : 5354 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 119. Chung C , Emili A , Frey BJ . Non-parametric bayesian approach to post-translational modification refinement of predictions from tandem mass spectrometry . Bioinformatics 2013 ; 29 : 821 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 120. Durbin KR , Tran JC , Zamdborg L , et al. Intact mass detection, interpretation, and visualization to automate top-down proteomics on a large scale . Proteomics 2010 ; 10 : 3589 – 97 . Google Scholar Crossref Search ADS PubMed WorldCat 121. LeDuc RD , Fellers RT , Early BP , et al. The c-score: a bayesian framework to sharply improve proteoform scoring in high-throughput top down proteomics . J Proteome Res 2014 ; 13 : 3231 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 122. Tian Z , Tolić N , Zhao R , et al. Enhanced top-down characterization of histone post-translational modifications . Genome Biol 2012 ; 13 : R86 . Google Scholar Crossref Search ADS PubMed WorldCat 123. Chambers MC , Maclean B , Burke R , et al. A cross-platform toolkit for mass spectrometry and proteomics . Nat Biotechnol 2012 ; 30 : 918 . Google Scholar Crossref Search ADS PubMed WorldCat 124. Kou Q , Wang Z , Lubeckyj R , et al. A markov chain Monte Carlo method for estimating the statistical significance of proteoform identifications by top-down mass spectrometry . J Proteome Res 2019 ; 18 : 878 – 89 . Google Scholar Crossref Search ADS PubMed WorldCat 125. Benjamini Y , Hochberg Y . Controlling the false discovery rate: a practical and powerful approach to multiple testing . J R Stat Soc B Methodol 1995 ; 57 : 289 – 300 . WorldCat 126. Storey JD . A direct approach to false discovery rates . J R Stat Soc Series B Stat Methodology 2002 ; 64 : 479 – 98 . Google Scholar Crossref Search ADS WorldCat © The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Proteoform characterization based on top-down mass spectrometry JF - Briefings in Bioinformatics DO - 10.1093/bib/bbaa015 DA - 2003-10-01 UR - https://www.deepdyve.com/lp/oxford-university-press/proteoform-characterization-based-on-top-down-mass-spectrometry-jAiRXQCOEQ SP - 1 VL - Advance Article IS - DP - DeepDyve ER -