PROMALS: towards accurate multiple sequence alignments of distantly related proteins

Jimin Pei; Nick V. Grishin

doi:10.1093/bioinformatics/btm017

PROMALS: towards accurate multiple sequence alignments of distantly related proteins

Pei, Jimin; Grishin, Nick V. 2007-01-31 00:00:00 Vol. 23 no. 7 2007, pages 802–808 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm017 Sequence analysis PROMALS: towards accurate multiple sequence alignments of distantly related proteins 1, 1,2 Jimin Pei and Nick V. Grishin 1 2 Howard Hughes Medical Institute and Department of Biochemistry, The University of Texas Southwestern Medical Center at Dallas, 6001 Forest Park Road, Dallas, TX 75390-9050, USA Received on December 4, 2006; revised on January 12, 2007; accepted on January 17, 2007 Advance Access publication January 31, 2007 Associate Editor: Alex Bateman (Lipman et al., 1989) is computationally prohibitive for large ABSTRACT sets of sequences. In contrast, a progressive method that aligns Motivation: Accurate multiple sequence alignments are essential pairs of sequences and sequence groups along a tree is in protein structure modeling, functional prediction and efficient algorithmically simpler and much faster, requiring only N1 planning of experiments. Although the alignment problem has steps of pairwise alignments for N sequences. However, in attracted considerable attention, preparation of high-quality progressive methods, alignment errors made at each step are alignments for distantly related sequences remains a difficult task. propagated to subsequent steps. Many progressive methods use Results: We developed PROMALS, a multiple alignment method a scoring function called sum-of-pairs, i.e. a sum of amino acid that shows promising results for protein homologs with sequence substitution scores for pairs of amino acids between two identity below 10%, aligning close to half of the amino acid residues positions (Edgar and Batzoglou, 2006; Thompson et al., 1994). correctly on average. This is about three times more accurate than Such a scoring function yields reasonable alignment quality traditional pairwise sequence alignment methods. PROMALS for closely related sequences (identity above 40%). However, algorithm derives its strength from several sources: (i) sequence alignment quality drops rapidly with decreasing sequence database searches to retrieve additional homologs; (ii) accurate similarity (Thompson et al., 1999). secondary structure prediction; (iii) a hidden Markov model that uses Effective construction of multiple alignments with respect to a novel combined scoring of amino acids and secondary structures; accuracy and speed has been extensively researched in recent (iv) probabilistic consistency-based scoring applied to progressive years. Refinement and consistency-based scoring are two major alignment of profiles. Compared to the best alignment methods that techniques to improve classical progressive methods. MUSCLE do not use secondary structure prediction and database searches (Edgar, 2004) and MAFFT (Katoh et al., 2005) represent two (e.g. MUMMALS, ProbCons and MAFFT), PROMALS is up to 30% recent methods that use extensive refinement to correct errors more accurate, with improvement being most prominent for highly made in progressive steps. They both implement sum-of-pairs divergent homologs. Compared to SPEM and HHalign, which also scores, which are easy to compute and offer the advantage of employ database searches and secondary structure prediction, great speed. In T-COFFEE (Notredame et al., 2000), the PROMALS shows an accuracy improvement of several percent. scoring is derived by finding consistently aligned residue pairs Availability: The PROMALS web server is available at: in a library of pairwise alignments. Such consistency-based http://prodata.swmed.edu/promals/ scoring functions can give better alignment quality than Contact: [email protected] sum-of-pairs scores. Further improvement comes with a Supplementary information: Supplementary data are available at probabilistic treatment of consistency via pairwise hidden Bioinformatics online. Markov models (HMMs), as first implemented in ProbCons (Do et al., 2005). MUMMALS (Pei and Grishin, 2006) builds on the success of probabilistic consistency by introducing 1 INTRODUCTION HMMs with more states that capture local structural informa- Multiple sequence alignments have broad applications in tion. Consistency transformation requires operations on sequence similarity searches, structure modeling and sequence triplets, and therefore is computationally intensive. phylogenetic analysis (Altschul et al., 1997; Eddy, 1998; By aligning similar sequences with general substitution matrices Ginalski and Rychlewski, 2003; Phillips et al., 2000). They and aligning divergent sequence groups with profile-based also aid in experimental design by revealing conserved residues consistency, PCMA (Pei et al., 2003) is able to achieve a with potential functional importance. A variety of alignment balance between alignment accuracy and speed. methods that rely on different algorithms and scoring Even with refinement and consistency-based scoring, functions have been developed (Edgar and Batzoglou, 2006). current methods still have difficulty in obtaining high-quality A rigorous method that aligns all sequences simultaneously alignments when sequence identity drops below 20%. As homologous proteins can have very low sequence *To whom correspondence should be addressed. similarity while maintaining similar structures and functions 802 The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Accurate multiple sequence alignments of proteins (weighted average) between effective frequency and the pseudocount (Murzin, 1998), aligning distantly related sequences is an frequency (Altschul et al., 1997; Tatusov et al., 1994). Defined in this important task. A recent trend in the multiple alignment way, the target frequency of any amino acid, even if it is not present in field is to recruit various sources of sequence and structural a position, is always greater than zero. Details on derivation of the two information to improve alignment accuracy (Edgar and profile components are in Supplementary Data. Batzoglou, 2006). Such sources include homologs detected in For an ‘M’ state, the probability of emitting the observed amino acids database searches (Katoh et al., 2005; Simossis and Heringa, for a position pair (i, j) is the product of two probabilities: (i) the 2005; Thompson et al., 2000), predicted secondary structure probability of generating the effective frequencies of position i using (Simossis and Heringa, 2005; Zhou and Zhou, 2005), and the target frequencies of position j, and (ii) the probability of generating known 3D structures (O’Sullivan et al., 2004). Since additional the effective frequencies of position j using the target frequencies of homologs improve the quality of sequence profiles, and position i. For an ‘X’ or ‘Y’ state, the probability of emitting the observed amino acids in a position k is the probability of generating structural features such as secondary structure are generally the effective frequencies of position k using the background amino acid more conserved than sequences, their usage can lead to frequencies in insertion regions. Besides amino acids, an ‘M’ state improved alignment quality. also emits a pair of predicted secondary structures, and an ‘X’ or ‘Y’ Here, we describe PROMALS, a multiple sequence state also emits a single predicted secondary structure. The emission alignment method that combines recent advances in computa- probability in a hidden state (‘M’, ‘X’ or ‘Y’) is a weighted product of tional approaches to tackle the difficult task of aligning amino acid emission probability and secondary structure emission divergent sequences. PROMALS improves probabilistic probability. The relative weights for the scoring terms of amino acids consistency-based scoring of profiles by utilizing predicted and predicted secondary structures have been optimized to increase the secondary structures and additional homologs found in alignment accuracy of the training sequence pairs. Details on emission database searches. To effectively combine these additional probability formulas, parameter estimation and the algorithm for aligning two profiles with optimal posterior probabilities of position data, we developed and implemented a new hidden Markov matches are described in Supplementary Data. model for profile-profile comparison, which scores both amino acid similarity and secondary structure similarity, and has local structure-dependent transition and emission probabilities. 2.2 PROMALS multiple sequence alignment procedure Like PCMA, PROMALS is made more computationally PROMALS (PROfile Multiple Alignment with predicted Local efficient by treating similar and divergent sequences with Structure) is a progressive method (Fig. 1). The alignment order is set different alignment strategies. On several difficult data sets, by a tree built using a k-mer count method (Edgar, 2004). Like PCMA we show that PROMALS gives the best alignment accuracy (Pei et al., 2003) and MUMMALS (Pei and Grishin, 2006), PROMALS among leading methods such as SPEM, HHalign (Soding, has two alignment stages for easy and difficult alignments. In the first stage, highly similar sequences are progressively aligned in a fast way 2005), MUMMALS, ProbCons and MAFFT. with a weighted sum-of-pairs measure of BLOSUM62 scores (Henikoff and Henikoff, 1992) (step 2 in Fig. 1). If two neighboring groups on the tree have an average sequence identity higher than a certain threshold 2 METHODS (default: 60%), they are aligned in this fast way. The result of the first 2.1 A hidden Markov model of profile–profile alignment alignment stage is a set of sequences or pre-aligned groups that are relatively divergent from each other. In the second alignment stage, A classical pairwise HMM for aligning two sequences has three types of one representative sequence (the longest one) is selected from each hidden states: a match state ‘M’ emitting a residue pair, an ‘X’ state pre-aligned group. For each representative, PSI-BLAST is used to emitting a residue in the first sequence and a ‘Y’ state emitting search for homologs from sequence database UNIREF90 (Wu et al., a residue in the second sequence (Durbin et al., 1998). ‘X’ and ‘Y’ states 2006) with three iterations and an E-value cutoff of 0.001. Hits with correspond to insertions or deletions in the two sequences. Our hidden 520% identity to the query are removed and up to 300 hits are selected. Markov model for aligning two alignments (having profile representa- The PSI-BLAST checkpoint file after three iterations is used to predict tions) has the same architecture as a pairwise sequence HMM. secondary structures by PSIPRED (Jones, 1999). For each pair of In our model, an ‘M’ state emits a pair of positions instead of a pair representatives, profiles are derived from the PSI-BLAST alignments of residues. For an ‘X’ or ‘Y’ state, a single position in the and PSIPRED secondary structure prediction, and a matrix of first alignment or in the second alignment is emitted, respectively. posterior probabilities of matches between positions is obtained by The emitted objects (observations) are amino acid frequency vectors forward and backward algorithms of the profile-profile HMM and predicted secondary structure types. (see Supplementary Data for details). These matrices are used to We adopt a representation of amino acid sequence profile similar calculate the probabilistic consistency scores as described in Do et al. to the ones in PSI-BLAST (Altschul et al., 1997) and COMPASS (Sadreyev and Grishin, 2003). Two profile components are estimated (2005). The representatives are then aligned progressively according for a position in an alignment: (i) effective frequencies of amino acids, to the consistency-based scoring function, and the pre-aligned and (ii) target frequencies of amino acids. The effective frequencies groups obtained in the first stage are merged to the multiple alignment serve as the emitted objects (observations) in a position for the hidden of the representatives. Finally, gap placement is refined to make the gap Markov model. They are estimated from the position-specific patterns more realistic. For that, we define a core block as a set of independent counts (PSIC) of amino acids (Pei and Grishin, 2001; consecutive positions with gap content less than 0.5 at each position. Sunyaev et al., 1999), which is a sequence-weighting scheme that A highly gapped (‘gappy’) region is defined as a set of consecutive corrects for biased similarities between sequences. If an amino acid positions with gap contents no less than 0.5 at each position. A gappy is not present in a position, it has an effective frequency of zero. region is either bound by two adjacent core blocks, or is at the start The target frequencies serve as the ‘hidden’ amino acid probabilistic or the end of the alignment. If there are l amino acid residues generator for a position. The target frequencies are estimated from the in a gappy segment, gap refinement introduces continuous gap effective frequencies, taking into account prior knowledge of amino characters in between the [l/2]th residue and the (l[l/2])th residue, acid substitution characteristics. The target frequency is a mixture with the exceptions for any gappy segment in N- or C-terminus, 803 J.Pei and N.V.Grishin 3. Select one 2. Align similar 1. k-mer sequence from sequences in a counting each group fast way N input sequences N ′ representatives N ′ pre-aligned UPGMA tree groups (N ′≤N) 4. Run PSI-BLAST and PSIPRED 6. Do progressive 5. Build profile-profile 7. Merge pre- alignment based HMMs; consistency aligned groups; on consistency transformation refine gaps Probabilistic N ′ profiles with predicted Final alignment Alignment of N ′ consistency secondary structures of N sequences representatives objective function Fig. 1. PROMALS multiple sequence alignment procedure. The gray arrows indicate the two most time-consuming steps: running PSI-BLAST and PSIPRED (step 4) and profile consistency transformation (step 5). where a single run of continuous gap characters is introduced at the reflecting structural similarity of two SCOP domains compared according to aligned residues in a test alignment: DALI Z-score sequence start or end. (Holm and Sander, 1998a), GDT-TS score (Zemla et al., 1999), TM-score (Zhang and Skolnick, 2004), 3D-score (Rychlewski et al., 2.3 Assessment of alignment methods 2003) and two LiveBench contact scores (Rychlewski et al., 2003). These scores were scaled by taking into account self-comparison scores, The following methods were tested: SPEM (Zhou and Zhou, 2005), random scores and alignment coverage (scaled scores are no larger HHalign (Soding, 2005), MUMMALS (Pei and Grishin, 2006), than 1 and usually above 0). We also calculated two reference- ProbCons (version 1.10) (Do et al., 2005), MAFFT (version 5.667) independent sequence similarity scores: sequence identity and (Katoh et al., 2005), MUSCLE (version 3.52) (Edgar, 2004) and BLOSUM62 scores of aligned positions in a test alignment. These ClustalW (version 1.83) (Thompson et al., 1994). For MAFFT, we scores were also calculated for DaliLite (Holm and Sander, 1998a) report two alignment options (‘-linsi’ and ‘-ginsi’) that show the best structure-based alignments as a positive control. results. HHalign is an enhanced version of HHsearch (Soding, 2005) that performs pairwise profile–profile alignment with predicted secondary structures (J. Soding, personal communication). Several 3 RESULTS parameters (score shift, secondary structure weight, pseudocount weight) of HHalign were selected that gave optimal performance on PROMALS is a progressive multiple alignment method SCOP domain pairs with identity520%. based on probabilistic consistency of profile-profile compar- For pairwise alignment tests, we used divergent SCOP superfamily ison, with enhanced profile information from homologs domain pairs that were divided into three identity bins: below 10%, detected by PSI-BLAST and secondary structures predicted 10–15% and 15–20%. For multiple alignment tests, we added up to by PSIPRED (Fig. 1). SPEM and HHalign are comparable 24 homologs to each sequence in the testing cases of pairwise methods as they also use these two sources of extra data. While alignments. Details on construction of these testing data sets were PROMALS and SPEM can align two or more sequences, given in our previous work (Pei and Grishin, 2006). Two large HHalign performs only pairwise alignments. The other tested benchmark data sets compiled by other researchers were used as well. One is the SABmark database (version 1.65) (Van Walle et al., 2005), methods (MUMMALS, ProbCons, MAFFT, MUSCLE and which contains two sets of multiple protein domains related at SCOP ClustalW) are stand-alone multiple sequence methods that do fold or superfamily level. The other is PREFAB database (version 4.0) not resort to other data sources or programs. (Edgar, 2004), which is based on structural alignments in FSSP database (Holm and Sander, 1998b) and homologous sequences 3.1 Reference-dependent evaluation of methods from database searches. Reference-dependent alignment quality scores (Q-scores) were calculated using the built-in programs in SABmark and 3.1.1 Tests on weakly similar SCOP domain pairs We tested PREFAB packages. The Q-score is the number of correctly aligned our profile-profile HMM on 1207 divergent SCOP domain residue pairs in the test alignment divided by the number of aligned pairs (Pei and Grishin, 2006) with 520% sequence identity residue pairs in the reference alignment. The value of the Q-score is (Table 1, first numbers in columns under ‘SCOP’). The three between 0 and 1. Wilcoxon signed-ranks tests were performed to methods that use extra data (PROMALS, SPEM and HHalign) calculate the statistical significance of comparisons between alignment produce substantially better results than stand-alone methods methods. (MUMMALS, ProbCons, MAFFT, MUSCLE and ClustalW) In addition to Q-score, we applied reference-independent evaluation that align a pair of sequences without using additional of alignment quality to SCOP domain pairs, as described in our previous work (Pei and Grishin, 2006). We calculated several scores homologs or predicted secondary structures. For sequence 804 Accurate multiple sequence alignments of proteins Table 1. Reference-dependent evaluation of alignment methods a a a c Method SCOP 0–10% (355) SCOP 10–15% (432) SCOP 15–20% (420) SABmark-twi (209) SABmark-sup (425) PREFAB (1682) PROMALS 0.435/0.457 0.612/0.619 0.761/0.772 0.391 0.665 0.790 SPEM 0.377/0.411 0.558/0.578 0.727/0.751 0.326 0.628 0.774 HHalign 0.406/– 0.567/– 0.730/– – – 0.787 MUMMALS 0.151/0.329 0.335/0.520 0.586/0.732 0.196 0.522 0.731 ProbCons 0.116/0.290 0.294/0.486 0.536/0.701 0.166 0.485 0.716 MAFFT-linsi 0.116/0.301 0.262/0.500 0.495/0.707 0.184 0.510 0.722 MAFFT-ginsi 0.116/0.308 0.265/0.497 0.496/0.714 0.176 0.495 0.715 MUSCLE 0.139/0.262 0.293/0.452 0.507/0.661 0.136 0.433 0.680 ClustalW 0.136/0.210 0.270/0.357 0.482/0.565 0.127 0.390 0.617 Average Q-scores of three testing data sets of ASTRAL SCOP40 superfamily pairs, two SABmark data sets (twi—‘twilight zone’ set, sup— ‘superfamily’ set) and the PREFAB 4.0 data set are shown. Q-score is the number of correctly aligned residue pairs in the test alignment divided by the total number of aligned residue pairs in the reference alignment. The number of alignments in each testing data set is shown in parentheses. Identity ranges are shown for the three SCOP data sets. The first three methods use extra data from PSI-BLAST and PSIPRED. The other five are stand-alone methods. The option of MUMMALS (modeling secondary structure and solvent accessibility) is set to produce the best results on these data sets. For each data set, PROMALS yields statistically higher accuracy (bold numbers) than any other method (P-value50.000001) according to Wilcoxon signed rank test. For tests on the SCOP data sets, there are two numbers in each cell separated by a slash. The first number is the average Q-score in pairwise alignment tests and the second number is the average Q-score in multiple alignment tests. HHalign only performs pairwise profile–profile alignments and does not construct multiple sequence alignments. Thus the values for SCOP multiple alignment tests and SABmark tests are not available. For PREFAB 4.0 data set, the scores of PROMALS, HHalign and SPEM are based on pairwise profile–profile alignments, while the scores for other methods are based on multiple alignments. pairs with identity below 10%, the average Q-score of Nevertheless, only 40% residues were correctly aligned on PROMALS (0.431) is almost three times higher than that average by PROMALS for the ‘twilight zone’ set, suggesting of MUMMALS (0.156). For alignments with identity ranges that homology modeling of extremely divergent domains 10–15% and 15–20%, PROMALS also gives substantial remains a difficult problem with regard to alignment quality. accuracy increases over MUMMALS of 0.272 and 0.176, respectively. PROMALS shows about 3–4% accuracy increases 3.1.3 Tests on and PREFAB database PREFAB 4.0 data- over SPEM and HHalign, suggesting that our profile-profile base consists of 1682 alignments averaging 45.2 sequences per HMM utilizes homologs and predicted secondary structures alignment. Each alignment consists of two sequences with in a better way. known structures and their homologs found by PSI-BLAST We also tested the methods (except HHalign, which is a database searches. The reference structural alignment in each pairwise alignment program) on data sets of multiple sequences test is based on the consensus of FSSP (Holm and Sander, constructed by adding up to 48 homologs to each SCOP 1998b) and CE (Shindyalov and Bourne, 1998) alignments. domain pair (Table 1, second numbers in columns under We have used the performances of pairwise profile–profile ‘SCOP’). With multiple sequences, PROMALS and SPEM alignments of PROMALS and SPEM as an indicator of their both show slight improvement (1–2% for PROMALS and multiple alignment performances. The three methods that use 2–3% for SPEM) over their pairwise profile–profile alignments. additional data (PROMALS, SPEM and HHalign) give similar PROMALS outperforms SPEM by 2% on multiple results, each with an average Q-score above 0.75. Their sequences. With added homologs, stand-alone methods all accuracies are higher than those on the two SCOP data sets yield better accuracies than pairwise sequence alignments, with identity515% and the two SABmark sets, suggesting that among which MUMMALS is the best method. PROMALS PREFAB 4.0 is an easier testing data set. PROMALS, SPEM outperforms MUMMALS by 0.13, 0.1, and 0.05 for data sets and HHalign are more accurate than MUMMALS by 4–6%. with identities510%, 10–15% and 15–20%, respectively. PROMALS is statistically more accurate (P-value50.000001) than SPEM and HHalign despite small differences in their 3.1.2 Tests on SABmark database SABmark database average Q-scores. Results on PREFAB 4.0 confirm that (version 1.65) has two multiple alignment benchmark sets. alignment quality differences between methods become smaller The ‘twilight zone’ set contains 209 tests of SCOP (version 1.65) on easier tests. fold-level domains with very low similarity, and the ‘super- family’ set contains 425 tests of SCOP superfamily-level 3.2 Reference-independent evaluation of methods domains with low to intermediate similarity. PROMALS achieves the best results among all methods for both sets. On our data sets of 1207 SCOP domain pairs with identity Its accuracy is 6% and 4% higher than SPEM on ‘twilight below 20%, we evaluated alignment quality using reference- zone’ set and ‘superfamily’ set, respectively. For the most independent scores that reflect the similarity between two difficult ‘twilight zone’ set, PROMALS doubles the structures compared according to aligned residue pairs in the accuracy of the best stand-alone method (MUMMALS). test alignment (Pei and Grishin, 2006). These structural 805 J.Pei and N.V.Grishin Table 2. Reference-independent evaluation on 1207 representative SCOP40 domain pairs with identity520% Method Structural similarity Sequence similarity DALI Z-score GDT-TS TM-score 3D-score LBcona LBconb Identity BLOSUM62 a a a a a a PROMALS 0.1562 0.3079 0.3675 0.3097 0.2692 0.3527 0.0868 0.1555 SPEM 0.1400 0.2886 0.3451 0.2893 0.2521 0.3319 0.0992 0.1724 HHalign 0.1334 0.2914 0.3488 0.2907 0.2469 0.3263 0.0874 0.1535 MUMMALS 0.1231 0.2570 0.3070 0.2563 0.2240 0.2909 0.0932 0.1651 ProbCons 0.1003 0.2324 0.2767 0.2307 0.2060 0.2670 0.0983 0.1719 MAFFT-linsi 0.1135 0.2485 0.2982 0.2467 0.2143 0.2820 0.0923 0.1632 MAFFT-ginsi 0.1126 0.2454 0.2960 0.2429 0.2152 0.2803 0.0972 0.1725 MUSCLE 0.0980 0.2297 0.2777 0.2266 0.1941 0.2535 0.0939 0.1686 ClustalW 0.0723 0.1916 0.2318 0.1876 0.1551 0.2030 0.0733 0.1344 b b DaliLite 0.4206 0.4936 0.5571 0.5289 0.4087 0.5110 0.0697 0.1268 The first three methods use extra data given by PSI-BLAST and PSIPRED. The last method (DaliLite) produces alignments based on comparison of known 3D structures. The other five are stand-alone methods. All sequence-based methods except HHalign construct multiple sequence alignments for target domain pairs with up to 48 homologs. HHalign constructs pairwise profile–profile alignments. Scores are calculated for pairwise alignments of target domain pairs extracted from multiple sequence alignments. PROMAL yields statistically higher structure-similarity scores (in bold) than other sequence alignment methods (P-value5 0.000001) according to Wilcoxon signed rank test. DaliLite structure-based sequence alignments have the lowest average sequence similarity scores (in bold). similarity scores are DALI Z-score, TM-score, GDT-TS score, 1207 alignments). These comparisons suggest that alignments 3D-score, and two LiveBench contact scores (Table 2). constructed by different methods can vary much for divergent Consistent with reference-dependent evaluation, PROMALS sequences, and a method with an overall inferior performance is produces significantly higher average structural similarity capable of generating better alignments in some cases. Careful scores than other methods. Used as a positive control, inspection of alignments produced by several programs could structural alignment method DaliLite yields higher structural help improve alignment quality for divergent sequences. similarity scores than any sequence-based alignment method (Table 2). Interestingly, DaliLite alignments have the lowest reference-independent sequence similarity scores (sequence 4 DISCUSSION identity and BLOSUM62 scores). PROMALS also shows Judging by its performance, PROMALS is a definite advance lower sequence similarity scores than several other sequence- compared to our previous alignment programs MUMMALS based methods. These observations suggest that for distantly (Pei and Grishin, 2006). MUMMALS derives probabilistic related sequences (sequence identity520%), sequence similarity consistency from pairwise HMMs with built-in local structural scores, such as identity or BLOSUM62, may not correlate information (secondary structure and/or solvent accessibility), with alignment quality measured by 3D structural comparison, and shows slight but significant improvement (a few percent) and maximization of these scores may not improve structural over other stand-alone methods such as ProbCons (Do et al., models based on sequence alignments. 2005) and MAFFT (Katoh et al., 2005). However, since no additional homologs are used, the local structure prediction 3.3 Pairwise comparisons of alignment methods implicitly performed by MUMMALS is of low accuracy To gain further understanding of the differences between compared to advanced methods such as PSIPRED alignment methods, we compared their performance on (Jones, 1999). In contrast, PROMALS incorporates database individual domain pairs from the SCOP sets (identity520%). searches and more accurate secondary structure prediction, Table 3 shows the number of pairs, for which one method and derives probabilistic consistency from profile–profile performs better than another method by a relatively large HMMs. Moreover, the HMM in PROMALS has a two-track margin of 0.1 or more (measured by scaled TM-score or structure (Karchin et al., 2003) that treats both amino acids Q-score, both scores are between 0 and 1). Although and predicted secondary structures as emitted objects, while PROMALS clearly leads by a large margin, it does not offer MUMMALS HMMs only emit amino acids. Owing to the best alignment in each and every case. For example, additional data sources and the advanced profile–profile PROMALS gives a TM-score increase of 0.1 or more over HMM, PROMALS shows significant improvement over SPEM on 197 alignments, while producing significantly MUMMALS and other stand-alone methods, especially for inferior alignments for 109 pairs. Even stand-alone methods highly divergent sequences. (MUMMALS, ProbCons, MAFFT, MUSCLE and ClustalW) The HMM in PROMALS adopts a numerical representation outperform PROMALS by a TM-score of 0.1 or more of sequence profile (see Supplementary Data for details) that on a small number of pairs (5%, i.e. 49–67 out of successfully works in other profile-sequence or profile–profile 806 Accurate multiple sequence alignments of proteins Table 3. Pairwise comparisons among alignment methods on 1207 SCOP domain pairs with identity520% PROMALS SPEM HHalign MUMMALS ProbCons MAFFT-linsi MAFFT-ginsi MUSCLE ClustalW PROMALS – 109/196 76/179 67/340 44/458 67/398 61/374 60/464 49/650 SPEM 199/81 – 140/148 108/281 71/389 98/324 99/326 82/400 43/574 HHalign 265/84 196/121 – 77/254 49/368 73/288 78/301 66/393 53/571 MUMMALS 685/286 648/305 627/333 – 38/169 111/138 82/128 82/227 59/431 ProbCons 726/263 693/277 674/303 201/62 – 172/80 162/76 162/169 110/336 MAFFT-linsi 718/276 680/295 662/325 239/128 133/188 – 85/98 93/196 60/387 MAFFT-ginsi 714/271 676/284 664/313 199/117 113/184 111/132 – 90/185 67/395 MUSCLE 783/239 741/255 727/279 401/83 302/138 295/110 327/106 – 75/288 ClustalW 858/193 840/209 819/228 649/55 559/103 585/70 600/76 449/100 – Each off-diagonal cell has two numbers separated by a slash. The first number is the number of pairs where the alignment score of the method listed to the left is inferior to that of the method listed above (in a column) by 0.1 or more. The second number is the number of pairs where the score of the method listed to the left is better than that of the method listed above by 0.1 or more. The alignment quality scores used for comparison in the lower triangle and the upper triangle are Q-scores and weighted and scaled TM-scores, respectively. These scores are calculated based on results of multiple sequence alignments (target domain pairs plus up to 48 added homologs), with the exception of HHalign alignments, which are pairwise profile–profile alignments. Comparisons of PROMALS with other methods are highlighted in bold. alignment methods such as PSI-BLAST (Altschul et al., 1997) compared to 67 min for SPEM (on Redhat Enterprise and COMPASS (Sadreyev and Grishin, 2003). A recent Linux 3, AMD Opteron 2.0 GHz). The stand-alone methods comprehensive study also supported the effectiveness of this (MUMMALS, PROBCONS, MAFFT, MUSCLE and profile–profile scoring scheme (Wang and Dunbrack, 2004). ClustalW) are much faster, all with a median CPU time51 min. To adequately use predicted secondary structures, we not only As in our previous work (Pei and Grishin, 2006), score them as emitted objects, but also use transition and we demonstrated the effectiveness of reference-independent emission probabilities that are dependent on predicted second- evaluation of alignment quality in this study. First, we observed ary structure types (Supplementary Data). Unlike HHalign, a good correlation between reference-dependent and reference- which treats each alignment as a classical profile HMM independent evaluations, suggesting that it may not be (Eddy, 1998), our HMM has a simpler structure similar necessary to spend significant efforts on development of to the classical 3-state pairwise HMM (Durbin et al., 1998). reference alignment databases. Second, reference-independent SPEM (Zhou and Zhou, 2005) does not use HMMs, but applies techniques solve the problem of reference alignment ambiguity, an empirical profile–profile alignment method (SP ) that which becomes significant when similarity is low. Third, identifies the optimal alignment path. In contrast, the HMM reference-independent evaluation helps answer general ques- in PROMALS allows estimation of posterior probabilities of tions such as whether alignments can be further improved for matches between positions. As a result, PROMALS has a sequences with low similarity, and whether such improvements probabilistic treatment of consistency similar to the one in will help structure modeling. For several structural similarity ProbCons and MUMMALS, while simple consistency measures measures (GDT-TS, 3Dscore, TM-score, LB contact scores), are used in SPEM, T-COFFEE (Notredame et al., 2000) and the ratio between the average score of PROMALS sequence- PCMA (Pei et al., 2003). PROMALS performs significantly based alignment and the average score of DaliLite structure- better than SPEM and HHalign on difficult tests, suggesting based alignment is 0.6 on domain pairs with520% sequence the advantages of our profile–profile comparison scheme. identity (Table 2), suggesting that we are still 40% below what Since PROMALS relies on PSI-BLAST and PSIPRED to can be achieved with structures in hand. Notably, for these collect additional homologs and predicted secondary structures, divergent sequences, DaliLite structural alignments have lower the speed of PROMALS is considerably slower than that of sequence similarity scores (identity and BLOSUM62 scores) stand-alone progressive methods. Our strategy for improving than alignments produced by any sequence method, suggesting speed is to use different algorithms for easy and difficult that scoring functions based only on amino acid sequence alignments (Pei et al., 2003). By aligning highly similar similarity may not be suitable for aligning divergent sequences sequences in a fast way, the number of sequences subject to for the purpose of homology modeling. This observation the time-consuming steps (running PSI-BLAST, PSIPRED and further justifies the use of alternative scoring schemes, such as consistency transformation) could be substantially reduced. the ones that recruit structural information. For example, for 1207 SCOP domain pairs with up to 48 added homologs, the average number of sequences in an alignment ACKNOWLEDGEMENTS is 41.6. After PROMALS aligns similar sequences with identity above 60% in the first stage, only 24 sequences on average We would like to thank Bong-Hyun Kim for the reference- require database searches, secondary structure prediction, independent evaluation routine, and Johannes Soding for and consistency transformation. For these tests, the median providing the HHalign program. We would like to thank Lisa CPU time of PROMALS is 30 min per alignment, as Kinch, Ruslan Sadreyev and James Wrabl for critical reading 807 J.Pei and N.V.Grishin Pei,J. et al. (2003) PCMA: fast and accurate multiple sequence alignment based of the manuscript and helpful comments. This work was on profile consistency. Bioinformatics, 19, 427–428. supported in part by NIH grant GM67165 to NVG. Phillips,A. et al. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol., 16, 317–330. Conflict of Interest: none declared. Rychlewski,L. et al. (2003) LiveBench-6: large-scale automated evaluation of protein structure prediction servers. Proteins, 53 (Suppl. 6), 542–547. Sadreyev,R. and Grishin,N. (2003) COMPASS: a tool for comparison of multiple REFERENCES protein alignments with assessment of statistical significance. J. Mol. Biol., Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of 326, 317–336. protein database search programs. Nucleic Acids Res., 25, 3389–3402. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by Do,C.B. et al. (2005) ProbCons: probabilistic consistency-based multiple incremental combinatorial extension (CE) of the optimal path. Protein Eng., sequence alignment. Genome Res., 15, 330–340. 11, 739–747. Durbin,R. et al. (1998) Biological Sequence Analysis: Probabilistic Models of Simossis,V.A. and Heringa,J. (2005) PRALINE: a multiple sequence alignment Proteins and Nucleic Acids. Cambridge University Press. toolbox that integrates homology-extended and secondary structure informa- Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. tion. Nucleic Acids Res., 33, W289–294. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy Soding,J. (2005) Protein homology detection by HMM-HMM comparison. and high throughput. Nucleic Acids Res., 32, 1792–1797. Bioinformatics, 21, 951–960. Edgar,R.C. and Batzoglou,S. (2006) Multiple sequence alignment. Curr. Opin. Sunyaev,S.R. et al. (1999) PSIC: profile extraction from sequence alignments Struct. Biol, 16, 368–373. with position-specific counts of independent observations. Protein Eng., 12, Ginalski,K. and Rychlewski,L. (2003) Detection of reliable and unexpected 387–394. protein fold predictions using 3D-Jury. Nucleic Acids Res., 31, 3291–3292. Tatusov,R.L. et al. (1994) Detection of conserved segments in proteins: iterative Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from scanning of sequence databases with alignment blocks. Proc. Natl. Acad. Sci. protein blocks. Proc. Natl. Acad. Sci. USA, 89, 10915–10919. USA, 91, 12091–12095. Holm,L. and Sander,C. (1998a) Dictionary of recurrent domains in protein Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of structures. Proteins, 33, 88–96. progressive multiple sequence alignment through sequence weighting, Holm,L. and Sander,C. (1998b) Touring protein fold space with Dali/FSSP. position-specific gap penalties and weight matrix choice. Nucleic Acids Res., Nucleic Acids Res., 26, 316–319. 22, 4673–4680. Jones,D.T. (1999) Protein secondary structure prediction based on position- Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence specific scoring matrices. J. Mol. Biol., 292, 195–202. alignment programs. Nucleic Acids Res., 27, 2682–2690. Karchin,R. et al. (2003) Hidden Markov models that use predicted local structure Thompson,J.D. et al. (2000) DbClustal: rapid and reliable global multiple for fold recognition: alphabets of backbone geometry. Proteins, 51, 504–514. alignments of protein sequences detected by database searches. Nucleic Acids Katoh,K. et al. (2005) MAFFT version 5: improvement in accuracy of multiple Res., 28, 2919–2926. sequence alignment. Nucleic Acids Res., 33, 511–518. Van Walle,I. et al. (2005) SABmark—a benchmark for sequence alignment that Lipman,D.J. et al. (1989) A tool for multiple sequence alignment. Proc. Natl. covers the entire known fold space. Bioinformatics, 21, 1267–1268. Acad. Sci. USA, 86, 4412–4415. Wang,G. and Dunbrack,R.L., Jr. (2004) Scoring profile-to-profile sequence Murzin,A.G. (1998) How far divergent evolution goes in proteins. Curr. Opin. alignments. Protein Sci., 13, 1612–1626. Struct. Biol., 8, 380–387. Wu,C.H. et al. (2006) The Universal Protein Resource (UniProt): an expanding Notredame,C. et al. (2000) T-Coffee: a novel method for fast and accurate universe of protein information. Nucleic Acids Res., 34, D187–191. multiple sequence alignment. J. Mol. Biol., 302, 205–217. Zemla,A. et al. (1999) Processing and analysis of CASP3 protein structure O’Sullivan,O. et al. (2004) 3DCoffee: combining protein sequences and structures predictions. Proteins, (Suppl. 3), 22–29. within multiple sequence alignments. J. Mol. Biol., 340, 385–395. Zhang,Y. and Skolnick,J. (2004) Scoring function for automated assessment of Pei,J. and Grishin,N.V. (2001) AL2CO: calculation of positional conservation in protein structure template quality. Proteins, 57, 702–710. a protein sequence alignment. Bioinformatics, 17, 700–712. Zhou,H. and Zhou,Y. (2005) SPEM: improving multiple sequence alignment Pei,J. and Grishin,N.V. (2006) MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. with sequence profiles and predicted secondary structures. Bioinformatics, 21, Nucleic Acids Res, 34, 4364–4374. 3615–3621. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/promals-towards-accurate-multiple-sequence-alignments-of-distantly-nMqftWMjiD

Loading next page...

References (39)

Robert Edgar (2004)
MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic acids research, 32 5
V. Simossis, J. Heringa (2005)
PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information
Nucleic Acids Research, 33
Chuong Do, Mahathi Mahabhashyam, M. Brudno, S. Batzoglou (2005)
ProbCons: Probabilistic consistency-based multiple sequence alignment.
Genome research, 15 2
R. Durbin, S. Eddy, A. Krogh, G. Mitchison (1998)
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
C. Notredame, D. Higgins, J. Heringa (2000)
T-Coffee: A novel method for fast and accurate multiple sequence alignment.
Journal of molecular biology, 302 1
Yang Zhang, J. Skolnick (2007)
Scoring function for automated assessment of protein structure template quality
Proteins: Structure, 68
S. Sunyaev, F. Eisenhaber, I. Rodchenkov, B. Eisenhaber, V. Tumanyan, E. Kuznetsov (1999)
PSIC: profile extraction from sequence alignments with position-specific counts of independent observations.
Protein engineering, 12 5
S. Henikoff, J. Henikoff (1992)
Amino acid substitution matrices from protein blocks.
Proceedings of the National Academy of Sciences of the United States of America, 89 22
Guoli Wang, Roland Dunbrack (2004)
Scoring profile‐to‐profile sequence alignments
Protein Science, 13
L. Rychlewski, D. Fischer, A. Elofsson (2003)
LiveBench‐6: Large‐scale automated evaluation of protein structure prediction servers
Proteins: Structure, 53
R. Karchin, M. Cline, Y. Mandel-Gutfreund, K. Karplus (2003)
Hidden Markov models that use predicted local structure for fold recognition: Alphabets of backbone geometry
Proteins: Structure, 51
R. Tatusov, S. Altschul, E. Koonin (1994)
Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.
Proceedings of the National Academy of Sciences of the United States of America, 91 25
R. Sadreyev, N. Grishin (2003)
COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance.
Journal of molecular biology, 326 1
Hongyi Zhou, Yaoqi Zhou (2005)
SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures
Bioinformatics, 21 18
(2005)
recognition: alphabets of backbone geometry. Proteins
J. Thompson, F. Plewniak, J. Thierry, O. Poch (2000)
DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches.
Nucleic acids research, 28 15
(2004)
BIOINFORMATICS APPLICATIONS NOTE Sequence analysis SABmark—a benchmark for sequence alignment that covers the entire known fold space
Jimin Pei, N. Grishin (2006)
MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information
Nucleic Acids Research, 34
J. Thompson, D. Higgins, T. Gibson (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Nucleic acids research, 22 22
K. Ginalski, L. Rychlewski (2003)
Detection of reliable and unexpected protein fold predictions using 3D-Jury
Nucleic acids research, 31 13
S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic acids research, 25 17
Jimin Pei, N. Grishin (2001)
AL2CO: calculation of positional conservation in a protein sequence alignment
Bioinformatics, 17 8
L. Holm, C. Sander (1998)
Dictionary of recurrent domains in protein structures
Proteins: Structure, 33
I. Shindyalov, P. Bourne (1998)
Protein structure alignment by incremental combinatorial extension (CE) of the optimal path.
Protein engineering, 11 9
Robert Edgar, S. Batzoglou (2006)
Multiple sequence alignment.
Current opinion in structural biology, 16 3
L. Holm, C. Sander (1998)
Touring protein fold space with Dali/FSSP
Nucleic acids research, 26 1
S. Eddy (1998)
Profile hidden Markov models
Bioinformatics, 14 9
J. Thompson, F. Plewniak, O. Poch (1999)
A comprehensive comparison of multiple sequence alignment programs
Nucleic acids research, 27 13
Jonathan Pevsner (2005)
Multiple Sequence Alignment
Theory and Mathematical Methods for Bioinformatics
A. Phillips, D. Janies, W. Wheeler (2000)
Multiple sequence alignment in phylogenetic analysis.
Molecular phylogenetics and evolution, 16 3
D. Lipman, S. Altschul, J. Kececioglu (1989)
A tool for multiple sequence alignment.
Proceedings of the National Academy of Sciences of the United States of America, 86 12
K. Katoh, K. Kuma, H. Toh, T. Miyata (2005)
MAFFT version 5: improvement in accuracy of multiple sequence alignment
Nucleic Acids Research, 33
David Jones (1999)
Protein secondary structure prediction based on position-specific scoring matrices.
Journal of molecular biology, 292 2
Cathy Wu, R. Apweiler, A. Bairoch, D. Natale, W. Barker, B. Boeckmann, Serenella Ferro, E. Gasteiger, Hongzhan Huang, R. Lopez, M. Magrane, M. Martin, R. Mazumder, C. O’Donovan, Nicole Redaschi, Baris Suzek (2005)
The Universal Protein Resource (UniProt): an expanding universe of protein information
Nucleic Acids Research, 34
Orla O'Sullivan, K. Suhre, C. Abergel, D. Higgins, C. Notredame (2004)
3DCoffee: combining protein sequences and structures within multiple sequence alignments.
Journal of molecular biology, 340 2
Jimin Pei, R. Sadreyev, N. Grishin (2003)
PCMA: fast and accurate multiple sequence alignment based on profile consistency
Bioinformatics, 19 3
A. Murzin (1998)
How far divergent evolution goes in proteins.
Current opinion in structural biology, 8 3
A. Zemla, Č. Venclovas, J. Moult, K. Fidelis (1999)
Processing and analysis of CASP3 protein structure predictions
Proteins: Structure, 37
J. Söding (2005)
Protein homology detection by HMM?CHMM comparison
Bioinformatics, 21 7

Publisher: Oxford University Press
Copyright: © The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
eISSN: 1367-4811
DOI: 10.1093/bioinformatics/btm017
pmid: 17267437
Publisher site: See Article on Publisher Site

Abstract

Vol. 23 no. 7 2007, pages 802–808 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm017 Sequence analysis PROMALS: towards accurate multiple sequence alignments of distantly related proteins 1, 1,2 Jimin Pei and Nick V. Grishin 1 2 Howard Hughes Medical Institute and Department of Biochemistry, The University of Texas Southwestern Medical Center at Dallas, 6001 Forest Park Road, Dallas, TX 75390-9050, USA Received on December 4, 2006; revised on January 12, 2007; accepted on January 17, 2007 Advance Access publication January 31, 2007 Associate Editor: Alex Bateman (Lipman et al., 1989) is computationally prohibitive for large ABSTRACT sets of sequences. In contrast, a progressive method that aligns Motivation: Accurate multiple sequence alignments are essential pairs of sequences and sequence groups along a tree is in protein structure modeling, functional prediction and efficient algorithmically simpler and much faster, requiring only N1 planning of experiments. Although the alignment problem has steps of pairwise alignments for N sequences. However, in attracted considerable attention, preparation of high-quality progressive methods, alignment errors made at each step are alignments for distantly related sequences remains a difficult task. propagated to subsequent steps. Many progressive methods use Results: We developed PROMALS, a multiple alignment method a scoring function called sum-of-pairs, i.e. a sum of amino acid that shows promising results for protein homologs with sequence substitution scores for pairs of amino acids between two identity below 10%, aligning close to half of the amino acid residues positions (Edgar and Batzoglou, 2006; Thompson et al., 1994). correctly on average. This is about three times more accurate than Such a scoring function yields reasonable alignment quality traditional pairwise sequence alignment methods. PROMALS for closely related sequences (identity above 40%). However, algorithm derives its strength from several sources: (i) sequence alignment quality drops rapidly with decreasing sequence database searches to retrieve additional homologs; (ii) accurate similarity (Thompson et al., 1999). secondary structure prediction; (iii) a hidden Markov model that uses Effective construction of multiple alignments with respect to a novel combined scoring of amino acids and secondary structures; accuracy and speed has been extensively researched in recent (iv) probabilistic consistency-based scoring applied to progressive years. Refinement and consistency-based scoring are two major alignment of profiles. Compared to the best alignment methods that techniques to improve classical progressive methods. MUSCLE do not use secondary structure prediction and database searches (Edgar, 2004) and MAFFT (Katoh et al., 2005) represent two (e.g. MUMMALS, ProbCons and MAFFT), PROMALS is up to 30% recent methods that use extensive refinement to correct errors more accurate, with improvement being most prominent for highly made in progressive steps. They both implement sum-of-pairs divergent homologs. Compared to SPEM and HHalign, which also scores, which are easy to compute and offer the advantage of employ database searches and secondary structure prediction, great speed. In T-COFFEE (Notredame et al., 2000), the PROMALS shows an accuracy improvement of several percent. scoring is derived by finding consistently aligned residue pairs Availability: The PROMALS web server is available at: in a library of pairwise alignments. Such consistency-based http://prodata.swmed.edu/promals/ scoring functions can give better alignment quality than Contact: [email protected] sum-of-pairs scores. Further improvement comes with a Supplementary information: Supplementary data are available at probabilistic treatment of consistency via pairwise hidden Bioinformatics online. Markov models (HMMs), as first implemented in ProbCons (Do et al., 2005). MUMMALS (Pei and Grishin, 2006) builds on the success of probabilistic consistency by introducing 1 INTRODUCTION HMMs with more states that capture local structural informa- Multiple sequence alignments have broad applications in tion. Consistency transformation requires operations on sequence similarity searches, structure modeling and sequence triplets, and therefore is computationally intensive. phylogenetic analysis (Altschul et al., 1997; Eddy, 1998; By aligning similar sequences with general substitution matrices Ginalski and Rychlewski, 2003; Phillips et al., 2000). They and aligning divergent sequence groups with profile-based also aid in experimental design by revealing conserved residues consistency, PCMA (Pei et al., 2003) is able to achieve a with potential functional importance. A variety of alignment balance between alignment accuracy and speed. methods that rely on different algorithms and scoring Even with refinement and consistency-based scoring, functions have been developed (Edgar and Batzoglou, 2006). current methods still have difficulty in obtaining high-quality A rigorous method that aligns all sequences simultaneously alignments when sequence identity drops below 20%. As homologous proteins can have very low sequence *To whom correspondence should be addressed. similarity while maintaining similar structures and functions 802 The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Accurate multiple sequence alignments of proteins (weighted average) between effective frequency and the pseudocount (Murzin, 1998), aligning distantly related sequences is an frequency (Altschul et al., 1997; Tatusov et al., 1994). Defined in this important task. A recent trend in the multiple alignment way, the target frequency of any amino acid, even if it is not present in field is to recruit various sources of sequence and structural a position, is always greater than zero. Details on derivation of the two information to improve alignment accuracy (Edgar and profile components are in Supplementary Data. Batzoglou, 2006). Such sources include homologs detected in For an ‘M’ state, the probability of emitting the observed amino acids database searches (Katoh et al., 2005; Simossis and Heringa, for a position pair (i, j) is the product of two probabilities: (i) the 2005; Thompson et al., 2000), predicted secondary structure probability of generating the effective frequencies of position i using (Simossis and Heringa, 2005; Zhou and Zhou, 2005), and the target frequencies of position j, and (ii) the probability of generating known 3D structures (O’Sullivan et al., 2004). Since additional the effective frequencies of position j using the target frequencies of homologs improve the quality of sequence profiles, and position i. For an ‘X’ or ‘Y’ state, the probability of emitting the observed amino acids in a position k is the probability of generating structural features such as secondary structure are generally the effective frequencies of position k using the background amino acid more conserved than sequences, their usage can lead to frequencies in insertion regions. Besides amino acids, an ‘M’ state improved alignment quality. also emits a pair of predicted secondary structures, and an ‘X’ or ‘Y’ Here, we describe PROMALS, a multiple sequence state also emits a single predicted secondary structure. The emission alignment method that combines recent advances in computa- probability in a hidden state (‘M’, ‘X’ or ‘Y’) is a weighted product of tional approaches to tackle the difficult task of aligning amino acid emission probability and secondary structure emission divergent sequences. PROMALS improves probabilistic probability. The relative weights for the scoring terms of amino acids consistency-based scoring of profiles by utilizing predicted and predicted secondary structures have been optimized to increase the secondary structures and additional homologs found in alignment accuracy of the training sequence pairs. Details on emission database searches. To effectively combine these additional probability formulas, parameter estimation and the algorithm for aligning two profiles with optimal posterior probabilities of position data, we developed and implemented a new hidden Markov matches are described in Supplementary Data. model for profile-profile comparison, which scores both amino acid similarity and secondary structure similarity, and has local structure-dependent transition and emission probabilities. 2.2 PROMALS multiple sequence alignment procedure Like PCMA, PROMALS is made more computationally PROMALS (PROfile Multiple Alignment with predicted Local efficient by treating similar and divergent sequences with Structure) is a progressive method (Fig. 1). The alignment order is set different alignment strategies. On several difficult data sets, by a tree built using a k-mer count method (Edgar, 2004). Like PCMA we show that PROMALS gives the best alignment accuracy (Pei et al., 2003) and MUMMALS (Pei and Grishin, 2006), PROMALS among leading methods such as SPEM, HHalign (Soding, has two alignment stages for easy and difficult alignments. In the first stage, highly similar sequences are progressively aligned in a fast way 2005), MUMMALS, ProbCons and MAFFT. with a weighted sum-of-pairs measure of BLOSUM62 scores (Henikoff and Henikoff, 1992) (step 2 in Fig. 1). If two neighboring groups on the tree have an average sequence identity higher than a certain threshold 2 METHODS (default: 60%), they are aligned in this fast way. The result of the first 2.1 A hidden Markov model of profile–profile alignment alignment stage is a set of sequences or pre-aligned groups that are relatively divergent from each other. In the second alignment stage, A classical pairwise HMM for aligning two sequences has three types of one representative sequence (the longest one) is selected from each hidden states: a match state ‘M’ emitting a residue pair, an ‘X’ state pre-aligned group. For each representative, PSI-BLAST is used to emitting a residue in the first sequence and a ‘Y’ state emitting search for homologs from sequence database UNIREF90 (Wu et al., a residue in the second sequence (Durbin et al., 1998). ‘X’ and ‘Y’ states 2006) with three iterations and an E-value cutoff of 0.001. Hits with correspond to insertions or deletions in the two sequences. Our hidden 520% identity to the query are removed and up to 300 hits are selected. Markov model for aligning two alignments (having profile representa- The PSI-BLAST checkpoint file after three iterations is used to predict tions) has the same architecture as a pairwise sequence HMM. secondary structures by PSIPRED (Jones, 1999). For each pair of In our model, an ‘M’ state emits a pair of positions instead of a pair representatives, profiles are derived from the PSI-BLAST alignments of residues. For an ‘X’ or ‘Y’ state, a single position in the and PSIPRED secondary structure prediction, and a matrix of first alignment or in the second alignment is emitted, respectively. posterior probabilities of matches between positions is obtained by The emitted objects (observations) are amino acid frequency vectors forward and backward algorithms of the profile-profile HMM and predicted secondary structure types. (see Supplementary Data for details). These matrices are used to We adopt a representation of amino acid sequence profile similar calculate the probabilistic consistency scores as described in Do et al. to the ones in PSI-BLAST (Altschul et al., 1997) and COMPASS (Sadreyev and Grishin, 2003). Two profile components are estimated (2005). The representatives are then aligned progressively according for a position in an alignment: (i) effective frequencies of amino acids, to the consistency-based scoring function, and the pre-aligned and (ii) target frequencies of amino acids. The effective frequencies groups obtained in the first stage are merged to the multiple alignment serve as the emitted objects (observations) in a position for the hidden of the representatives. Finally, gap placement is refined to make the gap Markov model. They are estimated from the position-specific patterns more realistic. For that, we define a core block as a set of independent counts (PSIC) of amino acids (Pei and Grishin, 2001; consecutive positions with gap content less than 0.5 at each position. Sunyaev et al., 1999), which is a sequence-weighting scheme that A highly gapped (‘gappy’) region is defined as a set of consecutive corrects for biased similarities between sequences. If an amino acid positions with gap contents no less than 0.5 at each position. A gappy is not present in a position, it has an effective frequency of zero. region is either bound by two adjacent core blocks, or is at the start The target frequencies serve as the ‘hidden’ amino acid probabilistic or the end of the alignment. If there are l amino acid residues generator for a position. The target frequencies are estimated from the in a gappy segment, gap refinement introduces continuous gap effective frequencies, taking into account prior knowledge of amino characters in between the [l/2]th residue and the (l[l/2])th residue, acid substitution characteristics. The target frequency is a mixture with the exceptions for any gappy segment in N- or C-terminus, 803 J.Pei and N.V.Grishin 3. Select one 2. Align similar 1. k-mer sequence from sequences in a counting each group fast way N input sequences N ′ representatives N ′ pre-aligned UPGMA tree groups (N ′≤N) 4. Run PSI-BLAST and PSIPRED 6. Do progressive 5. Build profile-profile 7. Merge pre- alignment based HMMs; consistency aligned groups; on consistency transformation refine gaps Probabilistic N ′ profiles with predicted Final alignment Alignment of N ′ consistency secondary structures of N sequences representatives objective function Fig. 1. PROMALS multiple sequence alignment procedure. The gray arrows indicate the two most time-consuming steps: running PSI-BLAST and PSIPRED (step 4) and profile consistency transformation (step 5). where a single run of continuous gap characters is introduced at the reflecting structural similarity of two SCOP domains compared according to aligned residues in a test alignment: DALI Z-score sequence start or end. (Holm and Sander, 1998a), GDT-TS score (Zemla et al., 1999), TM-score (Zhang and Skolnick, 2004), 3D-score (Rychlewski et al., 2.3 Assessment of alignment methods 2003) and two LiveBench contact scores (Rychlewski et al., 2003). These scores were scaled by taking into account self-comparison scores, The following methods were tested: SPEM (Zhou and Zhou, 2005), random scores and alignment coverage (scaled scores are no larger HHalign (Soding, 2005), MUMMALS (Pei and Grishin, 2006), than 1 and usually above 0). We also calculated two reference- ProbCons (version 1.10) (Do et al., 2005), MAFFT (version 5.667) independent sequence similarity scores: sequence identity and (Katoh et al., 2005), MUSCLE (version 3.52) (Edgar, 2004) and BLOSUM62 scores of aligned positions in a test alignment. These ClustalW (version 1.83) (Thompson et al., 1994). For MAFFT, we scores were also calculated for DaliLite (Holm and Sander, 1998a) report two alignment options (‘-linsi’ and ‘-ginsi’) that show the best structure-based alignments as a positive control. results. HHalign is an enhanced version of HHsearch (Soding, 2005) that performs pairwise profile–profile alignment with predicted secondary structures (J. Soding, personal communication). Several 3 RESULTS parameters (score shift, secondary structure weight, pseudocount weight) of HHalign were selected that gave optimal performance on PROMALS is a progressive multiple alignment method SCOP domain pairs with identity520%. based on probabilistic consistency of profile-profile compar- For pairwise alignment tests, we used divergent SCOP superfamily ison, with enhanced profile information from homologs domain pairs that were divided into three identity bins: below 10%, detected by PSI-BLAST and secondary structures predicted 10–15% and 15–20%. For multiple alignment tests, we added up to by PSIPRED (Fig. 1). SPEM and HHalign are comparable 24 homologs to each sequence in the testing cases of pairwise methods as they also use these two sources of extra data. While alignments. Details on construction of these testing data sets were PROMALS and SPEM can align two or more sequences, given in our previous work (Pei and Grishin, 2006). Two large HHalign performs only pairwise alignments. The other tested benchmark data sets compiled by other researchers were used as well. One is the SABmark database (version 1.65) (Van Walle et al., 2005), methods (MUMMALS, ProbCons, MAFFT, MUSCLE and which contains two sets of multiple protein domains related at SCOP ClustalW) are stand-alone multiple sequence methods that do fold or superfamily level. The other is PREFAB database (version 4.0) not resort to other data sources or programs. (Edgar, 2004), which is based on structural alignments in FSSP database (Holm and Sander, 1998b) and homologous sequences 3.1 Reference-dependent evaluation of methods from database searches. Reference-dependent alignment quality scores (Q-scores) were calculated using the built-in programs in SABmark and 3.1.1 Tests on weakly similar SCOP domain pairs We tested PREFAB packages. The Q-score is the number of correctly aligned our profile-profile HMM on 1207 divergent SCOP domain residue pairs in the test alignment divided by the number of aligned pairs (Pei and Grishin, 2006) with 520% sequence identity residue pairs in the reference alignment. The value of the Q-score is (Table 1, first numbers in columns under ‘SCOP’). The three between 0 and 1. Wilcoxon signed-ranks tests were performed to methods that use extra data (PROMALS, SPEM and HHalign) calculate the statistical significance of comparisons between alignment produce substantially better results than stand-alone methods methods. (MUMMALS, ProbCons, MAFFT, MUSCLE and ClustalW) In addition to Q-score, we applied reference-independent evaluation that align a pair of sequences without using additional of alignment quality to SCOP domain pairs, as described in our previous work (Pei and Grishin, 2006). We calculated several scores homologs or predicted secondary structures. For sequence 804 Accurate multiple sequence alignments of proteins Table 1. Reference-dependent evaluation of alignment methods a a a c Method SCOP 0–10% (355) SCOP 10–15% (432) SCOP 15–20% (420) SABmark-twi (209) SABmark-sup (425) PREFAB (1682) PROMALS 0.435/0.457 0.612/0.619 0.761/0.772 0.391 0.665 0.790 SPEM 0.377/0.411 0.558/0.578 0.727/0.751 0.326 0.628 0.774 HHalign 0.406/– 0.567/– 0.730/– – – 0.787 MUMMALS 0.151/0.329 0.335/0.520 0.586/0.732 0.196 0.522 0.731 ProbCons 0.116/0.290 0.294/0.486 0.536/0.701 0.166 0.485 0.716 MAFFT-linsi 0.116/0.301 0.262/0.500 0.495/0.707 0.184 0.510 0.722 MAFFT-ginsi 0.116/0.308 0.265/0.497 0.496/0.714 0.176 0.495 0.715 MUSCLE 0.139/0.262 0.293/0.452 0.507/0.661 0.136 0.433 0.680 ClustalW 0.136/0.210 0.270/0.357 0.482/0.565 0.127 0.390 0.617 Average Q-scores of three testing data sets of ASTRAL SCOP40 superfamily pairs, two SABmark data sets (twi—‘twilight zone’ set, sup— ‘superfamily’ set) and the PREFAB 4.0 data set are shown. Q-score is the number of correctly aligned residue pairs in the test alignment divided by the total number of aligned residue pairs in the reference alignment. The number of alignments in each testing data set is shown in parentheses. Identity ranges are shown for the three SCOP data sets. The first three methods use extra data from PSI-BLAST and PSIPRED. The other five are stand-alone methods. The option of MUMMALS (modeling secondary structure and solvent accessibility) is set to produce the best results on these data sets. For each data set, PROMALS yields statistically higher accuracy (bold numbers) than any other method (P-value50.000001) according to Wilcoxon signed rank test. For tests on the SCOP data sets, there are two numbers in each cell separated by a slash. The first number is the average Q-score in pairwise alignment tests and the second number is the average Q-score in multiple alignment tests. HHalign only performs pairwise profile–profile alignments and does not construct multiple sequence alignments. Thus the values for SCOP multiple alignment tests and SABmark tests are not available. For PREFAB 4.0 data set, the scores of PROMALS, HHalign and SPEM are based on pairwise profile–profile alignments, while the scores for other methods are based on multiple alignments. pairs with identity below 10%, the average Q-score of Nevertheless, only 40% residues were correctly aligned on PROMALS (0.431) is almost three times higher than that average by PROMALS for the ‘twilight zone’ set, suggesting of MUMMALS (0.156). For alignments with identity ranges that homology modeling of extremely divergent domains 10–15% and 15–20%, PROMALS also gives substantial remains a difficult problem with regard to alignment quality. accuracy increases over MUMMALS of 0.272 and 0.176, respectively. PROMALS shows about 3–4% accuracy increases 3.1.3 Tests on and PREFAB database PREFAB 4.0 data- over SPEM and HHalign, suggesting that our profile-profile base consists of 1682 alignments averaging 45.2 sequences per HMM utilizes homologs and predicted secondary structures alignment. Each alignment consists of two sequences with in a better way. known structures and their homologs found by PSI-BLAST We also tested the methods (except HHalign, which is a database searches. The reference structural alignment in each pairwise alignment program) on data sets of multiple sequences test is based on the consensus of FSSP (Holm and Sander, constructed by adding up to 48 homologs to each SCOP 1998b) and CE (Shindyalov and Bourne, 1998) alignments. domain pair (Table 1, second numbers in columns under We have used the performances of pairwise profile–profile ‘SCOP’). With multiple sequences, PROMALS and SPEM alignments of PROMALS and SPEM as an indicator of their both show slight improvement (1–2% for PROMALS and multiple alignment performances. The three methods that use 2–3% for SPEM) over their pairwise profile–profile alignments. additional data (PROMALS, SPEM and HHalign) give similar PROMALS outperforms SPEM by 2% on multiple results, each with an average Q-score above 0.75. Their sequences. With added homologs, stand-alone methods all accuracies are higher than those on the two SCOP data sets yield better accuracies than pairwise sequence alignments, with identity515% and the two SABmark sets, suggesting that among which MUMMALS is the best method. PROMALS PREFAB 4.0 is an easier testing data set. PROMALS, SPEM outperforms MUMMALS by 0.13, 0.1, and 0.05 for data sets and HHalign are more accurate than MUMMALS by 4–6%. with identities510%, 10–15% and 15–20%, respectively. PROMALS is statistically more accurate (P-value50.000001) than SPEM and HHalign despite small differences in their 3.1.2 Tests on SABmark database SABmark database average Q-scores. Results on PREFAB 4.0 confirm that (version 1.65) has two multiple alignment benchmark sets. alignment quality differences between methods become smaller The ‘twilight zone’ set contains 209 tests of SCOP (version 1.65) on easier tests. fold-level domains with very low similarity, and the ‘super- family’ set contains 425 tests of SCOP superfamily-level 3.2 Reference-independent evaluation of methods domains with low to intermediate similarity. PROMALS achieves the best results among all methods for both sets. On our data sets of 1207 SCOP domain pairs with identity Its accuracy is 6% and 4% higher than SPEM on ‘twilight below 20%, we evaluated alignment quality using reference- zone’ set and ‘superfamily’ set, respectively. For the most independent scores that reflect the similarity between two difficult ‘twilight zone’ set, PROMALS doubles the structures compared according to aligned residue pairs in the accuracy of the best stand-alone method (MUMMALS). test alignment (Pei and Grishin, 2006). These structural 805 J.Pei and N.V.Grishin Table 2. Reference-independent evaluation on 1207 representative SCOP40 domain pairs with identity520% Method Structural similarity Sequence similarity DALI Z-score GDT-TS TM-score 3D-score LBcona LBconb Identity BLOSUM62 a a a a a a PROMALS 0.1562 0.3079 0.3675 0.3097 0.2692 0.3527 0.0868 0.1555 SPEM 0.1400 0.2886 0.3451 0.2893 0.2521 0.3319 0.0992 0.1724 HHalign 0.1334 0.2914 0.3488 0.2907 0.2469 0.3263 0.0874 0.1535 MUMMALS 0.1231 0.2570 0.3070 0.2563 0.2240 0.2909 0.0932 0.1651 ProbCons 0.1003 0.2324 0.2767 0.2307 0.2060 0.2670 0.0983 0.1719 MAFFT-linsi 0.1135 0.2485 0.2982 0.2467 0.2143 0.2820 0.0923 0.1632 MAFFT-ginsi 0.1126 0.2454 0.2960 0.2429 0.2152 0.2803 0.0972 0.1725 MUSCLE 0.0980 0.2297 0.2777 0.2266 0.1941 0.2535 0.0939 0.1686 ClustalW 0.0723 0.1916 0.2318 0.1876 0.1551 0.2030 0.0733 0.1344 b b DaliLite 0.4206 0.4936 0.5571 0.5289 0.4087 0.5110 0.0697 0.1268 The first three methods use extra data given by PSI-BLAST and PSIPRED. The last method (DaliLite) produces alignments based on comparison of known 3D structures. The other five are stand-alone methods. All sequence-based methods except HHalign construct multiple sequence alignments for target domain pairs with up to 48 homologs. HHalign constructs pairwise profile–profile alignments. Scores are calculated for pairwise alignments of target domain pairs extracted from multiple sequence alignments. PROMAL yields statistically higher structure-similarity scores (in bold) than other sequence alignment methods (P-value5 0.000001) according to Wilcoxon signed rank test. DaliLite structure-based sequence alignments have the lowest average sequence similarity scores (in bold). similarity scores are DALI Z-score, TM-score, GDT-TS score, 1207 alignments). These comparisons suggest that alignments 3D-score, and two LiveBench contact scores (Table 2). constructed by different methods can vary much for divergent Consistent with reference-dependent evaluation, PROMALS sequences, and a method with an overall inferior performance is produces significantly higher average structural similarity capable of generating better alignments in some cases. Careful scores than other methods. Used as a positive control, inspection of alignments produced by several programs could structural alignment method DaliLite yields higher structural help improve alignment quality for divergent sequences. similarity scores than any sequence-based alignment method (Table 2). Interestingly, DaliLite alignments have the lowest reference-independent sequence similarity scores (sequence 4 DISCUSSION identity and BLOSUM62 scores). PROMALS also shows Judging by its performance, PROMALS is a definite advance lower sequence similarity scores than several other sequence- compared to our previous alignment programs MUMMALS based methods. These observations suggest that for distantly (Pei and Grishin, 2006). MUMMALS derives probabilistic related sequences (sequence identity520%), sequence similarity consistency from pairwise HMMs with built-in local structural scores, such as identity or BLOSUM62, may not correlate information (secondary structure and/or solvent accessibility), with alignment quality measured by 3D structural comparison, and shows slight but significant improvement (a few percent) and maximization of these scores may not improve structural over other stand-alone methods such as ProbCons (Do et al., models based on sequence alignments. 2005) and MAFFT (Katoh et al., 2005). However, since no additional homologs are used, the local structure prediction 3.3 Pairwise comparisons of alignment methods implicitly performed by MUMMALS is of low accuracy To gain further understanding of the differences between compared to advanced methods such as PSIPRED alignment methods, we compared their performance on (Jones, 1999). In contrast, PROMALS incorporates database individual domain pairs from the SCOP sets (identity520%). searches and more accurate secondary structure prediction, Table 3 shows the number of pairs, for which one method and derives probabilistic consistency from profile–profile performs better than another method by a relatively large HMMs. Moreover, the HMM in PROMALS has a two-track margin of 0.1 or more (measured by scaled TM-score or structure (Karchin et al., 2003) that treats both amino acids Q-score, both scores are between 0 and 1). Although and predicted secondary structures as emitted objects, while PROMALS clearly leads by a large margin, it does not offer MUMMALS HMMs only emit amino acids. Owing to the best alignment in each and every case. For example, additional data sources and the advanced profile–profile PROMALS gives a TM-score increase of 0.1 or more over HMM, PROMALS shows significant improvement over SPEM on 197 alignments, while producing significantly MUMMALS and other stand-alone methods, especially for inferior alignments for 109 pairs. Even stand-alone methods highly divergent sequences. (MUMMALS, ProbCons, MAFFT, MUSCLE and ClustalW) The HMM in PROMALS adopts a numerical representation outperform PROMALS by a TM-score of 0.1 or more of sequence profile (see Supplementary Data for details) that on a small number of pairs (5%, i.e. 49–67 out of successfully works in other profile-sequence or profile–profile 806 Accurate multiple sequence alignments of proteins Table 3. Pairwise comparisons among alignment methods on 1207 SCOP domain pairs with identity520% PROMALS SPEM HHalign MUMMALS ProbCons MAFFT-linsi MAFFT-ginsi MUSCLE ClustalW PROMALS – 109/196 76/179 67/340 44/458 67/398 61/374 60/464 49/650 SPEM 199/81 – 140/148 108/281 71/389 98/324 99/326 82/400 43/574 HHalign 265/84 196/121 – 77/254 49/368 73/288 78/301 66/393 53/571 MUMMALS 685/286 648/305 627/333 – 38/169 111/138 82/128 82/227 59/431 ProbCons 726/263 693/277 674/303 201/62 – 172/80 162/76 162/169 110/336 MAFFT-linsi 718/276 680/295 662/325 239/128 133/188 – 85/98 93/196 60/387 MAFFT-ginsi 714/271 676/284 664/313 199/117 113/184 111/132 – 90/185 67/395 MUSCLE 783/239 741/255 727/279 401/83 302/138 295/110 327/106 – 75/288 ClustalW 858/193 840/209 819/228 649/55 559/103 585/70 600/76 449/100 – Each off-diagonal cell has two numbers separated by a slash. The first number is the number of pairs where the alignment score of the method listed to the left is inferior to that of the method listed above (in a column) by 0.1 or more. The second number is the number of pairs where the score of the method listed to the left is better than that of the method listed above by 0.1 or more. The alignment quality scores used for comparison in the lower triangle and the upper triangle are Q-scores and weighted and scaled TM-scores, respectively. These scores are calculated based on results of multiple sequence alignments (target domain pairs plus up to 48 added homologs), with the exception of HHalign alignments, which are pairwise profile–profile alignments. Comparisons of PROMALS with other methods are highlighted in bold. alignment methods such as PSI-BLAST (Altschul et al., 1997) compared to 67 min for SPEM (on Redhat Enterprise and COMPASS (Sadreyev and Grishin, 2003). A recent Linux 3, AMD Opteron 2.0 GHz). The stand-alone methods comprehensive study also supported the effectiveness of this (MUMMALS, PROBCONS, MAFFT, MUSCLE and profile–profile scoring scheme (Wang and Dunbrack, 2004). ClustalW) are much faster, all with a median CPU time51 min. To adequately use predicted secondary structures, we not only As in our previous work (Pei and Grishin, 2006), score them as emitted objects, but also use transition and we demonstrated the effectiveness of reference-independent emission probabilities that are dependent on predicted second- evaluation of alignment quality in this study. First, we observed ary structure types (Supplementary Data). Unlike HHalign, a good correlation between reference-dependent and reference- which treats each alignment as a classical profile HMM independent evaluations, suggesting that it may not be (Eddy, 1998), our HMM has a simpler structure similar necessary to spend significant efforts on development of to the classical 3-state pairwise HMM (Durbin et al., 1998). reference alignment databases. Second, reference-independent SPEM (Zhou and Zhou, 2005) does not use HMMs, but applies techniques solve the problem of reference alignment ambiguity, an empirical profile–profile alignment method (SP ) that which becomes significant when similarity is low. Third, identifies the optimal alignment path. In contrast, the HMM reference-independent evaluation helps answer general ques- in PROMALS allows estimation of posterior probabilities of tions such as whether alignments can be further improved for matches between positions. As a result, PROMALS has a sequences with low similarity, and whether such improvements probabilistic treatment of consistency similar to the one in will help structure modeling. For several structural similarity ProbCons and MUMMALS, while simple consistency measures measures (GDT-TS, 3Dscore, TM-score, LB contact scores), are used in SPEM, T-COFFEE (Notredame et al., 2000) and the ratio between the average score of PROMALS sequence- PCMA (Pei et al., 2003). PROMALS performs significantly based alignment and the average score of DaliLite structure- better than SPEM and HHalign on difficult tests, suggesting based alignment is 0.6 on domain pairs with520% sequence the advantages of our profile–profile comparison scheme. identity (Table 2), suggesting that we are still 40% below what Since PROMALS relies on PSI-BLAST and PSIPRED to can be achieved with structures in hand. Notably, for these collect additional homologs and predicted secondary structures, divergent sequences, DaliLite structural alignments have lower the speed of PROMALS is considerably slower than that of sequence similarity scores (identity and BLOSUM62 scores) stand-alone progressive methods. Our strategy for improving than alignments produced by any sequence method, suggesting speed is to use different algorithms for easy and difficult that scoring functions based only on amino acid sequence alignments (Pei et al., 2003). By aligning highly similar similarity may not be suitable for aligning divergent sequences sequences in a fast way, the number of sequences subject to for the purpose of homology modeling. This observation the time-consuming steps (running PSI-BLAST, PSIPRED and further justifies the use of alternative scoring schemes, such as consistency transformation) could be substantially reduced. the ones that recruit structural information. For example, for 1207 SCOP domain pairs with up to 48 added homologs, the average number of sequences in an alignment ACKNOWLEDGEMENTS is 41.6. After PROMALS aligns similar sequences with identity above 60% in the first stage, only 24 sequences on average We would like to thank Bong-Hyun Kim for the reference- require database searches, secondary structure prediction, independent evaluation routine, and Johannes Soding for and consistency transformation. For these tests, the median providing the HHalign program. We would like to thank Lisa CPU time of PROMALS is 30 min per alignment, as Kinch, Ruslan Sadreyev and James Wrabl for critical reading 807 J.Pei and N.V.Grishin Pei,J. et al. (2003) PCMA: fast and accurate multiple sequence alignment based of the manuscript and helpful comments. This work was on profile consistency. Bioinformatics, 19, 427–428. supported in part by NIH grant GM67165 to NVG. Phillips,A. et al. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol., 16, 317–330. Conflict of Interest: none declared. Rychlewski,L. et al. (2003) LiveBench-6: large-scale automated evaluation of protein structure prediction servers. Proteins, 53 (Suppl. 6), 542–547. Sadreyev,R. and Grishin,N. (2003) COMPASS: a tool for comparison of multiple REFERENCES protein alignments with assessment of statistical significance. J. Mol. Biol., Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of 326, 317–336. protein database search programs. Nucleic Acids Res., 25, 3389–3402. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by Do,C.B. et al. (2005) ProbCons: probabilistic consistency-based multiple incremental combinatorial extension (CE) of the optimal path. Protein Eng., sequence alignment. Genome Res., 15, 330–340. 11, 739–747. Durbin,R. et al. (1998) Biological Sequence Analysis: Probabilistic Models of Simossis,V.A. and Heringa,J. (2005) PRALINE: a multiple sequence alignment Proteins and Nucleic Acids. Cambridge University Press. toolbox that integrates homology-extended and secondary structure informa- Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. tion. Nucleic Acids Res., 33, W289–294. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy Soding,J. (2005) Protein homology detection by HMM-HMM comparison. and high throughput. Nucleic Acids Res., 32, 1792–1797. Bioinformatics, 21, 951–960. Edgar,R.C. and Batzoglou,S. (2006) Multiple sequence alignment. Curr. Opin. Sunyaev,S.R. et al. (1999) PSIC: profile extraction from sequence alignments Struct. Biol, 16, 368–373. with position-specific counts of independent observations. Protein Eng., 12, Ginalski,K. and Rychlewski,L. (2003) Detection of reliable and unexpected 387–394. protein fold predictions using 3D-Jury. Nucleic Acids Res., 31, 3291–3292. Tatusov,R.L. et al. (1994) Detection of conserved segments in proteins: iterative Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from scanning of sequence databases with alignment blocks. Proc. Natl. Acad. Sci. protein blocks. Proc. Natl. Acad. Sci. USA, 89, 10915–10919. USA, 91, 12091–12095. Holm,L. and Sander,C. (1998a) Dictionary of recurrent domains in protein Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of structures. Proteins, 33, 88–96. progressive multiple sequence alignment through sequence weighting, Holm,L. and Sander,C. (1998b) Touring protein fold space with Dali/FSSP. position-specific gap penalties and weight matrix choice. Nucleic Acids Res., Nucleic Acids Res., 26, 316–319. 22, 4673–4680. Jones,D.T. (1999) Protein secondary structure prediction based on position- Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence specific scoring matrices. J. Mol. Biol., 292, 195–202. alignment programs. Nucleic Acids Res., 27, 2682–2690. Karchin,R. et al. (2003) Hidden Markov models that use predicted local structure Thompson,J.D. et al. (2000) DbClustal: rapid and reliable global multiple for fold recognition: alphabets of backbone geometry. Proteins, 51, 504–514. alignments of protein sequences detected by database searches. Nucleic Acids Katoh,K. et al. (2005) MAFFT version 5: improvement in accuracy of multiple Res., 28, 2919–2926. sequence alignment. Nucleic Acids Res., 33, 511–518. Van Walle,I. et al. (2005) SABmark—a benchmark for sequence alignment that Lipman,D.J. et al. (1989) A tool for multiple sequence alignment. Proc. Natl. covers the entire known fold space. Bioinformatics, 21, 1267–1268. Acad. Sci. USA, 86, 4412–4415. Wang,G. and Dunbrack,R.L., Jr. (2004) Scoring profile-to-profile sequence Murzin,A.G. (1998) How far divergent evolution goes in proteins. Curr. Opin. alignments. Protein Sci., 13, 1612–1626. Struct. Biol., 8, 380–387. Wu,C.H. et al. (2006) The Universal Protein Resource (UniProt): an expanding Notredame,C. et al. (2000) T-Coffee: a novel method for fast and accurate universe of protein information. Nucleic Acids Res., 34, D187–191. multiple sequence alignment. J. Mol. Biol., 302, 205–217. Zemla,A. et al. (1999) Processing and analysis of CASP3 protein structure O’Sullivan,O. et al. (2004) 3DCoffee: combining protein sequences and structures predictions. Proteins, (Suppl. 3), 22–29. within multiple sequence alignments. J. Mol. Biol., 340, 385–395. Zhang,Y. and Skolnick,J. (2004) Scoring function for automated assessment of Pei,J. and Grishin,N.V. (2001) AL2CO: calculation of positional conservation in protein structure template quality. Proteins, 57, 702–710. a protein sequence alignment. Bioinformatics, 17, 700–712. Zhou,H. and Zhou,Y. (2005) SPEM: improving multiple sequence alignment Pei,J. and Grishin,N.V. (2006) MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. with sequence profiles and predicted secondary structures. Bioinformatics, 21, Nucleic Acids Res, 34, 4364–4374. 3615–3621.

Journal

Bioinformatics – Oxford University Press

Published: Jan 31, 2007

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

PROMALS: towards accurate multiple sequence alignments of distantly related proteins

PROMALS: towards accurate multiple sequence alignments of distantly related proteins

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

PROMALS: towards accurate multiple sequence alignments of distantly related proteins

PROMALS: towards accurate multiple sequence alignments of distantly related proteins

References (39)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies