MICAN-SQ: a sequential protein structure alignment program that is applicable to monomers and all types of oligomers

MICAN-SQ: a sequential protein structure alignment program that is applicable to monomers and all... Abstract Motivation Protein structure alignment is a significant tool to understand evolutionary processes and physicochemical properties of proteins. Important targets of structure alignment are not only monomeric but also oligomeric proteins that sometimes include domain swapping or fusions. Although various protein structural alignment programs have been developed, no method is applicable to any protein pair regardless of the number of chain components and oligomeric states with retaining sequential restrictions: structurally equivalent regions must be aligned in the same order along protein sequences. Results In this paper, we introduced a new sequential protein structural alignment algorithm MICAN-SQ, which is applicable to protein structures in all oligomeric states. In particular, MICAN-SQ allows the complicated structural alignments of proteins with domain swapping or fusion regions. To validate MICAN-SQ, alignment accuracies were evaluated using curated alignments of monomers and examples of domain swapping, and compared with those of pre-existing protein structural alignment programs. The results of this study show that MICAN-SQ has superior accuracy and robustness in comparison with previous programs and offers limited computational times. We also demonstrate that MICAN-SQ correctly aligns very large complexes and fused proteins. The present computations warrant the consideration of MICAN-SQ for studies of evolutionary and physicochemical properties of monomeric structures and all oligomer types. Availability and implementation The MICAN program was implemented in C. The source code and executable file can be freely downloaded from http://www.tbp.cse.nagoya-u.ac.jp/MICAN/. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Analyses of protein structures are highly informative of evolutionary, functional and physicochemical properties of proteins. In particular, protein structure alignments can be used to identify spatially equivalent residue pairs for given protein pairs, leading to improved understanding of evolutionary (Alva et al., 2015; Orengo et al., 1997) and spatial properties of proteins (Hou et al., 2003; Minami et al., 2014), inferring molecular functions (Standley et al., 2010), facilitating structure predictions (Joo et al., 2007), de novo design (Huang et al., 2016) and drug discovery (Okuno et al., 2015). Although, many programs for protein structure alignments have been developed, further improvements are required to address various limitations. Among these, accuracy with gold-standards is one of the important issues. It is known that the results of existing structure alignment programs do not well coincide with the gold-standard (Mayr et al., 2007; Wang et al., 2013). Because biologically or physically meaningful alignments are key for inferring evolutionary relationships and understanding physico-chemical properties of protein structures, structure alignment programs that are highly consistent with gold-standards are eagerly awaited. Additional limitations are present in methodologies for aligning proteins comprising multiple chains. A few programs have been developed for structural comparisons of protein complexes and these programs can be roughly divided into two classes. The first class of these programs assigns one subunit in a complex into a subunit in another, and the resulting subunit–subunit correspondences impose sequential restrictions in which structurally equivalent regions must be aligned in the same order along the protein sequences of each subunit pair [MMalign (Mukherjee and Zhang, 2009), SCPC (Koike and Ota, 2012)]. The second class of programs involves non-sequential alignments [TopMatch (Sippl and Wiederstein, 2012) and MICAN (Minami et al., 2013)], in which structurally equivalent residues can be aligned with differing order along the protein sequences. Although both approaches identify valid structural relationships, each has issues of poor utility. Programs of the first class fail to make alignments of two proteins if one chain of a protein corresponds structurally with multiple chains of the other (one-to-many correspondences). This limitation is particularly problematic when two or more protein chains form oligomers by exchanging an identical structural element between subunits (domain swapping). Structure alignments of a monomeric protein [the mannose binding protein, Protein Data Bank (PDB) ID: 1msb] and structurally similar dimer with domain swapping (the platelet activator convulxin: 1umr) are shown in Supplementary Figure S1. In this example, the overall structure of the monomer resembles that of the dimer, and the monomer is associated with both chains of the dimer. Because these programs do not allow structural correspondence across multiple chains, they cannot correctly detect matches of swapped regions. In addition, when two protein complexes include more than two chains, a combinatorial problem with subunit–subunit correspondences leads to large computational times. Programs of the second class employ non-sequential structure alignments. By ignoring chain connectivity, these programs can identify structural correspondences across multiple chains and thus overcoming the problem of the one-to-many correspondences. However, because chain connectivities are generally maintained in homologous proteins, disregarding chain connectivities leads to the possibility of making odd alignments that are not biologically relevant. Herein, we propose a new class of protein structure alignment program, MICAN-SQ. Unique features of MICAN-SQ include (i) acceptance of protein structures of all oligomeric states and (ii) imposition of sequential restriction within alignments of each chain pair. The ensuing algorithm is based on the non-sequential protein structure alignment program MICAN (Minami et al., 2013). Previously, we showed that MICAN reproduces a manually curated sequential alignment of reference monomeric proteins, better than other programs, although it does not specialize in sequential structure alignments. This result motivated us to introducce the sequential restriction within the alignment of each chain pair into the MICAN algorithm for more accurate alignment. To assess alignment quality, we employed a database of manually-curated alignments of monomeric proteins that we used previously [MALIDUP (Cheng et al., 2007) and MALISAM (Cheng et al., 2008)], and a dataset of structure pairs of a monomer and its structurally similar domain-swapped structure [3DSwap-PS (Huang et al., 2012)]. We evaluated consistency between the alignments of MICAN-SQ and those in the databases, and compared accuracies with those of publicly available programs and those of MICAN with non-sequential modes. We demonstrated that MICAN-SQ has greater accuracy with benchmark sets than the other programs. These results warrant consideration of MICAN-SQ as a useful protein structure alignment program that can be used to identify biologically or physically relevant protein pairs in all oligomeric states. 2 Materials and methods 2.1 Overview of the MICAN algorithm In the previous study, two non-sequential alignment modes were implemented in MICAN (Minami et al., 2013). Firstly, the ‘permitting ReWiring mode (RW)’ allows alignment of structurally equivalent Secondary Structure Elements (SSEs) in different orders in the protein sequences (Rewiring loops) and possesses the same N- to C-terminal directions of SSEs. MICAN with this mode is referred to as MICAN-RW. Secondly, in addition to RW, the ‘permitting Rewiring and Reversing mode (RR)’ allows alignment of structurally equivalent SSEs in the reverse direction of N- and C-termini. MICAN with this mode is referred to as MICAN-RR. In this paper, we introduce the ‘restricting SeQuential alignment mode (SQ)’ into MICAN, in which structurally equivalent SSEs must be aligned in the same order along the protein sequences. MICAN with this mode is called MICAN-SQ, in which sequential constraints are imposed only within the alignment of each chain pair. In all the three modes, the alignment algorithm basically comprises the following three steps: (i) alignments of SSEs, (ii) alignments of residues and (iii) ranking of alignments. These three steps are briefly described in Supplementary Note S1 and are illustrated in a flowchart in Supplementary Figure S2. Because MICAN-SQ was modified only at step (ii), only this step is described below. The optimized score function (SSE match-weighted TMscore; sTMscore) and several quantities reported by the program, including global and local similarity measures, are also described in Supplementary Notes S1 and S2, respectively. 2.2 Alignment of residues in MICAN-SQ In this step, which starts from the initial superimposition obtained by alignments of SSEs, one-to-one assignments of Cα atoms are generated in a stepwise manner. We define segments by using contiguous pairs of Cα atoms. Assignments start from the segment comprising Cα atoms with the smallest distance and the same geometry in the superimposition. The sequential condition within the alignment of each chain pair is maintained. Here, we assume that the two protein structures for comparison correspond with a query (Q) and a template (T) containing nQ and nT, chains, respectively. If the number of residues of the i-th chain of the query and that of the template are NQ(i) and NT(i), respectively, then the total number of residues of the query (NQ) and the template (NT) are denoted as follows NQ=∑i=1nQNQ(i),    NT=∑i=1nTNT(i), respectively. For a given superimposition, an NQ×NT matrix, M, is constructed. The matrix element Mi,j is designed so that it is large if the distance between Cα atoms of the query residue i and that of the template residue j are small and have similar local-backbone geometries. The definition of the matrix is given in Supplementary Note S3. According to matrix M, we identify segments that are contiguous residue pairs and are superimposed with small distances and similar local-backbone geometries. A segment with the length l is described as a set of contiguous residue pairs {(α, β), (α+1, β+1),…,(α+l−1,β+l−1)}, where (α,β) indicates that residue α of the query is paired with residue β of the template. To quantify the similarity of a segment, we introduced a score for the segment Ak with length l, where k is the index of the segment, and the score S(Ak) is defined as follows: S(Ak)=∑m=0l−1Mα+m,β+m. To eliminate segments that are too short or too long and are rarely seen among manually curated alignments (Minami et al., 2013), only segments that satisfy the following two conditions are taken into account: Mα+m,β+m>Mcut   for   ∀m∈(0,1,…,l−1). l≥3, where Mcut is the cut off value. Based on the segments defined above, a set of non-overlapping segments that maximizes Stot=∑k=1nS(Ak), is explored with imposing the sequential constraint within the alignment of each chain pair, where n is the number of segments in the alignment. To find the best solution, a greedy algorithm was conducted (Supplementary Fig. S3). Firstly, the segment with the highest score among all segments is initially selected and recorded as A1. To select the subsequent segment that does not overlap with A1 and satisfies the sequential condition, the matrix M is modified so that matrix elements interfering with A1 and violating the sequential rule are set to zero. Matrix elements that violate the sequential rule are restricted only to the chain pair that includes segment A1. Secondly, the segment with the highest score of all the segments is selected (A2) based on the modified matrix and recorded. The matrix was further modified in the same manner to avoid overlap of selected segments (A1 and A2) and to maintain the sequential restriction. This procedure was repeated until the highest score of all remaining segments on the modified matrix was smaller than Smin (the cutoff parameter). Finally, a set of non-overlapping segments that approximately maximize the total score Stot were obtained. 2.3 Test-sets One of the significant purposes of structure alignments is to obtain biologically or physically relevant alignments. To evaluate alignment accuracy in this sense, we employed the two manually curated datasets MALIDUP (Cheng et al., 2007) and MALISAM (Cheng et al., 2008). MALIDUP contains 241 pairwise structure alignments for distantly-related domains that originated from internal duplication within a protein chain. In contrast, MALISAM contains 130 pairwise structure alignments for structural analogs, which were carefully selected based on evidence that the structural similarities are not a consequence of evolution. Because analogy is more difficult to detect than homology, MALISAM has a more difficult benchmark set than MALIDUP. The alignments in these datasets were then manually-curated, taking evolutionary and functional relationships and geometric similarity of structures into consideration. Accordingly, these alignments are suitable as reference alignments. Note that MALIDUP and MALISAM collect only monomer pairs that are sequentially aligned. We also evaluated alignment accuracies of protein pairs that include multiple chains and domain swapping using 3DSwap-PS (Huang et al., 2012). 3DSwap-PS is a curated collection of protein structures with domain-swapping based on pre-existing datasets and related literature. We used the single-domain group subset of 3DSwap-PS, which is a set of protein pairs of a domain-swapped dimer and its structurally corresponding single domain monomer. This subset contains 55 bona fide domain-swapped structures and 223 quasi domain-swapped structures. 2.4 Programs to be compared We compared MICAN-SQ with the popular sequential structure alignment programs, DaliLite (Holm and Sander, 1998), TMalign (Zhang and Skolnick, 2005) and DeepAlign (Wang et al., 2013), which are specialized for the alignment of monomer pairs. DaliLite version 3.3 was downloaded from http://www.ebi.ac.uk/Tools/structure/dalilite/, TMalign of 06/01/2014 version was downloaded from http://zhanglab.ccmb.med.umich.edu/TM-align/ and DeepAlign version 1.13 was downloaded from http://ttic.uchicago.edu/∼jinbo/software.htm. Also, we compared MICAN-SQ with MMalign (Mukherjee and Zhang, 2009) and TopMatch (Sippl and Wiederstein, 2012), which can accept structures of protein complexes. MMalign was downloaded from http://zhanglab.ccmb.med.umich.edu/MM-align/ and TopMatch was downloaded from https://www.came.sbg.ac.at/app_download.php?app=topmatch. In addition, we evaluated MICAN-RW and MICAN-RR to examine the effect of the sequential restraint introduced into MICAN-SQ. The three modes were then integrated into MICAN, and can be specified by the command line option. 3 Results and discussion 3.1 Performance on MALIDUP and MALISAM Structure alignments were performed with eight structural alignment methods including MICAN-SQ, and were assessed with reference to MALIDUP and MALISAM. For each alignment, we calculated Q-scores as the percentage agreement with reference alignments, defined by the number of correctly aligned residue pairs divided by the total number of aligned residue pairs in a reference alignment multiplied by 100. Averaged Q-scores for each method are shown in Table 1, and the box-plots of Q-scores are shown in Supplementary Figure S4. For both reference cases, MICAN-SQ produced the highest average Q-scores among assessed methods. Scatter plots of Q-scores for all pairs of methods relative to MALIDUP and MALISAM are indicated in Supplementary Figures S5 and S6, respectively. P-values of differences were estimated using Wilcoxon Signed-Rank tests and are also shown in the figures. With the exception of DeepAlign in MALIDUP, MICAN-SQ significantly outperformed the other alignment tools, as indicated by highly significant P-values. With reference to MALISAM, outstanding performance against other methods was clear, suggesting that MICAN-SQ can correctly align analogous protein structures, thus surmounting previous difficulties. These results demonstrate that MICAN-SQ generates structural alignments that are highly consistent with biologically and physically relevant alignments for monomer pairs. Table 1. The performance of eight methods on MALIDUP, MALISAM and 3DSwap-PS; ⟨Q-score ⟩, ⟨SID-score ⟩ and ⟨Nali⟩ represent average Q-scores, SID-scores and aligned residues over all protein pairs in the dataset, respectively Dataset Measure MICAN-SQ DeepAlign DaliLite TMalign MICAN-RW MICAN-RR MMalign TopMatch MALIDUP ⟨Q-score ⟩ 90.2 88.2 85.3 83.1 88.0 83.3 81.1 76.2 ⟨Nali⟩ 85.4 83.8 87.0 88.3 86.0 87.5 87.1 79.6 Time (s) 18.3 37.6 129.9 7.4 17.9 18.1 25.7 21.8 MALISAM ⟨Q-score ⟩ 76.5 67.3 67.2 58.6 68.6 49.7 58.9 48.0 ⟨Nali⟩ 61.7 59.3 61.0 64.6 62.4 64.3 62.4 57.4 Time (s) 5.4 11.8 24.5 2.4 5.3 5.4 8.3 11.3 3DSwap-PS ⟨SID score ⟩ 80.5 67.1 61.4 64.6 80.5 79.9 65.7 75.7 ⟨Nali⟩ 132.2 107.8 107.9 111.4 132.6 134.2 112.8 125.5 Time (s) 64.6 73.5 976.0 13.2 51.5 63.3 294.6 45.0 Dataset Measure MICAN-SQ DeepAlign DaliLite TMalign MICAN-RW MICAN-RR MMalign TopMatch MALIDUP ⟨Q-score ⟩ 90.2 88.2 85.3 83.1 88.0 83.3 81.1 76.2 ⟨Nali⟩ 85.4 83.8 87.0 88.3 86.0 87.5 87.1 79.6 Time (s) 18.3 37.6 129.9 7.4 17.9 18.1 25.7 21.8 MALISAM ⟨Q-score ⟩ 76.5 67.3 67.2 58.6 68.6 49.7 58.9 48.0 ⟨Nali⟩ 61.7 59.3 61.0 64.6 62.4 64.3 62.4 57.4 Time (s) 5.4 11.8 24.5 2.4 5.3 5.4 8.3 11.3 3DSwap-PS ⟨SID score ⟩ 80.5 67.1 61.4 64.6 80.5 79.9 65.7 75.7 ⟨Nali⟩ 132.2 107.8 107.9 111.4 132.6 134.2 112.8 125.5 Time (s) 64.6 73.5 976.0 13.2 51.5 63.3 294.6 45.0 Note: CPU times for all calculations are listed in the ‘time’ row. Running CPU times were measured on an Intel Corei7 3.40 GHz machine. The best results in each row are shown in bold. Table 1. The performance of eight methods on MALIDUP, MALISAM and 3DSwap-PS; ⟨Q-score ⟩, ⟨SID-score ⟩ and ⟨Nali⟩ represent average Q-scores, SID-scores and aligned residues over all protein pairs in the dataset, respectively Dataset Measure MICAN-SQ DeepAlign DaliLite TMalign MICAN-RW MICAN-RR MMalign TopMatch MALIDUP ⟨Q-score ⟩ 90.2 88.2 85.3 83.1 88.0 83.3 81.1 76.2 ⟨Nali⟩ 85.4 83.8 87.0 88.3 86.0 87.5 87.1 79.6 Time (s) 18.3 37.6 129.9 7.4 17.9 18.1 25.7 21.8 MALISAM ⟨Q-score ⟩ 76.5 67.3 67.2 58.6 68.6 49.7 58.9 48.0 ⟨Nali⟩ 61.7 59.3 61.0 64.6 62.4 64.3 62.4 57.4 Time (s) 5.4 11.8 24.5 2.4 5.3 5.4 8.3 11.3 3DSwap-PS ⟨SID score ⟩ 80.5 67.1 61.4 64.6 80.5 79.9 65.7 75.7 ⟨Nali⟩ 132.2 107.8 107.9 111.4 132.6 134.2 112.8 125.5 Time (s) 64.6 73.5 976.0 13.2 51.5 63.3 294.6 45.0 Dataset Measure MICAN-SQ DeepAlign DaliLite TMalign MICAN-RW MICAN-RR MMalign TopMatch MALIDUP ⟨Q-score ⟩ 90.2 88.2 85.3 83.1 88.0 83.3 81.1 76.2 ⟨Nali⟩ 85.4 83.8 87.0 88.3 86.0 87.5 87.1 79.6 Time (s) 18.3 37.6 129.9 7.4 17.9 18.1 25.7 21.8 MALISAM ⟨Q-score ⟩ 76.5 67.3 67.2 58.6 68.6 49.7 58.9 48.0 ⟨Nali⟩ 61.7 59.3 61.0 64.6 62.4 64.3 62.4 57.4 Time (s) 5.4 11.8 24.5 2.4 5.3 5.4 8.3 11.3 3DSwap-PS ⟨SID score ⟩ 80.5 67.1 61.4 64.6 80.5 79.9 65.7 75.7 ⟨Nali⟩ 132.2 107.8 107.9 111.4 132.6 134.2 112.8 125.5 Time (s) 64.6 73.5 976.0 13.2 51.5 63.3 294.6 45.0 Note: CPU times for all calculations are listed in the ‘time’ row. Running CPU times were measured on an Intel Corei7 3.40 GHz machine. The best results in each row are shown in bold. As shown in Table 1, Q-scores of MICAN-SQ, RW and RR differed, especially when MALISAM was used as a reference. Because the three modes of MICAN used the same scoring function and alignment algorithm, differences among the three modes were exclusively reflected by the restrictions imposed in each mode. In particular, large differences in Q-scores between the three modes were demonstrated for the protein pair of aldehyde reductase (SCOP ID: d1o02a_) and the hypothetical protein MT938 (SCOP ID: d1ihna1), which were taken from the MALISAM test set and were considered structural analogs. The alignment plots and structure superimpositions obtained by the three modes are shown in Supplementary Figure S7. Reference alignments are indicated by gray plots and contain two α helices and four β strands. For MICAN-SQ, RW and RR, the structurally aligned region covered all six SSEs that are identified in the reference. However, the alignment patterns, differed between modes and agreement with the reference alignment decreased with relaxation of restrictions from SQ to RW and from RW to RR, with Q-scores for SQ, RW and RR of 96.6, 71.2 and 1.7, respectively. Conversely, structural similarity increased from SQ to RW and from RW to RR, with TM-scores for SQ, RW and RR of 0.476, 0.484 and 0.520, respectively. This example illustrates the importance of the sequential restrictions to generate biologically or physically relevant alignments for analogous proteins. 3.2 Performance on 3DSwap-PS In further computations, we evaluated the performance of MICAN-SQ in alignments of multimers using 3DSwap-PS, and made comparisons with the other seven alignment programs. Unlike MALIDUP and MALISAM, 3DSwap-PS does not provide reference alignments. Instead, as a measure for evaluating the alignment accuracy, we used sequence identity (SID) in structure alignment by each method. Because each protein pair in 3DSwap-PS comprises the same or close homologous proteins (average SID of 53.9%), sequence identities for the entire alignment region, which may span across multiple chains, should be high in successful structure alignments. The precise definition of the SID score is the number of matches in the structure alignment divided by the number of matches in the sequence alignment multiplied by 100. To perform sequence alignments, the two sequences were taken from the monomer and one sequence of the domain swapped dimer. The Needleman–Wunsch dynamic programming algorithm was used with BLOSUM62, and with gap opening and extension penalties of –11 and –1, respectively. Note that we disregard domain swapping in the sequence alignment. Average SID scores and numbers of aligned residues (Nali) for each method are presented in the Table 1, and the box plots of SID scores are shown in the Supplementary Figure S8. We also present scatter plots of SID scores for all pairs of methods with P-values from Wilcoxon Signed-Rank tests in Supplementary Figure S9. MICAN-SQ produced much higher SID scores than DeepAlign, DaliLite and TMalign, which were specifically designed for alignments of monomers. In particular, the average SID score of MICAN-SQ was 13% greater than that of DeepAlign, which performed the best among the three. Differences in SID scores between MICAN-SQ and these programs are significant (P-value < 0.05), and performance insufficiencies of the three programs were mostly due to unsuitable alignments for swapped regions. Also, the number of aligned residues made by the three programs was much smaller than that of MICAN-SQ: differences in Nali are at least 20. These results contrast with those from MALIDUP and MALISAM computations, in which all programs generated comparable Nali values (Table 1). To demonstrate advantages of MICAN-SQ, we compared alignments of camelid heavy-chain variable domains (PDB ID: 1qd0) and the corresponding domain-swapped dimer (1sjv; Fig. 1) by MICAN-SQ with that by DeepAlign. SID scores of MICAN-SQ and DeepAlign were 91.4 and 75.3, respectively, and the alignment by DeepAlign was restricted between 1qd0 and chain A of 1sjv and failed to indicate structural correspondence between 1qd0 and chain B of 1sjv (orange rectangle in Fig. 1B). As a result, the alignment does not cover the C-terminal β strands of 1qd0, which is the swapped region in the left panel of Figure 1A. In contrast, the alignment by MICAN-SQ almost covered the entire region of 1qd0, which is associated with chains A and B of 1svj. The validity of the alignment by MICAN-SQ was supported by higher degrees of sequence matching, as shown in Figure 1C. This example illustrates the advantage of the MICAN-SQ over programs that are specific to monomer pairs. Fig. 1. View largeDownload slide A protein pair comprising a domain-swapped dimer (1sjv) and its structurally similar monomer (1qd0) in 3DSwap-PS; (A) The structure of 1sjvAB, 1qd0A and their superposition by MICAN-SQ; (B) Alignment plots by MICAN-SQ, DeepAlign and TopMatch. Horizontal and the vertical axes represent residue numbers of 1sjvAB and 1qd0, respectively. (C) Alignments are shown in cyan and orange rectangles in (B); Colons indicate alignment matches Fig. 1. View largeDownload slide A protein pair comprising a domain-swapped dimer (1sjv) and its structurally similar monomer (1qd0) in 3DSwap-PS; (A) The structure of 1sjvAB, 1qd0A and their superposition by MICAN-SQ; (B) Alignment plots by MICAN-SQ, DeepAlign and TopMatch. Horizontal and the vertical axes represent residue numbers of 1sjvAB and 1qd0, respectively. (C) Alignments are shown in cyan and orange rectangles in (B); Colons indicate alignment matches Compared with MMalign, MICAN-SQ achieved significant improvements (P-value < 0.05) of SID scores. Although MMalign was designed for alignments of multimers, SID scores of MICAN-SQ and MMalign differed by up to 15 and this difference was comparable to that between MICAN-SQ and the programs for monomer pairs. We confirmed that the default MMalign fails to generate alignments across multiple chains, and only the special option of MMalign allowed such alignment. Actually, Nali obtained by MMalign is almost the same value of TMalign (Table 1). MICAN-SQ was superior to TopMach, with significant improvements in average SID scores of about five points. (P-value =  7.0×10−19; Supplementary Fig. S9). Because TopMatch makes alignments across multiple chains, it is unclear why MICAN-SQ is superior to TopMatch. However, in evaluations of MALIDUP and MALISAM, MICAN-SQ was much more consistent with reference alignments than TopMatch. Hence, the same trend was expected in evaluations using 3DSwap-PS, and was indicated in comparisons of structure alignments of 1qd0 and 1sjv using MICAN-SQ and TopMatch (Fig. 1). Although the ensuing alignment plots (Fig. 1B) are similar, the corresponding SID scores differed largely and were 91.4 and 75.3 for MICAN and TopMatch, respectively. We noticed the alignments differed only in the region enclosed by cyan rectangles in Figure 1B, and the alignments of corresponding regions are shown in Figure 1C. In the region of the MICAN-SQ alignment, the number of matches was 14 (among 26 sites). Whereas the alignment of TopMatch was shifted by one residue from that of MICAN-SQ, thus containing only three matches. The performances of MICAN-SQ, RW and RR were statistically indistinguishable (P-value > 0.05: Supplementary Fig. S9) from each other. In evaluations of MALIDUP and MALISAM, greater evolutionary closeness of protein pairs led to increased similarity of alignments from the three modes of MICAN. The protein pairs in 3DSwap-PS are more closely related. The average sequence identities from MALIDUP, MALISAM and 3DSwap-PS are 18.8%, 8.5% and 53.9%, respectively. With greater evolutionary distance of protein pairs, differences between MICAN-SQ and MICAN-RW/RR become significant. It should be noted that MICAN-SQ can be applicable to not only monomer-dimer pairs but also any protein pairs with domain-swapping, regardless of the number of chain components. Here, we performed exhaustive all-versus-all structure comparisons of biological assemblies in the PDB using MICAN-SQ and found numerous examples beyond monomer-dimer pairs. Figure 2 shows such an example: a structure alignment for a dimer-dimer pair with domain swapping [Methylmalonyl-CoA epimerase (3rmu) and the uncharacterized protein Atu1953 (2pjs)]. This alignment shows that chain A (or B) of 3rmu is structurally associated with chains A and B of 2pjs, and vice versa, with domain swapping across multiple chains. To our knowledge, these structural similarities have not been previously described. We also infer functional and evolutionary aspects of Atu1953 below. Fig. 2. View largeDownload slide A protein pair comprising a domain-swapped dimer (2pjs) and its structurally similar non-swapped dimer (3rmu); (A) The structure of 3rmu; chains A and B are colored in red and blue, respectively. (B) The structure of 2pjs; chains A and B are colored in yellow and green, respectively. (C) The structure superimposition by MICAN-SQ; (D) alignment plot by MICAN-SQ Fig. 2. View largeDownload slide A protein pair comprising a domain-swapped dimer (2pjs) and its structurally similar non-swapped dimer (3rmu); (A) The structure of 3rmu; chains A and B are colored in red and blue, respectively. (B) The structure of 2pjs; chains A and B are colored in yellow and green, respectively. (C) The structure superimposition by MICAN-SQ; (D) alignment plot by MICAN-SQ 3.3 Computational speed Rapid computational speeds are required for large-scale database searches and for structure alignment programs. The CPU times spent to calculate all protein pairs in each dataset are listed in Table 1 and were measured on an Intel Corei7 3.40 GHz machine. TMalign was the fastest program in all cases, and was outstanding on 3DSwap-PS. This is partly because TMalign reads only the first chain in the PDB file. MICAN-SQ was the second fastest program on MALIDUP and MALISAM, and the third fastest on 3DSwap-PS, indicating that MICAN-SQ is a relatively fast program among those compared in the present study. Hence, in combination with alignment accuracies and computational speeds, MICAN-SQ has considerable utility in structural comparisons of monomers and oligomers. 3.4 Examples and applications In addition to the examples shown above, MICAN-SQ can be applied to multiple types of protein pairs. The first example is the large oligomer pair of bacterioferritin (3bkn) and apoferritin (2za6), illustrated in Figure 3A and B. Both of these oligomers comprise 24 identical chains of 160–170 amino acids and form spherical particles. Previously, TopMatch was used to aligned this complex pair and produced a structure match of Nali=3649 and an RMSD of 2.6 Å in superpositions (Sippl and Wiederstein, 2012). As shown in Figure 3D, MICAN-SQ correctly selected the chain combinations and aligned them, with an Nali value of 3696 and an RMSD value of 2.6 Å (TMscore = 0.9, Fig. 3C). This result is slightly better than that of TopMatch. Fig. 3. View largeDownload slide Structure alignment for the large oligomer pair apoferritin (2za6) and bacterioferritin (3bkn); (A) the structure of 2za6; (B) the structure of 3bkn. The green regions in (A) and (B) indicate one of the single chains in the oligomers. (C) Structure superimposition by MICAN-SQ. (D) The alignment plot by MICAN-SQ Fig. 3. View largeDownload slide Structure alignment for the large oligomer pair apoferritin (2za6) and bacterioferritin (3bkn); (A) the structure of 2za6; (B) the structure of 3bkn. The green regions in (A) and (B) indicate one of the single chains in the oligomers. (C) Structure superimposition by MICAN-SQ. (D) The alignment plot by MICAN-SQ The second example is a protein pair with domain-fusion (or fission). Shown in Figure 4 are a putative thiamine biosynthesis protein (1vk8, A and B chains) and the protein YKOF of unknown function (1s7hA). The former is a dimeric single domain protein and the latter is a monomer that corresponds to the fused form of the former protein. In the structure alignments, MICAN-SQ detected structural similarities (Fig. 4D) and showed structural associations of 1vk8A and 1vk8B, and 1s7hA with a TMscore of 0.72 and a Nali value of 159 (86% of residues in 1s7hA). In contrast, MMalign aligned these proteins with a TM score of 0.38 and a Nali value of 80 (half of 1s7hA). This example demonstrates that MICAN-SQ successfully performs structure comparisons of proteins with different numbers of chains, and indicates the importance of alignments across multimers. Similarly, a trimeric structure of a single β-propeller domain (2bt9, A, B and C chains) and a corresponding structure of their fused form (1ofz) were aligned and superimposed (Supplementary Fig. S10). Because oligomeric single-domain proteins are considered remnants of ancient proteins (Alva et al., 2015; Blaber et al., 2012; Lupas et al., 2001), their alignments against the fused form provide important evolutionary insights. Fig. 4. View largeDownload slide Structure alignment for a dimeric form of a single domain protein and the fused structure; (A) Structure of a putative thiamine biosynthesis protein [1vk8, A(red) and B(blue) chains]; (B) structure of the protein YKOF of unknown function (1s7hA); (C) structure superposition by MICAN-SQ; (D) alignment plot by MICAN-SQ Fig. 4. View largeDownload slide Structure alignment for a dimeric form of a single domain protein and the fused structure; (A) Structure of a putative thiamine biosynthesis protein [1vk8, A(red) and B(blue) chains]; (B) structure of the protein YKOF of unknown function (1s7hA); (C) structure superposition by MICAN-SQ; (D) alignment plot by MICAN-SQ The final example is the glyoxalase/bleomycin resistance protein (BRP)/dihydroxybiphenyl dioxygenase family, which includes the proteins shown in Figure 2. This protein family contains a wide variety of proteins with diverse functions and low sequence similarity, including glyoxalase I, extradiol dioxygenases and anti-biotic resistance proteins (Bergdoll et al., 1998). Many of these proteins are metalloenzymes with three or four essential metal ion-binding residues (commonly His and Glu residues) in the centers of β sheets. However, BRP is a non-metalloenzyme of this family that lacks metal ion-binding residues and binds and sequesters the anti-cancer agent bleomycin. Most members of this family show essentially the same spatial arrangements of four βαβββ sub-domains, but exhibit diversities in chain connectivity and assembly states, reflecting domain fusion and swapping (Suttisansanee et al., 2011). Here, we examined whether MICAN-SQ correctly classified these widely varied structures and expored previously unknown structures in this family. We performed pairwise structural alignments of representative structures of this family in the Evolutionary Classification of protein Domains database (Cheng et al., 2014) using MICAN-SQ, and generated a dendrogram shown in Figure 5 (A detailed description of the dataset and procedure is presented in Supplementary Note S4). The characterized structures were roughly divided into monomers and dimers and were then further divided into sub-clusters (A–G in Fig. 5), according to chain connectivity and assembly states. Subsequent visual inspections confirmed accurate automatic structural classifications following MICAN-SQ, because all structures belonged to appropriate clades in the dendrogram. These analyses demonstrate the success of MICAN-SQ in classifying protein structures with widely varying assembly states, and domain fusion and swapping. Fig. 5. View largeDownload slide A dendrogram of structures in the glyoxalase/BRP/dihydroxybiphenyl dioxygenase family; structure alignments were performed using MICAN-SQ. Seven clusters (A–G) were identified according to chain connectivity and assembly states. PDB IDs are used as operational taxonomic units and are colored according to corresponding clusters. In the bottom panels, schematic representations of chain connectivity and assembly states of each cluster are shown. Gray-shaded arcs represent β sheets. Triangles are β strands, and green and yellow colors indicate chain components. N and C termini are denoted by ‘N’ and ‘C’, respectively. The clusters (A)–(D) are monomers and clusters (E)–(G) are dimers Fig. 5. View largeDownload slide A dendrogram of structures in the glyoxalase/BRP/dihydroxybiphenyl dioxygenase family; structure alignments were performed using MICAN-SQ. Seven clusters (A–G) were identified according to chain connectivity and assembly states. PDB IDs are used as operational taxonomic units and are colored according to corresponding clusters. In the bottom panels, schematic representations of chain connectivity and assembly states of each cluster are shown. Gray-shaded arcs represent β sheets. Triangles are β strands, and green and yellow colors indicate chain components. N and C termini are denoted by ‘N’ and ‘C’, respectively. The clusters (A)–(D) are monomers and clusters (E)–(G) are dimers It is worth noting that there are previously unknown structure relationships in this family. The currently known chain connectivities and assembly states are exhibited in structures of clusters A–E and G depicted in the bottom panels of Figure 5 (Suttisansanee et al., 2011). In contrast, the arrangements and structures of Atu1953 (2pjs, cluster F) are shown in Figure 2B, which have not been previously described. Atu1953 was initially identified in genome analyses of Agrobacterium fabrum, although its function remains unknown. In our analyses, chain connectivities of Atu1953 were similar to those of proteins in cluster G, but contained a swapped β-strand (around red loops in Supplementary Fig. S11). While cluster G includes metalloenzymes, Atu1953 lacks His and Glu metal-binding residues at corresponding positions of the centers of β sheets and is not likely to be a metalloenzyme. Because cluster G also includes BRP, we hypothesized that the function of Atu1953 would be similar to that of BRP. The ensuing structure alignments of 2pjs and 1ewj (bleomycin-bound structure of BRP) suggest that franking loops of the swapped strand (red loops in Supplementary Fig. S11) sterically hinder bleomycin-binding (Supplementary Fig. S11C). This observation suggests that Atu1953 cannot bind to bleomycin and may bind to a smaller ligand, and that it has evolved selectivity through strand swapping. 4 Conclusion In this study, we introduce the novel sequential protein structure alignment algorithm MICAN-SQ. In contrast with other sequential protein structure alignment programs that can align monomer pairs only, MICAN-SQ can compare monomer-oligomer and oligomer-oligomer pairs and retains sequential restrictions within the alignments of each chain pair. To assess the accuracy of MICAN-SQ, we evaluated consistencies of manually-curated alignments of monomer pairs and the alignments of a domain swapped dimer and its corresponding monomer. We then compared the performance of MICAN-SQ with those of seven structure alignment programs that are publicly available, including those for monomer pairs and protein complexes. The present data show that MICAN-SQ is superior to all of the present programs. In addition, MICAN-SQ has excellent computational speed. Taken together, our results warrant consideration of MICAN-SQ as a useful program for comparisons of protein structures, and are applicable to monomer pairs and multimer complexes. Funding This work was supported by Platform for Drug Discovery, Informatics and Structural Life Science from the Japan Agency for Medical Research and Development and by a Grant-in-Aid for Scientific Research (C) (No.16K07315) from the Japan Society for the Promotion of Science. S.M. and K.S. were supported by a Grant-in-Aid for JSPS Fellows. Conflict of Interest: none declared. References Alva V. et al. ( 2015 ) A vocabulary of ancient peptides at the origin of folded proteins . eLife , 4 , e09410 . Google Scholar Crossref Search ADS PubMed Bergdoll M. et al. ( 1998 ) All in the family: structural and evolutionary relationships among three modular proteins with diverse functions and variable assembly . Protein Sci ., 7 , 1661 – 1670 . Google Scholar Crossref Search ADS PubMed Blaber M. et al. ( 2012 ) Emergence of symmetric protein architecture from a simple peptide motif: evolutionary models . Cell. Mol. Life Sci ., 69 , 3999 – 4006 . Google Scholar Crossref Search ADS PubMed Cheng H. et al. ( 2007 ) MALIDUP: a database of manually constructed structure alignments for duplicated domain pairs . Proteins , 70 , 1162 – 1166 . Google Scholar Crossref Search ADS Cheng H. et al. ( 2008 ) MALISAM: a database of structurally analogous motifs in proteins . Nucleic Acids Res ., 36 , D211 – D217 . Google Scholar Crossref Search ADS PubMed Cheng H. et al. ( 2014 ) ECOD: an evolutionary classification of protein domains . PLoS Comput. Biol ., 10 , e1003926. Google Scholar Crossref Search ADS PubMed Holm L. , Sander C. ( 1998 ) Dictionary of recurrent domains in protein structures . Proteins , 33 , 88 – 96 . Google Scholar Crossref Search ADS PubMed Hou J. et al. ( 2003 ) A global representation of the protein fold space . Proc. Natl. Acad. Sci. USA , 100 , 2386 – 2390 . Google Scholar Crossref Search ADS Huang P. et al. ( 2016 ) De novo design of a four-fold symmetric tim-barrel protein with atomic-level accuracy . Nat. Chem. Biol ., 12 , 29 – 34 . Google Scholar Crossref Search ADS PubMed Huang Y. et al. ( 2012 ) Three-dimensional domain swapping in the protein structure space . Proteins , 80 , 1610 – 1619 . Google Scholar Crossref Search ADS PubMed Joo K. et al. ( 2007 ) High accuracy template based modeling by global optimization . Proteins , 69 , 83 – 89 . Google Scholar Crossref Search ADS PubMed Koike R. , Ota M. ( 2012 ) SCPC: a method to structurally compare protein complexes . Bioinformatics , 28 , 324 – 330 . Google Scholar Crossref Search ADS PubMed Lupas A. et al. ( 2001 ) On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J. Struct. Biol ., 134 , 191 – 203 . Google Scholar Crossref Search ADS PubMed Mayr G. et al. ( 2007 ) Comparative analysis of protein structure alignments . BMC Struct. Biol ., 7 , 50. Google Scholar Crossref Search ADS PubMed Minami S. et al. ( 2013 ) MICAN: a protein structure alignment algorithm that can handle multiple-chains, inverse alignments, Cα only models, alternative alignments, and non-sequential alignments . BMC Bioinformatics , 14 , 24. Google Scholar Crossref Search ADS PubMed Minami S. et al. ( 2014 ) How a spatial arrangement of secondary structure elements is dispersed in the universe of protein folds . PLoS ONE , 9 , e107959. Google Scholar Crossref Search ADS PubMed Mukherjee S. , Zhang Y. ( 2009 ) MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming . Nucleic Acids Res ., 37 , e83. Google Scholar Crossref Search ADS PubMed Okuno T. et al. ( 2015 ) VS-APPLE: a virtual screening algorithm using promiscuous protein-ligand complexes . J. Chem. Inf. Model , 55 , 1108 – 1119 . Google Scholar Crossref Search ADS PubMed Orengo C. et al. ( 1997 ) CATH–a hierarchic classification of protein domain structures . Structure , 5 , 1093 – 1108 . Google Scholar Crossref Search ADS PubMed Sippl M.J. , Wiederstein M. ( 2012 ) Detection of spatial correlations in protein structures and molecular complexes . Structure , 20 , 718 – 728 . Google Scholar Crossref Search ADS PubMed Standley D.M. et al. ( 2010 ) SeSAW: balancing sequence and structural information in protein functional mapping . Bioinformatics , 26 , 1258 – 1259 . Google Scholar Crossref Search ADS PubMed Suttisansanee U. et al. ( 2011 ) Structural variation in bacterial glyoxalase I enzymes . J. Biol. Chem ., 286 , 38367 – 38374 . Google Scholar Crossref Search ADS PubMed Wang S. et al. ( 2013 ) Protein structure alignment beyond spatial proximity . Sci. Rep ., 3 , 1448. Google Scholar Crossref Search ADS PubMed Zhang Y. , Skolnick J. ( 2005 ) TM-align: a protein structure alignment algorithm based on the TM-score . Nucleic Acids Res ., 33 , 2302 – 2309 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

MICAN-SQ: a sequential protein structure alignment program that is applicable to monomers and all types of oligomers

Loading next page...
 
/lp/ou_press/mican-sq-a-sequential-protein-structure-alignment-program-that-is-HDMBZ86Vh8
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty369
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation Protein structure alignment is a significant tool to understand evolutionary processes and physicochemical properties of proteins. Important targets of structure alignment are not only monomeric but also oligomeric proteins that sometimes include domain swapping or fusions. Although various protein structural alignment programs have been developed, no method is applicable to any protein pair regardless of the number of chain components and oligomeric states with retaining sequential restrictions: structurally equivalent regions must be aligned in the same order along protein sequences. Results In this paper, we introduced a new sequential protein structural alignment algorithm MICAN-SQ, which is applicable to protein structures in all oligomeric states. In particular, MICAN-SQ allows the complicated structural alignments of proteins with domain swapping or fusion regions. To validate MICAN-SQ, alignment accuracies were evaluated using curated alignments of monomers and examples of domain swapping, and compared with those of pre-existing protein structural alignment programs. The results of this study show that MICAN-SQ has superior accuracy and robustness in comparison with previous programs and offers limited computational times. We also demonstrate that MICAN-SQ correctly aligns very large complexes and fused proteins. The present computations warrant the consideration of MICAN-SQ for studies of evolutionary and physicochemical properties of monomeric structures and all oligomer types. Availability and implementation The MICAN program was implemented in C. The source code and executable file can be freely downloaded from http://www.tbp.cse.nagoya-u.ac.jp/MICAN/. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Analyses of protein structures are highly informative of evolutionary, functional and physicochemical properties of proteins. In particular, protein structure alignments can be used to identify spatially equivalent residue pairs for given protein pairs, leading to improved understanding of evolutionary (Alva et al., 2015; Orengo et al., 1997) and spatial properties of proteins (Hou et al., 2003; Minami et al., 2014), inferring molecular functions (Standley et al., 2010), facilitating structure predictions (Joo et al., 2007), de novo design (Huang et al., 2016) and drug discovery (Okuno et al., 2015). Although, many programs for protein structure alignments have been developed, further improvements are required to address various limitations. Among these, accuracy with gold-standards is one of the important issues. It is known that the results of existing structure alignment programs do not well coincide with the gold-standard (Mayr et al., 2007; Wang et al., 2013). Because biologically or physically meaningful alignments are key for inferring evolutionary relationships and understanding physico-chemical properties of protein structures, structure alignment programs that are highly consistent with gold-standards are eagerly awaited. Additional limitations are present in methodologies for aligning proteins comprising multiple chains. A few programs have been developed for structural comparisons of protein complexes and these programs can be roughly divided into two classes. The first class of these programs assigns one subunit in a complex into a subunit in another, and the resulting subunit–subunit correspondences impose sequential restrictions in which structurally equivalent regions must be aligned in the same order along the protein sequences of each subunit pair [MMalign (Mukherjee and Zhang, 2009), SCPC (Koike and Ota, 2012)]. The second class of programs involves non-sequential alignments [TopMatch (Sippl and Wiederstein, 2012) and MICAN (Minami et al., 2013)], in which structurally equivalent residues can be aligned with differing order along the protein sequences. Although both approaches identify valid structural relationships, each has issues of poor utility. Programs of the first class fail to make alignments of two proteins if one chain of a protein corresponds structurally with multiple chains of the other (one-to-many correspondences). This limitation is particularly problematic when two or more protein chains form oligomers by exchanging an identical structural element between subunits (domain swapping). Structure alignments of a monomeric protein [the mannose binding protein, Protein Data Bank (PDB) ID: 1msb] and structurally similar dimer with domain swapping (the platelet activator convulxin: 1umr) are shown in Supplementary Figure S1. In this example, the overall structure of the monomer resembles that of the dimer, and the monomer is associated with both chains of the dimer. Because these programs do not allow structural correspondence across multiple chains, they cannot correctly detect matches of swapped regions. In addition, when two protein complexes include more than two chains, a combinatorial problem with subunit–subunit correspondences leads to large computational times. Programs of the second class employ non-sequential structure alignments. By ignoring chain connectivity, these programs can identify structural correspondences across multiple chains and thus overcoming the problem of the one-to-many correspondences. However, because chain connectivities are generally maintained in homologous proteins, disregarding chain connectivities leads to the possibility of making odd alignments that are not biologically relevant. Herein, we propose a new class of protein structure alignment program, MICAN-SQ. Unique features of MICAN-SQ include (i) acceptance of protein structures of all oligomeric states and (ii) imposition of sequential restriction within alignments of each chain pair. The ensuing algorithm is based on the non-sequential protein structure alignment program MICAN (Minami et al., 2013). Previously, we showed that MICAN reproduces a manually curated sequential alignment of reference monomeric proteins, better than other programs, although it does not specialize in sequential structure alignments. This result motivated us to introducce the sequential restriction within the alignment of each chain pair into the MICAN algorithm for more accurate alignment. To assess alignment quality, we employed a database of manually-curated alignments of monomeric proteins that we used previously [MALIDUP (Cheng et al., 2007) and MALISAM (Cheng et al., 2008)], and a dataset of structure pairs of a monomer and its structurally similar domain-swapped structure [3DSwap-PS (Huang et al., 2012)]. We evaluated consistency between the alignments of MICAN-SQ and those in the databases, and compared accuracies with those of publicly available programs and those of MICAN with non-sequential modes. We demonstrated that MICAN-SQ has greater accuracy with benchmark sets than the other programs. These results warrant consideration of MICAN-SQ as a useful protein structure alignment program that can be used to identify biologically or physically relevant protein pairs in all oligomeric states. 2 Materials and methods 2.1 Overview of the MICAN algorithm In the previous study, two non-sequential alignment modes were implemented in MICAN (Minami et al., 2013). Firstly, the ‘permitting ReWiring mode (RW)’ allows alignment of structurally equivalent Secondary Structure Elements (SSEs) in different orders in the protein sequences (Rewiring loops) and possesses the same N- to C-terminal directions of SSEs. MICAN with this mode is referred to as MICAN-RW. Secondly, in addition to RW, the ‘permitting Rewiring and Reversing mode (RR)’ allows alignment of structurally equivalent SSEs in the reverse direction of N- and C-termini. MICAN with this mode is referred to as MICAN-RR. In this paper, we introduce the ‘restricting SeQuential alignment mode (SQ)’ into MICAN, in which structurally equivalent SSEs must be aligned in the same order along the protein sequences. MICAN with this mode is called MICAN-SQ, in which sequential constraints are imposed only within the alignment of each chain pair. In all the three modes, the alignment algorithm basically comprises the following three steps: (i) alignments of SSEs, (ii) alignments of residues and (iii) ranking of alignments. These three steps are briefly described in Supplementary Note S1 and are illustrated in a flowchart in Supplementary Figure S2. Because MICAN-SQ was modified only at step (ii), only this step is described below. The optimized score function (SSE match-weighted TMscore; sTMscore) and several quantities reported by the program, including global and local similarity measures, are also described in Supplementary Notes S1 and S2, respectively. 2.2 Alignment of residues in MICAN-SQ In this step, which starts from the initial superimposition obtained by alignments of SSEs, one-to-one assignments of Cα atoms are generated in a stepwise manner. We define segments by using contiguous pairs of Cα atoms. Assignments start from the segment comprising Cα atoms with the smallest distance and the same geometry in the superimposition. The sequential condition within the alignment of each chain pair is maintained. Here, we assume that the two protein structures for comparison correspond with a query (Q) and a template (T) containing nQ and nT, chains, respectively. If the number of residues of the i-th chain of the query and that of the template are NQ(i) and NT(i), respectively, then the total number of residues of the query (NQ) and the template (NT) are denoted as follows NQ=∑i=1nQNQ(i),    NT=∑i=1nTNT(i), respectively. For a given superimposition, an NQ×NT matrix, M, is constructed. The matrix element Mi,j is designed so that it is large if the distance between Cα atoms of the query residue i and that of the template residue j are small and have similar local-backbone geometries. The definition of the matrix is given in Supplementary Note S3. According to matrix M, we identify segments that are contiguous residue pairs and are superimposed with small distances and similar local-backbone geometries. A segment with the length l is described as a set of contiguous residue pairs {(α, β), (α+1, β+1),…,(α+l−1,β+l−1)}, where (α,β) indicates that residue α of the query is paired with residue β of the template. To quantify the similarity of a segment, we introduced a score for the segment Ak with length l, where k is the index of the segment, and the score S(Ak) is defined as follows: S(Ak)=∑m=0l−1Mα+m,β+m. To eliminate segments that are too short or too long and are rarely seen among manually curated alignments (Minami et al., 2013), only segments that satisfy the following two conditions are taken into account: Mα+m,β+m>Mcut   for   ∀m∈(0,1,…,l−1). l≥3, where Mcut is the cut off value. Based on the segments defined above, a set of non-overlapping segments that maximizes Stot=∑k=1nS(Ak), is explored with imposing the sequential constraint within the alignment of each chain pair, where n is the number of segments in the alignment. To find the best solution, a greedy algorithm was conducted (Supplementary Fig. S3). Firstly, the segment with the highest score among all segments is initially selected and recorded as A1. To select the subsequent segment that does not overlap with A1 and satisfies the sequential condition, the matrix M is modified so that matrix elements interfering with A1 and violating the sequential rule are set to zero. Matrix elements that violate the sequential rule are restricted only to the chain pair that includes segment A1. Secondly, the segment with the highest score of all the segments is selected (A2) based on the modified matrix and recorded. The matrix was further modified in the same manner to avoid overlap of selected segments (A1 and A2) and to maintain the sequential restriction. This procedure was repeated until the highest score of all remaining segments on the modified matrix was smaller than Smin (the cutoff parameter). Finally, a set of non-overlapping segments that approximately maximize the total score Stot were obtained. 2.3 Test-sets One of the significant purposes of structure alignments is to obtain biologically or physically relevant alignments. To evaluate alignment accuracy in this sense, we employed the two manually curated datasets MALIDUP (Cheng et al., 2007) and MALISAM (Cheng et al., 2008). MALIDUP contains 241 pairwise structure alignments for distantly-related domains that originated from internal duplication within a protein chain. In contrast, MALISAM contains 130 pairwise structure alignments for structural analogs, which were carefully selected based on evidence that the structural similarities are not a consequence of evolution. Because analogy is more difficult to detect than homology, MALISAM has a more difficult benchmark set than MALIDUP. The alignments in these datasets were then manually-curated, taking evolutionary and functional relationships and geometric similarity of structures into consideration. Accordingly, these alignments are suitable as reference alignments. Note that MALIDUP and MALISAM collect only monomer pairs that are sequentially aligned. We also evaluated alignment accuracies of protein pairs that include multiple chains and domain swapping using 3DSwap-PS (Huang et al., 2012). 3DSwap-PS is a curated collection of protein structures with domain-swapping based on pre-existing datasets and related literature. We used the single-domain group subset of 3DSwap-PS, which is a set of protein pairs of a domain-swapped dimer and its structurally corresponding single domain monomer. This subset contains 55 bona fide domain-swapped structures and 223 quasi domain-swapped structures. 2.4 Programs to be compared We compared MICAN-SQ with the popular sequential structure alignment programs, DaliLite (Holm and Sander, 1998), TMalign (Zhang and Skolnick, 2005) and DeepAlign (Wang et al., 2013), which are specialized for the alignment of monomer pairs. DaliLite version 3.3 was downloaded from http://www.ebi.ac.uk/Tools/structure/dalilite/, TMalign of 06/01/2014 version was downloaded from http://zhanglab.ccmb.med.umich.edu/TM-align/ and DeepAlign version 1.13 was downloaded from http://ttic.uchicago.edu/∼jinbo/software.htm. Also, we compared MICAN-SQ with MMalign (Mukherjee and Zhang, 2009) and TopMatch (Sippl and Wiederstein, 2012), which can accept structures of protein complexes. MMalign was downloaded from http://zhanglab.ccmb.med.umich.edu/MM-align/ and TopMatch was downloaded from https://www.came.sbg.ac.at/app_download.php?app=topmatch. In addition, we evaluated MICAN-RW and MICAN-RR to examine the effect of the sequential restraint introduced into MICAN-SQ. The three modes were then integrated into MICAN, and can be specified by the command line option. 3 Results and discussion 3.1 Performance on MALIDUP and MALISAM Structure alignments were performed with eight structural alignment methods including MICAN-SQ, and were assessed with reference to MALIDUP and MALISAM. For each alignment, we calculated Q-scores as the percentage agreement with reference alignments, defined by the number of correctly aligned residue pairs divided by the total number of aligned residue pairs in a reference alignment multiplied by 100. Averaged Q-scores for each method are shown in Table 1, and the box-plots of Q-scores are shown in Supplementary Figure S4. For both reference cases, MICAN-SQ produced the highest average Q-scores among assessed methods. Scatter plots of Q-scores for all pairs of methods relative to MALIDUP and MALISAM are indicated in Supplementary Figures S5 and S6, respectively. P-values of differences were estimated using Wilcoxon Signed-Rank tests and are also shown in the figures. With the exception of DeepAlign in MALIDUP, MICAN-SQ significantly outperformed the other alignment tools, as indicated by highly significant P-values. With reference to MALISAM, outstanding performance against other methods was clear, suggesting that MICAN-SQ can correctly align analogous protein structures, thus surmounting previous difficulties. These results demonstrate that MICAN-SQ generates structural alignments that are highly consistent with biologically and physically relevant alignments for monomer pairs. Table 1. The performance of eight methods on MALIDUP, MALISAM and 3DSwap-PS; ⟨Q-score ⟩, ⟨SID-score ⟩ and ⟨Nali⟩ represent average Q-scores, SID-scores and aligned residues over all protein pairs in the dataset, respectively Dataset Measure MICAN-SQ DeepAlign DaliLite TMalign MICAN-RW MICAN-RR MMalign TopMatch MALIDUP ⟨Q-score ⟩ 90.2 88.2 85.3 83.1 88.0 83.3 81.1 76.2 ⟨Nali⟩ 85.4 83.8 87.0 88.3 86.0 87.5 87.1 79.6 Time (s) 18.3 37.6 129.9 7.4 17.9 18.1 25.7 21.8 MALISAM ⟨Q-score ⟩ 76.5 67.3 67.2 58.6 68.6 49.7 58.9 48.0 ⟨Nali⟩ 61.7 59.3 61.0 64.6 62.4 64.3 62.4 57.4 Time (s) 5.4 11.8 24.5 2.4 5.3 5.4 8.3 11.3 3DSwap-PS ⟨SID score ⟩ 80.5 67.1 61.4 64.6 80.5 79.9 65.7 75.7 ⟨Nali⟩ 132.2 107.8 107.9 111.4 132.6 134.2 112.8 125.5 Time (s) 64.6 73.5 976.0 13.2 51.5 63.3 294.6 45.0 Dataset Measure MICAN-SQ DeepAlign DaliLite TMalign MICAN-RW MICAN-RR MMalign TopMatch MALIDUP ⟨Q-score ⟩ 90.2 88.2 85.3 83.1 88.0 83.3 81.1 76.2 ⟨Nali⟩ 85.4 83.8 87.0 88.3 86.0 87.5 87.1 79.6 Time (s) 18.3 37.6 129.9 7.4 17.9 18.1 25.7 21.8 MALISAM ⟨Q-score ⟩ 76.5 67.3 67.2 58.6 68.6 49.7 58.9 48.0 ⟨Nali⟩ 61.7 59.3 61.0 64.6 62.4 64.3 62.4 57.4 Time (s) 5.4 11.8 24.5 2.4 5.3 5.4 8.3 11.3 3DSwap-PS ⟨SID score ⟩ 80.5 67.1 61.4 64.6 80.5 79.9 65.7 75.7 ⟨Nali⟩ 132.2 107.8 107.9 111.4 132.6 134.2 112.8 125.5 Time (s) 64.6 73.5 976.0 13.2 51.5 63.3 294.6 45.0 Note: CPU times for all calculations are listed in the ‘time’ row. Running CPU times were measured on an Intel Corei7 3.40 GHz machine. The best results in each row are shown in bold. Table 1. The performance of eight methods on MALIDUP, MALISAM and 3DSwap-PS; ⟨Q-score ⟩, ⟨SID-score ⟩ and ⟨Nali⟩ represent average Q-scores, SID-scores and aligned residues over all protein pairs in the dataset, respectively Dataset Measure MICAN-SQ DeepAlign DaliLite TMalign MICAN-RW MICAN-RR MMalign TopMatch MALIDUP ⟨Q-score ⟩ 90.2 88.2 85.3 83.1 88.0 83.3 81.1 76.2 ⟨Nali⟩ 85.4 83.8 87.0 88.3 86.0 87.5 87.1 79.6 Time (s) 18.3 37.6 129.9 7.4 17.9 18.1 25.7 21.8 MALISAM ⟨Q-score ⟩ 76.5 67.3 67.2 58.6 68.6 49.7 58.9 48.0 ⟨Nali⟩ 61.7 59.3 61.0 64.6 62.4 64.3 62.4 57.4 Time (s) 5.4 11.8 24.5 2.4 5.3 5.4 8.3 11.3 3DSwap-PS ⟨SID score ⟩ 80.5 67.1 61.4 64.6 80.5 79.9 65.7 75.7 ⟨Nali⟩ 132.2 107.8 107.9 111.4 132.6 134.2 112.8 125.5 Time (s) 64.6 73.5 976.0 13.2 51.5 63.3 294.6 45.0 Dataset Measure MICAN-SQ DeepAlign DaliLite TMalign MICAN-RW MICAN-RR MMalign TopMatch MALIDUP ⟨Q-score ⟩ 90.2 88.2 85.3 83.1 88.0 83.3 81.1 76.2 ⟨Nali⟩ 85.4 83.8 87.0 88.3 86.0 87.5 87.1 79.6 Time (s) 18.3 37.6 129.9 7.4 17.9 18.1 25.7 21.8 MALISAM ⟨Q-score ⟩ 76.5 67.3 67.2 58.6 68.6 49.7 58.9 48.0 ⟨Nali⟩ 61.7 59.3 61.0 64.6 62.4 64.3 62.4 57.4 Time (s) 5.4 11.8 24.5 2.4 5.3 5.4 8.3 11.3 3DSwap-PS ⟨SID score ⟩ 80.5 67.1 61.4 64.6 80.5 79.9 65.7 75.7 ⟨Nali⟩ 132.2 107.8 107.9 111.4 132.6 134.2 112.8 125.5 Time (s) 64.6 73.5 976.0 13.2 51.5 63.3 294.6 45.0 Note: CPU times for all calculations are listed in the ‘time’ row. Running CPU times were measured on an Intel Corei7 3.40 GHz machine. The best results in each row are shown in bold. As shown in Table 1, Q-scores of MICAN-SQ, RW and RR differed, especially when MALISAM was used as a reference. Because the three modes of MICAN used the same scoring function and alignment algorithm, differences among the three modes were exclusively reflected by the restrictions imposed in each mode. In particular, large differences in Q-scores between the three modes were demonstrated for the protein pair of aldehyde reductase (SCOP ID: d1o02a_) and the hypothetical protein MT938 (SCOP ID: d1ihna1), which were taken from the MALISAM test set and were considered structural analogs. The alignment plots and structure superimpositions obtained by the three modes are shown in Supplementary Figure S7. Reference alignments are indicated by gray plots and contain two α helices and four β strands. For MICAN-SQ, RW and RR, the structurally aligned region covered all six SSEs that are identified in the reference. However, the alignment patterns, differed between modes and agreement with the reference alignment decreased with relaxation of restrictions from SQ to RW and from RW to RR, with Q-scores for SQ, RW and RR of 96.6, 71.2 and 1.7, respectively. Conversely, structural similarity increased from SQ to RW and from RW to RR, with TM-scores for SQ, RW and RR of 0.476, 0.484 and 0.520, respectively. This example illustrates the importance of the sequential restrictions to generate biologically or physically relevant alignments for analogous proteins. 3.2 Performance on 3DSwap-PS In further computations, we evaluated the performance of MICAN-SQ in alignments of multimers using 3DSwap-PS, and made comparisons with the other seven alignment programs. Unlike MALIDUP and MALISAM, 3DSwap-PS does not provide reference alignments. Instead, as a measure for evaluating the alignment accuracy, we used sequence identity (SID) in structure alignment by each method. Because each protein pair in 3DSwap-PS comprises the same or close homologous proteins (average SID of 53.9%), sequence identities for the entire alignment region, which may span across multiple chains, should be high in successful structure alignments. The precise definition of the SID score is the number of matches in the structure alignment divided by the number of matches in the sequence alignment multiplied by 100. To perform sequence alignments, the two sequences were taken from the monomer and one sequence of the domain swapped dimer. The Needleman–Wunsch dynamic programming algorithm was used with BLOSUM62, and with gap opening and extension penalties of –11 and –1, respectively. Note that we disregard domain swapping in the sequence alignment. Average SID scores and numbers of aligned residues (Nali) for each method are presented in the Table 1, and the box plots of SID scores are shown in the Supplementary Figure S8. We also present scatter plots of SID scores for all pairs of methods with P-values from Wilcoxon Signed-Rank tests in Supplementary Figure S9. MICAN-SQ produced much higher SID scores than DeepAlign, DaliLite and TMalign, which were specifically designed for alignments of monomers. In particular, the average SID score of MICAN-SQ was 13% greater than that of DeepAlign, which performed the best among the three. Differences in SID scores between MICAN-SQ and these programs are significant (P-value < 0.05), and performance insufficiencies of the three programs were mostly due to unsuitable alignments for swapped regions. Also, the number of aligned residues made by the three programs was much smaller than that of MICAN-SQ: differences in Nali are at least 20. These results contrast with those from MALIDUP and MALISAM computations, in which all programs generated comparable Nali values (Table 1). To demonstrate advantages of MICAN-SQ, we compared alignments of camelid heavy-chain variable domains (PDB ID: 1qd0) and the corresponding domain-swapped dimer (1sjv; Fig. 1) by MICAN-SQ with that by DeepAlign. SID scores of MICAN-SQ and DeepAlign were 91.4 and 75.3, respectively, and the alignment by DeepAlign was restricted between 1qd0 and chain A of 1sjv and failed to indicate structural correspondence between 1qd0 and chain B of 1sjv (orange rectangle in Fig. 1B). As a result, the alignment does not cover the C-terminal β strands of 1qd0, which is the swapped region in the left panel of Figure 1A. In contrast, the alignment by MICAN-SQ almost covered the entire region of 1qd0, which is associated with chains A and B of 1svj. The validity of the alignment by MICAN-SQ was supported by higher degrees of sequence matching, as shown in Figure 1C. This example illustrates the advantage of the MICAN-SQ over programs that are specific to monomer pairs. Fig. 1. View largeDownload slide A protein pair comprising a domain-swapped dimer (1sjv) and its structurally similar monomer (1qd0) in 3DSwap-PS; (A) The structure of 1sjvAB, 1qd0A and their superposition by MICAN-SQ; (B) Alignment plots by MICAN-SQ, DeepAlign and TopMatch. Horizontal and the vertical axes represent residue numbers of 1sjvAB and 1qd0, respectively. (C) Alignments are shown in cyan and orange rectangles in (B); Colons indicate alignment matches Fig. 1. View largeDownload slide A protein pair comprising a domain-swapped dimer (1sjv) and its structurally similar monomer (1qd0) in 3DSwap-PS; (A) The structure of 1sjvAB, 1qd0A and their superposition by MICAN-SQ; (B) Alignment plots by MICAN-SQ, DeepAlign and TopMatch. Horizontal and the vertical axes represent residue numbers of 1sjvAB and 1qd0, respectively. (C) Alignments are shown in cyan and orange rectangles in (B); Colons indicate alignment matches Compared with MMalign, MICAN-SQ achieved significant improvements (P-value < 0.05) of SID scores. Although MMalign was designed for alignments of multimers, SID scores of MICAN-SQ and MMalign differed by up to 15 and this difference was comparable to that between MICAN-SQ and the programs for monomer pairs. We confirmed that the default MMalign fails to generate alignments across multiple chains, and only the special option of MMalign allowed such alignment. Actually, Nali obtained by MMalign is almost the same value of TMalign (Table 1). MICAN-SQ was superior to TopMach, with significant improvements in average SID scores of about five points. (P-value =  7.0×10−19; Supplementary Fig. S9). Because TopMatch makes alignments across multiple chains, it is unclear why MICAN-SQ is superior to TopMatch. However, in evaluations of MALIDUP and MALISAM, MICAN-SQ was much more consistent with reference alignments than TopMatch. Hence, the same trend was expected in evaluations using 3DSwap-PS, and was indicated in comparisons of structure alignments of 1qd0 and 1sjv using MICAN-SQ and TopMatch (Fig. 1). Although the ensuing alignment plots (Fig. 1B) are similar, the corresponding SID scores differed largely and were 91.4 and 75.3 for MICAN and TopMatch, respectively. We noticed the alignments differed only in the region enclosed by cyan rectangles in Figure 1B, and the alignments of corresponding regions are shown in Figure 1C. In the region of the MICAN-SQ alignment, the number of matches was 14 (among 26 sites). Whereas the alignment of TopMatch was shifted by one residue from that of MICAN-SQ, thus containing only three matches. The performances of MICAN-SQ, RW and RR were statistically indistinguishable (P-value > 0.05: Supplementary Fig. S9) from each other. In evaluations of MALIDUP and MALISAM, greater evolutionary closeness of protein pairs led to increased similarity of alignments from the three modes of MICAN. The protein pairs in 3DSwap-PS are more closely related. The average sequence identities from MALIDUP, MALISAM and 3DSwap-PS are 18.8%, 8.5% and 53.9%, respectively. With greater evolutionary distance of protein pairs, differences between MICAN-SQ and MICAN-RW/RR become significant. It should be noted that MICAN-SQ can be applicable to not only monomer-dimer pairs but also any protein pairs with domain-swapping, regardless of the number of chain components. Here, we performed exhaustive all-versus-all structure comparisons of biological assemblies in the PDB using MICAN-SQ and found numerous examples beyond monomer-dimer pairs. Figure 2 shows such an example: a structure alignment for a dimer-dimer pair with domain swapping [Methylmalonyl-CoA epimerase (3rmu) and the uncharacterized protein Atu1953 (2pjs)]. This alignment shows that chain A (or B) of 3rmu is structurally associated with chains A and B of 2pjs, and vice versa, with domain swapping across multiple chains. To our knowledge, these structural similarities have not been previously described. We also infer functional and evolutionary aspects of Atu1953 below. Fig. 2. View largeDownload slide A protein pair comprising a domain-swapped dimer (2pjs) and its structurally similar non-swapped dimer (3rmu); (A) The structure of 3rmu; chains A and B are colored in red and blue, respectively. (B) The structure of 2pjs; chains A and B are colored in yellow and green, respectively. (C) The structure superimposition by MICAN-SQ; (D) alignment plot by MICAN-SQ Fig. 2. View largeDownload slide A protein pair comprising a domain-swapped dimer (2pjs) and its structurally similar non-swapped dimer (3rmu); (A) The structure of 3rmu; chains A and B are colored in red and blue, respectively. (B) The structure of 2pjs; chains A and B are colored in yellow and green, respectively. (C) The structure superimposition by MICAN-SQ; (D) alignment plot by MICAN-SQ 3.3 Computational speed Rapid computational speeds are required for large-scale database searches and for structure alignment programs. The CPU times spent to calculate all protein pairs in each dataset are listed in Table 1 and were measured on an Intel Corei7 3.40 GHz machine. TMalign was the fastest program in all cases, and was outstanding on 3DSwap-PS. This is partly because TMalign reads only the first chain in the PDB file. MICAN-SQ was the second fastest program on MALIDUP and MALISAM, and the third fastest on 3DSwap-PS, indicating that MICAN-SQ is a relatively fast program among those compared in the present study. Hence, in combination with alignment accuracies and computational speeds, MICAN-SQ has considerable utility in structural comparisons of monomers and oligomers. 3.4 Examples and applications In addition to the examples shown above, MICAN-SQ can be applied to multiple types of protein pairs. The first example is the large oligomer pair of bacterioferritin (3bkn) and apoferritin (2za6), illustrated in Figure 3A and B. Both of these oligomers comprise 24 identical chains of 160–170 amino acids and form spherical particles. Previously, TopMatch was used to aligned this complex pair and produced a structure match of Nali=3649 and an RMSD of 2.6 Å in superpositions (Sippl and Wiederstein, 2012). As shown in Figure 3D, MICAN-SQ correctly selected the chain combinations and aligned them, with an Nali value of 3696 and an RMSD value of 2.6 Å (TMscore = 0.9, Fig. 3C). This result is slightly better than that of TopMatch. Fig. 3. View largeDownload slide Structure alignment for the large oligomer pair apoferritin (2za6) and bacterioferritin (3bkn); (A) the structure of 2za6; (B) the structure of 3bkn. The green regions in (A) and (B) indicate one of the single chains in the oligomers. (C) Structure superimposition by MICAN-SQ. (D) The alignment plot by MICAN-SQ Fig. 3. View largeDownload slide Structure alignment for the large oligomer pair apoferritin (2za6) and bacterioferritin (3bkn); (A) the structure of 2za6; (B) the structure of 3bkn. The green regions in (A) and (B) indicate one of the single chains in the oligomers. (C) Structure superimposition by MICAN-SQ. (D) The alignment plot by MICAN-SQ The second example is a protein pair with domain-fusion (or fission). Shown in Figure 4 are a putative thiamine biosynthesis protein (1vk8, A and B chains) and the protein YKOF of unknown function (1s7hA). The former is a dimeric single domain protein and the latter is a monomer that corresponds to the fused form of the former protein. In the structure alignments, MICAN-SQ detected structural similarities (Fig. 4D) and showed structural associations of 1vk8A and 1vk8B, and 1s7hA with a TMscore of 0.72 and a Nali value of 159 (86% of residues in 1s7hA). In contrast, MMalign aligned these proteins with a TM score of 0.38 and a Nali value of 80 (half of 1s7hA). This example demonstrates that MICAN-SQ successfully performs structure comparisons of proteins with different numbers of chains, and indicates the importance of alignments across multimers. Similarly, a trimeric structure of a single β-propeller domain (2bt9, A, B and C chains) and a corresponding structure of their fused form (1ofz) were aligned and superimposed (Supplementary Fig. S10). Because oligomeric single-domain proteins are considered remnants of ancient proteins (Alva et al., 2015; Blaber et al., 2012; Lupas et al., 2001), their alignments against the fused form provide important evolutionary insights. Fig. 4. View largeDownload slide Structure alignment for a dimeric form of a single domain protein and the fused structure; (A) Structure of a putative thiamine biosynthesis protein [1vk8, A(red) and B(blue) chains]; (B) structure of the protein YKOF of unknown function (1s7hA); (C) structure superposition by MICAN-SQ; (D) alignment plot by MICAN-SQ Fig. 4. View largeDownload slide Structure alignment for a dimeric form of a single domain protein and the fused structure; (A) Structure of a putative thiamine biosynthesis protein [1vk8, A(red) and B(blue) chains]; (B) structure of the protein YKOF of unknown function (1s7hA); (C) structure superposition by MICAN-SQ; (D) alignment plot by MICAN-SQ The final example is the glyoxalase/bleomycin resistance protein (BRP)/dihydroxybiphenyl dioxygenase family, which includes the proteins shown in Figure 2. This protein family contains a wide variety of proteins with diverse functions and low sequence similarity, including glyoxalase I, extradiol dioxygenases and anti-biotic resistance proteins (Bergdoll et al., 1998). Many of these proteins are metalloenzymes with three or four essential metal ion-binding residues (commonly His and Glu residues) in the centers of β sheets. However, BRP is a non-metalloenzyme of this family that lacks metal ion-binding residues and binds and sequesters the anti-cancer agent bleomycin. Most members of this family show essentially the same spatial arrangements of four βαβββ sub-domains, but exhibit diversities in chain connectivity and assembly states, reflecting domain fusion and swapping (Suttisansanee et al., 2011). Here, we examined whether MICAN-SQ correctly classified these widely varied structures and expored previously unknown structures in this family. We performed pairwise structural alignments of representative structures of this family in the Evolutionary Classification of protein Domains database (Cheng et al., 2014) using MICAN-SQ, and generated a dendrogram shown in Figure 5 (A detailed description of the dataset and procedure is presented in Supplementary Note S4). The characterized structures were roughly divided into monomers and dimers and were then further divided into sub-clusters (A–G in Fig. 5), according to chain connectivity and assembly states. Subsequent visual inspections confirmed accurate automatic structural classifications following MICAN-SQ, because all structures belonged to appropriate clades in the dendrogram. These analyses demonstrate the success of MICAN-SQ in classifying protein structures with widely varying assembly states, and domain fusion and swapping. Fig. 5. View largeDownload slide A dendrogram of structures in the glyoxalase/BRP/dihydroxybiphenyl dioxygenase family; structure alignments were performed using MICAN-SQ. Seven clusters (A–G) were identified according to chain connectivity and assembly states. PDB IDs are used as operational taxonomic units and are colored according to corresponding clusters. In the bottom panels, schematic representations of chain connectivity and assembly states of each cluster are shown. Gray-shaded arcs represent β sheets. Triangles are β strands, and green and yellow colors indicate chain components. N and C termini are denoted by ‘N’ and ‘C’, respectively. The clusters (A)–(D) are monomers and clusters (E)–(G) are dimers Fig. 5. View largeDownload slide A dendrogram of structures in the glyoxalase/BRP/dihydroxybiphenyl dioxygenase family; structure alignments were performed using MICAN-SQ. Seven clusters (A–G) were identified according to chain connectivity and assembly states. PDB IDs are used as operational taxonomic units and are colored according to corresponding clusters. In the bottom panels, schematic representations of chain connectivity and assembly states of each cluster are shown. Gray-shaded arcs represent β sheets. Triangles are β strands, and green and yellow colors indicate chain components. N and C termini are denoted by ‘N’ and ‘C’, respectively. The clusters (A)–(D) are monomers and clusters (E)–(G) are dimers It is worth noting that there are previously unknown structure relationships in this family. The currently known chain connectivities and assembly states are exhibited in structures of clusters A–E and G depicted in the bottom panels of Figure 5 (Suttisansanee et al., 2011). In contrast, the arrangements and structures of Atu1953 (2pjs, cluster F) are shown in Figure 2B, which have not been previously described. Atu1953 was initially identified in genome analyses of Agrobacterium fabrum, although its function remains unknown. In our analyses, chain connectivities of Atu1953 were similar to those of proteins in cluster G, but contained a swapped β-strand (around red loops in Supplementary Fig. S11). While cluster G includes metalloenzymes, Atu1953 lacks His and Glu metal-binding residues at corresponding positions of the centers of β sheets and is not likely to be a metalloenzyme. Because cluster G also includes BRP, we hypothesized that the function of Atu1953 would be similar to that of BRP. The ensuing structure alignments of 2pjs and 1ewj (bleomycin-bound structure of BRP) suggest that franking loops of the swapped strand (red loops in Supplementary Fig. S11) sterically hinder bleomycin-binding (Supplementary Fig. S11C). This observation suggests that Atu1953 cannot bind to bleomycin and may bind to a smaller ligand, and that it has evolved selectivity through strand swapping. 4 Conclusion In this study, we introduce the novel sequential protein structure alignment algorithm MICAN-SQ. In contrast with other sequential protein structure alignment programs that can align monomer pairs only, MICAN-SQ can compare monomer-oligomer and oligomer-oligomer pairs and retains sequential restrictions within the alignments of each chain pair. To assess the accuracy of MICAN-SQ, we evaluated consistencies of manually-curated alignments of monomer pairs and the alignments of a domain swapped dimer and its corresponding monomer. We then compared the performance of MICAN-SQ with those of seven structure alignment programs that are publicly available, including those for monomer pairs and protein complexes. The present data show that MICAN-SQ is superior to all of the present programs. In addition, MICAN-SQ has excellent computational speed. Taken together, our results warrant consideration of MICAN-SQ as a useful program for comparisons of protein structures, and are applicable to monomer pairs and multimer complexes. Funding This work was supported by Platform for Drug Discovery, Informatics and Structural Life Science from the Japan Agency for Medical Research and Development and by a Grant-in-Aid for Scientific Research (C) (No.16K07315) from the Japan Society for the Promotion of Science. S.M. and K.S. were supported by a Grant-in-Aid for JSPS Fellows. Conflict of Interest: none declared. References Alva V. et al. ( 2015 ) A vocabulary of ancient peptides at the origin of folded proteins . eLife , 4 , e09410 . Google Scholar Crossref Search ADS PubMed Bergdoll M. et al. ( 1998 ) All in the family: structural and evolutionary relationships among three modular proteins with diverse functions and variable assembly . Protein Sci ., 7 , 1661 – 1670 . Google Scholar Crossref Search ADS PubMed Blaber M. et al. ( 2012 ) Emergence of symmetric protein architecture from a simple peptide motif: evolutionary models . Cell. Mol. Life Sci ., 69 , 3999 – 4006 . Google Scholar Crossref Search ADS PubMed Cheng H. et al. ( 2007 ) MALIDUP: a database of manually constructed structure alignments for duplicated domain pairs . Proteins , 70 , 1162 – 1166 . Google Scholar Crossref Search ADS Cheng H. et al. ( 2008 ) MALISAM: a database of structurally analogous motifs in proteins . Nucleic Acids Res ., 36 , D211 – D217 . Google Scholar Crossref Search ADS PubMed Cheng H. et al. ( 2014 ) ECOD: an evolutionary classification of protein domains . PLoS Comput. Biol ., 10 , e1003926. Google Scholar Crossref Search ADS PubMed Holm L. , Sander C. ( 1998 ) Dictionary of recurrent domains in protein structures . Proteins , 33 , 88 – 96 . Google Scholar Crossref Search ADS PubMed Hou J. et al. ( 2003 ) A global representation of the protein fold space . Proc. Natl. Acad. Sci. USA , 100 , 2386 – 2390 . Google Scholar Crossref Search ADS Huang P. et al. ( 2016 ) De novo design of a four-fold symmetric tim-barrel protein with atomic-level accuracy . Nat. Chem. Biol ., 12 , 29 – 34 . Google Scholar Crossref Search ADS PubMed Huang Y. et al. ( 2012 ) Three-dimensional domain swapping in the protein structure space . Proteins , 80 , 1610 – 1619 . Google Scholar Crossref Search ADS PubMed Joo K. et al. ( 2007 ) High accuracy template based modeling by global optimization . Proteins , 69 , 83 – 89 . Google Scholar Crossref Search ADS PubMed Koike R. , Ota M. ( 2012 ) SCPC: a method to structurally compare protein complexes . Bioinformatics , 28 , 324 – 330 . Google Scholar Crossref Search ADS PubMed Lupas A. et al. ( 2001 ) On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J. Struct. Biol ., 134 , 191 – 203 . Google Scholar Crossref Search ADS PubMed Mayr G. et al. ( 2007 ) Comparative analysis of protein structure alignments . BMC Struct. Biol ., 7 , 50. Google Scholar Crossref Search ADS PubMed Minami S. et al. ( 2013 ) MICAN: a protein structure alignment algorithm that can handle multiple-chains, inverse alignments, Cα only models, alternative alignments, and non-sequential alignments . BMC Bioinformatics , 14 , 24. Google Scholar Crossref Search ADS PubMed Minami S. et al. ( 2014 ) How a spatial arrangement of secondary structure elements is dispersed in the universe of protein folds . PLoS ONE , 9 , e107959. Google Scholar Crossref Search ADS PubMed Mukherjee S. , Zhang Y. ( 2009 ) MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming . Nucleic Acids Res ., 37 , e83. Google Scholar Crossref Search ADS PubMed Okuno T. et al. ( 2015 ) VS-APPLE: a virtual screening algorithm using promiscuous protein-ligand complexes . J. Chem. Inf. Model , 55 , 1108 – 1119 . Google Scholar Crossref Search ADS PubMed Orengo C. et al. ( 1997 ) CATH–a hierarchic classification of protein domain structures . Structure , 5 , 1093 – 1108 . Google Scholar Crossref Search ADS PubMed Sippl M.J. , Wiederstein M. ( 2012 ) Detection of spatial correlations in protein structures and molecular complexes . Structure , 20 , 718 – 728 . Google Scholar Crossref Search ADS PubMed Standley D.M. et al. ( 2010 ) SeSAW: balancing sequence and structural information in protein functional mapping . Bioinformatics , 26 , 1258 – 1259 . Google Scholar Crossref Search ADS PubMed Suttisansanee U. et al. ( 2011 ) Structural variation in bacterial glyoxalase I enzymes . J. Biol. Chem ., 286 , 38367 – 38374 . Google Scholar Crossref Search ADS PubMed Wang S. et al. ( 2013 ) Protein structure alignment beyond spatial proximity . Sci. Rep ., 3 , 1448. Google Scholar Crossref Search ADS PubMed Zhang Y. , Skolnick J. ( 2005 ) TM-align: a protein structure alignment algorithm based on the TM-score . Nucleic Acids Res ., 33 , 2302 – 2309 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

BioinformaticsOxford University Press

Published: Oct 1, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off