Multiple sequence alignment with hierarchical clustering

Corpet,, Florence

doi:10.1093/nar/16.22.10881

Corpet,, Florence

1988-11-25 00:00:00

Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Volume 16 Number 22 1988 Nucleic Acids Research Florence Corpet Laboratoire de Genetique Cellulaire, INRA Toulouse, BP 27, 31326 Castanet Tolosan, France Received July 7, 1988; Revised and Accepted October 14, 1988 ABSTRACT An algorithm is presented for the multiple alignment of sequences, either protein s or nucleic acids, that is both accurate and easy to use on microcompu- ters . The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a hierarchical clustering of the sequences is per- formed using the matrix of the pairwise alignment scores. The closest sequences ar e aligned creating groups of aligned sequences. Then close groups are aligned unti l all sequences are aligned in one group. The pairwise alignments included in th e multiple alignment form a new matrix that is used to produce a hierarchical clustering . If it is different from the first one, iteration of the process can be performed. The method is illustrated by an example : a global alignment of 39 sequence s of cytochrome c. INTRODUCTION Macromolecules, either nucleic acids or proteins, are sequenced by an increasin g number of molecular biologists, by using techniques which are auto- matized and fast. Hence an increasing bulk of sequences has to be analyzed, which is impossible without the help of data processors. It is of particular interest to locate the parts that are common to many sequence s in a same family. For example, the homologous regions in sequence of protein s having the same activities in various living organisms, are likely to be th e most important from functional and structural points of view. This kind of analysis requires the alignment of the sequences, id est a representatio n of the sequences in rows, the homologous regions being in the same columns. When one knows the 3D structures of the macromolecules, this problem is easy to solve. But this is a very rare case and biologists want to align macromolecules when they know only the sequences. The alignment of two sequences can be performed with any of several program s that have been developed since 1970 (1). However, when there are more than two sequences, the result of pairwise comparisons may not be consis- ten t for certain residues. But simultaneous comparison of all the sequences 10881 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research cannot , in practice, be executed since the number of segment comparisons that must be carried out is of the order of the product of the sequence lengths. Hence, strategies have been chosen to propose alternative approaches : limiting the problem to three short sequences (2) or to closely related sequences (3), using a predetermined evolutionary tree (4), finding common subsequences (5 to 8), selecting the best pairwise alignments from the scores of all pairwise comparisons (9). Several multiple sequence alignment programs utilize the tech- nics of pairwise alignments, building the final alignment by gradually aligning furthe r sequences, according to the basic Needleman-Wunsch procedure (1). Based upon the scores of the initial alignments of all pairs of sequences, diffe- ren t strategies are used to determine in vha t order the sequences are incorpo- rate d into the final alignment. Since the sequences in the already aligned set preserv e their relative structure, finding the proper order is crucial. Most algorithms proceed in a sequential way (10 to 12). Feng and Doolittle utilize a more sophisticated clustering, and the closely related subsets of sequences are prealigne d before the final alignment (13). The algorithm described in this paper also uses clustering but in a more simple way and there is no prealignment of clusters. DESCRIPTION OF THE METHOD Algorithm for two sequences Let the two sequences A an d B have lengths m and n, and denote by A(i) and B(j) the i-th and j-t h elements in the respective sequences. To every pair of elements A(i), B(j)> a weight w(i,j) is assigned from a suitable matrix D such a s the mutation data matrix of Dayhoff (14) for amino acids (if necessary a suitabl e constant must be added to make all matrix entries non-negative) : w(i,j) - D(A(i),B(j)). The valueB of w ar e not stored but computed when needed, from stored values of the matrix. The method of computation used, following tha t of Needleman and Wunsh (1) and of Murata et al.(2), is to work backward from the cell (m,n), calculating the maximum total value S for paths from each cell. Let S(i,j) be the maximum, over all paths from cell (i,j) to the bottom or th e right side, of the sums of values w over the cells in the path minus g times th e number of gaps in the path. This gap penalty is independent of the gap length , as suggested by the results of Barton and Sternberg (15). Let M(i,j) be th e maximum value of S over all the cells (i,k) and (l,j) where j<k<n and i<l<m (as a consequence of its definition, M(i,j) is also the maximum value of S over all cells (l,k) with l>i and k>j). 10882 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research The following algorithm is used to calculate S and M : S(i,j) = w(i,j) + max ( S(i+l,j+l), M(i+l,j+l) - g ); M(i,j) = max ( S(i,j), M(i+l,j), M(i,j+1) ). Once the matrix S has been calculated, a traceback procedure is performed to find the successive cells of the best path. Its first cell is the one with the maximum value in the first row or the first column. This value is the score of the alignment. No gap penalty is added at either end of the path. Alignment of two clusters of aligned sequences To compare more than two sequences, clusters of aligned sequences are regrouped step by step with an alignment algorithm, which is an extension of the preceding one (10). Let Bi ... Bp be the sequences of one cluster and C'I...CQ be the ones of a second cluster. When generating the matrix S in order to align sequences Ci...Ca with sequences Bi...Bp, a scoring scheme is adopted that includes a contribution from all previously aligned sequences, thus giving more weight to regions already aligned. Let i (resp. j) be the position of an aligned residue in sequences Bi...Bp (resp. C1...C9), then : R=P S=Q R=l S=l where D is the matrix of amino acid pair scores. For example, the weight of (Ala-Val-Leu) aligned with (Ala-Leu) is given by the weight of [ (Ala vs. Ala) + (Ala vs. Leu) + (Val vs. Ala) + (Val vs. Leu) + (Leu vs. Ala) + (Leu vs. Leu) ] x 1/6. The value of D when one of its indices is a gap is set to 0. Matrices S and M are computed as before with the new values of w. One obtains a cluster of P + Q aligned sequences that takes the place of the two clusters of P and Q aligned sequences. Order of clustering The order in which the sequences are compared can have an effect on the multiple final alignment, hence a good order must be chosen. The method used here is to perform a hierarchical clustering of the sequences, using the scores of the pairwise comparisons as an index of similarity between sequences. The principle is simple : the hierarchy is built from its base (the set of sequences) 10883 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research creating new clusters by union of two already created clusters that are the closest ones (16). Let At, A ... AN be the N sequences to be aligned. All pairwise compari- sons are performed by a fast algorithm (17) and stored in a matrix Tt : Tj(I,J) is the score of the alignment of Ai with Aj. Then clusters of aligned sequences are defined as follows. At step 1, there are N clusters , every one including one sequence. The best score Tt(l,J) is found in the matrix Ti. The sequences Ai and Aj (whose score is the best) are aligned and the alignment of the two sequences form a cluster that takes the place of the I-th sequence. The J-th sequence is suppressed. A new matrix of score T2 is built ; its dimension is (N-l) and it is a copy of Ti where column and row J are deleted and column and row I are built from columns and rows I and J of Ti : T2<1,K) is the mean of the two scores Ti(I,K) and Ti(J,K). T2(I,K) is called the "score" of cluster I vs. cluster K. At step s (s - 1,2,... N-l), there are N - s + 1 clusters of se- quences and the matrix Ts holds the "scores" between these clusters. If the greatest element of Ts is Ts(I,J), clusters I and J are aligned and the resulting alignment forms the new cluster I. Cluster J is deleted. The matrix Ts»i is built as follows : Ts.i (K,L) = Ts (K,L) if K,L * I,J , Ts«i (J,K) and T 4i (K,J| does not exist for any K, Tsu (I,K) = Ts (K,I) - ( Ni .Ts (I,K) + Nj .Ts (J,K) ) / (NI+NJ ) if K t I,J where Ni (resp. Njl is the number of sequences in the cluster I (resp. J). At every step, the number of clusters decreases by one, and one of the new clusters includes the sequences of two clusters of the preceding step. At step N, there is one cluster including the N aligned sequences. Presentation of the complete process 0 - initialization : all pairwise comparisons are performed by a fast algo- rithm (17) and their scores are recorded. 1 - a hierarchical clustering of the sequences is done using these scores. 2 - the hierarchical tree is climbed with the pairwise alignment of clusters to obtain the complete alignment. 3 - the alignment is shown, recorded or printed. A score is given for the multiple alignment : it is the sum of the scores of all the pairwise alignments included in the multiple one. 4 - A new hierarchical clustering is done with these new scores. 10884 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research PC50 RF2C RF2S QF2R QF2P RF2P RF2A RF2V RD2 HU CH BN RF2G QF2M QF2F QFM2 QFF2 PC54 PS41 PS42 SG6 AA6 AU6 BF6 PR6 EG6 ML6 PSVM CF55 PH55 QF2T PS5A PS5S PS5M CAV5 PS5F PS5D RFG2 DV53 Figur e 1 - Hierarchical clustering of 39 related bacterial, algal and mitochon- dria l cytochrome c. PC50: Paracoccua denitrificans C550, RF2S: Rhodopseudo- monas sphaeroides C2, RF2C: R. capsulata ca, RF2P: R. palustris cs, RF2A: R. acidophila cj, RF2V: R. viridis ct, RF2G: R. globiformis C2, RFG2: R. gelatinosa C2, HU: Human c, CH: Chicken c, BN: Tuna c, RD2: Rhodomicrobium vanielii C2, QF2R: Rhodospirillum rubrum C2, QF2P: R. photometricum C2, QF2M: R. moli- schianum C21 iso-1 , QF2F: R. fulvum C2, iso-1 , QFM2: R. molischianum C2, iso-2, QFF2: R. fulvum cj, iso-2, QF2T: R. tenue C2, PS5A: Pseudomonas aeruginosa C55i, PS5F: P. fluorescens biotype c i, PS5S: P. stutzeri C351, PS5M: P. mendocina C551, PS5D: P. denitrificana CSM, PS41 : P. aeruginosa C4, 1st half, PS42: P. aeruginosa c«, 2nd half, PSVM: P. mendocina cs, AV5: Azotobacter vinelandii C551, SG6: Spirulina maxima c«, AA6: Anacystis nidulans ce, AU6: Alaria esculenta cs, PR6: Porphyra tenera C6, BF6: Bumilleriopsis filiformis c», ML6: Monochrysis lutheri c», EG6: Euglena gracilis c«, PC54: Paracoccus sp. csM, CF55: Chlorobium limicola f.sp. csss, PH55: Prosthecochloris aestuarii C555, DV53: Desulfovibrio vulgaris cs53- 10885 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research 1 10 20 30 40 50 60 PC50 QDGDAAkGEKeFn. K. CkaCHmlqapdGTDI 1. KGgKtGPNLYGWGRkiaSeegFk. YgEgi lEVaeknpd RF2S QEX3>peaGaKaFVi.qXqtC«vIvddsGrtIagRnAKtG™LYGVVCS?TAgTqaclFkgYgEgiiikEaGA. .kG RF2C GDAAkGEKeFn.K.CktCWsIiapdGTEIV.KGAKtGR^LYGWGRTAgTypeFk.YkDsivaLCA. .sG QF2R EGDAAaGEKvsk.K.ClaCHtfdqggan KvGPNLFGVf enTAahkdnYa. YsEsytEMkA. kG QF2P aGDAAvGEKiakaK.CtaCHdLnkggpi KvGPpLFGVfGRTtgTfagYs.YspgytvMGq. .kG RF2P QDAAkGEavFk.q.CmtCHradkn mvGPaLgGWGRkAgTaagFt.YsplnhNsGe. .aG RF2A AGDpdaGqkvFlk. .CaACHklgPgaKN GVGPSLnGVanRKaGqaeGFa. YSDAnkn SG RF2V qDaAsGeqvFkQ. .ClvCHSIgPgaKN kVGPvLnGLFGRHsGtieGFs. YSDAnkn SG HU GDvekGkkiFimk.CsqCHTVEkgGKh ktGPnLhGLFGRKtGqapGYs. YTaAnkn kG CH GDiekGkkiFvQk.CsqCHTVEkgGKh ktGPnLhGLFGRKtGqaeGFs. YTDAnkn kG BN GDv-.AkGkktFvQk.CaqCHTVEngGKh kVGPnLwGLFGRKtGqaeGYs. YTDAnks kG RD2 AGDpvkGeqvFkQ. . CKiCHqVgPtaKN GVGPeqndVFGqKaGarpGFn. YSDAmkn SG QF2M adapppaFnQ. .CKACHSID.aGKN GVGPSLsGaYGKKvGlapnYk.YSpAhla SG QF2F AdaptaFnQ. .CKACHSIE.aGKN GVGPSLsGaYGRKvGlapnYk.YSaAhla SG QFM2 AdapagFtl. .CKACHSVE.aGKN GVGPSLaGVYGRKaGtisGFk.FSDphik SG QFF2 AdappaFgm. .CKACHSVE.aGKN GVGPSLaGVYGHKaGtlaGFk.FSDphak SG RF2G1ppGDpveGkhlFhti .CilCHT.DikGrN kVGPSLyG\'vGRHsGiepGYn. YSEAnik SG QF2T adesaLaqTKgClACHnpEkKV VGPAYgwVAkKYAGQaGA. RFG2 atpaeLatkagCavCHqptaKg LGPsYqEIAkKYkGQaGA. PS5A edpEvLFKnKgCvACHalDtKM VGPAYKDVAAKFAGQaGA. PS5F edGaaLFKSKpCaACHtlDsKM VGPAlKEVAAKnAGvkdA. PS5S qdGEaLFKSKpCaACHsIDaKL VGPAFKEVAAKYAGQdGA. PS5M asGEeLFKSKpCgACHsVQaKL VGPAlKDVAAKnAGvdGA. PS5D stGEeLFKaKaCvACHsVDkKL VGPAFHDVAAKYgaQgdg. AV5 etGEeLYKTKgCtvCHalDsKL VGPsFKEVtAKYAGQaGi. SG6 gDVaaGasVFSAN.CAACHmGGrNV. . .IVan..KTLsKsd..LakYL kg AA6 aDLahGgQVFSAN. CAsCHlGGrNV. ..VnPa..KTLeKad..LDEY AU6 iDInNGENIFTAN.CsACHaGGNNV. ..IMPe..KTLkKda..LaDn PR6 aDLdNGEkVFSAN.CAACHaGGNNa. ..IMPd..KTLkKdv..LEan BF6 aDIeNGErlFTAN.CAACHaGGNNV. ..IMPe..KTLkKda..LEan ML6 gDIaNGEQVFTtN.CAACHsvzZZk. . .tLel..sSLwKaksyLaNF EG6 gCaDVFadN.CstCHvnGgNV. . .Isag..KvLaKta..IEEYL d. PC54 aGdAaagEDklgt..CvACHGtdgqG. .. lApi YPnLtGQsatYL PS41 aGdAaagQakaav..CgACHGabbbG. ..sApp....FPkLaGQgerYL PS42 lfrggkiaEgMpa..CtgCHGsspvG. ..iAta...gFPhLgGQhatYV PSVM2 agGgArsaDDilakh.CnACHGagvlG. . . apki. . gdtaawkeradhqg gl CF55 YDaAaGKatYDAs.CAmCHktGMMG. . .APKv. .GdkaaWapHIak GM PH553 eqYDlAnGKtvYDAn.CAsCHaaGIMQ. . .APKt. .GtarkWnsRLpq GL DV5 3 adGaalYks..CigCHsadgg. .. kammtnavkgkysdeelk a 1 10 20 30 Figure 2 - Alignment of cytochromes c sequences in 39 species (see Fig. 1). The sequences are ordered and clustered as in (15). Too long sequences have been cut at their extremities : * gsd, * aas, 3 avtkadv, * aa, » pvaggea. 10886 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research 70 80 90 100 110 120 LthTEaDLieYVtDPKpWLvlanTdDk gAKTKM..TFK...MgKNQa..DVvAFLaqnspdaggdge* PC50 LaWdEEhfvqYVqDPtkFLkeyTgDa KAKgKM..TFK...LkKEaDahNIwAYLqqVavrp RF2S faWTEEDIatYVkDPgaFLkekldDk KAKTgM. . aFK. .. LaKggE. . DVaAYLaSVvk RF2C LtWTEaN'LaaYVknPKaFVlekSgDp KAKSKM.. TFK. . . LtKDDEieNViAYLkTLk QF2B htWdDNaLkaYLlDPKgYVqakSgDp KAnSKM. . iFR. . .LeKDDDvaNViAYLhTMk QF2P L\-WTQENIiaYLpDPNaYLkkfltDkgqadkatgsTKM. .TFK.. .LanDQQrkDVaAYLaTLk RF2P LTWDEaTfkeYItaPqkkV PGTKM. .TFpG. . LpNeaDrdNIwAYLsqfkaDGSK RF2A ITWtEevf reYIrDPKakl PGTKM. . IFaG. . IKDeCfcVsDLIAYLKqfnaDGSKk RF2V liWgEdTLmeYLeNPKkyl PGTKM. . IFvG. . IKkkeEraDLIAYLKkatne HU ITWgEdTLflieYLeNPKkyl PGTKM. . IFaG. . IKkksErvDLIAYLKdatsk CH IvWNEnTLmeYLeNPKkyI PGTKM. .IFaG. .IKkkgErqDLVAYLKSats BN LTWDEaTLdKYLeNPKavV PGTKM. .VFvG. .LKNPQDraDVIAYLKqlsgk RD2 MTiDDamLtKYLaNPKetl PGnKMGAaFgG. .LKNPaDVaaVIAYLKTX'k QF2M MTiDEamLtNYLaNPKatI PGnKMGAsFgG. .LKkPEDVkaVIeYLKTvk QF2F LTWDEpTLtKYLaDPKtvI PGnKM. .VFaG. .LKNPDDVkaVIeYLKTlk QFM2 LTWDEpTLtKYLaDPKg\'I PGnKM. .VFaG. .LKNPaDVaaVIAYLKSl QFF2 IvWtpdvLfKYIehPqkiV PGTKM. .gYpG. .qpDPQkraDIIAYLeTlk RF2G eakLvaKVmaGgqGVWakqlg aelPM. . .PaN. . nVTkEEAtrLvkWVLSlKqidyk QF2T palMAeRVRkGSvGIFG kLPMtptPPa. . rISDaDlKlViDWILktp RFG2 eaeLAqRIKnGSqGVWG pIPM. . .PPN. .aVSDDEAqTLAkWVLSqK PS5A dktLAgHIKnGTqGnWG pIPM. . .PPN. .qVTDaEAlTLAQWVLSlK PS5F adlLAgHIKnGSqGVWG pIPM.. .PPN. .pVTEEEAKILAEWILSqK PS5S advLAgHIKnGStGVWG aMPM. . .PPN. .pVTEEEAKTLAEWVLTlK PS5M vahltnsIKtGSkGnWG pIPM. . .PPN. .aVSpEEAKTLAEWIVTlK PS5D adtLAaKIKaGgsGnWG qlPM. . .PPN. .pVSEaEAKTLAEWVLThK AV5 fdddaVaAVaYQV. .TN GKNAM.PgFnG..RLSpkQIEDVAaYWdQaEKGW SG6 .gMaSIEAITTQV. .TN GKgAM.PAFGa. .KLSaDDIEgVAsYaLdQSgKeW AA6 .kMvSVNAITYQV. .TN GKNAM.PAFGS. .RLaEtDIEDVANFVLTQSDKGWD AU6 .sMnTIDAITYQV. .qN GKNAM.PAFGG. .RLvDEDIEDaANYVLSQSEKGW PR6 .gMnaVsAITYQV. .TN GKNAM.PAFGG. .RLSDsDIEDVANYVLSQSEqGWD BF6 . .ngSesAIvYQV. .TN GKNAM.PAFGG. .RLeDDEIaNVAsYVLSQag ML6 .ggyTkEAIeYQV..rN GKgpM.PAWeG. . vLSEDEIvaVtDYVyTQaggaWanvs EG6 essIkayRDGqRkgg NaalMTpMaq...gLSDEDIAdlaaYySaqe PC54 lKqMhdiKDGkRtvl ee. .MTgLlt...bLSBZDIAaLadYaSqkmsvgmalbb5PS41 aKqLtdfREGtRndd gtkiMqsIaai. .kLSNkDIAalssYiqglh PS42 dgiLaKalsgi naM.ppkgtcadcSDDELreaiqkmSgl PSVM nvMVanSIkGY KG TKgntf. PAKGGNPkLTDaQVGNAVAYMVgQak CF55 atMIekSVaGYegeyRG SKtfM.PAKGGNPdLTDkQVGDAVAYMVnEvl PH55 ladymkaamgsakpvkgq gaeelykmkg.... yadgsyggerkamskl DV53 40 50 60 70 80 Figure 2 (continued) - When the homology is strong inside a family, the residue has been represented with a capital letter. 10887 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research 5 - if the new clustering is different from the old one, a new multiple alignment can be done following the new clustering (step 2). This process can be repeated until the clustering of the sequences is unchanged. Technical description The program is written in the Turbo Pascal language (Borland) and it runs on a Microcomputer with MS-DOS (Microsoft). Dynamic memory allocation is used throughout so that the number and size of sequences which can be handled is limited only by hardware and MS-DOS considerations. Binary and Pascal codes for academic distribution are available from the author. A 5 1/4 or 3 1/2 inch diskette should be sent with request. RESULTS AND DISCUSSION The method has been used to align amino acid sequences of 39 related bacterial, algal and mitochondrial cytochromes c. The sequences have been extracted from the NBRF Protein Data Base (Release 15.0, January 1988). The gap penalty was set to 8 and the weight for substitution of an amino acid by another was given by Dayhoff's matrix (adding 8 to all entriesl. The weight for conservation of the methionine was increased from 14 to 18 because of its known importance in the holding of the heme. The initial pairwise comparisons have been executed by FASTP (17). With their scores, a first clustering and a first alignment has been done. Then the scores of the pairwise comparisons included in this alignment have been calculated and the new hierarchical clus- tering was as in Figure 1. A second iteration gave the alignment of Figure 2. The clustering was the same so no more iteration could be done. The alignment was produced entirely automatically, without any prealignment of key regions. The results can be compared with a multiple alignment obtained by Dicker- son (18) on evidence from X-ray structural analyses, so that structurally equiva- lent regions of the chain are aligned. Dickerson defined 7 families on criteria as length and origin of the bacteria : the hierarchical clustering of the sequences (Fig. 1) gives the same families as Dickerson's, except that P. mendocina cs is clustered with the cytochromes C555 and not with the cytochromes C4. Both alignments are similar. The residues that hold the heme are aligned in all the sequences ( CxxCH near the beginning of the sequence and M near the end), except the methionine in the sequence of D. vulgaris csss (Fig. 2). The massive deletions in the C551 are located at the folds of the C2 proteins. Without the three-dimensional structures, the problems of where to place these deletions and of aligning the methionines seem to have been "insurmountable", to quote Dickerson (18), but the present program can do it. 10888 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research The total score is not a good criterion for assessing the quality of a global multiple alignment as it is the sum of pairwise scores. The accuracy of an alignment of sequences has been defined by the percentage of residues that are aligned as in a reference alignment, obtained by crystallography, within test zones that have importance for the 3D structure (10). The accuracy of the alignment of Fig.2, calculated on 16 residues and with reference to Dickerson's alignment, is 95.1%, while the accuracy obtained with Barton * Sternberg's algorithm, is 92.8%. With their algorithm or with Taylor's (9), the methionines that hold the heme, are aligned in every family but not between families. An advantage of the present algorithm over those that incorporate the sequences one by one in the final alignment, is that the sequences are aligned firstly inside the families; hence, when families are compared, the weight of already aligned residues is great and the best alignment is found when they are aligned. Feng and Doolittle utilize the same clustering scheme (13). In their method, the sub-cluster includ- ing the best scored pair of sequences, is not handled by the same algorithm as other clusters. This approach gives a major weight to the best scored pair, while other pairs of sequences could have nearly the same similarity score. Here, all clusters are aligned with the same algorithm. It seems difficult to prove mathematically the convergence of the iterative process because the distance between two sequences, used to perform the clustering, is function of the global alignment. This convergence, however, has always been observed after one or two iterations, for all the treated examples. The program can run on computers that use MS-Dos which are the most widespread in laboratories. On a 10 Mhz, 80-286 AT computer, it took 48 minu- tes to align the 39 sequences of cytochromes c, the length of which is around 150. The time required for an alignment by this algorithm is approximately proportional to N(N-1)M2, where N is the number of sequences and M is the length of the sequences when aligned. This apply also to Barton and Sternberg's method, contrarily to what is written in their article (10). CONCLUSION The program described here allows one to find an alignment of many related sequences. It can be used either for proteins or for nucleic acids and it takes account of closer relationships that can exist among some subsets of sequences ; hence, it is attractive when there are subgroups in the family of sequences under study. 10889 Downloaded from https://academic.oup.com/nar/article-abstract/16/22/10881/2378678/ by Ed 'DeepDyve' Gillespie user on 05 October 2019 Nucleic Acids Research Global alignment of large numbers (50 to 250) of small sequences or smaller numbers of medium-length sequences (150 to 300) can be obtained easily. ACKNOWLEDGEMENT I thank Drs Daniel Kahn and Denis Corpet for their many helpful discus- sions . REFERENCES 1. NEEDLEMAN, S.B. and WUNSCH, CD. (1970) J. Mol. Biol. 48, 443-453. 2. MURATA, M., RICHARDSON, J.S. and SUSSMAN, J.L. (1985) Proc. Nat. Acad. Sci., U.S.A. 82, 3073-3077. 3. BAINS, W. (1986) Nucl. Acids Res. 14, 159-177. 4. SANKOFF, R.J. and CEDERGREN, G.L. (1976) J. Mol. Evol. 7, 133-149. 5. SOBEL, E. and MARTINEZ, H.M. (1986) Nucl. Acids Res. 14, 363-374. 6. MARTINEZ, H.M. (1988) Nucl. Acids Res. 16, 1683-1691. 7. SANTIBANEZ, M. and ROHDE, K. (1987) CABIOS 3, 111-114. 8. BACON, D.J. and ANDERSON, W.F. (1986) J. Mol. Biol. 191, 153-161. 9. TAYLOR, W.R. (1987) CABIOS 3, 81-88. 10. BARTON, G.J. and STERNBERG, M.J.E. (1987) J. Mol. Biol. 198, 327-337. 11. GRIBSKOV, M., MCLACHLAN, A.D. and EISENBERG, D. (1987) Proc. Natl. Acad. Sci. USA, 84, 4355-4358. 12. GRIBSKOV, M., HOMYAK, M., EDENFIELD, J. and EISENBERG, D. (1988) CABIOS 4, 61-66. 13. FENG, D-F. and DOOLITTLE, R.F. (1987) J. Mol. Evol., 25, 351-360. 14. DAYHOFF, M.O. (1978) In DAYHOFF, M.O. (ed), Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington, D.C., Vol 5, suppl. 3, pp 345-358. 15. BARTON, G.J. and STERNBERG, M.J.E. (1987) Protein Eng. 1, 89-94. 16. BENZECRI J.P. (ed) (1973) L'Analyse Des Donnees, Vol. 1, pp 153-206, Dunod, Paris. 17. LIPMAN, D.J. and PEARSON, W.R. (1985) Science 227, 1435-1441. 18. DICKERSON, R.E. (1980) Sci. Am. 242, 98-110.

http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png

Nucleic Acids Research Oxford University Press

http://www.deepdyve.com/lp/oxford-university-press/multiple-sequence-alignment-with-hierarchical-clustering-XzZts3oyE5

Loading next page...

References (0)

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher: Oxford University Press
ISSN: 0305-1048
eISSN: 1362-4962
DOI: 10.1093/nar/16.22.10881
Publisher site: See Article on Publisher Site

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Multiple sequence alignment with hierarchical clustering

Multiple sequence alignment with hierarchical clustering

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Multiple sequence alignment with hierarchical clustering

Multiple sequence alignment with hierarchical clustering

References (0)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies