Generalized Bootstrap Supports for Phylogenetic Analyses of Protein Sequences Incorporating Alignment Uncertainty

Generalized Bootstrap Supports for Phylogenetic Analyses of Protein Sequences Incorporating... Abstract Phylogenetic reconstructions are essential in genomics data analyses and depend on accurate multiple sequence alignment (MSA) models. We show that all currently available large-scale progressive multiple alignment methods are numerically unstable when dealing with amino-acid sequences. They produce significantly different output when changing sequence input order. We used the HOMFAM protein sequences dataset to show that on datasets larger than 100 sequences, this instability affects on average 21.5% of the aligned residues. The resulting Maximum Likelihood (ML) trees estimated from these MSAs are equally unstable with over 38% of the branches being sensitive to the sequence input order. We established that about two-thirds of this uncertainty stems from the unordered nature of children nodes within the guide trees used to estimate MSAs. To quantify this uncertainty we developed unistrap, a novel approach that estimates the combined effect of alignment uncertainty and site sampling on phylogenetic tree branch supports. Compared with the regular bootstrap procedure, unistrap provides branch support estimates that take into account a larger fraction of the parameters impacting tree instability when processing datasets containing a large number of sequences. Bootstrap analysis, evolutionary analysis, multiple sequence alignment, phylogenetic tree reconstruction Producing accurate and robust analyses of genomic data has become one of the main challenges of modern biology. We show here that this issue is rather problematic when estimating multiple sequence alignments (MSAs) and phylogenies on datasets containing a large number of sequences and that the associated uncertainty can impact the full spectrum of biological data interpretation based on these models. MSA instability and its implication on phylogenetic reconstruction has long been known to biologists (Wong et al. 2008). Yet the lack of MSA robustness reported by Wong et al. (2008) on small datasets, albeit quantifiable and significant, was mostly concentrated on low support branches and therefore relatively easy to deal with. In contrast, the instability we report on very large datasets affects a much larger fraction of the models including seemingly well-supported branches. Instability may therefore prove a major confounding factor when considering the increasing reliance of genomic analysis on trees encompassing anything between a few thousand taxa (Krypotou et al. 2015; Pittis and Gabaldón 2016) up to 5000 (Vandewege et al. 2016) or even 10 000 for the largest (Jetz et al. 2012). In this work, we show how some of the uncertainty associated with these models can be precisely quantified both at MSA and phylogenetic reconstruction levels. MSA instability and its main causes have been extensively surveyed (Redelings and Suchard 2008) and shown to be an intrinsic property of sequence comparison. In most MSAs, alignment portions can be totally undefined or highly dependent on a combination of parameters whose numerical instability reflects some form of under-determination. Statistical methods can be used to assign probabilities to these effects but their high computational requirements restricts them to a hundred or less sequences (Redelings and Suchard 2007). The other main source of instability is related to the reliance on sub-optimal models imposed by the NP-Complete nature of MSAs computation (Wang and Jiang 1994; Notredame et al. 1998). NP-completeness makes it hard to distinguish between the effect of suboptimal optimisation and that of true alignment uncertainty. This is the reason why most MSA robustness measurements, including the ones reported here, reflect the combination of these two factors. In this work, we have focused our analysis on the effect of the progressive alignment framework (Hogeweg and Hesper 1984) onto MSA robustness. The progressive alignment is the most commonly used MSA heuristic algorithm and has so far been the only approach scaling with datasets containing a large number of sequences. Although a large number of variants have been reported (Chatzou et al. 2016) the core principle is always the same and involves incorporating sequences one by one following the order indicated by a pre-computed guide tree. Tree-based consistency procedures (Notredame et al. 2000) have been developed to address the greedy nature of this procedure but they do not scale well over a thousand sequences and current large-scale efforts have therefore focused on efficient guide tree computation (Katoh et al. 2002; Liu et al. 2009; Sievers et al. 2011; Mirarab et al. 2015). Guide trees are not, however, neutral components and their topologies are known to influence that of downstream phylogenetic trees (Lake 1991). In this work, Lake shows how all possible tree topologies associated with a given dataset can potentially lead to a different alignment. Interestingly, Lake wrongly assumed alignment symmetry, a condition that does not hold true when using dynamic programming as an alignment engine. Unfortunately, guide tree estimation is not a very robust process and even minute fluctuations such as the sequence input order can have an effect on their topologies (Takezaki 1998) owing to the existence of ties in the sequences distance matrix they are built upon (Boyce et al. 2015). This effect is hard to control but its influence on MSA stability can be quantified using the GUIDANCE protocol (Penn et al. 2010) in which the bootstrap replicates drawn from a seed MSA are used to estimate alternative guide trees and MSAs. The reason why even minute changes in the guide tree topology can have such an impact on the downstream MSA is a property of the Needleman and Wunsch (NW) (Needleman and Wunsch 1970) pairwise alignment algorithm used to merge pairs of partial MSAs. Given the most common NW implementations, alternative optimal alignments can be produced depending on the input order. This effect can be quantified using the Head or Tail (HoT) protocol (Landan and Graur 2007). The HoT principle is relatively straightforward and involves re-aligning two sequences (or groups) in their HoT (inverted sequences) configuration. Tail alignments provide the equivalent of an NW input order swap. HoT can be applied to entire datasets or at every node of a guide tree and can also be combined with GUIDANCE (Sela et al. 2015) so as to explore and combine guide tree and alignment induced uncertainty when estimating MSAs. While several alternative evaluation schemes have been developed for the purpose of evaluating MSA local reliability (Castresana 2000; Capella-Gutiérrez et al. 2009; Chang et al. 2014), HoT and GUIDANCE are the only ones whose explicit purpose is to quantify the level of agreement between the original sequence data and its MSA modeling. In this report, we show that these quantities can easily be integrated not only at the MSA level but also with respect to their combined influence on phylogenetic tree stability. We show how a simple sequence reordering procedure (input sequences shuffling) can help in untapping a significant amount of alignment numerical instability. Even though the source of this instability bears some similarity with HoT, we show that its effect is clearly distinct and complementary. Most importantly, we show how the progressive alignment instability can be combined with the regular bootstrap procedure so as to yield branch support values reflecting a broader range of data support for the final model. We named this approach unistrap because it has the potential to unify branch support estimates by combining sampling and computational effects. Our analysis shows that on large datasets unistrap makes it possible to avoid overconfidence in a sizeable fraction of the branches. Materials and Methods Unistrap Package The Unistrap software is a pipeline implemented in the Nextflow language (Di Tommaso et al. 2017) (see Supplementary Material Online, Fig. S9 available on Dryad at http://dx.doi.org/10.5061/dryad.bk203) and available along with its documentation from https://github.com/cbcrg/unistrap. Given a dataset made of unaligned amino-acid sequences in Fasta format, a multiple aligner and a tree reconstruction method, unistrap delivers a newick phylogenetic tree along with unistrap support values for each branch inserted in the place of bootstrap support values. In theory any combination of aligners and trees could be used. In practice, unistrap currently support three aligners (ClustalO, Mafft, and PASTA) and one tree reconstruction method (FastTree using ML (Price et al. 2009)). Unistrap support values are obtained by generating shuffle MSA replicates (100 by default) that are then used to generate bootstrap replicates (one for each shuffle MSA replicate by default). Branch support values are then estimated from the collection of trees generated from the replicates. The exact procedure is detailed in the unistrap branch support section below. Datasets MSA instability of large-scale MSAs was carried out on the HOMFAM benchmark dataset, which consists of 94 protein family ranging from 94 to 94 000 sequences. The datasets were originally built by combining PFAM entries (Sievers et al. 2011) with existing structural alignment of sequences corresponding to these families. In their original report no specific attempt was made at balancing the Operational Taxonomic Unit coverage (OTU) whose breakdown was not reported. Potential taxonomic bias are therefore similar to the ones reported in the corresponding PFAM release. We divided these datasets in three categories based on the number of sequences they contain: SMALL 94 to 3000 sequences, MEDIUM from 3001 to 10 000 sequences, and LARGE from 10 001 to 94 000 sequences. All generated trees, alignments, and comparisons are available from http://genome.crg.es/~cnotredame/data/supp/unistrap/. The guidance analyses pipeline and datasets can be found at https://github.com/skptic/shuffle-guidance. MSA Shuffle Replicates Generation For each dataset 100 replicates were generated by randomly reordering the sequences. Three datasets collections were then compiled using three large-scale aligners: Clustal-O, MAFFT with its large-scale parttree option (mafft – anysymbol –parttree input_file $$>$$ output_file), and PASTA (Table 1). Table 1. Tools and methods Tools used Version Availability Clustal-O (HOMFAM Benchmark) 1.2.1 http://www.clustal.org/omega/ MAFFT 6.956 http://mafft.cbrc.jp/alignment/software/ PASTA 1.6.1 https://github.com/smirarab/pasta FastTree 2.1.8 http://meta.microbesonline.org/fasttree/ Fast-Tree Comparison Tools As downloaded Feb. 2015 http://meta.microbesonline.org/fasttree/treecmp.html SEQBOOT 3.68 http://evolution.genetics.washington.edu T-Coffee 11 http://tcoffee.crg.cat GUIDANCE 2.02 http://guidance.tau.ac.il/ver2/source.php Tools used Version Availability Clustal-O (HOMFAM Benchmark) 1.2.1 http://www.clustal.org/omega/ MAFFT 6.956 http://mafft.cbrc.jp/alignment/software/ PASTA 1.6.1 https://github.com/smirarab/pasta FastTree 2.1.8 http://meta.microbesonline.org/fasttree/ Fast-Tree Comparison Tools As downloaded Feb. 2015 http://meta.microbesonline.org/fasttree/treecmp.html SEQBOOT 3.68 http://evolution.genetics.washington.edu T-Coffee 11 http://tcoffee.crg.cat GUIDANCE 2.02 http://guidance.tau.ac.il/ver2/source.php Table 1. Tools and methods Tools used Version Availability Clustal-O (HOMFAM Benchmark) 1.2.1 http://www.clustal.org/omega/ MAFFT 6.956 http://mafft.cbrc.jp/alignment/software/ PASTA 1.6.1 https://github.com/smirarab/pasta FastTree 2.1.8 http://meta.microbesonline.org/fasttree/ Fast-Tree Comparison Tools As downloaded Feb. 2015 http://meta.microbesonline.org/fasttree/treecmp.html SEQBOOT 3.68 http://evolution.genetics.washington.edu T-Coffee 11 http://tcoffee.crg.cat GUIDANCE 2.02 http://guidance.tau.ac.il/ver2/source.php Tools used Version Availability Clustal-O (HOMFAM Benchmark) 1.2.1 http://www.clustal.org/omega/ MAFFT 6.956 http://mafft.cbrc.jp/alignment/software/ PASTA 1.6.1 https://github.com/smirarab/pasta FastTree 2.1.8 http://meta.microbesonline.org/fasttree/ Fast-Tree Comparison Tools As downloaded Feb. 2015 http://meta.microbesonline.org/fasttree/treecmp.html SEQBOOT 3.68 http://evolution.genetics.washington.edu T-Coffee 11 http://tcoffee.crg.cat GUIDANCE 2.02 http://guidance.tau.ac.il/ver2/source.php MSA Identity Average level of identity among aligned residues in a pairwise alignment further averaged across all possible pairwise projections in the target MSA. Measurements were carried out on Clustal-O MSAs using the following T-Coffee command: t_coffee – other_pg seq_reformat –in sample_ aln1.aln –output sim MSA Alignment Stability Using Sampled Sums of Pairs Measurements Given a dataset, MSA stability was measured by running a Sum-of-Pairs analysis (S-o-P) on MSA pairs, and as such estimating S-o-P similarity. The exact S-o-P measure estimates the fraction of pairs-of-residues aligned the same way (i.e. same indexes) between the two MSAs. The final score is: S-o-P (MSA1, MSA2) $$=2\times$$ ((Number of Identically aligned pairs between MSA1 and MSA2)/(Number of aligned pairs in MSA1$$+$$Number of aligned pairs in MSA2)). The S-o-P was obtained by averaging the output of the two following commands: t_coffee - other_pg aln_compare -al1 MSA1 -al2 MSA2 t_coffee - other_pg aln_compare -al1 MSA1 -al2 MSA2 When dealing with datasets larger than 100 sequences, estimating exact S-o-P similarity between pairs of MSAs becomes computationally prohibitive. We therefore replaced the exact approach described above with a sampling approach that involved randomly selecting the same 50 probe sequences in the two MSAs being compared and measuring the S-o-P score of the two projections. This operation was repeated a 100 times and the resulting values averaged to yield the estimated S-o-P similarity between the two compared MSAs. The choice of 50 probe sequences was based on the observation that there is a near perfect correlation between S-o-P measures carried out using 50 or 100 probe sequences (see Supplementary Material Online, Fig. S10 available on Dryad). MSA Accuracy Stability It was estimated by measuring the S-o-P similarity between every replicate and its HOMFAM structure based reference using the HOMFAM provided package. For a given dataset, this procedure involves extracting the projection of the sequences with known 3D structures and comparing it with the corresponding reference MSA using a standard S-o-P comparison. t_coffee - other_pg aln_compare -al1 reference_ MSA -al2 target_MSA Tree Computation Trees were computed using FastTree (for the three largest datasets FastTree was used with the -fastest option) on Clustal-O and MAFFT MSAs. In the case of PASTA, which is both an aligner and a tree estimator, we used the trees directly recovered from the PASTA final output. Unistrap Branch Support We used the FastTree Robinson and Foulds (FTRF) implementation of the Robinson and Foulds (FR) metrics to compare topological similarity between pairs of alternative trees. FTRF provides for each branch both the depth and a count of the presence of that branch across a collection of replicate trees. The shuffle replicate trees were compared all against all using the FTRF metrics. The tree having the highest average topological identity with the rest of the replicates was then selected as the central tree to be used as a reference. Each shuffle MSA was then used to draw one bootstrap replicate using Phylip’s SEQBOOT. The 100 replicates were used to estimate 100 phylogenetic trees whose topologies were used to estimate the branch support of the reference tree using FTRF with the -$$n$$ option (tree collections) and CompareToBootstrap.pl (FastTree suite) to obtain the bootstrap branch support values. For undocumented reasons FastTree was unable to estimate the branch support of a very small number of branches ($$<$$1%). These branches were removed from the analysis across all considered trees. Bootstrap Branch Support Bootstrap support values were estimated on three datasets from different size groups (mmp with 1427 taxa from the smaller-size dataset group, subt with 7517 from the medium-size group and p450 with 21 013 from the large-size group) for the Clustal-O MSAs. Given the original ClustalO MSA estimated from sequences ordered as provided in HOMFAM, we used SEQBOOT to generate 100 bootstrap replicates that were used to estimate branch support values using FTRF-$$n$$ and CompareToBootstrap.pl as described above. PhyML Type Local Branch Support PhyML type local branch support values were estimated by FastTree during tree inference. These FastTree resulting support values are the FastTree implementation of the PhyML 3’s SH-like local supports (Guindon et al., 2010). Tree ML Stability It is important to insure that tree topological variations coincide with significant ML variations. For this purpose, one therefore needs to monitor ML variations across the replicates in a meaningful way, taking into account the difficulty in comparing ML scores against different MSAs. For each dataset, the 10 shuffle trees having the lowest average RF topological similarity with the rest of the replicate trees were selected. Among these the MSA associated with the tree having the lowest average RF topological similarity was then selected and used to evaluate the ML score of the other trees, using the following FastTree procedures: FastTree -gamma -nt -gtr -nome -mllen -intree topology2 -log topology2.log alignment.p $$>$$ topology2.ftlenperl GammaLogToPaup.pl topology1.log topology2.log $$>$$ top12.txt Depth Analyses The depth analyses were carried out by selecting from each tree either the deep branches (defined as having $$>$$ 10 children) or the shallow branches (defined as $$<$$10 children) and averaging the support of the selected branches. The analysis shown in the main text were carried out using ClustalO and similar analyses shown in the supplement were carried out using MAFFT (see Supplementary Material Online, Fig. S5 available on Dryad) and PASTA (see Supplementary Material Online, Fig. S6 available on Dryad). T-Coffee Highlow Mode Dummy alignments shown in Figure 3a were obtained using the sequences and trees displayed on the figure along with the following T-Coffee command: t_coffee –seq $$<$$ seq $$>$$ -usetree $$<$$ tree $$>$$ -mode highlow This mode was implemented so as to show the potentially extreme effect of arbitrary tie breaking when computing a progressive alignment using a canonical progressive algorithm. Under this mode, ties are broken in the following order: insertion in the first sequence (i.e. gap in the second sequence), substitutions, insertion in the second sequence (i.e. gap in the first sequence). This special mode does not use consistency. Tree Node Rotation These rotations were used to show the effect of an inversion of pairwise sequence input order when using a progressive alignment algorithm. The rotations were achieved by reading in newick trees and outputting them while randomly swapping left and right children of the tree data structure. The procedure is implemented in the T-Coffee seq_reformat package: t_coffee –other_pg seq_reformat –input newick –output newick_shuffle $$>$$ output_ shuffled_newick_tree_file For each dataset, a guide tree was generated using the default sequence input order and this tree was used to generate 100 rotated, topologically identical guide trees. The trees were then fed as guide trees to Clustal-O and the resulting MSAs processed as indicated above. HoT $$+$$ Shuffle This combination was carried out in order to show the combined effect of these two procedure when estimating MSAs. The original HoT algorithm has been incorporated in T-Coffee. For each of the 31 unstable Prefab datasets, a ClustalW MSA was estimated and the deepest branch (defined as the one for which the smallest split contains the largest number of sequences among all possible smallest splits) was used to estimate the seven HoT replicates. Each replicate was compared with the original MSA and estimated for its S-o-P similarity. The most diverse among the seven replicates was then kept and was re-computed whilst adding a shuffle step to the HoT step (i.e. before being re-aligned in HoT, the groups of sequences in each split were randomly re-ordered). This procedure has been implemented in the latest version of T-Coffee and can be called using: t_coffee -other_pg seq_reformat -in $\$$ input -action $$+$$hotshot2 GUIDANCE $$+$$ Shuffle For each of the 31 unstable Prefab datasets, GUIDANCE was executed with both default and shuffled input order 10 times using the following command line: perl guidance.pl–seqFile $\$$ {datasetFile}–msaProgram CLUSTALW–seqType aa–outDir $\$${resultsDir}–bootstraps 10–program GUIDANCE– clustalw clustalw2 The resulting 100 GUIDANCE alternative alignments and 100 Shuffled-GUIDANCE alternative alignments were evaluated for stability using S-o-P scores. This procedure was repeated 10 times and average stabilities of GUIDANCE and GUIDANCE$$+$$Shuffle alignments calculated. Software Version and Availability Unless otherwise stated, all third party tools were used with default parameters. The name, version, and source of the software are given in Table 1. Supplementary Material, including data files and/or online-only appendices, can be found at http://datadryad.org/review?doi=doi:10.5061/dryad.hv752 Results and Discussion Effect of Sequence Input Order Shuffling on Large Datasets Containing 100 Amino-Acid Sequences or More We used sequence input shuffling as a means to reveal and quantify MSA construction instability. This procedure has the advantage of being compatible with most aligners without requiring invasive code adaptation. The effect of sequence shuffling onto MSA computation is not entirely obvious, and given a standard progressive alignment procedure, under which sequences are all compared in pairs to estimate a distance matrix and its associated guide-tree, one would naively expect that realigning several replicates of the same dataset in which sequences are shuffled in random orders should have a limited effect—if any at all. It has, however, recently been confirmed that this was not the case (Boyce et al. 2015). We therefore decided to quantify this phenomenon on large datasets and determine its impact on MSA and phylogenetic tree numerical stability. We applied our methodology onto HOMFAM, the most widely used reference dataset for ultra large MSAs (94 datasets ranging between 93 and 93 681 sequences) and on Prefab, a collection of smaller datasets (1683 datasets ranging between 2 and 50 sequences with an average of 45 sequence/dataset) (Edgar 2004). The most relevant characteristic of these datasets is to be composed of real proteins gathered through a realistic database search protocol. There are currently only three methods able to handle all HOMFAM datasets: Clustal-O (Sievers et al. 2011), the phylogeny aware aligner PASTA (Katoh et al. 2002; Mirarab et al. 2015) and MAFFT-fftns, using the PartTree algorithm flag (Katoh et al. 2002) (Table 1). We estimated the effect of input-order shuffling on each method by generating 100 shuffled replicates for each HOMFAM dataset. Replicates produced by a given aligner were then compared to one another using the Sums of Pairs (SoP) in order to estimate the aligners’ sensitivity to the reordering effect. The SoP measure estimates the fraction of residue pairs aligned identically (i.e. same indexes in their respective sequences) between two MSAs. Results (Fig. 1a, Supplementary Material Online, Figs. S1a and S2a available on Dryad, Table 2), indicate a high sensitivity of the three aligners to the reordering effect. On average, 78.4% of residue pairs are identically aligned across pairs of replicates produced with the same aligner (i.e. 21.6% of the pairs are differently aligned across replicates). The lowest figure was measured on MAFFT-fftns whose average stability is 71.29% but drops down to 56.7% when considering the 43 HOMFAM datasets having $$<$$ 30% average identity (Table 2). Figure 1. View largeDownload slide MSAs and trees are sensitive to sequence input order. a) For each HOMFAM protein benchmark dataset, 100 replicates with shuffled sequence input order were generated and processed into MSAs by Clustal-O. For each dataset the average similarity among replicates was estimated using an S-o-P comparison (%) and plotted against the dataset’s average identity (Spearman correlation rs$$=$$ 0.79). b) This instability results in a high MSA structural accuracy variability among shuffle replicates, also correlating with MSA identity (rs$$= -$$0.51). c) Shuffled input sequences result in Robinson and Foulds topological variability among FastTree maximum likelihood trees, which correlates with the number of sequences (rs$$=$$$$-$$0.40). Figure 1. View largeDownload slide MSAs and trees are sensitive to sequence input order. a) For each HOMFAM protein benchmark dataset, 100 replicates with shuffled sequence input order were generated and processed into MSAs by Clustal-O. For each dataset the average similarity among replicates was estimated using an S-o-P comparison (%) and plotted against the dataset’s average identity (Spearman correlation rs$$=$$ 0.79). b) This instability results in a high MSA structural accuracy variability among shuffle replicates, also correlating with MSA identity (rs$$= -$$0.51). c) Shuffled input sequences result in Robinson and Foulds topological variability among FastTree maximum likelihood trees, which correlates with the number of sequences (rs$$=$$$$-$$0.40). Table 2. Comparative readouts for alignment and phylogenetic instability on large datasets Average overall Sequence identity Number of sequences Under 30% Over 30% 94-3K 3K-10K 10K-94K Aligner C P M AVG C P M C P M C P M C P M C P M # datasets 94 43 51 41 33 20 MSA stability (%) 80.00 84.07 71.29 78.45 69.68 77.41 56.70 87.89 89.69 83.59 84.85 86.36 77.89 79.65 84.37 72.65 68.59 78.92 55.53 Unstable residue pairs (%) 20 15.93 28.71 21.55 30.32 22.59 43.3 12.11 10.31 16.41 15.15 13.64 22.11 20.35 15.63 27.35 31.41 21.08 44.47 MSA accuracy stdev 3.29 4.31 5.13 4.24 4.04 5.12 6.13 2.65 3.63 4.29 2.93 4.53 4.88 3.26 3.96 5.15 4.09 4.49 5.61 Tree stability (%) 67.93 64.18 53.84 61.98 67.49 68.65 47.04 68.29 60.41 59.56 72.22 67.17 60.80 68.58 64.03 54.61 58.05 58.3 38.30 Fraction of unstable branches (%) 32.07 35.82 46.16 38.02 32.51 31.35 52.96 31.71 39.59 40.44 27.78 32.83 39.2 31.42 35.97 45.39 41.95 41.7 61.7 Deep branches of low MSA and high local support (%) 43.92 33.68 64.27 47.29 44.67 29.00 72.17 43.29 37.63 57.61 36.71 31.54 54.65 43.24 31.57 63.88 59.85 41.55 84.66 Shallow branches of low MSA and high local support (%) 11.29 5.61 28.34 15.08 12.27 3.58 40.62 10.45 7.31 17.97 6.34 4.21 16.39 9.52 4.64 26.72 24.35 10.05 55.49 Average overall Sequence identity Number of sequences Under 30% Over 30% 94-3K 3K-10K 10K-94K Aligner C P M AVG C P M C P M C P M C P M C P M # datasets 94 43 51 41 33 20 MSA stability (%) 80.00 84.07 71.29 78.45 69.68 77.41 56.70 87.89 89.69 83.59 84.85 86.36 77.89 79.65 84.37 72.65 68.59 78.92 55.53 Unstable residue pairs (%) 20 15.93 28.71 21.55 30.32 22.59 43.3 12.11 10.31 16.41 15.15 13.64 22.11 20.35 15.63 27.35 31.41 21.08 44.47 MSA accuracy stdev 3.29 4.31 5.13 4.24 4.04 5.12 6.13 2.65 3.63 4.29 2.93 4.53 4.88 3.26 3.96 5.15 4.09 4.49 5.61 Tree stability (%) 67.93 64.18 53.84 61.98 67.49 68.65 47.04 68.29 60.41 59.56 72.22 67.17 60.80 68.58 64.03 54.61 58.05 58.3 38.30 Fraction of unstable branches (%) 32.07 35.82 46.16 38.02 32.51 31.35 52.96 31.71 39.59 40.44 27.78 32.83 39.2 31.42 35.97 45.39 41.95 41.7 61.7 Deep branches of low MSA and high local support (%) 43.92 33.68 64.27 47.29 44.67 29.00 72.17 43.29 37.63 57.61 36.71 31.54 54.65 43.24 31.57 63.88 59.85 41.55 84.66 Shallow branches of low MSA and high local support (%) 11.29 5.61 28.34 15.08 12.27 3.58 40.62 10.45 7.31 17.97 6.34 4.21 16.39 9.52 4.64 26.72 24.35 10.05 55.49 Sequence identity indicates the average sequence identity as estimated of the Clustal-O MSAs. Under 30% indicates the number of datasets having a percent identity lower than 30% and Over 30 those having a percent identity higher than 30%. On the Aligner line, C indicates Clustal-O, P PASTA, and M MAFFT and Avg indicates the average readouts measured on the three aligners. # datasets indicates the number of datasets corresponding to each category. MSA stability indicates the average sum of pairs similarity between shuffled MSAs. Unstable residue pairs indicates the average fraction of residue pairs whose alignment changes across replicates. MSA Accuracy Stdev is the variation in HomFam accuracy across MSA replicates. Tree stability is the average RF similarity normalized by the number of branches. Fraction of unstable branches is the average fraction of branches that differ across the trees estimated from the shuffled MSA replicates. Deep branches of low MSA and high local support indicates the fraction of branches having more than 10 children that occur in less than 20% of the replicate trees but have a local support higher than 80%. Shallow branches of low MSA and high local support (%). Table 2. Comparative readouts for alignment and phylogenetic instability on large datasets Average overall Sequence identity Number of sequences Under 30% Over 30% 94-3K 3K-10K 10K-94K Aligner C P M AVG C P M C P M C P M C P M C P M # datasets 94 43 51 41 33 20 MSA stability (%) 80.00 84.07 71.29 78.45 69.68 77.41 56.70 87.89 89.69 83.59 84.85 86.36 77.89 79.65 84.37 72.65 68.59 78.92 55.53 Unstable residue pairs (%) 20 15.93 28.71 21.55 30.32 22.59 43.3 12.11 10.31 16.41 15.15 13.64 22.11 20.35 15.63 27.35 31.41 21.08 44.47 MSA accuracy stdev 3.29 4.31 5.13 4.24 4.04 5.12 6.13 2.65 3.63 4.29 2.93 4.53 4.88 3.26 3.96 5.15 4.09 4.49 5.61 Tree stability (%) 67.93 64.18 53.84 61.98 67.49 68.65 47.04 68.29 60.41 59.56 72.22 67.17 60.80 68.58 64.03 54.61 58.05 58.3 38.30 Fraction of unstable branches (%) 32.07 35.82 46.16 38.02 32.51 31.35 52.96 31.71 39.59 40.44 27.78 32.83 39.2 31.42 35.97 45.39 41.95 41.7 61.7 Deep branches of low MSA and high local support (%) 43.92 33.68 64.27 47.29 44.67 29.00 72.17 43.29 37.63 57.61 36.71 31.54 54.65 43.24 31.57 63.88 59.85 41.55 84.66 Shallow branches of low MSA and high local support (%) 11.29 5.61 28.34 15.08 12.27 3.58 40.62 10.45 7.31 17.97 6.34 4.21 16.39 9.52 4.64 26.72 24.35 10.05 55.49 Average overall Sequence identity Number of sequences Under 30% Over 30% 94-3K 3K-10K 10K-94K Aligner C P M AVG C P M C P M C P M C P M C P M # datasets 94 43 51 41 33 20 MSA stability (%) 80.00 84.07 71.29 78.45 69.68 77.41 56.70 87.89 89.69 83.59 84.85 86.36 77.89 79.65 84.37 72.65 68.59 78.92 55.53 Unstable residue pairs (%) 20 15.93 28.71 21.55 30.32 22.59 43.3 12.11 10.31 16.41 15.15 13.64 22.11 20.35 15.63 27.35 31.41 21.08 44.47 MSA accuracy stdev 3.29 4.31 5.13 4.24 4.04 5.12 6.13 2.65 3.63 4.29 2.93 4.53 4.88 3.26 3.96 5.15 4.09 4.49 5.61 Tree stability (%) 67.93 64.18 53.84 61.98 67.49 68.65 47.04 68.29 60.41 59.56 72.22 67.17 60.80 68.58 64.03 54.61 58.05 58.3 38.30 Fraction of unstable branches (%) 32.07 35.82 46.16 38.02 32.51 31.35 52.96 31.71 39.59 40.44 27.78 32.83 39.2 31.42 35.97 45.39 41.95 41.7 61.7 Deep branches of low MSA and high local support (%) 43.92 33.68 64.27 47.29 44.67 29.00 72.17 43.29 37.63 57.61 36.71 31.54 54.65 43.24 31.57 63.88 59.85 41.55 84.66 Shallow branches of low MSA and high local support (%) 11.29 5.61 28.34 15.08 12.27 3.58 40.62 10.45 7.31 17.97 6.34 4.21 16.39 9.52 4.64 26.72 24.35 10.05 55.49 Sequence identity indicates the average sequence identity as estimated of the Clustal-O MSAs. Under 30% indicates the number of datasets having a percent identity lower than 30% and Over 30 those having a percent identity higher than 30%. On the Aligner line, C indicates Clustal-O, P PASTA, and M MAFFT and Avg indicates the average readouts measured on the three aligners. # datasets indicates the number of datasets corresponding to each category. MSA stability indicates the average sum of pairs similarity between shuffled MSAs. Unstable residue pairs indicates the average fraction of residue pairs whose alignment changes across replicates. MSA Accuracy Stdev is the variation in HomFam accuracy across MSA replicates. Tree stability is the average RF similarity normalized by the number of branches. Fraction of unstable branches is the average fraction of branches that differ across the trees estimated from the shuffled MSA replicates. Deep branches of low MSA and high local support indicates the fraction of branches having more than 10 children that occur in less than 20% of the replicate trees but have a local support higher than 80%. Shallow branches of low MSA and high local support (%). Shuffling Effect on Small Datasets Containing 50 Amino-Acid Sequences or Less The reason why the shuffling effect had so far been mostly ignored probably owes much to its limited impact on the small datasets routinely analyzed in biology. Indeed when using Clustal-O to measure the effect of shuffling input order on the 1683 Prefab datasets we merely found 131 datasets to be unstable with an average SoP similarity of 95.99 % across 100 replicates (Table 3). This confirms that instability is not a major issue on datasets smaller than approximately 100 sequences. We used the 131 unstable Prefab datasets to further explore the relationships between shuffling, HoT, and Guidance, since these two last procedures are computationally too intensive to be applied onto HOMFAM. In order to quantify the relationship between HoT and Shuffle, we reimplemented the HoT protocol and made it possible to combine the sequence inversion with a shuffle step (cf Methods), resulting in a lower stability when sampled with HoT$$+$$Shuffle than when sampled with HoT only. Overall, datasets sampled with the HoT/Shuffle combination are 4% point less stable than when using HoT only (80.44 for HoT and 76.45 for HoT$$+$$Shuffle). This important observation confirms that the shuffle algorithm explores an alternative alignment space at least partly different from the one defined by HoT. It is worth noting that HoT inverts simultaneously all the ties associated with a given set of sequence. This explains why the HoT replicates are on average so much more different from one another than their shuffled counterparts (Table 2). Indeed, shuffling provides a much smoother exploration of the true space of inversions, with each replicate sampling point being anywhere between the original sequence order and a full inversion of all the sequences as provided by HoT. The shuffling algorithm is therefore a powerful way of estimating the average numerical instability of a dataset while the HoT algorithm provides heuristic access to some of the most extreme points of this sampling space. This makes the two strategies complementary. We used a similar approach to explore the effect of a combination between GUIDANCE and shuffle. To do so, we used the GUIDANCE implementation that does not combine guide tree re-estimation with HoT realignments. We measured the shuffling effect by generating 10 GUIDANCE replicates without sequence input order shuffling and 10 replicates with input order shuffling. Results indicate that the addition of a shuffling stage slightly decreases MSA stability, as measured by a SoP comparison between replicates (69.92 for Guidance, 68.99 for Guidance$$+$$Shuffle). Overall 89 out 131 datasets were found less stable when adding the shuffle step. This analysis confirms that the shuffling procedure differs from both HoT or GUIDANCE and thus provides an alternative way of sampling the MSA instability space when doing a progressive alignment. The additive effects of GUIDANCE, Shuffle and HoT also suggest that these methods can be combined together. We finally took advantage of this small dataset to estimate the effect of the aligner accuracy on the shuffling instability. On these small alignments we found no correlation between accuracy and instability. Table 3. Comparative readouts for alignment and phylogenetic instability on small unstable datasets Measure Mafft-linsi (slow) Mafft-fftns (Fast) Clustal W Clustal O Average Accuracy 0.69 0.63 0.56 0.64 0.63 MSA Stability 0.98 0.96 0.96 0.91 0.95 Tree Stability 0.97 0.92 0.93 0.88 0.92 Unistrap 0.66 0.67 0.65 0.65 0.66 Bootstrap 0.66 0.67 0.65 0.66 0.66 HoT — — 80.44 — Hot$$+$$shuffle — — 76.45 (91/131) — Guidance — — 69.92 — Guidance$$+$$shuffle — — 68.99 (89/131) — Measure Mafft-linsi (slow) Mafft-fftns (Fast) Clustal W Clustal O Average Accuracy 0.69 0.63 0.56 0.64 0.63 MSA Stability 0.98 0.96 0.96 0.91 0.95 Tree Stability 0.97 0.92 0.93 0.88 0.92 Unistrap 0.66 0.67 0.65 0.65 0.66 Bootstrap 0.66 0.67 0.65 0.66 0.66 HoT — — 80.44 — Hot$$+$$shuffle — — 76.45 (91/131) — Guidance — — 69.92 — Guidance$$+$$shuffle — — 68.99 (89/131) — Display of various instability metrics collected on 131 unstable Prefab datasets. Accuracy: sums of pairs Accuracy as provided by Prefab Qscore. MSA stability: average sum-of-pairs similarity across 100 MSA replicates. Tree stability: average RF similarity across trees, normalized by the number of branches. Shootsrap: average Shootsrap branch support. Bootstrap: average bootstrap support. HoT average sum-of pair similarity between the two most different HoT replicates drawn on the deepest branch of a ClustalW guide tree. Hot$$+$$shuffle: average sums of pairs of the HoT replicate when adding a shuffle step. The numbers in parenthesis indicate how many times the Hot$$+$$Shuffle replicates were less stable than their HoT counterpart). Guidance: average Guidance S-o-P similarity across Guidance I replicates. Guidance$$+$$Shuffle. Average S-o-P replicate similarity when combining GUIDANCE I with a shuffle step. The numbers in parenthesis indicate how many times the Hot$$+$$Shuffle replicates were less stable than their HoT counterpart). Table 3. Comparative readouts for alignment and phylogenetic instability on small unstable datasets Measure Mafft-linsi (slow) Mafft-fftns (Fast) Clustal W Clustal O Average Accuracy 0.69 0.63 0.56 0.64 0.63 MSA Stability 0.98 0.96 0.96 0.91 0.95 Tree Stability 0.97 0.92 0.93 0.88 0.92 Unistrap 0.66 0.67 0.65 0.65 0.66 Bootstrap 0.66 0.67 0.65 0.66 0.66 HoT — — 80.44 — Hot$$+$$shuffle — — 76.45 (91/131) — Guidance — — 69.92 — Guidance$$+$$shuffle — — 68.99 (89/131) — Measure Mafft-linsi (slow) Mafft-fftns (Fast) Clustal W Clustal O Average Accuracy 0.69 0.63 0.56 0.64 0.63 MSA Stability 0.98 0.96 0.96 0.91 0.95 Tree Stability 0.97 0.92 0.93 0.88 0.92 Unistrap 0.66 0.67 0.65 0.65 0.66 Bootstrap 0.66 0.67 0.65 0.66 0.66 HoT — — 80.44 — Hot$$+$$shuffle — — 76.45 (91/131) — Guidance — — 69.92 — Guidance$$+$$shuffle — — 68.99 (89/131) — Display of various instability metrics collected on 131 unstable Prefab datasets. Accuracy: sums of pairs Accuracy as provided by Prefab Qscore. MSA stability: average sum-of-pairs similarity across 100 MSA replicates. Tree stability: average RF similarity across trees, normalized by the number of branches. Shootsrap: average Shootsrap branch support. Bootstrap: average bootstrap support. HoT average sum-of pair similarity between the two most different HoT replicates drawn on the deepest branch of a ClustalW guide tree. Hot$$+$$shuffle: average sums of pairs of the HoT replicate when adding a shuffle step. The numbers in parenthesis indicate how many times the Hot$$+$$Shuffle replicates were less stable than their HoT counterpart). Guidance: average Guidance S-o-P similarity across Guidance I replicates. Guidance$$+$$Shuffle. Average S-o-P replicate similarity when combining GUIDANCE I with a shuffle step. The numbers in parenthesis indicate how many times the Hot$$+$$Shuffle replicates were less stable than their HoT counterpart). Structural Accuracy Instability of Large Dataset When Shuffling Sequences Input Order Altogether this analysis suggests a very significant effect of sequences input order shuffling on datasets larger than 50 sequences and a more moderate effect on smaller datasets like the one available in Prefab. We therefore focused the rest of our work on large and very large datasets and their associated aligners. We first asked if the alignment numerical instability measured on HOMFAM came along with a similar impact on structural accuracy. This question is relevant because regions used to assess this type of accuracy are structurally homologous and expected to result in a stable alignment. This analysis was therefore useful to insure that numerical instability is not concentrated in structurally ambiguous portions of the alignments. We measured this effect by comparing the fate of embedded sequences with known 3D structures against their HOMSTRAD reference alignments. When doing so, accuracy is defined as the fraction of residue pairs aligned identically in the target and the structural reference alignment. Our analysis revealed important variations in accuracy across replicates (Fig. 1b, Supplementary Material Online, Figs. S1b and S2b available on Dryad), especially for datasets with $$<$$ 30% identity (Table 2). Shuffling Effect onto Phylogenetic Tree Branch Stability With phylogenetic tree reconstruction being one of the most common downstream analysis based on MSA, we decided to assess the impact of shuffling on tree estimations. For this purpose, we used the deterministic mode of FastTree to estimate ML phylogenetic trees derived from the shuffled MSA replicates. The average Robinson and Foulds (RF) congruences measured on these trees were found to be as low as 68%, 64%, and 54% for Clustal-O, PASTA, and MAFFT respectively (Fig. 1c, Table 2, Supplementary Material Online, Figs. S1c and S2c available on Dryad). Overall, an average of 38.05 % of the branches are unstable across replicates. These topological variations are associated with significant ML score fluctuations (see Supplementary Material Online, Fig. S3 available on Dryad), with the largest trees being more predominantly affected. Variation among replicate trees can be used to measure a branch shuffle support index—defined as the fraction of trees, estimated from the MSA shuffled replicates, in which a specific branch occurs. We compared this shuffle support index with the local measure (local branch support) inferred by FastTree (Price et al. 2009) as initially proposed in PhyML (Anisimova and Gascuel 2006) to quantify the nearest-neighbor interchange effect. We found a weak Spearman correlation for both deep (over 10 children) and shallow branches, which drops even further as the number of sequences increases (Fig. 2a, Supplementary Material Online, Figs. S4a–S6a available on Dryad). This low correlation reflects the existence of many branches being highly supported by the local measure, albeit missing from the majority of shuffled replicates. For instance, when considering Clustal-O trees, 42% of the deep branches have a local support higher than 80%, yet among these highly supported branches, 44% occur in less than 20% of the shuffled replicate trees (Table 2, Supplementary Material Online, Figs. S4b–S6b available on Dryad). This important observation indicates that the branch local reliability measure does not reflect MSA robustness. Figure 2. View largeDownload slide Shuffle and bootstrap branch supports are correlated. (a) Local branch support and shuffle branch support measured across Clustal-O datasets from HOMFAM have a low Spearman correlation that decreases as the number of sequences increases, for both deep (more than 10 children) and shallow branches. Similar comparisons made on a large HOMFAM dataset (p450, 21,013 sequences) reveal a low correlation between bootstrap and local support values (rs$$=$$ 0.27) when compiling bootstrap and local branch support on the 100 shuffled replicate (b), but a relatively high correlation between standard bootstrap branch support and shuffle support (rs$$=$$ 0.69) (c). Figure 2. View largeDownload slide Shuffle and bootstrap branch supports are correlated. (a) Local branch support and shuffle branch support measured across Clustal-O datasets from HOMFAM have a low Spearman correlation that decreases as the number of sequences increases, for both deep (more than 10 children) and shallow branches. Similar comparisons made on a large HOMFAM dataset (p450, 21,013 sequences) reveal a low correlation between bootstrap and local support values (rs$$=$$ 0.27) when compiling bootstrap and local branch support on the 100 shuffled replicate (b), but a relatively high correlation between standard bootstrap branch support and shuffle support (rs$$=$$ 0.69) (c). We also compared branch shuffle support with the standard Felsenstein’s bootstrap support method (Felsenstein 1985), which generates replicate trees by re-sampling MSA columns with replacement and uses them to estimate the support of each branch in the reference tree. We restricted this analysis to three datasets, labeled as small, medium and large (1427,7517, and 21 013 taxa). On this subset, we could carry out an exhaustive comparison analysis between shuffle, bootstrap, and local branch supports using Clustal-O. Clustal-O was chosen because its trees are the most stable with respect to shuffling (Table 2). Our analysis do not show any clear correlation between local branch reliability and bootstrap support for either the large (Fig. 2b), the medium or the small datasets (see Supplementary Material Online, Figs. S7 available on Dryad). In contrast, for these same datasets we observed a relatively high correlation between shuffle and bootstrap branch support values (Fig. 2c, $$r_{\mathrm{s}} =$$ 0.69). Effect of Guide Tree Node Rotation onto Phylogenetic Tree Topological Stability We speculated that the major cause of MSA instability was the unordered nature of child nodes in the binary guide tree. Indeed, while the NW algorithm used to do pairwise alignments at every node is guaranteed to return an optimal scoring alignment of the two considered child nodes, this algorithm may deliver different—but equally optimal—pairwise alignments when inverting sequences input order. The reason is that while NW is guaranteed to deliver an optimal scoring pairwise alignment, it has no objective way of handling ties occurring within the score matrix. Depending on how these ties are resolved NW can therefore deliver alternative optimal alignments. This effect is often referred to as low road/high road because of the visual effect of alternative tie breaking on the alignment path. In most implementations, the order under which ties are broken is arbitrary and therefore input order dependent (e.g. substitution first, then gap in first sequence, then gap in second). Progressive MSA assembly is especially sensitive to this phenomenon because partial MSAs associated with sister nodes are assembled independently. As a consequence, the breaking of their ties cannot be synchronized. This leads to the possibility of partial MSAs whose combination may not be compatible with a global optimal MSA. It is worth noting that node input order variations can happen even without changing the guide tree topology. Our toy example (Fig. 3a) illustrates how a standard progressive aligner like T-Coffee (in the high-low mode with consistency turned off, cf Methods) can be led to produce very different MSAs while using guide trees with the same topologies, and by simply rotating child nodes. We systematically quantified this effect by generating for each HOMFAM dataset, 100 randomly rotated but topologically identical Clustal-O guide trees and we used them to generate as many Clustal-O MSAs. The effect of rotation on MSA stability was then estimated by measuring the average S-o-P similarity (as previously done) between the 100 replicate MSAs. The resulting values show that node rotation is capable of explaining the largest fraction of the shuffling effect (Fig. 3b, ratio of explained variance 0.68, $$r =$$ 0.85). We also used an all-against-all RF comparison to measure guide tree stability and found them to be on average 45.1% topologically similar, a variation sufficiently large to account for the remaining fraction of MSA instability (Fig. 3c). This finding confirms the observation by Boyce et al. (2015) of guide tree instability contributing to MSA instability, but it also indicates that the effect reported by Boyce et al. (2015), and attributed by the authors to frequent ties in the k-tup based distance matrices, is not the main cause of MSA instability. Figure 3. View largeDownload slide The majority of instability is explained by unordered nodes in the guide tree. a) Similar guide trees with different node ordering can result in different MSAs when fed to the highlow mode of T-Coffee multiple aligner (see Supplementary Material online, Methods available on Dryad). b) We used this property to recapitulate the shuffling input effect on each dataset, by feeding Clustal-O with a 100 topologically identical guide trees with randomly rotated nodes. When done on all the entire HOMFAM this procedure accounts for the majority of the shuffling effect (explained variance ratio $$=$$ 0.68, rs$$=$$ 0.85). c) The remaining variation can be attributed to the effect of sequence input shuffling on guide tree topologies, as shown by comparing guide tree stability to MSA stability (explained variance ratio $$=$$ 0.28, rs$$=$$ 0.21). Figure 3. View largeDownload slide The majority of instability is explained by unordered nodes in the guide tree. a) Similar guide trees with different node ordering can result in different MSAs when fed to the highlow mode of T-Coffee multiple aligner (see Supplementary Material online, Methods available on Dryad). b) We used this property to recapitulate the shuffling input effect on each dataset, by feeding Clustal-O with a 100 topologically identical guide trees with randomly rotated nodes. When done on all the entire HOMFAM this procedure accounts for the majority of the shuffling effect (explained variance ratio $$=$$ 0.68, rs$$=$$ 0.85). c) The remaining variation can be attributed to the effect of sequence input shuffling on guide tree topologies, as shown by comparing guide tree stability to MSA stability (explained variance ratio $$=$$ 0.28, rs$$=$$ 0.21). Branch Support Estimation Using the Unistrap Procedure No simple solution exists to address the problem of unordered child nodes (i.e. in binary guide tree the left and the right children are arbitrarily assigned and interchangeable), as this would require the existence of a linear sorting procedure capable of unambiguously recapitulating an entire binary tree. An alternative to regular guide trees is the use of simple chained guide trees (i.e. trees where each split has a leaf child), but recent suggestions in this direction (Boyce et al. 2014) have been criticized for their limited capacity of producing biologically realistic MSAs (Boyce et al. 2014; Tan et al. 2015). Summarizing the combined effect of site sampling and unordered child leaves is also a complicated problem because the intermediate level of correlation between the shuffle and bootstrap supports prevents a straightforward agglomeration of the two indexes, as they are neither independent enough to be combined nor correlated enough to be treated as one another’s proxy. The impossibility to disentangle these two measurements requires their simultaneous estimation. For this purpose, we designed the “unistrap” procedure whose name reflects the capacity to provide a unified estimation of the combined effect of site sampling and MSA instability on branch support. Our unistrap implementation starts by generating shuffled MSA replicates. Given each such shuffle replicate, columns are then drawn with re-sampling to generate one or more bootstrap replicates. The resulting collection of bootstrap MSA replicates are then used to estimate ML replicate trees. The rest of the procedure is a standard Felsenstein bootstrap measure in which replicate trees can either be combined into a consensus or be used to estimate branch supports of a reference tree. Comparison Between Unistrap and Felsenstein Bootstrap Support Indexes In this study, we used the central tree (RF comparison) drawn from the shuffle replicates as a reference and we found unistrap branch support to be correlated with both the shuffle and bootstrap supports (Fig. 4, Supplementary Material Online, Fig. S8 available on Dryad), albeit yielding support values on average lower (Table 4). This effect is especially relevant for the branches that are very well supported by either the shuffle or the bootstrap measure, but poorly supported by unistrap. For instance, in the largest dataset that comprises 21 013 taxa, 285 internal branches (out of the total 20 582 with a non-negative local FastTree support) have 80% or more bootstrap support despite their unistrap support being lower than 50% (Table 4). Without the unistrap index these branches would be considered reliable and could end up being used to draw unsupported conclusions. It is noteworthy that on small datasets, for which the shuffling effect is often limited, unistrap and bootstrap supports are more consistent with one another, thus confirming the expectation that our new procedure does not interfere with the standard bootstrap support analysis but rather complements it. An important property of unistrap is to behave exactly like standard bootstrap when the MSAs (or the aligners) are stable. This makes unistrap a very general approach compatible with any aligner or dataset type, regardless of the kind of parameters affecting their numerical robustness. Figure 4. View largeDownload slide Unistrap simultaneously quantifies MSA instability and site sampling. For each of the three considered datasets, 100-shuffle replicate MSAs were generated and used to estimate an equal number of ML trees. The central tree (i.e. the one with the highest average RF similarity with the rest of the set) was then selected as a reference. In parallel, 100 standard bootstrap as well as a 100 unistrap replicate trees were computed, all of which were used to estimate their respective branch support to the central tree. On the mmp dataset, unistrap support was then compared with the shuffle (a, rs$$=$$ 0.87) and the bootstrap (b, rs$$=$$ 0.84) supports accordingly. Similar comparisons were made on both the subt (c, rs$$=$$ 0.89; d, rs$$=$$ 0.95) and the p450 (e, rs$$=$$ 0.95; f, rs$$=$$ 0.90) datasets. These high correlations however are accompanied with unistrap values being significantly lower than their bootstrap or shuffle counterparts (Table 3), especially on the largest dataset for which the average unistrap support is 36.9% compared with the 44% for shuffle support and 43.3% for bootstrap support. Figure 4. View largeDownload slide Unistrap simultaneously quantifies MSA instability and site sampling. For each of the three considered datasets, 100-shuffle replicate MSAs were generated and used to estimate an equal number of ML trees. The central tree (i.e. the one with the highest average RF similarity with the rest of the set) was then selected as a reference. In parallel, 100 standard bootstrap as well as a 100 unistrap replicate trees were computed, all of which were used to estimate their respective branch support to the central tree. On the mmp dataset, unistrap support was then compared with the shuffle (a, rs$$=$$ 0.87) and the bootstrap (b, rs$$=$$ 0.84) supports accordingly. Similar comparisons were made on both the subt (c, rs$$=$$ 0.89; d, rs$$=$$ 0.95) and the p450 (e, rs$$=$$ 0.95; f, rs$$=$$ 0.90) datasets. These high correlations however are accompanied with unistrap values being significantly lower than their bootstrap or shuffle counterparts (Table 3), especially on the largest dataset for which the average unistrap support is 36.9% compared with the 44% for shuffle support and 43.3% for bootstrap support. Table 4. Comparison between bootstrap, shuffle, and unistrap supports SH BS SHO SH $$>$$ 80 SH $$<$$ 50 BS $$>$$ 80 BS $$>$$ 80 Dataset #Taxa #Branches support (%) support (%) support (%) SHO $$<$$ 50 (%) SHO $$>$$ 80 (%) SHO $$<$$ 50 (%) SHO $$<$$ 50 (%) mmp 1427 1283 63.2 52.9 48.3/48.3 6.00/6.00 0.08/0.08 0.08/0.08 0.15/0.15 subt 7517 7026 59.2 54.3 47.7/48.1 3.82/3.49 0.00/0.00 1.52/1.45 0.05/0.06 p450 21 013 20 482 44.0 43.3 36.9/37.0 1.66/1.64 0.00/0.00 1.39/1.35 0.06/0.07 SH BS SHO SH $$>$$ 80 SH $$<$$ 50 BS $$>$$ 80 BS $$>$$ 80 Dataset #Taxa #Branches support (%) support (%) support (%) SHO $$<$$ 50 (%) SHO $$>$$ 80 (%) SHO $$<$$ 50 (%) SHO $$<$$ 50 (%) mmp 1427 1283 63.2 52.9 48.3/48.3 6.00/6.00 0.08/0.08 0.08/0.08 0.15/0.15 subt 7517 7026 59.2 54.3 47.7/48.1 3.82/3.49 0.00/0.00 1.52/1.45 0.05/0.06 p450 21 013 20 482 44.0 43.3 36.9/37.0 1.66/1.64 0.00/0.00 1.39/1.35 0.06/0.07 #Taxa indicates the number of taxons in the considered datasets. #Branches indicates the number of internal branches with a non-negative FastTree local support. Shuffle support indicates the average shuffle branch support measured on the central ML tree (i.e. the one having the highest average RF topological similarity with the rest of the replicates) selected as a reference. Bootstrap support indicates the average bootstrap support measured on the central ML tree selected as a reference, Unistrap support indicates the average unistrap support measured on the central ML tree selected as a reference. Shuffle $$>$$ 80 Unistrap $$<$$ 50 is the fraction of branches whose shuffle support is higher than 80% but whose unistrap support is lower than 50. Such branches could easily be taken as false positives if only considering the Shuffle support. The rest of the columns indicate comparable values and confirm that the unistrap index is more conservative than both the shuffle and the bootstrap indexes. In order to estimate its numerical reproducibility, the bootstrap column sampling step of unistrap was carried out twice. Both readouts are indicated in the corresponding table entry (readout 1/readout 2). Table 4. Comparison between bootstrap, shuffle, and unistrap supports SH BS SHO SH $$>$$ 80 SH $$<$$ 50 BS $$>$$ 80 BS $$>$$ 80 Dataset #Taxa #Branches support (%) support (%) support (%) SHO $$<$$ 50 (%) SHO $$>$$ 80 (%) SHO $$<$$ 50 (%) SHO $$<$$ 50 (%) mmp 1427 1283 63.2 52.9 48.3/48.3 6.00/6.00 0.08/0.08 0.08/0.08 0.15/0.15 subt 7517 7026 59.2 54.3 47.7/48.1 3.82/3.49 0.00/0.00 1.52/1.45 0.05/0.06 p450 21 013 20 482 44.0 43.3 36.9/37.0 1.66/1.64 0.00/0.00 1.39/1.35 0.06/0.07 SH BS SHO SH $$>$$ 80 SH $$<$$ 50 BS $$>$$ 80 BS $$>$$ 80 Dataset #Taxa #Branches support (%) support (%) support (%) SHO $$<$$ 50 (%) SHO $$>$$ 80 (%) SHO $$<$$ 50 (%) SHO $$<$$ 50 (%) mmp 1427 1283 63.2 52.9 48.3/48.3 6.00/6.00 0.08/0.08 0.08/0.08 0.15/0.15 subt 7517 7026 59.2 54.3 47.7/48.1 3.82/3.49 0.00/0.00 1.52/1.45 0.05/0.06 p450 21 013 20 482 44.0 43.3 36.9/37.0 1.66/1.64 0.00/0.00 1.39/1.35 0.06/0.07 #Taxa indicates the number of taxons in the considered datasets. #Branches indicates the number of internal branches with a non-negative FastTree local support. Shuffle support indicates the average shuffle branch support measured on the central ML tree (i.e. the one having the highest average RF topological similarity with the rest of the replicates) selected as a reference. Bootstrap support indicates the average bootstrap support measured on the central ML tree selected as a reference, Unistrap support indicates the average unistrap support measured on the central ML tree selected as a reference. Shuffle $$>$$ 80 Unistrap $$<$$ 50 is the fraction of branches whose shuffle support is higher than 80% but whose unistrap support is lower than 50. Such branches could easily be taken as false positives if only considering the Shuffle support. The rest of the columns indicate comparable values and confirm that the unistrap index is more conservative than both the shuffle and the bootstrap indexes. In order to estimate its numerical reproducibility, the bootstrap column sampling step of unistrap was carried out twice. Both readouts are indicated in the corresponding table entry (readout 1/readout 2). Conclusion MSAs constitute one of the most widely used modeling tools in evolutionary biology. We show that the methods used to compute large-scale models incorporating over 1500 sequences are numerically unstable when estimated using a standard progressive alignment framework. This instability results in equally unstable phylogenetic trees. We have quantified this effect and are proposing unistrap, a novel bootstrap method that simultaneously incorporates into branch support the effects of site sampling and the alignment instability. As such, the unistrap index makes it possible to objectively estimate the support of raw data for a given phylogenetic tree. On its own, unistrap does not provide a guarantee of improved biological correctness, but it offers an objective criterion to estimate the effects of MSA instability on a given phylogeny. It must be stressed that the unistrap protocol presented here is tightly dependent on the progressive alignment framework and its sensitivity on input shuffling. Yet, in general, unistrap could easily be adapted to any procedure involving a numerically instable aligner. On the downside, one should also be aware that unistrap could easily be deceived by any arbitrary procedure used to stabilize the algorithm (i.e. internal sequence sorting or anything similar). Such limitations are not specific to unistrap. Other sampling methods such as HoT can also be deceived by low-level algorithmic decisions. This is one reason why the packages in which these methods have been implemented only support carefully selected aligners. The fact that deceiving practices can result in overestimating the robustness of a model is not per se a problem, but it is clearly something one should be aware of. In order to avoid such issues, algorithms should be as explicitly described as possible and code kept open. In general, analysis like the ones, we present here should encourage the community to treat limited robustness as an opportunity for the measurement of model supports and reliability. In order to avoid such confusion, we recommend to either generate the original MSA with the supported aligners or to apply the default unistrap support procedure (i.e. ClustalO $$+$$ FastTree replicates) onto any user generated phylogenetic tree. Such an approach brings the benefits of a normalized unistrap procedure comparable across experiments. This framework will also make it possible to incorporate and compare the stability of alternative tree reconstruction methods, as well as alternative rooting approaches. The unistrap algorithm could easily be extended further. For instance, the tiebreak effect on which our analysis relies is merely a special case of a more general situation in which dynamic programming provides an efficient way to navigate an optimization strategy among all potential suboptimal alignments. Efficient numerical procedures exist that allow objective comparisons between the optimal score and close sub-optimals, as in Probcons (Roshan 2014) for instance. Such implementation are, however, more demanding than a mere shuffling of sequence input order and they would require a systematic re-engineering of the methods benchmarked in the current analysis. Furthermore, dynamic programming is not the only source of instability and other aligners like UPP (Nguyen et al. 2015) or ClustalW (Thompson et al. 1994) can also be run in an unstable way because of the procedure used to seed the guide tree estimate. In general unistrap is probably a suitable strategy in any situation where alternative MSAs can be generated to estimate a phylogeny. This makes it a useful addition to the arsenal of methods recently developed to capture alignment instability, especially the HoT and the GUIDANCE protocols. Future analysis will also involve closer inspection of the parameters that dominate instability. We show here that MSA identity is a good predictor of instability, but the algorithmic source of the progressive alignment based instability suggests that gap frequencies in the original MSA or the number of OTUs and their evolutionary distances may also be very good predictors. It may it also be worth exploring the effect of multi-loci datasets so as to determine to which extent the modeling of epistatic relationships may affect stability. In that respect the most mature framework would certainly be RNA analysis and the comparison of methods aligning primary sequences with those able to take into account co-evolution through compensated mutations. So far, unistrap has only been validated on protein sequences, but we expect the instability on which it relies to be higher in RNA and DNA sequences whose lower complexity alphabets results in a larger number of alternative optimal alignments. We intend to explore unistrap behavior on nucleic acids as an immediate follow up of the work presented here. Yet, even when applied onto protein alignments, unistrap’s high sensitivity in detecting poorly supported branches gives it the potential to improve many downstream analyses, including attempts to predict the effect of genome variations on molecular functions (Katsonis and Lichtarge 2014). Availability The unistrap software is implemented as a public open source freeware package, available along with its documentation from https://github.com/cbcrg/unistrap. This package is a pipeline implemented in Nextflow (Di Tommaso et al. 2017). Supplementary Material Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.bk203. Author Contributions C.N. contributed ideas, some analyses and helped writing the manuscript. O.G. contributed ideas and helped writing the manuscript. M.C. contributed ideas, carried out most of the analyses and helped writing the manuscript. E.F. contributed ideas, carried out analyses and helped writing the manuscript. P.D.T helped with the implementation of the Nextflow unistrap pipeline. Acknowledgements We wish to thank Roderic Guigo and Tony Ferrar for revision and helpful comments. Funding We acknowledge support of the Spanish Ministry of Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa 2013–2017. We acknowledge the support of the CERCA Programme/Generalitat de Catalunya. References Boyce K., Sievers F., Higgins D.G. 2014 . Simple chained guide trees give high-quality protein multiple sequence alignments. Proc. Natl. Acad. Sci. U.S.A 111 : 10556 – 10561 . Google Scholar CrossRef Search ADS PubMed Boyce K., Sievers F., Higgins D.G. 2015 . Instability in progressive multiple sequence alignment algorithms. Algorithms Mol. Biol. 10 : 26 . Google Scholar CrossRef Search ADS PubMed Capella-Gutiérrez S., Silla-Martínez J.M., Gabaldón T. 2009 . trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25 : 1972 – 1973 . Google Scholar CrossRef Search ADS PubMed Castresana J. 2000 . Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17 : 540 – 552 . Google Scholar CrossRef Search ADS PubMed Chang J.-M., Di Tommaso P., Notredame C. 2014 . TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol. Biol. Evol. 31 : 1625 – 1637 . Google Scholar CrossRef Search ADS PubMed Chatzou M., Magis C., Chang J.-M., Kemena C., Bussotti G., Erb I., Notredame C. 2016 . Multiple sequence alignment modeling: methods and applications. Brief. Bioinform. 17 : 1009 – 1023 . Google Scholar CrossRef Search ADS PubMed Di Tommaso P., Chatzou M., Floden E.W., Barja P.P., Palumbo E., Notredame C. 2017 . Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35 : 316 – 319 . Google Scholar CrossRef Search ADS PubMed Edgar R.C. 2004 . MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 : 1792 – 1797 . Google Scholar CrossRef Search ADS PubMed Guindon S., Dufayard J.-F., Anisimova M., Hordijk W., Lefort V., Gascuel O. 2010 . “New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0”. Syst Biol. 59 : 307 – 321 . Google Scholar CrossRef Search ADS PubMed Hogeweg P., Hesper B. 1984 . The alignment of sets of sequences and the construction of phyletic trees: An integrated method. J. Mol. Evol. 20 : 175 – 186 . Google Scholar CrossRef Search ADS PubMed Jetz W., Thomas G.H., Joy J.B., Hartmann K., Mooers A.O. 2012 . The global diversity of birds in space and time. Nature 491 : 444 – 448 . Google Scholar CrossRef Search ADS PubMed Katoh K., Misawa K., Kuma K.-I., Miyata T. 2002 . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30 : 3059 – 3066 . Google Scholar CrossRef Search ADS PubMed Katsonis P., Lichtarge O. 2014 . A formal perturbation equation between genotype and phenotype determines the Evolutionary Action of protein-coding variations on fitness. Genome Res. 24 : 2050 – 2058 . Google Scholar CrossRef Search ADS PubMed Krypotou E., Evangelidis T., Bobonis J., Pittis A.A., Gabaldón T., Scazzocchio C., Mikros E., Diallinas G. 2015 . Origin, diversification and substrate specificity in the family of NCS1/FUR transporters. Mol. Microbiol. 96 : 927 – 950 . Google Scholar CrossRef Search ADS PubMed Lake J.A. 1991 . The order of sequence alignment can bias the selection of tree topology. Mol. Biol. Evol. 8 : 378 – 385 . Google Scholar PubMed Landan G., Graur D. 2007 . Heads or tails: a simple reliability check for multiple sequence alignments. Mol. Biol. Evol. 24 : 1380 – 1383 . Google Scholar CrossRef Search ADS PubMed Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009 . Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324 : 1561 – 1564 . Google Scholar CrossRef Search ADS PubMed Mirarab S., Nguyen N., Guo S., Wang L.-S., Kim J., Warnow T. 2015 . PASTA: Ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22 : 377 – 386 . Google Scholar CrossRef Search ADS PubMed Needleman S.B., Wunsch C.D. 1970 . A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48 : 443 – 453 . Google Scholar CrossRef Search ADS PubMed Nguyen N.-P.D., Mirarab S., Kumar K., Warnow T. 2015 . Ultra-large alignments using phylogeny-aware profiles. Genome Biol. 16 : 124 . Google Scholar CrossRef Search ADS PubMed Notredame C., Higgins D.G., Heringa J. 2000 . T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302 : 205 – 217 . Google Scholar CrossRef Search ADS PubMed Notredame C., Holm L., Higgins D.G. 1998 . COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14 : 407 – 422 . Google Scholar CrossRef Search ADS PubMed Penn O., Privman E., Landan G., Graur D., Pupko T. 2010 . An alignment confidence score capturing robustness to guide tree uncertainty. Mol. Biol. Evol. 27 : 1759 – 1767 . Google Scholar CrossRef Search ADS PubMed Pittis A.A., Gabaldón T. 2016 . Late acquisition of mitochondria by a host with chimaeric prokaryotic ancestry. Nature 531 : 101 – 104 . Google Scholar CrossRef Search ADS PubMed Price M.N., Dehal P.S., Arkin A.P. 2009 . FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26 : 1641 – 1650 . Google Scholar CrossRef Search ADS PubMed Redelings B.D., Suchard M.A. 2007 . Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol. Biol. 7 : 40 . Google Scholar CrossRef Search ADS PubMed Redelings B.D., Suchard M.A. 2008 . Robust inferences from ambiguous alignments. In: Rosenberg M.S., editor. Sequence alignment: methods, models, concepts and strategies . Oakland, CA : University of California Press . p. 209 – 270 . Google Scholar CrossRef Search ADS Roshan U. 2014 . Multiple sequence alignment using Probcons and Probalign. Methods Mol. Biol. 1079 : 147 – 153 . Google Scholar CrossRef Search ADS PubMed Sela I., Ashkenazy H., Katoh K., Pupko T. 2015 . GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 43 : W7 – 14 . Google Scholar CrossRef Search ADS PubMed Sievers F., Wilm A., Dineen D., Gibson T.J., Karplus K., Li W., Lopez R., McWilliam H., Remmert M., Söding J., Thompson J.D., Higgins D.G. 2011 . Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7 : 539 . Google Scholar CrossRef Search ADS PubMed Takezaki N. 1998 . Tie trees generated by distance methods of phylogenetic reconstruction. Mol. Biol. Evol. 15 : 727 – 737 . Google Scholar CrossRef Search ADS PubMed Tan G., Gil M., Löytynoja A.P., Goldman N., Dessimoz C. 2015 . Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks. Proc. Natl. Acad. Sci. U.S.A. 112 : E99 – 100 . Google Scholar CrossRef Search ADS PubMed Thompson J.D., Higgins D.G., Gibson T.J. 1994 . CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22 : 4673 – 4680 . Google Scholar CrossRef Search ADS PubMed Vandewege M.W., Mangum S.F., Gabaldón T., Castoe T.A., Ray D.A., Hoffmann F.G. 2016 . Contrasting Patterns of Evolutionary Diversification in the Olfactory Repertoires of Reptile and Bird Genomes. Genome Biol. Evol. 8 : 470 – 480 . Google Scholar CrossRef Search ADS PubMed Wang L., Jiang T. 1994 . On the Complexity of Multiple Sequence Alignment. J. Comput. Biol. 1 : 337 – 348 . Google Scholar CrossRef Search ADS PubMed Wong K.M., Suchard M.A., Huelsenbeck J.P. 2008 . Alignment uncertainty and genomic analysis. Science 319 : 473 – 476 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Systematic Biology Oxford University Press

Generalized Bootstrap Supports for Phylogenetic Analyses of Protein Sequences Incorporating Alignment Uncertainty

Loading next page...
 
/lp/ou_press/generalized-bootstrap-supports-for-phylogenetic-analyses-of-protein-Tlv9SLD0Zw
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1063-5157
eISSN
1076-836X
D.O.I.
10.1093/sysbio/syx096
Publisher site
See Article on Publisher Site

Abstract

Abstract Phylogenetic reconstructions are essential in genomics data analyses and depend on accurate multiple sequence alignment (MSA) models. We show that all currently available large-scale progressive multiple alignment methods are numerically unstable when dealing with amino-acid sequences. They produce significantly different output when changing sequence input order. We used the HOMFAM protein sequences dataset to show that on datasets larger than 100 sequences, this instability affects on average 21.5% of the aligned residues. The resulting Maximum Likelihood (ML) trees estimated from these MSAs are equally unstable with over 38% of the branches being sensitive to the sequence input order. We established that about two-thirds of this uncertainty stems from the unordered nature of children nodes within the guide trees used to estimate MSAs. To quantify this uncertainty we developed unistrap, a novel approach that estimates the combined effect of alignment uncertainty and site sampling on phylogenetic tree branch supports. Compared with the regular bootstrap procedure, unistrap provides branch support estimates that take into account a larger fraction of the parameters impacting tree instability when processing datasets containing a large number of sequences. Bootstrap analysis, evolutionary analysis, multiple sequence alignment, phylogenetic tree reconstruction Producing accurate and robust analyses of genomic data has become one of the main challenges of modern biology. We show here that this issue is rather problematic when estimating multiple sequence alignments (MSAs) and phylogenies on datasets containing a large number of sequences and that the associated uncertainty can impact the full spectrum of biological data interpretation based on these models. MSA instability and its implication on phylogenetic reconstruction has long been known to biologists (Wong et al. 2008). Yet the lack of MSA robustness reported by Wong et al. (2008) on small datasets, albeit quantifiable and significant, was mostly concentrated on low support branches and therefore relatively easy to deal with. In contrast, the instability we report on very large datasets affects a much larger fraction of the models including seemingly well-supported branches. Instability may therefore prove a major confounding factor when considering the increasing reliance of genomic analysis on trees encompassing anything between a few thousand taxa (Krypotou et al. 2015; Pittis and Gabaldón 2016) up to 5000 (Vandewege et al. 2016) or even 10 000 for the largest (Jetz et al. 2012). In this work, we show how some of the uncertainty associated with these models can be precisely quantified both at MSA and phylogenetic reconstruction levels. MSA instability and its main causes have been extensively surveyed (Redelings and Suchard 2008) and shown to be an intrinsic property of sequence comparison. In most MSAs, alignment portions can be totally undefined or highly dependent on a combination of parameters whose numerical instability reflects some form of under-determination. Statistical methods can be used to assign probabilities to these effects but their high computational requirements restricts them to a hundred or less sequences (Redelings and Suchard 2007). The other main source of instability is related to the reliance on sub-optimal models imposed by the NP-Complete nature of MSAs computation (Wang and Jiang 1994; Notredame et al. 1998). NP-completeness makes it hard to distinguish between the effect of suboptimal optimisation and that of true alignment uncertainty. This is the reason why most MSA robustness measurements, including the ones reported here, reflect the combination of these two factors. In this work, we have focused our analysis on the effect of the progressive alignment framework (Hogeweg and Hesper 1984) onto MSA robustness. The progressive alignment is the most commonly used MSA heuristic algorithm and has so far been the only approach scaling with datasets containing a large number of sequences. Although a large number of variants have been reported (Chatzou et al. 2016) the core principle is always the same and involves incorporating sequences one by one following the order indicated by a pre-computed guide tree. Tree-based consistency procedures (Notredame et al. 2000) have been developed to address the greedy nature of this procedure but they do not scale well over a thousand sequences and current large-scale efforts have therefore focused on efficient guide tree computation (Katoh et al. 2002; Liu et al. 2009; Sievers et al. 2011; Mirarab et al. 2015). Guide trees are not, however, neutral components and their topologies are known to influence that of downstream phylogenetic trees (Lake 1991). In this work, Lake shows how all possible tree topologies associated with a given dataset can potentially lead to a different alignment. Interestingly, Lake wrongly assumed alignment symmetry, a condition that does not hold true when using dynamic programming as an alignment engine. Unfortunately, guide tree estimation is not a very robust process and even minute fluctuations such as the sequence input order can have an effect on their topologies (Takezaki 1998) owing to the existence of ties in the sequences distance matrix they are built upon (Boyce et al. 2015). This effect is hard to control but its influence on MSA stability can be quantified using the GUIDANCE protocol (Penn et al. 2010) in which the bootstrap replicates drawn from a seed MSA are used to estimate alternative guide trees and MSAs. The reason why even minute changes in the guide tree topology can have such an impact on the downstream MSA is a property of the Needleman and Wunsch (NW) (Needleman and Wunsch 1970) pairwise alignment algorithm used to merge pairs of partial MSAs. Given the most common NW implementations, alternative optimal alignments can be produced depending on the input order. This effect can be quantified using the Head or Tail (HoT) protocol (Landan and Graur 2007). The HoT principle is relatively straightforward and involves re-aligning two sequences (or groups) in their HoT (inverted sequences) configuration. Tail alignments provide the equivalent of an NW input order swap. HoT can be applied to entire datasets or at every node of a guide tree and can also be combined with GUIDANCE (Sela et al. 2015) so as to explore and combine guide tree and alignment induced uncertainty when estimating MSAs. While several alternative evaluation schemes have been developed for the purpose of evaluating MSA local reliability (Castresana 2000; Capella-Gutiérrez et al. 2009; Chang et al. 2014), HoT and GUIDANCE are the only ones whose explicit purpose is to quantify the level of agreement between the original sequence data and its MSA modeling. In this report, we show that these quantities can easily be integrated not only at the MSA level but also with respect to their combined influence on phylogenetic tree stability. We show how a simple sequence reordering procedure (input sequences shuffling) can help in untapping a significant amount of alignment numerical instability. Even though the source of this instability bears some similarity with HoT, we show that its effect is clearly distinct and complementary. Most importantly, we show how the progressive alignment instability can be combined with the regular bootstrap procedure so as to yield branch support values reflecting a broader range of data support for the final model. We named this approach unistrap because it has the potential to unify branch support estimates by combining sampling and computational effects. Our analysis shows that on large datasets unistrap makes it possible to avoid overconfidence in a sizeable fraction of the branches. Materials and Methods Unistrap Package The Unistrap software is a pipeline implemented in the Nextflow language (Di Tommaso et al. 2017) (see Supplementary Material Online, Fig. S9 available on Dryad at http://dx.doi.org/10.5061/dryad.bk203) and available along with its documentation from https://github.com/cbcrg/unistrap. Given a dataset made of unaligned amino-acid sequences in Fasta format, a multiple aligner and a tree reconstruction method, unistrap delivers a newick phylogenetic tree along with unistrap support values for each branch inserted in the place of bootstrap support values. In theory any combination of aligners and trees could be used. In practice, unistrap currently support three aligners (ClustalO, Mafft, and PASTA) and one tree reconstruction method (FastTree using ML (Price et al. 2009)). Unistrap support values are obtained by generating shuffle MSA replicates (100 by default) that are then used to generate bootstrap replicates (one for each shuffle MSA replicate by default). Branch support values are then estimated from the collection of trees generated from the replicates. The exact procedure is detailed in the unistrap branch support section below. Datasets MSA instability of large-scale MSAs was carried out on the HOMFAM benchmark dataset, which consists of 94 protein family ranging from 94 to 94 000 sequences. The datasets were originally built by combining PFAM entries (Sievers et al. 2011) with existing structural alignment of sequences corresponding to these families. In their original report no specific attempt was made at balancing the Operational Taxonomic Unit coverage (OTU) whose breakdown was not reported. Potential taxonomic bias are therefore similar to the ones reported in the corresponding PFAM release. We divided these datasets in three categories based on the number of sequences they contain: SMALL 94 to 3000 sequences, MEDIUM from 3001 to 10 000 sequences, and LARGE from 10 001 to 94 000 sequences. All generated trees, alignments, and comparisons are available from http://genome.crg.es/~cnotredame/data/supp/unistrap/. The guidance analyses pipeline and datasets can be found at https://github.com/skptic/shuffle-guidance. MSA Shuffle Replicates Generation For each dataset 100 replicates were generated by randomly reordering the sequences. Three datasets collections were then compiled using three large-scale aligners: Clustal-O, MAFFT with its large-scale parttree option (mafft – anysymbol –parttree input_file $$>$$ output_file), and PASTA (Table 1). Table 1. Tools and methods Tools used Version Availability Clustal-O (HOMFAM Benchmark) 1.2.1 http://www.clustal.org/omega/ MAFFT 6.956 http://mafft.cbrc.jp/alignment/software/ PASTA 1.6.1 https://github.com/smirarab/pasta FastTree 2.1.8 http://meta.microbesonline.org/fasttree/ Fast-Tree Comparison Tools As downloaded Feb. 2015 http://meta.microbesonline.org/fasttree/treecmp.html SEQBOOT 3.68 http://evolution.genetics.washington.edu T-Coffee 11 http://tcoffee.crg.cat GUIDANCE 2.02 http://guidance.tau.ac.il/ver2/source.php Tools used Version Availability Clustal-O (HOMFAM Benchmark) 1.2.1 http://www.clustal.org/omega/ MAFFT 6.956 http://mafft.cbrc.jp/alignment/software/ PASTA 1.6.1 https://github.com/smirarab/pasta FastTree 2.1.8 http://meta.microbesonline.org/fasttree/ Fast-Tree Comparison Tools As downloaded Feb. 2015 http://meta.microbesonline.org/fasttree/treecmp.html SEQBOOT 3.68 http://evolution.genetics.washington.edu T-Coffee 11 http://tcoffee.crg.cat GUIDANCE 2.02 http://guidance.tau.ac.il/ver2/source.php Table 1. Tools and methods Tools used Version Availability Clustal-O (HOMFAM Benchmark) 1.2.1 http://www.clustal.org/omega/ MAFFT 6.956 http://mafft.cbrc.jp/alignment/software/ PASTA 1.6.1 https://github.com/smirarab/pasta FastTree 2.1.8 http://meta.microbesonline.org/fasttree/ Fast-Tree Comparison Tools As downloaded Feb. 2015 http://meta.microbesonline.org/fasttree/treecmp.html SEQBOOT 3.68 http://evolution.genetics.washington.edu T-Coffee 11 http://tcoffee.crg.cat GUIDANCE 2.02 http://guidance.tau.ac.il/ver2/source.php Tools used Version Availability Clustal-O (HOMFAM Benchmark) 1.2.1 http://www.clustal.org/omega/ MAFFT 6.956 http://mafft.cbrc.jp/alignment/software/ PASTA 1.6.1 https://github.com/smirarab/pasta FastTree 2.1.8 http://meta.microbesonline.org/fasttree/ Fast-Tree Comparison Tools As downloaded Feb. 2015 http://meta.microbesonline.org/fasttree/treecmp.html SEQBOOT 3.68 http://evolution.genetics.washington.edu T-Coffee 11 http://tcoffee.crg.cat GUIDANCE 2.02 http://guidance.tau.ac.il/ver2/source.php MSA Identity Average level of identity among aligned residues in a pairwise alignment further averaged across all possible pairwise projections in the target MSA. Measurements were carried out on Clustal-O MSAs using the following T-Coffee command: t_coffee – other_pg seq_reformat –in sample_ aln1.aln –output sim MSA Alignment Stability Using Sampled Sums of Pairs Measurements Given a dataset, MSA stability was measured by running a Sum-of-Pairs analysis (S-o-P) on MSA pairs, and as such estimating S-o-P similarity. The exact S-o-P measure estimates the fraction of pairs-of-residues aligned the same way (i.e. same indexes) between the two MSAs. The final score is: S-o-P (MSA1, MSA2) $$=2\times$$ ((Number of Identically aligned pairs between MSA1 and MSA2)/(Number of aligned pairs in MSA1$$+$$Number of aligned pairs in MSA2)). The S-o-P was obtained by averaging the output of the two following commands: t_coffee - other_pg aln_compare -al1 MSA1 -al2 MSA2 t_coffee - other_pg aln_compare -al1 MSA1 -al2 MSA2 When dealing with datasets larger than 100 sequences, estimating exact S-o-P similarity between pairs of MSAs becomes computationally prohibitive. We therefore replaced the exact approach described above with a sampling approach that involved randomly selecting the same 50 probe sequences in the two MSAs being compared and measuring the S-o-P score of the two projections. This operation was repeated a 100 times and the resulting values averaged to yield the estimated S-o-P similarity between the two compared MSAs. The choice of 50 probe sequences was based on the observation that there is a near perfect correlation between S-o-P measures carried out using 50 or 100 probe sequences (see Supplementary Material Online, Fig. S10 available on Dryad). MSA Accuracy Stability It was estimated by measuring the S-o-P similarity between every replicate and its HOMFAM structure based reference using the HOMFAM provided package. For a given dataset, this procedure involves extracting the projection of the sequences with known 3D structures and comparing it with the corresponding reference MSA using a standard S-o-P comparison. t_coffee - other_pg aln_compare -al1 reference_ MSA -al2 target_MSA Tree Computation Trees were computed using FastTree (for the three largest datasets FastTree was used with the -fastest option) on Clustal-O and MAFFT MSAs. In the case of PASTA, which is both an aligner and a tree estimator, we used the trees directly recovered from the PASTA final output. Unistrap Branch Support We used the FastTree Robinson and Foulds (FTRF) implementation of the Robinson and Foulds (FR) metrics to compare topological similarity between pairs of alternative trees. FTRF provides for each branch both the depth and a count of the presence of that branch across a collection of replicate trees. The shuffle replicate trees were compared all against all using the FTRF metrics. The tree having the highest average topological identity with the rest of the replicates was then selected as the central tree to be used as a reference. Each shuffle MSA was then used to draw one bootstrap replicate using Phylip’s SEQBOOT. The 100 replicates were used to estimate 100 phylogenetic trees whose topologies were used to estimate the branch support of the reference tree using FTRF with the -$$n$$ option (tree collections) and CompareToBootstrap.pl (FastTree suite) to obtain the bootstrap branch support values. For undocumented reasons FastTree was unable to estimate the branch support of a very small number of branches ($$<$$1%). These branches were removed from the analysis across all considered trees. Bootstrap Branch Support Bootstrap support values were estimated on three datasets from different size groups (mmp with 1427 taxa from the smaller-size dataset group, subt with 7517 from the medium-size group and p450 with 21 013 from the large-size group) for the Clustal-O MSAs. Given the original ClustalO MSA estimated from sequences ordered as provided in HOMFAM, we used SEQBOOT to generate 100 bootstrap replicates that were used to estimate branch support values using FTRF-$$n$$ and CompareToBootstrap.pl as described above. PhyML Type Local Branch Support PhyML type local branch support values were estimated by FastTree during tree inference. These FastTree resulting support values are the FastTree implementation of the PhyML 3’s SH-like local supports (Guindon et al., 2010). Tree ML Stability It is important to insure that tree topological variations coincide with significant ML variations. For this purpose, one therefore needs to monitor ML variations across the replicates in a meaningful way, taking into account the difficulty in comparing ML scores against different MSAs. For each dataset, the 10 shuffle trees having the lowest average RF topological similarity with the rest of the replicate trees were selected. Among these the MSA associated with the tree having the lowest average RF topological similarity was then selected and used to evaluate the ML score of the other trees, using the following FastTree procedures: FastTree -gamma -nt -gtr -nome -mllen -intree topology2 -log topology2.log alignment.p $$>$$ topology2.ftlenperl GammaLogToPaup.pl topology1.log topology2.log $$>$$ top12.txt Depth Analyses The depth analyses were carried out by selecting from each tree either the deep branches (defined as having $$>$$ 10 children) or the shallow branches (defined as $$<$$10 children) and averaging the support of the selected branches. The analysis shown in the main text were carried out using ClustalO and similar analyses shown in the supplement were carried out using MAFFT (see Supplementary Material Online, Fig. S5 available on Dryad) and PASTA (see Supplementary Material Online, Fig. S6 available on Dryad). T-Coffee Highlow Mode Dummy alignments shown in Figure 3a were obtained using the sequences and trees displayed on the figure along with the following T-Coffee command: t_coffee –seq $$<$$ seq $$>$$ -usetree $$<$$ tree $$>$$ -mode highlow This mode was implemented so as to show the potentially extreme effect of arbitrary tie breaking when computing a progressive alignment using a canonical progressive algorithm. Under this mode, ties are broken in the following order: insertion in the first sequence (i.e. gap in the second sequence), substitutions, insertion in the second sequence (i.e. gap in the first sequence). This special mode does not use consistency. Tree Node Rotation These rotations were used to show the effect of an inversion of pairwise sequence input order when using a progressive alignment algorithm. The rotations were achieved by reading in newick trees and outputting them while randomly swapping left and right children of the tree data structure. The procedure is implemented in the T-Coffee seq_reformat package: t_coffee –other_pg seq_reformat –input newick –output newick_shuffle $$>$$ output_ shuffled_newick_tree_file For each dataset, a guide tree was generated using the default sequence input order and this tree was used to generate 100 rotated, topologically identical guide trees. The trees were then fed as guide trees to Clustal-O and the resulting MSAs processed as indicated above. HoT $$+$$ Shuffle This combination was carried out in order to show the combined effect of these two procedure when estimating MSAs. The original HoT algorithm has been incorporated in T-Coffee. For each of the 31 unstable Prefab datasets, a ClustalW MSA was estimated and the deepest branch (defined as the one for which the smallest split contains the largest number of sequences among all possible smallest splits) was used to estimate the seven HoT replicates. Each replicate was compared with the original MSA and estimated for its S-o-P similarity. The most diverse among the seven replicates was then kept and was re-computed whilst adding a shuffle step to the HoT step (i.e. before being re-aligned in HoT, the groups of sequences in each split were randomly re-ordered). This procedure has been implemented in the latest version of T-Coffee and can be called using: t_coffee -other_pg seq_reformat -in $\$$ input -action $$+$$hotshot2 GUIDANCE $$+$$ Shuffle For each of the 31 unstable Prefab datasets, GUIDANCE was executed with both default and shuffled input order 10 times using the following command line: perl guidance.pl–seqFile $\$$ {datasetFile}–msaProgram CLUSTALW–seqType aa–outDir $\$${resultsDir}–bootstraps 10–program GUIDANCE– clustalw clustalw2 The resulting 100 GUIDANCE alternative alignments and 100 Shuffled-GUIDANCE alternative alignments were evaluated for stability using S-o-P scores. This procedure was repeated 10 times and average stabilities of GUIDANCE and GUIDANCE$$+$$Shuffle alignments calculated. Software Version and Availability Unless otherwise stated, all third party tools were used with default parameters. The name, version, and source of the software are given in Table 1. Supplementary Material, including data files and/or online-only appendices, can be found at http://datadryad.org/review?doi=doi:10.5061/dryad.hv752 Results and Discussion Effect of Sequence Input Order Shuffling on Large Datasets Containing 100 Amino-Acid Sequences or More We used sequence input shuffling as a means to reveal and quantify MSA construction instability. This procedure has the advantage of being compatible with most aligners without requiring invasive code adaptation. The effect of sequence shuffling onto MSA computation is not entirely obvious, and given a standard progressive alignment procedure, under which sequences are all compared in pairs to estimate a distance matrix and its associated guide-tree, one would naively expect that realigning several replicates of the same dataset in which sequences are shuffled in random orders should have a limited effect—if any at all. It has, however, recently been confirmed that this was not the case (Boyce et al. 2015). We therefore decided to quantify this phenomenon on large datasets and determine its impact on MSA and phylogenetic tree numerical stability. We applied our methodology onto HOMFAM, the most widely used reference dataset for ultra large MSAs (94 datasets ranging between 93 and 93 681 sequences) and on Prefab, a collection of smaller datasets (1683 datasets ranging between 2 and 50 sequences with an average of 45 sequence/dataset) (Edgar 2004). The most relevant characteristic of these datasets is to be composed of real proteins gathered through a realistic database search protocol. There are currently only three methods able to handle all HOMFAM datasets: Clustal-O (Sievers et al. 2011), the phylogeny aware aligner PASTA (Katoh et al. 2002; Mirarab et al. 2015) and MAFFT-fftns, using the PartTree algorithm flag (Katoh et al. 2002) (Table 1). We estimated the effect of input-order shuffling on each method by generating 100 shuffled replicates for each HOMFAM dataset. Replicates produced by a given aligner were then compared to one another using the Sums of Pairs (SoP) in order to estimate the aligners’ sensitivity to the reordering effect. The SoP measure estimates the fraction of residue pairs aligned identically (i.e. same indexes in their respective sequences) between two MSAs. Results (Fig. 1a, Supplementary Material Online, Figs. S1a and S2a available on Dryad, Table 2), indicate a high sensitivity of the three aligners to the reordering effect. On average, 78.4% of residue pairs are identically aligned across pairs of replicates produced with the same aligner (i.e. 21.6% of the pairs are differently aligned across replicates). The lowest figure was measured on MAFFT-fftns whose average stability is 71.29% but drops down to 56.7% when considering the 43 HOMFAM datasets having $$<$$ 30% average identity (Table 2). Figure 1. View largeDownload slide MSAs and trees are sensitive to sequence input order. a) For each HOMFAM protein benchmark dataset, 100 replicates with shuffled sequence input order were generated and processed into MSAs by Clustal-O. For each dataset the average similarity among replicates was estimated using an S-o-P comparison (%) and plotted against the dataset’s average identity (Spearman correlation rs$$=$$ 0.79). b) This instability results in a high MSA structural accuracy variability among shuffle replicates, also correlating with MSA identity (rs$$= -$$0.51). c) Shuffled input sequences result in Robinson and Foulds topological variability among FastTree maximum likelihood trees, which correlates with the number of sequences (rs$$=$$$$-$$0.40). Figure 1. View largeDownload slide MSAs and trees are sensitive to sequence input order. a) For each HOMFAM protein benchmark dataset, 100 replicates with shuffled sequence input order were generated and processed into MSAs by Clustal-O. For each dataset the average similarity among replicates was estimated using an S-o-P comparison (%) and plotted against the dataset’s average identity (Spearman correlation rs$$=$$ 0.79). b) This instability results in a high MSA structural accuracy variability among shuffle replicates, also correlating with MSA identity (rs$$= -$$0.51). c) Shuffled input sequences result in Robinson and Foulds topological variability among FastTree maximum likelihood trees, which correlates with the number of sequences (rs$$=$$$$-$$0.40). Table 2. Comparative readouts for alignment and phylogenetic instability on large datasets Average overall Sequence identity Number of sequences Under 30% Over 30% 94-3K 3K-10K 10K-94K Aligner C P M AVG C P M C P M C P M C P M C P M # datasets 94 43 51 41 33 20 MSA stability (%) 80.00 84.07 71.29 78.45 69.68 77.41 56.70 87.89 89.69 83.59 84.85 86.36 77.89 79.65 84.37 72.65 68.59 78.92 55.53 Unstable residue pairs (%) 20 15.93 28.71 21.55 30.32 22.59 43.3 12.11 10.31 16.41 15.15 13.64 22.11 20.35 15.63 27.35 31.41 21.08 44.47 MSA accuracy stdev 3.29 4.31 5.13 4.24 4.04 5.12 6.13 2.65 3.63 4.29 2.93 4.53 4.88 3.26 3.96 5.15 4.09 4.49 5.61 Tree stability (%) 67.93 64.18 53.84 61.98 67.49 68.65 47.04 68.29 60.41 59.56 72.22 67.17 60.80 68.58 64.03 54.61 58.05 58.3 38.30 Fraction of unstable branches (%) 32.07 35.82 46.16 38.02 32.51 31.35 52.96 31.71 39.59 40.44 27.78 32.83 39.2 31.42 35.97 45.39 41.95 41.7 61.7 Deep branches of low MSA and high local support (%) 43.92 33.68 64.27 47.29 44.67 29.00 72.17 43.29 37.63 57.61 36.71 31.54 54.65 43.24 31.57 63.88 59.85 41.55 84.66 Shallow branches of low MSA and high local support (%) 11.29 5.61 28.34 15.08 12.27 3.58 40.62 10.45 7.31 17.97 6.34 4.21 16.39 9.52 4.64 26.72 24.35 10.05 55.49 Average overall Sequence identity Number of sequences Under 30% Over 30% 94-3K 3K-10K 10K-94K Aligner C P M AVG C P M C P M C P M C P M C P M # datasets 94 43 51 41 33 20 MSA stability (%) 80.00 84.07 71.29 78.45 69.68 77.41 56.70 87.89 89.69 83.59 84.85 86.36 77.89 79.65 84.37 72.65 68.59 78.92 55.53 Unstable residue pairs (%) 20 15.93 28.71 21.55 30.32 22.59 43.3 12.11 10.31 16.41 15.15 13.64 22.11 20.35 15.63 27.35 31.41 21.08 44.47 MSA accuracy stdev 3.29 4.31 5.13 4.24 4.04 5.12 6.13 2.65 3.63 4.29 2.93 4.53 4.88 3.26 3.96 5.15 4.09 4.49 5.61 Tree stability (%) 67.93 64.18 53.84 61.98 67.49 68.65 47.04 68.29 60.41 59.56 72.22 67.17 60.80 68.58 64.03 54.61 58.05 58.3 38.30 Fraction of unstable branches (%) 32.07 35.82 46.16 38.02 32.51 31.35 52.96 31.71 39.59 40.44 27.78 32.83 39.2 31.42 35.97 45.39 41.95 41.7 61.7 Deep branches of low MSA and high local support (%) 43.92 33.68 64.27 47.29 44.67 29.00 72.17 43.29 37.63 57.61 36.71 31.54 54.65 43.24 31.57 63.88 59.85 41.55 84.66 Shallow branches of low MSA and high local support (%) 11.29 5.61 28.34 15.08 12.27 3.58 40.62 10.45 7.31 17.97 6.34 4.21 16.39 9.52 4.64 26.72 24.35 10.05 55.49 Sequence identity indicates the average sequence identity as estimated of the Clustal-O MSAs. Under 30% indicates the number of datasets having a percent identity lower than 30% and Over 30 those having a percent identity higher than 30%. On the Aligner line, C indicates Clustal-O, P PASTA, and M MAFFT and Avg indicates the average readouts measured on the three aligners. # datasets indicates the number of datasets corresponding to each category. MSA stability indicates the average sum of pairs similarity between shuffled MSAs. Unstable residue pairs indicates the average fraction of residue pairs whose alignment changes across replicates. MSA Accuracy Stdev is the variation in HomFam accuracy across MSA replicates. Tree stability is the average RF similarity normalized by the number of branches. Fraction of unstable branches is the average fraction of branches that differ across the trees estimated from the shuffled MSA replicates. Deep branches of low MSA and high local support indicates the fraction of branches having more than 10 children that occur in less than 20% of the replicate trees but have a local support higher than 80%. Shallow branches of low MSA and high local support (%). Table 2. Comparative readouts for alignment and phylogenetic instability on large datasets Average overall Sequence identity Number of sequences Under 30% Over 30% 94-3K 3K-10K 10K-94K Aligner C P M AVG C P M C P M C P M C P M C P M # datasets 94 43 51 41 33 20 MSA stability (%) 80.00 84.07 71.29 78.45 69.68 77.41 56.70 87.89 89.69 83.59 84.85 86.36 77.89 79.65 84.37 72.65 68.59 78.92 55.53 Unstable residue pairs (%) 20 15.93 28.71 21.55 30.32 22.59 43.3 12.11 10.31 16.41 15.15 13.64 22.11 20.35 15.63 27.35 31.41 21.08 44.47 MSA accuracy stdev 3.29 4.31 5.13 4.24 4.04 5.12 6.13 2.65 3.63 4.29 2.93 4.53 4.88 3.26 3.96 5.15 4.09 4.49 5.61 Tree stability (%) 67.93 64.18 53.84 61.98 67.49 68.65 47.04 68.29 60.41 59.56 72.22 67.17 60.80 68.58 64.03 54.61 58.05 58.3 38.30 Fraction of unstable branches (%) 32.07 35.82 46.16 38.02 32.51 31.35 52.96 31.71 39.59 40.44 27.78 32.83 39.2 31.42 35.97 45.39 41.95 41.7 61.7 Deep branches of low MSA and high local support (%) 43.92 33.68 64.27 47.29 44.67 29.00 72.17 43.29 37.63 57.61 36.71 31.54 54.65 43.24 31.57 63.88 59.85 41.55 84.66 Shallow branches of low MSA and high local support (%) 11.29 5.61 28.34 15.08 12.27 3.58 40.62 10.45 7.31 17.97 6.34 4.21 16.39 9.52 4.64 26.72 24.35 10.05 55.49 Average overall Sequence identity Number of sequences Under 30% Over 30% 94-3K 3K-10K 10K-94K Aligner C P M AVG C P M C P M C P M C P M C P M # datasets 94 43 51 41 33 20 MSA stability (%) 80.00 84.07 71.29 78.45 69.68 77.41 56.70 87.89 89.69 83.59 84.85 86.36 77.89 79.65 84.37 72.65 68.59 78.92 55.53 Unstable residue pairs (%) 20 15.93 28.71 21.55 30.32 22.59 43.3 12.11 10.31 16.41 15.15 13.64 22.11 20.35 15.63 27.35 31.41 21.08 44.47 MSA accuracy stdev 3.29 4.31 5.13 4.24 4.04 5.12 6.13 2.65 3.63 4.29 2.93 4.53 4.88 3.26 3.96 5.15 4.09 4.49 5.61 Tree stability (%) 67.93 64.18 53.84 61.98 67.49 68.65 47.04 68.29 60.41 59.56 72.22 67.17 60.80 68.58 64.03 54.61 58.05 58.3 38.30 Fraction of unstable branches (%) 32.07 35.82 46.16 38.02 32.51 31.35 52.96 31.71 39.59 40.44 27.78 32.83 39.2 31.42 35.97 45.39 41.95 41.7 61.7 Deep branches of low MSA and high local support (%) 43.92 33.68 64.27 47.29 44.67 29.00 72.17 43.29 37.63 57.61 36.71 31.54 54.65 43.24 31.57 63.88 59.85 41.55 84.66 Shallow branches of low MSA and high local support (%) 11.29 5.61 28.34 15.08 12.27 3.58 40.62 10.45 7.31 17.97 6.34 4.21 16.39 9.52 4.64 26.72 24.35 10.05 55.49 Sequence identity indicates the average sequence identity as estimated of the Clustal-O MSAs. Under 30% indicates the number of datasets having a percent identity lower than 30% and Over 30 those having a percent identity higher than 30%. On the Aligner line, C indicates Clustal-O, P PASTA, and M MAFFT and Avg indicates the average readouts measured on the three aligners. # datasets indicates the number of datasets corresponding to each category. MSA stability indicates the average sum of pairs similarity between shuffled MSAs. Unstable residue pairs indicates the average fraction of residue pairs whose alignment changes across replicates. MSA Accuracy Stdev is the variation in HomFam accuracy across MSA replicates. Tree stability is the average RF similarity normalized by the number of branches. Fraction of unstable branches is the average fraction of branches that differ across the trees estimated from the shuffled MSA replicates. Deep branches of low MSA and high local support indicates the fraction of branches having more than 10 children that occur in less than 20% of the replicate trees but have a local support higher than 80%. Shallow branches of low MSA and high local support (%). Shuffling Effect on Small Datasets Containing 50 Amino-Acid Sequences or Less The reason why the shuffling effect had so far been mostly ignored probably owes much to its limited impact on the small datasets routinely analyzed in biology. Indeed when using Clustal-O to measure the effect of shuffling input order on the 1683 Prefab datasets we merely found 131 datasets to be unstable with an average SoP similarity of 95.99 % across 100 replicates (Table 3). This confirms that instability is not a major issue on datasets smaller than approximately 100 sequences. We used the 131 unstable Prefab datasets to further explore the relationships between shuffling, HoT, and Guidance, since these two last procedures are computationally too intensive to be applied onto HOMFAM. In order to quantify the relationship between HoT and Shuffle, we reimplemented the HoT protocol and made it possible to combine the sequence inversion with a shuffle step (cf Methods), resulting in a lower stability when sampled with HoT$$+$$Shuffle than when sampled with HoT only. Overall, datasets sampled with the HoT/Shuffle combination are 4% point less stable than when using HoT only (80.44 for HoT and 76.45 for HoT$$+$$Shuffle). This important observation confirms that the shuffle algorithm explores an alternative alignment space at least partly different from the one defined by HoT. It is worth noting that HoT inverts simultaneously all the ties associated with a given set of sequence. This explains why the HoT replicates are on average so much more different from one another than their shuffled counterparts (Table 2). Indeed, shuffling provides a much smoother exploration of the true space of inversions, with each replicate sampling point being anywhere between the original sequence order and a full inversion of all the sequences as provided by HoT. The shuffling algorithm is therefore a powerful way of estimating the average numerical instability of a dataset while the HoT algorithm provides heuristic access to some of the most extreme points of this sampling space. This makes the two strategies complementary. We used a similar approach to explore the effect of a combination between GUIDANCE and shuffle. To do so, we used the GUIDANCE implementation that does not combine guide tree re-estimation with HoT realignments. We measured the shuffling effect by generating 10 GUIDANCE replicates without sequence input order shuffling and 10 replicates with input order shuffling. Results indicate that the addition of a shuffling stage slightly decreases MSA stability, as measured by a SoP comparison between replicates (69.92 for Guidance, 68.99 for Guidance$$+$$Shuffle). Overall 89 out 131 datasets were found less stable when adding the shuffle step. This analysis confirms that the shuffling procedure differs from both HoT or GUIDANCE and thus provides an alternative way of sampling the MSA instability space when doing a progressive alignment. The additive effects of GUIDANCE, Shuffle and HoT also suggest that these methods can be combined together. We finally took advantage of this small dataset to estimate the effect of the aligner accuracy on the shuffling instability. On these small alignments we found no correlation between accuracy and instability. Table 3. Comparative readouts for alignment and phylogenetic instability on small unstable datasets Measure Mafft-linsi (slow) Mafft-fftns (Fast) Clustal W Clustal O Average Accuracy 0.69 0.63 0.56 0.64 0.63 MSA Stability 0.98 0.96 0.96 0.91 0.95 Tree Stability 0.97 0.92 0.93 0.88 0.92 Unistrap 0.66 0.67 0.65 0.65 0.66 Bootstrap 0.66 0.67 0.65 0.66 0.66 HoT — — 80.44 — Hot$$+$$shuffle — — 76.45 (91/131) — Guidance — — 69.92 — Guidance$$+$$shuffle — — 68.99 (89/131) — Measure Mafft-linsi (slow) Mafft-fftns (Fast) Clustal W Clustal O Average Accuracy 0.69 0.63 0.56 0.64 0.63 MSA Stability 0.98 0.96 0.96 0.91 0.95 Tree Stability 0.97 0.92 0.93 0.88 0.92 Unistrap 0.66 0.67 0.65 0.65 0.66 Bootstrap 0.66 0.67 0.65 0.66 0.66 HoT — — 80.44 — Hot$$+$$shuffle — — 76.45 (91/131) — Guidance — — 69.92 — Guidance$$+$$shuffle — — 68.99 (89/131) — Display of various instability metrics collected on 131 unstable Prefab datasets. Accuracy: sums of pairs Accuracy as provided by Prefab Qscore. MSA stability: average sum-of-pairs similarity across 100 MSA replicates. Tree stability: average RF similarity across trees, normalized by the number of branches. Shootsrap: average Shootsrap branch support. Bootstrap: average bootstrap support. HoT average sum-of pair similarity between the two most different HoT replicates drawn on the deepest branch of a ClustalW guide tree. Hot$$+$$shuffle: average sums of pairs of the HoT replicate when adding a shuffle step. The numbers in parenthesis indicate how many times the Hot$$+$$Shuffle replicates were less stable than their HoT counterpart). Guidance: average Guidance S-o-P similarity across Guidance I replicates. Guidance$$+$$Shuffle. Average S-o-P replicate similarity when combining GUIDANCE I with a shuffle step. The numbers in parenthesis indicate how many times the Hot$$+$$Shuffle replicates were less stable than their HoT counterpart). Table 3. Comparative readouts for alignment and phylogenetic instability on small unstable datasets Measure Mafft-linsi (slow) Mafft-fftns (Fast) Clustal W Clustal O Average Accuracy 0.69 0.63 0.56 0.64 0.63 MSA Stability 0.98 0.96 0.96 0.91 0.95 Tree Stability 0.97 0.92 0.93 0.88 0.92 Unistrap 0.66 0.67 0.65 0.65 0.66 Bootstrap 0.66 0.67 0.65 0.66 0.66 HoT — — 80.44 — Hot$$+$$shuffle — — 76.45 (91/131) — Guidance — — 69.92 — Guidance$$+$$shuffle — — 68.99 (89/131) — Measure Mafft-linsi (slow) Mafft-fftns (Fast) Clustal W Clustal O Average Accuracy 0.69 0.63 0.56 0.64 0.63 MSA Stability 0.98 0.96 0.96 0.91 0.95 Tree Stability 0.97 0.92 0.93 0.88 0.92 Unistrap 0.66 0.67 0.65 0.65 0.66 Bootstrap 0.66 0.67 0.65 0.66 0.66 HoT — — 80.44 — Hot$$+$$shuffle — — 76.45 (91/131) — Guidance — — 69.92 — Guidance$$+$$shuffle — — 68.99 (89/131) — Display of various instability metrics collected on 131 unstable Prefab datasets. Accuracy: sums of pairs Accuracy as provided by Prefab Qscore. MSA stability: average sum-of-pairs similarity across 100 MSA replicates. Tree stability: average RF similarity across trees, normalized by the number of branches. Shootsrap: average Shootsrap branch support. Bootstrap: average bootstrap support. HoT average sum-of pair similarity between the two most different HoT replicates drawn on the deepest branch of a ClustalW guide tree. Hot$$+$$shuffle: average sums of pairs of the HoT replicate when adding a shuffle step. The numbers in parenthesis indicate how many times the Hot$$+$$Shuffle replicates were less stable than their HoT counterpart). Guidance: average Guidance S-o-P similarity across Guidance I replicates. Guidance$$+$$Shuffle. Average S-o-P replicate similarity when combining GUIDANCE I with a shuffle step. The numbers in parenthesis indicate how many times the Hot$$+$$Shuffle replicates were less stable than their HoT counterpart). Structural Accuracy Instability of Large Dataset When Shuffling Sequences Input Order Altogether this analysis suggests a very significant effect of sequences input order shuffling on datasets larger than 50 sequences and a more moderate effect on smaller datasets like the one available in Prefab. We therefore focused the rest of our work on large and very large datasets and their associated aligners. We first asked if the alignment numerical instability measured on HOMFAM came along with a similar impact on structural accuracy. This question is relevant because regions used to assess this type of accuracy are structurally homologous and expected to result in a stable alignment. This analysis was therefore useful to insure that numerical instability is not concentrated in structurally ambiguous portions of the alignments. We measured this effect by comparing the fate of embedded sequences with known 3D structures against their HOMSTRAD reference alignments. When doing so, accuracy is defined as the fraction of residue pairs aligned identically in the target and the structural reference alignment. Our analysis revealed important variations in accuracy across replicates (Fig. 1b, Supplementary Material Online, Figs. S1b and S2b available on Dryad), especially for datasets with $$<$$ 30% identity (Table 2). Shuffling Effect onto Phylogenetic Tree Branch Stability With phylogenetic tree reconstruction being one of the most common downstream analysis based on MSA, we decided to assess the impact of shuffling on tree estimations. For this purpose, we used the deterministic mode of FastTree to estimate ML phylogenetic trees derived from the shuffled MSA replicates. The average Robinson and Foulds (RF) congruences measured on these trees were found to be as low as 68%, 64%, and 54% for Clustal-O, PASTA, and MAFFT respectively (Fig. 1c, Table 2, Supplementary Material Online, Figs. S1c and S2c available on Dryad). Overall, an average of 38.05 % of the branches are unstable across replicates. These topological variations are associated with significant ML score fluctuations (see Supplementary Material Online, Fig. S3 available on Dryad), with the largest trees being more predominantly affected. Variation among replicate trees can be used to measure a branch shuffle support index—defined as the fraction of trees, estimated from the MSA shuffled replicates, in which a specific branch occurs. We compared this shuffle support index with the local measure (local branch support) inferred by FastTree (Price et al. 2009) as initially proposed in PhyML (Anisimova and Gascuel 2006) to quantify the nearest-neighbor interchange effect. We found a weak Spearman correlation for both deep (over 10 children) and shallow branches, which drops even further as the number of sequences increases (Fig. 2a, Supplementary Material Online, Figs. S4a–S6a available on Dryad). This low correlation reflects the existence of many branches being highly supported by the local measure, albeit missing from the majority of shuffled replicates. For instance, when considering Clustal-O trees, 42% of the deep branches have a local support higher than 80%, yet among these highly supported branches, 44% occur in less than 20% of the shuffled replicate trees (Table 2, Supplementary Material Online, Figs. S4b–S6b available on Dryad). This important observation indicates that the branch local reliability measure does not reflect MSA robustness. Figure 2. View largeDownload slide Shuffle and bootstrap branch supports are correlated. (a) Local branch support and shuffle branch support measured across Clustal-O datasets from HOMFAM have a low Spearman correlation that decreases as the number of sequences increases, for both deep (more than 10 children) and shallow branches. Similar comparisons made on a large HOMFAM dataset (p450, 21,013 sequences) reveal a low correlation between bootstrap and local support values (rs$$=$$ 0.27) when compiling bootstrap and local branch support on the 100 shuffled replicate (b), but a relatively high correlation between standard bootstrap branch support and shuffle support (rs$$=$$ 0.69) (c). Figure 2. View largeDownload slide Shuffle and bootstrap branch supports are correlated. (a) Local branch support and shuffle branch support measured across Clustal-O datasets from HOMFAM have a low Spearman correlation that decreases as the number of sequences increases, for both deep (more than 10 children) and shallow branches. Similar comparisons made on a large HOMFAM dataset (p450, 21,013 sequences) reveal a low correlation between bootstrap and local support values (rs$$=$$ 0.27) when compiling bootstrap and local branch support on the 100 shuffled replicate (b), but a relatively high correlation between standard bootstrap branch support and shuffle support (rs$$=$$ 0.69) (c). We also compared branch shuffle support with the standard Felsenstein’s bootstrap support method (Felsenstein 1985), which generates replicate trees by re-sampling MSA columns with replacement and uses them to estimate the support of each branch in the reference tree. We restricted this analysis to three datasets, labeled as small, medium and large (1427,7517, and 21 013 taxa). On this subset, we could carry out an exhaustive comparison analysis between shuffle, bootstrap, and local branch supports using Clustal-O. Clustal-O was chosen because its trees are the most stable with respect to shuffling (Table 2). Our analysis do not show any clear correlation between local branch reliability and bootstrap support for either the large (Fig. 2b), the medium or the small datasets (see Supplementary Material Online, Figs. S7 available on Dryad). In contrast, for these same datasets we observed a relatively high correlation between shuffle and bootstrap branch support values (Fig. 2c, $$r_{\mathrm{s}} =$$ 0.69). Effect of Guide Tree Node Rotation onto Phylogenetic Tree Topological Stability We speculated that the major cause of MSA instability was the unordered nature of child nodes in the binary guide tree. Indeed, while the NW algorithm used to do pairwise alignments at every node is guaranteed to return an optimal scoring alignment of the two considered child nodes, this algorithm may deliver different—but equally optimal—pairwise alignments when inverting sequences input order. The reason is that while NW is guaranteed to deliver an optimal scoring pairwise alignment, it has no objective way of handling ties occurring within the score matrix. Depending on how these ties are resolved NW can therefore deliver alternative optimal alignments. This effect is often referred to as low road/high road because of the visual effect of alternative tie breaking on the alignment path. In most implementations, the order under which ties are broken is arbitrary and therefore input order dependent (e.g. substitution first, then gap in first sequence, then gap in second). Progressive MSA assembly is especially sensitive to this phenomenon because partial MSAs associated with sister nodes are assembled independently. As a consequence, the breaking of their ties cannot be synchronized. This leads to the possibility of partial MSAs whose combination may not be compatible with a global optimal MSA. It is worth noting that node input order variations can happen even without changing the guide tree topology. Our toy example (Fig. 3a) illustrates how a standard progressive aligner like T-Coffee (in the high-low mode with consistency turned off, cf Methods) can be led to produce very different MSAs while using guide trees with the same topologies, and by simply rotating child nodes. We systematically quantified this effect by generating for each HOMFAM dataset, 100 randomly rotated but topologically identical Clustal-O guide trees and we used them to generate as many Clustal-O MSAs. The effect of rotation on MSA stability was then estimated by measuring the average S-o-P similarity (as previously done) between the 100 replicate MSAs. The resulting values show that node rotation is capable of explaining the largest fraction of the shuffling effect (Fig. 3b, ratio of explained variance 0.68, $$r =$$ 0.85). We also used an all-against-all RF comparison to measure guide tree stability and found them to be on average 45.1% topologically similar, a variation sufficiently large to account for the remaining fraction of MSA instability (Fig. 3c). This finding confirms the observation by Boyce et al. (2015) of guide tree instability contributing to MSA instability, but it also indicates that the effect reported by Boyce et al. (2015), and attributed by the authors to frequent ties in the k-tup based distance matrices, is not the main cause of MSA instability. Figure 3. View largeDownload slide The majority of instability is explained by unordered nodes in the guide tree. a) Similar guide trees with different node ordering can result in different MSAs when fed to the highlow mode of T-Coffee multiple aligner (see Supplementary Material online, Methods available on Dryad). b) We used this property to recapitulate the shuffling input effect on each dataset, by feeding Clustal-O with a 100 topologically identical guide trees with randomly rotated nodes. When done on all the entire HOMFAM this procedure accounts for the majority of the shuffling effect (explained variance ratio $$=$$ 0.68, rs$$=$$ 0.85). c) The remaining variation can be attributed to the effect of sequence input shuffling on guide tree topologies, as shown by comparing guide tree stability to MSA stability (explained variance ratio $$=$$ 0.28, rs$$=$$ 0.21). Figure 3. View largeDownload slide The majority of instability is explained by unordered nodes in the guide tree. a) Similar guide trees with different node ordering can result in different MSAs when fed to the highlow mode of T-Coffee multiple aligner (see Supplementary Material online, Methods available on Dryad). b) We used this property to recapitulate the shuffling input effect on each dataset, by feeding Clustal-O with a 100 topologically identical guide trees with randomly rotated nodes. When done on all the entire HOMFAM this procedure accounts for the majority of the shuffling effect (explained variance ratio $$=$$ 0.68, rs$$=$$ 0.85). c) The remaining variation can be attributed to the effect of sequence input shuffling on guide tree topologies, as shown by comparing guide tree stability to MSA stability (explained variance ratio $$=$$ 0.28, rs$$=$$ 0.21). Branch Support Estimation Using the Unistrap Procedure No simple solution exists to address the problem of unordered child nodes (i.e. in binary guide tree the left and the right children are arbitrarily assigned and interchangeable), as this would require the existence of a linear sorting procedure capable of unambiguously recapitulating an entire binary tree. An alternative to regular guide trees is the use of simple chained guide trees (i.e. trees where each split has a leaf child), but recent suggestions in this direction (Boyce et al. 2014) have been criticized for their limited capacity of producing biologically realistic MSAs (Boyce et al. 2014; Tan et al. 2015). Summarizing the combined effect of site sampling and unordered child leaves is also a complicated problem because the intermediate level of correlation between the shuffle and bootstrap supports prevents a straightforward agglomeration of the two indexes, as they are neither independent enough to be combined nor correlated enough to be treated as one another’s proxy. The impossibility to disentangle these two measurements requires their simultaneous estimation. For this purpose, we designed the “unistrap” procedure whose name reflects the capacity to provide a unified estimation of the combined effect of site sampling and MSA instability on branch support. Our unistrap implementation starts by generating shuffled MSA replicates. Given each such shuffle replicate, columns are then drawn with re-sampling to generate one or more bootstrap replicates. The resulting collection of bootstrap MSA replicates are then used to estimate ML replicate trees. The rest of the procedure is a standard Felsenstein bootstrap measure in which replicate trees can either be combined into a consensus or be used to estimate branch supports of a reference tree. Comparison Between Unistrap and Felsenstein Bootstrap Support Indexes In this study, we used the central tree (RF comparison) drawn from the shuffle replicates as a reference and we found unistrap branch support to be correlated with both the shuffle and bootstrap supports (Fig. 4, Supplementary Material Online, Fig. S8 available on Dryad), albeit yielding support values on average lower (Table 4). This effect is especially relevant for the branches that are very well supported by either the shuffle or the bootstrap measure, but poorly supported by unistrap. For instance, in the largest dataset that comprises 21 013 taxa, 285 internal branches (out of the total 20 582 with a non-negative local FastTree support) have 80% or more bootstrap support despite their unistrap support being lower than 50% (Table 4). Without the unistrap index these branches would be considered reliable and could end up being used to draw unsupported conclusions. It is noteworthy that on small datasets, for which the shuffling effect is often limited, unistrap and bootstrap supports are more consistent with one another, thus confirming the expectation that our new procedure does not interfere with the standard bootstrap support analysis but rather complements it. An important property of unistrap is to behave exactly like standard bootstrap when the MSAs (or the aligners) are stable. This makes unistrap a very general approach compatible with any aligner or dataset type, regardless of the kind of parameters affecting their numerical robustness. Figure 4. View largeDownload slide Unistrap simultaneously quantifies MSA instability and site sampling. For each of the three considered datasets, 100-shuffle replicate MSAs were generated and used to estimate an equal number of ML trees. The central tree (i.e. the one with the highest average RF similarity with the rest of the set) was then selected as a reference. In parallel, 100 standard bootstrap as well as a 100 unistrap replicate trees were computed, all of which were used to estimate their respective branch support to the central tree. On the mmp dataset, unistrap support was then compared with the shuffle (a, rs$$=$$ 0.87) and the bootstrap (b, rs$$=$$ 0.84) supports accordingly. Similar comparisons were made on both the subt (c, rs$$=$$ 0.89; d, rs$$=$$ 0.95) and the p450 (e, rs$$=$$ 0.95; f, rs$$=$$ 0.90) datasets. These high correlations however are accompanied with unistrap values being significantly lower than their bootstrap or shuffle counterparts (Table 3), especially on the largest dataset for which the average unistrap support is 36.9% compared with the 44% for shuffle support and 43.3% for bootstrap support. Figure 4. View largeDownload slide Unistrap simultaneously quantifies MSA instability and site sampling. For each of the three considered datasets, 100-shuffle replicate MSAs were generated and used to estimate an equal number of ML trees. The central tree (i.e. the one with the highest average RF similarity with the rest of the set) was then selected as a reference. In parallel, 100 standard bootstrap as well as a 100 unistrap replicate trees were computed, all of which were used to estimate their respective branch support to the central tree. On the mmp dataset, unistrap support was then compared with the shuffle (a, rs$$=$$ 0.87) and the bootstrap (b, rs$$=$$ 0.84) supports accordingly. Similar comparisons were made on both the subt (c, rs$$=$$ 0.89; d, rs$$=$$ 0.95) and the p450 (e, rs$$=$$ 0.95; f, rs$$=$$ 0.90) datasets. These high correlations however are accompanied with unistrap values being significantly lower than their bootstrap or shuffle counterparts (Table 3), especially on the largest dataset for which the average unistrap support is 36.9% compared with the 44% for shuffle support and 43.3% for bootstrap support. Table 4. Comparison between bootstrap, shuffle, and unistrap supports SH BS SHO SH $$>$$ 80 SH $$<$$ 50 BS $$>$$ 80 BS $$>$$ 80 Dataset #Taxa #Branches support (%) support (%) support (%) SHO $$<$$ 50 (%) SHO $$>$$ 80 (%) SHO $$<$$ 50 (%) SHO $$<$$ 50 (%) mmp 1427 1283 63.2 52.9 48.3/48.3 6.00/6.00 0.08/0.08 0.08/0.08 0.15/0.15 subt 7517 7026 59.2 54.3 47.7/48.1 3.82/3.49 0.00/0.00 1.52/1.45 0.05/0.06 p450 21 013 20 482 44.0 43.3 36.9/37.0 1.66/1.64 0.00/0.00 1.39/1.35 0.06/0.07 SH BS SHO SH $$>$$ 80 SH $$<$$ 50 BS $$>$$ 80 BS $$>$$ 80 Dataset #Taxa #Branches support (%) support (%) support (%) SHO $$<$$ 50 (%) SHO $$>$$ 80 (%) SHO $$<$$ 50 (%) SHO $$<$$ 50 (%) mmp 1427 1283 63.2 52.9 48.3/48.3 6.00/6.00 0.08/0.08 0.08/0.08 0.15/0.15 subt 7517 7026 59.2 54.3 47.7/48.1 3.82/3.49 0.00/0.00 1.52/1.45 0.05/0.06 p450 21 013 20 482 44.0 43.3 36.9/37.0 1.66/1.64 0.00/0.00 1.39/1.35 0.06/0.07 #Taxa indicates the number of taxons in the considered datasets. #Branches indicates the number of internal branches with a non-negative FastTree local support. Shuffle support indicates the average shuffle branch support measured on the central ML tree (i.e. the one having the highest average RF topological similarity with the rest of the replicates) selected as a reference. Bootstrap support indicates the average bootstrap support measured on the central ML tree selected as a reference, Unistrap support indicates the average unistrap support measured on the central ML tree selected as a reference. Shuffle $$>$$ 80 Unistrap $$<$$ 50 is the fraction of branches whose shuffle support is higher than 80% but whose unistrap support is lower than 50. Such branches could easily be taken as false positives if only considering the Shuffle support. The rest of the columns indicate comparable values and confirm that the unistrap index is more conservative than both the shuffle and the bootstrap indexes. In order to estimate its numerical reproducibility, the bootstrap column sampling step of unistrap was carried out twice. Both readouts are indicated in the corresponding table entry (readout 1/readout 2). Table 4. Comparison between bootstrap, shuffle, and unistrap supports SH BS SHO SH $$>$$ 80 SH $$<$$ 50 BS $$>$$ 80 BS $$>$$ 80 Dataset #Taxa #Branches support (%) support (%) support (%) SHO $$<$$ 50 (%) SHO $$>$$ 80 (%) SHO $$<$$ 50 (%) SHO $$<$$ 50 (%) mmp 1427 1283 63.2 52.9 48.3/48.3 6.00/6.00 0.08/0.08 0.08/0.08 0.15/0.15 subt 7517 7026 59.2 54.3 47.7/48.1 3.82/3.49 0.00/0.00 1.52/1.45 0.05/0.06 p450 21 013 20 482 44.0 43.3 36.9/37.0 1.66/1.64 0.00/0.00 1.39/1.35 0.06/0.07 SH BS SHO SH $$>$$ 80 SH $$<$$ 50 BS $$>$$ 80 BS $$>$$ 80 Dataset #Taxa #Branches support (%) support (%) support (%) SHO $$<$$ 50 (%) SHO $$>$$ 80 (%) SHO $$<$$ 50 (%) SHO $$<$$ 50 (%) mmp 1427 1283 63.2 52.9 48.3/48.3 6.00/6.00 0.08/0.08 0.08/0.08 0.15/0.15 subt 7517 7026 59.2 54.3 47.7/48.1 3.82/3.49 0.00/0.00 1.52/1.45 0.05/0.06 p450 21 013 20 482 44.0 43.3 36.9/37.0 1.66/1.64 0.00/0.00 1.39/1.35 0.06/0.07 #Taxa indicates the number of taxons in the considered datasets. #Branches indicates the number of internal branches with a non-negative FastTree local support. Shuffle support indicates the average shuffle branch support measured on the central ML tree (i.e. the one having the highest average RF topological similarity with the rest of the replicates) selected as a reference. Bootstrap support indicates the average bootstrap support measured on the central ML tree selected as a reference, Unistrap support indicates the average unistrap support measured on the central ML tree selected as a reference. Shuffle $$>$$ 80 Unistrap $$<$$ 50 is the fraction of branches whose shuffle support is higher than 80% but whose unistrap support is lower than 50. Such branches could easily be taken as false positives if only considering the Shuffle support. The rest of the columns indicate comparable values and confirm that the unistrap index is more conservative than both the shuffle and the bootstrap indexes. In order to estimate its numerical reproducibility, the bootstrap column sampling step of unistrap was carried out twice. Both readouts are indicated in the corresponding table entry (readout 1/readout 2). Conclusion MSAs constitute one of the most widely used modeling tools in evolutionary biology. We show that the methods used to compute large-scale models incorporating over 1500 sequences are numerically unstable when estimated using a standard progressive alignment framework. This instability results in equally unstable phylogenetic trees. We have quantified this effect and are proposing unistrap, a novel bootstrap method that simultaneously incorporates into branch support the effects of site sampling and the alignment instability. As such, the unistrap index makes it possible to objectively estimate the support of raw data for a given phylogenetic tree. On its own, unistrap does not provide a guarantee of improved biological correctness, but it offers an objective criterion to estimate the effects of MSA instability on a given phylogeny. It must be stressed that the unistrap protocol presented here is tightly dependent on the progressive alignment framework and its sensitivity on input shuffling. Yet, in general, unistrap could easily be adapted to any procedure involving a numerically instable aligner. On the downside, one should also be aware that unistrap could easily be deceived by any arbitrary procedure used to stabilize the algorithm (i.e. internal sequence sorting or anything similar). Such limitations are not specific to unistrap. Other sampling methods such as HoT can also be deceived by low-level algorithmic decisions. This is one reason why the packages in which these methods have been implemented only support carefully selected aligners. The fact that deceiving practices can result in overestimating the robustness of a model is not per se a problem, but it is clearly something one should be aware of. In order to avoid such issues, algorithms should be as explicitly described as possible and code kept open. In general, analysis like the ones, we present here should encourage the community to treat limited robustness as an opportunity for the measurement of model supports and reliability. In order to avoid such confusion, we recommend to either generate the original MSA with the supported aligners or to apply the default unistrap support procedure (i.e. ClustalO $$+$$ FastTree replicates) onto any user generated phylogenetic tree. Such an approach brings the benefits of a normalized unistrap procedure comparable across experiments. This framework will also make it possible to incorporate and compare the stability of alternative tree reconstruction methods, as well as alternative rooting approaches. The unistrap algorithm could easily be extended further. For instance, the tiebreak effect on which our analysis relies is merely a special case of a more general situation in which dynamic programming provides an efficient way to navigate an optimization strategy among all potential suboptimal alignments. Efficient numerical procedures exist that allow objective comparisons between the optimal score and close sub-optimals, as in Probcons (Roshan 2014) for instance. Such implementation are, however, more demanding than a mere shuffling of sequence input order and they would require a systematic re-engineering of the methods benchmarked in the current analysis. Furthermore, dynamic programming is not the only source of instability and other aligners like UPP (Nguyen et al. 2015) or ClustalW (Thompson et al. 1994) can also be run in an unstable way because of the procedure used to seed the guide tree estimate. In general unistrap is probably a suitable strategy in any situation where alternative MSAs can be generated to estimate a phylogeny. This makes it a useful addition to the arsenal of methods recently developed to capture alignment instability, especially the HoT and the GUIDANCE protocols. Future analysis will also involve closer inspection of the parameters that dominate instability. We show here that MSA identity is a good predictor of instability, but the algorithmic source of the progressive alignment based instability suggests that gap frequencies in the original MSA or the number of OTUs and their evolutionary distances may also be very good predictors. It may it also be worth exploring the effect of multi-loci datasets so as to determine to which extent the modeling of epistatic relationships may affect stability. In that respect the most mature framework would certainly be RNA analysis and the comparison of methods aligning primary sequences with those able to take into account co-evolution through compensated mutations. So far, unistrap has only been validated on protein sequences, but we expect the instability on which it relies to be higher in RNA and DNA sequences whose lower complexity alphabets results in a larger number of alternative optimal alignments. We intend to explore unistrap behavior on nucleic acids as an immediate follow up of the work presented here. Yet, even when applied onto protein alignments, unistrap’s high sensitivity in detecting poorly supported branches gives it the potential to improve many downstream analyses, including attempts to predict the effect of genome variations on molecular functions (Katsonis and Lichtarge 2014). Availability The unistrap software is implemented as a public open source freeware package, available along with its documentation from https://github.com/cbcrg/unistrap. This package is a pipeline implemented in Nextflow (Di Tommaso et al. 2017). Supplementary Material Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.bk203. Author Contributions C.N. contributed ideas, some analyses and helped writing the manuscript. O.G. contributed ideas and helped writing the manuscript. M.C. contributed ideas, carried out most of the analyses and helped writing the manuscript. E.F. contributed ideas, carried out analyses and helped writing the manuscript. P.D.T helped with the implementation of the Nextflow unistrap pipeline. Acknowledgements We wish to thank Roderic Guigo and Tony Ferrar for revision and helpful comments. Funding We acknowledge support of the Spanish Ministry of Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa 2013–2017. We acknowledge the support of the CERCA Programme/Generalitat de Catalunya. References Boyce K., Sievers F., Higgins D.G. 2014 . Simple chained guide trees give high-quality protein multiple sequence alignments. Proc. Natl. Acad. Sci. U.S.A 111 : 10556 – 10561 . Google Scholar CrossRef Search ADS PubMed Boyce K., Sievers F., Higgins D.G. 2015 . Instability in progressive multiple sequence alignment algorithms. Algorithms Mol. Biol. 10 : 26 . Google Scholar CrossRef Search ADS PubMed Capella-Gutiérrez S., Silla-Martínez J.M., Gabaldón T. 2009 . trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25 : 1972 – 1973 . Google Scholar CrossRef Search ADS PubMed Castresana J. 2000 . Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17 : 540 – 552 . Google Scholar CrossRef Search ADS PubMed Chang J.-M., Di Tommaso P., Notredame C. 2014 . TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol. Biol. Evol. 31 : 1625 – 1637 . Google Scholar CrossRef Search ADS PubMed Chatzou M., Magis C., Chang J.-M., Kemena C., Bussotti G., Erb I., Notredame C. 2016 . Multiple sequence alignment modeling: methods and applications. Brief. Bioinform. 17 : 1009 – 1023 . Google Scholar CrossRef Search ADS PubMed Di Tommaso P., Chatzou M., Floden E.W., Barja P.P., Palumbo E., Notredame C. 2017 . Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35 : 316 – 319 . Google Scholar CrossRef Search ADS PubMed Edgar R.C. 2004 . MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 : 1792 – 1797 . Google Scholar CrossRef Search ADS PubMed Guindon S., Dufayard J.-F., Anisimova M., Hordijk W., Lefort V., Gascuel O. 2010 . “New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0”. Syst Biol. 59 : 307 – 321 . Google Scholar CrossRef Search ADS PubMed Hogeweg P., Hesper B. 1984 . The alignment of sets of sequences and the construction of phyletic trees: An integrated method. J. Mol. Evol. 20 : 175 – 186 . Google Scholar CrossRef Search ADS PubMed Jetz W., Thomas G.H., Joy J.B., Hartmann K., Mooers A.O. 2012 . The global diversity of birds in space and time. Nature 491 : 444 – 448 . Google Scholar CrossRef Search ADS PubMed Katoh K., Misawa K., Kuma K.-I., Miyata T. 2002 . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30 : 3059 – 3066 . Google Scholar CrossRef Search ADS PubMed Katsonis P., Lichtarge O. 2014 . A formal perturbation equation between genotype and phenotype determines the Evolutionary Action of protein-coding variations on fitness. Genome Res. 24 : 2050 – 2058 . Google Scholar CrossRef Search ADS PubMed Krypotou E., Evangelidis T., Bobonis J., Pittis A.A., Gabaldón T., Scazzocchio C., Mikros E., Diallinas G. 2015 . Origin, diversification and substrate specificity in the family of NCS1/FUR transporters. Mol. Microbiol. 96 : 927 – 950 . Google Scholar CrossRef Search ADS PubMed Lake J.A. 1991 . The order of sequence alignment can bias the selection of tree topology. Mol. Biol. Evol. 8 : 378 – 385 . Google Scholar PubMed Landan G., Graur D. 2007 . Heads or tails: a simple reliability check for multiple sequence alignments. Mol. Biol. Evol. 24 : 1380 – 1383 . Google Scholar CrossRef Search ADS PubMed Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009 . Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324 : 1561 – 1564 . Google Scholar CrossRef Search ADS PubMed Mirarab S., Nguyen N., Guo S., Wang L.-S., Kim J., Warnow T. 2015 . PASTA: Ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22 : 377 – 386 . Google Scholar CrossRef Search ADS PubMed Needleman S.B., Wunsch C.D. 1970 . A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48 : 443 – 453 . Google Scholar CrossRef Search ADS PubMed Nguyen N.-P.D., Mirarab S., Kumar K., Warnow T. 2015 . Ultra-large alignments using phylogeny-aware profiles. Genome Biol. 16 : 124 . Google Scholar CrossRef Search ADS PubMed Notredame C., Higgins D.G., Heringa J. 2000 . T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302 : 205 – 217 . Google Scholar CrossRef Search ADS PubMed Notredame C., Holm L., Higgins D.G. 1998 . COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14 : 407 – 422 . Google Scholar CrossRef Search ADS PubMed Penn O., Privman E., Landan G., Graur D., Pupko T. 2010 . An alignment confidence score capturing robustness to guide tree uncertainty. Mol. Biol. Evol. 27 : 1759 – 1767 . Google Scholar CrossRef Search ADS PubMed Pittis A.A., Gabaldón T. 2016 . Late acquisition of mitochondria by a host with chimaeric prokaryotic ancestry. Nature 531 : 101 – 104 . Google Scholar CrossRef Search ADS PubMed Price M.N., Dehal P.S., Arkin A.P. 2009 . FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26 : 1641 – 1650 . Google Scholar CrossRef Search ADS PubMed Redelings B.D., Suchard M.A. 2007 . Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol. Biol. 7 : 40 . Google Scholar CrossRef Search ADS PubMed Redelings B.D., Suchard M.A. 2008 . Robust inferences from ambiguous alignments. In: Rosenberg M.S., editor. Sequence alignment: methods, models, concepts and strategies . Oakland, CA : University of California Press . p. 209 – 270 . Google Scholar CrossRef Search ADS Roshan U. 2014 . Multiple sequence alignment using Probcons and Probalign. Methods Mol. Biol. 1079 : 147 – 153 . Google Scholar CrossRef Search ADS PubMed Sela I., Ashkenazy H., Katoh K., Pupko T. 2015 . GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 43 : W7 – 14 . Google Scholar CrossRef Search ADS PubMed Sievers F., Wilm A., Dineen D., Gibson T.J., Karplus K., Li W., Lopez R., McWilliam H., Remmert M., Söding J., Thompson J.D., Higgins D.G. 2011 . Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7 : 539 . Google Scholar CrossRef Search ADS PubMed Takezaki N. 1998 . Tie trees generated by distance methods of phylogenetic reconstruction. Mol. Biol. Evol. 15 : 727 – 737 . Google Scholar CrossRef Search ADS PubMed Tan G., Gil M., Löytynoja A.P., Goldman N., Dessimoz C. 2015 . Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks. Proc. Natl. Acad. Sci. U.S.A. 112 : E99 – 100 . Google Scholar CrossRef Search ADS PubMed Thompson J.D., Higgins D.G., Gibson T.J. 1994 . CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22 : 4673 – 4680 . Google Scholar CrossRef Search ADS PubMed Vandewege M.W., Mangum S.F., Gabaldón T., Castoe T.A., Ray D.A., Hoffmann F.G. 2016 . Contrasting Patterns of Evolutionary Diversification in the Olfactory Repertoires of Reptile and Bird Genomes. Genome Biol. Evol. 8 : 470 – 480 . Google Scholar CrossRef Search ADS PubMed Wang L., Jiang T. 1994 . On the Complexity of Multiple Sequence Alignment. J. Comput. Biol. 1 : 337 – 348 . Google Scholar CrossRef Search ADS PubMed Wong K.M., Suchard M.A., Huelsenbeck J.P. 2008 . Alignment uncertainty and genomic analysis. Science 319 : 473 – 476 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Systematic BiologyOxford University Press

Published: Mar 29, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off