DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution

DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution Abstract DAMBE is a comprehensive software package for genomic and phylogenetic data analysis on Windows, Linux, and Macintosh computers. New functions include imputing missing distances and phylogeny simultaneously (paving the way to build large phage and transposon trees), new bootstrapping/jackknifing methods for PhyPA (phylogenetics from pairwise alignments), and an improved function for fast and accurate estimation of the shape parameter of the gamma distribution for fitting rate heterogeneity over sites. Previous method corrects multiple hits for each site independently. DAMBE’s new method uses all sites simultaneously for correction. DAMBE, featuring a user-friendly graphic interface, is freely available from http://dambe.bio.uottawa.ca (last accessed, April 17, 2018). phylogenetics, bioinformatics, missing distance imputation, rate heterogeneity over sites DAMBE is for descriptive and comparative sequence analysis (Xia 2013, 2017) featuring a graphic, user-friendly, and intuitive interface, and available free for Windows, Linux, and Macintosh computers at dambe.bio.uottawa.ca. DAMBE7 represents a major upgrade with many new functions including new sets of significance tests for position weight matrix and Gibbs sampler for de novo characterization of sequence motifs. I outline three functions most relevant to molecular evolution and phylogenetics. A supplemental file (Using_New_Functions.docx) is included in Supplementary Material online. Imputing Missing Distance and Phylogeny Simultaneously This function is implemented for building large trees of phages which often 1) are too diverged to build a multiple sequence alignment (MSA), and 2) do no share homologous genes/sites (e.g., S3 and S4 in fig. 1a). This is also true for many transposons from which one cannot get a meaningful MSA, and researchers are limited to align the sequences against the consensus (Gallus et al. 2015). One can do pairwise alignment among most of the sequences and compute their distances, but some sequence pairs do not share homologous sites and need to have their distances imputed from those computable distances. This allows one to build trees and likely will revolutionize phage taxonomy which is not based on phylogeny. Fig. 1. View largeDownload slide Illustration of distance imputation and estimation of the shape parameter in gamma distribution. (a) A sequence data set with S3 and S4 sharing no homologous sites to estimate distance. (b) Distance matrix with two shaded distance missing. (c) Tree reconstructed from the distance matrix in (b). (d) A case with nonunique solution for a missing distance between bonobo and chimpanzee. (e) Tree reconstructed from a multiple alignment with one site mapped to the leaves, together with one of several possible reconstruction of internal nodes. (f) Counting changes between neighboring nodes and correction for multiple hits. (g) Transitions and transversions at three sites illustrating independently estimated distance (DIE) and simultaneously estimated distance (DSE). Fig. 1. View largeDownload slide Illustration of distance imputation and estimation of the shape parameter in gamma distribution. (a) A sequence data set with S3 and S4 sharing no homologous sites to estimate distance. (b) Distance matrix with two shaded distance missing. (c) Tree reconstructed from the distance matrix in (b). (d) A case with nonunique solution for a missing distance between bonobo and chimpanzee. (e) Tree reconstructed from a multiple alignment with one site mapped to the leaves, together with one of several possible reconstruction of internal nodes. (f) Counting changes between neighboring nodes and correction for multiple hits. (g) Transitions and transversions at three sites illustrating independently estimated distance (DIE) and simultaneously estimated distance (DSE). This distance-imputation function is currently missing. MEGA (Kumar et al. 2016) does not impute missing distances, neither does PHYLIP’s DNADIST (Felsenstein 2014). Fitch and Kitsch programs can estimate missing distances if a user tree is provided. For a distance matrix with N missing distances (parameters), DAMBE searches the tree space and parameter space to find a tree with the N parameters that minimizes RSS=∑Dij-EDij2 Dijm (1) where Dij and E(Dij) are the observed and patristic distance, and m is typically 0, 1, or 2. Figure 1c is the phylogenetic tree reconstructed from the distance matrix in figure 1b with two shaded distances missing. For sequences such as that in figure 1a, DAMBE will compute all computable distances and impute the missing distances. When bootstrapping/jackknifing is used, distance imputation and phylogeny inference are done for each resampled data. One may also have unaligned sequence data and use PhyPA (Xia 2016) to build phylogenetic trees and obtain bootstrap/jackknife support. There are cases where a unique solution cannot be obtained. For example, when a missing distance is for two sister taxa (e.g., bonobo and chimpanzee in fig. 1b and c), we can find minimum RSS but the solution for missing Dij is not unique, with different values for missing Dij resulting in the same minimum RSS. The patristic distances Dp.bonobo.i and Dp.chimpanzee.i, where i stands for other species, do not change when x1 changes to x2 (fig. 1d), so Dp.bonobo.i and Dp.chimpanzee.i will remain the same, and so does RSS in equation (1). DAMBE use the midpoint distance in such cases. Bootstrap/Jackknife Support for PhyPA For each pair of sequences, we can obtain a vector S of 10 Nij values (number of pairs with nucleotide i in one sequence and j in another). With 10 sequences and 45 pairwise comparisons, there are 45 S vectors from which we can compute the 45 pairwise distances. For bootstrapping/jackknifing, we simply resample each pair to generate an S vector and use the 45 S vectors to produce a new set of 45 pairwise distances from which a tree can be reconstructed. This function complements the function of phylogenetics with imputed missing distances. An Improved Method for Estimating the Shape Parameter of Gamma Distribution Substitution rate varies over sites and is particularly pronounced in protein-coding genes (Xia 1998). The method by Gu and Zhang (1997) uses the following probability density function (Johnson and Kotz 1969) to estimate α: fk=Γ(α+k)Γk+1 Γ(α)k-k-+αkαk-+αα (2) where k, instead of being integers, is replaced by the estimated number of substitutions per site, and k is mean k. The method’s accuracy depends on the accuracy of the estimated k which comes from a multiple alignment in two steps (fig. 1e and f): 1) construct a phylogenetic tree from the aligned sequences and reconstruct ancestral sequences at internal nodes (fig. 1e, showing one of several possible reconstructions for one site with nucleotides mapped to the leaves), and 2) perform pairwise comparisons between two nodes on each side of a branch to obtain observed number of substitutions per site, and apply correction for multiple hits to get k (fig. 1f). DAMBE improves this estimation in two ways. First, it uses simultaneous estimation (SE). Take the K80 model for example. At each site, EPi=14+14e-4Diκ+2-12e-2Diκ+1κ+2 (3) EQi=12-12e-4Diκ+2 (4) where Di is K80 distance and κ is the transition/transversion ratio, not to confuse with k in equation (2) which is the estimated number of substitution for a site. Applying equations (3) and (4) to data from the three sites (fig. 1g) independently will generate one inapplicable case for site 2 (under DIE in fig. 1g, with IE for independent estimation). We can estimate all Di and κ simultaneously by maximizing the following log-likelihood: lnL=∑i=1N{Ns.iln⁡EPi+Nv.ilnE(Qi)+NI.iln1-EPi-E(Qi)} (5) where N is the number of sites, Ns.i and Nv.i and NI.i are recorded number of transitional, transversional difference and no difference from pairwise comparisons along the tree between nodes on each side of each branch at site i. SE generates no inapplicable cases (DSE in fig. 1g) and leads to the second improvement in using more realistic models such as F84 or TN93 instead of the K80 correction in GZ-gamma (Gu and Zhang 1997). SE distance is used in MEGA (Tamura et al. 2004) and DAMBE (Xia 2009) which includes MLCompositeF84 and MLCompositeTN93 for F84 and TN93 models, respectively, but has never been used in estimating the shape parameter. Supplementary Material Supplementary data are available at Molecular Biology and Evolution online. Acknowledgments This work was supported by Discovery Grant RGPIN-2018-03878 from Natural Science and Engineering Research Council of Canada. Literature Cited Felsenstein J. 2014 . PHYLIP 3.695 (phylogeny inference package). Seattle : Department of Genetics, University of Washington . Gallus S , Hallström BM , Kumar V , Dodt WG , Janke A , Schumann GG , Nilsson MA. 2015 . Evolutionary histories of transposable elements in the genome of the largest living marsupial carnivore, the Tasmanian devil . Mol Biol Evol. 32 ( 5 ): 1268 – 1283 . Google Scholar CrossRef Search ADS PubMed Gu X , Zhang J. 1997 . A simple method for estimating the parameter of substitution rate variation among sites . Mol Biol Evol. 14 ( 11 ): 1106 – 1113 . Google Scholar CrossRef Search ADS PubMed Johnson NL , Kotz S. 1969 . Discrete distributions . Boston : Houghton Mifflin . Kumar S , Stecher G , Tamura K. 2016 . MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets . Mol Biol Evol . 33 ( 7 ): 1870 – 1874 . Google Scholar CrossRef Search ADS PubMed Tamura K , Nei M , Kumar S. 2004 . Prospects for inferring very large phylogenies by using the neighbor-joining method . Proc Natl Acad Sci U S A. 101 ( 30 ): 11030 – 11035 . Google Scholar CrossRef Search ADS PubMed Xia X. 1998 . The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes . Mol Biol Evol. 15 ( 3 ): 336 – 344 . Google Scholar CrossRef Search ADS PubMed Xia X. 2009 . Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances . Mol Phylogenet Evol . 52 ( 3 ): 665 – 676 . Google Scholar CrossRef Search ADS PubMed Xia X. 2013 . DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution . Mol Biol Evol. 30 ( 7 ): 1720 – 1728 . Google Scholar CrossRef Search ADS PubMed Xia X. 2016 . PhyPA: phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences . Mol Phylogenet Evol. 102 : 331 – 343 . Google Scholar CrossRef Search ADS PubMed Xia X. 2017 . DAMBE6: new tools for microbial genomics, phylogenetics, and molecular evolution . J Hered. 108 ( 4 ): 431 – 437 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Molecular Biology and Evolution Oxford University Press

DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution

Loading next page...
 
/lp/ou_press/dambe7-new-and-improved-tools-for-data-analysis-in-molecular-biology-nOyTMaOlkf
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
ISSN
0737-4038
eISSN
1537-1719
D.O.I.
10.1093/molbev/msy073
Publisher site
See Article on Publisher Site

Abstract

Abstract DAMBE is a comprehensive software package for genomic and phylogenetic data analysis on Windows, Linux, and Macintosh computers. New functions include imputing missing distances and phylogeny simultaneously (paving the way to build large phage and transposon trees), new bootstrapping/jackknifing methods for PhyPA (phylogenetics from pairwise alignments), and an improved function for fast and accurate estimation of the shape parameter of the gamma distribution for fitting rate heterogeneity over sites. Previous method corrects multiple hits for each site independently. DAMBE’s new method uses all sites simultaneously for correction. DAMBE, featuring a user-friendly graphic interface, is freely available from http://dambe.bio.uottawa.ca (last accessed, April 17, 2018). phylogenetics, bioinformatics, missing distance imputation, rate heterogeneity over sites DAMBE is for descriptive and comparative sequence analysis (Xia 2013, 2017) featuring a graphic, user-friendly, and intuitive interface, and available free for Windows, Linux, and Macintosh computers at dambe.bio.uottawa.ca. DAMBE7 represents a major upgrade with many new functions including new sets of significance tests for position weight matrix and Gibbs sampler for de novo characterization of sequence motifs. I outline three functions most relevant to molecular evolution and phylogenetics. A supplemental file (Using_New_Functions.docx) is included in Supplementary Material online. Imputing Missing Distance and Phylogeny Simultaneously This function is implemented for building large trees of phages which often 1) are too diverged to build a multiple sequence alignment (MSA), and 2) do no share homologous genes/sites (e.g., S3 and S4 in fig. 1a). This is also true for many transposons from which one cannot get a meaningful MSA, and researchers are limited to align the sequences against the consensus (Gallus et al. 2015). One can do pairwise alignment among most of the sequences and compute their distances, but some sequence pairs do not share homologous sites and need to have their distances imputed from those computable distances. This allows one to build trees and likely will revolutionize phage taxonomy which is not based on phylogeny. Fig. 1. View largeDownload slide Illustration of distance imputation and estimation of the shape parameter in gamma distribution. (a) A sequence data set with S3 and S4 sharing no homologous sites to estimate distance. (b) Distance matrix with two shaded distance missing. (c) Tree reconstructed from the distance matrix in (b). (d) A case with nonunique solution for a missing distance between bonobo and chimpanzee. (e) Tree reconstructed from a multiple alignment with one site mapped to the leaves, together with one of several possible reconstruction of internal nodes. (f) Counting changes between neighboring nodes and correction for multiple hits. (g) Transitions and transversions at three sites illustrating independently estimated distance (DIE) and simultaneously estimated distance (DSE). Fig. 1. View largeDownload slide Illustration of distance imputation and estimation of the shape parameter in gamma distribution. (a) A sequence data set with S3 and S4 sharing no homologous sites to estimate distance. (b) Distance matrix with two shaded distance missing. (c) Tree reconstructed from the distance matrix in (b). (d) A case with nonunique solution for a missing distance between bonobo and chimpanzee. (e) Tree reconstructed from a multiple alignment with one site mapped to the leaves, together with one of several possible reconstruction of internal nodes. (f) Counting changes between neighboring nodes and correction for multiple hits. (g) Transitions and transversions at three sites illustrating independently estimated distance (DIE) and simultaneously estimated distance (DSE). This distance-imputation function is currently missing. MEGA (Kumar et al. 2016) does not impute missing distances, neither does PHYLIP’s DNADIST (Felsenstein 2014). Fitch and Kitsch programs can estimate missing distances if a user tree is provided. For a distance matrix with N missing distances (parameters), DAMBE searches the tree space and parameter space to find a tree with the N parameters that minimizes RSS=∑Dij-EDij2 Dijm (1) where Dij and E(Dij) are the observed and patristic distance, and m is typically 0, 1, or 2. Figure 1c is the phylogenetic tree reconstructed from the distance matrix in figure 1b with two shaded distances missing. For sequences such as that in figure 1a, DAMBE will compute all computable distances and impute the missing distances. When bootstrapping/jackknifing is used, distance imputation and phylogeny inference are done for each resampled data. One may also have unaligned sequence data and use PhyPA (Xia 2016) to build phylogenetic trees and obtain bootstrap/jackknife support. There are cases where a unique solution cannot be obtained. For example, when a missing distance is for two sister taxa (e.g., bonobo and chimpanzee in fig. 1b and c), we can find minimum RSS but the solution for missing Dij is not unique, with different values for missing Dij resulting in the same minimum RSS. The patristic distances Dp.bonobo.i and Dp.chimpanzee.i, where i stands for other species, do not change when x1 changes to x2 (fig. 1d), so Dp.bonobo.i and Dp.chimpanzee.i will remain the same, and so does RSS in equation (1). DAMBE use the midpoint distance in such cases. Bootstrap/Jackknife Support for PhyPA For each pair of sequences, we can obtain a vector S of 10 Nij values (number of pairs with nucleotide i in one sequence and j in another). With 10 sequences and 45 pairwise comparisons, there are 45 S vectors from which we can compute the 45 pairwise distances. For bootstrapping/jackknifing, we simply resample each pair to generate an S vector and use the 45 S vectors to produce a new set of 45 pairwise distances from which a tree can be reconstructed. This function complements the function of phylogenetics with imputed missing distances. An Improved Method for Estimating the Shape Parameter of Gamma Distribution Substitution rate varies over sites and is particularly pronounced in protein-coding genes (Xia 1998). The method by Gu and Zhang (1997) uses the following probability density function (Johnson and Kotz 1969) to estimate α: fk=Γ(α+k)Γk+1 Γ(α)k-k-+αkαk-+αα (2) where k, instead of being integers, is replaced by the estimated number of substitutions per site, and k is mean k. The method’s accuracy depends on the accuracy of the estimated k which comes from a multiple alignment in two steps (fig. 1e and f): 1) construct a phylogenetic tree from the aligned sequences and reconstruct ancestral sequences at internal nodes (fig. 1e, showing one of several possible reconstructions for one site with nucleotides mapped to the leaves), and 2) perform pairwise comparisons between two nodes on each side of a branch to obtain observed number of substitutions per site, and apply correction for multiple hits to get k (fig. 1f). DAMBE improves this estimation in two ways. First, it uses simultaneous estimation (SE). Take the K80 model for example. At each site, EPi=14+14e-4Diκ+2-12e-2Diκ+1κ+2 (3) EQi=12-12e-4Diκ+2 (4) where Di is K80 distance and κ is the transition/transversion ratio, not to confuse with k in equation (2) which is the estimated number of substitution for a site. Applying equations (3) and (4) to data from the three sites (fig. 1g) independently will generate one inapplicable case for site 2 (under DIE in fig. 1g, with IE for independent estimation). We can estimate all Di and κ simultaneously by maximizing the following log-likelihood: lnL=∑i=1N{Ns.iln⁡EPi+Nv.ilnE(Qi)+NI.iln1-EPi-E(Qi)} (5) where N is the number of sites, Ns.i and Nv.i and NI.i are recorded number of transitional, transversional difference and no difference from pairwise comparisons along the tree between nodes on each side of each branch at site i. SE generates no inapplicable cases (DSE in fig. 1g) and leads to the second improvement in using more realistic models such as F84 or TN93 instead of the K80 correction in GZ-gamma (Gu and Zhang 1997). SE distance is used in MEGA (Tamura et al. 2004) and DAMBE (Xia 2009) which includes MLCompositeF84 and MLCompositeTN93 for F84 and TN93 models, respectively, but has never been used in estimating the shape parameter. Supplementary Material Supplementary data are available at Molecular Biology and Evolution online. Acknowledgments This work was supported by Discovery Grant RGPIN-2018-03878 from Natural Science and Engineering Research Council of Canada. Literature Cited Felsenstein J. 2014 . PHYLIP 3.695 (phylogeny inference package). Seattle : Department of Genetics, University of Washington . Gallus S , Hallström BM , Kumar V , Dodt WG , Janke A , Schumann GG , Nilsson MA. 2015 . Evolutionary histories of transposable elements in the genome of the largest living marsupial carnivore, the Tasmanian devil . Mol Biol Evol. 32 ( 5 ): 1268 – 1283 . Google Scholar CrossRef Search ADS PubMed Gu X , Zhang J. 1997 . A simple method for estimating the parameter of substitution rate variation among sites . Mol Biol Evol. 14 ( 11 ): 1106 – 1113 . Google Scholar CrossRef Search ADS PubMed Johnson NL , Kotz S. 1969 . Discrete distributions . Boston : Houghton Mifflin . Kumar S , Stecher G , Tamura K. 2016 . MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets . Mol Biol Evol . 33 ( 7 ): 1870 – 1874 . Google Scholar CrossRef Search ADS PubMed Tamura K , Nei M , Kumar S. 2004 . Prospects for inferring very large phylogenies by using the neighbor-joining method . Proc Natl Acad Sci U S A. 101 ( 30 ): 11030 – 11035 . Google Scholar CrossRef Search ADS PubMed Xia X. 1998 . The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes . Mol Biol Evol. 15 ( 3 ): 336 – 344 . Google Scholar CrossRef Search ADS PubMed Xia X. 2009 . Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances . Mol Phylogenet Evol . 52 ( 3 ): 665 – 676 . Google Scholar CrossRef Search ADS PubMed Xia X. 2013 . DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution . Mol Biol Evol. 30 ( 7 ): 1720 – 1728 . Google Scholar CrossRef Search ADS PubMed Xia X. 2016 . PhyPA: phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences . Mol Phylogenet Evol. 102 : 331 – 343 . Google Scholar CrossRef Search ADS PubMed Xia X. 2017 . DAMBE6: new tools for microbial genomics, phylogenetics, and molecular evolution . J Hered. 108 ( 4 ): 431 – 437 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Journal

Molecular Biology and EvolutionOxford University Press

Published: Apr 14, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off