Molecular Biology and Evolution, Volume Advance Article (6) – Apr 14, 2018

3 pages

/lp/ou_press/dambe7-new-and-improved-tools-for-data-analysis-in-molecular-biology-nOyTMaOlkf

- Publisher
- Oxford University Press
- Copyright
- © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
- ISSN
- 0737-4038
- eISSN
- 1537-1719
- D.O.I.
- 10.1093/molbev/msy073
- Publisher site
- See Article on Publisher Site

Abstract DAMBE is a comprehensive software package for genomic and phylogenetic data analysis on Windows, Linux, and Macintosh computers. New functions include imputing missing distances and phylogeny simultaneously (paving the way to build large phage and transposon trees), new bootstrapping/jackknifing methods for PhyPA (phylogenetics from pairwise alignments), and an improved function for fast and accurate estimation of the shape parameter of the gamma distribution for fitting rate heterogeneity over sites. Previous method corrects multiple hits for each site independently. DAMBE’s new method uses all sites simultaneously for correction. DAMBE, featuring a user-friendly graphic interface, is freely available from http://dambe.bio.uottawa.ca (last accessed, April 17, 2018). phylogenetics, bioinformatics, missing distance imputation, rate heterogeneity over sites DAMBE is for descriptive and comparative sequence analysis (Xia 2013, 2017) featuring a graphic, user-friendly, and intuitive interface, and available free for Windows, Linux, and Macintosh computers at dambe.bio.uottawa.ca. DAMBE7 represents a major upgrade with many new functions including new sets of significance tests for position weight matrix and Gibbs sampler for de novo characterization of sequence motifs. I outline three functions most relevant to molecular evolution and phylogenetics. A supplemental file (Using_New_Functions.docx) is included in Supplementary Material online. Imputing Missing Distance and Phylogeny Simultaneously This function is implemented for building large trees of phages which often 1) are too diverged to build a multiple sequence alignment (MSA), and 2) do no share homologous genes/sites (e.g., S3 and S4 in fig. 1a). This is also true for many transposons from which one cannot get a meaningful MSA, and researchers are limited to align the sequences against the consensus (Gallus et al. 2015). One can do pairwise alignment among most of the sequences and compute their distances, but some sequence pairs do not share homologous sites and need to have their distances imputed from those computable distances. This allows one to build trees and likely will revolutionize phage taxonomy which is not based on phylogeny. Fig. 1. View largeDownload slide Illustration of distance imputation and estimation of the shape parameter in gamma distribution. (a) A sequence data set with S3 and S4 sharing no homologous sites to estimate distance. (b) Distance matrix with two shaded distance missing. (c) Tree reconstructed from the distance matrix in (b). (d) A case with nonunique solution for a missing distance between bonobo and chimpanzee. (e) Tree reconstructed from a multiple alignment with one site mapped to the leaves, together with one of several possible reconstruction of internal nodes. (f) Counting changes between neighboring nodes and correction for multiple hits. (g) Transitions and transversions at three sites illustrating independently estimated distance (DIE) and simultaneously estimated distance (DSE). Fig. 1. View largeDownload slide Illustration of distance imputation and estimation of the shape parameter in gamma distribution. (a) A sequence data set with S3 and S4 sharing no homologous sites to estimate distance. (b) Distance matrix with two shaded distance missing. (c) Tree reconstructed from the distance matrix in (b). (d) A case with nonunique solution for a missing distance between bonobo and chimpanzee. (e) Tree reconstructed from a multiple alignment with one site mapped to the leaves, together with one of several possible reconstruction of internal nodes. (f) Counting changes between neighboring nodes and correction for multiple hits. (g) Transitions and transversions at three sites illustrating independently estimated distance (DIE) and simultaneously estimated distance (DSE). This distance-imputation function is currently missing. MEGA (Kumar et al. 2016) does not impute missing distances, neither does PHYLIP’s DNADIST (Felsenstein 2014). Fitch and Kitsch programs can estimate missing distances if a user tree is provided. For a distance matrix with N missing distances (parameters), DAMBE searches the tree space and parameter space to find a tree with the N parameters that minimizes RSS=∑Dij-EDij2 Dijm (1) where Dij and E(Dij) are the observed and patristic distance, and m is typically 0, 1, or 2. Figure 1c is the phylogenetic tree reconstructed from the distance matrix in figure 1b with two shaded distances missing. For sequences such as that in figure 1a, DAMBE will compute all computable distances and impute the missing distances. When bootstrapping/jackknifing is used, distance imputation and phylogeny inference are done for each resampled data. One may also have unaligned sequence data and use PhyPA (Xia 2016) to build phylogenetic trees and obtain bootstrap/jackknife support. There are cases where a unique solution cannot be obtained. For example, when a missing distance is for two sister taxa (e.g., bonobo and chimpanzee in fig. 1b and c), we can find minimum RSS but the solution for missing Dij is not unique, with different values for missing Dij resulting in the same minimum RSS. The patristic distances Dp.bonobo.i and Dp.chimpanzee.i, where i stands for other species, do not change when x1 changes to x2 (fig. 1d), so Dp.bonobo.i and Dp.chimpanzee.i will remain the same, and so does RSS in equation (1). DAMBE use the midpoint distance in such cases. Bootstrap/Jackknife Support for PhyPA For each pair of sequences, we can obtain a vector S of 10 Nij values (number of pairs with nucleotide i in one sequence and j in another). With 10 sequences and 45 pairwise comparisons, there are 45 S vectors from which we can compute the 45 pairwise distances. For bootstrapping/jackknifing, we simply resample each pair to generate an S vector and use the 45 S vectors to produce a new set of 45 pairwise distances from which a tree can be reconstructed. This function complements the function of phylogenetics with imputed missing distances. An Improved Method for Estimating the Shape Parameter of Gamma Distribution Substitution rate varies over sites and is particularly pronounced in protein-coding genes (Xia 1998). The method by Gu and Zhang (1997) uses the following probability density function (Johnson and Kotz 1969) to estimate α: fk=Γ(α+k)Γk+1 Γ(α)k-k-+αkαk-+αα (2) where k, instead of being integers, is replaced by the estimated number of substitutions per site, and k is mean k. The method’s accuracy depends on the accuracy of the estimated k which comes from a multiple alignment in two steps (fig. 1e and f): 1) construct a phylogenetic tree from the aligned sequences and reconstruct ancestral sequences at internal nodes (fig. 1e, showing one of several possible reconstructions for one site with nucleotides mapped to the leaves), and 2) perform pairwise comparisons between two nodes on each side of a branch to obtain observed number of substitutions per site, and apply correction for multiple hits to get k (fig. 1f). DAMBE improves this estimation in two ways. First, it uses simultaneous estimation (SE). Take the K80 model for example. At each site, EPi=14+14e-4Diκ+2-12e-2Diκ+1κ+2 (3) EQi=12-12e-4Diκ+2 (4) where Di is K80 distance and κ is the transition/transversion ratio, not to confuse with k in equation (2) which is the estimated number of substitution for a site. Applying equations (3) and (4) to data from the three sites (fig. 1g) independently will generate one inapplicable case for site 2 (under DIE in fig. 1g, with IE for independent estimation). We can estimate all Di and κ simultaneously by maximizing the following log-likelihood: lnL=∑i=1N{Ns.ilnEPi+Nv.ilnE(Qi)+NI.iln1-EPi-E(Qi)} (5) where N is the number of sites, Ns.i and Nv.i and NI.i are recorded number of transitional, transversional difference and no difference from pairwise comparisons along the tree between nodes on each side of each branch at site i. SE generates no inapplicable cases (DSE in fig. 1g) and leads to the second improvement in using more realistic models such as F84 or TN93 instead of the K80 correction in GZ-gamma (Gu and Zhang 1997). SE distance is used in MEGA (Tamura et al. 2004) and DAMBE (Xia 2009) which includes MLCompositeF84 and MLCompositeTN93 for F84 and TN93 models, respectively, but has never been used in estimating the shape parameter. Supplementary Material Supplementary data are available at Molecular Biology and Evolution online. Acknowledgments This work was supported by Discovery Grant RGPIN-2018-03878 from Natural Science and Engineering Research Council of Canada. Literature Cited Felsenstein J. 2014 . PHYLIP 3.695 (phylogeny inference package). Seattle : Department of Genetics, University of Washington . Gallus S , Hallström BM , Kumar V , Dodt WG , Janke A , Schumann GG , Nilsson MA. 2015 . Evolutionary histories of transposable elements in the genome of the largest living marsupial carnivore, the Tasmanian devil . Mol Biol Evol. 32 ( 5 ): 1268 – 1283 . Google Scholar CrossRef Search ADS PubMed Gu X , Zhang J. 1997 . A simple method for estimating the parameter of substitution rate variation among sites . Mol Biol Evol. 14 ( 11 ): 1106 – 1113 . Google Scholar CrossRef Search ADS PubMed Johnson NL , Kotz S. 1969 . Discrete distributions . Boston : Houghton Mifflin . Kumar S , Stecher G , Tamura K. 2016 . MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets . Mol Biol Evol . 33 ( 7 ): 1870 – 1874 . Google Scholar CrossRef Search ADS PubMed Tamura K , Nei M , Kumar S. 2004 . Prospects for inferring very large phylogenies by using the neighbor-joining method . Proc Natl Acad Sci U S A. 101 ( 30 ): 11030 – 11035 . Google Scholar CrossRef Search ADS PubMed Xia X. 1998 . The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes . Mol Biol Evol. 15 ( 3 ): 336 – 344 . Google Scholar CrossRef Search ADS PubMed Xia X. 2009 . Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances . Mol Phylogenet Evol . 52 ( 3 ): 665 – 676 . Google Scholar CrossRef Search ADS PubMed Xia X. 2013 . DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution . Mol Biol Evol. 30 ( 7 ): 1720 – 1728 . Google Scholar CrossRef Search ADS PubMed Xia X. 2016 . PhyPA: phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences . Mol Phylogenet Evol. 102 : 331 – 343 . Google Scholar CrossRef Search ADS PubMed Xia X. 2017 . DAMBE6: new tools for microbial genomics, phylogenetics, and molecular evolution . J Hered. 108 ( 4 ): 431 – 437 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Molecular Biology and Evolution – Oxford University Press

**Published: ** Apr 14, 2018

Loading...

personal research library

It’s your single place to instantly

**discover** and **read** the research

that matters to you.

Enjoy **affordable access** to

over 18 million articles from more than

**15,000 peer-reviewed journals**.

All for just $49/month

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Read from thousands of the leading scholarly journals from *SpringerNature*, *Elsevier*, *Wiley-Blackwell*, *Oxford University Press* and more.

All the latest content is available, no embargo periods.

## “Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”

Daniel C.

## “Whoa! It’s like Spotify but for academic articles.”

@Phil_Robichaud

## “I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”

@deepthiw

## “My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”

@JoseServera

DeepDyve ## Freelancer | DeepDyve ## Pro | |
---|---|---|

Price | FREE | $49/month |

Save searches from | ||

Create lists to | ||

Export lists, citations | ||

Read DeepDyve articles | Abstract access only | Unlimited access to over |

20 pages / month | ||

PDF Discount | 20% off | |

Read and print from thousands of top scholarly journals.

System error. Please try again!

or

By signing up, you agree to DeepDyve’s Terms of Service and Privacy Policy.

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.

All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.

ok to continue