Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology

Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology Vol. 26 no. 19 2010, pages 2455–2457 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btq429 Sequence analysis Advance Access publication July 29, 2010 Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology 1 2 3 4,∗ Wayne Delport , Art F. Y. Poon , Simon D. W. Frost and Sergei L. Kosakovsky Pond 1 2 Department of Pathology, Antiviral Research Center, University of California, San Diego, CA, USA, British Columbia Centre for Excellence in HIV/AIDS, Vancouver, British Columbia, Canada, Department of Veterinary Medicine, University of Cambridge, Cambridge, UK and Department of Medicine, Antiviral Research Center, University of California, San Diego, CA, USA Associate Editor: David Posada ABSTRACT 2 METHODS Summary: Datamonkey is a popular web-based suite of phylog- 2.1 Natural selection enetic analysis tools for use in evolutionary biology. Since the original 2.1.1 Diversifying and purifying selection acting on sites Datamonkey release in 2005, we have expanded the analysis options to include was originally designed to provide a front end to an implementation of three recently developed algorithmic methods for recombination detection, approaches (SLAC, FEL and REL; Kosakovsky Pond and Frost, 2005a; evolutionary fingerprinting of genes, codon model selection, co- Kosakovsky Pond and Frost, 2005c) to finding the sites in a multiple sequence evolution between sites, identification of sites, which rapidly alignment, which may have been affected by purifying or diversifying escape host-immune pressure and HIV-1 subtype assignment. The selection. These and nearly all other methods have been upgraded to correct traditional selection tools have also been augmented to include for the confounding effect of recombination using the partitioning approach, recent developments in the field. Here, we summarize the analyses whereby the alignment is partitioned (computationally, e.g. using GARD) into non-recombinant fragments, and each one of those is endowed with a options currently available on Datamonkey, and provide guidelines separate phylogeny (Scheffler et al., 2006). Result processing allows users to for their use in evolutionary biology. visualize and report the distribution of inferred substitutions on a site-by-site Availability and documentation: http://www.datamonkey.org basis. The new PARRIS module furnishes a likelihood ratio test (LRT) for Contact: spond@ucsd.edu non-neutral evolution that is analogous to the original test of Nielsen and Yang (1998), but corrects for the confounding effect of recombination and Received and revised on June 15, 2010; accepted on July 20, 2010 permits synonymous substitution rates to vary from site to site. 2.1.2 ‘Population level’ selection using iFEL When one is interested in 1 INTRODUCTION selective pressures that are restricted to interior branches of the tree, e.g. as Recent developments of high-throughput sequencing technologies described in the context of population-level HIV-1 evolution in Kosakovsky have accelerated the rate at which genomic data are accumulating by Pond et al. (2006a), the iFEL (internal branches FEL) method is appropriate. orders of magnitude. Concurrent commoditization of cheap parallel computer systems (clusters, GPUs and multi-core systems) and 2.1.3 Lineage specific selection using GABranch This component rapid development of algorithmic, statistical and bioinformatics executes a genetic algorithm (GA) search for lineages that are subject to techniques have made it possible to analyze these genomic data differing mean selective pressures (Kosakovsky Pond and Frost, 2005b). Instead of addressing ‘where in the gene has selection acted?’ question that with models of increased biological realism. To make such models the previous tools are designed for, this analysis answers ‘when in the past developed by ourselves and other groups immediately useful to has selection acted?’ question, assuming that selection acts uniformly across the life sciences community, we deployed a public web service sites. to screen alignments of homologous sequences for signatures of natural selection using three different phylogenetic methods 2.1.4 Directional evolution of protein sequences using DEPS In (Kosakovsky Pond and Frost, 2005a; Kosakovsky Pond and Frost, Kosakovsky Pond et al. (2008), we proposed a model-based test for 2005c) on a 40-processor cluster in 2005. The server proved to be directional evolution in protein sequences, capable of identifying such popular, processing over 100 000 submitted jobs, many of which frequency changes, or, more generally, deviations from the ‘background’ would require days or weeks of desktop CPU time. Since the substitution patterns that favor substitutions towards a particular residue. original release, we have completely redesigned the user interface, Given an amino-acid alignment and a rooted phylogenetic tree, DEPS reports upgraded our cluster to one with 356 CPU cores, implemented whether or not there is evidence that a proportion of sites are evolving towards 12 new analytical modules and a plethora of result processing each of the 20 amino-acid residues. For those ‘target’ residues that pass this and visualization features. Improvements to core algorithms in the test, DEPS carries out an empirical Bayes analysis to pinpoint which sites may be directionally evolving towards a given residue, along with a heuristic HyPhy package (Kosakovsky Pond et al., 2005), have resulted in interpretation of the type of selection that could have caused the inferred significant (up to 10×) speedups and allowed us to increase the pattern of substitutions. sizes of alignments that can be submitted. 2.1.5 ‘Toggling’ selection using TOGGLE The best example of toggling To whom correspondence should be addressed. selection can be found in HIV-1 sequences, which can acquire mutations in © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2455 [13:49 28/8/2010 Bioinformatics-btq429.tex] Page: 2455 2455–2457 W.Delport et al. one host, e.g. in response to immune selection or drug therapy, and revert (the appropriate number of rate classes determined automatically). An these mutations following subsequent transmissions to hosts that are not on approximate posterior sample of the inferred rates is obtained and converted treatment or do not raise the selecting immune response. Using the approach to an evolutionary fingerprint. Site-by-site inference of positive selection of Delport et al. (2008), TOGGLE searches a subset of sites identified by the using this posterior sample is analogous to the Bayes Empirical Bayes (Yang user (e.g. based on inferred substitution patterns or association with immune et al., 2005) approach that attempts to account for the errors in estimated targets) for evidence of elevated rates of substitution away from and back to model parameters. However, the primary purpose of EVOBLAST is to the unknown wildtype residue. At every analyzed site, all possible wildtype enable the comparison of inferred evolutionary properties between genes residues are examined (several different wildtype residues can be consistent using Evolutionary Selection Distance, as described in the methodology with an evolutionary history of a site), and those which return an (corrected) paper. We are currently developing the functionality to compare users’ LRT P <0.05 are reported. Visualization tools are available to assist in alignments against a database of annotated (e.g. taxonomically and interpreting the patterns of substitutions at a site, and the evolutionary functionally) fingerprints and permit the users to add their own alignments pathways between residues. to the database. In this fashion, it may be possible to create a large database of evolutionary properties of many genes sampled from different taxonomic levels to power quantitative comparisons of non-homologous sequence data. 2.2 Recombination detection: GARD and SCUEAL GAs for recombination detection (GARD) are a highly sensitive and accurate approach for screening alignments for evidence of phylogenetically 2.5 Ancestral state reconstruction (ASR) incongruent segments (Kosakovsky Pond et al., 2006b). Since the original The ASR module accepts a partitioned alignment, provided, e.g. by GARD. release (Kosakovsky Pond et al., 2006c), the GARD module in Datamonkey Three different likelihood-based methods are used to recover ancestral has been significantly upgraded, e.g. to automatically perform Kishino– sequences. First, the joint likelihood method finds the assignment of ancestral Hasegawa tests for topological incongruence and compute Robinson–Foulds characters to maximize the likelihood over all such assignments (Pupko distances between conflicting topologies. This step helps tease apart the et al., 2000). Second, for each site and ancestral sequence, the marginal two most common causes of phylogenetic incongruence: recombination likelihood method computes posterior weights for each ancestral character and heterotachy. A specialized refinement of GARD can be used for by marginalizing over all other ancestral characters (Yang et al., 1995). Third, detecting recombination in a single sequence by screening against a reference 100 samples are drawn from the joint posterior distribution of ancestral alignment with a precomputed phylogeny (Kosakovsky Pond et al., 2009). characters (Nielsen, 2002). Three ancestral sequences (one for each method) This type of analysis is most commonly used to infer the recombination or present in the strict consensus tree of all ancestral segments are returned, reassortment history of HIV-1 or Influenza A virus strains, and forms the basis together with a report highlighting agreement and discrepancies between the of genetically delineated viral subtypes. The SCUEAL module currently methods. implements HIV-1 subtyping based on the most frequently sequenced pol gene and is capable of processing several hundred sequences per hour. 2.6 Co-evolution between sites: Spidermonkey 2.3 Model selection The Spidermonkey module (Poon et al., 2008) uses Bayesian network techniques and is geared towards identifying networks of interacting sites 2.3.1 Protein model selection We have implemented a simple model in an alignment, based upon the assumption that co-evolving sites will tend selection procedure to rank 14 empirical amino acid substitution models to acquire mutations along the same set of branches. Repeated inference with (this list is regularly updated) using AIC, AIC and BIC, similar to the ideas ancestral states sampled from the posterior distribution is useful to evaluate of ProtTest (Abascal et al., 2005). For each model, a version with published robustness. stationary frequencies and another (+F) with frequencies tabulated from the alignment under consideration are evaluated. 3 IMPLEMENTATION 2.3.2 Codon model selection using CMS The problem of properly modeling mechanistic (synonymous versus non-synonymous) and empirical Datamonkey is implemented as a collection of Perl, HyPhy batch (the dependence of substitution rates on the amino acids encoded by language and R scripts, with GnuPlot, GraphViz and GhostScript the source and target codons) components of codon-based evolution is used for visualization. Data upload, CGI processing, SLAC analyses computationally challenging, as there are combinatorially many possible and result visualization is handled by a dedicated Mac OS X codon models. In Delport et al. (2010), we have described a statistical server, while all the other analyses are executed on a 356-core approach to partition all pairwise substitution rates into groups, akin to Linux Beowulf (SCYLD) cluster, either as serial or MPI jobs. how, for example the HKY85 (Hasegawa et al., 1985) model partitions nucleotide substitutions into transitions and transversions, and to search There are method-group FIFO queues to schedule submissions. for well-fitting models of this type using a computationally feasible and Communication between the two systems is performed via SSH accurate GA. The CMS analysis reports the number and membership of tunneling. non-synonymous rate classes. Using multi-model based inference, CMS generates substitution rate profiles for each residue pair, determines the confidence with which each pair is allocated to a rate class and computes 4 DISCUSSION correlations between substitution rates and physico-chemical properties. We The ever-accelerating pace of methodological research and are currently developing a database with thousands of gene- and organism- development places a premium on resources that avail computational specific codon evolutionary models to assist the users in selecting an appropriate evolutionary model for their alignments. and evolutionary biologists and bioinformaticians of fast, maintained and documented modern tools with a consistent and easy-to-use interface. As evidenced by the popularity of the original 2.4 Evolutionary fingerprinting using EVOBLAST Datamonkey server, our approach of providing a web-based front The EVOBLAST module provides an implementation of the gene end for running computationally intensive statistical sequence evolutionary fingerprinting procedure described in Kosakovsky Pond et al. analysis tools on a large computer cluster continues to be well- (2010). It first fits a flexible generate bivariate distribution of synonymous received by the community and we fully intend to develop and and non-synonymous substitution rates to a coding sequence alignment [13:49 28/8/2010 Bioinformatics-btq429.tex] Page: 2456 2455–2457 Datamonkey 2010 extend the functionality of the service as new procedures and Kosakovsky Pond,S.L. and Frost,S.D.W. (2005c) Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol. Biol. Evol., 22, analyses are introduced. 1208–1222. Kosakovsky Pond,S.L. et al. (2005) HyPhy: hypothesis testing using phylogenies. Funding: Joint Division of Mathematical Sciences/National Institute Bioinformatics, 21, 676–679. of General Medical Sciences Mathematical Biology Initiative Kosakovsky Pond,S.L. et al. (2006a) Adaptation to different human populations by through Grant NSF-0714991; National Institutes of Health HIV-1 revealed by codon-based analyses. PLoS Comp. Biol., 2, e62. (AI43638, AI47745 and AI57167); the University of California Kosakovsky Pond,S.L. et al. (2006b) Automated phylogenetic detection of University wide AIDS Research Program (grant number IS02- recombination using a genetic algorithm. Mol. Biol. Evol., 23, 1891–1901. Kosakovsky Pond,S.L. et al. (2006c) GARD: a genetic algorithm for recombination SD-701); University of California, San Diego Center for AIDS detection. Bioinformatics, 22, 3096–3098. Research/NIAID Developmental Award (AI36214 to S.D.W.F., Kosakovsky Pond,S.L. et al. (2008) A maximum likelihood method for detecting S.L.K.P. and W.D.); Royal Society Wolfson Research Merit Award directional evolution in protein sequences and its application to influenza a virus. (in part to S.D.W.F.); Canadian Institutes of Health Research Mol. Biol. Evol., 25, 1809–1824. Kosakovsky Pond,S.L. et al. (2009)An evolutionary model-based algorithm for accurate (CIHR) Fellowships Award in HIV/AIDS Research (200802HFE) phylogenetic breakpoint mapping and subtype prediction in HIV-1. PLoS Comput. (to A.F.Y.P.); The funders had no role in study design, data collection Biol., 5, e1000581. and analysis, decision to publish, or preparation of the manuscript. Kosakovsky Pond,S.L. et al. (2010) Evolutionary fingerprinting of genes. Mol. Biol. Evol., 27, 520–536. Conflict of Interest: none declared. Nielsen,R. (2002) Mapping mutations on phylogenies. Syst. Biol., 51, 729–739. Nielsen,R. and Yang,Z.H. (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics, 148, REFERENCES 929–936. Poon,A.F.Y. et al. (2008) Spidermonkey: rapid detection of co-evolving sites using Abascal,F. et al. (2005) ProtTest: selection of best-fit models of protein evolution. bayesian graphical models. Bioinformatics, 24, 1949–1950. Bioinformatics, 21, 2104–2105. Pupko,T. et al. (2000) A fast algorithm for joint reconstruction of ancestral amino acid Delport,W. et al. (2008) Frequent toggling between alternative amino acids is driven sequences. Mol. Biol. Evol., 17, 890–896. by selection in HIV-1. PLoS Pathog., 4, e1000242. Scheffler,K. et al. (2006) Robust inference of positive selection from recombining Delport,W. et al. (2010) CodonTest: modeling amino-acid substitution preferences in coding sequences. Bioinformatics, 22, 2493–2499. coding sequences. PLoS Compt. Biol., 6, e1000885. Yang,Z. et al. (1995) A new method of inference of ancestral nucleotide and amino acid Hasegawa,M. et al. (1985) Dating of the human-ape splitting by a molecular clock of sequences. Genetics, 141, 1641–1650. mitochondrial DNA. Mol. Biol. Evol., 21, 160–174. Yang,Z. et al. (2005) Bayes Empirical Bayes inference of amino acid sites under positive Kosakovsky Pond,S.L. and Frost,S.D.W. (2005a) Datamonkey: rapid detection of selection. Mol. Biol. Evol., 22, 1107–1118. selective pressure on individual sites of codon alignments. Bioinformatics, 21, 2531–2533. Kosakovsky Pond,S.L. and Frost,S.D.W. (2005b) A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol. Biol. Evol., 22, 478–485. [13:49 28/8/2010 Bioinformatics-btq429.tex] Page: 2457 2455–2457 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology

Loading next page...
 
/lp/oxford-university-press/datamonkey-2010-a-suite-of-phylogenetic-analysis-tools-for-C12cOAQTtN

References (42)

Publisher
Oxford University Press
Copyright
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btq429
pmid
20671151
Publisher site
See Article on Publisher Site

Abstract

Vol. 26 no. 19 2010, pages 2455–2457 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btq429 Sequence analysis Advance Access publication July 29, 2010 Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology 1 2 3 4,∗ Wayne Delport , Art F. Y. Poon , Simon D. W. Frost and Sergei L. Kosakovsky Pond 1 2 Department of Pathology, Antiviral Research Center, University of California, San Diego, CA, USA, British Columbia Centre for Excellence in HIV/AIDS, Vancouver, British Columbia, Canada, Department of Veterinary Medicine, University of Cambridge, Cambridge, UK and Department of Medicine, Antiviral Research Center, University of California, San Diego, CA, USA Associate Editor: David Posada ABSTRACT 2 METHODS Summary: Datamonkey is a popular web-based suite of phylog- 2.1 Natural selection enetic analysis tools for use in evolutionary biology. Since the original 2.1.1 Diversifying and purifying selection acting on sites Datamonkey release in 2005, we have expanded the analysis options to include was originally designed to provide a front end to an implementation of three recently developed algorithmic methods for recombination detection, approaches (SLAC, FEL and REL; Kosakovsky Pond and Frost, 2005a; evolutionary fingerprinting of genes, codon model selection, co- Kosakovsky Pond and Frost, 2005c) to finding the sites in a multiple sequence evolution between sites, identification of sites, which rapidly alignment, which may have been affected by purifying or diversifying escape host-immune pressure and HIV-1 subtype assignment. The selection. These and nearly all other methods have been upgraded to correct traditional selection tools have also been augmented to include for the confounding effect of recombination using the partitioning approach, recent developments in the field. Here, we summarize the analyses whereby the alignment is partitioned (computationally, e.g. using GARD) into non-recombinant fragments, and each one of those is endowed with a options currently available on Datamonkey, and provide guidelines separate phylogeny (Scheffler et al., 2006). Result processing allows users to for their use in evolutionary biology. visualize and report the distribution of inferred substitutions on a site-by-site Availability and documentation: http://www.datamonkey.org basis. The new PARRIS module furnishes a likelihood ratio test (LRT) for Contact: spond@ucsd.edu non-neutral evolution that is analogous to the original test of Nielsen and Yang (1998), but corrects for the confounding effect of recombination and Received and revised on June 15, 2010; accepted on July 20, 2010 permits synonymous substitution rates to vary from site to site. 2.1.2 ‘Population level’ selection using iFEL When one is interested in 1 INTRODUCTION selective pressures that are restricted to interior branches of the tree, e.g. as Recent developments of high-throughput sequencing technologies described in the context of population-level HIV-1 evolution in Kosakovsky have accelerated the rate at which genomic data are accumulating by Pond et al. (2006a), the iFEL (internal branches FEL) method is appropriate. orders of magnitude. Concurrent commoditization of cheap parallel computer systems (clusters, GPUs and multi-core systems) and 2.1.3 Lineage specific selection using GABranch This component rapid development of algorithmic, statistical and bioinformatics executes a genetic algorithm (GA) search for lineages that are subject to techniques have made it possible to analyze these genomic data differing mean selective pressures (Kosakovsky Pond and Frost, 2005b). Instead of addressing ‘where in the gene has selection acted?’ question that with models of increased biological realism. To make such models the previous tools are designed for, this analysis answers ‘when in the past developed by ourselves and other groups immediately useful to has selection acted?’ question, assuming that selection acts uniformly across the life sciences community, we deployed a public web service sites. to screen alignments of homologous sequences for signatures of natural selection using three different phylogenetic methods 2.1.4 Directional evolution of protein sequences using DEPS In (Kosakovsky Pond and Frost, 2005a; Kosakovsky Pond and Frost, Kosakovsky Pond et al. (2008), we proposed a model-based test for 2005c) on a 40-processor cluster in 2005. The server proved to be directional evolution in protein sequences, capable of identifying such popular, processing over 100 000 submitted jobs, many of which frequency changes, or, more generally, deviations from the ‘background’ would require days or weeks of desktop CPU time. Since the substitution patterns that favor substitutions towards a particular residue. original release, we have completely redesigned the user interface, Given an amino-acid alignment and a rooted phylogenetic tree, DEPS reports upgraded our cluster to one with 356 CPU cores, implemented whether or not there is evidence that a proportion of sites are evolving towards 12 new analytical modules and a plethora of result processing each of the 20 amino-acid residues. For those ‘target’ residues that pass this and visualization features. Improvements to core algorithms in the test, DEPS carries out an empirical Bayes analysis to pinpoint which sites may be directionally evolving towards a given residue, along with a heuristic HyPhy package (Kosakovsky Pond et al., 2005), have resulted in interpretation of the type of selection that could have caused the inferred significant (up to 10×) speedups and allowed us to increase the pattern of substitutions. sizes of alignments that can be submitted. 2.1.5 ‘Toggling’ selection using TOGGLE The best example of toggling To whom correspondence should be addressed. selection can be found in HIV-1 sequences, which can acquire mutations in © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2455 [13:49 28/8/2010 Bioinformatics-btq429.tex] Page: 2455 2455–2457 W.Delport et al. one host, e.g. in response to immune selection or drug therapy, and revert (the appropriate number of rate classes determined automatically). An these mutations following subsequent transmissions to hosts that are not on approximate posterior sample of the inferred rates is obtained and converted treatment or do not raise the selecting immune response. Using the approach to an evolutionary fingerprint. Site-by-site inference of positive selection of Delport et al. (2008), TOGGLE searches a subset of sites identified by the using this posterior sample is analogous to the Bayes Empirical Bayes (Yang user (e.g. based on inferred substitution patterns or association with immune et al., 2005) approach that attempts to account for the errors in estimated targets) for evidence of elevated rates of substitution away from and back to model parameters. However, the primary purpose of EVOBLAST is to the unknown wildtype residue. At every analyzed site, all possible wildtype enable the comparison of inferred evolutionary properties between genes residues are examined (several different wildtype residues can be consistent using Evolutionary Selection Distance, as described in the methodology with an evolutionary history of a site), and those which return an (corrected) paper. We are currently developing the functionality to compare users’ LRT P <0.05 are reported. Visualization tools are available to assist in alignments against a database of annotated (e.g. taxonomically and interpreting the patterns of substitutions at a site, and the evolutionary functionally) fingerprints and permit the users to add their own alignments pathways between residues. to the database. In this fashion, it may be possible to create a large database of evolutionary properties of many genes sampled from different taxonomic levels to power quantitative comparisons of non-homologous sequence data. 2.2 Recombination detection: GARD and SCUEAL GAs for recombination detection (GARD) are a highly sensitive and accurate approach for screening alignments for evidence of phylogenetically 2.5 Ancestral state reconstruction (ASR) incongruent segments (Kosakovsky Pond et al., 2006b). Since the original The ASR module accepts a partitioned alignment, provided, e.g. by GARD. release (Kosakovsky Pond et al., 2006c), the GARD module in Datamonkey Three different likelihood-based methods are used to recover ancestral has been significantly upgraded, e.g. to automatically perform Kishino– sequences. First, the joint likelihood method finds the assignment of ancestral Hasegawa tests for topological incongruence and compute Robinson–Foulds characters to maximize the likelihood over all such assignments (Pupko distances between conflicting topologies. This step helps tease apart the et al., 2000). Second, for each site and ancestral sequence, the marginal two most common causes of phylogenetic incongruence: recombination likelihood method computes posterior weights for each ancestral character and heterotachy. A specialized refinement of GARD can be used for by marginalizing over all other ancestral characters (Yang et al., 1995). Third, detecting recombination in a single sequence by screening against a reference 100 samples are drawn from the joint posterior distribution of ancestral alignment with a precomputed phylogeny (Kosakovsky Pond et al., 2009). characters (Nielsen, 2002). Three ancestral sequences (one for each method) This type of analysis is most commonly used to infer the recombination or present in the strict consensus tree of all ancestral segments are returned, reassortment history of HIV-1 or Influenza A virus strains, and forms the basis together with a report highlighting agreement and discrepancies between the of genetically delineated viral subtypes. The SCUEAL module currently methods. implements HIV-1 subtyping based on the most frequently sequenced pol gene and is capable of processing several hundred sequences per hour. 2.6 Co-evolution between sites: Spidermonkey 2.3 Model selection The Spidermonkey module (Poon et al., 2008) uses Bayesian network techniques and is geared towards identifying networks of interacting sites 2.3.1 Protein model selection We have implemented a simple model in an alignment, based upon the assumption that co-evolving sites will tend selection procedure to rank 14 empirical amino acid substitution models to acquire mutations along the same set of branches. Repeated inference with (this list is regularly updated) using AIC, AIC and BIC, similar to the ideas ancestral states sampled from the posterior distribution is useful to evaluate of ProtTest (Abascal et al., 2005). For each model, a version with published robustness. stationary frequencies and another (+F) with frequencies tabulated from the alignment under consideration are evaluated. 3 IMPLEMENTATION 2.3.2 Codon model selection using CMS The problem of properly modeling mechanistic (synonymous versus non-synonymous) and empirical Datamonkey is implemented as a collection of Perl, HyPhy batch (the dependence of substitution rates on the amino acids encoded by language and R scripts, with GnuPlot, GraphViz and GhostScript the source and target codons) components of codon-based evolution is used for visualization. Data upload, CGI processing, SLAC analyses computationally challenging, as there are combinatorially many possible and result visualization is handled by a dedicated Mac OS X codon models. In Delport et al. (2010), we have described a statistical server, while all the other analyses are executed on a 356-core approach to partition all pairwise substitution rates into groups, akin to Linux Beowulf (SCYLD) cluster, either as serial or MPI jobs. how, for example the HKY85 (Hasegawa et al., 1985) model partitions nucleotide substitutions into transitions and transversions, and to search There are method-group FIFO queues to schedule submissions. for well-fitting models of this type using a computationally feasible and Communication between the two systems is performed via SSH accurate GA. The CMS analysis reports the number and membership of tunneling. non-synonymous rate classes. Using multi-model based inference, CMS generates substitution rate profiles for each residue pair, determines the confidence with which each pair is allocated to a rate class and computes 4 DISCUSSION correlations between substitution rates and physico-chemical properties. We The ever-accelerating pace of methodological research and are currently developing a database with thousands of gene- and organism- development places a premium on resources that avail computational specific codon evolutionary models to assist the users in selecting an appropriate evolutionary model for their alignments. and evolutionary biologists and bioinformaticians of fast, maintained and documented modern tools with a consistent and easy-to-use interface. As evidenced by the popularity of the original 2.4 Evolutionary fingerprinting using EVOBLAST Datamonkey server, our approach of providing a web-based front The EVOBLAST module provides an implementation of the gene end for running computationally intensive statistical sequence evolutionary fingerprinting procedure described in Kosakovsky Pond et al. analysis tools on a large computer cluster continues to be well- (2010). It first fits a flexible generate bivariate distribution of synonymous received by the community and we fully intend to develop and and non-synonymous substitution rates to a coding sequence alignment [13:49 28/8/2010 Bioinformatics-btq429.tex] Page: 2456 2455–2457 Datamonkey 2010 extend the functionality of the service as new procedures and Kosakovsky Pond,S.L. and Frost,S.D.W. (2005c) Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol. Biol. Evol., 22, analyses are introduced. 1208–1222. Kosakovsky Pond,S.L. et al. (2005) HyPhy: hypothesis testing using phylogenies. Funding: Joint Division of Mathematical Sciences/National Institute Bioinformatics, 21, 676–679. of General Medical Sciences Mathematical Biology Initiative Kosakovsky Pond,S.L. et al. (2006a) Adaptation to different human populations by through Grant NSF-0714991; National Institutes of Health HIV-1 revealed by codon-based analyses. PLoS Comp. Biol., 2, e62. (AI43638, AI47745 and AI57167); the University of California Kosakovsky Pond,S.L. et al. (2006b) Automated phylogenetic detection of University wide AIDS Research Program (grant number IS02- recombination using a genetic algorithm. Mol. Biol. Evol., 23, 1891–1901. Kosakovsky Pond,S.L. et al. (2006c) GARD: a genetic algorithm for recombination SD-701); University of California, San Diego Center for AIDS detection. Bioinformatics, 22, 3096–3098. Research/NIAID Developmental Award (AI36214 to S.D.W.F., Kosakovsky Pond,S.L. et al. (2008) A maximum likelihood method for detecting S.L.K.P. and W.D.); Royal Society Wolfson Research Merit Award directional evolution in protein sequences and its application to influenza a virus. (in part to S.D.W.F.); Canadian Institutes of Health Research Mol. Biol. Evol., 25, 1809–1824. Kosakovsky Pond,S.L. et al. (2009)An evolutionary model-based algorithm for accurate (CIHR) Fellowships Award in HIV/AIDS Research (200802HFE) phylogenetic breakpoint mapping and subtype prediction in HIV-1. PLoS Comput. (to A.F.Y.P.); The funders had no role in study design, data collection Biol., 5, e1000581. and analysis, decision to publish, or preparation of the manuscript. Kosakovsky Pond,S.L. et al. (2010) Evolutionary fingerprinting of genes. Mol. Biol. Evol., 27, 520–536. Conflict of Interest: none declared. Nielsen,R. (2002) Mapping mutations on phylogenies. Syst. Biol., 51, 729–739. Nielsen,R. and Yang,Z.H. (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics, 148, REFERENCES 929–936. Poon,A.F.Y. et al. (2008) Spidermonkey: rapid detection of co-evolving sites using Abascal,F. et al. (2005) ProtTest: selection of best-fit models of protein evolution. bayesian graphical models. Bioinformatics, 24, 1949–1950. Bioinformatics, 21, 2104–2105. Pupko,T. et al. (2000) A fast algorithm for joint reconstruction of ancestral amino acid Delport,W. et al. (2008) Frequent toggling between alternative amino acids is driven sequences. Mol. Biol. Evol., 17, 890–896. by selection in HIV-1. PLoS Pathog., 4, e1000242. Scheffler,K. et al. (2006) Robust inference of positive selection from recombining Delport,W. et al. (2010) CodonTest: modeling amino-acid substitution preferences in coding sequences. Bioinformatics, 22, 2493–2499. coding sequences. PLoS Compt. Biol., 6, e1000885. Yang,Z. et al. (1995) A new method of inference of ancestral nucleotide and amino acid Hasegawa,M. et al. (1985) Dating of the human-ape splitting by a molecular clock of sequences. Genetics, 141, 1641–1650. mitochondrial DNA. Mol. Biol. Evol., 21, 160–174. Yang,Z. et al. (2005) Bayes Empirical Bayes inference of amino acid sites under positive Kosakovsky Pond,S.L. and Frost,S.D.W. (2005a) Datamonkey: rapid detection of selection. Mol. Biol. Evol., 22, 1107–1118. selective pressure on individual sites of codon alignments. Bioinformatics, 21, 2531–2533. Kosakovsky Pond,S.L. and Frost,S.D.W. (2005b) A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol. Biol. Evol., 22, 478–485. [13:49 28/8/2010 Bioinformatics-btq429.tex] Page: 2457 2455–2457

Journal

BioinformaticsOxford University Press

Published: Jul 29, 2010

There are no references for this article.