Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Datamonkey: rapid detection of selective pressure on individual sites of codon alignments

Datamonkey: rapid detection of selective pressure on individual sites of codon alignments Vol. 21 no. 10 2005, pages 2531–2533 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bti320 Phylogenetics Datamonkey: rapid detection of selective pressure on individual sites of codon alignments Sergei L. Kosakovsky Pond and Simon D. W. Frost Antiviral Research Center, University of California, San Diego, CA, 92103, USA Received on April 14, 2004; revised on January 28, 2005; accepted on February 8, 2005 Advance Access publication February 15, 2005 ABSTRACT Single likelihood ancestor counting (SLAC) is a heavily modified and improved derivative of the Suzuki–Gojobori counting approach. SLAC can Summary: Datamonkey is a web interface to a suite of cutting edge process an alignment with 100 sequences and 400 codons in about a minute, maximum likelihood-based tools for identification of sites subject to using likelihood-based branch lengths, nucleotide and codon substitution positive or negative selection. The methods range from very fast data parameters and ancestral sequence reconstructions. SLAC has good power to exploration to the some of the most complex models available in public detect non-neutral evolution in large (>50 sequences) alignments. domain software, and are implemented to run in parallel on a cluster Fixed effects likelihood (FEL) is a new likelihood-based and statistically of computers. rigorous method to fit an independent dN and dS to every site in the context of Availability: http://www.datamonkey.org. In the future, we plan to codon substitution models and test whether dN = dS. This method has been expand the collection of available analytic tools, and provide a parallelized to run quickly on an MPI cluster and tends to be less conservative package for installation on other systems. than SLAC on datasets of intermediate size (20–50 sequences). Contact: spond@ucsd.edu Random effects likelihood (REL) is an improved variant of the Nielsen– Yang approach, which uses flexible but not overly parameter-rich rate distributions (Kosakovsky Pond and Frost, 2005a) and allows both dS and 1 INTRODUCTION dN to vary across sites independently. Kosakovsky Pond and Muse (2005) The detection and quantification of evolutionary pressures that have suggest that accounting for nucleotide substitution biases and synonymous contributed to genetic variation has been an active area of recent site-to-site variation helps reduce Type I errors. This method has been par- allelized to run on an MPI cluster, and while it is the most powerful of the research (Yang, 2002) and has become an accepted part of a statistical three methods, REL is somewhat susceptible to Type 1 errors, especially for toolbox used in sequence analysis. There are several popular statist- small datasets, where parameter estimates are likely to have large associated ical methods for the identification of rapidly evolving and unusually errors. conserved sites in regions of protein coding sequences, which rely upon estimating site-specific synonymous (dS) and non-synonymous 3 IMPLEMENTATION (dN ) substitution rate parameters, and performing statistical tests The interface has been constructed using open source, public domain to determine whether dS = dN . Two widely used methods that components such as Apache Web server (http://www.apache.org) make use of the evolutionary history of sampled sequences are the with custom Perl CGI and HyPhy batch language scripts which per- likelihood-based approach described by Nielsen and Yang (1998) form pre-processing of uploaded alignments and post-processing and implemented in the popular PAML package, developed by of analysis results (Fig. 1) and HyPhy scripts for execut- Yang (1997), and a parsimony-based counting method of Suzuki ing the analyses. HyPhy runs complex analyses in parallel on and Gojobori (1999) implemented in the ADAPTSITE program, clusters of computers which support the MPICH (http://www- written by Suzuki et al. (2001). Kosakovsky Pond and Frost (2005b) unix.mcs.anl.gov/mpi/mpich/) implementation of the message proposed an integrative approach, combining the strengths of both passing interface (MPI) protocol. Presently, the analyses are hos- of the above approaches and offering several new algorithms as ted on a Linux cluster of eight dual processor Athlon MP 1.4 GHz well. Datamonkey is a web-based gateway to the suite of these nodes. The implementation is completely self-contained and allows algorithms, executed by HyPhy, a molecular evolution analysis users, among other things, the following: platform (Kosakovsky Pond et al., 2004), running analyses in par- allel on a cluster of computers, with a streamlined and easy-to-use (1) Upload an alignment in one of the several standard data formats, interface. such as NEXUS, PHYLIP, MEGA or FASTA. The alignment is checked for validity, including the presence of stop codons. 2 METHODS (2) Run a locally hosted BLAST (Altschul et al., 1997) search on Datamonkey implements three complementary methods for detecting sites the sequences to classify the organisms. under selection. All theoretical and technical aspects of the methods and (3) Perform phylogenetic reconstruction using an efficient imple- performance comparison can be found in Kosakovsky Pond and Frost mentation of the neighbor joining method (Saitou and Nei, (2005b). 1987) and render high-quality PDF phylograms. (4) Invoke a model selection procedure proposed in Kosakovsky To whom correspondence should be addressed. Pond and Frost (2005a) to quickly decide which evolutionary © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org 2531 S.L.K.Pond and S.D.W.Frost Upload and Validate a Codon Alignment (FASTA, NEXUS, PHYLIP) Automatic Model Selection User Tree Included with the Alignment Substitution Phylogeny Model NJ Tree User Chooses Reconstruction Model REL Selection Analysis Selection Result Presentation and Detection Charting: FEL Selection Method 1. Sites Under Selection Analysis 2. dN/dS Plots 3. Mapped Mutations (SLAC) 4. Many more... SLAC Selection Analysis Integrated Selection Results (Requires all three selection Serial methods to have been run) Process MPI Process Fig. 1. Structure of www.datamonkey.org. model is appropriate for their alignments; this procedure is Analysis results can be downloaded or accessed on our server for unique, as it explores 203 time reversible models, rather than a up to 96 hours. limited subset of ‘named’ models. (5) Detect which sites in the alignment evolve adaptively and those 4 DISCUSSION which are functionally constrained. Our methods are orders We believe that the availability of fast and statistically sound methods of magnitude faster than popular existing methods, running is critical to enabling sophisticated large scale analyses of sequence essentially interactively, while offering a more statistically evoluion. Datamonkey is linked to a cluster of computers so that robust framework for estimating confidence in inferred results analyses which would take a long time to run on a desktop computer Kosakovsky Pond and Frost (2005b). The user can then run: can be run quickly. The use of a website allows the tools to be kept SLAC (up to 150 sequences), FEL (up to 50 sequences) or up-to-date centrally, without the need for the researcher to install and REL (up to 25 sequences) to locate sites undergoing adaptive or maintain the considerable hardware resources required to run com- purifying evolution. All methods provide progress updates and putationally intensive analyses. In the future, we intend to distribute intermediate results. a complete package of components needed to install and configure (6) For SLAC analyses, the user can view inferred mutations for Datamonkey on a POSIX compliant web server, with or without an each site, optionally, map them onto a phylogeny. SSH interface to MPI computer clusters. (7) Generate charts to visualize the distribution of selective pres- sure, and other quantities, along sequences. This feature utilizes ACKNOWLEDGEMENTS an open source plotting package, GNUPLOT, available at This research was supported by the National Institutes of Health http://www.gnuplot.info. (AI47745, AI43638 and AI57167), the University of California (8) If all three methods are run on a given dataset, a com- Universitywide AIDS Research Program (grant number IS02-SD- parative analysis integrating the three methods can be 701) and by a University of California, San Diego Center for AIDS performed. Research/NIAID Developmental Award to S.D.W.F. (AI36214). 2532 Rapid detection of selective pressure of codon alignments REFERENCES Nielsen,R. and Yang,Z.H. (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics, 148, 929–936. Altschul,S. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein Saitou,N. and Nei,M. (1987) The neighbor-joining method—a new method for recon- database search programs. Nucleic Acids Res., 25, 3389–3402. structing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. Kosakovsky Pond,S.L. and Frost,S.D. (2005a) A simple hierarchical approach to Suzuki,Y. and Gojobori,T. (1999) A method for detecting positive selection at single modeling distributions of substitution rates. Mol. Biol. Evol., 22, 223–234. amino acid sites. Mol. Biol. Evol., 16, 1315–1328. Kosakovsky Pond,S.L. and Frost,S.D. (2005b) Not so different after all: a compar- Suzuki,Y. et al. (2001) ADAPTSITE: detecting natural selection at single amino acid ison of methods for detecting amino-acid sites under selection. Mol. Biol. Evol., sites. Bioinformatics, 17, 660–661. Advance Access published February 9, 2005, doi:10.1093/molbev/msi105. Yang,Z. (2002) Inference of selection from multiple species alignments. Curr. Opin. Kosakovsky Pond,S.L. and Muse,S.V. (2005) Site-to-site variation of synonymous Genet. Develop., 12, 688–694. substitution rates. Mol. Biol. Evol., in revision. Kosakovsky Pond,S.L. et al. (2004) HyPhy: hypothesis testing using phylogenies. Yang,Z.H. (1997) PAML: a program package for phylogenetic analysis by maximum Bioinformatics, 21, 676–679. likelihood. Comput. Appl. Biosci., 13, 555–556. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Datamonkey: rapid detection of selective pressure on individual sites of codon alignments

Bioinformatics , Volume 21 (10): 3 – Feb 15, 2005

Loading next page...
 
/lp/oxford-university-press/datamonkey-rapid-detection-of-selective-pressure-on-individual-sites-5nBVJvlCP6

References (11)

Publisher
Oxford University Press
Copyright
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bti320
pmid
15713735
Publisher site
See Article on Publisher Site

Abstract

Vol. 21 no. 10 2005, pages 2531–2533 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bti320 Phylogenetics Datamonkey: rapid detection of selective pressure on individual sites of codon alignments Sergei L. Kosakovsky Pond and Simon D. W. Frost Antiviral Research Center, University of California, San Diego, CA, 92103, USA Received on April 14, 2004; revised on January 28, 2005; accepted on February 8, 2005 Advance Access publication February 15, 2005 ABSTRACT Single likelihood ancestor counting (SLAC) is a heavily modified and improved derivative of the Suzuki–Gojobori counting approach. SLAC can Summary: Datamonkey is a web interface to a suite of cutting edge process an alignment with 100 sequences and 400 codons in about a minute, maximum likelihood-based tools for identification of sites subject to using likelihood-based branch lengths, nucleotide and codon substitution positive or negative selection. The methods range from very fast data parameters and ancestral sequence reconstructions. SLAC has good power to exploration to the some of the most complex models available in public detect non-neutral evolution in large (>50 sequences) alignments. domain software, and are implemented to run in parallel on a cluster Fixed effects likelihood (FEL) is a new likelihood-based and statistically of computers. rigorous method to fit an independent dN and dS to every site in the context of Availability: http://www.datamonkey.org. In the future, we plan to codon substitution models and test whether dN = dS. This method has been expand the collection of available analytic tools, and provide a parallelized to run quickly on an MPI cluster and tends to be less conservative package for installation on other systems. than SLAC on datasets of intermediate size (20–50 sequences). Contact: spond@ucsd.edu Random effects likelihood (REL) is an improved variant of the Nielsen– Yang approach, which uses flexible but not overly parameter-rich rate distributions (Kosakovsky Pond and Frost, 2005a) and allows both dS and 1 INTRODUCTION dN to vary across sites independently. Kosakovsky Pond and Muse (2005) The detection and quantification of evolutionary pressures that have suggest that accounting for nucleotide substitution biases and synonymous contributed to genetic variation has been an active area of recent site-to-site variation helps reduce Type I errors. This method has been par- allelized to run on an MPI cluster, and while it is the most powerful of the research (Yang, 2002) and has become an accepted part of a statistical three methods, REL is somewhat susceptible to Type 1 errors, especially for toolbox used in sequence analysis. There are several popular statist- small datasets, where parameter estimates are likely to have large associated ical methods for the identification of rapidly evolving and unusually errors. conserved sites in regions of protein coding sequences, which rely upon estimating site-specific synonymous (dS) and non-synonymous 3 IMPLEMENTATION (dN ) substitution rate parameters, and performing statistical tests The interface has been constructed using open source, public domain to determine whether dS = dN . Two widely used methods that components such as Apache Web server (http://www.apache.org) make use of the evolutionary history of sampled sequences are the with custom Perl CGI and HyPhy batch language scripts which per- likelihood-based approach described by Nielsen and Yang (1998) form pre-processing of uploaded alignments and post-processing and implemented in the popular PAML package, developed by of analysis results (Fig. 1) and HyPhy scripts for execut- Yang (1997), and a parsimony-based counting method of Suzuki ing the analyses. HyPhy runs complex analyses in parallel on and Gojobori (1999) implemented in the ADAPTSITE program, clusters of computers which support the MPICH (http://www- written by Suzuki et al. (2001). Kosakovsky Pond and Frost (2005b) unix.mcs.anl.gov/mpi/mpich/) implementation of the message proposed an integrative approach, combining the strengths of both passing interface (MPI) protocol. Presently, the analyses are hos- of the above approaches and offering several new algorithms as ted on a Linux cluster of eight dual processor Athlon MP 1.4 GHz well. Datamonkey is a web-based gateway to the suite of these nodes. The implementation is completely self-contained and allows algorithms, executed by HyPhy, a molecular evolution analysis users, among other things, the following: platform (Kosakovsky Pond et al., 2004), running analyses in par- allel on a cluster of computers, with a streamlined and easy-to-use (1) Upload an alignment in one of the several standard data formats, interface. such as NEXUS, PHYLIP, MEGA or FASTA. The alignment is checked for validity, including the presence of stop codons. 2 METHODS (2) Run a locally hosted BLAST (Altschul et al., 1997) search on Datamonkey implements three complementary methods for detecting sites the sequences to classify the organisms. under selection. All theoretical and technical aspects of the methods and (3) Perform phylogenetic reconstruction using an efficient imple- performance comparison can be found in Kosakovsky Pond and Frost mentation of the neighbor joining method (Saitou and Nei, (2005b). 1987) and render high-quality PDF phylograms. (4) Invoke a model selection procedure proposed in Kosakovsky To whom correspondence should be addressed. Pond and Frost (2005a) to quickly decide which evolutionary © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org 2531 S.L.K.Pond and S.D.W.Frost Upload and Validate a Codon Alignment (FASTA, NEXUS, PHYLIP) Automatic Model Selection User Tree Included with the Alignment Substitution Phylogeny Model NJ Tree User Chooses Reconstruction Model REL Selection Analysis Selection Result Presentation and Detection Charting: FEL Selection Method 1. Sites Under Selection Analysis 2. dN/dS Plots 3. Mapped Mutations (SLAC) 4. Many more... SLAC Selection Analysis Integrated Selection Results (Requires all three selection Serial methods to have been run) Process MPI Process Fig. 1. Structure of www.datamonkey.org. model is appropriate for their alignments; this procedure is Analysis results can be downloaded or accessed on our server for unique, as it explores 203 time reversible models, rather than a up to 96 hours. limited subset of ‘named’ models. (5) Detect which sites in the alignment evolve adaptively and those 4 DISCUSSION which are functionally constrained. Our methods are orders We believe that the availability of fast and statistically sound methods of magnitude faster than popular existing methods, running is critical to enabling sophisticated large scale analyses of sequence essentially interactively, while offering a more statistically evoluion. Datamonkey is linked to a cluster of computers so that robust framework for estimating confidence in inferred results analyses which would take a long time to run on a desktop computer Kosakovsky Pond and Frost (2005b). The user can then run: can be run quickly. The use of a website allows the tools to be kept SLAC (up to 150 sequences), FEL (up to 50 sequences) or up-to-date centrally, without the need for the researcher to install and REL (up to 25 sequences) to locate sites undergoing adaptive or maintain the considerable hardware resources required to run com- purifying evolution. All methods provide progress updates and putationally intensive analyses. In the future, we intend to distribute intermediate results. a complete package of components needed to install and configure (6) For SLAC analyses, the user can view inferred mutations for Datamonkey on a POSIX compliant web server, with or without an each site, optionally, map them onto a phylogeny. SSH interface to MPI computer clusters. (7) Generate charts to visualize the distribution of selective pres- sure, and other quantities, along sequences. This feature utilizes ACKNOWLEDGEMENTS an open source plotting package, GNUPLOT, available at This research was supported by the National Institutes of Health http://www.gnuplot.info. (AI47745, AI43638 and AI57167), the University of California (8) If all three methods are run on a given dataset, a com- Universitywide AIDS Research Program (grant number IS02-SD- parative analysis integrating the three methods can be 701) and by a University of California, San Diego Center for AIDS performed. Research/NIAID Developmental Award to S.D.W.F. (AI36214). 2532 Rapid detection of selective pressure of codon alignments REFERENCES Nielsen,R. and Yang,Z.H. (1998) Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics, 148, 929–936. Altschul,S. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein Saitou,N. and Nei,M. (1987) The neighbor-joining method—a new method for recon- database search programs. Nucleic Acids Res., 25, 3389–3402. structing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. Kosakovsky Pond,S.L. and Frost,S.D. (2005a) A simple hierarchical approach to Suzuki,Y. and Gojobori,T. (1999) A method for detecting positive selection at single modeling distributions of substitution rates. Mol. Biol. Evol., 22, 223–234. amino acid sites. Mol. Biol. Evol., 16, 1315–1328. Kosakovsky Pond,S.L. and Frost,S.D. (2005b) Not so different after all: a compar- Suzuki,Y. et al. (2001) ADAPTSITE: detecting natural selection at single amino acid ison of methods for detecting amino-acid sites under selection. Mol. Biol. Evol., sites. Bioinformatics, 17, 660–661. Advance Access published February 9, 2005, doi:10.1093/molbev/msi105. Yang,Z. (2002) Inference of selection from multiple species alignments. Curr. Opin. Kosakovsky Pond,S.L. and Muse,S.V. (2005) Site-to-site variation of synonymous Genet. Develop., 12, 688–694. substitution rates. Mol. Biol. Evol., in revision. Kosakovsky Pond,S.L. et al. (2004) HyPhy: hypothesis testing using phylogenies. Yang,Z.H. (1997) PAML: a program package for phylogenetic analysis by maximum Bioinformatics, 21, 676–679. likelihood. Comput. Appl. Biosci., 13, 555–556.

Journal

BioinformaticsOxford University Press

Published: Feb 15, 2005

There are no references for this article.