Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

ProtTest 3: fast selection of best-fit models of protein evolution

ProtTest 3: fast selection of best-fit models of protein evolution Vol. 27 no. 8 2011, pages 1164–1165 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr088 Phylogenetics Advance Access publication February 17, 2011 ProtTest 3: fast selection of best-fit models of protein evolution 1,2 2 2 1,∗ Diego Darriba , Guillermo L. Taboada , Ramón Doallo and David Posada 1 2 Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo and Department of Electronics and Systems, Computer Architecture Group, University of A Coruña, 15071 A Coruña, Spain Associate Editor: Martin Bishop ABSTRACT We implemented several parallel strategies as distinct execution modes in order to make an efficient use of the different computer Summary: We have implemented a high-performance computing architectures that a user might encounter: (HPC) version of ProtTest that can be executed in parallel in multicore desktops and clusters. This version, called ProtTest 3, includes new features and extended capabilities. (1) A Java thread-based concurrence for shared memory Availability: ProtTest 3 source code and binaries are freely available architectures (e.g. a multicore desktop computer or a under GNU license for download from http://darwin.uvigo.es/ multicore cluster node). This version also includes a new and software/prottest3, linked to a Mercurial repository at Bitbucket richer graphical user interface (GUI) to facilitate its use. (https://bitbucket.org/). (2) An MPJ (Shafi et al., 2009) parallelism for distributed Contact: dposada@uvigo.es memory architectures (e.g. HPC clusters). Supplementary information: Supplementary data are available at (3) A hybrid implementation MPJ - OpenMP (Dagum and Menon, Bioinformatics online. 1998) to obtain maximum scalability in architectures with Received on December 29, 2010; revised on February 10, 2011; both shared and distributed memory (e.g. multicore HPC accepted on February 11, 2011 clusters). Moreover, ProtTest 3 includes a number of new and more 1 INTRODUCTION comprehensive features: (i) more flexible support for different input Recent advances in modern sequencing technologies have resulted in alignment formats through the use of the ALTER library (Glez-Peña an increasing capability for gathering large datasets. Long sequence et al., 2010): ALN, FASTA, GDE, MSF, NEXUS, PHYLIP and alignments with hundred or thousands of sequences are not rare PIR; (ii) up to 120 candidate models of protein evolution; (iii) four these days, but their analysis imply access to large computing strategies for the calculation of likelihood scores: fixed BIONJ, infrastructures and/or the use of simpler and faster methods. In this BIONJ, ML or user defined; (iv) four information criteria: AIC, regard, high-performance computing (HPC) becomes essential for BIC, AICc and DT (see Sullivan and Joyce 2005); (v) reconstruction the feasibility of more sophisticated—and often more accurate— of model-averaged phylogenetic trees (Posada and Buckley, 2004); analyses. Indeed, during the last years HPC facilities have become (vi) fault tolerance with checkpointing; and (vii) automatic logging part of the general services provided by many universities and of the user activity. research centers. Besides, multicore desktops are now standard. The program ProtTest (Abascal et al., 2007) is one of the most popular tools for selecting models of amino acid replacement, a 3 PERFORMANCE EVALUATION routinary step in phylogenetic analysis. ProtTest is written in Java and uses the program PhyML (Guindon and Gascuel, 2003) for In order to benchmark the performance of ProtTest 3, we computed the maximum likelihood (ML) estimation of model parameters and the running times for the estimation of the likelihood scores of all 120 phylogenetic trees and the PAL library (Drummond and Strimmer, candidate models from several real and simulated protein alignments 2001) to handle alignments and trees. Statistical model selection can (Table 1). When these data were executed in a system with shared be a very intensive task when the alignments are large and include memory, e.g. a multicore desktop, the scalability was almost linear divergent sequences, highlighting the need for new bionformatic as far as there was enough memory to satisfy the requirements. tools capable of exploiting the available computational resources. For example, in a shared memory execution in a 24-core node the Here we describe a new version of ProtTest, ProtTest 3, that has speedup was almost linear with up to 8 cores, also scaling well with been completely redesigned to take advantage of HPC environments datasets with medium complexity, like HIVML or COXML (Fig. 1). and desktop multicore processors, significantly reducing the In a system with distributed memory like an cluster, the application execution time for model selection in large protein alignments. scaled well up to 56 processors (Fig. 2). With more processors, a theoretical scalability limit exists due to the hetereogeneous nature of the optimization times, from a few seconds for the simplest models 2 PROTTEST 3 to up to several hours for the models that include rate variation The general structure and the Java code of ProtTest has been among sites (+G). This problem was solved with the hybrid memory completely redesigned from a computer engineering point of view. approach. In this case, the scalability went beyond the previous limit, reaching up to 150 in the most complex cases with 8-core nodes To whom correspondence should be addressed. (Fig. 3). 1164 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com [09:25 26/3/2011 Bioinformatics-btr088.tex] Page: 1164 1164–1165 ProtTest 3 Table 1. Real and simulated alignments analyzed Dataset Protein Size Base tree Sequence Abbreviation N × L execution time RIB Ribosomal 21 × 113 Fixed BIONJ 5.5 min protein RIBML ” ” ML tree 28 min COX Cytochrome C 28 × 113 Fixed BIONJ 9.5 min oxidase II COXML ” ” ML tree 55 min HIV HIV polimerase 36 × 1,034 Fixed BIONJ 44 min HIVML ” ” ML tree 160 min 10K Simulated aln 50 × 10K Fixed BIONJ 9.2 h Fig. 3. Speed-ups obtained with the hybrid memory version of ProtTest 3 20K ” 50 × 20K ” 24.5 h according to the numbers of cores used in the same 32-node cluster as Fig. 2. 100K ” 50 × 100K ” 80 h Up to 4 MPJ Express processes per node and at least 2 OpenMP threads for each ML optimization were executed. N indicates the number of sequences and L the length of the alignment. Base tree is the speed through the distribution of tasks among nodes while taking tree used likelihood optimization and Seq. exec. time is the time required to calculate advantage of multicore processors within nodes. The new version the likelihood scores using the sequential version (i.e. a single thread). has been completely redesigned and includes new capabilities like checkpointing, additional amino acid replacement matrices, new model selection criteria and the possibility of computing model- averaged phylogenetic trees. The use of ProtTest 3 results in significant performance gains, with observed speedups of up to 150 on a high performance cluster. In this way, statistical model selection for large protein alignments becomes feasible, not only for cluster users but also for the owners of standard multicore desktop computers. Moreover, the flexible design of ProtTest-HPC will allow developers to extend future functionalities, whereas third- party projects will be able to easily adapt its capabilities to their requirements. Fig. 1. Speed-ups obtained with the shared memory version of ProtTest 3 ACKNOWLEDGEMENTS according to the numbers of threads used in a 24-core shared memory node Special thanks to Stephane Guindon and to Federico Abascal for (4 hexa-core Intel Xeon E7450 processors) with 12 GB memory. their help. Funding: This work was financially supported by the European Research Council (ERC-2007-Stg 203161-PHYGENOM to D.P.); the Spanish Ministry of Science and Education (BFU2009-08611 to D.P.); Xunta de Galicia (Galician Thematic Networks RGB 2010/90 to D.P. and GHPC2 2010/53 to R.D.). Conflict of Interest: none declared. REFERENCES Abascal,F. et al. (2007) ProtTest: selection of best-fit models of protein evolution. Bioinformatics, 24, 1104–1105. Dagum,L. and Menon,R. (1998) OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng., 5, 46–55. Fig. 2. Speed-ups obtained with the distributed memory version of ProtTest 3 Drummond,A. and Strimmer,K. (2001) Pal: an object-oriented programming library for according to the numbers of cores used in a 32-node cluster with 2 quad-core molecular evolution and phylogenetics. Bioinformatics, 17, 662–663. Intel Harpertown processors and 8 GB memory per node. Up to 4 processes Glez-Peña,D. et al. (2010) ALTER: program-oriented conversion of DNA and protein were executed per node because of the memory requirements of the largest alignments. Nucleic Acids Res., 38 (Suppl. 2), W14–W18. datasets (10K, 20K, 100K). Guindon,S. and Gascuel,O. (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol., 52, 696–704. Posada,D. and Buckley,T.R. (2004) Model selection and model averaging in 4 CONCLUSIONS phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst. Biol., 53, 793–808. ProtTest 3 can be executed in parallel in HPC environments as: (i) a Shafi,A. et al. (2009) Nested parallelism for multi-core HPC systems using Java. J. GUI-based desktop version that uses multicore processors; (ii) a Parallel Distr. Com., 69, 532–545. cluster-based version that distributes the computational load among Sullivan,J. and Joyce,P. (2005) Model selection in phylogenetics. Annu. Rev. nodes; and (iii) as a hybrid multicore cluster version that achieves Ecol. Evol. S, 36, 445–466. [09:25 26/3/2011 Bioinformatics-btr088.tex] Page: 1165 1164–1165 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

ProtTest 3: fast selection of best-fit models of protein evolution

Loading next page...
 
/lp/oxford-university-press/prottest-3-fast-selection-of-best-fit-models-of-protein-evolution-PlTjIS1T25

References (9)

Publisher
Oxford University Press
Copyright
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
eISSN
1367-4811
DOI
10.1093/bioinformatics/btr088
pmid
21335321
Publisher site
See Article on Publisher Site

Abstract

Vol. 27 no. 8 2011, pages 1164–1165 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr088 Phylogenetics Advance Access publication February 17, 2011 ProtTest 3: fast selection of best-fit models of protein evolution 1,2 2 2 1,∗ Diego Darriba , Guillermo L. Taboada , Ramón Doallo and David Posada 1 2 Department of Biochemistry, Genetics and Immunology, University of Vigo, 36310 Vigo and Department of Electronics and Systems, Computer Architecture Group, University of A Coruña, 15071 A Coruña, Spain Associate Editor: Martin Bishop ABSTRACT We implemented several parallel strategies as distinct execution modes in order to make an efficient use of the different computer Summary: We have implemented a high-performance computing architectures that a user might encounter: (HPC) version of ProtTest that can be executed in parallel in multicore desktops and clusters. This version, called ProtTest 3, includes new features and extended capabilities. (1) A Java thread-based concurrence for shared memory Availability: ProtTest 3 source code and binaries are freely available architectures (e.g. a multicore desktop computer or a under GNU license for download from http://darwin.uvigo.es/ multicore cluster node). This version also includes a new and software/prottest3, linked to a Mercurial repository at Bitbucket richer graphical user interface (GUI) to facilitate its use. (https://bitbucket.org/). (2) An MPJ (Shafi et al., 2009) parallelism for distributed Contact: dposada@uvigo.es memory architectures (e.g. HPC clusters). Supplementary information: Supplementary data are available at (3) A hybrid implementation MPJ - OpenMP (Dagum and Menon, Bioinformatics online. 1998) to obtain maximum scalability in architectures with Received on December 29, 2010; revised on February 10, 2011; both shared and distributed memory (e.g. multicore HPC accepted on February 11, 2011 clusters). Moreover, ProtTest 3 includes a number of new and more 1 INTRODUCTION comprehensive features: (i) more flexible support for different input Recent advances in modern sequencing technologies have resulted in alignment formats through the use of the ALTER library (Glez-Peña an increasing capability for gathering large datasets. Long sequence et al., 2010): ALN, FASTA, GDE, MSF, NEXUS, PHYLIP and alignments with hundred or thousands of sequences are not rare PIR; (ii) up to 120 candidate models of protein evolution; (iii) four these days, but their analysis imply access to large computing strategies for the calculation of likelihood scores: fixed BIONJ, infrastructures and/or the use of simpler and faster methods. In this BIONJ, ML or user defined; (iv) four information criteria: AIC, regard, high-performance computing (HPC) becomes essential for BIC, AICc and DT (see Sullivan and Joyce 2005); (v) reconstruction the feasibility of more sophisticated—and often more accurate— of model-averaged phylogenetic trees (Posada and Buckley, 2004); analyses. Indeed, during the last years HPC facilities have become (vi) fault tolerance with checkpointing; and (vii) automatic logging part of the general services provided by many universities and of the user activity. research centers. Besides, multicore desktops are now standard. The program ProtTest (Abascal et al., 2007) is one of the most popular tools for selecting models of amino acid replacement, a 3 PERFORMANCE EVALUATION routinary step in phylogenetic analysis. ProtTest is written in Java and uses the program PhyML (Guindon and Gascuel, 2003) for In order to benchmark the performance of ProtTest 3, we computed the maximum likelihood (ML) estimation of model parameters and the running times for the estimation of the likelihood scores of all 120 phylogenetic trees and the PAL library (Drummond and Strimmer, candidate models from several real and simulated protein alignments 2001) to handle alignments and trees. Statistical model selection can (Table 1). When these data were executed in a system with shared be a very intensive task when the alignments are large and include memory, e.g. a multicore desktop, the scalability was almost linear divergent sequences, highlighting the need for new bionformatic as far as there was enough memory to satisfy the requirements. tools capable of exploiting the available computational resources. For example, in a shared memory execution in a 24-core node the Here we describe a new version of ProtTest, ProtTest 3, that has speedup was almost linear with up to 8 cores, also scaling well with been completely redesigned to take advantage of HPC environments datasets with medium complexity, like HIVML or COXML (Fig. 1). and desktop multicore processors, significantly reducing the In a system with distributed memory like an cluster, the application execution time for model selection in large protein alignments. scaled well up to 56 processors (Fig. 2). With more processors, a theoretical scalability limit exists due to the hetereogeneous nature of the optimization times, from a few seconds for the simplest models 2 PROTTEST 3 to up to several hours for the models that include rate variation The general structure and the Java code of ProtTest has been among sites (+G). This problem was solved with the hybrid memory completely redesigned from a computer engineering point of view. approach. In this case, the scalability went beyond the previous limit, reaching up to 150 in the most complex cases with 8-core nodes To whom correspondence should be addressed. (Fig. 3). 1164 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com [09:25 26/3/2011 Bioinformatics-btr088.tex] Page: 1164 1164–1165 ProtTest 3 Table 1. Real and simulated alignments analyzed Dataset Protein Size Base tree Sequence Abbreviation N × L execution time RIB Ribosomal 21 × 113 Fixed BIONJ 5.5 min protein RIBML ” ” ML tree 28 min COX Cytochrome C 28 × 113 Fixed BIONJ 9.5 min oxidase II COXML ” ” ML tree 55 min HIV HIV polimerase 36 × 1,034 Fixed BIONJ 44 min HIVML ” ” ML tree 160 min 10K Simulated aln 50 × 10K Fixed BIONJ 9.2 h Fig. 3. Speed-ups obtained with the hybrid memory version of ProtTest 3 20K ” 50 × 20K ” 24.5 h according to the numbers of cores used in the same 32-node cluster as Fig. 2. 100K ” 50 × 100K ” 80 h Up to 4 MPJ Express processes per node and at least 2 OpenMP threads for each ML optimization were executed. N indicates the number of sequences and L the length of the alignment. Base tree is the speed through the distribution of tasks among nodes while taking tree used likelihood optimization and Seq. exec. time is the time required to calculate advantage of multicore processors within nodes. The new version the likelihood scores using the sequential version (i.e. a single thread). has been completely redesigned and includes new capabilities like checkpointing, additional amino acid replacement matrices, new model selection criteria and the possibility of computing model- averaged phylogenetic trees. The use of ProtTest 3 results in significant performance gains, with observed speedups of up to 150 on a high performance cluster. In this way, statistical model selection for large protein alignments becomes feasible, not only for cluster users but also for the owners of standard multicore desktop computers. Moreover, the flexible design of ProtTest-HPC will allow developers to extend future functionalities, whereas third- party projects will be able to easily adapt its capabilities to their requirements. Fig. 1. Speed-ups obtained with the shared memory version of ProtTest 3 ACKNOWLEDGEMENTS according to the numbers of threads used in a 24-core shared memory node Special thanks to Stephane Guindon and to Federico Abascal for (4 hexa-core Intel Xeon E7450 processors) with 12 GB memory. their help. Funding: This work was financially supported by the European Research Council (ERC-2007-Stg 203161-PHYGENOM to D.P.); the Spanish Ministry of Science and Education (BFU2009-08611 to D.P.); Xunta de Galicia (Galician Thematic Networks RGB 2010/90 to D.P. and GHPC2 2010/53 to R.D.). Conflict of Interest: none declared. REFERENCES Abascal,F. et al. (2007) ProtTest: selection of best-fit models of protein evolution. Bioinformatics, 24, 1104–1105. Dagum,L. and Menon,R. (1998) OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng., 5, 46–55. Fig. 2. Speed-ups obtained with the distributed memory version of ProtTest 3 Drummond,A. and Strimmer,K. (2001) Pal: an object-oriented programming library for according to the numbers of cores used in a 32-node cluster with 2 quad-core molecular evolution and phylogenetics. Bioinformatics, 17, 662–663. Intel Harpertown processors and 8 GB memory per node. Up to 4 processes Glez-Peña,D. et al. (2010) ALTER: program-oriented conversion of DNA and protein were executed per node because of the memory requirements of the largest alignments. Nucleic Acids Res., 38 (Suppl. 2), W14–W18. datasets (10K, 20K, 100K). Guindon,S. and Gascuel,O. (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol., 52, 696–704. Posada,D. and Buckley,T.R. (2004) Model selection and model averaging in 4 CONCLUSIONS phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst. Biol., 53, 793–808. ProtTest 3 can be executed in parallel in HPC environments as: (i) a Shafi,A. et al. (2009) Nested parallelism for multi-core HPC systems using Java. J. GUI-based desktop version that uses multicore processors; (ii) a Parallel Distr. Com., 69, 532–545. cluster-based version that distributes the computational load among Sullivan,J. and Joyce,P. (2005) Model selection in phylogenetics. Annu. Rev. nodes; and (iii) as a hybrid multicore cluster version that achieves Ecol. Evol. S, 36, 445–466. [09:25 26/3/2011 Bioinformatics-btr088.tex] Page: 1165 1164–1165

Journal

BioinformaticsOxford University Press

Published: Feb 17, 2011

There are no references for this article.