Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

DnaSP v5: a software for comprehensive analysis of DNA polymorphism data

DnaSP v5: a software for comprehensive analysis of DNA polymorphism data Vol. 25 no. 11 2009, pages 1451–1452 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btp187 Genetics and population analysis DnaSP v5: a software for comprehensive analysis of DNA polymorphism data 1,2 1,2,∗ P. Librado and J. Rozas 1 2 Departament de Genètica, Facultat de Biologia and Institut de Recerca de la Biodiversitat, Universitat de Barcelona, Diagonal 645, 08028 Barcelona, Spain Received on February 10, 2009; revised and accepted on April 2, 2009 Advance Access publication April 3, 2009 Associate Editor: Martin Bishop ABSTRACT comprehensive DNA polymorphism analyses on multiple data files and on large datasets. Altogether, the present version of DnaSP has Motivation: DnaSP is a software package for a comprehensive the appropriate features for exhaustive exploratory analyses using analysis of DNA polymorphism data. Version 5 implements a number high-throughput DNA polymorphism data. of new features and analytical methods allowing extensive DNA polymorphism analyses on large datasets. Among other features, the newly implemented methods allow for: (i) analyses on multiple data files; (ii) haplotype phasing; (iii) analyses on insertion/deletion 2 FEATURES polymorphism data; (iv) visualizing sliding window results integrated DnaSP v5 incorporates major improvements. The new version with available genome annotations in the UCSC browser. currently allows for the handling and analysis of multiple data files Availability: Freely available to academic users from: in batch, and implements new algorithms and methods; among other http://www.ub.edu/dnasp things (see below) includes a new module to identify conserved DNA Contact: jrozas@ub.edu regions, this feature might be useful for phylogenetic footprinting- based analyses (Vingron et al., 2009). DnaSP provides a convenient GUI facilitating all data management and analytical tasks; the 1 INTRODUCTION results can be visualized graphically as well as in a text report. The analysis of DNA polymorphisms is a powerful approach to DnaSP accepts multiple DNA sequence alignment file formats understand the evolutionary process and to establish the functional (Rozas et al., 2003), including NEXUS (Maddison et al., 1997), significance of particular genomic regions (Begun et al., 2007; and HapMap3 files with phased haplotypes (The International Nielsen, 2005; Rosenberg and Nordborg, 2002). In this context, HapMap Consortium, 2003). The software allows exhaustive DNA estimating the impact of natural selection (both positive and polymorphism analyses, including those based on coalescent theory negative) is of major interest. Furthermore, DNA polymorphisms (Rozas et al., 2003; Wakeley, 2009). are relevant as a tool for a broad range of life science disciplines. Consequently, many high-throughput sequencing, genotyping and polymorphism detection systems have been developed and are 2.1 Haplotype reconstruction currently publicly available (Shendure and Ji, 2008). These new technologies are generating massive amounts of data that need to be Haplotype reconstruction aims at resolving haplotype phase given processed, analyzed and transformed effectively into knowledge. genotypic information. DnaSP implements statistical methods to These technological advances have largely stimulated the infer haplotype phase, and prepares adequately the phased data development of both analytical methods and computer applications. for subsequent analyses. The input data (unphased genotype data) Population genetic methods, and particularly those based on are required in FASTA format using IUPAC nucleotide ambiguity coalescent theory (Hudson, 1990; Wakeley, 2009), are used at codes to represent heterozygous sites. DnaSP reconstructs the an increasing rate, but need to be adapted to the particularities phase by applying various algorithms (PHASE v2.1, fastPHASE of the data (massive amounts of data, missing data, genotypes, v1.1 and HAPAR) differing in the underlying population genetic insertion/deletion (indels) polymorphisms, etc.). Furthermore, new assumptions. PHASE (Stephens and Donnelly, 2003; Stephens et al., computer applications and algorithms need to be developed for 2001) assumes Hardy–Weinberg equilibrium and uses a coalescent- processing massive datasets (Excoffier and Heckel, 2006), and more based Bayesian method to infer haplotypes. fastPHASE (Scheet and specifically computer visualization tools for the representation of Stephens, 2006) implements a modification of the PHASE algorithm DNA variation patterns. DnaSP (DNA Sequence Polymorphism) is taking into account the patterns of linkage disequilibrium and its a software package that allows for extensive DNA polymorphism gradual decline with physical distance. This algorithm is faster and analyses using a friendly graphical user interface (GUI) (Rozas et al., allows for the handling of larger datasets than PHASE, while being 2003). Version 5 extends the capabilities of the software, allowing slightly less accurate. HAPAR (Wang and Xu, 2003) infers haplotype phase by maximum parsimony, i.e. attempts to find the minimum To whom correspondence should be addressed. number of haplotypes explaining the genotype sample. © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1451 [11:50 14/5/2009 Bioinformatics-btp187.tex] Page: 1451 1451–1452 P.Librado and J.Rozas 2.2 Deletion/insertion polymorphisms ACKNOWLEDGEMENTS Deletion/insertion polymorphisms (DIPs) analysis can provide We acknowledge Sergios-Orestis Kolokotronis for helpful insights into the evolutionary forces acting on DNA. This comments on the manuscript. Special thanks to the numerous information, however, has been rarely used. One obstacle has users who tested the software with their data, and particularly to been the difficulty of defining clearly homologous states (Young all members of the Molecular Evolutionary Genetics group at the and Healy, 2003). DnaSP incorporates an algorithm for treating Departament de Genètica, Universitat de Barcelona. indels related to the ‘simple indel coding’ method of Simmons and Ochoterena (2000). Specifically, only indels with the same 5 and 3 Funding: Spanish Dirección General de Investigación Científica termini are considered homologous (resulted from a single event), y Técnica (grants BFU2004-02253 and BFU2007-62927); the and indels of different lengths (even in the same position of the Catalonian Comissió Interdepartamental de Recerca i Innovació alignment) are treated as different events. DnaSP, nevertheless, uses Tecnològica (grant 2005SGR00166). a slightly different method for coding completely overlapping gaps, and allows the user to choose the level of overlap to be coded. Conflict of Interest: none declared. Subsequently, DnaSP estimates a number of DIP summary statistics, such as the average indel length, indel diversity, as well as Tajima’s REFERENCES D (Tajima, 1989) based on indel information. Additionally, it exports the recoded data in the NEXUS format file. Begun,D.J. et al. (2007) Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol., 6, e310. Excoffier,L. and Heckel,G. (2006) Computer programs for population genetics data 2.3 Analysis of multiple data files analysis: a survival guide. Nat. Rev. Genet., 7, 745–758. Hudson,R.R. (1990) Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol., DnaSP can automatically read and analyze multiple data files 7, 1–44. sequentially (in batch mode). These data files may contain a varying Hutter,S. et al. (2006) Genome-wide DNA polymorphism analyses using VariScan. number of sequences (from within one species, or from one species BMC Bioinformatics, 7, 409. as well as one outgroup), or represent diverse genomic regions. Kent,W.J. et al. (2002) The Human Genome Browser at UCSC. Genome Res., 12, 996–1006. The program estimates the most common DNA polymorphism Maddison,W.P. et al. (1997) NEXUS: an extendible file format for systematic and divergence summary statistics (such as the nucleotide and information. Syst. Biol., 46, 590–621. haplotype diversity, the population mutation parameter, the number Nielsen,R. (2005) Molecular signatures of natural selection. Annu. Rev. Genet., 39, of nucleotide substitutions per site, etc.), and neutrality tests (such 197–218. as Tajima’s, Fu and Li’s and Fu’s tests). Rosenberg,N.A. and Nordborg,M. (2002) Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms. Nat. Rev. Genet., 3, 380–390. Rozas,J. et al. (2003) DnaSP, DNA polymorphism analyses by the coalescent and other 2.4 Sliding window results visualization methods. Bioinformatics, 19, 2496–2497. Scheet,P. and Stephens,M. (2006) A fast and flexible statistical model for large- The sliding window technique is a useful tool for exploratory DNA scale population genotype data: applications to inferring missing genotypes and polymorphism data analysis (Hutter et al., 2006; Rozas et al., haplotypic phase. Am. J. Hum. Genet., 78, 629–644. 2003; Vilella et al., 2005). The current version of DnaSP permits Shendure,J. and Ji,H. (2008) Next-generation DNA sequencing. Nat. Biotechnol., 26, 1135–1145. visualizing results of the sliding window (for example, nucleotide Simmons,M.P. and Ochoterena, H. (2000) Gaps as characters in sequence-based diversity or Tajima’s D values along the DNA sequence) integrating phylogenetic analyses. Syst. Biol., 49, 369–381. available genome annotations in the UCSC browser (Kent et al., Stephens,M. and Donnelly,P. (2003) A comparison of Bayesian methods for haplotype 2002). This feature can greatly facilitate the interpretation of the reconstruction from population genotype data. Am. J. Hum. Genet., 73, 1162–1169. Stephens,M. et al. (2001) A new statistical method for haplotype reconstruction from results; for instance, it is possible to identify the relevant genome population data. Am. J. Hum. Genet., 68, 978–989. annotations (genes, intergenic regions, conserved regions, etc.), Tajima,F. (1989) Statistical method for testing the neutral mutation hypothesis by DNA which are adjacent to regions with atypical patterns of nucleotide polymorphism. Genetics, 123, 585–595. variation. The International HapMap Consortium (2003) The International HapMap Project. Nature, 426, 789–796. Vilella,A.J. et al. (2005) VariScan: analysis of evolutionary patterns from large-scale DNA sequence polymorphism data. Bioinformatics, 21, 2791–2793. 3 IMPLEMENTATION Vingron,M. et al. (2009) Integrating sequence,evolution and functional genomics in DnaSP version 5 has been developed in Microsoft Visual Basic regulatory genomics. Genome Biol., 10,202. Wang,L. and Xu,Y. (2003) Haplotype inference by maximum parsimony. v6.0, C and C++, and it runs under Microsoft Windows operating Bioinformatics, 19, 1773–1780. systems (2000/XP/Vista). With the use of Windows emulators, Wakeley,J. (2009) Coalescent Theory. An Introduction. Roberts and Company DnaSP can also run on Apple Macintosh platforms, Linux and Unix- Publishers. Greenwood Village. based operating systems. The software has been tested in all three Young,N.D. and Healy, J. (2003) GapCoder automates the use of indel characters in platforms. phylogenetic analysis. BMC Bioinformatics, 4,6. [11:50 14/5/2009 Bioinformatics-btp187.tex] Page: 1452 1451–1452 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

DnaSP v5: a software for comprehensive analysis of DNA polymorphism data

Bioinformatics , Volume 25 (11): 2 – Apr 3, 2009

Loading next page...
 
/lp/oxford-university-press/dnasp-v5-a-software-for-comprehensive-analysis-of-dna-polymorphism-NERGB0bge6

References (25)

Publisher
Oxford University Press
Copyright
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btp187
pmid
19346325
Publisher site
See Article on Publisher Site

Abstract

Vol. 25 no. 11 2009, pages 1451–1452 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btp187 Genetics and population analysis DnaSP v5: a software for comprehensive analysis of DNA polymorphism data 1,2 1,2,∗ P. Librado and J. Rozas 1 2 Departament de Genètica, Facultat de Biologia and Institut de Recerca de la Biodiversitat, Universitat de Barcelona, Diagonal 645, 08028 Barcelona, Spain Received on February 10, 2009; revised and accepted on April 2, 2009 Advance Access publication April 3, 2009 Associate Editor: Martin Bishop ABSTRACT comprehensive DNA polymorphism analyses on multiple data files and on large datasets. Altogether, the present version of DnaSP has Motivation: DnaSP is a software package for a comprehensive the appropriate features for exhaustive exploratory analyses using analysis of DNA polymorphism data. Version 5 implements a number high-throughput DNA polymorphism data. of new features and analytical methods allowing extensive DNA polymorphism analyses on large datasets. Among other features, the newly implemented methods allow for: (i) analyses on multiple data files; (ii) haplotype phasing; (iii) analyses on insertion/deletion 2 FEATURES polymorphism data; (iv) visualizing sliding window results integrated DnaSP v5 incorporates major improvements. The new version with available genome annotations in the UCSC browser. currently allows for the handling and analysis of multiple data files Availability: Freely available to academic users from: in batch, and implements new algorithms and methods; among other http://www.ub.edu/dnasp things (see below) includes a new module to identify conserved DNA Contact: jrozas@ub.edu regions, this feature might be useful for phylogenetic footprinting- based analyses (Vingron et al., 2009). DnaSP provides a convenient GUI facilitating all data management and analytical tasks; the 1 INTRODUCTION results can be visualized graphically as well as in a text report. The analysis of DNA polymorphisms is a powerful approach to DnaSP accepts multiple DNA sequence alignment file formats understand the evolutionary process and to establish the functional (Rozas et al., 2003), including NEXUS (Maddison et al., 1997), significance of particular genomic regions (Begun et al., 2007; and HapMap3 files with phased haplotypes (The International Nielsen, 2005; Rosenberg and Nordborg, 2002). In this context, HapMap Consortium, 2003). The software allows exhaustive DNA estimating the impact of natural selection (both positive and polymorphism analyses, including those based on coalescent theory negative) is of major interest. Furthermore, DNA polymorphisms (Rozas et al., 2003; Wakeley, 2009). are relevant as a tool for a broad range of life science disciplines. Consequently, many high-throughput sequencing, genotyping and polymorphism detection systems have been developed and are 2.1 Haplotype reconstruction currently publicly available (Shendure and Ji, 2008). These new technologies are generating massive amounts of data that need to be Haplotype reconstruction aims at resolving haplotype phase given processed, analyzed and transformed effectively into knowledge. genotypic information. DnaSP implements statistical methods to These technological advances have largely stimulated the infer haplotype phase, and prepares adequately the phased data development of both analytical methods and computer applications. for subsequent analyses. The input data (unphased genotype data) Population genetic methods, and particularly those based on are required in FASTA format using IUPAC nucleotide ambiguity coalescent theory (Hudson, 1990; Wakeley, 2009), are used at codes to represent heterozygous sites. DnaSP reconstructs the an increasing rate, but need to be adapted to the particularities phase by applying various algorithms (PHASE v2.1, fastPHASE of the data (massive amounts of data, missing data, genotypes, v1.1 and HAPAR) differing in the underlying population genetic insertion/deletion (indels) polymorphisms, etc.). Furthermore, new assumptions. PHASE (Stephens and Donnelly, 2003; Stephens et al., computer applications and algorithms need to be developed for 2001) assumes Hardy–Weinberg equilibrium and uses a coalescent- processing massive datasets (Excoffier and Heckel, 2006), and more based Bayesian method to infer haplotypes. fastPHASE (Scheet and specifically computer visualization tools for the representation of Stephens, 2006) implements a modification of the PHASE algorithm DNA variation patterns. DnaSP (DNA Sequence Polymorphism) is taking into account the patterns of linkage disequilibrium and its a software package that allows for extensive DNA polymorphism gradual decline with physical distance. This algorithm is faster and analyses using a friendly graphical user interface (GUI) (Rozas et al., allows for the handling of larger datasets than PHASE, while being 2003). Version 5 extends the capabilities of the software, allowing slightly less accurate. HAPAR (Wang and Xu, 2003) infers haplotype phase by maximum parsimony, i.e. attempts to find the minimum To whom correspondence should be addressed. number of haplotypes explaining the genotype sample. © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1451 [11:50 14/5/2009 Bioinformatics-btp187.tex] Page: 1451 1451–1452 P.Librado and J.Rozas 2.2 Deletion/insertion polymorphisms ACKNOWLEDGEMENTS Deletion/insertion polymorphisms (DIPs) analysis can provide We acknowledge Sergios-Orestis Kolokotronis for helpful insights into the evolutionary forces acting on DNA. This comments on the manuscript. Special thanks to the numerous information, however, has been rarely used. One obstacle has users who tested the software with their data, and particularly to been the difficulty of defining clearly homologous states (Young all members of the Molecular Evolutionary Genetics group at the and Healy, 2003). DnaSP incorporates an algorithm for treating Departament de Genètica, Universitat de Barcelona. indels related to the ‘simple indel coding’ method of Simmons and Ochoterena (2000). Specifically, only indels with the same 5 and 3 Funding: Spanish Dirección General de Investigación Científica termini are considered homologous (resulted from a single event), y Técnica (grants BFU2004-02253 and BFU2007-62927); the and indels of different lengths (even in the same position of the Catalonian Comissió Interdepartamental de Recerca i Innovació alignment) are treated as different events. DnaSP, nevertheless, uses Tecnològica (grant 2005SGR00166). a slightly different method for coding completely overlapping gaps, and allows the user to choose the level of overlap to be coded. Conflict of Interest: none declared. Subsequently, DnaSP estimates a number of DIP summary statistics, such as the average indel length, indel diversity, as well as Tajima’s REFERENCES D (Tajima, 1989) based on indel information. Additionally, it exports the recoded data in the NEXUS format file. Begun,D.J. et al. (2007) Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol., 6, e310. Excoffier,L. and Heckel,G. (2006) Computer programs for population genetics data 2.3 Analysis of multiple data files analysis: a survival guide. Nat. Rev. Genet., 7, 745–758. Hudson,R.R. (1990) Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol., DnaSP can automatically read and analyze multiple data files 7, 1–44. sequentially (in batch mode). These data files may contain a varying Hutter,S. et al. (2006) Genome-wide DNA polymorphism analyses using VariScan. number of sequences (from within one species, or from one species BMC Bioinformatics, 7, 409. as well as one outgroup), or represent diverse genomic regions. Kent,W.J. et al. (2002) The Human Genome Browser at UCSC. Genome Res., 12, 996–1006. The program estimates the most common DNA polymorphism Maddison,W.P. et al. (1997) NEXUS: an extendible file format for systematic and divergence summary statistics (such as the nucleotide and information. Syst. Biol., 46, 590–621. haplotype diversity, the population mutation parameter, the number Nielsen,R. (2005) Molecular signatures of natural selection. Annu. Rev. Genet., 39, of nucleotide substitutions per site, etc.), and neutrality tests (such 197–218. as Tajima’s, Fu and Li’s and Fu’s tests). Rosenberg,N.A. and Nordborg,M. (2002) Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms. Nat. Rev. Genet., 3, 380–390. Rozas,J. et al. (2003) DnaSP, DNA polymorphism analyses by the coalescent and other 2.4 Sliding window results visualization methods. Bioinformatics, 19, 2496–2497. Scheet,P. and Stephens,M. (2006) A fast and flexible statistical model for large- The sliding window technique is a useful tool for exploratory DNA scale population genotype data: applications to inferring missing genotypes and polymorphism data analysis (Hutter et al., 2006; Rozas et al., haplotypic phase. Am. J. Hum. Genet., 78, 629–644. 2003; Vilella et al., 2005). The current version of DnaSP permits Shendure,J. and Ji,H. (2008) Next-generation DNA sequencing. Nat. Biotechnol., 26, 1135–1145. visualizing results of the sliding window (for example, nucleotide Simmons,M.P. and Ochoterena, H. (2000) Gaps as characters in sequence-based diversity or Tajima’s D values along the DNA sequence) integrating phylogenetic analyses. Syst. Biol., 49, 369–381. available genome annotations in the UCSC browser (Kent et al., Stephens,M. and Donnelly,P. (2003) A comparison of Bayesian methods for haplotype 2002). This feature can greatly facilitate the interpretation of the reconstruction from population genotype data. Am. J. Hum. Genet., 73, 1162–1169. Stephens,M. et al. (2001) A new statistical method for haplotype reconstruction from results; for instance, it is possible to identify the relevant genome population data. Am. J. Hum. Genet., 68, 978–989. annotations (genes, intergenic regions, conserved regions, etc.), Tajima,F. (1989) Statistical method for testing the neutral mutation hypothesis by DNA which are adjacent to regions with atypical patterns of nucleotide polymorphism. Genetics, 123, 585–595. variation. The International HapMap Consortium (2003) The International HapMap Project. Nature, 426, 789–796. Vilella,A.J. et al. (2005) VariScan: analysis of evolutionary patterns from large-scale DNA sequence polymorphism data. Bioinformatics, 21, 2791–2793. 3 IMPLEMENTATION Vingron,M. et al. (2009) Integrating sequence,evolution and functional genomics in DnaSP version 5 has been developed in Microsoft Visual Basic regulatory genomics. Genome Biol., 10,202. Wang,L. and Xu,Y. (2003) Haplotype inference by maximum parsimony. v6.0, C and C++, and it runs under Microsoft Windows operating Bioinformatics, 19, 1773–1780. systems (2000/XP/Vista). With the use of Windows emulators, Wakeley,J. (2009) Coalescent Theory. An Introduction. Roberts and Company DnaSP can also run on Apple Macintosh platforms, Linux and Unix- Publishers. Greenwood Village. based operating systems. The software has been tested in all three Young,N.D. and Healy, J. (2003) GapCoder automates the use of indel characters in platforms. phylogenetic analysis. BMC Bioinformatics, 4,6. [11:50 14/5/2009 Bioinformatics-btp187.tex] Page: 1452 1451–1452

Journal

BioinformaticsOxford University Press

Published: Apr 3, 2009

There are no references for this article.