Access the full text.
Sign up today, get DeepDyve free for 14 days.
Research in population genetics and evolutionary biology has always provided a computational backbone for life sciences as a whole. Today evolutionary and population biology reasoning are essential for interpretation of large complex datasets that are characteristic of all domains of today’s life sciences ranging from cancer biology to microbial ecology. This situation makes algorithms and software tools developed by our community more important than ever before. This means that we, developers of software tool for molecular evolutionary analyses, now have a shared responsibility to make these tools accessible using modern technological developments as well as provide adequate documentation and training. Key words: software, evolutionary biology, computational biology. When Lewontin and Hubby (Hubby and Lewontin 1966; analyses. Truly, Dobzhansky (1973)was rightinsayingthat Lewontin and Hubby 1966) demonstrated that genetic vari- “Nothing in biology makes sense except in the light of ation in natural populations can be observed directly at high evolution”. These unique circumstances position evolution- resolution (i.e., protein-level), they provided evolutionary and ary and population biology at the center of life sciences—a population biology with the ability to generate much more place well deserved. But it also places a special responsibility interesting and insightful datasets. Initially, these datasets on us—practitioners of this field—to make our software tools were small by today’s standards. For example, their two clas- useful and comprehensible by the broad life sciences com- sical Genetics papers contained all (!) data in the main text munity. Below, we examine recent developments that would along with all calculations. The development of recombinant make this possible. DNA and sequence determination techniques in the 1970s To be usable, software tools should minimally be accessible allowed for generation of larger datasets such as the sequenc- and (well) documented. To gauge these parameters within ing of an entire alcohol dehydrogenase gene from several recently published molecular evolution software tools we populations of Drosophila in early 1980s (Kreitman 1983). have examined all Methods and Resource articles published That same period of late 1970s and early 1980s also saw in MBE between January 2017 and March 2018. We only looked at articles that were either freely available (outside the emergence of personal computers and the development of the first evolutionary analysis toolkit—PHYLIP (Felsenstein the paywall) or had a clearly specified URL pointing to the 1993)—the oldest continuously maintained software in our software within the abstract. This is because readers from field. In mid-2000 the development of next-generation other biological domains are unlikely to be subscribed to sequencing techniques has brought low-cost, high output MBE. There were 23 papers describing new software tools data generation capacity to all areas of life sciences. A by- (see supplementary table S1). Twenty-two had source code product of this data explosion was that a number of biomed- deposited in GitHub or R archive (The Comprehensive R ical domains that were traditionally distant from evolutionary Archive Network [CRAN])—a testament to the openness thinking found themselves in a situation where data interpre- of the field. We then looked at how easily these tools can tation should be performed in the evolutionary context. For be used in practice. This presented a less exciting picture: only example, analyses of infectious diseases such as AIDS and three tools contained enough information (documentation influenza, proliferation of malignant tumors, emergence of and/or tutorials) to actually be easily installed and used. Two antibiotic resistance, and many others types of problems additional tools has been added to Bioconda (see below) can only be fully understood in the context of evolutionary greatly improving their usability. Thus the conclusion so far The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is Open Access properly cited. 1372 Mol. Biol. Evol. 35(6):1372–1375 doi:10.1093/molbev/msy084 Advance Access publication April 23, 2018 Downloaded from https://academic.oup.com/mbe/article/35/6/1372/4983859 by DeepDyve user on 19 July 2022 Biology Needs Evolutionary Software Tools doi:10.1093/molbev/msy084 MBE FIG.1. Examples of different deployment strategies for a single tool IQ-Tree (Nguyen et al. 2015). (A) Compiling from the source code on Linux (installation instruction speciﬁc to MacOS and Windows are described in IQ-Tree website). These instruction do not include installation of compiler and cmake as well as environment conﬁguration (e.g., PATH variable). (B) Because IQ-Tree is available from Bioconda (https:// bioconda.github.io/recipes/iqtree, last accessed March 2018) is can be installed with much less effort. Here we ﬁrst create an isolated virtual environment (conda create), switch to that environment (source activate), and ﬁnally install IQ-Tree itself (conda install). In contrast to (A) this takes care of all dependencies and environment conﬁguration making the package immediately ready for use. Because Bioconda automatically creates containers the tool can be run from within a container (docker run). Note that in these cases (Conda and Docker) we explicitly specify version of IQ-Tree (1.5.5). Ability of specify software versions is essential for making analyses transparent and reproducible. (C) Finally, because IQ-Tree is already in Conda it is trivial to incorporate it into Galaxy—an integrative environment (this screenshot is from http://usegalaxy.eu, last accessed March 2018). This provides users with a consistent interface and ability to combine IQ-Tree with other tools within a Galaxy such as, for example, tools for generation of multiple alignments. is that the community is open (the absolute majority of tools configurations. Conda (https://conda.io, last accessed March are in the open source domain) but significantly lacks in the 2018) represents the latest generation of open source package and environment managers developed specifically to mitigate area of making tools truly usable. Below we summarize tech- this issue (fig. 1B). With Conda, tools and their dependencies nological developments that can significantly improve the can be easily installed through a simple, one-step command. usability of our software while putting minimal strain on Conda also works across programming languages and OSs, developers. Specifically, we discuss advances in package and making it widely useful. Leveraging Conda, Bioconda (https:// environment management for installing tools, software con- bioconda.github.io, last accessed March 2018) is a community tainerization for isolating tools and dependencies, and inte- project dedicated to data analysis in life sciences that contains grativeframeworksthatprovide access to awiderange of over 3,700 tool packages with contributions by more than 400 tools through a single user interface (UI) (fig. 1). authors (Dale et al. 2017). Despite the fact that Bioconda is one of the most recent package managers dedicated to bio- Tools for Packaging and Distributing medical tools, it contains by far the largest number of soft- Software ware tools, underscoring its rapid uptake by the community Several recent developments promise to significantly help (fig. 2 in Dale et al. 2017). Bioconda packages are well main- make software distribution less of a burden on developers. tained and include a testing system to ensure their quality. The first of these developments is evolution of frameworks for Another transformative development is software contain- management of tool dependencies and runtime environ- erization platforms (or, simply, containers) represented by ments. The biggest problem for many (especially naı ¨ve) users Docker (https://www.docker.com, last accessed March is installation, especially when the tool needs to be compiled 2018), Singularity (Kurtzer et al. 2017), and rkt (https://cor- from the source code and properly installed (fig. 1A). The eos.com/rkt, last accessed March 2018). Containers are run situation is often aggravated by dependencies such as external within host’s OS’s kernel but “containerize” every other as- libraries required for successful building of executables within pect of the runtime environment, providing higher isolation a multitude of operating systems (OSs) and local from local environment compared with what Conda virtual 1373 Downloaded from https://academic.oup.com/mbe/article/35/6/1372/4983859 by DeepDyve user on 19 July 2022 Nekrutenko et al. doi:10.1093/molbev/msy084 MBE environments can provide (these can still be influenced by configuration, Galaxy can create a dedicated Conda environ- the host system; Beaulieu-Jones and Greene 2017). Containers ment for tool execution or “pull” (download) a Docker con- share a kernel with the host environment, thus the impact on tainer that was automatically created by Bioconda and invoke execution performance is negligible while enabling computa- the tool within this container. tional reproducibility akin to full virtual machines. Contains Training Is Key are straightforward to create and they are automatically gen- erated for every tool included into Bioconda (Dale et al. 2017). Expansion of areas that can directly benefit from software tools developed within evolutionary biology means that nu- Integrative Frameworks merous researchers unfamiliar with these types of analyses will need to be trained. This means that 1) a framework for Integrative frameworks are systems where heterogeneous distribution and management of educational materials tools can be applied to a variety of datasets within a single, should be developed and 2) community-sourced tutorials unified UI. There are many advantages to such systems: they need to be produced. provide users with ready-to-use tools that can be combined To achieve the first goal, we and the Galaxy community into complete workflows, support data storage and compu- have built an infrastructure for creation and delivery of train- tational needs, automatically convert between file formats, ing materials that enables transparent peer-review and cura- and provide capabilities for reproducing and sharing analyses. tion to guarantee high quality and current content. In doing Galaxy (Blankenberg et al. 2007; Goecks et al. 2010; Afgan et al. this we took inspiration from the Software and Data 2016) is the most widely used of these platforms (http://bit.ly/ Carpentry (SDC) (Wilson 2014) projects where materials gxyTSstats, last accessed March 2018). It provides access to are openly reviewed and iteratively developed on GitHub hundreds of tools used in a wide variety of analysis scenarios (https://github.com/, last accessed March 2018) to capture (e.g., through American http://usegalaxy.org [last accessed the breadth of community expertise. SDC delivers training via March 2018], European http://usegalaxy.eu [last accessed online tutorials with hands-on sections, which offer better March 2018], and Australian http://usegalaxy.org.au [last training support than videos because trainees who are ac- accessed March 2018] server instances). It features a web- tively participating learn more (Dollar et al. 2007). The con- based UI while automatically and transparently managing tent of these web pages is easy to edit, thus reducing the underlying computation details. In addition to public servers, contribution barrier. The tutorials are developed in it can be deployed on a personal computer, heterogeneous Markdown, a plain text markup language, which is automat- computer clusters, as well as computation systems provided ically transformed into web-browser accessible pages. Using by Amazon, Microsoft, Google, and other clouds. It is an open, these strategies, we created a GitHub repository (https:// community driven project, which ensures its sustainability github.com/galaxyproject/training-material, last accessed and allows it to be adapted for use in a wide variety of re- March 2018) to collect, manage, and distribute training mate- search domains from genomics to image analysis to natural rials. This infrastructure has been developed in accordance language processing. with the FAIR (Findable, Accessible, Interoperable, Reusable) The advantage of integrative frameworks is that they pro- principles (Wilkinson et al. 2016). Using the framework de- vide multiple tools under the umbrella of a single system. This scribed above, we relaunched the Galaxy Training Network means that a user can perform complex, multi-step analyses (GTN; https://galaxyproject.org/teach/gtn, last accessed in one place. For example, a researcher studying evolution of March 2018). This growing network currently consists of 33 antibiotic resistance can start from the very beginning by scientific groups (https://galaxyproject.org/teach/trainers, last assessing the quality and mapping of, say, Illumina data, call- accessed March 2018) invested in Galaxy-based training. The ing and filtering variants, and identifying sites under selection GTN regularly organizes training events worldwide and offers all within one system without ever leaving it and never need- best practices for developing Galaxy-based training material, ing to install anything. One can argue that such a advice on compute platform choice to use for training, and a statement—performing everything within a single system— catalog of existing training resources for Galaxy. There is cur- is unrealistic because 1) one cannot always assess the full rently a paucity of tutorials targeting evolutionary- and complexity of a given analysis apriori and 2) systems like population-biology types of analyses. We hope that this re- Galaxy can never include all possible tools. This is true and port will precipitate their development. this is why we have developed Interactive Environments (IEs) within Galaxy (Gru ¨ning et al. 2016). Using IEs, one can start a Going Forward Jupyter and RStudio session directly within Galaxy (using its Concluding,wewould liketointroduceashort setofrecom- robust computational infrastructure) and perform any type mendations that can potentially widen the impact of the of ad hoc analysis such as a statistical test or creating a custom software produced within the field of evolutionary biology. visualization. Any web-based or command-line tools can be integrated (1) Use modern software distribution practices.Using into Galaxy. However, tools that are already registered with systems like Conda dramatically simplifies installation Conda (or Bioconda) are especially easy to add because all of software tools for end users. The importance of this dependency resolution issues are already solved by the pack- cannot be overemphasized. Many readers will recall age manager or software containers (fig. 1C). Depending on its “horrors” of source code not compiling properly or 1374 Downloaded from https://academic.oup.com/mbe/article/35/6/1372/4983859 by DeepDyve user on 19 July 2022 Biology Needs Evolutionary Software Tools doi:10.1093/molbev/msy084 MBE searching for the right version of a needed software NIH grants U41 HG006620 and R01 AI134384-01 as well as library. For a naı ¨ve user such a situation is the end of an NSF grant 1661497 to J.T. and A.N. attempt to ever try the software. Using Conda reduces this unnecessary complexity to simply using conda in- References stall, which will automatically retrieve dependencies Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Cech M, and install needed components. This does not only Chilton J, Clements D, Coraor N, Eberhard C, et al. 2016. The Galaxy benefit the user. This benefits the software developer platform for accessible, reproducible and collaborative biomedical as well. After all, the “fitness” of software is directly analyses: 2016 update. Nucleic Acids Res. 44(W1):W3. Beaulieu-Jones BK, Greene CS. 2017. Reproducibility of computational proportional to the number of users and these workflows is automated using continuous analysis. Nat. Biotechnol. approaches will increase the number of users. 35(4):342–346. (2) Use integrative environments because stand-alone Blankenberg D, Taylor J, Schenck I, He J, Zhang Y, Ghent M, web applications have limited utility. It is often tempt- Veeraraghavan N, Albert I, Miller W, Makova KD, et al. 2007. ing to develop a web-server for a singular tool or a A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome Res. 17(6): collection of tools. However single-purpose web servers 960–964. usually do not have all tools necessary for performing a Dale R, Gru ¨ning B, Sjo ¨din A, Rowe J, Chapman BA, Tomkins-Tinch CH, complete from-data-to-publication type of analysis. Valieris R, The Bioconda Team, Ko ¨ster J. 2017. Bioconda: a sustain- For example, a website implementing a tree recon- able and comprehensive software distribution for the life sciences. struction algorithm (such as PhyML; Guindon et al. bioRxiv [Internet] 207092. Available from: https://www.biorxiv.org/ 2010) will use sequence alignments in a particular for- content/early/2017/10/21/207092 Dobzhansky T. 1973. Nothing in biology makes sense except in the light mat (e.g., Newick) as the input. But these alignments of evolution. Am Biol Teach. 35(3): 125–129. need to be generated somehow and converted to an Dollar A, Steif PS, Strader R. 2007. Enhancing traditional classroom in- appropriate format—a set of manipulations the web- struction with web-based Statics course. In: 2007 37th annual fron- site is unlikely to provide. On the other hand, incorpo- tiers in education conference - global engineering: knowledge rating the tool into a system like Galaxy empowers without borders, opportunities without passports. Available from: http://dx.doi.org/10.1109/fie.2007.4417892 users to combine the tool in novel ways with hundreds Felsenstein J. 1993. PHYLIP 3.5. Seattle: University of Washington. of other utilities as well as to interactive computing Goecks J, Nekrutenko A, Taylor J, Team G. 2010. Galaxy: a compre- environments such as Jupyter and RStudio. This also hensive approach for supporting accessible, reproducible, and frees developers from website development— transparent computational research in the life sciences. Genome significant time that can be spent, well, wrapping tools Biol. 11(8):R86. in Conda, Galaxy, and developing tutorials. Gru ¨ning BA, Rasche E, Rebolledo-jaramillo B, Eberhard C, Chilton J, Coraor N, Backofen R, Taylor J. 2016. Enhancing pre-defined work- (3) Documentation and training efforts always pay off. flows with ad hoc analytics using Galaxy, Docker and Jupyter. It is redundant to say that documentation is key to Available from: http://dx.doi.org/10.1101/075457 everything. Tutorial development is hard work because Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. one needs to design analyses using specially tailored 2010. New algorithms and methods to estimate maximum- minimal datasets that will produce meaningful results likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59(3): 307–321. and tell an engaging story. However, only domain spe- Hubby JL, Lewontin RC. 1966. A molecular approach to the study of cialists can produce quality educational materials and genic heterozygosity in natural populations. I. The number of so we appeal to all readers of this piece: if you have ever alleles at different loci in Drosophila pseudoobscura. Genetics developed an analysis tool, make a tutorial to showcase 54:577–594. what your tool can do. Ultimately (as we mentioned Kreitman M. 1983. Nucleotide polymorphism at the alcohol dehydro- above) this will only increase the “fitness” of your genase locus of Drosophila melanogaster. Nature 304(5925): 412–417. software. Kurtzer GM, Sochat V, Bauer MW. 2017. Singularity: scientific containers for mobility of compute. PLoS One 12(5): e0177459. Supplementary Material Lewontin RC, Hubby JL. 1966. A molecular approach to the study of Supplementary data areavailableat Molecular Biology and genic heterozygosity in natural populations. II. Amount of variation Evolution online. and degree of heterozygosity in natural populations of Drosophila pseudoobscura. Genetics 54:595–609. Acknowledgments Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, et al. The authors are grateful to Bioconda, BioContainers, and 2016. The FAIR Guiding Principles for scientific data management Galaxy communities as without these resources this work and stewardship. Sci Data 3:160018. would not be possible. This project was supported by Wilson G. 2014. Software Carpentry: lessons learned. F1000Res. 3:62.
Molecular Biology and Evolution – Oxford University Press
Published: Jun 1, 2018
Access the full text.
Sign up today, get DeepDyve free for 14 days.