Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

NGScloud: RNA-seq analysis of non-model species using cloud computing

NGScloud: RNA-seq analysis of non-model species using cloud computing Summary: RNA-seq analysis usually requires large computing infrastructures. NGScloud is a bio- informatic system developed to analyze RNA-seq data using the cloud computing services of Amazon that permit the access to ad hoc computing infrastructure scaled according to the com- plexity of the experiment, so its costs and times can be optimized. The application provides a user- friendly front-end to operate Amazon’s hardware resources, and to control a workflow of RNA-seq analysis oriented to non-model species, incorporating the cluster concept, which allows parallel runs of common RNA-seq analysis programs in several virtual machines for faster analysis. Availability and implementation: NGScloud is freely available at https://github.com/GGFHF/ NGScloud/. A manual detailing installation and how-to-use instructions is available with the distribution. Contact: [email protected] We present NGScloud, a bioinformatic system developed to ana- 1 Introduction lyze RNA-seq data using the cloud computing offered by EC2. RNA-seq experiments often yield huge amount of data, especially NGScloud is oriented to non-model species whose reference when several NGS libraries are involved. The algorithms used in the genomes are not available, and it implements parallel runs in several bioinformatic analyses are very complex, particularly those referred virtual machines for faster analysis. The application aims to ease the to the assembly of reads (Miller et al., 2010). Thus, the hardware researcher the use of EC2 resources and the performance of RNA- requirements to run RNA-seq analysis are very high in terms of seq analysis. CPUs and GiBs of RAM memory, and computing infrastructure to fulfill such requirements is not always available in small research centers. In such cases, cloud computing is a solution that provides 2 Materials and methods resizable computing capacity, and therefore, allows to fit the hard- ware to the nature of the experimental data. One of the main cloud 2.1 Software computing solutions is the Elastic Compute Cloud (EC2), a service NGScloud was programmed in Python3, and it runs in any com- of the Amazon Web Services (AWS). The EC2 has a wide range of puter with an OS that allows for Python3: Linux, Microsoft scalable instances that allow the optimization of the experiment Windows, Mac OS X and other platforms. To work properly, costs, because the user only pays for the time of use of the resources. NGScloud has the following dependencies for the local computer of Also, the EC2 provides immediacy, since a virtual machine can be the user: (i) StarCluster, an open source cluster-computing toolkit booted in only a few minutes. for EC2 (http://star.mit.edu/cluster/); (ii) Boto3, the AWS SDK for V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 3405 3406 F.Mora-Ma´ rquez et al. Python (https://boto3.readthedocs.io/); (iii) Paramiko, an implemen- tation of the SSHv2 protocol in Python (http://www.paramiko.org/) and (iv) AWS CLI (https://aws.amazon.com/cli/). NGScloud offers a user-friendly front-end to operate the EC2 resources, to control the implement RNA-seq workflow and to han- dle the data. NGScloud runs in graphical mode using the graphical user interface (GUI) by default, but it can also be run in console mode on server machines without GUI installed. In addition, several free bioinformatic applications of common use in RNA-seq workflows are easily set up from the front-end (Point 2.3). 2.2 Cloud computing NGScloud philosophy is based on the cluster concept. A cluster is a set of virtual machines of an AWS instance type. Each instance type has its hardware features (machine type, CPU number, memory amount, etc). Data volumes allow to save data and keep them even if there is not any cluster created. NGScloud uses Amazon’s EBS volumes to hold applications, read files, references, databases and results of analysis. Through the NGScloud front-end, the user can easily to: (i) cre- ate and terminate clusters; (ii) show the cluster composition; (iii) add and remove nodes dynamically to a cluster; (iv) create and re- Fig. 1. NGScloud architecture. NGScloud operates EC2 resources, submits move volumes; mount and unmount volumes; (v) submit and kill workflow and manages datasets from RNA-seq experiments jobs to the RNA-seq workflow; (vi) show the status of the batch job; (vii) view the log of every batch job to inspect correct program oper- or submitted to the next step of the RNA-seq workflow. Running of ability and (viii) upload, download, compress, decompress and re- bioinformatic applications is easy since the researcher is guided in move datasets. the choice of the input files and the parameters to be used, encapsu- When a cluster is created, it has only a virtual machine named lating the complexity of the command line. Multiple runs of applica- master node. After the master node creation, subsidiary nodes can tions can be run in parallel creating nodes. In addition, the be added if necessary, to run some processes in parallel. In this case, annotation step supports parallelization, so it can use several nodes the new job will run in the node determined according to the to increase the run speed. workload. 2.3 RNA-seq workflow 3 Conclusions The RNA-seq workflow implemented has the standard steps of a NGScloud provides a user-friendly front-end to operate the EC2 RNA-seq analysis in non-model species (Lo ´ pez de Heredia and resources, and to control the workflow of non-model species ori- Va ´ zquez-Poletti, 2016), including: (i) pre-processing: read quality ented RNA-seq experiment in a modular way. The application assessment with FastQC (http://www.bioinformatics.babraham.ac. allows to optimize the cost-efficiency ratio of RNA-seq experiments uk/projects/fastqc.), trimming with Trimmomatic (Bolger, 2014) when appropriate computational facilities are not available. and the insilico-read-normalization procedure of Trinity; (ii) de novo assembly with SOAPdenovo-Trans (Xie, 2014), Trans-Abyss (Robertson, 2010) and Trinity (Grabherr, 2011; Haas, 2013), and reference-based assembly with STAR (Dobin, 2013); (iii) assessment Funding of the assembly quality and transcript quantification with Transrate This work has been supported by the projects SPIP2014-01093 (Spanish (Smith-Unna, 2016), GMAP (Wu and Watanabe, 2005), BUSCO National Parks Agency, Ministry of Agriculture and AGL2015-67495-C2-2- (Waterhouse, 2017), QUAST (Gurevich, 2013), rnaQUAST R and FedCloudNet) (MINECO TIN2015-65469-P) (Spanish Ministry of (Bushmanova, 2016) and RSEM-EVAL included in DETONATE Economy and Competitiveness) and by an Amazon Research Grant. package (Li, 2014); (iv) post-filtering with CD-HIT-EST (Li and Conflict of Interest: none declared. Godzik, 2006) and transcript-filter included in NGShelper package, which also has some other tools to perform the RNA-seq analysis workflow and (v) annotation with transcriptome-blast that encapsu- References lates blastx runs in several nodes, and is included in NGShelper. The Bolger,A.M. et al. (2014) Trimmomatic: a flexible trimmer for Illumina se- results may be downloaded for downstream analysis (Fig. 1). quence data. Bioinformatics, 30, 2114–2120. The workflow steps are run separately by the user, with each Bushmanova,E. et al. (2016) rnaQUAST: a quality assessment tool for de novo step requiring the setup of the cloud resources. NGScloud is config- transcriptome assemblies. Bioinformatics, 32, 2210–2212. ured to read the output generated in each step as the required input Dobin,A. et al. (2013) STAR: ultrafast universal RNA-seq aligner. file(s) of subsequent steps, or to download the output to a local ma- Bioinformatics, 29, 15–21. chine. For instance, to perform an assembly, reads are uploaded, the Grabherr,M.G. et al. (2011) Full-length transcriptome assembly from assembly program runs in the cluster and the output is downloaded, RNA-Seq data without a reference genome. Nat. Biotechnol., 29, 644–652. NGScloud 3407 Gurevich,A. et al. (2013) QUAST: quality assessment tool for genome assem- Miller,J.R. et al. (2010) Assembly algorithms for next-generation sequencing blies. Bioinformatics, 29, 1072–1075. data. Genomics, 95, 315–327. Haas,B.J. et al. (2013) De novo transcript sequence reconstruction from Robertson,G. et al. (2010) De novo assembly and analysis of RNA-seq data. RNA-seq using the Trinity platform for reference generation and analysis. Nat. Methods, 7, 909–912. Nat. Protoc., 8, 1494–1512. Smith-Unna,R. et al. (2016) TransRate: reference-free quality assessment of de Li,B. et al. (2014) Evaluation of de novo transcriptome assemblies from novo transcriptome assemblies. Genome Res., 26, 1134–1144. RNA-Seq data. Genome Biol., 15, 553. Waterhouse,R.M. et al. (2017) BUSCO applications from quality assessments Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering and compar- to gene prediction and phylogenomics. Mol. Biol. Evol., 35,1–6. ing large sets of protein or nucleotide sequences. Bioinformatics, 22, Wu,T.D. and Watanabe,C.K. (2005) GMAP: a genomic mapping and align- 1658–1659. ment program for mRNA and EST sequences. Bioinformatics, 21, Lo ´ pez de Heredia,U. and Va ´ zquez-Poletti,J.L. (2016) RNA-seq analysis in 1859–1875. forest tree species: bioinformatic problems and solutions. Tree Genet. Xie,Y. et al. (2014) SOAPdenovo-Trans: de novo transcriptome assembly with Genomes, 12, 30. short RNA-Seq reads. Bioinformatics, 30, 1660–1666. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

NGScloud: RNA-seq analysis of non-model species using cloud computing

Loading next page...
 
/lp/ou_press/ngscloud-rna-seq-analysis-of-non-model-species-using-cloud-computing-U0JxXv0uB4

References (18)

Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bty363
Publisher site
See Article on Publisher Site

Abstract

Summary: RNA-seq analysis usually requires large computing infrastructures. NGScloud is a bio- informatic system developed to analyze RNA-seq data using the cloud computing services of Amazon that permit the access to ad hoc computing infrastructure scaled according to the com- plexity of the experiment, so its costs and times can be optimized. The application provides a user- friendly front-end to operate Amazon’s hardware resources, and to control a workflow of RNA-seq analysis oriented to non-model species, incorporating the cluster concept, which allows parallel runs of common RNA-seq analysis programs in several virtual machines for faster analysis. Availability and implementation: NGScloud is freely available at https://github.com/GGFHF/ NGScloud/. A manual detailing installation and how-to-use instructions is available with the distribution. Contact: [email protected] We present NGScloud, a bioinformatic system developed to ana- 1 Introduction lyze RNA-seq data using the cloud computing offered by EC2. RNA-seq experiments often yield huge amount of data, especially NGScloud is oriented to non-model species whose reference when several NGS libraries are involved. The algorithms used in the genomes are not available, and it implements parallel runs in several bioinformatic analyses are very complex, particularly those referred virtual machines for faster analysis. The application aims to ease the to the assembly of reads (Miller et al., 2010). Thus, the hardware researcher the use of EC2 resources and the performance of RNA- requirements to run RNA-seq analysis are very high in terms of seq analysis. CPUs and GiBs of RAM memory, and computing infrastructure to fulfill such requirements is not always available in small research centers. In such cases, cloud computing is a solution that provides 2 Materials and methods resizable computing capacity, and therefore, allows to fit the hard- ware to the nature of the experimental data. One of the main cloud 2.1 Software computing solutions is the Elastic Compute Cloud (EC2), a service NGScloud was programmed in Python3, and it runs in any com- of the Amazon Web Services (AWS). The EC2 has a wide range of puter with an OS that allows for Python3: Linux, Microsoft scalable instances that allow the optimization of the experiment Windows, Mac OS X and other platforms. To work properly, costs, because the user only pays for the time of use of the resources. NGScloud has the following dependencies for the local computer of Also, the EC2 provides immediacy, since a virtual machine can be the user: (i) StarCluster, an open source cluster-computing toolkit booted in only a few minutes. for EC2 (http://star.mit.edu/cluster/); (ii) Boto3, the AWS SDK for V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 3405 3406 F.Mora-Ma´ rquez et al. Python (https://boto3.readthedocs.io/); (iii) Paramiko, an implemen- tation of the SSHv2 protocol in Python (http://www.paramiko.org/) and (iv) AWS CLI (https://aws.amazon.com/cli/). NGScloud offers a user-friendly front-end to operate the EC2 resources, to control the implement RNA-seq workflow and to han- dle the data. NGScloud runs in graphical mode using the graphical user interface (GUI) by default, but it can also be run in console mode on server machines without GUI installed. In addition, several free bioinformatic applications of common use in RNA-seq workflows are easily set up from the front-end (Point 2.3). 2.2 Cloud computing NGScloud philosophy is based on the cluster concept. A cluster is a set of virtual machines of an AWS instance type. Each instance type has its hardware features (machine type, CPU number, memory amount, etc). Data volumes allow to save data and keep them even if there is not any cluster created. NGScloud uses Amazon’s EBS volumes to hold applications, read files, references, databases and results of analysis. Through the NGScloud front-end, the user can easily to: (i) cre- ate and terminate clusters; (ii) show the cluster composition; (iii) add and remove nodes dynamically to a cluster; (iv) create and re- Fig. 1. NGScloud architecture. NGScloud operates EC2 resources, submits move volumes; mount and unmount volumes; (v) submit and kill workflow and manages datasets from RNA-seq experiments jobs to the RNA-seq workflow; (vi) show the status of the batch job; (vii) view the log of every batch job to inspect correct program oper- or submitted to the next step of the RNA-seq workflow. Running of ability and (viii) upload, download, compress, decompress and re- bioinformatic applications is easy since the researcher is guided in move datasets. the choice of the input files and the parameters to be used, encapsu- When a cluster is created, it has only a virtual machine named lating the complexity of the command line. Multiple runs of applica- master node. After the master node creation, subsidiary nodes can tions can be run in parallel creating nodes. In addition, the be added if necessary, to run some processes in parallel. In this case, annotation step supports parallelization, so it can use several nodes the new job will run in the node determined according to the to increase the run speed. workload. 2.3 RNA-seq workflow 3 Conclusions The RNA-seq workflow implemented has the standard steps of a NGScloud provides a user-friendly front-end to operate the EC2 RNA-seq analysis in non-model species (Lo ´ pez de Heredia and resources, and to control the workflow of non-model species ori- Va ´ zquez-Poletti, 2016), including: (i) pre-processing: read quality ented RNA-seq experiment in a modular way. The application assessment with FastQC (http://www.bioinformatics.babraham.ac. allows to optimize the cost-efficiency ratio of RNA-seq experiments uk/projects/fastqc.), trimming with Trimmomatic (Bolger, 2014) when appropriate computational facilities are not available. and the insilico-read-normalization procedure of Trinity; (ii) de novo assembly with SOAPdenovo-Trans (Xie, 2014), Trans-Abyss (Robertson, 2010) and Trinity (Grabherr, 2011; Haas, 2013), and reference-based assembly with STAR (Dobin, 2013); (iii) assessment Funding of the assembly quality and transcript quantification with Transrate This work has been supported by the projects SPIP2014-01093 (Spanish (Smith-Unna, 2016), GMAP (Wu and Watanabe, 2005), BUSCO National Parks Agency, Ministry of Agriculture and AGL2015-67495-C2-2- (Waterhouse, 2017), QUAST (Gurevich, 2013), rnaQUAST R and FedCloudNet) (MINECO TIN2015-65469-P) (Spanish Ministry of (Bushmanova, 2016) and RSEM-EVAL included in DETONATE Economy and Competitiveness) and by an Amazon Research Grant. package (Li, 2014); (iv) post-filtering with CD-HIT-EST (Li and Conflict of Interest: none declared. Godzik, 2006) and transcript-filter included in NGShelper package, which also has some other tools to perform the RNA-seq analysis workflow and (v) annotation with transcriptome-blast that encapsu- References lates blastx runs in several nodes, and is included in NGShelper. The Bolger,A.M. et al. (2014) Trimmomatic: a flexible trimmer for Illumina se- results may be downloaded for downstream analysis (Fig. 1). quence data. Bioinformatics, 30, 2114–2120. The workflow steps are run separately by the user, with each Bushmanova,E. et al. (2016) rnaQUAST: a quality assessment tool for de novo step requiring the setup of the cloud resources. NGScloud is config- transcriptome assemblies. Bioinformatics, 32, 2210–2212. ured to read the output generated in each step as the required input Dobin,A. et al. (2013) STAR: ultrafast universal RNA-seq aligner. file(s) of subsequent steps, or to download the output to a local ma- Bioinformatics, 29, 15–21. chine. For instance, to perform an assembly, reads are uploaded, the Grabherr,M.G. et al. (2011) Full-length transcriptome assembly from assembly program runs in the cluster and the output is downloaded, RNA-Seq data without a reference genome. Nat. Biotechnol., 29, 644–652. NGScloud 3407 Gurevich,A. et al. (2013) QUAST: quality assessment tool for genome assem- Miller,J.R. et al. (2010) Assembly algorithms for next-generation sequencing blies. Bioinformatics, 29, 1072–1075. data. Genomics, 95, 315–327. Haas,B.J. et al. (2013) De novo transcript sequence reconstruction from Robertson,G. et al. (2010) De novo assembly and analysis of RNA-seq data. RNA-seq using the Trinity platform for reference generation and analysis. Nat. Methods, 7, 909–912. Nat. Protoc., 8, 1494–1512. Smith-Unna,R. et al. (2016) TransRate: reference-free quality assessment of de Li,B. et al. (2014) Evaluation of de novo transcriptome assemblies from novo transcriptome assemblies. Genome Res., 26, 1134–1144. RNA-Seq data. Genome Biol., 15, 553. Waterhouse,R.M. et al. (2017) BUSCO applications from quality assessments Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering and compar- to gene prediction and phylogenomics. Mol. Biol. Evol., 35,1–6. ing large sets of protein or nucleotide sequences. Bioinformatics, 22, Wu,T.D. and Watanabe,C.K. (2005) GMAP: a genomic mapping and align- 1658–1659. ment program for mRNA and EST sequences. Bioinformatics, 21, Lo ´ pez de Heredia,U. and Va ´ zquez-Poletti,J.L. (2016) RNA-seq analysis in 1859–1875. forest tree species: bioinformatic problems and solutions. Tree Genet. Xie,Y. et al. (2014) SOAPdenovo-Trans: de novo transcriptome assembly with Genomes, 12, 30. short RNA-Seq reads. Bioinformatics, 30, 1660–1666.

Journal

BioinformaticsOxford University Press

Published: May 3, 2018

There are no references for this article.