Access the full text.
Sign up today, get DeepDyve free for 14 days.
J. Miller, S. Koren, G. Sutton (2010)
Assembly algorithms for next-generation sequencing data.Genomics, 95 6
U. Heredia, J. Vázquez-Poletti (2016)
RNA-seq analysis in forest tree species: bioinformatic problems and solutionsTree Genetics & Genomes, 12
Anthony Bolger, M. Lohse, B. Usadel (2014)
Trimmomatic: a flexible trimmer for Illumina sequence dataBioinformatics, 30
Yinlong Xie, Gengxiong Wu, Jingbo Tang, Ruibang Luo, Jordan Patterson, Shanlin Liu, Weihua Huang, Guangzhu He, Shengchang Gu, Sheng-qiang Li, Xin Zhou, T. Lam, Yingrui Li, Xun Xu, G. Wong, Jun Wang (2013)
SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq readsBioinformatics, 30 12
Weizhong Li, A. Godzik (2006)
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequencesBioinformatics, 22 13
Richard Smith-Unna, Chris Boursnell, Robert Patro, J. Hibberd, S. Kelly (2015)
TransRate: reference-free quality assessment of de novo transcriptome assembliesGenome Research, 26
M.G. Grabherr (2011)
Full-length transcriptome assembly from RNA-Seq data without a reference genomeNat. Biotechnol, 29
Sudhir Kumar, Heather Rowe (2018)
MBE Citation Classics (2018 Edition).Molecular biology and evolution, 35 1
B. Haas, A. Papanicolaou, M. Yassour, M. Grabherr, Philip Blood, Joshua Bowden, M. Couger, David Eccles, Bo Li, Matthias Lieber, M. MacManes, M. Ott, Joshua Orvis, N. Pochet, F. Strozzi, N. Weeks, R. Westerman, T. William, Colin Dewey, R. Henschel, Richard LeDuc, N. Friedman, A. Regev (2013)
De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysisNature Protocols, 8
Gordon Robertson, J. Schein, Readman Chiu, R. Corbett, M. Field, S. Jackman, K. Mungall, Sam Lee, H. Okada, Jenny Qian, M. Griffith, Anthony Raymond, N. Thiessen, T. Cezard, Y. Butterfield, R. Newsome, S. Chan, R. She, R. Varhol, Baljit Kamoh, A. Prabhu, Angela Tam, Yongjun Zhao, Richard Moore, M. Hirst, M. Marra, Steven Jones, P. Hoodless, I. Birol (2010)
De novo assembly and analysis of RNA-seq dataNature Methods, 7
A. Gurevich, V. Saveliev, Nikolay Vyahhi, G. Tesler (2013)
QUAST: quality assessment tool for genome assembliesBioinformatics, 29 8
Waterhouse (2017)
1Mol. Biol. Evol, 35
M. Grabherr, B. Haas, M. Yassour, J. Levin, Dawn Thompson, I. Amit, X. Adiconis, Lin Fan, R. Raychowdhury, Qiandong Zeng, Zehua Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. Palma, B. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman, A. Regev (2011)
Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq dataNature biotechnology, 29
R. Waterhouse, Mathieu Seppey, Felipe Simão, M. Manni, P. Ioannidis, G. Klioutchnikov, E. Kriventseva, E. Zdobnov (2017)
BUSCO Applications from Quality Assessments to Gene Prediction and PhylogenomicsMolecular Biology and Evolution, 35
Elena Bushmanova, D. Antipov, A. Lapidus, Vladimir Suvorov, A. Prjibelski (2016)
rnaQUAST: a quality assessment tool for de novo transcriptome assembliesBioinformatics, 32 14
Bo Li, N. Fillmore, Yongsheng Bai, Michael Collins, J. Thomson, R. Stewart, Colin Dewey (2014)
Evaluation of de novo transcriptome assemblies from RNA-Seq dataGenome Biology, 15
Alexander Dobin, C. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, M. Chaisson, Thomas Gingeras (2013)
STAR: ultrafast universal RNA-seq alignerBioinformatics, 29 1
Thomas Wu, C. Watanabe (2005)
GMAP: a genomic mapping and alignment program for mRNA and EST sequenceBioinformatics, 21 9
Summary: RNA-seq analysis usually requires large computing infrastructures. NGScloud is a bio- informatic system developed to analyze RNA-seq data using the cloud computing services of Amazon that permit the access to ad hoc computing infrastructure scaled according to the com- plexity of the experiment, so its costs and times can be optimized. The application provides a user- friendly front-end to operate Amazon’s hardware resources, and to control a workflow of RNA-seq analysis oriented to non-model species, incorporating the cluster concept, which allows parallel runs of common RNA-seq analysis programs in several virtual machines for faster analysis. Availability and implementation: NGScloud is freely available at https://github.com/GGFHF/ NGScloud/. A manual detailing installation and how-to-use instructions is available with the distribution. Contact: [email protected] We present NGScloud, a bioinformatic system developed to ana- 1 Introduction lyze RNA-seq data using the cloud computing offered by EC2. RNA-seq experiments often yield huge amount of data, especially NGScloud is oriented to non-model species whose reference when several NGS libraries are involved. The algorithms used in the genomes are not available, and it implements parallel runs in several bioinformatic analyses are very complex, particularly those referred virtual machines for faster analysis. The application aims to ease the to the assembly of reads (Miller et al., 2010). Thus, the hardware researcher the use of EC2 resources and the performance of RNA- requirements to run RNA-seq analysis are very high in terms of seq analysis. CPUs and GiBs of RAM memory, and computing infrastructure to fulfill such requirements is not always available in small research centers. In such cases, cloud computing is a solution that provides 2 Materials and methods resizable computing capacity, and therefore, allows to fit the hard- ware to the nature of the experimental data. One of the main cloud 2.1 Software computing solutions is the Elastic Compute Cloud (EC2), a service NGScloud was programmed in Python3, and it runs in any com- of the Amazon Web Services (AWS). The EC2 has a wide range of puter with an OS that allows for Python3: Linux, Microsoft scalable instances that allow the optimization of the experiment Windows, Mac OS X and other platforms. To work properly, costs, because the user only pays for the time of use of the resources. NGScloud has the following dependencies for the local computer of Also, the EC2 provides immediacy, since a virtual machine can be the user: (i) StarCluster, an open source cluster-computing toolkit booted in only a few minutes. for EC2 (http://star.mit.edu/cluster/); (ii) Boto3, the AWS SDK for V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 3405 3406 F.Mora-Ma´ rquez et al. Python (https://boto3.readthedocs.io/); (iii) Paramiko, an implemen- tation of the SSHv2 protocol in Python (http://www.paramiko.org/) and (iv) AWS CLI (https://aws.amazon.com/cli/). NGScloud offers a user-friendly front-end to operate the EC2 resources, to control the implement RNA-seq workflow and to han- dle the data. NGScloud runs in graphical mode using the graphical user interface (GUI) by default, but it can also be run in console mode on server machines without GUI installed. In addition, several free bioinformatic applications of common use in RNA-seq workflows are easily set up from the front-end (Point 2.3). 2.2 Cloud computing NGScloud philosophy is based on the cluster concept. A cluster is a set of virtual machines of an AWS instance type. Each instance type has its hardware features (machine type, CPU number, memory amount, etc). Data volumes allow to save data and keep them even if there is not any cluster created. NGScloud uses Amazon’s EBS volumes to hold applications, read files, references, databases and results of analysis. Through the NGScloud front-end, the user can easily to: (i) cre- ate and terminate clusters; (ii) show the cluster composition; (iii) add and remove nodes dynamically to a cluster; (iv) create and re- Fig. 1. NGScloud architecture. NGScloud operates EC2 resources, submits move volumes; mount and unmount volumes; (v) submit and kill workflow and manages datasets from RNA-seq experiments jobs to the RNA-seq workflow; (vi) show the status of the batch job; (vii) view the log of every batch job to inspect correct program oper- or submitted to the next step of the RNA-seq workflow. Running of ability and (viii) upload, download, compress, decompress and re- bioinformatic applications is easy since the researcher is guided in move datasets. the choice of the input files and the parameters to be used, encapsu- When a cluster is created, it has only a virtual machine named lating the complexity of the command line. Multiple runs of applica- master node. After the master node creation, subsidiary nodes can tions can be run in parallel creating nodes. In addition, the be added if necessary, to run some processes in parallel. In this case, annotation step supports parallelization, so it can use several nodes the new job will run in the node determined according to the to increase the run speed. workload. 2.3 RNA-seq workflow 3 Conclusions The RNA-seq workflow implemented has the standard steps of a NGScloud provides a user-friendly front-end to operate the EC2 RNA-seq analysis in non-model species (Lo ´ pez de Heredia and resources, and to control the workflow of non-model species ori- Va ´ zquez-Poletti, 2016), including: (i) pre-processing: read quality ented RNA-seq experiment in a modular way. The application assessment with FastQC (http://www.bioinformatics.babraham.ac. allows to optimize the cost-efficiency ratio of RNA-seq experiments uk/projects/fastqc.), trimming with Trimmomatic (Bolger, 2014) when appropriate computational facilities are not available. and the insilico-read-normalization procedure of Trinity; (ii) de novo assembly with SOAPdenovo-Trans (Xie, 2014), Trans-Abyss (Robertson, 2010) and Trinity (Grabherr, 2011; Haas, 2013), and reference-based assembly with STAR (Dobin, 2013); (iii) assessment Funding of the assembly quality and transcript quantification with Transrate This work has been supported by the projects SPIP2014-01093 (Spanish (Smith-Unna, 2016), GMAP (Wu and Watanabe, 2005), BUSCO National Parks Agency, Ministry of Agriculture and AGL2015-67495-C2-2- (Waterhouse, 2017), QUAST (Gurevich, 2013), rnaQUAST R and FedCloudNet) (MINECO TIN2015-65469-P) (Spanish Ministry of (Bushmanova, 2016) and RSEM-EVAL included in DETONATE Economy and Competitiveness) and by an Amazon Research Grant. package (Li, 2014); (iv) post-filtering with CD-HIT-EST (Li and Conflict of Interest: none declared. Godzik, 2006) and transcript-filter included in NGShelper package, which also has some other tools to perform the RNA-seq analysis workflow and (v) annotation with transcriptome-blast that encapsu- References lates blastx runs in several nodes, and is included in NGShelper. The Bolger,A.M. et al. (2014) Trimmomatic: a flexible trimmer for Illumina se- results may be downloaded for downstream analysis (Fig. 1). quence data. Bioinformatics, 30, 2114–2120. The workflow steps are run separately by the user, with each Bushmanova,E. et al. (2016) rnaQUAST: a quality assessment tool for de novo step requiring the setup of the cloud resources. NGScloud is config- transcriptome assemblies. Bioinformatics, 32, 2210–2212. ured to read the output generated in each step as the required input Dobin,A. et al. (2013) STAR: ultrafast universal RNA-seq aligner. file(s) of subsequent steps, or to download the output to a local ma- Bioinformatics, 29, 15–21. chine. For instance, to perform an assembly, reads are uploaded, the Grabherr,M.G. et al. (2011) Full-length transcriptome assembly from assembly program runs in the cluster and the output is downloaded, RNA-Seq data without a reference genome. Nat. Biotechnol., 29, 644–652. NGScloud 3407 Gurevich,A. et al. (2013) QUAST: quality assessment tool for genome assem- Miller,J.R. et al. (2010) Assembly algorithms for next-generation sequencing blies. Bioinformatics, 29, 1072–1075. data. Genomics, 95, 315–327. Haas,B.J. et al. (2013) De novo transcript sequence reconstruction from Robertson,G. et al. (2010) De novo assembly and analysis of RNA-seq data. RNA-seq using the Trinity platform for reference generation and analysis. Nat. Methods, 7, 909–912. Nat. Protoc., 8, 1494–1512. Smith-Unna,R. et al. (2016) TransRate: reference-free quality assessment of de Li,B. et al. (2014) Evaluation of de novo transcriptome assemblies from novo transcriptome assemblies. Genome Res., 26, 1134–1144. RNA-Seq data. Genome Biol., 15, 553. Waterhouse,R.M. et al. (2017) BUSCO applications from quality assessments Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering and compar- to gene prediction and phylogenomics. Mol. Biol. Evol., 35,1–6. ing large sets of protein or nucleotide sequences. Bioinformatics, 22, Wu,T.D. and Watanabe,C.K. (2005) GMAP: a genomic mapping and align- 1658–1659. ment program for mRNA and EST sequences. Bioinformatics, 21, Lo ´ pez de Heredia,U. and Va ´ zquez-Poletti,J.L. (2016) RNA-seq analysis in 1859–1875. forest tree species: bioinformatic problems and solutions. Tree Genet. Xie,Y. et al. (2014) SOAPdenovo-Trans: de novo transcriptome assembly with Genomes, 12, 30. short RNA-Seq reads. Bioinformatics, 30, 1660–1666.
Bioinformatics – Oxford University Press
Published: May 3, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.