Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Gene set analysis in the cloud

Gene set analysis in the cloud Vol. 28 no. 2 2012, pages 294–295 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr630 Systems biology Advance Access publication November 13, 2011 1 2 2 2 1,∗ Lu Zhang , Shengchang Gu , Yuan Liu , Bingqiang Wang and Francisco Azuaje 1 2 Laboratory of Cardiovascular research, CRP-Santé, Luxembourg L-1150, Luxembourg and Bioinformatics Center, BGI, Shenzhen 518083, China Associate Editor: Trey Ideker ABSTRACT pathway scoring. In the overlay task the aim is to map expression measurements from samples (e.g. patients) onto gene sets (e.g. Summary: Cloud computing offers low cost and highly flexible molecular pathways) provided by the user. In this step ‘activity opportunities in bioinformatics. Its potential has already been levels’ for each sample-specific pathway are calculated. In the case demonstrated in high-throughput sequence data analysis. Pathway- considered here, an activity level refers to the mean expression based or gene set analysis of expression data has received relatively value observed in a pathway. This is followed by the computation less attention. We developed a gene set analysis algorithm for of a ‘perturbation score’ for each gene set, which is based on the biomarker identification in the cloud. The resulting tool, YunBe,is comparison of differential activity levels across two sample groups ready to use on Amazon Web Services. Moreover, here we compare defined in the data (e.g. disease classification). its performance to those obtained with desktop and computing We developed a cloud compute version of this method, YunBe, cluster solutions. which is written in Java using the MapReduce framework (Fig. 1B). Availability and implementation: YunBe is open-source The overlay step corresponds to a matrix multiplication task: gene and freely accessible within the Amazon Elastic MapReduce expression data matrix (M) is multiplied by a pathway matrix (N) service at s3n://lrcv-crp-sante/app/yunbe.jar. Source code and to produce a new matrix (K) with samples matched to the pathways user’s guidelines can be downloaded from http://tinyurl.com/ available. Matrix M is uploaded to the Hadoop Distributed File yunbedownload. System (HDFS), while the matrix N is uploaded to a distributed Contact: francisco.azuaje@crp-sante.lu cache system upon execution. Thus, in each processing iteration, Received and revised on September 22, 2011; accepted on sample data are ‘multiplied’ by (i.e. mapped to) the entire pathway November 7, 2011 collection, one by one. Hence, a mapping task processes two inputs: samples (from matrix M) and pathways (from matrix N) to produce 1 INTRODUCTION a key/value pair, where key is a pathway ID and value is an ‘activity The complexity and cost associated with recent advances in high- value’. The ‘reduce’ phase connects all values associated with the throughput ‘omic’ technologies have made cloud computing a same key (pathway ID), followed by the calculation of ‘perturbation cost-effective and powerful resource for bioinformatics. Existing scores’ and P-values for each pathway as reported in (Azuaje et al., platforms, such as those offered by Amazon Web Services (AWS), 2010). provide the computing environment, including CPUs, storage, processing memory, networking and operating systems, required to deploy computationally expensive algorithms and applications. 3 DATA AND IMPLEMENTATION Such environments allow users to configure and exploit resources on a ‘pay as you use’ basis (Fusaro et al., 2011). Although cloud To test YunBe we analyzed published and simulated gene expression computing applications are increasingly being made available for datasets: a human liver gene expression dataset (Schadt et al., 2008) high-throughput DNA sequencing data, there is a need for publicly and synthetic datasets of varying sizes. The liver dataset includes 466 available algorithms that can enable other translational biomedical samples from 31 842 transcripts and grouped by gender, i.e. two- research applications, such as large-scale gene set analysis of class phenotype. To generate the simulated data, we first selected expression data (Dudley et al., 2010). In this context, expression transcript names lists from the Agilent Whole Human Genome data of thousands of genes are mapped to biologically relevant sets of Oligo Microarray (i.e. 19 634 transcript names). We then randomly genes, e.g. curated biological pathways, and differential expression grouped 1000 samples and computed (normally distributed) values of such gene sets is estimated across phenotypes. The objective of for each transcript in the sample. As gene sets, we used a canonical our research is 2-fold: (i) to develop a cloud compute version of a pathway list with 880 gene sets from the Molecular Signatures published gene set analysis algorithm (Azuaje et al. 2010); and (ii) DataBase (MSigDB) (Subramanian et al., 2005). to perform a comparative analysis of performance across different Gene expression data, pathways andYunBe Jar files were uploaded computing platforms. to an Amazon S3 bucket. A job flow was created with Amazon Elastic MapReduce service (Fig. 1B) with m1.large instance type. A m1.large instance represented a 64-bit platform with two 2 ALGORITHM virtual cores. Each virtual core has two EC2 Compute Units that In our original gene set analysis algorithm, kipuMarkers, there are individually equals the CPU capacity of a 1.0-1.2 GHz 2007 Xeon two main processing steps (Fig. 1A): expression data overlay and processor. We compared YunBe’s execution speeds with a program version running on a computing cluster, which consisted of dual To whom correspondence should be addressed. socket quad-core Intel E5430 Harpertown CPUs. In this analysis, 294 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com [17:41 20/12/2011 Bioinformatics-btr630.tex] Page: 294 294–295 Gene set analysis in the cloud Fig. 1. Gene set analysis in the cloud. (A) Analysis pipeline. (B) MapReduce implementation on Amazon EC2. (C) Performance of desktop program, Amazon EC2-based YunBe and BGI cluster on liver and simulated data. 1, 2, 4 and 8 EC2 m1.large instances were compared with 2, 4, 8 and In theory, MapReduce computations are independent, and therefore 16 cluster cores running at BGI. We also made comparisons with a the (wall clock) running time should scale linearly with the number desktop program running on a duo-core Intel E7500 Wolfdale CPU. of processor cores available. In practice, the speedup of a program running on multiple processors may be limited by serial processing overheads, as described by Amdahl’s law. YunBe analyses on the AWS are relatively inexpensive. For example, a full analysis of the 4 COMPARATIVE ANALYSIS RESULTS liver dataset requiring eight virtual cores was completed for about The desktop computer version of our algorithm required 120 and US$ 1.7 (∼EUR 1.2). 173 min (wall clock time) to perform gene set analysis of the In conclusion, we offer YunBe, a new open-source gene liver and simulated datasets, respectively. In both datasets, YunBe set analysis tool for the cloud. YunBe is freely available and reduced the execution time from hours to minutes (Fig. 1C). The ready to run on AWS. We showed how, in comparison to a execution time was significantly reduced even when using only desktop implementation, YunBe significantly improves execution two cores. In the case of the liver dataset and in relation to the times. YunBe can accelerate pathway-based biomarker identification desktop implementation, speedups of 10.9 and 24.1 times were through inexpensive and secure distributed computing. obtained with the Amazon EC2 and BGI cluster, respectively. Major execution time improvements were also observed on the simulated ACKNOWLEDGEMENTS dataset: 8.6 and 16.4 faster with Amazon EC2 and BGI cluster, respectively. Overall, the BGI platform produced faster results than We thank Y. Devaux, D. Wagner, J.C. Schmit, L. Fang and G. Cao AWS. This result may be expected due to overheads incurred by the for supporting this Luxembourg-China partnership. cloud’s virtualization layer. Another factor to take into account is that Funding: National Research Fund (FNR) of Luxembourg (AM2c speed also depends on the specific hardware utilized for execution. Programme). Note that in our analyses, we equalized the bit size of computer architecture (64-bit) and the number of cores between Amazon EC2 Conflict of Interest: none declared. and BGI cluster. Nevertheless, other factors, such as memory and I/O performance, may have influenced our comparison. Moreover, REFERENCES differences in networking hardware, inter-node communication Azuaje,F. et al. (2010) Integrative pathway-centric modeling of ventricular dysfunction and geographical distance should be considered when interpreting after myocardial infarction. PLoS One, 5, e9661. observed differences in speed. Dudley,J.T. et al. (2010) Translational bioinformatics in the cloud: an affordable YunBe’s running time scales with nearly linear speedup over the alternative. Genome Med., 2, 51. Fusaro,V.A. et al. (2011) Biomedical cloud computing with Amazon web services. PLoS desktop program performance as the number of cores increases Comput. Biol., 7, e1002147. (Fig. 1C). For instance, on Amazon EC2 and liver data, we obtained Schadt,E.E. et al. (2008) Mapping the genetic architecture of gene expression in human a speedup of 11 and 20 for 2 and 4 virtual cores, respectively. liver. PLoS Biol., 6, e107. For the simulated data, the speedup was of 8 and 14 for 2 and Subramanian,A. et al. (2005) Gene set enrichment analysis: a knowledge-based 4 cores, respectively. However, such proportional increases were approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA, 102, 15545–15550. not observed above eight cores, more significantly on Amazon EC2. [17:41 20/12/2011 Bioinformatics-btr630.tex] Page: 295 294–295 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Gene set analysis in the cloud

Loading next page...
 
/lp/oxford-university-press/gene-set-analysis-in-the-cloud-qeQlZpBkPB

References (5)

Publisher
Oxford University Press
Copyright
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btr630
pmid
22084254
Publisher site
See Article on Publisher Site

Abstract

Vol. 28 no. 2 2012, pages 294–295 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr630 Systems biology Advance Access publication November 13, 2011 1 2 2 2 1,∗ Lu Zhang , Shengchang Gu , Yuan Liu , Bingqiang Wang and Francisco Azuaje 1 2 Laboratory of Cardiovascular research, CRP-Santé, Luxembourg L-1150, Luxembourg and Bioinformatics Center, BGI, Shenzhen 518083, China Associate Editor: Trey Ideker ABSTRACT pathway scoring. In the overlay task the aim is to map expression measurements from samples (e.g. patients) onto gene sets (e.g. Summary: Cloud computing offers low cost and highly flexible molecular pathways) provided by the user. In this step ‘activity opportunities in bioinformatics. Its potential has already been levels’ for each sample-specific pathway are calculated. In the case demonstrated in high-throughput sequence data analysis. Pathway- considered here, an activity level refers to the mean expression based or gene set analysis of expression data has received relatively value observed in a pathway. This is followed by the computation less attention. We developed a gene set analysis algorithm for of a ‘perturbation score’ for each gene set, which is based on the biomarker identification in the cloud. The resulting tool, YunBe,is comparison of differential activity levels across two sample groups ready to use on Amazon Web Services. Moreover, here we compare defined in the data (e.g. disease classification). its performance to those obtained with desktop and computing We developed a cloud compute version of this method, YunBe, cluster solutions. which is written in Java using the MapReduce framework (Fig. 1B). Availability and implementation: YunBe is open-source The overlay step corresponds to a matrix multiplication task: gene and freely accessible within the Amazon Elastic MapReduce expression data matrix (M) is multiplied by a pathway matrix (N) service at s3n://lrcv-crp-sante/app/yunbe.jar. Source code and to produce a new matrix (K) with samples matched to the pathways user’s guidelines can be downloaded from http://tinyurl.com/ available. Matrix M is uploaded to the Hadoop Distributed File yunbedownload. System (HDFS), while the matrix N is uploaded to a distributed Contact: francisco.azuaje@crp-sante.lu cache system upon execution. Thus, in each processing iteration, Received and revised on September 22, 2011; accepted on sample data are ‘multiplied’ by (i.e. mapped to) the entire pathway November 7, 2011 collection, one by one. Hence, a mapping task processes two inputs: samples (from matrix M) and pathways (from matrix N) to produce 1 INTRODUCTION a key/value pair, where key is a pathway ID and value is an ‘activity The complexity and cost associated with recent advances in high- value’. The ‘reduce’ phase connects all values associated with the throughput ‘omic’ technologies have made cloud computing a same key (pathway ID), followed by the calculation of ‘perturbation cost-effective and powerful resource for bioinformatics. Existing scores’ and P-values for each pathway as reported in (Azuaje et al., platforms, such as those offered by Amazon Web Services (AWS), 2010). provide the computing environment, including CPUs, storage, processing memory, networking and operating systems, required to deploy computationally expensive algorithms and applications. 3 DATA AND IMPLEMENTATION Such environments allow users to configure and exploit resources on a ‘pay as you use’ basis (Fusaro et al., 2011). Although cloud To test YunBe we analyzed published and simulated gene expression computing applications are increasingly being made available for datasets: a human liver gene expression dataset (Schadt et al., 2008) high-throughput DNA sequencing data, there is a need for publicly and synthetic datasets of varying sizes. The liver dataset includes 466 available algorithms that can enable other translational biomedical samples from 31 842 transcripts and grouped by gender, i.e. two- research applications, such as large-scale gene set analysis of class phenotype. To generate the simulated data, we first selected expression data (Dudley et al., 2010). In this context, expression transcript names lists from the Agilent Whole Human Genome data of thousands of genes are mapped to biologically relevant sets of Oligo Microarray (i.e. 19 634 transcript names). We then randomly genes, e.g. curated biological pathways, and differential expression grouped 1000 samples and computed (normally distributed) values of such gene sets is estimated across phenotypes. The objective of for each transcript in the sample. As gene sets, we used a canonical our research is 2-fold: (i) to develop a cloud compute version of a pathway list with 880 gene sets from the Molecular Signatures published gene set analysis algorithm (Azuaje et al. 2010); and (ii) DataBase (MSigDB) (Subramanian et al., 2005). to perform a comparative analysis of performance across different Gene expression data, pathways andYunBe Jar files were uploaded computing platforms. to an Amazon S3 bucket. A job flow was created with Amazon Elastic MapReduce service (Fig. 1B) with m1.large instance type. A m1.large instance represented a 64-bit platform with two 2 ALGORITHM virtual cores. Each virtual core has two EC2 Compute Units that In our original gene set analysis algorithm, kipuMarkers, there are individually equals the CPU capacity of a 1.0-1.2 GHz 2007 Xeon two main processing steps (Fig. 1A): expression data overlay and processor. We compared YunBe’s execution speeds with a program version running on a computing cluster, which consisted of dual To whom correspondence should be addressed. socket quad-core Intel E5430 Harpertown CPUs. In this analysis, 294 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com [17:41 20/12/2011 Bioinformatics-btr630.tex] Page: 294 294–295 Gene set analysis in the cloud Fig. 1. Gene set analysis in the cloud. (A) Analysis pipeline. (B) MapReduce implementation on Amazon EC2. (C) Performance of desktop program, Amazon EC2-based YunBe and BGI cluster on liver and simulated data. 1, 2, 4 and 8 EC2 m1.large instances were compared with 2, 4, 8 and In theory, MapReduce computations are independent, and therefore 16 cluster cores running at BGI. We also made comparisons with a the (wall clock) running time should scale linearly with the number desktop program running on a duo-core Intel E7500 Wolfdale CPU. of processor cores available. In practice, the speedup of a program running on multiple processors may be limited by serial processing overheads, as described by Amdahl’s law. YunBe analyses on the AWS are relatively inexpensive. For example, a full analysis of the 4 COMPARATIVE ANALYSIS RESULTS liver dataset requiring eight virtual cores was completed for about The desktop computer version of our algorithm required 120 and US$ 1.7 (∼EUR 1.2). 173 min (wall clock time) to perform gene set analysis of the In conclusion, we offer YunBe, a new open-source gene liver and simulated datasets, respectively. In both datasets, YunBe set analysis tool for the cloud. YunBe is freely available and reduced the execution time from hours to minutes (Fig. 1C). The ready to run on AWS. We showed how, in comparison to a execution time was significantly reduced even when using only desktop implementation, YunBe significantly improves execution two cores. In the case of the liver dataset and in relation to the times. YunBe can accelerate pathway-based biomarker identification desktop implementation, speedups of 10.9 and 24.1 times were through inexpensive and secure distributed computing. obtained with the Amazon EC2 and BGI cluster, respectively. Major execution time improvements were also observed on the simulated ACKNOWLEDGEMENTS dataset: 8.6 and 16.4 faster with Amazon EC2 and BGI cluster, respectively. Overall, the BGI platform produced faster results than We thank Y. Devaux, D. Wagner, J.C. Schmit, L. Fang and G. Cao AWS. This result may be expected due to overheads incurred by the for supporting this Luxembourg-China partnership. cloud’s virtualization layer. Another factor to take into account is that Funding: National Research Fund (FNR) of Luxembourg (AM2c speed also depends on the specific hardware utilized for execution. Programme). Note that in our analyses, we equalized the bit size of computer architecture (64-bit) and the number of cores between Amazon EC2 Conflict of Interest: none declared. and BGI cluster. Nevertheless, other factors, such as memory and I/O performance, may have influenced our comparison. Moreover, REFERENCES differences in networking hardware, inter-node communication Azuaje,F. et al. (2010) Integrative pathway-centric modeling of ventricular dysfunction and geographical distance should be considered when interpreting after myocardial infarction. PLoS One, 5, e9661. observed differences in speed. Dudley,J.T. et al. (2010) Translational bioinformatics in the cloud: an affordable YunBe’s running time scales with nearly linear speedup over the alternative. Genome Med., 2, 51. Fusaro,V.A. et al. (2011) Biomedical cloud computing with Amazon web services. PLoS desktop program performance as the number of cores increases Comput. Biol., 7, e1002147. (Fig. 1C). For instance, on Amazon EC2 and liver data, we obtained Schadt,E.E. et al. (2008) Mapping the genetic architecture of gene expression in human a speedup of 11 and 20 for 2 and 4 virtual cores, respectively. liver. PLoS Biol., 6, e107. For the simulated data, the speedup was of 8 and 14 for 2 and Subramanian,A. et al. (2005) Gene set enrichment analysis: a knowledge-based 4 cores, respectively. However, such proportional increases were approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA, 102, 15545–15550. not observed above eight cores, more significantly on Amazon EC2. [17:41 20/12/2011 Bioinformatics-btr630.tex] Page: 295 294–295

Journal

BioinformaticsOxford University Press

Published: Nov 13, 2011

There are no references for this article.