RnaSeqSampleSize: real data based sample size estimation for RNA sequencing

RnaSeqSampleSize: real data based sample size estimation for RNA sequencing Background: One of the most important and often neglected components of a successful RNA sequencing (RNA-Seq) experiment is sample size estimation. A few negative binomial model-based methods have been developed to estimate sample size based on the parameters of a single gene. However, thousands of genes are quantified and tested for differential expression simultaneously in RNA-Seq experiments. Thus, additional issues should be carefully addressed, including the false discovery rate for multiple statistic tests, widely distributed read counts and dispersions for different genes. Results: To solve these issues, we developed a sample size and power estimation method named RnaSeqSampleSize, based on the distributions of gene average read counts and dispersions estimated from real RNA-seq data. Datasets from previous, similar experiments such as the Cancer Genome Atlas (TCGA) can be used as a point of reference. Read counts and their dispersions were estimated from the reference’s distribution; using that information, we estimated and summarized the power and sample size. RnaSeqSampleSize is implemented in R language and can be installed from Bioconductor website. A user friendly web graphic interface is provided at http://cqs.mc.vanderbilt.edu/shiny/ RnaSeqSampleSize/. Conclusions: RnaSeqSampleSize provides a convenient and powerful way for power and sample size estimation for an RNAseq experiment. It is also equipped with several unique features, including estimation for interested genes or pathway, power curve visualization, and parameter optimization. Keywords: RNA-Seq, Sample size, Power analysis, Simulation Background Sample size and power analysis have been well- RNA sequencing is a powerful NGS tool that has been established for traditional biological studies such as genome widely used in differential gene expression studies [1]. One wide association studies (GWAS) and microarray gene ex- of the most important steps in designing an RNA sequen- pression studies [2, 3]. In earlier RNA-Seq studies, the ana- cing experiment is selecting the optimal number of bio- lysis was based on Poisson distribution, because RNA-Seq logical replicates to achieve a desired statistical power data can be represented as read counts [4–6]. It was discov- (sample size estimation), or estimating the likelihood of ered, however, that Poisson distribution does not fit the em- successfully finding the statistical significance in the dataset pirical data due to an over-dispersion mainly caused by (power estimation). An insufficient number of replicates natural biological variation [7, 8]. To address this issue, a may lead to unreliable conclusions, whereas too many rep- few negative binomial distribution-based methods have licates may result in a waste of time and resources. The tra- been developed. These methods provide researchers with deoff between cost and study power needs to be carefully more flexibility in assigning between-sample variations [9– balanced. To address this issue, several attempts have been 13]. Hart et al. [14] proposed a power analysis method made to estimate power and sample size for RNA-seq based on the score test for single-gene differential expres- experiments. sion analysis. This method has been implemented in Bio- conductor as RNASeqPower. To handle multiple gene * Correspondence: shyr.yu@vanderbilt.edu comparisons, Li et al. [15] proposed a power analysis Department of Biostatistics, Vanderbilt University Medical Center, Nashville, method while controlling for the false discovery rate. To in- TN 37232, USA corporate the experiment’sbudgetintothe poweranalysis, Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Zhao et al. BMC Bioinformatics (2018) 19:191 Page 2 of 8 Wu et al. [16] introduced the concepts of stratified power RnaSeqSampleSize package, which controls the false dis- by coverage or biological variation and the cost of false dis- covery rate (FDR) of multiple testing, and utilizes the covery, then proposed a simulation-based method for average read count and dispersion distributions from power analysis. The method was implemented as a Biocon- real data to estimate a more reliable sample size. The ductor package PROPER. package is also optimized for running efficiency and pro- However, there are several limitations in the majority of vides additional features, which we demonstrate using the previous methods, such as: the lack of properly ac- real RNA-Seq data. counting for average read counts and dispersion in differ- ent genes; the lack of proper reference data; and the lack Results and discussion of easy and user friendly interfaces. The average read The detailed feature list of RnaSeqSampleSize package counts of genes are distributed in a range of more than can be observed in Fig. 1: four orders of magnitude, and their dispersions are highly dependent on their gene expression level [17, 18]. Previ- Sample size estimation with single average read count ous estimation methods were not designed for these dis- and dispersion tributions, so they often utilized one value chosen RnaSeqSampleSize was developed based on the sample conservatively or by experience [9, 10], which often re- size and power estimation methods described in the pre- sulted in an over-estimated sample size. Yu et al. [18]have vious study [10], and it greatly improved the compatibil- introduced a simulation-based procedure which considers ity and efficiency of older methods. In this new dependence between gene expression level and its disper- implementation, a minimal average read count and a sion, but this method has not been made into an easy-to- maximal dispersion are used to represent all genes in use software. Additionally, it is computationally expensive the RNA sequence experiments, and a conservative sam- to apply these methods to every gene in the dataset, be- ple size or power can be estimated. More importantly, cause the individual power analysis for the exact test in- RnaSeqSampleSize is compatible for large average read volves infinite sums, and the study’s overall power is counts and dispersions, supporting as much as a 2000 estimated from a summation of the individual power. A average read count. We optimized the running efficiency proper approach is to providing reference data with simi- of the method from 40 min to two seconds for most of lar distributions to current experiments. We acknowledge the widely used parameters (Additional file 1: Table S1). that such data may not be available for every project type and that a significant amount of programming and data Sample size estimation with real data processing effort is needed to utilize them. As previously stated, average read counts and disper- To address the aforementioned problems, we used sions for genes have wide distributions within a single previous methods [15] as the foundation for developing RNA sequencing experiment. A tiny fluctuation in the Fig. 1 RnaSeqSampleSize package workflow Zhao et al. BMC Bioinformatics (2018) 19:191 Page 3 of 8 average read count or dispersion will greatly influence larger than the real sample size, because we used the low- the estimated power or sample size (Fig. 2). For example, est read count and highest dispersion to represent all in TCGA Rectum adenocarcinoma (READ) dataset, the genes, even if most of them were not very conservative genes have a dispersion from 0 to 10, and the average (Red line and green line in Fig. 2a). read counts range from one to numbers in the several In the empirical data-based method, the genes in the thousands (Fig. 2a). In such a scenario, the sample size reference dataset were randomly selected, and the pow- estimation from a single value is inaccurate. We com- ers were estimated respectively based on these genes puted that the estimated sample size increased from 10 (Fig. 3). With the same desired power and FDR, the esti- to 302 when the minimal average read count changed mated sample size was 42 (Additional file 1: Table S3, in from one to 30 and the maximal dispersion changed boldface). This result is a better representation of the from 0.1 to three (Fig. 2b). genes in RNA-seq experiment and it is substantially less Instead of relying on guessing to discover the future than the result that was obtained using the traditional data’s distribution, RnaSeqSampleSize uses data parame- method. More importantly, the empirical data-based ters from previous studies. To demonstrate this feature, method can reflect the differential gene expression pat- we compared the usual sample size estimation approach tern in a different dataset. For example, genes in TCGA to RnaSeqSampleSize’s empirical approach. We used the Breast Invasive Carcinoma (BRCA) and READ dataset datasets from TCGA as the reference data to estimate the have similar read count distributions (Fig. 3a), but genes true data distribution. TCGA is a cancer consortium data in BRCA dataset have a higher dispersion than in READ set considered to be the most representative dataset for (Fig. 3b). Thus, when analyzing genes in READ dataset, cancer RNA-seq. Following common standards, we set we have a higher power distribution (Fig. 3c and d), sug- the minimum average read count as one or 10, and 95% gesting that less samples are needed to analyze rectum quantile in all gene dispersions as maximum dispersion, adenocarcinoma samples with a desired power. while the empirical databased method utilized the empir- ical average read count and dispersion distribution com- Sample size estimation for interested genes or pathways puted from TCGA (Additional file 1: Table S2 and S3). In certain situations, researchers may be interested in a The estimated sample size obtained using the empirical subset of genes defined by certain features such as data-based method was smaller in all parameter combina- shared pathways or gene ontology categories, rather than tions (Additional file 1: Table S2 and S3). For example, the the entire gene set. In such scenarios, the sample size es- estimated sample size for TCGA Rectum adenocarcinoma timation method needs to be adjusted, because the gene (READ) dataset was 168 when the following parameters subsets of interest may have distinct expression patterns were used: fold change at 2; desired power at 0.8; FDR at0. compared to other genes [19]. RnaSeqSampleSize was 05; minimal read count at 10; dispersion at 2.0 (Additional designed to handle sample size and power analysis in file 1: Table S2, in boldface). This estimated sample size is such experiment design by allowing users to provide a ab Fig. 2 Read counts and dispersion distribution greatly influence the estimated sample size and power. a The read counts and dispersion distribution for all genes from TCGA Rectum adenocarcinoma (READ) dataset. The red lines indicate read counts equal to one and 10. And the green line indicates the 95% quantile of all gene dispersions. b The estimated sample size required to achieve 0.8 power in different combinations of read counts and dispersions Zhao et al. BMC Bioinformatics (2018) 19:191 Page 4 of 8 Read Counts Distribution Dispersion Distribution ab TCGA BRCA TCGA BRCA TCGA READ TCGA READ 0 5 10 15 20 012345 log2(Read Counts) Dispersion cd Power Distribution (TCGA BRCA) Power Distribution (TCGA READ) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Power Power Fig. 3 Sample size estimation with real data. a The read counts distribution for all genes from TCGA Breast Invasive Carcinoma (BRCA) and Rectum adenocarcinoma (READ) dataset; (b) The dispersion distribution for all genes from TCGA BRCA and READ dataset; (c) The power distribution based on the count and dispersion distributions in TCGA BRCA dataset when sample size equals 71. The red lines indicate the mean value of power distribution. d The power distribution based on the count and dispersion distributions in TCGA READ dataset when sample size equals 71. The red lines indicate the mean value of power distribution list of interested genes or a KEGG pathway ID; this en- Table S5, in boldface) vs 42 for all genes (Additional file 1: sures that only the read count and dispersion distribu- Table S3, in boldface) in READ dataset if we use the same tion of interested genes or genes in the selected pathway parameters as previously. will be used for estimation. As illustrated in Fig. 4a and b, genes in Proteasome Power curve visualization for different parameters (KEGG pathway 03050), Calcium Signaling (KEGG path- Power curves are widely used to analyze and compare way 04020) and Pathways in Cancer (KEGG pathway sample size estimation results. To demonstrate the 05200) have distinguishable read counts and a dispersion power curve visualization feature in RnaSeqSampleSize, distribution in TCGA READ dataset. The genes in Prote- we produced three power curves based on different sce- asome pathway have very high read counts and a low dis- narios. As displayed in Fig. 5a, the X-axis indicates the persion, whereas genes in Calcium Signaling pathway have total number of samples used in two groups, and the Y- low read counts and a high dispersion, which may be a re- axis indicates the estimated power. There are three types flection of their functions related to Rectum Adenocarcin- of sample allocation design: 1:1 sample size in two oma (Fig. 4c,and d,Additional file 1:Table S4). groups (red curve); 2:1 sample size in two groups (blue Furthermore, we demonstrated that different sample curve); 3:1 sample size in two groups (purple curve). size estimations result in different TCGA datasets with The relationship between power and the number of KEGG pathway “Pathways in Cancer” (Additional file 1: samples can be easily visualized. In the example dis- Table S5). RnaSeqSampleSize estimated a sample size of played in Fig. 5a, the power curves indicate that the bal- 45 for “Pathways in Cancer” genes (Additional file 1: anced (sample size 1:1) experiment design (red curve) Number of Genes Density 0 100 200 300 400 500 600 0.00 0.05 0.10 0.15 Number of Genes Density 0 200 400 600 0.0 0.5 1.0 1.5 2.0 2.5 Zhao et al. BMC Bioinformatics (2018) 19:191 Page 5 of 8 ab Read Counts Distribution Dispersion Distribution Proteasome Proteasome Calcium signaling Calcium signaling Pathways in cancer Pathways in cancer 0 5 10 15 20 0.0 0.5 1.0 1.5 2.0 log2(Read Counts) Dispersion cd Power Distribution (Calcium signaling pathway) Power Distribution (Proteasome Pathway) 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.7 0.8 0.9 1.0 Power Power Fig. 4 Sample size estimation for interested genes. a The read counts distribution for genes in three KEGG pathways in TCGA READ dataset; (b) The dispersion distribution for genes in three KEGG pathways in TCGA READ dataset; (c) The power distribution based on the count and dispersion distributions in TCGA READ dataset for genes in Calcium signaling pathway when sample size equals 71. The red lines indicate the mean value of power distribution. d The power distribution based on the count and dispersion distributions in TCGA READ dataset for genes in Proteasome pathway when sample size equals 71. The red lines indicate the mean value of the power distribution will achieve the highest power when the same total Material and methods number of samples is used. Software development and data acquisition RnaSeqSampleSize was developed in R language [21] Parameter optimization for experiment design and compiled into a software package following the The RNA-seq experiment design is often limited by the guidelines of Bioconductor [22]. availability of the budget. The optimization feature in Rna- The web interface of RnaSeqSampleSize was developed SeqSampleSize can be used to identify the optimal param- using Shiny package (http://shiny.rstudio.com/) in R eters that will achieve the highest power while staying language [21]. under the budget. To demonstrate the parameter The TCGA data used as real data examples in RnaSeq- optimization feature, we attempted to optimize numbers SampleSize were downloaded from Genomic Data Com- of samples and read counts while fixing all other parame- mons Data Portal (https://gdc-portal.nci.nih.gov/). The ters (fold change: 2; dispersion: 1; FDR: 0.05) by generating gene expression data of 13 types of cancers were down- a power matrix (Fig. 5b). The estimated power was less loaded and included as references in RnaSeqSampleSize- than 0.1 when 16 samples were used even if the read Data package in Bioconductor [22]. count was as high as 96. When the number of samples in- creased to 96, however, the estimated power increased to Algorithm 0.8, even when the read count was as low as eight. This In previous research, we have reported a sample size cal- matrix indicates that the number of samples plays a more culation method based the exact test for a single-gene significant role in determining the power than the read comparison [10]. In this method, we used the concept of count, which is consistent with the previous report [20]. pseudo counts [23, 24]. Because the question of interest Number of Genes Density 0 20 40 60 80 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Number of Genes Density 0 10 20 30 40 012 34 Zhao et al. BMC Bioinformatics (2018) 19:191 Page 6 of 8 ab Fig. 5 Power curve visualization and parameter optimization by RnaSeqSampleSize. a Power curves for balanced (same sample size in two groups) and unbalanced (different sample size in two groups) experiment design. The power curves indicate that the balanced experiment design (red line) will achieve the highest power with the same total number of samples; (b) Optimization of parameters in sample size estimation. The dispersion and fold change were set as 0.5 and two, respectively. A power matrix with different pairs of numbers of samples and read counts were generated. The power distribution indicates that the number of samples plays a more significant role in determining the power, and suggests at least 96 samples should be used in RNA-Seq experiments with these parameters to get 0.8 power is to identify the differential gene expression between considered. We proposed a false discovery rate (FDR) two groups, the corresponding testing hypothesis is. controlled method in previous research [10], which is H : γ = γ vs H : γ ≠ γ , defined as the expected proportion of false discoveries 0 0 1 1 0 1 where γ represents the gene expression level of group among rejected null hypotheses. In this method, the i(i = 0,1). In order to perform sample size calculations, marginal type I error level α will be adjusted to α∗ to it is necessary to construct a power function for the test- guarantee the expected number of true rejections at a ing described above. The power of a test is the probabil- given FDR. ity that the null hypothesis is rejected when the To calculate the sample size, we need to pre- alternative hypothesis is true. For a given marginal type I specific the parameters estimated from the differen- error level α to reject the null hypothesis, the power can tially expressed genes. However, we may not be able be expressed as to know or determine which genes were differentially expressed in a real dataset. To deal this issue, we as- sume the distribution of average read count in con- ε n; ρ; μ ; φ ; ω; α trol group (μ in formula 2) and dispersion (φ in g g g g X X formula 2) for differential expressed genes were the φ φ g g ∞ ∞ ¼ fnωρμ ; fnμ ; IðÞ pðÞ y ; y < α y ¼0 y ¼0 g g 1 0 0 1 same as all genes. Then we randomly selected genes n n from the real data set and treated them as differen- ð1Þ tially expressed. Functions in edgeR package were Where g is the single gene in comparison; n is the wrapped and used to estimate the average read number of samples in each group; ρ is the fold change counts and dispersion distribution. If M1 was the between the two groups; μ is the average read count for number of differential expressed genes in the data- gene g in the control group; φ is the dispersion param- set, we randomly selected M1 genes from all genes eter for gene g in the control group; ω is the geometric and used their average read counts and dispersions mean of normalization factors between the two groups; from the distribution to represent differential genes. α is type I error rate; y and y are pseudo counts [25]in As a result, the power of detecting these M1 differ- 1 0 the two groups; f(μ, φ) is the probability mass function ential genes can be calculated: M1 of the negative binomial distribution with mean μ as well ε n; ρ; μ ; φ ; ω; α g¼1 g g as dispersion φ; and I(p(y , y )< α) denotes the indicator 1 0 Power ¼ ð2Þ M1 function for the p value of the exact test [25]. In reality, thousands of genes are examined in an RNA-seq experiment, and those genes are tested simul- The value of power in formula (2) is highly dependent taneously for significance of differential expression. In on the selected differential genes. When the number of such cases, multiple testing problem should be differential expressed genes is small, different genes will Zhao et al. BMC Bioinformatics (2018) 19:191 Page 7 of 8 be selected in each replication and results in a significant related TCGA dataset to estimate the read count and dispersion distribution. diversity among the power in replications. Motivated by Table S4. Estimated sample size for RNA-Seq experiments in different cancer types by real data distribution based method, only the genes in the ensemble method in machine learning, we average interested KEGG pathway were considered. Table S5. Estimated sample size all the powers calculated from the replications to obtain for RNA-Seq experiments in different cancer types by real data distribution a robust estimation of power. based method, only the genes in KEGG pathway ID 05200 (Pathways in Cancer) were considered. Figure S1. A screen shot of user interface of This re-sampling process was repeated several times RnaSeqSampleSize package. (DOCX 217 kb) (1000 by default) to get a power distribution and the power distribution was summarized (averaged by de- Acknowledgements fault) to obtain a robust estimation of power. Then, Rna- The authors wish to thank Michael Smith for providing editorial work and SeqSampleSize package will use the numerical approach William Gray for providing website support for this manuscript. to find the n when the robust estimation of power is Funding equal to the desired level. This work was supported by grants P30CA068485, P50CA095103, P50CA098131, and U24CA163056. Conclusion Availability of data and materials Sample size estimation is a critical step in RNA sequen- The datasets used in the study were available in RnaSeqSampleSizeData cing experimental design. It provides an important solu- package in Bioconductor (http://master.bioconductor.org/packages/release/ data/experiment/html/RnaSeqSampleSizeData.html). tion for balancing the number of samples and the statistical power. Here, we presented the power and Authors’ contributions sample size estimation software RnaSeqSampleSize to SZ improved the algorithm, developed the package and wrote the manuscript. CL developed the algorithm. YG improved the package and overcome the current limitations and provide a less con- the manuscript. QS participated in improving the package. YS designed servative yet more accurate and reliable result. RnaSeq- and guided the project. All authors read and approved the final manuscript. SampleSize provides more efficient computations compared to previous methods; additionally, it provides Ethics approval and consent to participate Not applicable. several novel visualization and optimization features as well as a much desired graphical user web interface, Competing interests which allows investigators without a background in pro- The authors declare that they have no competing interests. gramming to easily conduct sample size calculation (Additional file 1: Figure S1). Publisher’sNote Springer Nature remains neutral with regard to jurisdictional claims in What separates RnaSeqSampleSize from the other published maps and institutional affiliations. RNA-Seq power analysis tools is its usage of reference data, which can help generate a reliable read count and Author details Department of Biostatistics, Vanderbilt University Medical Center, Nashville, dispersion distribution. We preloaded the TCGA data- TN 37232, USA. Department of Statistics, National Cheng Kung University, sets for users without reference data. The TCGA dataset Tainan 70101, Taiwan. Department of Internal Medicine, University of New provides a comprehensive reference for cancer tissues Mexico, Albuquerque, NM 87131, USA. Key Laboratory of Resource Biology and Biotechnology in Western China, School of Life Sciences, Northwest samples, but the reference datasets for non-cancer or University, Xi’an 710069, Shanxi, China. non-tissue samples are not currently included. As more and more RNA sequencing datasets become publically Received: 8 August 2017 Accepted: 7 May 2018 available, we will continually update the reference dataset. References 1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. Availability 2. Jung SH, Bang H, Young S. Sample size calculation for multiple testing in Home page: microarray data analysis. Biostatistics. 2005;6(1):157–69. http://www.bioconductor.org/packages/release/bioc/ 3. Müller P, Parmigiani G, Robert C, Rousseau J. Optimal sample size for multiple testing: the case of gene expression microarrays. J Am Stat Assoc. html/RnaSeqSampleSize.html 2004;99(468):990–1001. Web interface: 4. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. Scotty: a web tool for http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ designing RNA-Seq experiments to measure differential gene expression. Bioinformatics. 2013;29(5):656–7. 5. Chen Z, Liu J, Ng HK, Nadarajah S, Kaufman HL, Yang JY, Deng Y. Statistical Additional file methods on detecting differentially expressed genes for RNA-seq data. BMC Syst Biol. 2011;5(Suppl 3):S1. 6. Fang Z, Cui X. Design and validation issues in RNA-seq experiments. Brief Additional file 1: Table S1. The improvement in efficiency in Bioinform. 2011;12(3):280–7. RnaSeqSampleSize package. Table S2. Estimated sample size for RNA-Seq 7. Robinson MD, Oshlack A. A scaling normalization method for differential experiments in different cancer types by single parameter method. Table S3. expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25. Estimated sample size for RNA-Seq experiments in different cancer types by 8. Anders S, Huber W. Differential expression analysis for sequence count data. real data distribution based method. For each cancer type, we used the Genome Biol. 2010;11(10):R106. Zhao et al. BMC Bioinformatics (2018) 19:191 Page 8 of 8 9. Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. Calculating sample size estimates for RNA sequencing data. J Comput Biol. 2013;20(12):970–8. 10. Li CI, Su PF, Shyr Y. Sample size calculation based on exact test for assessing differential expression analysis in RNA-seq data. BMC bioinformatics. 2013;14:357. 11. Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics. 2014;30(3):301–4. 12. Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014;20(11):1684–96. 13. Li CI, Samuels DC, Zhao YY, Shyr Y, Guo Y. Power and sample size calculations for high-throughput sequencing-based experiments. Brief Bioinform. 2017; https://www.ncbi.nlm.nih.gov/pubmed/28605403. 14. Therneau TM, Hart SN, Kocher JP. RNASeqPower: Calculating samples Size estimates for RNA Seq studies. R package version 1.18.0. 2013. 15. Guo Y, Li J, Li CI, Shyr Y, Samuels DC. MitoSeek: extracting mitochondria information and performing high-throughput mitochondria sequencing analysis. Bioinformatics. 2013;29(9):1210–1. 16. Wu H, Wang C, Wu ZJ. PROPER: comprehensive power evaluation for differential expression using RNA-seq. Bioinformatics. 2015;31(2):233–41. 17. Zhou X, Lindsay H, Robinson MD. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014; 42(11):e91. 18. Yu L, Fernandez S, Brock G. Power analysis for RNA-Seq differential expression studies. BMC Bioinformatics. 2017;18(1):234. 19. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39(Database issue):D691–7. 20. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013;14(9):R95. 21. R Core Team. R: a language and environment for statistical computing. In: R foundation for statistical computing; 2016. https://www.R-project.org/:. 22. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12(2):115–21. 23. Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321–32. 24. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–7. 25. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

RnaSeqSampleSize: real data based sample size estimation for RNA sequencing

Free
8 pages

Loading next page...
 
/lp/springer_journal/rnaseqsamplesize-real-data-based-sample-size-estimation-for-rna-jFovQRawIZ
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s).
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Algorithms
eISSN
1471-2105
D.O.I.
10.1186/s12859-018-2191-5
Publisher site
See Article on Publisher Site

Abstract

Background: One of the most important and often neglected components of a successful RNA sequencing (RNA-Seq) experiment is sample size estimation. A few negative binomial model-based methods have been developed to estimate sample size based on the parameters of a single gene. However, thousands of genes are quantified and tested for differential expression simultaneously in RNA-Seq experiments. Thus, additional issues should be carefully addressed, including the false discovery rate for multiple statistic tests, widely distributed read counts and dispersions for different genes. Results: To solve these issues, we developed a sample size and power estimation method named RnaSeqSampleSize, based on the distributions of gene average read counts and dispersions estimated from real RNA-seq data. Datasets from previous, similar experiments such as the Cancer Genome Atlas (TCGA) can be used as a point of reference. Read counts and their dispersions were estimated from the reference’s distribution; using that information, we estimated and summarized the power and sample size. RnaSeqSampleSize is implemented in R language and can be installed from Bioconductor website. A user friendly web graphic interface is provided at http://cqs.mc.vanderbilt.edu/shiny/ RnaSeqSampleSize/. Conclusions: RnaSeqSampleSize provides a convenient and powerful way for power and sample size estimation for an RNAseq experiment. It is also equipped with several unique features, including estimation for interested genes or pathway, power curve visualization, and parameter optimization. Keywords: RNA-Seq, Sample size, Power analysis, Simulation Background Sample size and power analysis have been well- RNA sequencing is a powerful NGS tool that has been established for traditional biological studies such as genome widely used in differential gene expression studies [1]. One wide association studies (GWAS) and microarray gene ex- of the most important steps in designing an RNA sequen- pression studies [2, 3]. In earlier RNA-Seq studies, the ana- cing experiment is selecting the optimal number of bio- lysis was based on Poisson distribution, because RNA-Seq logical replicates to achieve a desired statistical power data can be represented as read counts [4–6]. It was discov- (sample size estimation), or estimating the likelihood of ered, however, that Poisson distribution does not fit the em- successfully finding the statistical significance in the dataset pirical data due to an over-dispersion mainly caused by (power estimation). An insufficient number of replicates natural biological variation [7, 8]. To address this issue, a may lead to unreliable conclusions, whereas too many rep- few negative binomial distribution-based methods have licates may result in a waste of time and resources. The tra- been developed. These methods provide researchers with deoff between cost and study power needs to be carefully more flexibility in assigning between-sample variations [9– balanced. To address this issue, several attempts have been 13]. Hart et al. [14] proposed a power analysis method made to estimate power and sample size for RNA-seq based on the score test for single-gene differential expres- experiments. sion analysis. This method has been implemented in Bio- conductor as RNASeqPower. To handle multiple gene * Correspondence: shyr.yu@vanderbilt.edu comparisons, Li et al. [15] proposed a power analysis Department of Biostatistics, Vanderbilt University Medical Center, Nashville, method while controlling for the false discovery rate. To in- TN 37232, USA corporate the experiment’sbudgetintothe poweranalysis, Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Zhao et al. BMC Bioinformatics (2018) 19:191 Page 2 of 8 Wu et al. [16] introduced the concepts of stratified power RnaSeqSampleSize package, which controls the false dis- by coverage or biological variation and the cost of false dis- covery rate (FDR) of multiple testing, and utilizes the covery, then proposed a simulation-based method for average read count and dispersion distributions from power analysis. The method was implemented as a Biocon- real data to estimate a more reliable sample size. The ductor package PROPER. package is also optimized for running efficiency and pro- However, there are several limitations in the majority of vides additional features, which we demonstrate using the previous methods, such as: the lack of properly ac- real RNA-Seq data. counting for average read counts and dispersion in differ- ent genes; the lack of proper reference data; and the lack Results and discussion of easy and user friendly interfaces. The average read The detailed feature list of RnaSeqSampleSize package counts of genes are distributed in a range of more than can be observed in Fig. 1: four orders of magnitude, and their dispersions are highly dependent on their gene expression level [17, 18]. Previ- Sample size estimation with single average read count ous estimation methods were not designed for these dis- and dispersion tributions, so they often utilized one value chosen RnaSeqSampleSize was developed based on the sample conservatively or by experience [9, 10], which often re- size and power estimation methods described in the pre- sulted in an over-estimated sample size. Yu et al. [18]have vious study [10], and it greatly improved the compatibil- introduced a simulation-based procedure which considers ity and efficiency of older methods. In this new dependence between gene expression level and its disper- implementation, a minimal average read count and a sion, but this method has not been made into an easy-to- maximal dispersion are used to represent all genes in use software. Additionally, it is computationally expensive the RNA sequence experiments, and a conservative sam- to apply these methods to every gene in the dataset, be- ple size or power can be estimated. More importantly, cause the individual power analysis for the exact test in- RnaSeqSampleSize is compatible for large average read volves infinite sums, and the study’s overall power is counts and dispersions, supporting as much as a 2000 estimated from a summation of the individual power. A average read count. We optimized the running efficiency proper approach is to providing reference data with simi- of the method from 40 min to two seconds for most of lar distributions to current experiments. We acknowledge the widely used parameters (Additional file 1: Table S1). that such data may not be available for every project type and that a significant amount of programming and data Sample size estimation with real data processing effort is needed to utilize them. As previously stated, average read counts and disper- To address the aforementioned problems, we used sions for genes have wide distributions within a single previous methods [15] as the foundation for developing RNA sequencing experiment. A tiny fluctuation in the Fig. 1 RnaSeqSampleSize package workflow Zhao et al. BMC Bioinformatics (2018) 19:191 Page 3 of 8 average read count or dispersion will greatly influence larger than the real sample size, because we used the low- the estimated power or sample size (Fig. 2). For example, est read count and highest dispersion to represent all in TCGA Rectum adenocarcinoma (READ) dataset, the genes, even if most of them were not very conservative genes have a dispersion from 0 to 10, and the average (Red line and green line in Fig. 2a). read counts range from one to numbers in the several In the empirical data-based method, the genes in the thousands (Fig. 2a). In such a scenario, the sample size reference dataset were randomly selected, and the pow- estimation from a single value is inaccurate. We com- ers were estimated respectively based on these genes puted that the estimated sample size increased from 10 (Fig. 3). With the same desired power and FDR, the esti- to 302 when the minimal average read count changed mated sample size was 42 (Additional file 1: Table S3, in from one to 30 and the maximal dispersion changed boldface). This result is a better representation of the from 0.1 to three (Fig. 2b). genes in RNA-seq experiment and it is substantially less Instead of relying on guessing to discover the future than the result that was obtained using the traditional data’s distribution, RnaSeqSampleSize uses data parame- method. More importantly, the empirical data-based ters from previous studies. To demonstrate this feature, method can reflect the differential gene expression pat- we compared the usual sample size estimation approach tern in a different dataset. For example, genes in TCGA to RnaSeqSampleSize’s empirical approach. We used the Breast Invasive Carcinoma (BRCA) and READ dataset datasets from TCGA as the reference data to estimate the have similar read count distributions (Fig. 3a), but genes true data distribution. TCGA is a cancer consortium data in BRCA dataset have a higher dispersion than in READ set considered to be the most representative dataset for (Fig. 3b). Thus, when analyzing genes in READ dataset, cancer RNA-seq. Following common standards, we set we have a higher power distribution (Fig. 3c and d), sug- the minimum average read count as one or 10, and 95% gesting that less samples are needed to analyze rectum quantile in all gene dispersions as maximum dispersion, adenocarcinoma samples with a desired power. while the empirical databased method utilized the empir- ical average read count and dispersion distribution com- Sample size estimation for interested genes or pathways puted from TCGA (Additional file 1: Table S2 and S3). In certain situations, researchers may be interested in a The estimated sample size obtained using the empirical subset of genes defined by certain features such as data-based method was smaller in all parameter combina- shared pathways or gene ontology categories, rather than tions (Additional file 1: Table S2 and S3). For example, the the entire gene set. In such scenarios, the sample size es- estimated sample size for TCGA Rectum adenocarcinoma timation method needs to be adjusted, because the gene (READ) dataset was 168 when the following parameters subsets of interest may have distinct expression patterns were used: fold change at 2; desired power at 0.8; FDR at0. compared to other genes [19]. RnaSeqSampleSize was 05; minimal read count at 10; dispersion at 2.0 (Additional designed to handle sample size and power analysis in file 1: Table S2, in boldface). This estimated sample size is such experiment design by allowing users to provide a ab Fig. 2 Read counts and dispersion distribution greatly influence the estimated sample size and power. a The read counts and dispersion distribution for all genes from TCGA Rectum adenocarcinoma (READ) dataset. The red lines indicate read counts equal to one and 10. And the green line indicates the 95% quantile of all gene dispersions. b The estimated sample size required to achieve 0.8 power in different combinations of read counts and dispersions Zhao et al. BMC Bioinformatics (2018) 19:191 Page 4 of 8 Read Counts Distribution Dispersion Distribution ab TCGA BRCA TCGA BRCA TCGA READ TCGA READ 0 5 10 15 20 012345 log2(Read Counts) Dispersion cd Power Distribution (TCGA BRCA) Power Distribution (TCGA READ) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Power Power Fig. 3 Sample size estimation with real data. a The read counts distribution for all genes from TCGA Breast Invasive Carcinoma (BRCA) and Rectum adenocarcinoma (READ) dataset; (b) The dispersion distribution for all genes from TCGA BRCA and READ dataset; (c) The power distribution based on the count and dispersion distributions in TCGA BRCA dataset when sample size equals 71. The red lines indicate the mean value of power distribution. d The power distribution based on the count and dispersion distributions in TCGA READ dataset when sample size equals 71. The red lines indicate the mean value of power distribution list of interested genes or a KEGG pathway ID; this en- Table S5, in boldface) vs 42 for all genes (Additional file 1: sures that only the read count and dispersion distribu- Table S3, in boldface) in READ dataset if we use the same tion of interested genes or genes in the selected pathway parameters as previously. will be used for estimation. As illustrated in Fig. 4a and b, genes in Proteasome Power curve visualization for different parameters (KEGG pathway 03050), Calcium Signaling (KEGG path- Power curves are widely used to analyze and compare way 04020) and Pathways in Cancer (KEGG pathway sample size estimation results. To demonstrate the 05200) have distinguishable read counts and a dispersion power curve visualization feature in RnaSeqSampleSize, distribution in TCGA READ dataset. The genes in Prote- we produced three power curves based on different sce- asome pathway have very high read counts and a low dis- narios. As displayed in Fig. 5a, the X-axis indicates the persion, whereas genes in Calcium Signaling pathway have total number of samples used in two groups, and the Y- low read counts and a high dispersion, which may be a re- axis indicates the estimated power. There are three types flection of their functions related to Rectum Adenocarcin- of sample allocation design: 1:1 sample size in two oma (Fig. 4c,and d,Additional file 1:Table S4). groups (red curve); 2:1 sample size in two groups (blue Furthermore, we demonstrated that different sample curve); 3:1 sample size in two groups (purple curve). size estimations result in different TCGA datasets with The relationship between power and the number of KEGG pathway “Pathways in Cancer” (Additional file 1: samples can be easily visualized. In the example dis- Table S5). RnaSeqSampleSize estimated a sample size of played in Fig. 5a, the power curves indicate that the bal- 45 for “Pathways in Cancer” genes (Additional file 1: anced (sample size 1:1) experiment design (red curve) Number of Genes Density 0 100 200 300 400 500 600 0.00 0.05 0.10 0.15 Number of Genes Density 0 200 400 600 0.0 0.5 1.0 1.5 2.0 2.5 Zhao et al. BMC Bioinformatics (2018) 19:191 Page 5 of 8 ab Read Counts Distribution Dispersion Distribution Proteasome Proteasome Calcium signaling Calcium signaling Pathways in cancer Pathways in cancer 0 5 10 15 20 0.0 0.5 1.0 1.5 2.0 log2(Read Counts) Dispersion cd Power Distribution (Calcium signaling pathway) Power Distribution (Proteasome Pathway) 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.7 0.8 0.9 1.0 Power Power Fig. 4 Sample size estimation for interested genes. a The read counts distribution for genes in three KEGG pathways in TCGA READ dataset; (b) The dispersion distribution for genes in three KEGG pathways in TCGA READ dataset; (c) The power distribution based on the count and dispersion distributions in TCGA READ dataset for genes in Calcium signaling pathway when sample size equals 71. The red lines indicate the mean value of power distribution. d The power distribution based on the count and dispersion distributions in TCGA READ dataset for genes in Proteasome pathway when sample size equals 71. The red lines indicate the mean value of the power distribution will achieve the highest power when the same total Material and methods number of samples is used. Software development and data acquisition RnaSeqSampleSize was developed in R language [21] Parameter optimization for experiment design and compiled into a software package following the The RNA-seq experiment design is often limited by the guidelines of Bioconductor [22]. availability of the budget. The optimization feature in Rna- The web interface of RnaSeqSampleSize was developed SeqSampleSize can be used to identify the optimal param- using Shiny package (http://shiny.rstudio.com/) in R eters that will achieve the highest power while staying language [21]. under the budget. To demonstrate the parameter The TCGA data used as real data examples in RnaSeq- optimization feature, we attempted to optimize numbers SampleSize were downloaded from Genomic Data Com- of samples and read counts while fixing all other parame- mons Data Portal (https://gdc-portal.nci.nih.gov/). The ters (fold change: 2; dispersion: 1; FDR: 0.05) by generating gene expression data of 13 types of cancers were down- a power matrix (Fig. 5b). The estimated power was less loaded and included as references in RnaSeqSampleSize- than 0.1 when 16 samples were used even if the read Data package in Bioconductor [22]. count was as high as 96. When the number of samples in- creased to 96, however, the estimated power increased to Algorithm 0.8, even when the read count was as low as eight. This In previous research, we have reported a sample size cal- matrix indicates that the number of samples plays a more culation method based the exact test for a single-gene significant role in determining the power than the read comparison [10]. In this method, we used the concept of count, which is consistent with the previous report [20]. pseudo counts [23, 24]. Because the question of interest Number of Genes Density 0 20 40 60 80 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Number of Genes Density 0 10 20 30 40 012 34 Zhao et al. BMC Bioinformatics (2018) 19:191 Page 6 of 8 ab Fig. 5 Power curve visualization and parameter optimization by RnaSeqSampleSize. a Power curves for balanced (same sample size in two groups) and unbalanced (different sample size in two groups) experiment design. The power curves indicate that the balanced experiment design (red line) will achieve the highest power with the same total number of samples; (b) Optimization of parameters in sample size estimation. The dispersion and fold change were set as 0.5 and two, respectively. A power matrix with different pairs of numbers of samples and read counts were generated. The power distribution indicates that the number of samples plays a more significant role in determining the power, and suggests at least 96 samples should be used in RNA-Seq experiments with these parameters to get 0.8 power is to identify the differential gene expression between considered. We proposed a false discovery rate (FDR) two groups, the corresponding testing hypothesis is. controlled method in previous research [10], which is H : γ = γ vs H : γ ≠ γ , defined as the expected proportion of false discoveries 0 0 1 1 0 1 where γ represents the gene expression level of group among rejected null hypotheses. In this method, the i(i = 0,1). In order to perform sample size calculations, marginal type I error level α will be adjusted to α∗ to it is necessary to construct a power function for the test- guarantee the expected number of true rejections at a ing described above. The power of a test is the probabil- given FDR. ity that the null hypothesis is rejected when the To calculate the sample size, we need to pre- alternative hypothesis is true. For a given marginal type I specific the parameters estimated from the differen- error level α to reject the null hypothesis, the power can tially expressed genes. However, we may not be able be expressed as to know or determine which genes were differentially expressed in a real dataset. To deal this issue, we as- sume the distribution of average read count in con- ε n; ρ; μ ; φ ; ω; α trol group (μ in formula 2) and dispersion (φ in g g g g X X formula 2) for differential expressed genes were the φ φ g g ∞ ∞ ¼ fnωρμ ; fnμ ; IðÞ pðÞ y ; y < α y ¼0 y ¼0 g g 1 0 0 1 same as all genes. Then we randomly selected genes n n from the real data set and treated them as differen- ð1Þ tially expressed. Functions in edgeR package were Where g is the single gene in comparison; n is the wrapped and used to estimate the average read number of samples in each group; ρ is the fold change counts and dispersion distribution. If M1 was the between the two groups; μ is the average read count for number of differential expressed genes in the data- gene g in the control group; φ is the dispersion param- set, we randomly selected M1 genes from all genes eter for gene g in the control group; ω is the geometric and used their average read counts and dispersions mean of normalization factors between the two groups; from the distribution to represent differential genes. α is type I error rate; y and y are pseudo counts [25]in As a result, the power of detecting these M1 differ- 1 0 the two groups; f(μ, φ) is the probability mass function ential genes can be calculated: M1 of the negative binomial distribution with mean μ as well ε n; ρ; μ ; φ ; ω; α g¼1 g g as dispersion φ; and I(p(y , y )< α) denotes the indicator 1 0 Power ¼ ð2Þ M1 function for the p value of the exact test [25]. In reality, thousands of genes are examined in an RNA-seq experiment, and those genes are tested simul- The value of power in formula (2) is highly dependent taneously for significance of differential expression. In on the selected differential genes. When the number of such cases, multiple testing problem should be differential expressed genes is small, different genes will Zhao et al. BMC Bioinformatics (2018) 19:191 Page 7 of 8 be selected in each replication and results in a significant related TCGA dataset to estimate the read count and dispersion distribution. diversity among the power in replications. Motivated by Table S4. Estimated sample size for RNA-Seq experiments in different cancer types by real data distribution based method, only the genes in the ensemble method in machine learning, we average interested KEGG pathway were considered. Table S5. Estimated sample size all the powers calculated from the replications to obtain for RNA-Seq experiments in different cancer types by real data distribution a robust estimation of power. based method, only the genes in KEGG pathway ID 05200 (Pathways in Cancer) were considered. Figure S1. A screen shot of user interface of This re-sampling process was repeated several times RnaSeqSampleSize package. (DOCX 217 kb) (1000 by default) to get a power distribution and the power distribution was summarized (averaged by de- Acknowledgements fault) to obtain a robust estimation of power. Then, Rna- The authors wish to thank Michael Smith for providing editorial work and SeqSampleSize package will use the numerical approach William Gray for providing website support for this manuscript. to find the n when the robust estimation of power is Funding equal to the desired level. This work was supported by grants P30CA068485, P50CA095103, P50CA098131, and U24CA163056. Conclusion Availability of data and materials Sample size estimation is a critical step in RNA sequen- The datasets used in the study were available in RnaSeqSampleSizeData cing experimental design. It provides an important solu- package in Bioconductor (http://master.bioconductor.org/packages/release/ data/experiment/html/RnaSeqSampleSizeData.html). tion for balancing the number of samples and the statistical power. Here, we presented the power and Authors’ contributions sample size estimation software RnaSeqSampleSize to SZ improved the algorithm, developed the package and wrote the manuscript. CL developed the algorithm. YG improved the package and overcome the current limitations and provide a less con- the manuscript. QS participated in improving the package. YS designed servative yet more accurate and reliable result. RnaSeq- and guided the project. All authors read and approved the final manuscript. SampleSize provides more efficient computations compared to previous methods; additionally, it provides Ethics approval and consent to participate Not applicable. several novel visualization and optimization features as well as a much desired graphical user web interface, Competing interests which allows investigators without a background in pro- The authors declare that they have no competing interests. gramming to easily conduct sample size calculation (Additional file 1: Figure S1). Publisher’sNote Springer Nature remains neutral with regard to jurisdictional claims in What separates RnaSeqSampleSize from the other published maps and institutional affiliations. RNA-Seq power analysis tools is its usage of reference data, which can help generate a reliable read count and Author details Department of Biostatistics, Vanderbilt University Medical Center, Nashville, dispersion distribution. We preloaded the TCGA data- TN 37232, USA. Department of Statistics, National Cheng Kung University, sets for users without reference data. The TCGA dataset Tainan 70101, Taiwan. Department of Internal Medicine, University of New provides a comprehensive reference for cancer tissues Mexico, Albuquerque, NM 87131, USA. Key Laboratory of Resource Biology and Biotechnology in Western China, School of Life Sciences, Northwest samples, but the reference datasets for non-cancer or University, Xi’an 710069, Shanxi, China. non-tissue samples are not currently included. As more and more RNA sequencing datasets become publically Received: 8 August 2017 Accepted: 7 May 2018 available, we will continually update the reference dataset. References 1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. Availability 2. Jung SH, Bang H, Young S. Sample size calculation for multiple testing in Home page: microarray data analysis. Biostatistics. 2005;6(1):157–69. http://www.bioconductor.org/packages/release/bioc/ 3. Müller P, Parmigiani G, Robert C, Rousseau J. Optimal sample size for multiple testing: the case of gene expression microarrays. J Am Stat Assoc. html/RnaSeqSampleSize.html 2004;99(468):990–1001. Web interface: 4. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. Scotty: a web tool for http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ designing RNA-Seq experiments to measure differential gene expression. Bioinformatics. 2013;29(5):656–7. 5. Chen Z, Liu J, Ng HK, Nadarajah S, Kaufman HL, Yang JY, Deng Y. Statistical Additional file methods on detecting differentially expressed genes for RNA-seq data. BMC Syst Biol. 2011;5(Suppl 3):S1. 6. Fang Z, Cui X. Design and validation issues in RNA-seq experiments. Brief Additional file 1: Table S1. The improvement in efficiency in Bioinform. 2011;12(3):280–7. RnaSeqSampleSize package. Table S2. Estimated sample size for RNA-Seq 7. Robinson MD, Oshlack A. A scaling normalization method for differential experiments in different cancer types by single parameter method. Table S3. expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25. Estimated sample size for RNA-Seq experiments in different cancer types by 8. Anders S, Huber W. Differential expression analysis for sequence count data. real data distribution based method. For each cancer type, we used the Genome Biol. 2010;11(10):R106. Zhao et al. BMC Bioinformatics (2018) 19:191 Page 8 of 8 9. Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. Calculating sample size estimates for RNA sequencing data. J Comput Biol. 2013;20(12):970–8. 10. Li CI, Su PF, Shyr Y. Sample size calculation based on exact test for assessing differential expression analysis in RNA-seq data. BMC bioinformatics. 2013;14:357. 11. Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics. 2014;30(3):301–4. 12. Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014;20(11):1684–96. 13. Li CI, Samuels DC, Zhao YY, Shyr Y, Guo Y. Power and sample size calculations for high-throughput sequencing-based experiments. Brief Bioinform. 2017; https://www.ncbi.nlm.nih.gov/pubmed/28605403. 14. Therneau TM, Hart SN, Kocher JP. RNASeqPower: Calculating samples Size estimates for RNA Seq studies. R package version 1.18.0. 2013. 15. Guo Y, Li J, Li CI, Shyr Y, Samuels DC. MitoSeek: extracting mitochondria information and performing high-throughput mitochondria sequencing analysis. Bioinformatics. 2013;29(9):1210–1. 16. Wu H, Wang C, Wu ZJ. PROPER: comprehensive power evaluation for differential expression using RNA-seq. Bioinformatics. 2015;31(2):233–41. 17. Zhou X, Lindsay H, Robinson MD. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014; 42(11):e91. 18. Yu L, Fernandez S, Brock G. Power analysis for RNA-Seq differential expression studies. BMC Bioinformatics. 2017;18(1):234. 19. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39(Database issue):D691–7. 20. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013;14(9):R95. 21. R Core Team. R: a language and environment for statistical computing. In: R foundation for statistical computing; 2016. https://www.R-project.org/:. 22. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12(2):115–21. 23. Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321–32. 24. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–7. 25. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.

Journal

BMC BioinformaticsSpringer Journals

Published: May 30, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off