UMI-count modeling and differential expression analysis for single-cell RNA sequencing

UMI-count modeling and differential expression analysis for single-cell RNA sequencing Read counting and unique molecular identifier (UMI) counting are the principal gene expression quantification schemes used in single-cell RNA-sequencing (scRNA-seq) analysis. By using multiple scRNA-seq datasets, we reveal distinct distribution differences between these schemes and conclude that the negative binomial model is a good approximation for UMI counts, even in heterogeneous populations. We further propose a novel differential expression analysis algorithm based on a negative binomial model with independent dispersions in each group (NBID). Our results show that this properly controls the FDR and achieves better power for UMI counts when compared to other recently developed packages for scRNA-seq analysis. Keywords: Unique molecular identifier, Negative binomial, Differential expression analysis Background missing value whereby the expression of a gene is detected Single-cell RNA-sequencing (scRNA-seq) technology pro- at a moderate or high level in a subset of cells but is not vides transcriptome profiles of individual cells, enabling detected in other cells [5]. the dissection of the heterogeneity of different cell popula- Read counts and transcript counts are two categories tions and tissues [1]. Although scRNA-seq protocols share of quantification schemes commonly employed in common principles of single-cell isolation, cell lysis, tran- scRNA-seq. Although the read count-based scheme is script capture, complementary DNA (cDNA) conversion similar to the common approaches used for bulk and amplification, library preparation, and sequencing, the RNA-seq, the miniscule quantity of transcripts captured methodologies differ. Multiple methods for transcript from a single cell requires cDNA amplification for quantification with differing levels of accuracy and sensi- library construction; this inevitably results in large amp- tivity have been employed in scRNA-seq analysis [2]. lification bias [6]. To mitigate this bias, several recent However, the paucity of starting material for reverse tran- scRNA-seq protocols have employed an additional step scription remains an inherent limitation of scRNA-seq in which individual transcripts are barcoded with unique protocols and contributes to the relatively low rate at molecular identifiers (UMIs) before amplification, result- which messenger RNA (mRNA) molecules in individual ing in a more accurate quantification of the transcript cells are converted to cDNA molecules that can be cap- count [7, 8]. tured and sequenced [3, 4]. Coupled with the stochastic Although the fast-evolving experimental protocols for nature of gene expression, scRNA-seq protocols generally scRNA-seq have given rise to numerous studies employ- produce single-cell transcriptome measurements with low ing scRNA-seq techniques, statistical characterizations of signal-to-noise ratios, exemplified by the high abundance scRNA-seq data continue to lag. Most published studies of zeroes in the expression matrix and so-called dropout have focused primarily on either read counts [5, 9, 10]or events. In this context, dropout refers to a special type of UMI counts [3, 7]. Although a few studies that used both read-count and UMI-count schemes have suggested that employing UMIs in expression measurement globally * Correspondence: xiang.chen@stjude.org reduces the technical noises and that the data generally fit Department of Computational Biology, St. Jude Children’s Research Hospital, 262 Danny Thomas Pl, Memphis, TN 38105, USA into simpler statistical models compared to read counts Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Chen et al. Genome Biology (2018) 19:70 Page 2 of 17 [3, 11–13], a popular perspective held by the field is that application of NBID in biomarker identification after dropout events result in more zeroes than expected unsupervised clustering of scRNA-seq data. in scRNA-seq data and these events need to be expli- citly modeled using zero inflated/bimodality models Results [5, 10, 14–17]. This study investigated the necessity Model comparison for UMI counts and read counts and effectiveness of zero-inflated models in modeling the We used a unique dataset produced by Ziegenhain et al. UMI-count distribution among cells by directly comparing [12] to determine the difference between read counts the statistical modeling of UMI counts and read counts. and UMI counts. A homogeneous population of mouse A closely related application of scRNA-seq count model- embryonic stem cells was derived by two inhibitors/ ing is single-cell differential expression (DE) analysis. leukemia inhibitory factors and used to evaluate six Several software packages have been developed specifically different scRNA-seq protocols, including four UMI for scRNA-seq DE analysis, such as SCDE [5], MAST [9], count-based protocols and two read count-based proto- ROTS [18], Monocle2 [11], and Seurat [19]. However, there cols. Furthermore, the read counts before conversion to have been no systematic evaluations of these methods with UMIs were also evaluated for the four UMI based proto- respect to UMI count-based scRNA-seq data. cols, which provided an excellent opportunity to exam- In this study, we first conducted a comprehensive ana- ine the differences between the UMI count and read lysis of the modeling UMI counts and read counts in count for the same data [12]. We first examined scatter scRNA-seq data. Based on that analysis, we proposed a plots for cell pairs with similar total read/UMI counts. method using the Negative Binomial model with Inde- Figure 1 shows the representative pattern for two cells. pendent Dispersions (NBID) and compared its false dis- This general pattern holds for most cell pairs in the covery rate (FDR) control and power to those of other dataset. A large density mass was focused on at (0, 0) commonly used methods. We also illustrate a practical (the origin point) in all three results (Fig. 1a, c, and e), Smart−Seq2 read count CEL−Seq2/C1 read count CEL−Seq2/C1 UMI count a c e 01 2345 012345 01 2345 cell 1: log10 count cell 1: log10 count cell 1: log10 count b d f x−axis y−axis 12345678 9 count 012345 01 2345 012345 012345 012345 012345 log10 count log10 count log10 count Fig. 1 Scatter plots of two cells with similar read counts or UMI counts. a, b Read counts for Smart−Seq2. c, d Read counts for CEL − Seq2/C1. e, f UMI counts for CEL − Seq2/C1. a, c, e The scatter plot with color-coded density, the highest density at the origin. The left and middle panels,which are based on the read counts, show very different patterns from the right panel, which is based on the UMI counts. b, d, f The density plot along the x- and y-axes of (a), (c), and (e), excluding the origin. For all plots, we kept the genes that were detected in at least five cells among all cells Density cell 2: log10 count 0.0 0.0 0.5 0.5 1.0 1.0 1.5 1.5 012345 Density cell 2: log10 count 0.0 0.0 0.5 0.5 1.0 1.0 1.5 1.5 03 12 4 5 Density cell 2: log10 count 02468 02468 1100 0 12345 proportion 0.0 0.2 0.4 0.6 Chen et al. Genome Biology (2018) 19:70 Page 3 of 17 which was consistent with the notion that transcripts for different distributions. We employed a backward selection most genes were not captured by either cell in strategy on three candidate models that are commonly scRNA-seq protocols (supported by abundance of zeros used in scRNA-seq studies [3, 5, 7]: a Poisson model with in the scRNA-seq data) [10]. Read-count measurements one parameter defining both the mean and the variance; a produced results (Fig. 1a–d) similar to those of Kharchenko NB model with two parameters defining the mean and et al. [5], which were used to illustrate dropout events. The variance; and a ZINB model with three parameters, of overall UMI-count measurements (Fig. 1e and f)showed which two were the same as those in the NB model. The less divergence when compared to their read-count coun- additional parameter in the ZINB model defines the prob- terparts in the same cell pair (Fig. 1c and d). Specifically, ability of a count being zero or being distributed as an NB quantifications for genes with dropout events (i.e. when distribution. These three models are nested with increas- transcripts/reads were captured in one cell in the pair but ing complexity, i.e. the Poisson model is a special case of not the other) showed a distinct bi-modal pattern in the the NB model and the NB model is a special case of the read counts (Fig. 1b and d) but a unimodal distribution in ZINB model. Our goal was to decide on the proper the UMI counts (Fig. 1f). Consequently, it is prudent to complexity for fitting the scRNA-seq data. We started by model gene expression by using a zero-inflated model (i.e. a testing whether the ZINB model was significantly better zero-inflated negative binomial [ZINB] model [5]orhurdle than the NB model for modeling the counts. Among those model [9]) for the read counts: one component for the zero genes that did not reject the NB model, we further tested counts and the other component for the non-zero counts. whether that model was significantly better than the Pois- However, the fast attenuation of density along the axes sug- son model. The model selection results are summarized in gests that a unimodal distribution (e.g. a Poisson or nega- Table 1 (see also Additional file 1:Figure S1).We analyzed tive binomial [NB] distribution) may be sufficient for UMI both the UMI counts and read counts (before UMI con- counts. version) form four UMI-based protocols. Although no To further capture the quantitative difference between genes measured in UMI counts preferred the ZINB model read counts and UMI counts, we modeled them with over the NB model at an FDR level of 0.05, the results for Table 1 Number of genes with selected models for different protocols from Ziegenhain et al. [12] NB vs ZINB Poisson vs NB Protocol Cells used (n) Genes detected (n) Genes tested (n) Genes converged (n) ZINB NB Poisson Poisson (%) UMI count CEL-Seq2/C1(A) 34 16,690 11,345 11,345 0 2968 8377 73.84 CEL-Seq2/C1(B) 37 17,229 12,190 12,189 0 3794 8395 68.87 Drop-Seq(A) 42 16,579 10,702 10,702 0 4277 6425 60.04 Drop-Seq(B) 34 15,469 9288 9288 0 3659 5629 60.61 MARS-Seq(A) 29 14,551 8266 8266 0 4592 3674 44.45 MARS-Seq(B) 36 15,406 9644 9644 0 5848 3796 39.36 SCRB-Seq(A) 39 16,411 12,955 12,955 0 1214 11,741 90.63 SCRB-Seq(B) 45 16,944 13,212 13,212 0 2115 11,097 83.99 CEL-Seq2/C1(A) 34 16,690 11,345 10,679 3679 6385 615 5.76 CEL-Seq2/C1(B) 37 17,229 12,190 12,155 1174 10,443 538 4.43 Drop-Seq(A) 42 16,579 10,702 10,690 121 9601 968 9.06 Drop-Seq(B) 34 15,469 9288 9278 91 8329 858 9.25 MARS-Seq(A) 29 14,551 8266 8132 761 7161 210 2.58 MARS-Seq(B) 36 15,406 9644 9531 1333 7974 224 2.35 SCRB-Seq(A) 39 16,411 12,955 12,954 0 11,814 1140 8.80 SCRB-Seq(B) 45 16,944 13,212 13,212 0 11,964 1248 9.45 Read count Smart-Seq2(A) 80 21,076 15,294 15,098 7905 5795 1398 9.26 Smart-Seq2(B) 77 20,861 15,224 15,152 6456 7244 1452 9.58 Smart-Seq/C1(A) 69 19,699 13,518 13,513 16 12,761 736 5.45 Smart-Seq/C1(B) 61 19,100 12,949 12,947 0 11,888 1059 8.18 Chen et al. Genome Biology (2018) 19:70 Page 4 of 17 CEL-Seq2/C1 and MARS-Seq showed a significant per- reported by Ziegenhain et al. [12] (Table 2). At an FDR centage of genes (9.4–34.5%) rejecting the NB model in level of 0.05, only 0.1% (range = 0–0.4%) of converged favor of the ZINB model when measured in read counts. genes rejected the NB model for UMI counts. This per- Moreover, for UMI counts, a large proportion of genes centage was significantly increased to 14.2% (range = (39.4–84.0%) selected the simple Poisson distribution. By 1.1–35.3%, p = 0.0078, the Wilcoxon signed rank test) contrast, read-count measurements resulted in a sharp for read counts from the same datasets, indicating that a drop in the proportions of Poisson models (2.4–9.5%, high-level noise was introduced by cDNA amplification. p = 0.0078, the Wilcoxon signed rank test) across all We further examined the proportion of genes that could platforms evaluated. Read-count only protocols (Smart-- be modeled by a Poisson model. As expected, the percent- Seq and Smart-Seq2) show comparable patterns to the age of genes with an adequate Poisson fit (FDR > 0.05) read counts from UMI protocols. Overall, our analysis im- dropped sharply from 80.2% (range = 65.7–95.1%) for plies that while ZINB is necessary for a significant fraction UMI counts to 2.6% (range = 1.0–4.1%, p = 0.0078, the of read counts, it is not needed for UMI counts. Wilcoxon signed rank test) for read counts measured in the same datasets. The goodness of fit of both the Poisson Negative binomial model for UMI counts and NB models supports the conclusion that UMI counts A model selection strategy always selects a “best” model can be modeled by simpler models when compared to among the specified candidates even though the chosen read counts. model may fit the underlying data poorly. Therefore, we evaluated the goodness of fit for these selected models. Modeling and goodness of fit for UMI counts in large scale Because a Poisson model can be modeled as a special scRNA-seq datasets scenario of the NB model, we began by measuring the Although the datasets of Ziegenhain et al. [12] provided goodness of fit of the NB model for various datasets an unparalleled opportunity to evaluate the difference Table 2 Goodness of fit test for the Poisson and NB models for different protocols from Ziegenhain et al. [12] Protocol Cells Used (n) Genes tested (n) Genes reject Poisson (n) Genes reject NB (n) Accept Poisson (%) Reject NB (%) UMI count CEL-Seq2/C1(A) 30 3357 660 1 80.34 0.03 CEL-Seq2/C1(B) 33 5601 1082 3 80.68 0.05 Drop-Seq(A) 37 2311 548 2 76.29 0.09 Drop-Seq(B) 30 1690 414 0 75.50 0.00 MARS-Seq(A) 26 1162 317 2 72.72 0.17 MARS-Seq(B) 32 2184 750 8 65.66 0.37 SCRB-Seq(A) 35 4218 214 1 94.93 0.02 SCRB-Seq(B) 40 4360 213 0 95.11 0.00 Read count before converting to UMI CEL-Seq2/C1(A) 30 6012 5954 622 0.96 10.35 CEL-Seq2/C1(B) 33 7993 7897 90 1.20 1.13 Drop-Seq(A) 37 4574 4386 430 4.11 9.40 Drop-Seq(B) 30 2867 2781 512 3.00 17.86 MARS-Seq(A) 26 2830 2743 152 3.07 5.37 MARS-Seq(B) 32 4248 4168 265 1.88 6.24 SCRB-Seq(A) 35 7392 7194 2065 2.68 27.94 SCRB-Seq(B) 40 7112 6855 2507 3.61 35.25 Read count Smart-Seq2(A) 72 10,880 10,692 2696 1.73 24.78 Smart-Seq2(B) 69 10,684 10,469 1790 2.01 16.75 Smart-Seq/C1(A) 62 9342 9249 87 1.00 0.93 Smart-Seq/C1(B) 55 7990 7893 75 1.21 0.94 A model is rejected if FDR < 0.05 among all genes tested; otherwise it is accepted NB negative binomial Chen et al. Genome Biology (2018) 19:70 Page 5 of 17 between read counts and UMI counts, the number of statistically significant lack of fit, even though the departure cells captured was relatively small (range = 29–80). We from the specified distributions may be very small and un- extended our analysis to additional datasets generated by important [24]. Therefore, we compared the empirical different platforms [7, 20–23] to evaluate whether the probability mass function (pmf) and the cumulative distri- same pattern generally held for other datasets. Despite bution function (cdf) with the fitted negative binomial technical differences among protocols and heterogeneity model to evaluate visually the difference between them for within cell populations, overall, the model selection and genes rejecting the NB model (Fig. 2, Additional file 1: goodness-of-fit analysis for these datasets supported our Figures S3 and S4). Even though these genes rejected the conclusion that UMI counts can be modeled by simpler NB model at an FDR level of 0.05, the fitted pmf and cdf models when compared to read counts (Additional file 2: curves were good approximations of their empirical coun- Tables S1A and S1B). terparts. Importantly, among the 23 to 282 genes that Since 2016, several Drop-seq UMI based platforms have rejected the NB model, only few (3–17) were ad- appeared with the capability to process thousands of cells equately approximated by the ZINB model (Additional in a single experiment [2, 8]. Consequently, we studied file 2: Table S3). Therefore, we conclude that the NB whether the same pattern held for such large-scale data- model is a good approximation model for UMI counts, sets. We applied the described model-selection strategy even for large-scale scRNA-seq data with evidence of and goodness-of-fit test to the following datasets: (1) CD4 heterogeneity. + naïve T cells (9850 cells); and (2) CD4+ memory T cells (9578 cells), both of which were generated on the Gem- scRNA-seq differential expression analysis Code platform (10× Genomics, Pleasanton, CA, USA) [8], A direct consequence of properly modeling scRNA-seq and 3) Rh41 cells, a human PAX3-FOXO1 positive alveolar counts is the power to accurately conduct differential rhabdomyosarcoma (ARMS) cell line (6875 cells) prepared expression analyses. Based on the knowledge derived in-house on the Chromium platform (10× Genomics). from UMI-count modeling, we proposed a NB-based al- Rh41 cells contained two distinct subpopulations based gorithm for differential expression analysis of large-scale on unsupervised clustering analysis (Additional file 1: UMI-based scRNA-seq data. We extended the general Figure S2) and were included to evaluate the effects of NB-based models by allowing independent dispersion strong heterogeneity on model selection and fitting parameters in each biological condition, resulting in the (Table 3). Although few genes (4–7, 0.04–0.06%) preferred NBID method. This approach is analogous to the t-test, the ZINB model in the relatively homogeneous T-cell pop- which allows different variances between groups when ulations, the percentage of genes selecting the ZINB testing the equivalence of means. The rationale stems model in Rh41 cells was slightly elevated, albeit still low from the apparent variations in dispersion even at the (39 genes, 0.21%). The expression of these genes differed same average expression level [3, 7]. Because the number significantly between the two clusters (FDR < 0.05, the of cells in each condition is generally sufficient in Wilcoxon rank sum test; see also Additional file 2:Table large-scale datasets, we derive separate dispersion esti- S2), suggesting that the fraction of genes preferring the mates for each condition; these are used in the subse- ZINB model correlates with the level of heterogeneity. quent NB-based test against the null hypothesis that Compared to the datasets of Ziegenhain et al. [12], the T different conditions have the same average expression. and Rh41 datasets displayed a lack of statistical fit for sim- We compared the proposed method with other com- pler models (Table 4). Specifically, the genes modeled by monly used methods (Additional file 2: Table S4): the Poisson model dropped to 61.1% (range = 51.9–67.6%) Monocle2 [11]; SCDE [5]; ROTS [18]; MAST [9]; and and the percentage of genes that rejected the NB model Seurat [19]. Although both SCDE and MAST were increased to 5.3% (range = 3.4–8.4%). In addition to ele- developed for read counts, their authors claim that they vated heterogeneity in the Rh41 cells, the sample size of can be applied to UMI data. To handle the apparent these datasets (range = 6875–9850) also played an import- zero inflation, SCDE employs a mixture of a NB model ant role in the increased lack of model fitness. It has been and a Poisson model, while MAST uses a hurdle model documented that very large samples invariably produce with the non-zero component modeled with a Gaussian Table 3 Number of genes with selected models for large-scale datasets on the GemCode and Chromium platforms NB vs ZINB Poisson vs NB Data Cells used (n) Genes detected (n) Genes tested (n) Genes converged (n) ZINB NB Poisson Poisson (%) Naive T cells (Gemcode) 9850 32,738 11,978 11,977 7 5336 6634 55.39 Memory T cells (Gemcode) 9578 32,738 12,569 12,567 4 6336 6227 49.55 Rh41 (Chromium) 6875 33,416 18,435 18,435 39 9387 9009 48.87 Chen et al. Genome Biology (2018) 19:70 Page 6 of 17 Table 4 Goodness of fit test for the Poisson and NB models for large-scale datasets on the GemCode and Chromium platforms Data Cells used (n) Genes tested (n) Genes reject Poisson (n) Genes reject NB (n) Accept Poisson (%) Reject NB (%) Naive T cells (Gemcode) 7332 776 403 65 48.07 8.38 Memory T cells (Gemcode) 8622 836 533 28 36.24 3.35 Rh41 (Chromium) 6187 6853 4630 295 32.44 4.30 A model is rejected if FDR < 0.05 among all genes tested; otherwise it is accepted NB negative binomial The 1st gene wit h (p −value=0.0405) The 1st gene with (p−value=0.0405) FDR>0.2 FDR>0.2 ab Empirical Theoretical Empirical Theoretical 0246 0246 cd The 1st gene with FDR<.05 (p−value=0.004) The 1st gene with FDR<.05 (p−value=0.004) Empirical Theoretical Empirical Theoretical The gene with the worst FDR (p-value=0) The gene with the worst FDR (p-value=0) ef Empirical Theoretical Empirical Theoretical 010 20 30 40 010 20 30 40 Fig. 2 Goodness of fit using the negative binomial distribution on the naïve T-cell data (Tn). a The empirical and theoretical probability mass function (pmf) for the first gene with FDR > 0.2. b The empirical and theoretical cumulative distribution function (cdf) for the first gene with FDR > 0.2. c, d The same pmf and cdf plots for the first gene with FDR < 0.05. e, f The same pmf and cdf plots for the gene with the worst FDR Density Density Density 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 CDF CDF CDF 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Chen et al. Genome Biology (2018) 19:70 Page 7 of 17 distribution. Monocle2 [11] and Seurat [19] provide UMIs retained). Because NBID assumes a sufficiently large NB-based differential expression analysis (among other number of cells in each group, we evaluate its robustness models) for UMI counts. We also included ROTS based in common scenarios for scRNA-Seq experiments with on a recent comparison of scRNA-seq differential expres- different number of cells (60, 300, or 1000 cells, approxi- sion analysis [18]. Recently, several scRNA-seq-tailored mating samples collected from 96-well plates, 384-well normalization schemes have been proposed [16]. We plates, and by droplet methods, respectively). evaluate their contributions by integrating NBID with In extensive simulations of 300 or 1000 cells (two scran, a state-of-the-art generic normalization method groups combined) with different simulated fold change [25], by using the cell-specific size factor estimated by from multiple datasets obtained by different protocols, scran in NBID (NBID_scran). Monocle2, SCDE, and Seurat consistently inflated the FDR and the number of false positives increased with FDR and power comparison for differential expression the level of expected UMI difference between groups analysis of UMI-based scRNA-seq data (Table 5, Additional file 2: Tables S5–S8). Similar to the We first evaluated the FDR control for all methods by comparisons result derived from read counts [27], SCDE using simulated data. Instead of generating artificial generally detected fewer DE genes compared to other datasets from a theoretical distribution, we simulated methods in UMI-count scRNA-seq data. However, it groups of cells with differentially expressed genes from often produced relatively high number of false positives, publicly available datasets collected from different which resulted in severely inflated FDRs in the simula- protocols (data from memory T cells obtained by Gem- tions. Due to the severely inflated FDRs (with or without Code [8], from whole-intestinal organoids obtained by expected group difference in the total UMI counts) in CEL-Seq [20], and from heterogeneous dendritic cells many scenarios, both SCDE and Seurat were excluded obtained by MARS-Seq [21]). We began with randomly from subsequent analyses. ROTS controlled FDR with- generating two distinct groups of cells by swapping the out UMI difference, but severely inflated the FDR in UMI counts for two sets of genes in the second group. scenarios with expected group differences in UMIs. Here, the first group represented cells collected under a While MAST controlled the FDR without and with mild reference condition and the second group contained group difference in UMIs, it also shows inflated the FDR cells under the testing condition with simulated differen- with intermediate differences in total UMIs. NBID and tial expressions. The two equal-sized sets of genes had NBID_scran were the only methods to achieve proper different average expression levels in the full dataset FDR control under all three scenarios (Table 5). before swapping. This strategy generated artificially In the simulation of 60 cells combined, all the methods separated groups of cells while retaining specific charac- except for MAST yielded various degrees of FDR infla- teristics of the scRNA-seq counts for each cell. The dis- tion (Additional file 2: Table S9), indicating that > 60 tribution of the total number of UMIs captured in a cell cells are needed for a robust DE analysis in scRNA-seq. is an important characteristic for UMI-based scRNA-seq Nevertheless, when we focused on genes with high ex- experiments. Although biological differences (such as pression (with TPM ≥ 50 in at least one group, similar to the physical cell size, proliferation status, and cell-cycle the threshold employed in reference [28]), NBID and stages) may affect the absolute number of transcripts in NBID_scran approached the desired FDR control in all the cells, technical (non-biological) variations, such as three scenarios. the cell-to-cell variations in the conversion factor be- We used precision-recall curves to evaluate the power tween transcripts and captured UMIs and variations in of the methods (Fig. 3 and Additional file 1: Figures S5– the sequencing depth, have substantial influence on the S9). Measured by the area under the curve (AUC), NBID number of UMIs captured for each cell. Moreover, the and NBID_scran robustly outperformed other methods effects of total UMI variations are disproportionally in different simulation scenarios. Although ROTS had a biased towards the gene with lower expression [17, 26] slight edge without group difference in UMIs, it was and the disparity in the number of UMIs is further exac- highly sensitive to the group difference: even a mild dif- erbated in scenarios in which two groups of cells being ference dropped the AUC to 0 in simulations with 1000 compared are captured and sequenced separately. To cells. NBID and NBID_scran achieved similar results, evaluate the robustness of performance against the com- suggesting that the total UMI count is a good estimator monly observed difference in the total UMIs captured for the cell-specific size factor in the simulations. per cell, we simulated three scenarios in terms of ex- pected group difference in the total UMIs captured by Differential expression analysis of naive T cells and memory sub-sampling UMIs in the second group of cells: no dif- Tcells ference (100% UMIs retained); mild difference (80–90% We evaluated three algorithms (MAST, ROTS, and NBID) UMIs retained); and intermediate difference (50–60% for their ability to identify DE genes in naïve T cells and Chen et al. Genome Biology (2018) 19:70 Page 8 of 17 Table 5 FDRs of evaluated methods a b c 1 0.8–0.9 0.5–0.6 Method FDR False (n) DE (n) FDR False (n) DE (n) FDR False (n) DE (n) Monocle2 0.069 5.9 83.9 0.089 7 79.1 0.276 22.1 79 SCDE 0.299 2.6 8.3 0.34 3.7 9.5 0.848 123.5 145.2 MAST 0.001 0 29.5 0.003 0.1 28.2 0.193 3.4 19.5 ROTS 0.045 4.4 97.5 0.497 71.6 145.9 0.835 272.4 323.9 Seurat_ttest 0.244 31.5 128.3 0.441 69.6 156.1 0.927 653.6 704.5 Seurat_bimod 0.154 17.6 112.5 0.655 172 258.7 0.928 924.5 996.3 Seurat_tobit 0.248 32.2 129 0.45 72 158.5 0.873 351.7 402.8 Seurat_poisson 0.208 25.7 122.6 0.188 20.1 106.1 0.573 67.4 116.2 Seurat_negbinom 0.197 23.9 120.7 0.164 16.9 102.4 0.5 47.7 93.8 NBID_scran 0.038 3.4 86.9 0.035 2.8 80.1 0.039 2.5 62.4 NBID 0.033 2.8 85.7 0.032 2.7 81.4 0.03 1.8 58.9 No sub-sampling The sub-sampling ratio in Group 2 was 0.8–0.9 The sub-sampling ratio in Group 2 was 0.5–0.6. Bold values indicate FDR > 0.05. Bold and underlined values indicate FDR > 0.1. The nominal FDR was 0.05. Simulation based on the Memory T-cell data [8], 500 cells in each group, results are averaged over 96 replicates (see Additional file 2: Tables S5–S9 for results for other simulation scenarios). NBID_scran used the size factor computed by scran as the offset instead of the total UMI counts memory T cells (Fig. 4 and Additional file 2:Tables S10 three genes missed by NBID (LY96, STAM,and TOX)had and S11). NBID detected more DE genes than did MAST very low expression in the dataset (average UMI count or ROTS (Fig. 4a), consistent with the simulation results from the large group: 0.002, 0.007, and 0.007, and TPM: showing better power with NBID. Because the true DE 2.7, 9.4 and 10.7, with an average of approximately 850 genes in the groups were unknown, we compared the in- UMIs being captured per cell), leading to insufficient de- ferred DE genes against a published list of DE genes for tection power for DE genes. Consequently, none of the naïve T cells and memory T cells (Table 1 in reference evaluated algorithms classified the three genes as DE [29]). Since both T-cell datasets were derived from the genes. Additional file 1: Figure S10 shows density plots of CD4+ population [8], CD8+ specific genes were ignored. selected genes. Of the 37 true positives, NBID, MAST, and ROTS recov- We carried out additional in-silico validation of pre- ered 34 (92%), 30 (81%), and 24 (65%), respectively. The dicted DE genes by NBID and MAST. We assumed that No UMI differences Mild UMI differences c Intermediate UMI differences Precision−Recall − P: 9700, N: 1532309 Precision−Recall − P: 9700, N: 1532309 Precision−Recall − P: 10000, N: 1579700 Monocle2 (AUC 0.725) Monocle2 (AUC 0.621) Monocle2 (AUC 0.36) MAST (AUC 0.72) MAST (AUC 0.69) MAST (AUC 0.018) ROTS (AUC 0.942) ROTS (AUC 0) ROTS (AUC 0) NBID_scran (AUC 0.876) NBID_scran (AUC 0.847) NBID_scran (AUC 0.683) NBID (AUC 0.879) NBID (AUC 0.848) NBID (AUC 0.687) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall Fig. 3 Precision-recall curves for selected methods. a The precision-recall curve without UMI differences between two groups. b The precision- recall curve with mild UMI differences between two groups. c The precision-recall curve with intermediate UMI differences between two groups. For each scenario, some methods failed to run on few replicates but at least 97 replicates were used to calculate the precision and recall rate. P true DE genes, N true non-DE genes Precision 0.80 0.85 0.90 0.95 1.00 Precision 0.80 0.85 0.90 0.95 1.00 Precision 0.80 0.85 0.90 0.95 1.00 Chen et al. Genome Biology (2018) 19:70 Page 9 of 17 a Precision−Recall − P: 28740, N: 265660 NBID (AUC 0.514) MAST (AUC 0.461) 0.0 0.2 0.4 0.6 0.8 1.0 Recall c Precision−Recall − P: 28740, N: 265660 Precision−Recall − P: 28740, N: 265660 NBID (AUC 0.724) MAST (AUC 0.601) NBID (AUC 0.929) MAST (AUC 0.834) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Fig. 4 Comparison of detected genes in naïve T cells and memory T cells. a The Venn diagram of DE genes detected by NBID, ROTS, and MAST. b The precision-recall curves obtained using a subset of 1000 cells of each cell type. For the power calculation, we chose the DE genes detected in both NBID and MAST as the true DE genes and the genes not detected as DE genes in either NBID or MAST as the true non-DE genes. c, d The same as (b) except using subsets of 2000 and 5000 cells, respectively the DE genes detected by both algorithms were true posi- Differential expression analysis for biomarker identifications tive and that genes not detected as DE genes by either scRNA-seq has been widely used to reveal the subpopu- algorithm were true negatives. We then randomly sam- lation structure in heterogeneous cell populations pled subsets of cells from each population (1000, 2000, or through unsupervised clustering approaches [1]. Differ- 5000 cells) and compared the recovery of these genes in ential expression analysis of identified cell subpopula- ten subsampled replicates. NBID outperformed MAST, tions can further characterize their functional differences having a higher AUC in all three settings (Fig. 4b–d). and identify potential biomarkers for experimental valid- In this real-data analysis, NBID_scran again achieved ation and subpopulation separation. Consequently, we similar results as NBID. Specifically, 5728 DE genes were applied NBID and MAST to detect DE genes in the two detected by both methods, accounting for 95.5% and 94.3% subpopulations inferred to be present in Rh41 cells. of all DE genes by NBID (5997) and NBID_scran (6076), Among the expressed genes (with TPM ≥ 3 in at least respectively. Together with additional evidences from simu- one group [30]), NBID and MAST revealed 1019 and lation studies, we conclude that the default normalization 448 DE genes between the two clusters with a fold scheme employed by NBID generally achieved comparable change > 2 between the two clusters (Additional file 2: performance with scran, a state-of-the-art normalization Tables S12, S13), respectively. We ranked the potential scheme. of DE genes to be robust biomarkers based on the test Precision 0.80 0.85 0.90 0.95 1.00 Precision Precision 0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00 Chen et al. Genome Biology (2018) 19:70 Page 10 of 17 FDR values, their relative fold changes, and their overall Evaluation and control of batch effects expression levels. The CD44 gene, which encodes a Differential gene expression analysis of scRNA-seq data fre- commonly used cell surface marker, appeared at the top quently involves data generated in separate batches (e.g. in of the list (Fig. 5a). FACS sorting confirmed the presence different lanes or plates in single-cell library construction). of two subpopulations with different CD44 protein levels This can introduce batch effects (systematic inter-group high low (CD44 and CD44 ) in Rh41 cells (Fig. 5b). Being technical variations that are not relevant to the biological both a receptor for extracellular matrix components and hypothesis being evaluated), which pose a major challenge a co-factor for growth factors and cytokines, CD44 is a in high-throughput data analyses [34]. Controlling the well-established cancer stem cell marker with great batch effects is, therefore, important in order to distinguish prognostic and therapeutic potentials [31, 32]. true biological differences from technical artifacts [26, 35]. We performed three replicates of FACS sorting on Rh41 We evaluated batch effects in the two replicates of the four high low cells and collected both CD44 and CD44 subpopula- UMI-based protocols used by Ziegenhain et al. [12](Fig. 6). tions for bulk RNA-seq. Of the 1019 DE genes identified Although various numbers of DE genes (596–5156, Fig. by NBID in the inferred clusters in scRNA-seq, 699 6a–d) were detected by these protocols, only seven were (68.6%) were also detected as DE genes with the same common across all protocols (Fig. 6e), consistent with the direction in the two subpopulations from the bulk hypothesis that most apparent DE genes were the result of RNA-seq analysis, thus validating CD44 as a cell-surface technical noise. Among the four protocols, CEL-Seq2 and marker that could be used to separate the two endogenous SCRB-Seq had relatively stronger batch effects when com- Rh41 subpopulations. Although MAST identified fewer pared to DROP-Seq and MARS-Seq; these stronger effects (448) DE genes, a lower percentage of DE genes (226, were potentially associated with the higher UMIs captured 50.4%) were validated in the bulk DE analysis (Additional per cell. file 2: Table S13), which demonstrated the superior accur- All the evaluated methods except for ROTS allow explicit acy and power of NBID in revealing true DE genes. More- modeling of technical variations (such as differences in over, among the four established surrogate molecular cell-cycle stage and batch effects) as covariates. We evalu- markers for fusion status in rhabdomyosarcoma samples, ated the performance of batch-effect removal by simulating namely the upregulation of TFAP2B, MYOG,and NOS1, group differences mixed with apparent differences arising coupled with the repression of HMGA2 in fusion positive from the batch composition, using data generated from ARMS [33], both bulk DE analysis of sorted subpopula- CEL-Seq2 (Table 6,Fig. 7) and SCRB-Seq (Additional file 1: tions and NBID analysis of subpopulations inferred from Figure S11, Additional file 2: Table S14) by Ziegenhain et al. scRNA-seq revealed repression of TFAP2B and MYOG as [12]. Becauseofthe limitedsamplesizeinthesetwodata- high well as upregulation of HMGA2 in the CD44 subpopu- sets, we focused on highly expressed genes (with TPM ≥ 50 lation (Additional file 2: Table S12), suggesting that the in at least one group). Without explicitly modeling the high CD44 subpopulation represents a less differentiated, batch effects, all methods showed various levels of FDR stem-like cell subpopulation. The exact mechanism by inflation. Most of the tested methods (except MAST) which the distinct subpopulations develop warrants fur- reduced the FDR after modeling batch information as a ther investigation. covariate. NBID outperformed Monocle2 and MAST by CD44 cluster1 cluster2 Fig. 5 Differential expression of CD44 in two clusters in the Rh41 cell line. a Violin plot of the gene expression among cells in the two clusters, the TPM is in log10 scale after adding a small value 1. b The CD44 count distribution when using CD44 to sort single cells, indicating two clusters of cells with different levels of CD44 expression log10TPM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Chen et al. Genome Biology (2018) 19:70 Page 11 of 17 CEL−Seq2/C1 Drop−Seq FDR<0.05 (3627) FDR<0.05 (596) FDR >= 0.05 (14979) FDR >= 0.05 (17439) −1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6 log10TPM in the higher group log10TPM in the higher group c MARS−Seq SCRB−Seq FDR<0.05 (763) FDR<0.05 (5156) FDR >= 0.05 (16039) FDR >= 0.05 (13407) −1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6 log10TPM in the higher group log10TPM in the higher group MARS−Seq SCRB−Seq CEL−Seq2/C1 Drop−Seq 356 3745 206 136 2255 79 10 260 1041 14 64 18 Fig. 6 Differential expression analysis of two replicates from Ziegenhain et al. [12]. a–d The log2 fold change vs the maximal gene log TPM for the two biological replicates. NBID was used for the differential expression analysis of two replicates of each of four UMI-based protocols. The red dots indicate genes with FDR < 0.05. e Venn diagram of DE genes from four UMI-based protocols demonstrating better recovery of true positives with prop- differences between the read counts and the UMI counts erly controlled FDR. for the scRNA-seq data. Our analysis suggests that, com- pared to read counts, UMI counts can be modeled by a Discussion simpler distribution. Specifically, the NB model is an ad- In the present study, we performed extensive model se- equate model for UMI-count data in the absence of an lection and goodness-of-fit analyses using multiple explicit need to account for dropout events by using scRNA-seq datasets and revealed intrinsic distributional zero-inflated models. Data derived from the Smart-Seq log2FC log2FC −10 −5 0 5 10 −10 −5 0 5 10 log2FC log2FC −10 −5 0 5 10 −10 −50 5 10 Chen et al. Genome Biology (2018) 19:70 Page 12 of 17 Table 6 FDRs with and without controlling batch covariates No filtering TPM ≥ 50 Method FDR False (n) DE (n) FDR False (n) DE (n) Monocle2 0.279 37.7 132.4 0.202 24.3 118.65 Monocle2_plateCov 0.100 10.25 101.3 0.054 5.3 96.2 MAST 0.264 28.75 74.65 0.264 28.75 74.65 MAST_plateCov 0.268 28.65 72.85 0.268 28.65 72.85 ROTS 0.258 36.2 132.1 0.206 26.75 122.1 NBID 0.326 50.35 146.85 0.183 22.9 119.05 NBID_scran 0.328 50.45 146.95 0.200 25.3 121.45 NBID_plateCov 0.124 13.5 108.1 0.048 4.75 99.05 NBID_scran_plateCov 0.124 13.45 108 0.048 4.75 98.9 Simulation based on data: CEL-Seq2 (both batch A and batch B) from Ziegenhain et al. [12]. Sample size was 60 (30 cells in each group). In total, Replicate A had 30 cells and Replicate B had 35 cells after QC. Group 1 had 9 cells from Replicate A and 21 cells from Replicate B. Group 2 had 18 cells from Replicate A and 12 cells from Replicate B. Method names with plateCov indicate adjusting the batch covariates. NBID_scran and NBID_scran_plateCov used the size factor computed by scran as the offset instead of the total UMI counts Bold values indicate FDR > 0.05 Bold and underlined values indicate FDR > 0.1 protocol in reference [12] deviated slightly from other that also explains the differences between UMI-count- and read-based data, with fewer genes preferring the ZINB read-count-based scatter plots (Fig. 1). The PCR amplifica- model and a lower proportion of NB rejection in the tion step produces a sharp contrast between the read goodness-of-fit test. Although the exact cause of the ob- counts and UMI counts. Whereas the UMI counts follow a servation is unknown, our analysis of a different Poisson/NB distribution, the read counts—even with a con- Smart-Seq dataset [23] (Additional file 2: Table S1) re- stant multiplication factor (i.e. with no amplification sulted in a pattern similar to that seen with other read biases)—no longer follow the same Poisson/NB model (see count-based protocols. the “Methods” section for more details). The uneven ampli- Based on the result of our analysis, we propose a hypo- fication bias (i.e. with transcripts being amplified at differ- thetical model linking the UMI counts and read counts (see ent levels) introduces extra deviations from the underneath Additional file 1: Supplementary methods and Figure S12) (simpler) distribution of the UMI counts. Consistent with the hypothesized model, recent studies have shown that in- ferring approximate transcriptcounts fromthe read-count Precision−Recall − P: 2000, N: 370120 data can significantly improve the analysis efficiency [11]. A few published studies have suggested that NB models are often to be preferred for UMI-based scRNA-seq data [3, 13]. Although we reached the same conclusion, we be- lieve that our design controlled potential technical noises and allowed us to draw a stronger and more valuable con- clusion from the extensive evaluation. Grun et al. evalu- ated the technical noise in the read counts and UMI counts in a relatively small dataset (74 cells) generated by CEL-Seq and concluded that, when compared to the nor- mal and log-normal models, a NB model explained the distribution of more genes [3]. However, the captured UMIs were converted to theoretical transcript-counts (based on the estimated conversion rate) before model Monocle2_plateCov (AUC 0.621) fitting. This process could be approximated by the sce- MAST_plateCov (AUC 0) NBID_scran_plateCov (AUC 0.947) nario of amplification without biases in our hypothetical NBID_plateCov (AUC 0.946) model (Additional file 1: Figure S12) and converted 0.0 0.2 0.4 0.6 0.8 1.0 transcript counts (in theory) no longer follow a NB model. Recall Consequently, although a NB model explained the distri- Fig. 7 Precision-recall curves for selected methods on simulated bution of more genes than did the normal and log-normal datasets after adjusting batch variables. P true DE genes, N true models, it only accounted for a small fraction of the 11,555 non-DE genes genes analyzed. Recently, Vieth et al. [13] carried out a Precision 0.80 0.85 0.90 0.95 1.00 Chen et al. Genome Biology (2018) 19:70 Page 13 of 17 study that estimated characteristics in 18 UMI-based and real DE genes when compared to previously developed 20 read-based scRNA-seq datasets, including those of Zie- methods for single-cell analysis. Even though only pairwise genhain et al. [12], that were extensively evaluated in the analyses were considered in the current study, the general present study. However, the evaluation of Vieth et al. was form of NBID allows multiple groups to be tested simul- based on modeling read counts and UMI counts collected taneously, as in the generalized linear model framework. in separate experiments, which inevitably introduced un- Differential expression analysis can be used to reveal controlled differences between the experiments. differences among samples run on separate lanes or Our design directly compared read counts (before plates. However, it should be pointed out that batch ef- conversion to UMIs) and corresponding UMI counts fects are expected to overlap with biological differences collected from the same set of cells, enabling us to dir- in this setting. It is better to account for batch effects by ectly evaluate the effects of PCR amplification in statis- proper experimental design, such as by including multiple tical modeling. Moreover, our analysis revealed the biological replicates for each group. Another typical appli- necessity of controlling batch differences. Combining the cation of differential expression analysis is to identify poten- knowledge derived from the extensive modeling with the tial biomarkers for inferred cell subpopulations. One expected large sample size, we proposed NBID, a novel potential caveat in this setting is that cells are usually clus- differential expression analysis algorithm designed for tered from the same data. Therefore, p values or FDR use with UMI-based scRNA-seq data. NBID is based on values derived from the differential expression analysis the negative binomial generalized linear regression might be overly optimistic. However, the result is still useful (GLM) framework, thus shared similarities with those for prioritizing potential biomarkers for further validation. employed in bulk RNA-seq analysis [36, 37]. The major difference compared to those originally proposed in bulk Conclusions RNA-seq analysis is that we allow independent We have conducted an extensive analysis of multiple group-specific dispersions for each gene based on obser- scRNA-seq datasets and have concluded that, unlike vations that genes of the same expression level might read counts, UMI counts can be modeled appropriately have different dispersion parameters [3, 7]. Because of with the negative binomial model. More complex the sample size limitation, algorithms proposed in bulk models, such as zero-inflated negative binomial models, RNA-seq analysis [36, 37] typically pool genes with simi- provide no extra gain. Based on the above conclusion, lar expressions for a robust and smooth estimate of dis- we have proposed a differential expression analysis algo- persions and assume identical dispersion between rithm that allows independent estimations of dispersion groups. NBID exploits the direct benefit of the large for individual genes within each group. Compared to sample size in scRNA-seq, which allows group-specific other recently developed methods, our proposed algo- estimates of dispersion for each gene. This difference is rithm achieves proper FDR control and better power for analogous to the difference between a t-test assuming detecting differentially expressed genes in large-scale equal variance and an unequal variance t-test. Several UMI-count scRNA-seq datasets. studies have shown that an unequal variance t-test per- forms equally well when the underlying group variances Methods are identical but outperforms a t-test assuming equal Model selection and testing variance when the group variance are different [38, 39]. We first checked whether the ZINB model was neces- Although we focused on comparison to algorithms de- sary for the UMI counts. This was done by a statistical signed for scRNA-seq in this study, we believe that it test comparing the NB and ZINB models for each gene will inspire future in-depth evaluations (under various with the null hypothesis that NB fitted the data well. technical scenarios) with additional methods, including The likelihood ratio statistic was used. We used an FDR those originally proposed for bulk RNA-seq analysis. level of 0.05 to control the false positives because of the Technical variations (e.g. batch effects and variations large number of genes tested. For those genes that ac- in the total UMIs captured) are common in scRNA-seq cepted the NB model, we then checked whether the NB experiments; accounting for these variations is critical to model was necessary by testing the Poisson model ver- revealing true biological differences in differential ex- sus the NB model, with an FDR level of 0.05. For both pression analysis. As shown in our simulation, many NB versus ZINB and Poisson versus NB comparisons, scRNA-seq analysis packages yielded inflated FDRs with the parameter being tested was on the boundary, and technical variations (such as small differences in the the log likelihood-ratio test statistic follows an equal total UMIs for the groups), which might result in ele- mixture of 0 mass and a chi-square distribution with 1 vated false positives and/or true positives being masked. degree of freedom under the null hypothesis [40, 41]. In contrast, our analysis indicates that NBID achieved The p value was calculated based on this mixture distri- both proper FDR control and better power in revealing bution, and the FDR was calculated using the Benjamini Chen et al. Genome Biology (2018) 19:70 Page 14 of 17 and Hochberg’s method [42] . To ensure the conver- package fitdistrplus to plot the empirical pmf/cdf versus gence of fitted NB and ZINB models, we keep only those the theoretical ones [45]. genes that satisfy L(Poisson) ≤ L(NB) + δ and L(NB) ≤ L(ZINB) + δ, where L(M) is the log likelihood of fitted Differential expression analysis using the NB model with model M, and we set δ to 0.5 to allow some numerical independent dispersions (NBID) variations in likelihood maximizing. The Poisson model To simplify the notation, we focus here on one gene. Let was fitted using the glm function in R. Two methods us denote the count in cell i by y ;then y  NBðn μ ; ϕ Þ, i g i i were used to fit a NB model and the one with the higher where n is the total number of counts for cell i and μ is i i likelihood was used in the model comparison. The first the proportion of the gene counts in cell i. ϕ is the method was implemented using glm.nb in the R package dispersion for cell i with group label g , for example, g = i i [−8, −7, …,4] MASS [43]. A grid of initial values 10 for θ 0 or 1 for two groups. As in generalized linear models, we (the reciprocal of the dispersion) was tried, and the lar- link the mean proportion to explanatory variables such as gest likelihood was used. The second method was to first group labels; and other potential covariates. Specifically, fit other parameters related to the mean with an initial for two groups, the full model is: dispersion, and then search for the optimal dispersion value to maximize the likelihood given the estimated logðÞ n μ ¼ β þ β g þ γ x ; i i i 0 1 i mean. That method iterated between these two steps where β is the intercept, β is the group effect size in 0 1 until a maximal number of iteration was reached or the the log scale, and γ is a vector of coefficient for the other change in likelihood was small enough. The ZINB model covariate vector x . The likelihood of the observed data was fitted using the function zeroinf from the R package under the full model is pscl [44]. To increase the convergence rate, we first fit- ted a NB model and then used parameters from the NB L ¼ fy jn μ ; ϕ ; model as the initial values. For all model comparison, we i i i g i¼1 restricted the comparisons to genes with at least five non-zero cells among all the cells to ensure meaningful where m is the number of cells and f ðy jn μ ; ϕ Þ is the i i expression pattern. probability of y assuming a NB distribution with mean −1 Γðϕ þyÞ n μ and dispersion ϕ . Specifically, f ðyjμ; ϕÞ¼ i i −1 Goodness-of-fit test g y!Γðϕ Þ −1 We first down-sampled each cell to the 10% quantile of −1 μ y ϕ ð Þ ð Þ : −1 −1 ϕ þμ ϕ þμ the total UMI among all the cells so that the gene-count We note that, here, n serves as a normalization factor values for each gene would be comparable among cells. i or size factor, similar to those used in edgeR [36] and The cells corresponding to the lower 10% quantile were DESeq [37]. Alternatively, NBID can accept size factors not used. The down-sampling was performed by sampling estimated by other methods, such as scran [25]. the transcript without replacement, which follows a multi- We compute the maximum likelihood estimate of the variate hypergeometric distribution. After down-sampling, dispersion parameters ϕ and the coefficients related to only genes with a nonzero count in more than five cells g were kept. Then the count values were assigned to differ- the mean by using the R package nloptr. To test whether ent intervals (bins). First each unique count value itself there is a difference between groups, we also fit the null forms its own bin and the number of cells falling into each model log(n μ )= β + γ x with dispersions estimated i i 0 bin was recorded. Staring from the bin of the largest count from the full model. Finally, a likelihood ratio test is value, bins with no more than five cells were combined used to compare the reduced model and the full model, with next bin. The degree of freedom for the Chi-square which follows a chi-square distribution with one degree goodness-of-fit test is k − p − 1, where k is the number of of freedom. bins and p is the number of parameters of the model used. For example, the degree of freedom for the NB model is k Computing time for NBID in large scale datasets – 3 and that for the Poisson model is k – 2. This proced- NBID took 7.5 h in the analysis of naïve T cells versus ure filters out genes with expression levels that are too memory T cell datasets (9850 and 9578 cells, respect- low. For example, genes with count values of only 0 or 1 ively) on an Intel Xeon processor (E5-2670) running Red (two bins) will not be included for testing. However, for Hat Enterprise Linux 6 operating system and R 3.3.1. these genes, the Poisson or NB model will often result in a very good fit due to the simplicity of the data. In this Methods evaluated study, the maximum likelihood estimate of the model pa- The methods evaluated and additional details are listed rameters were estimated first and then the theoretical in Additional file 2: Table S4; unless specifically stated counts for individual bins were calculated. We used the R otherwise, the default options for each method were Chen et al. Genome Biology (2018) 19:70 Page 15 of 17 used in the evaluation. When there was a need to con- the power for detecting DE genes, we plotted the vert a count value x to the log2 scale, log2(x +1) was precision-recall curve and used the area under the used for the conversion. The FDR was calculated based curve (AUC) as the criterion; this was calculated based on p values by Benjamini and Hochberg’s method [42], on all the applicable replicates. Because only the top except for SCDE and ROTS. For SCDE, an adjusted p genes with relatively small estimated FDRs are of value was used based on the output corrected z-score interest in a real data analysis, we restricted the com- and assuming a standard normal distribution. For ROTS, parison to the region where the precision was above we used the FDR output from the package, which was 0.8, i.e. the region with FDR ≤ 0.2. This approach was calculated based on the bootstrap resampling. more reasonable than using the full range of the precision-recall curve, even though the result patterns Data simulation were often similar. This method is also better than To simulate data for differentially expressed analysis, we using the receiver operating characteristic (ROC) sampled 1000, 300, or 60 cells from the real UMI-count curve as used in some published papers for power matrix from memory T cells obtained by GemCode [8], comparison because the true negative genes are often from heterogeneous dendritic cells obtained by the majority; therefore, only the region with very high MARS-Seq [21], and from whole-intestinal organoids specificity (so that the FDR can be low) is of interest obtained by CEL-Seq [20], respectively. Cells were ran- but the cut-off is not easy to determine with the ROC domly split into two groups. To create differentially curve because the specificity is not directly related to expressed genes, we first ranked genes based on the the FDR. average count in the second group and chose 50 genes starting with the one having an average UMI count just Rh41 single-cell dataset above t. Denoting the fold change by FC, we selected The human alveolar rhabdomyosarcoma cell line, Rh41, another 50 genes starting with the average count just was cultured in a 5% CO incubator in a 75-cm vented above FC × t. We then swapped these two sets of genes flask containing DMEM media supplemented with 10% in their count matrix in the second group. This simula- FBS and 2× glutamine until the cells reached 75% con- tion kept the distribution pattern of the UMI counts un- fluence at approximately 3.6×10 cells. The cells were changed and created differentially expressed genes with detached from the flask with 7 mL of 1× citrate saline to certain fold-change levels. In our simulation, we set FC which 7 mL of DPBS was added followed by centrifuga- and t so that the precision-recall curves (power) were in tion at 300×G for 7 min. The cells pellet was resus- a good range. pended in 300 uL of blocking buffer (Rat IgG/PBS) and To simulate datasets with known batch variables, we incubated on ice for 30 min. A total of 50 uL of the cells sampled different proportions of cells in each replicate in blocking buffer were transferred to a separate tube for to form two groups. Specifically, we sampled nine cells the isotype control. The cells were washed with 1 mL of from replicate A and 21 cells from replicate B to form staining buffer (5% BSA/PBS) and centrifuged at 300×G the reference group, 18 cells from replicate A and 22 for 5 min. The pellet containing approximately 3×10 cells from replicate B to form the other group. We se- cells was incubated with Rat IgG2B anti-CD44-Alexa lected DE genes which were not influenced by the ap- 488 antibody (R&D systems) in staining buffer (15 uL parent difference between replicates. Therefore, when antibody + 135 uL of staining buffer) for 30 min on ice. these true DE genes were detected, it was not due to the For the isotype control ~ 600,000 cells were incubated detection of simulated batch effects. Specifically, we se- with 5 uL of Rat IgG2B-Alexa488 (R&D systems) + 45 lected DE genes with p value > 0.5 from the DE analysis uL of staining buffer for 30 min on ices. After the incu- results between the two replicates. We used two plat- bation, both sets of cells were pelleted and washed with forms in this simulation: CEL-Seq2 and SCRB-Seq, 1 mL of staining buffer as described above and resus- which showed strong batch effects in the DE analysis be- pended in staining buffer, followed by flow cytometric tween two replicates. analysis to identify the fraction of CD44 positive and negative populations. Evaluating FDR and power by using the precision-recall For the single cell experiment, Rh41 cells were cultured curve and harvested and washed in DPBS, as described above, We simulated 100 or 20 replicates for each down-sampling and resuspended in PBS/0.2%BSA at a concentration of setting. The FDR was calculated for each replicate and then 1×10 cells/mL. The 10× Genomics single-cell platform averaged across the replicates to generate the mean FDR. performs 3′ gene expression profiling by poly-A selection Because a few datasets had running problems with selected of mRNA within a single cell, which utilizes a cell barcode competitive methods, replicates on which all methods ran and UMIs for each transcript. Single-cell suspensions were successfully were used in the final analysis. To obtain loaded onto the Chromium Controller according to their Chen et al. Genome Biology (2018) 19:70 Page 16 of 17 respective cell counts to generate approximately 6000 par- were ligated to the fragments followed by 12 cycles of PCR titioned single-cell GEMs (Gel Bead-In-EMulsions). The amplification on a C1000 (bio-rad). Paired end sequencing library was prepared using the Chromium Single Cell 3′ was performed (151 bases per read) on a HiSeq 4000 (Illu- v2 Library and Gel Bead Kit (10× Genomics) according to mina). Three replicates were generated. HTSeq [46]was the manufacturers protocol. The cDNA content of each used to produce the count data. edgeR [36] was used for sample after cDNA amplification of 12 cycles was quanti- the DE analysis with TMM normalization. Each repli- high low fied and quality checked by High-Sensitivity DNA chip on cate was coded as a pair of CD44 and CD44 in a 2100 Bioanalyzer (Agilent Technologies) at a dilution of the analysis. 1:6. This quantification was used to determine final library amplification cycles in the protocol, which was calculated Additional files at 12 cycles. After library quantification and quality check by DNA 1000 chip (Agilent Technologies), samples were Additional file 1: This file includes: (1) supplementary methods describing details in single cell quality control and preprocessing, diluted to 3.5 nM for loading onto the HiSeq 4000 (Illu- application details of other DE methods, and a statistical model linking mina) with a 2 × 75 Paired-end kit using the following UMI and read counts; (2) all supplementary figures. (PDF 2338 kb) read length: 26 bp Read1 (10× cell barcode and UMI), Additional file 2: This file includes all supplementary tables. (XLSX 1530 kb) 8 bp i7 Index (sample index), and 98 bp Read2 (insert). An average of 400,000,000 reads per sample was obtained, Acknowledgements which translated to roughly 80,000 mean reads per cell, We thank Christoph Ziegenhain for sharing the count matrix of UMI-based per sample. The Cell Ranger 2.0.1 Single-Cell Software protocols and Keith A. Laycock and Xiaotu Ma for editing the manuscript. Suite (10× Genomics) was implemented to process the Funding raw sequencing data from the Illumina HiSeq run. This This study was also supported in part by the National Cancer Institute of the pipeline performed de-multiplexing, alignment (GRCh38/ National Institutes of Health under Award Number P30CA021765 and by STAR), and barcode processing to generate gene-cell ALSAC. matrices used for downstream analysis. Availability of data and materials After matrix generation the ribosomal and mitochon- high The Rh41 scRNA-seq dataset and the bulk RNA-seq data for sorted CD44 drial related genes were filtered. The subpopulation low and CD44 subpopulations generated in this study have been deposited in structure in Rh41 cells was inferred using a novel clus- GEO under the accession number GSE113660 [47]. The functions used for the data analysis are included in the NBID package under a MIT license, tering algorithm developed in house for analyzing which can be installed from Bitbucket (https://bitbucket.org/Wenan/nbid) large-scale scRNA-seq data (manuscript in preparation). [48]. The source code is also uploaded with DOI URL: https://doi.org/10.5281/ Briefly, it first used singular value decomposition (SVD) zenodo.1225670 [49]. The codes for data QC and DE analysis using other packages can be downloaded from https://bitbucket.org/Wenan/ to derive latent cellular states from the expression scrna_qc_de [50]. The public datasets we use in this paper are from Ziegen- matrix for individual cells. The number of significant hain et al. [12], Zheng et al. [8], Grun et al. [20], Jatin et al. [21], Klein et al. [7], cellular states was determined using the Tracy-Widom Islam et al. [22], and Scialdone et al. [23]. test on eigenvalues. A modified version of spectral clus- Authors’ contributions tering was performed on the significant cellular states of WC and XC proposed the study design and method development. WC individual cells (cellular states explained by total UMIs performed model comparison using hypothesis testing, differential expression were ignored) with a different number of clusters (2– analysis method implementation, and data analysis. YL performed the goodness of fit analysis. XC performed clustering analysis on Rh41 cells. JE generated scRNA- 30). The final two-subpopulation structure was deter- seq data from Rh41 cells, CD44 based cell sorting and RNA-seq on sorted subpop- mined by the silhouette measure for solutions with dif- ulations. DF analyzed RNA-seq data from sorted Rh41 subpopulations. GW ferent number of clusters. proposed visualization schemes of analysis results. WC and XC drafted the manuscript. All authors revised and approved the final manuscript. NBID was used to identify DE genes in the two sub- populations. We further filtered the DE genes by using Ethics approval and consent to participate two thresholds: the average expression level with TPM ≥ Not applicable. 3 in at least one cluster [30], and log2 fold-change ≥ 1. Competing interests The authors declare that they have no competing interests. Rh41 bulk RNA-seq dataset RNA was isolated from the sorted subpopulations using Trizol (Thermo Fisher Scientific) following the manufac- Publisher’sNote Springer Nature remains neutral with regard to jurisdictional claims in ture recommendations. RNA libraries were prepared using published maps and institutional affiliations. the Kapa RNA HyperPrep with Riboerase RNA kit (Roche) using the recommended conditions. Briefly, 200 ng of total Author details Department of Computational Biology, St. Jude Children’s Research Hospital, RNA was used as input for fragmentation, reverse tran- 262 Danny Thomas Pl, Memphis, TN 38105, USA. Division of Biostatistics, scription, and second strand synthesis. After clean up, end School of Public Health, University of Minnesota Twin Cities, Mayo Building, repair, and A tailing, Nextflex adapters (Bioo Scientific) Minneapolis, MN 55455, USA. Chen et al. Genome Biology (2018) 19:70 Page 17 of 17 Received: 1 December 2017 Accepted: 30 April 2018 26. Tung PY, Blischak JD, Hsiao CJ, Knowles DA, Burnett JE, Pritchard JK, et al. Batch effects and the effective design of single-cell gene expression studies. Sci Rep. 2017;7:39921. 27. Jaakkola MK, Seyednasrollah F, Mehmood A, Elo LL. Comparison of methods to detect differentially expressed genes between single-cell populations. References Brief Bioinform. 2017;18:735–43. 1. Liu S, Trapnell C. Single-cell transcriptome sequencing: recent advances and 28. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, et al. remaining challenges [version 1; referees: 2 approved]. 2016;5(F1000 Faculty Single-cell RNA-seq highlights intratumoral heterogeneity in primary Rev):182. https://doi.org/10.12688/f1000research.7223.1. glioblastoma. Science. 2014;344:1396–401. 2. Svensson V, Natarajan KN, Ly LH, Miragaia RJ, Labalette C, Macaulay IC, et al. 29. Weng NP, Araki Y, Subedi K. The molecular basis of the memory T cell Power analysis of single-cell RNA-sequencing experiments. Nat Methods. response: differential gene expression and its epigenetic regulation. Nat Rev 2017;14:381–7. Immunol. 2012;12:306–15. 3. Grun D, Kester L, van Oudenaarden A. Validation of noise models for single- 30. Wagner GP, Kin K, Lynch VJ. A model based criterion for gene expression cell transcriptomics. Nat Methods. 2014;11:637–40. calls using RNA-seq data. Theory Biosci. 2013;132:159–64. 4. Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, et al. 31. Li F, Tiede B, Massague J, Kang Y. Beyond tumorigenesis: cancer stem cells From single-cell to cell-pool transcriptomes: stochasticity in gene expression in metastasis. Cell Res. 2007;17:3–14. and RNA splicing. Genome Res. 2014;24:496–510. 32. Yan Y, Zuo X, Wei D. Concise review: emerging role of CD44 in cancer stem 5. Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell cells: a promising biomarker and therapeutic target. Stem Cells Transl Med. differential expression analysis. Nat Methods. 2014;11:740–2. 2015;4:1033–43. 6. Wang Y, Navin NE. Advances and applications of single-cell sequencing 33. Rudzinski ER, Anderson JR, Lyden ER, Bridge JA, Barr FG, Gastier-Foster JM, et al. technologies. Mol Cell. 2015;58:598–609. Myogenin, AP2beta, NOS-1, and HMGA2 are surrogate markers of fusion status 7. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet in rhabdomyosarcoma: a report from the Soft Tissue Sarcoma Committee of barcoding for single-cell transcriptomics applied to embryonic stem cells. the Children's Oncology Group. Am J Surg Pathol. 2014;38:654–9. Cell. 2015;161:1187–201. 34. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. 8. Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Tackling the widespread and critical impact of batch effects in high- Massively parallel digital transcriptional profiling of single cells. Nat throughput data. Nat Rev Genet. 2010;11:733–9. Commun. 2017;8:14049. 35. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson 9. Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: a A, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. flexible statistical framework for assessing transcriptional changes and 2016;17:13. characterizing heterogeneity in single-cell RNA sequencing data. Genome 36. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for Biol. 2015;16:278. differential expression analysis of digital gene expression data. 10. Bacher R, Kendziorski C. Design and computational analysis of single-cell Bioinformatics. 2010;26:139–40. RNA-sequencing experiments. Genome Biol. 2016;17:63. 37. Anders S, Huber W. Differential expression analysis for sequence count data. 11. Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA quantification Genome Biol. 2010;11:R106. and differential analysis with Census. Nat Methods. 2017;14:309–15. 38. Ruxton GD. The unequal variance t-test is an underused alternative to 12. Ziegenhain C, Vieth B, Parekh S, Reinius B, Guillaumet-Adkins A, Smets M, et Student's t-test and the Mann-Whitney U test. Behav Ecol. 2006;17:688–90. al. Comparative analysis of single-cell RNA sequencing methods. Mol Cell. 39. Rasch D, Kubinger KD, Moder K. The two-sample t test: pre-testing its 2017;65:631–43. e634 assumptions does not pay off. Stat Pap. 2011;52:219–31. 13. Vieth B, Ziegenhain C, Parekh S, Enard W, Hellmann I. powsimR: power 40. Jansakul N, Hinde J. Score tests for extra-zero models in zero-inflated analysis for bulk and single cell RNA-seq experiments. Bioinformatics. 2017; negative binomial models. Commun Stat Simul Comput. 2009;38:92–108. 33:3486–8. 41. Cameron AC, Trivedi PK. Regression analysis of count data. Cambridge: 14. Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with Cambridge University Press; 1998. single-cell genomics. Nat Biotechnol. 2016;34:1145–60. 42. Benjamini Y, Hochberg Y. Controlling the false discovery rate - a practical 15. Pierson E, Yau C. ZIFA: Dimensionality reduction for zero-inflated single-cell and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. gene expression analysis. Genome Biol. 2015;16:241. 1995;57:289–300. 16. Vallejos CA, Risso D, Scialdone A, Dudoit S, Marioni JC. Normalizing single- 43. Venables WN, Ripley BD. Modern applied statistics with S. 4th ed: New York: cell RNA sequencing data: challenges and opportunities. Nat Methods. 2017; Springer; 2010. 14:565–71. 44. Zeileis A, Kleiber C, Jackman S. Regression models for count data in R. J Stat 17. Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical Softw. 2008;27:1–25. variability in single-cell RNA-sequencing experiments. Biostatistics. 2017. 45. Delignette-Muller ML, Dutang C. fitdistrplus: An R Package for fitting https://doi.org/10.1093/biostatistics/kxx053. distributions. J Stat Softw. 2015;64:1–34. 18. Jaakkola MK, Seyednasrollah F, Mehmood A, Elo LL. Comparison of methods 46. Anders S, Pyl PT, Huber W. HTSeq–a Python framework to work with high- to detect differentially expressed genes between single-cell populations. throughput sequencing data. Bioinformatics. 2015;31:166–9. Brief Bioinform. 2017;18(5):735–43. 47. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling 19. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of and differential expression analysis for single cell RNA sequencing. Datasets. single-cell gene expression data. Nat Biotechnol. 2015;33:495–502. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc. 20. Grun D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, et al. Single- cgi?acc=GSE113660. cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 48. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling 2015;525:251–5. and differential expression analysis for single cell RNA sequencing. 21. Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, et al. Bitbucket. https://bitbucket.org/Wenan/nbid. Massively parallel single-cell RNA-seq for marker-free decomposition of 49. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling tissues into cell types. Science. 2014;343:776–9. and differential expression analysis for single cell RNA sequencing. zenodo. 22. Islam S, Kjallquist U, Moliner A, Zajac P, Fan JB, Lonnerberg P, et al. https://doi.org/10.5281/zenodo.1225670. Characterization of the single-cell transcriptional landscape by highly 50. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling and multiplex RNA-seq. Genome Res. 2011;21:1160–7. differential expression analysis for single cell RNA sequencing. Bitbucket. 23. Scialdone A, Natarajan KN, Saraiva LR, Proserpio V, Teichmann SA, Stegle O, https://bitbucket.org/Wenan/scrna_qc_de. et al. Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods. 2015;85:54–61. 24. Johnson RA, Wichern DW. Applied multivariate statistical analysis. 3rd ed. Prentice Hall: Englewood Cliffs, NJ; 1992. 25. Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17:75. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals

UMI-count modeling and differential expression analysis for single-cell RNA sequencing

Free
17 pages

Loading next page...
 
/lp/springer_journal/umi-count-modeling-and-differential-expression-analysis-for-single-yjjib77jxZ
Publisher
BioMed Central
Copyright
Copyright © 2018 by The Author(s).
Subject
Life Sciences; Animal Genetics and Genomics; Human Genetics; Plant Genetics and Genomics; Microbial Genetics and Genomics; Bioinformatics; Evolutionary Biology
eISSN
1474-760X
D.O.I.
10.1186/s13059-018-1438-9
Publisher site
See Article on Publisher Site

Abstract

Read counting and unique molecular identifier (UMI) counting are the principal gene expression quantification schemes used in single-cell RNA-sequencing (scRNA-seq) analysis. By using multiple scRNA-seq datasets, we reveal distinct distribution differences between these schemes and conclude that the negative binomial model is a good approximation for UMI counts, even in heterogeneous populations. We further propose a novel differential expression analysis algorithm based on a negative binomial model with independent dispersions in each group (NBID). Our results show that this properly controls the FDR and achieves better power for UMI counts when compared to other recently developed packages for scRNA-seq analysis. Keywords: Unique molecular identifier, Negative binomial, Differential expression analysis Background missing value whereby the expression of a gene is detected Single-cell RNA-sequencing (scRNA-seq) technology pro- at a moderate or high level in a subset of cells but is not vides transcriptome profiles of individual cells, enabling detected in other cells [5]. the dissection of the heterogeneity of different cell popula- Read counts and transcript counts are two categories tions and tissues [1]. Although scRNA-seq protocols share of quantification schemes commonly employed in common principles of single-cell isolation, cell lysis, tran- scRNA-seq. Although the read count-based scheme is script capture, complementary DNA (cDNA) conversion similar to the common approaches used for bulk and amplification, library preparation, and sequencing, the RNA-seq, the miniscule quantity of transcripts captured methodologies differ. Multiple methods for transcript from a single cell requires cDNA amplification for quantification with differing levels of accuracy and sensi- library construction; this inevitably results in large amp- tivity have been employed in scRNA-seq analysis [2]. lification bias [6]. To mitigate this bias, several recent However, the paucity of starting material for reverse tran- scRNA-seq protocols have employed an additional step scription remains an inherent limitation of scRNA-seq in which individual transcripts are barcoded with unique protocols and contributes to the relatively low rate at molecular identifiers (UMIs) before amplification, result- which messenger RNA (mRNA) molecules in individual ing in a more accurate quantification of the transcript cells are converted to cDNA molecules that can be cap- count [7, 8]. tured and sequenced [3, 4]. Coupled with the stochastic Although the fast-evolving experimental protocols for nature of gene expression, scRNA-seq protocols generally scRNA-seq have given rise to numerous studies employ- produce single-cell transcriptome measurements with low ing scRNA-seq techniques, statistical characterizations of signal-to-noise ratios, exemplified by the high abundance scRNA-seq data continue to lag. Most published studies of zeroes in the expression matrix and so-called dropout have focused primarily on either read counts [5, 9, 10]or events. In this context, dropout refers to a special type of UMI counts [3, 7]. Although a few studies that used both read-count and UMI-count schemes have suggested that employing UMIs in expression measurement globally * Correspondence: xiang.chen@stjude.org reduces the technical noises and that the data generally fit Department of Computational Biology, St. Jude Children’s Research Hospital, 262 Danny Thomas Pl, Memphis, TN 38105, USA into simpler statistical models compared to read counts Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Chen et al. Genome Biology (2018) 19:70 Page 2 of 17 [3, 11–13], a popular perspective held by the field is that application of NBID in biomarker identification after dropout events result in more zeroes than expected unsupervised clustering of scRNA-seq data. in scRNA-seq data and these events need to be expli- citly modeled using zero inflated/bimodality models Results [5, 10, 14–17]. This study investigated the necessity Model comparison for UMI counts and read counts and effectiveness of zero-inflated models in modeling the We used a unique dataset produced by Ziegenhain et al. UMI-count distribution among cells by directly comparing [12] to determine the difference between read counts the statistical modeling of UMI counts and read counts. and UMI counts. A homogeneous population of mouse A closely related application of scRNA-seq count model- embryonic stem cells was derived by two inhibitors/ ing is single-cell differential expression (DE) analysis. leukemia inhibitory factors and used to evaluate six Several software packages have been developed specifically different scRNA-seq protocols, including four UMI for scRNA-seq DE analysis, such as SCDE [5], MAST [9], count-based protocols and two read count-based proto- ROTS [18], Monocle2 [11], and Seurat [19]. However, there cols. Furthermore, the read counts before conversion to have been no systematic evaluations of these methods with UMIs were also evaluated for the four UMI based proto- respect to UMI count-based scRNA-seq data. cols, which provided an excellent opportunity to exam- In this study, we first conducted a comprehensive ana- ine the differences between the UMI count and read lysis of the modeling UMI counts and read counts in count for the same data [12]. We first examined scatter scRNA-seq data. Based on that analysis, we proposed a plots for cell pairs with similar total read/UMI counts. method using the Negative Binomial model with Inde- Figure 1 shows the representative pattern for two cells. pendent Dispersions (NBID) and compared its false dis- This general pattern holds for most cell pairs in the covery rate (FDR) control and power to those of other dataset. A large density mass was focused on at (0, 0) commonly used methods. We also illustrate a practical (the origin point) in all three results (Fig. 1a, c, and e), Smart−Seq2 read count CEL−Seq2/C1 read count CEL−Seq2/C1 UMI count a c e 01 2345 012345 01 2345 cell 1: log10 count cell 1: log10 count cell 1: log10 count b d f x−axis y−axis 12345678 9 count 012345 01 2345 012345 012345 012345 012345 log10 count log10 count log10 count Fig. 1 Scatter plots of two cells with similar read counts or UMI counts. a, b Read counts for Smart−Seq2. c, d Read counts for CEL − Seq2/C1. e, f UMI counts for CEL − Seq2/C1. a, c, e The scatter plot with color-coded density, the highest density at the origin. The left and middle panels,which are based on the read counts, show very different patterns from the right panel, which is based on the UMI counts. b, d, f The density plot along the x- and y-axes of (a), (c), and (e), excluding the origin. For all plots, we kept the genes that were detected in at least five cells among all cells Density cell 2: log10 count 0.0 0.0 0.5 0.5 1.0 1.0 1.5 1.5 012345 Density cell 2: log10 count 0.0 0.0 0.5 0.5 1.0 1.0 1.5 1.5 03 12 4 5 Density cell 2: log10 count 02468 02468 1100 0 12345 proportion 0.0 0.2 0.4 0.6 Chen et al. Genome Biology (2018) 19:70 Page 3 of 17 which was consistent with the notion that transcripts for different distributions. We employed a backward selection most genes were not captured by either cell in strategy on three candidate models that are commonly scRNA-seq protocols (supported by abundance of zeros used in scRNA-seq studies [3, 5, 7]: a Poisson model with in the scRNA-seq data) [10]. Read-count measurements one parameter defining both the mean and the variance; a produced results (Fig. 1a–d) similar to those of Kharchenko NB model with two parameters defining the mean and et al. [5], which were used to illustrate dropout events. The variance; and a ZINB model with three parameters, of overall UMI-count measurements (Fig. 1e and f)showed which two were the same as those in the NB model. The less divergence when compared to their read-count coun- additional parameter in the ZINB model defines the prob- terparts in the same cell pair (Fig. 1c and d). Specifically, ability of a count being zero or being distributed as an NB quantifications for genes with dropout events (i.e. when distribution. These three models are nested with increas- transcripts/reads were captured in one cell in the pair but ing complexity, i.e. the Poisson model is a special case of not the other) showed a distinct bi-modal pattern in the the NB model and the NB model is a special case of the read counts (Fig. 1b and d) but a unimodal distribution in ZINB model. Our goal was to decide on the proper the UMI counts (Fig. 1f). Consequently, it is prudent to complexity for fitting the scRNA-seq data. We started by model gene expression by using a zero-inflated model (i.e. a testing whether the ZINB model was significantly better zero-inflated negative binomial [ZINB] model [5]orhurdle than the NB model for modeling the counts. Among those model [9]) for the read counts: one component for the zero genes that did not reject the NB model, we further tested counts and the other component for the non-zero counts. whether that model was significantly better than the Pois- However, the fast attenuation of density along the axes sug- son model. The model selection results are summarized in gests that a unimodal distribution (e.g. a Poisson or nega- Table 1 (see also Additional file 1:Figure S1).We analyzed tive binomial [NB] distribution) may be sufficient for UMI both the UMI counts and read counts (before UMI con- counts. version) form four UMI-based protocols. Although no To further capture the quantitative difference between genes measured in UMI counts preferred the ZINB model read counts and UMI counts, we modeled them with over the NB model at an FDR level of 0.05, the results for Table 1 Number of genes with selected models for different protocols from Ziegenhain et al. [12] NB vs ZINB Poisson vs NB Protocol Cells used (n) Genes detected (n) Genes tested (n) Genes converged (n) ZINB NB Poisson Poisson (%) UMI count CEL-Seq2/C1(A) 34 16,690 11,345 11,345 0 2968 8377 73.84 CEL-Seq2/C1(B) 37 17,229 12,190 12,189 0 3794 8395 68.87 Drop-Seq(A) 42 16,579 10,702 10,702 0 4277 6425 60.04 Drop-Seq(B) 34 15,469 9288 9288 0 3659 5629 60.61 MARS-Seq(A) 29 14,551 8266 8266 0 4592 3674 44.45 MARS-Seq(B) 36 15,406 9644 9644 0 5848 3796 39.36 SCRB-Seq(A) 39 16,411 12,955 12,955 0 1214 11,741 90.63 SCRB-Seq(B) 45 16,944 13,212 13,212 0 2115 11,097 83.99 CEL-Seq2/C1(A) 34 16,690 11,345 10,679 3679 6385 615 5.76 CEL-Seq2/C1(B) 37 17,229 12,190 12,155 1174 10,443 538 4.43 Drop-Seq(A) 42 16,579 10,702 10,690 121 9601 968 9.06 Drop-Seq(B) 34 15,469 9288 9278 91 8329 858 9.25 MARS-Seq(A) 29 14,551 8266 8132 761 7161 210 2.58 MARS-Seq(B) 36 15,406 9644 9531 1333 7974 224 2.35 SCRB-Seq(A) 39 16,411 12,955 12,954 0 11,814 1140 8.80 SCRB-Seq(B) 45 16,944 13,212 13,212 0 11,964 1248 9.45 Read count Smart-Seq2(A) 80 21,076 15,294 15,098 7905 5795 1398 9.26 Smart-Seq2(B) 77 20,861 15,224 15,152 6456 7244 1452 9.58 Smart-Seq/C1(A) 69 19,699 13,518 13,513 16 12,761 736 5.45 Smart-Seq/C1(B) 61 19,100 12,949 12,947 0 11,888 1059 8.18 Chen et al. Genome Biology (2018) 19:70 Page 4 of 17 CEL-Seq2/C1 and MARS-Seq showed a significant per- reported by Ziegenhain et al. [12] (Table 2). At an FDR centage of genes (9.4–34.5%) rejecting the NB model in level of 0.05, only 0.1% (range = 0–0.4%) of converged favor of the ZINB model when measured in read counts. genes rejected the NB model for UMI counts. This per- Moreover, for UMI counts, a large proportion of genes centage was significantly increased to 14.2% (range = (39.4–84.0%) selected the simple Poisson distribution. By 1.1–35.3%, p = 0.0078, the Wilcoxon signed rank test) contrast, read-count measurements resulted in a sharp for read counts from the same datasets, indicating that a drop in the proportions of Poisson models (2.4–9.5%, high-level noise was introduced by cDNA amplification. p = 0.0078, the Wilcoxon signed rank test) across all We further examined the proportion of genes that could platforms evaluated. Read-count only protocols (Smart-- be modeled by a Poisson model. As expected, the percent- Seq and Smart-Seq2) show comparable patterns to the age of genes with an adequate Poisson fit (FDR > 0.05) read counts from UMI protocols. Overall, our analysis im- dropped sharply from 80.2% (range = 65.7–95.1%) for plies that while ZINB is necessary for a significant fraction UMI counts to 2.6% (range = 1.0–4.1%, p = 0.0078, the of read counts, it is not needed for UMI counts. Wilcoxon signed rank test) for read counts measured in the same datasets. The goodness of fit of both the Poisson Negative binomial model for UMI counts and NB models supports the conclusion that UMI counts A model selection strategy always selects a “best” model can be modeled by simpler models when compared to among the specified candidates even though the chosen read counts. model may fit the underlying data poorly. Therefore, we evaluated the goodness of fit for these selected models. Modeling and goodness of fit for UMI counts in large scale Because a Poisson model can be modeled as a special scRNA-seq datasets scenario of the NB model, we began by measuring the Although the datasets of Ziegenhain et al. [12] provided goodness of fit of the NB model for various datasets an unparalleled opportunity to evaluate the difference Table 2 Goodness of fit test for the Poisson and NB models for different protocols from Ziegenhain et al. [12] Protocol Cells Used (n) Genes tested (n) Genes reject Poisson (n) Genes reject NB (n) Accept Poisson (%) Reject NB (%) UMI count CEL-Seq2/C1(A) 30 3357 660 1 80.34 0.03 CEL-Seq2/C1(B) 33 5601 1082 3 80.68 0.05 Drop-Seq(A) 37 2311 548 2 76.29 0.09 Drop-Seq(B) 30 1690 414 0 75.50 0.00 MARS-Seq(A) 26 1162 317 2 72.72 0.17 MARS-Seq(B) 32 2184 750 8 65.66 0.37 SCRB-Seq(A) 35 4218 214 1 94.93 0.02 SCRB-Seq(B) 40 4360 213 0 95.11 0.00 Read count before converting to UMI CEL-Seq2/C1(A) 30 6012 5954 622 0.96 10.35 CEL-Seq2/C1(B) 33 7993 7897 90 1.20 1.13 Drop-Seq(A) 37 4574 4386 430 4.11 9.40 Drop-Seq(B) 30 2867 2781 512 3.00 17.86 MARS-Seq(A) 26 2830 2743 152 3.07 5.37 MARS-Seq(B) 32 4248 4168 265 1.88 6.24 SCRB-Seq(A) 35 7392 7194 2065 2.68 27.94 SCRB-Seq(B) 40 7112 6855 2507 3.61 35.25 Read count Smart-Seq2(A) 72 10,880 10,692 2696 1.73 24.78 Smart-Seq2(B) 69 10,684 10,469 1790 2.01 16.75 Smart-Seq/C1(A) 62 9342 9249 87 1.00 0.93 Smart-Seq/C1(B) 55 7990 7893 75 1.21 0.94 A model is rejected if FDR < 0.05 among all genes tested; otherwise it is accepted NB negative binomial Chen et al. Genome Biology (2018) 19:70 Page 5 of 17 between read counts and UMI counts, the number of statistically significant lack of fit, even though the departure cells captured was relatively small (range = 29–80). We from the specified distributions may be very small and un- extended our analysis to additional datasets generated by important [24]. Therefore, we compared the empirical different platforms [7, 20–23] to evaluate whether the probability mass function (pmf) and the cumulative distri- same pattern generally held for other datasets. Despite bution function (cdf) with the fitted negative binomial technical differences among protocols and heterogeneity model to evaluate visually the difference between them for within cell populations, overall, the model selection and genes rejecting the NB model (Fig. 2, Additional file 1: goodness-of-fit analysis for these datasets supported our Figures S3 and S4). Even though these genes rejected the conclusion that UMI counts can be modeled by simpler NB model at an FDR level of 0.05, the fitted pmf and cdf models when compared to read counts (Additional file 2: curves were good approximations of their empirical coun- Tables S1A and S1B). terparts. Importantly, among the 23 to 282 genes that Since 2016, several Drop-seq UMI based platforms have rejected the NB model, only few (3–17) were ad- appeared with the capability to process thousands of cells equately approximated by the ZINB model (Additional in a single experiment [2, 8]. Consequently, we studied file 2: Table S3). Therefore, we conclude that the NB whether the same pattern held for such large-scale data- model is a good approximation model for UMI counts, sets. We applied the described model-selection strategy even for large-scale scRNA-seq data with evidence of and goodness-of-fit test to the following datasets: (1) CD4 heterogeneity. + naïve T cells (9850 cells); and (2) CD4+ memory T cells (9578 cells), both of which were generated on the Gem- scRNA-seq differential expression analysis Code platform (10× Genomics, Pleasanton, CA, USA) [8], A direct consequence of properly modeling scRNA-seq and 3) Rh41 cells, a human PAX3-FOXO1 positive alveolar counts is the power to accurately conduct differential rhabdomyosarcoma (ARMS) cell line (6875 cells) prepared expression analyses. Based on the knowledge derived in-house on the Chromium platform (10× Genomics). from UMI-count modeling, we proposed a NB-based al- Rh41 cells contained two distinct subpopulations based gorithm for differential expression analysis of large-scale on unsupervised clustering analysis (Additional file 1: UMI-based scRNA-seq data. We extended the general Figure S2) and were included to evaluate the effects of NB-based models by allowing independent dispersion strong heterogeneity on model selection and fitting parameters in each biological condition, resulting in the (Table 3). Although few genes (4–7, 0.04–0.06%) preferred NBID method. This approach is analogous to the t-test, the ZINB model in the relatively homogeneous T-cell pop- which allows different variances between groups when ulations, the percentage of genes selecting the ZINB testing the equivalence of means. The rationale stems model in Rh41 cells was slightly elevated, albeit still low from the apparent variations in dispersion even at the (39 genes, 0.21%). The expression of these genes differed same average expression level [3, 7]. Because the number significantly between the two clusters (FDR < 0.05, the of cells in each condition is generally sufficient in Wilcoxon rank sum test; see also Additional file 2:Table large-scale datasets, we derive separate dispersion esti- S2), suggesting that the fraction of genes preferring the mates for each condition; these are used in the subse- ZINB model correlates with the level of heterogeneity. quent NB-based test against the null hypothesis that Compared to the datasets of Ziegenhain et al. [12], the T different conditions have the same average expression. and Rh41 datasets displayed a lack of statistical fit for sim- We compared the proposed method with other com- pler models (Table 4). Specifically, the genes modeled by monly used methods (Additional file 2: Table S4): the Poisson model dropped to 61.1% (range = 51.9–67.6%) Monocle2 [11]; SCDE [5]; ROTS [18]; MAST [9]; and and the percentage of genes that rejected the NB model Seurat [19]. Although both SCDE and MAST were increased to 5.3% (range = 3.4–8.4%). In addition to ele- developed for read counts, their authors claim that they vated heterogeneity in the Rh41 cells, the sample size of can be applied to UMI data. To handle the apparent these datasets (range = 6875–9850) also played an import- zero inflation, SCDE employs a mixture of a NB model ant role in the increased lack of model fitness. It has been and a Poisson model, while MAST uses a hurdle model documented that very large samples invariably produce with the non-zero component modeled with a Gaussian Table 3 Number of genes with selected models for large-scale datasets on the GemCode and Chromium platforms NB vs ZINB Poisson vs NB Data Cells used (n) Genes detected (n) Genes tested (n) Genes converged (n) ZINB NB Poisson Poisson (%) Naive T cells (Gemcode) 9850 32,738 11,978 11,977 7 5336 6634 55.39 Memory T cells (Gemcode) 9578 32,738 12,569 12,567 4 6336 6227 49.55 Rh41 (Chromium) 6875 33,416 18,435 18,435 39 9387 9009 48.87 Chen et al. Genome Biology (2018) 19:70 Page 6 of 17 Table 4 Goodness of fit test for the Poisson and NB models for large-scale datasets on the GemCode and Chromium platforms Data Cells used (n) Genes tested (n) Genes reject Poisson (n) Genes reject NB (n) Accept Poisson (%) Reject NB (%) Naive T cells (Gemcode) 7332 776 403 65 48.07 8.38 Memory T cells (Gemcode) 8622 836 533 28 36.24 3.35 Rh41 (Chromium) 6187 6853 4630 295 32.44 4.30 A model is rejected if FDR < 0.05 among all genes tested; otherwise it is accepted NB negative binomial The 1st gene wit h (p −value=0.0405) The 1st gene with (p−value=0.0405) FDR>0.2 FDR>0.2 ab Empirical Theoretical Empirical Theoretical 0246 0246 cd The 1st gene with FDR<.05 (p−value=0.004) The 1st gene with FDR<.05 (p−value=0.004) Empirical Theoretical Empirical Theoretical The gene with the worst FDR (p-value=0) The gene with the worst FDR (p-value=0) ef Empirical Theoretical Empirical Theoretical 010 20 30 40 010 20 30 40 Fig. 2 Goodness of fit using the negative binomial distribution on the naïve T-cell data (Tn). a The empirical and theoretical probability mass function (pmf) for the first gene with FDR > 0.2. b The empirical and theoretical cumulative distribution function (cdf) for the first gene with FDR > 0.2. c, d The same pmf and cdf plots for the first gene with FDR < 0.05. e, f The same pmf and cdf plots for the gene with the worst FDR Density Density Density 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 CDF CDF CDF 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Chen et al. Genome Biology (2018) 19:70 Page 7 of 17 distribution. Monocle2 [11] and Seurat [19] provide UMIs retained). Because NBID assumes a sufficiently large NB-based differential expression analysis (among other number of cells in each group, we evaluate its robustness models) for UMI counts. We also included ROTS based in common scenarios for scRNA-Seq experiments with on a recent comparison of scRNA-seq differential expres- different number of cells (60, 300, or 1000 cells, approxi- sion analysis [18]. Recently, several scRNA-seq-tailored mating samples collected from 96-well plates, 384-well normalization schemes have been proposed [16]. We plates, and by droplet methods, respectively). evaluate their contributions by integrating NBID with In extensive simulations of 300 or 1000 cells (two scran, a state-of-the-art generic normalization method groups combined) with different simulated fold change [25], by using the cell-specific size factor estimated by from multiple datasets obtained by different protocols, scran in NBID (NBID_scran). Monocle2, SCDE, and Seurat consistently inflated the FDR and the number of false positives increased with FDR and power comparison for differential expression the level of expected UMI difference between groups analysis of UMI-based scRNA-seq data (Table 5, Additional file 2: Tables S5–S8). Similar to the We first evaluated the FDR control for all methods by comparisons result derived from read counts [27], SCDE using simulated data. Instead of generating artificial generally detected fewer DE genes compared to other datasets from a theoretical distribution, we simulated methods in UMI-count scRNA-seq data. However, it groups of cells with differentially expressed genes from often produced relatively high number of false positives, publicly available datasets collected from different which resulted in severely inflated FDRs in the simula- protocols (data from memory T cells obtained by Gem- tions. Due to the severely inflated FDRs (with or without Code [8], from whole-intestinal organoids obtained by expected group difference in the total UMI counts) in CEL-Seq [20], and from heterogeneous dendritic cells many scenarios, both SCDE and Seurat were excluded obtained by MARS-Seq [21]). We began with randomly from subsequent analyses. ROTS controlled FDR with- generating two distinct groups of cells by swapping the out UMI difference, but severely inflated the FDR in UMI counts for two sets of genes in the second group. scenarios with expected group differences in UMIs. Here, the first group represented cells collected under a While MAST controlled the FDR without and with mild reference condition and the second group contained group difference in UMIs, it also shows inflated the FDR cells under the testing condition with simulated differen- with intermediate differences in total UMIs. NBID and tial expressions. The two equal-sized sets of genes had NBID_scran were the only methods to achieve proper different average expression levels in the full dataset FDR control under all three scenarios (Table 5). before swapping. This strategy generated artificially In the simulation of 60 cells combined, all the methods separated groups of cells while retaining specific charac- except for MAST yielded various degrees of FDR infla- teristics of the scRNA-seq counts for each cell. The dis- tion (Additional file 2: Table S9), indicating that > 60 tribution of the total number of UMIs captured in a cell cells are needed for a robust DE analysis in scRNA-seq. is an important characteristic for UMI-based scRNA-seq Nevertheless, when we focused on genes with high ex- experiments. Although biological differences (such as pression (with TPM ≥ 50 in at least one group, similar to the physical cell size, proliferation status, and cell-cycle the threshold employed in reference [28]), NBID and stages) may affect the absolute number of transcripts in NBID_scran approached the desired FDR control in all the cells, technical (non-biological) variations, such as three scenarios. the cell-to-cell variations in the conversion factor be- We used precision-recall curves to evaluate the power tween transcripts and captured UMIs and variations in of the methods (Fig. 3 and Additional file 1: Figures S5– the sequencing depth, have substantial influence on the S9). Measured by the area under the curve (AUC), NBID number of UMIs captured for each cell. Moreover, the and NBID_scran robustly outperformed other methods effects of total UMI variations are disproportionally in different simulation scenarios. Although ROTS had a biased towards the gene with lower expression [17, 26] slight edge without group difference in UMIs, it was and the disparity in the number of UMIs is further exac- highly sensitive to the group difference: even a mild dif- erbated in scenarios in which two groups of cells being ference dropped the AUC to 0 in simulations with 1000 compared are captured and sequenced separately. To cells. NBID and NBID_scran achieved similar results, evaluate the robustness of performance against the com- suggesting that the total UMI count is a good estimator monly observed difference in the total UMIs captured for the cell-specific size factor in the simulations. per cell, we simulated three scenarios in terms of ex- pected group difference in the total UMIs captured by Differential expression analysis of naive T cells and memory sub-sampling UMIs in the second group of cells: no dif- Tcells ference (100% UMIs retained); mild difference (80–90% We evaluated three algorithms (MAST, ROTS, and NBID) UMIs retained); and intermediate difference (50–60% for their ability to identify DE genes in naïve T cells and Chen et al. Genome Biology (2018) 19:70 Page 8 of 17 Table 5 FDRs of evaluated methods a b c 1 0.8–0.9 0.5–0.6 Method FDR False (n) DE (n) FDR False (n) DE (n) FDR False (n) DE (n) Monocle2 0.069 5.9 83.9 0.089 7 79.1 0.276 22.1 79 SCDE 0.299 2.6 8.3 0.34 3.7 9.5 0.848 123.5 145.2 MAST 0.001 0 29.5 0.003 0.1 28.2 0.193 3.4 19.5 ROTS 0.045 4.4 97.5 0.497 71.6 145.9 0.835 272.4 323.9 Seurat_ttest 0.244 31.5 128.3 0.441 69.6 156.1 0.927 653.6 704.5 Seurat_bimod 0.154 17.6 112.5 0.655 172 258.7 0.928 924.5 996.3 Seurat_tobit 0.248 32.2 129 0.45 72 158.5 0.873 351.7 402.8 Seurat_poisson 0.208 25.7 122.6 0.188 20.1 106.1 0.573 67.4 116.2 Seurat_negbinom 0.197 23.9 120.7 0.164 16.9 102.4 0.5 47.7 93.8 NBID_scran 0.038 3.4 86.9 0.035 2.8 80.1 0.039 2.5 62.4 NBID 0.033 2.8 85.7 0.032 2.7 81.4 0.03 1.8 58.9 No sub-sampling The sub-sampling ratio in Group 2 was 0.8–0.9 The sub-sampling ratio in Group 2 was 0.5–0.6. Bold values indicate FDR > 0.05. Bold and underlined values indicate FDR > 0.1. The nominal FDR was 0.05. Simulation based on the Memory T-cell data [8], 500 cells in each group, results are averaged over 96 replicates (see Additional file 2: Tables S5–S9 for results for other simulation scenarios). NBID_scran used the size factor computed by scran as the offset instead of the total UMI counts memory T cells (Fig. 4 and Additional file 2:Tables S10 three genes missed by NBID (LY96, STAM,and TOX)had and S11). NBID detected more DE genes than did MAST very low expression in the dataset (average UMI count or ROTS (Fig. 4a), consistent with the simulation results from the large group: 0.002, 0.007, and 0.007, and TPM: showing better power with NBID. Because the true DE 2.7, 9.4 and 10.7, with an average of approximately 850 genes in the groups were unknown, we compared the in- UMIs being captured per cell), leading to insufficient de- ferred DE genes against a published list of DE genes for tection power for DE genes. Consequently, none of the naïve T cells and memory T cells (Table 1 in reference evaluated algorithms classified the three genes as DE [29]). Since both T-cell datasets were derived from the genes. Additional file 1: Figure S10 shows density plots of CD4+ population [8], CD8+ specific genes were ignored. selected genes. Of the 37 true positives, NBID, MAST, and ROTS recov- We carried out additional in-silico validation of pre- ered 34 (92%), 30 (81%), and 24 (65%), respectively. The dicted DE genes by NBID and MAST. We assumed that No UMI differences Mild UMI differences c Intermediate UMI differences Precision−Recall − P: 9700, N: 1532309 Precision−Recall − P: 9700, N: 1532309 Precision−Recall − P: 10000, N: 1579700 Monocle2 (AUC 0.725) Monocle2 (AUC 0.621) Monocle2 (AUC 0.36) MAST (AUC 0.72) MAST (AUC 0.69) MAST (AUC 0.018) ROTS (AUC 0.942) ROTS (AUC 0) ROTS (AUC 0) NBID_scran (AUC 0.876) NBID_scran (AUC 0.847) NBID_scran (AUC 0.683) NBID (AUC 0.879) NBID (AUC 0.848) NBID (AUC 0.687) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall Fig. 3 Precision-recall curves for selected methods. a The precision-recall curve without UMI differences between two groups. b The precision- recall curve with mild UMI differences between two groups. c The precision-recall curve with intermediate UMI differences between two groups. For each scenario, some methods failed to run on few replicates but at least 97 replicates were used to calculate the precision and recall rate. P true DE genes, N true non-DE genes Precision 0.80 0.85 0.90 0.95 1.00 Precision 0.80 0.85 0.90 0.95 1.00 Precision 0.80 0.85 0.90 0.95 1.00 Chen et al. Genome Biology (2018) 19:70 Page 9 of 17 a Precision−Recall − P: 28740, N: 265660 NBID (AUC 0.514) MAST (AUC 0.461) 0.0 0.2 0.4 0.6 0.8 1.0 Recall c Precision−Recall − P: 28740, N: 265660 Precision−Recall − P: 28740, N: 265660 NBID (AUC 0.724) MAST (AUC 0.601) NBID (AUC 0.929) MAST (AUC 0.834) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Fig. 4 Comparison of detected genes in naïve T cells and memory T cells. a The Venn diagram of DE genes detected by NBID, ROTS, and MAST. b The precision-recall curves obtained using a subset of 1000 cells of each cell type. For the power calculation, we chose the DE genes detected in both NBID and MAST as the true DE genes and the genes not detected as DE genes in either NBID or MAST as the true non-DE genes. c, d The same as (b) except using subsets of 2000 and 5000 cells, respectively the DE genes detected by both algorithms were true posi- Differential expression analysis for biomarker identifications tive and that genes not detected as DE genes by either scRNA-seq has been widely used to reveal the subpopu- algorithm were true negatives. We then randomly sam- lation structure in heterogeneous cell populations pled subsets of cells from each population (1000, 2000, or through unsupervised clustering approaches [1]. Differ- 5000 cells) and compared the recovery of these genes in ential expression analysis of identified cell subpopula- ten subsampled replicates. NBID outperformed MAST, tions can further characterize their functional differences having a higher AUC in all three settings (Fig. 4b–d). and identify potential biomarkers for experimental valid- In this real-data analysis, NBID_scran again achieved ation and subpopulation separation. Consequently, we similar results as NBID. Specifically, 5728 DE genes were applied NBID and MAST to detect DE genes in the two detected by both methods, accounting for 95.5% and 94.3% subpopulations inferred to be present in Rh41 cells. of all DE genes by NBID (5997) and NBID_scran (6076), Among the expressed genes (with TPM ≥ 3 in at least respectively. Together with additional evidences from simu- one group [30]), NBID and MAST revealed 1019 and lation studies, we conclude that the default normalization 448 DE genes between the two clusters with a fold scheme employed by NBID generally achieved comparable change > 2 between the two clusters (Additional file 2: performance with scran, a state-of-the-art normalization Tables S12, S13), respectively. We ranked the potential scheme. of DE genes to be robust biomarkers based on the test Precision 0.80 0.85 0.90 0.95 1.00 Precision Precision 0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00 Chen et al. Genome Biology (2018) 19:70 Page 10 of 17 FDR values, their relative fold changes, and their overall Evaluation and control of batch effects expression levels. The CD44 gene, which encodes a Differential gene expression analysis of scRNA-seq data fre- commonly used cell surface marker, appeared at the top quently involves data generated in separate batches (e.g. in of the list (Fig. 5a). FACS sorting confirmed the presence different lanes or plates in single-cell library construction). of two subpopulations with different CD44 protein levels This can introduce batch effects (systematic inter-group high low (CD44 and CD44 ) in Rh41 cells (Fig. 5b). Being technical variations that are not relevant to the biological both a receptor for extracellular matrix components and hypothesis being evaluated), which pose a major challenge a co-factor for growth factors and cytokines, CD44 is a in high-throughput data analyses [34]. Controlling the well-established cancer stem cell marker with great batch effects is, therefore, important in order to distinguish prognostic and therapeutic potentials [31, 32]. true biological differences from technical artifacts [26, 35]. We performed three replicates of FACS sorting on Rh41 We evaluated batch effects in the two replicates of the four high low cells and collected both CD44 and CD44 subpopula- UMI-based protocols used by Ziegenhain et al. [12](Fig. 6). tions for bulk RNA-seq. Of the 1019 DE genes identified Although various numbers of DE genes (596–5156, Fig. by NBID in the inferred clusters in scRNA-seq, 699 6a–d) were detected by these protocols, only seven were (68.6%) were also detected as DE genes with the same common across all protocols (Fig. 6e), consistent with the direction in the two subpopulations from the bulk hypothesis that most apparent DE genes were the result of RNA-seq analysis, thus validating CD44 as a cell-surface technical noise. Among the four protocols, CEL-Seq2 and marker that could be used to separate the two endogenous SCRB-Seq had relatively stronger batch effects when com- Rh41 subpopulations. Although MAST identified fewer pared to DROP-Seq and MARS-Seq; these stronger effects (448) DE genes, a lower percentage of DE genes (226, were potentially associated with the higher UMIs captured 50.4%) were validated in the bulk DE analysis (Additional per cell. file 2: Table S13), which demonstrated the superior accur- All the evaluated methods except for ROTS allow explicit acy and power of NBID in revealing true DE genes. More- modeling of technical variations (such as differences in over, among the four established surrogate molecular cell-cycle stage and batch effects) as covariates. We evalu- markers for fusion status in rhabdomyosarcoma samples, ated the performance of batch-effect removal by simulating namely the upregulation of TFAP2B, MYOG,and NOS1, group differences mixed with apparent differences arising coupled with the repression of HMGA2 in fusion positive from the batch composition, using data generated from ARMS [33], both bulk DE analysis of sorted subpopula- CEL-Seq2 (Table 6,Fig. 7) and SCRB-Seq (Additional file 1: tions and NBID analysis of subpopulations inferred from Figure S11, Additional file 2: Table S14) by Ziegenhain et al. scRNA-seq revealed repression of TFAP2B and MYOG as [12]. Becauseofthe limitedsamplesizeinthesetwodata- high well as upregulation of HMGA2 in the CD44 subpopu- sets, we focused on highly expressed genes (with TPM ≥ 50 lation (Additional file 2: Table S12), suggesting that the in at least one group). Without explicitly modeling the high CD44 subpopulation represents a less differentiated, batch effects, all methods showed various levels of FDR stem-like cell subpopulation. The exact mechanism by inflation. Most of the tested methods (except MAST) which the distinct subpopulations develop warrants fur- reduced the FDR after modeling batch information as a ther investigation. covariate. NBID outperformed Monocle2 and MAST by CD44 cluster1 cluster2 Fig. 5 Differential expression of CD44 in two clusters in the Rh41 cell line. a Violin plot of the gene expression among cells in the two clusters, the TPM is in log10 scale after adding a small value 1. b The CD44 count distribution when using CD44 to sort single cells, indicating two clusters of cells with different levels of CD44 expression log10TPM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Chen et al. Genome Biology (2018) 19:70 Page 11 of 17 CEL−Seq2/C1 Drop−Seq FDR<0.05 (3627) FDR<0.05 (596) FDR >= 0.05 (14979) FDR >= 0.05 (17439) −1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6 log10TPM in the higher group log10TPM in the higher group c MARS−Seq SCRB−Seq FDR<0.05 (763) FDR<0.05 (5156) FDR >= 0.05 (16039) FDR >= 0.05 (13407) −1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6 log10TPM in the higher group log10TPM in the higher group MARS−Seq SCRB−Seq CEL−Seq2/C1 Drop−Seq 356 3745 206 136 2255 79 10 260 1041 14 64 18 Fig. 6 Differential expression analysis of two replicates from Ziegenhain et al. [12]. a–d The log2 fold change vs the maximal gene log TPM for the two biological replicates. NBID was used for the differential expression analysis of two replicates of each of four UMI-based protocols. The red dots indicate genes with FDR < 0.05. e Venn diagram of DE genes from four UMI-based protocols demonstrating better recovery of true positives with prop- differences between the read counts and the UMI counts erly controlled FDR. for the scRNA-seq data. Our analysis suggests that, com- pared to read counts, UMI counts can be modeled by a Discussion simpler distribution. Specifically, the NB model is an ad- In the present study, we performed extensive model se- equate model for UMI-count data in the absence of an lection and goodness-of-fit analyses using multiple explicit need to account for dropout events by using scRNA-seq datasets and revealed intrinsic distributional zero-inflated models. Data derived from the Smart-Seq log2FC log2FC −10 −5 0 5 10 −10 −5 0 5 10 log2FC log2FC −10 −5 0 5 10 −10 −50 5 10 Chen et al. Genome Biology (2018) 19:70 Page 12 of 17 Table 6 FDRs with and without controlling batch covariates No filtering TPM ≥ 50 Method FDR False (n) DE (n) FDR False (n) DE (n) Monocle2 0.279 37.7 132.4 0.202 24.3 118.65 Monocle2_plateCov 0.100 10.25 101.3 0.054 5.3 96.2 MAST 0.264 28.75 74.65 0.264 28.75 74.65 MAST_plateCov 0.268 28.65 72.85 0.268 28.65 72.85 ROTS 0.258 36.2 132.1 0.206 26.75 122.1 NBID 0.326 50.35 146.85 0.183 22.9 119.05 NBID_scran 0.328 50.45 146.95 0.200 25.3 121.45 NBID_plateCov 0.124 13.5 108.1 0.048 4.75 99.05 NBID_scran_plateCov 0.124 13.45 108 0.048 4.75 98.9 Simulation based on data: CEL-Seq2 (both batch A and batch B) from Ziegenhain et al. [12]. Sample size was 60 (30 cells in each group). In total, Replicate A had 30 cells and Replicate B had 35 cells after QC. Group 1 had 9 cells from Replicate A and 21 cells from Replicate B. Group 2 had 18 cells from Replicate A and 12 cells from Replicate B. Method names with plateCov indicate adjusting the batch covariates. NBID_scran and NBID_scran_plateCov used the size factor computed by scran as the offset instead of the total UMI counts Bold values indicate FDR > 0.05 Bold and underlined values indicate FDR > 0.1 protocol in reference [12] deviated slightly from other that also explains the differences between UMI-count- and read-based data, with fewer genes preferring the ZINB read-count-based scatter plots (Fig. 1). The PCR amplifica- model and a lower proportion of NB rejection in the tion step produces a sharp contrast between the read goodness-of-fit test. Although the exact cause of the ob- counts and UMI counts. Whereas the UMI counts follow a servation is unknown, our analysis of a different Poisson/NB distribution, the read counts—even with a con- Smart-Seq dataset [23] (Additional file 2: Table S1) re- stant multiplication factor (i.e. with no amplification sulted in a pattern similar to that seen with other read biases)—no longer follow the same Poisson/NB model (see count-based protocols. the “Methods” section for more details). The uneven ampli- Based on the result of our analysis, we propose a hypo- fication bias (i.e. with transcripts being amplified at differ- thetical model linking the UMI counts and read counts (see ent levels) introduces extra deviations from the underneath Additional file 1: Supplementary methods and Figure S12) (simpler) distribution of the UMI counts. Consistent with the hypothesized model, recent studies have shown that in- ferring approximate transcriptcounts fromthe read-count Precision−Recall − P: 2000, N: 370120 data can significantly improve the analysis efficiency [11]. A few published studies have suggested that NB models are often to be preferred for UMI-based scRNA-seq data [3, 13]. Although we reached the same conclusion, we be- lieve that our design controlled potential technical noises and allowed us to draw a stronger and more valuable con- clusion from the extensive evaluation. Grun et al. evalu- ated the technical noise in the read counts and UMI counts in a relatively small dataset (74 cells) generated by CEL-Seq and concluded that, when compared to the nor- mal and log-normal models, a NB model explained the distribution of more genes [3]. However, the captured UMIs were converted to theoretical transcript-counts (based on the estimated conversion rate) before model Monocle2_plateCov (AUC 0.621) fitting. This process could be approximated by the sce- MAST_plateCov (AUC 0) NBID_scran_plateCov (AUC 0.947) nario of amplification without biases in our hypothetical NBID_plateCov (AUC 0.946) model (Additional file 1: Figure S12) and converted 0.0 0.2 0.4 0.6 0.8 1.0 transcript counts (in theory) no longer follow a NB model. Recall Consequently, although a NB model explained the distri- Fig. 7 Precision-recall curves for selected methods on simulated bution of more genes than did the normal and log-normal datasets after adjusting batch variables. P true DE genes, N true models, it only accounted for a small fraction of the 11,555 non-DE genes genes analyzed. Recently, Vieth et al. [13] carried out a Precision 0.80 0.85 0.90 0.95 1.00 Chen et al. Genome Biology (2018) 19:70 Page 13 of 17 study that estimated characteristics in 18 UMI-based and real DE genes when compared to previously developed 20 read-based scRNA-seq datasets, including those of Zie- methods for single-cell analysis. Even though only pairwise genhain et al. [12], that were extensively evaluated in the analyses were considered in the current study, the general present study. However, the evaluation of Vieth et al. was form of NBID allows multiple groups to be tested simul- based on modeling read counts and UMI counts collected taneously, as in the generalized linear model framework. in separate experiments, which inevitably introduced un- Differential expression analysis can be used to reveal controlled differences between the experiments. differences among samples run on separate lanes or Our design directly compared read counts (before plates. However, it should be pointed out that batch ef- conversion to UMIs) and corresponding UMI counts fects are expected to overlap with biological differences collected from the same set of cells, enabling us to dir- in this setting. It is better to account for batch effects by ectly evaluate the effects of PCR amplification in statis- proper experimental design, such as by including multiple tical modeling. Moreover, our analysis revealed the biological replicates for each group. Another typical appli- necessity of controlling batch differences. Combining the cation of differential expression analysis is to identify poten- knowledge derived from the extensive modeling with the tial biomarkers for inferred cell subpopulations. One expected large sample size, we proposed NBID, a novel potential caveat in this setting is that cells are usually clus- differential expression analysis algorithm designed for tered from the same data. Therefore, p values or FDR use with UMI-based scRNA-seq data. NBID is based on values derived from the differential expression analysis the negative binomial generalized linear regression might be overly optimistic. However, the result is still useful (GLM) framework, thus shared similarities with those for prioritizing potential biomarkers for further validation. employed in bulk RNA-seq analysis [36, 37]. The major difference compared to those originally proposed in bulk Conclusions RNA-seq analysis is that we allow independent We have conducted an extensive analysis of multiple group-specific dispersions for each gene based on obser- scRNA-seq datasets and have concluded that, unlike vations that genes of the same expression level might read counts, UMI counts can be modeled appropriately have different dispersion parameters [3, 7]. Because of with the negative binomial model. More complex the sample size limitation, algorithms proposed in bulk models, such as zero-inflated negative binomial models, RNA-seq analysis [36, 37] typically pool genes with simi- provide no extra gain. Based on the above conclusion, lar expressions for a robust and smooth estimate of dis- we have proposed a differential expression analysis algo- persions and assume identical dispersion between rithm that allows independent estimations of dispersion groups. NBID exploits the direct benefit of the large for individual genes within each group. Compared to sample size in scRNA-seq, which allows group-specific other recently developed methods, our proposed algo- estimates of dispersion for each gene. This difference is rithm achieves proper FDR control and better power for analogous to the difference between a t-test assuming detecting differentially expressed genes in large-scale equal variance and an unequal variance t-test. Several UMI-count scRNA-seq datasets. studies have shown that an unequal variance t-test per- forms equally well when the underlying group variances Methods are identical but outperforms a t-test assuming equal Model selection and testing variance when the group variance are different [38, 39]. We first checked whether the ZINB model was neces- Although we focused on comparison to algorithms de- sary for the UMI counts. This was done by a statistical signed for scRNA-seq in this study, we believe that it test comparing the NB and ZINB models for each gene will inspire future in-depth evaluations (under various with the null hypothesis that NB fitted the data well. technical scenarios) with additional methods, including The likelihood ratio statistic was used. We used an FDR those originally proposed for bulk RNA-seq analysis. level of 0.05 to control the false positives because of the Technical variations (e.g. batch effects and variations large number of genes tested. For those genes that ac- in the total UMIs captured) are common in scRNA-seq cepted the NB model, we then checked whether the NB experiments; accounting for these variations is critical to model was necessary by testing the Poisson model ver- revealing true biological differences in differential ex- sus the NB model, with an FDR level of 0.05. For both pression analysis. As shown in our simulation, many NB versus ZINB and Poisson versus NB comparisons, scRNA-seq analysis packages yielded inflated FDRs with the parameter being tested was on the boundary, and technical variations (such as small differences in the the log likelihood-ratio test statistic follows an equal total UMIs for the groups), which might result in ele- mixture of 0 mass and a chi-square distribution with 1 vated false positives and/or true positives being masked. degree of freedom under the null hypothesis [40, 41]. In contrast, our analysis indicates that NBID achieved The p value was calculated based on this mixture distri- both proper FDR control and better power in revealing bution, and the FDR was calculated using the Benjamini Chen et al. Genome Biology (2018) 19:70 Page 14 of 17 and Hochberg’s method [42] . To ensure the conver- package fitdistrplus to plot the empirical pmf/cdf versus gence of fitted NB and ZINB models, we keep only those the theoretical ones [45]. genes that satisfy L(Poisson) ≤ L(NB) + δ and L(NB) ≤ L(ZINB) + δ, where L(M) is the log likelihood of fitted Differential expression analysis using the NB model with model M, and we set δ to 0.5 to allow some numerical independent dispersions (NBID) variations in likelihood maximizing. The Poisson model To simplify the notation, we focus here on one gene. Let was fitted using the glm function in R. Two methods us denote the count in cell i by y ;then y  NBðn μ ; ϕ Þ, i g i i were used to fit a NB model and the one with the higher where n is the total number of counts for cell i and μ is i i likelihood was used in the model comparison. The first the proportion of the gene counts in cell i. ϕ is the method was implemented using glm.nb in the R package dispersion for cell i with group label g , for example, g = i i [−8, −7, …,4] MASS [43]. A grid of initial values 10 for θ 0 or 1 for two groups. As in generalized linear models, we (the reciprocal of the dispersion) was tried, and the lar- link the mean proportion to explanatory variables such as gest likelihood was used. The second method was to first group labels; and other potential covariates. Specifically, fit other parameters related to the mean with an initial for two groups, the full model is: dispersion, and then search for the optimal dispersion value to maximize the likelihood given the estimated logðÞ n μ ¼ β þ β g þ γ x ; i i i 0 1 i mean. That method iterated between these two steps where β is the intercept, β is the group effect size in 0 1 until a maximal number of iteration was reached or the the log scale, and γ is a vector of coefficient for the other change in likelihood was small enough. The ZINB model covariate vector x . The likelihood of the observed data was fitted using the function zeroinf from the R package under the full model is pscl [44]. To increase the convergence rate, we first fit- ted a NB model and then used parameters from the NB L ¼ fy jn μ ; ϕ ; model as the initial values. For all model comparison, we i i i g i¼1 restricted the comparisons to genes with at least five non-zero cells among all the cells to ensure meaningful where m is the number of cells and f ðy jn μ ; ϕ Þ is the i i expression pattern. probability of y assuming a NB distribution with mean −1 Γðϕ þyÞ n μ and dispersion ϕ . Specifically, f ðyjμ; ϕÞ¼ i i −1 Goodness-of-fit test g y!Γðϕ Þ −1 We first down-sampled each cell to the 10% quantile of −1 μ y ϕ ð Þ ð Þ : −1 −1 ϕ þμ ϕ þμ the total UMI among all the cells so that the gene-count We note that, here, n serves as a normalization factor values for each gene would be comparable among cells. i or size factor, similar to those used in edgeR [36] and The cells corresponding to the lower 10% quantile were DESeq [37]. Alternatively, NBID can accept size factors not used. The down-sampling was performed by sampling estimated by other methods, such as scran [25]. the transcript without replacement, which follows a multi- We compute the maximum likelihood estimate of the variate hypergeometric distribution. After down-sampling, dispersion parameters ϕ and the coefficients related to only genes with a nonzero count in more than five cells g were kept. Then the count values were assigned to differ- the mean by using the R package nloptr. To test whether ent intervals (bins). First each unique count value itself there is a difference between groups, we also fit the null forms its own bin and the number of cells falling into each model log(n μ )= β + γ x with dispersions estimated i i 0 bin was recorded. Staring from the bin of the largest count from the full model. Finally, a likelihood ratio test is value, bins with no more than five cells were combined used to compare the reduced model and the full model, with next bin. The degree of freedom for the Chi-square which follows a chi-square distribution with one degree goodness-of-fit test is k − p − 1, where k is the number of of freedom. bins and p is the number of parameters of the model used. For example, the degree of freedom for the NB model is k Computing time for NBID in large scale datasets – 3 and that for the Poisson model is k – 2. This proced- NBID took 7.5 h in the analysis of naïve T cells versus ure filters out genes with expression levels that are too memory T cell datasets (9850 and 9578 cells, respect- low. For example, genes with count values of only 0 or 1 ively) on an Intel Xeon processor (E5-2670) running Red (two bins) will not be included for testing. However, for Hat Enterprise Linux 6 operating system and R 3.3.1. these genes, the Poisson or NB model will often result in a very good fit due to the simplicity of the data. In this Methods evaluated study, the maximum likelihood estimate of the model pa- The methods evaluated and additional details are listed rameters were estimated first and then the theoretical in Additional file 2: Table S4; unless specifically stated counts for individual bins were calculated. We used the R otherwise, the default options for each method were Chen et al. Genome Biology (2018) 19:70 Page 15 of 17 used in the evaluation. When there was a need to con- the power for detecting DE genes, we plotted the vert a count value x to the log2 scale, log2(x +1) was precision-recall curve and used the area under the used for the conversion. The FDR was calculated based curve (AUC) as the criterion; this was calculated based on p values by Benjamini and Hochberg’s method [42], on all the applicable replicates. Because only the top except for SCDE and ROTS. For SCDE, an adjusted p genes with relatively small estimated FDRs are of value was used based on the output corrected z-score interest in a real data analysis, we restricted the com- and assuming a standard normal distribution. For ROTS, parison to the region where the precision was above we used the FDR output from the package, which was 0.8, i.e. the region with FDR ≤ 0.2. This approach was calculated based on the bootstrap resampling. more reasonable than using the full range of the precision-recall curve, even though the result patterns Data simulation were often similar. This method is also better than To simulate data for differentially expressed analysis, we using the receiver operating characteristic (ROC) sampled 1000, 300, or 60 cells from the real UMI-count curve as used in some published papers for power matrix from memory T cells obtained by GemCode [8], comparison because the true negative genes are often from heterogeneous dendritic cells obtained by the majority; therefore, only the region with very high MARS-Seq [21], and from whole-intestinal organoids specificity (so that the FDR can be low) is of interest obtained by CEL-Seq [20], respectively. Cells were ran- but the cut-off is not easy to determine with the ROC domly split into two groups. To create differentially curve because the specificity is not directly related to expressed genes, we first ranked genes based on the the FDR. average count in the second group and chose 50 genes starting with the one having an average UMI count just Rh41 single-cell dataset above t. Denoting the fold change by FC, we selected The human alveolar rhabdomyosarcoma cell line, Rh41, another 50 genes starting with the average count just was cultured in a 5% CO incubator in a 75-cm vented above FC × t. We then swapped these two sets of genes flask containing DMEM media supplemented with 10% in their count matrix in the second group. This simula- FBS and 2× glutamine until the cells reached 75% con- tion kept the distribution pattern of the UMI counts un- fluence at approximately 3.6×10 cells. The cells were changed and created differentially expressed genes with detached from the flask with 7 mL of 1× citrate saline to certain fold-change levels. In our simulation, we set FC which 7 mL of DPBS was added followed by centrifuga- and t so that the precision-recall curves (power) were in tion at 300×G for 7 min. The cells pellet was resus- a good range. pended in 300 uL of blocking buffer (Rat IgG/PBS) and To simulate datasets with known batch variables, we incubated on ice for 30 min. A total of 50 uL of the cells sampled different proportions of cells in each replicate in blocking buffer were transferred to a separate tube for to form two groups. Specifically, we sampled nine cells the isotype control. The cells were washed with 1 mL of from replicate A and 21 cells from replicate B to form staining buffer (5% BSA/PBS) and centrifuged at 300×G the reference group, 18 cells from replicate A and 22 for 5 min. The pellet containing approximately 3×10 cells from replicate B to form the other group. We se- cells was incubated with Rat IgG2B anti-CD44-Alexa lected DE genes which were not influenced by the ap- 488 antibody (R&D systems) in staining buffer (15 uL parent difference between replicates. Therefore, when antibody + 135 uL of staining buffer) for 30 min on ice. these true DE genes were detected, it was not due to the For the isotype control ~ 600,000 cells were incubated detection of simulated batch effects. Specifically, we se- with 5 uL of Rat IgG2B-Alexa488 (R&D systems) + 45 lected DE genes with p value > 0.5 from the DE analysis uL of staining buffer for 30 min on ices. After the incu- results between the two replicates. We used two plat- bation, both sets of cells were pelleted and washed with forms in this simulation: CEL-Seq2 and SCRB-Seq, 1 mL of staining buffer as described above and resus- which showed strong batch effects in the DE analysis be- pended in staining buffer, followed by flow cytometric tween two replicates. analysis to identify the fraction of CD44 positive and negative populations. Evaluating FDR and power by using the precision-recall For the single cell experiment, Rh41 cells were cultured curve and harvested and washed in DPBS, as described above, We simulated 100 or 20 replicates for each down-sampling and resuspended in PBS/0.2%BSA at a concentration of setting. The FDR was calculated for each replicate and then 1×10 cells/mL. The 10× Genomics single-cell platform averaged across the replicates to generate the mean FDR. performs 3′ gene expression profiling by poly-A selection Because a few datasets had running problems with selected of mRNA within a single cell, which utilizes a cell barcode competitive methods, replicates on which all methods ran and UMIs for each transcript. Single-cell suspensions were successfully were used in the final analysis. To obtain loaded onto the Chromium Controller according to their Chen et al. Genome Biology (2018) 19:70 Page 16 of 17 respective cell counts to generate approximately 6000 par- were ligated to the fragments followed by 12 cycles of PCR titioned single-cell GEMs (Gel Bead-In-EMulsions). The amplification on a C1000 (bio-rad). Paired end sequencing library was prepared using the Chromium Single Cell 3′ was performed (151 bases per read) on a HiSeq 4000 (Illu- v2 Library and Gel Bead Kit (10× Genomics) according to mina). Three replicates were generated. HTSeq [46]was the manufacturers protocol. The cDNA content of each used to produce the count data. edgeR [36] was used for sample after cDNA amplification of 12 cycles was quanti- the DE analysis with TMM normalization. Each repli- high low fied and quality checked by High-Sensitivity DNA chip on cate was coded as a pair of CD44 and CD44 in a 2100 Bioanalyzer (Agilent Technologies) at a dilution of the analysis. 1:6. This quantification was used to determine final library amplification cycles in the protocol, which was calculated Additional files at 12 cycles. After library quantification and quality check by DNA 1000 chip (Agilent Technologies), samples were Additional file 1: This file includes: (1) supplementary methods describing details in single cell quality control and preprocessing, diluted to 3.5 nM for loading onto the HiSeq 4000 (Illu- application details of other DE methods, and a statistical model linking mina) with a 2 × 75 Paired-end kit using the following UMI and read counts; (2) all supplementary figures. (PDF 2338 kb) read length: 26 bp Read1 (10× cell barcode and UMI), Additional file 2: This file includes all supplementary tables. (XLSX 1530 kb) 8 bp i7 Index (sample index), and 98 bp Read2 (insert). An average of 400,000,000 reads per sample was obtained, Acknowledgements which translated to roughly 80,000 mean reads per cell, We thank Christoph Ziegenhain for sharing the count matrix of UMI-based per sample. The Cell Ranger 2.0.1 Single-Cell Software protocols and Keith A. Laycock and Xiaotu Ma for editing the manuscript. Suite (10× Genomics) was implemented to process the Funding raw sequencing data from the Illumina HiSeq run. This This study was also supported in part by the National Cancer Institute of the pipeline performed de-multiplexing, alignment (GRCh38/ National Institutes of Health under Award Number P30CA021765 and by STAR), and barcode processing to generate gene-cell ALSAC. matrices used for downstream analysis. Availability of data and materials After matrix generation the ribosomal and mitochon- high The Rh41 scRNA-seq dataset and the bulk RNA-seq data for sorted CD44 drial related genes were filtered. The subpopulation low and CD44 subpopulations generated in this study have been deposited in structure in Rh41 cells was inferred using a novel clus- GEO under the accession number GSE113660 [47]. The functions used for the data analysis are included in the NBID package under a MIT license, tering algorithm developed in house for analyzing which can be installed from Bitbucket (https://bitbucket.org/Wenan/nbid) large-scale scRNA-seq data (manuscript in preparation). [48]. The source code is also uploaded with DOI URL: https://doi.org/10.5281/ Briefly, it first used singular value decomposition (SVD) zenodo.1225670 [49]. The codes for data QC and DE analysis using other packages can be downloaded from https://bitbucket.org/Wenan/ to derive latent cellular states from the expression scrna_qc_de [50]. The public datasets we use in this paper are from Ziegen- matrix for individual cells. The number of significant hain et al. [12], Zheng et al. [8], Grun et al. [20], Jatin et al. [21], Klein et al. [7], cellular states was determined using the Tracy-Widom Islam et al. [22], and Scialdone et al. [23]. test on eigenvalues. A modified version of spectral clus- Authors’ contributions tering was performed on the significant cellular states of WC and XC proposed the study design and method development. WC individual cells (cellular states explained by total UMIs performed model comparison using hypothesis testing, differential expression were ignored) with a different number of clusters (2– analysis method implementation, and data analysis. YL performed the goodness of fit analysis. XC performed clustering analysis on Rh41 cells. JE generated scRNA- 30). The final two-subpopulation structure was deter- seq data from Rh41 cells, CD44 based cell sorting and RNA-seq on sorted subpop- mined by the silhouette measure for solutions with dif- ulations. DF analyzed RNA-seq data from sorted Rh41 subpopulations. GW ferent number of clusters. proposed visualization schemes of analysis results. WC and XC drafted the manuscript. All authors revised and approved the final manuscript. NBID was used to identify DE genes in the two sub- populations. We further filtered the DE genes by using Ethics approval and consent to participate two thresholds: the average expression level with TPM ≥ Not applicable. 3 in at least one cluster [30], and log2 fold-change ≥ 1. Competing interests The authors declare that they have no competing interests. Rh41 bulk RNA-seq dataset RNA was isolated from the sorted subpopulations using Trizol (Thermo Fisher Scientific) following the manufac- Publisher’sNote Springer Nature remains neutral with regard to jurisdictional claims in ture recommendations. RNA libraries were prepared using published maps and institutional affiliations. the Kapa RNA HyperPrep with Riboerase RNA kit (Roche) using the recommended conditions. Briefly, 200 ng of total Author details Department of Computational Biology, St. Jude Children’s Research Hospital, RNA was used as input for fragmentation, reverse tran- 262 Danny Thomas Pl, Memphis, TN 38105, USA. Division of Biostatistics, scription, and second strand synthesis. After clean up, end School of Public Health, University of Minnesota Twin Cities, Mayo Building, repair, and A tailing, Nextflex adapters (Bioo Scientific) Minneapolis, MN 55455, USA. Chen et al. Genome Biology (2018) 19:70 Page 17 of 17 Received: 1 December 2017 Accepted: 30 April 2018 26. Tung PY, Blischak JD, Hsiao CJ, Knowles DA, Burnett JE, Pritchard JK, et al. Batch effects and the effective design of single-cell gene expression studies. Sci Rep. 2017;7:39921. 27. Jaakkola MK, Seyednasrollah F, Mehmood A, Elo LL. Comparison of methods to detect differentially expressed genes between single-cell populations. References Brief Bioinform. 2017;18:735–43. 1. Liu S, Trapnell C. Single-cell transcriptome sequencing: recent advances and 28. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, et al. remaining challenges [version 1; referees: 2 approved]. 2016;5(F1000 Faculty Single-cell RNA-seq highlights intratumoral heterogeneity in primary Rev):182. https://doi.org/10.12688/f1000research.7223.1. glioblastoma. Science. 2014;344:1396–401. 2. Svensson V, Natarajan KN, Ly LH, Miragaia RJ, Labalette C, Macaulay IC, et al. 29. Weng NP, Araki Y, Subedi K. The molecular basis of the memory T cell Power analysis of single-cell RNA-sequencing experiments. Nat Methods. response: differential gene expression and its epigenetic regulation. Nat Rev 2017;14:381–7. Immunol. 2012;12:306–15. 3. Grun D, Kester L, van Oudenaarden A. Validation of noise models for single- 30. Wagner GP, Kin K, Lynch VJ. A model based criterion for gene expression cell transcriptomics. Nat Methods. 2014;11:637–40. calls using RNA-seq data. Theory Biosci. 2013;132:159–64. 4. Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, et al. 31. Li F, Tiede B, Massague J, Kang Y. Beyond tumorigenesis: cancer stem cells From single-cell to cell-pool transcriptomes: stochasticity in gene expression in metastasis. Cell Res. 2007;17:3–14. and RNA splicing. Genome Res. 2014;24:496–510. 32. Yan Y, Zuo X, Wei D. Concise review: emerging role of CD44 in cancer stem 5. Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell cells: a promising biomarker and therapeutic target. Stem Cells Transl Med. differential expression analysis. Nat Methods. 2014;11:740–2. 2015;4:1033–43. 6. Wang Y, Navin NE. Advances and applications of single-cell sequencing 33. Rudzinski ER, Anderson JR, Lyden ER, Bridge JA, Barr FG, Gastier-Foster JM, et al. technologies. Mol Cell. 2015;58:598–609. Myogenin, AP2beta, NOS-1, and HMGA2 are surrogate markers of fusion status 7. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet in rhabdomyosarcoma: a report from the Soft Tissue Sarcoma Committee of barcoding for single-cell transcriptomics applied to embryonic stem cells. the Children's Oncology Group. Am J Surg Pathol. 2014;38:654–9. Cell. 2015;161:1187–201. 34. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. 8. Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Tackling the widespread and critical impact of batch effects in high- Massively parallel digital transcriptional profiling of single cells. Nat throughput data. Nat Rev Genet. 2010;11:733–9. Commun. 2017;8:14049. 35. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson 9. Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: a A, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. flexible statistical framework for assessing transcriptional changes and 2016;17:13. characterizing heterogeneity in single-cell RNA sequencing data. Genome 36. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for Biol. 2015;16:278. differential expression analysis of digital gene expression data. 10. Bacher R, Kendziorski C. Design and computational analysis of single-cell Bioinformatics. 2010;26:139–40. RNA-sequencing experiments. Genome Biol. 2016;17:63. 37. Anders S, Huber W. Differential expression analysis for sequence count data. 11. Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA quantification Genome Biol. 2010;11:R106. and differential analysis with Census. Nat Methods. 2017;14:309–15. 38. Ruxton GD. The unequal variance t-test is an underused alternative to 12. Ziegenhain C, Vieth B, Parekh S, Reinius B, Guillaumet-Adkins A, Smets M, et Student's t-test and the Mann-Whitney U test. Behav Ecol. 2006;17:688–90. al. Comparative analysis of single-cell RNA sequencing methods. Mol Cell. 39. Rasch D, Kubinger KD, Moder K. The two-sample t test: pre-testing its 2017;65:631–43. e634 assumptions does not pay off. Stat Pap. 2011;52:219–31. 13. Vieth B, Ziegenhain C, Parekh S, Enard W, Hellmann I. powsimR: power 40. Jansakul N, Hinde J. Score tests for extra-zero models in zero-inflated analysis for bulk and single cell RNA-seq experiments. Bioinformatics. 2017; negative binomial models. Commun Stat Simul Comput. 2009;38:92–108. 33:3486–8. 41. Cameron AC, Trivedi PK. Regression analysis of count data. Cambridge: 14. Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with Cambridge University Press; 1998. single-cell genomics. Nat Biotechnol. 2016;34:1145–60. 42. Benjamini Y, Hochberg Y. Controlling the false discovery rate - a practical 15. Pierson E, Yau C. ZIFA: Dimensionality reduction for zero-inflated single-cell and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. gene expression analysis. Genome Biol. 2015;16:241. 1995;57:289–300. 16. Vallejos CA, Risso D, Scialdone A, Dudoit S, Marioni JC. Normalizing single- 43. Venables WN, Ripley BD. Modern applied statistics with S. 4th ed: New York: cell RNA sequencing data: challenges and opportunities. Nat Methods. 2017; Springer; 2010. 14:565–71. 44. Zeileis A, Kleiber C, Jackman S. Regression models for count data in R. J Stat 17. Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical Softw. 2008;27:1–25. variability in single-cell RNA-sequencing experiments. Biostatistics. 2017. 45. Delignette-Muller ML, Dutang C. fitdistrplus: An R Package for fitting https://doi.org/10.1093/biostatistics/kxx053. distributions. J Stat Softw. 2015;64:1–34. 18. Jaakkola MK, Seyednasrollah F, Mehmood A, Elo LL. Comparison of methods 46. Anders S, Pyl PT, Huber W. HTSeq–a Python framework to work with high- to detect differentially expressed genes between single-cell populations. throughput sequencing data. Bioinformatics. 2015;31:166–9. Brief Bioinform. 2017;18(5):735–43. 47. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling 19. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of and differential expression analysis for single cell RNA sequencing. Datasets. single-cell gene expression data. Nat Biotechnol. 2015;33:495–502. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc. 20. Grun D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, et al. Single- cgi?acc=GSE113660. cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 48. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling 2015;525:251–5. and differential expression analysis for single cell RNA sequencing. 21. Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, et al. Bitbucket. https://bitbucket.org/Wenan/nbid. Massively parallel single-cell RNA-seq for marker-free decomposition of 49. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling tissues into cell types. Science. 2014;343:776–9. and differential expression analysis for single cell RNA sequencing. zenodo. 22. Islam S, Kjallquist U, Moliner A, Zajac P, Fan JB, Lonnerberg P, et al. https://doi.org/10.5281/zenodo.1225670. Characterization of the single-cell transcriptional landscape by highly 50. Chen W, Li Y, Easton J, Finkelstein D, Wu G, Chen X. UMI-count modeling and multiplex RNA-seq. Genome Res. 2011;21:1160–7. differential expression analysis for single cell RNA sequencing. Bitbucket. 23. Scialdone A, Natarajan KN, Saraiva LR, Proserpio V, Teichmann SA, Stegle O, https://bitbucket.org/Wenan/scrna_qc_de. et al. Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods. 2015;85:54–61. 24. Johnson RA, Wichern DW. Applied multivariate statistical analysis. 3rd ed. Prentice Hall: Englewood Cliffs, NJ; 1992. 25. Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17:75.

Journal

Genome BiologySpringer Journals

Published: May 31, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off