Interpretable dimensionality reduction of single cell transcriptome data with deep generative models

Interpretable dimensionality reduction of single cell transcriptome data with deep generative models ARTICLE DOI: 10.1038/s41467-018-04368-5 OPEN Interpretable dimensionality reduction of single cell transcriptome data with deep generative models 1,2,3,4 1 1,2,3,5 Jiarui Ding , Anne Condon & Sohrab P. Shah Single-cell RNA-sequencing has great potential to discover cell types, identify cell states, trace development lineages, and reconstruct the spatial organization of cells. However, dimension reduction to interpret structure in single-cell sequencing data remains a challenge. Existing algorithms are either not able to uncover the clustering structures in the data or lose global information such as groups of clusters that are close to each other. We present a robust statistical model, scvis, to capture and visualize the low-dimensional structures in single-cell gene expression data. Simulation results demonstrate that low-dimensional representations learned by scvis preserve both the local and global neighbor structures in the data. In addition, scvis is robust to the number of data points and learns a probabilistic parametric mapping function to add new data points to an existing embedding. We then use scvis to analyze four single-cell RNA-sequencing datasets, exemplifying interpretable two- dimensional representations of the high-dimensional single-cell RNA-sequencing data. 1 2 Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada. Department of Molecular Oncology, BC Cancer Agency, Vancouver, BC V5Z 1L3, Canada. Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC V6T 2B5, 4 5 Canada. Present address: Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. Present address: Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065, USA. Correspondence and requests for materials should be addressed to J.D. (email: jding@broadinstitute.org) or to S.P.S. (email: sshah@bccrc.ca) NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 1 | | | 1234567890():,; ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ategorizing cell types comprising a specific organ or dis- dimensionalities are typically much lower. For example, factors ease tissue is critical for comprehensive study of tissue such as cell type and patient origin explain much of the variation 1 3 Cdevelopment and function . For example, in cancer, in a study of metastatic melanoma . We therefore assume that for identifying constituent cell types in the tumor microenvironment a high-dimensional scRNA-seq dataset D¼fg x with N cells, n¼1 together with malignant cell populations will improve under- where x is the expression vector of cell n, the x distribution is n n standing of cancer initialization, progression, and treatment governed by a latent low-dimensional random vector z (Fig. 1a). 2, 3 response . Technical developments have made it possible to For visualization purposes, the dimensionality d of z is typically measure the DNA and/or RNA molecules in single cells by single- two or three. We assume that z is distributed according to a 4–15 cell sequencing or protein content by flow or mass prior, with the joint distribution of the whole model as p(z | θ)p 16, 17 cytometry . The data generated by these technologies enable (x | z , θ). For simplicity, we can choose a factorized standard n n us to quantify cell types, identify cell states, trace development normal distribution for the prior p(z | θ) = N z j0; I . n n;i R i¼1 18, 19 lineages, and reconstruct the spatial organization of cells .An The distribution pðx jθÞ = pðz jθÞpðx jz ; θÞdz can be a n n n n n unsolved challenge is to develop robust computational methods complex multimodal high-dimensional distribution. To represent to analyze large-scale single-cell data measuring the expression of complex high-dimensional distributions, we assume that p(x | z , n n dozens of protein markers to all the mRNA expression in tens of θ) is a location-scale family distribution with location parameter thousands to millions of cells in order to distill single-cell biol- μ (z ) and scale parameter σ (z ); both are functions of z θ n θ n n 20–23 ogy . parameterized by a neural network with parameter θ. The Single-cell datasets are typically high dimensional in large inference problem is to compute the posterior distribution p(z | numbers of measured cells. For example, single-cell RNA- x , θ), which is however intractable to compute. We therefore use 19, 24–26 sequencing (scRNA-seq) can theoretically measure the a variational distribution q(z | x , ϕ) to approximate the pos- n n expression of all the genes in tens of thousands of cells in a single terior (Fig. 1b). Here q(z | x , ϕ) is a multivariate normal dis- n n 9, 10, 14, 15 experiment . For analysis, dimensionality reduction tribution with mean μ (x ) and standard deviation σ (x ). Both ϕ n ϕ n projecting high-dimensional data into low-dimensional space parameters are (continuous) functions of x parameterized by a (typically two or three dimensions) to visualize the cluster neural network with parameter ϕ. To model the data distribution 27–29 30–33 structures and development trajectories is commonly well (with a high likelihood of pðz jθÞpðx jz ; θÞdz ), the model n n n n used. Linear projection methods such as principal component tends to assign similar posterior distributions p(z | x , θ) to cells n n analysis (PCA) typically cannot represent the complex structures with similar expression profiles. To explicitly encourage cells with of single-cell data in low dimensional spaces. Nonlinear dimen- similar expression profiles to be proximal (and those with dis- sion reduction, such as the t-distributed stochastic neighbor similar profiles to be distal) in the latent space, we add the t-SNE 34–39 embedding algorithm (t-SNE) , has shown reasonable results objective function on the latent z distribution as a constraint. for many applications and has been widely used in single-cell data More details about the model and the inference algorithms are 1, 40, 41 42 processing . However, t-SNE has several limitations . First, presented in the Methods section. The scvis model is imple- unlike PCA, it is a non-parametric method that does not learn a mented in Python using Tensorflow with a command-line parametric mapping. Therefore, it is not natural to add new data interface and is freely available from https://bitbucket.org/jerry00/ to an existing t-SNE embedding. Instead, we typically need to scvis-dev. combine all the data together and rerun t-SNE. Second, as a non- parametric method, the algorithm is sensitive to hyperparameter Single-cell datasets. We analyzed four scRNA-seq datasets in this settings. Third, t-SNE is not scalable to large datasets because it 1, 3, 9, 44 study . Data were mostly downloaded from the single-cell 2 2 has a time complexity of O(N D) and space complexity of O(N ), portal . Two of these datasets were originally used to study where N is the number of cells and D is the number of expressed intratumor heterogeneity and the tumor microenvironment in genes in the case of scRNA-seq data. Fourth, t-SNE only outputs 3 44 metastatic melanoma and oligodendroglioma , respectively. the low-dimensional coordinates but without any uncertainties of One dataset was used to categorize the mouse bipolar cell the embedding. Finally, t-SNE typically preserves the local clus- populations of the retina , and one dataset was used to categorize tering structures very well given proper hyperparameters, but all cell types in the mouse retina . For all the scRNA-seq datasets, more global structures such as a group of subclusters that form a 1, 19 we used PCA (as a noise-reduction preprocessing step )to big cluster are missed in the low-dimensional embedding. project the cells into a 100-dimensional space and used the In this paper, we introduce a robust latent variable model, projected coordinates in the 100-dimensional spaces as inputs to scvis, to capture underlying low-dimensional structures in scvis. We also used two mass cytometry (CyTOF) datasets con- scRNA-seq data. As a probabilistic generative model, our method sisting of bone marrow mononuclear cells from two healthy adult learns a parametric mapping from the high-dimensional space to donors H1 and H2 . For CyTOF data, since their dimensionality a low-dimensional embedding. Therefore, new data points can be (32) is relatively low, we directly used these data as inputs to scvis. directly added to an existing embedding by the mapping function. Moreover, scvis estimates the uncertainty of mapping a high- dimensional point to a low-dimensional space that adds rich Experimental setting and implementation. The variational capacity to interpret results. We show that scvis has superior approximation neural network has three hidden layers (l , l , and 1 2 distance preserving properties in its low-dimensional projections l ) with 128, 64, and 32 hidden units each, and the model neural ′ ′ ′ ′ ′ leading to robust identification of cell types in the presence of network has five hidden layers l ; l ; l ; l ; and l with 32, 32, 32, 1 2 3 4 5 noise or ambiguous measurements. We extensively tested our 64, and 128 units each. We use the exponential linear unit acti- method on simulated data and several scRNA-seq datasets in vation function as it has been shown to speed up the convergence both normal and malignant tissues to demonstrate the robustness of optimization and the Adam stochastic optimization algo- of our method. rithm with a learning rate of 0.01 . Details about the influence of these hyperparameters on results are presented in the Methods section. The time complexity to compute the t-SNE loss is Results quadratic in terms of the number of data points. Consequently, Modeling and visualizing scRNA-seq data. Although scRNA- we use mini-batch optimization and set the mini-batch size to 512 seq datasets have high dimensionality, their intrinsic (cells). We expect that a large batch of data could be better in 2 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE Gene –2 2 –2 Input Hidden Hidden Output z ~p (z ) layer layer 1 layer 2 layer ⎮  x ~ p(x z, ) Input 1 Input 2 Ouput parameters  of p (x z, ) or  of q(z x, ) Input 3 b Input 4 Gene –5 –10 –15 –20 –15 –10 –5 0 5 10 15 x is measured z ~ q (z x, ) Fig. 1 Overview of the scvis method. a scvis model assumptions: given a low-dimensional point drawn from a simple distribution, e.g., a two-dimensional standard normal distribution, a high-dimensional gene expression vector of a cell can be generated by drawing a sample from the distribution p(x | z, θ). The heatmap represents a cell–gene expression matrix, where each row is a cell and each column is a gene. Color encodes the expression levels of genes in cells. The data-point-specific parameters θ are determined by a model neural network. The model neural network (a feedforward neural network) consists of an input layer, several hidden layers, and an output layer. The output layer outputs the parameters θ of p(x | z, θ). b scvis inference: given a high- dimensional gene expression vector of a cell (a row of the heatmap), scvis obtains its low-dimensional representation by sampling from the conditional distribution q(z | x, ϕ). The data-point-specific parameters ϕ are determined by a variational inference neural network. The inference neural network is also a feedforward neural network and its output layer outputs the parameters ϕ of q(z | x, ϕ). Again, the heatmap represents a cell–gene expression matrix. The scatter plot shows samples drawn from the variational posterior distributions q(z | x, ϕ) 2 2 2 3 3 estimating the high-dimensional data manifold, however we y , x y, xy , x , y ). Each of the nine features was then divided by found that 512 cells work accurately and efficiently in practice. its corresponding maximum absolute value. We run the Adam stochastic gradient descent algorithm for 500 Although t-SNE (with default parameter setting, we used the 34 48 epochs for each dataset with at least 3000 iterations by default. efficient Barnes-Hut t-SNE R wrapper package ) uncovered For large datasets, running 500 epochs is computationally the six clusters in this dataset, it was still challenging to infer the expensive, we therefore run the Adam algorithm for a maximum overall layout of the six clusters (Fig. 2b). t-SNE by design of 30,000 iteration or two epochs (which ever is larger). We use preserves local structure of the high-dimensional data, but the an L2 regularizer of 0.001 on the weights of the neural networks “global” structure is not reliable. Moreover, for the uniformly to prevent overfitting. distributed outliers, t-SNE put them into several compact clusters, which were adjacent to other genuine clusters. The scvis results, on the other hand, better preserved the overall structure of the original data (Fig. 2c): (1) The five small Benchmarking scvis against t-SNE on simulated data.To clusters were on one side, and the big cluster was on the other demonstrate that scvis can robustly learn a low-dimensional side. The relative positions of the clusters were also preserved. (2) representation of the input data, we first simulated data in a two- Outliers were scattered around the genuine clusters as in the dimensional space (for easy visualization) as in Fig. 2a. The big original data. In addition, as a probabilistic generative model, cluster on the left consisted of 1000 points and the five small scvis not only learned a low-dimensional representation of the clusters on the right each had 200 points. The five small clusters input data but also provided a way to quantify the uncertainty of were very close to each other and could roughly be considered as the low-dimensional mapping of each input data point by its log- a single big cluster. There were 200 uniformly distributed likelihood. We colored the low-dimensional embedding of each outliers around these six clusters. For each two-dimensional data point by its log-likelihood (Fig. 2d). We can see that data point with coordinates (x, y), we then mapped it into a generally scvis put most of its modeling power to model the five nine-dimensional space by the transformation (x+y, x−y, xy, x , NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 3 | | | Cell Cell ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 a b c d Original data t-SNE scvis Log-likelihoods 35.68 15 15 10 10 22.84 5 5 10.01 0 0 −20 −5 −5 −2 −2.81 −10 −10 −40 −4 −15.67 −8 −6 −4 −2 0 2 4 −40 −20 0 20 −15 −10 −5 0 5 10 −15 −10 −5 0 5 10 FDR FDR e f g Density contour 3e–07 1e–09 8e–11 6e–12 9e–12 5e–03 3e–07 6e–06 2e–06 1e–10 3e–07 4e–07 1e–06 scvis scvis t−SNE t−SNE −5 86 −10 −15 −10 −5 0 5 10 10 30 50 100 150 100 200 300 500 700 1000 1500 2000 K # points Fig. 2 Benchmarking scvis against t-SNE on synthetic data. a The original 2200 two-dimensional synthetic data points (points are colored by cluster labels. The randomly distributed outliers are also colored in a distinct color, same for b, c), b t-SNE results on the transformed nine-dimensional dataset with default perplexity parameter of 30, c scvis results, d coloring points based on their log-likelihoods from scvis, e the kernel density estimates of the scvis results, where the density contours are represented by lines, f average K-nearest neighbor preservations across ten runs for different Ks, and g the average K-nearest neighbor preservations (K = 10) for different numbers of subsampled data points from ten repeated runs. The numbers at the top are the adjusted p-values (FDR, one-sided Welch’s t-test) comparing the average Knn preservations from scvis with those from t-SNE. Boxplots in f, g denote the medians and the interquartile ranges (IQRs). The whiskers of a boxplot are the lowest datum still within 1.5 IQR of the lower quartile and the highest datum still within 1.5 IQR of the upper quartile compact clusters, while the outliers far from the five compact To test how scvis performs on smaller datasets, we subsampled clusters tended to have lower log-likelihoods. Thus, by combining the nine-dimensional synthetic dataset. Specifically, we sub- the log-likelihoods and the low-dimensional density information sampled 100, 200, 300, 500, 700, 1000, 1500, and 2000 points (Fig. 2e), we can better interpret the structure in the original data from the original dataset and ran scvis 11 times on each and uncertainty over the projection. subsampled dataset. We then computed the Knn preservations The low-dimensional representation may change for different (K = 10) and found that the Knn preservations from the scvis runs because the scvis objective function can have different local results were significantly higher than those from t-SNE results maxima. To test the stability of the low-dimensional representa- (false discovery rate (FDR) <0.01 for all the subsampled datasets, tions, we ran scvis ten times. Generally, the two-dimensional one-sided Welch’s t-test, Fig. 2g). scvis performs very well on all representations from the ten runs (Supplementary Fig. 1a–j) the subsampled datasets (Supplementary Fig. 2a–h). Even with showed similar patterns as in Fig. 2c. As a comparison, we also just 100 data points, the two-dimensional representation ran t-SNE ten times, and the results (Supplementary Fig. 1k–t) (Supplementary Fig. 2a) preserved much of the structure in the showed that the layouts of the clusters were less preserved, e.g., data. The log-likelihoods estimated from the subsampled data the relative positions of the clusters changed from run to run. To also recapitulated the log-likelihoods from the original 2200 data quantitatively compare scvis and t-SNE results, we computed the points (Supplementary Fig. 3a–h). The t-SNE results on the average K-nearest neighbor (Knn) preservations across runs for subsampled datasets (Supplementary Fig. 2i–p) generally revealed K ∈ {10, 30, 50, 100, 150}. Specifically, for the low-dimensional the clustering structures. However, the relative positions of the representation from each run, we constructed Knn graphs for five clusters and the big cluster were largely inaccurate. different Ks. We then computed the Knn graph from the high- To test the performance of scvis when adding new data to an dimensional data for a specific K. Finally, we compared the existing embedding, we increased by tenfold the number of points average overlap of the Knn graphs from the low-dimensional in each cluster and the number of outliers (for a total of 22,000 representations with the Knn graph from the high-dimensional points) using a different random seed. The embedding (Fig. 3a, b) data for a specific K. For scvis, the median Knn preservations was very similar to that of the 2200 training data points in Fig. 2c, monotonically increased from 86.7% for K = 10, to 90.9% for K d. We trained Knn classifiers on the embedding of the 2200 = 150 (Fig. 2f). For t-SNE, the median Knn preservations first training data for K ∈ {5, 9, 17, 33, 65} and used the trained decreased from 82.7% for K = 10 to 82.1% for K = 50 (consistent classifiers to classify the embedding of the 22,000 points, with t-SNE preserving local structures) and then increased to repeating 11 times. Median accuracy (the proportion of points 84.0% for K = 150. Thus scvis preserved Knn more effectively correctly assigned to their corresponding clusters) was 98.1% for than t-SNE. K = 5 and 94.8% for K = 65. The performance decreased mainly 4 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | Average Knn preservation (%) Average Knn preservation (%) NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE a b c Knn classification accuracy 2e−04 8e−06 2e−08 4e−11 3e−07 scvis mapping new data scvis log-likelihoods 1e−01 1e−01 1e−01 1e−01 1e−01 38.91 scvis 15 15 GPLVM pt−SNE PCA 10 10 17.67 88.8 88.3 96 5 5 87.4 −3.56 85.3 −5 −5 −23.72 −10 −10 82.5 −46.02 −15 −10 −5 0 5 10 5 9 17 33 65 −15 −10 −5 05 10 d e f scvis trained on new data scvis log-likelihoods t−SNE 38.91 10 10 5 5 17.67 −3.56 −5 −5 −20 −23.72 −10 −10 −40 −15 −60 −15 −46.02 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −60 −40 −20 0 20 40 60 Fig. 3 Benchmarking scvis against GPLVM, parametric t-SNE, and PCA to embed 22,000 synthetic out-of-sample data. a scvis mapping 22,000 new data points based on the learned probabilistic mapping function from the 2200 training data points, b the estimated log-likelihoods, and c the average K-nearest neighbor classification accuracies for different Ks across 11 runs, the classifiers were trained on the 11 embeddings from the 2200 points. The numbers at the top are the FDR (one-sided Mann–Whitney U-test) comparing the K-nearest neighbor classification accuracy from scvis with those from GPLVM (orange, bottom) and those from parametric t-SNE (golden, top). Notice that, for GPLVM, two runs produced bad results and were not plotted in the figure. Boxplots denote the medians and the interquartile ranges (IQR). The whiskers of a boxplot are the lowest datum still within 1.5 IQR of the lower quartile and the highest datum still within 1.5 IQR of the upper quartile. d scvis results on the larger dataset with the same perplexity parameter as used in Fig. 2; e scvis log-likelihoods on the larger dataset; and f t-SNE results on the larger dataset because, for a larger K, the outliers were wrongly assigned to the than those from scvis, GPLVM, and pt-SNE in terms of the mean six genuine clusters. classification accuracies for different Ks. We then benchmarked scvis against Gaussian process latent As a non-parametric dimension reduction method, t-SNE was 49 50 variate model (GPLVM, implemented in the GPy package), sensitive to hyperparameter setting, especially the perplexity parametric t-SNE (pt-SNE), and PCA on embedding the 22,000 parameter (the effective number of neighbors, see the Methods out-of-sample data points. We used the 11 scvis models trained section for details). The optimal perplexity parameter increased as on the small nine-dimensional synthetic dataset with 2200 data the total number of data points increased. In contrast, as we points to embed the larger nine-dimensional synthetic data with adopted mini-batch for training scvis by subsampling, e.g., 512 22,000 data points. Similarly, we trained 11 GPLVM models and cells each time, scvis was less sensitive to the perplexity parameter pt-SNE models on the small nine-dimensional synthetic dataset as we increase the total number of training data points because and applied these models to the bigger synthetic dataset. To the number of cell is fixed at 512 at each training step. Therefore, compare the abilities of the trained models to embed unseen data, scvis performed well on approximately an order of magnitude we trained Knn classifiers on the two-dimensional representations larger dataset (Fig. 3d, e), without changing the perplexity (of the small 2200 data points) outputted from different parameter for scvis. For this larger dataset, the t-SNE results algorithms. These Knn classifiers were used to classify the two- (Fig. 3f) were difficult to interpret without the ground-truth dimensional coordinates of the 22,000 data points outputted from cluster information, because it was already different algorithms. scvis was significantly better than GPLVM difficult to see how many clusters in this dataset, not to mention and pt-SNE for different Ks (Fig. 3c, two runs of GPLVM to uncover the overall structure of the data. Although by produced bad results and were not plotted in the figure, FDR < increasing the perplexity parameter, the performance of t-SNE 0.05, one-sided Mann–Whitney U-test). For PCA, because the became better (Supplementary Fig. 4), the outliers still formed model is unique for a given dataset, we generated unique two- distinct clusters, and it remains difficult to set this parameter in dimensional coordinates for the 22,000 out-of-sample data points. practice. The Knn classifiers trained on the PCA coordinates were worse NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 5 | | | Accuracy (%) ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 a b Bipolar cell types Bipolar log-likelihoods 354.5 BC5A (cone bipolar cell 5A) BC5D 10 10 BC7 (cone bipolar cell 7) 153.4 BC5B BC3B BC5C −400 0 200 BC8/9 (mixture of BC8 and BC9) Log-likelihoods BC4 BC3A BC6 BC2 −47.4 RBC (rod bipolar cell) BC1A BC1B AC (amacrine cell) −165.5 Cone photoreceptors Rod photoreceptors −10 −10 MG (mueller glia) −20 −20 −449.7 −10 0 10 20 −10 0 10 20 c d Retina cell clusters Retina log-likelihoods 351.2 15 15 10 10 273.5 50 150 250 350 27 5 Log-likelihoods 195.9 −5 −5 −10 −10 118.1 −15 −15 40.5 −15 −10 −5 0 5 10 15 20 −15 −10 −5 0 5 10 15 20 z coordinate one Fig. 4 Learning a probabilistic mapping function from the bipolar data and applying the function to the independently generated mouse retina dataset. a scvis learned two-dimensional representations of the bipolar dataset, b coloring each point by the estimated log-likelihood, c the whole mouse retina dataset was directly projected to a two-dimensional space by the probabilistic mapping function learned from the bipolar data, and d coloring each point from the retina dataset by the estimated log-likelihood Learning a parametric mapping for a single-cell dataset.We from Shekhar et al. ). In addition, although t-SNE mapped cells next analyzed the scvis learned probabilistic mapping from a from different cell populations into distinct regions, more global training single-cell dataset and tested how it performed on unseen organizations of clusters of cells were missed in the t-SNE data. We first trained a model on the mouse bipolar cell of the embedding. The “ON” cone bipolar cell clusters, the “OFF” cone retina dataset and then used the learned model to map the bipolar cell clusters, and other non-bipolar cell clusters were independently generated mouse retina dataset . The two- mixed together in the t-SNE results. dimensional coordinates from the bipolar dataset captured The bipolar cells tended to have higher log-likelihoods than much information in this dataset (Fig. 4a). For example, non- non-bipolar cells such as amacrine cells, Mueller glia, and bipolar cells such as amacrine cells, Mueller glia, and photo- photoreceptors (Fig. 4b), suggesting that the model used most of receptors were at the bottom, the rod bipolar cells were in the its power to model the bipolar cells, while other cell types were middle, and the cone bipolar cells were on the top left around the not modeled as well. The embedded figure at the top right corner rod bipolar cells. Moreover, the “OFF” cone bipolar cells (BC1A, shows the histogram of the log-likelihoods. The majority of the BC1B, BC2, BC3A, BC3B, BC4) were on the left and close to each points exhibited high log-likelihoods (with a median of 292.4). other, and the “ON” cone bipolar cells (BC5A-D, BC6, BC7, BC8/ The bipolar cells had significantly higher log-likelihoods (median 9) were at the top. Cell doublets and contaminants (accounting log-likelihood of 298.4) relative to non-bipolar cells (including for 2.43% of the cells comprised eight clusters , with distinct color amacrine cells, Mueller glia, rod and cone photoreceptors) and symbol combinations in Fig. 4a but not labeled) were rare in (median log-likelihood of 223.6; one-sided Mann–Whitney U- the bipolar datasets, and they were mapped to low-density regions test FDR < 0.001; Supplementary Fig. 5b). The amacrine cells had in the low-dimensional plots (Fig. 4a). the lowest median log-likelihood (median log-likelihood for Consistent with the synthetic data (Fig. 2), t-SNE put the amacrine cells, Mueller glia, rod and cone photoreceptors were “outlier” cell doublets and contaminants into very distinct 226.4, 187.3, 222.7, and 205.4, respectively; Supplementary compact clusters (Supplementary Fig. 5a, t-SNE coordinates Fig. 5b). 6 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | z coordinate two Counts 0 4000 8000 Counts 2000 4000 6000 0 NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE We benchmarked scvis against GPLVM, pt-SNE, and PCA on distribution of the log-likelihoods. The “Other” cells types embedding out-of-sample scRNA-seq data, performing a five-fold (horizontal cells, retina ganglion cells, microglia cells, etc) that cross-validation analysis on the bipolar dataset. Specifically, we were only in the retina dataset had the lowest log-likelihoods partitioned the bipolar dataset into five roughly equal size (median log-likelihoods of 181.7, Supplementary Fig. 5d). subsamples and held out one subsample as out-of-sample It is straightforward to project scRNA-seq to a higher than evaluation data, using the remaining four subsamples as training two-dimensional space. To evaluate how scvis performs on data to learn different models. We then trained Knn classifiers on higher-dimensional maps, we projected the bipolar data to a the two-dimensional representations of the training data and then three-dimensional space. We obtained better average log- used the Knn classifiers to classify the two-dimensional likelihood per data point, i.e., 255.1 versus 253.3 (from the last representations of the out-of-sample evaluation data. The process 100 iterations) by projecting the data to a three-dimensional was repeated five times with each of the five subsamples used space compared to projecting the data to a two-dimensional space exactly once as the out-of-sample validation data. scvis was (Supplementary Fig. 7). In addition, the average KL divergence significantly better than pt-SNE, GPLVM, and PCA on embed- was smaller (2.7 versus 4.1 from the last 100 iterations) by ding the out-of-samples (Supplementary Fig. 6a, b, FDR < 0.05, projecting the data to a three-dimensional space. one-sided Welch’s t-test). Finally, to demonstrate that scvis can be used for other types of We used the learned probabilistic mapping from the bipolar single-cell data, we learned a parametric mapping from the cells to map the independent whole-retina dataset .We first CyTOF data H2 and then directly used the mapping to project the projected the retina dataset to the subspace spanned by the first CyTOF data H1 to a two-dimensional space. As can be seen from 100 principal direction vectors of the bipolar dataset and then Supplementary Fig. 8a, all the 14 cell types were separated mapped each 100-dimensional vector to a two-dimensional space (although CD16+ and CD16− NK cells have some overlaps), and based on the learned scvis model from the bipolar dataset. The CD4 T cells and CD8 T cells clusters are adjacent to each other. bipolar cell clusters in the retina dataset identified in the original Moreover, the high quality of the mapping carried over to the study (clusters 26–33) tended to be mapped to the correspond- CyTOF data H1 (72,463 cells, Supplementary Fig. 8a, b). ing bipolar cell subtype regions discovered in the study (Fig. 4c). Although Macosko et al. only identified eight subtypes of bipolar cells, all the recently identified 14 subtypes of bipolar cells were Tumor microenvironments and intratumor heterogeneity.We possibly present in the retina dataset as can be seen from Fig. 4c, next used scvis to analyze tumor microenvironments and intra- i.e., cluster 27 (BC3B and BC4), cluster 28 (BC2 and BC3A), tumor heterogeneity. The oligodendroglioma dataset consists of cluster 29 (BC1A and BC1B), cluster 30 (BC5A and BC5D), mostly malignant cells (Supplementary Fig. 9a). We used densi- cluster 31 (BC5B and BC5C), and cluster 33 (BC6 and BC8/9). tycut to cluster the two-dimensional coordinates to produce 15 Interestingly, there was a cluster just above the rod photo- clusters (Supplementary Fig. 9b). The non-malignant cells receptors (Fig. 4c) consisting of different subtypes of bipolar cells. (microglia/macrophage and oligodendrocytes) formed two small In the bipolar dataset, cell doublets or contaminants were mapped clusters on the left and each consisted of cells from different to this region (Fig. 4a). We used densitycut to cluster the two- patients. We therefore computed the entropy of each cluster dimensional mapping of all the bipolar cells from the retina based on the cells of origin (enclosed bar plot). As expected, the dataset to detect this mixture of bipolar cell cluster (Supplemen- non-malignant clusters (cluster one and cluster five) had high tary Fig. 5c, where the 1535 high-density points in this cluster entropies. Cluster 12 (cells mostly from MGH53 and MGH54) were labeled with red circles). To test whether this mixture cell and cluster 14 (cells from MGH93 and MGH94) also had high population was an artifact of the projection, we randomly drew entropies (Fig. 5a). The cells in these two clusters consisted of the same number of data points from each bipolar subtype as in mostly astrocytes (Fig. 5b; the oligodendroglioma cells could the mixture cluster and computed the Knns of each data point roughly be classified as oligodendrocyte, astrocyte, or stem-like (here K was set to log (1535) = 11). We found that the 11 nearest cells.) Interestingly, cluster 15 had the highest entropy, and these neighbors of the points from the mixture clusters were also cells had significant higher stem-like scores (one-sided Welch’s t- −12 mostly from the mixture cluster (median of 11 and mean of 10.8), test p-value <10 ). We also colored cells by the cell-cycle scores while for the randomly selected points from the bipolar cells, a (G1/S scores, Supplementary Fig. 9c; G2/M scores, Supplemen- relatively small number of points of their 11 nearest neighbors tary Fig. 9d) and found that these cells also had significantly −12 (median of 0 and mean of 0.2) were from the mixture cluster. The higher G1/S scores (one-sided Welch’s t-test p-value <10 ) and −9 results suggest that the bipolar cells in the mixture cluster were G2/M scores (one-sided Welch’s t-test p-value <10 ). Therefore, substantially different from other bipolar cells. Finally, this cluster 15 cells consisted of mostly stem-like cells, and these cells mixture of bipolar cells had significantly lower log-likelihoods were cycling. compared with other bipolar cells (one-sided Mann–Whitney U- Malignant cells formed distinct clusters even if they were from test p-value <0.001, Supplementary Fig. 5d). the same patient (Fig. 5a). We next colored each malignant cell by Non-bipolar cells, especially Mueller glia cells, were mapped to its lineage score (Fig. 5b). The cells in some clusters highly the corresponding regions as in the bipolar dataset (Fig. 4c). expressed the astrocyte gene markers or the oligodendrocyte gene Photoreceptors (rod and cone photoreceptors accounting for 65.6 markers. The stem-like cells tended to be rare and they could link and 4.2% of all the cells from the retina ) were also mapped to “outliers” connecting oligodendrocyte and astrocyte cells in the their corresponding regions as in the bipolar dataset (Supple- two-dimensional scatter plots (Fig. 5b). In addition, some clusters mentary Fig. 5e). The amacrine cells (consisting of 21 clusters) of cells consisted of mixtures of cells (e.g., both oligodendrocyte together with horizontal cells and retinal ganglion cells were and stem-like cells), suggesting that other factors such as genetic mapped to the bottom right region (Supplementary Fig. 5f); all mutations and epigenetic measurements would be required to the amacrine cells were assigned the same label and the same fully interpret the clustering structures in the dataset. color. For the melanoma dataset, the authors profiled both malignant As in the training bipolar data, the bipolar cells in the retina cells and non-malignant cells . The malignant cells originated dataset also tended to have high log-likelihoods, and other cells from different patients were mapped to the bottom left region tended to have relatively lower log-likelihoods (Fig. 4d). The (Fig. 5c). These malignant cells were further subdivided by the embedded plot on the top right corner shows a bimodal patients of origin (Fig. 5d). Similar to the oligodendroglioma NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 7 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 a b Oligodendroglioma Oligodendroglioma lineage scores Astrocyte MGH36 MGH53 MGH54 MGH60 MGH97 MGH93 Oligodendrocyte Stem-like 15 15 10 10 5 5 0 0 −5 −5 −10 −10 −15 −10 −5 0 5 10 15 −10 −5 0 5 10 15 c Cell types in melanoma d Melanoma 53 60 71 75 80 84 94 B cells Unresolved normal Macrophages 58 65 72 78 81 88 T cells Unresolved Endothelial Malignant CAFs NK cells 59 67 74 79 82 89 15 15 10 10 5 5 −5 −5 −10 −10 −15 −15 −20 −10 0 10 20 −20 −10 0 10 20 z coordinate one Fig. 5 scvis learned low-dimensional representations. a The oligodendroglioma dataset, each cell is colored by its patient of origin, b the oligodendroglioma dataset, each cell is colored by its linage score from Tirosh et al. , c the melanoma dataset, each cell is colored by its cell type, and d the melanoma dataset, each cell is colored by its patient of origin dataset, non-malignant immune cells such as T cells, B cells, and Interestingly, as non-malignant cells, cancer-associated fibro- macrophages, even from different patients, tended to be grouped blasts (CAFs) were mapped to the region adjacent to the together by cell types instead of patients of origin of the cells malignant cells. The endothelial cells were just above the CAFs (Fig. 5c, d), although for some patients (e.g., 75, 58, and 67, (Fig. 5d). To test whether these cells were truly more similar with Fig. 5d), their immune cells showed patient-specific bias. We did the malignant cells than with immune cells, we first computed the a differential expression analysis of patient 75 T cells and other average principal component values in each type of cells and did a patient T cells using limma . Most of the top 100 differently hierarchical clustering analysis (Supplementary Fig. 10b). Gen- expressed genes were ribosome genes (Supplementary Fig. 10a), erally, there were two clusters: one cluster consisted of the suggesting that batch effects could be detectable between patient immune cells and the “Unsolved normal” cells, while the other 75 T cells and other patient T cells. cluster consisted of CAFs, endothelial cells, malignant cells, and the “Unsolved” cells, indicating CAFs and endothelial cells were 8 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | z coordinate two NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE more similar to malignant cells (they had high PC1 values) than embedding is difficult to interpret, and there are no likelihoods to to the immune cells. quantify the uncertainty of each mapping. In conclusion, the scvis algorithm provides a computational framework to compute low-dimensional embeddings of scRNA- Discussion seq data while preserving global structure of the high-dimensional We have developed a novel method, scvis, for modeling and measurements. We expect scvis to model and visualize structures reducing dimensionality of single-cell gene expression data. We in scRNA-seq data while providing new means to biologically demonstrated that scvis can robustly preserve the structures in interpretable results. As technical advances to profile the tran- high-dimensional datasets, including in datasets with small scriptomes of large numbers of single cells further mature, we numbers of data points. envisage that scvis will be of great value for routine analysis of Our contribution has several important implications large-scale, high-resolution mapping of cell populations. for the field. As a probabilistic generative model, scvis provides not only the low-dimensional coordinate for a given data point but also the log-likelihood as a measure of the quality Methods of the embedding. The log-likelihoods could potentially be used A latent variable model of single-cell data. We assume that the gene expression vector x of cell n is a random vector and is governed by a low-dimensional latent for outlier detection, e.g., for the bipolar cells in Fig. 4b, vector z . The graphical model representation of this latent variable model (with N the log-likelihood histogram shows a long tail of data points with cells) is shown in Fig. 6a. The x distribution could be a complex high-dimensional relatively low log-likelihoods, suggesting some outliers in this distribution. We assume that it follows a Student’s t-distribution given z : dataset (the non-bipolar cells). The log-likelihoods could also be useful in mapping new data. For example, although horizontal cells and retinal ganglion cells were mapped pðx jz ; θÞ¼ T ðx jμ ðz Þ; σ ðz Þ; νÞð1Þ to the region adjacent to/overlap the region occupied by amacrine n n n θ n θ n cells, these cells exhibited low log-likelihoods, suggesting that further analyses were required to elucidate these cell types/ subtypes. where both μ (·) and σ (·) are functions of z given by a neural network with θ θ scvis preserves the “global” structure in a dataset, greatly parameter θ and ν is the degree of freedom parameter and learned from data. The enhancing interpretation of projected structures in scRNA-seq marginal distribution pðx jθÞ = pðx jz ; θÞpðz jθÞdz can model a complex high- n n n n n data. For example, in the bipolar dataset, the “ON” bipolar cells dimensional distribution. were close to each other in the two-dimensional representation in We are interested in the posterior distribution of the low-dimensional latent variable given data: p(z | x , θ), which is intractable to compute. To approximate n n Fig. 4a, and similarly, the “OFF” bipolar cells were close to each the posterior, we use the variational distribution q(z | x , ϕ) = N μ ðx Þ, diag n n n other. For the oligodendroglioma dataset, the cells can be first (σ (x ))) (Fig. 6b). Both μ (·) and σ (·) are functions of x through a neural network ϕ n ϕ ϕ divided into normal cells and malignant cells. The normal cells with parameter ϕ. Although the number of latent variables grows with the number formed two clusters, with each cluster of cells consisting of cells of cells, these latent variables are governed by a neural network with a fixed set of parameters ϕ. Therefore, even for datasets with large number of cells, we still can from multiple patients. The malignant cells, although from the efficiently infer the posterior distributions of latent variables. The model coupled same patient, formed multiple clusters with cell clusters from the 56, 57 with the variational inference is called the variational autoencoder . same patient adjacent to each other. Adjacent malignant cell Now the problem is to find the variational parameter ϕ such that the clusters from different patients tended to selectively express the approximation q(z | x , ϕ) is as close as possible to the true posterior distribution p n n (z | x , θ). The quality of the approximation is measured by the Kullback–Leibler ( oligodendrocyte marker genes or the astrocyte marker genes. For n n the metastatic melanoma dataset, malignant cells from different patients, although mapped to the same region, formed clusters based on the patient origin of the cells, while immune cells from different patients tended to be clustered together by cell types. From the low-dimensional representations, we can hypothesize that the CAFs were more “similar” to the malignant cells than to the immune cells. Other methods, e.g., the SIMLR algorithm, improve the t-SNE algorithm by learning a similarity matrix between cells, and the similarity matrix is used as the input of t-SNE for dimension Z X n n N reduction. However, SIMLR is computationally expensive because its objective function involves large matrix multiplications (an N × N kernel matrix multiplying an N × N similarity matrix, where N is the number of cells). In addition, although the learned similarity matrix could help clustering analyses, it may distort the manifold structure as demonstrated in the t-SNE plots on the Z X n n learned similarity matrix because the SIMLR objective function Fig. 6 The scvis directed probabilistic graphical model and the variational encourages forming clusters. The DeepCyTOF framework has a approximation of its posterior. Circles represent random variables. Squares component that uses a denoising autoencoder (trained on the represent deterministic parameters. Shaded nodes are observed, and cells with few or without zeros events) to filter CyTOF data to unshaded nodes are hidden. Here we use the plate notation, i.e., nodes minimize the influence of dropout noises in single-cell data. The inside each box will get repeated when the node is unpacked (the number purpose of DeepCyTOF is quite different from that of scvis to of repeats is on the bottom right corner of each box). Each node and its model and visualize the low-dimensional structures in high- parents constitute a family. Given the parents, a random variable is dimensional single-cell data. The most similar approach for scvis independent of the ancestors. Therefore, the joint distribution of all the may be the parametric t-SNE algorithm , which uses a neural random variables is the product of the family conditional distributions. a network to learn a parametric mapping from the high- The generative model to generate data x , and b the variational dimensional space to a low dimension. However, parametric t- n approximation q(z | x , ϕ) to the posterior p(z | x , θ) SNE is not a probabilistic model, the learned low-dimensional n n n n NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 9 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 KL) divergence of the low-dimensional space. A heavy tailed Student’s t-distribution allows moderate distances in the high-dimensional space to be modeled by much larger KLðÞ qðz jx ; ϕÞjjpðz jx ; θÞ n n n n distances in the low-dimensional space to prevent crushing different clusters qðz jx ;ϕÞ 34 n n ¼ qðz jx ; ϕÞlog dz together in the low-dimensional space . n n n pðz jx ;θÞ n n The low-dimensional embedding coordinates fg z are obtained by i i¼1 qðz jx ;ϕÞpðx jθÞ n n n ð2Þ ¼ qðz jx ; ϕÞlog dz n n n pðz ;x jθÞ minimizing the KL divergence between the sum of conditional distributions: n n ¼ E ½ logqðz jx ; ϕÞ qðz jx ;ϕÞ n n n n N N P P P jji KL p jjq ¼ p log E ½ logpðz ; x jθÞþ logpðx jθÞ ji ji jji qðz jx ;ϕÞ n n n q n n jji i i¼1 j¼1;j≠i N N N N P P P P ¼ p logp  p logq ¼ KL½qðz jx ; ϕÞjjpðz jθÞ jji jji jji jji n n n i¼1 j¼1;j≠i i¼1 j¼1;j≠i ð3Þ E ½ logpðx jz ; θÞþ logpðx jθÞ  νþ1 qðz jx ;ϕÞ n n n n n 2 N N 2 P P 1þkk z z =ν i j / p log νþ1 jji P i¼1 j¼1;j≠i ðÞ 1þkk z z =ν The term E ½ logpðz ; x jθÞ − E ½ logqðz jx ; ϕÞ in Eq. (2) is the i k qðz jx ;φÞ n n qðzjx ;φÞ n n n n n k;k≠i ð6Þ evidence lower bound (ELBO) because it is a lower bound of logp(x | θ) as the KL νþ1 N N 2 2 P P divergence on the left hand side is non-negative. We therefore can do maximum- / p log 1 þ z  z  =ν jji i j likelihood estimation of both θ and ϕ by maximizing the ELBO. Notice that in the i¼1 j¼1;j≠i Bayesian setting, the ELBO is a lower bound of the evidence log p(x ) as the N N νþ1 P P P parameters θ are also latent random variables. 2 þ p log 1 þkk z  z =ν jji i k Both the prior p(z | θ) and the variational distribution q(z | x , ϕ) in the ELBO i¼1 j¼1;j≠i k;k≠i n n n νþ1 of the form in Eq. (3) are distributions of z . In our case, we can compute the KL N N 2 P P term analytically because the prior is a multivariate normal distribution, and the / p log 1 þ z  z  =ν jji i j variational distribution is also a multivariate normal distribution given x . i¼1 j¼1;j≠i However, typically there is no closed-form expression for the integration E qðz jx ;ϕÞ n n [log p(x | z , θ)] because we should integrate out z and the parameters of the n n n model μ (z ) and diag(σ (z )) are functions of z . Instead, we can use Monte Carlo X X θ n θ n n  νþ1 þ log 1 þ z  z =ν integration and obtain the estimated evidence lower bound for the nth cell: ð7Þ i k i¼1 k;k≠i ELBO ¼KLðÞ qðz jx ; ϕÞjjpðz jθÞþ logpðx jz ; θÞ ð4Þ n n n n n n;l Here ν is the degree of freedom of the Student’s t-distribution, which is typically set l¼1 to one (the standard Cauchy distribution) or learned from data. Equation (6)isa data-dependent term (depending on the high-dimensional data) that keeps nearby where z is sampled from q(z | x , ϕ) and L is the number of samples. We want to n,l n n data points in the high-dimensional data nearby in the low-dimensional space . take the partial derivatives of the ELBO w.r.t. the variational parameter ϕ and the Equation (7) is a data-independent term that pushes data points in the low- generative model parameter θ to find a local maximum of the ELBO. However, if we dimensional space apart from each other. Notice that the t-SNE objection directly sample points from q(z | x , ϕ), it is impossible to use the chain rule to take n n function minimizes the KL divergence of the joint distribution defined as the the partial derivative of the second term of Eq. (4) w.r.t ϕ because z is a number. n,l symmetrized condition distributions p = (p + p )/(2 × N) and q = (q + q )/ To use gradient-based methods for optimization, we indirectly sample data from q i,j i|j j|i i,j i|j j|i 56, 57 (2 × N). t-SNE has shown excellent results on many visualization tasks such as (z | x , ϕ) using the “reparameterization trick” . Specifically, we first sample ε n n l visualizing scRNA-seq data and CyTOF data . from a easy to sample distribution ϵ  pðϵjαÞ, e.g., a standard multivariate The final objective function is a weighted combination of the ELBO of the latent Gaussian distribution for our case. Next we pass ϵ through a continuous function variable model and the above asymmetric t-SNE objective function: g (ϵ, x ) to get a sample from q(z | x , ϕ). For our case, if q(z | x , ϕ) = ϕ n n n n n N μ ðx Þ, diag (σ (x ))), then g ðϵ; x Þ¼ μ ðx Þþ diag σ ðÞ x ´ ϵ. ! φ n ϕ n ϕ n ϕ n ϕ n N N X X arg min  ELBO þ α KL p jjq ð8Þ n jn jn θ;ϕ Adding regularizers on the latent variables. Given i.i.d data D¼fg x ,by n¼1 n¼1 n n¼1 maximizing the ELBO , we can do maximum-likelihood estimation of the n n¼1 model parameters θ and the variational distribution parameters ϕ. Although p(z | The parameter α is set to the dimensionality of the input high-dimensional data θ)p(x | z , θ) may model the data distribution very well, the variational dis- n n because the magnitude of the log-likelihood term in the ELBO scales with the tribution q(z | x , ϕ) is not necessarily good for visualization purposes. Specifically, n n dimensionality of the input data. The perplexity parameter is set to ten for scvis. it is possible that there are no very clear gaps among the points from different clusters. In fact, to model the data distribution well, the low-dimensional z space Sensitivity of scvis on cell numbers. To test the performance of scvis on scRNA- tends to be filled such that all the z space is used in modeling the data distribution. seq datasets with small numbers of cells, we ran scvis on subsampled data from the To better visualize the manifold structure of a dataset, we need to add regularizers bipolar dataset. The bipolar dataset consists of six batches of datasets. We only used to the objective function in Eq. (4) to encourage forming gaps between clusters and the cells from batch six (6221 cells in total after removing cell doublets and con- at the same time keeping nearby points in the high-dimensional space nearby in 34–39 taminants) to remove batch effects. Specifically, we subsampled 1, 2, 3, 5, 10, 20, 30, the low-dimensional space. Here we use the non-symmetrized t-SNE objective and 50% of the bipolar dataset from batch six (62, 124, 187, 311, 622, 1244, 1866, function. and 3110 cells, respectively). Then we computed the principal components from The t-SNE algorithm preserves the local structure in the high-dimensional the subsampled data, and ran scvis using the top 100 PCs (for the cases with M < space after dimension reduction. To measure the “localness” of a pairwise distance, 100 cells, we used the top M PCs). For each subsampled dataset, we ran scvis ten for a data point i in the high-dimensional space, the pairwise distance between i times with different seeds. We used exactly the same parameter setting for all the and another data point j is transformed to a conditional distribution by centering datasets. Therefore, except for the models trained on the cases with <100 cells, all an isotropic univariate Gaussian distribution at i other models have the same number of parameters. 2 2 When the number of training data is small (e.g., 62 cells, 124 cells, or 187 cells), exp x  x =2σ i j i only the large clusters such as cluster one and cluster two are distinct from the rest p ¼ P ð5Þ jji 2 2 exp x  x =2σ i i (Supplementary Figs. 11 and 12). As we increased the number of subsampled data k≠i points, some small clusters of cells can be recovered. At 622 cells, many cell clusters can be recovered. The Knn classification accuracies in Supplementary Fig. 13 The point-specific standard deviation σ is a parameter that is computed (trained on the two-dimensional representations of the subsampled data and tested p log p jji 2 jji automatically in such a way that the perplexity (2 ) of the conditional on the two-dimensional representations of the remaining cells from batch six) distribution p equals a user defined hyperparameter (e.g., typically 30 ). We set shows relatively high mean accuracies of 84.4, 84.5, 84.1, and 81.2% for K equals to j|i p = 0 because only pairwise similarities are of interest. 5, 9, 17, and 33. When we subsampled 622 cells as in Supplementary Fig. 11e, the i|i In the low-dimensional space, the conditional distribution q is defined cells in cluster 22 were not present in the 622 cells. However, when we used the j|i similarly and q is set to 0. The only difference is that an unscaled univariate model trained on these 622 cells to embed the remaining 5599 cells (6221–622), i|i Student’s t-distribution is used instead of an isotropic univariate Gaussian cluster 22 cells were mapped to the “correct” region that was adjacent to cluster 20 distribution as in the high-dimensional space. Because in the high-dimensional cells and bridged cluster 20 and other clusters as in Supplementary Fig. 11d, f–h. space more points can be close to each other than in the low-dimensional space Interestingly, with smaller numbers of cells (cell numbers ≤622), Knn classifiers (e.g., only two points can be mutually equidistant in a line, three points in a two- trained on the two-dimensional scvis coordinates were better than those trained dimensional plane, and four points in a three-dimensional space), it is impossible using the 100 principal components (one-sample t-test FDR < 0.05; Supplementary to faithfully preserve the high-dimensional pairwise distance information in the Fig. 12, the red color triangles represent the Knn accuracy by using the original 100 low-dimensional space if the intrinsic dimensionality of the data is bigger than that PCs). 10 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE It seems that there is no noticeable overfitting even with small numbers of cells Computational complexity analysis. The scvis objective function involves the as can be seen from the two-dimensional plots from Supplementary Figs. 11 and 12 asymmetrical t-SNE objective function. The most time-consuming part is to and the Knn classification accuracies in Supplementary Fig. 13. To decrease the compute the pairwise distances between two cells in a mini-batch that takes O 2 2 possibility of overfitting, we used Student’s t-distributions instead of Gaussian (TN D + TN d) time, where N is the mini-batch size (we use N = 512 in this distributions for the model neural networks. In addition, we used relatively small study), D is the dimensionality of the input data, d is the dimensionality of the low- neural networks (a three-layer inference network with 128, 64, and 32 units and a dimensional latent variables (e.g., d = 2 for most cases), and T is the number of five-layer model network with 32, 32, 32, 64, and 128 units), which may decrease iterations. For our case, we first use PCA to project the scRNA-seq data to a 100- the chances of overfitting. dimensional space, so D = 100. For a feedforward neural network with L hidden layers, and the number of neurons in layer l is n , the time complexity to train the neural network is ONT n  n , where n = D and n is the size of the 0 L+1 i¼0 lþ1 l output layer. For the model neural network, we use a five hidden layer (with layer Sensitivity of scvis on hyperparameters. scvis has several hyperparameters such size 32, 32, 32, 64, and 128) feedforward neural network. The input layer size is d as the number of layers in the variational inference and the model neural networks, and the output layer size is D. For the variational inference network, we use a three the layer sizes (the number of units in each layer), and the α parameters in Eq. (8). hidden layer (with layer size 128, 64, and 32) feedforward neural network. The size We established that scvis is typically robust to these hyperparameters. We first of the input layer is D and the size of the output layer is d. For space complexity, we tested the influence of the number of layers using batch six of the bipolar dataset. need to save the weights and bias of each neuron O n  n . We also need lþ1 l We learned scvis models with different layers from the 3110 subsampled data i¼0 to save the O(N ) pairwise distances and the data of size O(ND) in a mini-batch. points from the batch six bipolar dataset. Specifically, we tested these variational The original t-SNE algorithm is not scalable to large datasets (with tens of influence neural networks and the model neural thousands of cells to millions of cells) because it needs to compute the pairwise network layer combinations (the first number is the number of layers in the var- 2 2 2 distances between any two cells (taking O(M D + M T) time and O(M ) space, iational influence neural network, and the second number is the number where M is the total number of cells and T is the number of iterations). of layers in the model neural network): (10, 1), (10, 3), (10, 5), (10, 7), (10, 10), (7, Approximate t-SNE algorithms are typically more scalable in both time and space. 10), (5, 10), (3, 10), and (1, 10). These models performed reasonably well such that For example, BH t-SNE only computes the distance between a cell and its Knns. the cells from the same cluster are close to each other in the two-dimensional Therefore, BH t-SNE takes O(M log(M)) time and O(M log(M)) space, where we spaces (Supplementary Fig. 14). When the number of layers in the variational assume K is in the order of O(log(M)). inference neural network was fixed to ten, for some types of cells, their two- We next experimentally compare the scalability of scvis and BH t-SNE (the dimensional embeddings were close to each other and formed curve-like 48 59 widely used Rtsne package ) by using the 1.3 million cells from 10X genomics . structures as can be seen from Supplementary Fig. 14e. The reason for this phe- However, BH t-SNE did not finish in 24 h and we terminated it. On the contrary, nomenon could be that the variational influence network underestimated the scvis produced visually reasonable results in <33 min (after 3000 mini-batch variances of the latent z posterior distributions or the optimization was training, Supplementary Fig. 24a). Therefore, scvis can be much more scalable than not converged. On the contrary, when the influence networks have smaller BH t-SNE for very large datasets. As we increased the number of training batches, number of layers (<10), we did not see these curve structures (Supplementary we can see slightly better separations in clusters as in Supplementary Fig. 24b–f. Fig. 14f–i). The out-of-sample mapping results in Supplementary Fig. 15 show The time used to train scvis increased linearly in the number of training mini- similar results. batches (Supplementary Fig. 24h). However, for small datasets, BH t-SNE can be We computed the Knn classification accuracies of the out-of-samples. As more efficient than scvis. For example, for the melanoma dataset with only 4645 before, the Knn classifiers were trained on the two-dimensional coordinates of the cells, scvis still took 24 min to run 3000 mini-batches, while BH t-SNE finished in training subsampled data, and the classifiers were used to classify the two- only 28.9 s. All the experiments were conducted using a Mac computer with 32 GB dimensional coordinates of the out-of-sample data. The parameter setting did of RAM, 4.2 GHz four-core Intel i7 processor with 8 MB cache. influence scvis performance, i.e., for each K ∈ {5,9,17,33,65,129,257}, the trained Finally, when the mapping function is trained, mapping new cells takes only scvis models did significantly different (Supplementary Fig. 16, FDR < 0.05, one-  P OM n  n time, where M is the number of input cells. Also, because way analysis of variance (ANOVA) test). To find out which parameter l¼0 lþ1 l each data point can be mapped independently, the space complexity could be only combinations led to inferior or superior performance, we then compared the  P Lþ1 O n  n . As an example, it took only 1.5 s for a trained scvis model to classification accuracies of each model with the most complex model with both ten l¼0 lþ1 l map the entire 1.3 million cells from 10X genomics. layers of variational influence neural networks and ten layers of model networks. The FDR (two-sided Welch’s t-test) at the top of each subfigure of Supplementary Fig. 16 shows that, except for K = 257, all the models with one layer of variational influence neural networks did significantly worse than those from the most Datasets. The oligodendroglioma dataset measures the expression of 23,686 genes complex model (FDR < 0.05, two-sided Welch’s t-test). Similarly, the models with in 4347 cells from six IDH1 or IDH2 mutant human oligodendrolioma patients . three layers of variational influence neural networks did significantly worse than The expression of each gene is quantified as log (TPM/10+1), where “TPM” those from the most complex model when K ∈ {5, 9, 17, 33, 65}. While for other standards for “transcripts per million” . Through copy number estimations from models, their performances were not statistically different from those of the most these scRNA-seq measurements, 303 cells without detectable copy number complex models. alterations were classified as normal cells. These normal cells can be further We next examined the influence of the layer sizes of the neural networks. The grouped into microglia and oligodendrocyte based on a set of marker genes they number of layers was fixed at ten for both the variational influence neural networks expressed. Two patients show subclonal copy number alterations. and the model neural networks; the number of units in each layer was set to 8, 16, The melanoma dataset is from sequencing 4645 cells isolated from 19 metastatic 32, 64, and 128. All layers of the inference and the model neural networks had the melanoma patients . The cDNAs from each cell were sequenced by an Illumina same size. All models successfully embedded both the training data and the out-of- NextSeq 500 instrument to 30 bp pair-end reads with a median of ~150,000 reads sample test data (Supplementary Fig. 17). However, the layer size parameter did per cell. The expression of each gene (23,686 genes in total) is quantified by log influence scvis performance, i.e., the Knn classifiers on the out-of-sample data did (TPM/10+1). In addition to malignant cells, the authors also profiled immune significantly different (Supplementary Fig. 18, FDR < 0.05, one-way ANOVA test). cells, stromal cells, and endothelial cells to study the whole-tumor multi-cellular The FDR (two-sided Mann–Whitney U-test) at the top of each subfigure of ecosystem. Supplementary Fig. 18 shows that all models with layer size of eight did The bipolar dataset consists of low-coverage (median depth of 8200 mapped significantly worse than those from the most complex model using 128 units (FDR reads per cell) Drop-seq sequencing of 27,499 mouse retinal bipolar neural cells < 0.05). Similarly, the models with layer size of 16 did significantly worse than those 1 from a transgenic mouse . In total, 26 putative cells types were identified by from the most complex model when K ∈ {5, 9, 17, 33, 65, 129} (FDR < 0.05). While clustering the first 37 principal components of all the 27,499 cells. Fourteen clusters for other models, their performances were not statistically different from those can be assigned to bipolar cells, and another major cluster is composed of Mueller from the most complex models. Notice that, at layer size of 64, the mapping glia cells. These 15 clusters account for about 96% of all the 27,499 cells. The functions from one run were worse than others in embedding the out-of-sample remaining 11 clusters (comprising of only 1060 cells) include rod photoreceptors, data. However, there was no significant difference in the log-likelihoods from the cone photoreceptors, amacrine cells, and cell doublets and contaminants . repeated ten runs (Supplementary Fig. 19, one-way ANOVA p-value = 0.741). 9 The retina dataset consists of low-coverage Drop-seq sequencing of 44,808 For the α weight parameter in Eq. (8), we set α relative to the dimensionality of cells from the retinas of 14-day-old mice. By clustering the two-dimensional t-SNE the input data. We set α = 0, 0.5, 1.0, 1.5, 2.0, 10.0, inf times of the dimensionality 61 embedding using DBSCAN —a density-based clustering algorithm, the authors of the input data. When α = inf, the trained models did significantly worse than the identified 39 clusters after merging the clusters without enough differentially models trained with the default α equals to the dimensionality of the input data for expressed genes between any two clusters. K ∈ {5, 9, 17, 33, 65} (FDR ≤ 0.05, two-sided Welch’s t-test, Supplementary Figs. 20, The 10X Genomics neural cell dataset consist of 1,306,127 cells from cortex, 21 and 22). Also, when α = 0, the trained models were significantly worse than the hippocampus, and subventricular zones of two E18 C57BL/6 mice. The cells were models trained with the default α equaling to the dimensionality of the input data sequenced on 11 Illumina Hiseq 4000 machines to produce 98 bp reads . for all Ks (FDR ≤ 0.05, two-sided Welch’s t-test, Supplementary Fig. 22). For α = 0, 17 For the mass cytometry dataset H1 , manual gating assigned 72,463 cells to 14 we performed an extra comparison by using the synthetic nine-dimensional data, cell types based on 32 measured surface protein markers. Manual gating assigned showing that when K was large (≥65), setting α = 9 (the dimensionality of the input 31,721 cells to the same 14 cell populations from H2 based on the same 32 surface data) did significantly better than letting α = 0 (Supplementary Fig. 23, FDR ≤ 0.05, protein markers. one-sided Welch’s t-test). NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 11 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 Statistical analysis. All statistical analyses were performed using the R statistical 23. Qiu, X. et al. Single-cell mRNA quantification and differential analysis with software package, version 3.4.3. Boxplots denote the medians and the interquartile Census. Nat. Methods 14, 309–315 (2017). ranges (IQRs). The whiskers of a boxplot are the lowest datum still within 1.5 IQR of 24. Svensson, V. et al. Power analysis of single-cell RNA-sequencing experiments. the lower quartile and the highest datum still within 1.5 IQR of the upper quartile. Nat. Methods 14, 381–387 (2017). The full datasets were superimposed to boxplots. For datasets with non-normal 25. Ziegenhain, C. et al. Comparative analysis of single-cell RNA sequencing distribution (e.g., outliers), non-parametric tests were used. To account for unequal methods. Mol. Cell 65, 631–643 (2017). variances, Welch’s t-test was used for pairwise data comparison. Adjusted p-values 26. Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical <0.05 (FDR, the Benjamini–Hochberg procedure ) were considered to be significant. challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015). 27. Angerer, P. et al. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics 32, 1241–1243 (2015). Code availability. The scvis v0.1.0 Python package is available freely from bit- 28. Pierson, E. & Yau, C. ZIFA: Dimensionality reduction for zero-inflated single- bucket: https://bitbucket.org/jerry00/scvis-dev. cell gene expression analysis. Genome Biol. 16, 241 (2015). 29. DeTomaso, D. & Yosef, N. FastProject: a tool for low-dimensional analysis of Data availability. The scRNA-seq data that support the findings of this study are single-cell RNA-seq data. BMC Bioinforma. 17, 315 (2016). available in Gene Expression Omnibus with the identifiers (bipolar: GSE81905, 30. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are retina: GSE63473, oligodendroglioma: GSE70630, metastatic melanoma: revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, GSE72056, E18 mouse neural cells: GSE93421). The E18 mouse neural cells are 381–386 (2014). freely available from 10X Genomics . The other scRNA-seq data are publicly 45 31. Setty, M. et al. Wishbone identifies bifurcating developmental trajectories available from the single-cell portal . The mass cytometric data can be down- from single-cell data. Nat. Biotechnol. 34, 637–645 (2016). loaded from cytoback . The synthetic data used in this study are available from 32. Campbell, K. R. & Yau, C. Probabilistic modeling of bifurcations in single-cell bitbucket repo: https://bitbucket.org/jerry00/scvis-dev. gene expression data using a bayesian mixture of factor analyzers. Wellcome Open Res. 2, 19 (2017). Received: 16 July 2017 Accepted: 25 April 2018 33. Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. bioRxiv https://doi.org/10.1101/128843 (2017). 34. Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). 35. Hinton, G. E. & Roweis, S. T. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems 15 (eds Becker, S., Thrun, S. & Obermayer, K.) 857–864 (MIT Press, Cambridge, 2003). References 36. Cook, J., Sutskever, I., Mnih, A. & Hinton, G. E. Visualizing similarity data 1. Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by with a mixture of maps. In Proc. Eleventh International Conference on single-cell transcriptomics. Cell 166, 1308–1323 (2016). Artificial Intelligence and Statistics, vol. 2 of Proceedings of Machine Learning 2. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in Research (eds Meila, M. & Shen, X.) 67–74 (PMLR, San Juan, Puerto Rico, primary glioblastoma. Science 344, 1396–1401 (2014). 2007). 3. Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma 37. Carreira-Perpinán, M. A. The elastic embedding algorithm for dimensionality by single-cell RNA-seq. Science 352, 189–196 (2016). reduction. In Proc. 27th International Conference on Machine Learning 4. Navin, N. et al. Tumor evolution inferred by single cell sequencing. Nature 167–174 (Haifa, Israel, 2010). 472,90–94 (2011). 38. Yang, Z., Peltonen, J. & Kaski, S. Scalable optimization of neighbor embedding 5. Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. for visualization. In Proc. 30th International Conference on Machine Learning Protoc. 9, 171–181 (2014). (eds Dasgupta, S. & McAllester, D.) 127–135 (PMLR, Atlanta, Georgia, 2013). 6. Jaitin, D. A. et al. Massively parallel single-cell RNA-seq for marker-free 39. Maaten, L. v. d. Accelerating t-SNE using tree-based algorithms. J. Mach. decomposition of tissues into cell types. Science 343, 776–779 Learn. Res. 15, 3221–3245 (2014). (2014). 40. Amir, E.-a.D. et al. viSNE enables visualization of high dimensional single-cell 7. Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, identifiers. Nat. Methods 11, 163–166 (2014). 545–552 (2013). 8. Macaulay, I. C. et al. G&T-seq: parallel sequencing of single-cell genomes and 41. Zurauskiene, J. & Yau, C. pcaReduce: hierarchical clustering of single cell transcriptomes. Nat. Methods 12, 519–522 (2015). transcriptional profiles. BMC Bioinformatics 17, 140 (2016). 9. Macosko, E. Z. et al. Highly parallel genome-wide expression 42. Wattenberg, M., ViÈgas, F. & Johnson, I. How to use t-SNE effectively. Distill profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 http://distill.pub/2016/misread-tsne (2016). (2015). 43. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous 10. Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to systems. Preprint at https://static.googleusercontent.com/media/research. embryonic stem cells. Cell 161, 1187–1201 (2015). google.com/en//pubs/archive/45166.pdf (2015). 11. Hashimshony, T. et al. Cel-seq2: sensitive highly-multiplexed single-cell RNA- 44. Tirosh, I. et al. Single-cell RNA-seq supports a developmental hierarchy in seq. Genome Biol. 17, 77 (2016). human oligodendroglioma. Nature 539, 309–313 (2016). 12. Gierahn, T. M. et al. Seq-Well: portable, low-cost RNA sequencing of single 45. Tickle, T. et al. Single cell portal. https://portals.broadinstitute.org/single_cell cells at high throughput. Nat. Methods 14, 395–398 (2017). (2017). 13. Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single 46. Clevert, D.-A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep cells. Nat. Commun. 8, 14049 (2017). network learning by exponential linear units (elus). In 4th International 14. Cao, J. et al. Comprehensive single-cell transcriptional profiling of a Conference for Learning Representations (San Juan, Puerto Rico, 2016). multicellular organism. Science 357, 661–667 (2017). 47. Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. In 3rd 15. Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and International Conference for Learning Representations (San Diego, CA, 2015). spinal cord with split-pool barcoding. Science 360, 176–182 (2018). 48. Krijthe, J. H. Rtsne: t-distributed stochastic neighbor embedding using Barnes- 16. Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug Hut implementation. https://github.com/jkrijthe/Rtsne, R package version responses across a human hematopoietic continuum. Science 332, 687–696 0.13 (2015). (2011). 49. Lawrence, N. D. Gaussian process latent variable models for visualisation of 17. Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals high dimensional data. In Advances in Neural Information Processing Systems progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015). 16 (eds Thrun, S., Saul, L. K. & Schölkopf, B.) 329–336 (Cambridge, MIT 18. Regev, A. et al. The human cell atlas. Elife https://doi.org/10.7554/eLife.27041 Press, 2004). (2017). 50. GPy. GPy: A gaussian process framework in python. http://github.com/ 19. Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity SheffieldML/GPy (2012). with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016). 51. Maaten, L. Learning a parametric embedding by preserving local structure. In 20. Buettner, F. et al. Computational analysis of cell-to-cell heterogeneity in Proceedings of the Twelth International Conference on Artificial Intelligence single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. and Statistics, vol. 5 of Proceedings of Machine Learning Research (eds van Biotechnol. 33, 155–160 (2015). Dyk, D. & Welling, M.) 384–391 (PMLR, Clearwater Beach, Florida, 21. Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. 2009). Normalizing single-cell RNA sequencing data: challenges and opportunities. 52. Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile Nat. Methods 14, 565–571 (2017). topological approach for automatic clustering of biological data. 22. Bacher, R. et al. SCnorm: robust normalization of single-cell RNA-seq data. Bioinformatics 32, 2567–2576 (2016). Nat. Methods 14, 584–586 (2017). 12 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE on Cancer Biology and Treatment, and grant 1061, The Terry Fox New Frontiers Pro- 53. Smyth, G. Limma: linear models for microarray data. In Bioinformatics and gram Project Grant in Overcoming Treatment Failure in Lymphoid Cancers), CIHR computational biology solutions using R and Bioconductor (eds Gentleman, R., (grant MOP-115170 to S.P.S.), and CIHR Foundation (grant FDN-143246 to S.P.S.). S.P. Carey, V. J., Huber, W., Irizarry, R. A. & Dudoit, S.) 397–420 (Springer, New S. is supported by Canada Research Chairs. S.P.S. is a Michael Smith Foundation for York, 2005). Health Research scholar. S.P.S is a Susan G. Komen Scholar. 54. Wang, B., Zhu, J., Pierson, E. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017). Author contributions 55. Li, H. et al. Gating mass cytometry data by deep learning. Bioinformatics 33, J.D.: project conception, software implementation, and data analysis; J.D., A.C., and S.P.S.: 3423–3430 (2017). algorithm development and manuscript writing; S.P.S.: project conception, oversight, and 56. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In Proc. 2nd senior responsible author. International Conference on Learning Representations (Banff, Alberta, 2014). 57. Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proc. 31st International Additional information Supplementary Information accompanies this paper at https://doi.org/10.1038/s41467- Conference on Machine Learning (eds Xing, E. P. & Jebara, T.) 1278–1286 (PMLR, Beijing, 2014). 018-04368-5. 58. Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Stat. Competing interests: S.P.S is a shareholder of Contextual Genomics Inc. The remaining 22,79–86 (1951). authors declare no competing interests. 59. 10X Genomics. 1.3 million brain cells from E18 mice. https:// support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/ 1M_neurons (2017). Reprints and permission information is available online at http://npg.nature.com/ reprintsandpermissions/ 60. Wagner, G. P., Kin, K. & Lynch, V. J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285 (2012). Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 61. Ester, M. et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD’96 Proc. Second International Conference on Knowledge Discovery and Data Mining (eds Simoudis, E., Han, J. & Fayyad, U.) 226–231 (AAAI Press, Portland, Oregon, 1996). Open Access This article is licensed under a Creative Commons 62. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical Attribution 4.0 International License, which permits use, sharing, and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.) 57, 289–300 (1995). adaptation, distribution and reproduction in any medium or format, as long as you give 63. Levine, J. H. et al. Phenograph. https://www.cytobank.org/nolanlab/reports/ appropriate credit to the original author(s) and the source, provide a link to the Creative Levine2015.html (2015). Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the Acknowledgements article’s Creative Commons license and your intended use is not permitted by statutory This work was supported by a Discovery Frontiers project grant, “The Cancer regulation or exceeds the permitted use, you will need to obtain permission directly from Genome Collaboratory”, jointly sponsored by the Natural Sciences and Engineering the copyright holder. To view a copy of this license, visit http://creativecommons.org/ Research Council (NSERC), Genome Canada (GC), the Canadian Institutes licenses/by/4.0/. of Health Research (CIHR), and the Canada Foundation for Innovation (CFI) to S.P.S. In addition, we acknowledge generous long-term funding support from the BC Cancer Foundation. The S.P.S. group receives operating funds from the Canadian Breast © The Author(s) 2018 Cancer Foundation, the Canadian Cancer Society Research Institute (impact grant 701584 to S.P.S.), the Terry Fox Research Institute (grant 1021, The Terry Fox New Frontiers Program Project Grant in the Genomics of Forme Fruste Tumours: New Vistas NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 13 | | | http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nature Communications Springer Journals

Interpretable dimensionality reduction of single cell transcriptome data with deep generative models

Free
13 pages

Loading next page...
 
/lp/springer_journal/interpretable-dimensionality-reduction-of-single-cell-transcriptome-hNq0yI06SN
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Science, Humanities and Social Sciences, multidisciplinary; Science, Humanities and Social Sciences, multidisciplinary; Science, multidisciplinary
eISSN
2041-1723
D.O.I.
10.1038/s41467-018-04368-5
Publisher site
See Article on Publisher Site

Abstract

ARTICLE DOI: 10.1038/s41467-018-04368-5 OPEN Interpretable dimensionality reduction of single cell transcriptome data with deep generative models 1,2,3,4 1 1,2,3,5 Jiarui Ding , Anne Condon & Sohrab P. Shah Single-cell RNA-sequencing has great potential to discover cell types, identify cell states, trace development lineages, and reconstruct the spatial organization of cells. However, dimension reduction to interpret structure in single-cell sequencing data remains a challenge. Existing algorithms are either not able to uncover the clustering structures in the data or lose global information such as groups of clusters that are close to each other. We present a robust statistical model, scvis, to capture and visualize the low-dimensional structures in single-cell gene expression data. Simulation results demonstrate that low-dimensional representations learned by scvis preserve both the local and global neighbor structures in the data. In addition, scvis is robust to the number of data points and learns a probabilistic parametric mapping function to add new data points to an existing embedding. We then use scvis to analyze four single-cell RNA-sequencing datasets, exemplifying interpretable two- dimensional representations of the high-dimensional single-cell RNA-sequencing data. 1 2 Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada. Department of Molecular Oncology, BC Cancer Agency, Vancouver, BC V5Z 1L3, Canada. Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC V6T 2B5, 4 5 Canada. Present address: Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. Present address: Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065, USA. Correspondence and requests for materials should be addressed to J.D. (email: jding@broadinstitute.org) or to S.P.S. (email: sshah@bccrc.ca) NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 1 | | | 1234567890():,; ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ategorizing cell types comprising a specific organ or dis- dimensionalities are typically much lower. For example, factors ease tissue is critical for comprehensive study of tissue such as cell type and patient origin explain much of the variation 1 3 Cdevelopment and function . For example, in cancer, in a study of metastatic melanoma . We therefore assume that for identifying constituent cell types in the tumor microenvironment a high-dimensional scRNA-seq dataset D¼fg x with N cells, n¼1 together with malignant cell populations will improve under- where x is the expression vector of cell n, the x distribution is n n standing of cancer initialization, progression, and treatment governed by a latent low-dimensional random vector z (Fig. 1a). 2, 3 response . Technical developments have made it possible to For visualization purposes, the dimensionality d of z is typically measure the DNA and/or RNA molecules in single cells by single- two or three. We assume that z is distributed according to a 4–15 cell sequencing or protein content by flow or mass prior, with the joint distribution of the whole model as p(z | θ)p 16, 17 cytometry . The data generated by these technologies enable (x | z , θ). For simplicity, we can choose a factorized standard n n us to quantify cell types, identify cell states, trace development normal distribution for the prior p(z | θ) = N z j0; I . n n;i R i¼1 18, 19 lineages, and reconstruct the spatial organization of cells .An The distribution pðx jθÞ = pðz jθÞpðx jz ; θÞdz can be a n n n n n unsolved challenge is to develop robust computational methods complex multimodal high-dimensional distribution. To represent to analyze large-scale single-cell data measuring the expression of complex high-dimensional distributions, we assume that p(x | z , n n dozens of protein markers to all the mRNA expression in tens of θ) is a location-scale family distribution with location parameter thousands to millions of cells in order to distill single-cell biol- μ (z ) and scale parameter σ (z ); both are functions of z θ n θ n n 20–23 ogy . parameterized by a neural network with parameter θ. The Single-cell datasets are typically high dimensional in large inference problem is to compute the posterior distribution p(z | numbers of measured cells. For example, single-cell RNA- x , θ), which is however intractable to compute. We therefore use 19, 24–26 sequencing (scRNA-seq) can theoretically measure the a variational distribution q(z | x , ϕ) to approximate the pos- n n expression of all the genes in tens of thousands of cells in a single terior (Fig. 1b). Here q(z | x , ϕ) is a multivariate normal dis- n n 9, 10, 14, 15 experiment . For analysis, dimensionality reduction tribution with mean μ (x ) and standard deviation σ (x ). Both ϕ n ϕ n projecting high-dimensional data into low-dimensional space parameters are (continuous) functions of x parameterized by a (typically two or three dimensions) to visualize the cluster neural network with parameter ϕ. To model the data distribution 27–29 30–33 structures and development trajectories is commonly well (with a high likelihood of pðz jθÞpðx jz ; θÞdz ), the model n n n n used. Linear projection methods such as principal component tends to assign similar posterior distributions p(z | x , θ) to cells n n analysis (PCA) typically cannot represent the complex structures with similar expression profiles. To explicitly encourage cells with of single-cell data in low dimensional spaces. Nonlinear dimen- similar expression profiles to be proximal (and those with dis- sion reduction, such as the t-distributed stochastic neighbor similar profiles to be distal) in the latent space, we add the t-SNE 34–39 embedding algorithm (t-SNE) , has shown reasonable results objective function on the latent z distribution as a constraint. for many applications and has been widely used in single-cell data More details about the model and the inference algorithms are 1, 40, 41 42 processing . However, t-SNE has several limitations . First, presented in the Methods section. The scvis model is imple- unlike PCA, it is a non-parametric method that does not learn a mented in Python using Tensorflow with a command-line parametric mapping. Therefore, it is not natural to add new data interface and is freely available from https://bitbucket.org/jerry00/ to an existing t-SNE embedding. Instead, we typically need to scvis-dev. combine all the data together and rerun t-SNE. Second, as a non- parametric method, the algorithm is sensitive to hyperparameter Single-cell datasets. We analyzed four scRNA-seq datasets in this settings. Third, t-SNE is not scalable to large datasets because it 1, 3, 9, 44 study . Data were mostly downloaded from the single-cell 2 2 has a time complexity of O(N D) and space complexity of O(N ), portal . Two of these datasets were originally used to study where N is the number of cells and D is the number of expressed intratumor heterogeneity and the tumor microenvironment in genes in the case of scRNA-seq data. Fourth, t-SNE only outputs 3 44 metastatic melanoma and oligodendroglioma , respectively. the low-dimensional coordinates but without any uncertainties of One dataset was used to categorize the mouse bipolar cell the embedding. Finally, t-SNE typically preserves the local clus- populations of the retina , and one dataset was used to categorize tering structures very well given proper hyperparameters, but all cell types in the mouse retina . For all the scRNA-seq datasets, more global structures such as a group of subclusters that form a 1, 19 we used PCA (as a noise-reduction preprocessing step )to big cluster are missed in the low-dimensional embedding. project the cells into a 100-dimensional space and used the In this paper, we introduce a robust latent variable model, projected coordinates in the 100-dimensional spaces as inputs to scvis, to capture underlying low-dimensional structures in scvis. We also used two mass cytometry (CyTOF) datasets con- scRNA-seq data. As a probabilistic generative model, our method sisting of bone marrow mononuclear cells from two healthy adult learns a parametric mapping from the high-dimensional space to donors H1 and H2 . For CyTOF data, since their dimensionality a low-dimensional embedding. Therefore, new data points can be (32) is relatively low, we directly used these data as inputs to scvis. directly added to an existing embedding by the mapping function. Moreover, scvis estimates the uncertainty of mapping a high- dimensional point to a low-dimensional space that adds rich Experimental setting and implementation. The variational capacity to interpret results. We show that scvis has superior approximation neural network has three hidden layers (l , l , and 1 2 distance preserving properties in its low-dimensional projections l ) with 128, 64, and 32 hidden units each, and the model neural ′ ′ ′ ′ ′ leading to robust identification of cell types in the presence of network has five hidden layers l ; l ; l ; l ; and l with 32, 32, 32, 1 2 3 4 5 noise or ambiguous measurements. We extensively tested our 64, and 128 units each. We use the exponential linear unit acti- method on simulated data and several scRNA-seq datasets in vation function as it has been shown to speed up the convergence both normal and malignant tissues to demonstrate the robustness of optimization and the Adam stochastic optimization algo- of our method. rithm with a learning rate of 0.01 . Details about the influence of these hyperparameters on results are presented in the Methods section. The time complexity to compute the t-SNE loss is Results quadratic in terms of the number of data points. Consequently, Modeling and visualizing scRNA-seq data. Although scRNA- we use mini-batch optimization and set the mini-batch size to 512 seq datasets have high dimensionality, their intrinsic (cells). We expect that a large batch of data could be better in 2 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE Gene –2 2 –2 Input Hidden Hidden Output z ~p (z ) layer layer 1 layer 2 layer ⎮  x ~ p(x z, ) Input 1 Input 2 Ouput parameters  of p (x z, ) or  of q(z x, ) Input 3 b Input 4 Gene –5 –10 –15 –20 –15 –10 –5 0 5 10 15 x is measured z ~ q (z x, ) Fig. 1 Overview of the scvis method. a scvis model assumptions: given a low-dimensional point drawn from a simple distribution, e.g., a two-dimensional standard normal distribution, a high-dimensional gene expression vector of a cell can be generated by drawing a sample from the distribution p(x | z, θ). The heatmap represents a cell–gene expression matrix, where each row is a cell and each column is a gene. Color encodes the expression levels of genes in cells. The data-point-specific parameters θ are determined by a model neural network. The model neural network (a feedforward neural network) consists of an input layer, several hidden layers, and an output layer. The output layer outputs the parameters θ of p(x | z, θ). b scvis inference: given a high- dimensional gene expression vector of a cell (a row of the heatmap), scvis obtains its low-dimensional representation by sampling from the conditional distribution q(z | x, ϕ). The data-point-specific parameters ϕ are determined by a variational inference neural network. The inference neural network is also a feedforward neural network and its output layer outputs the parameters ϕ of q(z | x, ϕ). Again, the heatmap represents a cell–gene expression matrix. The scatter plot shows samples drawn from the variational posterior distributions q(z | x, ϕ) 2 2 2 3 3 estimating the high-dimensional data manifold, however we y , x y, xy , x , y ). Each of the nine features was then divided by found that 512 cells work accurately and efficiently in practice. its corresponding maximum absolute value. We run the Adam stochastic gradient descent algorithm for 500 Although t-SNE (with default parameter setting, we used the 34 48 epochs for each dataset with at least 3000 iterations by default. efficient Barnes-Hut t-SNE R wrapper package ) uncovered For large datasets, running 500 epochs is computationally the six clusters in this dataset, it was still challenging to infer the expensive, we therefore run the Adam algorithm for a maximum overall layout of the six clusters (Fig. 2b). t-SNE by design of 30,000 iteration or two epochs (which ever is larger). We use preserves local structure of the high-dimensional data, but the an L2 regularizer of 0.001 on the weights of the neural networks “global” structure is not reliable. Moreover, for the uniformly to prevent overfitting. distributed outliers, t-SNE put them into several compact clusters, which were adjacent to other genuine clusters. The scvis results, on the other hand, better preserved the overall structure of the original data (Fig. 2c): (1) The five small Benchmarking scvis against t-SNE on simulated data.To clusters were on one side, and the big cluster was on the other demonstrate that scvis can robustly learn a low-dimensional side. The relative positions of the clusters were also preserved. (2) representation of the input data, we first simulated data in a two- Outliers were scattered around the genuine clusters as in the dimensional space (for easy visualization) as in Fig. 2a. The big original data. In addition, as a probabilistic generative model, cluster on the left consisted of 1000 points and the five small scvis not only learned a low-dimensional representation of the clusters on the right each had 200 points. The five small clusters input data but also provided a way to quantify the uncertainty of were very close to each other and could roughly be considered as the low-dimensional mapping of each input data point by its log- a single big cluster. There were 200 uniformly distributed likelihood. We colored the low-dimensional embedding of each outliers around these six clusters. For each two-dimensional data point by its log-likelihood (Fig. 2d). We can see that data point with coordinates (x, y), we then mapped it into a generally scvis put most of its modeling power to model the five nine-dimensional space by the transformation (x+y, x−y, xy, x , NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 3 | | | Cell Cell ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 a b c d Original data t-SNE scvis Log-likelihoods 35.68 15 15 10 10 22.84 5 5 10.01 0 0 −20 −5 −5 −2 −2.81 −10 −10 −40 −4 −15.67 −8 −6 −4 −2 0 2 4 −40 −20 0 20 −15 −10 −5 0 5 10 −15 −10 −5 0 5 10 FDR FDR e f g Density contour 3e–07 1e–09 8e–11 6e–12 9e–12 5e–03 3e–07 6e–06 2e–06 1e–10 3e–07 4e–07 1e–06 scvis scvis t−SNE t−SNE −5 86 −10 −15 −10 −5 0 5 10 10 30 50 100 150 100 200 300 500 700 1000 1500 2000 K # points Fig. 2 Benchmarking scvis against t-SNE on synthetic data. a The original 2200 two-dimensional synthetic data points (points are colored by cluster labels. The randomly distributed outliers are also colored in a distinct color, same for b, c), b t-SNE results on the transformed nine-dimensional dataset with default perplexity parameter of 30, c scvis results, d coloring points based on their log-likelihoods from scvis, e the kernel density estimates of the scvis results, where the density contours are represented by lines, f average K-nearest neighbor preservations across ten runs for different Ks, and g the average K-nearest neighbor preservations (K = 10) for different numbers of subsampled data points from ten repeated runs. The numbers at the top are the adjusted p-values (FDR, one-sided Welch’s t-test) comparing the average Knn preservations from scvis with those from t-SNE. Boxplots in f, g denote the medians and the interquartile ranges (IQRs). The whiskers of a boxplot are the lowest datum still within 1.5 IQR of the lower quartile and the highest datum still within 1.5 IQR of the upper quartile compact clusters, while the outliers far from the five compact To test how scvis performs on smaller datasets, we subsampled clusters tended to have lower log-likelihoods. Thus, by combining the nine-dimensional synthetic dataset. Specifically, we sub- the log-likelihoods and the low-dimensional density information sampled 100, 200, 300, 500, 700, 1000, 1500, and 2000 points (Fig. 2e), we can better interpret the structure in the original data from the original dataset and ran scvis 11 times on each and uncertainty over the projection. subsampled dataset. We then computed the Knn preservations The low-dimensional representation may change for different (K = 10) and found that the Knn preservations from the scvis runs because the scvis objective function can have different local results were significantly higher than those from t-SNE results maxima. To test the stability of the low-dimensional representa- (false discovery rate (FDR) <0.01 for all the subsampled datasets, tions, we ran scvis ten times. Generally, the two-dimensional one-sided Welch’s t-test, Fig. 2g). scvis performs very well on all representations from the ten runs (Supplementary Fig. 1a–j) the subsampled datasets (Supplementary Fig. 2a–h). Even with showed similar patterns as in Fig. 2c. As a comparison, we also just 100 data points, the two-dimensional representation ran t-SNE ten times, and the results (Supplementary Fig. 1k–t) (Supplementary Fig. 2a) preserved much of the structure in the showed that the layouts of the clusters were less preserved, e.g., data. The log-likelihoods estimated from the subsampled data the relative positions of the clusters changed from run to run. To also recapitulated the log-likelihoods from the original 2200 data quantitatively compare scvis and t-SNE results, we computed the points (Supplementary Fig. 3a–h). The t-SNE results on the average K-nearest neighbor (Knn) preservations across runs for subsampled datasets (Supplementary Fig. 2i–p) generally revealed K ∈ {10, 30, 50, 100, 150}. Specifically, for the low-dimensional the clustering structures. However, the relative positions of the representation from each run, we constructed Knn graphs for five clusters and the big cluster were largely inaccurate. different Ks. We then computed the Knn graph from the high- To test the performance of scvis when adding new data to an dimensional data for a specific K. Finally, we compared the existing embedding, we increased by tenfold the number of points average overlap of the Knn graphs from the low-dimensional in each cluster and the number of outliers (for a total of 22,000 representations with the Knn graph from the high-dimensional points) using a different random seed. The embedding (Fig. 3a, b) data for a specific K. For scvis, the median Knn preservations was very similar to that of the 2200 training data points in Fig. 2c, monotonically increased from 86.7% for K = 10, to 90.9% for K d. We trained Knn classifiers on the embedding of the 2200 = 150 (Fig. 2f). For t-SNE, the median Knn preservations first training data for K ∈ {5, 9, 17, 33, 65} and used the trained decreased from 82.7% for K = 10 to 82.1% for K = 50 (consistent classifiers to classify the embedding of the 22,000 points, with t-SNE preserving local structures) and then increased to repeating 11 times. Median accuracy (the proportion of points 84.0% for K = 150. Thus scvis preserved Knn more effectively correctly assigned to their corresponding clusters) was 98.1% for than t-SNE. K = 5 and 94.8% for K = 65. The performance decreased mainly 4 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | Average Knn preservation (%) Average Knn preservation (%) NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE a b c Knn classification accuracy 2e−04 8e−06 2e−08 4e−11 3e−07 scvis mapping new data scvis log-likelihoods 1e−01 1e−01 1e−01 1e−01 1e−01 38.91 scvis 15 15 GPLVM pt−SNE PCA 10 10 17.67 88.8 88.3 96 5 5 87.4 −3.56 85.3 −5 −5 −23.72 −10 −10 82.5 −46.02 −15 −10 −5 0 5 10 5 9 17 33 65 −15 −10 −5 05 10 d e f scvis trained on new data scvis log-likelihoods t−SNE 38.91 10 10 5 5 17.67 −3.56 −5 −5 −20 −23.72 −10 −10 −40 −15 −60 −15 −46.02 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −60 −40 −20 0 20 40 60 Fig. 3 Benchmarking scvis against GPLVM, parametric t-SNE, and PCA to embed 22,000 synthetic out-of-sample data. a scvis mapping 22,000 new data points based on the learned probabilistic mapping function from the 2200 training data points, b the estimated log-likelihoods, and c the average K-nearest neighbor classification accuracies for different Ks across 11 runs, the classifiers were trained on the 11 embeddings from the 2200 points. The numbers at the top are the FDR (one-sided Mann–Whitney U-test) comparing the K-nearest neighbor classification accuracy from scvis with those from GPLVM (orange, bottom) and those from parametric t-SNE (golden, top). Notice that, for GPLVM, two runs produced bad results and were not plotted in the figure. Boxplots denote the medians and the interquartile ranges (IQR). The whiskers of a boxplot are the lowest datum still within 1.5 IQR of the lower quartile and the highest datum still within 1.5 IQR of the upper quartile. d scvis results on the larger dataset with the same perplexity parameter as used in Fig. 2; e scvis log-likelihoods on the larger dataset; and f t-SNE results on the larger dataset because, for a larger K, the outliers were wrongly assigned to the than those from scvis, GPLVM, and pt-SNE in terms of the mean six genuine clusters. classification accuracies for different Ks. We then benchmarked scvis against Gaussian process latent As a non-parametric dimension reduction method, t-SNE was 49 50 variate model (GPLVM, implemented in the GPy package), sensitive to hyperparameter setting, especially the perplexity parametric t-SNE (pt-SNE), and PCA on embedding the 22,000 parameter (the effective number of neighbors, see the Methods out-of-sample data points. We used the 11 scvis models trained section for details). The optimal perplexity parameter increased as on the small nine-dimensional synthetic dataset with 2200 data the total number of data points increased. In contrast, as we points to embed the larger nine-dimensional synthetic data with adopted mini-batch for training scvis by subsampling, e.g., 512 22,000 data points. Similarly, we trained 11 GPLVM models and cells each time, scvis was less sensitive to the perplexity parameter pt-SNE models on the small nine-dimensional synthetic dataset as we increase the total number of training data points because and applied these models to the bigger synthetic dataset. To the number of cell is fixed at 512 at each training step. Therefore, compare the abilities of the trained models to embed unseen data, scvis performed well on approximately an order of magnitude we trained Knn classifiers on the two-dimensional representations larger dataset (Fig. 3d, e), without changing the perplexity (of the small 2200 data points) outputted from different parameter for scvis. For this larger dataset, the t-SNE results algorithms. These Knn classifiers were used to classify the two- (Fig. 3f) were difficult to interpret without the ground-truth dimensional coordinates of the 22,000 data points outputted from cluster information, because it was already different algorithms. scvis was significantly better than GPLVM difficult to see how many clusters in this dataset, not to mention and pt-SNE for different Ks (Fig. 3c, two runs of GPLVM to uncover the overall structure of the data. Although by produced bad results and were not plotted in the figure, FDR < increasing the perplexity parameter, the performance of t-SNE 0.05, one-sided Mann–Whitney U-test). For PCA, because the became better (Supplementary Fig. 4), the outliers still formed model is unique for a given dataset, we generated unique two- distinct clusters, and it remains difficult to set this parameter in dimensional coordinates for the 22,000 out-of-sample data points. practice. The Knn classifiers trained on the PCA coordinates were worse NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 5 | | | Accuracy (%) ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 a b Bipolar cell types Bipolar log-likelihoods 354.5 BC5A (cone bipolar cell 5A) BC5D 10 10 BC7 (cone bipolar cell 7) 153.4 BC5B BC3B BC5C −400 0 200 BC8/9 (mixture of BC8 and BC9) Log-likelihoods BC4 BC3A BC6 BC2 −47.4 RBC (rod bipolar cell) BC1A BC1B AC (amacrine cell) −165.5 Cone photoreceptors Rod photoreceptors −10 −10 MG (mueller glia) −20 −20 −449.7 −10 0 10 20 −10 0 10 20 c d Retina cell clusters Retina log-likelihoods 351.2 15 15 10 10 273.5 50 150 250 350 27 5 Log-likelihoods 195.9 −5 −5 −10 −10 118.1 −15 −15 40.5 −15 −10 −5 0 5 10 15 20 −15 −10 −5 0 5 10 15 20 z coordinate one Fig. 4 Learning a probabilistic mapping function from the bipolar data and applying the function to the independently generated mouse retina dataset. a scvis learned two-dimensional representations of the bipolar dataset, b coloring each point by the estimated log-likelihood, c the whole mouse retina dataset was directly projected to a two-dimensional space by the probabilistic mapping function learned from the bipolar data, and d coloring each point from the retina dataset by the estimated log-likelihood Learning a parametric mapping for a single-cell dataset.We from Shekhar et al. ). In addition, although t-SNE mapped cells next analyzed the scvis learned probabilistic mapping from a from different cell populations into distinct regions, more global training single-cell dataset and tested how it performed on unseen organizations of clusters of cells were missed in the t-SNE data. We first trained a model on the mouse bipolar cell of the embedding. The “ON” cone bipolar cell clusters, the “OFF” cone retina dataset and then used the learned model to map the bipolar cell clusters, and other non-bipolar cell clusters were independently generated mouse retina dataset . The two- mixed together in the t-SNE results. dimensional coordinates from the bipolar dataset captured The bipolar cells tended to have higher log-likelihoods than much information in this dataset (Fig. 4a). For example, non- non-bipolar cells such as amacrine cells, Mueller glia, and bipolar cells such as amacrine cells, Mueller glia, and photo- photoreceptors (Fig. 4b), suggesting that the model used most of receptors were at the bottom, the rod bipolar cells were in the its power to model the bipolar cells, while other cell types were middle, and the cone bipolar cells were on the top left around the not modeled as well. The embedded figure at the top right corner rod bipolar cells. Moreover, the “OFF” cone bipolar cells (BC1A, shows the histogram of the log-likelihoods. The majority of the BC1B, BC2, BC3A, BC3B, BC4) were on the left and close to each points exhibited high log-likelihoods (with a median of 292.4). other, and the “ON” cone bipolar cells (BC5A-D, BC6, BC7, BC8/ The bipolar cells had significantly higher log-likelihoods (median 9) were at the top. Cell doublets and contaminants (accounting log-likelihood of 298.4) relative to non-bipolar cells (including for 2.43% of the cells comprised eight clusters , with distinct color amacrine cells, Mueller glia, rod and cone photoreceptors) and symbol combinations in Fig. 4a but not labeled) were rare in (median log-likelihood of 223.6; one-sided Mann–Whitney U- the bipolar datasets, and they were mapped to low-density regions test FDR < 0.001; Supplementary Fig. 5b). The amacrine cells had in the low-dimensional plots (Fig. 4a). the lowest median log-likelihood (median log-likelihood for Consistent with the synthetic data (Fig. 2), t-SNE put the amacrine cells, Mueller glia, rod and cone photoreceptors were “outlier” cell doublets and contaminants into very distinct 226.4, 187.3, 222.7, and 205.4, respectively; Supplementary compact clusters (Supplementary Fig. 5a, t-SNE coordinates Fig. 5b). 6 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | z coordinate two Counts 0 4000 8000 Counts 2000 4000 6000 0 NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE We benchmarked scvis against GPLVM, pt-SNE, and PCA on distribution of the log-likelihoods. The “Other” cells types embedding out-of-sample scRNA-seq data, performing a five-fold (horizontal cells, retina ganglion cells, microglia cells, etc) that cross-validation analysis on the bipolar dataset. Specifically, we were only in the retina dataset had the lowest log-likelihoods partitioned the bipolar dataset into five roughly equal size (median log-likelihoods of 181.7, Supplementary Fig. 5d). subsamples and held out one subsample as out-of-sample It is straightforward to project scRNA-seq to a higher than evaluation data, using the remaining four subsamples as training two-dimensional space. To evaluate how scvis performs on data to learn different models. We then trained Knn classifiers on higher-dimensional maps, we projected the bipolar data to a the two-dimensional representations of the training data and then three-dimensional space. We obtained better average log- used the Knn classifiers to classify the two-dimensional likelihood per data point, i.e., 255.1 versus 253.3 (from the last representations of the out-of-sample evaluation data. The process 100 iterations) by projecting the data to a three-dimensional was repeated five times with each of the five subsamples used space compared to projecting the data to a two-dimensional space exactly once as the out-of-sample validation data. scvis was (Supplementary Fig. 7). In addition, the average KL divergence significantly better than pt-SNE, GPLVM, and PCA on embed- was smaller (2.7 versus 4.1 from the last 100 iterations) by ding the out-of-samples (Supplementary Fig. 6a, b, FDR < 0.05, projecting the data to a three-dimensional space. one-sided Welch’s t-test). Finally, to demonstrate that scvis can be used for other types of We used the learned probabilistic mapping from the bipolar single-cell data, we learned a parametric mapping from the cells to map the independent whole-retina dataset .We first CyTOF data H2 and then directly used the mapping to project the projected the retina dataset to the subspace spanned by the first CyTOF data H1 to a two-dimensional space. As can be seen from 100 principal direction vectors of the bipolar dataset and then Supplementary Fig. 8a, all the 14 cell types were separated mapped each 100-dimensional vector to a two-dimensional space (although CD16+ and CD16− NK cells have some overlaps), and based on the learned scvis model from the bipolar dataset. The CD4 T cells and CD8 T cells clusters are adjacent to each other. bipolar cell clusters in the retina dataset identified in the original Moreover, the high quality of the mapping carried over to the study (clusters 26–33) tended to be mapped to the correspond- CyTOF data H1 (72,463 cells, Supplementary Fig. 8a, b). ing bipolar cell subtype regions discovered in the study (Fig. 4c). Although Macosko et al. only identified eight subtypes of bipolar cells, all the recently identified 14 subtypes of bipolar cells were Tumor microenvironments and intratumor heterogeneity.We possibly present in the retina dataset as can be seen from Fig. 4c, next used scvis to analyze tumor microenvironments and intra- i.e., cluster 27 (BC3B and BC4), cluster 28 (BC2 and BC3A), tumor heterogeneity. The oligodendroglioma dataset consists of cluster 29 (BC1A and BC1B), cluster 30 (BC5A and BC5D), mostly malignant cells (Supplementary Fig. 9a). We used densi- cluster 31 (BC5B and BC5C), and cluster 33 (BC6 and BC8/9). tycut to cluster the two-dimensional coordinates to produce 15 Interestingly, there was a cluster just above the rod photo- clusters (Supplementary Fig. 9b). The non-malignant cells receptors (Fig. 4c) consisting of different subtypes of bipolar cells. (microglia/macrophage and oligodendrocytes) formed two small In the bipolar dataset, cell doublets or contaminants were mapped clusters on the left and each consisted of cells from different to this region (Fig. 4a). We used densitycut to cluster the two- patients. We therefore computed the entropy of each cluster dimensional mapping of all the bipolar cells from the retina based on the cells of origin (enclosed bar plot). As expected, the dataset to detect this mixture of bipolar cell cluster (Supplemen- non-malignant clusters (cluster one and cluster five) had high tary Fig. 5c, where the 1535 high-density points in this cluster entropies. Cluster 12 (cells mostly from MGH53 and MGH54) were labeled with red circles). To test whether this mixture cell and cluster 14 (cells from MGH93 and MGH94) also had high population was an artifact of the projection, we randomly drew entropies (Fig. 5a). The cells in these two clusters consisted of the same number of data points from each bipolar subtype as in mostly astrocytes (Fig. 5b; the oligodendroglioma cells could the mixture cluster and computed the Knns of each data point roughly be classified as oligodendrocyte, astrocyte, or stem-like (here K was set to log (1535) = 11). We found that the 11 nearest cells.) Interestingly, cluster 15 had the highest entropy, and these neighbors of the points from the mixture clusters were also cells had significant higher stem-like scores (one-sided Welch’s t- −12 mostly from the mixture cluster (median of 11 and mean of 10.8), test p-value <10 ). We also colored cells by the cell-cycle scores while for the randomly selected points from the bipolar cells, a (G1/S scores, Supplementary Fig. 9c; G2/M scores, Supplemen- relatively small number of points of their 11 nearest neighbors tary Fig. 9d) and found that these cells also had significantly −12 (median of 0 and mean of 0.2) were from the mixture cluster. The higher G1/S scores (one-sided Welch’s t-test p-value <10 ) and −9 results suggest that the bipolar cells in the mixture cluster were G2/M scores (one-sided Welch’s t-test p-value <10 ). Therefore, substantially different from other bipolar cells. Finally, this cluster 15 cells consisted of mostly stem-like cells, and these cells mixture of bipolar cells had significantly lower log-likelihoods were cycling. compared with other bipolar cells (one-sided Mann–Whitney U- Malignant cells formed distinct clusters even if they were from test p-value <0.001, Supplementary Fig. 5d). the same patient (Fig. 5a). We next colored each malignant cell by Non-bipolar cells, especially Mueller glia cells, were mapped to its lineage score (Fig. 5b). The cells in some clusters highly the corresponding regions as in the bipolar dataset (Fig. 4c). expressed the astrocyte gene markers or the oligodendrocyte gene Photoreceptors (rod and cone photoreceptors accounting for 65.6 markers. The stem-like cells tended to be rare and they could link and 4.2% of all the cells from the retina ) were also mapped to “outliers” connecting oligodendrocyte and astrocyte cells in the their corresponding regions as in the bipolar dataset (Supple- two-dimensional scatter plots (Fig. 5b). In addition, some clusters mentary Fig. 5e). The amacrine cells (consisting of 21 clusters) of cells consisted of mixtures of cells (e.g., both oligodendrocyte together with horizontal cells and retinal ganglion cells were and stem-like cells), suggesting that other factors such as genetic mapped to the bottom right region (Supplementary Fig. 5f); all mutations and epigenetic measurements would be required to the amacrine cells were assigned the same label and the same fully interpret the clustering structures in the dataset. color. For the melanoma dataset, the authors profiled both malignant As in the training bipolar data, the bipolar cells in the retina cells and non-malignant cells . The malignant cells originated dataset also tended to have high log-likelihoods, and other cells from different patients were mapped to the bottom left region tended to have relatively lower log-likelihoods (Fig. 4d). The (Fig. 5c). These malignant cells were further subdivided by the embedded plot on the top right corner shows a bimodal patients of origin (Fig. 5d). Similar to the oligodendroglioma NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 7 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 a b Oligodendroglioma Oligodendroglioma lineage scores Astrocyte MGH36 MGH53 MGH54 MGH60 MGH97 MGH93 Oligodendrocyte Stem-like 15 15 10 10 5 5 0 0 −5 −5 −10 −10 −15 −10 −5 0 5 10 15 −10 −5 0 5 10 15 c Cell types in melanoma d Melanoma 53 60 71 75 80 84 94 B cells Unresolved normal Macrophages 58 65 72 78 81 88 T cells Unresolved Endothelial Malignant CAFs NK cells 59 67 74 79 82 89 15 15 10 10 5 5 −5 −5 −10 −10 −15 −15 −20 −10 0 10 20 −20 −10 0 10 20 z coordinate one Fig. 5 scvis learned low-dimensional representations. a The oligodendroglioma dataset, each cell is colored by its patient of origin, b the oligodendroglioma dataset, each cell is colored by its linage score from Tirosh et al. , c the melanoma dataset, each cell is colored by its cell type, and d the melanoma dataset, each cell is colored by its patient of origin dataset, non-malignant immune cells such as T cells, B cells, and Interestingly, as non-malignant cells, cancer-associated fibro- macrophages, even from different patients, tended to be grouped blasts (CAFs) were mapped to the region adjacent to the together by cell types instead of patients of origin of the cells malignant cells. The endothelial cells were just above the CAFs (Fig. 5c, d), although for some patients (e.g., 75, 58, and 67, (Fig. 5d). To test whether these cells were truly more similar with Fig. 5d), their immune cells showed patient-specific bias. We did the malignant cells than with immune cells, we first computed the a differential expression analysis of patient 75 T cells and other average principal component values in each type of cells and did a patient T cells using limma . Most of the top 100 differently hierarchical clustering analysis (Supplementary Fig. 10b). Gen- expressed genes were ribosome genes (Supplementary Fig. 10a), erally, there were two clusters: one cluster consisted of the suggesting that batch effects could be detectable between patient immune cells and the “Unsolved normal” cells, while the other 75 T cells and other patient T cells. cluster consisted of CAFs, endothelial cells, malignant cells, and the “Unsolved” cells, indicating CAFs and endothelial cells were 8 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | z coordinate two NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE more similar to malignant cells (they had high PC1 values) than embedding is difficult to interpret, and there are no likelihoods to to the immune cells. quantify the uncertainty of each mapping. In conclusion, the scvis algorithm provides a computational framework to compute low-dimensional embeddings of scRNA- Discussion seq data while preserving global structure of the high-dimensional We have developed a novel method, scvis, for modeling and measurements. We expect scvis to model and visualize structures reducing dimensionality of single-cell gene expression data. We in scRNA-seq data while providing new means to biologically demonstrated that scvis can robustly preserve the structures in interpretable results. As technical advances to profile the tran- high-dimensional datasets, including in datasets with small scriptomes of large numbers of single cells further mature, we numbers of data points. envisage that scvis will be of great value for routine analysis of Our contribution has several important implications large-scale, high-resolution mapping of cell populations. for the field. As a probabilistic generative model, scvis provides not only the low-dimensional coordinate for a given data point but also the log-likelihood as a measure of the quality Methods of the embedding. The log-likelihoods could potentially be used A latent variable model of single-cell data. We assume that the gene expression vector x of cell n is a random vector and is governed by a low-dimensional latent for outlier detection, e.g., for the bipolar cells in Fig. 4b, vector z . The graphical model representation of this latent variable model (with N the log-likelihood histogram shows a long tail of data points with cells) is shown in Fig. 6a. The x distribution could be a complex high-dimensional relatively low log-likelihoods, suggesting some outliers in this distribution. We assume that it follows a Student’s t-distribution given z : dataset (the non-bipolar cells). The log-likelihoods could also be useful in mapping new data. For example, although horizontal cells and retinal ganglion cells were mapped pðx jz ; θÞ¼ T ðx jμ ðz Þ; σ ðz Þ; νÞð1Þ to the region adjacent to/overlap the region occupied by amacrine n n n θ n θ n cells, these cells exhibited low log-likelihoods, suggesting that further analyses were required to elucidate these cell types/ subtypes. where both μ (·) and σ (·) are functions of z given by a neural network with θ θ scvis preserves the “global” structure in a dataset, greatly parameter θ and ν is the degree of freedom parameter and learned from data. The enhancing interpretation of projected structures in scRNA-seq marginal distribution pðx jθÞ = pðx jz ; θÞpðz jθÞdz can model a complex high- n n n n n data. For example, in the bipolar dataset, the “ON” bipolar cells dimensional distribution. were close to each other in the two-dimensional representation in We are interested in the posterior distribution of the low-dimensional latent variable given data: p(z | x , θ), which is intractable to compute. To approximate n n Fig. 4a, and similarly, the “OFF” bipolar cells were close to each the posterior, we use the variational distribution q(z | x , ϕ) = N μ ðx Þ, diag n n n other. For the oligodendroglioma dataset, the cells can be first (σ (x ))) (Fig. 6b). Both μ (·) and σ (·) are functions of x through a neural network ϕ n ϕ ϕ divided into normal cells and malignant cells. The normal cells with parameter ϕ. Although the number of latent variables grows with the number formed two clusters, with each cluster of cells consisting of cells of cells, these latent variables are governed by a neural network with a fixed set of parameters ϕ. Therefore, even for datasets with large number of cells, we still can from multiple patients. The malignant cells, although from the efficiently infer the posterior distributions of latent variables. The model coupled same patient, formed multiple clusters with cell clusters from the 56, 57 with the variational inference is called the variational autoencoder . same patient adjacent to each other. Adjacent malignant cell Now the problem is to find the variational parameter ϕ such that the clusters from different patients tended to selectively express the approximation q(z | x , ϕ) is as close as possible to the true posterior distribution p n n (z | x , θ). The quality of the approximation is measured by the Kullback–Leibler ( oligodendrocyte marker genes or the astrocyte marker genes. For n n the metastatic melanoma dataset, malignant cells from different patients, although mapped to the same region, formed clusters based on the patient origin of the cells, while immune cells from different patients tended to be clustered together by cell types. From the low-dimensional representations, we can hypothesize that the CAFs were more “similar” to the malignant cells than to the immune cells. Other methods, e.g., the SIMLR algorithm, improve the t-SNE algorithm by learning a similarity matrix between cells, and the similarity matrix is used as the input of t-SNE for dimension Z X n n N reduction. However, SIMLR is computationally expensive because its objective function involves large matrix multiplications (an N × N kernel matrix multiplying an N × N similarity matrix, where N is the number of cells). In addition, although the learned similarity matrix could help clustering analyses, it may distort the manifold structure as demonstrated in the t-SNE plots on the Z X n n learned similarity matrix because the SIMLR objective function Fig. 6 The scvis directed probabilistic graphical model and the variational encourages forming clusters. The DeepCyTOF framework has a approximation of its posterior. Circles represent random variables. Squares component that uses a denoising autoencoder (trained on the represent deterministic parameters. Shaded nodes are observed, and cells with few or without zeros events) to filter CyTOF data to unshaded nodes are hidden. Here we use the plate notation, i.e., nodes minimize the influence of dropout noises in single-cell data. The inside each box will get repeated when the node is unpacked (the number purpose of DeepCyTOF is quite different from that of scvis to of repeats is on the bottom right corner of each box). Each node and its model and visualize the low-dimensional structures in high- parents constitute a family. Given the parents, a random variable is dimensional single-cell data. The most similar approach for scvis independent of the ancestors. Therefore, the joint distribution of all the may be the parametric t-SNE algorithm , which uses a neural random variables is the product of the family conditional distributions. a network to learn a parametric mapping from the high- The generative model to generate data x , and b the variational dimensional space to a low dimension. However, parametric t- n approximation q(z | x , ϕ) to the posterior p(z | x , θ) SNE is not a probabilistic model, the learned low-dimensional n n n n NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 9 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 KL) divergence of the low-dimensional space. A heavy tailed Student’s t-distribution allows moderate distances in the high-dimensional space to be modeled by much larger KLðÞ qðz jx ; ϕÞjjpðz jx ; θÞ n n n n distances in the low-dimensional space to prevent crushing different clusters qðz jx ;ϕÞ 34 n n ¼ qðz jx ; ϕÞlog dz together in the low-dimensional space . n n n pðz jx ;θÞ n n The low-dimensional embedding coordinates fg z are obtained by i i¼1 qðz jx ;ϕÞpðx jθÞ n n n ð2Þ ¼ qðz jx ; ϕÞlog dz n n n pðz ;x jθÞ minimizing the KL divergence between the sum of conditional distributions: n n ¼ E ½ logqðz jx ; ϕÞ qðz jx ;ϕÞ n n n n N N P P P jji KL p jjq ¼ p log E ½ logpðz ; x jθÞþ logpðx jθÞ ji ji jji qðz jx ;ϕÞ n n n q n n jji i i¼1 j¼1;j≠i N N N N P P P P ¼ p logp  p logq ¼ KL½qðz jx ; ϕÞjjpðz jθÞ jji jji jji jji n n n i¼1 j¼1;j≠i i¼1 j¼1;j≠i ð3Þ E ½ logpðx jz ; θÞþ logpðx jθÞ  νþ1 qðz jx ;ϕÞ n n n n n 2 N N 2 P P 1þkk z z =ν i j / p log νþ1 jji P i¼1 j¼1;j≠i ðÞ 1þkk z z =ν The term E ½ logpðz ; x jθÞ − E ½ logqðz jx ; ϕÞ in Eq. (2) is the i k qðz jx ;φÞ n n qðzjx ;φÞ n n n n n k;k≠i ð6Þ evidence lower bound (ELBO) because it is a lower bound of logp(x | θ) as the KL νþ1 N N 2 2 P P divergence on the left hand side is non-negative. We therefore can do maximum- / p log 1 þ z  z  =ν jji i j likelihood estimation of both θ and ϕ by maximizing the ELBO. Notice that in the i¼1 j¼1;j≠i Bayesian setting, the ELBO is a lower bound of the evidence log p(x ) as the N N νþ1 P P P parameters θ are also latent random variables. 2 þ p log 1 þkk z  z =ν jji i k Both the prior p(z | θ) and the variational distribution q(z | x , ϕ) in the ELBO i¼1 j¼1;j≠i k;k≠i n n n νþ1 of the form in Eq. (3) are distributions of z . In our case, we can compute the KL N N 2 P P term analytically because the prior is a multivariate normal distribution, and the / p log 1 þ z  z  =ν jji i j variational distribution is also a multivariate normal distribution given x . i¼1 j¼1;j≠i However, typically there is no closed-form expression for the integration E qðz jx ;ϕÞ n n [log p(x | z , θ)] because we should integrate out z and the parameters of the n n n model μ (z ) and diag(σ (z )) are functions of z . Instead, we can use Monte Carlo X X θ n θ n n  νþ1 þ log 1 þ z  z =ν integration and obtain the estimated evidence lower bound for the nth cell: ð7Þ i k i¼1 k;k≠i ELBO ¼KLðÞ qðz jx ; ϕÞjjpðz jθÞþ logpðx jz ; θÞ ð4Þ n n n n n n;l Here ν is the degree of freedom of the Student’s t-distribution, which is typically set l¼1 to one (the standard Cauchy distribution) or learned from data. Equation (6)isa data-dependent term (depending on the high-dimensional data) that keeps nearby where z is sampled from q(z | x , ϕ) and L is the number of samples. We want to n,l n n data points in the high-dimensional data nearby in the low-dimensional space . take the partial derivatives of the ELBO w.r.t. the variational parameter ϕ and the Equation (7) is a data-independent term that pushes data points in the low- generative model parameter θ to find a local maximum of the ELBO. However, if we dimensional space apart from each other. Notice that the t-SNE objection directly sample points from q(z | x , ϕ), it is impossible to use the chain rule to take n n function minimizes the KL divergence of the joint distribution defined as the the partial derivative of the second term of Eq. (4) w.r.t ϕ because z is a number. n,l symmetrized condition distributions p = (p + p )/(2 × N) and q = (q + q )/ To use gradient-based methods for optimization, we indirectly sample data from q i,j i|j j|i i,j i|j j|i 56, 57 (2 × N). t-SNE has shown excellent results on many visualization tasks such as (z | x , ϕ) using the “reparameterization trick” . Specifically, we first sample ε n n l visualizing scRNA-seq data and CyTOF data . from a easy to sample distribution ϵ  pðϵjαÞ, e.g., a standard multivariate The final objective function is a weighted combination of the ELBO of the latent Gaussian distribution for our case. Next we pass ϵ through a continuous function variable model and the above asymmetric t-SNE objective function: g (ϵ, x ) to get a sample from q(z | x , ϕ). For our case, if q(z | x , ϕ) = ϕ n n n n n N μ ðx Þ, diag (σ (x ))), then g ðϵ; x Þ¼ μ ðx Þþ diag σ ðÞ x ´ ϵ. ! φ n ϕ n ϕ n ϕ n ϕ n N N X X arg min  ELBO þ α KL p jjq ð8Þ n jn jn θ;ϕ Adding regularizers on the latent variables. Given i.i.d data D¼fg x ,by n¼1 n¼1 n n¼1 maximizing the ELBO , we can do maximum-likelihood estimation of the n n¼1 model parameters θ and the variational distribution parameters ϕ. Although p(z | The parameter α is set to the dimensionality of the input high-dimensional data θ)p(x | z , θ) may model the data distribution very well, the variational dis- n n because the magnitude of the log-likelihood term in the ELBO scales with the tribution q(z | x , ϕ) is not necessarily good for visualization purposes. Specifically, n n dimensionality of the input data. The perplexity parameter is set to ten for scvis. it is possible that there are no very clear gaps among the points from different clusters. In fact, to model the data distribution well, the low-dimensional z space Sensitivity of scvis on cell numbers. To test the performance of scvis on scRNA- tends to be filled such that all the z space is used in modeling the data distribution. seq datasets with small numbers of cells, we ran scvis on subsampled data from the To better visualize the manifold structure of a dataset, we need to add regularizers bipolar dataset. The bipolar dataset consists of six batches of datasets. We only used to the objective function in Eq. (4) to encourage forming gaps between clusters and the cells from batch six (6221 cells in total after removing cell doublets and con- at the same time keeping nearby points in the high-dimensional space nearby in 34–39 taminants) to remove batch effects. Specifically, we subsampled 1, 2, 3, 5, 10, 20, 30, the low-dimensional space. Here we use the non-symmetrized t-SNE objective and 50% of the bipolar dataset from batch six (62, 124, 187, 311, 622, 1244, 1866, function. and 3110 cells, respectively). Then we computed the principal components from The t-SNE algorithm preserves the local structure in the high-dimensional the subsampled data, and ran scvis using the top 100 PCs (for the cases with M < space after dimension reduction. To measure the “localness” of a pairwise distance, 100 cells, we used the top M PCs). For each subsampled dataset, we ran scvis ten for a data point i in the high-dimensional space, the pairwise distance between i times with different seeds. We used exactly the same parameter setting for all the and another data point j is transformed to a conditional distribution by centering datasets. Therefore, except for the models trained on the cases with <100 cells, all an isotropic univariate Gaussian distribution at i other models have the same number of parameters. 2 2 When the number of training data is small (e.g., 62 cells, 124 cells, or 187 cells), exp x  x =2σ i j i only the large clusters such as cluster one and cluster two are distinct from the rest p ¼ P ð5Þ jji 2 2 exp x  x =2σ i i (Supplementary Figs. 11 and 12). As we increased the number of subsampled data k≠i points, some small clusters of cells can be recovered. At 622 cells, many cell clusters can be recovered. The Knn classification accuracies in Supplementary Fig. 13 The point-specific standard deviation σ is a parameter that is computed (trained on the two-dimensional representations of the subsampled data and tested p log p jji 2 jji automatically in such a way that the perplexity (2 ) of the conditional on the two-dimensional representations of the remaining cells from batch six) distribution p equals a user defined hyperparameter (e.g., typically 30 ). We set shows relatively high mean accuracies of 84.4, 84.5, 84.1, and 81.2% for K equals to j|i p = 0 because only pairwise similarities are of interest. 5, 9, 17, and 33. When we subsampled 622 cells as in Supplementary Fig. 11e, the i|i In the low-dimensional space, the conditional distribution q is defined cells in cluster 22 were not present in the 622 cells. However, when we used the j|i similarly and q is set to 0. The only difference is that an unscaled univariate model trained on these 622 cells to embed the remaining 5599 cells (6221–622), i|i Student’s t-distribution is used instead of an isotropic univariate Gaussian cluster 22 cells were mapped to the “correct” region that was adjacent to cluster 20 distribution as in the high-dimensional space. Because in the high-dimensional cells and bridged cluster 20 and other clusters as in Supplementary Fig. 11d, f–h. space more points can be close to each other than in the low-dimensional space Interestingly, with smaller numbers of cells (cell numbers ≤622), Knn classifiers (e.g., only two points can be mutually equidistant in a line, three points in a two- trained on the two-dimensional scvis coordinates were better than those trained dimensional plane, and four points in a three-dimensional space), it is impossible using the 100 principal components (one-sample t-test FDR < 0.05; Supplementary to faithfully preserve the high-dimensional pairwise distance information in the Fig. 12, the red color triangles represent the Knn accuracy by using the original 100 low-dimensional space if the intrinsic dimensionality of the data is bigger than that PCs). 10 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE It seems that there is no noticeable overfitting even with small numbers of cells Computational complexity analysis. The scvis objective function involves the as can be seen from the two-dimensional plots from Supplementary Figs. 11 and 12 asymmetrical t-SNE objective function. The most time-consuming part is to and the Knn classification accuracies in Supplementary Fig. 13. To decrease the compute the pairwise distances between two cells in a mini-batch that takes O 2 2 possibility of overfitting, we used Student’s t-distributions instead of Gaussian (TN D + TN d) time, where N is the mini-batch size (we use N = 512 in this distributions for the model neural networks. In addition, we used relatively small study), D is the dimensionality of the input data, d is the dimensionality of the low- neural networks (a three-layer inference network with 128, 64, and 32 units and a dimensional latent variables (e.g., d = 2 for most cases), and T is the number of five-layer model network with 32, 32, 32, 64, and 128 units), which may decrease iterations. For our case, we first use PCA to project the scRNA-seq data to a 100- the chances of overfitting. dimensional space, so D = 100. For a feedforward neural network with L hidden layers, and the number of neurons in layer l is n , the time complexity to train the neural network is ONT n  n , where n = D and n is the size of the 0 L+1 i¼0 lþ1 l output layer. For the model neural network, we use a five hidden layer (with layer Sensitivity of scvis on hyperparameters. scvis has several hyperparameters such size 32, 32, 32, 64, and 128) feedforward neural network. The input layer size is d as the number of layers in the variational inference and the model neural networks, and the output layer size is D. For the variational inference network, we use a three the layer sizes (the number of units in each layer), and the α parameters in Eq. (8). hidden layer (with layer size 128, 64, and 32) feedforward neural network. The size We established that scvis is typically robust to these hyperparameters. We first of the input layer is D and the size of the output layer is d. For space complexity, we tested the influence of the number of layers using batch six of the bipolar dataset. need to save the weights and bias of each neuron O n  n . We also need lþ1 l We learned scvis models with different layers from the 3110 subsampled data i¼0 to save the O(N ) pairwise distances and the data of size O(ND) in a mini-batch. points from the batch six bipolar dataset. Specifically, we tested these variational The original t-SNE algorithm is not scalable to large datasets (with tens of influence neural networks and the model neural thousands of cells to millions of cells) because it needs to compute the pairwise network layer combinations (the first number is the number of layers in the var- 2 2 2 distances between any two cells (taking O(M D + M T) time and O(M ) space, iational influence neural network, and the second number is the number where M is the total number of cells and T is the number of iterations). of layers in the model neural network): (10, 1), (10, 3), (10, 5), (10, 7), (10, 10), (7, Approximate t-SNE algorithms are typically more scalable in both time and space. 10), (5, 10), (3, 10), and (1, 10). These models performed reasonably well such that For example, BH t-SNE only computes the distance between a cell and its Knns. the cells from the same cluster are close to each other in the two-dimensional Therefore, BH t-SNE takes O(M log(M)) time and O(M log(M)) space, where we spaces (Supplementary Fig. 14). When the number of layers in the variational assume K is in the order of O(log(M)). inference neural network was fixed to ten, for some types of cells, their two- We next experimentally compare the scalability of scvis and BH t-SNE (the dimensional embeddings were close to each other and formed curve-like 48 59 widely used Rtsne package ) by using the 1.3 million cells from 10X genomics . structures as can be seen from Supplementary Fig. 14e. The reason for this phe- However, BH t-SNE did not finish in 24 h and we terminated it. On the contrary, nomenon could be that the variational influence network underestimated the scvis produced visually reasonable results in <33 min (after 3000 mini-batch variances of the latent z posterior distributions or the optimization was training, Supplementary Fig. 24a). Therefore, scvis can be much more scalable than not converged. On the contrary, when the influence networks have smaller BH t-SNE for very large datasets. As we increased the number of training batches, number of layers (<10), we did not see these curve structures (Supplementary we can see slightly better separations in clusters as in Supplementary Fig. 24b–f. Fig. 14f–i). The out-of-sample mapping results in Supplementary Fig. 15 show The time used to train scvis increased linearly in the number of training mini- similar results. batches (Supplementary Fig. 24h). However, for small datasets, BH t-SNE can be We computed the Knn classification accuracies of the out-of-samples. As more efficient than scvis. For example, for the melanoma dataset with only 4645 before, the Knn classifiers were trained on the two-dimensional coordinates of the cells, scvis still took 24 min to run 3000 mini-batches, while BH t-SNE finished in training subsampled data, and the classifiers were used to classify the two- only 28.9 s. All the experiments were conducted using a Mac computer with 32 GB dimensional coordinates of the out-of-sample data. The parameter setting did of RAM, 4.2 GHz four-core Intel i7 processor with 8 MB cache. influence scvis performance, i.e., for each K ∈ {5,9,17,33,65,129,257}, the trained Finally, when the mapping function is trained, mapping new cells takes only scvis models did significantly different (Supplementary Fig. 16, FDR < 0.05, one-  P OM n  n time, where M is the number of input cells. Also, because way analysis of variance (ANOVA) test). To find out which parameter l¼0 lþ1 l each data point can be mapped independently, the space complexity could be only combinations led to inferior or superior performance, we then compared the  P Lþ1 O n  n . As an example, it took only 1.5 s for a trained scvis model to classification accuracies of each model with the most complex model with both ten l¼0 lþ1 l map the entire 1.3 million cells from 10X genomics. layers of variational influence neural networks and ten layers of model networks. The FDR (two-sided Welch’s t-test) at the top of each subfigure of Supplementary Fig. 16 shows that, except for K = 257, all the models with one layer of variational influence neural networks did significantly worse than those from the most Datasets. The oligodendroglioma dataset measures the expression of 23,686 genes complex model (FDR < 0.05, two-sided Welch’s t-test). Similarly, the models with in 4347 cells from six IDH1 or IDH2 mutant human oligodendrolioma patients . three layers of variational influence neural networks did significantly worse than The expression of each gene is quantified as log (TPM/10+1), where “TPM” those from the most complex model when K ∈ {5, 9, 17, 33, 65}. While for other standards for “transcripts per million” . Through copy number estimations from models, their performances were not statistically different from those of the most these scRNA-seq measurements, 303 cells without detectable copy number complex models. alterations were classified as normal cells. These normal cells can be further We next examined the influence of the layer sizes of the neural networks. The grouped into microglia and oligodendrocyte based on a set of marker genes they number of layers was fixed at ten for both the variational influence neural networks expressed. Two patients show subclonal copy number alterations. and the model neural networks; the number of units in each layer was set to 8, 16, The melanoma dataset is from sequencing 4645 cells isolated from 19 metastatic 32, 64, and 128. All layers of the inference and the model neural networks had the melanoma patients . The cDNAs from each cell were sequenced by an Illumina same size. All models successfully embedded both the training data and the out-of- NextSeq 500 instrument to 30 bp pair-end reads with a median of ~150,000 reads sample test data (Supplementary Fig. 17). However, the layer size parameter did per cell. The expression of each gene (23,686 genes in total) is quantified by log influence scvis performance, i.e., the Knn classifiers on the out-of-sample data did (TPM/10+1). In addition to malignant cells, the authors also profiled immune significantly different (Supplementary Fig. 18, FDR < 0.05, one-way ANOVA test). cells, stromal cells, and endothelial cells to study the whole-tumor multi-cellular The FDR (two-sided Mann–Whitney U-test) at the top of each subfigure of ecosystem. Supplementary Fig. 18 shows that all models with layer size of eight did The bipolar dataset consists of low-coverage (median depth of 8200 mapped significantly worse than those from the most complex model using 128 units (FDR reads per cell) Drop-seq sequencing of 27,499 mouse retinal bipolar neural cells < 0.05). Similarly, the models with layer size of 16 did significantly worse than those 1 from a transgenic mouse . In total, 26 putative cells types were identified by from the most complex model when K ∈ {5, 9, 17, 33, 65, 129} (FDR < 0.05). While clustering the first 37 principal components of all the 27,499 cells. Fourteen clusters for other models, their performances were not statistically different from those can be assigned to bipolar cells, and another major cluster is composed of Mueller from the most complex models. Notice that, at layer size of 64, the mapping glia cells. These 15 clusters account for about 96% of all the 27,499 cells. The functions from one run were worse than others in embedding the out-of-sample remaining 11 clusters (comprising of only 1060 cells) include rod photoreceptors, data. However, there was no significant difference in the log-likelihoods from the cone photoreceptors, amacrine cells, and cell doublets and contaminants . repeated ten runs (Supplementary Fig. 19, one-way ANOVA p-value = 0.741). 9 The retina dataset consists of low-coverage Drop-seq sequencing of 44,808 For the α weight parameter in Eq. (8), we set α relative to the dimensionality of cells from the retinas of 14-day-old mice. By clustering the two-dimensional t-SNE the input data. We set α = 0, 0.5, 1.0, 1.5, 2.0, 10.0, inf times of the dimensionality 61 embedding using DBSCAN —a density-based clustering algorithm, the authors of the input data. When α = inf, the trained models did significantly worse than the identified 39 clusters after merging the clusters without enough differentially models trained with the default α equals to the dimensionality of the input data for expressed genes between any two clusters. K ∈ {5, 9, 17, 33, 65} (FDR ≤ 0.05, two-sided Welch’s t-test, Supplementary Figs. 20, The 10X Genomics neural cell dataset consist of 1,306,127 cells from cortex, 21 and 22). Also, when α = 0, the trained models were significantly worse than the hippocampus, and subventricular zones of two E18 C57BL/6 mice. The cells were models trained with the default α equaling to the dimensionality of the input data sequenced on 11 Illumina Hiseq 4000 machines to produce 98 bp reads . for all Ks (FDR ≤ 0.05, two-sided Welch’s t-test, Supplementary Fig. 22). For α = 0, 17 For the mass cytometry dataset H1 , manual gating assigned 72,463 cells to 14 we performed an extra comparison by using the synthetic nine-dimensional data, cell types based on 32 measured surface protein markers. Manual gating assigned showing that when K was large (≥65), setting α = 9 (the dimensionality of the input 31,721 cells to the same 14 cell populations from H2 based on the same 32 surface data) did significantly better than letting α = 0 (Supplementary Fig. 23, FDR ≤ 0.05, protein markers. one-sided Welch’s t-test). NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 11 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 Statistical analysis. All statistical analyses were performed using the R statistical 23. Qiu, X. et al. Single-cell mRNA quantification and differential analysis with software package, version 3.4.3. Boxplots denote the medians and the interquartile Census. Nat. Methods 14, 309–315 (2017). ranges (IQRs). The whiskers of a boxplot are the lowest datum still within 1.5 IQR of 24. Svensson, V. et al. Power analysis of single-cell RNA-sequencing experiments. the lower quartile and the highest datum still within 1.5 IQR of the upper quartile. Nat. Methods 14, 381–387 (2017). The full datasets were superimposed to boxplots. For datasets with non-normal 25. Ziegenhain, C. et al. Comparative analysis of single-cell RNA sequencing distribution (e.g., outliers), non-parametric tests were used. To account for unequal methods. Mol. Cell 65, 631–643 (2017). variances, Welch’s t-test was used for pairwise data comparison. Adjusted p-values 26. Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical <0.05 (FDR, the Benjamini–Hochberg procedure ) were considered to be significant. challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015). 27. Angerer, P. et al. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics 32, 1241–1243 (2015). Code availability. The scvis v0.1.0 Python package is available freely from bit- 28. Pierson, E. & Yau, C. ZIFA: Dimensionality reduction for zero-inflated single- bucket: https://bitbucket.org/jerry00/scvis-dev. cell gene expression analysis. Genome Biol. 16, 241 (2015). 29. DeTomaso, D. & Yosef, N. FastProject: a tool for low-dimensional analysis of Data availability. The scRNA-seq data that support the findings of this study are single-cell RNA-seq data. BMC Bioinforma. 17, 315 (2016). available in Gene Expression Omnibus with the identifiers (bipolar: GSE81905, 30. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are retina: GSE63473, oligodendroglioma: GSE70630, metastatic melanoma: revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, GSE72056, E18 mouse neural cells: GSE93421). The E18 mouse neural cells are 381–386 (2014). freely available from 10X Genomics . The other scRNA-seq data are publicly 45 31. Setty, M. et al. Wishbone identifies bifurcating developmental trajectories available from the single-cell portal . The mass cytometric data can be down- from single-cell data. Nat. Biotechnol. 34, 637–645 (2016). loaded from cytoback . The synthetic data used in this study are available from 32. Campbell, K. R. & Yau, C. Probabilistic modeling of bifurcations in single-cell bitbucket repo: https://bitbucket.org/jerry00/scvis-dev. gene expression data using a bayesian mixture of factor analyzers. Wellcome Open Res. 2, 19 (2017). Received: 16 July 2017 Accepted: 25 April 2018 33. Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. bioRxiv https://doi.org/10.1101/128843 (2017). 34. Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). 35. Hinton, G. E. & Roweis, S. T. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems 15 (eds Becker, S., Thrun, S. & Obermayer, K.) 857–864 (MIT Press, Cambridge, 2003). References 36. Cook, J., Sutskever, I., Mnih, A. & Hinton, G. E. Visualizing similarity data 1. Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by with a mixture of maps. In Proc. Eleventh International Conference on single-cell transcriptomics. Cell 166, 1308–1323 (2016). Artificial Intelligence and Statistics, vol. 2 of Proceedings of Machine Learning 2. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in Research (eds Meila, M. & Shen, X.) 67–74 (PMLR, San Juan, Puerto Rico, primary glioblastoma. Science 344, 1396–1401 (2014). 2007). 3. Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma 37. Carreira-Perpinán, M. A. The elastic embedding algorithm for dimensionality by single-cell RNA-seq. Science 352, 189–196 (2016). reduction. In Proc. 27th International Conference on Machine Learning 4. Navin, N. et al. Tumor evolution inferred by single cell sequencing. Nature 167–174 (Haifa, Israel, 2010). 472,90–94 (2011). 38. Yang, Z., Peltonen, J. & Kaski, S. Scalable optimization of neighbor embedding 5. Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. for visualization. In Proc. 30th International Conference on Machine Learning Protoc. 9, 171–181 (2014). (eds Dasgupta, S. & McAllester, D.) 127–135 (PMLR, Atlanta, Georgia, 2013). 6. Jaitin, D. A. et al. Massively parallel single-cell RNA-seq for marker-free 39. Maaten, L. v. d. Accelerating t-SNE using tree-based algorithms. J. Mach. decomposition of tissues into cell types. Science 343, 776–779 Learn. Res. 15, 3221–3245 (2014). (2014). 40. Amir, E.-a.D. et al. viSNE enables visualization of high dimensional single-cell 7. Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, identifiers. Nat. Methods 11, 163–166 (2014). 545–552 (2013). 8. Macaulay, I. C. et al. G&T-seq: parallel sequencing of single-cell genomes and 41. Zurauskiene, J. & Yau, C. pcaReduce: hierarchical clustering of single cell transcriptomes. Nat. Methods 12, 519–522 (2015). transcriptional profiles. BMC Bioinformatics 17, 140 (2016). 9. Macosko, E. Z. et al. Highly parallel genome-wide expression 42. Wattenberg, M., ViÈgas, F. & Johnson, I. How to use t-SNE effectively. Distill profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 http://distill.pub/2016/misread-tsne (2016). (2015). 43. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous 10. Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to systems. Preprint at https://static.googleusercontent.com/media/research. embryonic stem cells. Cell 161, 1187–1201 (2015). google.com/en//pubs/archive/45166.pdf (2015). 11. Hashimshony, T. et al. Cel-seq2: sensitive highly-multiplexed single-cell RNA- 44. Tirosh, I. et al. Single-cell RNA-seq supports a developmental hierarchy in seq. Genome Biol. 17, 77 (2016). human oligodendroglioma. Nature 539, 309–313 (2016). 12. Gierahn, T. M. et al. Seq-Well: portable, low-cost RNA sequencing of single 45. Tickle, T. et al. Single cell portal. https://portals.broadinstitute.org/single_cell cells at high throughput. Nat. Methods 14, 395–398 (2017). (2017). 13. Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single 46. Clevert, D.-A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep cells. Nat. Commun. 8, 14049 (2017). network learning by exponential linear units (elus). In 4th International 14. Cao, J. et al. Comprehensive single-cell transcriptional profiling of a Conference for Learning Representations (San Juan, Puerto Rico, 2016). multicellular organism. Science 357, 661–667 (2017). 47. Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. In 3rd 15. Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and International Conference for Learning Representations (San Diego, CA, 2015). spinal cord with split-pool barcoding. Science 360, 176–182 (2018). 48. Krijthe, J. H. Rtsne: t-distributed stochastic neighbor embedding using Barnes- 16. Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug Hut implementation. https://github.com/jkrijthe/Rtsne, R package version responses across a human hematopoietic continuum. Science 332, 687–696 0.13 (2015). (2011). 49. Lawrence, N. D. Gaussian process latent variable models for visualisation of 17. Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals high dimensional data. In Advances in Neural Information Processing Systems progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015). 16 (eds Thrun, S., Saul, L. K. & Schölkopf, B.) 329–336 (Cambridge, MIT 18. Regev, A. et al. The human cell atlas. Elife https://doi.org/10.7554/eLife.27041 Press, 2004). (2017). 50. GPy. GPy: A gaussian process framework in python. http://github.com/ 19. Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity SheffieldML/GPy (2012). with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016). 51. Maaten, L. Learning a parametric embedding by preserving local structure. In 20. Buettner, F. et al. Computational analysis of cell-to-cell heterogeneity in Proceedings of the Twelth International Conference on Artificial Intelligence single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. and Statistics, vol. 5 of Proceedings of Machine Learning Research (eds van Biotechnol. 33, 155–160 (2015). Dyk, D. & Welling, M.) 384–391 (PMLR, Clearwater Beach, Florida, 21. Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. 2009). Normalizing single-cell RNA sequencing data: challenges and opportunities. 52. Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile Nat. Methods 14, 565–571 (2017). topological approach for automatic clustering of biological data. 22. Bacher, R. et al. SCnorm: robust normalization of single-cell RNA-seq data. Bioinformatics 32, 2567–2576 (2016). Nat. Methods 14, 584–586 (2017). 12 NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04368-5 ARTICLE on Cancer Biology and Treatment, and grant 1061, The Terry Fox New Frontiers Pro- 53. Smyth, G. Limma: linear models for microarray data. In Bioinformatics and gram Project Grant in Overcoming Treatment Failure in Lymphoid Cancers), CIHR computational biology solutions using R and Bioconductor (eds Gentleman, R., (grant MOP-115170 to S.P.S.), and CIHR Foundation (grant FDN-143246 to S.P.S.). S.P. Carey, V. J., Huber, W., Irizarry, R. A. & Dudoit, S.) 397–420 (Springer, New S. is supported by Canada Research Chairs. S.P.S. is a Michael Smith Foundation for York, 2005). Health Research scholar. S.P.S is a Susan G. Komen Scholar. 54. Wang, B., Zhu, J., Pierson, E. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017). Author contributions 55. Li, H. et al. Gating mass cytometry data by deep learning. Bioinformatics 33, J.D.: project conception, software implementation, and data analysis; J.D., A.C., and S.P.S.: 3423–3430 (2017). algorithm development and manuscript writing; S.P.S.: project conception, oversight, and 56. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In Proc. 2nd senior responsible author. International Conference on Learning Representations (Banff, Alberta, 2014). 57. Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proc. 31st International Additional information Supplementary Information accompanies this paper at https://doi.org/10.1038/s41467- Conference on Machine Learning (eds Xing, E. P. & Jebara, T.) 1278–1286 (PMLR, Beijing, 2014). 018-04368-5. 58. Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Stat. Competing interests: S.P.S is a shareholder of Contextual Genomics Inc. The remaining 22,79–86 (1951). authors declare no competing interests. 59. 10X Genomics. 1.3 million brain cells from E18 mice. https:// support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/ 1M_neurons (2017). Reprints and permission information is available online at http://npg.nature.com/ reprintsandpermissions/ 60. Wagner, G. P., Kin, K. & Lynch, V. J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285 (2012). Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 61. Ester, M. et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD’96 Proc. Second International Conference on Knowledge Discovery and Data Mining (eds Simoudis, E., Han, J. & Fayyad, U.) 226–231 (AAAI Press, Portland, Oregon, 1996). Open Access This article is licensed under a Creative Commons 62. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical Attribution 4.0 International License, which permits use, sharing, and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.) 57, 289–300 (1995). adaptation, distribution and reproduction in any medium or format, as long as you give 63. Levine, J. H. et al. Phenograph. https://www.cytobank.org/nolanlab/reports/ appropriate credit to the original author(s) and the source, provide a link to the Creative Levine2015.html (2015). Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the Acknowledgements article’s Creative Commons license and your intended use is not permitted by statutory This work was supported by a Discovery Frontiers project grant, “The Cancer regulation or exceeds the permitted use, you will need to obtain permission directly from Genome Collaboratory”, jointly sponsored by the Natural Sciences and Engineering the copyright holder. To view a copy of this license, visit http://creativecommons.org/ Research Council (NSERC), Genome Canada (GC), the Canadian Institutes licenses/by/4.0/. of Health Research (CIHR), and the Canada Foundation for Innovation (CFI) to S.P.S. In addition, we acknowledge generous long-term funding support from the BC Cancer Foundation. The S.P.S. group receives operating funds from the Canadian Breast © The Author(s) 2018 Cancer Foundation, the Canadian Cancer Society Research Institute (impact grant 701584 to S.P.S.), the Terry Fox Research Institute (grant 1021, The Terry Fox New Frontiers Program Project Grant in the Genomics of Forme Fruste Tumours: New Vistas NATURE COMMUNICATIONS (2018) 9:2002 DOI: 10.1038/s41467-018-04368-5 www.nature.com/naturecommunications 13 | | |

Journal

Nature CommunicationsSpringer Journals

Published: May 21, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off