TY - JOUR AB - Introduction Recent advances allow high-throughput measurement of biological information from individual cells [1–12]. This is an improvement over standard experiments which mask the range of states in the population because they average over millions of cells. Therefore, it is expected that single-cell technologies can reveal new biology, such as the diversity of states of cells in a tissue [13–21]. These experiments portray each cell as a point in a high-dimensional space whose axes are the expression level of each gene, or other molecular parameters. The geometry of how cells are distributed in gene expression space is an open question. One possibility is that each cell type forms a tight cluster, and that these clusters are well separated from each other. This assumption is at the heart of clustering analyses of gene expression data [22,23]. The tight-cluster picture relates to the idea of discrete cell types, which is supported by the existence of marker genes that are mutually exclusive between cells. When considering a set of many genes, in contrast to only marker genes, it is possible that cells also form more continuous distributions in gene expression space, and that clusters are more difficult to define. Such distributed states have been suggested in studies on T-cells [24–26] and embryonic stem cells [27]. In such cases, an open question is whether there is meaningful geometry to this continuum of cell states. The question of geometry in gene expression space was recently addressed in the context of a theory on evolutionary tradeoffs [28]. The theory suggests that cells that need to perform multiple tasks are arranged in a simple, low dimensional polygons or polyhedra in gene expression space. The vertices of these shapes, called archetypes, are the expression profiles optimal for each task. Thus, two tasks correspond to data on a line, three tasks to data on a triangle, four tasks to a tetrahedron, and so on. These shapes are generally called polytopes: the generalization of polygons and polyhedral to any number of dimensions. These polytopes represent the optimal tradeoffs between the tasks, in the sense that for any point outside the polytope there is a point inside it that performs equally or better at all tasks. This corresponds to the concept of Pareto optimality, where the polytopes are Pareto fronts for the system [28–32]. Pareto optimality has been applied to a variety of biological datasets [30,33–36], but not to single-cell data. Here, we study single-cell expression from several tissues, collected with different single-cell technologies (Fig 1). We analyze single-cell qPCR data on human and mouse colonic crypts from Dalerba et al. and Rothenberg et al. [2,37], single-cell mass cytometry data on individual human bone marrow cells from Bendall et al [13,25], and single-cell RNA-Seq on mouse dendritic cells from Jaitin et al [3]. We test whether each dataset is well-described by a low dimensional polytope. We fit the data to a series of polytopes (line, triangle, tetrahedron, etc.), finding the best fit polytope, and assess its statistical significance. We then analyze the gene expression profiles at the cells closest to the vertices (archetypes), to test if they correspond to specific biological tasks. This offers a way to discover potential tasks of cells in a tissue from single-cell data. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Overview of Pareto archetype analysis of single-cell datasets to discover polytopes in gene expression space and infer tasks. Datasets from different human and mouse tissues analyzed by different groups with different technologies were analyzed by Pareto archetype analysis. Best fit polytopes and their significance were found. Tasks were inferred from the genes maximally enriched in the cells closest to each vertex of the polytope. https://doi.org/10.1371/journal.pcbi.1004224.g001 We find evidence for polytopes and enriched tasks in all of the datasets we analyzed. The well-studied human and mouse colonic crypt system [38,39] includes stem cells at the bottom of the crypt, which differentiate into enterocytes that absorb nutrients, and secretory cells, mainly goblet cells, which secrete mucus. We find that single human intestinal cells are arranged in gene expression space, to a good approximation, in a tetrahedron. At its vertices are expression profiles of pure cell types—enterocytes, goblet cells, putative stem cells and a new category of nodal-expressing cells. We further find that when analyzing the intestinal progenitor cells alone, they fill out a distinct tetrahedron in gene expression space, in contrast to other cell types. The vertices of this tetrahedron correspond to four progenitor cell tasks. In addition, we find that the polytopes found in crypt data from human and mice are strikingly similar. Using the same approach, we find that human bone marrow cell data is arranged in a five-vertex simplex (four-dimensional simplex) whose vertices correspond to five major cell types, and that mouse dendritic cells uniformly fill a tetrahedron suggesting four immune tasks including maturation, pathogen sensing and communication with lymphocytes. Pareto analysis can thus be useful to understand the geometry of single-cell gene expression, and to infer the tasks of single cells in a tissue. Results Pareto analysis on human crypt single-cell expression suggests four archetypes We begin with analyzing the single-cell gene expression dataset of human colon crypt cells obtained by Dalerba et al [2]. The dataset included 407 individual cells, each analyzed by single-cell qPCR for 83 selected genes in a Fluidigm microfluidic system [5]. We normalized the data by subtracting the mean of each gene as described in Methods. To address the effects of outliers and all/none bimodality in expression data ([11,12]), we removed 10% of the cells which had the lowest expression levels across genes, and 10% of the genes which had the lowest expression across cells, resulting in a dataset of 368 cells and 76 genes (S1 Dataset). Removing more cells or genes, up to 25% of the lowest expressing cells or lowest expressed genes, leaves the results essentially unchanged (S1 Fig and S1A Text). Based on theory on evolutionary tradeoffs between tasks [28], we expect that cells should fall in a low-dimensional polytope whose vertices are the points optimal in each task alone. We therefore asked whether gene expression data is enclosed within a low dimensional polytope (e.g. line, triangle, tetrahedron and so on). We calculated the best fit polytopes that enclose the data, and asked how well they describe the data compared to randomized datasets. We considered polytopes with k vertices: we tested k = 2 (line), k = 3 (triangle), k = 4 (tetrahedron) and so on up to 11 vertices. The polytopes were found using the PCHA algorithm [40]. This algorithm seeks k points on the convex hull of the data that define a polytope that encloses as much of the data as possible (see Methods: Archetype detection using the PCHA algorithm [40]). For each polytope, we calculated the deviation of the data from the polytope (explained variance, Methods: Determining the number of archetypes). k = 4 archetypes explain 45% of the variance, whereas adding a fifth archetype added only 4% additional explained variance (Fig 2a). A tetrahedron explained the data variance better than when applying the same algorithm to shuffled data (p = 0.01, Methods: Statistical significance of best fit polytopes). This suggests that a 4-vertex polyhedron, namely a tetrahedron, is a reasonable description of the data. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Human colon crypt cells fall in a tetrahedron in gene expression space. (a) For k = 2–11 we found the k-polytope that best fit the data using PCHA algorithm, considering all 76 dimensions. Explained variance of best fit polytopes with k = 2–11 vertices begins to saturate at k = 4 or k = 5 vertices. (b) Comparison between the variance explained by the first k principal components of the data to the variance explained by the k principal components of shuffled data suggests that effective data dimensionality is three or four. Blue line: variance explained by PCA of intestinal data. Green line: variance explained by PCA of shuffled data. Points represent mean values. Error bars, representing 5%-95% variation intervals, are smaller than line width. Points for which the real data EV is higher than the randomized data EV are marked with *. (c) Data displayed in first 3 PCs axes resembles a tetrahedron, and its projections on principal planes (d)-(f) resemble triangles. Archetypes and their variation upon data resampling (bootstrapping) are shown as colored ellipses (see S1B Text). Thin lines—tetrahedron edges. https://doi.org/10.1371/journal.pcbi.1004224.g002 The tetrahedron was found in the full 76-dimensional space. To display the data, it is helpful to use principal component analysis (PCA). PCA indicated that 3 principal components (3 PCs) explain most of the variance in the data (47%), and that 4 PCs (52%) are significant compared to shuffled data (Fig 2b, Methods: Determining the number of archetypes), indicating that the data is indeed low-dimensional. The tetrahedrality of the data is highlighted by the fact that 96% of the variance explained by the first 3PCs is explained also by the much more stringent description of a tetrahedron whose vertices are on the convex hull of the data. Plotting the data, with each cell represented by a point in the space spanned by the first 3PCs of gene expression space, suggests a tetrahedron-like shape (Fig 2c). The projections of the data on the three principal planes are roughly triangular (Fig 2d–2f), and show well-defined linear edges which meet at pointy vertices. We found that the archetype positions were robust to data sampling, with errors on the order of a few percent in bootstrapping tests in which data is resampled with replacement (S1B Text and S1a Table and Fig 2c). We note that clustering methods, such as k-means or hierarchical clustering, are much more sensitive to sampling errors in this dataset: the continuous distribution of the data makes the cluster boundaries somewhat arbitrary and thus on the order of 20% of the data points are classified to different clusters upon bootstrapping (S1C Text and S2–S4 Figs). Each archetype is enriched with markers for a major crypt cell type Each of the four vertices of the best-fit tetrahedron is a point in the 76-dimensional gene expression space. Within Pareto theory, each vertex is an archetype that can be thought of as an optimal gene expression profile, extrapolated from the data, which best performs the archetype's task. The gene profiles for the archetypes are shown in Fig 3a. We find that each of the four archetypes shows high expression of a set of markers for a specific crypt cell type (Fig 3b and 3c and S5 Fig). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Expression profiles of the four colon crypt archetypes are each enriched for markers of specific cell types. (a) The expression profiles of the four archetypes, with enriched genes colored. Enriched genes were determined by leave-1-out enrichment analysis, binning the cells according to distance from each archetype and seeking when average expression in the bin closest to the archetype is maximal, as described in Methods: 1D Gene enrichment at archetypes (See full enriched genes list in S2 Table). Light blue—enterocyte archetype, yellow—Nodal archetype, green—stem cells archetype, red—goblet cell archetype. Genes that are not enriched, or enriched in more than one archetype, are in dark blue. Zero level represents the average expression of each gene in the dataset. (b) Leave-1-out enrichment plot: expression of a gene (SLC26A3—an enterocyte marker) as a function of distance from archetype in equal mass bins of cells (Methods: 1D Gene enrichment at archetypes), line color indicates archetype. This gene is maximally enriched only at the enterocyte archetype (blue line). For enrichment plots for additional genes see S5 Fig. (c) A two dimensional enrichment plot of SLC26A3, in which its expression is plotted on the plane of the first 2PCs of the data, indicating expression is maximal in the cells closest to the enterocyte archetype. Contours are expression density estimated using a Gaussian kernel (Methods: 2D Gene enrichment at archetypes). Archetype positions and PCs were calculated without the tested gene. https://doi.org/10.1371/journal.pcbi.1004224.g003 Archetype 1 shows high expression of enterocyte markers (AQP8, SLC26A3, MS4A12, KRT20 [41–44]). The cells closest to this archetype show the maximal expression of these markers in the entire dataset (Methods: 1D Gene enrichment at archetypes). In cells closest to archetype 2, putative stem cells markers are maximally expressed (including LGR5, ASCL2, AXIN2, c-MYC, CDK6, OLFM4 [45–49]). In cells closest to archetype 3, markers for goblet cells are maximally expressed (MUC2, TFF3, SPDEF and others [38,50,51]). The complete list of enriched genes is shown in S2 Table. Thus, the first three archetypes correspond each to one of the three main crypt cell types. The fourth archetype is discussed below. We also evaluated the physical position of each cell in the crypt using a proxy for height in the crypt, Axin2 expression [52]. We find that the stem cell archetype has highest Axin2, followed by the goblet archetype. The enterocyte archetype has lowest Axin2. This matches the known arrangement of cells in the crypt, with stem cells at the bottom of the crypt, and enterocytes at its top (S6a Fig) [37]. One archetype represents a novel Nodal cell class which may be an intermediate between stem cells and enterocytes The fourth archetype is enriched with a specific set of genes related to development and embryonic patterning (NODAL, CFC1, TDGF1 [53–56]), a transcriptional repressor that has a role in development (PCGF6 [57]), and an enzyme involved in hormone secretion (UGT1A1 [58]), and a member of the claudin family CLDN8 [59]. We call these cells Nodal cells. Their position in the crypt, based on Axin2 levels, is intermediate between the bottom and top (S6a Fig). To better understand Nodal cells, we ordered the cells according to a pseudo-temporal order inferred from their gene expression using the Wanderlust algorithm [60]. NODAL cells seem to lie on the developmental axis between stem cells and enterocytes (S7a Fig). This tentatively suggests NODAL cells as a differentiation step between precursors and enterocytes. In summary, Pareto analysis finds four archetypes which correspond to gene expression profiles. Three of these define the three main cell types in the crypt. The fourth may indicate a step between stem cells and enterocytes. The archetypes can be interpreted as an idealized gene profile for each of the cell types, and cells of a given type are arranged in proximity to the corresponding archetype in gene expression space. Progenitor cells uniformly fill a tetrahedron with archetypes for stemness and three differentiation tasks We next zoom in on each inferred cell type to examine the variation between cells within a type. We repeated the Pareto analysis on each class of cells separately (S1D Text). We find that the three non-progenitor cell types (enterocytes, goblet cells, nodal cells) cannot be explained by a statistically significant polytope with 5 or less vertices (see S3 Table, all p-values >0.15). Thus, the expression of these cells seems to form a cloud in gene expression space with no easily discernible vertices (Fig 4a). This may hint that other effects dominate the structure of the data, or that not enough cells or not enough relevant genes were measured, see S1E Text and S8 Fig. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Progenitor crypt cells fall in a tetrahedron. (a) Enterocytes, goblet cells and nodal cells analyzed separately do not form significant polytopes. Cells are color coded by type in the tetrahedron of Fig 3, and each cell class is plotted in its own first 3PC. (b) Progenitor cells analyzed separately fall uniformly in a tetrahedron. The best fit tetrahedron is shown (PCHA delta = 0.5). Arrow represents direction of development according to Axin2 levels, see S7b Fig. Also shown are projections on the principal planes, which resemble triangles or quadrangles. Archetypes and their variation upon data resampling (bootstrapping) are shown as gray ellipses. (c) Explained variance as a function of polytope order k or number of PCs D both suggest a tetrahedron (k = 4, D = 3). https://doi.org/10.1371/journal.pcbi.1004224.g004 In contrast, the progenitor cells were well-described by a tetrahedron (p = 0.01, Methods: Statistical significance of best fit polytopes, Fig 4b showing tetrahedron and projections, Fig 4c shows explained variance curves, S2 Dataset). The progenitor cells seem to uniformly fill out this tetrahedron, suggesting that precursors span a continuum of gene expression states, with some cells coming close to one of four archetypal precursor profiles, and others showing a more even mixture of the archetypes. We examined the expression profiles of the cells closest to each archetype (Methods: 1D Gene enrichment at archetypes). Archetype 1 is enriched with stem markers LGR5 and ASCL2, and lies physically at the lowest point in the crypt according to Axin2 levels (S7b Fig inset). It has proliferation markers (MYC, Cdk6), and telomerase (TERT). These genes are characteristic of dividing stem cells [37,46,48,49]. The other three archetypes all display the progenitor marker OLFM4 [61], but also have characteristics more similar to the differentiated cells. Archetype 2 includes enrichment in enterocyte and goblet markers. Archetype 3 is enriched in division inhibitor CDKN1A [62]. Archetype 4 has low expression of all genes. We hypothesize that these three archetypes represent three tasks needed for differentiation: (i) expression of effector cell-specific genes (ii) inhibition of cell division (iii) reduction in global gene expression. For a complete list of enriched genes see S4 Table, archetype gene expression profiles are shown in S9 Fig. The progenitor cells fill out the tetrahedron quite uniformly. According to Axin2 expression, as they move up the crypt they move away from the stem archetype and parallel to the plane defined by the other three archetypes. Pseudo-temporal order derived by the Wanderlust algorithm suggests similar conclusion (S7b Fig). This may suggest multiple temporal paths between the three tasks, such that each progenitor is a different weighted average of the archetypes. We also asked how the progenitor tetrahedron relates to the rest of the cells in the crypt. We find that the fully differentiated cells types (enterocytes, goblet cells) are closest to archetype 4, because they all have lower overall gene expression than the progenitors. Similar polytopes for mouse and human intestinal crypt single-cell data Up to now we analyzed a human crypt dataset. We now compare it to a mouse crypt dataset, presented by Rothenberg et al. [37] using qPCR (S3 Dataset). We used the same Pareto analysis approach. We removed the lowest 10% of cells and genes, remaining with 161 cells and 41 genes. The two datasets overlap in 24 genes. The mouse cells in the dataset were only from the bottom of the crypt: they were harvested by cell sorting (FACS) using the markers CD66 and CD44. We find that the mouse single-cell data is well-described by a triangle (p = 0.003, 2PCs explain 43% of the variance). To compare the mouse data to human data described in the previous sections, we analyzed human cells in the lower part of the crypt (defined in [2], by FACS sort for high and low cells using CD66 and CD44 markers). This dataset (213 cells) is also well described by a triangle (p = 0.001, 2PCs describe 35% variance). The archetypes in both the datasets correspond to stem cells, goblet cells and enterocytes. The nodal cells are on the edge connecting progenitors and enterocytes (Fig 5b), representing the projection of the nodal tetrahedral archetype on the triangle that describes the crypt-bottom cells. In the mouse dataset 5 out of the 6 genes that are enriched in the Nodal archetype were not measured; however the Nodal co-receptor TDGF1 [27] is highly expressed on the edge that connects stem cells and enterocytes. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Mouse and human colon lower crypt cells fall on similar triangles and show a similar distribution within the triangle. (a) Mouse colon lower crypt cells dataset by [37], plotted on its first 2PCs plane. Inset: PCA explained variance analysis suggest k = 3 vertices and D = 2 dimensions, namely a triangle. (b) Human lower crypt dataset plotted on its first 2PC plane. Inset: explained PCA explained variance analysis. The 24 genes common to the two datasets were used. Arrow indicates projection of the nodal cell archetype on the triangle. https://doi.org/10.1371/journal.pcbi.1004224.g005 The geometry of mouse and human datasets are strikingly similar (Fig 5). The three mouse archetypes are very similar to the three corresponding human archetypes in their overlapping genes (R2 = 0.65, p<10–9, S10 Fig, enriched genes are shown in Fig 5), and signify stem cells, goblet cells and enterocytes. Moreover, the distribution of points on the triangle suggests a gradual differentiation process from stem cells to enterocytes, compared to more abrupt temporal switch in the case of differentiation to goblets, a difference shared by both species. We also analyzed single-cell data by Dalerba et al [2] for human colon cancer xenograft in mouse, derived from a single cancer cell (S11 Fig and S1F Text and S4 Dataset). We compared this data to the triangle found when analyzing human crypt-bottom normal tissue cells (Fig 5b). We find that the cancer cells lie in a similar triangle, with a density distribution that peaks near the stem cells, enterocyte and goblet archetypes. This hints that the human cancer cells undergo differentiation similar to the normal mouse and human tissues [2]. Analysis of single-cell mass-cytometry of bone marrow cells and RNA-Seq of dendritic cells reveals polytopes and tasks Finally, we asked whether this approach can be used with other experimental techniques and other tissue types. We studied a single-cell mass cytometry (Cytof) dataset on human bone marrow by Bendall et al [13,25] (S5 Dataset). Here, 10,000 cells were each characterized by the expression of 31 proteins detected using antibodies. We find that this dataset is well-described by a four-dimensional simplex (a polytope with five vertices, p = 0.005, Fig 6 top row and S12 Fig). The five archetypes are each enriched with genes that clearly define a specific cell type (CD4 T cells, CD8 T cells, monocytes/ macrophages, B cells and non-leukocytes) (see S1G Text and S5 Table and S13 and S14 Figs). Cells density peaks near each vertex. A sizable set of cells, which formed an unidentified cluster in the viSNE analysis of [13], lies in the middle of the tetrahedron (S15 Fig), possibly indicating cells whose protein expression is intermediate between the classical cell type profiles. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Different tissues analyzed by different single-cell technologies show polytopes and tasks. (a) Human bone marrow cells analyzed by single-cell mass cytometry [13,25] in which proteins are detected using mass-tagged antibodies is well described by a 4D simplex (a polytope with 5 vertices). The simplex is shown projected on the first 3PCs, for other projections see S12c Fig. The archetypes correspond to cell types as indicated. Cell density peaks near each archetype. (b) Mouse spleen LPS stimulated dendritic cells analyzed by single-cell RNA-Seq [3] are well described by a tetrahedron. Archetypes are labeled with functions inferred from genes maximally enriched in cells near each archetype. For more details see S1G and S1H Text. https://doi.org/10.1371/journal.pcbi.1004224.g006 We also analyzed a single-cell RNA-Seq dataset from mouse spleen by Jaitin et al [3] (S6 Dataset). Each cell is characterized by 20,091 gene expression counts, based on sampling a fraction of the cell’s mRNA pool. This data was classified by Jaitin et al into seven groups of cells using a probabilistic mixture model. One group of cells, however, seemed to defy clear clustering (class VII in [3]). These are the dendritic cells in the spleen, known to carry out a wide range of immune functions including detection of pathogens and activation of lymphocytes [63–65]. We find that LPS treated dendritic cells in this dataset are well-described by a tetrahedron (312 cells, p<0.001, Fig 6 bottom row). Several functional gene groups [66] (see S1H Text) are highly and maximally enriched in the cells closest to each of the four archetypes (S6 Table). This enrichment indicates four key immune tasks: Archetype 1- response to virus (cytoplasmic DNA response and interferon pathways) [7,67]; Archetype 2- Dendritic cell maturation and formation of cytoskeletal features [64,68,69]; Archetype 3- Stimulation of lymphocytes (cytokine secretion, antigen presentation) [7,70,71]; Archetype 4- putative cell-death pathways [72,73]. We compared the tissues in terms of how uniformly they fill out their polyhedra or polytopes. We therefore computed the ratio ρ between the mean local density of the data and the local density expected in a uniform distribution [74] (S1I Text and S16 Fig). The closer this ratio is to one, the more uniform the distribution of points; a high ratio means the data is clumped into clusters. We find that bone marrow cells are the most clustered or clumped among the datasets (ρ = 4.67), in line with the classic view of hierarchical hematopoietic lineages [75,76]. Intestinal cells are less clustered (ρ = 2.93). The closest to uniform distribution inside their respective polyhedra are dendritic cells (ρ = 2.42) and intestinal progenitors (ρ = 1.06). We conclude that the present approach can describe the geometry and potential tasks of single-cell data from diverse tissues and different technologies. Pareto analysis on human crypt single-cell expression suggests four archetypes We begin with analyzing the single-cell gene expression dataset of human colon crypt cells obtained by Dalerba et al [2]. The dataset included 407 individual cells, each analyzed by single-cell qPCR for 83 selected genes in a Fluidigm microfluidic system [5]. We normalized the data by subtracting the mean of each gene as described in Methods. To address the effects of outliers and all/none bimodality in expression data ([11,12]), we removed 10% of the cells which had the lowest expression levels across genes, and 10% of the genes which had the lowest expression across cells, resulting in a dataset of 368 cells and 76 genes (S1 Dataset). Removing more cells or genes, up to 25% of the lowest expressing cells or lowest expressed genes, leaves the results essentially unchanged (S1 Fig and S1A Text). Based on theory on evolutionary tradeoffs between tasks [28], we expect that cells should fall in a low-dimensional polytope whose vertices are the points optimal in each task alone. We therefore asked whether gene expression data is enclosed within a low dimensional polytope (e.g. line, triangle, tetrahedron and so on). We calculated the best fit polytopes that enclose the data, and asked how well they describe the data compared to randomized datasets. We considered polytopes with k vertices: we tested k = 2 (line), k = 3 (triangle), k = 4 (tetrahedron) and so on up to 11 vertices. The polytopes were found using the PCHA algorithm [40]. This algorithm seeks k points on the convex hull of the data that define a polytope that encloses as much of the data as possible (see Methods: Archetype detection using the PCHA algorithm [40]). For each polytope, we calculated the deviation of the data from the polytope (explained variance, Methods: Determining the number of archetypes). k = 4 archetypes explain 45% of the variance, whereas adding a fifth archetype added only 4% additional explained variance (Fig 2a). A tetrahedron explained the data variance better than when applying the same algorithm to shuffled data (p = 0.01, Methods: Statistical significance of best fit polytopes). This suggests that a 4-vertex polyhedron, namely a tetrahedron, is a reasonable description of the data. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Human colon crypt cells fall in a tetrahedron in gene expression space. (a) For k = 2–11 we found the k-polytope that best fit the data using PCHA algorithm, considering all 76 dimensions. Explained variance of best fit polytopes with k = 2–11 vertices begins to saturate at k = 4 or k = 5 vertices. (b) Comparison between the variance explained by the first k principal components of the data to the variance explained by the k principal components of shuffled data suggests that effective data dimensionality is three or four. Blue line: variance explained by PCA of intestinal data. Green line: variance explained by PCA of shuffled data. Points represent mean values. Error bars, representing 5%-95% variation intervals, are smaller than line width. Points for which the real data EV is higher than the randomized data EV are marked with *. (c) Data displayed in first 3 PCs axes resembles a tetrahedron, and its projections on principal planes (d)-(f) resemble triangles. Archetypes and their variation upon data resampling (bootstrapping) are shown as colored ellipses (see S1B Text). Thin lines—tetrahedron edges. https://doi.org/10.1371/journal.pcbi.1004224.g002 The tetrahedron was found in the full 76-dimensional space. To display the data, it is helpful to use principal component analysis (PCA). PCA indicated that 3 principal components (3 PCs) explain most of the variance in the data (47%), and that 4 PCs (52%) are significant compared to shuffled data (Fig 2b, Methods: Determining the number of archetypes), indicating that the data is indeed low-dimensional. The tetrahedrality of the data is highlighted by the fact that 96% of the variance explained by the first 3PCs is explained also by the much more stringent description of a tetrahedron whose vertices are on the convex hull of the data. Plotting the data, with each cell represented by a point in the space spanned by the first 3PCs of gene expression space, suggests a tetrahedron-like shape (Fig 2c). The projections of the data on the three principal planes are roughly triangular (Fig 2d–2f), and show well-defined linear edges which meet at pointy vertices. We found that the archetype positions were robust to data sampling, with errors on the order of a few percent in bootstrapping tests in which data is resampled with replacement (S1B Text and S1a Table and Fig 2c). We note that clustering methods, such as k-means or hierarchical clustering, are much more sensitive to sampling errors in this dataset: the continuous distribution of the data makes the cluster boundaries somewhat arbitrary and thus on the order of 20% of the data points are classified to different clusters upon bootstrapping (S1C Text and S2–S4 Figs). Each archetype is enriched with markers for a major crypt cell type Each of the four vertices of the best-fit tetrahedron is a point in the 76-dimensional gene expression space. Within Pareto theory, each vertex is an archetype that can be thought of as an optimal gene expression profile, extrapolated from the data, which best performs the archetype's task. The gene profiles for the archetypes are shown in Fig 3a. We find that each of the four archetypes shows high expression of a set of markers for a specific crypt cell type (Fig 3b and 3c and S5 Fig). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Expression profiles of the four colon crypt archetypes are each enriched for markers of specific cell types. (a) The expression profiles of the four archetypes, with enriched genes colored. Enriched genes were determined by leave-1-out enrichment analysis, binning the cells according to distance from each archetype and seeking when average expression in the bin closest to the archetype is maximal, as described in Methods: 1D Gene enrichment at archetypes (See full enriched genes list in S2 Table). Light blue—enterocyte archetype, yellow—Nodal archetype, green—stem cells archetype, red—goblet cell archetype. Genes that are not enriched, or enriched in more than one archetype, are in dark blue. Zero level represents the average expression of each gene in the dataset. (b) Leave-1-out enrichment plot: expression of a gene (SLC26A3—an enterocyte marker) as a function of distance from archetype in equal mass bins of cells (Methods: 1D Gene enrichment at archetypes), line color indicates archetype. This gene is maximally enriched only at the enterocyte archetype (blue line). For enrichment plots for additional genes see S5 Fig. (c) A two dimensional enrichment plot of SLC26A3, in which its expression is plotted on the plane of the first 2PCs of the data, indicating expression is maximal in the cells closest to the enterocyte archetype. Contours are expression density estimated using a Gaussian kernel (Methods: 2D Gene enrichment at archetypes). Archetype positions and PCs were calculated without the tested gene. https://doi.org/10.1371/journal.pcbi.1004224.g003 Archetype 1 shows high expression of enterocyte markers (AQP8, SLC26A3, MS4A12, KRT20 [41–44]). The cells closest to this archetype show the maximal expression of these markers in the entire dataset (Methods: 1D Gene enrichment at archetypes). In cells closest to archetype 2, putative stem cells markers are maximally expressed (including LGR5, ASCL2, AXIN2, c-MYC, CDK6, OLFM4 [45–49]). In cells closest to archetype 3, markers for goblet cells are maximally expressed (MUC2, TFF3, SPDEF and others [38,50,51]). The complete list of enriched genes is shown in S2 Table. Thus, the first three archetypes correspond each to one of the three main crypt cell types. The fourth archetype is discussed below. We also evaluated the physical position of each cell in the crypt using a proxy for height in the crypt, Axin2 expression [52]. We find that the stem cell archetype has highest Axin2, followed by the goblet archetype. The enterocyte archetype has lowest Axin2. This matches the known arrangement of cells in the crypt, with stem cells at the bottom of the crypt, and enterocytes at its top (S6a Fig) [37]. One archetype represents a novel Nodal cell class which may be an intermediate between stem cells and enterocytes The fourth archetype is enriched with a specific set of genes related to development and embryonic patterning (NODAL, CFC1, TDGF1 [53–56]), a transcriptional repressor that has a role in development (PCGF6 [57]), and an enzyme involved in hormone secretion (UGT1A1 [58]), and a member of the claudin family CLDN8 [59]. We call these cells Nodal cells. Their position in the crypt, based on Axin2 levels, is intermediate between the bottom and top (S6a Fig). To better understand Nodal cells, we ordered the cells according to a pseudo-temporal order inferred from their gene expression using the Wanderlust algorithm [60]. NODAL cells seem to lie on the developmental axis between stem cells and enterocytes (S7a Fig). This tentatively suggests NODAL cells as a differentiation step between precursors and enterocytes. In summary, Pareto analysis finds four archetypes which correspond to gene expression profiles. Three of these define the three main cell types in the crypt. The fourth may indicate a step between stem cells and enterocytes. The archetypes can be interpreted as an idealized gene profile for each of the cell types, and cells of a given type are arranged in proximity to the corresponding archetype in gene expression space. Progenitor cells uniformly fill a tetrahedron with archetypes for stemness and three differentiation tasks We next zoom in on each inferred cell type to examine the variation between cells within a type. We repeated the Pareto analysis on each class of cells separately (S1D Text). We find that the three non-progenitor cell types (enterocytes, goblet cells, nodal cells) cannot be explained by a statistically significant polytope with 5 or less vertices (see S3 Table, all p-values >0.15). Thus, the expression of these cells seems to form a cloud in gene expression space with no easily discernible vertices (Fig 4a). This may hint that other effects dominate the structure of the data, or that not enough cells or not enough relevant genes were measured, see S1E Text and S8 Fig. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Progenitor crypt cells fall in a tetrahedron. (a) Enterocytes, goblet cells and nodal cells analyzed separately do not form significant polytopes. Cells are color coded by type in the tetrahedron of Fig 3, and each cell class is plotted in its own first 3PC. (b) Progenitor cells analyzed separately fall uniformly in a tetrahedron. The best fit tetrahedron is shown (PCHA delta = 0.5). Arrow represents direction of development according to Axin2 levels, see S7b Fig. Also shown are projections on the principal planes, which resemble triangles or quadrangles. Archetypes and their variation upon data resampling (bootstrapping) are shown as gray ellipses. (c) Explained variance as a function of polytope order k or number of PCs D both suggest a tetrahedron (k = 4, D = 3). https://doi.org/10.1371/journal.pcbi.1004224.g004 In contrast, the progenitor cells were well-described by a tetrahedron (p = 0.01, Methods: Statistical significance of best fit polytopes, Fig 4b showing tetrahedron and projections, Fig 4c shows explained variance curves, S2 Dataset). The progenitor cells seem to uniformly fill out this tetrahedron, suggesting that precursors span a continuum of gene expression states, with some cells coming close to one of four archetypal precursor profiles, and others showing a more even mixture of the archetypes. We examined the expression profiles of the cells closest to each archetype (Methods: 1D Gene enrichment at archetypes). Archetype 1 is enriched with stem markers LGR5 and ASCL2, and lies physically at the lowest point in the crypt according to Axin2 levels (S7b Fig inset). It has proliferation markers (MYC, Cdk6), and telomerase (TERT). These genes are characteristic of dividing stem cells [37,46,48,49]. The other three archetypes all display the progenitor marker OLFM4 [61], but also have characteristics more similar to the differentiated cells. Archetype 2 includes enrichment in enterocyte and goblet markers. Archetype 3 is enriched in division inhibitor CDKN1A [62]. Archetype 4 has low expression of all genes. We hypothesize that these three archetypes represent three tasks needed for differentiation: (i) expression of effector cell-specific genes (ii) inhibition of cell division (iii) reduction in global gene expression. For a complete list of enriched genes see S4 Table, archetype gene expression profiles are shown in S9 Fig. The progenitor cells fill out the tetrahedron quite uniformly. According to Axin2 expression, as they move up the crypt they move away from the stem archetype and parallel to the plane defined by the other three archetypes. Pseudo-temporal order derived by the Wanderlust algorithm suggests similar conclusion (S7b Fig). This may suggest multiple temporal paths between the three tasks, such that each progenitor is a different weighted average of the archetypes. We also asked how the progenitor tetrahedron relates to the rest of the cells in the crypt. We find that the fully differentiated cells types (enterocytes, goblet cells) are closest to archetype 4, because they all have lower overall gene expression than the progenitors. Similar polytopes for mouse and human intestinal crypt single-cell data Up to now we analyzed a human crypt dataset. We now compare it to a mouse crypt dataset, presented by Rothenberg et al. [37] using qPCR (S3 Dataset). We used the same Pareto analysis approach. We removed the lowest 10% of cells and genes, remaining with 161 cells and 41 genes. The two datasets overlap in 24 genes. The mouse cells in the dataset were only from the bottom of the crypt: they were harvested by cell sorting (FACS) using the markers CD66 and CD44. We find that the mouse single-cell data is well-described by a triangle (p = 0.003, 2PCs explain 43% of the variance). To compare the mouse data to human data described in the previous sections, we analyzed human cells in the lower part of the crypt (defined in [2], by FACS sort for high and low cells using CD66 and CD44 markers). This dataset (213 cells) is also well described by a triangle (p = 0.001, 2PCs describe 35% variance). The archetypes in both the datasets correspond to stem cells, goblet cells and enterocytes. The nodal cells are on the edge connecting progenitors and enterocytes (Fig 5b), representing the projection of the nodal tetrahedral archetype on the triangle that describes the crypt-bottom cells. In the mouse dataset 5 out of the 6 genes that are enriched in the Nodal archetype were not measured; however the Nodal co-receptor TDGF1 [27] is highly expressed on the edge that connects stem cells and enterocytes. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Mouse and human colon lower crypt cells fall on similar triangles and show a similar distribution within the triangle. (a) Mouse colon lower crypt cells dataset by [37], plotted on its first 2PCs plane. Inset: PCA explained variance analysis suggest k = 3 vertices and D = 2 dimensions, namely a triangle. (b) Human lower crypt dataset plotted on its first 2PC plane. Inset: explained PCA explained variance analysis. The 24 genes common to the two datasets were used. Arrow indicates projection of the nodal cell archetype on the triangle. https://doi.org/10.1371/journal.pcbi.1004224.g005 The geometry of mouse and human datasets are strikingly similar (Fig 5). The three mouse archetypes are very similar to the three corresponding human archetypes in their overlapping genes (R2 = 0.65, p<10–9, S10 Fig, enriched genes are shown in Fig 5), and signify stem cells, goblet cells and enterocytes. Moreover, the distribution of points on the triangle suggests a gradual differentiation process from stem cells to enterocytes, compared to more abrupt temporal switch in the case of differentiation to goblets, a difference shared by both species. We also analyzed single-cell data by Dalerba et al [2] for human colon cancer xenograft in mouse, derived from a single cancer cell (S11 Fig and S1F Text and S4 Dataset). We compared this data to the triangle found when analyzing human crypt-bottom normal tissue cells (Fig 5b). We find that the cancer cells lie in a similar triangle, with a density distribution that peaks near the stem cells, enterocyte and goblet archetypes. This hints that the human cancer cells undergo differentiation similar to the normal mouse and human tissues [2]. Analysis of single-cell mass-cytometry of bone marrow cells and RNA-Seq of dendritic cells reveals polytopes and tasks Finally, we asked whether this approach can be used with other experimental techniques and other tissue types. We studied a single-cell mass cytometry (Cytof) dataset on human bone marrow by Bendall et al [13,25] (S5 Dataset). Here, 10,000 cells were each characterized by the expression of 31 proteins detected using antibodies. We find that this dataset is well-described by a four-dimensional simplex (a polytope with five vertices, p = 0.005, Fig 6 top row and S12 Fig). The five archetypes are each enriched with genes that clearly define a specific cell type (CD4 T cells, CD8 T cells, monocytes/ macrophages, B cells and non-leukocytes) (see S1G Text and S5 Table and S13 and S14 Figs). Cells density peaks near each vertex. A sizable set of cells, which formed an unidentified cluster in the viSNE analysis of [13], lies in the middle of the tetrahedron (S15 Fig), possibly indicating cells whose protein expression is intermediate between the classical cell type profiles. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Different tissues analyzed by different single-cell technologies show polytopes and tasks. (a) Human bone marrow cells analyzed by single-cell mass cytometry [13,25] in which proteins are detected using mass-tagged antibodies is well described by a 4D simplex (a polytope with 5 vertices). The simplex is shown projected on the first 3PCs, for other projections see S12c Fig. The archetypes correspond to cell types as indicated. Cell density peaks near each archetype. (b) Mouse spleen LPS stimulated dendritic cells analyzed by single-cell RNA-Seq [3] are well described by a tetrahedron. Archetypes are labeled with functions inferred from genes maximally enriched in cells near each archetype. For more details see S1G and S1H Text. https://doi.org/10.1371/journal.pcbi.1004224.g006 We also analyzed a single-cell RNA-Seq dataset from mouse spleen by Jaitin et al [3] (S6 Dataset). Each cell is characterized by 20,091 gene expression counts, based on sampling a fraction of the cell’s mRNA pool. This data was classified by Jaitin et al into seven groups of cells using a probabilistic mixture model. One group of cells, however, seemed to defy clear clustering (class VII in [3]). These are the dendritic cells in the spleen, known to carry out a wide range of immune functions including detection of pathogens and activation of lymphocytes [63–65]. We find that LPS treated dendritic cells in this dataset are well-described by a tetrahedron (312 cells, p<0.001, Fig 6 bottom row). Several functional gene groups [66] (see S1H Text) are highly and maximally enriched in the cells closest to each of the four archetypes (S6 Table). This enrichment indicates four key immune tasks: Archetype 1- response to virus (cytoplasmic DNA response and interferon pathways) [7,67]; Archetype 2- Dendritic cell maturation and formation of cytoskeletal features [64,68,69]; Archetype 3- Stimulation of lymphocytes (cytokine secretion, antigen presentation) [7,70,71]; Archetype 4- putative cell-death pathways [72,73]. We compared the tissues in terms of how uniformly they fill out their polyhedra or polytopes. We therefore computed the ratio ρ between the mean local density of the data and the local density expected in a uniform distribution [74] (S1I Text and S16 Fig). The closer this ratio is to one, the more uniform the distribution of points; a high ratio means the data is clumped into clusters. We find that bone marrow cells are the most clustered or clumped among the datasets (ρ = 4.67), in line with the classic view of hierarchical hematopoietic lineages [75,76]. Intestinal cells are less clustered (ρ = 2.93). The closest to uniform distribution inside their respective polyhedra are dendritic cells (ρ = 2.42) and intestinal progenitors (ρ = 1.06). We conclude that the present approach can describe the geometry and potential tasks of single-cell data from diverse tissues and different technologies. Discussion We studied the geometry of single cells in gene expression space using Pareto archetype analysis. We used data from different single-cell technologies employing qPCR, RNA-Seq or mass cytometry, and different tissues including intestinal crypt, bone marrow and spleen. We find that single-cell data fall in low dimensional shapes with well-defined vertices, such as tetrahedrons. The cells closest to each vertex are enriched with genes that reveal key biological functions relevant to each tissue. Some datasets fall into distinct clusters, with one cluster near each vertex, and thus support a picture of distinct and well-separated cell types. Other contexts, such as intestinal progenitor cells and spleen dendritic cells, show a continuum of gene expression states which uniformly fills the tetrahedron, supporting a picture of a continuous range of cell states that carry out mixtures of the biological functions defined by the vertices. These findings expand the concept of cell type, by demonstrating the possibility of a polyhedral continuum of expression states: cells can range between being task specialists near the vertices of the polyhedron, and generalists suitable for multiple tasks near the center. It is interesting to ask when is it better to design distinct cell types with separated biological functions, and when to design a continuum of cell expression states. Distinct cell types have the advantage of being specialists at a given task, with optimal function. However, if the proportions of the tasks needed in the tissue changes more rapidly than the ability to make new cells or to adjust their protein composition, a continuum of states may have an advantage. It allows a reserve of cells (cells in the middle of the polyhedron) that can perform multiple functions, albeit less optimally than specialists, and can therefore be recruited to each task in times of need. This type of reasoning was used to explain why ant morphologies in a nest tend to show a continuous distribution rather than distinct clusters: Intermediate morphology ants can be recruited to defense, foraging or nursing tasks according to changing needs [77]. Other factors that may influence clustering versus continuum include the biochemical feasibility of multiple functions to co-exist in the same cell [78,79] and the existence of a continuous range of spatial and temporal niches in a tissue related for example to migration, differentiation processes [17,80], or to distances from blood vessels or tissue boundaries [81]. We find a continuous filling of a tetrahedron in the context of progenitor cells in the intestine. The progenitor tetrahedron suggests four tasks—one related to stemness and stem cell renewal, and the other three archetypes related to potential tasks required for early stages of differentiation: stopping division, expressing effector genes, and down regulating global expression [82]. The uniform filling of the progenitor tetrahedron suggests that there is not one temporal path to differentiation, but rather many paths with different ordering of the tasks, each taken with more or less equal probability. This relates to the idea that stem cells show heterogeneity [15,27,83], in which different molecular states confer functional biases to individual cells, contributing to their overall regulation [84], and aligns with recent findings about their plasticity [85,86]. With more data, one may be able to infer temporal paths from static data, as was done in the context of the cell cycle [87] and cell differentiation [17,60]. Similarly, the continuum observed between intestinal progenitors, nodal cells and enterocytes (Figs 2c and 5b) suggests a gradual differentiation process, with the nodal cells—a new class of cells defined in the present study—possibly an intermediate station between progenitors and enterocytes. The polyhedral continuum of states we find in stimulated dendritic cells (DCs) may likewise suggest spatiotemporal trajectories in the spleen, with external DCs active in pathogen detection, followed by migration into the central spleen for lymphocyte activation [64,65]. The present Pareto analysis is a new way of looking at single-cell data that emphasizes the geometric contours that enclose the data. This approach was used recently also in other biological contexts, to understand C. elegans foraging behavior [34], bird toe-bone proportions [35], and bacterial [88] and tumor [33] population-level gene expression. It is useful to compare the present approach to other methods of analyzing high-dimensional data, such as clustering or self-organizing maps [88–92]. If the data is arranged in separated non-overlapping clouds, all methods can reveal its structure. If the data, in contrast, is spread more along a continuum, clustering methods can lead to arbitrary classification, because it is not possible to tell where one cluster ends and another begins. For example, in the current crypt dataset, cluster assignment for 20% of the crypt cells changed upon data resampling with returns (bootstrapping, see S1C Text). In contrast, archetypes varied by only a few percent upon bootstrapping, and data description as a weighted average of archetypes was likewise much less sensitive to data resampling than clustering analysis (S2–S4 Figs and S1 Table). The present approach also differs from principal component analysis (PCA, see S1J Text): whereas PCA finds the axes of the space which describes most of the data variance, our approach finds a polytope within which the data resides and is thus much more restrictive (S17 Fig); moreover the vertices of the polytope are different from the principal components (e.g. a tetrahedron resides in a 3D space but has four vertices, S18 and S19 Figs), and our findings support their interpretation as archetypal profiles that reveal clear biological functions. PCA and other dimensionality reduction techniques are not needed to find the polytopes, but can help to visualize them. Major challenges remain, related to the high dimensionality of gene expression data, which can be in the many thousands. Fitting such data to a low-dimensional polytope can capture major trends. However, if there are small sets of genes that vary independently of the others, in a biologically important way, the current implementation of Pareto analysis will miss them because they make a small contribution to the overall shape of the data. Future work can develop ways to detect such groups of genes, separate the data and analyze each subgroup independently. This may lead to a presentation of the data as a collection (outer product) of Pareto fronts, each with a subset of genes related to a distinct set of tasks. Similarly, Pareto analysis can be used as a microscope to zoom in on a subset of cells or genes—in the present study, progenitor cells when considered separately without the rest of the crypt cells, revealed an informative tetrahedron of their own. Finally, one can analyze polytopes with increasing number of vertices and in this way observe the splitting of archetypes into finer distinctions (S1K Text and S20 Fig). The polyhedra found here seem to be a distinct feature of the dataset, especially the straight edges and faces apparent in plots of the data distribution. However, it is possible that the observed polytopic-like structure results from other (unknown) reasons, such as systematic experimental errors or biological phenomena. Further application of Pareto analysis, with different tissues and different single-cell technologies that have different noise and biases, is needed to test the generality of the present conclusions. With these caveats in mind, the present approach of using Pareto analysis for single-cell data can be generally used to understand the geometry of single-cell data and to infer the tasks of individual cells in a tissue. More generally, this study indicates that the concept of cell type may be expanded. In addition to separated clusters in gene-expression space, we suggest a new possibility: a continuum of states within a polyhedron, in which the vertices represent specialists at key tasks, with generalist cells lying in the middle. Methods Preprocessing and normalization of single-cell qPCR data Single-cell gene expression of primary human colon crypt cells obtained by Dalerba et al [2] included 407 individual cells, each analyzed by single-cell qPCR for 83 selected genes in a Fluidigm microfluidic system [5] (S7 Dataset). Note that the present dataset contains 34 genes not included in the original publication, kindly provided by Dalerba et al. These genes, including CFC1, UGT1A1, CLDN8, NODAL, TDGF1, and PCGF6, allow identification of the nodal cell type. Some genes were measured by multiple primers, see S1A Text and S21 Fig. Cells were sorted by FACS so that they belong to one of two populations: EpCAMhigh/CD44+, which corresponds to the bottom of the crypt, and EpCAM+/CD44-/CD66high, which corresponds the top of the crypt. The measured genes were selected using publicly available gene-expression array data sets from human colon epithelia, using a bioinformatics approach designed to identify developmentally regulated genes [2]. Gene expression was measured in Cycle threshold units (Ct)—the number of amplification cycles it takes to detect the gene—which correspond to the logarithm of the fold change in expression. The data was bi-modal, as seen in other applications of this technology [11,12], with some cells ranging from 2.5Ct to 30Ct, and another set of cells with Ct>40. Since there were no values larger than 30Ct, we treated this as the minimal threshold level of detection and set these values to 30, and then linearly transformed the data so that minimal CT is zero (30-data). Finally, we subtracted the mean of each gene. Cells and genes with low expression were removed as described in Results. The processed data can be found in S1 Dataset. The same normalization was used for the cancer xenograft data [2] (S4 Dataset) and for the mouse intestinal cells data (S3 Dataset) [37]. Preprocessing and normalization of single-cell mass cytometry data We used data by Bendall et al [25] and Amir et al [13], of human bone marrow samples from healthy donors. The cells were analyzed using single-cell mass cytometry as described in [25], resulting in the antibody-detected expression of 31 proteins in 10,000 cells. Multiple detections of one protein—CD3—were united by taking their mean for each cell. Following the original publication, we transformed the data by applying a hyperbolic arcsin transformation with a cofactor of five. Processed data can be found in S5 Dataset. Preprocessing and normalization of single-cell RNA-Seq data We used data from [3], of mice spleen CD11c+ cells that were exposed to lipopolysaccharide (LPS) for 2 hours. We focused on cells that were classified in the original paper as dendritic cells (group VII). We used the likelihood model of [3] in order to choose cells which have the highest likelihood to belong to this group (rather than to other groups). We didn’t consider 16 genes with high batch-to-batch variability (as in [3]), and removed measures of control ERCC molecules. Since the data is very sparse (97% of the matrix entries are 0), we considered the 500 genes with highest standard deviation across samples. Results are similar for choosing genes by highest mean or median. Following Jaitin et al, we normalized the data by down-sampling: Defining a target number of molecules N, and then sampling from each cell having m> = N molecules precisely N molecules without replacement. Cells with m40. Since there were no values larger than 30Ct, we treated this as the minimal threshold level of detection and set these values to 30, and then linearly transformed the data so that minimal CT is zero (30-data). Finally, we subtracted the mean of each gene. Cells and genes with low expression were removed as described in Results. The processed data can be found in S1 Dataset. The same normalization was used for the cancer xenograft data [2] (S4 Dataset) and for the mouse intestinal cells data (S3 Dataset) [37]. Preprocessing and normalization of single-cell mass cytometry data We used data by Bendall et al [25] and Amir et al [13], of human bone marrow samples from healthy donors. The cells were analyzed using single-cell mass cytometry as described in [25], resulting in the antibody-detected expression of 31 proteins in 10,000 cells. Multiple detections of one protein—CD3—were united by taking their mean for each cell. Following the original publication, we transformed the data by applying a hyperbolic arcsin transformation with a cofactor of five. Processed data can be found in S5 Dataset. Preprocessing and normalization of single-cell RNA-Seq data We used data from [3], of mice spleen CD11c+ cells that were exposed to lipopolysaccharide (LPS) for 2 hours. We focused on cells that were classified in the original paper as dendritic cells (group VII). We used the likelihood model of [3] in order to choose cells which have the highest likelihood to belong to this group (rather than to other groups). We didn’t consider 16 genes with high batch-to-batch variability (as in [3]), and removed measures of control ERCC molecules. Since the data is very sparse (97% of the matrix entries are 0), we considered the 500 genes with highest standard deviation across samples. Results are similar for choosing genes by highest mean or median. Following Jaitin et al, we normalized the data by down-sampling: Defining a target number of molecules N, and then sampling from each cell having m> = N molecules precisely N molecules without replacement. Cells with m0.15), while progenitor cells fall in significant tetrahedron and triangle but not in a polytope with 5 vertices. P-values were computed as described in Methods: Statistical significance of best fit polytopes. https://doi.org/10.1371/journal.pcbi.1004224.s027 (DOCX) S4 Table. Intestinal progenitor cells archetypes are enriched with specific sets of genes. Results of a leave-1-out enrichment analysis, carried on intestinal progenitor cells tetrahedron as described in Methods: 1D enrichment at archetypes, using 5 bins and demanding p-value < 0.001 using Wilcoxon rank-sum statistical test. https://doi.org/10.1371/journal.pcbi.1004224.s028 (DOCX) S5 Table. Bone marrow cells leave-1-out enrichment reveals enriched genes at archetypes. Results of a leave-1-out enrichment analysis, carried on human bone marrow cells protein expression data, acquired by single-cell mass cytometry. Enrichment was computed as described in Methods: 1D enrichment at archetypes, using 10 bins and demanding p-value < 0.001 using Wilcoxon rank-sum statistical test. https://doi.org/10.1371/journal.pcbi.1004224.s029 (DOCX) S6 Table. Dendritic cells archetypes are enriched with robust specific functional gene groups. Functional gene groups (MSigDB gene sets [66]) enriched at archetypes and their significance levels. The functional categories that appear in this list were found to be robust to bootstrapping—they were enriched in the same archetype in 80% or more of 100 bootstrapped datasets (created by resampling with replacement of the original data, archetypes were computed again for each bootstrapped dataset). The p-value threshold to define significance was set using Benjamini-Hochberg test (FDR<0.1) to prevent multi-hypothesis testing error. https://doi.org/10.1371/journal.pcbi.1004224.s030 (XLSX) S7 Table. Enriched genes at archetypes were used to identify the splitting archetypes in S20 Fig. Leave-1-out enrichment results (Methods: 1D enrichment at archetypes, bin size = 0.1, demanding p-value lower than 0.001 using Rank-sum test) when looking for k = 2–6 archetypes for intestinal cells data. Archetypes titles in S20 Fig were given based on these results. https://doi.org/10.1371/journal.pcbi.1004224.s031 (XLSX) S1 Text. Extended methods and results. Estimation of technical noise and bimodality in the intestinal cell dataset Robustness of archetypes to the sampling of the data, intestinal dataset Comparison of archetypal analysis to clustering methods Definition of cell types in the intestinal dataset Effect of sample size on the statistical significance of polytopes Analysis of a single-cell qPCR dataset of a human colon cancer xenograft from a single cancer cell Analysis of a single-cell mass cytometry dataset from human bone marrow Analysis of single-cell RNA-Seq data for stimulated mouse spleen dendritic cell Uniformity of the distribution of cell states varies between tissues Comparison of archetypal analysis to principal component analysis Increasing the number of archetypes may reveal subtle trends https://doi.org/10.1371/journal.pcbi.1004224.s032 (PDF) S1 Dataset. Human intestinal cells gene expression obtained by single-cell qPCR, from Dalerba et al 2011, after processing and normalization as described in Methods: Preprocessing and normalization of single-cell qPCR data. https://doi.org/10.1371/journal.pcbi.1004224.s033 (CSV) S2 Dataset. Human intestinal cell progenitors gene expression obtained by single-cell qPCR, from Dalerba et al 2011, after processing and normalization as described in Methods: Preprocessing and normalization of single-cell qPCR data. The cells analyzed here are a subset of the cells presented in S1 Dataset, chosen as described in S1D Text. https://doi.org/10.1371/journal.pcbi.1004224.s034 (CSV) S3 Dataset. Mouse intestinal cells gene expression obtained by single-cell qPCR, from Rothenberg et al 2012, after processing and normalization as described in Methods: Preprocessing and normalization of single-cell qPCR data. https://doi.org/10.1371/journal.pcbi.1004224.s035 (CSV) S4 Dataset. Human cancer xenograft in mouse gene expression obtained by single-cell qPCR, from Dalerba et al 2011, after processing and normalization as described in Methods: Preprocessing and normalization of single-cell qPCR data, S1F Text. https://doi.org/10.1371/journal.pcbi.1004224.s036 (CSV) S5 Dataset. Human bone marrow gene expression obtained by single-cell mass cytometry, from Bendall et al 2011, Amir et al 2013, after processing and normalization as described in Methods: Preprocessing and normalization of single-cell mass cytometry data, S1G Text. https://doi.org/10.1371/journal.pcbi.1004224.s037 (CSV) S6 Dataset. LPS stimulated dendritic cells gene expression obtained by single-cell RNA-Seq, from Jaitin et al 2014, after processing and normalization as described in Methods: Preprocessing and normalization of single-cell RNA-Seq data, S1H Text. https://doi.org/10.1371/journal.pcbi.1004224.s038 (CSV) S7 Dataset. Human intestinal cells gene expression obtained by single-cell qPCR, from Dalerba et al 2011, no further processing and normalization. This dataset includes 34 genes not included in the original publication, kindly provided by Dalerba et al. https://doi.org/10.1371/journal.pcbi.1004224.s039 (XLSX) Acknowledgments We thank Shalev Itzkovitz, Steffen Jung, Caterina Curato, Nir Drayman, Benjamin Towbin, Shlomit Reich-Zeliger, Amos Tanay, Ido Amit, Ephraim Kenigsberg, Piero Dalerba, Dana Pe’er and Xiling Shen for discussions and for helpful remarks on earlier versions of this manuscript. We thank Piero Dalerba, Michael F Clarke and Stephen R Quake for sharing unpublished data. TI - Geometry of the Gene Expression Space of Individual Cells JO - PLoS Computational Biology DO - 10.1371/journal.pcbi.1004224 DA - 2015-07-10 UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/geometry-of-the-gene-expression-space-of-individual-cells-M019yYZDc8 SP - e1004224 VL - 11 IS - 7 DP - DeepDyve ER -