Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

BGP: identifying gene-specific branching dynamics from single-cell data with a branching Gaussian process

BGP: identifying gene-specific branching dynamics from single-cell data with a branching Gaussian... High-throughput single-cell gene expression experiments can be used to uncover branching dynamics in cell populations undergoing differentiation through pseudotime methods. We develop the branching Gaussian process (BGP), a non-parametric model that is able to identify branching dynamics for individual genes and provide an estimate of branching times for each gene with an associated credible region. We demonstrate the effectiveness of our method on simulated data, a single-cell RNA-seq haematopoiesis study and mouse embryonic stem cells generated using droplet barcoding. The method is robust to high levels of technical variation and dropout, which are common in single-cell data. Keywords: Single cell RNA-seq, Gaussian process, Branching dynamics Background assess the evidence for branching and to provide a pos- Single-cell gene expression data can be used to uncover terior estimate of the branching time. The posterior dis- cellular progression through different states of a temporal tribution over branching time can be used to identify the transformation, e.g. during development, differentiation most likely branching time for each gene as well as an or disease. As single-cell protocols improve, a flurry of associated credible region capturing our uncertainty in methods have been proposed to model branching of cel- the estimate. lular trajectories to alternative cell fates [1–4]. In these Our approach is based on Gaussian processes (GPs), and similar methods, pseudotime is estimated and a global which are a class of flexible non-parametric probabilistic branching structure is inferred. Our focus in this paper is models. GPs have a long history in temporal and spatial on a downstream analysis method that can subsequently statistics and have gained popularity in many areas of be used to model branching gene expression dynamics for machine learning, including multivariate regression, individual genes. We are interested in discovering which classification and dimensionality reduction [5]. GPs have genes follow the global cellular branching dynamics and been used for dimensionality reduction of single-cell whether these genes branch early or late with respect expression data [6, 7] and more recently for pseudotime to the global cellular branching time. Recently, [3]pro- estimation where the effect of uncertainty in the inferred posed the branch expression analysis modelling (BEAM) pseudotime can be quantified [8]and capturetimecan approach, which uses penalised splines to infer the indi- be included as prior information [9, 10]. GP-based meth- vidual gene branching time. Here, we propose an alter- ods have also been used for modelling global cellular native non-parametric method to model gene expression branching dynamics from single-cell data after assigning branching dynamics. We develop a probabilistic genera- pseudotime to cells [11]. tive model of branching dynamics that can be used to Here, we build on the work of [12], whodeveloped aGP model to identify when two gene expression time-course *Correspondence: alexis.boukouvalas@manchester.ac.uk; data sets first diverge from one another. They defined a Magnus.Rattray@manchester.ac.uk novelGPcovariancefunctionthatconstrainstwo func- Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Oxford Road, Manchester, UK tions to intersect at a single point. The divergence time Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Boukouvalas et al. Genome Biology (2018) 19:65 Page 2 of 15 is inferred by numerically approximating the posterior Specifically for N cells, naive covariance inversion scales using a simple histogram approach. The model identifies as O N while under sparse inference with k inducing when a gene first becomes differentially expressed in time- points, it scales as O k N where typically k  N.Sec- course gene expression data under control and perturbed ondly, we provide an open-source implementation that conditions. In their approach, all data points are labelled leverages the GPflow library [16], which both simplifies with the branch that generated them and the ordering the implementation due to automatic symbolic differenti- of time points is assumed known. Although similar to ation and allows for the necessary matrix operations to be the problem of modelling branching in single-cell data computed in parallel across many CPU nodes or GPUs. after pseudotime is inferred, this two-sample time-series The paper is organised as follows. We first give an method cannot be applied directly to our problem because overview of the BGP model, including how inference is we have to allow for uncertainty regarding which branch performed in a scalable manner. We discuss how uncer- each cell belongs to. tainty is quantified and represented in a full posterior Also closely related to the present work, the overlap- distribution on the branching time for each gene. We con- ping mixture of GPs (OMGP) [13] is a mixture model for trast the performance of the BGP method to two recently time-series data where the mixture components are GP published methods, the mixture of factor analysers (MFA) functions and the data at any time can be assigned to any [17] and the BEAM approach [3], in a synthetic study of the components. For single-cell data, after pseudotime across a variety of simulated scenarios. We apply the is assigned to each cell, then the OMGP model can be BGP model to data from a haematopoiesis study [18]and used to assign cells to different trajectories. The cell to single-cell mouse embryonic stem cell data generated labels do not have to be known in advance and can be using droplet barcoding [19]. We conclude with a sum- inferred through fitting the model to data. However, in mary of our findings and a discussion of possible future OMGP models, the cellular trajectories are independent research directions. rather than branching. The OMGP model has been applied to single-cell data to infer global cell branching Results and discussion times [11] but as OMGP assumes the latent functions are Overview of BGP independent without any branching, a heuristic based Before applying the BGP method, we require the pseu- on a piecewise linear fit of the log likelihood surface is dotime for each cell and the global branching pattern of proposed to identify the most likely branching times. This the cells. In our experiments, we used results from the is problematic, since OMGP does not provide a proper Monocle reversed graph embedding approach, termed generative model of branching dynamics and therefore, DDRTree [3], and the diffusion pseudotime (DPT) [1] it is not clear how to compute the posterior distribution and the wishbone [2] approaches. However, BGP can be over the branching time. based on any method that estimates pseudotime and a Our main methodological contribution here is to gene-wide cell branching association. For a recent review, generalise the OMGP model to account explicitly for see [20]. dependence between the functions in the mixture model. OMGP [13] is a mixture model of independent GPs that Specifically, we consider that the functions branch, as is able to associate an observation with the generating in [12]. This allowed us to develop a probabilistic model GP. The authors term this the association problem and over branching cellular trajectories where the assignment derive a variational inference algorithm for independent of cells to branches is not known in advance. Our new GPs. Our work extends the OMGP model in two direc- branching GP (BGP) model allows us to calculate the tions: firstly, we remove the assumption of latent func- posterior distribution over branching time for each gene tion independence and allow dependent GPs as required while allowing for uncertainty in the branch labels for by a branching model. Secondly, we provide a sparse each cell. This uncertainty is especially important for inducing point approximation that allows for scalable early-branching genes, since cells are not labelled with a inference. branch prior to the global cellular branching time, which Let F be a BGP evaluated for N data points (cells) with N ×M we assume is known. M latent functions. Z ∈ {0, 1} indicates which branch A naive implementation of GP models scales cubically each cell comes from. The number of latent functions for with the size of the data. As increasing numbers of cells a single branching point is M = 3, as we have separate can be profiled in new single-cell protocols, we ensure latent functions for the trunk and each branch. The like- the scalability of our approach by employing two com- lihood is p (Y |F, Z) = N Y |ZF, σ I and as in [13], we plementary approaches. Firstly, we use sparse inference place a categorical prior on the indicator matrix p(Z) = N M [Z] nm [14, 15], which allows model fitting to scale with the [] . We place a GP prior on the latent n=1 m=1 n,m number of inducing points. The latter is a user-defined functions p (F|t ) = GP (0, K |t ), which constrains the b b value that trades off model accuracy and training time. latent functions to branch at pseudotime t . Note that the b Boukouvalas et al. Genome Biology (2018) 19:65 Page 3 of 15 latter does not factorise as in [13], as the latent functions The marginal likelihood of the model can also be used to are dependent. Further details of the model derivation calculate the Bayes factor of branching versus not branch- and inference scheme used are provided in ‘Methods’ ing (assuming equal priors). This is used to rank genes by and a detailed derivation, including the inducing point how likely it is that their expression exhibits branching. By approximation, is provided in the supplementary material numerically integrating out the uncertainty of the branch- (Additional file 1: Section 2). ing location, the logged Bayes factor r includes the effect Global branching labels, such as those provided by of posterior uncertainty in the branching location: DDRTree, can provide an informative prior p(Z) for all genes. The prior before the global branching point is P(0 < t < 1|Y ) r = log uninformative, as no global assignment is available. This g P(t →∞)|Y ) is relevant for early-branching genes, which may start ⎡ ⎤ branching earlier than the global cellular branching. After ⎣ ⎦ = log p (Y |t = i) − log p (Y |t →∞)) , b b the global branching point, the prior favours an increased i∈S assignment probability to the globally assigned branch. (3) If the prior places significant mass on the alternative assignment, the resulting assignment may differ from the global allocation given enough evidence from the likeli- where t =∞ specifies that the model does not branch hood term. This allows the model to correct mislabelled and we have assumed equal prior probabilities for branch- cells as well as account for sources of noise in the data, ing and not branching. such as dropout for lowly expressed genes. However, in An example of the BGP model fit is shown in Fig. 1b. our single-cell case studies, we use a strong prior assign- The uncertainty in the cell branch association is shown ment probability of 0.99 to avoid cell reassignment after in conjunction with the posterior on the branching times. the global branching time. This simplifies the interpre- For visualisation, the cell assignment to the top branch tation of the results, as cells do not switch their global is shown. We see that most cells away from the branch- branch assignment. Nevertheless, other constructions are ing point are assigned with high confidence to one of possible with this model, e.g. the distance of the cells the branches. However, cells that are equidistant from from the global branching time may be used to adjust the both branches have high assignment uncertainty (0.5). associated prior uncertainty. This is also the case for cells close to the branching loca- The model hyperparameters are fitted by maximising tion where the two branches are in close proximity. In a bound on the log-likelihood. The log-likelihood is not the bottom panel of Fig. 1b, the posterior on the branch- analytically tractable, as it involves integrating out the ing location shows there is significant uncertainty on the indicator matrix Z and therefore, we use a variational precise branching location. This is reflected in Fig. 1a approximation. A lower bound is available using Jensen’s in the branching time uncertainty (magenta). The cell inequality: assignment uncertainty is incorporated into the branch- ing time posterior. If the branches separate quickly, the posterior branching time uncertainty is likely to be small. log p (Y |F) ≥ E log p (Y |F, Z) −KL q(Z)||p(Z) , q(Z) This reflects one of the main benefits of employing a (1) probabilistic model to identify branching dynamics as the assignment uncertainty is considered when calculating the whereweuse amean-fieldapproximation q (Z, F) = branching time posterior. The cell assignment is inferred q(Z)q(F) with the latent functions F independent of the in the BGP model, in contrast to the model in [12]where association indicators Z and q(Z) = φ .The φ nm nm nm the assignment is assumed known. approximates the posterior probability of cell n belonging Additional biological insights can be gleaned from the to branch m. The latter is either the trunk state or one of BGP method by inferring a branch order network using the two branches for the single branching considered in the posterior for each branching gene. The probability of the applications here. Then, F can be integrated out to get agene A branching before time t can be calculated using the marginal likelihood p(Y ). S samples from the branching posterior, The branching time posterior probability is calculated using the approximate marginal likelihood evaluated at a set of candidate branching points S of size N .The B b P (Br(A)< t)) = I (s < t),(4) posterior for a candidate branching time c is s=1 p(Y |t = c) where Br(A) is the branching time of gene A and s are p(t = c|Y ) =  .(2) p(Y |t = i) b the posterior branching time samples. The probability of a i∈S B Boukouvalas et al. Genome Biology (2018) 19:65 Page 4 of 15 (a) DDRTree prior assignment (b) Cell assignment Fig. 1 Haematopoiesis gene expression, showing the BGP fit for the MPO gene. a The Wishbone branching assignment is shown for each cell along with the global branching time (black dashed line), the most likely branching time (blue solid line) and posterior branching time uncertainty (magenta background). The sample of cells used to fit the BGP model is shown with larger markers. b The posterior cell assignment is shown in the top subpanel. In the bottom subpanel, the posterior branching time is shown. Pseudotime is shown on the horizontal axis of all plots. a, b Gene expression is depicted on the vertical axis. c The posterior branching probability BGP branching Gaussian process gene A branching before gene B can be calculated similarly functions cross after the branching point were rejected from each branching posterior, since these may be difficult for other methods, e.g. BEAM identifies the last crossing location for its fitted splines S and may, therefore, identify the wrong point in a time P (Br(A)< Br(B)) = I (s < s ).(5) series that crosses after branching. We address this issue A B s=1 in the real-data study considered in the next section but do not consider it in the synthetic benchmark. We generate We can infer a branch order network by computing this N = 150 data points (cells) with D = 40 genes and pseu- probability for all pairs of branching genes at the desired dotime in a unit interval [0, 1].The genesare separated confidence level. In the single-cell applications we discuss into three groups depending on their branching behaviour later, we use this approach to construct a directed network and time (Table 1). graph of gene branching times. All methods were run with default parameter settings, We can also calculate a posterior rank for each gene so it may be possible to improve on their performance with an associated confidence interval. Using samples of by tuning these parameters. For example, as in [17], we the posterior branching time, we can estimate quantiles of found the performance of the algorithm depended on the the rank distribution for each gene. This is a simple way initialisation used. We contrast the performance of the to infer which genes branch early and which late through BGP model both without and with an informative prior a probabilistic ranking. Unlike the previous approach, this (80% prior probability) on cell assignment derived from does not allow pairs of genes to be compared but can pro- the global Monocle assignment. vide an overall summary of the branch ordering without We first compare the pseudotime estimation accu- the need for a network analysis. racy of Monocle and MFA. Both methods achieve good Synthetic study Table 1 Synthetic gene groups We evaluate three methods, MFA [17], the BEAM Group Branching time G approach [3] and the BGP model, on synthetically gener- ated data. For the synthetic study, we use Gaussian noise Early 0.2 10 and therefore, we use the BEAM algorithm with a Gaus- Late 0.8 20 sian likelihood function. MFA also assumes a Gaussian No branching NA 10 likelihood function. Data are generated from branching Branching times and number of genes G for each group. All scenarios use N = 150 GPs with signal variance σ = 2, length scale λ = 1.2 cells and a total of D = 40 genes and a range of noise levels (Table 2). Samples where the NA not applicable Boukouvalas et al. Genome Biology (2018) 19:65 Page 5 of 15 Table 3 Synthetic study: Area under the curve for detecting performance as measured by the rank correlation of the branching genes estimated pseudotime to the ground truth (Table 2). The Bayes factor of the branching GP can be used Noise MFA BEAM BGP to rank the evidence of branching for each gene. Simi- No prior prior lar measures exist for MFA and the BEAM method. We 0.001 0.30 1.00 1.00 1.00 first compare the three methods on their ability to dis- 0.01 0.65 1.00 1.00 1.00 criminate branching from non-branching genes (Table 3). 0.03 0.77 0.96 0.98 0.99 Themetricweuse is theareaunder thecurve,which 0.08 0.82 0.93 0.86 0.92 provides a reasonable measure when the numbers of pos- itives and negatives in the ground truth are balanced. 0.1 0.77 0.82 0.81 0.96 Both BEAM and BGP have higher accuracy than MFA, 0.2 0.77 0.74 0.86 0.84 whose performance varies significantly. The inclusion of BEAM branch expression analysis modelling, BGP branching Gaussian process, MFA an informative prior improves the performance of the mixture of factor analysers BGP model, resulting in consistently high performance for all noise levels. The performance of BEAM decreases with increased noise level, which is also the case for the BGP More generally, the spline approach taken in BEAM suf- model to a lesser extent. fers from a consistent bias in branching time estimation, We also examine the error in identifying the branch- which pulls all estimates towards the global branching ing time. As MFA does not provide such an estimate, we time. To demonstrate this effect clearly, we examine an consider only the BEAM and BGP methods. The error in additional synthetic example with three genes branch- estimating branching time for the BEAM and BGP meth- ing very early (0.1), 27 genes branching late (0.7) and 10 ods is given in Table 4. The error for the BGP method is genes not branching. We select a low noise level (0.001). consistently lower than that for the BEAM method. The Since there are many late-branching genes and few early- informative prior allows for more consistent performance branching genes, the global branching time is late (Fig. 3), of the BGP method. This method has substantial increases which clearly demonstrates the bias effect. As can be seen in accuracy in some scenarios, especially for the highest in Fig. 3a, the estimates for the BEAM method are biased noise level (0.2), where the error is reduced from 0.15 towards the global branching time. The underestimation to 0.08. The lack of robustness of the BEAM approach of branching times in BEAM for genes that branch later to high noise is demonstrated in Fig. 2.Inthe low-noise than the global branching time is most likely due to the scenario (Fig. 2a, b), both BEAM and BGP are able to spline regularisation employed by BEAM, which tends to recover the gene expression branching dynamics. In the over-smooth the spline fit. The overestimation of branch- high-noise scenario (Fig. 2c, d), the global branching time ing times for early-branching genes is due to the arbitrary is early due to the early-branching genes in the data. The assignment of cells prior to the global branching time, as later-branching gene depicted has a branching time of no labels are provided by the global algorithm and no esti- b = 0.8 and the global assignment correctly separates the mation is performed by the spline-fitting algorithm. See two branches. However, due to the high noise in the data, Fig. 3c for an illustrative example. The former could pos- the spline is unable to identify the correct branching time sibly be rectified by tuning the regularisation approach and significantly underestimates it (Fig. 2c). In contrast, employed, but the latter is a fundamental restriction of the BGP model correctly identifies the late-branching the BEAM approach, which does not directly estimate nature of the gene, despite the early global branching time and the high-noise level of the data (Fig. 2d). Table 4 Synthetic study: root-mean-squared error for branching time estimation Table 2 Synthetic study: pseudotime rank correlation to the true Noise BEAM BGP time for both MFA and Monocle under both scenarios No prior prior Noise MFA Monocle 0.001 0.12 0.03 0.03 0.001 −0.96 1.0 0.01 0.11 0.04 0.05 0.01 0.98 1.0 0.03 0.13 0.06 0.07 0.03 −0.93 1.0 0.08 0.20 0.14 0.09 0.08 −0.97 1.0 0.1 0.23 0.12 0.09 0.1 −0.97 1.0 0.2 0.23 0.15 0.08 0.2 0.91 1.0 Only performed on branching genes MFA mixture of factor analysers BEAM branch expression analysis modelling, BGP branching Gaussian process Boukouvalas et al. Genome Biology (2018) 19:65 Page 6 of 15 (a) BEAM Low noise (b) BGP Low noise (c) BEAM High noise (d) BGP High noise Fig. 2 Synthetic data: example BEAM and BGP model predictions for late-branching genes. These branch at b = 0.8. Pseudotime is shown on the horizontal axis and the gene expression is depicted on the vertical axis. The global branching time (vertical black bar) and the posterior branching time uncertainty (magenta background) are shown. The vertical red bar is the BEAM branching time estimate and the vertical blue bar the BGP estimate. Cells have been coloured by the global Monocle assignment. a BEAM low noise. b BGP low noise. c BEAM high noise. d BGP high noise. BEAM branch expression analysis modelling, BGP branching Gaussian process branching assignments but uses only the globally derived The state estimation correctly identifies a single branch- label estimates. The BGP approach (Fig. 3b)doesnot ing point but the majority of cells are assigned to one of suffer from this deficiency, as the branch assignment is the branches (red). As one of the global branches (red) performed gene by gene at the cost of increased computa- spans both gene expression branches, it is unsurprising tion time. However, the task is easily parallelisable, as each that the spline approach fails to identify the branch loca- gene is treated independently. tion correctly and in fact overestimates the true branch- Lastly, we examine the effect of poor state estimation on ing time of t = 0.2 (Fig. 2a). The corresponding BGP theBEAMand BGPmethods (Fig. 4). The generated data inference (Fig. 4f) overcomes the errors in global state sets are identical except for the level of Gaussian observa- estimation and the confidence interval includes the true tion noise, whose variance is increased from σ = 0.001 branching time. (Fig. 4a)to σ = 0.03 (Fig. 4d). In Figs. 4a–c,the Mon- The robustness of the BGP model can be understood ocle state estimation accurately identifies the underlying in terms of the probabilistic nature of the model. Prior branching dynamics and both BGP and BEAM correctly global state information is considered as well as a like- estimate the branching dynamics. In Figs. 4d–f,weshow lihood term that fits a branching process. Therefore, an example of the effect of poor state estimation. the BGP prior model incorporates the best of both Boukouvalas et al. Genome Biology (2018) 19:65 Page 7 of 15 (a) Posterior (b) BGP (c) BEAM Fig. 3 Synthetic data: fitting BGP and BEAM. The horizontal axis depicts the pseudotime. a The true branching times (black dots), BEAM times (red crosses), BGP mean (blue dots) and 98% credible regions for all 40 synthetic genes. b,c BGP and BEAM estimates for the same early-branching gene. Gene expression is depicted on the vertical axis. The vertical grey bar is the global branching time. The vertical red dashed bar is the BEAM branching time estimate and the vertical magenta bar the BGP estimate. Cells have been coloured by the global assignment. BEAM branch expression analysis modelling, BGP branching Gaussian process worlds: inclusion of global assignment information and gene and required approximately 2 minutes of CPU time assessment of cell assignment based on individual gene per gene . expression. In the supplementary material (Additional A probabilistic model is an appropriate choice for early file 1: Section 3), we also show the robustness of the model haematopoiesis, which has been described as a cellu- to non-Gaussian data, using synthetic data generated lar continuum of low-primed HSCs [22]. The continuum using a single-cell RNA-seq data simulator [21] to gener- contains transitory states rather than discrete progenitor ate zero-inflated count data across a range of library sizes cell types. Some cell state transitions and lineage com- and dropout rates. We also find the model robust to data binations are more likely to occur than others. A proba- downsampling (Additional file 1: Section 4), which we use bilistic model such as BGP better reflects the probabilistic to speed up inference in the studies of experimental data nature of lineage selection highlighted in [22]. In the BGP that follow. model in particular, each cell is associated with an allo- cation probability for each branch. The branching point Haematopoiesis and single-cell RNA-seq can be interpreted as the earliest pseudotime from which We apply the BGP model on single-cell RNA-seq of probabilistic biases in lineage selection can be detected. haematopoietic stem cells (HSCs), which differentiate into We find 839 genes out of a possible 1072 that show myeloid and erythroid precursors [18]. Following the evidence of branching based on the Bayes factor. The pos- same quality control as [2], the data consists of 2312 terior branching times for all branching genes are shown genes and 4423 cells. We use M = 30 inducing points in Fig. 5a. Only 76 genes have a log likelihood over 50, in our sparse approximation and to speed up the com- with the majority of the genes close to the 0 threshold. In putations further, we randomly subsample the data down Fig. 5, we also show the branching times for 10 marker to 467 cells. In [2], an analysis of this data set was per- genes that have been found to show significant evidence formed and we use their estimation of pseudotime and of branching. Marker genes for megakaryocyte erythroid cell trajectory assignment labels using the Wishbone algo- progenitors (MEPs) and granulocyte macrophage progen- rithm. We found similar pseudotime and global branching itors (GMPs) are shown. For all GMP markers, the same assignments with Monocle 2 (vs 2.1), and those results branch is upregulated (magenta), whereas for the MEP are given in the supplementary material (Additional file 1: markers, the alternative branch is upregulated (brown). The confidence interval of branching times is also shown Section 7). To reduce the computation required, we apply a for each marker gene, which can be used in drawing infer- t-statistic on the end states of the two branches to fil- ences. For example, the CTSG GMP marker is predicted ter the genes that are most likely branching. We apply to branch earlier than the global branching time with high the BGP algorithm on the resulting set of 1072 genes. confidence, whereas the branching time of the CEBPA The BGP inference was performed in parallel for each marker is highly uncertain. Boukouvalas et al. Genome Biology (2018) 19:65 Page 8 of 15 (a) Monocle global state: (b) Gene branching estima- (c) Gene branching estima- Accurate state estimation tion using BEAM tion using BGP (d) Monocle global state: (e) Gene branching estima- (f) Gene branching estima- Inaccurate state estimation tion using BEAM tion using BGP Fig. 4 Synthetic data: effect of Monocle global state estimation on BEAM and BGP model predictions for early-branching genes. b = 0.2. Two different examples are shown, corresponding to accurate and inaccurate state estimation by Monocle. b, c, e, f Gene expression (vertical axis) vs pseudotime (horizontal axis). The vertical black bar is the global branching time. The vertical red bar is the BEAM branching time estimate and the vertical blue bar the BGP estimate. Cells have been coloured by the global Monocle assignment. a, d The Monocle-DDRTree latent space. BEAM branch expression analysis modelling, BGP branching Gaussian process When examining the model fit for the APOE marker time order relationships are denoted by directed edges. gene (Fig. 6), we observe a transitory gene expression for The posterior confidence cut-off used for the latter is 95%. one of the branches. In particular, the expression initially For instance, both PRTN3 and MPO are found to branch increases after the branch point, peaks and decreases to before CAR1 but only the former is branching before the level of the other branch. The spline approach used ELANE; this can be understood by the higher uncertainty in BEAM erroneously identifies the last intersection point in the branching posterior of MPO (Fig. 8c). In the net- as the branching point, as we show in the supplementary work we have groups genes in three distinct modules. The material (Additional file 1: Section 7), whereas the BGP PRTN3 and CTSG genes (red module) branched before approach computes a posterior over the branch locations, all other genes except CALR, i.e. ELANE, CAR1, CAR2 which quantifies the likelihood of both intersection points and GSTM1. The other group of early-branching genes, as branching locations. MPO and CALR (yellow module) branch before CAR1, We show the branching time network and gene expres- CAR2, GSTM1 but not ELANE. Finally, the later branch- sion profiles for the highest evidence branching genes in ing group consisting of ELANE and GSTM1, branch Figs. 7 and 8. These eight genes show very strong evidence before CAR1. of branching (r > 200). When considering more genes, looking at the entire net- In the network (Fig. 7), each gene is annotated with its work may be cumbersome and if the interest is solely on most likely branching time and the pairwise branching identifying the earliest branching genes, we can estimate Boukouvalas et al. Genome Biology (2018) 19:65 Page 9 of 15 (a) Posterior branching times (b) GMP (c) MEP Fig. 5 Haematopoiesis data. The horizontal axis is pseudotime in all plots. a Posterior branching times for 839 genes identified as branching (log Bayes factor >0) ordered by branching location. The global branching time is shown as a vertical grey bar (b = 0.11). The vertical axis is the gene index. b, c Branching times for marker genes that have been found to show significant evidence of branching for megakaryocyte erythroid progenitors (MEPs) and granulocyte macrophage progenitors (GMPs). The colour scheme reflects which branch is upregulated after the most likely branching time. The arrows associate each gene with its most likely branching time (dot) and posterior branching confidence interval (blue region). The percentile rank in terms of the log Bayes factor is shown in parentheses; e.g. 100 means ranked in the first percentile of all 1072 genes examined. GMP granulocyte macrophage progenitor, MEP megakaryocyte erythroid progenitor the posterior rank for each gene as described in the BGP Droplet-based single-cell RNA-seq overview section. When using a lower threshold on the We further demonstrate the effectiveness of the BGP branching evidence (log Bayes factor > 50), we find 76 model by applying it to single-cell RNA-seq data gen- genes are the earliest and latest branching genes, which erated using droplet barcoding [19]. Klein et al. [19] are listed in the supplementary material (Additional file 1: monitored the transcriptomic profiles and heterogene- Section 5). ity in the differentiation of mouse embryonic stem cells This analysis demonstrates the richness of the BGP after leukaemia inhibitory factor withdrawal. A total of model and the range of downstream analyses afforded by 2717 cells were profiled. Altogether, 24 175 transcripts the probabilistic nature of the model. were observed with cells captured at t = 0, 2, 4 and Fig. 7 Haematopoiesis gene expression. Network of most significantly branching genes (log Bayes factor >200). The most likely branching Fig. 6 Haematopoiesis data: example of transitory gene expression for time is given for each gene. The directed edges denote the gene the APOE gene. Pseudotime is shown on the horizontal axis. The cell branching order with a 95% confidence cut-off. The edge colours are assignment uncertainty (top, vertical axis) and posterior branching used to group genes that have identical later branching genes. The time posterior (bottom, vertical axis) is shown for the BGP method. horizontal placement of each gene is based on its most likely BGP branching Gaussian process branching time Boukouvalas et al. Genome Biology (2018) 19:65 Page 10 of 15 (a) PRTN3 b = 0.05 (b) CTSG b = 0.05 (c) MPO b = 0.05 (d) CALR b = 0.08 (e) ELANE b = 0.14 (f) CAR2 b = 0.21 (g) GSTM1 b = 0.14 (h) CAR1 b = 0.30 Fig. 8 Haematopoiesis gene expression. Gene expression for most significantly branching genes (log Bayes factor >200). The bottom panel for each gene denotes the posterior branching time over the discrete set considered. a PRTN3 b = 0.05. b CTSG b = 0.05. c MPO b = 0.05. d CALR b = 0.08. e ELANE b = 0.14. f CAR2 b = 0.21. g GSTM1 b = 0.14. h CAR1 b = 0.30 7 days. As for the haematopoiesis data, BGP inference branching early in pseudotime and are upregulated in the was performed in parallel for each gene, which required main branch (brown). They are also very highly ranked. approximately 2 minutes of CPU time per gene. The effect Krt8 and Krt18 are both in the top percentile and Krt19 of cell cycle was removed using the scLVM approach [7] is in the 96th percentile. The gene expression profiles as in [1], who used this data set for their analysis of DPT. (Fig. 10) show clear early-branching times for the epi- We use the pseudotime estimated using the DPT blast markers as well as for genes where the alternative method of [1]. The prior on the branching labels was branch is upregulated (e.g. Ccdc36). The branching time also derived from the DPT method with a prior confi- uncertainty is low for most marker genes, including the dence of 99%. We examine the first dominant branching epiblast markers. We also show an example of transitory event reported by [1]. Here 915 cells were assigned to the gene expression (Perp), where the BGP method selects trunk, and 1662 and 114 cells to each branch. To allow the earliest intersection point as the most likely branching for fast computation for all genes, we subsampled the time and this point has a higher posterior branching time gene expression data down from 2717 to 335 cells with uncertainty. 81 assigned to the trunk, and 159 and 95 to each branch. We use the branching time posterior for each gene to This ensures the branches have roughly the same number estimate a branch order network of genes. For ease of of points. For faster computation, we analyse the top 998 presentation, we examine only the seven branching genes genes according to the method of [23], which is available with the strongest evidence of branching (log Bayes fac- in the ScanPy software library [24]. tor > 500). All posterior rankings with confidence greater A summary of the BGP findings is shown in Fig. 9.A than 95% are included in the network. We show the gene total of 337 genes show evidence of branching and 661 branch order network in Fig. 11 and the corresponding do not. There is a continuum of early to late posterior gene expression profiles in Fig. 10. The earliest branch- branching times (Fig. 9a). ing gene, Ccdc36, was found to branch before all other We also show the branching times for a selection of the genes in the network within the 95% confidence thresh- genes found to be differentially expressed in [1](Fig. 9). old. The Actg1 and Ccdc36 genes branch before the later The percentile rank in terms of the log Bayes factor is branching genes Krt8, Krt18, Bc1 and Hsp90aa1. The epi- also shown for each gene. All three epiblast markers con- blast markers Krt8 and Krt18 and the Bc1 gene branch sidered (Krt8, Krt18 and Krt19) show strong evidence of before Hsp90aa1, which has the latest branching time Boukouvalas et al. Genome Biology (2018) 19:65 Page 11 of 15 (a) Posterior branching times (b) Marker genes with rank shown Fig. 9 Summary plots for mouse embryonic stem cell droplet data. The horizontal axis denotes pseudotime and the vertical axis is the gene index. a The most likely branching time (diamonds) and confidence interval (blue region) for all genes found to show evidence of branching. b The posterior for selected marked genes is coloured by the branch that is enriched (brown or magenta) (0.36). The earliest branching genes can also be found accurately identify branching times earlier than the global by directly computing the posterior rank for each (see branching time. We found that branching time estimates section ‘Overview of BGP’). These genes are listed in the from this spline-based approach were generally biased supplementary material (Additional file 1: Section 6). towards the global branching time. In contrast, the BGP method can robustly identify branching times, as it esti- Conclusion mates the cell branch association for each gene indepen- We have presented a flexible non-parametric probabilistic dently while accounting for cell assignment uncertainty approach for robustly identifying individual gene branch- in the posterior branching times. We also found the BGP ing times. For scalability, our model uses sparse varia- approach to be robust to global state estimation errors tional inference implemented using the GPflow package and high noise. The BGP branching time uncertainty can [16]. The probabilistic nature of our model allows for also be used in a downstream analysis of the individual well-defined parameter estimation via maximisation of a gene branching times, for example, ranking genes in bound on the marginal likelihood. terms of their most likely or minimum branching times. The spline model used by BEAM uses global branch In the BGP model, a separate assignment of cells to assignments for each cell and is, therefore, unable to branches is performed for each gene, since they are treated independently. This can lead to potentially misleading results, as cells may be assigned to different branches for different genes. Therefore, to achieve good results, we use an informative prior based on the cell assign- ment of a global method such as Monocle [3], DPT [1]or Wishbone [2]. This gives a high prior confidence of cell assignments after the global branching point. By using this strongly informative prior, we avoid the issue of incon- sistent assignments for cells with pseudotime after the global branching point. However, cell assignments can differ for genes branching prior to the global branching time and therefore, care must be taken when interpret- Fig. 10 Mouse embryonic stem cell droplet data. Expression profiles ing the results for such genes. This is an improvement for genes appearing in the branching order network (Fig. 11)and an over methods such as BEAM that randomly assign cells example of a gene with higher branching uncertainty (Perp). The bottom panel in each gene expression denotes the posterior prior to the global branching time. In future work, we branching time over the discrete set considered. a Ccdc36 b = 0.05. plan to extend the BGP model to share the cell assignment b Actg1 b = 0.08. c Actb b = 0.08. d Krt8 b = 0.11. e Krt18 b = 0.11. across all genes, therefore avoiding such inconsistencies f Bc1 b = 0.11. g Hsp90aa1 b = 0.36. h Perp b = 0.20 and simplifying any downstream analysis. Boukouvalas et al. Genome Biology (2018) 19:65 Page 12 of 15 (a) Ccdc36 b = 0.05 (b) Actg1 b = 0.08 (c) Actb b = 0.08 (d) Krt8 b = 0.11 (e) Krt18 b = 0.11 (f) Bc1 b = 0.11 (g) Hsp90aa1 b = 0.36 (h) Perp b = 0.20 Fig. 11 Mouse embryonic stem cell droplet data. Gene network of branching times of most significantly branching genes (log Bayes factor >500). A directed edge A → B denotes that gene A branches before gene B with a 95% probability. The edge colours are used to group genes that have identical later branching genes. The horizontal placement of each gene is based on its most likely branching time We have also included in our comparison a probabilis- The probabilistic nature of the BGP model allows for tic linear method [17]. The linearity allows for an effi- additional biological insights to be gained by constructing cient joint estimation of both the pseudotime and global a gene branch order network and identifying early lineage branching structure. Although this method does not esti- priming. The former is accomplished by computing a pair- mate gene bifurcation times, a probabilistic estimate of wise gene probability that assesses the likelihood of a gene whether an individual gene exhibits branching behaviour branching before another. The latter can be inferred by is available. However, in our synthetic study, we found examining the gene network and identifying the earliest that the pseudotime estimation was not robust and this branching genes. We demonstrated this approach on both reduces the effectiveness of the method. single-cell data sets, identifying early-branching genes and The application of the BGP method to the haema- confident orderings of gene branching events. topoiesis data revealed the importance of modelling tran- Concurrent with our study, [25]usedchangepoint ker- sitory gene expression, which has the potential to confuse nels to develop similar branching GPs to identify bifur- non-probabilistic methods. The model was able to select cations in single-cell transcriptional data sets. They use automatically the most likely branching location, even in a Markov chain Monte Carlo approach to estimate cell the presence of multiple crossing points in gene expres- branch association and branching times. Their approach sion without the need for any post-processing heuristics also explicitly models recombination, in which individual such as those included in the BEAM package. branches are merged, and they can jointly estimate pseu- We also demonstrated the flexibility of our approach dotime. However, the computational complexity of their by applying it to droplet-based single-cell data using the method would make application to genome-wide infer- pseudotime and branching association derived from the ence of branching times from unlabelled data challenging DPT method [1]. As the BGP approach does not rely on a and that is the motivation for our sparse inducing point particular method to estimate pseudotime and branching variational approach. association, a modular approach is possible in which the A number of extensions of the BGP model are possible best method for a given study is used. The prior uncer- that would increase the range of possible applications. The tainty specification on the branching association allows branching kernel could be adapted to detect changepoints the BGP user to quantify the expected accuracy of the in time series. Whereas we have modelled a branch- global branching method. ing event as the intersection of three latent functions, a Boukouvalas et al. Genome Biology (2018) 19:65 Page 13 of 15 changepoint would require only two latent functions. We The resulting covariance between any two latent func- would also like to extend our model to non-Gaussian like- tions f and g constrained to cross at t is lihoods, which would more accurately describe single-cell K K ff fg data. This would increase inference complexity but could K K gg gf provide better calibrated uncertainty estimates. Another ⎛ ⎞ K(T,t )K(t ,T) p p useful extension would be to jointly infer pseudotime and K (T, T) K t ,t ( ) p p ⎝ ⎠ branching behaviour, which would also improve uncer- = .(7) K T,t K t ,T ( p) ( p ) K (T, T) tainty estimation as the uncertainty arising from the esti- K t ,t ( p p) mation of the former would be included in the posterior where K (T, T), K (T, t ) and K (t , t ) are the kernel func- p p p branching uncertainty. Extending our model to multi- tions evaluated between all training data pseudotimes T, ple branching points is straightforward from a modelling between the training data and branching point, and solely standpoint but presents a more challenging optimisation at the branching, point respectively. problem, for which a tree prior on the branching struc- In [12], only two latent functions were specified, a con- ture may prove helpful [26]. This extension would allow us trol and perturbation condition where the former spanned to address the problem of selecting the correct number of the branching point. In our modified formulation, three branches in the global cellular branching dynamics. functions are used, allowing for a discontinuity in the gra- dient between the trunk and both branch latent functions. Methods As an extension of our model, the derivatives of the latent Branching model derivation outline functions could also be constrained to intersect at the We present an overview of the probabilistic model for branch point to allow for differentiable paths. BGP. The full model description and derivation of the variational inference lower bound is given in the supple- Full GP inference mentary material (Additional file 1: Section 2). Let Y ∈ R be thedataofinterestand let M be the num- First, we define the Gaussian process kernel describ- ber of functions that are dependent. We specify a set of ing a branching structure. We then derive a lower bound latent functions F for each data point using an expanded on the model likelihood using variational inference. In representation of size M × 1where M = NM .Let Z ∈ the supplementary material (Additional file 1:Section N ×M {0, 1} be the binary indicator matrix on the expanded 2.2), we present a formulation of a sparse inducing point representation that describes the association of each data approximation that allows the application of the model point to a latent function. Each row of Z has only one to large data sets. How to perform prediction on the full non-zero entry. The model likelihood is and sparse models is also presented in the supplementary material (Additional file 1: Section 2.3). p (Y |F, Z) = N Y |ZF, σ I.(8) Branching kernel The extension to multiple independent outputs is To model the branching process, we specify a branch- straightforward as the likelihood factorises: ing kernel that constrains the latent branching functions to intersect at the branching point. We use a modified p (Y |F, Z) = N Y |ZF , σ I,(9) d d version of the kernel proposed in [12]. The trunk f and d=1 branch kernel functions g and h are constrained to cross at the branching point t . We place Gaussian process priors where Y denotes the N ×1 column vector of observations p d on all three functions and constrain them to intersect: for output d and similarly F denotes the M × 1 column vector of latent function values. We omit the multiple f ∼ GP (0, K ), output case from the derivation below for clarity. g ∼ GP (0, K ), As in [13], we place a categorical prior on the indicator (6) h ∼ GP (0, K ), matrix Z and a GP prior on the latent functions F.Note f t = g t = h t . that the latter does not factorise as in [13], as we assume p p p the latent functions are dependent: For simplicity, the same kernel is used for all three functions although it would be straightforward to extend N M [Z] nm our framework to specify different kernels for each latent p(Z) = [] , (10) n,m function. This would allow for instance, one branch to be n=1 m=1 modelled as a periodic function and the others as non- p(F) = N 0, K , (11) ( ) periodic. The extra flexibility would come at the cost where for the multinomial distribution we have of increasing the number of parameters that need to be [] = 1and K is the GP kernel . nm estimated. m=1 Boukouvalas et al. Genome Biology (2018) 19:65 Page 14 of 15 The log likelihood is not analytically tractable, as it Our bound is, therefore, involves integrating out the indicator matrix Z: log p (Y |F) ≥ L , log p (Y |F) = log p (Y , Z|F) dZ. (12) wherewehavedefined We proceed to compute a lower bound using Jensen’s 2 L  − log 2πσ − KL q(Z) || p(Z) inequality: (1) T T T T − Y Y + F AF − 2F  Y . q(Z) 2σ log p (Y |F) = log p (Y , Z|F) dZ q(Z) (19) = E log p (Y , Z|F) − E log q(Z) . q(Z) q(Z) We proceed to integrate out the latent functions F to (13) obtain the variational collapsed bound: From the mean-field assumption, the latent functions F log p(Y ) = log p (Y |F) p(F)dF are independent of the association indicators Z: (20) q (Z, F) = q(Z)q(F). (14) ≥ log exp [L ] p(F)dF. The log likelihood term is This bound holds because L is a bound to log p (Y |F) N N and the exponent function is monotonic. More details can 2 2 log N Y |ZF, σ I =− log(2π) − log σ 2 2 be found in [27]. Setting the prior on the latent function as a GP, − (Y − ZF) (Y − ZF) . 2σ log p(F) = log N (F|0, K ), and substituting (19)into(20) (15) results in the collapsed bound N 1 1 Taking the expectation with respect to the variational 2 T L  − log(2πσ ) − Y Y − log |K | distribution q(Z): 2 2σ 2 −2 −1 N N − log Aσ + K 2 2 E log N Y |ZF, σ I =− log(2π) − log σ q(Z) 2 (21) 2 2 −1 1 −4 T −2 −1 T T T T T + σ Y  Aσ + K  Y − Y Y + F AF − 2F  Y , 2σ (16) − KL q(Z) || p(Z) . wherewehavedefined We can also derive this bound when using an inducing point approximation. This allows the algorithm to scale E (Z), q(Z) up to larger data sets. The full derivation is given in the A  E Z Z , supplementary material (Additional file 1: Section 2). q(Z) and the variational approximation is Endnotes n,m Time measured on a MacBook Pro with 2.2 GHz quad- q(Z) =  . (17) n,m n,m core Intel Core i7 and 16GB of DDR3L onboard memory. This expanded representation allows for efficient This encodes the mean-field assumption as we assume the posterior indicators factorise. recomputation of the marginal likelihood for different The second-order expectation for A is branching times. For simplicity, we assume the same kernel for every A = E z z i,j q(Z) n,i n,j output and latent trajectory function. Removing this restriction does not affect the derivation but will increase =  δ , (18) the inference complexity. n,i i,j where z the N × 1 indicator vector for latent function Additional file i = m. The KL divergence term is computable as Additional file 1: Supplementary material for BGP: identifying gene-specific branching dynamics from single-cell data with a branching n,m Gaussian process. This also contains additional results for both the KL q(Z) || p(Z) =  log . n,m synthetic and single-cell data sets. (PDF 1206 kb) [] n,m n,m Boukouvalas et al. Genome Biology (2018) 19:65 Page 15 of 15 Acknowledgments 9. Reid JE, Wernisch L. Pseudotime estimation: deconfounding single cell We wish to thank Xiaojie Qiu for his help with the BEAM Monocle 2 code and time series. Bioinformatics. 2016;32(19):2973–80. Kieran Campbell with running the MFA model. 10. Ahmed S, Rattray M, Boukouvalas A. GrandPrix: scaling up the Bayesian GPLVM for single-cell data. 2017. bioarxiv preprint: https://doi.org/10. Funding 1101/227843. MR and AB were supported by Medical Research Council award 11. Lönnberg T, Svensson V, James KR, Fernandez-Ruiz D, Sebina I, MR/M008908/1. JH was supported by Medical Research Council fellowship Montandon R, Soon MSF, Fogg LG, Nair AS, Liligeto UN, Stubbington MR/K022016/2. MJT, Ly L-H, Bagger FO, Zwiessele M, Lawrence ND, Souza-Fonseca-Guimaraes F, Bunn PT, Engwerda CR, Heath WR, Billker Availability of data and materials O, Stegle O, Haque A, Teichmann SA. Single-cell RNA-Seq and An open source Python implementation of BGP is available at https://github. computational analysis using temporal mixture modelling resolves com/ManchesterBioinference/BranchedGP [28], released under the Apache th1/tfh fate bifurcation in malaria. Sci Immunol. 2017;2(9). http:// 2.0 licence. Version 1.1 with DOI https://doi.org/10.5281/zenodo.1235962 was immunology.sciencemag.org/content/2/9/eaal2192. used to generate the results of the paper. Documentation and examples are 12. Yang J, Penfold CA, Grant MR, Rattray M. Inferring the perturbation time available on the GitHub page. from biological time course data. Bioinformatics. 2016;32(19):2956–64. All data used in the paper have previously been published and are freely 13. Lázaro-Gredilla M, Van Vaerenbergh S, Lawrence ND. Overlapping available. The haematopoiesis data were previously analysed in [2]and are mixtures of Gaussian processes for the data association problem. Pattern available from https://github.com/ManuSetty/wishbone. The dropseq data Recognit. 2012;45(4):1386–95. were previously analysed in [1] and are available from https://www.helmholtz- 14. Quiñonero-Candela J, Rasmussen CE. A unifying view of sparse muenchen.de/icb/research/groups/machine-learning/projects/dpt/index. approximate Gaussian process regression. J Mach Learn Res. 2005;6(Dec): html. 1939–59. 15. de Garis Matthews AG. Scalable Gaussian process inference using Authors’ contributions variational methods. Department of Engineering, University of AB, JH and MR designed the algorithm. AB and JH implemented the code and Cambridge; 2016. AB performed the numerical experiments and analysis. AB and MR wrote the 16. Matthews AGdG, van der Wilk M, Nickson T, Fujii K, Boukouvalas A, paper. All authors read and approved the final manuscript. León-Villagrá P, et al. GPflow: a Gaussian process library using Tensorflow. J Mach Learn Res. 2017;18(40):1–6. Ethics approval and consent to participate 17. Campbell K, Yau C. Probabilistic modeling of bifurcations in single-cell Not applicable. gene expression data using a Bayesian mixture of factor analyzers. Wellcome Open Res. 2017;2–19. https://doi.org/10.12688/ wellcomeopenres.11087.1. Competing interests 18. Paul F, Arkin Y, Giladi A, Jaitin DA, Kenigsberg E, Keren-Shaul H, et al. The authors declare that they have no competing interests. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell. 2015;163(7):1663–77. Publisher’s Note 19. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Springer Nature remains neutral with regard to jurisdictional claims in Droplet barcoding for single-cell transcriptomics applied to embryonic published maps and institutional affiliations. stem cells. Cell. 2015;161(5):1187–201. 20. Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of single-cell Author details trajectory inference methods: towards more accurate and robust tools. Division of Informatics, Imaging and Data Sciences, Faculty of Biology, 2018276907. bioarxiv preprint: https://doi.org/10.1101/276907. Medicine and Health, University of Manchester, Oxford Road, Manchester, UK. 21. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA Prowler.io, Cambridge, UK. sequencing data. Genome Biol. 2017;18(1):174. 22. Velten L, Haas SF, Raffel S, Blaszkiewicz S, Islam S, Hennig BP, et al. Received: 30 October 2017 Accepted: 1 May 2018 Human haematopoietic stem cell lineage commitment is a continuous process. Nat Cell Biol. 2017;19(4):271–81. 23. Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat References Commun. 2017;8:14049. 1. Haghverdi L, Buettner M, Wolf FA, Buettner F, Theis FJ. Diffusion 24. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene pseudotime robustly reconstructs lineage branching. Nat Methods. expression data analysis. Genome Biology. 19(1):15. https://doi.org/10. 2016;13(10):845–8. 1186/s13059-017-1382-0. 2. Setty M, Tadmor MD, Reich-Zeliger S, Angel O, Salame TM, Kathail P, 25. Penfold CA, Sybirna A, Reid J, Huang Y, Wernisch L, Ghahramani Z, et al. et al. Wishbone identifies bifurcating developmental trajectories from Nonparametric Bayesian inference of transcriptional branching and single-cell data. Nat Biotechnol. 2016;34(6):637–45. recombination identifies regulators of early human germ cell 3. Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, et al. Reversed development. 2017. bioarxiv preprint: https://doi.org/10.1101/167684. graph embedding resolves complex single-cell trajectories. Nat Methods. 26. Simek K, Palanivelu R, Barnard K. Branching Gaussian processes with 2017;14(10):979. applications to spatiotemporal reconstruction of 3d trees. In: Leibe B, 4. Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, et al. Slingshot: cell Matas J, Sebe N, Welling M, editors. Computer Vision – ECCV 2016: 14th lineage and pseudotime inference for single-cell transcriptomics. 2017. European Conference, Amsterdam, The Netherlands, October 11–14, bioarxiv preprint: https://doi.org/10.1101/128843. 2016, Proceedings, Part VIII. New York: Springer International Publishing; 5. Rasmussen CE, Williams CKI. Gaussian processes for machine learning. 2016. p. 177–93. https://doi.org/10.1007/978-3-319-46484-8_11. London: The MIT Press; 2006. 27. King NJ, Lawrence ND. Fast variational inference for Gaussian process 6. Buettner F, Theis FJ. A novel approach for resolving differences in models through KL-correction. In: European Conference on Machine single-cell gene expression patterns from zygote to blastocyst. Learning. New York: Springer International Publishing; 2006. p. 270–81. Bioinformatics. 2012;28(18):i626–32. 28. Boukouvalas A, Hensman J, Magnus R. BGP: identifying gene-specific 7. Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, branching dynamics from single cell data with a branching Gaussian et al. Computational analysis of cell-to-cell heterogeneity in single-cell process. 2018. https://github.com/ManchesterBioinference/BranchedGP. RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155–60. 8. Campbell KR, Yau C. Order under uncertainty: robust differential expression analysis using probabilistic models for pseudotime inference. PLOS Comput Biol. 2016;12(11):1–20. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals

BGP: identifying gene-specific branching dynamics from single-cell data with a branching Gaussian process

Loading next page...
 
/lp/springer_journal/bgp-identifying-gene-specific-branching-dynamics-from-single-cell-data-PeRWcxUX0S
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Life Sciences; Animal Genetics and Genomics; Human Genetics; Plant Genetics and Genomics; Microbial Genetics and Genomics; Bioinformatics; Evolutionary Biology
eISSN
1474-760X
DOI
10.1186/s13059-018-1440-2
pmid
29843817
Publisher site
See Article on Publisher Site

Abstract

High-throughput single-cell gene expression experiments can be used to uncover branching dynamics in cell populations undergoing differentiation through pseudotime methods. We develop the branching Gaussian process (BGP), a non-parametric model that is able to identify branching dynamics for individual genes and provide an estimate of branching times for each gene with an associated credible region. We demonstrate the effectiveness of our method on simulated data, a single-cell RNA-seq haematopoiesis study and mouse embryonic stem cells generated using droplet barcoding. The method is robust to high levels of technical variation and dropout, which are common in single-cell data. Keywords: Single cell RNA-seq, Gaussian process, Branching dynamics Background assess the evidence for branching and to provide a pos- Single-cell gene expression data can be used to uncover terior estimate of the branching time. The posterior dis- cellular progression through different states of a temporal tribution over branching time can be used to identify the transformation, e.g. during development, differentiation most likely branching time for each gene as well as an or disease. As single-cell protocols improve, a flurry of associated credible region capturing our uncertainty in methods have been proposed to model branching of cel- the estimate. lular trajectories to alternative cell fates [1–4]. In these Our approach is based on Gaussian processes (GPs), and similar methods, pseudotime is estimated and a global which are a class of flexible non-parametric probabilistic branching structure is inferred. Our focus in this paper is models. GPs have a long history in temporal and spatial on a downstream analysis method that can subsequently statistics and have gained popularity in many areas of be used to model branching gene expression dynamics for machine learning, including multivariate regression, individual genes. We are interested in discovering which classification and dimensionality reduction [5]. GPs have genes follow the global cellular branching dynamics and been used for dimensionality reduction of single-cell whether these genes branch early or late with respect expression data [6, 7] and more recently for pseudotime to the global cellular branching time. Recently, [3]pro- estimation where the effect of uncertainty in the inferred posed the branch expression analysis modelling (BEAM) pseudotime can be quantified [8]and capturetimecan approach, which uses penalised splines to infer the indi- be included as prior information [9, 10]. GP-based meth- vidual gene branching time. Here, we propose an alter- ods have also been used for modelling global cellular native non-parametric method to model gene expression branching dynamics from single-cell data after assigning branching dynamics. We develop a probabilistic genera- pseudotime to cells [11]. tive model of branching dynamics that can be used to Here, we build on the work of [12], whodeveloped aGP model to identify when two gene expression time-course *Correspondence: alexis.boukouvalas@manchester.ac.uk; data sets first diverge from one another. They defined a Magnus.Rattray@manchester.ac.uk novelGPcovariancefunctionthatconstrainstwo func- Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Oxford Road, Manchester, UK tions to intersect at a single point. The divergence time Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Boukouvalas et al. Genome Biology (2018) 19:65 Page 2 of 15 is inferred by numerically approximating the posterior Specifically for N cells, naive covariance inversion scales using a simple histogram approach. The model identifies as O N while under sparse inference with k inducing when a gene first becomes differentially expressed in time- points, it scales as O k N where typically k  N.Sec- course gene expression data under control and perturbed ondly, we provide an open-source implementation that conditions. In their approach, all data points are labelled leverages the GPflow library [16], which both simplifies with the branch that generated them and the ordering the implementation due to automatic symbolic differenti- of time points is assumed known. Although similar to ation and allows for the necessary matrix operations to be the problem of modelling branching in single-cell data computed in parallel across many CPU nodes or GPUs. after pseudotime is inferred, this two-sample time-series The paper is organised as follows. We first give an method cannot be applied directly to our problem because overview of the BGP model, including how inference is we have to allow for uncertainty regarding which branch performed in a scalable manner. We discuss how uncer- each cell belongs to. tainty is quantified and represented in a full posterior Also closely related to the present work, the overlap- distribution on the branching time for each gene. We con- ping mixture of GPs (OMGP) [13] is a mixture model for trast the performance of the BGP method to two recently time-series data where the mixture components are GP published methods, the mixture of factor analysers (MFA) functions and the data at any time can be assigned to any [17] and the BEAM approach [3], in a synthetic study of the components. For single-cell data, after pseudotime across a variety of simulated scenarios. We apply the is assigned to each cell, then the OMGP model can be BGP model to data from a haematopoiesis study [18]and used to assign cells to different trajectories. The cell to single-cell mouse embryonic stem cell data generated labels do not have to be known in advance and can be using droplet barcoding [19]. We conclude with a sum- inferred through fitting the model to data. However, in mary of our findings and a discussion of possible future OMGP models, the cellular trajectories are independent research directions. rather than branching. The OMGP model has been applied to single-cell data to infer global cell branching Results and discussion times [11] but as OMGP assumes the latent functions are Overview of BGP independent without any branching, a heuristic based Before applying the BGP method, we require the pseu- on a piecewise linear fit of the log likelihood surface is dotime for each cell and the global branching pattern of proposed to identify the most likely branching times. This the cells. In our experiments, we used results from the is problematic, since OMGP does not provide a proper Monocle reversed graph embedding approach, termed generative model of branching dynamics and therefore, DDRTree [3], and the diffusion pseudotime (DPT) [1] it is not clear how to compute the posterior distribution and the wishbone [2] approaches. However, BGP can be over the branching time. based on any method that estimates pseudotime and a Our main methodological contribution here is to gene-wide cell branching association. For a recent review, generalise the OMGP model to account explicitly for see [20]. dependence between the functions in the mixture model. OMGP [13] is a mixture model of independent GPs that Specifically, we consider that the functions branch, as is able to associate an observation with the generating in [12]. This allowed us to develop a probabilistic model GP. The authors term this the association problem and over branching cellular trajectories where the assignment derive a variational inference algorithm for independent of cells to branches is not known in advance. Our new GPs. Our work extends the OMGP model in two direc- branching GP (BGP) model allows us to calculate the tions: firstly, we remove the assumption of latent func- posterior distribution over branching time for each gene tion independence and allow dependent GPs as required while allowing for uncertainty in the branch labels for by a branching model. Secondly, we provide a sparse each cell. This uncertainty is especially important for inducing point approximation that allows for scalable early-branching genes, since cells are not labelled with a inference. branch prior to the global cellular branching time, which Let F be a BGP evaluated for N data points (cells) with N ×M we assume is known. M latent functions. Z ∈ {0, 1} indicates which branch A naive implementation of GP models scales cubically each cell comes from. The number of latent functions for with the size of the data. As increasing numbers of cells a single branching point is M = 3, as we have separate can be profiled in new single-cell protocols, we ensure latent functions for the trunk and each branch. The like- the scalability of our approach by employing two com- lihood is p (Y |F, Z) = N Y |ZF, σ I and as in [13], we plementary approaches. Firstly, we use sparse inference place a categorical prior on the indicator matrix p(Z) = N M [Z] nm [14, 15], which allows model fitting to scale with the [] . We place a GP prior on the latent n=1 m=1 n,m number of inducing points. The latter is a user-defined functions p (F|t ) = GP (0, K |t ), which constrains the b b value that trades off model accuracy and training time. latent functions to branch at pseudotime t . Note that the b Boukouvalas et al. Genome Biology (2018) 19:65 Page 3 of 15 latter does not factorise as in [13], as the latent functions The marginal likelihood of the model can also be used to are dependent. Further details of the model derivation calculate the Bayes factor of branching versus not branch- and inference scheme used are provided in ‘Methods’ ing (assuming equal priors). This is used to rank genes by and a detailed derivation, including the inducing point how likely it is that their expression exhibits branching. By approximation, is provided in the supplementary material numerically integrating out the uncertainty of the branch- (Additional file 1: Section 2). ing location, the logged Bayes factor r includes the effect Global branching labels, such as those provided by of posterior uncertainty in the branching location: DDRTree, can provide an informative prior p(Z) for all genes. The prior before the global branching point is P(0 < t < 1|Y ) r = log uninformative, as no global assignment is available. This g P(t →∞)|Y ) is relevant for early-branching genes, which may start ⎡ ⎤ branching earlier than the global cellular branching. After ⎣ ⎦ = log p (Y |t = i) − log p (Y |t →∞)) , b b the global branching point, the prior favours an increased i∈S assignment probability to the globally assigned branch. (3) If the prior places significant mass on the alternative assignment, the resulting assignment may differ from the global allocation given enough evidence from the likeli- where t =∞ specifies that the model does not branch hood term. This allows the model to correct mislabelled and we have assumed equal prior probabilities for branch- cells as well as account for sources of noise in the data, ing and not branching. such as dropout for lowly expressed genes. However, in An example of the BGP model fit is shown in Fig. 1b. our single-cell case studies, we use a strong prior assign- The uncertainty in the cell branch association is shown ment probability of 0.99 to avoid cell reassignment after in conjunction with the posterior on the branching times. the global branching time. This simplifies the interpre- For visualisation, the cell assignment to the top branch tation of the results, as cells do not switch their global is shown. We see that most cells away from the branch- branch assignment. Nevertheless, other constructions are ing point are assigned with high confidence to one of possible with this model, e.g. the distance of the cells the branches. However, cells that are equidistant from from the global branching time may be used to adjust the both branches have high assignment uncertainty (0.5). associated prior uncertainty. This is also the case for cells close to the branching loca- The model hyperparameters are fitted by maximising tion where the two branches are in close proximity. In a bound on the log-likelihood. The log-likelihood is not the bottom panel of Fig. 1b, the posterior on the branch- analytically tractable, as it involves integrating out the ing location shows there is significant uncertainty on the indicator matrix Z and therefore, we use a variational precise branching location. This is reflected in Fig. 1a approximation. A lower bound is available using Jensen’s in the branching time uncertainty (magenta). The cell inequality: assignment uncertainty is incorporated into the branch- ing time posterior. If the branches separate quickly, the posterior branching time uncertainty is likely to be small. log p (Y |F) ≥ E log p (Y |F, Z) −KL q(Z)||p(Z) , q(Z) This reflects one of the main benefits of employing a (1) probabilistic model to identify branching dynamics as the assignment uncertainty is considered when calculating the whereweuse amean-fieldapproximation q (Z, F) = branching time posterior. The cell assignment is inferred q(Z)q(F) with the latent functions F independent of the in the BGP model, in contrast to the model in [12]where association indicators Z and q(Z) = φ .The φ nm nm nm the assignment is assumed known. approximates the posterior probability of cell n belonging Additional biological insights can be gleaned from the to branch m. The latter is either the trunk state or one of BGP method by inferring a branch order network using the two branches for the single branching considered in the posterior for each branching gene. The probability of the applications here. Then, F can be integrated out to get agene A branching before time t can be calculated using the marginal likelihood p(Y ). S samples from the branching posterior, The branching time posterior probability is calculated using the approximate marginal likelihood evaluated at a set of candidate branching points S of size N .The B b P (Br(A)< t)) = I (s < t),(4) posterior for a candidate branching time c is s=1 p(Y |t = c) where Br(A) is the branching time of gene A and s are p(t = c|Y ) =  .(2) p(Y |t = i) b the posterior branching time samples. The probability of a i∈S B Boukouvalas et al. Genome Biology (2018) 19:65 Page 4 of 15 (a) DDRTree prior assignment (b) Cell assignment Fig. 1 Haematopoiesis gene expression, showing the BGP fit for the MPO gene. a The Wishbone branching assignment is shown for each cell along with the global branching time (black dashed line), the most likely branching time (blue solid line) and posterior branching time uncertainty (magenta background). The sample of cells used to fit the BGP model is shown with larger markers. b The posterior cell assignment is shown in the top subpanel. In the bottom subpanel, the posterior branching time is shown. Pseudotime is shown on the horizontal axis of all plots. a, b Gene expression is depicted on the vertical axis. c The posterior branching probability BGP branching Gaussian process gene A branching before gene B can be calculated similarly functions cross after the branching point were rejected from each branching posterior, since these may be difficult for other methods, e.g. BEAM identifies the last crossing location for its fitted splines S and may, therefore, identify the wrong point in a time P (Br(A)< Br(B)) = I (s < s ).(5) series that crosses after branching. We address this issue A B s=1 in the real-data study considered in the next section but do not consider it in the synthetic benchmark. We generate We can infer a branch order network by computing this N = 150 data points (cells) with D = 40 genes and pseu- probability for all pairs of branching genes at the desired dotime in a unit interval [0, 1].The genesare separated confidence level. In the single-cell applications we discuss into three groups depending on their branching behaviour later, we use this approach to construct a directed network and time (Table 1). graph of gene branching times. All methods were run with default parameter settings, We can also calculate a posterior rank for each gene so it may be possible to improve on their performance with an associated confidence interval. Using samples of by tuning these parameters. For example, as in [17], we the posterior branching time, we can estimate quantiles of found the performance of the algorithm depended on the the rank distribution for each gene. This is a simple way initialisation used. We contrast the performance of the to infer which genes branch early and which late through BGP model both without and with an informative prior a probabilistic ranking. Unlike the previous approach, this (80% prior probability) on cell assignment derived from does not allow pairs of genes to be compared but can pro- the global Monocle assignment. vide an overall summary of the branch ordering without We first compare the pseudotime estimation accu- the need for a network analysis. racy of Monocle and MFA. Both methods achieve good Synthetic study Table 1 Synthetic gene groups We evaluate three methods, MFA [17], the BEAM Group Branching time G approach [3] and the BGP model, on synthetically gener- ated data. For the synthetic study, we use Gaussian noise Early 0.2 10 and therefore, we use the BEAM algorithm with a Gaus- Late 0.8 20 sian likelihood function. MFA also assumes a Gaussian No branching NA 10 likelihood function. Data are generated from branching Branching times and number of genes G for each group. All scenarios use N = 150 GPs with signal variance σ = 2, length scale λ = 1.2 cells and a total of D = 40 genes and a range of noise levels (Table 2). Samples where the NA not applicable Boukouvalas et al. Genome Biology (2018) 19:65 Page 5 of 15 Table 3 Synthetic study: Area under the curve for detecting performance as measured by the rank correlation of the branching genes estimated pseudotime to the ground truth (Table 2). The Bayes factor of the branching GP can be used Noise MFA BEAM BGP to rank the evidence of branching for each gene. Simi- No prior prior lar measures exist for MFA and the BEAM method. We 0.001 0.30 1.00 1.00 1.00 first compare the three methods on their ability to dis- 0.01 0.65 1.00 1.00 1.00 criminate branching from non-branching genes (Table 3). 0.03 0.77 0.96 0.98 0.99 Themetricweuse is theareaunder thecurve,which 0.08 0.82 0.93 0.86 0.92 provides a reasonable measure when the numbers of pos- itives and negatives in the ground truth are balanced. 0.1 0.77 0.82 0.81 0.96 Both BEAM and BGP have higher accuracy than MFA, 0.2 0.77 0.74 0.86 0.84 whose performance varies significantly. The inclusion of BEAM branch expression analysis modelling, BGP branching Gaussian process, MFA an informative prior improves the performance of the mixture of factor analysers BGP model, resulting in consistently high performance for all noise levels. The performance of BEAM decreases with increased noise level, which is also the case for the BGP More generally, the spline approach taken in BEAM suf- model to a lesser extent. fers from a consistent bias in branching time estimation, We also examine the error in identifying the branch- which pulls all estimates towards the global branching ing time. As MFA does not provide such an estimate, we time. To demonstrate this effect clearly, we examine an consider only the BEAM and BGP methods. The error in additional synthetic example with three genes branch- estimating branching time for the BEAM and BGP meth- ing very early (0.1), 27 genes branching late (0.7) and 10 ods is given in Table 4. The error for the BGP method is genes not branching. We select a low noise level (0.001). consistently lower than that for the BEAM method. The Since there are many late-branching genes and few early- informative prior allows for more consistent performance branching genes, the global branching time is late (Fig. 3), of the BGP method. This method has substantial increases which clearly demonstrates the bias effect. As can be seen in accuracy in some scenarios, especially for the highest in Fig. 3a, the estimates for the BEAM method are biased noise level (0.2), where the error is reduced from 0.15 towards the global branching time. The underestimation to 0.08. The lack of robustness of the BEAM approach of branching times in BEAM for genes that branch later to high noise is demonstrated in Fig. 2.Inthe low-noise than the global branching time is most likely due to the scenario (Fig. 2a, b), both BEAM and BGP are able to spline regularisation employed by BEAM, which tends to recover the gene expression branching dynamics. In the over-smooth the spline fit. The overestimation of branch- high-noise scenario (Fig. 2c, d), the global branching time ing times for early-branching genes is due to the arbitrary is early due to the early-branching genes in the data. The assignment of cells prior to the global branching time, as later-branching gene depicted has a branching time of no labels are provided by the global algorithm and no esti- b = 0.8 and the global assignment correctly separates the mation is performed by the spline-fitting algorithm. See two branches. However, due to the high noise in the data, Fig. 3c for an illustrative example. The former could pos- the spline is unable to identify the correct branching time sibly be rectified by tuning the regularisation approach and significantly underestimates it (Fig. 2c). In contrast, employed, but the latter is a fundamental restriction of the BGP model correctly identifies the late-branching the BEAM approach, which does not directly estimate nature of the gene, despite the early global branching time and the high-noise level of the data (Fig. 2d). Table 4 Synthetic study: root-mean-squared error for branching time estimation Table 2 Synthetic study: pseudotime rank correlation to the true Noise BEAM BGP time for both MFA and Monocle under both scenarios No prior prior Noise MFA Monocle 0.001 0.12 0.03 0.03 0.001 −0.96 1.0 0.01 0.11 0.04 0.05 0.01 0.98 1.0 0.03 0.13 0.06 0.07 0.03 −0.93 1.0 0.08 0.20 0.14 0.09 0.08 −0.97 1.0 0.1 0.23 0.12 0.09 0.1 −0.97 1.0 0.2 0.23 0.15 0.08 0.2 0.91 1.0 Only performed on branching genes MFA mixture of factor analysers BEAM branch expression analysis modelling, BGP branching Gaussian process Boukouvalas et al. Genome Biology (2018) 19:65 Page 6 of 15 (a) BEAM Low noise (b) BGP Low noise (c) BEAM High noise (d) BGP High noise Fig. 2 Synthetic data: example BEAM and BGP model predictions for late-branching genes. These branch at b = 0.8. Pseudotime is shown on the horizontal axis and the gene expression is depicted on the vertical axis. The global branching time (vertical black bar) and the posterior branching time uncertainty (magenta background) are shown. The vertical red bar is the BEAM branching time estimate and the vertical blue bar the BGP estimate. Cells have been coloured by the global Monocle assignment. a BEAM low noise. b BGP low noise. c BEAM high noise. d BGP high noise. BEAM branch expression analysis modelling, BGP branching Gaussian process branching assignments but uses only the globally derived The state estimation correctly identifies a single branch- label estimates. The BGP approach (Fig. 3b)doesnot ing point but the majority of cells are assigned to one of suffer from this deficiency, as the branch assignment is the branches (red). As one of the global branches (red) performed gene by gene at the cost of increased computa- spans both gene expression branches, it is unsurprising tion time. However, the task is easily parallelisable, as each that the spline approach fails to identify the branch loca- gene is treated independently. tion correctly and in fact overestimates the true branch- Lastly, we examine the effect of poor state estimation on ing time of t = 0.2 (Fig. 2a). The corresponding BGP theBEAMand BGPmethods (Fig. 4). The generated data inference (Fig. 4f) overcomes the errors in global state sets are identical except for the level of Gaussian observa- estimation and the confidence interval includes the true tion noise, whose variance is increased from σ = 0.001 branching time. (Fig. 4a)to σ = 0.03 (Fig. 4d). In Figs. 4a–c,the Mon- The robustness of the BGP model can be understood ocle state estimation accurately identifies the underlying in terms of the probabilistic nature of the model. Prior branching dynamics and both BGP and BEAM correctly global state information is considered as well as a like- estimate the branching dynamics. In Figs. 4d–f,weshow lihood term that fits a branching process. Therefore, an example of the effect of poor state estimation. the BGP prior model incorporates the best of both Boukouvalas et al. Genome Biology (2018) 19:65 Page 7 of 15 (a) Posterior (b) BGP (c) BEAM Fig. 3 Synthetic data: fitting BGP and BEAM. The horizontal axis depicts the pseudotime. a The true branching times (black dots), BEAM times (red crosses), BGP mean (blue dots) and 98% credible regions for all 40 synthetic genes. b,c BGP and BEAM estimates for the same early-branching gene. Gene expression is depicted on the vertical axis. The vertical grey bar is the global branching time. The vertical red dashed bar is the BEAM branching time estimate and the vertical magenta bar the BGP estimate. Cells have been coloured by the global assignment. BEAM branch expression analysis modelling, BGP branching Gaussian process worlds: inclusion of global assignment information and gene and required approximately 2 minutes of CPU time assessment of cell assignment based on individual gene per gene . expression. In the supplementary material (Additional A probabilistic model is an appropriate choice for early file 1: Section 3), we also show the robustness of the model haematopoiesis, which has been described as a cellu- to non-Gaussian data, using synthetic data generated lar continuum of low-primed HSCs [22]. The continuum using a single-cell RNA-seq data simulator [21] to gener- contains transitory states rather than discrete progenitor ate zero-inflated count data across a range of library sizes cell types. Some cell state transitions and lineage com- and dropout rates. We also find the model robust to data binations are more likely to occur than others. A proba- downsampling (Additional file 1: Section 4), which we use bilistic model such as BGP better reflects the probabilistic to speed up inference in the studies of experimental data nature of lineage selection highlighted in [22]. In the BGP that follow. model in particular, each cell is associated with an allo- cation probability for each branch. The branching point Haematopoiesis and single-cell RNA-seq can be interpreted as the earliest pseudotime from which We apply the BGP model on single-cell RNA-seq of probabilistic biases in lineage selection can be detected. haematopoietic stem cells (HSCs), which differentiate into We find 839 genes out of a possible 1072 that show myeloid and erythroid precursors [18]. Following the evidence of branching based on the Bayes factor. The pos- same quality control as [2], the data consists of 2312 terior branching times for all branching genes are shown genes and 4423 cells. We use M = 30 inducing points in Fig. 5a. Only 76 genes have a log likelihood over 50, in our sparse approximation and to speed up the com- with the majority of the genes close to the 0 threshold. In putations further, we randomly subsample the data down Fig. 5, we also show the branching times for 10 marker to 467 cells. In [2], an analysis of this data set was per- genes that have been found to show significant evidence formed and we use their estimation of pseudotime and of branching. Marker genes for megakaryocyte erythroid cell trajectory assignment labels using the Wishbone algo- progenitors (MEPs) and granulocyte macrophage progen- rithm. We found similar pseudotime and global branching itors (GMPs) are shown. For all GMP markers, the same assignments with Monocle 2 (vs 2.1), and those results branch is upregulated (magenta), whereas for the MEP are given in the supplementary material (Additional file 1: markers, the alternative branch is upregulated (brown). The confidence interval of branching times is also shown Section 7). To reduce the computation required, we apply a for each marker gene, which can be used in drawing infer- t-statistic on the end states of the two branches to fil- ences. For example, the CTSG GMP marker is predicted ter the genes that are most likely branching. We apply to branch earlier than the global branching time with high the BGP algorithm on the resulting set of 1072 genes. confidence, whereas the branching time of the CEBPA The BGP inference was performed in parallel for each marker is highly uncertain. Boukouvalas et al. Genome Biology (2018) 19:65 Page 8 of 15 (a) Monocle global state: (b) Gene branching estima- (c) Gene branching estima- Accurate state estimation tion using BEAM tion using BGP (d) Monocle global state: (e) Gene branching estima- (f) Gene branching estima- Inaccurate state estimation tion using BEAM tion using BGP Fig. 4 Synthetic data: effect of Monocle global state estimation on BEAM and BGP model predictions for early-branching genes. b = 0.2. Two different examples are shown, corresponding to accurate and inaccurate state estimation by Monocle. b, c, e, f Gene expression (vertical axis) vs pseudotime (horizontal axis). The vertical black bar is the global branching time. The vertical red bar is the BEAM branching time estimate and the vertical blue bar the BGP estimate. Cells have been coloured by the global Monocle assignment. a, d The Monocle-DDRTree latent space. BEAM branch expression analysis modelling, BGP branching Gaussian process When examining the model fit for the APOE marker time order relationships are denoted by directed edges. gene (Fig. 6), we observe a transitory gene expression for The posterior confidence cut-off used for the latter is 95%. one of the branches. In particular, the expression initially For instance, both PRTN3 and MPO are found to branch increases after the branch point, peaks and decreases to before CAR1 but only the former is branching before the level of the other branch. The spline approach used ELANE; this can be understood by the higher uncertainty in BEAM erroneously identifies the last intersection point in the branching posterior of MPO (Fig. 8c). In the net- as the branching point, as we show in the supplementary work we have groups genes in three distinct modules. The material (Additional file 1: Section 7), whereas the BGP PRTN3 and CTSG genes (red module) branched before approach computes a posterior over the branch locations, all other genes except CALR, i.e. ELANE, CAR1, CAR2 which quantifies the likelihood of both intersection points and GSTM1. The other group of early-branching genes, as branching locations. MPO and CALR (yellow module) branch before CAR1, We show the branching time network and gene expres- CAR2, GSTM1 but not ELANE. Finally, the later branch- sion profiles for the highest evidence branching genes in ing group consisting of ELANE and GSTM1, branch Figs. 7 and 8. These eight genes show very strong evidence before CAR1. of branching (r > 200). When considering more genes, looking at the entire net- In the network (Fig. 7), each gene is annotated with its work may be cumbersome and if the interest is solely on most likely branching time and the pairwise branching identifying the earliest branching genes, we can estimate Boukouvalas et al. Genome Biology (2018) 19:65 Page 9 of 15 (a) Posterior branching times (b) GMP (c) MEP Fig. 5 Haematopoiesis data. The horizontal axis is pseudotime in all plots. a Posterior branching times for 839 genes identified as branching (log Bayes factor >0) ordered by branching location. The global branching time is shown as a vertical grey bar (b = 0.11). The vertical axis is the gene index. b, c Branching times for marker genes that have been found to show significant evidence of branching for megakaryocyte erythroid progenitors (MEPs) and granulocyte macrophage progenitors (GMPs). The colour scheme reflects which branch is upregulated after the most likely branching time. The arrows associate each gene with its most likely branching time (dot) and posterior branching confidence interval (blue region). The percentile rank in terms of the log Bayes factor is shown in parentheses; e.g. 100 means ranked in the first percentile of all 1072 genes examined. GMP granulocyte macrophage progenitor, MEP megakaryocyte erythroid progenitor the posterior rank for each gene as described in the BGP Droplet-based single-cell RNA-seq overview section. When using a lower threshold on the We further demonstrate the effectiveness of the BGP branching evidence (log Bayes factor > 50), we find 76 model by applying it to single-cell RNA-seq data gen- genes are the earliest and latest branching genes, which erated using droplet barcoding [19]. Klein et al. [19] are listed in the supplementary material (Additional file 1: monitored the transcriptomic profiles and heterogene- Section 5). ity in the differentiation of mouse embryonic stem cells This analysis demonstrates the richness of the BGP after leukaemia inhibitory factor withdrawal. A total of model and the range of downstream analyses afforded by 2717 cells were profiled. Altogether, 24 175 transcripts the probabilistic nature of the model. were observed with cells captured at t = 0, 2, 4 and Fig. 7 Haematopoiesis gene expression. Network of most significantly branching genes (log Bayes factor >200). The most likely branching Fig. 6 Haematopoiesis data: example of transitory gene expression for time is given for each gene. The directed edges denote the gene the APOE gene. Pseudotime is shown on the horizontal axis. The cell branching order with a 95% confidence cut-off. The edge colours are assignment uncertainty (top, vertical axis) and posterior branching used to group genes that have identical later branching genes. The time posterior (bottom, vertical axis) is shown for the BGP method. horizontal placement of each gene is based on its most likely BGP branching Gaussian process branching time Boukouvalas et al. Genome Biology (2018) 19:65 Page 10 of 15 (a) PRTN3 b = 0.05 (b) CTSG b = 0.05 (c) MPO b = 0.05 (d) CALR b = 0.08 (e) ELANE b = 0.14 (f) CAR2 b = 0.21 (g) GSTM1 b = 0.14 (h) CAR1 b = 0.30 Fig. 8 Haematopoiesis gene expression. Gene expression for most significantly branching genes (log Bayes factor >200). The bottom panel for each gene denotes the posterior branching time over the discrete set considered. a PRTN3 b = 0.05. b CTSG b = 0.05. c MPO b = 0.05. d CALR b = 0.08. e ELANE b = 0.14. f CAR2 b = 0.21. g GSTM1 b = 0.14. h CAR1 b = 0.30 7 days. As for the haematopoiesis data, BGP inference branching early in pseudotime and are upregulated in the was performed in parallel for each gene, which required main branch (brown). They are also very highly ranked. approximately 2 minutes of CPU time per gene. The effect Krt8 and Krt18 are both in the top percentile and Krt19 of cell cycle was removed using the scLVM approach [7] is in the 96th percentile. The gene expression profiles as in [1], who used this data set for their analysis of DPT. (Fig. 10) show clear early-branching times for the epi- We use the pseudotime estimated using the DPT blast markers as well as for genes where the alternative method of [1]. The prior on the branching labels was branch is upregulated (e.g. Ccdc36). The branching time also derived from the DPT method with a prior confi- uncertainty is low for most marker genes, including the dence of 99%. We examine the first dominant branching epiblast markers. We also show an example of transitory event reported by [1]. Here 915 cells were assigned to the gene expression (Perp), where the BGP method selects trunk, and 1662 and 114 cells to each branch. To allow the earliest intersection point as the most likely branching for fast computation for all genes, we subsampled the time and this point has a higher posterior branching time gene expression data down from 2717 to 335 cells with uncertainty. 81 assigned to the trunk, and 159 and 95 to each branch. We use the branching time posterior for each gene to This ensures the branches have roughly the same number estimate a branch order network of genes. For ease of of points. For faster computation, we analyse the top 998 presentation, we examine only the seven branching genes genes according to the method of [23], which is available with the strongest evidence of branching (log Bayes fac- in the ScanPy software library [24]. tor > 500). All posterior rankings with confidence greater A summary of the BGP findings is shown in Fig. 9.A than 95% are included in the network. We show the gene total of 337 genes show evidence of branching and 661 branch order network in Fig. 11 and the corresponding do not. There is a continuum of early to late posterior gene expression profiles in Fig. 10. The earliest branch- branching times (Fig. 9a). ing gene, Ccdc36, was found to branch before all other We also show the branching times for a selection of the genes in the network within the 95% confidence thresh- genes found to be differentially expressed in [1](Fig. 9). old. The Actg1 and Ccdc36 genes branch before the later The percentile rank in terms of the log Bayes factor is branching genes Krt8, Krt18, Bc1 and Hsp90aa1. The epi- also shown for each gene. All three epiblast markers con- blast markers Krt8 and Krt18 and the Bc1 gene branch sidered (Krt8, Krt18 and Krt19) show strong evidence of before Hsp90aa1, which has the latest branching time Boukouvalas et al. Genome Biology (2018) 19:65 Page 11 of 15 (a) Posterior branching times (b) Marker genes with rank shown Fig. 9 Summary plots for mouse embryonic stem cell droplet data. The horizontal axis denotes pseudotime and the vertical axis is the gene index. a The most likely branching time (diamonds) and confidence interval (blue region) for all genes found to show evidence of branching. b The posterior for selected marked genes is coloured by the branch that is enriched (brown or magenta) (0.36). The earliest branching genes can also be found accurately identify branching times earlier than the global by directly computing the posterior rank for each (see branching time. We found that branching time estimates section ‘Overview of BGP’). These genes are listed in the from this spline-based approach were generally biased supplementary material (Additional file 1: Section 6). towards the global branching time. In contrast, the BGP method can robustly identify branching times, as it esti- Conclusion mates the cell branch association for each gene indepen- We have presented a flexible non-parametric probabilistic dently while accounting for cell assignment uncertainty approach for robustly identifying individual gene branch- in the posterior branching times. We also found the BGP ing times. For scalability, our model uses sparse varia- approach to be robust to global state estimation errors tional inference implemented using the GPflow package and high noise. The BGP branching time uncertainty can [16]. The probabilistic nature of our model allows for also be used in a downstream analysis of the individual well-defined parameter estimation via maximisation of a gene branching times, for example, ranking genes in bound on the marginal likelihood. terms of their most likely or minimum branching times. The spline model used by BEAM uses global branch In the BGP model, a separate assignment of cells to assignments for each cell and is, therefore, unable to branches is performed for each gene, since they are treated independently. This can lead to potentially misleading results, as cells may be assigned to different branches for different genes. Therefore, to achieve good results, we use an informative prior based on the cell assign- ment of a global method such as Monocle [3], DPT [1]or Wishbone [2]. This gives a high prior confidence of cell assignments after the global branching point. By using this strongly informative prior, we avoid the issue of incon- sistent assignments for cells with pseudotime after the global branching point. However, cell assignments can differ for genes branching prior to the global branching time and therefore, care must be taken when interpret- Fig. 10 Mouse embryonic stem cell droplet data. Expression profiles ing the results for such genes. This is an improvement for genes appearing in the branching order network (Fig. 11)and an over methods such as BEAM that randomly assign cells example of a gene with higher branching uncertainty (Perp). The bottom panel in each gene expression denotes the posterior prior to the global branching time. In future work, we branching time over the discrete set considered. a Ccdc36 b = 0.05. plan to extend the BGP model to share the cell assignment b Actg1 b = 0.08. c Actb b = 0.08. d Krt8 b = 0.11. e Krt18 b = 0.11. across all genes, therefore avoiding such inconsistencies f Bc1 b = 0.11. g Hsp90aa1 b = 0.36. h Perp b = 0.20 and simplifying any downstream analysis. Boukouvalas et al. Genome Biology (2018) 19:65 Page 12 of 15 (a) Ccdc36 b = 0.05 (b) Actg1 b = 0.08 (c) Actb b = 0.08 (d) Krt8 b = 0.11 (e) Krt18 b = 0.11 (f) Bc1 b = 0.11 (g) Hsp90aa1 b = 0.36 (h) Perp b = 0.20 Fig. 11 Mouse embryonic stem cell droplet data. Gene network of branching times of most significantly branching genes (log Bayes factor >500). A directed edge A → B denotes that gene A branches before gene B with a 95% probability. The edge colours are used to group genes that have identical later branching genes. The horizontal placement of each gene is based on its most likely branching time We have also included in our comparison a probabilis- The probabilistic nature of the BGP model allows for tic linear method [17]. The linearity allows for an effi- additional biological insights to be gained by constructing cient joint estimation of both the pseudotime and global a gene branch order network and identifying early lineage branching structure. Although this method does not esti- priming. The former is accomplished by computing a pair- mate gene bifurcation times, a probabilistic estimate of wise gene probability that assesses the likelihood of a gene whether an individual gene exhibits branching behaviour branching before another. The latter can be inferred by is available. However, in our synthetic study, we found examining the gene network and identifying the earliest that the pseudotime estimation was not robust and this branching genes. We demonstrated this approach on both reduces the effectiveness of the method. single-cell data sets, identifying early-branching genes and The application of the BGP method to the haema- confident orderings of gene branching events. topoiesis data revealed the importance of modelling tran- Concurrent with our study, [25]usedchangepoint ker- sitory gene expression, which has the potential to confuse nels to develop similar branching GPs to identify bifur- non-probabilistic methods. The model was able to select cations in single-cell transcriptional data sets. They use automatically the most likely branching location, even in a Markov chain Monte Carlo approach to estimate cell the presence of multiple crossing points in gene expres- branch association and branching times. Their approach sion without the need for any post-processing heuristics also explicitly models recombination, in which individual such as those included in the BEAM package. branches are merged, and they can jointly estimate pseu- We also demonstrated the flexibility of our approach dotime. However, the computational complexity of their by applying it to droplet-based single-cell data using the method would make application to genome-wide infer- pseudotime and branching association derived from the ence of branching times from unlabelled data challenging DPT method [1]. As the BGP approach does not rely on a and that is the motivation for our sparse inducing point particular method to estimate pseudotime and branching variational approach. association, a modular approach is possible in which the A number of extensions of the BGP model are possible best method for a given study is used. The prior uncer- that would increase the range of possible applications. The tainty specification on the branching association allows branching kernel could be adapted to detect changepoints the BGP user to quantify the expected accuracy of the in time series. Whereas we have modelled a branch- global branching method. ing event as the intersection of three latent functions, a Boukouvalas et al. Genome Biology (2018) 19:65 Page 13 of 15 changepoint would require only two latent functions. We The resulting covariance between any two latent func- would also like to extend our model to non-Gaussian like- tions f and g constrained to cross at t is lihoods, which would more accurately describe single-cell K K ff fg data. This would increase inference complexity but could K K gg gf provide better calibrated uncertainty estimates. Another ⎛ ⎞ K(T,t )K(t ,T) p p useful extension would be to jointly infer pseudotime and K (T, T) K t ,t ( ) p p ⎝ ⎠ branching behaviour, which would also improve uncer- = .(7) K T,t K t ,T ( p) ( p ) K (T, T) tainty estimation as the uncertainty arising from the esti- K t ,t ( p p) mation of the former would be included in the posterior where K (T, T), K (T, t ) and K (t , t ) are the kernel func- p p p branching uncertainty. Extending our model to multi- tions evaluated between all training data pseudotimes T, ple branching points is straightforward from a modelling between the training data and branching point, and solely standpoint but presents a more challenging optimisation at the branching, point respectively. problem, for which a tree prior on the branching struc- In [12], only two latent functions were specified, a con- ture may prove helpful [26]. This extension would allow us trol and perturbation condition where the former spanned to address the problem of selecting the correct number of the branching point. In our modified formulation, three branches in the global cellular branching dynamics. functions are used, allowing for a discontinuity in the gra- dient between the trunk and both branch latent functions. Methods As an extension of our model, the derivatives of the latent Branching model derivation outline functions could also be constrained to intersect at the We present an overview of the probabilistic model for branch point to allow for differentiable paths. BGP. The full model description and derivation of the variational inference lower bound is given in the supple- Full GP inference mentary material (Additional file 1: Section 2). Let Y ∈ R be thedataofinterestand let M be the num- First, we define the Gaussian process kernel describ- ber of functions that are dependent. We specify a set of ing a branching structure. We then derive a lower bound latent functions F for each data point using an expanded on the model likelihood using variational inference. In representation of size M × 1where M = NM .Let Z ∈ the supplementary material (Additional file 1:Section N ×M {0, 1} be the binary indicator matrix on the expanded 2.2), we present a formulation of a sparse inducing point representation that describes the association of each data approximation that allows the application of the model point to a latent function. Each row of Z has only one to large data sets. How to perform prediction on the full non-zero entry. The model likelihood is and sparse models is also presented in the supplementary material (Additional file 1: Section 2.3). p (Y |F, Z) = N Y |ZF, σ I.(8) Branching kernel The extension to multiple independent outputs is To model the branching process, we specify a branch- straightforward as the likelihood factorises: ing kernel that constrains the latent branching functions to intersect at the branching point. We use a modified p (Y |F, Z) = N Y |ZF , σ I,(9) d d version of the kernel proposed in [12]. The trunk f and d=1 branch kernel functions g and h are constrained to cross at the branching point t . We place Gaussian process priors where Y denotes the N ×1 column vector of observations p d on all three functions and constrain them to intersect: for output d and similarly F denotes the M × 1 column vector of latent function values. We omit the multiple f ∼ GP (0, K ), output case from the derivation below for clarity. g ∼ GP (0, K ), As in [13], we place a categorical prior on the indicator (6) h ∼ GP (0, K ), matrix Z and a GP prior on the latent functions F.Note f t = g t = h t . that the latter does not factorise as in [13], as we assume p p p the latent functions are dependent: For simplicity, the same kernel is used for all three functions although it would be straightforward to extend N M [Z] nm our framework to specify different kernels for each latent p(Z) = [] , (10) n,m function. This would allow for instance, one branch to be n=1 m=1 modelled as a periodic function and the others as non- p(F) = N 0, K , (11) ( ) periodic. The extra flexibility would come at the cost where for the multinomial distribution we have of increasing the number of parameters that need to be [] = 1and K is the GP kernel . nm estimated. m=1 Boukouvalas et al. Genome Biology (2018) 19:65 Page 14 of 15 The log likelihood is not analytically tractable, as it Our bound is, therefore, involves integrating out the indicator matrix Z: log p (Y |F) ≥ L , log p (Y |F) = log p (Y , Z|F) dZ. (12) wherewehavedefined We proceed to compute a lower bound using Jensen’s 2 L  − log 2πσ − KL q(Z) || p(Z) inequality: (1) T T T T − Y Y + F AF − 2F  Y . q(Z) 2σ log p (Y |F) = log p (Y , Z|F) dZ q(Z) (19) = E log p (Y , Z|F) − E log q(Z) . q(Z) q(Z) We proceed to integrate out the latent functions F to (13) obtain the variational collapsed bound: From the mean-field assumption, the latent functions F log p(Y ) = log p (Y |F) p(F)dF are independent of the association indicators Z: (20) q (Z, F) = q(Z)q(F). (14) ≥ log exp [L ] p(F)dF. The log likelihood term is This bound holds because L is a bound to log p (Y |F) N N and the exponent function is monotonic. More details can 2 2 log N Y |ZF, σ I =− log(2π) − log σ 2 2 be found in [27]. Setting the prior on the latent function as a GP, − (Y − ZF) (Y − ZF) . 2σ log p(F) = log N (F|0, K ), and substituting (19)into(20) (15) results in the collapsed bound N 1 1 Taking the expectation with respect to the variational 2 T L  − log(2πσ ) − Y Y − log |K | distribution q(Z): 2 2σ 2 −2 −1 N N − log Aσ + K 2 2 E log N Y |ZF, σ I =− log(2π) − log σ q(Z) 2 (21) 2 2 −1 1 −4 T −2 −1 T T T T T + σ Y  Aσ + K  Y − Y Y + F AF − 2F  Y , 2σ (16) − KL q(Z) || p(Z) . wherewehavedefined We can also derive this bound when using an inducing point approximation. This allows the algorithm to scale E (Z), q(Z) up to larger data sets. The full derivation is given in the A  E Z Z , supplementary material (Additional file 1: Section 2). q(Z) and the variational approximation is Endnotes n,m Time measured on a MacBook Pro with 2.2 GHz quad- q(Z) =  . (17) n,m n,m core Intel Core i7 and 16GB of DDR3L onboard memory. This expanded representation allows for efficient This encodes the mean-field assumption as we assume the posterior indicators factorise. recomputation of the marginal likelihood for different The second-order expectation for A is branching times. For simplicity, we assume the same kernel for every A = E z z i,j q(Z) n,i n,j output and latent trajectory function. Removing this restriction does not affect the derivation but will increase =  δ , (18) the inference complexity. n,i i,j where z the N × 1 indicator vector for latent function Additional file i = m. The KL divergence term is computable as Additional file 1: Supplementary material for BGP: identifying gene-specific branching dynamics from single-cell data with a branching n,m Gaussian process. This also contains additional results for both the KL q(Z) || p(Z) =  log . n,m synthetic and single-cell data sets. (PDF 1206 kb) [] n,m n,m Boukouvalas et al. Genome Biology (2018) 19:65 Page 15 of 15 Acknowledgments 9. Reid JE, Wernisch L. Pseudotime estimation: deconfounding single cell We wish to thank Xiaojie Qiu for his help with the BEAM Monocle 2 code and time series. Bioinformatics. 2016;32(19):2973–80. Kieran Campbell with running the MFA model. 10. Ahmed S, Rattray M, Boukouvalas A. GrandPrix: scaling up the Bayesian GPLVM for single-cell data. 2017. bioarxiv preprint: https://doi.org/10. Funding 1101/227843. MR and AB were supported by Medical Research Council award 11. Lönnberg T, Svensson V, James KR, Fernandez-Ruiz D, Sebina I, MR/M008908/1. JH was supported by Medical Research Council fellowship Montandon R, Soon MSF, Fogg LG, Nair AS, Liligeto UN, Stubbington MR/K022016/2. MJT, Ly L-H, Bagger FO, Zwiessele M, Lawrence ND, Souza-Fonseca-Guimaraes F, Bunn PT, Engwerda CR, Heath WR, Billker Availability of data and materials O, Stegle O, Haque A, Teichmann SA. Single-cell RNA-Seq and An open source Python implementation of BGP is available at https://github. computational analysis using temporal mixture modelling resolves com/ManchesterBioinference/BranchedGP [28], released under the Apache th1/tfh fate bifurcation in malaria. Sci Immunol. 2017;2(9). http:// 2.0 licence. Version 1.1 with DOI https://doi.org/10.5281/zenodo.1235962 was immunology.sciencemag.org/content/2/9/eaal2192. used to generate the results of the paper. Documentation and examples are 12. Yang J, Penfold CA, Grant MR, Rattray M. Inferring the perturbation time available on the GitHub page. from biological time course data. Bioinformatics. 2016;32(19):2956–64. All data used in the paper have previously been published and are freely 13. Lázaro-Gredilla M, Van Vaerenbergh S, Lawrence ND. Overlapping available. The haematopoiesis data were previously analysed in [2]and are mixtures of Gaussian processes for the data association problem. Pattern available from https://github.com/ManuSetty/wishbone. The dropseq data Recognit. 2012;45(4):1386–95. were previously analysed in [1] and are available from https://www.helmholtz- 14. Quiñonero-Candela J, Rasmussen CE. A unifying view of sparse muenchen.de/icb/research/groups/machine-learning/projects/dpt/index. approximate Gaussian process regression. J Mach Learn Res. 2005;6(Dec): html. 1939–59. 15. de Garis Matthews AG. Scalable Gaussian process inference using Authors’ contributions variational methods. Department of Engineering, University of AB, JH and MR designed the algorithm. AB and JH implemented the code and Cambridge; 2016. AB performed the numerical experiments and analysis. AB and MR wrote the 16. Matthews AGdG, van der Wilk M, Nickson T, Fujii K, Boukouvalas A, paper. All authors read and approved the final manuscript. León-Villagrá P, et al. GPflow: a Gaussian process library using Tensorflow. J Mach Learn Res. 2017;18(40):1–6. Ethics approval and consent to participate 17. Campbell K, Yau C. Probabilistic modeling of bifurcations in single-cell Not applicable. gene expression data using a Bayesian mixture of factor analyzers. Wellcome Open Res. 2017;2–19. https://doi.org/10.12688/ wellcomeopenres.11087.1. Competing interests 18. Paul F, Arkin Y, Giladi A, Jaitin DA, Kenigsberg E, Keren-Shaul H, et al. The authors declare that they have no competing interests. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell. 2015;163(7):1663–77. Publisher’s Note 19. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Springer Nature remains neutral with regard to jurisdictional claims in Droplet barcoding for single-cell transcriptomics applied to embryonic published maps and institutional affiliations. stem cells. Cell. 2015;161(5):1187–201. 20. Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of single-cell Author details trajectory inference methods: towards more accurate and robust tools. Division of Informatics, Imaging and Data Sciences, Faculty of Biology, 2018276907. bioarxiv preprint: https://doi.org/10.1101/276907. Medicine and Health, University of Manchester, Oxford Road, Manchester, UK. 21. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA Prowler.io, Cambridge, UK. sequencing data. Genome Biol. 2017;18(1):174. 22. Velten L, Haas SF, Raffel S, Blaszkiewicz S, Islam S, Hennig BP, et al. Received: 30 October 2017 Accepted: 1 May 2018 Human haematopoietic stem cell lineage commitment is a continuous process. Nat Cell Biol. 2017;19(4):271–81. 23. Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat References Commun. 2017;8:14049. 1. Haghverdi L, Buettner M, Wolf FA, Buettner F, Theis FJ. Diffusion 24. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene pseudotime robustly reconstructs lineage branching. Nat Methods. expression data analysis. Genome Biology. 19(1):15. https://doi.org/10. 2016;13(10):845–8. 1186/s13059-017-1382-0. 2. Setty M, Tadmor MD, Reich-Zeliger S, Angel O, Salame TM, Kathail P, 25. Penfold CA, Sybirna A, Reid J, Huang Y, Wernisch L, Ghahramani Z, et al. et al. Wishbone identifies bifurcating developmental trajectories from Nonparametric Bayesian inference of transcriptional branching and single-cell data. Nat Biotechnol. 2016;34(6):637–45. recombination identifies regulators of early human germ cell 3. Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, et al. Reversed development. 2017. bioarxiv preprint: https://doi.org/10.1101/167684. graph embedding resolves complex single-cell trajectories. Nat Methods. 26. Simek K, Palanivelu R, Barnard K. Branching Gaussian processes with 2017;14(10):979. applications to spatiotemporal reconstruction of 3d trees. In: Leibe B, 4. Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, et al. Slingshot: cell Matas J, Sebe N, Welling M, editors. Computer Vision – ECCV 2016: 14th lineage and pseudotime inference for single-cell transcriptomics. 2017. European Conference, Amsterdam, The Netherlands, October 11–14, bioarxiv preprint: https://doi.org/10.1101/128843. 2016, Proceedings, Part VIII. New York: Springer International Publishing; 5. Rasmussen CE, Williams CKI. Gaussian processes for machine learning. 2016. p. 177–93. https://doi.org/10.1007/978-3-319-46484-8_11. London: The MIT Press; 2006. 27. King NJ, Lawrence ND. Fast variational inference for Gaussian process 6. Buettner F, Theis FJ. A novel approach for resolving differences in models through KL-correction. In: European Conference on Machine single-cell gene expression patterns from zygote to blastocyst. Learning. New York: Springer International Publishing; 2006. p. 270–81. Bioinformatics. 2012;28(18):i626–32. 28. Boukouvalas A, Hensman J, Magnus R. BGP: identifying gene-specific 7. Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, branching dynamics from single cell data with a branching Gaussian et al. Computational analysis of cell-to-cell heterogeneity in single-cell process. 2018. https://github.com/ManchesterBioinference/BranchedGP. RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155–60. 8. Campbell KR, Yau C. Order under uncertainty: robust differential expression analysis using probabilistic models for pseudotime inference. PLOS Comput Biol. 2016;12(11):1–20.

Journal

Genome BiologySpringer Journals

Published: May 29, 2018

References