Transformation and model choice for RNA-seq co-expression analysis

Transformation and model choice for RNA-seq co-expression analysis Abstract Although a large number of clustering algorithms have been proposed to identify groups of co-expressed genes from microarray data, the question of if and how such methods may be applied to RNA sequencing (RNA-seq) data remains unaddressed. In this work, we investigate the use of data transformations in conjunction with Gaussian mixture models for RNA-seq co-expression analyses, as well as a penalized model selection criterion to select both an appropriate transformation and number of clusters present in the data. This approach has the advantage of accounting for per-cluster correlation structures among samples, which can be strong in RNA-seq data. In addition, it provides a rigorous statistical framework for parameter estimation, an objective assessment of data transformations and number of clusters and the possibility of performing diagnostic checks on the quality and homogeneity of the identified clusters. We analyze four varied RNA-seq data sets to illustrate the use of transformations and model selection in conjunction with Gaussian mixture models. Finally, we propose a Bioconductor package coseq (co-expression of RNA-seq data) to facilitate implementation and visualization of the recommended RNA-seq co-expression analyses. RNA-seq, co-expression, mixture models, data transformation Introduction Increasingly complex studies of transcriptome dynamics are now routinely carried out using high-throughput sequencing of reverse-transcribed RNA molecules [i.e. complementary DNA (cDNA) molecules], called RNA sequencing (RNA-seq). By quantifying and comparing transcriptomes among different types of tissues, developmental stages or experimental conditions, researchers have gained a deeper understanding of how changes in transcriptional activity reflect specific cell types and contribute to phenotypic differences. Identifying groups of co-expressed genes may help target gene modules that are involved in similar biological processes [1, 2] or that are candidates for co-regulation. Thus, by identifying clusters of co-expressed genes, we aim both to identify co-regulated genes and to characterize potential biological functions for orphan genes (namely, those whose biological function is unknown). A great deal of clustering algorithms has been proposed for microarray data, raising the question of their applicability to RNA-seq data. In particular, after normalization, background correction and  log⁡2 transformation of microarray data, hybridization intensities are typically modeled by Gaussian distributions [3]. RNA-seq data, on the other hand, are made up of read counts [4, 5] or pseudocounts [6, 7] for each biological entity or feature (e.g. a gene) after either alignment to a genome reference sequence or de novo assembly. These data are characterized by (1) highly skewed values with a large dynamic range, often covering several orders of magnitude; (2) positive correlation between feature size (e.g. gene length) and read counts [8]; and (3) variable sequencing depth (i.e. library size) and coverage among experiments [9]. The presence of overdispersion (i.e. variance larger than the mean) among biological replicates for a given feature is also a typical feature of these data, leading to the use of negative binomial models [10, 11] for RNA-seq differential analyses. Statistically speaking, the goal of clustering approaches is to discover structures (clusters) within data. Many clustering methods exist and roughly fall into two categories: (1) methods based on dissimilarity distances, including tree-based hierarchical clustering [12] as well as methods like the K-means algorithm [13]; and (2) model-based methods [14], which consist of defining a clustering model and optimizing the fit between the data and the model. For the latter class of models, each cluster is represented by a distinct parametric distribution, and the entire data set is thus modeled as a mixture of these distributions; a notable advantage of model-based clustering is that it provides a rigorous framework to assess the appropriate number of clusters and the quality of clusters obtained. Presently, most proposals for clustering RNA-seq data have focused on the question of grouping biological samples rather than features, for example using hierarchical clustering with a modified loglikelihood ratio statistic based on a Poisson loglinear model as a distance measure [15] or the Euclidean distance of samples following a variance-stabilizing transformation (VST) [16]. In recent work [17], we proposed the use of Poisson mixture models to cluster RNA-seq expression profiles. This method has the advantage of directly modeling the count nature of RNA-seq data, accounting for variable library sizes among experiments and providing easily interpretable clusterings based on the profiles of variation around average expression of each gene. However, there are several serious limitations to this approach: (1) the assumption of conditional independence among samples, given the clustering group, is likely to be unrealistic for the vast majority of RNA-seq data sets; (2) per-cluster correlation structures cannot be included in the model; and (3) the Poisson distribution is likely overly restrictive, as it imposes an assumption of equal means and variances. In addition, classical asymptotic model selection criteria, such as the Bayesian information criterion (BIC) [18] and integrated completed likelihood (ICL) criterion [19], were observed to have poor behavior for the Poisson mixture model in many cases. As such, Rau et al. [17] proposed the use of a non-asymptotic penalized model selection criterion calibrated by the slope heuristics (SH) [20, 21], requiring a collection of mixture models to be fit for a wide range of cluster numbers K; for large K, this can imply significant computational time as well as practical difficulties for parameter initialization and estimation. We note that a related approach based on a hybrid-hierarchical clustering of negative binomial mixtures was proposed by Si et al. [22]; as with the work of Rau et al. [17], this method cannot account for correlation structures among samples. To address the aforementioned limitations of the Poisson mixture model, in this work, we investigate appropriate transformations to facilitate the use of Gaussian mixture models for RNA-seq co-expression analysis. This strategy has the notable advantage of enabling the estimation of per-cluster correlation structures, as well as drawing on the extensive theoretical justifications of Gaussian mixture models [14]. We note that Law et al. [23] used a related strategy for the differential analyses of RNA-seq data by transforming data, estimating precision weights for each feature and using the limma empirical Bayes analysis pipeline [24]. The identification of an ‘appropriate’ transformation for RNA-seq co-expression is not necessarily straightforward, and depends strongly on the desired interpretability of the resulting clusters as well as the model assumptions. Several transformations of read counts or pseudocounts have been proposed in the context of exploratory or differential analyses, but most largely seek to render the data homoskedastic or to reduce skewness. In this work, rather than grouping together genes with similar absolute (transformed) read abundances, we propose the use of normalized expression profiles for each feature, that is the proportion of normalized counts observed for a given feature. Owing to the compositional nature of these profiles (i.e. the sum for each feature equals 1), an additional transformation is required before fitting the Gaussian mixture model, as discussed below. The remainder of the article is organized as follows. In the Methods section, we introduce some notation, discuss appropriate data transformation for RNA-seq co-expression analyses and briefly review Gaussian mixture models, including parameter estimation and model selection. In the Results section, we describe several RNA-seq data sets and illustrate co-expression analyses on each using Gaussian mixture models on transformed data using the coseq R package. Finally, in the Discussion we provide some concluding remarks and recommendations for RNA-seq co-expression analyses in practice, as well as some opportunities for future work. Methods For the remainder of this work, let Yij be a random variable, with corresponding observed value yij, representing the raw read count (or pseudocount) for biological entity i ( i=1,…,n) of biological sample j ( j=1,…,q). For simplicity, in this work, we typically refer to the entities i as genes, although the generality of the following discussion holds for other entities of interest (exons, etc.). Each sample is typically associated with one or more experimental conditions (e.g. tissue, treatment, and time); to reflect this, let C(j) correspond to the experimental group for sample j. Finally, let y be the (n×q) matrix of read counts for all genes and samples, and let yi be the q-dimensional vector of raw count values across all biological samples for gene i. In the following, we use dot notation to indicate summations over a particular index, e.g. yi·=∑jyij. Data transformations for RNA-seq co-expression A feature common to many RNA-seq data transformations is the incorporation of sample-specific normalization factors, often referred to as library size normalization. These normalization factors account for the fact that the number of reads expected to map to a particular gene depends not only on its own expression level, but also (1) on the total number of mapped reads (also referred to as library size) in the sample, and (2) on the overall composition of the RNA population being sampled. Although several library size normalization factors have been proposed since the introduction of RNA-seq, the median ratio [11] and trimmed mean of M-values normalization (TMM; [25]) methods have been found to be robust and effective, and are now widely used [26] in the context of differential analysis. Without loss of generality, we note t=(tj) as the scaling normalization factors for raw library sizes calculated using the TMM normalization method; ℓj=y·jtj is then the corresponding normalized library size for sample j, and sj=ℓj∑m=1qℓm/q (1) is the associated normalization scaling factor by which raw counts yij are divided. Several data transformations have been suggested for RNA-seq data, most often in the context of exploratory or differential analyses. These include a  log⁡2 transformation (where a small constant is typically added to read counts to avoid 0’s), a VST [16, 27, 28], moderated log counts per million (CPM; [23]) and a regularized log (rlog) transformation [11]; see the Supplementary Materials for more details about each. As previously noted, each of these transformations was proposed with the objective of rendering the data homoskedastic (in the case of the VST or rlog transformations) or to reduce the orders of magnitude spanned by untransformed RNA-seq data. Rather than making use of these transformations, we propose calculating the normalized expression profiles for each feature, that is the proportion of normalized reads observed for gene i with respect to the total observed for gene i across all samples: pij=yij/sj+1∑ℓyiℓ/sℓ+1, where sj are the scaling normalization factors for raw counts [Equation (1)]. To illustrate the interest of using these normalized expression profiles for co-expression analysis, we plot yij/sj,  log ⁡(yij/sj+1), and the normalized expression profiles pij in Figure 1 for a subset of genes from the mouse RNA-seq data studied by Fietz et al. [29]. In particular, we consider 10 genes that are most representative (as measured by Euclidean distance) of four distinct groups: non-differentially expressed (NDE) genes across all samples (Group 1); genes expressed only in the last experimental condition (samples 11–15, Group 2); genes expressed only in the first experimental condition (samples 1–5, Group 3); and genes expressed only in the second experimental condition (samples 6–10, Group 4). It may clearly be seen that the large differences in magnitude that are dominant for normalized counts (Figure 1A) are greatly reduced by a log transformation (Figure 1B), although a certain amount of spread remains between very highly and weakly expressed genes. This spread can be notably reduced by considering the normalized expression profiles pij (Figure 1C). This example is thus instructive in illustrating the importance in co-expression analyses of considering a measure that is independent of the absolute expression level of the genes, as is the case for the normalized profiles. Figure 1 View largeDownload slide Normalized counts (A), log normalized counts  +1 (B) and normalized expression profiles pij (C) for a subset of the Fietz et al. [29] mouse RNA-seq data. The subset of genes includes NDE genes across all samples (Group 1); genes expressed only in the last experimental condition (samples 11–15, Group 2); genes expressed only in the first experimental condition (samples 1–5, Group 3); and genes expressed only in the second experimental condition (samples 6–10, Group 4). Transparent gray boxes delimit the replicates in each of the three experimental groups. Figure 1 View largeDownload slide Normalized counts (A), log normalized counts  +1 (B) and normalized expression profiles pij (C) for a subset of the Fietz et al. [29] mouse RNA-seq data. The subset of genes includes NDE genes across all samples (Group 1); genes expressed only in the last experimental condition (samples 11–15, Group 2); genes expressed only in the first experimental condition (samples 1–5, Group 3); and genes expressed only in the second experimental condition (samples 6–10, Group 4). Transparent gray boxes delimit the replicates in each of the three experimental groups. It is important to note that the profile for gene i, pi=(pij), represents compositional data [30], as it is a q-tuple of nonnegative numbers whose sum is 1. This means that the vector of values pi is linearly dependent, which imposes constraints on the covariance matrices Σk that are problematic for the general Gaussian mixture model (and indeed for most standard statistical approaches). For this reason, we consider two separate transformations of the profiles pij to break the sum constraint, the logit and the arcsine (also referred to as the arcsine square root, or angular) transformations: garcsin(pij)=arcsin(pij)∈[0,π/2], and  (2) glogit(pij)=log2(pij1−pij)∈(−∞,∞). (3) Over a broad range of intermediate values of the proportions, the logit and arcsine transformations are roughly linearly related to one another (Supplementary Figure S1A in the Supplementary Materials). However, although both transformations tend to pull out the ends of the distribution of pij values, this effect is more marked for the logit transformation, meaning that it is more affected by smaller differences at the ends of the scale (Supplementary Figure S1B). Gaussian mixture models Model-based clustering consists of assuming that the expression data come from several separately modeled subpopulations, where the full population of genes is a mixture of these subpopulations. Thus, observations are assumed to be a sample from an unknown probability distribution with density f, which is estimated by a finite mixture: f(.|θK)=∑Kk=1πkfk(.|αk), where θK=(π,α1,…,αK) and π=(π1,…,πK) are the mixing proportions, with πk∈(0,1) for all k and ∑k=1Kπk=1. The density fk(.|αk) of the kth subpopulation must be chosen according to the nature of the gene expression measures; in the following, we consider the special case of Gaussian mixture models. A collection of Gaussian mixture models can be defined as (Sm)m∈M=(S(K,v))(K,v)∈M, where S(K,v)={f(.|θ(K,v))=∑Kk=1πk,vϕ(.|μk,Σk,v)}, (4) with ϕ(.|μk,Σk,v) denoting the q-dimensional Gaussian density with mean μk and covariance matrix Σk,v. The index v denotes one of the Gaussian mixture shapes obtained by constraining one or more of the parameters in the following decomposition of each mixture component variance matrix: Σk=λkDk′AkDk, (5) where λk=|Σk|1/q, Dk is the eigenvector matrix of Σk, and Ak is the diagonal matrix of normalized eigenvalues of Σk. Various constraints on these parameters, respectively, control the volume, orientation and shape of the kth cluster [31]; by additionally allowing the proportions πk to vary according to cluster or be equal for all clusters, we may define a collection of 28 parsimonious and interpretable mixture models, available in the Rmixmod R package [32]. Without loss of generality, for simplicity of notation, we will consider here only the most general model form, with variable proportions, volume, orientation and shape (referred to as the [pkLkCk] in Rmixmod); as such, the model collection is defined solely over a range of numbers of clusters, (SK)K∈M. The parameters of each model SK in the collection defined in [Equation (4)] may be estimated using an expectation–maximization-type algorithm [33]. After solving the density estimation problem, for each model in the collection f is estimated by f^K=f(.|θ^K), and the data are clustered using the maximum a posteriori rule: for each i=1,…,n and each k=1,…,K, z^ik={1 if τik(θ^K)>τiℓ(θ^K) ∀ℓ≠k0 otherwise ., where τik(θ^K) is the conditional probability that observation i arises from the kth component mixture f(.|θ^K). Model choice for RNA-seq co-expression In the mixture model framework, the number of clusters K is typically chosen from the model collection using a penalized selection criterion such as the BIC, [18], ICL [19] or a non-asymptotic penalized criterion whose penalty is calibrated using the SH principle [34]: BIC(K)=−ℓ(.|θ^K)+νK2nln⁡(n). (6) ICL(K)=−ℓ(.|θ^K)+νK2nln⁡(n)−1n∑ni=1∑Kk=1τik(θ^K)ln⁡(τik(θ^K)) =−ℓ(.|θ^K)+νK2nln⁡(n)+entropy. (7) SH (K)=−ℓ(.|θ^K)+κνK, (8) where ℓ(.|θ^K) is the loglikelihood evaluated at θ^K (the maximum likelihood estimate of θK), νK represents the number of free parameters for the mixtures in model SK, n is the number of observations, κ is a multiplicative constant that must be calibrated using the capushe R package [21] and τik(θ^K) is as defined in the previous section. The selected model K^ corresponds to the number of clusters K that minimizes the chosen criterion among Equations (6–8). Although rarely done in practice, penalized criteria like the BIC and ICL may also be used to select among different models or transformations, as was suggested in a different context by Thomas et al. [35] and more recently for RNA-seq data by Gallopin [36]. This is of great interest, as it removes the need for an arbitrary choice of data transformation by using the framework of formal model selection. We illustrate this principle for the choice of number of clusters K and data transformation; in a more general case, a similar procedure could be used to additionally choose among the different forms of Gaussian mixture models described in Equation (5) or among different parametric forms of models. Let g(x) represent an arbitrary monotonic transformation of a data set x. If the new sample g(x) is assumed to arise from an i.i.d. Gaussian mixture density, f(.|θK), then the initial data x is an i.i.d. sample from density fg(.|θK), which is a transformation of f(.|θK) and thus not necessarily a Gaussian mixture density. If Jg denotes the Jacobian of the transformation g and θ^(K,g) the maximum likelihood estimate obtained for the model with K clusters and transformation g, we select the pair (K, g) leading to the minimum of the corrected BIC or ICL criteria: BIC*(K,g)=−ℓ(.|θ^(K,g))+νK2nln⁡(n)−ln⁡[det⁡(Jg)], ICL*(K,g)=−ℓ(.|θ^(K,g))+νK2nln⁡(n)+entropy−ln⁡[det⁡(Jg)]. (9) Note that in these expressions, the number of parameters νK does not depend on the transformation g. For the purposes of this work, we make use of the corrected ICL criterion defined in Equation (9) to compare between the logit and arcsine transformations in Equations (2) and (3) applied to the expression profiles p=(pij). In particular, we use the following: ICLarcsin*(K,garcsin)=−ℓ(.|θ^(K,garcsin))+νK2nln⁡(n)+entropy+nqln⁡(2)+12∑ni=1∑qj=1ln⁡[pij(1−pij)],  (10) ICLlogit*(K,glogit)=−ℓ(.|θ^(K,glogit))+νK2nln⁡(n)+entropy+nqln⁡[ln⁡(2)]+∑ni=1∑qj=1ln⁡[pij(1−pij)]. (11) The values of ICLarcsin*(K,garcsin) and ICLlogit*(K,glogit) can thus be directly compared to choose between the two transformations. coseq Bioconductor package To facilitate co-expression analyses of RNA-seq data using Gaussian mixture models and an appropriate data transformation, we have created the R package coseq (co-expression of RNA-seq data), freely available on Bioconductor; in this section, we briefly describe some of the options available in this package. The package is first installed and loaded using the following commands: > source(“https://bioconductor.org/biocLite.R”) > biocLite(“coseq”) > library(coseq) A typical call to coseq to fit a Gaussian mixture model on arcsine- or logit-transformed normalized profiles takes the following form: > run_arcsin <- coseq(counts, K = 2:10, model=“Normal”, transformation=“arcsin”) > run_logit <- coseq(counts, K = 2:10, model=“Normal”, transformation=“logit”) where counts represent a (n×q) matrix or data frame of read counts for n genes in q samples and K=2:10 provides the desired range of numbers of clusters (here, 2–10). We note that this function directly calls the Rmixmod R package to fit Gaussian mixture models [32]. For backward compatibility with our previous method [17], a similar function call may be used to fit a Poisson mixture model on raw counts using the HTSCluster package: > run_pois <- coseq(counts, conds, K = 2:10, model= “Poisson”) where a vector conds is additionally provided to identify the experimental condition associated with each column in counts. In both cases, the output of the coseq function is an S4 object of class coseqResults (an extension of the RangedSummarizedExperiment Bioconductor S4 class) on which standard plot and summary functions can be directly applied; the former uses functionalities from the ggplot2 package [37]. Several examples of the standard plot commands can be seen in the Results section of this work, as well as in the reproducible Rmarkdown document included in the Supplementary Materials. The option of parallelization via the BiocParallel Bioconductor package is also provided. In addition to the choice of mixture model and transformation to be used, the coseq function provides flexibility to the user to filter normalized read counts according to their mean value if desired, specify library size normalization method (TMM, median ratio, upper quartile or user-provided normalization factors) and modify Rmixmod options (number of iterations, etc.). For the specific case of arcsine- and logit-transformed normalized profiles, we provide a convenience function compareICL to calculate and plot the corrected ICL model selection criteria defined in Equations (10) and (11). Finally, as RNA-seq expression analyses are often performed on a subset of genes identified as differentially expressed, the coseq function can also be directly called on an DESeqResults S4 object or integrated with DGELRT S4 objects, respectively, corresponding to output from the DESeq2 [11] and edgeR [10] Bioconductor packages for RNA-seq differential analyses. For more details and examples, see the full package vignette provided with coseq. Results In the following, we illustrate co-expression analyses using Gaussian mixture models in conjunction with the proposed transformations on normalized expression profiles for several RNA-seq data sets. The data were selected to represent several different organisms (pig, mouse, human and fly) in studies for which co-expression is of particular interest (across tissues or across time); additional details on how data were obtained and preprocessed may be found in the Supplementary Materials. Description of RNA-seq data Porcine small intestine: Mach et al. [38] used RNA-seq to study site-specific gene expression along the gastrointestinal tract of four healthy 70-day-old male large white piglets. Samples were collected in three sites along the proximal–distal axis of the small intestine (duodenum, jejunum and ileum), as well as the ileal Peyer’s patch (a lymphoid tissue localized in direct contact with the epithelial intestinal tissue). Complete information regarding sample preparation, sequencing, quality control and preprocessing is available in the original article [38]. Raw reads are available at NCBI’s Short Read Archive (SRA) repository (PRJNA221286 BioProject; accessions SRR1006118–SRR1006133); in the current work, read counts for genes sharing a common gene symbol or Ensembl gene ID were summed. Embryonic mouse neocortex: Fietz et al. [29] studied the expansion of the neocortex in five embryonic (day 14.5) mice by analyzing the transcriptome of the ventricular zone (VZ), subventricular zone (SVZ) and cortical plate (CP) using RNA-seq. Laser capture microdissection, RNA isolation and cDNA library preparation and RNA sequencing and quantification are described in the Supplementary Materials of Fietz et al. [29]. In our work, raw read counts for this study were downloaded on 23 December 2015 from the Digital Expression Explorer (DEE) [39] using associated SRA accession number SRP013825, and run information was downloaded using the SRA Run Selector. Additional information about the DEE processing pipeline may be found in the Supplementary Materials. Fetal human neocortex: In the aforementioned study, Fietz et al. [29] also included samples from six (13–16-week postconception) human fetuses taken from four neocortex regions: CP, VZ and inner and outer subventricular zone. Raw counts were obtained in the same manner as described above. Dynamic expression in embryonic flies: As part of the modENCODE project to annotate functional elements of the Drosophila melanogaster genome, Graveley et al. [40] characterized the expression dynamics of the fly using RNA-seq over 27 distinct stages of development, from early embryo to ageing male and female adults. As in our previous co-expression work [17], we focus on a subset of these data from 12 embryonic samples that were collected at 2 h intervals for 24 h, with one biological replicate for each time point. Phenotype tables and raw read counts were obtained from the ReCount online resource [41]. Results on RNA-seq data We used the coseq package described in the previous section to fit Gaussian mixture models to the arcsine- and logit-transformed normalized profiles for each of the four data sets described above for K=2,…,40 clusters (with the exception of the D. melanogaster data, for which a maximum value of K = 60 was used), using the TMM library size normalization, filtering genes with mean normalized count <50 and otherwise using default values for parameters. Concerning the filtering step, screening using either a differential analysis or a threshold on normalized means or coefficients of variation is often applied in practice before co-expression analyses to remove features that contribute noise. We note that for some studies (e.g. those in which some genes may be completely switched off in some conditions), a less stringent filtering threshold may be desired; in such cases, the meanFilterCutoff argument of coseq may be omitted (corresponding to no filter) or set to a smaller value. In all cases, we calculated the corrected ICL values from Equations (10) and (11) to compare between the arcsine and logit transformations; the number of clusters K^ identified for each transformation and the preferred model–transformation pair chosen via the corrected ICL, are shown for each data set in Table 1. The corrected ICL values across a range of numbers of clusters K are shown in Figure 2 for the Graveley et al. [40] fly and Fietz et al. [29] mouse data; for clarity, we focus our discussion in the main text on these two data sets, but complete and reproducible results (in the form of an Rmarkdown document) for all four RNA-seq data sets may be found in the Supplementary Materials. Table 1 Summary of results for Gaussian mixture models fit on transformed normalized RNA-seq profiles Organism Reference Conditions (replicates) K^arcsin K^logit Pig Mach et al. [38] Four tissues (four replicates each) 14 11 Mouse Fietz et al. [29] Three tissues (five replicates each) 12 15 Human Fietz et al. [29] Four tissues (six replicates each) 6 8 Fly Graveley et al. [40] 12 time points (one replicate each) 28 23 Organism Reference Conditions (replicates) K^arcsin K^logit Pig Mach et al. [38] Four tissues (four replicates each) 14 11 Mouse Fietz et al. [29] Three tissues (five replicates each) 12 15 Human Fietz et al. [29] Four tissues (six replicates each) 6 8 Fly Graveley et al. [40] 12 time points (one replicate each) 28 23 Note. For each data set, the organism, associated reference, experimental conditions C(j) and number of biological replicates in each, and number of clusters selected via ICL for the arcsine and logit transformation are provided. Boldface values indicated the final model selected via the corrected ICL. Table 1 Summary of results for Gaussian mixture models fit on transformed normalized RNA-seq profiles Organism Reference Conditions (replicates) K^arcsin K^logit Pig Mach et al. [38] Four tissues (four replicates each) 14 11 Mouse Fietz et al. [29] Three tissues (five replicates each) 12 15 Human Fietz et al. [29] Four tissues (six replicates each) 6 8 Fly Graveley et al. [40] 12 time points (one replicate each) 28 23 Organism Reference Conditions (replicates) K^arcsin K^logit Pig Mach et al. [38] Four tissues (four replicates each) 14 11 Mouse Fietz et al. [29] Three tissues (five replicates each) 12 15 Human Fietz et al. [29] Four tissues (six replicates each) 6 8 Fly Graveley et al. [40] 12 time points (one replicate each) 28 23 Note. For each data set, the organism, associated reference, experimental conditions C(j) and number of biological replicates in each, and number of clusters selected via ICL for the arcsine and logit transformation are provided. Boldface values indicated the final model selected via the corrected ICL. Figure 2 View largeDownload slide Corrected ICL values for the arcsine-transformed and logit-transformed normalized expression profiles over a range of numbers of clusters K for the Graveley et al. [40] fly and Fietz et al. [29] mouse data (A and B, respectively). Figure 2 View largeDownload slide Corrected ICL values for the arcsine-transformed and logit-transformed normalized expression profiles over a range of numbers of clusters K for the Graveley et al. [40] fly and Fietz et al. [29] mouse data (A and B, respectively). For the remainder of the article, the results presented correspond to the model selected via the corrected ICL. It is of interest to investigate the per-cluster covariance structures estimated for the selected models for each of the RNA-seq data sets. As an example, the per-cluster correlation matrices estimated by coseq for two selected clusters from the Graveley et al. [40] and Fietz et al. [29] mouse data are shown in Figure 3. It is interesting to note that although the Gaussian mixture model does not explicitly incorporate the experimental condition labels C(j), the estimated models include large cluster-specific correlations among close time points (Figure 3A and B) or among replicates within each tissue (Figure 3C and D). In addition, cluster-specific correlation structures among regions may be clearly seen; for example, in the Fietz et al. [29] mouse data, Cluster 2 is characterized by large negative correlations between the CP and SVZ/VZ regions, while Cluster 3 instead has a strong negative correlation between the VZ and CP/SVZ regions. This strongly suggests that in these data, the assumption of conditional independence among samples assumed by the Poisson mixture model described in Rau et al. [17] is indeed unrealistic. Figure 3 View largeDownload slide Per-cluster correlation matrices for Clusters 25 (A) and 27 (B) from the Graveley et al. [40] fly data and for Clusters 2 (C) and 3 (D) from the Fietz et al. [29] mouse data. Right- and left-leaning ellipses represent correlations close to 1 and −1, respectively, and ellipse areas correspond to the absolute value of correlation coefficients. Correlation matrices are visualized using the corrplot R package. Figure 3 View largeDownload slide Per-cluster correlation matrices for Clusters 25 (A) and 27 (B) from the Graveley et al. [40] fly data and for Clusters 2 (C) and 3 (D) from the Fietz et al. [29] mouse data. Right- and left-leaning ellipses represent correlations close to 1 and −1, respectively, and ellipse areas correspond to the absolute value of correlation coefficients. Correlation matrices are visualized using the corrplot R package. There are several ways in which per-cluster expression profiles can be represented graphically, depending on the type of data plotted (normalized counts, normalized expression profiles or transformed normalized profiles), the type of plot (e.g. line plots or boxplots) and whether replicates within experimental conditions are averaged or plotted independently. Regarding the latter point, note that the Gaussian mixture model is fit on the entirety of the data, and replicate averaging is proposed to simplify the visualization of cluster-specific expression. Although the coseq package facilitates the implementation of any combination of these three graphical options, our recommendations for visualizing co-expression results are as follows: (1) although ‘tighter’ profiles are observed when plotting the transformed normalized profiles (as these are the data used to fit the model), interpretation of profiles is improved by instead using the untransformed normalized profiles; (2) boxplots are generally preferable when experimental conditions C(j) represent distinct groups, although line plots can be useful for time-course experiments; (3) averaging replicates before plotting often provides clearer distinctions among cluster-specific profiles. Following these recommendations, the cluster-specific profiles identified for the Graveley et al. [40] and Fietz et al. [29] mouse data are shown in Figures 4 and 5. Figure 4 View largeDownload slide Per-cluster expression profiles for the Graveley et al. [40] data. Clusters have been sorted, so that those with similar mean vectors (as measured by the Euclidean distance) are plotted next to one another. Lines correspond to genes with maximum conditional probability τmax⁡(i) of cluster membership >0.8. Figure 4 View largeDownload slide Per-cluster expression profiles for the Graveley et al. [40] data. Clusters have been sorted, so that those with similar mean vectors (as measured by the Euclidean distance) are plotted next to one another. Lines correspond to genes with maximum conditional probability τmax⁡(i) of cluster membership >0.8. Figure 5 View largeDownload slide Per-cluster expression profiles for the Fietz et al. [29] data. Clusters have been sorted, so that those with similar mean vectors (as measured by the Euclidean distance) are plotted next to one another. Connected lines correspond to the mean expression profile for each group. Figure 5 View largeDownload slide Per-cluster expression profiles for the Fietz et al. [29] data. Clusters have been sorted, so that those with similar mean vectors (as measured by the Euclidean distance) are plotted next to one another. Connected lines correspond to the mean expression profile for each group. An additional advantage of model-based clustering approaches is that they facilitate an evaluation of the clustering quality of the selected model by examining the maximum conditional probabilities of cluster membership for each gene: τmax⁡(i)=max⁡1≤k≤K^τik(θ^K^),i=1,…,n. Boxplots of the maximum conditional probabilities τmax⁡(i) per cluster for the Graveley et al. [40] and Fietz et al. [29] mouse data are presented in Figure 6. It may be seen that across clusters, the majority of genes in both data sets have a large value (i.e. close to 1) for τmax⁡(i); the number of genes with τmax⁡(i)>0.8 is 7822 (82.1%) and 7382 (82.4%) for the Graveley et al. [40] and Fietz et al. [29] mouse data, respectively. However, the boxplots also illustrate that some genes have a τmax⁡(i) less than this threshold, in some cases as low as 0.4; this indicates that for a small number of genes, the cluster assignment is fairly ambiguous and assignment to a single cluster is questionable (the gene with the smallest τmax⁡(i) in the Fietz et al. [29] mouse data had a conditional probability of 24.8, 32.2, 13.0 and 30.0% of belonging to clusters 1, 4, 6 and 12, respectively). In such cases, it may be prudent to focus attention on genes with highly confident cluster assignments (e.g. those with τmax⁡(i)> 0.8). Figure 6 View largeDownload slide Evaluation of clustering quality for the Graveley et al. [40] (top) and Fietz et al. [29] mouse (bottom) data. Maximum conditional probabilities τmax⁡(i) for each cluster, sorted in decreasing order by the cluster median (left). Barplots of cluster sizes, according to τmax⁡(i) >0.8 or < 0.8, sorted according to the number of genes with τmax⁡(i)>0.8 (right). Figure 6 View largeDownload slide Evaluation of clustering quality for the Graveley et al. [40] (top) and Fietz et al. [29] mouse (bottom) data. Maximum conditional probabilities τmax⁡(i) for each cluster, sorted in decreasing order by the cluster median (left). Barplots of cluster sizes, according to τmax⁡(i) >0.8 or < 0.8, sorted according to the number of genes with τmax⁡(i)>0.8 (right). Finally, examining the distribution of τmax⁡(i) values within each cluster provides information about the homogeneity and relevance of each cluster. For both data sets, in all cases, clusters are primarily made up of genes with highly confident τmax⁡(i) values; however, some clusters (e.g. Clusters 1 and 4 in the Graveley et al. [40] data) appear to be more homogeneous and well-formed than others (e.g. Clusters 16 and 3 in the same data). These conclusions align with the general observations made about the per-cluster normalized profiles in Figure 4, where it may be seen that Clusters 16 and 3 have similar profiles (suggesting that unambiguous assignment to one of these clusters is more difficult). Discussion and recommendations In this work, we have primarily addressed the choice of data to be clustered (transformed normalized profiles rather than raw counts) for RNA-seq co-expression analysis under the framework of Gaussian mixture models. In the following section, we provide some additional discussion of the choice of transformation and the use of Gaussian mixture models for co-expression analysis, as well as additional remarks about the practical application of co-expression analyses. Choice of transformation for co-expression analysis We have focused the majority of our discussion here on the use of (arcsine- or logit-transformed) normalized profiles to identify groups of co-expressed genes. As previously noted, a variety of well-established candidates for transformations have already been proposed for RNA-seq data, including the log ⁡(·+c) (for a constant c), VST, CPM and rlog. As suggested by a reviewer, we could consider the alternative approach of clustering RNA-seq data after applying one of these transformations and mean-centering per gene; this latter step would remove the spread in values observed, for example in the log-transformed (and uncentered) data in Figure 1B and ensure that the data to be clustered are independent of the absolute expression levels. To investigate this idea, we fit Gaussian mixture models to the centered log ⁡, VST, CPM and rlog transformed data for the Fietz et al. [29] mouse data with K=2,…,40 clusters using the same parameters as for the coseq analysis. For all four mean-centered transformations across all clusters, the most general covariance model form (the so-called [pkLkCk] model, with variable proportions, volume, orientation and shape) systematically resulted in estimation errors via the Rmixmod package. When considering a larger variety of general covariance model forms (the [pkLCk], [pkLkC], [pkLC] forms, which all have variable proportions and, respectively, feature equal volume, equal orientation/shape and equal volume/orientation/shape), the same estimation errors were observed for all centered transformations across all clusters. We were in fact only able to estimate the Gaussian mixture model for all four centered transformations when restricting the covariance model forms to be spherical or diagonal (which roughly corresponds to applying a K-means type algorithm); as such, this implies that when using these alternative transformations, complex covariance structures among samples (such as those observed in Figure 3) must be assumed to be negligible (i.e. conditional independence given the clustering component). Gaussian mixture models for co-expression analysis We have illustrated several advantages in using Gaussian mixture models in conjunction with appropriately defined transformations to identify groups of co-expressed genes from normalized RNA-seq gene expression profiles. First, mixture models in general have the advantage of providing a rigorous statistical framework for parameter estimation, an objective assessment of the number of clusters present in the data through the use of penalized criteria and the possibility of performing diagnostic checks on the quality and homogeneity of the resulting clusters. In particular, diagnostic plots on the maximum conditional probabilities of cluster membership provide a global overview of the clustering and an objective explanation of the quality of cluster assignments. As only a subset of genes are expected to be assigned to biologically interpretable groups, these diagnostic plots help provide a basis for discussion about the choice of genes for follow-up study. However, it should be noted that such assessments of cluster stability are not unique to Gaussian mixture models. In recent years, several resampling-based methods for assessing cluster stability and membership have been proposed, including the clue R package [42] and ConsensusClusterPlus Bioconductor package [43]; for example, Ohnishi et al. [44] applied K-means clustering to resampled microarray data, constructed a consensus clustering using clue and compared each of the resampled clusterings with the consensus to identify stable and robust clusters. Gaussian mixtures in particular represent a rich, flexible and well-characterized class of models that have been successfully implemented in a large variety of theoretical and applied research contexts. For RNA-seq data, this means that the model may directly account for per-cluster correlation structures among samples, which can be strong in RNA-seq data. In this work, we considered a single form of Gaussian covariance matrices (the [pkLkCk] form), but any or all of the 28 forms of Gaussian mixture models could be used in practice. We have also discussed the use of penalized criteria like the ICL and BIC to objectively compare results between different transformations, and potentially among different forms of Gaussian covariance matrices or among different models. For the four data sets considered here, the arcsine transformation of normalized expression profiles was consistently preferred to the logit transformation; as previously mentioned, this is likely owing to the sensitivity of the latter to small pij values. An interesting further direction of research would be to consider approaches able to directly model the compositional nature of normalized profiles pij without the need to apply an arcsine or logit transformation. Practical issues for co-expression analysis It is worth noting that the concept of gene co-expression is alternatively used to refer to two broad types of analyses [45]: (1) clustering gene expression patterns to explore shared function and co-regulation (our focus in this work); and (2) network inference, which aims to construct a model of the network of regulatory interactions between genes. For the latter, a popular and widely used method is the weighted correlation network analysis (WGCNA) approach [46], which seeks to identify modules of highly interconnected (both positively and negatively correlated) genes. Although this approach was first proposed for use with microarray data, the WGCNA online Frequently Asked Questions (FAQ) page suggests it may be used for normalized RNA-seq data following a variance-stabilizing or log transformation; however, as WGCNA is based on estimates of pairwise correlation among genes, the authors recommend at least 15–20 samples in practice (as a reminder, the number of samples in the four data sets considered here ranged from 12 to 24). As such, because of both the difference in analysis objectives (clustering versus network inference) and the relatively small sample sizes of the four data sets, we have not included a more detailed discussion of WGCNA in the current work. Many alternative clustering strategies exist based on different algorithms (e.g. K-means and hierarchical clustering), distance measures calculated among pairs of genes (e.g. Euclidean distance, correlation, etc.) and techniques for identifying the number of clusters (e.g. the Dynamic Tree Cut method for dendrograms [46]). The difficulty of comparing clusterings arising from different approaches is well-known, and it is rarely straightforward to establish the circumstances under which a given strategy may be preferred. One possibility that may be of interest in practice is to analyze cluster ensembles arising from a set of different methods to assess the agreement or dissimilarity among partitions and obtain a consensus clustering [42]. In addition to the choice of clustering method, several practical issues should be considered in co-expression analyses. First, a common question is whether genes should be screened before the analysis (e.g. via an upstream differential analysis or filter based on the mean expression or coefficient of variation for each gene). Such a screening step is often used in practice, as genes contributing noise but little biological signal of interest can adversely affect clustering results. A second common question pertains to whether replicates within a given experimental group should be modeled independently or summed or averaged before the co-expression analysis. Although technical replicates in RNA-seq data are typically summed before analysis, in this work, we fit Gaussian mixture models on the full data including all biological replicates; subsequently to visualize clustering results, replicate profiles are averaged for improved clarity of cluster profiles. Following a co-expression analysis, it is notoriously difficult to validate the results of a clustering algorithm on transcriptomic data, and such results can be evaluated based on either statistical criteria (e.g. between-group and within-cluster inertia measures) or external biological criteria. In practice, groups of co-expressed genes are further characterized by analyzing and integrating various resources, such as functional annotation or pathway membership information from databases like the Gene Ontology Consortium. Such functional analyses can be useful for providing interpretation and context for the identified clusters. Key Points After applying an appropriate transformation, Gaussian mixture models represent a rich, flexible and well-characterized class of models to identify groups of co-expressed genes from RNA-seq data. In particular, they directly account for per-cluster correlation structures among samples, which are observed to be strong in typical RNA-seq data. Normalized expression profiles, rather than raw counts, are recommended for co-expression analyses of RNA-seq data. Because these data are compositional in nature, an additional transformation (e.g. arcsine or logit) is required before fitting a Gaussian mixture model. Penalized model selection criteria like the BIC or ICL can be used to select both the number of clusters present and the appropriate transformation to use; in the latter case, an additional term based on the Jacobian of the transformation is added to the criterion, yielding a corrected BIC or ICL that can be used to directly compare two transformations. Supplementary data Supplementary data are available online at http://bib.oxfordjournals.org/. Andrea Rau is a Research Scientist in the Animal Genetics and Integrative Biology research unit at the French National Institute for Agronomical Research (INRA) in Jouy en Josas, France. Her research focuses on developing statistical methodology and open-source software for the analysis of genomic and transcriptomic data. Cathy Maugis-Rabusseau is an Associate Professor at the French National Institute of Applied Sciences and the French Institute of Mathematics in Toulouse, France. Her research is centered on theoretical and methodological developments for model-based clustering methods, variable selection and hypothesis testing approaches. Acknowledgements The authors would like to thank Gilles Celeux, Sandrine Laguerre, Béatrice Laurent, Clément Marteau and Marie-Laure Martin-Magniette for helpful discussions, and also thank the three reviewers for their constructive comments that helped to improve this work. Funding The French Agence Nationale de la Recherche (ANR), under grant MixStatSeq (grant number ANR-13-JS01-0001-01). References 1 Eisen MB , Spellman PT , Brown PO , et al. Cluster analysis and display of genome-wide expression patterns . PNAS 1998 ; 95 ( 25 ): 14863 – 8 . Google Scholar CrossRef Search ADS PubMed 2 Jiang D , Tang C , Zhang A. Cluster analysis for gene expression data: a survey . IEEE Trans Knowl Data Eng 2004 ; 16 ( 11 ): 1370 – 86 . Google Scholar CrossRef Search ADS 3 Yeung KY , Fraley C , Murua A , et al. Model-based clustering and data transformations for gene expression data . Bioinformatics 2001 ; 17 ( 10 ): 977 – 87 . Google Scholar CrossRef Search ADS PubMed 4 Anders S , Pyl PT , Huber W. A Python framework to work with high-throughput sequencing data . Bioinformatics 2015 ; 31 ( 2 ): 166 – 9 . Google Scholar CrossRef Search ADS PubMed 5 Liao Y , Smyth GK , Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features . Bioinformatics 2014 ; 30 ( 7 ): 923 – 30 . Google Scholar CrossRef Search ADS PubMed 6 Li B , Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome . BMC Bioinformatics 2011 ; 12 : 323. Google Scholar CrossRef Search ADS PubMed 7 Bray NL , Pimentel H , Melsted P , et al. Near-optimal probabilistic RNA-seq quantification . Nat Biotechnol 2016 ; 34 : 525 – 7 . Google Scholar CrossRef Search ADS PubMed 8 Oshlack A , Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology . Biol Direct 2009 ; 4 : 14. Google Scholar CrossRef Search ADS PubMed 9 Łabaj PP , Leparc GG , Linggi BE , et al. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling . Bioinformatics 2011 ; 27 : i383 – 91 . Google Scholar CrossRef Search ADS PubMed 10 Robinson MD , McCarthy DJ , Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data . Bioinformatics 2010 ; 26 : 139 – 40 . Google Scholar CrossRef Search ADS PubMed 11 Love MI , Huber W , Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 . Genome Biol 2014 ; 15 ( 12 ): 550 Google Scholar CrossRef Search ADS PubMed 12 Ward JH. Hierarchical grouping to optimize an objective function . J Am Stat Assoc 1963 ; 58 ( 301 ): 236 – 44 . Google Scholar CrossRef Search ADS 13 MacQueen JB Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability , 1967 , pp. 281 – 97 . University of California Press , Berkeley . 14 McLachlan G , Peel D. Finite Mixture Models . Wiley-Interscience , 2000 . Google Scholar CrossRef Search ADS 15 Witten DM. Classification and clustering of sequencing data using a Poisson model . Ann Appl Stat 2011 ; 5 ( 4 ): 2493 – 518 . Google Scholar CrossRef Search ADS 16 Anders S , Huber W. Differential expression analysis for sequence count data . Genome Biol 2010 ; 11 ( R106 ): 1 – 28 . 17 Rau A , Maugis-Rabusseau C , Martin-Magniette ML , et al. Co-expression analysis of high-throughput transcriptome sequencing data with poisson mixture models . Bioinformatics 2015 ; 31 : 1420 – 7 . Google Scholar CrossRef Search ADS PubMed 18 Schwarz G. Estimating the dimension of a model . Ann Stat 1978 ; 6 ( 2 ): 461 – 4 . Google Scholar CrossRef Search ADS 19 Biernacki C , Celeux G , Govaert G. Assessing a mixture model for clustering with the integrated completed likelihood . IEEE Trans Pattern Anal Mach Intell 2000 ; 22 ( 7 ): 719 – 25 . Google Scholar CrossRef Search ADS 20 Birgé L , Massart P. Gaussian model selection . J Eur Math Soc 2001 ; 3 : 203 – 68 . Google Scholar CrossRef Search ADS 21 Baudry JP , Maugis C , Michel B. Slope heuristics: overview and implementation . Stat Comput 2012 ; 22 : 455 – 70 . Google Scholar CrossRef Search ADS 22 Si Y , Liu P , Li P , et al. Model-based clustering for RNA-seq data . Bioinformatics 2014 ; 30 ( 2 ): 197 – 205 . Google Scholar CrossRef Search ADS PubMed 23 Law CW , Chen Y , Shi W , et al. voom: precision weights unlock linear model analysis tools for RNA-seq read counts . Genome Biol 2014 ; 15 ( 2 ): R29 . Google Scholar CrossRef Search ADS PubMed 24 Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments . Stat Appl Genet Mol Biol 2004 ; 1 ( 3 ): 1 – 26 . Google Scholar CrossRef Search ADS 25 Robinson MD , Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data . Genome Biol 2010 ; 11 : R25. Google Scholar CrossRef Search ADS PubMed 26 Dillies MA , Rau A , Aubert J , et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis . Brief Bioinform 2013 ; 14 ( 6 ): 671 – 83 . Google Scholar CrossRef Search ADS PubMed 27 Tibshirani R. Estimating transformations for regression via additivity and variance stabilization . J Am Stat Assoc 1988 ; 83 : 394 – 405 . Google Scholar CrossRef Search ADS 28 Huber W , von Heydebreck A , Sueltmann H , et al. Parameter estimation for the calibration and variance stabilization of microarray data . Stat Appl Genet Mol Biol 2003 ; 2 ( 1 ). 29 Fietz SA , Lachmann R , Brandl H , et al. Transcriptomes of germinal zones of human and mouse fetal neocortex suggest a role of extracellular matrix in progenitor self-renewal . PNAS 2012 ; 109 ( 29 ): 11836 – 41 . Google Scholar CrossRef Search ADS PubMed 30 Aitchison J. The Statistical Analysis of Compositional Data . Chapman & Hall , 1986 . Google Scholar CrossRef Search ADS 31 Celeux G , Govaert G. Gaussian parsimonious clustering models . Pattern Recogn 1995 ; 28 ( 5 ): 781 – 93 . Google Scholar CrossRef Search ADS 32 Lebret R , Iovleff S , Langrognet F , et al. Rmixmod: the R Package of the model-based unsupervised, supervised, and semi-supervised classification Mixmod library . J Stat Softw 2015 ; 67 ( 6 ): 1 – 29 . Google Scholar CrossRef Search ADS 33 Dempster AP , Laird NM , Rubin DB. Maximum likelihood from incomplete data via the EM algorithm . J R Stat Soc Ser B 1977 ; 39 ( 1 ): 1 – 38 . 34 Birgé L , Massart P. Minimal penalties for Gaussian model selection . Probab Theory Relat Fields 2007 ; 138 : 33 – 73 . Google Scholar CrossRef Search ADS 35 Thomas I , Frankhauser P , Biernacki C. The fractal morphology of the built-up landscape . Landsc Urban Plan 2008 ; 84 ( 2 ): 99 – 115 . Google Scholar CrossRef Search ADS 36 Gallopin M. Classification et inférence de réseaux pour les données RNA-seq . PhD thesis, Université Paris-Saclay , 2015 . 37 Hadley W. ggplot2: Elegant Graphics for Data Analysis . New York : Springer-Verlag , 2009 . http://ggplot2.org. 38 Mach N , Berri M , Esquerré D , et al. Extensive expression differences along porcine small intestine evidenced by transcriptome sequencing . PLoS One 2014 ; 9 ( 2 ): 1 – 12 . Google Scholar CrossRef Search ADS 39 Ziemann M , Kaspi A , Lazarus R , et al. Digital expression explorer: a user-friendly repository of uniformly processed RNA-seq data. ComBio2015, volume POS-TUE-099, Melbourne, 2015 . doi: 10.13140/RG.2.1.1707.5926. 40 Graveley BR , Brooks AN , Carlson JW , et al. The development transcriptome of Drosophila melanogaster . Nature 2011 ; 471 : 473 – 9 . Google Scholar CrossRef Search ADS PubMed 41 Frazee AC , Langmead B , Leek JT. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets . BMC Bioinformatics 2011 ; 12 :449. 42 Hornik K. A CLUE for CLUster Ensembles . J Stat Softw 2005 ; 14 ( 12 ). 43 Wilkerson MD , Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking . Bioinformatics 2010 ; 26 ( 12 ): 1572 – 3 . Google Scholar CrossRef Search ADS PubMed 44 Ohnishi Y , Huber W , Tsumura A , et al. Cell-to-cell expression variability followed by signal reinforcement progressively segregates early mouse lineages . Nat Cell Biol 2014 ; 16 : 27 – 37 . Google Scholar CrossRef Search ADS PubMed 45 D’haeseleer P , Liang S , Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering . Bioinformatics 2000 ; 16 ( 8 ): 707 – 26 . Google Scholar CrossRef Search ADS PubMed 46 Langfelder P , Horvath S. WGCNA: an R package for weighted correlation network analysis . BMC Bioinformatics 2008 ; 9 : 559 . Google Scholar CrossRef Search ADS PubMed © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Transformation and model choice for RNA-seq co-expression analysis

Loading next page...
 
/lp/ou_press/transformation-and-model-choice-for-rna-seq-co-expression-analysis-0O8rYDhoO3
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bbw128
Publisher site
See Article on Publisher Site

Abstract

Abstract Although a large number of clustering algorithms have been proposed to identify groups of co-expressed genes from microarray data, the question of if and how such methods may be applied to RNA sequencing (RNA-seq) data remains unaddressed. In this work, we investigate the use of data transformations in conjunction with Gaussian mixture models for RNA-seq co-expression analyses, as well as a penalized model selection criterion to select both an appropriate transformation and number of clusters present in the data. This approach has the advantage of accounting for per-cluster correlation structures among samples, which can be strong in RNA-seq data. In addition, it provides a rigorous statistical framework for parameter estimation, an objective assessment of data transformations and number of clusters and the possibility of performing diagnostic checks on the quality and homogeneity of the identified clusters. We analyze four varied RNA-seq data sets to illustrate the use of transformations and model selection in conjunction with Gaussian mixture models. Finally, we propose a Bioconductor package coseq (co-expression of RNA-seq data) to facilitate implementation and visualization of the recommended RNA-seq co-expression analyses. RNA-seq, co-expression, mixture models, data transformation Introduction Increasingly complex studies of transcriptome dynamics are now routinely carried out using high-throughput sequencing of reverse-transcribed RNA molecules [i.e. complementary DNA (cDNA) molecules], called RNA sequencing (RNA-seq). By quantifying and comparing transcriptomes among different types of tissues, developmental stages or experimental conditions, researchers have gained a deeper understanding of how changes in transcriptional activity reflect specific cell types and contribute to phenotypic differences. Identifying groups of co-expressed genes may help target gene modules that are involved in similar biological processes [1, 2] or that are candidates for co-regulation. Thus, by identifying clusters of co-expressed genes, we aim both to identify co-regulated genes and to characterize potential biological functions for orphan genes (namely, those whose biological function is unknown). A great deal of clustering algorithms has been proposed for microarray data, raising the question of their applicability to RNA-seq data. In particular, after normalization, background correction and  log⁡2 transformation of microarray data, hybridization intensities are typically modeled by Gaussian distributions [3]. RNA-seq data, on the other hand, are made up of read counts [4, 5] or pseudocounts [6, 7] for each biological entity or feature (e.g. a gene) after either alignment to a genome reference sequence or de novo assembly. These data are characterized by (1) highly skewed values with a large dynamic range, often covering several orders of magnitude; (2) positive correlation between feature size (e.g. gene length) and read counts [8]; and (3) variable sequencing depth (i.e. library size) and coverage among experiments [9]. The presence of overdispersion (i.e. variance larger than the mean) among biological replicates for a given feature is also a typical feature of these data, leading to the use of negative binomial models [10, 11] for RNA-seq differential analyses. Statistically speaking, the goal of clustering approaches is to discover structures (clusters) within data. Many clustering methods exist and roughly fall into two categories: (1) methods based on dissimilarity distances, including tree-based hierarchical clustering [12] as well as methods like the K-means algorithm [13]; and (2) model-based methods [14], which consist of defining a clustering model and optimizing the fit between the data and the model. For the latter class of models, each cluster is represented by a distinct parametric distribution, and the entire data set is thus modeled as a mixture of these distributions; a notable advantage of model-based clustering is that it provides a rigorous framework to assess the appropriate number of clusters and the quality of clusters obtained. Presently, most proposals for clustering RNA-seq data have focused on the question of grouping biological samples rather than features, for example using hierarchical clustering with a modified loglikelihood ratio statistic based on a Poisson loglinear model as a distance measure [15] or the Euclidean distance of samples following a variance-stabilizing transformation (VST) [16]. In recent work [17], we proposed the use of Poisson mixture models to cluster RNA-seq expression profiles. This method has the advantage of directly modeling the count nature of RNA-seq data, accounting for variable library sizes among experiments and providing easily interpretable clusterings based on the profiles of variation around average expression of each gene. However, there are several serious limitations to this approach: (1) the assumption of conditional independence among samples, given the clustering group, is likely to be unrealistic for the vast majority of RNA-seq data sets; (2) per-cluster correlation structures cannot be included in the model; and (3) the Poisson distribution is likely overly restrictive, as it imposes an assumption of equal means and variances. In addition, classical asymptotic model selection criteria, such as the Bayesian information criterion (BIC) [18] and integrated completed likelihood (ICL) criterion [19], were observed to have poor behavior for the Poisson mixture model in many cases. As such, Rau et al. [17] proposed the use of a non-asymptotic penalized model selection criterion calibrated by the slope heuristics (SH) [20, 21], requiring a collection of mixture models to be fit for a wide range of cluster numbers K; for large K, this can imply significant computational time as well as practical difficulties for parameter initialization and estimation. We note that a related approach based on a hybrid-hierarchical clustering of negative binomial mixtures was proposed by Si et al. [22]; as with the work of Rau et al. [17], this method cannot account for correlation structures among samples. To address the aforementioned limitations of the Poisson mixture model, in this work, we investigate appropriate transformations to facilitate the use of Gaussian mixture models for RNA-seq co-expression analysis. This strategy has the notable advantage of enabling the estimation of per-cluster correlation structures, as well as drawing on the extensive theoretical justifications of Gaussian mixture models [14]. We note that Law et al. [23] used a related strategy for the differential analyses of RNA-seq data by transforming data, estimating precision weights for each feature and using the limma empirical Bayes analysis pipeline [24]. The identification of an ‘appropriate’ transformation for RNA-seq co-expression is not necessarily straightforward, and depends strongly on the desired interpretability of the resulting clusters as well as the model assumptions. Several transformations of read counts or pseudocounts have been proposed in the context of exploratory or differential analyses, but most largely seek to render the data homoskedastic or to reduce skewness. In this work, rather than grouping together genes with similar absolute (transformed) read abundances, we propose the use of normalized expression profiles for each feature, that is the proportion of normalized counts observed for a given feature. Owing to the compositional nature of these profiles (i.e. the sum for each feature equals 1), an additional transformation is required before fitting the Gaussian mixture model, as discussed below. The remainder of the article is organized as follows. In the Methods section, we introduce some notation, discuss appropriate data transformation for RNA-seq co-expression analyses and briefly review Gaussian mixture models, including parameter estimation and model selection. In the Results section, we describe several RNA-seq data sets and illustrate co-expression analyses on each using Gaussian mixture models on transformed data using the coseq R package. Finally, in the Discussion we provide some concluding remarks and recommendations for RNA-seq co-expression analyses in practice, as well as some opportunities for future work. Methods For the remainder of this work, let Yij be a random variable, with corresponding observed value yij, representing the raw read count (or pseudocount) for biological entity i ( i=1,…,n) of biological sample j ( j=1,…,q). For simplicity, in this work, we typically refer to the entities i as genes, although the generality of the following discussion holds for other entities of interest (exons, etc.). Each sample is typically associated with one or more experimental conditions (e.g. tissue, treatment, and time); to reflect this, let C(j) correspond to the experimental group for sample j. Finally, let y be the (n×q) matrix of read counts for all genes and samples, and let yi be the q-dimensional vector of raw count values across all biological samples for gene i. In the following, we use dot notation to indicate summations over a particular index, e.g. yi·=∑jyij. Data transformations for RNA-seq co-expression A feature common to many RNA-seq data transformations is the incorporation of sample-specific normalization factors, often referred to as library size normalization. These normalization factors account for the fact that the number of reads expected to map to a particular gene depends not only on its own expression level, but also (1) on the total number of mapped reads (also referred to as library size) in the sample, and (2) on the overall composition of the RNA population being sampled. Although several library size normalization factors have been proposed since the introduction of RNA-seq, the median ratio [11] and trimmed mean of M-values normalization (TMM; [25]) methods have been found to be robust and effective, and are now widely used [26] in the context of differential analysis. Without loss of generality, we note t=(tj) as the scaling normalization factors for raw library sizes calculated using the TMM normalization method; ℓj=y·jtj is then the corresponding normalized library size for sample j, and sj=ℓj∑m=1qℓm/q (1) is the associated normalization scaling factor by which raw counts yij are divided. Several data transformations have been suggested for RNA-seq data, most often in the context of exploratory or differential analyses. These include a  log⁡2 transformation (where a small constant is typically added to read counts to avoid 0’s), a VST [16, 27, 28], moderated log counts per million (CPM; [23]) and a regularized log (rlog) transformation [11]; see the Supplementary Materials for more details about each. As previously noted, each of these transformations was proposed with the objective of rendering the data homoskedastic (in the case of the VST or rlog transformations) or to reduce the orders of magnitude spanned by untransformed RNA-seq data. Rather than making use of these transformations, we propose calculating the normalized expression profiles for each feature, that is the proportion of normalized reads observed for gene i with respect to the total observed for gene i across all samples: pij=yij/sj+1∑ℓyiℓ/sℓ+1, where sj are the scaling normalization factors for raw counts [Equation (1)]. To illustrate the interest of using these normalized expression profiles for co-expression analysis, we plot yij/sj,  log ⁡(yij/sj+1), and the normalized expression profiles pij in Figure 1 for a subset of genes from the mouse RNA-seq data studied by Fietz et al. [29]. In particular, we consider 10 genes that are most representative (as measured by Euclidean distance) of four distinct groups: non-differentially expressed (NDE) genes across all samples (Group 1); genes expressed only in the last experimental condition (samples 11–15, Group 2); genes expressed only in the first experimental condition (samples 1–5, Group 3); and genes expressed only in the second experimental condition (samples 6–10, Group 4). It may clearly be seen that the large differences in magnitude that are dominant for normalized counts (Figure 1A) are greatly reduced by a log transformation (Figure 1B), although a certain amount of spread remains between very highly and weakly expressed genes. This spread can be notably reduced by considering the normalized expression profiles pij (Figure 1C). This example is thus instructive in illustrating the importance in co-expression analyses of considering a measure that is independent of the absolute expression level of the genes, as is the case for the normalized profiles. Figure 1 View largeDownload slide Normalized counts (A), log normalized counts  +1 (B) and normalized expression profiles pij (C) for a subset of the Fietz et al. [29] mouse RNA-seq data. The subset of genes includes NDE genes across all samples (Group 1); genes expressed only in the last experimental condition (samples 11–15, Group 2); genes expressed only in the first experimental condition (samples 1–5, Group 3); and genes expressed only in the second experimental condition (samples 6–10, Group 4). Transparent gray boxes delimit the replicates in each of the three experimental groups. Figure 1 View largeDownload slide Normalized counts (A), log normalized counts  +1 (B) and normalized expression profiles pij (C) for a subset of the Fietz et al. [29] mouse RNA-seq data. The subset of genes includes NDE genes across all samples (Group 1); genes expressed only in the last experimental condition (samples 11–15, Group 2); genes expressed only in the first experimental condition (samples 1–5, Group 3); and genes expressed only in the second experimental condition (samples 6–10, Group 4). Transparent gray boxes delimit the replicates in each of the three experimental groups. It is important to note that the profile for gene i, pi=(pij), represents compositional data [30], as it is a q-tuple of nonnegative numbers whose sum is 1. This means that the vector of values pi is linearly dependent, which imposes constraints on the covariance matrices Σk that are problematic for the general Gaussian mixture model (and indeed for most standard statistical approaches). For this reason, we consider two separate transformations of the profiles pij to break the sum constraint, the logit and the arcsine (also referred to as the arcsine square root, or angular) transformations: garcsin(pij)=arcsin(pij)∈[0,π/2], and  (2) glogit(pij)=log2(pij1−pij)∈(−∞,∞). (3) Over a broad range of intermediate values of the proportions, the logit and arcsine transformations are roughly linearly related to one another (Supplementary Figure S1A in the Supplementary Materials). However, although both transformations tend to pull out the ends of the distribution of pij values, this effect is more marked for the logit transformation, meaning that it is more affected by smaller differences at the ends of the scale (Supplementary Figure S1B). Gaussian mixture models Model-based clustering consists of assuming that the expression data come from several separately modeled subpopulations, where the full population of genes is a mixture of these subpopulations. Thus, observations are assumed to be a sample from an unknown probability distribution with density f, which is estimated by a finite mixture: f(.|θK)=∑Kk=1πkfk(.|αk), where θK=(π,α1,…,αK) and π=(π1,…,πK) are the mixing proportions, with πk∈(0,1) for all k and ∑k=1Kπk=1. The density fk(.|αk) of the kth subpopulation must be chosen according to the nature of the gene expression measures; in the following, we consider the special case of Gaussian mixture models. A collection of Gaussian mixture models can be defined as (Sm)m∈M=(S(K,v))(K,v)∈M, where S(K,v)={f(.|θ(K,v))=∑Kk=1πk,vϕ(.|μk,Σk,v)}, (4) with ϕ(.|μk,Σk,v) denoting the q-dimensional Gaussian density with mean μk and covariance matrix Σk,v. The index v denotes one of the Gaussian mixture shapes obtained by constraining one or more of the parameters in the following decomposition of each mixture component variance matrix: Σk=λkDk′AkDk, (5) where λk=|Σk|1/q, Dk is the eigenvector matrix of Σk, and Ak is the diagonal matrix of normalized eigenvalues of Σk. Various constraints on these parameters, respectively, control the volume, orientation and shape of the kth cluster [31]; by additionally allowing the proportions πk to vary according to cluster or be equal for all clusters, we may define a collection of 28 parsimonious and interpretable mixture models, available in the Rmixmod R package [32]. Without loss of generality, for simplicity of notation, we will consider here only the most general model form, with variable proportions, volume, orientation and shape (referred to as the [pkLkCk] in Rmixmod); as such, the model collection is defined solely over a range of numbers of clusters, (SK)K∈M. The parameters of each model SK in the collection defined in [Equation (4)] may be estimated using an expectation–maximization-type algorithm [33]. After solving the density estimation problem, for each model in the collection f is estimated by f^K=f(.|θ^K), and the data are clustered using the maximum a posteriori rule: for each i=1,…,n and each k=1,…,K, z^ik={1 if τik(θ^K)>τiℓ(θ^K) ∀ℓ≠k0 otherwise ., where τik(θ^K) is the conditional probability that observation i arises from the kth component mixture f(.|θ^K). Model choice for RNA-seq co-expression In the mixture model framework, the number of clusters K is typically chosen from the model collection using a penalized selection criterion such as the BIC, [18], ICL [19] or a non-asymptotic penalized criterion whose penalty is calibrated using the SH principle [34]: BIC(K)=−ℓ(.|θ^K)+νK2nln⁡(n). (6) ICL(K)=−ℓ(.|θ^K)+νK2nln⁡(n)−1n∑ni=1∑Kk=1τik(θ^K)ln⁡(τik(θ^K)) =−ℓ(.|θ^K)+νK2nln⁡(n)+entropy. (7) SH (K)=−ℓ(.|θ^K)+κνK, (8) where ℓ(.|θ^K) is the loglikelihood evaluated at θ^K (the maximum likelihood estimate of θK), νK represents the number of free parameters for the mixtures in model SK, n is the number of observations, κ is a multiplicative constant that must be calibrated using the capushe R package [21] and τik(θ^K) is as defined in the previous section. The selected model K^ corresponds to the number of clusters K that minimizes the chosen criterion among Equations (6–8). Although rarely done in practice, penalized criteria like the BIC and ICL may also be used to select among different models or transformations, as was suggested in a different context by Thomas et al. [35] and more recently for RNA-seq data by Gallopin [36]. This is of great interest, as it removes the need for an arbitrary choice of data transformation by using the framework of formal model selection. We illustrate this principle for the choice of number of clusters K and data transformation; in a more general case, a similar procedure could be used to additionally choose among the different forms of Gaussian mixture models described in Equation (5) or among different parametric forms of models. Let g(x) represent an arbitrary monotonic transformation of a data set x. If the new sample g(x) is assumed to arise from an i.i.d. Gaussian mixture density, f(.|θK), then the initial data x is an i.i.d. sample from density fg(.|θK), which is a transformation of f(.|θK) and thus not necessarily a Gaussian mixture density. If Jg denotes the Jacobian of the transformation g and θ^(K,g) the maximum likelihood estimate obtained for the model with K clusters and transformation g, we select the pair (K, g) leading to the minimum of the corrected BIC or ICL criteria: BIC*(K,g)=−ℓ(.|θ^(K,g))+νK2nln⁡(n)−ln⁡[det⁡(Jg)], ICL*(K,g)=−ℓ(.|θ^(K,g))+νK2nln⁡(n)+entropy−ln⁡[det⁡(Jg)]. (9) Note that in these expressions, the number of parameters νK does not depend on the transformation g. For the purposes of this work, we make use of the corrected ICL criterion defined in Equation (9) to compare between the logit and arcsine transformations in Equations (2) and (3) applied to the expression profiles p=(pij). In particular, we use the following: ICLarcsin*(K,garcsin)=−ℓ(.|θ^(K,garcsin))+νK2nln⁡(n)+entropy+nqln⁡(2)+12∑ni=1∑qj=1ln⁡[pij(1−pij)],  (10) ICLlogit*(K,glogit)=−ℓ(.|θ^(K,glogit))+νK2nln⁡(n)+entropy+nqln⁡[ln⁡(2)]+∑ni=1∑qj=1ln⁡[pij(1−pij)]. (11) The values of ICLarcsin*(K,garcsin) and ICLlogit*(K,glogit) can thus be directly compared to choose between the two transformations. coseq Bioconductor package To facilitate co-expression analyses of RNA-seq data using Gaussian mixture models and an appropriate data transformation, we have created the R package coseq (co-expression of RNA-seq data), freely available on Bioconductor; in this section, we briefly describe some of the options available in this package. The package is first installed and loaded using the following commands: > source(“https://bioconductor.org/biocLite.R”) > biocLite(“coseq”) > library(coseq) A typical call to coseq to fit a Gaussian mixture model on arcsine- or logit-transformed normalized profiles takes the following form: > run_arcsin <- coseq(counts, K = 2:10, model=“Normal”, transformation=“arcsin”) > run_logit <- coseq(counts, K = 2:10, model=“Normal”, transformation=“logit”) where counts represent a (n×q) matrix or data frame of read counts for n genes in q samples and K=2:10 provides the desired range of numbers of clusters (here, 2–10). We note that this function directly calls the Rmixmod R package to fit Gaussian mixture models [32]. For backward compatibility with our previous method [17], a similar function call may be used to fit a Poisson mixture model on raw counts using the HTSCluster package: > run_pois <- coseq(counts, conds, K = 2:10, model= “Poisson”) where a vector conds is additionally provided to identify the experimental condition associated with each column in counts. In both cases, the output of the coseq function is an S4 object of class coseqResults (an extension of the RangedSummarizedExperiment Bioconductor S4 class) on which standard plot and summary functions can be directly applied; the former uses functionalities from the ggplot2 package [37]. Several examples of the standard plot commands can be seen in the Results section of this work, as well as in the reproducible Rmarkdown document included in the Supplementary Materials. The option of parallelization via the BiocParallel Bioconductor package is also provided. In addition to the choice of mixture model and transformation to be used, the coseq function provides flexibility to the user to filter normalized read counts according to their mean value if desired, specify library size normalization method (TMM, median ratio, upper quartile or user-provided normalization factors) and modify Rmixmod options (number of iterations, etc.). For the specific case of arcsine- and logit-transformed normalized profiles, we provide a convenience function compareICL to calculate and plot the corrected ICL model selection criteria defined in Equations (10) and (11). Finally, as RNA-seq expression analyses are often performed on a subset of genes identified as differentially expressed, the coseq function can also be directly called on an DESeqResults S4 object or integrated with DGELRT S4 objects, respectively, corresponding to output from the DESeq2 [11] and edgeR [10] Bioconductor packages for RNA-seq differential analyses. For more details and examples, see the full package vignette provided with coseq. Results In the following, we illustrate co-expression analyses using Gaussian mixture models in conjunction with the proposed transformations on normalized expression profiles for several RNA-seq data sets. The data were selected to represent several different organisms (pig, mouse, human and fly) in studies for which co-expression is of particular interest (across tissues or across time); additional details on how data were obtained and preprocessed may be found in the Supplementary Materials. Description of RNA-seq data Porcine small intestine: Mach et al. [38] used RNA-seq to study site-specific gene expression along the gastrointestinal tract of four healthy 70-day-old male large white piglets. Samples were collected in three sites along the proximal–distal axis of the small intestine (duodenum, jejunum and ileum), as well as the ileal Peyer’s patch (a lymphoid tissue localized in direct contact with the epithelial intestinal tissue). Complete information regarding sample preparation, sequencing, quality control and preprocessing is available in the original article [38]. Raw reads are available at NCBI’s Short Read Archive (SRA) repository (PRJNA221286 BioProject; accessions SRR1006118–SRR1006133); in the current work, read counts for genes sharing a common gene symbol or Ensembl gene ID were summed. Embryonic mouse neocortex: Fietz et al. [29] studied the expansion of the neocortex in five embryonic (day 14.5) mice by analyzing the transcriptome of the ventricular zone (VZ), subventricular zone (SVZ) and cortical plate (CP) using RNA-seq. Laser capture microdissection, RNA isolation and cDNA library preparation and RNA sequencing and quantification are described in the Supplementary Materials of Fietz et al. [29]. In our work, raw read counts for this study were downloaded on 23 December 2015 from the Digital Expression Explorer (DEE) [39] using associated SRA accession number SRP013825, and run information was downloaded using the SRA Run Selector. Additional information about the DEE processing pipeline may be found in the Supplementary Materials. Fetal human neocortex: In the aforementioned study, Fietz et al. [29] also included samples from six (13–16-week postconception) human fetuses taken from four neocortex regions: CP, VZ and inner and outer subventricular zone. Raw counts were obtained in the same manner as described above. Dynamic expression in embryonic flies: As part of the modENCODE project to annotate functional elements of the Drosophila melanogaster genome, Graveley et al. [40] characterized the expression dynamics of the fly using RNA-seq over 27 distinct stages of development, from early embryo to ageing male and female adults. As in our previous co-expression work [17], we focus on a subset of these data from 12 embryonic samples that were collected at 2 h intervals for 24 h, with one biological replicate for each time point. Phenotype tables and raw read counts were obtained from the ReCount online resource [41]. Results on RNA-seq data We used the coseq package described in the previous section to fit Gaussian mixture models to the arcsine- and logit-transformed normalized profiles for each of the four data sets described above for K=2,…,40 clusters (with the exception of the D. melanogaster data, for which a maximum value of K = 60 was used), using the TMM library size normalization, filtering genes with mean normalized count <50 and otherwise using default values for parameters. Concerning the filtering step, screening using either a differential analysis or a threshold on normalized means or coefficients of variation is often applied in practice before co-expression analyses to remove features that contribute noise. We note that for some studies (e.g. those in which some genes may be completely switched off in some conditions), a less stringent filtering threshold may be desired; in such cases, the meanFilterCutoff argument of coseq may be omitted (corresponding to no filter) or set to a smaller value. In all cases, we calculated the corrected ICL values from Equations (10) and (11) to compare between the arcsine and logit transformations; the number of clusters K^ identified for each transformation and the preferred model–transformation pair chosen via the corrected ICL, are shown for each data set in Table 1. The corrected ICL values across a range of numbers of clusters K are shown in Figure 2 for the Graveley et al. [40] fly and Fietz et al. [29] mouse data; for clarity, we focus our discussion in the main text on these two data sets, but complete and reproducible results (in the form of an Rmarkdown document) for all four RNA-seq data sets may be found in the Supplementary Materials. Table 1 Summary of results for Gaussian mixture models fit on transformed normalized RNA-seq profiles Organism Reference Conditions (replicates) K^arcsin K^logit Pig Mach et al. [38] Four tissues (four replicates each) 14 11 Mouse Fietz et al. [29] Three tissues (five replicates each) 12 15 Human Fietz et al. [29] Four tissues (six replicates each) 6 8 Fly Graveley et al. [40] 12 time points (one replicate each) 28 23 Organism Reference Conditions (replicates) K^arcsin K^logit Pig Mach et al. [38] Four tissues (four replicates each) 14 11 Mouse Fietz et al. [29] Three tissues (five replicates each) 12 15 Human Fietz et al. [29] Four tissues (six replicates each) 6 8 Fly Graveley et al. [40] 12 time points (one replicate each) 28 23 Note. For each data set, the organism, associated reference, experimental conditions C(j) and number of biological replicates in each, and number of clusters selected via ICL for the arcsine and logit transformation are provided. Boldface values indicated the final model selected via the corrected ICL. Table 1 Summary of results for Gaussian mixture models fit on transformed normalized RNA-seq profiles Organism Reference Conditions (replicates) K^arcsin K^logit Pig Mach et al. [38] Four tissues (four replicates each) 14 11 Mouse Fietz et al. [29] Three tissues (five replicates each) 12 15 Human Fietz et al. [29] Four tissues (six replicates each) 6 8 Fly Graveley et al. [40] 12 time points (one replicate each) 28 23 Organism Reference Conditions (replicates) K^arcsin K^logit Pig Mach et al. [38] Four tissues (four replicates each) 14 11 Mouse Fietz et al. [29] Three tissues (five replicates each) 12 15 Human Fietz et al. [29] Four tissues (six replicates each) 6 8 Fly Graveley et al. [40] 12 time points (one replicate each) 28 23 Note. For each data set, the organism, associated reference, experimental conditions C(j) and number of biological replicates in each, and number of clusters selected via ICL for the arcsine and logit transformation are provided. Boldface values indicated the final model selected via the corrected ICL. Figure 2 View largeDownload slide Corrected ICL values for the arcsine-transformed and logit-transformed normalized expression profiles over a range of numbers of clusters K for the Graveley et al. [40] fly and Fietz et al. [29] mouse data (A and B, respectively). Figure 2 View largeDownload slide Corrected ICL values for the arcsine-transformed and logit-transformed normalized expression profiles over a range of numbers of clusters K for the Graveley et al. [40] fly and Fietz et al. [29] mouse data (A and B, respectively). For the remainder of the article, the results presented correspond to the model selected via the corrected ICL. It is of interest to investigate the per-cluster covariance structures estimated for the selected models for each of the RNA-seq data sets. As an example, the per-cluster correlation matrices estimated by coseq for two selected clusters from the Graveley et al. [40] and Fietz et al. [29] mouse data are shown in Figure 3. It is interesting to note that although the Gaussian mixture model does not explicitly incorporate the experimental condition labels C(j), the estimated models include large cluster-specific correlations among close time points (Figure 3A and B) or among replicates within each tissue (Figure 3C and D). In addition, cluster-specific correlation structures among regions may be clearly seen; for example, in the Fietz et al. [29] mouse data, Cluster 2 is characterized by large negative correlations between the CP and SVZ/VZ regions, while Cluster 3 instead has a strong negative correlation between the VZ and CP/SVZ regions. This strongly suggests that in these data, the assumption of conditional independence among samples assumed by the Poisson mixture model described in Rau et al. [17] is indeed unrealistic. Figure 3 View largeDownload slide Per-cluster correlation matrices for Clusters 25 (A) and 27 (B) from the Graveley et al. [40] fly data and for Clusters 2 (C) and 3 (D) from the Fietz et al. [29] mouse data. Right- and left-leaning ellipses represent correlations close to 1 and −1, respectively, and ellipse areas correspond to the absolute value of correlation coefficients. Correlation matrices are visualized using the corrplot R package. Figure 3 View largeDownload slide Per-cluster correlation matrices for Clusters 25 (A) and 27 (B) from the Graveley et al. [40] fly data and for Clusters 2 (C) and 3 (D) from the Fietz et al. [29] mouse data. Right- and left-leaning ellipses represent correlations close to 1 and −1, respectively, and ellipse areas correspond to the absolute value of correlation coefficients. Correlation matrices are visualized using the corrplot R package. There are several ways in which per-cluster expression profiles can be represented graphically, depending on the type of data plotted (normalized counts, normalized expression profiles or transformed normalized profiles), the type of plot (e.g. line plots or boxplots) and whether replicates within experimental conditions are averaged or plotted independently. Regarding the latter point, note that the Gaussian mixture model is fit on the entirety of the data, and replicate averaging is proposed to simplify the visualization of cluster-specific expression. Although the coseq package facilitates the implementation of any combination of these three graphical options, our recommendations for visualizing co-expression results are as follows: (1) although ‘tighter’ profiles are observed when plotting the transformed normalized profiles (as these are the data used to fit the model), interpretation of profiles is improved by instead using the untransformed normalized profiles; (2) boxplots are generally preferable when experimental conditions C(j) represent distinct groups, although line plots can be useful for time-course experiments; (3) averaging replicates before plotting often provides clearer distinctions among cluster-specific profiles. Following these recommendations, the cluster-specific profiles identified for the Graveley et al. [40] and Fietz et al. [29] mouse data are shown in Figures 4 and 5. Figure 4 View largeDownload slide Per-cluster expression profiles for the Graveley et al. [40] data. Clusters have been sorted, so that those with similar mean vectors (as measured by the Euclidean distance) are plotted next to one another. Lines correspond to genes with maximum conditional probability τmax⁡(i) of cluster membership >0.8. Figure 4 View largeDownload slide Per-cluster expression profiles for the Graveley et al. [40] data. Clusters have been sorted, so that those with similar mean vectors (as measured by the Euclidean distance) are plotted next to one another. Lines correspond to genes with maximum conditional probability τmax⁡(i) of cluster membership >0.8. Figure 5 View largeDownload slide Per-cluster expression profiles for the Fietz et al. [29] data. Clusters have been sorted, so that those with similar mean vectors (as measured by the Euclidean distance) are plotted next to one another. Connected lines correspond to the mean expression profile for each group. Figure 5 View largeDownload slide Per-cluster expression profiles for the Fietz et al. [29] data. Clusters have been sorted, so that those with similar mean vectors (as measured by the Euclidean distance) are plotted next to one another. Connected lines correspond to the mean expression profile for each group. An additional advantage of model-based clustering approaches is that they facilitate an evaluation of the clustering quality of the selected model by examining the maximum conditional probabilities of cluster membership for each gene: τmax⁡(i)=max⁡1≤k≤K^τik(θ^K^),i=1,…,n. Boxplots of the maximum conditional probabilities τmax⁡(i) per cluster for the Graveley et al. [40] and Fietz et al. [29] mouse data are presented in Figure 6. It may be seen that across clusters, the majority of genes in both data sets have a large value (i.e. close to 1) for τmax⁡(i); the number of genes with τmax⁡(i)>0.8 is 7822 (82.1%) and 7382 (82.4%) for the Graveley et al. [40] and Fietz et al. [29] mouse data, respectively. However, the boxplots also illustrate that some genes have a τmax⁡(i) less than this threshold, in some cases as low as 0.4; this indicates that for a small number of genes, the cluster assignment is fairly ambiguous and assignment to a single cluster is questionable (the gene with the smallest τmax⁡(i) in the Fietz et al. [29] mouse data had a conditional probability of 24.8, 32.2, 13.0 and 30.0% of belonging to clusters 1, 4, 6 and 12, respectively). In such cases, it may be prudent to focus attention on genes with highly confident cluster assignments (e.g. those with τmax⁡(i)> 0.8). Figure 6 View largeDownload slide Evaluation of clustering quality for the Graveley et al. [40] (top) and Fietz et al. [29] mouse (bottom) data. Maximum conditional probabilities τmax⁡(i) for each cluster, sorted in decreasing order by the cluster median (left). Barplots of cluster sizes, according to τmax⁡(i) >0.8 or < 0.8, sorted according to the number of genes with τmax⁡(i)>0.8 (right). Figure 6 View largeDownload slide Evaluation of clustering quality for the Graveley et al. [40] (top) and Fietz et al. [29] mouse (bottom) data. Maximum conditional probabilities τmax⁡(i) for each cluster, sorted in decreasing order by the cluster median (left). Barplots of cluster sizes, according to τmax⁡(i) >0.8 or < 0.8, sorted according to the number of genes with τmax⁡(i)>0.8 (right). Finally, examining the distribution of τmax⁡(i) values within each cluster provides information about the homogeneity and relevance of each cluster. For both data sets, in all cases, clusters are primarily made up of genes with highly confident τmax⁡(i) values; however, some clusters (e.g. Clusters 1 and 4 in the Graveley et al. [40] data) appear to be more homogeneous and well-formed than others (e.g. Clusters 16 and 3 in the same data). These conclusions align with the general observations made about the per-cluster normalized profiles in Figure 4, where it may be seen that Clusters 16 and 3 have similar profiles (suggesting that unambiguous assignment to one of these clusters is more difficult). Discussion and recommendations In this work, we have primarily addressed the choice of data to be clustered (transformed normalized profiles rather than raw counts) for RNA-seq co-expression analysis under the framework of Gaussian mixture models. In the following section, we provide some additional discussion of the choice of transformation and the use of Gaussian mixture models for co-expression analysis, as well as additional remarks about the practical application of co-expression analyses. Choice of transformation for co-expression analysis We have focused the majority of our discussion here on the use of (arcsine- or logit-transformed) normalized profiles to identify groups of co-expressed genes. As previously noted, a variety of well-established candidates for transformations have already been proposed for RNA-seq data, including the log ⁡(·+c) (for a constant c), VST, CPM and rlog. As suggested by a reviewer, we could consider the alternative approach of clustering RNA-seq data after applying one of these transformations and mean-centering per gene; this latter step would remove the spread in values observed, for example in the log-transformed (and uncentered) data in Figure 1B and ensure that the data to be clustered are independent of the absolute expression levels. To investigate this idea, we fit Gaussian mixture models to the centered log ⁡, VST, CPM and rlog transformed data for the Fietz et al. [29] mouse data with K=2,…,40 clusters using the same parameters as for the coseq analysis. For all four mean-centered transformations across all clusters, the most general covariance model form (the so-called [pkLkCk] model, with variable proportions, volume, orientation and shape) systematically resulted in estimation errors via the Rmixmod package. When considering a larger variety of general covariance model forms (the [pkLCk], [pkLkC], [pkLC] forms, which all have variable proportions and, respectively, feature equal volume, equal orientation/shape and equal volume/orientation/shape), the same estimation errors were observed for all centered transformations across all clusters. We were in fact only able to estimate the Gaussian mixture model for all four centered transformations when restricting the covariance model forms to be spherical or diagonal (which roughly corresponds to applying a K-means type algorithm); as such, this implies that when using these alternative transformations, complex covariance structures among samples (such as those observed in Figure 3) must be assumed to be negligible (i.e. conditional independence given the clustering component). Gaussian mixture models for co-expression analysis We have illustrated several advantages in using Gaussian mixture models in conjunction with appropriately defined transformations to identify groups of co-expressed genes from normalized RNA-seq gene expression profiles. First, mixture models in general have the advantage of providing a rigorous statistical framework for parameter estimation, an objective assessment of the number of clusters present in the data through the use of penalized criteria and the possibility of performing diagnostic checks on the quality and homogeneity of the resulting clusters. In particular, diagnostic plots on the maximum conditional probabilities of cluster membership provide a global overview of the clustering and an objective explanation of the quality of cluster assignments. As only a subset of genes are expected to be assigned to biologically interpretable groups, these diagnostic plots help provide a basis for discussion about the choice of genes for follow-up study. However, it should be noted that such assessments of cluster stability are not unique to Gaussian mixture models. In recent years, several resampling-based methods for assessing cluster stability and membership have been proposed, including the clue R package [42] and ConsensusClusterPlus Bioconductor package [43]; for example, Ohnishi et al. [44] applied K-means clustering to resampled microarray data, constructed a consensus clustering using clue and compared each of the resampled clusterings with the consensus to identify stable and robust clusters. Gaussian mixtures in particular represent a rich, flexible and well-characterized class of models that have been successfully implemented in a large variety of theoretical and applied research contexts. For RNA-seq data, this means that the model may directly account for per-cluster correlation structures among samples, which can be strong in RNA-seq data. In this work, we considered a single form of Gaussian covariance matrices (the [pkLkCk] form), but any or all of the 28 forms of Gaussian mixture models could be used in practice. We have also discussed the use of penalized criteria like the ICL and BIC to objectively compare results between different transformations, and potentially among different forms of Gaussian covariance matrices or among different models. For the four data sets considered here, the arcsine transformation of normalized expression profiles was consistently preferred to the logit transformation; as previously mentioned, this is likely owing to the sensitivity of the latter to small pij values. An interesting further direction of research would be to consider approaches able to directly model the compositional nature of normalized profiles pij without the need to apply an arcsine or logit transformation. Practical issues for co-expression analysis It is worth noting that the concept of gene co-expression is alternatively used to refer to two broad types of analyses [45]: (1) clustering gene expression patterns to explore shared function and co-regulation (our focus in this work); and (2) network inference, which aims to construct a model of the network of regulatory interactions between genes. For the latter, a popular and widely used method is the weighted correlation network analysis (WGCNA) approach [46], which seeks to identify modules of highly interconnected (both positively and negatively correlated) genes. Although this approach was first proposed for use with microarray data, the WGCNA online Frequently Asked Questions (FAQ) page suggests it may be used for normalized RNA-seq data following a variance-stabilizing or log transformation; however, as WGCNA is based on estimates of pairwise correlation among genes, the authors recommend at least 15–20 samples in practice (as a reminder, the number of samples in the four data sets considered here ranged from 12 to 24). As such, because of both the difference in analysis objectives (clustering versus network inference) and the relatively small sample sizes of the four data sets, we have not included a more detailed discussion of WGCNA in the current work. Many alternative clustering strategies exist based on different algorithms (e.g. K-means and hierarchical clustering), distance measures calculated among pairs of genes (e.g. Euclidean distance, correlation, etc.) and techniques for identifying the number of clusters (e.g. the Dynamic Tree Cut method for dendrograms [46]). The difficulty of comparing clusterings arising from different approaches is well-known, and it is rarely straightforward to establish the circumstances under which a given strategy may be preferred. One possibility that may be of interest in practice is to analyze cluster ensembles arising from a set of different methods to assess the agreement or dissimilarity among partitions and obtain a consensus clustering [42]. In addition to the choice of clustering method, several practical issues should be considered in co-expression analyses. First, a common question is whether genes should be screened before the analysis (e.g. via an upstream differential analysis or filter based on the mean expression or coefficient of variation for each gene). Such a screening step is often used in practice, as genes contributing noise but little biological signal of interest can adversely affect clustering results. A second common question pertains to whether replicates within a given experimental group should be modeled independently or summed or averaged before the co-expression analysis. Although technical replicates in RNA-seq data are typically summed before analysis, in this work, we fit Gaussian mixture models on the full data including all biological replicates; subsequently to visualize clustering results, replicate profiles are averaged for improved clarity of cluster profiles. Following a co-expression analysis, it is notoriously difficult to validate the results of a clustering algorithm on transcriptomic data, and such results can be evaluated based on either statistical criteria (e.g. between-group and within-cluster inertia measures) or external biological criteria. In practice, groups of co-expressed genes are further characterized by analyzing and integrating various resources, such as functional annotation or pathway membership information from databases like the Gene Ontology Consortium. Such functional analyses can be useful for providing interpretation and context for the identified clusters. Key Points After applying an appropriate transformation, Gaussian mixture models represent a rich, flexible and well-characterized class of models to identify groups of co-expressed genes from RNA-seq data. In particular, they directly account for per-cluster correlation structures among samples, which are observed to be strong in typical RNA-seq data. Normalized expression profiles, rather than raw counts, are recommended for co-expression analyses of RNA-seq data. Because these data are compositional in nature, an additional transformation (e.g. arcsine or logit) is required before fitting a Gaussian mixture model. Penalized model selection criteria like the BIC or ICL can be used to select both the number of clusters present and the appropriate transformation to use; in the latter case, an additional term based on the Jacobian of the transformation is added to the criterion, yielding a corrected BIC or ICL that can be used to directly compare two transformations. Supplementary data Supplementary data are available online at http://bib.oxfordjournals.org/. Andrea Rau is a Research Scientist in the Animal Genetics and Integrative Biology research unit at the French National Institute for Agronomical Research (INRA) in Jouy en Josas, France. Her research focuses on developing statistical methodology and open-source software for the analysis of genomic and transcriptomic data. Cathy Maugis-Rabusseau is an Associate Professor at the French National Institute of Applied Sciences and the French Institute of Mathematics in Toulouse, France. Her research is centered on theoretical and methodological developments for model-based clustering methods, variable selection and hypothesis testing approaches. Acknowledgements The authors would like to thank Gilles Celeux, Sandrine Laguerre, Béatrice Laurent, Clément Marteau and Marie-Laure Martin-Magniette for helpful discussions, and also thank the three reviewers for their constructive comments that helped to improve this work. Funding The French Agence Nationale de la Recherche (ANR), under grant MixStatSeq (grant number ANR-13-JS01-0001-01). References 1 Eisen MB , Spellman PT , Brown PO , et al. Cluster analysis and display of genome-wide expression patterns . PNAS 1998 ; 95 ( 25 ): 14863 – 8 . Google Scholar CrossRef Search ADS PubMed 2 Jiang D , Tang C , Zhang A. Cluster analysis for gene expression data: a survey . IEEE Trans Knowl Data Eng 2004 ; 16 ( 11 ): 1370 – 86 . Google Scholar CrossRef Search ADS 3 Yeung KY , Fraley C , Murua A , et al. Model-based clustering and data transformations for gene expression data . Bioinformatics 2001 ; 17 ( 10 ): 977 – 87 . Google Scholar CrossRef Search ADS PubMed 4 Anders S , Pyl PT , Huber W. A Python framework to work with high-throughput sequencing data . Bioinformatics 2015 ; 31 ( 2 ): 166 – 9 . Google Scholar CrossRef Search ADS PubMed 5 Liao Y , Smyth GK , Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features . Bioinformatics 2014 ; 30 ( 7 ): 923 – 30 . Google Scholar CrossRef Search ADS PubMed 6 Li B , Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome . BMC Bioinformatics 2011 ; 12 : 323. Google Scholar CrossRef Search ADS PubMed 7 Bray NL , Pimentel H , Melsted P , et al. Near-optimal probabilistic RNA-seq quantification . Nat Biotechnol 2016 ; 34 : 525 – 7 . Google Scholar CrossRef Search ADS PubMed 8 Oshlack A , Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology . Biol Direct 2009 ; 4 : 14. Google Scholar CrossRef Search ADS PubMed 9 Łabaj PP , Leparc GG , Linggi BE , et al. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling . Bioinformatics 2011 ; 27 : i383 – 91 . Google Scholar CrossRef Search ADS PubMed 10 Robinson MD , McCarthy DJ , Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data . Bioinformatics 2010 ; 26 : 139 – 40 . Google Scholar CrossRef Search ADS PubMed 11 Love MI , Huber W , Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 . Genome Biol 2014 ; 15 ( 12 ): 550 Google Scholar CrossRef Search ADS PubMed 12 Ward JH. Hierarchical grouping to optimize an objective function . J Am Stat Assoc 1963 ; 58 ( 301 ): 236 – 44 . Google Scholar CrossRef Search ADS 13 MacQueen JB Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability , 1967 , pp. 281 – 97 . University of California Press , Berkeley . 14 McLachlan G , Peel D. Finite Mixture Models . Wiley-Interscience , 2000 . Google Scholar CrossRef Search ADS 15 Witten DM. Classification and clustering of sequencing data using a Poisson model . Ann Appl Stat 2011 ; 5 ( 4 ): 2493 – 518 . Google Scholar CrossRef Search ADS 16 Anders S , Huber W. Differential expression analysis for sequence count data . Genome Biol 2010 ; 11 ( R106 ): 1 – 28 . 17 Rau A , Maugis-Rabusseau C , Martin-Magniette ML , et al. Co-expression analysis of high-throughput transcriptome sequencing data with poisson mixture models . Bioinformatics 2015 ; 31 : 1420 – 7 . Google Scholar CrossRef Search ADS PubMed 18 Schwarz G. Estimating the dimension of a model . Ann Stat 1978 ; 6 ( 2 ): 461 – 4 . Google Scholar CrossRef Search ADS 19 Biernacki C , Celeux G , Govaert G. Assessing a mixture model for clustering with the integrated completed likelihood . IEEE Trans Pattern Anal Mach Intell 2000 ; 22 ( 7 ): 719 – 25 . Google Scholar CrossRef Search ADS 20 Birgé L , Massart P. Gaussian model selection . J Eur Math Soc 2001 ; 3 : 203 – 68 . Google Scholar CrossRef Search ADS 21 Baudry JP , Maugis C , Michel B. Slope heuristics: overview and implementation . Stat Comput 2012 ; 22 : 455 – 70 . Google Scholar CrossRef Search ADS 22 Si Y , Liu P , Li P , et al. Model-based clustering for RNA-seq data . Bioinformatics 2014 ; 30 ( 2 ): 197 – 205 . Google Scholar CrossRef Search ADS PubMed 23 Law CW , Chen Y , Shi W , et al. voom: precision weights unlock linear model analysis tools for RNA-seq read counts . Genome Biol 2014 ; 15 ( 2 ): R29 . Google Scholar CrossRef Search ADS PubMed 24 Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments . Stat Appl Genet Mol Biol 2004 ; 1 ( 3 ): 1 – 26 . Google Scholar CrossRef Search ADS 25 Robinson MD , Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data . Genome Biol 2010 ; 11 : R25. Google Scholar CrossRef Search ADS PubMed 26 Dillies MA , Rau A , Aubert J , et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis . Brief Bioinform 2013 ; 14 ( 6 ): 671 – 83 . Google Scholar CrossRef Search ADS PubMed 27 Tibshirani R. Estimating transformations for regression via additivity and variance stabilization . J Am Stat Assoc 1988 ; 83 : 394 – 405 . Google Scholar CrossRef Search ADS 28 Huber W , von Heydebreck A , Sueltmann H , et al. Parameter estimation for the calibration and variance stabilization of microarray data . Stat Appl Genet Mol Biol 2003 ; 2 ( 1 ). 29 Fietz SA , Lachmann R , Brandl H , et al. Transcriptomes of germinal zones of human and mouse fetal neocortex suggest a role of extracellular matrix in progenitor self-renewal . PNAS 2012 ; 109 ( 29 ): 11836 – 41 . Google Scholar CrossRef Search ADS PubMed 30 Aitchison J. The Statistical Analysis of Compositional Data . Chapman & Hall , 1986 . Google Scholar CrossRef Search ADS 31 Celeux G , Govaert G. Gaussian parsimonious clustering models . Pattern Recogn 1995 ; 28 ( 5 ): 781 – 93 . Google Scholar CrossRef Search ADS 32 Lebret R , Iovleff S , Langrognet F , et al. Rmixmod: the R Package of the model-based unsupervised, supervised, and semi-supervised classification Mixmod library . J Stat Softw 2015 ; 67 ( 6 ): 1 – 29 . Google Scholar CrossRef Search ADS 33 Dempster AP , Laird NM , Rubin DB. Maximum likelihood from incomplete data via the EM algorithm . J R Stat Soc Ser B 1977 ; 39 ( 1 ): 1 – 38 . 34 Birgé L , Massart P. Minimal penalties for Gaussian model selection . Probab Theory Relat Fields 2007 ; 138 : 33 – 73 . Google Scholar CrossRef Search ADS 35 Thomas I , Frankhauser P , Biernacki C. The fractal morphology of the built-up landscape . Landsc Urban Plan 2008 ; 84 ( 2 ): 99 – 115 . Google Scholar CrossRef Search ADS 36 Gallopin M. Classification et inférence de réseaux pour les données RNA-seq . PhD thesis, Université Paris-Saclay , 2015 . 37 Hadley W. ggplot2: Elegant Graphics for Data Analysis . New York : Springer-Verlag , 2009 . http://ggplot2.org. 38 Mach N , Berri M , Esquerré D , et al. Extensive expression differences along porcine small intestine evidenced by transcriptome sequencing . PLoS One 2014 ; 9 ( 2 ): 1 – 12 . Google Scholar CrossRef Search ADS 39 Ziemann M , Kaspi A , Lazarus R , et al. Digital expression explorer: a user-friendly repository of uniformly processed RNA-seq data. ComBio2015, volume POS-TUE-099, Melbourne, 2015 . doi: 10.13140/RG.2.1.1707.5926. 40 Graveley BR , Brooks AN , Carlson JW , et al. The development transcriptome of Drosophila melanogaster . Nature 2011 ; 471 : 473 – 9 . Google Scholar CrossRef Search ADS PubMed 41 Frazee AC , Langmead B , Leek JT. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets . BMC Bioinformatics 2011 ; 12 :449. 42 Hornik K. A CLUE for CLUster Ensembles . J Stat Softw 2005 ; 14 ( 12 ). 43 Wilkerson MD , Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking . Bioinformatics 2010 ; 26 ( 12 ): 1572 – 3 . Google Scholar CrossRef Search ADS PubMed 44 Ohnishi Y , Huber W , Tsumura A , et al. Cell-to-cell expression variability followed by signal reinforcement progressively segregates early mouse lineages . Nat Cell Biol 2014 ; 16 : 27 – 37 . Google Scholar CrossRef Search ADS PubMed 45 D’haeseleer P , Liang S , Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering . Bioinformatics 2000 ; 16 ( 8 ): 707 – 26 . Google Scholar CrossRef Search ADS PubMed 46 Langfelder P , Horvath S. WGCNA: an R package for weighted correlation network analysis . BMC Bioinformatics 2008 ; 9 : 559 . Google Scholar CrossRef Search ADS PubMed © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Briefings in BioinformaticsOxford University Press

Published: Jan 8, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off