The evolution and diversiﬁcation of cell types is a key means by which animal complexity evolves. Recently, hierarchical clustering and phylogenetic methods have been applied to RNA-seq data to infer cell type evolutionary history and homology. A major challenge for interpreting this data is that cell type transcriptomes may not evolve independently due to correlated changes in gene expression. This nonindependence can arise for several reasons, such as common regulatory sequences for genes expressed in multiple tissues, that is, pleiotropic effects of mutations. We develop a model to estimate the level of correlated transcriptome evolution (LCE) and apply it to different data sets. The results reveal pervasive correlated transcriptome evolution among different cell and tissue types. In general, tissues related by morphology or developmental lineage exhibit higher LCE than more distantly related tissues. Analyzing new data collected from bird skin appendages suggests that LCE decreases with the phylogenetic age of tissues compared, with recently evolved tissues exhibiting the highest LCE. Furthermore, we show correlated evolution can alter patterns of hierarchical clustering, causing different tissue types from the same species to cluster together. To identify genes that most strongly contribute to the correlated evolution signal, we performed a gene-wise estimation of LCE on a data set with ten species. Removing genes with high LCE allows for accurate reconstruction of evolutionary relationships among tissue types. Our study provides a statistical method to measure and account for correlated gene expression evolution when interpreting comparative transcriptome data. Key words: comparative transcriptomincs, cell type evolution, gene expression evolution, correlated evolution. Introduction morphologically simplest free-living animal, has ﬁve to six Complex organisms, such as birds and mammals, generally cell types (Grell and Ruthmann 1991; Syed and Schierwater have a much higher number of cell types than anatomically 2002), and in sponges only 10–18 cell types have been “simple” organisms, like sponges and Trichoplax. For exam- recognized (Simpson 1984; Valentine et al. 1994; ple, there are >400 cell types recognized in humans Ereskovsky 2010). Viewed through the lens of animal phylog- (Vickaryous and Hall 2006). In contrast, Trichoplax,the eny, this observation suggests that animal complexity The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact email@example.com 538 Genome Biol. Evol. 10(2):538–552. doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Pervasive Correlated Evolution in Gene Expression GBE FIG.1.—Correlated evolution of cell transcriptomes. (A) Cell type history embedded within species history. Cell types A and B originate from ancestral cell type O prior to split of species 1 and 2. Cell type lineages show homology relationships for the two cell types in the descendant species (subscripts indicate species). (B) Cell type history without species phylogeny illustrates cell type homology. Gray squiggly lines indicate nonindependence of transcriptome change. Correlated transcriptome evolution leads to an increase of transcriptome similarities in cell types of the same species relative to their homologous counterparts in other species. (C) Correlated evolution causes the accumulation of species-speciﬁc gene expression similarities between tissues in the same organism, potentially resulting in different tissues within a species being more similar to each other than to their homologous counterparts in another species. increased, in part, via the evolution of new cell types (Arendt cell and tissue speciﬁc gene expression independently. 2008; Wagner 2014; Arendt et al. 2016). Thus, understand- However, many RNA-seq studies ﬁnd that most genes are ing the evolution of animal complexity requires investigating not uniquely expressed in a single cell or tissue type. For in- cell type evolutionary history and homology. stance, different neuronal cell types express synaptic genes A promising approach for inferring cell or tissue type ho- (Sieburth et al. 2005; Ruvinsky et al. 2007; Stefanakis et al. mology and evolutionary history is the use of hierarchical clus- 2015), different muscle cell types share expression of contrac- tering or phylogenetic methods with RNA-seq data to tile genes (Steinmetz et al. 2012; Brunet et al. 2016), and all reconstruct cell type trees (ﬁg. 1A and B; we used cell types tissues employ “housekeeping genes” (Eisenberg and and tissue types interchangeably in this ﬁgure, to represent Levanon 2013). Thus, it is likely that some evolutionary the samples that were frequently collected in comparative changes will alter the expression of some genes simulta- RNA-seq studies) (Pu and Brady 2010; Wang et al. 2011; neously across multiple tissues, resulting in correlated patterns Pankey et al. 2014; Tschopp et al. 2014; Achim et al. 2015; of gene expression evolution. This has the potential to inﬂu- Kin et al. 2015; Musser and Wagner 2015). These are analo- ence the relationships inferred in cell type trees by generating gous to a gene or species trees, but depict relationships similarities among cell types from the same species, causing among different cell or tissue types. Homologous cell and them to cluster by species, rather than homologous tissue tissue types are expected to be more closely related to each type (ﬁg. 1B and C)(Musser and Wagner 2015). Correlated other than to other tissues (ﬁg. 1A and B). This pattern has evolution of gene expression was recently shown to be im- been observed in analyses of mature tissues with distinct func- portant in understanding transcriptome structure using phy- tions (Brawand et al. 2011; Merkin et al. 2012; Sudmantetal. logenetic networks (Gu et al. 2017). 2015). However, the opposite pattern has also been docu- To address the issue discussed earlier, the level of correlated mented (Pankey et al. 2014; Tschopp et al. 2014), in which gene expression (transcriptome) evolution (LCE), and its inﬂu- distinct cell/tissue types cluster according to species of origin, ence on tree reconstruction, needs to be evaluated. rather than by homology, even when the tissue types are Estimating correlations among quantitative phenotypic traits phylogenetically older than the species (ﬁg. 1C). Recent anal- along a known phylogeny has been extensively studied yses of this phenomenon have suggested the clustering pat- (Felsenstein 1985; Felsenstein 1988; Martins and Garland tern is related to both tissue similarity and the divergence 1991; Pagel 1994; Diaz-Uriarte and Garland 1996; times between the species under comparison (Sudmant Felsenstein 2004; Pagel and Meade 2006). These approaches et al. 2015; Gu 2016). However, how and why these factors analyze the evolution of a few traits over a large number of are related to the shape of reconstructed trees are currently species. In contrast, transcriptomic studies deal with many unclear. traits (expression levels of genes or transcripts) over compar- Phylogenetic inference using tree-building methods atively fewer species, where traditional methods do not di- assumes that the individuals under comparison (in this case rectly apply. For this reason, we focus on estimating the cell or tissue types) evolve independently of each other, thus average correlation over all genes to obtain an estimate for forming distinct historical lineages. Cell and tissue types main- the overall strength of correlated evolution affecting tain distinct gene expression programs, and thus may evolve transcriptomes. Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 539 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE In this study, we propose a statistical method to measure leghorn chicken and emu. Fertile chicken eggs, obtained from LCE by incorporating a parameter for correlated evolution in the University of Connecticut Poultry Farm, and emu eggs, both Brownian and Ornstein–Uhlenbeck (OU) models of gene obtained from Songline Emu Farm in Gill, MA, were incu- expression evolution. Using simulations of gene expression bated in a standard egg incubator at 37.8 and 35.5 C, re- evolution, we show that our method accurately estimates spectively. Skin appendages in birds develop at different times the level of correlated evolution among transcriptomes. across the embryo. We dissected embryonic placode skin Applying this method to published data, we ﬁnd that corre- from dorsal tract feathers between Hamburger and lated evolution is pervasive, and varies in accordance with Hamilton (1951; H&H) stages 33 and 34, asymmetric (scutate) developmental and evolutionary relatedness. We show that scales from the dorsal hindlimb tarsometatarsus at H&H stage tissues of more recent origin display higher LCE compared 37, symmetric (reticulate) scales from the hindlimb ventral with more distantly related tissues, suggesting that the evo- metatarsal footpad at H&H stage 39, and claws from the lutionary independence, or transcriptomic individuation, of tips of the hindlimb digits at H&H stage 36. Dissected skin tissues may increase over time. Further, a theoretical study was treated with 10 mg/ml dispase for 12 h at 4 Cto allow of our stochastic model of gene expression reveals how LCE for complete separation of epidermis and dermis. RNA was and species divergence time inﬂuence transcriptome similari- extracted from epidermal tissue using a Qiagen Rneasy kit. ties, and thus affect the hierarchical clustering patterns. Using Stand-speciﬁc polyadenylated RNA libraries were prepared for a data set with sufﬁcient species sampling, we quantify LCE sequencing by the Yale Center for Genome Analysis using an for individual genes. By excluding genes with high LCE, we in-house protocol. For each individual tissue sample, we se- were able to reconstruct trees that recover expected patterns quenced 50 million reads (one-quarter of a lane) on an of tissue type homology. Illumina Hiseq 2000. We sequenced two biological replicates for each tissue, with the exception of emu symmetric scales and claws, for which we were only able to obtain one repli- Materials and Methods cate due to limitations in the number of eggs we could Publicly Available Transcriptome Data Processing acquire. Reads were mapped to the respective species genome Transcript read counts in Merkin et al. (2012) were down- available at the time: WASHUC2 for chicken (Ensembl release loaded from GSE41637 (http://www.ncbi.nlm.nih.gov/geo/ 68 downloaded October 4, 2012) and a preliminary version of query/acc.cgi?acc¼GSE41637; last accessed January 25, the Emu genome generated by the laboratory of Allan Baker 2018). Nine tissues (brain, colon, heart, kidney, liver, lung, at the Royal Ontario Museum and University of Toronto muscle, spleen, and testes) in ﬁve species (chicken, cow, (mapped on December 5, 2012). Reads were mapped with mouse, rat, and rhesus macaque) are proﬁled in this work. Tophat2 v2.0.6 using the –GTF option. Mapped reads were We followed methods in Merkin et al. (2012) to map each assigned to genes using the program HTSeq, requiring that transcript to respective genomes (musmus9, rhemac2, rat- reads map to the correct strand, and using the “intersection- nor4, bostau4, galgal3) and extract FPKM values for each nonempty” option for dealing with reads that mapped to gene. We then normalized FPKM to TPM (Transcript per mil- more than one feature. lion mapped transcripts) by the sum (Wagner et al. 2012). Normalized gene expression RPKM from Brawand et al. (2011) was downloaded from the supplementary tables of Estimating the Level of Correlated Transcriptome Evolution the original publication. According to Brawand et al., RPKM For all data sets, we calculated TPM (transcripts per million were normalized using the median value of the most con- transcripts) as the measure of gene expression level (Wagner served a thousand genes among samples. Six tissues (cerebel- et al. 2012). We also used normalized RPKM from Brawand lum, brain without cerebellum, heart, kidney, liver, and testis) et al. (2011) to compare the effect of different quantiﬁcation in nine mammalian species and chicken are proﬁled in this methods. All transcriptome data were further processed to work. facilitate cross-species comparison, with the exception of the Transcript read counts from Tschopp et al. (2014) was down- normalized RPKM of Brawand data set which was already loaded from GSE60373 (http://www.ncbi.nlm.nih.gov/geo/ normalized between species using a median-scaling proce- query/acc.cgi?acc¼GSE60373; last accessed January 25, dure on highly conserved genes. For the other data sets, we 2018). Tail bud, forelimb bud, hindlimb bud, and genital bud followed the normalization method of Musser and Wagner in early development stage from mouse and anolis were proﬁled (2015), using one-to-one orthologs. We then used the square in this work. Read counts were also normalized to TPM values. root transformation to correct for the heteroscedasticity of transcriptome data (Musser and Wagner 2015). However, Skin Appendage at Placode Stage Transcriptome Data even after the square root transformation, we found some To collect skin appendage placode transcriptome data, we very highly expressed genes, such as mitochondrial genes, dissected developing skin from embryos of single comb white that still exhibited high variance relative to other genes. 540 Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Pervasive Correlated Evolution in Gene Expression GBE These were removed to prevent overestimation of correlated 1985) for all tissue pairs in the Brawand data set. We tested pﬃﬃﬃﬃﬃﬃﬃ c^ n2 evolution. Finally, we averaged sqrt(TPM) values across repli- i pﬃﬃﬃﬃﬃﬃﬃﬃﬃ its signiﬁcance using the test statistic t ¼ . According 1c cates to generate a representative transcriptome for each tis- to the property of the sample correlation coefﬁcient, t is dis- sue ineachspecies. tributed as a t-distribution with n 2 degrees of freedom We estimated LCE (c) for all combinations of tissue and under the null hypothesis c ¼ 0. Here, n is the number of species in all data sets by calculating the Pearson correlation of independent contrasts. We calculated the P value of per-gene their contrast vectors. We tested the signiﬁcance of correlated LCE using the probability of t > t under the null hypothesis. evolution by calculating a P value from a null distribution of We calculated the FDR and the fraction of genes under the randomly permutated gene expression contrast vectors (de- true null hypothesis (p , i.e., the proportion of genes do not scribed below). We conducted 1, 000 permutations to gen- have correlated evolution between two tissues) using erate a null distribution of the LCE c. Actual estimates of Bioconductor package “qvalue” (Benjamini–Hochberg proce- correlated evolution were considered signiﬁcantly higher dure). We performed gene ontology (GO) analysis using than expected by chance if they fell in the top 5% of the DAVID bioinformatics tools (v6.8). null distribution. We conducted ANOVA tests in R to deter- mine whether correlated evolution values varied depending on time since the last common ancestor of the species used Results for comparisons. The Welch’s t-test was used to determine Stochastic Modeling of Correlated Transcriptome whether the mean c values estimated from the Brawand and Evolution Merkin data sets for the same tissue pairs were signiﬁcantly different. To model correlated transcriptome evolution, we initially con- One challenge of estimating the permutation P value is that sider two cell types, A and B, in two species, 1 and 2 (ﬁg. 1A the degree of freedom would be overestimated using the and B). We assume that the two cell types are more ancient whole transcriptome, because gene expression evolution be- than the last common ancestor of species 1 and 2, which we tween genes is not independent (Oakley et al. 2005; refer to as species 0. We use vectors A , A , B , B to repre- 1 2 1 2 Ghanbarian and Hurst 2015), for instance because genes sent the transcriptome proﬁles of cell types A and B in species can be coregulated by the same transcription factors 1 and 2, respectively. Similarly, we use vectors A , B to rep- 0 0 (Pavlicev and Widder 2015). Thus, a low P value is always resent the unknown transcriptome proﬁles in species 0. For achieved if a whole transcriptome permutation test is per- each gene i, we model the evolution of its expression level in formed. To account for this, we subsampled 50 genes for the four cell types as a stochastic process: n o n o each permutation, which represents a lower bound for the X ¼ A ; A ; B ; B (1) i;t 1;i;t 2;i;t 1;i;t 2;i;t number of independently varying gene clusters identiﬁed in single cell transcriptome data. The lower bound of indepen- dently varying gene clusters is estimated using the approach Here, X ’s are multivariate random variables indexed by i;t of Wagner et al. (2008). We randomly sampled different sets the continuous time variable t, representing the expression of n genes. For each set of n genes, we calculated a correla- level of gene i in cell types A and B from two species. We tion matrix using gene expression values from the single-cell consider the time interval from the speciation event (t ¼ 0) to transcriptome data set of Pavlicev et al. (2017).We then cal- present (t ¼ t ). The initial state of the stochastic process culated the eigenvalues of the correlation matrix, and the X is X ¼ðA ; A ; B ; B Þ .Gene expression values i;t i;0 0;i 0;i 0;i 0;i variance of the eigenvalues. According to (Wagner et al. are only measured at the present time t ¼ t . We studied the 2008), the effective number of independent genes equals stochastic process X with the Brownian model and the i;t to n minus the variance of eigenvalues of the correlation ma- Ornstein–Uhlenbeck (OU) model, and developed an estimator trix. We observed that the variance of eigenvalues is always of LCE (eqs. 6 and 8). <1when n ¼ 50. That is to say, if we randomly draw 50 genes, their gene expression can be considered as indepen- Brownian Model dent events. Further, we estimated the theoretical P value for a c . Based on the characteristic of Pearson correlation coef- We ﬁrst study the case when X is a Brownian process, a i;t pﬃﬃﬃﬃﬃﬃﬃ c^ n2 c model for neutral transcriptome evolution. The Brownian pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬁcient, the test statistic t ¼ is distributed as a t-distri- 1c model is adequate when gene expression changes can be bution with n2 degrees of freedom. According to the sufﬁciently explained by random drift (Khaitovich et al. discussion above, the lower bound of n is 50 for the whole 2004; Yanai et al. 2004; Yang et al. 2017). The accumulated transcriptome. Thus, we ﬁnd the minimal c to get a theoret- variances in gene expression are proportional to the diver- ical P value <0.05 is 0.235 (ﬁg. 4, dashed line). gence time in this model. We describe the multivariate We estimated the per-gene LCE (c^ ) using the correlation Brownian process X with the following stochastic differ- i;t between phylogenetic independent contrasts (Felsenstein ential equation: Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 541 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE dX ¼ S dD (2) ^c ¼ corrðA A ; B B Þ (6) i;t t 1 2 1 2 Although we assumed a universal parameter c for all genes in Here, D ¼ D ; D ; D ; D is a vector of four indepen- t 1;t 2;t 3;t 4;t the derivation above, our simulation analysis (see below) dent Brownian processes (Here, we use D for Brownian pro- showed that when c varies among genes, ^c estimated the cesses to avoid the confusion with cell type B). At time t,they average LCE among all genes. are independent and identically normally distributed with 2 2 mean 0 and variance r t. r is thevariancein geneexpression that is expected to accumulate per unit time in each cell type. Ornstein–Uhlenbeck Model LCE is modeled by parameter c in the correlation matrix R of We also studied when gene expression evolution is subject to this process: stabilizing selection (Gilad et al. 2006; Fay and Wittkopp 2008; 0 1 10 c 0 Bedford and Hartl 2009; Metzger et al. 2017). In this case, X i;t B C is a multivariate Ornstein–Uhlenbeck process. This process is de- B C 01 0 c B C R ¼ SS ¼ B C (3) scribed by the following stochastic differential equation: B C c 01 0 @ A dX ¼k X l dt þ S dD (7) i;t i;t i t 0 c 01 Here, S and D are deﬁned as the same as in the Brownian c, ranging in [1, 1], is the correlation between the gene model (eq. 3). k represents the strength of stabilizing selection expression changes in the two cell types: corrðDA ; DB Þ 1;i;t 1;i;t and is always positive. (Brownian model is the limiting case and corrðDA DB Þ.If c > 0, cell types A and B will un- 2;i;t; 2;i;t when k ¼ 0.) l ¼ l ; l ; l ; l is the ﬁtness opti- i A;i A;i B;i B;i dergo correlated gene expression evolution. We generally ex- mum. For a gene i, gene expression optima are the same in pect that c > 0, although negative correlations are also homologous tissues, whereas they may differ between tissues mathematically possible. This is because gene expression levels in the same species. Assuming that tissues A and B have may change toward the same direction if they are coregulated evolved long before the speciation event, we study the sta- in twocelltypes. Onthe other hand, if c ¼ 0, gene expression tionary solution for the OU process. changes are uncorrelated in tissues A and B, and no correlated We took an analogous approach to that we utilized for the evolution is happening. Brownian model to estimate LCE parameter c using data from We developed a method to estimate LCE parameter c us- current time. First, as shown in (Meucci 2009), the stationary ing data from current time, t ¼ t . For Brownian models, it is solution for equation (7) is a multivariate normal distribution known that gene expression vector, X ¼ r i;t with mean l and covariance matrix R. Second, we got the T i 2k A ; A ; B ; B , is distributed as a multivariate 1;i;t 2;i;t 1;i;t 2;i;t 0 0 0 0 gene expression contrast vector Y through the linear trans- i;t normal random vector with mean X ¼ A ; A ; i;t 1;i;t¼0 2;i;t¼0 formation (eqs. 4 and 5), and found that Y is distributed i;t B ; B Þ ¼ A ; A ; B ; B , and covariance ma- 1;i;t¼0 2;i;t¼0 0;i 0;i 0;i 0;i as bivariate normal distribution with mean 0 and covariance trix: r t R. Its gene expression contrast vector, 1 c matrix , regardless of the evolutionary optimum. c 1 Y ¼ A A ; B B (4) i;t 1;i;t 2;i;t 1;i;t 2;i;t 0 0 0 0 0 Again, we ﬁnd that c is the Pearson correlation coefﬁcient of the gene expression contrasts between cell type A and B. As a is a linear transformation of X : i;t ! result, we arrived at the same estimator of LCE parameter c as 1 10 0 in the Brownian model: Y ¼ M X ; where M ¼ (5) i;t i;t 0 0 001 1 ^c ¼ corrðA A ; B B Þ (8) 1 2 1 2 Because a linear transformation of a normally distributed This formula indicates that our estimation of LCE is invari- random variable is still normally distributed, Y is distributed i;t 0 ant to some transformations of gene expression measures. as bivariate normal distribution with mean MX ¼ðÞ 0; 0 i;t 0 For example, if gene expression measure increases by an ad- 1 c ditive constant c in one species, ^c ¼ corrðA A c; B 1 2 1 2 T 2 and covariance matrix: M ðr t RÞ M ¼ 2r t , 0 0 B cÞ remains unchanged. In fact the measure is invariant c 1 under any afﬁne transformation: y¼ axþ c. regardless of the initial state of the Brownian process. Thus, we ﬁnd that c is the Pearson correlation Estimation of Correlated Evolution Is Accurate in Simulated coefﬁcient of the gene expression contrasts between Data cell types A and B. As a result, we estimate c using the We simulated transcriptome evolution data to evaluate our correlation coefﬁcient of the observed transcriptome LCE estimator ^c under various conditions. A total of 6,119 contrasts: 542 Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 g(γ): mixture distribution g(γ): beta distribution Pervasive Correlated Evolution in Gene Expression GBE f(λ): zeros f(λ): gamma distribution f(λ): mixture distribution 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Average LCE among genes (mean γ) FIG.2.—Estimation of LCE in simulated transcriptome data. We simulated 1, 000 transcriptome evolutionary trajectories with varied distributions of stabilizing force k and LCE paratmer c. The distributions of k are: k ¼ 0 (Brownian model), a gamma distribution (Ornstein–Uhlenbeck model), and a mixture distribution of zeros and gamma distribution (mixture of Brownian and OU model). The distributions of c are: a beta distribution, and a mixture distribution of beta distribution and zeros. In all conditions, we observed a high correlation between the estimated LCE (^c) and the mean of true c among genes (R > 0:95). Blue and orange colors are simulations with highly (lung–spleen) and lowly (brain–testes) correlated gene expression optima, respectively. We ﬁnd that estimation of LCE are not inﬂuenced by the correlation between gene expression optima or the initial state of the stochastic process. We used square root transformed TPMs from Merkin data as gene expression optima and the initial states in the simulation. A total of 600 data points are randomly sampled to be plotted in this ﬁgure. one-to-one ortholog genes among four mammals and 1) zeros (Brownian model), 2) a gamma distribution, and 3) chicken are considered as in Merkin et. al (2012). For each a mixture distribution of gamma distribution and zeros. The gene, we simulated its gene expression evolutionary trajectory mixture distribution simulates the scenario where genes are fX g by iteratively updating X forashorttimeperiod dt constrained by varied levels of stabilizing selection with a frac- i;t i;t according to equation (7) (k ¼ 0 for Brownian model; k > 0 tion of genes evolves neutrally. We ﬁnd that our estimates of for OU model). Negative X were suppressed to zero in each LCE are robust to various model parameters. We observed a i;t iteration to avoid negative expression values in the simulation. high correlation (R > 0:95) between estimated LCE (c)and We used square root transformed gene expression TPMs from the mean of true c among genes in all cases (ﬁg. 2). Merkin et al. as selective optima l and their initial states X Meanwhile, as predicted by the theoretical analysis, the cor- i;0 (see Materials and Methods). TPMs from two pairs of relation in gene expression optima and the initial state of the tissues were used to represent both highly correlated tran- stochastic process does not inﬂuence the estimation of LCE, ^c scriptome proﬁles (corrðÞ lung; spleen¼ 0:74, ﬁg. 2 blue (different colors in ﬁg. 2). In summary, c estimated from phy- points) and lowly correlated transcriptome proﬁles logenetic contrasts accurately estimates the average LCE (corrðÞ brain; testes¼ 0:24, ﬁg. 2 orange points). We gener- among all genes under the Brownian and OU process. ated LCE parameter c and stabilizing selection parameter k from various distributions to simulate different evolutionary Correlated Evolution Estimates Are Consistent across Data conditions. c were drawn from two distributions: 1) a beta Sets distribution, and 2) a mixture distribution of beta distribution and zeros. The mixture distribution simulates the scenario that To document LCE among various tissues, we applied our a fraction of the genes experience varied levels of correlated model to three previously published comparative transcrip- evolution, whereas the rest of the genes evolve indepen- tome data sets: 1) Merkin et al. (2012), containing nine ma- dently. Similarly, k were drawn from three distributions: ture tissues from four mammals and chicken (ﬁg. 3A), Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 543 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Estimated LCE ( γ ) Liang et al. GBE Macaque A B Champanzee Bonobo Chicken Human Gorilla Cow Orangutan Rhesus Mouse Opposum Rat Platypus Chicken Mouse 300 250 200 150 100 50 0 mya 300 250 200 150 100 50 0 mya 0.8 C D E 0.8 ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** R^2=0.986 − − − − − − − − − − − − − − − − − − − − − − − − − 0.6 0.6 0.4 − − − − − − − − − − − − − − − − − − − − − − − − − Dataset 0.4 Brawand 0.4 Merkin 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 0 100 200 300 brain.heart kidney.testes gamma(Normalized RPKM) Split Time (mya) FIG.3.—Effect of normalization methods, species divergence time, and data set on estimates of LCE. (A) The time tree for species analyzed in Merkin data set. (B) The time tree for species analyzed in Brawand data set. All tissues considered here are much older than the species compared in these two data sets. (C) The scatterplot of estimated LCE using two normalization methods for Brawand data set: normalized RPKM according to most consistently expressed genes as in Brawand et al. (2011), and normalized TPM according to one-to-one orthologs as in Musser et al. (2015). The normalization methods of gene expression levels do not affect the estimation of LCE (R ¼ 0.986). (D)Estimates of LCE (cÞ from brain and heart sampled in both Merkin and Brawand data sets using independent contrasts. The horizontal axis is the species divergence time in million years at internal nodes as shown in (A)and (B). ^ ^ The vertical axis is the estimated LCE, c. The estimated LCE (cÞ is not inﬂuenced by species divergence time (ANOVA P value > 0.1). The dashed lines indicate themean valueof c with divergence time > 50 Ma (blue for Merkin data, orange for Brawand data). (E) Boxplot of estimated LCE for two shared tissue comparisons between Brawand and Merkin data set (see all shared tissue comparisons in supplementary ﬁg. S1, Supplementary Material online). Estimates from two data sets do not statistically differ (Welch’s t-test, “-“: P > 0.05, “*”: P < 0.05; “**” P < 0.01). We conclude that our estimates of LCE are consistent if estimated from data sets produced by different laboratories. 2) Brawand et al. (2011), containing six mature tissues in ten measures, the species compared, and data set used. To en- species of mammals and chicken (ﬁg. 3B), and 3) Tschopp et al. sure that observations are independent, we calculated ^c using (2014), reporting data on limb, genital-, and tail buds at early the correlation between phylogenetic independent contrast developmental stages in mouse and Anolis. These data sets vectors (Felsenstein 1985) of gene expression levels in Merkin have three advantages. First, all samples for each data set and Brawand data sets. We found that ^c is consistent with were generated in the same lab and with the same protocol, different gene expression normalization methods: normalized reducing batch effects. Second, these data sets include a RPKM as in Brawand et al. (2011), or normalized TPM as diverse set of both mature and developing tissues, enabling proposed in Musser and Wagner (2015) (ﬁg. 3C, us to determine the extent to which correlated evolution varies R ¼ 0:986, supplementary ﬁg. S1A, Supplementary depending on tissue relationships. Third, the Brawand and Material online). We found consistent ^c regardless of the di- Merkin data sets sampled ﬁve tissues in common (“brain”, heart, vergence time based on estimates by Hedges et al. (2006) for liver, kidney, and testes), allowing us to determine whether esti- mature tissues studied in both Brawand and Merkin (ﬁg. 3D, mates of LCE are consistent across data sets and species. for all tissue pairs ANOVA P values for dependence of ^c on We explored the extent to which our estimates of LCE, ^c, divergence time > 0.1). However, ^c estimated from recently depends on the normalization method of gene expression diverged species pairs (< 50 Myr) tended to be lower than 544 Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 gamma(Normalized TPM) gamma gamma Pervasive Correlated Evolution in Gene Expression GBE 1.00 0.75 0.50 6 78 9 0.25 0.00 020 40 60 rank FIG.4.—Estimates of LCE in various tissue pairs. LCE estimates for Tschopp, Merkin, and Brawand data sets. Tschopp estimates (c;points in red)are for early forelimb-, hindlimb-, genital-, and tail buds from mouse and Anolis. Merkin and Brawand estimates (c) are from 10 different mature tissues in 12 species of mammal and chicken. Estimates of LCE range from 0 to 1, with higher values indicating stronger correlated evolution. Dotted line indicates cutoff for estimates of c signiﬁcantly greater than expected by chance under a model without correlated evolution (see Materials and Methods). Gray colors identify LCE estimates with testes. Numbers identify LCE estimates for other tissues with highest and lowest LCE (points in blue): 1) cerebellum and forebrain,2) spleen and lung, 3) spleen and colon, 4) colon and lung, 5) heart and skeletal muscle, 6) brain and liver (Brawand data set), 7) heart and liver, 8) cerebellum and liver, 9) brain and liver (Merkin data set), and 10) liver and skeletal muscle. Black data points are the remaining tissue comparisons from Merkin and Brawand data set (supplementary table S4, Supplementary Material online). Of note, brain and liver comparison were tested in both Merkin and Brawand data set, and LCE estimates from two data sets do not signiﬁcantly differ from each other (t-test P value > 0.5). from more distantly related species for the same tissues, con- LCE were from embryonic appendage buds (ﬁg. 4; points in tributing to relatively large SD for some tissue pairs (supple- red), where c ranged from 0.81 (early hindlimb and tail buds) mentary tables S1–S3, Supplementary Material online). One to 0.89 (early hindlimb and genital buds) consistent with the possible explanation of this ﬁnding is that recently diverged conclusion of the authors that genital and limb buds are de- species have not accumulated enough gene expression velopmentally highly related. Correlated evolution estimates changes to allow for accurate estimation of correlated evolu- among the differentiated tissues of the Brawand and Merkin tion. Thus, comparisons from recently diverged species may data sets were lower. For each pair of tissues in the Brawand be dominated by observational noise which likely is uncorre- and Merkin data sets, we calculated c,the mean value of c lated, resulting in underestimation of c: from independent contrasts with split time > 50 Myr (supple- Notably, we found consistent c for the same tissue pairs mentary tables S1 and S2, Supplementary Material online). across different data sets (ﬁg. 3E and supplementary ﬁg. 1B, Mean estimates ranged from 0.17 (liver and testes) to 0.71 Supplementary Material online). The average c values for the (forebrain and cerebellum), indicating substantial variation in same tissue pair, calculated for Merkin and Brawand data sets correlated evolution among differentiated tissues (ﬁg. 4). separately do not differ statistically (Welch’s t-test P val- Most tissues exhibited at least some degree of correlated evo- ues > 0.1), indicating that estimates of the relative LCE are lution, with 71% of individual c signiﬁcantly greater than robust to variations in data acquisition. As a result, the fact expected by chance (permutation P value < 0.05). Notable that our estimation of LCE is consistent with different normal- exceptions were tissue comparisons with testes, which dom- ization methods, species, as well as data sets suggests our inated the lower end of LCE estimates (ﬁg. 4, data points in results reﬂect biological properties of these tissues, rather gray), and could not be statistically distinguished from a than technical artifacts introduced by data acquisition or model in which gene expression evolved independently be- manipulation. tween tissues. Correlated Evolution Is Pervasive and Varies in Degree Level of Correlated Evolution Differs with Phylogenetic across Diverse Tissues Age of Tissues Our results show that correlated evolution is pervasive and The observation that tissues exhibit different LCE indicates strong, and varies in degree depending on the tissues under that tissues’ evolutionary independence may itself evolve. comparison (ﬁg. 4 and supplementary tables S1 and S2, For instance, do cell and tissue types of recent origin exhibit Supplementary Material online). The highest estimates of higher LCE relative to other tissues? To address this question, Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 545 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 gamma Liang et al. GBE Table 1 we collected transcriptomes from developing bird feathers, Estimated LCE among Avian Skin Appendages two different avian scale types, and claws from two avian species, chicken and emu. Feathers are an evolutionary inno- Feather Scutate Scale Reticulate Scale Claw vation that evolved in the bird stem lineage, replacing scales Feather 1 0.88 0.79 0.71 across much of the body (Prum and Brush 2002). Scales are Scutate Scale 1 0.86 0.72 Reticulate Scale 1 0.71 found in all reptiles, and also on the feet of birds. Bird scales Claw 1 include large, asymmetric “scutate” scales, found on the top of the foot, and small, symmetric “reticulate” scales on the NOTE.—Estimates were obtained from transcriptomes of early skin appendage development (placodes) in chicken and emu embryos. For each skin appendage type sides and plantar surface. Birds also have claws, which de- in each species, we averaged gene expression across replicates, and calculated ^c using velop from distal toe skin, immediately adjacent to both avian gene expression contrasts <10 to avoid the inﬂuence of highly expressed structural genes. scale types. Claws are shared by reptiles and mammals, indi- cating that they are phylogenetically older than feathers and scales. We sampled epidermis during early development of these between tissues from the same species (corrðÞ A ; B and 1 1 skin appendages in chicken and emu. LCE estimates (table 1 corrðA ; B Þ). Otherwise they group by species. We studied 2 2 and supplementary table S5, Supplementary Material online) the formula of their transcriptome correlation matrix to pre- rangedfrom 0.71(feathers andclaws) to0.88(feathers and dict the hierarchical clustering pattern under various condi- scutate scales). Claws, the phylogenetically most ancient skin tions (ﬁg. 5). appendage in this sample, have a lower LCE compared with Under the Brownian model, the correlation matrix of the other skin appendages. Feathers, which evolved most re- four samples is determined by three parameters (supplemen- cently, are highly correlated with scutate scales (c ¼ 0:88), tary methods,eqs.S1–S4, Supplementary Material online): but exhibit less correlated evolution with reticulate scales LCE (c ), the correlation of gene expression proﬁles between (c ¼ 0:79, t-test on bootstrapping P¼ 910 )or claws the ancestral cell types (r ¼ corrðÞ A ; B ), andthe ratioof B 0 0 (c ¼ 0:72, t-test on bootstrapping P¼ 310 ). Notably, accumulated random walk variances and the variances in feathers and scutate scales undergo more correlated evolu- gene expression levels in the ancestral cell types r t tion than is found between the two avian scale types, which (a ¼ / t ). In the parameter space of r and c ,the B 0 B varðÞ A existed prior to the origin of feathers (c ¼ 0:86). These results phase transition boundary between clustering by species and indicate that LCE is not determined by the shared anatomical clustering by tissue is a straight line with positive intersections location of scales and claws on the distal hindlimb. Rather, our on both axis (ﬁg. 5A and supplementary methods,eqs.S1–S4, data show that the more recently evolved tissue, feathers, Supplementary Material online). We ﬁnd that samples cluster evolves in concert with its sister tissue type, scutate scales. by tissuetypewith small r and c , whereas samples cluster by This is consistent with the hypotheses that feathers and scu- species when two tissue types are similar to each other an- tate scales are serially homologous in early development, and cestrally (large r ) and have high LCE (large c ). are both derived from ancestral archosaur scales (Prum and We arrived at a similar conclusion under the OU model. Williamson 2001; Harris et al. 2002; Musser et al. 2015; Di-Poı¨ The correlation matrix of four cell types is determined by LCE and Milinkovitch 2016). Thus, our results suggest that the (c ), the correlation of gene expression optima in two tissues OU degree of genetic individuation may increase over evolution- (r ¼ corrðÞ l ; l ), and the ratio of stochastic variances and OU A B ary time, with tissues of more recent origin undergoing rela- the variances in the expression optima (a ¼ )(sup- OU 2kvarðÞ l tively higher LCE. plementary methods,eq. S5, Supplementary Material online). The phase transition boundary between two hierarchical clus- tering patterns in the parameter space of r and c is also a OU OU Theoretical Analysis of Transcriptome Clustering Patterns straight line as in the Brownian model (ﬁg. 5B). We ﬁnd again Hierarchical clustering of transcriptomes has been utilized as a that samples cluster by species when two tissue types are discovery tool for cell and tissue type homologies across spe- similar to each other (large r ) and experience high LCE OU cies (Tschopp et al. 2014), and for reconstructing the phylog- (large c ). In both models, when there is no correlated evo- OU eny of cell type diversiﬁcation (Arendt 2008; Kin et al. 2015). lution (c ¼ 0), it is expected that samples group by tissue type. However, it is unclear how correlated evolution may inﬂuence As a result, the hierarchical clustering pattern is shaped by the result of hierarchical clustering. To explore this, we ﬁrst evolutionary changes as well as historical residuals (ﬁg. 5). In conducted a theoretical analysis of how LCE inﬂuences the both models, the correlation between the same tissues from hierarchical clustering pattern of two tissues in two species two species reﬂects, in part, the residual historical signature of using our stochastic models of transcriptome evolution. Using homology. However, this historical signal decreases over time Pearson correlation as the similarity measure, samples group due to the variation introduced by gene expression evolution. by tissue type when the correlation between the same tissue On the other hand, the correlation between tissues from the (corrðÞ A ; A and corrðB ; B Þ) is higher than the correlation same species is inﬂuenced by LCE, as well as their correlation 1 2 1 2 546 Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Pervasive Correlated Evolution in Gene Expression GBE 1 1 Group by species Group by species B γ OU decrease α decrease α B OU Group by homology Group by homology 01 01 corr( A , B ) corr( μ , μ ) 0 0 A B AB Brownian model Orenstein-Uhlenbeck model FIG.5.—Theoretical hierarchical clustering pattern varies with model parameters. (A) Brownian model. (B) Ornstein–Uhlenbeck model. The hierarchical clustering pattern of four cell types, A ,A ,B ,and B , is determined by the correlation matrix of their transcriptome proﬁles (supplementary eq. S2, 1 2 1 2 Supplementary Material online). Homology signal (horizontal axis; corrðA ; B Þ or corrðl ; l )) and correlated evolution signal (vertical axis; c) shape cell 0 0 A B type transcriptome similarities together. The phase transition condition between clustering by homology and clustering by species is a straight line (supple- mentary eqs. S4 and S5, Supplementary Material online). In both models, no clustering by species pattern is observed without correlated evolution (c ¼ 0). a and a are parameters that are related to the random walk variance, r , in Brownian and OU model, respectively (see supplementary methods, OU Supplementary Material online). The dashed arrow indicates how the phase transition boundary changes with parameters a and a . If the random B OU walk variance r decreases, there is higher chance to see group-by-homology pattern. in the ancestor (Brownian model) or correlation in their ﬁtness Merkin, and Tschopp data sets. We found that the majority of optima (OU model). It is worthwhile to note that LCE (value on mature tissues cluster by tissue type, whereas tissues with y axis) and the correlation in the ancestor or ﬁtness optima high LCE cluster by species (ﬁg. 6B; symbols in red, supple- (value on x axis) may be positively related. As a result, the mentary table S4, Supplementary Material online). Further, biological observations are likely near the diagonal of ﬁgure 5. the more distantly related the species, the more likely tissues Together, these factors affect the hierarchical clustering pat- clustered by species rather than homology. Tissues collected tern of cell type transcriptomes and need to be considered from species with divergence times <50 Myr always clustered when using comparative transcriptomes in cell type evolution- by tissue homology. However, with >50 Myr of divergence, ary study. nearly all tissue pairs with c > 0.5 clustered by species, rather than homologous tissue. Phylogenetic divergence time also affects clustering because greater divergence time allows for Correlated Evolution Shapes Transcriptome Similarities the accumulation of more species-speciﬁc similarities via cor- Our transcriptome evolution model suggests that LCE affects related evolution, whereas similarity due to homology decays. the hierarchical clustering pattern to various extents (ﬁg. 5). Thus, transcriptomes reﬂect both the homology of cell and This is consistent with the previous observation that tissue tissue types, as well as the effect of correlated evolution within transcriptomes can cluster by species or by tissue homology each lineage. under different conditions (Sudmant et al. 2015; Gu 2016). Clustering by species can occur even when the tissues are Clustering by Tissue Type Is Recovered by Excluding Genes more ancient than the tissues under comparison, as has with High Correlated Evolution been shown in a number of cases (Lin et al. 2014; Tschopp et al. 2014; Sudmant et al. 2015; Gu 2016). Similarly, we The observation above suggest that the effect of correlated observed the species clustering pattern in the reanalysis of evolution needs to be carefully considered when using hier- three data sets described earlier, and the pattern persists archical clustering or phylogenetic reconstruction methods with different gene expression metrics, including with both with comparative RNA-seq data. To dissect the contribution normalized TPM and RPKM relative expression values. For ex- of each gene to correlated transcriptome evolution, we esti- ample, lung and spleen evolved prior to the common ancestor mated per-gene LCE (c ) for each one-to-one ortholog i,by of birds and mammals, yet we found lung and spleen samples calculating the correlation between its expression contrasts in from chicken and mouse clustered by species, rather than two tissues along the phylogenetic tree (Felsenstein 1985). A tissue identity (ﬁg. 6A). limitation for estimating per-gene LCE is species number. As a To explore the prevalence of clustering by species, and its result, we calculated per-gene LCE estimates only for the association with LCE in real data, we determined the cluster- Brawand data set, which contained the largest number of ing pattern for every pair of tissue and species in the Brawand, species (ﬁg. 3B). However, we only included gene expression Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 547 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 A B C D E Liang et al. GBE aubp brain vs cerebellum 0.75 brain/cerebelum vs liver 98 98 Brawand 0.50 Merkin 1000 Tschopp 0.25 testes comparisons By Species By Tissue 0.00 0 100 200 300 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SplitTime primates opossum mouse platypus primates opossum mouse platypus chicken chicken forebrain cerebellum forebrain cerebellum 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 FIG.6.—Clustering by homology can be recovered by excluding genes with high LCE. (A) Hierarchical clustering of mouse and chicken lung and spleen illustrating examples of tissues clustering by species. Numbers at nodes are conﬁdence values (%) calculated by the package using two different methods. (B) LCE versus clustering pattern of tetrads with samples from two species and two homologous tissue types. Higher LCE are associated with larger chance of clustering by species, as is the age of the lineage split. With higher LCE and long evolution, correlated evolution leads to the accumulation of species speciﬁc similarities among tissues and thus to clustering by species. (C) Upper: Bar-plot of the number of correlated (high LCE) genes identiﬁed in all tissue comparisons in Brawand data set. The criteria for highly correlated genes are: q value < 0.05 (BH correction) and ^c > 0:5. The x-axis represents different tissue comparisons. Lower: Histogram of how many times each one-to-one orthologs are identiﬁed as correlated genes in all tissue pairs of Brawand dataset. Only a minority of genes have high LCE in a majority of tissue pairs. (D) Heatmap and hierarchical clustering of forebrain and cerebellum samples from Brawand data using all one-to-one orthologs. Outside the primates samples cluster by species. (E) Samples cluster by tissue types when genes with high LCE are excluded in the hierarchical clustering analysis. contrasts outside the primate group to avoid biases from forebrain–cerebellum comparison. We then identiﬁed individ- dense sampling in one clade and short branch lengths, which ual genes with high LCE for each tissue pair using FDR < 0.05 could result in inaccurate estimation of variance between spe- (Benjamini–Hochberg procedure) and with effect size cies gene expression (Garland et al. 1992; Ackerly 2000). c^ > 0:5. The comparison of forebrain and cerebellum yielded We then tested whether our estimate of per-gene LCE is the largest number of highly correlated genes (n ¼ 4; 099), signiﬁcant for all tissue pairs (see Materials and Methods). We whereas tissue comparisons with testes yielded limited num- estimated the portion of genes that belong to the true null ber of highly correlated genes (average n ¼ 24; ﬁg. 6C). The hypothesis (p ), that is, do not show correlated evolution, top enriched GO terms in forebrain–cerebellum speciﬁc cor- using the P value distribution of all analyzed genes. In related genes (n ¼ 631) are related to pan-neuronal functions Brawand data, p varies from 0.08 (forebrain–cerebellum) such as ion transmembrane transport and chemical synaptic to 0.64 (liver–testes) (supplementary table S6, transmission (supplementary table S7, Supplementary Supplementary Material online), indicating that in this analy- Material online). We also identiﬁed 1,190 genes were highly sis, a large fraction of one-to-one orthologous genes are inﬂu- correlated in at least ﬁve tissue comparisons. GO analysis of enced by correlated evolution. This included 36% of the these genes showed enrichment for cell cycle, metabolic, and genes in the liver–testes comparison and 92% in the RNA processing related GO terms (supplementary table S8, 548 Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 chicken_lung chicken_spleen mouse_lung mouse_spleen gamma Freq correlated genes Pervasive Correlated Evolution in Gene Expression GBE Supplementary Material online). Finally, we reﬁned the hier- inﬂuences the degree of correlated variation in differentiated archical clustering analysis of cell type transcriptomes by ex- tissue. Lung and colon (c¼0.50), both endodermal deriva- cluding genes with high LCE. Whereas forebrain and tives, were highly correlated relative to other differentiated cerebellum samples cluster by species outside the primates tissues despite their very distinct functions. These ﬁndings using all one-to-one orthologous genes (ﬁg. 6D), their clus- along with the fact that our estimates are consistent across tering pattern reveals tissue homology after removing genes various data sets and consistent with respect to gene expres- with high LCE (ﬁg. 6E). Hence, identiﬁcation and elimination sion normalization methods (ﬁg. 3) suggest that estimated of genes with high LCE can be used to recover the phyloge- LCE reﬂect the biological nature of the tissues under compar- netic history of cell and tissue types. ison, rather than artifacts of genome quality, transcripts an- notation, or batch effects. The overall pattern of gene expression correlations described here would require a quite Discussion peculiar set of artifacts to generate the biologically plausible In this study, we analyzed patterns of transcriptome variation differences. among different cell and tissue types and species using sto- The second possible explanation for the correlated gene chastic models of gene expression evolution. We found strong expression variation is what has been called “concerted” evo- signals of correlated gene expression variation among differ- lution of gene expression (Musser and Wagner 2015). ent cell and tissue types. This pattern can arise for a number of Concerted gene expression evolution occurs when cell types reasons. The principal ones are: 1) Artifacts in measuring gene share parts of their gene regulatory networks. Thus, muta- expression in different species causing species speciﬁc differ- tions in certain cis- or trans-regulatory factors can affect gene ences in estimated expression level (Gilad 2015; Sudmant expression in more than one cell type. This is a special case of et al. 2015); 2) mechanistic reasons, where either gene regu- pleiotropic effects, a widely recognized pattern for all kinds of latory communalities among cell types leads to “concerted” characters (Stearns 2010; Wagner and Zhang 2011) and thus evolution of gene expression due to pleiotropic effects of plausibly also for gene expression in different cell types. A mutations (Musser and Wagner 2015) or species speciﬁc related explanation is that correlated evolution is expected if physiological differences affecting gene expression (Lin et al. species differ in their physiological states in a way that affects 2014) (anonymous reviewer). We address these possible gene expression in the two cell types studied. For instance, causes of correlated gene expression variation below. differences in hormone levels, diet, or preferred temperature As with any quantitative method, RNA-seq can be subject can lead to species-speciﬁc differences in gene expression af- to artifacts, and some sources of artifacts have been identiﬁed fecting multiple cells. Correlated evolutionary changes caused (Marioni et al. 2008; Seqc Maqc-Iii Consortium 2014; Gilad by any of these molecular and physiological factors can be 2015). For instance, batch effects can lead to systematic dif- called concerted evolution. The term “concerted evolution” ferences between data sets and thus affect downstream anal- originated in the study of nucleotide sequence evolution ysis. However, we consider it unlikely that the main result of (Dover 1993; Elder and Turner 1995; Liao 1999), but more this study, pervasive and strong correlated variation in gene generally we use the term here for the “sharing” of muta- expression, is caused solely by known artifacts. There are a tional effects among different cell types, tissues, and organs number of features of our results that are hard to explain as (Musser and Wagner 2015). This interpretation predicts that artifacts. One is that the same tissues from different species tissue speciﬁc genes do not contribute to correlated evolution. and data from different laboratories leads to statistically indis- Consistent with this, we observed signiﬁcantly decreased LCE tinguishable results with respect to the estimated LCE (ﬁg. 3D when analyzing tissue speciﬁc genes only (supplementary ﬁg. and E). Secondly, artifacts due to different genome quality S2, Supplementary Material online, Welch’s t-test P and transcript annotations should affect estimates of different value < 10 ). cell and tissues in the same way, resulting in statistically indis- One obvious reason for correlated evolution is the fact that tinguishable estimates of LCE between different tissue pairs. nearly all cell types share certain metabolic and structural In contrast, we ﬁnd that the LCE estimates varied across dif- “housekeeping” genes, and thus evolutionary changes in ferent tissue pairs, and is generally higher between tissues their expression would contribute to correlated gene expres- with similar cell biology (ﬁg. 4). For instance, our highest esti- sion variation. However, our ﬁnding that different tissue types mates were among early developing limb, genital, and tail from the same set of species (e.g., human and mouse) can bud tissue. At this early stage of development, these anlagen have very different LCE (ﬁgs. 3E and 4) suggests that corre- largely consist of undifferentiated mesenchymal cells. lated evolution does not solely originate from globally shared Moreover, two of the highest measures of correlated evolu- “housekeeping” genes. Instead, the correlated evolution sig- tion were for forebrain and cerebellum (c¼0.71), and heart nals are heavily inﬂuenced by the nature of the tissues and and skeletal muscle (c¼0.47). Both tissue pairs express com- cells compared, suggesting that the structure of the gene mon sets of pan-neuronal and contractile genes, respectively. regulatory networks active in these cells affects the strength Furthermore, we found evidence that developmental origin of correlated evolution. For instance, our ﬁnding of relatively Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 549 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE high LCE between forebrain and cerebellum is consistent with Correcting for correlated gene expression evolution is chal- ﬁndings in other animals that common transcription factors lenging because correlated evolution is occurring at different regulate pan-neuronal gene expression across different neu- levels in different tissues. In this study, we outline a potential ronal cell types (Ruvinsky et al. 2007; Stefanakis et al. 2015). solution to this problem. We show that correlations of phylo- Previous studies have shown that gene expression in testes genetic contrasts (Felsenstein 1985), with large enough taxon has a high degree of tissue-speciﬁcity (Lin et al. 2014; Yue samples, can identify genes with high contributions to the et al. 2014), is highly divergent among mammals relative to correlated evolution signal. Excluding those genes from a hi- other tissues (Brawand et al. 2011), and is under directional erarchical clustering analysis leads to a result reﬂecting tissue selection (Khaitovich et al. 2005, 2006). Our results suggest homology (ﬁg. 6D and E). This highlights the importance of that a key factor in explaining the distinct evolutionary history sampling more taxa in evolutionary studies of cell or tissue of testes is its ability to evolve gene expression independently types. Genes that are less vulnerable to correlated evolution in of other organs, that is, gene regulatory individuality. tissues of interest can be identiﬁed from a group of species Here, we acknowledge that most samples in our analysis with known tissue homology. These genes allow for the clas- contain more than one cell type, and cell types may be shared siﬁcation of tissues with uncertain homology using unsuper- among different tissues. These may contribute to LCE esti- vised learning methods, such as PCA and hierarchical mated for tissue types. Never the less, tissue transcriptomes clustering (ﬁg. 6D and E), thus facilitating the study of the can still serve as a surrogate of gene expression proﬁle of origination of novel cell and tissue types. It will be of value to major cell types in that tissue type. Particularly, although testes develop analytical tools to systematically correct for the effects share epithelial and mesenchymal cell types with other tissues, of correlated gene expression evolution and thus enable the no signiﬁcant LCE was observed between testes and other broad application of the comparative method with respect to tissues. In addition, genes with strongest LCE signal between cell and tissue transcriptomes. cerebellum and forebrain are highly enriched for synaptic It will also be useful to extend phylogenetic independent functions rather than structural or supportive functions by contrasts (PIC) to models other than Brownian motion. sharing some glial cells. Previous simulation results suggest that PIC outperforms other Overall, it is important to note that the empirical results we methods under OU model but show decreased accuracy report here identify a statistical pattern, correlated gene ex- (Martins and Garland 1991; Diaz-Uriarte and Garland 1996). pression variation between cell and tissue types, which can Moreover, there is currently a debate regarding whether have a variety of explanations. In this study, we did not spe- Brownian motion is adequate for modeling gene expression ciﬁcally investigate the actual cause for these patterns, which evolution (Gilad et al. 2006; Fay and Wittkopp 2008). Recently needs to be done on a case by case basis. However, our results diverged taxa, such as primates, may obey assumptions in the indicate that this pattern is less likely to be caused by artifacts Brownian motion model (Khaitovich et al. 2004; Khaitovich et al. or some global biological factor. Further empirical investiga- 2006; Yang et al. 2017). In other cases, transcriptome diver- tions are needed to reveal the mechanistic basis for this gence has been observed to saturate (Bedford and Hartl 2009; phenomenon. Kin et al. 2016; Metzger et al. 2017) (Brawand data, supplemen- tary ﬁg. S3, Supplementary Material online), which may be ad- equately modeled with an OU model (supplementary methods, Modeling Correlated Evolution Is Essential for eqs. S6 and S7, Supplementary Material online). Thus, a theo- Reconstructing Cell Type Phylogeny retical study of comparative methods under the OU model may Our study raises several issues of practical importance for advance our understanding of correlated evolution. comparative transcriptomics. It is clear that many evolutionary changes in gene expression are not limited to a single cell Supplementary Material type, making cross species comparisons difﬁcult. This fact Supplementary data are available at Genome Biology and has hampered attempts to use gene expression data to assess Evolution online. the homology of cells and tissues across species, because straight forward clustering of gene expression proﬁles from different species can lead to obviously wrong results. For ex- Acknowledgments ample, Tschopp et al. (2014) showed that clustering of tran- scriptomes is unable to reproduce the fact that limb buds of We are thankful for the contributions of Allan Baker, principal mice and lizards are homologous and thus also prevented the investigator of the Emu Genome Project, who died unexpect- assessment of the hypothesized serial homology of limbs and edly in November 2014. Allan leaves behind a long legacy in external genitalia (penis/clitoris). Hence, it would be desirable the ﬁelds of ornithology, evolutionary biology, and avian con- to correct for the effects of correlated variation when servation. He will be greatly missed. We would like to thank attempting to reconstruct the evolutionary history of cell Thomas A. Stewart, Oliver W. Grifﬁth, and Arun R. Chavan and tissue types from comparative transcriptome data. for their thoughtful and helpful review of the manuscript. CL 550 Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Pervasive Correlated Evolution in Gene Expression GBE Grell KG, Ruthmann A. 1991. Placozoa. In: Harrison FW, Ruppert EE, gratefully acknowledges the receipt of a graduate fellowship editors. Microscopic anatomy of invertebrates. New York: Wiley-Liss. from China Scholarship Council and the support of the p. 1–5, 7, 9, 10, 12–15. Raymond and Beverly Sackler Institute. JM gratefully acknowl- Gu X. 2016. Understanding tissue expression evolution: from expression edges funding from the Dillon and Mary Ripley Graduate phylogeny to phylogenetic network. Brief Bioinform 17(2):249–254. Fellowship Fund at Yale University and the National Science Gu X, Ruan H, Su Z, Zou Y. 2017. Brownian model of transcriptome evolution and phylogenetic network visualization between tissues. Foundation Graduate Research Fellowship under grant No. Mol Phylogenet Evol. 114:34–39. DGE-1122492. The research was supported by the W. R. Hamburger V, Hamilton HL. 1951. A series of normal stages in the devel- Coe Fund at Yale University, the John Templeton opment of the chick embryo. J Morphol. 88:49–92. Foundation grant #54860, NSF grant # 14-001273, and the Harris MP, Fallon JF, Prum RO. 2002. Shh-Bmp2 signaling module and the Yale University Science Development Fund. The opinions evolutionary origin and diversiﬁcation of feathers. J Exp Zool. 294(2):160–176. expressed in this article are not those of the John Hedges SB, Dudley J, Kumar S. 2006. TimeTree: a public knowledge-base of Templeton Foundation. RNAseq data for bird skin appen- divergence times among organisms. Bioinformatics 22(23):2971–2972. dages was produced in the Yale Center of Genome Analysis. Khaitovich P, Enard W, Lachmann M, Paabo S. 2006. Evolution of primate gene expression. Nat Rev Genet. 7(9):693–702. Khaitovich P, et al. 2004. A neutral model of transcriptome evolution. PLoS Literature Cited Biol. 2(5):E132. Achim K, et al. 2015. High-throughput spatial mapping of single-cell RNA- Khaitovich P, et al. 2005. Parallel patterns of evolution in the genomes and seq data to tissue of origin. Nat Biotechnol. 33(5):503–509. transcriptomes of humans and chimpanzees. Science Ackerly DD. 2000. Taxon sampling, correlated evolution, and independent 309(5742):1850–1854. contrasts. Evolution 54(5):1480–1492. Kin K, et al. 2016. The transcriptomic evolution of mammalian pregnancy: Arendt D. 2008. The evolution of cell types in animals: emerging principles gene expression innovations in endometrial stromal ﬁbroblasts. from molecular studies. Nat Rev Genet. 9(11):868–882. Genome Biol Evol. 8:2459–2473. Arendt D, et al. 2016. The origin and evolution of cell types. Nat Rev Kin K, Nnamani MC, Lynch VJ, Michaelides E, Wagner GP. 2015. Cell-type Genet. 17:744–757. phylogenetics and the origin of endometrial stromal cells. Cell Rep. Bedford T, Hartl DL. 2009. Optimization of gene expression by natural 10(8):1398–1409. selection. Proc Natl Acad Sci U S A. 106(4):1133–1138. Liao D. 1999. Concerted evolution: molecular mechanism and biological Brawand D, et al. 2011. The evolution of gene expression levels in mam- implications. Am J Hum Genet. 64(1):24–30. malian organs. Nature 478(7369):343–348. Lin S, et al. 2014. Comparison of the transcriptional landscapes between Brunet T, et al. 2016. The evolutionary origin of bilaterian smooth and human and mouse tissues. Proc Natl Acad Sci U S A. striated myocytes. Elife 5. doi: 10.7554/eLife.19607. 111(48):17224–17229. Diaz-Uriarte R, Garland T. 1996. Testing hypotheses of correlated evolu- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. 2008. RNA-seq: tion using phylogenetically independent contrasts: sensitivity to devia- an assessment of technical reproducibility and comparison with gene tions from Brownian motion. Syst Biol. 45:27–47. expression arrays. Genome Res. 18(9):1509–1517. Di-Poı¨ N, Milinkovitch MC. 2016. The anatomical placode in reptile scale Martins EP, Garland T. 1991. Phylogenetic analyses of the correlated evo- morphogenesis indicates shared ancestry among skin appendages in lution of continuous characters – a simulation study. Evolution amniotes. Sci Adv. 2:e1600708. 45(3):534–557. Dover GA. 1993. Evolution of genetic redundancy for advanced players. Merkin J, Russell C, Chen P, Burge CB. 2012. Evolutionary dynamics of Curr Opin Genet Dev. 3(6):902–910. gene and isoform regulation in Mammalian tissues. Science Eisenberg E, Levanon EY. 2013. Human housekeeping genes, revisited. 338(6114):1593–1599. Trends Genet. 29(10):569–574. Metzger BPH, Wittkopp PJ, Coolon JD. 2017. Evolutionary dynamics of Elder JF, Jr, Turner BJ. 1995. Concerted evolution of repetitive DNA regulatory changes underlying gene expression divergence among sequences in eukaryotes. Q Rev Biol. 70(3):297–320. Saccharomyces species. Genome Biol Evol. 9(4):843–854. Ereskovsky AV. 2010. Comparative Embryology of Sponges. New York: Meucci A. 2009. Review of statistical arbitrage, cointegration, and multi- Springer Publishing. variate Ornstein-Uhlenbeck. Available at SSRN: https://ssrn.com/ Fay JC, Wittkopp PJ. 2008. Evaluating the role of natural selection in the abstract¼1404905.or http://dx.doi.org/10.2139/ssrn.1404905;last evolution of gene regulation. Heredity 100(2):191–199. accessed January 24, 2018. Felsenstein J. 1985. Phylogenies and the comparative method. Am Nat. Musser JM, Wagner GP. 2015. Character trees from transcriptome data: 125(1):1–15. origin and individuation of morphological characters and the so-called Felsenstein J. 1988. Phylogenies and quantitative characters. Annu Rev “species signal”. J Exp Zool B Mol Dev Evol. 324:588–604. Ecol Syst. 19(1):445–471. Musser JM, Wagner GP, Prum RO. 2015. Nuclear beta-catenin localization Felsenstein J. 2004. Inferring phylogenies. Sunderland (MA): Sinauer supports homology of feathers, avian scutate scales, and alligator Associates. scales in early development. Evol Dev. 17(3):185–194. Garland T, Harvey PH, Ives AR. 1992. Procedures for the analysis of com- Oakley TH, Gu Z, Abouheif E, Patel NH, Li WH. 2005. Comparative meth- parative data using phylogenetically independent contrasts. Syst Biol. ods for the analysis of gene-expression evolution: an example using 41(1):18–32. yeast functional genomic data. Mol Biol Evol. 22(1):40–50. Ghanbarian AT, Hurst LD. 2015. Neighboring genes show correlated evo- Pagel M. 1994. Detecting correlated evolution on phylogenies – a general- lution in gene expression. Mol Biol Evol. 32(7):1748–1766. method for the comparative-analysis of discrete characters. Proc R Soc Gilad Y, Oshlack A, Rifkin SA. 2006. Natural selection on gene expression. B Biol Sci. 255(1342):37–45. Trends Genet. 22(8):456–461. Pagel M, Meade A. 2006. Bayesian analysis of correlated evolution of Gilad YM-MO. 2015. A reanalysis of mouse ENCODE comparative gene discrete characters by reversible-jump Markov chain Monte Carlo. expression data. F1000Res. 4:121. Am Nat. 167(6):808–825. Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 551 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE Pankey MS, Minin VN, Imholte GC, Suchard MA, Oakley TH. 2014. Syed T, Schierwater B. 2002. Trichoplax adhaerens: discovered as a missing Predictable transcriptome evolution in the convergent and complex link, forgotten as a hydrozoan, re-discovered as a key to metazoan bioluminescent organs of squid. Proc Natl Acad Sci U S A. evolution. Vie Milieu-Life Environ. 52:177–187. 111(44):E4736–E4742. Tschopp P, et al. 2014. A relative shift in cloacal location repositions ex- Pavlicev M, et al. 2017. Single-cell transcriptomics of the human placenta: ternal genitalia in amniote evolution. Nature 516(7531):391–394. inferring the cell communication network of the maternal-fetal inter- Valentine JW, Collins AG, Meyer CP. 1994. Morphological complexity in- face. Genome Res. 27(3):349–361. crease in metazoans. Paleobiology 20(02):131–142. Pavlicev M, Widder S. 2015. Wiring for independence: positive feedback Vickaryous MK, Hall BK. 2006. Human cell type diversity, evolution, motifs facilitate individuation of traits in development and evolution. J development, and classiﬁcation with special reference to cells Exp Zool B Mol Dev Evol. 324(2):104–113. derived from the neural crest. Biol Rev Camb Philos Soc. 81(3): Prum RO, Brush AH. 2002. The evolutionary origin and diversiﬁcation of 425–455. feathers. Q Rev Biol. 77(3):261–295. Wagner GP. 2014. Homology, genes, and evolutionary innovation. Prum RO, Williamson S. 2001. Theory of the growth and evolution of Princeton, New Jersey: Princeton University Press. feather shape. J Exp Zool 291(1):30–57. Wagner GP, et al. 2008. Pleiotropic scaling of gene effects and the ‘cost of Pu L, Brady S. 2010. Systems biology update: cell type-speciﬁc transcrip- complexity’. Nature 452(7186):470–472. tional regulatory networks. Plant Physiol. 152(2):411–419. Wagner GP, Kin K, Lynch VJ. 2012. Measurement of mRNA abundance Ruvinsky I, Ohler U, Burge CB, Ruvkun G. 2007. Detection of using RNA-seq data: RPKM measure is inconsistent among samples. broadly expressed neuronal genes in C. elegans.Dev Biol. Theory Biosci. 131(4):281–285. 302(2):617–626. Wagner GP, Zhang J. 2011. The pleiotropic structure of the genotype- Seqc Maqc-Iii Consortium 2014. A comprehensive assessment of RNA-seq phenotype map: the evolvability of complex organisms. Nat Rev accuracy, reproducibility and information content by the Sequencing Genet. 12(3):204–213. Quality Control Consortium. Nat Biotechnol. 32(9):903–914. Wang Z, Young RL, Xue H, Wagner GP. 2011. Transcriptomic analysis of Sieburth D, et al. 2005. Systematic analysis of genes required for synapse avian digits reveals conserved and derived digit identities in birds. structure and function. Nature 436(7050):510–517. Nature 477(7366):583–586. Simpson TL. 1984. The cell biology of sponges. New York: Springer New Yanai I, Graur D, Ophir R. 2004. Incongruent expression proﬁles between York. p. 1 online resource. human and mouse orthologous genes suggest widespread neutral Stearns FW. 2010. One hundred years of pleiotropy: a retrospective. evolution of transcription control. Omics 8(1):15–24. Genetics 186(3):767–773. Yang JR, Maclean CJ, Park C, Zhao H, Zhang J. 2017. Intra- and inter- Stefanakis N, Carrera I, Hobert O. 2015. Regulatory logic of Pan-neuronal speciﬁc variations of gene expression levels in yeast are largely neutral: gene expression in C. elegans. Neuron 87(4):733–750. (Nei Lecture, SMBE 2016, Gold Coast). Mol Biol Evol. 34:2125–2139. Steinmetz PR, et al. 2012. Independent evolution of striated muscles in Yue F, et al. 2014. A comparative encyclopedia of DNA elements in the cnidarians and bilaterians. Nature 487(7406):231–234. mouse genome. Nature 515(7527):355–364. Sudmant PH, Alexis MS, Burge CB. 2015. Meta-analysis of RNA-seq expression data across species, tissues and studies. Genome Biol. 16:287. Associate editor: George Zhang 552 Genome Biol. Evol. 10(2):538–552 doi:10.1093/gbe/evy016 Advance Access publication January 23, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/538/4823540 by Ed 'DeepDyve' Gillespie user on 16 March 2018
Genome Biology and Evolution – Oxford University Press
Published: Feb 1, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera