Influence of Effective Population Size on Genes under Varying Levels of Selection Pressure

Influence of Effective Population Size on Genes under Varying Levels of Selection Pressure The ratio of diversities at amino acid changing (nonsynonymous) and neutral (synonymous) sites (x ¼ p /p ) is routinely used to N S measure the intensity of selection pressure. It is well known that this ratio is influenced by the effective population size (N )and selection coefficient (s). Here, we examined the effects of effective population size on x by comparing protein-coding genes from Mus musculus castaneus and Mus musculus musculus—two mouse subspecies with different N . Our results revealed a positive relationship between the magnitude of selection intensity and the x estimated for genes. For genes under high selective constraints, the x estimated for the subspecies with small N (M. m. musculus) was three times higher than that observed for that with large N e e (M. m. castaneus). However, this difference was only 18% for genes under relaxed selective constraints. We showed that the observed relationship is qualitatively similar to the theoretical predictions. We also showed that, for highly expressed genes, the x of M. m. musculus was 2.1 times higher than that of M.m. castaneus and this difference was only 27% for genes with low expression levels. These results suggest that the effect of effective population size is more pronounced in genes under high purifying selection. Hence the choice of genes is important when x is used to infer the effective size of a population. Key words: population size effect, deleterious mutations, amino acid diversity, gene expression and population genetics theory. Introduction 2012; Tsagkogeorga et al. 2012; Gayral et al. 2013; Harrang The ratio of diversities (x ¼ p /p ) at nonsynonymous (p )and et al. 2013; Romiguier et al. 2014; James et al. 2017). In N S N synonymous (p ) sites of protein-coding genes reveal the in- contrast, x estimates also vary significantly between genes tensity of natural selection on genes (Li 1997). Furthermore, x of a genome, which reflects the magnitude of selection on suggests the fraction of nonsynonymous single nucleotide them (Bustamante et al. 2005). However, it is unclear how variations (SNVs) segregating in a population with respect to and to what extent the difference in effective population sizes synonymous SNVs. This also suggests that a fraction of non- influences the x of various genes that are under different synonymous SNVs has been eliminated from a population levels of selection constraints. owing to their deleterious nature when x < 1. It is well To examine this, we assembled the genome-wide SNV known that x is dictated by the product of effective popula- data from two subspecies of mouse: Mus musculus castaneus tion size (N ) and selection coefficient (s)(Kimura 1983). and Mus musculus musculus. These two species were esti- Population genetic theories predict that x estimated for large mated to be diverged 500,000 years ago (Geraldes et al. populations tend to be smaller than those obtained for small 2008; Duvaux et al. 2011) and have lived in reproductive iso- populations (Ohta 1993). This is because a much higher frac- lation (Good et al. 2008). The effective population sizes were tion of deleterious SNVs is removed in large populations com- estimated to be 580,000 (200,000–733,000) and 76,000 pared with small populations owing to the difference in the (25,000–120,000) (Salcedo et al. 2007; Geraldes et al. efficacy of selection between them. Therefore, the overall 2008; Halligan et al. 2010). Since these species diverged variation in x observed for different populations suggest only recently and live in similar habitats (commensal with the potential difference in their effective population sizes. humans) but differ only in N our results are not confounded Hence x is routinely used to infer the difference in N be- by the difference in other factors such as physiology, genetics, tween populations (Strasburg et al. 2011; Phifer-Rixey et al. and ecology between the two groups compared. We first The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non- commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 756 Genome Biol. Evol. 10(3):756–762. doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Population Size Effect on Gene Evolution GBE examined the theoretical relationship between s and x for mapped reads (FPKM) for each gene from 19 tissues were different effective population sizes. Using genome-wide averaged and a log-transformed mean was used for further SNVs from the two mice we then investigated the empirical analysis. We used only the genes that were expressed in all relationship by using the proportion of constrained sites and tissues. This data set included 6,733 mouse genes, which level of gene expression as proxies for the magnitude of se- were sorted based on their FPKM values that represent the lection (s). level of gene expression. These genes were then grouped by taking 500 genes in each of the 12 categories with FPKM values ranging: >25, 15–25, 11–15, 9–11, 7.4–9, 6.3–7.4, Materials and Methods 5.4–6.3, 4.6–5.4, 3.9–4.6, 3.2–3.9, 2.6–3.2, 2.0–2.6, and Whole genome genotype data for 19 autosomes of Mus the 13th category contains the remaining 733 genes musculus castaneus and Musmusculusmusculus were down- with <2.0 FPKM. loaded from the data repository (http://wwwuser.gwdg.de/ To determine the significance of positive selection we ex- evolbio/evolgen/wildmouse/) of a previous study (Harr amined whether the number of nonsynonymous SNVs per et al. 2016). The genome-wide SNVs data were available for nonsynonymous site (pN) is significantly higher than the num- 8(n ¼ 16) and 10 (n ¼ 20) individuals of M.m. musculus and ber of synonymous SNVs per synonymous site (pS)for a gene. M.m. castaneus, respectively. Using the software program For this purpose, we obtained the Jukes–Cantor variance for SNPEff (http://snpeff.sourceforge.net/) functional annotations the two measures and used a Z-test to examine the signifi- were inferred (Cingolani et al. 2012). We then extracted only cance. We used the nonparametric Spearman rank correla- SNVs affecting coding regions and introns. We also down- tion to test the strength and significance of observed loaded the mouse reference genome (GRCm38/mm10) se- correlations. quence and the annotation file (refGene.txt) from the UCSC For mutation rate analysis, we used the synonymous sub- genome browser data resource (https://genome.ucsc.edu/). stitution rate as the proxy for mutation rate of each gene and Using the annotations, we extracted the protein-coding sorted 16,285 genes based on this. We separated top and gene sequences and their chromosomal locations. The num- bottom 30% of the genes (4,886 in each category) and des- ber of synonymous, nonsynonymous, and intron positions ignated them as slow and fast evolving genes, respectively. were also extracted from the annotation file. Using the above, Within each group, we further separated the genes with- we estimated the diversity at nonsynonymous, synonymous, >50% and <10% constrained sites as constrained and re- and intron sites. For this purpose, we use the mean pairwise laxed genes. For recombination analysis, we obtained the differences per site (p)(Tajima 1983). We also downloaded fine-scale map of recombination rates created for the mouse the mouse-rat genome alignment from the UCSC genome genome by a previous study (Brunschwig et al. 2012), which browser and extracted the alignments for each protein- could be mapped to 8,205 genes of our data set. We calcu- coding gene. We estimated synonymous divergence for lated the mean recombination rate for each protein-coding each gene using the likelihood based method implemented gene and sorted the genes based on this. We then separated in the codeml program of the PAML package (Yang 2007). the top and bottom 30% of the genes (2,461 in each cate- We obtained the basewise conservation scores (PhyloP) gory) and termed them as low (with a recombination based on 59 vertebrate genomes with mouse (http://hgdown- rate <0.005 N r/kb) andhigh(>0.029 N r/kb) recombining e e load.soe.ucsc.edu/goldenPath/mm10/ phyloP60way/) (Siepel genes, respectively. Within each group, we separated con- et al. 2006). The score was available for each position of strained and relaxed genes as mentioned earlier. the mouse chromosomes. The PhyloP scores were then We examined the theoretical relationship between N s and mapped to on to the protein-coding genes and we desig- x usingKimura’sequation [3.18inpage 45of (Kimura nated any position of a gene with a PhyloP score >2 as con- 1983)]: strained. Based on this criterion, we estimated the proportion 2ðÞ S  1 þ e of constrained sites in each protein-coding gene. Our final x H =H ¼ ; ðÞ 1 T T; 0 data set included 16,285 reference protein-coding genes. ½ SðÞ 1  e These genes were grouped into 13 categories based on the fraction of constrained positions as (number of genes): <10 where S ¼ 4N s. In the above equation, H is the sum of e T (1,219), 10–15 (1,295), 15–20 (1,727), 20–25 (1,952), 25–30 heterozygotes involving a mutant allele over all generations (2,097), 30–35 (1,998), 35–40 (1,799), 40–45 (1,425), 45–50 until either fixation or loss and H denotes mutant with s!0. T,0 (1,094), 50–55 (755), 55–60 (493), 60–65 (277) and >65% Hence H denotes neutral and nonneutral mutations and H T T,0 (154). indicates only neutral mutations. For empirical data, nonsy- We obtained the RNA-seq expression data (http://chromo- nonymous and synonymous mutations represent H and H T T,0 some.sdsc.edu/mouse/download.html) from a previous large- of equation 1. Because a nonsynonymous mutation could be scale study using 19 mouse tissues (Dunham et al. 2012). The deleterious or neutral but a synonymous or intron mutation is values of Fragments Per Kilobase of transcript per million largely neutral in nature. Therefore, we used the ratio of Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 757 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Subramanian GBE heterozygosity or diversity of nonsynonymous and synony- mous mutations as the empirical equivalence of equation 1. Note that equation 1 assumes that the effective population is equal to the actual population (N ¼ N). The difference between the xs from two populations was quantified as: x x x x A B mus cas d ¼ or d ¼ ðÞ 2 x x x x A mus where x and x is estimated for the population with small A B and large effective population sizes (eg. M.m. musculus and M.m. castaneus), respectively. Results To understand the theoretical relationship between s andx for different N , let us assume that we compare two populations A and B with 76,000 and 580,000 as their N ,respectively (to represent the N of M. m. musculus and M. m. castaneus,re- spectively) and s is the mean selection coefficient on the non- synonymous sites of a gene or a collection of genes. The theoretical relationship has been derived by Kimura (1983), which is shown equation 1. Using this formula, we assigned values for s ranging from 0 to 1.7  10 with an increment of 1.3  10 and computed x and x for the hypothetical A B populations with two different N mentioned earlier. Figure 1A shows that the difference between x and x is large when A B negative selection is high and this difference disappears when s approaches zero. When s ¼ –1.7  10 , x estimated for the population A with small effective population size (N s ¼ 1.3) was 3.9 times higher than that estimated for the large popu- lationB(N s ¼ 10). However, this fraction becomes equal be- tween the populations when s is very close to zero (s!0). This is FIG.1.—Theoretical relationship between the ratio of diversity at se- further clear from the positive relationship between selection lected and neutral sites (x ¼ H /H ), effective population size (N )and T T,0 e coefficient and the magnitude of difference between the xs selection coefficient (s). (A) Using equation 1 (see Materials and Methods), (d )(fig. 1B). x estimated for 13 different values of s ranging from 0 to 1.7 10 To examine the influence of population size on a genome (increment of 1.3 10 ) for two populations (A and B) with effective scale, we estimated the nucleotide diversities using the whole population sizes of 76,000 (x ) and 580,000 (x ) to represent those of A B Musmusculusmusculus and M.m. castaneus, respectively. Note when genome data of M.m. castaneus and M.m. musculus popula- s¼ 0 the equation becomes undefined (0/0) and therefore the first two tions. The genome-wide estimates are given in table 1.The columns were based on the assumption of s being infinitesimally small effective population size difference between the two subspe- (s!0), which results in x very close to 1 (x!1). (B) Relationship between cies is evident from the difference in the number of SNVs. selection coefficient (s) and the normalized difference between x and x A B Although nonsynonymous diversity of M.m. castaneus was (d ). 2.3 times higher than that of M.m. musculus, synonymous and intron diversities estimated were 3.3 and 3.6 times higher for the former than the latter. The x estimated for the whole used to represent evolution at constrained and neutral sites, genomes of M.m. musculus population was 31% and 37% which forms the empirical ratio x (p /p ). We grouped genes N S (based on synonymous sites and introns, respectively) higher based on the proportion of constrained sties into 13 catego- than those obtained for M.m. castaneus population. To ex- ries and obtained the mean x for the genes belonging to each amine the empirical relationship between selection intensity category. Cleary, the patterns of relationships in figure 2 are and x, we first computed the proportion of constrained sites similar to those shown in figure 1, suggesting that the ge- for each protein-coding gene as described in methods and nome data provide the empirical proof for the theoretical re- this was used as a proxy for selection intensity (s). The diver- lationship. For highly constrained genes (with >65% sities at nonsynonymous (p ) and synonymous sites (p )were constrained sites) x estimated for M.m. musculus (x ) N S mus 758 Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Population Size Effect on Gene Evolution GBE Table 1 Summary Statistics Mus musculus M.m. musculus castaneus SNVs Nonsynonymous 103763 33676 Synonymous 229655 55760 Intron 14831843 3266566 Diversity 6 6 Nonsynonymous (p ) 0.00076 (66.1 10 ) 0.00033 (64.0 10 ) 5 5 Synonymous (p)0.0057(62.7 10 )0.0017(61.5 10 ) 6 6 Intron (p)0.0037(62.1 10 )0.0010(61.1 10 ) p /p 0.13 (0.0012) 0.19 (0.0028) N S p /p 0.20 (0.0016) 0.32 (0.0039) N I Difference between x - d Using synonymous 0.31 (0.008) — P<0.0001 sites Using intron 0.37 (0.011) — P<0.0001 was 0.079 (6 0.017), which was three times higher than that estimated for M.m. castaneus [x ¼ 0.026 (6 0.004)] and cas the difference between the xs was significantly >0 (P ¼ 0.0015, one-tailed Z-test). Whereas, this difference be- tween the xs was only 18% [0.470 (6 0.027) vs. 0.388 (6 0.014)] for genes under relaxed selection pressure (<10% constrained sites) and it was statistically significant (P ¼ 0.0031). This result is further supported by the positive relationship (q ¼ 0.97, P < 0.0001) between the proportion of constrained sites and the normalized difference between x and x (d )estimated for M.m. castaneus and M.m. mus cas x musculus (fig. 2B). FIG.2.—Empirical relationship between the proportion of constrained A number of earlier studies showed that highly expressed positions and x estimated for Musmusculusmusculus (x )and mus genes are under high selection pressure and expression level is M.m. castaneus (x ) using nonsynonymous and synonymous sites. (A) cas a major determinant of protein evolution (Subramanian and Protein-coding genes were grouped into 13 categories based on the propor- Kumar 2004; Drummond et al. 2005; Yang et al. 2012). tion of constrained sites (see Materials and Methods) and average x com- puted for genes belonging to each category are shown. We used the fraction Based on this rationale, we used the level of gene expression of constrained sites as a proxy for selection intensity (s). Error bars shows the as an independent proxy for selection intensity (s) and exam- standard error of the mean. The difference between x and x was mus cas ined its relationship with x. For this purpose, we obtained statistically significant for all categories (at least P< 0.01). (B) Relationship RNA-seq data from a previous study (Dunham et al. 2012). between the proportion of constrained sites in protein-coding genes and Based on the level of expression (in FPKM units) genes were normalized difference betweenx andx (d ) is shown. This relationship mus cas x grouped into 13 categories and the average x was computed is highly significant (q ¼ 0.97, P< 0.0001). Best fitting regression line is for genes belonging to each category. Our results based on shown. expression levels were very similar to those obtained for the proportion of constrained sites (fig. 3). For highly expressed (q ¼ 0.95, P < 0.0001) between expression levels and the genes x estimated for M.m. musculus [x ¼ 0.130 mus normalized difference between x and x (d )provides mus cas x (6 0.014)] was 2.1 times higher than that estimated for confirmatory support for our results (fig. 3B). M.m. castaneus [x ¼ 0.062 (6 0.006)] and the difference cas between the xs was statistically significant (P < 0.0001). Discussion Whereas this difference between the xs was only 27% for Our results suggest that the influence of effective population the genes with low expression levels [0.238 (6 0.015) vs. size is more pronounced in genes under high selection inten- 0.174 (6 0.006)] and it was statistically significant sity. In this study, we first used the proportion of constrained (P ¼ 0.0073). A highly significant positive correlation Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 759 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Subramanian GBE FIG.4.—Theoretical relationship between selection coefficient (s)and normalized difference between x and x (d ) is shown for higher values A B x of s. Note that d changes only when the values of s fall between the nearly neutral range of 0 to 2. M.m. musculus than M.m. castaneus suggests a greater frac- tion of deleterious mutations segregating in the former. This is due to the fact that selection is not efficient in purging dele- terious mutations in small populations. We used synonymous sites as a proxy for neutral evolution. However, previous stud- ies suggested that a fraction of synonymous sites could be under selective constraints (Chamary et al. 2006). If negative selection is assumed in synonymous sites the magnitude of this effect will be more pronounced in populations with large effective sizes and in this case in M.m. castaneus.This will in FIG.3.—Relationship between gene expression levels and x esti- turn expected to further increase the difference between xs mated for Musmusculusmusculus (x )and M.m. castaneus (x )using mus cas (d ) from the two mouse populations. Hence this assumption nonsynonymous and synonymous sites. (A) Genes were grouped into 13 will make our results more conservative. However, we categories based on their expression levels represented by the Fragments addressed this issue by replacing introns for synonymous sites Per Kilobase of transcript per million mapped reads (FPKM) units (see to estimate neutral diversities and found similar results (sup- Materials and Methods) and the average x estimated for genes belonging plementary figs. S1 and S2, Supplementary Material online). to each category are shown. Here, we used the level of gene expression as In figure 1, the theoretical relationship was shown only for an independent proxy for selection intensity (s). Error bars shows the stan- small values of s (0 to 1.7 10 ). However, the corre- dard error of the mean. The difference between x and x was sta- mus cas sponding N s values are 10 and 1.3 for M.m. castaneus tistically significant for all categories (at least P< 0.01). (B) Relationship and M.m. musculus, respectively. Previous studies on the dis- between the gene expression levels and normalized difference between x and x (d ) is shown (q ¼ 0.95, P< 0.0001). Best fitting regression tribution of the fitness effects of mutations in M.m. castaneus mus cas x line is shown. populations suggested that almost 77–80% of the mutations with N s > 10 were lethal or highly deleterious and 20% of sites as a proxy for selection intensity (s), which is straightfor- them (N s < 10) were nearly neutral in nature (Halligan et al. ward. Since the level of gene expression is known to correlate 2010, 2013; Kousathanas and Keightley 2013). It is well with selection intensity (Subramanian and Kumar 2004; known that only nearly neutral or slightly deleterious muta- Drummond et al. 2005; Yang et al. 2012)we used this as tions are influenced by N and both neutral and highly dele- an independent proxy for s. However, both measures pro- terious mutations are independent and are not modulated by duced almost identical patterns of relationships with x.The effective population sizes. This is very clear from figure 4, pattern of our population diversity based results is similar to whichshowsa plateauor anasymptote for d when that reported based on divergence between species s > –2.0  10 . Due to this reason, we chose to show only (Subramanian 2013). Overall, the higher x observed for the nearly neutral range in figure 1. Because the mutations/ 760 Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Population Size Effect on Gene Evolution GBE constraints (see Materials and Methods). Similar estimates were obtained for the bottom 30% of the genes with fastest evolutionary rate. The difference between xsof M.m. mus- culus and M.m. castaneus (d ) estimated for slow-evolving constrained genes was 3.2 times higher than that obtained for slow-evolving relaxed genes (0.46 vs. 0.14) (fig. 5A). Similarly, this difference for fast-evolving constrained genes was 2.5 times higher than that estimated for fast-evolving relaxed genes (0.438 vs. 0.176). Similar results were obtained when diversity at introns (instead of synonymous sites) were used to estimate x (supplementary fig. S3A, Supplementary Material online). These results revealed that the effect of ef- fective population size was more pronounced in constrained than in relaxed genes and the magnitude of this effect was largely similar in the fast and slow mutating genes. Therefore, difference in mutation rate between genes is unlikely to affect the main results of this study. Next, to examine the effects of recombination we obtained the fine-scale map of recombination rates from a previous study (Brunschwig et al. 2012) and computed the mean recom- bination rate for each mouse protein-coding gene. Similar to the previous analysis we sorted genes based on their recombi- nation rates and obtained the top and bottom 30% of the genes with low and high recombination rates, respectively. Thedifferencebetween xs(d )of M.m. musculus and M.m. castaneus estimated for low-recombining constrained genes was 2 times higher than that obtained for low-recombining relaxed genes (0.4 vs. 0.2) (fig. 5B). Similarly, this difference for high-recombining constrained genes was also two times higher than that estimated for high-recombining relaxed genes (0.5 vs. 0.25). Comparable results were also obtained when diversity at introns was used to estimate x (supplementary fig. S3B, Supplementary Material online). The above results sug- gest that the magnitude of the effects of N on x (or d )was e x FIG.5.—Normalized difference between x and x (d ) estimated mus cas x similar for genes located in high and low recombining regions. for genes under high and low selective constraints using nonsynonymous Therefore, variation in the rate of recombination between and synonymous sites. (A) Genes evolving under high and low substitution genes do not affect the findings and conclusions of this study. rates (see Materials and Methods). (B) Genes present in high and low The results of this study are under the assumption that the recombining regions. All differences between d of genes under high fraction of adaptive nonsynonymous segregating variations is and low selective constraints were statistically significant at least P< 0.0001. Error bars show the standard error of the mean. negligible. This is because when adaptive sweep occurs non- synonymous SNVs will be quickly fixed and do not contribute variations associated with the observed difference in the xs to the segregating variation. We examined this using our data estimated for M.m. castaneus and M.m. musculus were pre- and found that 147 and 362 genes (M.m. castaneus and dominantly nearly neutral or slightly deleterious in nature. M.m. musculus, respectively) had more number of nonsynon- Theoretical relationship shown in figure 1 was also based ymous SNVs per nonsynonymous site than synonymous SNVs on simple assumptions and did not consider the influence of per synonymous site (or x > 1). . However, the difference was other factors such as mutation and recombination rate differ- statistically significant only for 4 and 5 genes, respectively ence between genes, which might result in different N for (using a Z test—see Materials and Methods). Finally, the the- genes. To examine this effect, first we used the rate of sub- oretical relationship shown in this study (fig. 1)is based on the stitution at synonymous sites as the proxy for mutation rate assumption of no dominance and independence between and sorted genes based on the synonymous divergence be- mutations. However, empirical data include the effects of tween mouse and rat. We then obtained the top 30% of the dominant mutations and interactions/epistasis between genes with slowest evolutionary rate and within this category mutations. Hence these caveats should be noted while infer- we estimated d for the genes under high and low selective ring the empirical results of this study. Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 761 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Subramanian GBE Good JM, Handel MA, Nachman MW. 2008. Asymmetry and polymor- Based on previous theoretical and empirical predictions, phism of hybrid male sterility during the early stages of speciation in comparative population genetic studies generally assume house mice. Evolution 62(1):50–65. the difference in x observed between two populations to Halligan DL, et al. 2013. Contributions of protein-coding and regulatory reflect the variations in their effective population sizes change to adaptive molecular evolution in murid rodents. PLoS Genet. (Strasburg et al. 2011; Phifer-Rixey et al. 2012; Gayral et al. 9(12):e1003995. Halligan DL, Oliver F, Eyre-Walker A, Harr B, Keightley PD. 2010. Evidence 2013; Romiguier et al. 2014; James et al. 2017). However, our for pervasive adaptive protein evolution in wild mice. PLoS Genet. results suggest that the magnitude of this difference is depen- 6(1):e1000825. dent upon the genes being compared. As we have shown Harr B, et al. 2016. Genomic resources for wild populations of the house that comparing genes under relaxed selective constraints will mouse, Mus musculus and its close relative Mus spretus.Sci Data underestimate the actual difference in the effective popula- 3:160075. Harrang E, Lapegue S, Morga B, Bierne N. 2013. A high load of non- tion sizes. Therefore, it is important to consider the selection neutral amino-acid polymorphisms explains high protein diversity de- intensity on the genes when comparing x between popula- spite moderate effective population size in a marine bivalve with tions to infer effective population size. Although our results sweepstakes reproduction. G3 (Bethesda) 3:333–341. are based on protein-coding genes the findings will hold true James J, Castellano D, Eyre-Walker A. 2017. DNA sequence diversity and 0 0 for other constrained noncoding regions such as 3 and 5 the efficiency of natural selection in animal mitochondrial DNA. Heredity (Edinb) 118(1):88–95. untranslated regions, splice sites, up-, and downstream regu- Kimura M. 1983. The neutral theory of molecular evolution. Cambridge: latory elements. Cambridge University Press. Kousathanas A, Keightley PD. 2013. A comparison of models to infer the distribution of fitness effects of new mutations. Genetics Supplementary Material 193(4):1197–1208. Supplementary data areavailableat Genome Biology and Li W-H. 1997. Molecular evolution. Chicago: Sinauer Associates Inc. Evolution online. Ohta T. 1993. An examination of the generation-time effect on molecular evolution. Proc Natl Acad Sci U S A. 90(22):10676–10680. Phifer-Rixey M, et al. 2012. Adaptive evolution and effective population Acknowledgments size in wild house mice. Mol Biol Evol. 29(10):2949–2955. Romiguier J, et al. 2014. Population genomics of eusocial insects: the costs The author acknowledges the support from the Australian of a vertebrate-like effective population size. J Evol Biol. Research Council (LP160100594) and the University of the 27(3):593–603. Sunshine Coast. Salcedo T, Geraldes A, Nachman MW. 2007. Nucleotide variation in wild and inbred mice. Genetics 177(4):2277–2291. Siepel A, Pollard KS, Haussler D. 2006. New methods for detecting Literature Cited lineage-specific selection. In: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman M, editors. Research in computational molec- Brunschwig H, et al. 2012. Fine-scale maps of recombination rates and ular biology. RECOMB 2006. Lecture Notes in Computer Science. hotspots in the mouse genome. Genetics 191(3):757–764. Berlin: Springer. p. 190–205. Bustamante CD, et al. 2005. Natural selection on protein-coding genes in Strasburg JL, et al. 2011. Effective population size is positively correlated the human genome. Nature 437(7062):1153–1157. with levels of adaptive divergence among annual sunflowers. Mol Biol Chamary JV, Parmley JL, Hurst LD. 2006. Hearing silence: non-neutral Evol. 28(5):1569–1580. evolution at synonymous sites in mammals. Nat Rev Genet. Subramanian S. 2013. Significance of population size on the fixation of 7(2):98–108. nonsynonymous mutations in genes under varying levels of selection Cingolani P, et al. 2012. A program for annotating and predicting the pressure. Genetics 193(3):995–1002. effects of single nucleotide polymorphisms, SnpEff: sNPs in the ge- Subramanian S, Kumar S. 2004. Gene expression intensity shapes evolu- nome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly tionary rates of the proteins encoded by the vertebrate genome. (Austin) 6(2):80–92. Genetics 168(1):373–381. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. 2005. Why Tajima F. 1983. Evolutionary relationship of DNA sequences in finite pop- highly expressed proteins evolve slowly. Proc Natl Acad Sci U S A. ulations. Genetics 105(2):437–460. 102(40):14338–14343. Tsagkogeorga G, Cahais V, Galtier N. 2012. The population genomics of a Dunham I, et al. 2012. An integrated encyclopedia of DNA elements in the fast evolver: high levels of diversity, functional constraint, and molec- human genome. Nature 489(7414):57–74. ular adaptation in the tunicate Ciona intestinalis. Genome Biol Evol. Duvaux L, Belkhir K, Boulesteix M, Boursot P. 2011. Isolation and gene 4(8):740–749. flow: inferring the speciation history of European house mice. Mol Yang JR, Liao BY, Zhuang SM, Zhang J. 2012. Protein misinteraction avoid- Ecol. 20(24):5248–5264. ance causes highly expressed proteins to evolve slowly. Proc Natl Acad Gayral P, et al. 2013. Reference-free population genomics from next- Sci U S A. 109(14):E831–E840. generation transcriptome data and the vertebrate-invertebrate gap. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol PLoS Genet. 9(4):e1003457. Biol Evol. 24(8):1586–1591. Geraldes A, et al. 2008. Inferring the history of speciation in house mice from autosomal, X-linked, Y-linked and mitochondrial genes. Mol Ecol. 17(24):5349–5363. Associate editor:George Zhang 762 Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology and Evolution Oxford University Press

Influence of Effective Population Size on Genes under Varying Levels of Selection Pressure

Free
7 pages

Loading next page...
 
/lp/ou_press/influence-of-effective-population-size-on-genes-under-varying-levels-LZEtAa22Qy
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
ISSN
1759-6653
eISSN
1759-6653
D.O.I.
10.1093/gbe/evy047
Publisher site
See Article on Publisher Site

Abstract

The ratio of diversities at amino acid changing (nonsynonymous) and neutral (synonymous) sites (x ¼ p /p ) is routinely used to N S measure the intensity of selection pressure. It is well known that this ratio is influenced by the effective population size (N )and selection coefficient (s). Here, we examined the effects of effective population size on x by comparing protein-coding genes from Mus musculus castaneus and Mus musculus musculus—two mouse subspecies with different N . Our results revealed a positive relationship between the magnitude of selection intensity and the x estimated for genes. For genes under high selective constraints, the x estimated for the subspecies with small N (M. m. musculus) was three times higher than that observed for that with large N e e (M. m. castaneus). However, this difference was only 18% for genes under relaxed selective constraints. We showed that the observed relationship is qualitatively similar to the theoretical predictions. We also showed that, for highly expressed genes, the x of M. m. musculus was 2.1 times higher than that of M.m. castaneus and this difference was only 27% for genes with low expression levels. These results suggest that the effect of effective population size is more pronounced in genes under high purifying selection. Hence the choice of genes is important when x is used to infer the effective size of a population. Key words: population size effect, deleterious mutations, amino acid diversity, gene expression and population genetics theory. Introduction 2012; Tsagkogeorga et al. 2012; Gayral et al. 2013; Harrang The ratio of diversities (x ¼ p /p ) at nonsynonymous (p )and et al. 2013; Romiguier et al. 2014; James et al. 2017). In N S N synonymous (p ) sites of protein-coding genes reveal the in- contrast, x estimates also vary significantly between genes tensity of natural selection on genes (Li 1997). Furthermore, x of a genome, which reflects the magnitude of selection on suggests the fraction of nonsynonymous single nucleotide them (Bustamante et al. 2005). However, it is unclear how variations (SNVs) segregating in a population with respect to and to what extent the difference in effective population sizes synonymous SNVs. This also suggests that a fraction of non- influences the x of various genes that are under different synonymous SNVs has been eliminated from a population levels of selection constraints. owing to their deleterious nature when x < 1. It is well To examine this, we assembled the genome-wide SNV known that x is dictated by the product of effective popula- data from two subspecies of mouse: Mus musculus castaneus tion size (N ) and selection coefficient (s)(Kimura 1983). and Mus musculus musculus. These two species were esti- Population genetic theories predict that x estimated for large mated to be diverged 500,000 years ago (Geraldes et al. populations tend to be smaller than those obtained for small 2008; Duvaux et al. 2011) and have lived in reproductive iso- populations (Ohta 1993). This is because a much higher frac- lation (Good et al. 2008). The effective population sizes were tion of deleterious SNVs is removed in large populations com- estimated to be 580,000 (200,000–733,000) and 76,000 pared with small populations owing to the difference in the (25,000–120,000) (Salcedo et al. 2007; Geraldes et al. efficacy of selection between them. Therefore, the overall 2008; Halligan et al. 2010). Since these species diverged variation in x observed for different populations suggest only recently and live in similar habitats (commensal with the potential difference in their effective population sizes. humans) but differ only in N our results are not confounded Hence x is routinely used to infer the difference in N be- by the difference in other factors such as physiology, genetics, tween populations (Strasburg et al. 2011; Phifer-Rixey et al. and ecology between the two groups compared. We first The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non- commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 756 Genome Biol. Evol. 10(3):756–762. doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Population Size Effect on Gene Evolution GBE examined the theoretical relationship between s and x for mapped reads (FPKM) for each gene from 19 tissues were different effective population sizes. Using genome-wide averaged and a log-transformed mean was used for further SNVs from the two mice we then investigated the empirical analysis. We used only the genes that were expressed in all relationship by using the proportion of constrained sites and tissues. This data set included 6,733 mouse genes, which level of gene expression as proxies for the magnitude of se- were sorted based on their FPKM values that represent the lection (s). level of gene expression. These genes were then grouped by taking 500 genes in each of the 12 categories with FPKM values ranging: >25, 15–25, 11–15, 9–11, 7.4–9, 6.3–7.4, Materials and Methods 5.4–6.3, 4.6–5.4, 3.9–4.6, 3.2–3.9, 2.6–3.2, 2.0–2.6, and Whole genome genotype data for 19 autosomes of Mus the 13th category contains the remaining 733 genes musculus castaneus and Musmusculusmusculus were down- with <2.0 FPKM. loaded from the data repository (http://wwwuser.gwdg.de/ To determine the significance of positive selection we ex- evolbio/evolgen/wildmouse/) of a previous study (Harr amined whether the number of nonsynonymous SNVs per et al. 2016). The genome-wide SNVs data were available for nonsynonymous site (pN) is significantly higher than the num- 8(n ¼ 16) and 10 (n ¼ 20) individuals of M.m. musculus and ber of synonymous SNVs per synonymous site (pS)for a gene. M.m. castaneus, respectively. Using the software program For this purpose, we obtained the Jukes–Cantor variance for SNPEff (http://snpeff.sourceforge.net/) functional annotations the two measures and used a Z-test to examine the signifi- were inferred (Cingolani et al. 2012). We then extracted only cance. We used the nonparametric Spearman rank correla- SNVs affecting coding regions and introns. We also down- tion to test the strength and significance of observed loaded the mouse reference genome (GRCm38/mm10) se- correlations. quence and the annotation file (refGene.txt) from the UCSC For mutation rate analysis, we used the synonymous sub- genome browser data resource (https://genome.ucsc.edu/). stitution rate as the proxy for mutation rate of each gene and Using the annotations, we extracted the protein-coding sorted 16,285 genes based on this. We separated top and gene sequences and their chromosomal locations. The num- bottom 30% of the genes (4,886 in each category) and des- ber of synonymous, nonsynonymous, and intron positions ignated them as slow and fast evolving genes, respectively. were also extracted from the annotation file. Using the above, Within each group, we further separated the genes with- we estimated the diversity at nonsynonymous, synonymous, >50% and <10% constrained sites as constrained and re- and intron sites. For this purpose, we use the mean pairwise laxed genes. For recombination analysis, we obtained the differences per site (p)(Tajima 1983). We also downloaded fine-scale map of recombination rates created for the mouse the mouse-rat genome alignment from the UCSC genome genome by a previous study (Brunschwig et al. 2012), which browser and extracted the alignments for each protein- could be mapped to 8,205 genes of our data set. We calcu- coding gene. We estimated synonymous divergence for lated the mean recombination rate for each protein-coding each gene using the likelihood based method implemented gene and sorted the genes based on this. We then separated in the codeml program of the PAML package (Yang 2007). the top and bottom 30% of the genes (2,461 in each cate- We obtained the basewise conservation scores (PhyloP) gory) and termed them as low (with a recombination based on 59 vertebrate genomes with mouse (http://hgdown- rate <0.005 N r/kb) andhigh(>0.029 N r/kb) recombining e e load.soe.ucsc.edu/goldenPath/mm10/ phyloP60way/) (Siepel genes, respectively. Within each group, we separated con- et al. 2006). The score was available for each position of strained and relaxed genes as mentioned earlier. the mouse chromosomes. The PhyloP scores were then We examined the theoretical relationship between N s and mapped to on to the protein-coding genes and we desig- x usingKimura’sequation [3.18inpage 45of (Kimura nated any position of a gene with a PhyloP score >2 as con- 1983)]: strained. Based on this criterion, we estimated the proportion 2ðÞ S  1 þ e of constrained sites in each protein-coding gene. Our final x H =H ¼ ; ðÞ 1 T T; 0 data set included 16,285 reference protein-coding genes. ½ SðÞ 1  e These genes were grouped into 13 categories based on the fraction of constrained positions as (number of genes): <10 where S ¼ 4N s. In the above equation, H is the sum of e T (1,219), 10–15 (1,295), 15–20 (1,727), 20–25 (1,952), 25–30 heterozygotes involving a mutant allele over all generations (2,097), 30–35 (1,998), 35–40 (1,799), 40–45 (1,425), 45–50 until either fixation or loss and H denotes mutant with s!0. T,0 (1,094), 50–55 (755), 55–60 (493), 60–65 (277) and >65% Hence H denotes neutral and nonneutral mutations and H T T,0 (154). indicates only neutral mutations. For empirical data, nonsy- We obtained the RNA-seq expression data (http://chromo- nonymous and synonymous mutations represent H and H T T,0 some.sdsc.edu/mouse/download.html) from a previous large- of equation 1. Because a nonsynonymous mutation could be scale study using 19 mouse tissues (Dunham et al. 2012). The deleterious or neutral but a synonymous or intron mutation is values of Fragments Per Kilobase of transcript per million largely neutral in nature. Therefore, we used the ratio of Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 757 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Subramanian GBE heterozygosity or diversity of nonsynonymous and synony- mous mutations as the empirical equivalence of equation 1. Note that equation 1 assumes that the effective population is equal to the actual population (N ¼ N). The difference between the xs from two populations was quantified as: x x x x A B mus cas d ¼ or d ¼ ðÞ 2 x x x x A mus where x and x is estimated for the population with small A B and large effective population sizes (eg. M.m. musculus and M.m. castaneus), respectively. Results To understand the theoretical relationship between s andx for different N , let us assume that we compare two populations A and B with 76,000 and 580,000 as their N ,respectively (to represent the N of M. m. musculus and M. m. castaneus,re- spectively) and s is the mean selection coefficient on the non- synonymous sites of a gene or a collection of genes. The theoretical relationship has been derived by Kimura (1983), which is shown equation 1. Using this formula, we assigned values for s ranging from 0 to 1.7  10 with an increment of 1.3  10 and computed x and x for the hypothetical A B populations with two different N mentioned earlier. Figure 1A shows that the difference between x and x is large when A B negative selection is high and this difference disappears when s approaches zero. When s ¼ –1.7  10 , x estimated for the population A with small effective population size (N s ¼ 1.3) was 3.9 times higher than that estimated for the large popu- lationB(N s ¼ 10). However, this fraction becomes equal be- tween the populations when s is very close to zero (s!0). This is FIG.1.—Theoretical relationship between the ratio of diversity at se- further clear from the positive relationship between selection lected and neutral sites (x ¼ H /H ), effective population size (N )and T T,0 e coefficient and the magnitude of difference between the xs selection coefficient (s). (A) Using equation 1 (see Materials and Methods), (d )(fig. 1B). x estimated for 13 different values of s ranging from 0 to 1.7 10 To examine the influence of population size on a genome (increment of 1.3 10 ) for two populations (A and B) with effective scale, we estimated the nucleotide diversities using the whole population sizes of 76,000 (x ) and 580,000 (x ) to represent those of A B Musmusculusmusculus and M.m. castaneus, respectively. Note when genome data of M.m. castaneus and M.m. musculus popula- s¼ 0 the equation becomes undefined (0/0) and therefore the first two tions. The genome-wide estimates are given in table 1.The columns were based on the assumption of s being infinitesimally small effective population size difference between the two subspe- (s!0), which results in x very close to 1 (x!1). (B) Relationship between cies is evident from the difference in the number of SNVs. selection coefficient (s) and the normalized difference between x and x A B Although nonsynonymous diversity of M.m. castaneus was (d ). 2.3 times higher than that of M.m. musculus, synonymous and intron diversities estimated were 3.3 and 3.6 times higher for the former than the latter. The x estimated for the whole used to represent evolution at constrained and neutral sites, genomes of M.m. musculus population was 31% and 37% which forms the empirical ratio x (p /p ). We grouped genes N S (based on synonymous sites and introns, respectively) higher based on the proportion of constrained sties into 13 catego- than those obtained for M.m. castaneus population. To ex- ries and obtained the mean x for the genes belonging to each amine the empirical relationship between selection intensity category. Cleary, the patterns of relationships in figure 2 are and x, we first computed the proportion of constrained sites similar to those shown in figure 1, suggesting that the ge- for each protein-coding gene as described in methods and nome data provide the empirical proof for the theoretical re- this was used as a proxy for selection intensity (s). The diver- lationship. For highly constrained genes (with >65% sities at nonsynonymous (p ) and synonymous sites (p )were constrained sites) x estimated for M.m. musculus (x ) N S mus 758 Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Population Size Effect on Gene Evolution GBE Table 1 Summary Statistics Mus musculus M.m. musculus castaneus SNVs Nonsynonymous 103763 33676 Synonymous 229655 55760 Intron 14831843 3266566 Diversity 6 6 Nonsynonymous (p ) 0.00076 (66.1 10 ) 0.00033 (64.0 10 ) 5 5 Synonymous (p)0.0057(62.7 10 )0.0017(61.5 10 ) 6 6 Intron (p)0.0037(62.1 10 )0.0010(61.1 10 ) p /p 0.13 (0.0012) 0.19 (0.0028) N S p /p 0.20 (0.0016) 0.32 (0.0039) N I Difference between x - d Using synonymous 0.31 (0.008) — P<0.0001 sites Using intron 0.37 (0.011) — P<0.0001 was 0.079 (6 0.017), which was three times higher than that estimated for M.m. castaneus [x ¼ 0.026 (6 0.004)] and cas the difference between the xs was significantly >0 (P ¼ 0.0015, one-tailed Z-test). Whereas, this difference be- tween the xs was only 18% [0.470 (6 0.027) vs. 0.388 (6 0.014)] for genes under relaxed selection pressure (<10% constrained sites) and it was statistically significant (P ¼ 0.0031). This result is further supported by the positive relationship (q ¼ 0.97, P < 0.0001) between the proportion of constrained sites and the normalized difference between x and x (d )estimated for M.m. castaneus and M.m. mus cas x musculus (fig. 2B). FIG.2.—Empirical relationship between the proportion of constrained A number of earlier studies showed that highly expressed positions and x estimated for Musmusculusmusculus (x )and mus genes are under high selection pressure and expression level is M.m. castaneus (x ) using nonsynonymous and synonymous sites. (A) cas a major determinant of protein evolution (Subramanian and Protein-coding genes were grouped into 13 categories based on the propor- Kumar 2004; Drummond et al. 2005; Yang et al. 2012). tion of constrained sites (see Materials and Methods) and average x com- puted for genes belonging to each category are shown. We used the fraction Based on this rationale, we used the level of gene expression of constrained sites as a proxy for selection intensity (s). Error bars shows the as an independent proxy for selection intensity (s) and exam- standard error of the mean. The difference between x and x was mus cas ined its relationship with x. For this purpose, we obtained statistically significant for all categories (at least P< 0.01). (B) Relationship RNA-seq data from a previous study (Dunham et al. 2012). between the proportion of constrained sites in protein-coding genes and Based on the level of expression (in FPKM units) genes were normalized difference betweenx andx (d ) is shown. This relationship mus cas x grouped into 13 categories and the average x was computed is highly significant (q ¼ 0.97, P< 0.0001). Best fitting regression line is for genes belonging to each category. Our results based on shown. expression levels were very similar to those obtained for the proportion of constrained sites (fig. 3). For highly expressed (q ¼ 0.95, P < 0.0001) between expression levels and the genes x estimated for M.m. musculus [x ¼ 0.130 mus normalized difference between x and x (d )provides mus cas x (6 0.014)] was 2.1 times higher than that estimated for confirmatory support for our results (fig. 3B). M.m. castaneus [x ¼ 0.062 (6 0.006)] and the difference cas between the xs was statistically significant (P < 0.0001). Discussion Whereas this difference between the xs was only 27% for Our results suggest that the influence of effective population the genes with low expression levels [0.238 (6 0.015) vs. size is more pronounced in genes under high selection inten- 0.174 (6 0.006)] and it was statistically significant sity. In this study, we first used the proportion of constrained (P ¼ 0.0073). A highly significant positive correlation Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 759 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Subramanian GBE FIG.4.—Theoretical relationship between selection coefficient (s)and normalized difference between x and x (d ) is shown for higher values A B x of s. Note that d changes only when the values of s fall between the nearly neutral range of 0 to 2. M.m. musculus than M.m. castaneus suggests a greater frac- tion of deleterious mutations segregating in the former. This is due to the fact that selection is not efficient in purging dele- terious mutations in small populations. We used synonymous sites as a proxy for neutral evolution. However, previous stud- ies suggested that a fraction of synonymous sites could be under selective constraints (Chamary et al. 2006). If negative selection is assumed in synonymous sites the magnitude of this effect will be more pronounced in populations with large effective sizes and in this case in M.m. castaneus.This will in FIG.3.—Relationship between gene expression levels and x esti- turn expected to further increase the difference between xs mated for Musmusculusmusculus (x )and M.m. castaneus (x )using mus cas (d ) from the two mouse populations. Hence this assumption nonsynonymous and synonymous sites. (A) Genes were grouped into 13 will make our results more conservative. However, we categories based on their expression levels represented by the Fragments addressed this issue by replacing introns for synonymous sites Per Kilobase of transcript per million mapped reads (FPKM) units (see to estimate neutral diversities and found similar results (sup- Materials and Methods) and the average x estimated for genes belonging plementary figs. S1 and S2, Supplementary Material online). to each category are shown. Here, we used the level of gene expression as In figure 1, the theoretical relationship was shown only for an independent proxy for selection intensity (s). Error bars shows the stan- small values of s (0 to 1.7 10 ). However, the corre- dard error of the mean. The difference between x and x was sta- mus cas sponding N s values are 10 and 1.3 for M.m. castaneus tistically significant for all categories (at least P< 0.01). (B) Relationship and M.m. musculus, respectively. Previous studies on the dis- between the gene expression levels and normalized difference between x and x (d ) is shown (q ¼ 0.95, P< 0.0001). Best fitting regression tribution of the fitness effects of mutations in M.m. castaneus mus cas x line is shown. populations suggested that almost 77–80% of the mutations with N s > 10 were lethal or highly deleterious and 20% of sites as a proxy for selection intensity (s), which is straightfor- them (N s < 10) were nearly neutral in nature (Halligan et al. ward. Since the level of gene expression is known to correlate 2010, 2013; Kousathanas and Keightley 2013). It is well with selection intensity (Subramanian and Kumar 2004; known that only nearly neutral or slightly deleterious muta- Drummond et al. 2005; Yang et al. 2012)we used this as tions are influenced by N and both neutral and highly dele- an independent proxy for s. However, both measures pro- terious mutations are independent and are not modulated by duced almost identical patterns of relationships with x.The effective population sizes. This is very clear from figure 4, pattern of our population diversity based results is similar to whichshowsa plateauor anasymptote for d when that reported based on divergence between species s > –2.0  10 . Due to this reason, we chose to show only (Subramanian 2013). Overall, the higher x observed for the nearly neutral range in figure 1. Because the mutations/ 760 Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Population Size Effect on Gene Evolution GBE constraints (see Materials and Methods). Similar estimates were obtained for the bottom 30% of the genes with fastest evolutionary rate. The difference between xsof M.m. mus- culus and M.m. castaneus (d ) estimated for slow-evolving constrained genes was 3.2 times higher than that obtained for slow-evolving relaxed genes (0.46 vs. 0.14) (fig. 5A). Similarly, this difference for fast-evolving constrained genes was 2.5 times higher than that estimated for fast-evolving relaxed genes (0.438 vs. 0.176). Similar results were obtained when diversity at introns (instead of synonymous sites) were used to estimate x (supplementary fig. S3A, Supplementary Material online). These results revealed that the effect of ef- fective population size was more pronounced in constrained than in relaxed genes and the magnitude of this effect was largely similar in the fast and slow mutating genes. Therefore, difference in mutation rate between genes is unlikely to affect the main results of this study. Next, to examine the effects of recombination we obtained the fine-scale map of recombination rates from a previous study (Brunschwig et al. 2012) and computed the mean recom- bination rate for each mouse protein-coding gene. Similar to the previous analysis we sorted genes based on their recombi- nation rates and obtained the top and bottom 30% of the genes with low and high recombination rates, respectively. Thedifferencebetween xs(d )of M.m. musculus and M.m. castaneus estimated for low-recombining constrained genes was 2 times higher than that obtained for low-recombining relaxed genes (0.4 vs. 0.2) (fig. 5B). Similarly, this difference for high-recombining constrained genes was also two times higher than that estimated for high-recombining relaxed genes (0.5 vs. 0.25). Comparable results were also obtained when diversity at introns was used to estimate x (supplementary fig. S3B, Supplementary Material online). The above results sug- gest that the magnitude of the effects of N on x (or d )was e x FIG.5.—Normalized difference between x and x (d ) estimated mus cas x similar for genes located in high and low recombining regions. for genes under high and low selective constraints using nonsynonymous Therefore, variation in the rate of recombination between and synonymous sites. (A) Genes evolving under high and low substitution genes do not affect the findings and conclusions of this study. rates (see Materials and Methods). (B) Genes present in high and low The results of this study are under the assumption that the recombining regions. All differences between d of genes under high fraction of adaptive nonsynonymous segregating variations is and low selective constraints were statistically significant at least P< 0.0001. Error bars show the standard error of the mean. negligible. This is because when adaptive sweep occurs non- synonymous SNVs will be quickly fixed and do not contribute variations associated with the observed difference in the xs to the segregating variation. We examined this using our data estimated for M.m. castaneus and M.m. musculus were pre- and found that 147 and 362 genes (M.m. castaneus and dominantly nearly neutral or slightly deleterious in nature. M.m. musculus, respectively) had more number of nonsynon- Theoretical relationship shown in figure 1 was also based ymous SNVs per nonsynonymous site than synonymous SNVs on simple assumptions and did not consider the influence of per synonymous site (or x > 1). . However, the difference was other factors such as mutation and recombination rate differ- statistically significant only for 4 and 5 genes, respectively ence between genes, which might result in different N for (using a Z test—see Materials and Methods). Finally, the the- genes. To examine this effect, first we used the rate of sub- oretical relationship shown in this study (fig. 1)is based on the stitution at synonymous sites as the proxy for mutation rate assumption of no dominance and independence between and sorted genes based on the synonymous divergence be- mutations. However, empirical data include the effects of tween mouse and rat. We then obtained the top 30% of the dominant mutations and interactions/epistasis between genes with slowest evolutionary rate and within this category mutations. Hence these caveats should be noted while infer- we estimated d for the genes under high and low selective ring the empirical results of this study. Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 761 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Subramanian GBE Good JM, Handel MA, Nachman MW. 2008. Asymmetry and polymor- Based on previous theoretical and empirical predictions, phism of hybrid male sterility during the early stages of speciation in comparative population genetic studies generally assume house mice. Evolution 62(1):50–65. the difference in x observed between two populations to Halligan DL, et al. 2013. Contributions of protein-coding and regulatory reflect the variations in their effective population sizes change to adaptive molecular evolution in murid rodents. PLoS Genet. (Strasburg et al. 2011; Phifer-Rixey et al. 2012; Gayral et al. 9(12):e1003995. Halligan DL, Oliver F, Eyre-Walker A, Harr B, Keightley PD. 2010. Evidence 2013; Romiguier et al. 2014; James et al. 2017). However, our for pervasive adaptive protein evolution in wild mice. PLoS Genet. results suggest that the magnitude of this difference is depen- 6(1):e1000825. dent upon the genes being compared. As we have shown Harr B, et al. 2016. Genomic resources for wild populations of the house that comparing genes under relaxed selective constraints will mouse, Mus musculus and its close relative Mus spretus.Sci Data underestimate the actual difference in the effective popula- 3:160075. Harrang E, Lapegue S, Morga B, Bierne N. 2013. A high load of non- tion sizes. Therefore, it is important to consider the selection neutral amino-acid polymorphisms explains high protein diversity de- intensity on the genes when comparing x between popula- spite moderate effective population size in a marine bivalve with tions to infer effective population size. Although our results sweepstakes reproduction. G3 (Bethesda) 3:333–341. are based on protein-coding genes the findings will hold true James J, Castellano D, Eyre-Walker A. 2017. DNA sequence diversity and 0 0 for other constrained noncoding regions such as 3 and 5 the efficiency of natural selection in animal mitochondrial DNA. Heredity (Edinb) 118(1):88–95. untranslated regions, splice sites, up-, and downstream regu- Kimura M. 1983. The neutral theory of molecular evolution. Cambridge: latory elements. Cambridge University Press. Kousathanas A, Keightley PD. 2013. A comparison of models to infer the distribution of fitness effects of new mutations. Genetics Supplementary Material 193(4):1197–1208. Supplementary data areavailableat Genome Biology and Li W-H. 1997. Molecular evolution. Chicago: Sinauer Associates Inc. Evolution online. Ohta T. 1993. An examination of the generation-time effect on molecular evolution. Proc Natl Acad Sci U S A. 90(22):10676–10680. Phifer-Rixey M, et al. 2012. Adaptive evolution and effective population Acknowledgments size in wild house mice. Mol Biol Evol. 29(10):2949–2955. Romiguier J, et al. 2014. Population genomics of eusocial insects: the costs The author acknowledges the support from the Australian of a vertebrate-like effective population size. J Evol Biol. Research Council (LP160100594) and the University of the 27(3):593–603. Sunshine Coast. Salcedo T, Geraldes A, Nachman MW. 2007. Nucleotide variation in wild and inbred mice. Genetics 177(4):2277–2291. Siepel A, Pollard KS, Haussler D. 2006. New methods for detecting Literature Cited lineage-specific selection. In: Apostolico A, Guerra C, Istrail S, Pevzner PA, Waterman M, editors. Research in computational molec- Brunschwig H, et al. 2012. Fine-scale maps of recombination rates and ular biology. RECOMB 2006. Lecture Notes in Computer Science. hotspots in the mouse genome. Genetics 191(3):757–764. Berlin: Springer. p. 190–205. Bustamante CD, et al. 2005. Natural selection on protein-coding genes in Strasburg JL, et al. 2011. Effective population size is positively correlated the human genome. Nature 437(7062):1153–1157. with levels of adaptive divergence among annual sunflowers. Mol Biol Chamary JV, Parmley JL, Hurst LD. 2006. Hearing silence: non-neutral Evol. 28(5):1569–1580. evolution at synonymous sites in mammals. Nat Rev Genet. Subramanian S. 2013. Significance of population size on the fixation of 7(2):98–108. nonsynonymous mutations in genes under varying levels of selection Cingolani P, et al. 2012. A program for annotating and predicting the pressure. Genetics 193(3):995–1002. effects of single nucleotide polymorphisms, SnpEff: sNPs in the ge- Subramanian S, Kumar S. 2004. Gene expression intensity shapes evolu- nome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly tionary rates of the proteins encoded by the vertebrate genome. (Austin) 6(2):80–92. Genetics 168(1):373–381. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. 2005. Why Tajima F. 1983. Evolutionary relationship of DNA sequences in finite pop- highly expressed proteins evolve slowly. Proc Natl Acad Sci U S A. ulations. Genetics 105(2):437–460. 102(40):14338–14343. Tsagkogeorga G, Cahais V, Galtier N. 2012. The population genomics of a Dunham I, et al. 2012. An integrated encyclopedia of DNA elements in the fast evolver: high levels of diversity, functional constraint, and molec- human genome. Nature 489(7414):57–74. ular adaptation in the tunicate Ciona intestinalis. Genome Biol Evol. Duvaux L, Belkhir K, Boulesteix M, Boursot P. 2011. Isolation and gene 4(8):740–749. flow: inferring the speciation history of European house mice. Mol Yang JR, Liao BY, Zhuang SM, Zhang J. 2012. Protein misinteraction avoid- Ecol. 20(24):5248–5264. ance causes highly expressed proteins to evolve slowly. Proc Natl Acad Gayral P, et al. 2013. Reference-free population genomics from next- Sci U S A. 109(14):E831–E840. generation transcriptome data and the vertebrate-invertebrate gap. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol PLoS Genet. 9(4):e1003457. Biol Evol. 24(8):1586–1591. Geraldes A, et al. 2008. Inferring the history of speciation in house mice from autosomal, X-linked, Y-linked and mitochondrial genes. Mol Ecol. 17(24):5349–5363. Associate editor:George Zhang 762 Genome Biol. Evol. 10(3):756–762 doi:10.1093/gbe/evy047 Advance Access publication February 21, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/756/4898785 by Ed 'DeepDyve' Gillespie user on 16 March 2018

Journal

Genome Biology and EvolutionOxford University Press

Published: Mar 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off