Amino acid usage in a proteome depends mostly on its taxonomy, as it does the codon usage in transcriptomes. Here, we explore the level of variation in the codon usage of a speciﬁc amino acid, glutamine, in relation to the number of consecutive glutamine residues. We show that CAG triplets are consistently more abundant in short glutamine homorepeats (polyQ, four to eight residues) than in shorter glutamine stretches (one to three residues), leading to the evolutionary growth of the repeat region in a CAG-dependent manner. The length of orthologous polyQ regions is mostly stable in primates, particularly the short ones. Interestingly, given a short polyQ the CAG usage is higher in unstable-in-length orthologous polyQ regions. This indicates that CAG triplets produce the necessary instability for a glutamine stretch to grow. Proteins related to polyQ-associated diseases behave in a more extreme way, with longer glutamine stretches in human and evolutionarily closer nonhuman primates, and an overall higher CAG usage. In the light of our results, we suggest an evolutionary model to explain the glutamine codon usage in polyQ regions. Key words: homorepeat, glutamine stretch, codon usage, polyQ-associated diseases. Introduction approaches (Hughes and Olson 2001; Robertson and Homorepeats are deﬁned as runs of the same amino acid in a Bottomley 2010; Margulis et al. 2013; Fan et al. 2014; protein sequence. Given the repeated amino acid X, its homo- Takeuchi et al. 2014; Takeuchi and Nagai 2017). repeat is known as polyX, X-AAR (amino acid repeat), or Intrinsically, the presence of CAG and other CNG repeats X-homopeptide (Zhou et al. 2011). The prevalence and func- affects mRNA stability and structure (Broda et al. 2005), and tions of a polyX vary in proteomes depending on known (nat- their abnormal expansion in disease can inﬂuence splicing ural selection, taxonomy, length, GC content) (Faux et al. (Neueder et al. 2017). On the other hand, at the protein level 2005; Mularoni et al. 2010; Zhou et al. 2011; Mier et al. the length of a polyQ region correlates with its propensity to 2017) and unknown factors (e.g., the case of poly- aggregate (Barton et al. 2007), and is a critical determinant of asparagines in the amoeba Dictyostelium discoideum and age-of-disease onset (Nagai et al. 2000). These facts underline the protozoan Plasmodium falciparum)(Eichinger et al. the importance of a better comprehension of the evolutionary 2005; Muralidharan and Goldberg 2013). perspective of the growth of glutamine tracts and its codon From a purely anthropocentric point of view, the most usage. interesting homorepeats are the poly-glutamines (polyQ). Although polyQ is a common accepted term for stretches Besides being one of the most prevalent homorepeats in of consecutive glutamine residues, thresholds of a minimum eukaryotes (Faux et al. 2005; Mier et al. 2017), abnormal of four out of ﬁve (UniProt, http://www.uniprot.org/help/ expansion of glutamine tracts (via CAG trinucleotide compbias; last accessed February 26, 2018), four out of six repeats) are associated with at least nine inherited neuro- (Mier and Andrade-Navarro 2017), ﬁve (Albaand Guigo degenerative diseases (Fan et al. 2014; Den Dunnen 2017). 2004; Jorda and Kajava 2010; Chavali et al. 2017), six None of these diseases are neither curable nor effectively (Lobanov and Galzitskaya 2012), and eight outoften treatable so far, despite the many attempts to fathom the (Schaefer et al. 2012; Mier and Andrade-Navarro 2016)glu- role of the extended polyQ in the progression of the dis- tamine residues have been used so far to refer to polyQ order and the development of potential therapeutic regions. Although it has been demonstrated that for human The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 816 Genome Biol. Evol. 10(3):816–825. doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length GBE and metazoan proteomes a stretch of ﬁve consecutive gluta- The downloaded data were complemented with coding mines is not a random feature and thus can be considered a and peptide sequences from protein coding genes of model polyQ (Lobanov et al. 2016), we showed in a previous re- organisms from different taxonomic groups available in search (Totzeck et al. 2017) that a protein sequence with a Ensembl/Biomart version 90: Mus musculus (mmu, minimum of four glutamines in a window of six amino acids GRCm38.p5), Rattus norvegicus (rno, Rnor6.0), Sus scrofa already possesses characteristic features of a polyQ region. To (ssc, Sscrofa11.1), Monodelphis domestica (mdo, study the maximum number of glutamine stretches but also monDom5), Gallus gallus (gga, Gallus_gallus-5.0), to account for the functional and structural implications of a Taeniopygia guttata (tgu, taeGut3.2.4), Xenopus tropicalis polyQ, we consider that: a glutamine stretch may be consid- (xtr, JGI 4.2), Latimeria chalumnae (lch, LatCha1), Danio rerio ered for Q 1, a polyQ region may be considered for Q 4, a (dre, GRCz10), Takifugu rubripes (tru, FUGU 4.0), Ciona intes- short polyQ has a deﬁned length of 4 Q 8, and a long tinalis (cin, KH), Drosophila melanogaster (dme, BDGP6), polyQ is Q 9 residues long. We note that these thresholds Caenorhabditis elegans (cel, WBcel235), and Saccharomyces are likely speciﬁc to polyQ and might not apply to other cerevisiae (sce, R64-1-1). homorepeats, as the properties of homorepeats are highly We considered the downloaded data sets from Ensembl as inﬂuenced by the repeated residue type (Bernacki and reference, and did not account neither for intraspecies poly- Murphy 2011; Lu and Murphy 2015). morphic variation nor for the quality of the genome assemblies. Glutamine is coded by synonymous codons CAA and CAG. Codon usage biases are organism- or taxa-speciﬁc and are Glutamine Stretches and Codon Usage affected by natural selection (Lynn et al. 2002; Athey et al. 2017). Codon optimality derived from these biases is a major We calculated the glutamine codon usage in all the retrieved determinant of mRNA stability (Presnyak et al. 2015)and data sets from Ensembl by counting the number of CAA and controls mRNA translation (Saikia et al. 2016). In primates, CAG triplets in pure Q stretches. The length of a Q stretch was CAG is roughly three times more frequent than CAA (35.28 taken as the number of consecutive glutamines, in a non- and 13.66 per 1,000 codons, respectively) (Athey et al. 2017), nested way (e.g., “QQQQ” was considered to be of length driving the glutamines to be coded by a 72.09% CAG four glutamines, and not one time “QQQQ”, two times (71.85% CAG in human). These numbers do not consider “QQ”, and four times “Q”). any additional feature of the coded glutamine, like if it is The orthology information obtained from Ensembl was in- inﬂuenced by the presence of adjacent glutamine residues. tegrated to generate sets of orthologs per human protein. We In this work, we characterize the length-dependent codon took into account only the sets in which all nonhuman pri- usage of glutamine in glutamine stretches from complete pro- mates had at least one ortholog to the human protein. From teomes of diverse taxonomic lineages. Focusing on orthologous them, we considered solely the sets in which at least one se- Q stretches from primates, their length differences and the quence had at least one region with four or more consecutive codon usage of stable- versus unstable-in-length stretches are glutamines (supplementary ﬁle 1, Supplementary Material assessed. We also show how glutamine stretches in proteins online). PolyQ regions from proteins with more than one glu- related to polyQ-associated diseases deviate from the expected tamine stretch were analyzed independently. All the regions proteome-wide codon usage, and propose an evolutionary meeting this condition were manually veriﬁed, and were com- model to explain the glutamine codon usage in polyQ regions. pared with the different aligned orthologous sequences. To study the length of the Q stretches in the orthologs, we aligned them in UGENE v1.9.8 (Okonechnikov et al. 2012) using the T-Coffee algorithm with default parameters. To de- Materials and Methods termine the length of a Q stretch, we counted the number of Data Retrieval consecutive glutamines. An exception was made when two We downloaded all coding and peptide sequences from protein different Q stretches should have been considered in one se- coding genes from the human data set GRCh38.p10 using quence, and only one in an aligned orthologous region (e.g., Ensembl/Biomart version 90 (Yates et al. 2016). Similar informa- “QQQQPQQQQ” in one protein aligned with tion was retrieved for all nonhuman primates for which Ensembl “QQQQQQQQQ”). In that case, we counted the total num- provides information about orthology relationships with human ber of glutamines in the aligned region, and not just the pure sequences: Pan troglodytes (ptr, CHIMP2.1.4), Gorilla gorilla Q stretches (e.g., “QQQQPQQQQ” is considered to be of gorilla (ggo, gorGor3.1), Pongo abelii (pab, PPYG2), length eight glutamines and “QQQQQQQQQ” of nine gluta- Nomascus leucogenys (nle, Nleu1.0), Macaca mulatta (mmul, mines); we did not analyze further the identity of the different Mmul8.0.1), Chlorocebus sabaeus (csa, ChlSab1.1), Papio anu- amino acids present within the polyQ region. To study the bis (pan, PanAnu2.0), Callithrix jacchus (cja, C_jacchus3.2.1), glutamine codon usage in the orthologs, we followed the Carlito syrichta (csy, tarSyr1), Otolemur garnettii (oga, same procedure, counting the number of CAA and CAG OtoGar3), and Microcebus murinus (mmur, Mmur2.0). triplets forming the Q stretches. In this case, we used the Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 817 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Mier and Andrade-Navarro GBE standalone version of TranslatorX (Abascal et al. 2010), with codons. Focusing on the amino acid glutamine, Q, we calcu- default parameters, to easily visualize the nucleotide align- lated the frequency of glutamine stretches of different lengths ments separated by codons. in a set of 26 organisms representing major taxonomic The length of a Q stretch was considered to be stable if at groups, from the yeast Saccharomyces cerevisiae to the hu- least half of the orthologs had the same length; otherwise, it man proteome. For each stretch of consecutive glutamines in was considered unstable. In that case, we took as length of a proteome, we computed its length and the number of both the unstable-in-length Q stretch the most frequent length CAA and CAG codons coding for the stretch (see Methods for among the orthologs. Given the case of two or more most details). frequent lengths, we took as the unstable length the most Glutamine stretches of short length are present in approx- frequent one closest in evolution to human. imately similar numbers in all proteomes, and their proportion Reported P values are the result of a nonparametric Mann– is extremely stable in primates (ﬁg. 1a). There are in average Whitney U statistical test. more than ten “Q”, around one “QQ” and 0.1 “QQQ” per protein. It has been reported before that glutamines are coded by a 72% CAG in human (Athey et al. 2017); our Phylogenetic Relationships between Species results conﬁrm it (ﬁg. 1b, hsa), but solely in short glutamine To assess the pairwise divergence time for each species and stretches (1–3 Q). Glutamine stretches longer than three glu- human, we obtained the estimated divergence time in million tamines are coded by a higher CAG percentage. The larger years given by the TimeTree database (Kumaretal. 2017). We proportion of 1–3 glutamine stretches (ﬁg. 1a)bias the direct used the phyloT tool version 2017.7 (http://phylot.biobyte.de/; calculation of the glutamine codon usage. More distant-in- last accessed February 26, 2018) to generate a phylogenetic evolution species behave differently, with lower values of tree to relate the organisms based on NCBI taxonomy. CAG percentages in 4–8 Q than in 1–3 Q in the zebraﬁsh Danio rerio (dre) and the tunicate Ciona intestinalis (cin) (ﬁg. 1b). The contrast of the percentage of CAG codons in Proteins Related to polyQ-Associated Diseases 4–8 Q stretches versus in 1–3 Q stretches, that is, short polyQ The amino acidic and nucleotidic sequences from the nine and not polyQ, show a high correlation between these values human proteins related to polyQ-associated diseases (Fan when plotting the results for the 26 species (ﬁg. 1c). Most of et al. 2014)(supplementary ﬁle 2, Supplementary Material the species cluster in values of 70% CAG for 1–3 Q stretches online) were extracted from the downloaded data sets. and 80% CAG for 4–8 stretches, but S. cerevisiae (sce), Similarly, the orthologous sequences from the nonhuman pri- Caenorhabditis elegans (cel), and C. intestinalis (cin). These mates were used. To complement the information about three species are distant to human in evolution (676– orthologs to those nine human proteins that were not deﬁned 1,105 Myr), which suggests the length-dependent glutamine by Ensembl, we conducted an additional procedure. We per- codon usage was ﬁxed after their speciation event. formed a BLAST search using the human protein as query Human and the rest of the nonhuman primates show sim- versus the proteomes of the nonhuman primates with no ilar proportions of glutamine stretches per protein, and also of deﬁned ortholog (one search per human protein), using de- the length-dependent CAG percentage, as described earlier. fault parameters and low complexity ﬁlter off. As our only When itemizing the glutamine lengths from one to eight purpose here is to evaluate the length and codon usage of glutamines, and more than eight glutamines (ﬁg. 1d), the the one glutamine stretch associated to the disease, we con- triplet CAG is used in primates preferably in small polyQ sidered a sequence as orthologous to the human query if their (from a length of 1–3 to 4–8), whereas it is not abundant in alignment covered the coordinates of the human disease- grown homorepeats (>8 Q), for which positive selection for associated Q stretch. Fragments of orthologs not containing CAG codons might disappear. The trend in the CAG percent- that Q stretch were thus not considered. age values is clear, and show two well-deﬁned groups of We followed the strategy explained above to evaluate the values (1–3 Q and>3Q, P¼ 2.2E-16) consistent with our Q stretch length and codon usage of the full set of available initial deﬁnition of what should be considered a polyQ. orthologs to the nine human proteins related to polyQ- associated diseases. PolyQ Regions in Primates Are of Similar Length No proteome-wide set of one-to-one orthologous sequences Results is available for a set of model organisms including human and Glutamine Codon Usage Is Enriched in CAG Triplets in other nonhuman primates. We built it by focusing on the set Longer Q Stretches of primates studied in the previous section with information Amino acid codon usage has varied throughout evolution, provided by Ensembl (see Methods for details); we used sets and depend mostly on taxonomy. Here, we want to assess of orthologs in which all proteomes had at least one ortholog. whether it is also inﬂuenced by the context of the surrounding A total of 8539 sets of orthologous sequences was initially 818 Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length GBE FIG.1.—Characterization of glutamine stretches in complete proteomes. (a) Number of glutamine stretches per number of proteins per proteome, depending on the stretch length. (b) Percentage of CAG triplets in glutamine stretches per proteome, depending on the stretch length. (c) Percentage of CAG triplets in glutamine stretches of lengths 4–8 compared with lengths 1–3; the result for each proteome is colored depending on the pairwise divergence time with human. The discontinuous line represents x¼ y values. (d) Overall CAG percentages in primates in glutamine stretches of varying lengths. Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 819 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Mier and Andrade-Navarro GBE obtained and then ﬁltered to work only with those in which at the stability of their length in the available data set. Both least one protein from any of the organisms had at least one length growth and decrease would be inﬂuenced by the pro- polyQ of length four or more, which resulted in 347 sets. teome taken as reference; however, the depiction of the From these, we identiﬁed 461 independent orthologous length stability of glutamine stretches among primates is a regions. property that takes into account all proteomes considered. We counted the maximum number of consecutive gluta- We considered a glutamine stretch to be stable-in-length if mines in all the orthologous independent regions within the it had the same length in at least half of the orthologous described data set, to account for both already-formed and regions. In that case, the stretch is labelled as stable and its for emerging/fading polyQ regions. This procedure allows us length is taken as the one of the majority of them. Were the to characterize the length of a Q stretch in several points in region unstable-in-length, its length would be taken as the evolution. As we are working with the full set of available most frequent among the orthologs (see Methods for details). completely sequenced primates, we are able to describe the As previously described (ﬁg. 2), shorter Q stretches are more evolutive drift of glutamine stretches in the last 74 million stable-in-length than longer ones (ﬁg. 4). Stretches with more years in a comprehensive way. than ten glutamines (28/461 stretches) are rarely stable-in- For a given organism, we took all of its Q stretches as length (21% of them). On the other hand, stretches of four reference, and calculated the difference between their length consecutive glutamines (178/461 stretches) are almost always and that of the rest of its orthologous regions. We repeated stable-in-length (97%). Results suggest that short polyQ ap- the procedure with the 12 primates, and split the results pear to be generally held back within a controlled length depending on the length of the reference Q stretch: 0–3 range. They are most probably long enough to be functional, glutamines, 4–8 glutamines, and more than 8 glutamines while not in risk of an unexpected expansion that could lead (ﬁg. 2). Both short Q stretches and polyQ regions (ﬁg. 2a to instability and disease. and b, respectively) show a general length similarity in all Separate codon usage calculations in stable-in-length/ the species, with a very narrow length difference, especially unstable-in-length and short/long polyQ regions show that in short polyQ. Short Q stretches are present in the results CAG is more frequent in short and unstable-in-length polyQ because at least one of its orthologous regions contain a (“4-8U”) in all studied primates (ﬁg. 5, P¼ 8.94E-08). This polyQ, and thus logically they are generally either similar in result suggests that CAG codons destabilize the length or shorter, meaning that either a few of the ortholo- glutamine stretch, probably assisting in the growth of the gous regions are a polyQ, or many of them, respectively. Short region. polyQ regions appear to be mostly stable-in-length. On the other hand, long polyQ regions are more unstable-in-length, A Closer Look into the Proteins Related to polyQ- although equally dissimilar in all organisms (ﬁg. 2c). Associated Diseases Glutamine homorepeats are then not signiﬁcantly longer in human than in the rest of the nonhuman primates. There are nine human proteins associated with diseases pro- Most of the 461 orthologous polyQ regions are encoded duced by the abnormal elongation of their polyQ regions. by pure CAG codon stretches when short (ﬁg. 3a)and mixed Following a similar procedure as the explained before, we with CAA codons when long (ﬁg. 3b). There are almost no checked both the polyQ length and glutamine codon usage CAA pure regions coding for a polyQ stretch (ﬁg. 3d), and in these proteins in 12 primates. The proteins in study are interruptions of different codons are also not frequent characterized for being pathological when surpassing an (ﬁg. 3c). Finally, we calculated the longest run of consecutive anomalous polyQ length threshold, speciﬁc for each protein. CAG codons, and similarly of CAA codons, in stretches For example, the normal length of the glutamine repeat in encoded by more than one different triplet. Consecutive human protein Huntingtin (EnsemblID: ENSP00000347184) is CAA codon runs are always shorter than CAG consecutive described to be 6-35, and in its pathological version 36-121 stretches (ﬁg. 3e and f). Both results hint at the use of CAA (Fan et al. 2014). It is important to notice that in this study we codons to disrupt long consecutive CAG stretches. refer to the length of the polyQ region in the sequence obtained from the Ensembl database, which we take as ref- erence; for Huntingtin, the sequence version present in Glutamine Codon Usage Is Enriched in CAG Triplets in Ensembl is 21 glutamines long. Shorter Unstable-in-Length polyQ All the proteins related to polyQ-associated diseases con- The length stability of glutamine stretches was already brieﬂy tain one polyQ region, but the androgen receptor (EnsemblID: referred to in the previous section, by comparing the overall ENSP00000363822), which contains three, with 23, 6, and 5 length differences in the full set of 461 independent ortholo- glutamines (in coordinates 58–80, 86–91, and 195–199, re- gous regions. However, a one-by-one study of these regions is spectively). As the pathological stretch is the ﬁrst of them, for needed to assess their length-dependent stability and codon the purpose of this work, we did not considered the second usage. We will not focus on polyQ growth or decrease, but in and the third regions. 820 Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length GBE FIG.2.—Length differences of glutamine stretches between primates. Length differences from the reference glutamine stretch to the rest of the orthologous glutamine stretches, when the length of the reference stretch is (a)0–3, (b)4–8, and (c) >8. Not every analyzed organism contains all of these nine polyQ regions in these proteins from evolutionarily distant proteins. The protein absences may be due to problems in nonhuman primates to human (supplementary ﬁg. 1, the orthology mapping given by Ensembl, an erroneous ge- Supplementary Material online, P value¼ 0.007). The differ- nome sequencing or protein-coding gene annotation, or sim- ences at the level of nucleotides (higher CAG triplet propor- ply due to a gene loss event. We complemented the Ensembl tion than the background) and amino acids (longer glutamine orthology mapping with a manual strategy based on BLAST stretches in human and evolutionarily closer nonhuman pri- searches in the 12 proteomes to ﬁll as much as possible the mates) may explain the association of these nine proteins with sets of orthologs for each protein (see Methods for details). human diseases. The overall length of the glutamine stretches in these pro- teins show that the human ones are generally longer (ﬁg. 6,in Discussion blue); in fact, in seven out of nine sets of orthologous proteins, This work presents a comprehensive evolutionary characteri- the human polyQ region is the longest one (not in Ataxin-2 zation of homoglutamine repeats in both amino acidic and and Ataxin-3). Nonhuman primates evolutionarily closer to nucleotidic contexts. We have showed that for all studied human also have longer polyQ regions than more distant spe- species the glutamine codon usage depends on the number cies. This is a deviation from the expected results: we have of consecutive glutamines in a stretch, being in most of the showed before that polyQ regions in primates are generally of species enriched in CAG triplets in longer Q stretches (ﬁg. 1). similar length (ﬁg. 2). Primates present a direct correlation between the number of The overall proportion of CAG codons in those regions is consecutive glutamines in a stretch and the percentage of unexpectedly high, with a mean value of >90% CAG in al- CAG triplets coding them, covering glutamine stretches most all species (ﬁg. 6, in red). The CAG codon usage when with lengths 1–3 and short polyQ with lengths 4–8. Once computing all polyQ regions with more than eight glutamines the polyQ region is established and long enough, the pres- was calculated to be between 75% and 80% for all primates ence of CAG is not required anymore. Our results suggest the (ﬁg. 5). The extreme CAG codon presence in the polyQ greater importance of CAG triplets in generating the polyQ regions of proteins related to polyQ-associated diseases may region than in elongating it once it reaches a certain length produce an instability that boosts the CAG-dependent polyQ threshold. This result is supported by the fact that orthologous growth in evolution, explaining the growth pattern of the Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 821 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Mier and Andrade-Navarro GBE FIG.3.—Codon purity of glutamine stretches in primates. Percentage of glutamine stretches per length encoded by (a) only CAG codons, (b)a mix of CAG and CAA codons, (c) a mix of CAG, CAA, and other interrupting codons, and (d) only CAA codons. Considering only glutamine stretches encoded by more than one different triplet, maximum number of consecutive (e) CAG, and (f) CAA per length. short unstable-in-length polyQ regions in primates are CAA codons serve as disruptors of long pure CAG stretches enriched in CAG (ﬁg. 5, data labels “4-8U”). (ﬁg. 3), which may be selected for to avoid the uncontrollable Orthologous glutamine stretches in primates are generally growth of these regions produced by CAG expansion through of a similar length (ﬁg. 2). The length range of orthologous slippage-related mechanisms (Kraus-Perrotta and Lagalwar regions to stretches of length 4–8 is very narrow, which is 2016; Ciesiolka et al. 2017). The smaller amount of CAA triplets conﬁrmed by the greater length stability of shorter glutamine encoding for polyQ regions associated to polyQ diseases sug- stretches (ﬁg. 4). In the same way, longer glutamine stretches gests a role for CAA codons as phenotype modulators. Even are more unstable-in-length, but they are not signiﬁcantly though the frequency of codons different to CAA interrupting longer in any species. Contravening this result, polyQ regions consecutive CAG stretches is low, is has been previously of proteins related to polyQ-associated diseases are unexpect- reported a role of these interruptions evading homologous edly longer in human and evolutionarily closer nonhuman DNA recombination (Barik 2017), slowing the aggregation primates (ﬁg. 6). They also deviate from the proteome-wide rates of polyQ regions, decreasing ﬁber formation rates, in- codon usage of glutamine stretches, showing an overall creasing oligomer stability (Menon et al. 2013), and preventing higher CAG proportion in almost all species. CAG expansion (Ciesiolka et al. 2017). Whether the 822 Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length GBE phenotypic outcome of an interruption in a long CAG stretch following observations. First, the percentage of CAG triplets produced by a CAA (silent mutation) or by another codon (mis- coding for glutamine stretches depends on the number of sense mutation) is different remains to be deciphered. consecutive glutamines (ﬁg. 1d): lower percentages for 1–3 Our interest in the proteins related to polyQ-associated Q, higher percentages for 4–8 Q, and medium percentages diseases is anthropocentric, as they are associated with neu- for>8 Q. Second, polyQ length is generally stable across rological diseases described in human; there are probably orthologs (ﬁg. 2). Third, shorter polyQ are more stable-in- more proteins in nonhuman primates associated with neuro- length than longer ones (ﬁg. 4). Fourth, CAG codons are as- degenerative diseases in them which we do not know of be- sociated with length instability (ﬁg. 5, “4-8S” vs “4-8U”). cause of their nonpathogenicity in human. Fifth, much higher percentages of CAG than expected for To suggest an evolutionary model to explain the glutamine their length are present in polyQ regions of proteins related codon usage in polyQ regions in primates, we point to the to polyQ-associated diseases (ﬁg. 6 “Overall %CAG” vs ﬁg. 1d “>8”), and they present an overall polyQ length lon- ger in human and evolutionary-related nonhuman primates and shorter in species more distant in evolution (ﬁg. 6). We propose then that the observations summarized above collec- tively suggest the following evolutionary model for polyQ in primates: 1) CAG are positively selected in evolution to gen- erate a short polyQ region from a short glutamine stretch; 2) a short polyQ region is probably long enough to be functional, and thus their growth is no longer selected; 3) as a mecha- nism to stop longer polyQ to keep growing and to reduce instability, CAG triplets are either actively counter selected or in neutral evolution; 4) longer polyQ regions escaping this blockage may grow uncontrollably and be involved in the development of polyQ-associated diseases. The validity of the presented model needs to be tested in vivo. Even if the model is thought to explain the glutamine codon usage in polyQ regions in primates, a more-distant specieswitha shorter lifespancouldbe usedtotest our hy- pothesis. For example, by integrating in its genome polyQ tracts with various CAG percentages, and checking in succes- sive generations if the glutamine stretches grow in a CAG- dependent way. The yeast S. cerevisiae has already been used FIG.4.—Length-stability of glutamine stretches. Percentage of orthol- ogous glutamine stretch regions with a stable length in at least half of the to express fragments of Huntingtin with polyQ expansions to orthologs. study polyglutamine toxicity (Krobitsch and Lindquist 2000; FIG.5.—Glutamine codon usage in stable- versus unstable-in-length polyQ. Codon usage calculated in stable-in-length (S) and unstable-in-length (U), short (4–8 Q), and long (>8 Q) polyQ stretches, in 12 primates. Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 823 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Mier and Andrade-Navarro GBE FIG.6.—PolyQ lengths and glutamine codon usage in proteins related to polyQ-expansion diseases. Divergence time for each organism and human is measured in million years (Myr). The tree on top relates the species based on their NCBI taxonomy. Each protein is appended with its related disease. The overall Q lengths (in blue) and CAG percentage prevalence (in red) plots take into account the results per species of the nine proteins shown above. Duennwald et al. 2006), therefore we propose it as a potential length, they evolve in a CAG-dependent manner. Further model organism to prove our model. Furthermore, future efforts should be made to research the evolution of other gene therapies may induce point mutations in polyQ regions homorepeats, taking advantage of the growing collection of to transform CAG codons into CAAs, which could stop ab- complete proteomes available in public databases. normal CAG-mediated expansions of glutamine tracts. With this work, we hope to raise awareness to the useful- Supplementary Material ness of studying homorepeat evolution. Gathering informa- Supplementary data are available at Genome Biology and tion from a data set of complete proteomes, we could show Evolution online. that, while in primates polyglutamines are rather stable-in- 824 Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Myr Myr Myr Myr Myr Myr Myr Myr Myr Myr Myr Myr Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length GBE Kumar S, Stecher G, Suleski M, Hedges SB. 2017. TimeTree: a resource for Acknowledgments timelines, timetrees, and divergence times. Mol Biol Evol. This work was supported by the Center for Computational 34(7):1812–1819. Lobanov MY, Galzitskaya OV. 2012. Occurrence of disordered patterns Sciences Mainz (CSM, Johannes Gutenberg University of and homorepeats in eukaryotic and bacterial proteomes. Mol Biosyst. Mainz, Germany). 8(1):327–337. Lobanov MY, Klus P, Sokolovsky IV, Tartaglia GG, Galzitskaya OV. 2016. Author’s Contributions Non-random distribution of homo-repeats: links with biological func- tions and human diseases. Sci Rep. 6:26941. P.M. and M.A.A.N. conceived the project. P.M. designed, Lu X, Murphy RM. 2015. Asparagine repeat peptides: aggregation kinetics implemented, and carried out the experiments. M.A.A.N. su- and comparison with glutamine repeats. Biochemistry pervised the research. P.M. wrote the manuscript, incorporat- 54(31):4784–4794. Lynn DJ, Singer GA, Hickey DA. 2002. Synonymous codon usage is subject to ing comments, contributions, and corrections from M.A.A.N. selection in thermophilic bacteria. Nucleic Acids Res. 30(19):4272–4277. All authors read and approved the ﬁnal manuscript. Margulis BA, Vigont V, Lazarev VF, Kaznacheyeva EV, Guzhova IV. 2013. Pharmacological protein targets in polyglutamine diseases: mutant polypeptides and their interactors. FEBS Lett. 587(13):1997–2007. Literature Cited Menon RP, et al. 2013. The role of interruptions in polyQ in the pathology Abascal F, Zardoya R, Telford MJ. 2010. TranslatorX: multiple alignment of of SCA1. PLoS Genet. 9(7):e1003648. nucleotide sequences guided by amino acid translations. Nucleic Acids Mier P, Alanis-Lobato G, Andrade-Navarro MA. 2017. Context character- Res. 38(Suppl_2):W7–13. ization of amino acid homorepeats using evolution, position, and or- Alba MM, Guigo R. 2004. Comparative analysis of amino acid repeats in der. Proteins 85(4):709–719. rodents and humans. Genome Res. 14(4):549–554. Mier P, Andrade-Navarro MA. 2016. FastaHerder2: four ways to research Athey J, et al. 2017. A new and updated resource for codon usage tables. protein function and evolution with clustering and clustered data- BMC Bioinformatics 18(1). bases. J Comput Biol. 23(4):270–278. Barik S. 2017. Amino acid repeats avert mRNA folding through conserva- Mier P, Andrade-Navarro MA. 2017. dAPE: a web server to detect homo- tive substitutions and synonymous codons, regardless of codon bias. repeats and follow their evolution. Bioinformatics 33(8):1221–1223. Heliyon 3:12. Mularoni L, Ledda A, Toll-Riera M, Alba MM. 2010. Natural selection drives Barton S, Jacak R, Khare SD, Ding F, Dokholyan NV. 2007. The length the accumulation of amino acid tandem repeats in human proteins. dependence of the polyQ-mediated protein aggregation. J Biol Chem. Genome Res. 20(6):745–754. 282(35):25487–25492. Muralidharan V, Goldberg DE. 2013. Asparagine repeats in Plasmodium Bernacki JP, Murphy RM. 2011. Length-dependent aggregation of unin- falciparum proteins: good for nothing? PLoS Pathog. 9(8):e1003488. terrupted polyalanine peptides. Biochemistry 50(43):9200–9211. Nagai Y, et al. 2000. Inhibition of polyglutamine protein aggregation and Broda M, Kierzek E, Gdaniec Z, Kulinski T, Kierzek R. 2005. cell death by novel peptides identiﬁed by phage display screening. J Thermodynamic stability of RNA structures formed by CNG trinucleo- Biol Chem. 275(14):10437–10442. tide repeats. Implication for prediction of RNA structure. Biochemistry Neueder A, et al. 2017. The pathogenic exon 1 HTT protein is produced by 44(32):10873–10882. incomplete splicing in Huntington’s disease patients. Sci Rep. 7(1):1307. Chavali S, et al. 2017. Constraints and consequences of the emergence of Okonechnikov K, Golosova O, Fursov M, UGENE Team. 2012. Unipro UGENE: aminoacidrepeats ineukaryotic proteins. Nat Struct Mol Biol. a uniﬁed bioinformatics toolkit. Bioinformatics 28(8):1166–1167. 24(9):765–777. Presnyak V, et al. 2015. Codon optimality is a major determinant of mRNA Ciesiolka A, Jazurek M, Drazkowska K, Krzyzosiak WJ. 2017. Structural stability. Cell 160(6):1111–1124. characteristics of simple RNA repeats associated with disease and their Robertson AL, Bottomley SP. 2010. Towards the treatment of polyglut- deleterious protein interactions. Front Cell Neurosci. 11:97. amine diseases: the modulatory role of protein context. Curr Med Den Dunnen WFA. 2017. Trinucleotide repeat disorders. Handb Clin Chem. 17(27):3058–3068. Neurol. 145:383–391. Saikia M, et al. 2016. Codon optimality controls differential mRNA trans- Duennwald ML, Jagadish S, Giorgini F, Muchowski PJ, Lindquist S. 2006. A lation during amino acid starvation. RNA 22(11):1719–1727. network of protein interactions determines polyglutamine toxicity. Schaefer MH, Wanker EE, Andrade-Navarro MA. 2012. Evolution and Proc Natl Acad Sci U S A. 103(29):11051–11056. function of CAG/polyglutamine repeats in protein-protein interaction Eichinger L, et al. 2005. The genome of the social amoeba Dictyostelium networks. Nucleic Acids Res. 40(10):4273–4287. discoideum. Nature 435(7038):43–57. Takeuchi T, Nagai Y. 2017. Protein misfolding and aggregation as a ther- Fan HC, et al. 2014. Polyglutamine (PolyQ) diseases: genetics to treat- apeutic target for polyglutamine diseases. Brain Sci. 7(12):128. ments. Cell Transplant. 23(4–5):441–458. Takeuchi T, Popiel HA, Futaki S, Wada K, Nagai Y. 2014. Peptide-based Faux NG, et al. 2005. Functional insights from the distribution and role of therapeutic approaches for treatment of the polyglutamine diseases. homopeptide repeat-containing proteins. Genome Res. Curr Med Chem. 21(23):2575–2582. 15(4):537–551. Totzeck F, Andrade-Navarro MA, Mier P. 2017. The protein structure con- Hughes RE, Olson JM. 2001. Therapeutic opportunities in polyglutamine text of PolyQ regions. PLoS One 12(1):e0170801. disease. Nat Med. 7(4):419–423. Yates A, et al. 2016. Ensembl 2016. Nucleic Acids Res. Jorda J, Kajava AV. 2010. Protein homorepeats sequences, structures, 44(D1):D710–D716. evolution, and functions. Adv Protein Chem Struct Biol. 79:59–88. Zhou Y, Liu J, Han L, Li ZG, Zhang Z. 2011. Comprehensive analysis of Kraus-Perrotta C, Lagalwar S. 2016. Expansion, mosaicism and interrup- tandem amino acid repeats from ten angiosperm genomes. BMC tion: mechanisms of the CAG repeat mutation in spinocerebellar Genomics 12(1). ataxia type 1. Cerebellum Ataxias 3:20. Krobitsch S, Lindquist S. 2000. Aggregation of huntingtin in yeast varies with the length of the polyglutamine expansion and the expression of chaperone proteins. Proc Natl Acad Sci U S A. 97(4):1589–1594. Associate editor:Mar Alba Genome Biol. Evol. 10(3):816–825 doi:10.1093/gbe/evy046 Advance Access publication March 1, 2018 825 Downloaded from https://academic.oup.com/gbe/article-abstract/10/3/816/4916091 by Ed 'DeepDyve' Gillespie user on 16 March 2018
Genome Biology and Evolution – Oxford University Press
Published: Mar 1, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera