Conserved noncoding sequences (CNSs) are evolutionarily conserved DNA sequences that do not encode proteins but may have potential regulatory roles in gene expression. CNS in crop genomes could be linked to many important agronomic traits and ecological adaptations. Compared with the relatively mature exon annotation protocols, efﬁcient methods are lacking to predict the location of noncoding sequences in the plant genomes. We implemented a computational pipeline that is tailored to the comparisons of plant genomes, yielding a large number of conserved sequences using rice genome as the reference. In this study, we used 17 published grass genomes, along with ﬁve monocot genomes as well as the basal angiosperm genome of Amborella trichopoda. Genome alignments among these genomes suggest that at least 12.05% of the rice genome appears to be evolving under constraints in the Poaceae lineage, with close to half of the evolutionarily constrained sequences located outside protein-coding regions. We found evidence for purifying selection acting on the conserved sequences by analyzing segregating SNPs within the rice population. Furthermore, we found that known func- tional motifs were signiﬁcantly enriched within CNS, with many motifs associated with the preferred binding of ubiquitous transcription factors. The conserved elements that we have curated are accessible through our public database and the JBrowse server. In-depth functional annotations and evolutionary dynamics of the identiﬁed conserved sequences provide a solid foundation for studying gene regulation, genome evolution, as well as to inform gene isolation for cereal biologists. Key words: conserved noncoding sequences, synteny, purifying selection, phylogenetic footprinting, comparative genomics. Introduction Patterns of sequence conservation among closely related spe- Comparative genomics is an important tool to identify both cies may be used to identify “footprints” of the noncoding coding and regulatory DNA elements in the genome—based regulatory elements to infer their functions (Nelson and on the premise that these elements are under purifying (neg- Wardle 2013). The related methods are collectively called ative) selection that resist mutations during evolution (Tagle “phylogenetic footprinting” (Boffelli et al. 2003). et al. 1988; Duret and Bucher 1997; Freeling and Many conserved noncoding sequences (CNSs) are known Subramaniam 2009). Nonfunctional or neutrally evolving to function as cis-regulatory elements which are involved in sequences are expected to diverge faster than the sequences the regulation of transcription and modulation of chromatin under selective constraints (Carginale et al. 2004; Clauss and structure (Venkataram and Fay 2010; Raatz et al. 2011; Mitchell-Olds 2004; Zhang et al. 2004; Reineke et al. 2011). Zhang et al. 2012; Zhang, Yuan, et al. 2016). CNSs are found The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Genome Biol. Evol. 10(2):473–488. doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 473 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE to function as regulatory regions—such as the transcription disruption of DNA-protein stoichiometry (Schnable and factor binding sites and enhancers (Hardison 2000; Turco Freeling 2011). et al. 2012). For example, bilateral conserved regulatory ele- In recent years, plummeting cost of sequencing has led ments (Bicores)(Clarke et al. 2012) are involved as enhancers to a wealth of sequenced genomes, which has accelerated in the vertebrate central nervous system, and are known to the development of sophisticated methods and software drive expression of transcriptional factors like Atf.Those CNSs to identify the CNSs by genome-wide comparison of act like switches of expressed genes, which can turn “on” and closely related genomes and to predict the regulatory “off” of the expression of particular genes at speciﬁc devel- functions of those CNSs. Sequence conservation is a use- opment stages or in speciﬁc tissues. Moreover, they may be ful metric to identify functional coding as well as noncod- involved in the regulation of posttranscriptional process as ing regions of the genome (Loots et al. 2000; Woolfe et al. sequences of microRNAs or small nucleolar RNAs. For in- 2005; Wang et al. 2009). Although there are many studies stance, microRNAs can suppress translation of target genes related to conserved sequences of protein-coding regions by binding to their mRNA and the bipartite coupling of which include various conserved protein domains (Haudry microRNA-target can be preserved over evolutionary time et al. 2013; Hupalo and Kern 2013), study of noncoding (Bentwich et al. 2005; Friedman et al. 2009). Accelerated sequences is still daunting task and mostly understudied, evolution of CNSs, which occurs disproportionately near the especially in plants. genes with particular biological functions, may contribute to The grass family (Poaceae) are the ﬁfth largest plant fam- the divergence of species (Prabhakar et al. 2006). Deletions of ily with 12,000 species of monocotyledonous ﬂowering CNSs could also lead to the divergence of species, with one plants (Soreng et al. 2015). It is also one of the most eco- example of such deletions involves a forebrain subventricular nomically important plant families, including many grain zone enhancer near the tumor suppressor gene that detains crops such as rice, wheat, barley, maize, millet for human growth and DNA-damage-inducible gamma (GADD45G) and forage consumption, and building materials such as (Zerbini et al. 2004; McLean et al. 2011). The deletion of bamboo. Grass genomes share lots of similarities in synteny this noncoding enhancer sequence is closely associated with and collinearity, and could be considered a “single genetic the expansion of speciﬁc brain regions in humans (McLean system” (Freeling 2001). Studies have reported that critical et al. 2011). mutational sites that are of agronomic signiﬁcance of many In contrast to the many examples in animal studies, speciﬁc cereal crops occurred in the conserved regions during do- functions of most of the CNSs found in plants are still largely mestication (Tang et al. 2010). One such example is the unknown to date due to relatively slow progress in annotation gene that controls the seed shattering trait in several grain of such sequences (Freeling and Subramaniam 2009). In both crops (Houston et al. 2013; Tang et al. 2013; Wang et al. plants and mammals, regulatory genes tend to have higher 2015). Identiﬁcation of CNSs through comparative geno- association with CNSs, such as genes that are enriched with mics is a valued resource, which will provide additional func- various transcription factor binding motifs, than other classes tionally sequence sites that have the potential to become of genes (Kaplinsky et al. 2002; Buchanan et al. 2004; King future targets for crop engineering. et al. 2005; Siepel et al. 2005). Regulatory genes in plant Recently, it has become possible to perform the compara- genomestendto be associatedwithfewer andshorter tive genomics in Poaceae on a large scale as the number of CNSs when compared with the mammalian genes at similar sequenced genomes in Poaceae have greatly increased. divergence level and also tend to degrade faster than mam- Whole genome comparisons of another important crop clade, malian CNSs over evolutionary time (Inada et al. 2003; including many of the crucifer genomes, identiﬁed and char- Reineke et al. 2011). acterized over 90,000 CNSs with a large proportion of CNSs Despite the differences in lengths, CNSs present in plant predicted to be involved in transcriptional and posttranscrip- and animal genomes still share functional characteristics tional regulation (Haudry et al. 2013). (Strahle and Rastegar 2008; Burgess and Freeling 2014). In prior studies, several methods were developed to Since the overall structure of most mammalian genomes identify functional elements through the scoring of se- are more conserved, comparative genomics on mammal quence conservation, such as GERP (Cooper et al. 2005) have made substantial progress. It has been estimated which is primarily a column-by-column method, where that 3.5% of the human genome is presumed to be com- column represents an aligned base in the multiple se- prised of noncoding sequences involved in various gene quence alignments. Such scoring method was shown to regulatory processes (Drake et al. 2006). There are many be effective for mammalian genome sequences and con- examples of genes enriched with CNSs that are more likely ceptually similar to PhyloP (Pollard et al. 2010). Another to be retained under evolutionary selection. Genes enriched method, PhastCons (Hubisz et al. 2011), models the ge- with CNSs are more likely to be subfunctionalized following nome as one of the two states, one state of conserved gene duplication (Force et al. 1999) or selection against region and one state of nonconserved region. Each state 474 Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Evolutionary Constraints and Annotation of Conserved Elements across Major Grass Genomes GBE has different substitution rate parameters where Python scripts for performing individual steps and is available PhastCons seeks to estimate while simultaneously consid- on GitHub: https://github.com/liangpingping/CNSpipeline, ering phylogenetic relationships and sequence similarity last accessed January 13, 2018. using a Hidden Markov Model (HMM). Herein, we report the genome-wide high-resolution atlas Calculation of Conservation Score for Every Base and of noncoding regions under selection in the publicly available Identiﬁcation of CNSs in Each Clade grass genomes and other related nongrass monocot genomes We estimated a simple “conservation score” of every single serving as “outgroups.” To our knowledge, this is the largest base. The score is based on the number of species matched to genome-scale CNS mining efforts conducted across a set of this site of reference genome, divided by the total number of plant genomes to date. We identiﬁed and characterized CNSs species that are being compared. For example, a score of 1 in different major clades, providing several “tiers” of conser- suggests that the site is ultraconserved across all species, while vation at various divergence levels including the family, order, 0 suggests that this site is not seen in other species. or the clade level. Finally, we have released the curated CNS Additionally, we aggregated the average conservative score elements through our public database and interactive ge- across each gene. CNSs were identiﬁed as fragments located nome browser tools. beyond the coding sequences in the reference genome that showed score of at least 0.7. Lastly, merged fragments having distance within 3 bp, and we then removed the fragments Materials and Methods that are shorter than 6 bp. Whole-Genome Alignments These parameters were selected based on comparisons of empirical base pair coverage for coding DNA sequence In this study, we used 17 grass genomes, 5 monocotyledons regions, and also based on the studies in Gumucio et al. of nongrass monocot genomes, and Amborella trichopoda,all (1992, 1993). There are other, more sophisticated scoring of which has relatively high quality full genomes sequences schemes such as PhastCons (Siepel et al. 2005), which may available. The genome sequences information is listed in sup- not be as easy to interpret and could be highly variable across plementary table S1, Supplementary Material online. Coding different loci which tend to be problematic for plant genome sequences (CDS) align refers to the overlap of alignment with comparisons. In short, we deﬁned our CNSs as stretches of existing rice (O. sativa ssp. japonica) CDS annotation as deter- sequences that are contained within a signiﬁcant ortholo- mined by intersections using BEDTOOLS (Quinlan and Hall gous, nonrepetitive segment that are conserved in at least 2016). Due to polyploidization in plants that created a large 70% of the species in comparison and at least 6 bp long in number of regions that are paralogous, we found an optimal length. set of local alignments that aim at retrieving orthologous alignments that are descended from the same sequence in last common ancestor of the genomes, through rigorous Rice Expression Data ﬁltering of all sequence alignments. RNA-Seq data on the expression of rice genes were obtained Each genome was split by chromosome or contig sequen- from NCBI Sequence Read Archive, we selected three distinct ces for parallel computation and aligned to O. sativa ssp. tissues in different periods, including 20-day leaves, emerging japonica (reference genome) using LAST v759 (Frith and inﬂorescence, four-leaf stage seedlings (SRA accessions: Kawaguchi 2015). The LAST parameters that we used SRP008821, SRP008821, SRP001787). For each gene by map- were: lastdb option “uMAM8,” and lastal options ping reads to same rice reference genome using the spliced “pHOXD70 e4000 C 2 m100.” Simple repeats read aligner TopHat v2.1.1 (Trapnell et al. 2009), we calcu- were identiﬁed using tantan (Frith 2011) which was built in lated the expression in Fragments Per Kilobase of transcript LAST in order to ﬁnding orthologous sequences more accu- per Million mapped reads (FPKM) using Cufﬂinks rately. Alignments generated by LAST were linked into longer v2.2.1(Trapnell et al. 2010). chains using axtChain (Kent et al. 2003), with chains scored <1,000 removed to retain only the signiﬁcant alignments. GO Term Enrichment The long chains were then assembled into longer stretches of syntenybychainNet (Kent et al. 2003). We then extracted All enrichment and GO terms reported in this paper were the sequences based on the coordinates indicated in the chain calculated using agriGO (Du et al. 2010). The GO annotation and net ﬁles. After all the pairwise alignments between each ﬁle was retrieved through the Rice Genome Annotation of the 22 nonreference genomes against the rice reference, Project (Kawahara et al. 2013). Enrichment was determined we used ROAST (reference dependent multiple alignment using the Fisher’s exact test with correction for multiple test- tool) to join the pairwise alignments of each nonreference ing. Results were considered signiﬁcant at False Discover Rate genome to the rice reference genome according to the tree (q value) <0.01. For simplicity, we only focused on the ontol- topology in ﬁgure 2. Our CNSpipeline is consisted of a set of ogy of Biological Processes (BP). Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 475 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE well separated. However, differences were observed, such Results as the relationship among the Triticinae species (Aegilops Computational Pipeline for Whole Genome Alignments tauschii, Triticum urartu,and Triticum aestivum) and be- An overview of the pipeline is in ﬁgure 1. We compiled a set tween Ananas comosus and Poaceae. The difference in the of 17 published grass (Poaceae) genomes, 5 nongrass mono- Triticinae species may be caused by several polyploidy cot genomes and Amborella trichopoda, for a total of 23 events in their evolutionary history. In our species tree con- genomes available for comparisons (supplementary table structed from our whole genome alignments, Ananas S1, Supplementary Material online). A set of 23-way genome comosus is closer to other monocots than to Poaceae, al- alignments were generated using rice (O. sativa ssp. japonica) though it is considered in the same Poales order as the as the “reference” genome against which all other genomes grasses (supplementary ﬁg. S1, Supplementary Material (“nonreference” genomes) were aligned. Due to the recur- online). Additionally, shared sequences were divided into ring nature of the ancient polyploidization in plants, there are coding sequences and noncoding sequences according to many regions from nonreference genomes that map to the the annotation of reference genome and the noncoding same reference region. In order to enrich for the sequence tree showed a similar trend, indicating the overall consis- matches that are likely to be orthologous, that is, which are tency between the phylogenetic information in the coding descended from the same sequence in last common ancestor versus noncoding sequences. of the genomes, we only allow a single best region from a nonreference genome to align to a single region of the refer- Variation in the Distribution of Conserved Elements ence genome. The limitation of a single region per matching species is necessary to remove much noise from the single Results showed that different Poaceae genomes vary greatly gene duplications and transposon activities, but largely in terms of genome size, complexity, assembly quality, and ignores the paralogous regions derived after the cereal radia- phylogenetic distance from the rice reference (ﬁg. 2). For in- tion. Some species, like maize or wheat, are known to have stance, genome size was observed to vary from 271 Mb in B. had subsequent polyploidy events follow their divergence distachyon, to 6,483 Mb in T. aestivum (supplementary table with reference genome. Since we mostly searched for CNSs S1, Supplementary Material online). We calculated base pair from the perspective of the reference genome (rice), it does coverage, or total number of base pairs, for both coding and not matter since either region could indicate conservation in a noncoding sequences (ﬁg. 2A). Protein-coding sequences species affected by recent polyploidy. across Poaceae genomes show higher level of conservation, Additionally, we only retained local pairwise alignment with coverage of conserved sequences staying over 55% of all blocks that belonged to the longest sets of collinear blocks, coding sequences (CDS) in rice (ﬁg. 2B). In contrast, the as deﬁned by “chains” and “nets” (Kent et al. 2003). In short, protein-coding sequence coverage for nongrass monocots each genome was aligned to the reference genome using and Amborella trichopoda dropped signiﬁcantly to <40% tuned parameters, followed by chaining, netting, and post- due to their large evolutionary distance from rice. processing into pairwise alignments. Additionally, coverage score of noncoding sequences was three times lower than the CDSs, and the coverage score 0 0 for 5 -UTR was slightly higher than the 3 -UTR (40% and Phylogenetic Tree Construction Based on Whole Genome 48%, respectively). The lowest coverage score was observed Alignments in intergenic regions. It is clear that the coverage of conser- Phylogenetic trees were constructed covering all the included vation mostly decreased with the increasing phylogenetic dis- species in this study to merge pairwise alignments into multi- tance from the reference genome, as expected. ple alignments (Blanchette et al. 2004). Sinceacommon ref- In order to predict conserved regions in different clades, we erence is used, the alignments can be stitched together to calculated a “conservation score” for each base that is equal form multiple sequence alignment blocks in a straightforward to the number of species that contain sequences aligned to manner. For the Poaceae clade, there are a total of 3,759,732 this site of the reference genome. This conservation score is sequence alignment blocks with an average size is 90 bp, with more straightforward to interpret than alternative scoring an average of eight species per alignment block. methods such as PhastCons (Siepel et al. 2005). To deﬁne Building on the multiple alignments, a phylogenetic tree “conserved” sequences, we chose a score threshold based derived from shared sequences can be built (supplementary on the amount of sequence overlapping with the coding ﬁg. S1, Supplementary Material online). The overall topology sequences in the reference genome, as the coding sequences of the tree, constructed from the shared sequences in were largely expected to be conserved. We observed that the genome-wide comparisons, conﬁrmed the expected taxa overlap decreased with the increase of the conservation score relationships with a few minor differences. The clades of (ﬁg. 3A). Based on the base pair coverage for coding DNA monocots, Poales order, Poaceae family, and sequence regions (ﬁg. 2B), any base with a conservation score Bambusoideae–Ehrhartoideae–Pooideae (BEP) subclades are 0.7isdeﬁnedasa conservedsite, whichstrikes a balance 476 Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Evolutionary Constraints and Annotation of Conserved Elements across Major Grass Genomes GBE FIG.1.—Our CNSpipeline to identify conserved sequences through genome comparisons. The steps include: (A) pairwise genome comparisons followed by masking and chaining; (B) merge pairwise alignments into multiple alignments; (C) calculation of conservation score of each base; (D) prediction of CNSs based on our criteria; (E) annotations of CNSs against various genomic elements including known motifs, microRNA, and lncRNA when possible. between sensitivity and speciﬁcity of distinguishing coding PhastCons. Qualitatively, our scoring scheme shows higher versus noncoding regions in our experiment. sensitivity in both coding and noncoding regions (supplemen- Finally, we deﬁne “conserved sequences” separately for tary table S2, Supplementary Material online). In our visual ﬁve different clades, BEP, Poaceae, Poales, monocots, and prooﬁng, PhastCons is inadequate in some regions, such as all, in increasing species coverage and divergence level from showing no conservation in some genic regions, or showing the rice reference genome. The threshold of 0.7 of conserva- uneven conservation scores within known exons, which are tion scorewereused across all clades. inconsistent with the underlying multiple sequence align- ments. For example, the exon in gene LOC_0s01g19340 Comparisons between Our CNSpipeline and PhastCons has extremely low conservation score (supplementary ﬁg. As a form of validation of our method, we have provided S3, Supplementary Material online) while and the conserva- detailed analysis to compare our method with PhastCons. tion scores for the exons in gene LOC_0s01g57870 are quite Overall, the number of CDS shared between our method uneven (supplementary ﬁg. S4, Supplementary Material on- and PhastCons has a reasonably high degree of overlaps line), in both cases contradicting the underlying sequence (60%) in each clade (supplementary ﬁg. S2, alignments. Supplementary Material online). Our method shows higher We chose not to include the comparison to GERP in this coverage of coding sequence in all clades relative to the study since GERP did not appear to scale with the number of Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 477 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE A B 0.6 0.8 1.0 0 0.2 0.4 T.aestivum A.tauschii T.urartu H.vulgare B.distachyon BEP 100 P.heterocycla O.brachyantha 100 O.indica Poaceae 100 O.sativa E.tef O.thomaeum S.bicolor Z.mays Poales S.viridis S.italica P.hallii P.virgatum A.comosus Monocot M.acuminata 100 E.guineensis All P.dactylifera S.polyrhiza A.trichopoda Fraction of matched coding bases in O.sativa Fraction of matched noncoding bases in O.sativa 0.050 Fraction of matched 5’UTR bases in O.sativa Fraction of matched 3’UTR bases in O.sativa Fraction of matched Intron bases in O.sativa Fraction of matched Intergenic bases in O.sativa FIG.2.—Phylogenetic tree and base pair coverage based on conserved sequences. (A) The phylogenetic tree obtained using shared sequences that are aligned across all genomes. (B) Base pair coverage for coding sequence (CDS) and noncoding regions based on the annotated rice gene models. taxa and sites. In addition, results from GERP scores from a introns. The proportion of sites under selection was particu- 5,000 test set showed an unacceptably low number of con- larly high in introns (22%) and intergenic (16%) (ﬁg. 3C and served domains that could be predicted when compared side- supplementary table S3, Supplementary Material online). We by-side with our conservation score, suggesting that GERP also observed similar trends in other clades (ﬁg. 3). Finally, we was signiﬁcantly underpowered when a large number of spe- recovered a total of 21.12 Mb of CNSs in Poaceae after merg- cies are included, at least when tested empirically in the set of ing close fragments that are within 3 bp from one another, genomes that we selected. and removing the fragments with length <6bp (table 1). The CNSs have an average size ranging from 31 to 55 bp, which is longer than previous studies (Freeling and Subramaniam Identiﬁcation and Distribution of CNSs 2009)(table 1). Analysis of sites under constraints based on distinct clades showed that, at least 12.05% of the rice genome sequence Identiﬁcation of Conserved Genes and Functional (45 Mb) have been evolving under constraint in the Poaceae Enrichment Analysis of Highly Conserved Genes clade, and approximately half of these sequences (20.64 Mb) are located outside the protein-coding regions and hence To determine whether conserved genes in various different considered noncoding (ﬁg. 3C and supplementary table S3, clades showed enrichment of particular functions, we calcu- Supplementary Material online). We divided the constrained latedthe conservedscore for eachgene based on the average noncoding sites into two categories, intergenic sites and gene conservation score of every base within the gene. The distri- 0 0 space that is further consisted of 5 -UTR, 3 -UTR regions, and bution of gene conservation scores showed similar trends 478 Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Evolutionary Constraints and Annotation of Conserved Elements across Major Grass Genomes GBE FIG.3.—Estimation of the fraction of sites under selection in the rice genome. (A) The fraction of how many species in each clade matched to reference aligned to CDS. Coverage decreased with the increase of the conservation score, but it became stable when the conservation score is close to 0.7. (B–F) Breakdown of sites under selection between coding (CDS) and noncoding regions and among different types of noncoding regions in (B) BEP, (C) Poaceae, (D)Poales, (E) monocots, and (F) all selected genomes. Table 1 Summary of CNS Distribution in Different Clades CNS Attributes BEP Poaceae Poales Monocot All Total number 600,632 527,895 533,982 377,266 300,395 Mean length 55 bp 40 bp 40 bp 33 bp 31 bp Median length 33 bp 24 bp 24 bp 20 bp 18 bp Total length 33,026,856 21,117,687 21,608,465 12,633,501 9,336,098 Percentage (%) of CNS in intergenic 40.76 35.25 35.83 31.21 30.31 Percentage (%) of CNS in 5 -UTR 5.10 6.56 6.51 6.44 6.07 Percentage (%) of CNS in 3 -UTR 9.17 10.92 10.73 11.26 10.23 Percentage (%) of CNS in intron 44.96 47.28 46.93 51.09 53.39 among each clade (ﬁg. 4A). The distribution is very heavy both (ﬁg. 4B). Those common genes involved in basic cellular func- close to the “no-conservation” (score¼ 0) and close to “full- tions, catalytic activity, and binding (ﬁg. 4C). Except for com- conservation” (score¼ 1), that is, the score almost qualita- mon genes, there are many speciﬁc conserved genes in each tively separates the two extremal classes of genes. Further clade contrary to other clades, especially for BEP and Poaceae analyses suggested that when we selected the score above, (1,095 and 594 conserved genes, respectively) (ﬁg. 4B). Many say 0.8, for example, the Gene Ontology (GO) results showed of the genes extremely conserved within Poaceae are not so similar enriched classes—so our results do not appear to be conserved through the monocot clade with the distance of too sensitive to the exact cutoffs that we chose, unless we evolution. For example, waxy gene, grain number, and cell chose an unreasonably low or high cutoff, as determined by wall invertase gene related to the production yield and quality the underlying score distribution. appear to be speciﬁc to the Poaceae. We inferred a set of highly conserved genes (with average conservative score of at least 0.9, that is, showing conserva- Conservation Score in Gene Structure and Intergenic tion in 90% of the species that were compared). We also Regions obtained the Gene Ontology (GO) terms of the highly con- served genes among clades. The number of highly conserved Sequences under selective constraints are expected to diverge genes in different clades vary between 5,215 and 12,023 much slower than the nonfunctional sequences over Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 479 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE A B Conservation Score BEP Poaceae Poales Monocot All FIG.4.—Highly conserved genes among different clades. (A) The distribution of gene conservation scores in different clades. (B) Regions are labeled with their respective clade and number of genes that are considered highly conserved. (C) Functional categories of highly conserved genes. evolutionary time, therefore, we expect the conservation Intronic and intergenic bases showed similar trends in dis- score of the functional sites to be higher than the nonfunc- tinct clades, and bases located near the TSS or TES seem to be tional sites. In order to test this hypothesis, we calculated the under stronger selective pressure than other intronic or inter- score distribution in different types within the gene, upstream genic bases (supplementary ﬁg. S5B and C, Supplementary 1 kb of the transcription start site (TSS), downstream 1 kb of Material online). Additionally, compared with other exons, the the transcription end site (TES) (Alexandrov et al. 2015), and average score of the ﬁrst exon and the last exon are low, and intergenic regions (supplementary ﬁg. S5, Supplementary the average score of UTR is lower than other exons (supple- Material online). The results showed that the distribution of mentary ﬁg. S5D, Supplementary Material online). The the conservation score tend to show similar trends through unevenness of the conservation score reﬂects the functional different clades. Outside protein-coding transcripts, the score landscape of the regions surrounding the coding genes. decreased with the increasing physical distance from the TSS (supplementary ﬁg. S5A, Supplementary Material online). Evidence of Purifying Selection on Conserved Sequences at However, the score was particularly high within protein- the Population Level coding transcripts and the UTR regions which are located near coding sequences in both BEP and Poaceae. However, The conserved sequences were identiﬁed through cross- the sharply increased score within 350 bp of the TSS sug- species comparisons, and were determined to be under gested that most of the regulatory elements are effectively strong purifying selection over evolutionary time scale. We located within those regions in the rice genome. further asked the question whether these sequences continue 480 Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 The number of gene Evolutionary Constraints and Annotation of Conserved Elements across Major Grass Genomes GBE to be under strong selection within the rice population. The Relationship between Intraspeciﬁc and Interspeciﬁc We compared our conserved sequences against the SNP- Variation Seek database of single-nucleotide polymorphisms (SNPs) We counted the number of substitutions of each gene on the derived from 3,000 rice genomes (Alexandrov et al. intraspeciﬁc (based on population genetics data) as well as on 2015). the interspeciﬁc level (based on cross-species comparisons). Constrained sequences show a depletion in SNPs fre- The results show a strong positive correlation (Pearson’s quency, which is 1.36-fold lower rate than the genome R¼ 0.43, P value¼ 0) between intraspeciﬁc and interspeciﬁc average. Evidence for purifying selection acting on con- variationinall studied clades (supplementary ﬁg. S6, served sequences was also found in the minor allele fre- Supplementary Material online). quency (MAF). MAF indicates frequency of these SNPs Despite the overall concordance between the level of intra- occurred within the rice population. Extremely low MAF and interspeciﬁc variation, we found that some sites show indicate rare SNPs, which are considered to be less toler- abnormally high level of intraspeciﬁc variation that appear ant to mutations than sites with higher MAF. These SNPs to have only occurred in rice. We found a total of 645 SNPs were binned according to varying MAF categories sites with extremely high conservation score (identical base in (ﬁg. 5A). Additionally, we extracted three types of sites every species compared) but high MAF (high level of mutation based on the conservation: sites that are nonconservative frequency) in rice. They are the sites that have “relaxed (conservation score from 0.0 to 0.2) (ﬁg. 5B); sites that are selection” in rice. For example, one SNP in OsValRS2 caused conservative (conservation score from 0.7 to 0.9) (ﬁg. 5C); virescent to albino phenotypes in seedlings and white panicles sites that are extremely conservative (conservation score at heading but show very strong conservation in other from 0.9 to 1.0) (ﬁg. 5D). This forms the basis of the study genomes at this site (Wang et al. 2016). of the degree of overlaps between cross-species and intra- Other genes that show high intra- and low interspeciﬁc species variation. variations have most of their putative functions related to Previous studies on maize (Yang et al. 2017) and poplar the disease resistance and receptor-like protein kinase. (Zhang, Zhou, et al. 2016) indicated the negative correlation Genes that show high intraspeciﬁc variations may be posi- between the deleteriousness and minor alleles frequency, and tively selected in the rice lineage since their functions may results showed low alleles frequency suggesting purifying se- be related to the ecological adaptations for rice. Although lection, which were enriched within regions showing evi- these genes have high level of intraspeciﬁc variation, they dence of selection and regions of low recombination. Much are conserved in cross-species comparisons, suggesting the of the selective constraints are expected to persist regardless selection on those genes have only recently been relaxed. of the evolutionary scale, as our results showed. The popula- As an example, we illustrate a portion of a rice gene tion frequency of the rare SNPs is clearly correlated with the LOC_Os02g40130 (ﬁg. 6). In LOC_Os02g40130,a total of level of conservation over evolutionary time scale, thus further 555 of SNPs are observed, however, 93 of them show no supporting the functional roles of these sequences, both at variations in interspeciﬁc multiple alignment. These results the level of different species as well as within the rice suggest that those intraspeciﬁc SNPs in rice were resulted diversiﬁcation. from relatively recent positive selection, while being still sub- The most extreme cases of purifying selection are the ject to strong negative selection in other genomes. “invariant sites,” which could be identiﬁed as sites that shared Conversely, there are 106 genes conserved only in the rice identical bases in the cross-species multiple sequence align- lineage but are not conserved in other genomes, with their ment blocks but referred as SNPs in previous studies. For ex- main functions mostly involved in transposon proteins, SCP- ample, SNPs located in gene LOC_Os02g40130 show the like extracellular proteins, and pollen allergen. These genes cross-species “invariant” SNPs that surprisingly show varia- might have a relatively recent origin but somehow obtained tions in the population although mostly in low frequencies new functions and hence under placed under purifying selec- with MAF <1% (ﬁg. 6). We divided the invariant sites into tion. In-depth studies of those “outlier” genes that have ab- four categories according to their level of impacts on gene normal level of intraspeciﬁc variations may ultimately reveal function—high, low, moderate, and modiﬁer, based on interesting rice biology as well as functional innovations that SnpEff (Cingolani et al. 2012). Results showed that 2.14% have occurred in the rice lineage. (1,479/23,811) of total invariant sites have large impacts (moderate to modiﬁer) on gene function (table 2). We also Functional Annotation of CNSs calculated the percentage effects of those invariant sites according to their types and regions where they are present. Many CNSs remained conserved over millions of years of evo- We found that exons, downstream, and upstream regions lution, suggesting that they play vital roles in regulating bio- have high concentration of rare SNPs, mostly matching the logical processes such as growth and development. To distribution of the conserved elements that we predicted understand and study the functions of these CNSs, we studied (table 2). 5,761 rice genes that are located in the 1-kb downstream of Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 481 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE FIG.5.—Evidence of selection on conserved sites in the rice population. Minor allele frequency (MAF) distribution is shown at single polymorphic sites (SNPs) present in 3,000 rice genomes. When summarized at different clade level, they show similar levels of variation in different conserved categories, where the most conserved sites appear to be under the strongest selective pressure. 2 SNPs with MAF < 0.01 SNPs with MAF = 0.01-0.5 Invariant SNPs LOC_Os02g40130 (gene fragment) FIG.6.—Sequence logo of one fragment of an exemplar gene harboring SNPs positions that illustrate sites with high intraspeciﬁc variation yet low interspeciﬁc variation. Blue stars represent the rare SNPs of MAF<0.01. Red stars represent SNPs of MAF>0.01 but<0.5. Black stars represent cross-species “invariant” SNPs that are consistent throughout the multiple sequence alignment blocks across species. This fragment is located in Chr2: 24,293,421– 24,293,580 of the rice genome. the identiﬁed CNSs. GO enrichment analysis reveals that these transcription factors are identiﬁed in these rice genes, which genes are likely to mediate the biological functions, mostly are highly enriched in the CNS-containing genes. These CNSs related to a variety of regulatory mechanisms (supplementary potentially act as cis-regulators of the genes, often containing table S4, Supplementary Material online), consistent with pre- motifs that are transcription factor binding sites (TFBS) or vious studies reports in vertebrates (Bejerano et al. 2004)and enhancers, that is, providing mechanisms for plants (Inada et al. 2003; Thomas et al. 2007). A total of 634 pretranscriptional regulation. In addition, we identiﬁed 482 Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 bits bits Evolutionary Constraints and Annotation of Conserved Elements across Major Grass Genomes GBE Table 2 Cis-Regulating CNSs Are Enriched for Speciﬁc Sequence Number of Effects by Type in the Rice Genomes Based on the Invariant Motifs Sites The study of CNSs enabled the direct identiﬁcation of various Types Count Percentage (%) regulatory motifs for different transcription factors by 3 -UTR variant 915 1.29 highlighting the genomic regions with putative regulatory 5 -UTR premature start codon gain variant 91 0.13 function. We have observed a large number of known motifs 5 -UTR variant 643 0.91 that are conserved in BEP (384), Poaceae (376), Poales (376), Downstream gene variant 19,370 27.37 and monocot (369), respectively (supplementary table S6, Intergenic region 1,054 1.49 Supplementary Material online). We found that motifs were Intron variant 1,262 1.78 signiﬁcantly (chi-squared test, P< 0.01) enriched in the CNS Missense variant 26,790 37.86 Splice acceptor variant 228 0.32 regions versus their occurrences across the entire genome. Splice donor variant 313 0.44 We computed the enrichment of motifs in different types Splice region variant 1,124 1.59 of CNS based on z-score. We found that each type of CNS is Start lost 13 0.02 enriched with at least some motifs registered in the PLACE Stop gained 916 1.29 database (Higo et al. 1998)(ﬁg. 7). Most motifs were associ- Stop lost 9 0.01 ated with the binding preferences of ubiquitous transcription Stop retained variant 1 0.00 factors. For example, the ABRE3 motif contains a G-box and a Synonymous variant 4,201 5.94 novel cis-acting element, the beta-phaseolin promoter activity Upstream gene variant 13,835 19.55 is regulated by the TATA-box motifs (Grace et al. 2004), site-II NOTE.—The invariant sites are the columns with no mismatches based on the elements (SITEIIATCYTC) found in the promoter regions of multiple sequence alignments across the selected genomes that we reconstructed in this study. cytochrome which regulate the phosphorylation (OxPhos)ma- chinery (Welchen and Gonzalez 2005, 2006) and site-IIa (SITEIIAOSPCNA) elements which are involved in meristematic 2,785 bigfoot genes (genes with >4 kb of gene spaces and at tissue-speciﬁc promotor region of auxin-regulated genes least have six CNSs) with most of them containing nucleic acid (Kosugi and Ohashi 1997). All of these motifs are highly binding function (GO: 0005488). enriched in our CNS set. Functional validation is still necessary There are 20% of CNSs identiﬁed in Poaceae family that to follow up on many of the enriched motifs in the CNS are overlapping with repetitive sequences, including at least regions, which have at least indicated putative functions on 51% of those CNSs contain LTRs (40.67%) and DNA trans- the basis of sequence conservations. Since we have identiﬁed poson elements (11.17%). This is a bit unusual as transposon the CNS set through cross-species comparisons, such motifs sequences are generally expected to have a faster substitution are likely to carry out critical functions across a wide range of rate. These transposons or transposon fragments may be on taxa, making them the ideal targets for genetic engineering in the path to become “domesticated” (Sinzelle et al. 2009). a broad range of crops. We identiﬁed 116 known miRNAs that are part of or over- lap our identiﬁed CNSs, indicating that some CNSs were also Visualization of Conserved Elements directly involved in posttranscriptional regulation. For exam- ple, we identiﬁed that a miRNA (MIMAT0022865), which In order to visualize the alignments and the associated func- overlapped with a 55-bp CNS that is conserved across all tional and diversity data directly, we have set up a JBrowse targeted genomes, and had high level of sequence similarity server to host the alignments and conservation tracks with MIR396E in B. distachyon (supplementary ﬁg. S7, obtained in this study (http://angiosperm.org/jsp/ping.html; Supplementary Material online), which was involved in last accessed January 13, 2018) (ﬁg. 8). The JBrowse instance reprogramming leaf growth during drought stress (Mecchia includes separate tracks including the rice reference genome, et al. 2013). rice gene models, a total of 22 pairwise alignments of non- In addition to being located in the vicinity of protein-coding reference genomes against reference genome, SNP variations genes, we found that some CNSs have overlapped with pub- in the 3,000 rice genomes, and conservation score of each lished lncRNAs (supplementary table S5, Supplementary clade that are resulted from this study. The conservation track Material online), and three of them were speciﬁcally (wiggle plot) displays conservation score of each base and the expressed during reproduction (Zhang 2014). For example, conservation score ranges from 0 (no conservation) to 1 (con- lncRNA XLOC_024266 is conserved across 22 monocot served across all genomes compared). For example, gene genomes, with the exception of Spirodela polyrrhiza (supple- LOC_0s02g03220 and LOC_0s02g03230 showed high con- mentary ﬁg. S8, Supplementary Material online). Taken to- servation score across the coding region, but they also show gether, a signiﬁcant portion of the identiﬁed CNSs can be high conservation score in the intron and intergenic region of annotated as components of pre-and posttranscriptional ma- the genes indicating the presence of CNSs (ﬁg. 8). We also chineries in the plant cells. highlight a case of lncRNA that is involved in the sexual Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 483 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE of the protein. Available methods are lacking for the identiﬁ- cation of CNSs, especially in plants where divergence rate of noncoding sequences is much higher than in the animal genomes. Phylogenetic footprinting and genome-wide com- parisons have been proven to be one of the most general approaches to identify the conserved regulatory sequences in genomes of closely related species. Due to the subfunction- alization following recurring whole genome duplication (WGD) events and massive transposon activities in the grass genomes, it is not straightforward to apply the mainstream algorithm used in the comparisons of mammalian genomes. To investigate the functions of CNSs in grasses, we have developed a computational approach, called CNSpipeline, to exhaustively and accurately identify noncoding DNA elements through the comparisons of whole genomes across species. We have compared our CNSpipeline with PhastCons. Our method is simplistic in that it only counts the multiplicity of the concordant sites in the species included in the compari- son. PhastCons takes an alternative approach to the detection of conservation—instead of scoring individual bases, they al- low information to be aggregated across adjacent sites using a Hidden Markov Model (Hubisz et al. 2011). Overall, our method showed higher sensitivity of genic regions and also in nongenic regions when compared with the PhastCons. In our visual prooﬁng with JBrowse, PhastCons appears inade- quate in many regions, such as underestimating level of con- servation in some exons and shows very uneven conservation score within known conserved exons, when compared side- by-side with our method. Genome-wide comparison of ﬂowering plants has facili- tated the annotation of sequences based on the patterns of conservation and to ﬁnd the novel features that may indicate potential regulatory sequences. In our study, we relied on FIG.7.—List of sequence motifs that are enriched in the CNSs. whole genome alignments across 23 angiosperm genomes Enrichment is deﬁned as z-score >2. Colors in the heat map correspond that focus mostly on the grass family, and estimated conser- to different level of fold enrichment. Sequence motif annotations were vative scores every base in several major clades in monocots. based on the PLACE database (Higo et al. 1998). We successfully extracted a large number of CNSs in BEP (231,263) followed by the Poaceae (184,247), Poales (105,244), monocots (30,441), and angiosperms (24,536) reproductive process located in the conserved intergenic re- with average conservation score >0.9 across all sites. The gion (ﬁg. 8). conservations scores showed similar trends of distribution among each clade where we would not expect much varia- Discussion tion for a qualitative assessment. The main shift appears to Conserved DNA elements still retain high levels of similarity have only involved bins of extremal conservation values for following millions of years of evolution, suggesting that they deep comparisons such as across monocots or across are subject to strong purifying selection. With the increasing angiosperm. availability of sequenced genomes, many tools and methods We discovered CNSs located in distal, upstream, 5 -UTR, have been developed to predict and annotate the protein intragenic, 3 -UTR, and downstream. Many CNSs are associ- coding sequences, using both ab initio predictions, protein ated with various known motifs, miRNA and lncRNA and con- homology as well as transcript evidences (Cantarel et al. sequently participate actively in the pre- and 2008). The coding space, while important, reveal only a por- posttranscriptional machineries. Compared with the methods tion of the gene after all. To better understand all functional that are dependent on experiments, for example, chromatin features of a gene, we need to better characterize and iden- immunoprecipitation (ChIP) and enhancer assay, our in silico tify the cis-regulatory sequences that mediate the expression method is inexpensive and yet still powerful to predict these 484 Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Evolutionary Constraints and Annotation of Conserved Elements across Major Grass Genomes GBE FIG.8.—JBrowse server of the CNS discovery pipeline. Each track represents the alignment between rice (reference) and another query genome. Rectangles represent high-scoring segment pair (HSP) by LAST. The extent of the rectangle indicates the boundaries of the HSP. Chained HSPs outside the coding regions are good candidates of CNSs. As examples, gene LOC_0s02g03220 and LOC_0s02g03230 both show high conservation score across the coding region, but they also show high conservation score in the intron (which contains noncoding CNS) and the intergenic regions suggesting the presence of the noncoding CNSs. potential cis-elements from several species at the whole ge- also indicated the important role of deleterious alleles for in- nome level. Among the inferred CNSs, our results have high complete dominance in explaining heterosis (Yang et al. level of concordance with previously reported binding sites 2017). Our results suggested that, the speciﬁc elements that (Gina et al. 2013). About 90% of the CNSs were identiﬁed we identiﬁed as “outliers”—variant in rice and invariant in by the CNS discovery pipeline we built, suggesting that this other genomes, or conversely, invariant in rice and variant method is effective and accurate. In addition, our functional in other genomes—are particularly interesting targets that il- enrichment analysis reveals that highly conserved genes are lustrate unique aspects in the evolutionary history of the rice involved in a range of house-keeping functions, especially in genome and could have potential application in population basic cellular functions, catalytic activity, and binding, rein- studies and plant breeding. forcing the view that these basic cellular functions are under CNSs conserved across Poaceae differ from animals CNSs the strongest purifying selection. due to their association with putative target genes and distal To investigate the evolutionary history of CNSs in rice, we CNSs are relatively in low numbers and much shorter in length analyzed the SNPs from 3,000 rice genomes and evolutionary in Poaceae than in animals (Margulies et al. 2007). These dynamic analysis of allele frequency showed that there is a differences in the distribution of CNSs across various regions high level of correlation between conservation score and in- may reﬂect the structural differences of introns and exons variant sites within the population, indicating that these con- between plant and animals. Further functional analysis served elements consistently play important adaptive roles showed that genes with CNSs in their upstream 1 kb were within the rice lineage. Similarly, Zhang, Zhou, et al. (2016) enriched in GO categories of regulation, and represent a ma- also showed that deleterious alleles had signiﬁcant effects on jority of the transcription factors in the genome. This result the population dynamics of poplar as mostly challenged by suggests that CNSs are able to rewire existing regulatory net- the anthropogenic climatic changes. Another study on maize works via active or inactive regulatory genes. In addition, Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 485 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE many of these CNSs overlap with a number of noncoding accessedonGitHub; https://github.com/liangpingping/ RNAs with known function, such as lncRNA and micro-RNA, CNSpipeline. that are currently annotated with important functions in plant genomes. Literature Cited Recent studies of genome wide comparison of rice sug- Alexandrov N, et al. 2015. SNP-Seek database of SNPs derived from 3000 gested a strong positive correlation between the presence of rice genomes. Nucleic Acids Res. 43(D1):1023–1027. CNSs and open chromatin (Zhang et al. 2012). Another work Bejerano G, et al. 2004. Ultraconserved elements in the human genome. showed that Arabidopsis homeologs enriched with 5 -CNSs Science 304(5675):1321–1325. showed lower expression than genes with less CNSs (Spangler Bentwich I, et al. 2005. Identiﬁcation of hundreds of conserved and non- conserved human microRNAs. Nat Genet. 37(7):766–770. et al. 2012). Wang et al. (2015) also found that the OsSPL16- Blanchette M, et al. 2004. Aligning multiple genomic sequences with the GW7 regulatory module, which is located in the upstream of Threaded Blockset Aligner. Genome Res. 14(4):708–715. GW7 gene, affected grain shape and quality, but not the Boffelli D, et al. 2003. Phylogenetic shadowing of primate sequences to production. Furthermore, research on the barley genome ﬁnd functional regions of the human genome. Science identiﬁed a microRNA binding site of HvAP2 gene that af- 299(5611):1391–1394. Buchanan CD, Klein PE, Mullet JE. 2004. Phylogenetic analysis of 5 -non- fected the shape and size of the spike, showing that once coding regions from the ABA-responsive rab16/17 gene family of sor- mutation occurs in the binding site of microRNA, it will ghum, maize and rice provides insight into the composition, change the binding ability with microRNA172 and ultimately organization and function of cis-regulatory modules. Genetics alter the expression of HvAP2 gene to generate more grains 168(3):1639–1654. (Houston et al. 2013). We also identiﬁed these studied bind- Burgess D, Freeling M. 2014. The most deeply conserved noncoding sequences in plants serve similar functions to those in vertebrates de- ing sites as present in our CNS set, conﬁrming the exhaustive- spite large differences in evolutionary rates. Plant Cell 26(3):946–961. ness of our CNS set. Cantarel BL, et al. 2008. MAKER: an easy-to-use annotation pipeline Our large exhaustive CNS data set has provided a valuable designed for emerging model organism genomes. Genome Res. set of resources to associate the constrained potions of the 18(1):188. noncoding genome with biological functions. Speciﬁcally, the Carginale V, et al. 2004. Adaptive evolution and functional divergence of pepsin gene family. Gene 333:81–90. CNSs annotated in this study could help researchers to isolate Cingolani P, et al. 2012. A program for annotating and predicting the and elucidate the regulation of genes for agronomic traits effects of single nucleotide polymorphisms, SnpEff: SNPs in the ge- from many cereal crops with genomes ranging from relatively nome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly small genome (e.g., rice) to large and complex genomes (e.g., (Austin) 6(2):80–92. maize and wheat). Clarke SL, et al. 2012. Human developmental enhancers conserved be- tween deuterostomes and protostomes. PLoS Genet. 8(8):e1002852. Clauss M, Mitchell-Olds T. 2004. Functional divergence in tandemly dupli- Supplementary Material cated Arabidopsis thaliana trypsin inhibitor genes. Genetics 166(3):1419–1436. Supplementary data areavailableat Genome Biology and Cooper GM, et al. 2005. Distribution and intensity of constraint in mam- Evolution online. malian genomic sequence. Genome Res. 15(7):901–913. Drake JA, et al. 2006. Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat Genet. 38(2):223–227. Du Z, Zhou X, Ling Y, Zhang Z, Su Z. 2010. agriGO: a GO analysis toolkit Acknowledgments for the agricultural community. Nucleic Acids Res. 38(Web Server issue):W64–W70. We thank the Fujian provincial government for a Fujian “100 Duret L, Bucher P. 1997. Searching for regulatory elements in human Talent Plan” award to H.T. This work was supported by the noncoding sequences. Curr Opin Struct Biol. 7(3):399–406. National Key Research and Development Program of China Force A, et al. 1999. Preservation of duplicate genes by complementary, (2016YFD0100305). degenerative mutations. Genetics 151(4):1531–1545. Freeling M. 2001. Grasses as a single genetic system: reassessment 2001. Plant Physiol. 125(3):1191–1197. Authors’ Contribution Freeling M, Subramaniam S. 2009. Conserved noncoding sequences (CNSs) in higher plants. Curr Opin Plant Biol. 12(2):126–132. P.L. and H.T. designed the software. P.L., H.S.A.S., and X.Z. Friedman RC, Farh KKH, Burge CB, Bartel DP. 2009. Most mammalian analyzed results. P.L., H.S.A.S., X.Z., and H.T. wrote the man- mRNAs are conserved targets of microRNAs. Genome Res. uscript. All authors have read and approved the ﬁnal 19:92–105. manuscript. Frith M, Kawaguchi R. 2015. Split-alignment of genomes ﬁnds orthologies more accurately. Genome Biol. 16:106–106. Frith MC. 2011. A new repeat-masking method enables speciﬁc detection Additional Information of homologous sequences. Nucleic Acids Res. 39(4):e23–e23. Gina T, Schnable JC, Brent P, Michael F. 2013. Automated conserved All generated conserved sequence data and annotations can non-coding sequence (CNS) discovery reveals differences in gene con- be accessed by using following link: http://angiosperm.org/jsp/ tent and promoter evolution among grasses. Front Plant Sci. ping.html. The scripts for driving the CNS pipeline can be 4:170–170. 486 Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Evolutionary Constraints and Annotation of Conserved Elements across Major Grass Genomes GBE Grace ML, Chandrasekharan MB, Hall TC, Crowe AJ. 2004. Sequence and Raatz B, et al. 2011. Speciﬁc expression of LATERAL SUPPRESSOR is con- spacing of TATA box elements are critical for accurate initiation from trolled by an evolutionarily conserved 3 enhancer. Plant J. the b-phaseolin promoter. J Biol Chem. 279(9):8102–8110. 68(3):400–412. Gumucio DL, et al. 1992. Phylogenetic footprinting reveals a nuclear pro- Reineke AR, Bornberg-Bauer E, Gu J. 2011. Evolutionary divergence and tein which binds to silencer sequences in the human c and e globin limits of conserved non-coding sequence detection in plant genomes. genes. Mol Cell Biol. 12(11):4919–4929. Nucleic Acids Res. 39(14):6029–6043. Gumucio DL, Shelton DA, Bailey WJ, Slightom JL, Goodman M. 1993. Schnable JC, Freeling M. 2011. Genes identiﬁed by visible mutant pheno- Phylogenetic footprinting reveals unexpected complexity in trans fac- types show increased bias toward one of two subgenomes of maize. tor binding upstream from the e-globin gene. Proc Natl Acad Sci. PLoS One 6(3):e17855. 90(13):6018–6022. Siepel A, et al. 2005. Evolutionarily conserved elements in vertebrate, in- Hardison RC. 2000. Conserved noncoding sequences are reliable guides to sect, worm, and yeast genomes. Genome Res. 15(8):1034–1050. regulatory elements. Trends Genet. 16(9):369–372. Sinzelle L, Izsvak Z, Ivics Z. 2009. Molecular domestication of transposable Haudry A, et al. 2013. An atlas of over 90,000 conserved noncoding elements: from detrimental parasites to useful host genes. Cell Mol sequences provides insight into crucifer regulatory regions. Nat Life Sci. 66(6):1073–1093. Genet. 45(8):891–898. Soreng RJ, et al. 2015. A worldwide phylogenetic classiﬁcation of the Higo K, Ugawa Y, Iwamoto M,Higo H. 1998. PLACE: a database of plant Poaceae (Gramineae). J Syst Evol. 53(2):117–137. cis-acting regulatory DNA elements. Nucleic Acids Res. 26:358. Spangler JB, Subramaniam S, Freeling M, Feltus FA. 2012. Evidence of Houston K, et al. 2013. Variation in the interaction between alleles of function for conserved noncoding sequences in Arabidopsis thaliana. HvAPETALA2 and microRNA172 determines the density of grains on New Phytol. 193(1):241–252. the barley inﬂorescence. Proc Natl Acad Sci U S A. Str€ ahle U, Rastegar S. 2008. Conserved non-coding sequences and tran- 110(41):16675–16680. scriptional regulation. Brain Res Bull. 75(2–4):225–230. Hubisz MJ, Pollard KS, Siepel A. 2011. PHAST and RPHAST: phylogenetic Tagle DA, et al. 1988. Embryonic e and c globin genes of a prosimian analysis with space/time models. Brief Bioinformatics 12(1):41–51. primate (Galago crassicaudatus): nucleotide and amino acid sequen- Hupalo D, Kern AD. 2013. Conservation and functional element discovery ces, developmental regulation and phylogenetic footprints. J Mol Biol. in 20 angiosperm plant genomes. Mol Biol Evol. 30(7):1729–1744. 203(2):439–455. Inada DC, et al. 2003. Conserved noncoding sequences in the grasses4. Tang H, et al. 2013. Seed shattering in a wild sorghum is conferred by a Genome Res. 13(9):2030–2041. locus unrelated to domestication. Proc Natl Acad Sci U S A. Kaplinsky NJ, Braun DM, Penterman J, Goff SA, Freeling M. 2002. Utility 110(39):15824–15829. and distribution of conserved noncoding sequences in the grasses. Tang H, Sezen U, Paterson AH. 2010. Domestication and plant genomes. Proc Natl Acad Sci U S A. 99(9):6147–6151. Curr Opin Plant Biol. 13(2):160–166. Kawahara Y, et al. 2013. Improvement of the Oryza sativa Nipponbare Thomas BC, Rapaka L, Lyons E, Pedersen B, Freeling M. 2007. Arabidopsis reference genome using next generation sequence and optical map intragenomic conserved noncoding sequence. Proc Natl Acad Sci U S data. Rice 6(1):1–10. A. 104(9):3348–3353. Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. 2003. Evolution’s Trapnell C, et al. 2010. Transcript assembly and quantiﬁcation by RNA-Seq Cauldron: duplication, deletion, and rearrangement in the mouse and reveals unannotated transcripts and isoform switching during cell dif- human genomes. Proc Natl Acad Sci U S A. 100(20):11484. ferentiation. Nat Biotechnol. 28(5):511–515. King DC, et al. 2005. Evaluation of regulatory potential and conservation Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice junc- scores for detecting cis-regulatory modules in aligned mammalian ge- tions with RNA-Seq. Bioinformatics 25(9):1105–1111. nome sequences. Genome Res. 15(8):1051–1060. Turco G, Schnable JC, Pedersen B, Freeling M. 2012. Automated con- Kosugi S, Ohashi Y. 1997. PCF1 and PCF2 speciﬁcally bind to cis elements served non-coding sequence (CNS) discovery reveals differences in in the rice proliferating cell nuclear antigen gene. Plant Cell gene content and promoter evolution among grasses. Front Plant 9(9):1607–1619. Sci.4:170–170. Loots GG, et al. 2000. Identiﬁcation of a coordinate regulator of interleu- Venkataram S, Fay JC. 2010. Is transcription factor binding site turnover a kins 4, 13, and 5 by cross-species sequence comparisons. Science sufﬁcient explanation for cis-regulatory sequence divergence? 288(5463):136–140. Genome Biol Evol. 2:851–858. Margulies EH, et al. 2007. Analyses of deep mammalian sequence align- Wang S, et al. 2015. The OsSPL16—GW7 regulatory module determines ments and constraint predictions for 1% of the human genome. grain shape and simultaneously improves rice yield and grain quality. Genome Research 17(6):760–774. Nat Genet. 47(8):949–954. McLean CY, et al. 2011. Human-speciﬁc loss of regulatory DNA and the Wang X, Haberer G, Mayer KF. 2009. Discovery of cis-elements between evolution of human-speciﬁc traits. Nature 471(7337):216–219. sorghum and rice using co-expression and evolutionary conservation. Mecchia MA, Debernardi JM, Rodriguez RE, Schommer C, Palatnik JF. BMC Genomics 10(1):1. 2013. MicroRNA miR396 and RDR6 synergistically regulate leaf devel- Wang Y, et al. 2016. WHITE PANICLE1, a Val-tRNA synthetase regulating opment. Mech Dev. 130(1):2–13. chloroplast ribosome biogenesis in rice, is essential for early chloroplast Nelson AC, Wardle FC. 2013. Conserved non-coding elements and cis development. Plant Physiol. 170(4):2110. regulation: actions speak louder than words. Development Welchen E, Gonzalez DH. 2005. Differential expression of the Arabidopsis 140(7):1385–1395. cytochrome c genes Cytc-1 and Cytc-2. Evidence for the involvement Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. 2010. Detection of non- of TCP-domain protein-binding elements in anther-and meristem-spe- neutral substitution rates on mammalian phylogenies. Genome Res. ciﬁc expression of the Cytc-1 gene. Plant Physiol. 139(1):88–100. 20(1):110–121. Welchen E, Gonzalez DH. 2006. Overrepresentation of elements recog- Prabhakar S, Noonan JP, P€ a€ abo S, Rubin EM. 2006. Accelerated evolution nized by TCP-domain transcription factors in the upstream regions of of conserved noncoding sequences in humans. Science nuclear genes encoding components of the mitochondrial oxidative 314(5800):786–786. phosphorylation machinery. Plant Physiol. 141(2):540–545. Quinlan AR, Hall IM. 2016. BEDTools. Current Protocols in Bioinformatics Woolfe A, et al. 2005. Highly conserved non-coding sequences are asso- 47:11.12.11. ciated with vertebrate development. PLoS Biol. 3(1):e7. Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 487 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018 Liang et al. GBE Yang J, et al. 2017. Incomplete dominance of deleterious alleles contrib- Zhang W, et al. 2012. High-resolution mapping of open chromatin in the utes substantially to trait variation and heterosis in maize. PLoS Genet. rice genome. Genome Res. 22(1):151–162. 13(9):e1007019. Zhang Y-C, et al. 2014. Genome-wide screening and functional Zerbini LF, et al. 2004. NF-kappa B-mediated repression of growth arrest- analysis identify a large number of long noncoding RNAs in- and DNA-damage-inducible proteins 45alpha and gamma is essential volved in the sexual reproduction of rice. Genome Biol. for cancer cell survival. Proc Natl Acad Sci U S A. 101(37):13618–13623. 15(12):1–16. Zhang J, Yuan T, et al. 2016. Cis-regulatory elements determine germline Zhang Z, Gu J, Gu X. 2004. How much expression divergence after yeast speciﬁcity and expression level of an isopentenyltransferase gene in gene duplication could be explained by regulatory motif evolution? sperm cells of arabidopsis. Plant Physiol. 170:1524–1534. Trends Genet. 20(9):403–407. Zhang M, Zhou L, Bawa R, Suren H, Holliday JA. 2016. Recombination rate variation, hitchhiking, and demographic history shape deleterious load in poplar. Mol Biol Evol. 33:2899–2910. Associate editor: Yves Van De Peer 488 Genome Biol. Evol. 10(2):473–488 doi:10.1093/gbe/evy006 Advance Access publication January 25, 2018 Downloaded from https://academic.oup.com/gbe/article-abstract/10/2/473/4824837 by Ed 'DeepDyve' Gillespie user on 16 March 2018
Genome Biology and Evolution – Oxford University Press
Published: Feb 1, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera