Analysis of the Draft Genome of the Red Seaweed Gracilariopsis chorda Provides Insights into Genome Size Evolution in Rhodophyta

Analysis of the Draft Genome of the Red Seaweed Gracilariopsis chorda Provides Insights into... Abstract Red algae (Rhodophyta) underwent two phases of large-scale genome reduction during their early evolution. The red seaweeds did not attain genome sizes or gene inventories typical of other multicellular eukaryotes. We generated a high-quality 92.1 Mb draft genome assembly from the red seaweed Gracilariopsis chorda, including methylation and small (s)RNA data. We analyzed these and other Archaeplastida genomes to address three questions: 1) What is the role of repeats and transposable elements (TEs) in explaining Rhodophyta genome size variation, 2) what is the history of genome duplication and gene family expansion/reduction in these taxa, and 3) is there evidence for TE suppression in red algae? We find that the number of predicted genes in red algae is relatively small (4,803–13,125 genes), particularly when compared with land plants, with no evidence of polyploidization. Genome size variation is primarily explained by TE expansion with the red seaweeds having the largest genomes. Long terminal repeat elements and DNA repeats are the major contributors to genome size growth. About 8.3% of the G. chorda genome undergoes cytosine methylation among gene bodies, promoters, and TEs, and 71.5% of TEs contain methylated-DNA with 57% of these regions associated with sRNAs. These latter results suggest a role for TE-associated sRNAs in RNA-dependent DNA methylation to facilitate silencing. We postulate that the evolution of genome size in red algae is the result of the combined action of TE spread and the concomitant emergence of its epigenetic suppression, together with other important factors such as changes in population size. Rhodophyta, Gracilariopsis chorda, DNA methylation, transposable element suppression, small RNAs Introduction How genome size evolves over evolutionary timescales (i.e., hundreds of millions of years) is a fundamental question posed by many genome biologists. The advent of next-generation DNA sequencing has led to the accumulation of vast amounts of genome data that reveal enormous variation in genome size and gene number among eukaryotes. This variation is however not strictly correlated with developmental or morphological complexity. Once known as the “C-value paradox,” it has become clear that genome size expansion primarily reflects the proliferation of noncoding DNA such as transposable elements (TEs), not the growth in organismal complexity. Relatively “simple” organisms such as unicellular dinoflagellates can have massive genomes, several gigabases in size, when compared with a “complex” plant such as Arabidopsis thaliana (120 Mb genome). In addition, the expansion of cis-acting regulatory elements (e.g., number of transcription factors) is a more important predictor of organismal complexity than the number of protein-coding genes (Hahn and Wray 2002; Greilhuber et al. 2005). In photosynthetic eukaryotes, genome size varies from several megabase pairs (Mb) to 148 Gb (the flowering plant Paris japonica) (Pellicer et al. 2010; Garcia et al. 2014; Hidalgo et al. 2017). In many land plants, polyploidization is one of the major contributors to genome expansion through gene/genome duplication, which is estimated to have occurred in 30–80% of angiosperms (Blanc and Wolfe 2004). Another important mechanism in genome evolution is TE-derived genome expansion. In land plants, this process significantly impacts the size and structure of genomes, as well as the mutation rate, extent of gene movement, and gene regulation (Bennetzen and Wang 2014), in the absence of polyploidization (Hawkins et al. 2006; Piegu et al. 2006). Genome-size expansion via the accumulation of TEs has been reported in many eukaryotes, including the multicellular red seaweed Chondrus crispus (Collén et al. 2013). TEs are classified on the basis of their transposition intermediate: RNA (retrotransposons, class I) and DNA (DNA transposons, class II) that carry out copy-and-paste and cut-and-paste activities, respectively, which may in some cases affect genome stability (e.g., via double-strand DNA breaks) with the generation of flanking repeated DNA elements (Slotkin and Martienssen 2007). Previously, TEs were regarded primarily as “parasites” that engage the host in an evolutionary arms race, whereby the host tries to control TE activity through mechanisms, such as chromatin remodeling and silencing through RNA interference (RNAi) (McClintock 1950; McClintock 1984; Okamoto and Hirochika 2001; Vaucheret and Fagard 2001; Slotkin and Martienssen 2007; Bennetzen and Wang 2014; Goodier 2016). With the accumulation of genome data, it has become clear however that the insertion and/or recombination of TEs may have positive impacts by providing a source of evolutionary novelty. Such beneficial outcomes include alternative splicing, epigenetic control of gene expression, transduction, duplication, and recombination (Hua-Van et al. 2011; Galindo-González et al. 2017). Countermeasures to TE activity in plant genomes are known as posttranscriptional gene silencing (PTGS) and transcriptional gene silencing (TGS) (Hirochika et al. 2000; Okamoto and Hirochika 2001; Vaucheret and Fagard 2001; Verbsky and Richards 2001; Slotkin and Martienssen 2007). PTGS is a small RNA (sRNA)-dependent process that leads to the degradation or repression of translation of target mRNAs in the cytoplasm. TGS is also sRNA-dependent, but induced by chromatin remodeling. These epigenetic modifications involve the RNA-dependent DNA methylation (RdDM) pathway (Matzke and Mosher 2014; Matzke et al. 2015). RdDM is not only related to TE silencing but also plays several crucial roles in the cell cycle (e.g., reproduction) (Garcia-Aguilar et al. 2010; Olmedo-Monfil et al. 2010; Singh et al. 2011), stress responses (Ou et al. 2012; Tricker et al. 2012; Popova et al. 2013), and intracellular interactions (Melnyk et al. 2011) through the control of gene expression (Matzke and Mosher 2014). Despite these important roles in genome complexity and evolution, DNA methylation has not yet been studied in red algae (Rhodophyta). The Rhodophyta are members of the supergroup Archaeplastida that also includes the Glaucophyta and Viridiplantae (green algae plus land plants). The ancestor of this supergroup putatively captured and retained a cyanobacterial cell through primary endosymbiosis that eventually gave rise to the canonical plastid (Bhattacharya et al. 2004; Keeling 2010). Despite having a common ancestor, the three “primary” (i.e., Archaeplastida) algal lineages have quite different gene inventories. For instance, all available red algal genome data indicate a relatively modest (i.e., ∼5K–13K) gene inventory in both morphologically “simple” unicellular mesophilic, extremophilic, and filamentous taxa as well as in large, multicellular seaweeds (Matsuzaki et al. 2004; Bhattacharya et al. 2013; Collén et al. 2013; Schönknecht et al. 2013; Brawley et al. 2017). In contrast, the Viridiplantae shows high variability in gene number from green algae (7K–16K genes) to land plants (20K–90K genes) (Arabidopsis Genome Initiative 2000; Derelle et al. 2006; Merchant et al. 2007; Rensing et al. 2008; Worden et al. 2009; Hori et al. 2014). The single, available glaucophyte genome from the unicellular Cyanophora paradoxa encodes ∼30K genes (Price et al. 2012). It has been proposed that the reduced gene inventory in red algae is the result of two ancient (>1 Ga) phases of genome reduction: The first occurred in the stem lineage of Rhodophyta and the second in the ancestor of the extremophilic Cyanidiophytina (Qiu et al. 2015,, 2017). However, the history of gene inventory evolution in red algae vis-à-vis genome size-change remains poorly understood, particularly for the multicellular red seaweeds (the Florideophyceae) for which limited genome data are available. To fill this gap in our knowledge, we sequenced and analyzed the genome of the economically important agar-producing multicellular red seaweed, Gracilariopsis chorda (Rhodophyta, Florideophyceae). With these data in hand, we asked three questions about the evolutionary history of red algal genomes: 1) What role do repeats and TEs play in explaining genome size variation, 2) what is the history of genome duplication and gene family expansion/reduction in Rhodophyta, and 3) as the focus of our study, is there evidence for TE suppression in red algae? We found no evidence of polyploidization in Rhodophyta, and gene families have generally remained constrained in size. Assessment of DNA methylation in G. chorda using bisulfite sequencing (Li and Tollefsbol 2011) shows that 8.3% (7.6 Mb) of DNA is methylated and that 96% (10,470/10,806 genes) of coding sequences (CDSs) contain methylated cytosines. We predicted a total of 489 genes that contain methylated CpG islands. We also found evidence of sRNA-dependent TE suppression in G. chorda. Our analyses provide insights into the evolutionary trajectory of red algal genomes, demonstrating that they are characterized by constrained gene inventories and a TE suppression pathway that appears to be typical for eukaryotes. Results and Discussion Repeat and TE Content in Red Algae The primary genome assembly of G. chorda was generated using single-molecule real-time (SMRT) sequencing (PacBio) with bases corrected using Illumina data (Hi-Seq 2000) (see Materials and Methods). The G. chorda genome is ∼92.1 Mb in size (49.26% G + C-content, comprised 1,211 contigs, N50 = 220.2 kb; supplementary table S1, Supplementary Material online) and encodes a predicted inventory of 10,806 protein-coding genes (number of exons: 13,274; avg. exon/coding genes: 1.23; number of introns: 2,468; supplementary table S2, Supplementary Material online). Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis identified 87.6% of the eukaryotic BUSCO conserved gene set (supplementary table S3, Supplementary Material online). On the basis of the comparative analysis of all available red algal genomic data, we find large differences in genome sizes between unicellular red algae (13.7–19.4 Mb) and multicellular red seaweeds (i.e., G. chorda: 92.1 Mb, C. crispus: 104.8 Mb, and Porphyra umbilicalis: 87.7 Mb) (Matsuzaki et al. 2004; Bhattacharya et al. 2013; Collén et al. 2013; Schönknecht et al. 2013; Brawley et al. 2017). A total of 56.4 Mb (61.2%) of the G. chorda genome comprised TEs, as predicted by the RepeatModeler pipeline (see Materials and Methods), which is comparable with 61.9 Mb (59.0%) in C. crispus, but higher than the 38.2 Mb (43.5%) of TEs in the P. umbilicalis genome (supplementary table S4, Supplementary Material online). In contrast to multicellular seaweeds, unicellular red algae contain a smaller proportion of TEs in their genomes: That is, Galdieria sulphuraria (2.2 Mb; 16.9%), Cyanidioschyzon merolae (3.4 Mb; 21.8%), and Porphyridium purpureum (0.7 Mb; 3.6%) (supplementary table S4, Supplementary Material online). On the basis of these results, we surmise that TE had a major impact on genome size expansion in multicellular red algae (i.e., the two florideophycean taxa, G. chorda and C. crispus) when compared with unicellular lineages. Among the florideophycean TEs, long terminal repeat (LTR) elements (60–62%) and DNA repeats (10–22%) are the two major components. Non-LTR retrotransposon sequences (e.g., Long interspersed nuclear elements and Short interspersed nuclear elements; LINEs and SINEs) are also present, but at lower frequencies (LINEs: ≤1.9% and SINEs: ≤0.08%; supplementary table S4, Supplementary Material online). LTRs are derived from RNA through the “copy-and-paste” activity initiated by the gag (group antigen) and pol (reverse transcriptase) genes (Slotkin and Martienssen 2007). These transposon families may have increased red algal genome size through duplication and/or transposition, as reported in land plants (supplementary table S5, Supplementary Material online) (Hawkins et al. 2006; Piegu et al. 2006; Slotkin and Martienssen 2007; Collén et al. 2013). To estimate the relative divergence values for individual TEs, Kimura evolutionary distances (Kimura 1980) were calculated between consensus and individual TE sequences. Because mutations accumulate over time, the genetic distances between TE families can represent their evolutionary history (i.e., recently or more anciently diverged TE). Except for the case of C. merolae, there are recently diverged major peaks in the different TE families (fig. 1A). For instance, the florideophycean-specific LTRs contributed >25% of the total TE sequences (G. chorda: 25.6% and C. crispus: 25.3%) within the range of recent divergence (Kimura distance = 0–5). This high proportion of LTRs is distinct from other red algal species (P. umbilicalis: 12.4%, P. purpureum: 0.7%, C. merolae: 0.0%, and G. sulphuraria: 6.2%). In contrast, the pattern of florideophycean TE expansion is different from its sister taxa such as P. umbilicalis (Bangiophyceae) that contains a large proportion of “unknown” repeats. These data support the idea that red algae underwent different pattern of TE sequence divergence, with strong evidence of LTR-derived genome expansion in the Florideophyceae. Fig. 1. View largeDownload slide Key features of red algal genomes. (A) Frequency and distribution of Kimura distances based on comparisons of repeated elements. (B) Genome size, repeated element, and gene number. The acronyms are as follows: Cyanid. is the extremophilic Cyanidiophytina, Porph. is Porphyridiophyceae, Bang. is Bangiophyceae, and Florid. is Florideophyceae. (C) Structural feature of gene density (number of genes/genome size) and gene clustering (average intergenic distance/median intergenic distance). (D) Heatmap of BUSCO genesets in red algae with their average copy numbers. (E) OGFs in red algae. GAL, Galdieria sulphuraria; CYS, Cyanidioschyzon merolae; POP, Porphyridium purpureum; PUM, Porphyra umbilicalis; CHC, Chondrus crispus; GRC, Gracilariopsis chorda. Fig. 1. View largeDownload slide Key features of red algal genomes. (A) Frequency and distribution of Kimura distances based on comparisons of repeated elements. (B) Genome size, repeated element, and gene number. The acronyms are as follows: Cyanid. is the extremophilic Cyanidiophytina, Porph. is Porphyridiophyceae, Bang. is Bangiophyceae, and Florid. is Florideophyceae. (C) Structural feature of gene density (number of genes/genome size) and gene clustering (average intergenic distance/median intergenic distance). (D) Heatmap of BUSCO genesets in red algae with their average copy numbers. (E) OGFs in red algae. GAL, Galdieria sulphuraria; CYS, Cyanidioschyzon merolae; POP, Porphyridium purpureum; PUM, Porphyra umbilicalis; CHC, Chondrus crispus; GRC, Gracilariopsis chorda. Although limited genome data are available, we compared shared TEs based on their phylogenetic relationships. We used repeat prediction (i.e., RepeatModeler pipeline) with the combined gnomes from each phylogenetic internode. For example, all six red algal genomes were analyzed with their species tag to determine the distribution of shared TEs (supplementary fig. S1A, Supplementary Material online). These results show that florideophycean TEs have recently expanded, although some families (e.g., LTR elements) existed in their common ancestor (supplementary fig. S1B, Supplementary Material online). Analysis of the Gene Inventory in Red Algae and Other Archaeplastida The gene inventory is thought to evolve slowly and conservatively when compared with genome size (Eddy 2012). The gene inventory is thought to evolve slowly and conservatively when compared with genome size (Eddy 2012). The number of predicted genes ranges from 4,803 to 13,125 in available red algal genomes (fig. 1B). Although the genome sizes of multicellular species are 5-fold larger than in unicellular species, they only encode 1.1- to 2.7-fold more genes (i.e., C. crispus: 9,603, G. chorda: 10,806, and P. umbilicalis: 13,125 genes) than do unicellular taxa (i.e., C. merolae: 4,803, G. sulphuraria: 7,174, and P. purpureum: 8,355 genes). The limited gene number in red algae becomes more apparent than their sister taxa, the Viridiplantae. Gene numbers in Viridiplantae genomes gradually increase from 7.8K to 14.4K in unicellular chlorophytes through 16.1K in a charophyte alga to 20–59K in land plants along with dramatic genome size expansions (i.e., from 12–120 to 120–761 Mb, respectively; supplementary fig. S2A, Supplementary Material online). In comparison, the genome of the glaucophyte Cyanophora paradoxa encodes ∼30K genes (Price et al. 2012). A novel PacBio assembly of this species shows that the predicted gene numbers do not change but the genome size is larger than previously estimated and is ∼110 Mb (Price DC and Bhattacharya D., unpublished data). Therefore, although there is still overall limited taxon sampling of Archaeplastida, only red algae appear to encode a relatively low gene number. Overall, significant differences exist with respect to both genome size (P-value = 0.019) and gene number (P-value = 0.007) between red algae and Viridiplantae (Wilcoxon rank sum test). Red algal genomes are also depauperate in terms of spliceosomal introns with C. merolae, G. sulphuraria, P. purpureum, P. umbilicalis, C. crispus, and G. chorda having intron densities of 1.24, 0.02, 0.009, 0.24, 0.19, and 0.026 introns/kb, respectively. Although intron density is relatively high in the unicellular red alga C. merolae and relatively low in the unicellular green alga O. tauri, this value over all available red algal genomes is statistically smaller than in Viridiplantae (P-value = 0.029; Wilcoxon rank sum test): O. tauri (0.18), C. reinhardtii (1.70), K. flaccidum (0.93), P. patens (1.64), A. trichopoda (1.17), A. thaliana (3.63), G. raimondii (0.55), and O. sativa (0.28). We compared gene density among available red algal genomes (fig. 1B). The G. chorda genome shows a relatively low gene density (number of genes per kb = 0.11), similar to other macroalgal species, C. crispus (0.09) and P. umbilicalis (0.15), which contain a large proportion of TEs (fig. 1A and B). Among Viridiplantae, the moss (P. patens), the shrub (A. trichopoda), rice (O. sativa), and cotton (G. raimondii) also contain a large proportion of TEs and a relatively low gene density. In contrast, the unicellular green alga C. reinhardtii shows a relatively small proportion of TEs and low gene density similar to florideophycean red algae (fig. 1C and supplementary fig. S2B, Supplementary Material online). Interestingly, gene density is not significantly different between both of these Archaeplastida lineages (P-value = 0.282; Wilcoxon rank sum test), which differs from the significant differences (P-value < 0.05) in genome size and gene number. In red algae, there is a negative correlation between the proportion of TEs and gene density (linear regression P-value = 0.015 and R2 = −0.8048) that differs from Viridiplantae (P-value = 0.066 and R2 = 0.4561). The negative correlation observed for red algal genomes may be explained by the relatively uniform (i.e., low) gene number combined with genome size growth. Because highly clustered genes were first reported in the red seaweed C. crispus when compared with the unicellular red alga C. merolae genome (Collén et al. 2013), we reevaluated this feature using our larger collection of red algal genomes. A higher gene clustering value (average/median intergenic distances) reflects shorter intergenic regions between genes (i.e., a right-skewed distribution). The G. chorda genome shows a relatively low degree of gene clustering (2.78) among red algae, together with C. merolae (1.54), P. purpureum (1.72), and P. umbilicalis (4.1). In contrast, highly clustered gene distributions are present in C. crispus (8.25) and G. sulphuraria (13.0) that result from a larger amount of short intergenic regions (see Collén et al. 2013). This suggests that the high degree of gene clustering found in the C. crispus and G. sulphuraria genome were independently derived rather than being an ancestral shared feature. The gene clustering values in red algae, however, are not significantly different from those in Viridiplantae (P-value = 0.228; Wilcoxon rank sum test) but the degree of gene clustering in the latter is more uniform (avg. = 2.13, SD = 0.37) than in red algae (avg. = 5.23, SD = 4.53) (supplementary fig. S2B, Supplementary Material online). Then, we focused on the conserved eukaryotic BUSCO gene set in eukaryotes (Simão et al. 2015). BUSCO genes generally do not form complex gene families (i.e., paralogs); therefore, they provide a useful marker for assessing the extent of genome-wide gene conservation and duplication. Including the gene inventory of G. chorda, we mapped all proteins from the three Archaeplastida lineages onto the 429-gene BUSCO data set (Simão et al. 2015). The average (avg.) and maximum (max) duplicated copy number of BUSCO gene sets are 1.32–1.69 and 9–25, respectively, in red algae (fig. 1D and supplementary table S6, Supplementary Material online), which is similar to that in C. paradoxa (Glaucophyta) (avg. = 1.37, max = 17). In contrast, the genomes of two lineages of Viridiplantae show different patterns: Green algae and charophytes (both aquatic) exhibit a similar duplicated copy number of BUSCO gene set as red algae (avg. = 1.37–1.79, max = 9–26); however, land plants contain a higher copy number of gene sets (avg. = 2.18–5.66, max = 39–133) (supplementary fig. S2C and table S7, Supplementary Material online). The patterns of duplicated BUSCO gene number are significantly different in aquatic algae when compared with land plants (P-values < 0.001; Wilcoxon rank sum test). These results suggest that the aquatic algae (e.g., red algae, glaucophytes, green algae, and charophytes) retain more conserved copy numbers of BUSCO genes than do land plants. The latter underwent significant expansion of the core gene family through duplications (i.e., after their split from green algae and charophytes). We extended the gene inventory analysis to all protein-coding regions using OrthoFinder (Emms and Kelly 2015). The numbers of orthologous gene families (OGFs) shared by all taxa were counted, and the common OGFs were regarded as the ancestral gene families. The six red algal species share 1,771 OGFs with an additional 1.1- to 1.4-fold duplicated genes for each species (1,993–2,548 genes; fig. 1E and supplementary table S8, Supplementary Material online). In the case of the eight Viridiplantae species that shared 2,732 OGFs, green algae and charophytes show 1.1- to 1.4-fold duplicated OGFs (i.e., 3,210–3,899 genes), whereas land plants contain 2.3- to 6.0-fold duplications (i.e., 6,559–16,562 genes) (supplementary fig. S2D and table S9, Supplementary Material online). From an extended OGF analysis including red, green, and glaucophyte species, we found 1,121 Archaeplastida OGFs (supplementary tables S10 and S11, Supplementary Material online). For these OGFs, the lower copy numbers are present in red algae (1.39–1.89 ± 0.88–3.12; ± SD) and other algal groups (green algae and charophytes: 1.5–2.2 ± 1.43–4.66, and glaucophytes: 2.48 ± 4.96) when compared with land plants (4.00–10.41 ± 9.46–33.66; supplementary fig. S3A, Supplementary Material online). Paralogy rates within Archaeplastida OGFs (supplementary fig. S3A, Supplementary Material online) were significantly different between land plants and algae (P-values ≤ 0.001; Wilcoxon rank sum test). Another result that is in agreement with the OGF work is the gene family gain and loss analysis of Archaeplastida using OrthoFinder results (Dollo parsimony analysis performed by the Count program; Csűös 2010). The red algal groups show lower gene family gains when we compare net gene gain and loss (supplementary fig. S3B, Supplementary Material online): That is, land plants have gained several thousand gene families (2,004–3,840), whereas multicellular red algae have lost between 188 and 1,683 gene families, similar to unicellular red algae (651–1683 gene families). In the case of green algae, these taxa have lost −984 gene families or gained +230 gene families. Interestingly, there is a larger gene family gain in the charophyte alga K. flaccidum (+1,900), comparable with the aggregated number of gene families (6,872) in one land plant species (i.e., P. patens 6,976; supplementary fig. S3B, Supplementary Material online). It was reported that K. flaccidum contains many land plant-associated gene families that encode fundamental functions present in the latter (Hori et al. 2014). We also identified 1,136 gene families present in all land plants and in K. flaccidum but absent in green algae (supplementary fig. S2D, Supplementary Material online). Although K. flaccidum provides an exceptional case, the aggregated gene numbers in land plants (6,976–8,812) and in aquatic algae (red, glaucophyte, green algae, and charophytes; 3,289–6,872) are significantly different (P-value < 0.001, Wilcoxon rank sum test; supplementary fig. S3B, Supplementary Material online). Moreover, the aggregated values within red algae are significantly different from that in Viridiplantae (P-value < 0.005; Wilcoxon rank sum test), although one bacterium-sized green alga, O. tauri, contains a relatively small number of gene families (3,988). Taken together, these data indicate that red algal lineages contain a less modified (i.e., more conserved) gene inventory than their sister lineages. We recognize however that future genome sequencing projects will undoubtedly change (i.e., likely increase) the size of the gene inventory in red algae because of improved gene predictions and annotations as the database grows for this anciently diverged eukaryotic lineage. In particular, generation of RNA-seq data from different taxa grown under a variety of culture conditions will expand the inventory of validated coding regions in Rhodophyta and provide insights into potential gene functions. Insights into G. chorda-Specific Gene Duplications Lineage-specific gene duplication is one of the major driving forces in eukaryotic genome evolution and is well described in land plants and vertebrates (Glasauer and Neuhauss 2014; Rensing 2014). To better understand genome evolution in G. chorda, we analyzed lineage-specific gene duplications. Using a k-mer analysis (Jellyfish program with 17-mer; Marçais and Kingsford 2011; supplementary fig. S4, Supplementary Material online), we were unable to find evidence of genome-wide duplications (e.g., polyploidy), consistent with the BUSCO and OGF analyses. Although one potential polyploidization event was proposed to have occurred in coralline red algae (based on DNA fluorescence using DAPI staining and the RBC method; Kapraun 2005), and the existence of multiple nuclei in cells has been reported (Goff and Coleman 1986; Cole 1990; Goff and Coleman 1990), there are no known cases of polyploidization-derived genes or cases of genome-level massive gene duplications in red algae. To achieve a more fine-grained perspective on G. chorda-specific gene duplications, we designed a strict screening process (supplementary figs. S5 and S6 and tables S12 and S13, Supplementary Material online; see Materials and Methods). Through this approach, we found two highly duplicated G. chorda-specific genes (i.e., six copies of OG172 and three paralogous copies of OG1241), the high-light-inducible protein (HLIP) family, because these genes contain the same domain (cl14543: light-harvesting-like protein 3). HLIPs have been previously characterized in Pyropia yezoensis (Bangiophyceae) and show similarity to light-harvesting chlorophyll a/b-binding proteins (Montané et al. 1999; Kong et al. 2012). On the basis of phylogenetic analysis and conserved domain evidence (this study; Kong et al. 2012), HLIP genes originated from cyanobacteria and acquired a transit peptide for plastid targeting. This family is highly duplicated (nine copies) in G. chorda (fig. 2A and B and supplementary fig. S6, Supplementary Material online) (Lee et al. 2016) due to a recent expansion, because sequences of the duplicated flanking regions are nearly identical across members (fig. 2C). To understand why this family was expanded, we surveyed their gene expression profiles (supplementary fig. S7, Supplementary Material online). As expected, HLIPs in G. chorda are highly expressed under several stress conditions (i.e., high temperature, osmotic stress, the light-dark transition, and cold stress; supplementary fig. S7, Supplementary Material online). Cyanobacterial HLIPs are upregulated in high light as well as under cold stress, osmotic stress, salt stress, UV-B treatment, and hydrogen peroxide treatment (Funk and Vermaas 1999; Bhaya et al. 2002; Teramoto et al. 2004; Jantaro et al. 2006). Therefore, the duplicated HLIPs of G. chorda may play a role in dealing with abiotic stress. Fig. 2. View largeDownload slide Duplicated genes in the Gracilariopsis chorda genome that encode HLIPs. (A) HLIP copy number in the nuclear genomes of red algae (tree topology is based on reference; Lee et al. 2016). (B) Red algal HLIP alignment and identified domain regions. Asterisks indicate proteins lacking a transit peptide. (C) Flanking regions of G. chorda HLIP genes. Fig. 2. View largeDownload slide Duplicated genes in the Gracilariopsis chorda genome that encode HLIPs. (A) HLIP copy number in the nuclear genomes of red algae (tree topology is based on reference; Lee et al. 2016). (B) Red algal HLIP alignment and identified domain regions. Asterisks indicate proteins lacking a transit peptide. (C) Flanking regions of G. chorda HLIP genes. DNA Methylation Patterns in G. chorda DNA methylation (cytosine methylation; 5-methylcytosine; 5-meC) is widely studied in land plants, animals, fungi, and bacteria and is involved not only in epigenetic regulation of gene expression but also genome structure and integrity vis-à-vis chromatin remodeling (Martienssen and Colot 2001; Zilberman et al. 2007; Feng et al. 2010; Law and Jacobsen 2010; Zemach et al. 2010; Capuano et al. 2014; Huff and Zilberman 2014). We generated methylome data from G. chorda using bisulfite sequencing (see Materials and Methods) (Li and Tollefsbol 2011). A total of 7.6-Mb methylated DNA sites were detected (8.3% of a total 92.1-Mb nucleotide sequences; 33.4% of a total 22.7-Mb cytosine sequences) that comprised CHH (3.9 Mb; 51.3% of 7.6 Mb), CHG (1.9 Mb; 25.7%), and CG (1.7 Mb; 23.0%) methylation (H = A, C, or T; fig. 3A and supplementary table S14, Supplementary Material online). The proportion of DNA methylation can vary in different species even in different tissues (Banyushin 2006). The genome-wide proportion of cytosine methylation in G. chorda (8.3% of 92 Mb genome) is lower than that in Arabidopsis thaliana (14% of 120 Mb genome) but higher than Mus musculus (7.6% of 2.8 Gb genome), Drosophila melanogaster (<0.1% of 140 Mb genome), and Escherichia coli (2.3% of 4.6 Mb genome) (Capuano et al. 2014). Interestingly, around 50% of methylated cytosines are present in TEs (3.7 Mb; 48.64% of 7.6 Mb), and the remainder is located in gene bodies (1.7 Mb; 22.34%) and other genomic regions, including promoters (2.2 Mb; 29.01%) (supplementary fig. S8, Supplementary Material online). Most protein-coding sequences (i.e., 10,470/10,806 CDSs; 96.8%) contain methylated cytosines. Among these, the majority includes 10–20% methylated cytosines, although there are 336 unmethylated CDSs (bar graph in fig. 3B and supplementary table S15, Supplementary Material online). To test the consistency of the signal from varied data sets, we sorted randomly sampled CDSs (20% of the total CDSs) and plotted their frequencies using 1,000 iterations (red line in fig. 3B). The majority of randomly sampled CDSs also comprises 10–20% methylated CDS bodies. The pattern of upstream and gene body methylation is similar across the three types of targeted DNAs, with CHH having the highest frequency (colored lines in fig. 3C). The distribution of DNA methylation level (%) is about 3–5% in gene bodies (dot plots in fig. 3C). DNA methylation level (%) of the CHG type (green dots) is about 4–5% higher than the CHH and CG types (fig. 3C). In the case of land plants, a specific gene family (i.e., chromomethylase proteins) is closely related to gene body methylation, although the general functions of gene body methylation are still poorly understood (Bewick and Schmitz 2017; Bewick et al. 2017). Our study indicates that DNA methylation in the red seaweed G. chorda follows a pattern that is similar to other eukaryotes, and gene family-specific gene body methylation in red algae may have a function that is analogous to that in land plants. Fig. 3. View largeDownload slide DNA (cytosine) methylation patterns in the Gracilariopsis chorda genome. (A) Circular heatmap of DNA methylation (CHH, CHG, and CG methylation) in which the red color indicates a higher methylation level and the blue, a lower level. (B) Coverage of methylation sites in CDSs and results of the random sampling-based methylation site distribution analysis. (C) Comparison of methylation frequencies in gene bodies and 1-kb upstream regions, and the methylation levels of gene bodies. Fig. 3. View largeDownload slide DNA (cytosine) methylation patterns in the Gracilariopsis chorda genome. (A) Circular heatmap of DNA methylation (CHH, CHG, and CG methylation) in which the red color indicates a higher methylation level and the blue, a lower level. (B) Coverage of methylation sites in CDSs and results of the random sampling-based methylation site distribution analysis. (C) Comparison of methylation frequencies in gene bodies and 1-kb upstream regions, and the methylation levels of gene bodies. DNA Methylation-Dependent Transcriptional Regulation in G. chorda One well-known function of DNA methylation is transcriptional inactivation (i.e., gene silencing) associated with promoter regions (Mohn et al. 2008; Payer and Lee 2008; Deaton and Bird 2011). In general, the promoter regions overlap with CpG islands (a high density of CG sites), suggesting that DNA methylation of CpG islands is linked with transcriptional repression, even though the CG sites in the CpG islands are not always methylated (Illingworth and Bird 2009; Deaton and Bird 2011). To understand DNA methylation-dependent gene regulation, we analyzed promoters and CpG islands in the G. chorda genome. We focused on promoters within 1 kb upstream of G. chorda genes that were predicted using the TSSPlant promoter database (prediction of plant polymerase II promoter) (Shahmuradov et al. 2017). As a result, 8,022 predicted promoters were comprised TATA-boxes (4925) and TATA-less Transcription Start Sites (TSS; 3097) (supplementary table S16, Supplementary Material online). These are primarily distributed within 50–400 bp upstream regions of genes (fig. 4A). The dominant nucleotide sequences are “TATAAAAATA” in TATA-boxes and cytosines in TATA-less promoters (fig. 4B and supplementary table S17, Supplementary Material online). The TATA-box promoter-binding regions are methylated primarily in the proximal and distal regions (fig. 4B), whereas the TATA-less TSS regions are more uniformly methylated (supplementary fig. S9, Supplementary Material online). Fig. 4. View largeDownload slide CpG island-dependent regulated genes in the Gracilariopsis chorda genome. (A) Promoter distribution within 1-kb region upstream of genes. (B) Amino acid proportions of TATA-box promoter regions and their methylation. (C) COG categories of CpG island-dependent regulated genes in G. chorda. L = replication, recombination and repair; A = RNA processing and modification; P = inorganic ion transport and metabolism; H = coenzyme transport and metabolism; I = lipid transport and metabolism; D = cell cycle control, cell division, chromosome partitioning; Q = secondary metabolites biosynthesis, transport, and catabolism; B = chromatin structure and dynamics; M = cell wall/membrane/envelope biogenesis; Z = cytoskeleton; V = defense mechanisms; F = nucleotide transport and metabolism. (D) Enriched GO terms of CpG island-dependent regulated genes in G. chorda. Fig. 4. View largeDownload slide CpG island-dependent regulated genes in the Gracilariopsis chorda genome. (A) Promoter distribution within 1-kb region upstream of genes. (B) Amino acid proportions of TATA-box promoter regions and their methylation. (C) COG categories of CpG island-dependent regulated genes in G. chorda. L = replication, recombination and repair; A = RNA processing and modification; P = inorganic ion transport and metabolism; H = coenzyme transport and metabolism; I = lipid transport and metabolism; D = cell cycle control, cell division, chromosome partitioning; Q = secondary metabolites biosynthesis, transport, and catabolism; B = chromatin structure and dynamics; M = cell wall/membrane/envelope biogenesis; Z = cytoskeleton; V = defense mechanisms; F = nucleotide transport and metabolism. (D) Enriched GO terms of CpG island-dependent regulated genes in G. chorda. A total of 52,008 CpG islands were predicted using a Perl script (Takai and Jones 2002), and 97% (50,484) of the CpG islands are smaller than 3,000 bp (median size = 300 bp), comparable with other organisms (supplementary fig. S10, Supplementary Material online) (Carlberg and Molnár 2014). However, there are 1,524 unusually large CpG islands (median size = 4,000 bp). In both cases, two-thirds of the CpG islands contain methylated CG dinucleotides, which encompass <15% of these regions (supplementary fig. S10, Supplementary Material online). Interestingly, there was a strong correlation between the length of contigs and the number of CpG islands (linear regression R2=0.97 and P-value < 2.2e-16; supplementary fig. S11A, Supplementary Material online), suggesting a random distribution of CpG islands on a genome-wide scale. On average, 0.62/kb (median = 0.58) of CpG islands is present in the genome (supplementary fig. S11B, Supplementary Material online). However, there are no correlations between GC richness and CpG islands with CDSs and/or promoters (linear regression R2 ≤ 0.45 and P-value < 2.2e-16; supplementary fig. S11C, Supplementary Material online). In general, CpG island-dependent regulated genes are defined by the features related to the presence of DNA methylations in CpG island-located promoter regions (Mohn et al. 2008; Payer and Lee 2008; Illingworth and Bird 2009; Deaton and Bird 2011). To analyze CpG island-dependent regulated genes in G. chorda, we analyzed CpG island-located promoters, of which only about one-half (3,720) had an overlap with the total number of predicted promoters (8,022) in the genome (supplementary table S18, Supplementary Material online). The genes associated with these CpG island-located promoters were regarded as candidates for potential regulation by DNA methylation. We focused on these genes knowing that other transcription regulatory factors (e.g., enhancer as cis-regulatory sequences; Tippens et al. 2018) could exist in the G. chorda genome. These promoter regions overlapped with the DNA methylation sites, including differential gene expression results (fragments per kilobase million (fpkm) > 0 in at least two RNA-seq conditions with a maximum fold-change, fpkm > 3, compared with expression of the housekeeping gene GAPDH; gene id = GRC0024GENE2684). Using these criteria, we selected 489 candidates for CpG island-dependent regulated genes. To analyze functions of these candidates, we analyzed functional categories of these genes using the eggNOG-mapper (Huerta-Cepas et al. 2016) that resulted in 470 COG (Clusters of Orthologous Groups) categories and 298 GO (Gene Ontology) terms (supplementary table S19, Supplementary Material online). After excluding unknown hits, the most frequent COG category is “posttranslational modification, protein turnover, chaperones” followed by “signal transduction mechanism” (fig. 4C). Moreover, from the enriched GO terms (togGO R package; Fisher test, P-value < 0.05; supplementary fig. S12, Supplementary Material online), there are not only the regulators (i.e., cell cycle and development in Biological Process category) but also the repressors (i.e., chromatin silencing, negative regulation of metabolic process, response ER stress and viral process in Biological Process, and transcriptional repression activity in Molecular Function category) (fig. 4D). On the basis of these functional descriptions using eggNOG, we identified the potential CpG island-dependent regulatory genes involved in the regulation of the cell cycle, the development and stress response as well as biological repression processes (e.g., chromatin silencing) in G. chorda genome. In other words, these functions might be regulated by methylations in G. chorda. TEs and Putative Epigenetic Regulation in G. chorda As discussed earlier, florideophycean species contain more TEs than other red algae, which is a major factor in their genome size expansion [i.e., G. chorda, this study, and C. crispus (Collén et al. 2013)]. However, TEs generally lead to negative impacts on the host genome, for example, by causing DNA double-strand breaks (McClintock 1950, 1984; Slotkin and Martienssen 2007). Therefore, TE activity is controlled by DNA methylation and histone modification in the host (Hirochika et al. 2000; Okamoto and Hirochika 2001; Vaucheret and Fagard 2001; Verbsky and Richards 2001; Slotkin and Martienssen 2007). About 71.2% of TEs contain methylated cytosines (47,459/66,575 TEs) in the G. chorda genome (supplementary fig. S8, Supplementary Material online). The number and level (percentage of methylated/unmethylated cytosines) of methylated TE sites were plotted for each type of TEs (supplementary fig. S13, Supplementary Material online). The major type of DNA methylation in TEs, based on their frequencies, is CHH methylation (red lines in supplementary fig. S13, Supplementary Material online). The distribution of DNA methylation level (%) is about 4–5% in each TE type (i.e., LTR: 4–6%, DNA: 4–7%, RC: 3–5%, LINE: 3–7%, SINE: 2–10%, and the others: 4–5%; dot plots in supplementary fig. S13, Supplementary Material online). DNA methylation level (%) of CHG type (green dots) is about 5% in all TE types (supplementary table S20 and fig. S13, Supplementary Material online). These patterns of methylation level in TEs (4–5%) are significantly different from that found in gene bodies (3–4%) for all three types of methylations (P-value < 0.05, Wilcoxon rank sum test; supplementary table S20, Supplementary Material online) except for CHG methylation of SINE elements (P-value = 0.298). On the basis of the presence of methylated DNA in TEs (47,608/66,575 TEs), we postulate that TEs are regulated by a DNA methylation-associated mechanism in G. chorda, similar to the DNA methylation-based posttranscriptional/transcriptional gene silencing (PTGS and TGS mechanisms) in land plants (Okamoto and Hirochika 2001; Verbsky and Richards 2001; Matzke et al. 2009, 2015; Saze et al. 2012). In general, PTGS is sRNA-dependent and induced by degradation or translational repression of target mRNA in the cytoplasm. TGS is also an RNA-dependent translational repression process, but is induced by chromatin remodeling (Matzke and Mosher 2014; Matzke et al. 2015). To elucidate how many sRNAs are associated with TE regions, we produced sRNA data from G. chorda and sorted sequences that are 18–30 nt in length based on their mapping outcome (criterion: >100 reads depth with a perfect match; see Materials and Methods). A total of 40,925 sRNAs were predicted and classified in various categories but the majority is unclassified (supplementary table S21, Supplementary Material online). In many cases, small interfering RNAs (siRNAs in land plant genomes) and PIWI-interacting RNAs (P-element induced wimpy testis RNAs; piRNAs in animal genomes) are involved in TE silencing (Brennecke et al. 2008; Slotkin et al. 2009; Saito and Siomi 2010; Ji and Chen 2012; Marí-Ordóñez et al. 2013), but these small RNA categories were difficult to define in the G. chorda genome largely due to the uncharacterized status of sRNAs in red algae. However, we expect that there are many unknown (unclassified) red algal sRNAs that may be involved in TE silencing. Therefore, we mapped sRNAs onto the TEs and found that 57% (23,315/40,925) are located in these regions. These TE-located sRNAs could potentially be involved in the methylation of TE regions (PTGS) or in the degradation of expressed TE sequences (TGS) in G. chorda. As an example, the unmethylated LINE element (GRC1278) contained 49 complementary sRNAs sites where their mapping depths range from 102 to 12,430 sRNA reads (fig. 5A and supplementary table S22, Supplementary Material online). These unmethylated TE-located sRNAs might be potential targets of sRNA for degradation or translational suppression of TEs, which is similar to TGS in land plants. In general, the TGS mechanism is implicated in the regulation of expressed TEs rather than the methylation of TE regions (Okamoto and Hirochika 2001; Verbsky and Richards 2001; Matzke et al. 2009, 2015; Saze et al. 2012). To test the association between sRNA and TE methylation, we compared the proportion (%) of methylated TEs with the sRNA-binding sites. Not surprisingly, no correlation exists between sRNA density (number of sRNAs per kb) and methylation density (methylation sites per kb) in TE regions (P-value = 0.15, Spearman rank correlation). In addition, there was no significant difference in terms of whether methylated TEs contain complementary sRNAs or not (P-value > 0.6, Chi-squared test) (proportion of methylated TEs with mapped sRNA: 74% [9,667/12,957] and methylated TEs without mapped sRNA: 70% [37,792/53,618]). In contrast, a decrease in the number of methylated TEs depends on sRNA density in TE regions (negative correlation, rho = −0.96, P-value < 2.2e-16, Spearman rank correlation). On the basis of these results, we postulate that a large proportion of sRNAs in G. chorda genome are primarily related to the translational TE suppression (i.e., TGS-like) rather than the methylation of TE regions (i.e., PTGS-like). Interestingly, the largest proportion of the TE-located sRNAs is of size 28 nt, although 27, 26, and 20 nt sRNAs are also well represented (supplementary fig. S14, Supplementary Material online). To understand better the size composition of the major sRNA (i.e., of lengths, 20, 26, 27, and 28 nt) in TE regions, we randomly resampled TEs 1,000 times (20% of the total sRNA-associated TEs) and then plotted the sRNA frequencies (supplementary fig. S14, Supplementary Material online). This result shows a nearly identical pattern (linear regression R2 = 0.999) between the total number of TE-located sRNAs and the randomly resampled data. In addition, these major sRNAs (20, 26, 27, and 28 nt) show higher CHH at both termini of the sRNAs (supplementary fig. S15, Supplementary Material online). Fig. 5. View largeDownload slide TE located sRNAs and methylation sites. (A) Distribution of unmethylated TE-associated sRNAs from an example of LINE elements. (B) Distribution of methylated TE-associated sRNAs and methylated sites from examples of LTR elements. Fig. 5. View largeDownload slide TE located sRNAs and methylation sites. (A) Distribution of unmethylated TE-associated sRNAs from an example of LINE elements. (B) Distribution of methylated TE-associated sRNAs and methylated sites from examples of LTR elements. As other examples, LTR elements contain a large proportion of methylation sites with several complementary sRNA sites (fig. 5B). On the basis of several lines of evidences, we suggest that sRNA-associated TE methylation mechanisms (i.e., PTGS-like) exist in G. chorda (supplementary table S22, Supplementary Material online). As a TE methylation mechanism, the RdMD pathway was suggested to operate in land plants (Matzke and Mosher 2014; Matzke et al. 2015). The machinery is related to the DCL (Dicer-like protein that processes long double-stranded RNAs into small interfering RNAs) and AGO (Argonaute protein that is responsible for sRNA-guided chromatin modification) pathway although these proteins also play general roles in the generation and functioning of sRNAs (Liu et al. 2009; Hutvagner and Simard 2008). We found these DCL and AGO homologs in mesophilic red algae using BlastP and domain searches (supplementary table S23, Supplementary Material online). These genes are highly expressed in G. chorda (supplementary table S24, Supplementary Material online). On the basis of these results, we postulate that the DCL and AGO proteins in red algae (at least G. chorda genome) play important roles in the small RNA-related process, even though the detailed classification of these protein families is challenging (alignment using MAFFT with default option; maximum likelihood tree using IQ-tree with 1,000 replications after choosing the best-fit evolutionary model; supplementary fig. S16, Supplementary Material online; Minh et al. 2013; Flouri et al. 2015; Nguyen et al. 2015; Yamada et al. 2016). In addition, most RdDM pathways in land plants are polymerase-dependent (pol IV and V) (Slotkin and Martienssen 2007; Matzke and Mosher 2014; Matzke et al. 2015); however, these enzymes are absent from red algal genomes. Taken together, it is highly likely that red algal genomes contain a putative TE suppression pathway because of presence of cytosine methylation of TEs, TE-associated sRNAs, and the conserved proteins of DCL and AGO although our genome-wide study has the limitation that we cannot determine how sRNAs directly regulate DNA methylation with the associated components. In summary, based on our genome-wide analysis of methylation and sRNAs in G. chorda, we suggest that sRNA-mediated TE suppression present as a form of epigenetic modification in red algae. Conclusions Analysis of genomes from the multicellular red algal class Florideophyceae demonstrates genome size expansion via the accumulation of TEs, when compared with its sister class, the Bangiophyceae and other early diverged, unicellular red algae (Cyanidiophyceae and Porphyridiophyceae). In land plants, TE activity has contributed to genome size growth, DNA rearrangement and gene mutation, modification, movement, and regulation (Bennetzen and Wang 2014). It is intriguing that sexual reproduction in angiosperms may increase the diversity and activity of TE families (Wright and Finnegan 2001; Arkhipova 2005; Steinemann and Steinemann 2005; Michael 2014; Li et al. 2016), which has been experimentally validated in yeast (Zeyl and Bell 1996). Although multicellular Florideophyceae and Bangiophyceae have sexual life cycle and large proportions of TEs in their genomes, additional evidence is required to establish a connection between sexual reproduction and TE-derived genome expansion in red algae. Beyond such adaptation-based scenarios, another factor that potentially explains genome growth in red seaweeds is population size. Although Floridiophyceae (∼6,700 spp.) are ubiquitous in nearshore marine environments, they are likely to be far less abundant than unicellular red algal species. It has been previously argued under the drift-barrier hypothesis for mutation-rate evolution (Lynch et al. 2016) that effective population size and genetic drift govern the strength of selection on trait evolution. Smaller populations are more prone to genetic drift and elevated genome-wide deleterious mutation rates, with larger populations behaving in the opposite direction with selection removing deleterious traits more effectively (Sung et al. 2012). Under this scenario, the larger genomes and higher TE loads of red seaweeds may be explained by their relatively smaller population sizes when compared with unicellular lineages. The genetic drift among red seaweeds has been active for at least 600 My when the Florideophyceae radiated (Yang et al. 2016). Therefore, TE-derived genome size growth of multicellular red algae might reflect the long-term evolutionary consequence of reduced population sizes. This hypothesis does not argue that all unicellular Rhodophyta to be analyzed in the future will have small genomes, but rather, that their genome size evolution may be better understood under a neutral model that reflects population level processes. The apparent absence of polyploidization in red algae provides a potential explanation for why the gene inventory has not expanded massively in red seaweeds, despite genome size growth, in sharp contrast to many land plants. Interestingly, all available red algae have genome sizes at or smaller than ∼100 Mb. This contrasts with land plants whose genomes can be several hundred megabase pairs in size with TE-derived genome size growth occurring independent of polyploidization (Hawkins et al. 2006; Piegu et al. 2006). TE activity in land plants has likely provided novel sources of evolutionary variability in these lineages (Hua-Van et al. 2011; Galindo-González et al. 2017). Testing this latter scenario in Rhodophyta will require additional red algal genome data and genetic tools that await development. Our epigenomic analysis shows that the potential exists for CpG island-dependent regulated genes in G. chora. The impacted genes have regulatory functions such as “cell cycle and development,” “chromatin silencing,” and “transcriptional repression activity.” Therefore, we suggest that DNA methylation may play a role in regulating the most complex aspects of red seaweed biology such as sexual reproduction, that is akin to regulation of plant gametogenesis (Saze et al. 2003) and embryogenesis (Xiao et al. 2006). This and other hypotheses regarding the impact of epigenomic regulation in red algal development and adaptation remain to be investigated using the genetic toolkit described in this study. Materials and Methods Sample Preparation and Genome Sequencing Samples of isomorphic gametophytes (n) and tetrasporophytes (2n) of G. chorda cultivars were collected from a coastal farm Jangheung, Jeonnam, Korea (34°28ʹ18″N, 126°56ʹ28″E) on February 2, 2013. Total genomic DNA was extracted from a fresh sample using a modified CTAB protocol and the DNeasy Plant Maxi Kit (Qiagen, Germany) with clean-up steps. PacBio Library Construction and Sequencing Using the Covaris G-tube, we generated 20-kb fragments by shearing genomic DNA according to the manufacturer’s protocol. Using the AMpureXP bead purification system to remove the small fragments, a total of 5 µg from each sample was used as input for library preparation. The library was constructed using the SMRTbell Template Prep Kit 1.0 (PN 100-259-100). Using the BluePippin Size selection system, we removed small fragments for the large-insert library. After a sequencing primer was annealed to the SMRTbell template, DNA polymerase was bound to the complex (DNA/Polymerase Binding kit P6v2). Following the polymerase binding reaction, the MagBead Kit was used to bind the library complex with MagBeads before sequencing. MagBead bound complexes provide for more reads per SMRT Cell. This polymerase-SMRTbell-adaptor complex was loaded into zero-mode waveguides. The SMRTbell library was sequenced using 17 SMRT cells (Pacific Biosciences) using C4 chemistry (DNA sequencing Reagent 4.0). This procedure generated 6.52 Gb of total data (supplementary table S25, Supplementary Material online). The PacBio reads were assembled using SMRTmake, based on the Celera Assembler. To correct errors in the PacBio assembly, the short-read genome sequencing platform from Illumina (HiSeq2000) was used to generate data. A whole-genome sequencing library was prepared according to the manufacturer’s instructions. Genomic DNA was randomly sheared to yield fragments of 200–300 bp in size using the Covaris S2 system. The fragments were ligated with index adapters using the Truseq DNA Sample prep kit (Illumina, San Diego, CA) and AMPure XP Beads purification kit (Beckman Coulter Genomics, Danvers, MA). Ligated products were purified and size-selected for 300–400 bp using the Pippin prep electrophoresis platform (Sage Science, Beverly, MA). Size-selected fragments were amplified with adapter-specific primer sets. The quality of the amplified libraries was assessed on a 2100 BioAnlayzer (Agilent Technologies, Santa Clara, CA) and then sequenced with Illumina HiSeq2000 using 100-bp paired-end reagents. A total of 46.5 Gb of data was generated (supplementary table S25, Supplementary Material online). De Novo Assembly and Scaffolding De novo assembly was conducted using the hierarchical genome assembly process (HGAP) workflow (Chin et al. 2013) on SMRT portal (v2.3). HGAP starts with the filtering step for the SMRT reads and then performs an error correction on long reads using relatively short reads with 30× target coverage and 90 Mb expected genome size parameters. The error-corrected reads were used in Celera Assembler to generate the draft assembly. As the final HGAP step, polishing was performed through consensus calling using Quiver. The “polished” contigs were then further scaffolded using SSPACE (v3.0) with Illumina paired-end reads (Boetzer et al. 2011). To correct errors in the PacBio assembly, Illumina (HiSeq 2000) paired-end reads were mapped to the long-read data and the sequences corrected using bwa (v0.7.15; Li 2012). Analysis of Repeated DNA Elements We searched for repeated DNA elements using RepeatModeler (v1.0.10; http://www.repeatmasker.org/RepeatModeler/; de novo repeat family identification and modeling package). This package includes two de novo repeat finding programs: RECON (v1.08) (Bao and Eddy 2002) and RepeatScout (v1.0.5) (Price et al. 2005). We chose options as default l-mer size and filtered out low-complexity and tandem repeats (Tandem Repeats Finder) (Benson 1999). Repeat libraries were downloaded (http://www.girinst.org) and used in the classification step. Each genome was masked with its corresponding repeat library using RepeatMasker (v4.0.7, http://www.repeatmasker.org/) under sensitive mode (-s); without masking low-complexity and simple repeats (-nolow) and without checking for bacterial insertion sequences (-no_is). An in-house Python script was used to parse the results that the frequency of Kimura distance between copies of repeated elements (TEs) identified in the genome using the library. To analyze shared TEs (supplementary fig. S1, Supplementary Material online), all six red algal genomes were merged into a single file, with species tags included in the contig names, as input for the RepeatModeler pipeline. If the same consensus repeat families were present at phylogenetic internodes, the individual repeats in each red algal genome were regarded as shared TEs within the groups (gray bars in supplementary fig. S1A, Supplementary Material online). From each divergence point, we plotted the Kimura values that were calculated between the consensus and individual TE sequences (supplementary fig. S1A, Supplementary Material online; Kimura 1980). The classification and amount of shared TEs are shown in supplementary figure S1B, Supplementary Material online. Archaeplastida genome data were downloaded and their repeats reanalyzed using the above pipelines, including data from Cyanophora paradoxa (Price et al. 2012), Galdieria sulphuraria (Schönknecht et al. 2013), Cyanidioschyzon merolae (Matsuzaki et al. 2004), Porphyridium purpureum (Bhattacharya et al. 2013), Chondrus crispus (Collén et al. 2013), Porphyra umbilicalis (Brawley et al. 2017), Ostreococcus tauri (Derelle et al. 2006), Chlamydomonas reinhardtii (Merchant et al. 2007), Physcomitrella patens (Rensing et al. 2008), Amborella trichopoda (Amborella Genome Consortium 2013), and Arabidopsis thaliana (Arabidopsis Genome Initiative 2000). RNA Sequencing To collect diverse RNA sequencing data, we used G. chorda thalli from various developmental stages in several culture conditions. One female thallus with carpospores and one vegetative thallus without carpospores (i.e., young stage of gametophyte or tetrasporophyte) were cleaned using sterilized seawater under the dissecting microscope and kept in a culture chamber. To increase the recovery of expressed genes, we extracted total RNAs under the following seven conditions: 1) Cleaned nonfemale thallus, kept under the light (2 h) and room temperature (RT; 20 °C), 2) female thallus, kept under the light and RT, 3) nonfemale thallus, kept under the light and high temperature (30 °C, 2 h), 4) nonfemale thallus under the light and half salinity condition (DW: seawater = 1:1), 5) nonfemale thallus, kept under the dark (2 h) and RT, 6) nonfemale thallus exposed the light (30 min) after dark treatment (7 h), and 7) nonfemale thallus kept under the light and low temperature (4 °C) for 7 days. Total RNA was extracted using QIAGEN RNeasy Plant Mini Kit (Qiagen) based on the manufacturer’s instruction with DNase treatment. The RNA sequencing (RNA-seq) library was prepared according to the manufacturer’s instructions using the Illumina Truseq RNA Prep kit v2. The mRNA was purified and fragmented from total RNA (1 µg) using poly-T oligo-attached magnetic beads using two rounds of purification. Cleaved RNA fragments primed with random hexamers were reverse transcribed into first strand cDNA using reverse transcriptase and random primers. The RNA template was then removed prior to the synthesis of a replacement strand to generate double-strand cDNA. End repair, A-tailing, adaptor ligation, cDNA template purification, and enrichment of the purified cDNA templates were then performed using PCR. The quality of the amplified libraries was assessed on a 2100 Bioanalyzer (Agilent Technologies) and sequenced using 100-bp paired-end reagents with an Illumina HiSeq2000. All RNA-seq data were used to predict the gene models and their expression level analysis of G. chorda. Gene Prediction, Annotation, and Contaminant Verification To predict the gene models, all RNA-seq data were aligned to G. chorda genome (supplementary table S26, Supplementary Material online) using TopHat2 aligner (v2.1.1) (Kim et al. 2013). Splice junctions based on mapped RNA-seq reads were identified using Bowtie2 (v2.2.6) (Langmead and Salzberg 2012). The RNA-seq data were assembled de novo using Trinity (v2.4.0) (Haas et al. 2013) with the default option and these assembled transcripts (data not shown here) were also aligned to the G. chorda genome using TopHat2. Based on these mapped RNA-seq data, the gene models were predicted by BRAKER1 pipeline (v1.9) (Hoff et al. 2016), which includes implementation of GeneMark-ET (Lomsadze et al. 2014) and AUGUSTUS (Stanke et al. 2006). Additionally, the homology-based method was also conducted using protein sequences from taxonomically diverse genomes including Chondrus crispus, Cyanidioschyzone merolae, Cyanophora paradoxa, Porphyridium purpureum, Galdieria sulphuraria, Pyropia yezoensis, Arabidopsis thaliana, and Homo sapiens; these protein sequences were aligned to G. chorda genome using TBlastN (e-value ≤ 1e-04). The homologous regions of the genome sequences were then aligned to matched proteins using Genewise (Birney et al. 2004) to predict the accurate spliced alignments. EvidenceModeler, EVM (Haas et al. 2008) was used to integrate the homology-based predicted gene models to consensus gene model. To run EVM, all gene models were converted to EVM-compatible GFF3 format. All RNA-seq-based and homology-based gene models were combined into final gene set for G. chorda genome prior to the RNA-seq-based gene models. To choose the correct gene models, the generated protein models were checked manually using customized python scripts based on BlastN/P results using the nt/nr database (NCBI). The organelle-derived (mitochondria and plastid) and contaminant contigs were identified and removed. A total of 10,806 genes were predicted from the 92.1 Mb G. chorda genome. Of these, 7,139 genes were annotated with functional information from UniProt (http://www.uniprot.org/; 5,378 genes), conserved domains (Marchler-Bauer et al. 2017) (CD-search; 7,018 genes), and KEGG blast (3,089 genes; http://www.genome.jp/tools/blast/). The promoters of G. chorda proteins were predicted from 1-kb upstream regions using TSSPlant (prediction of plant polymerase II promoter) (Shahmuradov et al. 2017). Gene Family and Duplication Analysis To compare the ancestrally duplicated genes in Archaeplastida, all proteins from the three Archaeplastida lineages were mapped onto the 429 BUSCO data set (Simão et al. 2015). The average numbers of duplicated BUSCO genes in each species were calculated from BUSCO gene numbers (supplementary table S6, Supplementary Material online). Clustered/orthologous gene families were analyzed using the OrthoFinder pipeline (blast algorithm; inflation parameter for mcl = 1.5; Emms and Kelly 2015). Three separate runs of ortholog clustering were performed: 1) The six red algal species (fig. 1E), 2) eight Viridiplantae species (supplementary fig. S2D, Supplementary Material online), and 3) 15 Archaeplastida species (supplementary fig. S3, Supplementary Material online). Based on the clustered proteins, the clusters composed of all used species were sorted and defined as OGFs (fig. 1E). Other clustered gene families were indicated as same gene family. The numbers of shared OGFs in each group were summarized as bar graphs with rest of gene numbers (fig. 1E). To analyze G. chorda-specific gene duplications, a strict process was implemented (supplementary fig. S5, Supplementary Material online). First, all available red algal proteins (from 54,101 genes) were clustered into homologous groups (i.e., gene families) using the OrthoFinder pipeline (supplementary fig. S5A, Supplementary Material online) (Emms and Kelly 2015). Second, we analyzed conserved domain structure and number from all red algal proteins using the web-based CD-search (Marchler-Bauer et al. 2017), and then searched the candidates for species-specific and highly duplicated genes (criteria: > 2-fold domain duplication when compared with other taxa; supplementary fig. S5B and C, Supplementary Material online). Using this procedure, we found 13 G. chorda-specific duplicated gene families (supplementary table S12, Supplementary Material online). This third step resulted in five domain-families that were duplicated when compared with other red algal taxa excluding fragmentally duplicated domain structures (supplementary fig. S5D and table S12, Supplementary Material online). These duplicated domains are located in six gene families (47 genes) in G. chorda (see gene expression and putative functions in supplementary table S13, Supplementary Material online). As the fourth step, these six gene families were validated as functionally expressed genes using RNA-seq data (supplementary fig. S5D, Supplementary Material online), because some cases of duplicated genes had undergone purifying selection (loss of function) during the duplication process (Zhang 2003; Innan and Kondrashov 2010). From this step, six gene families were identified as being functionally expressed and highly duplicated (i.e., OG172, OG1241, OG78, OG329, OG16, and OG1129; supplementary table S13, Supplementary Material online). After filtering out two gene families (i.e., OG16 and OG1129) that show insufficient expression levels, four gene families were confirmed as duplicated functional genes involving three functional domains (cl14543, cl05795, and pfam04096). Phylogenetic analysis of these candidates was done using an extended taxon data set that incorporated EST data (fig. 2A, supplementary fig. S5E, Supplementary Material online). Putative homologs for each target protein were identified using BlastP search (e-value ≤ 1e-05) against RefSeq database and were aligned using MAFFT v7.309 (default option: –auto; Yamada et al. 2016). Through this process, two gene families (i.e., OG78 and OG329) were removed because their copy numbers were not significant when compared with other homologs in red algae from the extended data set (criteria: >2-fold duplications when compared with other taxa). Finally, the combined alignment of cl14543 domain contained gene families (OG172 and OG1241) was used as input for IQ-tree to infer a maximum likelihood tree with 1,000 replications (-bb 1000) after the model test step (-m TEST; supplementary fig. S5E, Supplementary Material online) (Minh et al. 2013; Flouri et al. 2015; Nguyen et al. 2015). Bisulfite Sequencing Library Preparation (Illumina TruSeq) and DNA Methylation Analysis We checked the quality of DNA using 1% agarose gel electrophoresis and the PicoGreen dsDNA Assay (Invitrogen, Waltham, MA). Bisulfite conversion of genomic DNA (1 μg) was performed with the EZ DNA Methylation-Gold kit (Zymo Research, Irvine, CA). The methyl-seq libraries were prepared using the TruSeq DNA Methylation kit (Illumina). The quality of the libraries was verified by capillary electrophoresis (Bioanalyzer; Agilent Technologies). Sequencing of 2×100-bp paired-end reads was carried out on the Illumina HiSeq 2500 platform. We produced the 29 Gb (289 million reads; 91.0% ≥Q30) of bisulfite sequencing data and aligned them to the G. chorda genome using bowtie2 (Langmead and Salzberg 2012). Removal of duplicated reads and methylation calling were done using bismark (v0.16.3) (Krueger and Andrews 2011). Methylation sites in gene bodies, promoters and TEs were identified using customized Python scripts based on G. chorda gene models and the genome sequence. The CpG islands were predicted using a customized Perl script (options: GC% ≥ 50, ObsCpG/ExpCpG ≥ 0.65, and sequence length ≥ 200 bp; Takai and Jones 2002). Small RNA Sequencing Library Preparation (Illumina TruSeq) and Data Analysis Small RNA was extracted using the Hybrid-R miRNA kit (GeneAll, Seoul, Korea). The extracted RNA purity was determined by assaying 1 µl of total RNA on a NanoDrop8000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA). Total RNA integrity was checked using an Agilent Technologies 2100 Bioanalyzer (Agilent Technologies) with the RNA integrity number value. Small RNA sequencing libraries were prepared according to the manufacturer’s instructions (Illumina Truseq small RNA Prep kit). The resulting cDNAs were purified using the Sage Science Pippin prep electrophoresis platform. The quality of libraries was verified by capillary electrophoresis (Bioanalyzer; Agilent Technologies). Sequencing of 1×51-bp paired-end reads was carried out on the Illumina HiSeq 2500 platform. We produced 13 Gb (255 million reads; 96.5% ≥Q30) of sRNA sequencing data. Adapter sequence and quality trimming were done using cutadapt (v1.1) (Martin 2011). The trimmed sequences were aligned against the G. chorda genome sequence at 100% identity, using bowtie2 (Langmead and Salzberg 2012). We kept only 1.6 Gb (74 million reads) of mapped sRNAs of lengths 18–30 nt. The mapped sRNA sequences were classified based on Rfam (v12.2) (Nawrocki et al. 2015) database using bowtie2 (Langmead and Salzberg 2012). We analyzed sRNA coverage on the TEs and methylation sites for each sRNA category using CoverageBED (Quinlan and Hall 2010). Supplementary Material Supplementary data are available at Molecular Biology and Evolution online. Author Contributions H.S.Y., S.M.B., E.C.Y., and J.-H.L. designed the genome sequencing project. H.S.Y. and D.B. conceived the research in collaboration with J.L. and J.H.Y., E.C.Y., G.H.B., K.M.K., H.-S.Y., and E.C.Y. prepared samples from field and culture. E.C.Y. and K.M.K. did the transcriptome library construction. J.L., Y.S., M.J., and S.J.L. did the genome assembly and gene prediction. J.L. led the genome analysis. J.L., S.J.L., L.G., H.Q., U.Z.Z., A.P.M.W., C.X.C., T.G.S., and E.C.Y. analyzed the gene ontology, gene duplications, repeats, gene families, and DNA methylation. J.L., D.B., and H.S.Y. wrote the paper in collaboration with the coauthors. H.S.Y. supervised the project. All authors read and approved the final manuscript. Acknowledgements We thank to Sung Bae Lee for kindly providing the Gracilariopsis chorda tissue from his sea farm that was used in this study. D.B., H.Q., and U.Z.Z. acknowledge the generous support from the School of Environmental and Biological Sciences at Rutgers University for the SEBS Genome Cooperative. C.X.C. and T.G.S. acknowledge support from the University of Queensland. This research was supported by the Collaborative Genome Program of the Korea Institute of Marine Science and Technology Promotion (KIMST) funded by the Ministry of Oceans and Fisheries (MOF) (No. 20140428), the National Research Foundation of Korea (NRF-2017R1A2B3001923), and the Next-generation BioGreen21 Program (PJ01389003) from the Rural Development Administration, Korea. The authors declare no competing financial interests. References Amborella Genome Consortium . 2013 . The complete nuclear genome of Amborella trichopoda: an evolutionary reference genome for the angiosperms . Science 342 : 6165 . Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana . Nature 408 ( 6814 ): 796 – 815 . CrossRef Search ADS PubMed Arkhipova IR. 2005 . Mobile genetic elements and sexual reproduction . Cytogenet Genome Res . 110 ( 1–4 ): 372 – 382 . Google Scholar CrossRef Search ADS PubMed Banyushin BF. 2006 . DNA methylation in plants. In: Doerfler W , Bohm P , editors. DNA methylation: basic mechanisms . Berlin (Germany ): Springer-Verlag Press . p. 67 – 122 . Google Scholar CrossRef Search ADS Bao Z , Eddy SR. 2002 . Automated de novo identification of repeat sequence families in sequenced genomes . Genome Res . 12 ( 8 ): 1269 – 1276 . Google Scholar CrossRef Search ADS PubMed Bennetzen JL , Wang H. 2014 . The contributions of transposable elements to the structure, function, and evolution of plant genomes . Annu Rev Plant Biol . 65 : 505 – 530 . Google Scholar CrossRef Search ADS PubMed Benson G. 1999 . Tandem repeats finder: a program to analyze DNA sequences . Nucleic Acids Res . 27 ( 2 ): 573 – 580 . Google Scholar CrossRef Search ADS PubMed Bewick AJ , Niederhuth CE , Ji L , Rohr NA , Griffin PT , Leebens-Mack J , Schmitz RJ. 2017 . The evolution of CHROMOMETHYLASES and gene body DNA methylation in plants . Genom Biol . 18 ( 1 ): 65. Google Scholar CrossRef Search ADS Bewick AJ , Schmitz RJ. 2017 . Gene body DNA methylation in plants . Curr Opin Plant Biol . 36 : 103 – 110 . Google Scholar CrossRef Search ADS PubMed Bhattacharya D , Price DC , Chan CX , Qiu H , Rose N , Ball S , Weber APM , Arias MC , Henrissat B , Coutinho PM , et al. 2013 . Genome of the red alga Porphyridium purpureum . Nat Commun . 4 : 1941. Google Scholar CrossRef Search ADS PubMed Bhattacharya D , Yoon HS , Hackett JD. 2004 . Photosynthetic eukaryotes unite: endosymbiosis connects the dots . BioEssays 26 ( 1 ): 50 – 60 . Google Scholar CrossRef Search ADS PubMed Bhaya D , Dufresne A , Vaulot D , Grossman A. 2002 . Analysis of the hli gene family in marine and freshwater cyanobacteria . FEMS Microbiol Lett . 215 ( 2 ): 209 – 219 . Google Scholar CrossRef Search ADS PubMed Birney E , Clamp M , Durbin R. 2004 . GeneWise and genomewise . Genome Res . 14 ( 5 ): 988 – 995 . Google Scholar CrossRef Search ADS PubMed Blanc G , Wolfe KH. 2004 . Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution . Plant Cell 16 ( 7 ): 1679 – 1691 . Google Scholar CrossRef Search ADS PubMed Boetzer M , Henkel CV , Jansen HJ , Butler D , Pirovano W. 2011 . Scaffolding pre-assembled contigs using SSPACE . Bioinformatics 27 ( 4 ): 578 – 579 . Google Scholar CrossRef Search ADS PubMed Brawley SH , Blouin NA , Ficko-Blean E , Wheeler GL , Lohr M , Goodson HV , Jenkins JW , Blaby-Haas CE , Helliwell KE , Chan CX , et al. 2017 . Insights into the red algae and eukaryotic evolution from the genome of Porphyra umbilicalis (Bangiophyceae, Rhodophyta) . Proc Natl Acad Sci U S A . 114 ( 31 ): E6361 – E6370 . Google Scholar CrossRef Search ADS PubMed Brennecke J , Malone CD , Aravin AA , Sachidanandam R , Stark A , Hannon GJ. 2008 . An epigenetic role for maternally inherited piRNAs in transposon silencing . Science 322 ( 5906 ): 1387 – 1392 . Google Scholar CrossRef Search ADS PubMed Capuano F , Mülleder M , Kok R , Blom HJ , Ralser M. 2014 . Cytosine DNA methylation is found in Drosophila melanogaster but absent in Saccharomyces cerevisiae, Schizosaccharomyces pombe, and other yeast species . Anal Chem . 86 ( 8 ): 3697 – 3702 . Google Scholar CrossRef Search ADS PubMed Carlberg C , Molnár F. 2014 . Mechanisms of gene regulation. Dordrecht (The Netherlands ): Springer Science+Business Media . Google Scholar CrossRef Search ADS Chin CS , Alexander DH , Marks P , Klammer AA , Drake J , Heiner C , Clum A , Copeland A , Huddleston J , Eichler EE , et al. 2013 . Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data . Nat Methods . 10 ( 6 ): 563 – 569 . Google Scholar CrossRef Search ADS PubMed Cole KM. 1990 . Chromosomes. In: Cole KM , Sheath RG , editors. Biology of the red algae . Cambridge : Cambridge University Press . p. 73 – 102 . Collén J , Porcel B , Carré W , Ball SG , Chaparro C , Tonon T , Barbeyron T , Michel G , Noel B , Valentin K et al. , 2013 . Genome structure and metabolic features in the red seaweed Chondrus crispus shed light on evolution of the Archaeplastida . Proc Natl Acad Sci U S A . 110 ( 13 ): 5247 – 5252 . Google Scholar CrossRef Search ADS PubMed Csűös M. 2010 . Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood . Bioinformatics 26 ( 15 ): 1910 – 1912 . Google Scholar CrossRef Search ADS PubMed Deaton AM , Bird A. 2011 . CpG islands and the regulation of transcription . Genes Dev . 25 ( 10 ): 1010 – 1022 . Google Scholar CrossRef Search ADS PubMed Derelle E , Ferraz C , Rombauts S , Rouzé P , Worden AZ , Robbens S , Partensky F , Degroeve S , Echeynié S , Cooke R et al. , 2006 . Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features . Proc Natl Acad Sci U S A . 103 ( 31 ): 11647 – 11652 . Google Scholar CrossRef Search ADS PubMed Eddy SR. 2012 . The C-value paradox, junk DNA and ENCDE . Curr Biol . 22 ( 21 ): R898 – R899 . Google Scholar CrossRef Search ADS PubMed Emms D , Kelly S. 2015 . OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy . Genome Biol . 16 : 157. Google Scholar CrossRef Search ADS PubMed Feng S , Cokus SJ , Zhang X , Chen P-Y , Bostick M , Goll MG , Hetzel J , Jain J , Strauss SH , Halpern ME et al. , 2010 . Conservation and divergence of methylation patterning in plants and animals . Proc Natl Acad Sci U S A . 107 ( 19 ): 8689 – 8694 . Google Scholar CrossRef Search ADS PubMed Flouri T , Izquierdo-Carrasco F , Darriba D , Aberer AJ , Nguyen LT , Minh BQ , Von Haeseler A , Stamatakis A. 2015 . The phylogenetic likelihood library . Syst Biol . 64 ( 2 ): 356 – 362 . Google Scholar CrossRef Search ADS PubMed Funk C , Vermaas W. 1999 . A cyanobacterial gene family coding for single-helix proteins resembling part of the light-harvesting proteins from higher plants . Biochemistry 38 ( 29 ): 9397 – 9404 . Google Scholar CrossRef Search ADS PubMed Galindo-González L , Mhiri C , Deyholos MK , Grandbastien MA. 2017 . LTR-retrotransposons in plants: engines of evolution . Gene 626 : 14 – 25 . Google Scholar CrossRef Search ADS PubMed Garcia S , Leitch IJ , Anadon-Rosell A , Canela MÁ , Gálvez F , Garnatje T , Gras A , Hidalgo O , Johnston E , Mas de Xaxars G et al. , 2014 . Recent updates and developments to plant genome size databases . Nucleic Acids Res. 42 ( D1 ): D1159 – D1166 . Google Scholar CrossRef Search ADS PubMed Garcia-Aguilar M , Michaud C , Leblanc O , Grimanelli D. 2010 . Inactivation of a DNA methylation pathway in maize reproductive organs results in apomixis-like phenotypes . Plant Cell 22 ( 10 ): 3249 – 3267 . Google Scholar CrossRef Search ADS PubMed Glasauer SMK , Neuhauss SCF. 2014 . Whole-genome duplication in teleost fishes and its evolutionary consequences . Mol Genet Genomics . 289 ( 6 ): 1045 – 1060 . Google Scholar CrossRef Search ADS PubMed Goff LJ , Coleman AW. 1986 . A novel pattern of apical cell polyploidy, sequential polyploidy reduction and intercellular nuclear transfer in the red alga Polysiphonia . Am J Bot . 73 ( 8 ): 1109 – 1130 . Google Scholar CrossRef Search ADS Goff LJ , Coleman AW. 1990 . DNA: micro-spectrohotometric studies. In: Cole KM , Sheath RG , editors. Biology of the red algae . Cambridge : Cambridge University Press . p. 43 – 72 . Goodier JL. 2016 . Restricting retrotransposons: a review . Mob DNA . 7 : 16. Google Scholar CrossRef Search ADS PubMed Greilhuber J , Doležel J , Lysák M , Bennett MD. 2005 . The origin, evolution and proposed stabilization of the terms ‘genome size’ and ‘C-value’ to describe nuclear DNA contents . Ann Bot . 95 ( 1 ): 255 – 260 . Google Scholar CrossRef Search ADS PubMed Haas BJ , Papanicolaou A , Yassour M , Grabherr M , Blood PD , Bowden J , Couger MB , Eccles D , Lieber M , MacManes MD et al. , 2013 . De novo transcript sequence reconstruction from RNA-seq: reference generation and analysis with Trinity . Nat Protoc . 8 ( 8 ): 1494 – 1512 . Google Scholar CrossRef Search ADS PubMed Haas BJ , Salzberg SL , Zhu W , Pertea M , Allen JE , Orvis J , White O , Buell CR , Wortman JR. 2008 . Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments . Genome Biol . 9 ( 1 ): R7. Google Scholar CrossRef Search ADS PubMed Hahn MW , Wray GA. 2002 . The g-value paradox . Evol Dev . 4 ( 2 ): 73 – 75 . Google Scholar CrossRef Search ADS PubMed Hawkins JS , Kim HR , Nason JD , Wing RA , Wendel JF. 2006 . Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium . Genome Res . 16 ( 10 ): 1252 – 1261 . Google Scholar CrossRef Search ADS PubMed Hidalgo O , Pellicer J , Christenhusz M , Schneider H , Leitch AR , Leitch IJ. 2017 . Is there an upper limit to genome size? Trends Plant Sci . 22 ( 7 ): 567 – 573 . Google Scholar CrossRef Search ADS PubMed Hirochika H , Okamoto H , Kakutani T. 2000 . Silencing of retrotransposons in Arabidopsis and reactivation by the ddm1 mutation . Plant Cell 12 ( 3 ): 357 – 368 . Google Scholar CrossRef Search ADS PubMed Hoff KJ , Lange S , Lomsadze A , Borodovsky M , Stanke M. 2016 . BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS . Bioinformatics 32 ( 5 ): 767 – 769 . Google Scholar CrossRef Search ADS PubMed Hori K , Maruyama F , Fujisawa T , Togashi T , Yamamoto N , Seo M , Sato S , Yamada T , Mori H , Yajima N et al. , 2014 . Klebsormidium flaccidum genome reveals primary factors for plant terrestrial adaptation . Nat Commun . 5 : 3978 . Google Scholar CrossRef Search ADS PubMed Hua-Van A , Le Rouzic A , Boutin TS , Filée J , Capy P. 2011 . The struggle for life of the genome’s selfish architects . Biol Direct . 6 ( 1 ): 19. Google Scholar CrossRef Search ADS PubMed Huerta-Cepas J , Szklarczyk D , Forslund K , Cook H , Heller D , Walter MC , Rattei T , Mende DR , Sunagawa S , Kuhn M et al. , 2016 . eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences . Nucleic Acids Res. 44 ( D1 ): D286 – D293 . Google Scholar CrossRef Search ADS PubMed Huff JT , Zilberman D. 2014 . Dnmt1-independent CG methylation contributes to nucleosome positioning in diverse eukaryotes . Cell 156 ( 6 ): 1286 – 1297 . Google Scholar CrossRef Search ADS PubMed Hutvagner G , Simard MJ. 2008 . Argonaute proteins: key players in RNA silencing . Nat Rev Mol Cell Biol . 9 ( 1 ): 22 – 32 . Google Scholar CrossRef Search ADS PubMed Illingworth RS , Bird AP. 2009 . CpG islands—‘a rough guide’ . FEBS Lett . 583 ( 11 ): 1713 – 1720 . Google Scholar CrossRef Search ADS PubMed Innan H , Kondrashov F. 2010 . The evolution of gene duplications: classifying and distinguishing between models . Nat Rev Genet . 11 ( 2 ): 97 – 108 . Google Scholar CrossRef Search ADS PubMed Jantaro S , Ali Q , Lone S , He Q. 2006 . Suppression of the lethality of high light to a quadruple HLI mutant by the inactivation of the regulatory protein PfsR in Synechocystis PCC 6803 . J Biol Chem . 281 ( 41 ): 30865 – 30874 . Google Scholar CrossRef Search ADS PubMed Ji L , Chen X. 2012 . Regulation of small RNA stability: methylation and beyond . Cell Res . 22 ( 4 ): 624 – 636 . Google Scholar CrossRef Search ADS PubMed Kapraun DF. 2005 . Nuclear DNA content estimates in multicellular green, red and brown algae: phylogenetic considerations . Ann Bot . 95 ( 1 ): 7 – 44 . Google Scholar CrossRef Search ADS PubMed Keeling PJ. 2010 . The endosymbiotic origin, diversification and fate of plastids . Philos Trans R Soc B . 365 ( 1541 ): 729 – 748 . Google Scholar CrossRef Search ADS Kim D , Pertea G , Trapnell C , Pimentel H , Kelley R , Salzberg SL. 2013 . TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions . Genome Biol. 14 ( 4 ): R36. Google Scholar CrossRef Search ADS PubMed Kimura M. 1980 . A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences . J Mol Evol . 16 ( 2 ): 111 – 120 . Google Scholar CrossRef Search ADS PubMed Kong F , Mao Y , Yang H , Wang L , Liu L. 2012 . Cloning and characterization of the HLIP gene encoding high light-inducible protein from Porphyra yezoensis . J Appl Phycol . 24 ( 4 ): 685 – 692 . Google Scholar CrossRef Search ADS Krueger F , Andrews SR. 2011 . Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications . Bioinformatics 27 ( 11 ): 1571 – 1572 . Google Scholar CrossRef Search ADS PubMed Langmead B , Salzberg SL. 2012 . Fast gapped-read alignment with Bowtie 2 . Nat Methods . 9 ( 4 ): 357 – 359 . Google Scholar CrossRef Search ADS PubMed Law JA , Jacobsen SE. 2010 . Establishing, maintaining and modifying DNA methylation patterns in plants and animals . Nat Rev Genet . 11 ( 3 ): 204 – 220 . Google Scholar CrossRef Search ADS PubMed Lee JM , Cho CH , Park SI , Choi JW , Song HS , West JA , Bhattacharya D , Yoon HS. 2016 . Parallel evolution of highly conserved plastid genome architecture in red seaweeds and seed plants . BMC Biol . 14 ( 1 ): 75. Google Scholar CrossRef Search ADS PubMed Li H. 2012 . Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly . Bioinformatics 28 ( 14 ): 1838 – 1844 . Google Scholar CrossRef Search ADS PubMed Li SF , Zhang GJ , Yuan JH , Deng CL , Gao WJ. 2016 . Repetitive sequences and epigenetic modification: inseparable partners play important roles in the evolution of plant sex chromosomes . Planta 243 ( 5 ): 1083 – 1095 . Google Scholar CrossRef Search ADS PubMed Li Y , Tollefsbol TO. 2011 . DNA methylation detection: bisulfite genomic sequencing analysis . Methods Mol Biol . 791 : 11 – 21 . Google Scholar CrossRef Search ADS PubMed Liu Q , Feng Y , Zhu Z. 2009 . Dicer-like (DCL) proteins in plants . Funct Integr Genomics . 9 ( 3 ): 277 – 286 . Google Scholar CrossRef Search ADS PubMed Lomsadze A , Burns PD , Borodovsky M. 2014 . Integration of mapped RNA-seq reads into automatic training of eukaryotic gene finding algorithm . Nucleic Acids Res . 42 ( 15 ): e119. Google Scholar CrossRef Search ADS PubMed Lynch M , Ackerman MS , Gout JF , Long H , Sung W , Thomas WK , Foster PL. 2016 . Genetic drift, selection and the evolution of the mutation rate . Nat Rev Genet . 17 ( 11 ): 704 – 714 . Google Scholar CrossRef Search ADS PubMed Marçais G , Kingsford C. 2011 . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers . Bioinformatics 27 ( 6 ): 764 – 770 . Google Scholar CrossRef Search ADS PubMed Marchler-Bauer A , Bo Y , Han L , He J , Lanczycki CJ , Lu S , Chitsaz F , Derbyshire MK , Geer RC , Gonzales NR et al. , 2017 . CDD/SPARCLE: functional classification of proteins via subfamily domain architectures . Nucleic Acids Res. 45 ( D1 ): D200 – D203 . Google Scholar CrossRef Search ADS PubMed Marí-Ordóñez A , Marchais A , Etcheverry M , Martin A , Colot V , Voinnet O. 2013 . Reconstructing de novo silencing of an active plant retrotransposon . Nat Genet . 45 ( 9 ): 1029 – 1039 . Google Scholar CrossRef Search ADS PubMed Martienssen RA , Colot V. 2001 . DNA methylation and epigenetic inheritance in plants and filamentous fungi . Science 293 ( 5532 ): 1070 – 1074 . Google Scholar CrossRef Search ADS PubMed Martin M. 2011 . Cutadapt removes adapter sequences from high-throughput sequencing reads . EMBnet.Journal 17 ( 1 ): 10 – 12 . Google Scholar CrossRef Search ADS Matsuzaki M , Misumi O , Shin-i T , Maruyama S , Takahara M , Miyagishima S , Mori T , Nishida K , Yagisawa F , Nishida K et al. , 2004 . Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D . Nature 428 ( 6983 ): 653 – 657 . Google Scholar CrossRef Search ADS PubMed Matzke M , Kanno T , Daxinger L , Huettel B , Matzke AJM. 2009 . RNA-mediated chromatin-based silencing in plants . Curr Opin Cell Biol . 21 ( 3 ): 367 – 376 . Google Scholar CrossRef Search ADS PubMed Matzke MA , Kanno T , Matzke AJM. 2015 . RNA-directed DNA methylation: the evolution of a complex epigenetic pathway in flowering plants . Annu Rev Plant Biol . 66 : 243 – 267 . Google Scholar CrossRef Search ADS PubMed Matzke MA , Mosher RA. 2014 . RNA-directed DNA methylation: an epigenetic pathway of increasing complexity . Net Rev Genet . 15 ( 6 ): 394 – 408 . Google Scholar CrossRef Search ADS McClintock B. 1950 . The origin and behavior of mutable loci in maize . Proc Natl Acad Sci U S A . 36 ( 6 ): 344 – 355 . Google Scholar CrossRef Search ADS PubMed McClintock B. 1984 . The significance of responses of the genome to challenge . Science 226 ( 4676 ): 792 – 801 . Google Scholar CrossRef Search ADS PubMed Melnyk CW , Molnar A , Bassett A , Baulcombe DC. 2011 . Mobile 24 nt small RNAs direct transcriptional gene silencing in the root meristems of Arabidopsis thaliana . Curr Biol. 21 ( 19 ): 1678 – 1683 . Google Scholar CrossRef Search ADS PubMed Merchant SS , Prochnik SE , Vallon O , Harris EH , Karpowicz SJ , Witman GB , Terry A , Salamov A , Fritz-Laylin LK , Maréchal-Drouard L et al. , 2007 . The Chlamydomonas genome reveals the evolution of key animal and plant functions . Science 318 ( 5848 ): 245 – 250 . Google Scholar CrossRef Search ADS PubMed Michael TP. 2014 . Plant genome size variation: bloating and purging DNA . Brief Funct Genomics . 13 ( 4 ): 308 – 317 . Google Scholar CrossRef Search ADS PubMed Minh BQ , Nguyen MAT , Von Haeseler A. 2013 . Ultrafast approximation for phylogenetic bootstrap . Mol Biol Evol . 30 ( 5 ): 1188 – 1195 . Google Scholar CrossRef Search ADS PubMed Mohn F , Weber M , Rebhan M , Roloff TC , Richter J , Stadler MB , Bibel M , Schübeler D. 2008 . Linage-specific polycomb targets and de novo DNA methylation define restriction and potential of neuronal progenitors . Mol Cell . 30 ( 6 ): 755 – 766 . Google Scholar CrossRef Search ADS PubMed Montané MH , Petzold B , Kloppstech K. 1999 . Formation of early-light-inducible-protein complexes and status of xanthophyll levels under high light and cold stress in barley (Hordeum vulgare L.) . Planta 208 ( 4 ): 519 – 527 . Google Scholar CrossRef Search ADS Nawrocki EP , Burge SW , Bateman A , Daub J , Eberhardt RY , Eddy SR , Floden EW , Gardner PP , Jones TA , Tate J et al. , 2015 . Rfam 12.0: updates to the RNA families database . Nucleic Acids Res. 43 ( Database issue ): D130 – D137 . Google Scholar CrossRef Search ADS PubMed Nguyen LT , Schmidt HA , Von Haeseler A , Minh BQ. 2015 . IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies . Mol Biol Evol . 32 ( 1 ): 268 – 274 . Google Scholar CrossRef Search ADS PubMed Okamoto H , Hirochika H. 2001 . Silencing of transposable elements in plants . Trends Plant Sci . 6 ( 11 ): 527 – 534 . Google Scholar CrossRef Search ADS PubMed Olmedo-Monfil V , Durán-Figueroa N , Arteaga-Vázquez M , Demesa-Arévalo E , Autran D , Grimanelli D , Slotkin RK , Martienssen RA , Vielle-Calzada J-P. 2010 . Control of female gamete formation by a small RNA pathway in Arabidopsis . Nature 464 ( 7288 ): 628 – 632 . Google Scholar CrossRef Search ADS PubMed Ou X , Zhang Y , Xu C , Lin X , Zang Q , Zhuang T , Jiang L , von Wettstein D , Liu B. 2012 . Transgenerational inheritance of modified DNA methylation patterns and enhanced tolerance induced by heavy metal stress in rice (Oryza sativa L.) . PLoS One 7 ( 9 ): e41143. Google Scholar CrossRef Search ADS PubMed Payer B , Lee JT. 2008 . X chromosome dosage compensation: how mammals keep the balance . Annu Rev Genet . 42 : 733 – 772 . Google Scholar CrossRef Search ADS PubMed Pellicer J , Fay MF , Leitch IJ. 2010 . The largest eukaryotic genome of them all? Bot J Linn Soc . 164 ( 1 ): 10 – 15 . Google Scholar CrossRef Search ADS Piegu B , Guyot R , Picault N , Roulin A , Sanyal A , Saniyal A , Kim H , Collura K , Brar DS , Jackson S et al. , 2006 . Doubling genome size without polyploidization: dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice . Genome Res . 16 ( 10 ): 1262 – 1269 . Google Scholar CrossRef Search ADS PubMed Popova OV , Dinh HQ , Aufsatz W , Jonak C. 2013 . The RdDM pathway is required for basal heat tolerance in Arabidopsis . Mol Plant. 6 ( 2 ): 396 – 410 . Google Scholar CrossRef Search ADS PubMed Price AL , Jones NC , Pevzner PA. 2005 . De novo identification of repeat families in large genomes . Bioinformatics 21 ( Suppl 1 ): i351 – i358 . Google Scholar CrossRef Search ADS PubMed Price DC , Chan CX , Yoon HS , Yang EC , Qiu H , Weber APM , Schwacke R , Gross J , Blouin NA , Lane C et al. , 2012 . Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants . Science 335 ( 6070 ): 843 – 847 . Google Scholar CrossRef Search ADS PubMed Qiu H , Lee J , Yoon HS , Bhattacharya D. 2017 . Hypothesis: gene-rich plastid genomes in red algae may be an outcome of nuclear genome reduction . J Phycol . 53 ( 3 ): 715 – 719 . Google Scholar CrossRef Search ADS PubMed Qiu H , Price DC , Yang EC , Yoon HS , Bhattacharya D. 2015 . Evidence of ancient genome reduction in red algae (Rhodophyta) . J Phycol . 51 ( 4 ): 624 – 636 . Google Scholar CrossRef Search ADS PubMed Quinlan AR , Hall IM. 2010 . BEDTools: a flexible suite of utilities for comparing genomic features . Bioinformatics 26 ( 6 ): 841 – 842 . Google Scholar CrossRef Search ADS PubMed Rensing SA. 2014 . Gene duplication as a driver of plant morphogenetic evolution . Curr Opin Plant Biol . 17 : 43 – 48 . Google Scholar CrossRef Search ADS PubMed Rensing SA , Lang D , Zimmer AD , Terry A , Salamov A , Shapiro H , Nishiyama T , Perroud P-F , Lindquist EA , Kamisugi Y et al. , 2008 . The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants . Science 319 ( 5859 ): 64 – 69 . Google Scholar CrossRef Search ADS PubMed Saito K , Siomi MC. 2010 . Small RNA-mediated quiescence of transposable elements in animals . Dev Cell . 19 ( 5 ): 687 – 697 . Google Scholar CrossRef Search ADS PubMed Saze H , Scheid OR , Paszkowski J. 2003 . Maintenance of CpG methylation is essential for epigenetic inheritance during plant gametogenesis . Nat Genet . 34 ( 1 ): 65 – 69 . Google Scholar CrossRef Search ADS PubMed Saze H , Tsugane K , Kanno T , Nishimura T. 2012 . DNA methylation in plants: relationship to small RNAs and histone modifications, and functions in transposon inactivation . Plant Cell Physiol . 53 ( 5 ): 766 – 784 . Google Scholar CrossRef Search ADS PubMed Schönknecht G , Chen W , Ternes CM , Barbier GG , Shrestha RP , Stanke M , Bräutigam A , Baker BJ , Banfield JF , Garavito RM et al. , 2013 . Gene transfer from bacteria and archaea facilitated evolution of an extremophilic eukaryote . Science 339 ( 6124 ): 1207 – 1210 . Google Scholar CrossRef Search ADS PubMed Shahmuradov IA , Umarov RK , Solovyev VV. 2017 . TSSPlant: a new tool for prediction of plant Pol II promoters . Nucleic Acids Res . 45 ( 8 ): e65. Google Scholar PubMed Simão FA , Waterhouse RM , Ioannidis P , Kriventseva EV , Zdobnov EM. 2015 . BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs . Bioinformatics 31 ( 19 ): 3210 – 3212 . Google Scholar CrossRef Search ADS PubMed Singh M , Goel S , Meeley RB , Dantec C , Parrinello H , Michaud C , Leblanc O , Grimanelli D. 2011 . Production of viable gametes without meiosis in maize deficient for an ARGONAUTE protein . Plant Cell 23 ( 2 ): 443 – 458 . Google Scholar CrossRef Search ADS PubMed Slotkin RK , Martienssen R. 2007 . Transposable elements and the epigenetic regulation of the genome . Nat Rev Genet . 8 ( 4 ): 272 – 285 . Google Scholar CrossRef Search ADS PubMed Slotkin RK , Vaughn M , Borges F , Tanurdžić M , Becker JD , Feijó JA , Martienssen R. 2009 . Epigenetic reprogramming and small RNA silencing of transposable elements in pollen . Cell 136 ( 3 ): 461 – 472 . Google Scholar CrossRef Search ADS PubMed Stanke M , Keller O , Gunduz I , Hayes A , Waack S , Morgenstern B. 2006 . AUGUSTUS: ab initio prediction of alternative transcripts . Nucleic Acids Res. 34 ( Web Server ): W435 – W439 . Google Scholar CrossRef Search ADS PubMed Steinemann S , Steinemann M. 2005 . Retroelements: tools for sex chromosome evolution . Cytogenet Genome Res . 110 ( 1–4 ): 134 – 143 . Google Scholar CrossRef Search ADS PubMed Sung W , Ackerman MS , Miller SF , Doak TG , Lynch M. 2012 . Drift-barrier hypothesis and mutation-rate evolution . Proc Natl Acad Sci U S A . 109 ( 45 ): 18488 – 18492 . Google Scholar CrossRef Search ADS PubMed Takai D , Jones PA. 2002 . Comprehensive analysis of CpG islands in human chromosomes 21 and 22 . Proc Natl Acad Sci U S A . 99 ( 6 ): 3740 – 3745 . Google Scholar CrossRef Search ADS PubMed Teramoto H , Itoh T , Ono TA. 2004 . High-intensity-light-dependent and transient expression of new genes encoding distant relatives of light-harvesting chlorophyll-a/b proteins in Chlamydomonas reinhardtii . Plant Cell Physiol . 45 ( 9 ): 1221 – 1232 . Google Scholar CrossRef Search ADS PubMed Tippens ND , Vihervaara A , Lis JT. 2018 . Enhancer transcription: what, where, when, and why? Genes Dev . 32 ( 1 ): 1 . Google Scholar CrossRef Search ADS PubMed Tricker PJ , Gibbings JG , Rodríguez López CM , Hadley P , Wilkinson MJ. 2012 . Low relative humidity triggers RNA-directed de novo DNA methylation and suppression of genes controlling stomatal development . J Exp Bot . 63 ( 10 ): 3799 – 3813 . Google Scholar CrossRef Search ADS PubMed Vaucheret H , Fagard M. 2001 . Transcriptional gene silencing in plants: targets, inducers and regulators . Trends Genet . 17 ( 1 ): 29 – 35 . Google Scholar CrossRef Search ADS PubMed Verbsky ML , Richards EJ. 2001 . Chromatin remodeling in plants . Curr Opin Plant Biol . 4 : 494 – 500 . Google Scholar CrossRef Search ADS PubMed Worden AZ , Lee J-H , Mock T , Rouzé P , Simmons MP , Aerts AL , Allen AE , Cuvelier ML , Derelle E , Everett MV et al. , 2009 . Green evolution and dynamic adaptations revealed by genomes of the marine picoeukaryotes Micromonas . Science 324 ( 5924 ): 268 – 272 . Google Scholar CrossRef Search ADS PubMed Wright S , Finnegan D. 2001 . Genome evolution: sex and the transposable element . Curr Biol . 11 ( 8 ): R296 – R299 . Google Scholar CrossRef Search ADS PubMed Xiao W , Custard KD , Brown RC , Lemmon BE , Harada JJ , Goldberg RB , Fischer RL. 2006 . DNA methylation is critical for Arabidopsis embryogenesis and seed viability . Plant Cell 18 ( 4 ): 805 – 814 . Google Scholar CrossRef Search ADS PubMed Yamada KD , Tomii K , Katoh K. 2016 . Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees . Bioinformatics 32 ( 21 ): 3246 – 3251 . Google Scholar CrossRef Search ADS PubMed Yang EC , Boo SM , Bhattacharya D , Saunders GW , Knoll AH , Fredericq S , Graf L , Yoon HS. 2016 . Divergence time estimates and the evolution of major lineages in the florideophyte red algae . Sci Rep . 6 : 21361. Google Scholar CrossRef Search ADS PubMed Zemach A , McDaniel IE , Silva P , Zilberman D. 2010 . Genome-wide evolutionary analysis of eukaryotic DNA methylation . Science 328 ( 5980 ): 916 – 919 . Google Scholar CrossRef Search ADS PubMed Zeyl C , Bell G. 1996 . Symbiotic DNA in eukaryotic genomes . Trends Ecol Evol . 11 ( 1 ): 10 – 15 . Google Scholar CrossRef Search ADS PubMed Zhang J. 2003 . Evolution by gene duplication: an update . Trends Ecol Evol . 18 ( 6 ): 292 – 298 . Google Scholar CrossRef Search ADS Zilberman D , Gehring M , Tran RK , Ballinger T , Henikoff S. 2007 . Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription . Nat Genet . 39 ( 1 ): 61 – 69 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Molecular Biology and Evolution Oxford University Press

Loading next page...
 
/lp/ou_press/analysis-of-the-draft-genome-of-the-red-seaweed-gracilariopsis-chorda-mUcE980GRm
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
0737-4038
eISSN
1537-1719
D.O.I.
10.1093/molbev/msy081
Publisher site
See Article on Publisher Site

Abstract

Abstract Red algae (Rhodophyta) underwent two phases of large-scale genome reduction during their early evolution. The red seaweeds did not attain genome sizes or gene inventories typical of other multicellular eukaryotes. We generated a high-quality 92.1 Mb draft genome assembly from the red seaweed Gracilariopsis chorda, including methylation and small (s)RNA data. We analyzed these and other Archaeplastida genomes to address three questions: 1) What is the role of repeats and transposable elements (TEs) in explaining Rhodophyta genome size variation, 2) what is the history of genome duplication and gene family expansion/reduction in these taxa, and 3) is there evidence for TE suppression in red algae? We find that the number of predicted genes in red algae is relatively small (4,803–13,125 genes), particularly when compared with land plants, with no evidence of polyploidization. Genome size variation is primarily explained by TE expansion with the red seaweeds having the largest genomes. Long terminal repeat elements and DNA repeats are the major contributors to genome size growth. About 8.3% of the G. chorda genome undergoes cytosine methylation among gene bodies, promoters, and TEs, and 71.5% of TEs contain methylated-DNA with 57% of these regions associated with sRNAs. These latter results suggest a role for TE-associated sRNAs in RNA-dependent DNA methylation to facilitate silencing. We postulate that the evolution of genome size in red algae is the result of the combined action of TE spread and the concomitant emergence of its epigenetic suppression, together with other important factors such as changes in population size. Rhodophyta, Gracilariopsis chorda, DNA methylation, transposable element suppression, small RNAs Introduction How genome size evolves over evolutionary timescales (i.e., hundreds of millions of years) is a fundamental question posed by many genome biologists. The advent of next-generation DNA sequencing has led to the accumulation of vast amounts of genome data that reveal enormous variation in genome size and gene number among eukaryotes. This variation is however not strictly correlated with developmental or morphological complexity. Once known as the “C-value paradox,” it has become clear that genome size expansion primarily reflects the proliferation of noncoding DNA such as transposable elements (TEs), not the growth in organismal complexity. Relatively “simple” organisms such as unicellular dinoflagellates can have massive genomes, several gigabases in size, when compared with a “complex” plant such as Arabidopsis thaliana (120 Mb genome). In addition, the expansion of cis-acting regulatory elements (e.g., number of transcription factors) is a more important predictor of organismal complexity than the number of protein-coding genes (Hahn and Wray 2002; Greilhuber et al. 2005). In photosynthetic eukaryotes, genome size varies from several megabase pairs (Mb) to 148 Gb (the flowering plant Paris japonica) (Pellicer et al. 2010; Garcia et al. 2014; Hidalgo et al. 2017). In many land plants, polyploidization is one of the major contributors to genome expansion through gene/genome duplication, which is estimated to have occurred in 30–80% of angiosperms (Blanc and Wolfe 2004). Another important mechanism in genome evolution is TE-derived genome expansion. In land plants, this process significantly impacts the size and structure of genomes, as well as the mutation rate, extent of gene movement, and gene regulation (Bennetzen and Wang 2014), in the absence of polyploidization (Hawkins et al. 2006; Piegu et al. 2006). Genome-size expansion via the accumulation of TEs has been reported in many eukaryotes, including the multicellular red seaweed Chondrus crispus (Collén et al. 2013). TEs are classified on the basis of their transposition intermediate: RNA (retrotransposons, class I) and DNA (DNA transposons, class II) that carry out copy-and-paste and cut-and-paste activities, respectively, which may in some cases affect genome stability (e.g., via double-strand DNA breaks) with the generation of flanking repeated DNA elements (Slotkin and Martienssen 2007). Previously, TEs were regarded primarily as “parasites” that engage the host in an evolutionary arms race, whereby the host tries to control TE activity through mechanisms, such as chromatin remodeling and silencing through RNA interference (RNAi) (McClintock 1950; McClintock 1984; Okamoto and Hirochika 2001; Vaucheret and Fagard 2001; Slotkin and Martienssen 2007; Bennetzen and Wang 2014; Goodier 2016). With the accumulation of genome data, it has become clear however that the insertion and/or recombination of TEs may have positive impacts by providing a source of evolutionary novelty. Such beneficial outcomes include alternative splicing, epigenetic control of gene expression, transduction, duplication, and recombination (Hua-Van et al. 2011; Galindo-González et al. 2017). Countermeasures to TE activity in plant genomes are known as posttranscriptional gene silencing (PTGS) and transcriptional gene silencing (TGS) (Hirochika et al. 2000; Okamoto and Hirochika 2001; Vaucheret and Fagard 2001; Verbsky and Richards 2001; Slotkin and Martienssen 2007). PTGS is a small RNA (sRNA)-dependent process that leads to the degradation or repression of translation of target mRNAs in the cytoplasm. TGS is also sRNA-dependent, but induced by chromatin remodeling. These epigenetic modifications involve the RNA-dependent DNA methylation (RdDM) pathway (Matzke and Mosher 2014; Matzke et al. 2015). RdDM is not only related to TE silencing but also plays several crucial roles in the cell cycle (e.g., reproduction) (Garcia-Aguilar et al. 2010; Olmedo-Monfil et al. 2010; Singh et al. 2011), stress responses (Ou et al. 2012; Tricker et al. 2012; Popova et al. 2013), and intracellular interactions (Melnyk et al. 2011) through the control of gene expression (Matzke and Mosher 2014). Despite these important roles in genome complexity and evolution, DNA methylation has not yet been studied in red algae (Rhodophyta). The Rhodophyta are members of the supergroup Archaeplastida that also includes the Glaucophyta and Viridiplantae (green algae plus land plants). The ancestor of this supergroup putatively captured and retained a cyanobacterial cell through primary endosymbiosis that eventually gave rise to the canonical plastid (Bhattacharya et al. 2004; Keeling 2010). Despite having a common ancestor, the three “primary” (i.e., Archaeplastida) algal lineages have quite different gene inventories. For instance, all available red algal genome data indicate a relatively modest (i.e., ∼5K–13K) gene inventory in both morphologically “simple” unicellular mesophilic, extremophilic, and filamentous taxa as well as in large, multicellular seaweeds (Matsuzaki et al. 2004; Bhattacharya et al. 2013; Collén et al. 2013; Schönknecht et al. 2013; Brawley et al. 2017). In contrast, the Viridiplantae shows high variability in gene number from green algae (7K–16K genes) to land plants (20K–90K genes) (Arabidopsis Genome Initiative 2000; Derelle et al. 2006; Merchant et al. 2007; Rensing et al. 2008; Worden et al. 2009; Hori et al. 2014). The single, available glaucophyte genome from the unicellular Cyanophora paradoxa encodes ∼30K genes (Price et al. 2012). It has been proposed that the reduced gene inventory in red algae is the result of two ancient (>1 Ga) phases of genome reduction: The first occurred in the stem lineage of Rhodophyta and the second in the ancestor of the extremophilic Cyanidiophytina (Qiu et al. 2015,, 2017). However, the history of gene inventory evolution in red algae vis-à-vis genome size-change remains poorly understood, particularly for the multicellular red seaweeds (the Florideophyceae) for which limited genome data are available. To fill this gap in our knowledge, we sequenced and analyzed the genome of the economically important agar-producing multicellular red seaweed, Gracilariopsis chorda (Rhodophyta, Florideophyceae). With these data in hand, we asked three questions about the evolutionary history of red algal genomes: 1) What role do repeats and TEs play in explaining genome size variation, 2) what is the history of genome duplication and gene family expansion/reduction in Rhodophyta, and 3) as the focus of our study, is there evidence for TE suppression in red algae? We found no evidence of polyploidization in Rhodophyta, and gene families have generally remained constrained in size. Assessment of DNA methylation in G. chorda using bisulfite sequencing (Li and Tollefsbol 2011) shows that 8.3% (7.6 Mb) of DNA is methylated and that 96% (10,470/10,806 genes) of coding sequences (CDSs) contain methylated cytosines. We predicted a total of 489 genes that contain methylated CpG islands. We also found evidence of sRNA-dependent TE suppression in G. chorda. Our analyses provide insights into the evolutionary trajectory of red algal genomes, demonstrating that they are characterized by constrained gene inventories and a TE suppression pathway that appears to be typical for eukaryotes. Results and Discussion Repeat and TE Content in Red Algae The primary genome assembly of G. chorda was generated using single-molecule real-time (SMRT) sequencing (PacBio) with bases corrected using Illumina data (Hi-Seq 2000) (see Materials and Methods). The G. chorda genome is ∼92.1 Mb in size (49.26% G + C-content, comprised 1,211 contigs, N50 = 220.2 kb; supplementary table S1, Supplementary Material online) and encodes a predicted inventory of 10,806 protein-coding genes (number of exons: 13,274; avg. exon/coding genes: 1.23; number of introns: 2,468; supplementary table S2, Supplementary Material online). Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis identified 87.6% of the eukaryotic BUSCO conserved gene set (supplementary table S3, Supplementary Material online). On the basis of the comparative analysis of all available red algal genomic data, we find large differences in genome sizes between unicellular red algae (13.7–19.4 Mb) and multicellular red seaweeds (i.e., G. chorda: 92.1 Mb, C. crispus: 104.8 Mb, and Porphyra umbilicalis: 87.7 Mb) (Matsuzaki et al. 2004; Bhattacharya et al. 2013; Collén et al. 2013; Schönknecht et al. 2013; Brawley et al. 2017). A total of 56.4 Mb (61.2%) of the G. chorda genome comprised TEs, as predicted by the RepeatModeler pipeline (see Materials and Methods), which is comparable with 61.9 Mb (59.0%) in C. crispus, but higher than the 38.2 Mb (43.5%) of TEs in the P. umbilicalis genome (supplementary table S4, Supplementary Material online). In contrast to multicellular seaweeds, unicellular red algae contain a smaller proportion of TEs in their genomes: That is, Galdieria sulphuraria (2.2 Mb; 16.9%), Cyanidioschyzon merolae (3.4 Mb; 21.8%), and Porphyridium purpureum (0.7 Mb; 3.6%) (supplementary table S4, Supplementary Material online). On the basis of these results, we surmise that TE had a major impact on genome size expansion in multicellular red algae (i.e., the two florideophycean taxa, G. chorda and C. crispus) when compared with unicellular lineages. Among the florideophycean TEs, long terminal repeat (LTR) elements (60–62%) and DNA repeats (10–22%) are the two major components. Non-LTR retrotransposon sequences (e.g., Long interspersed nuclear elements and Short interspersed nuclear elements; LINEs and SINEs) are also present, but at lower frequencies (LINEs: ≤1.9% and SINEs: ≤0.08%; supplementary table S4, Supplementary Material online). LTRs are derived from RNA through the “copy-and-paste” activity initiated by the gag (group antigen) and pol (reverse transcriptase) genes (Slotkin and Martienssen 2007). These transposon families may have increased red algal genome size through duplication and/or transposition, as reported in land plants (supplementary table S5, Supplementary Material online) (Hawkins et al. 2006; Piegu et al. 2006; Slotkin and Martienssen 2007; Collén et al. 2013). To estimate the relative divergence values for individual TEs, Kimura evolutionary distances (Kimura 1980) were calculated between consensus and individual TE sequences. Because mutations accumulate over time, the genetic distances between TE families can represent their evolutionary history (i.e., recently or more anciently diverged TE). Except for the case of C. merolae, there are recently diverged major peaks in the different TE families (fig. 1A). For instance, the florideophycean-specific LTRs contributed >25% of the total TE sequences (G. chorda: 25.6% and C. crispus: 25.3%) within the range of recent divergence (Kimura distance = 0–5). This high proportion of LTRs is distinct from other red algal species (P. umbilicalis: 12.4%, P. purpureum: 0.7%, C. merolae: 0.0%, and G. sulphuraria: 6.2%). In contrast, the pattern of florideophycean TE expansion is different from its sister taxa such as P. umbilicalis (Bangiophyceae) that contains a large proportion of “unknown” repeats. These data support the idea that red algae underwent different pattern of TE sequence divergence, with strong evidence of LTR-derived genome expansion in the Florideophyceae. Fig. 1. View largeDownload slide Key features of red algal genomes. (A) Frequency and distribution of Kimura distances based on comparisons of repeated elements. (B) Genome size, repeated element, and gene number. The acronyms are as follows: Cyanid. is the extremophilic Cyanidiophytina, Porph. is Porphyridiophyceae, Bang. is Bangiophyceae, and Florid. is Florideophyceae. (C) Structural feature of gene density (number of genes/genome size) and gene clustering (average intergenic distance/median intergenic distance). (D) Heatmap of BUSCO genesets in red algae with their average copy numbers. (E) OGFs in red algae. GAL, Galdieria sulphuraria; CYS, Cyanidioschyzon merolae; POP, Porphyridium purpureum; PUM, Porphyra umbilicalis; CHC, Chondrus crispus; GRC, Gracilariopsis chorda. Fig. 1. View largeDownload slide Key features of red algal genomes. (A) Frequency and distribution of Kimura distances based on comparisons of repeated elements. (B) Genome size, repeated element, and gene number. The acronyms are as follows: Cyanid. is the extremophilic Cyanidiophytina, Porph. is Porphyridiophyceae, Bang. is Bangiophyceae, and Florid. is Florideophyceae. (C) Structural feature of gene density (number of genes/genome size) and gene clustering (average intergenic distance/median intergenic distance). (D) Heatmap of BUSCO genesets in red algae with their average copy numbers. (E) OGFs in red algae. GAL, Galdieria sulphuraria; CYS, Cyanidioschyzon merolae; POP, Porphyridium purpureum; PUM, Porphyra umbilicalis; CHC, Chondrus crispus; GRC, Gracilariopsis chorda. Although limited genome data are available, we compared shared TEs based on their phylogenetic relationships. We used repeat prediction (i.e., RepeatModeler pipeline) with the combined gnomes from each phylogenetic internode. For example, all six red algal genomes were analyzed with their species tag to determine the distribution of shared TEs (supplementary fig. S1A, Supplementary Material online). These results show that florideophycean TEs have recently expanded, although some families (e.g., LTR elements) existed in their common ancestor (supplementary fig. S1B, Supplementary Material online). Analysis of the Gene Inventory in Red Algae and Other Archaeplastida The gene inventory is thought to evolve slowly and conservatively when compared with genome size (Eddy 2012). The gene inventory is thought to evolve slowly and conservatively when compared with genome size (Eddy 2012). The number of predicted genes ranges from 4,803 to 13,125 in available red algal genomes (fig. 1B). Although the genome sizes of multicellular species are 5-fold larger than in unicellular species, they only encode 1.1- to 2.7-fold more genes (i.e., C. crispus: 9,603, G. chorda: 10,806, and P. umbilicalis: 13,125 genes) than do unicellular taxa (i.e., C. merolae: 4,803, G. sulphuraria: 7,174, and P. purpureum: 8,355 genes). The limited gene number in red algae becomes more apparent than their sister taxa, the Viridiplantae. Gene numbers in Viridiplantae genomes gradually increase from 7.8K to 14.4K in unicellular chlorophytes through 16.1K in a charophyte alga to 20–59K in land plants along with dramatic genome size expansions (i.e., from 12–120 to 120–761 Mb, respectively; supplementary fig. S2A, Supplementary Material online). In comparison, the genome of the glaucophyte Cyanophora paradoxa encodes ∼30K genes (Price et al. 2012). A novel PacBio assembly of this species shows that the predicted gene numbers do not change but the genome size is larger than previously estimated and is ∼110 Mb (Price DC and Bhattacharya D., unpublished data). Therefore, although there is still overall limited taxon sampling of Archaeplastida, only red algae appear to encode a relatively low gene number. Overall, significant differences exist with respect to both genome size (P-value = 0.019) and gene number (P-value = 0.007) between red algae and Viridiplantae (Wilcoxon rank sum test). Red algal genomes are also depauperate in terms of spliceosomal introns with C. merolae, G. sulphuraria, P. purpureum, P. umbilicalis, C. crispus, and G. chorda having intron densities of 1.24, 0.02, 0.009, 0.24, 0.19, and 0.026 introns/kb, respectively. Although intron density is relatively high in the unicellular red alga C. merolae and relatively low in the unicellular green alga O. tauri, this value over all available red algal genomes is statistically smaller than in Viridiplantae (P-value = 0.029; Wilcoxon rank sum test): O. tauri (0.18), C. reinhardtii (1.70), K. flaccidum (0.93), P. patens (1.64), A. trichopoda (1.17), A. thaliana (3.63), G. raimondii (0.55), and O. sativa (0.28). We compared gene density among available red algal genomes (fig. 1B). The G. chorda genome shows a relatively low gene density (number of genes per kb = 0.11), similar to other macroalgal species, C. crispus (0.09) and P. umbilicalis (0.15), which contain a large proportion of TEs (fig. 1A and B). Among Viridiplantae, the moss (P. patens), the shrub (A. trichopoda), rice (O. sativa), and cotton (G. raimondii) also contain a large proportion of TEs and a relatively low gene density. In contrast, the unicellular green alga C. reinhardtii shows a relatively small proportion of TEs and low gene density similar to florideophycean red algae (fig. 1C and supplementary fig. S2B, Supplementary Material online). Interestingly, gene density is not significantly different between both of these Archaeplastida lineages (P-value = 0.282; Wilcoxon rank sum test), which differs from the significant differences (P-value < 0.05) in genome size and gene number. In red algae, there is a negative correlation between the proportion of TEs and gene density (linear regression P-value = 0.015 and R2 = −0.8048) that differs from Viridiplantae (P-value = 0.066 and R2 = 0.4561). The negative correlation observed for red algal genomes may be explained by the relatively uniform (i.e., low) gene number combined with genome size growth. Because highly clustered genes were first reported in the red seaweed C. crispus when compared with the unicellular red alga C. merolae genome (Collén et al. 2013), we reevaluated this feature using our larger collection of red algal genomes. A higher gene clustering value (average/median intergenic distances) reflects shorter intergenic regions between genes (i.e., a right-skewed distribution). The G. chorda genome shows a relatively low degree of gene clustering (2.78) among red algae, together with C. merolae (1.54), P. purpureum (1.72), and P. umbilicalis (4.1). In contrast, highly clustered gene distributions are present in C. crispus (8.25) and G. sulphuraria (13.0) that result from a larger amount of short intergenic regions (see Collén et al. 2013). This suggests that the high degree of gene clustering found in the C. crispus and G. sulphuraria genome were independently derived rather than being an ancestral shared feature. The gene clustering values in red algae, however, are not significantly different from those in Viridiplantae (P-value = 0.228; Wilcoxon rank sum test) but the degree of gene clustering in the latter is more uniform (avg. = 2.13, SD = 0.37) than in red algae (avg. = 5.23, SD = 4.53) (supplementary fig. S2B, Supplementary Material online). Then, we focused on the conserved eukaryotic BUSCO gene set in eukaryotes (Simão et al. 2015). BUSCO genes generally do not form complex gene families (i.e., paralogs); therefore, they provide a useful marker for assessing the extent of genome-wide gene conservation and duplication. Including the gene inventory of G. chorda, we mapped all proteins from the three Archaeplastida lineages onto the 429-gene BUSCO data set (Simão et al. 2015). The average (avg.) and maximum (max) duplicated copy number of BUSCO gene sets are 1.32–1.69 and 9–25, respectively, in red algae (fig. 1D and supplementary table S6, Supplementary Material online), which is similar to that in C. paradoxa (Glaucophyta) (avg. = 1.37, max = 17). In contrast, the genomes of two lineages of Viridiplantae show different patterns: Green algae and charophytes (both aquatic) exhibit a similar duplicated copy number of BUSCO gene set as red algae (avg. = 1.37–1.79, max = 9–26); however, land plants contain a higher copy number of gene sets (avg. = 2.18–5.66, max = 39–133) (supplementary fig. S2C and table S7, Supplementary Material online). The patterns of duplicated BUSCO gene number are significantly different in aquatic algae when compared with land plants (P-values < 0.001; Wilcoxon rank sum test). These results suggest that the aquatic algae (e.g., red algae, glaucophytes, green algae, and charophytes) retain more conserved copy numbers of BUSCO genes than do land plants. The latter underwent significant expansion of the core gene family through duplications (i.e., after their split from green algae and charophytes). We extended the gene inventory analysis to all protein-coding regions using OrthoFinder (Emms and Kelly 2015). The numbers of orthologous gene families (OGFs) shared by all taxa were counted, and the common OGFs were regarded as the ancestral gene families. The six red algal species share 1,771 OGFs with an additional 1.1- to 1.4-fold duplicated genes for each species (1,993–2,548 genes; fig. 1E and supplementary table S8, Supplementary Material online). In the case of the eight Viridiplantae species that shared 2,732 OGFs, green algae and charophytes show 1.1- to 1.4-fold duplicated OGFs (i.e., 3,210–3,899 genes), whereas land plants contain 2.3- to 6.0-fold duplications (i.e., 6,559–16,562 genes) (supplementary fig. S2D and table S9, Supplementary Material online). From an extended OGF analysis including red, green, and glaucophyte species, we found 1,121 Archaeplastida OGFs (supplementary tables S10 and S11, Supplementary Material online). For these OGFs, the lower copy numbers are present in red algae (1.39–1.89 ± 0.88–3.12; ± SD) and other algal groups (green algae and charophytes: 1.5–2.2 ± 1.43–4.66, and glaucophytes: 2.48 ± 4.96) when compared with land plants (4.00–10.41 ± 9.46–33.66; supplementary fig. S3A, Supplementary Material online). Paralogy rates within Archaeplastida OGFs (supplementary fig. S3A, Supplementary Material online) were significantly different between land plants and algae (P-values ≤ 0.001; Wilcoxon rank sum test). Another result that is in agreement with the OGF work is the gene family gain and loss analysis of Archaeplastida using OrthoFinder results (Dollo parsimony analysis performed by the Count program; Csűös 2010). The red algal groups show lower gene family gains when we compare net gene gain and loss (supplementary fig. S3B, Supplementary Material online): That is, land plants have gained several thousand gene families (2,004–3,840), whereas multicellular red algae have lost between 188 and 1,683 gene families, similar to unicellular red algae (651–1683 gene families). In the case of green algae, these taxa have lost −984 gene families or gained +230 gene families. Interestingly, there is a larger gene family gain in the charophyte alga K. flaccidum (+1,900), comparable with the aggregated number of gene families (6,872) in one land plant species (i.e., P. patens 6,976; supplementary fig. S3B, Supplementary Material online). It was reported that K. flaccidum contains many land plant-associated gene families that encode fundamental functions present in the latter (Hori et al. 2014). We also identified 1,136 gene families present in all land plants and in K. flaccidum but absent in green algae (supplementary fig. S2D, Supplementary Material online). Although K. flaccidum provides an exceptional case, the aggregated gene numbers in land plants (6,976–8,812) and in aquatic algae (red, glaucophyte, green algae, and charophytes; 3,289–6,872) are significantly different (P-value < 0.001, Wilcoxon rank sum test; supplementary fig. S3B, Supplementary Material online). Moreover, the aggregated values within red algae are significantly different from that in Viridiplantae (P-value < 0.005; Wilcoxon rank sum test), although one bacterium-sized green alga, O. tauri, contains a relatively small number of gene families (3,988). Taken together, these data indicate that red algal lineages contain a less modified (i.e., more conserved) gene inventory than their sister lineages. We recognize however that future genome sequencing projects will undoubtedly change (i.e., likely increase) the size of the gene inventory in red algae because of improved gene predictions and annotations as the database grows for this anciently diverged eukaryotic lineage. In particular, generation of RNA-seq data from different taxa grown under a variety of culture conditions will expand the inventory of validated coding regions in Rhodophyta and provide insights into potential gene functions. Insights into G. chorda-Specific Gene Duplications Lineage-specific gene duplication is one of the major driving forces in eukaryotic genome evolution and is well described in land plants and vertebrates (Glasauer and Neuhauss 2014; Rensing 2014). To better understand genome evolution in G. chorda, we analyzed lineage-specific gene duplications. Using a k-mer analysis (Jellyfish program with 17-mer; Marçais and Kingsford 2011; supplementary fig. S4, Supplementary Material online), we were unable to find evidence of genome-wide duplications (e.g., polyploidy), consistent with the BUSCO and OGF analyses. Although one potential polyploidization event was proposed to have occurred in coralline red algae (based on DNA fluorescence using DAPI staining and the RBC method; Kapraun 2005), and the existence of multiple nuclei in cells has been reported (Goff and Coleman 1986; Cole 1990; Goff and Coleman 1990), there are no known cases of polyploidization-derived genes or cases of genome-level massive gene duplications in red algae. To achieve a more fine-grained perspective on G. chorda-specific gene duplications, we designed a strict screening process (supplementary figs. S5 and S6 and tables S12 and S13, Supplementary Material online; see Materials and Methods). Through this approach, we found two highly duplicated G. chorda-specific genes (i.e., six copies of OG172 and three paralogous copies of OG1241), the high-light-inducible protein (HLIP) family, because these genes contain the same domain (cl14543: light-harvesting-like protein 3). HLIPs have been previously characterized in Pyropia yezoensis (Bangiophyceae) and show similarity to light-harvesting chlorophyll a/b-binding proteins (Montané et al. 1999; Kong et al. 2012). On the basis of phylogenetic analysis and conserved domain evidence (this study; Kong et al. 2012), HLIP genes originated from cyanobacteria and acquired a transit peptide for plastid targeting. This family is highly duplicated (nine copies) in G. chorda (fig. 2A and B and supplementary fig. S6, Supplementary Material online) (Lee et al. 2016) due to a recent expansion, because sequences of the duplicated flanking regions are nearly identical across members (fig. 2C). To understand why this family was expanded, we surveyed their gene expression profiles (supplementary fig. S7, Supplementary Material online). As expected, HLIPs in G. chorda are highly expressed under several stress conditions (i.e., high temperature, osmotic stress, the light-dark transition, and cold stress; supplementary fig. S7, Supplementary Material online). Cyanobacterial HLIPs are upregulated in high light as well as under cold stress, osmotic stress, salt stress, UV-B treatment, and hydrogen peroxide treatment (Funk and Vermaas 1999; Bhaya et al. 2002; Teramoto et al. 2004; Jantaro et al. 2006). Therefore, the duplicated HLIPs of G. chorda may play a role in dealing with abiotic stress. Fig. 2. View largeDownload slide Duplicated genes in the Gracilariopsis chorda genome that encode HLIPs. (A) HLIP copy number in the nuclear genomes of red algae (tree topology is based on reference; Lee et al. 2016). (B) Red algal HLIP alignment and identified domain regions. Asterisks indicate proteins lacking a transit peptide. (C) Flanking regions of G. chorda HLIP genes. Fig. 2. View largeDownload slide Duplicated genes in the Gracilariopsis chorda genome that encode HLIPs. (A) HLIP copy number in the nuclear genomes of red algae (tree topology is based on reference; Lee et al. 2016). (B) Red algal HLIP alignment and identified domain regions. Asterisks indicate proteins lacking a transit peptide. (C) Flanking regions of G. chorda HLIP genes. DNA Methylation Patterns in G. chorda DNA methylation (cytosine methylation; 5-methylcytosine; 5-meC) is widely studied in land plants, animals, fungi, and bacteria and is involved not only in epigenetic regulation of gene expression but also genome structure and integrity vis-à-vis chromatin remodeling (Martienssen and Colot 2001; Zilberman et al. 2007; Feng et al. 2010; Law and Jacobsen 2010; Zemach et al. 2010; Capuano et al. 2014; Huff and Zilberman 2014). We generated methylome data from G. chorda using bisulfite sequencing (see Materials and Methods) (Li and Tollefsbol 2011). A total of 7.6-Mb methylated DNA sites were detected (8.3% of a total 92.1-Mb nucleotide sequences; 33.4% of a total 22.7-Mb cytosine sequences) that comprised CHH (3.9 Mb; 51.3% of 7.6 Mb), CHG (1.9 Mb; 25.7%), and CG (1.7 Mb; 23.0%) methylation (H = A, C, or T; fig. 3A and supplementary table S14, Supplementary Material online). The proportion of DNA methylation can vary in different species even in different tissues (Banyushin 2006). The genome-wide proportion of cytosine methylation in G. chorda (8.3% of 92 Mb genome) is lower than that in Arabidopsis thaliana (14% of 120 Mb genome) but higher than Mus musculus (7.6% of 2.8 Gb genome), Drosophila melanogaster (<0.1% of 140 Mb genome), and Escherichia coli (2.3% of 4.6 Mb genome) (Capuano et al. 2014). Interestingly, around 50% of methylated cytosines are present in TEs (3.7 Mb; 48.64% of 7.6 Mb), and the remainder is located in gene bodies (1.7 Mb; 22.34%) and other genomic regions, including promoters (2.2 Mb; 29.01%) (supplementary fig. S8, Supplementary Material online). Most protein-coding sequences (i.e., 10,470/10,806 CDSs; 96.8%) contain methylated cytosines. Among these, the majority includes 10–20% methylated cytosines, although there are 336 unmethylated CDSs (bar graph in fig. 3B and supplementary table S15, Supplementary Material online). To test the consistency of the signal from varied data sets, we sorted randomly sampled CDSs (20% of the total CDSs) and plotted their frequencies using 1,000 iterations (red line in fig. 3B). The majority of randomly sampled CDSs also comprises 10–20% methylated CDS bodies. The pattern of upstream and gene body methylation is similar across the three types of targeted DNAs, with CHH having the highest frequency (colored lines in fig. 3C). The distribution of DNA methylation level (%) is about 3–5% in gene bodies (dot plots in fig. 3C). DNA methylation level (%) of the CHG type (green dots) is about 4–5% higher than the CHH and CG types (fig. 3C). In the case of land plants, a specific gene family (i.e., chromomethylase proteins) is closely related to gene body methylation, although the general functions of gene body methylation are still poorly understood (Bewick and Schmitz 2017; Bewick et al. 2017). Our study indicates that DNA methylation in the red seaweed G. chorda follows a pattern that is similar to other eukaryotes, and gene family-specific gene body methylation in red algae may have a function that is analogous to that in land plants. Fig. 3. View largeDownload slide DNA (cytosine) methylation patterns in the Gracilariopsis chorda genome. (A) Circular heatmap of DNA methylation (CHH, CHG, and CG methylation) in which the red color indicates a higher methylation level and the blue, a lower level. (B) Coverage of methylation sites in CDSs and results of the random sampling-based methylation site distribution analysis. (C) Comparison of methylation frequencies in gene bodies and 1-kb upstream regions, and the methylation levels of gene bodies. Fig. 3. View largeDownload slide DNA (cytosine) methylation patterns in the Gracilariopsis chorda genome. (A) Circular heatmap of DNA methylation (CHH, CHG, and CG methylation) in which the red color indicates a higher methylation level and the blue, a lower level. (B) Coverage of methylation sites in CDSs and results of the random sampling-based methylation site distribution analysis. (C) Comparison of methylation frequencies in gene bodies and 1-kb upstream regions, and the methylation levels of gene bodies. DNA Methylation-Dependent Transcriptional Regulation in G. chorda One well-known function of DNA methylation is transcriptional inactivation (i.e., gene silencing) associated with promoter regions (Mohn et al. 2008; Payer and Lee 2008; Deaton and Bird 2011). In general, the promoter regions overlap with CpG islands (a high density of CG sites), suggesting that DNA methylation of CpG islands is linked with transcriptional repression, even though the CG sites in the CpG islands are not always methylated (Illingworth and Bird 2009; Deaton and Bird 2011). To understand DNA methylation-dependent gene regulation, we analyzed promoters and CpG islands in the G. chorda genome. We focused on promoters within 1 kb upstream of G. chorda genes that were predicted using the TSSPlant promoter database (prediction of plant polymerase II promoter) (Shahmuradov et al. 2017). As a result, 8,022 predicted promoters were comprised TATA-boxes (4925) and TATA-less Transcription Start Sites (TSS; 3097) (supplementary table S16, Supplementary Material online). These are primarily distributed within 50–400 bp upstream regions of genes (fig. 4A). The dominant nucleotide sequences are “TATAAAAATA” in TATA-boxes and cytosines in TATA-less promoters (fig. 4B and supplementary table S17, Supplementary Material online). The TATA-box promoter-binding regions are methylated primarily in the proximal and distal regions (fig. 4B), whereas the TATA-less TSS regions are more uniformly methylated (supplementary fig. S9, Supplementary Material online). Fig. 4. View largeDownload slide CpG island-dependent regulated genes in the Gracilariopsis chorda genome. (A) Promoter distribution within 1-kb region upstream of genes. (B) Amino acid proportions of TATA-box promoter regions and their methylation. (C) COG categories of CpG island-dependent regulated genes in G. chorda. L = replication, recombination and repair; A = RNA processing and modification; P = inorganic ion transport and metabolism; H = coenzyme transport and metabolism; I = lipid transport and metabolism; D = cell cycle control, cell division, chromosome partitioning; Q = secondary metabolites biosynthesis, transport, and catabolism; B = chromatin structure and dynamics; M = cell wall/membrane/envelope biogenesis; Z = cytoskeleton; V = defense mechanisms; F = nucleotide transport and metabolism. (D) Enriched GO terms of CpG island-dependent regulated genes in G. chorda. Fig. 4. View largeDownload slide CpG island-dependent regulated genes in the Gracilariopsis chorda genome. (A) Promoter distribution within 1-kb region upstream of genes. (B) Amino acid proportions of TATA-box promoter regions and their methylation. (C) COG categories of CpG island-dependent regulated genes in G. chorda. L = replication, recombination and repair; A = RNA processing and modification; P = inorganic ion transport and metabolism; H = coenzyme transport and metabolism; I = lipid transport and metabolism; D = cell cycle control, cell division, chromosome partitioning; Q = secondary metabolites biosynthesis, transport, and catabolism; B = chromatin structure and dynamics; M = cell wall/membrane/envelope biogenesis; Z = cytoskeleton; V = defense mechanisms; F = nucleotide transport and metabolism. (D) Enriched GO terms of CpG island-dependent regulated genes in G. chorda. A total of 52,008 CpG islands were predicted using a Perl script (Takai and Jones 2002), and 97% (50,484) of the CpG islands are smaller than 3,000 bp (median size = 300 bp), comparable with other organisms (supplementary fig. S10, Supplementary Material online) (Carlberg and Molnár 2014). However, there are 1,524 unusually large CpG islands (median size = 4,000 bp). In both cases, two-thirds of the CpG islands contain methylated CG dinucleotides, which encompass <15% of these regions (supplementary fig. S10, Supplementary Material online). Interestingly, there was a strong correlation between the length of contigs and the number of CpG islands (linear regression R2=0.97 and P-value < 2.2e-16; supplementary fig. S11A, Supplementary Material online), suggesting a random distribution of CpG islands on a genome-wide scale. On average, 0.62/kb (median = 0.58) of CpG islands is present in the genome (supplementary fig. S11B, Supplementary Material online). However, there are no correlations between GC richness and CpG islands with CDSs and/or promoters (linear regression R2 ≤ 0.45 and P-value < 2.2e-16; supplementary fig. S11C, Supplementary Material online). In general, CpG island-dependent regulated genes are defined by the features related to the presence of DNA methylations in CpG island-located promoter regions (Mohn et al. 2008; Payer and Lee 2008; Illingworth and Bird 2009; Deaton and Bird 2011). To analyze CpG island-dependent regulated genes in G. chorda, we analyzed CpG island-located promoters, of which only about one-half (3,720) had an overlap with the total number of predicted promoters (8,022) in the genome (supplementary table S18, Supplementary Material online). The genes associated with these CpG island-located promoters were regarded as candidates for potential regulation by DNA methylation. We focused on these genes knowing that other transcription regulatory factors (e.g., enhancer as cis-regulatory sequences; Tippens et al. 2018) could exist in the G. chorda genome. These promoter regions overlapped with the DNA methylation sites, including differential gene expression results (fragments per kilobase million (fpkm) > 0 in at least two RNA-seq conditions with a maximum fold-change, fpkm > 3, compared with expression of the housekeeping gene GAPDH; gene id = GRC0024GENE2684). Using these criteria, we selected 489 candidates for CpG island-dependent regulated genes. To analyze functions of these candidates, we analyzed functional categories of these genes using the eggNOG-mapper (Huerta-Cepas et al. 2016) that resulted in 470 COG (Clusters of Orthologous Groups) categories and 298 GO (Gene Ontology) terms (supplementary table S19, Supplementary Material online). After excluding unknown hits, the most frequent COG category is “posttranslational modification, protein turnover, chaperones” followed by “signal transduction mechanism” (fig. 4C). Moreover, from the enriched GO terms (togGO R package; Fisher test, P-value < 0.05; supplementary fig. S12, Supplementary Material online), there are not only the regulators (i.e., cell cycle and development in Biological Process category) but also the repressors (i.e., chromatin silencing, negative regulation of metabolic process, response ER stress and viral process in Biological Process, and transcriptional repression activity in Molecular Function category) (fig. 4D). On the basis of these functional descriptions using eggNOG, we identified the potential CpG island-dependent regulatory genes involved in the regulation of the cell cycle, the development and stress response as well as biological repression processes (e.g., chromatin silencing) in G. chorda genome. In other words, these functions might be regulated by methylations in G. chorda. TEs and Putative Epigenetic Regulation in G. chorda As discussed earlier, florideophycean species contain more TEs than other red algae, which is a major factor in their genome size expansion [i.e., G. chorda, this study, and C. crispus (Collén et al. 2013)]. However, TEs generally lead to negative impacts on the host genome, for example, by causing DNA double-strand breaks (McClintock 1950, 1984; Slotkin and Martienssen 2007). Therefore, TE activity is controlled by DNA methylation and histone modification in the host (Hirochika et al. 2000; Okamoto and Hirochika 2001; Vaucheret and Fagard 2001; Verbsky and Richards 2001; Slotkin and Martienssen 2007). About 71.2% of TEs contain methylated cytosines (47,459/66,575 TEs) in the G. chorda genome (supplementary fig. S8, Supplementary Material online). The number and level (percentage of methylated/unmethylated cytosines) of methylated TE sites were plotted for each type of TEs (supplementary fig. S13, Supplementary Material online). The major type of DNA methylation in TEs, based on their frequencies, is CHH methylation (red lines in supplementary fig. S13, Supplementary Material online). The distribution of DNA methylation level (%) is about 4–5% in each TE type (i.e., LTR: 4–6%, DNA: 4–7%, RC: 3–5%, LINE: 3–7%, SINE: 2–10%, and the others: 4–5%; dot plots in supplementary fig. S13, Supplementary Material online). DNA methylation level (%) of CHG type (green dots) is about 5% in all TE types (supplementary table S20 and fig. S13, Supplementary Material online). These patterns of methylation level in TEs (4–5%) are significantly different from that found in gene bodies (3–4%) for all three types of methylations (P-value < 0.05, Wilcoxon rank sum test; supplementary table S20, Supplementary Material online) except for CHG methylation of SINE elements (P-value = 0.298). On the basis of the presence of methylated DNA in TEs (47,608/66,575 TEs), we postulate that TEs are regulated by a DNA methylation-associated mechanism in G. chorda, similar to the DNA methylation-based posttranscriptional/transcriptional gene silencing (PTGS and TGS mechanisms) in land plants (Okamoto and Hirochika 2001; Verbsky and Richards 2001; Matzke et al. 2009, 2015; Saze et al. 2012). In general, PTGS is sRNA-dependent and induced by degradation or translational repression of target mRNA in the cytoplasm. TGS is also an RNA-dependent translational repression process, but is induced by chromatin remodeling (Matzke and Mosher 2014; Matzke et al. 2015). To elucidate how many sRNAs are associated with TE regions, we produced sRNA data from G. chorda and sorted sequences that are 18–30 nt in length based on their mapping outcome (criterion: >100 reads depth with a perfect match; see Materials and Methods). A total of 40,925 sRNAs were predicted and classified in various categories but the majority is unclassified (supplementary table S21, Supplementary Material online). In many cases, small interfering RNAs (siRNAs in land plant genomes) and PIWI-interacting RNAs (P-element induced wimpy testis RNAs; piRNAs in animal genomes) are involved in TE silencing (Brennecke et al. 2008; Slotkin et al. 2009; Saito and Siomi 2010; Ji and Chen 2012; Marí-Ordóñez et al. 2013), but these small RNA categories were difficult to define in the G. chorda genome largely due to the uncharacterized status of sRNAs in red algae. However, we expect that there are many unknown (unclassified) red algal sRNAs that may be involved in TE silencing. Therefore, we mapped sRNAs onto the TEs and found that 57% (23,315/40,925) are located in these regions. These TE-located sRNAs could potentially be involved in the methylation of TE regions (PTGS) or in the degradation of expressed TE sequences (TGS) in G. chorda. As an example, the unmethylated LINE element (GRC1278) contained 49 complementary sRNAs sites where their mapping depths range from 102 to 12,430 sRNA reads (fig. 5A and supplementary table S22, Supplementary Material online). These unmethylated TE-located sRNAs might be potential targets of sRNA for degradation or translational suppression of TEs, which is similar to TGS in land plants. In general, the TGS mechanism is implicated in the regulation of expressed TEs rather than the methylation of TE regions (Okamoto and Hirochika 2001; Verbsky and Richards 2001; Matzke et al. 2009, 2015; Saze et al. 2012). To test the association between sRNA and TE methylation, we compared the proportion (%) of methylated TEs with the sRNA-binding sites. Not surprisingly, no correlation exists between sRNA density (number of sRNAs per kb) and methylation density (methylation sites per kb) in TE regions (P-value = 0.15, Spearman rank correlation). In addition, there was no significant difference in terms of whether methylated TEs contain complementary sRNAs or not (P-value > 0.6, Chi-squared test) (proportion of methylated TEs with mapped sRNA: 74% [9,667/12,957] and methylated TEs without mapped sRNA: 70% [37,792/53,618]). In contrast, a decrease in the number of methylated TEs depends on sRNA density in TE regions (negative correlation, rho = −0.96, P-value < 2.2e-16, Spearman rank correlation). On the basis of these results, we postulate that a large proportion of sRNAs in G. chorda genome are primarily related to the translational TE suppression (i.e., TGS-like) rather than the methylation of TE regions (i.e., PTGS-like). Interestingly, the largest proportion of the TE-located sRNAs is of size 28 nt, although 27, 26, and 20 nt sRNAs are also well represented (supplementary fig. S14, Supplementary Material online). To understand better the size composition of the major sRNA (i.e., of lengths, 20, 26, 27, and 28 nt) in TE regions, we randomly resampled TEs 1,000 times (20% of the total sRNA-associated TEs) and then plotted the sRNA frequencies (supplementary fig. S14, Supplementary Material online). This result shows a nearly identical pattern (linear regression R2 = 0.999) between the total number of TE-located sRNAs and the randomly resampled data. In addition, these major sRNAs (20, 26, 27, and 28 nt) show higher CHH at both termini of the sRNAs (supplementary fig. S15, Supplementary Material online). Fig. 5. View largeDownload slide TE located sRNAs and methylation sites. (A) Distribution of unmethylated TE-associated sRNAs from an example of LINE elements. (B) Distribution of methylated TE-associated sRNAs and methylated sites from examples of LTR elements. Fig. 5. View largeDownload slide TE located sRNAs and methylation sites. (A) Distribution of unmethylated TE-associated sRNAs from an example of LINE elements. (B) Distribution of methylated TE-associated sRNAs and methylated sites from examples of LTR elements. As other examples, LTR elements contain a large proportion of methylation sites with several complementary sRNA sites (fig. 5B). On the basis of several lines of evidences, we suggest that sRNA-associated TE methylation mechanisms (i.e., PTGS-like) exist in G. chorda (supplementary table S22, Supplementary Material online). As a TE methylation mechanism, the RdMD pathway was suggested to operate in land plants (Matzke and Mosher 2014; Matzke et al. 2015). The machinery is related to the DCL (Dicer-like protein that processes long double-stranded RNAs into small interfering RNAs) and AGO (Argonaute protein that is responsible for sRNA-guided chromatin modification) pathway although these proteins also play general roles in the generation and functioning of sRNAs (Liu et al. 2009; Hutvagner and Simard 2008). We found these DCL and AGO homologs in mesophilic red algae using BlastP and domain searches (supplementary table S23, Supplementary Material online). These genes are highly expressed in G. chorda (supplementary table S24, Supplementary Material online). On the basis of these results, we postulate that the DCL and AGO proteins in red algae (at least G. chorda genome) play important roles in the small RNA-related process, even though the detailed classification of these protein families is challenging (alignment using MAFFT with default option; maximum likelihood tree using IQ-tree with 1,000 replications after choosing the best-fit evolutionary model; supplementary fig. S16, Supplementary Material online; Minh et al. 2013; Flouri et al. 2015; Nguyen et al. 2015; Yamada et al. 2016). In addition, most RdDM pathways in land plants are polymerase-dependent (pol IV and V) (Slotkin and Martienssen 2007; Matzke and Mosher 2014; Matzke et al. 2015); however, these enzymes are absent from red algal genomes. Taken together, it is highly likely that red algal genomes contain a putative TE suppression pathway because of presence of cytosine methylation of TEs, TE-associated sRNAs, and the conserved proteins of DCL and AGO although our genome-wide study has the limitation that we cannot determine how sRNAs directly regulate DNA methylation with the associated components. In summary, based on our genome-wide analysis of methylation and sRNAs in G. chorda, we suggest that sRNA-mediated TE suppression present as a form of epigenetic modification in red algae. Conclusions Analysis of genomes from the multicellular red algal class Florideophyceae demonstrates genome size expansion via the accumulation of TEs, when compared with its sister class, the Bangiophyceae and other early diverged, unicellular red algae (Cyanidiophyceae and Porphyridiophyceae). In land plants, TE activity has contributed to genome size growth, DNA rearrangement and gene mutation, modification, movement, and regulation (Bennetzen and Wang 2014). It is intriguing that sexual reproduction in angiosperms may increase the diversity and activity of TE families (Wright and Finnegan 2001; Arkhipova 2005; Steinemann and Steinemann 2005; Michael 2014; Li et al. 2016), which has been experimentally validated in yeast (Zeyl and Bell 1996). Although multicellular Florideophyceae and Bangiophyceae have sexual life cycle and large proportions of TEs in their genomes, additional evidence is required to establish a connection between sexual reproduction and TE-derived genome expansion in red algae. Beyond such adaptation-based scenarios, another factor that potentially explains genome growth in red seaweeds is population size. Although Floridiophyceae (∼6,700 spp.) are ubiquitous in nearshore marine environments, they are likely to be far less abundant than unicellular red algal species. It has been previously argued under the drift-barrier hypothesis for mutation-rate evolution (Lynch et al. 2016) that effective population size and genetic drift govern the strength of selection on trait evolution. Smaller populations are more prone to genetic drift and elevated genome-wide deleterious mutation rates, with larger populations behaving in the opposite direction with selection removing deleterious traits more effectively (Sung et al. 2012). Under this scenario, the larger genomes and higher TE loads of red seaweeds may be explained by their relatively smaller population sizes when compared with unicellular lineages. The genetic drift among red seaweeds has been active for at least 600 My when the Florideophyceae radiated (Yang et al. 2016). Therefore, TE-derived genome size growth of multicellular red algae might reflect the long-term evolutionary consequence of reduced population sizes. This hypothesis does not argue that all unicellular Rhodophyta to be analyzed in the future will have small genomes, but rather, that their genome size evolution may be better understood under a neutral model that reflects population level processes. The apparent absence of polyploidization in red algae provides a potential explanation for why the gene inventory has not expanded massively in red seaweeds, despite genome size growth, in sharp contrast to many land plants. Interestingly, all available red algae have genome sizes at or smaller than ∼100 Mb. This contrasts with land plants whose genomes can be several hundred megabase pairs in size with TE-derived genome size growth occurring independent of polyploidization (Hawkins et al. 2006; Piegu et al. 2006). TE activity in land plants has likely provided novel sources of evolutionary variability in these lineages (Hua-Van et al. 2011; Galindo-González et al. 2017). Testing this latter scenario in Rhodophyta will require additional red algal genome data and genetic tools that await development. Our epigenomic analysis shows that the potential exists for CpG island-dependent regulated genes in G. chora. The impacted genes have regulatory functions such as “cell cycle and development,” “chromatin silencing,” and “transcriptional repression activity.” Therefore, we suggest that DNA methylation may play a role in regulating the most complex aspects of red seaweed biology such as sexual reproduction, that is akin to regulation of plant gametogenesis (Saze et al. 2003) and embryogenesis (Xiao et al. 2006). This and other hypotheses regarding the impact of epigenomic regulation in red algal development and adaptation remain to be investigated using the genetic toolkit described in this study. Materials and Methods Sample Preparation and Genome Sequencing Samples of isomorphic gametophytes (n) and tetrasporophytes (2n) of G. chorda cultivars were collected from a coastal farm Jangheung, Jeonnam, Korea (34°28ʹ18″N, 126°56ʹ28″E) on February 2, 2013. Total genomic DNA was extracted from a fresh sample using a modified CTAB protocol and the DNeasy Plant Maxi Kit (Qiagen, Germany) with clean-up steps. PacBio Library Construction and Sequencing Using the Covaris G-tube, we generated 20-kb fragments by shearing genomic DNA according to the manufacturer’s protocol. Using the AMpureXP bead purification system to remove the small fragments, a total of 5 µg from each sample was used as input for library preparation. The library was constructed using the SMRTbell Template Prep Kit 1.0 (PN 100-259-100). Using the BluePippin Size selection system, we removed small fragments for the large-insert library. After a sequencing primer was annealed to the SMRTbell template, DNA polymerase was bound to the complex (DNA/Polymerase Binding kit P6v2). Following the polymerase binding reaction, the MagBead Kit was used to bind the library complex with MagBeads before sequencing. MagBead bound complexes provide for more reads per SMRT Cell. This polymerase-SMRTbell-adaptor complex was loaded into zero-mode waveguides. The SMRTbell library was sequenced using 17 SMRT cells (Pacific Biosciences) using C4 chemistry (DNA sequencing Reagent 4.0). This procedure generated 6.52 Gb of total data (supplementary table S25, Supplementary Material online). The PacBio reads were assembled using SMRTmake, based on the Celera Assembler. To correct errors in the PacBio assembly, the short-read genome sequencing platform from Illumina (HiSeq2000) was used to generate data. A whole-genome sequencing library was prepared according to the manufacturer’s instructions. Genomic DNA was randomly sheared to yield fragments of 200–300 bp in size using the Covaris S2 system. The fragments were ligated with index adapters using the Truseq DNA Sample prep kit (Illumina, San Diego, CA) and AMPure XP Beads purification kit (Beckman Coulter Genomics, Danvers, MA). Ligated products were purified and size-selected for 300–400 bp using the Pippin prep electrophoresis platform (Sage Science, Beverly, MA). Size-selected fragments were amplified with adapter-specific primer sets. The quality of the amplified libraries was assessed on a 2100 BioAnlayzer (Agilent Technologies, Santa Clara, CA) and then sequenced with Illumina HiSeq2000 using 100-bp paired-end reagents. A total of 46.5 Gb of data was generated (supplementary table S25, Supplementary Material online). De Novo Assembly and Scaffolding De novo assembly was conducted using the hierarchical genome assembly process (HGAP) workflow (Chin et al. 2013) on SMRT portal (v2.3). HGAP starts with the filtering step for the SMRT reads and then performs an error correction on long reads using relatively short reads with 30× target coverage and 90 Mb expected genome size parameters. The error-corrected reads were used in Celera Assembler to generate the draft assembly. As the final HGAP step, polishing was performed through consensus calling using Quiver. The “polished” contigs were then further scaffolded using SSPACE (v3.0) with Illumina paired-end reads (Boetzer et al. 2011). To correct errors in the PacBio assembly, Illumina (HiSeq 2000) paired-end reads were mapped to the long-read data and the sequences corrected using bwa (v0.7.15; Li 2012). Analysis of Repeated DNA Elements We searched for repeated DNA elements using RepeatModeler (v1.0.10; http://www.repeatmasker.org/RepeatModeler/; de novo repeat family identification and modeling package). This package includes two de novo repeat finding programs: RECON (v1.08) (Bao and Eddy 2002) and RepeatScout (v1.0.5) (Price et al. 2005). We chose options as default l-mer size and filtered out low-complexity and tandem repeats (Tandem Repeats Finder) (Benson 1999). Repeat libraries were downloaded (http://www.girinst.org) and used in the classification step. Each genome was masked with its corresponding repeat library using RepeatMasker (v4.0.7, http://www.repeatmasker.org/) under sensitive mode (-s); without masking low-complexity and simple repeats (-nolow) and without checking for bacterial insertion sequences (-no_is). An in-house Python script was used to parse the results that the frequency of Kimura distance between copies of repeated elements (TEs) identified in the genome using the library. To analyze shared TEs (supplementary fig. S1, Supplementary Material online), all six red algal genomes were merged into a single file, with species tags included in the contig names, as input for the RepeatModeler pipeline. If the same consensus repeat families were present at phylogenetic internodes, the individual repeats in each red algal genome were regarded as shared TEs within the groups (gray bars in supplementary fig. S1A, Supplementary Material online). From each divergence point, we plotted the Kimura values that were calculated between the consensus and individual TE sequences (supplementary fig. S1A, Supplementary Material online; Kimura 1980). The classification and amount of shared TEs are shown in supplementary figure S1B, Supplementary Material online. Archaeplastida genome data were downloaded and their repeats reanalyzed using the above pipelines, including data from Cyanophora paradoxa (Price et al. 2012), Galdieria sulphuraria (Schönknecht et al. 2013), Cyanidioschyzon merolae (Matsuzaki et al. 2004), Porphyridium purpureum (Bhattacharya et al. 2013), Chondrus crispus (Collén et al. 2013), Porphyra umbilicalis (Brawley et al. 2017), Ostreococcus tauri (Derelle et al. 2006), Chlamydomonas reinhardtii (Merchant et al. 2007), Physcomitrella patens (Rensing et al. 2008), Amborella trichopoda (Amborella Genome Consortium 2013), and Arabidopsis thaliana (Arabidopsis Genome Initiative 2000). RNA Sequencing To collect diverse RNA sequencing data, we used G. chorda thalli from various developmental stages in several culture conditions. One female thallus with carpospores and one vegetative thallus without carpospores (i.e., young stage of gametophyte or tetrasporophyte) were cleaned using sterilized seawater under the dissecting microscope and kept in a culture chamber. To increase the recovery of expressed genes, we extracted total RNAs under the following seven conditions: 1) Cleaned nonfemale thallus, kept under the light (2 h) and room temperature (RT; 20 °C), 2) female thallus, kept under the light and RT, 3) nonfemale thallus, kept under the light and high temperature (30 °C, 2 h), 4) nonfemale thallus under the light and half salinity condition (DW: seawater = 1:1), 5) nonfemale thallus, kept under the dark (2 h) and RT, 6) nonfemale thallus exposed the light (30 min) after dark treatment (7 h), and 7) nonfemale thallus kept under the light and low temperature (4 °C) for 7 days. Total RNA was extracted using QIAGEN RNeasy Plant Mini Kit (Qiagen) based on the manufacturer’s instruction with DNase treatment. The RNA sequencing (RNA-seq) library was prepared according to the manufacturer’s instructions using the Illumina Truseq RNA Prep kit v2. The mRNA was purified and fragmented from total RNA (1 µg) using poly-T oligo-attached magnetic beads using two rounds of purification. Cleaved RNA fragments primed with random hexamers were reverse transcribed into first strand cDNA using reverse transcriptase and random primers. The RNA template was then removed prior to the synthesis of a replacement strand to generate double-strand cDNA. End repair, A-tailing, adaptor ligation, cDNA template purification, and enrichment of the purified cDNA templates were then performed using PCR. The quality of the amplified libraries was assessed on a 2100 Bioanalyzer (Agilent Technologies) and sequenced using 100-bp paired-end reagents with an Illumina HiSeq2000. All RNA-seq data were used to predict the gene models and their expression level analysis of G. chorda. Gene Prediction, Annotation, and Contaminant Verification To predict the gene models, all RNA-seq data were aligned to G. chorda genome (supplementary table S26, Supplementary Material online) using TopHat2 aligner (v2.1.1) (Kim et al. 2013). Splice junctions based on mapped RNA-seq reads were identified using Bowtie2 (v2.2.6) (Langmead and Salzberg 2012). The RNA-seq data were assembled de novo using Trinity (v2.4.0) (Haas et al. 2013) with the default option and these assembled transcripts (data not shown here) were also aligned to the G. chorda genome using TopHat2. Based on these mapped RNA-seq data, the gene models were predicted by BRAKER1 pipeline (v1.9) (Hoff et al. 2016), which includes implementation of GeneMark-ET (Lomsadze et al. 2014) and AUGUSTUS (Stanke et al. 2006). Additionally, the homology-based method was also conducted using protein sequences from taxonomically diverse genomes including Chondrus crispus, Cyanidioschyzone merolae, Cyanophora paradoxa, Porphyridium purpureum, Galdieria sulphuraria, Pyropia yezoensis, Arabidopsis thaliana, and Homo sapiens; these protein sequences were aligned to G. chorda genome using TBlastN (e-value ≤ 1e-04). The homologous regions of the genome sequences were then aligned to matched proteins using Genewise (Birney et al. 2004) to predict the accurate spliced alignments. EvidenceModeler, EVM (Haas et al. 2008) was used to integrate the homology-based predicted gene models to consensus gene model. To run EVM, all gene models were converted to EVM-compatible GFF3 format. All RNA-seq-based and homology-based gene models were combined into final gene set for G. chorda genome prior to the RNA-seq-based gene models. To choose the correct gene models, the generated protein models were checked manually using customized python scripts based on BlastN/P results using the nt/nr database (NCBI). The organelle-derived (mitochondria and plastid) and contaminant contigs were identified and removed. A total of 10,806 genes were predicted from the 92.1 Mb G. chorda genome. Of these, 7,139 genes were annotated with functional information from UniProt (http://www.uniprot.org/; 5,378 genes), conserved domains (Marchler-Bauer et al. 2017) (CD-search; 7,018 genes), and KEGG blast (3,089 genes; http://www.genome.jp/tools/blast/). The promoters of G. chorda proteins were predicted from 1-kb upstream regions using TSSPlant (prediction of plant polymerase II promoter) (Shahmuradov et al. 2017). Gene Family and Duplication Analysis To compare the ancestrally duplicated genes in Archaeplastida, all proteins from the three Archaeplastida lineages were mapped onto the 429 BUSCO data set (Simão et al. 2015). The average numbers of duplicated BUSCO genes in each species were calculated from BUSCO gene numbers (supplementary table S6, Supplementary Material online). Clustered/orthologous gene families were analyzed using the OrthoFinder pipeline (blast algorithm; inflation parameter for mcl = 1.5; Emms and Kelly 2015). Three separate runs of ortholog clustering were performed: 1) The six red algal species (fig. 1E), 2) eight Viridiplantae species (supplementary fig. S2D, Supplementary Material online), and 3) 15 Archaeplastida species (supplementary fig. S3, Supplementary Material online). Based on the clustered proteins, the clusters composed of all used species were sorted and defined as OGFs (fig. 1E). Other clustered gene families were indicated as same gene family. The numbers of shared OGFs in each group were summarized as bar graphs with rest of gene numbers (fig. 1E). To analyze G. chorda-specific gene duplications, a strict process was implemented (supplementary fig. S5, Supplementary Material online). First, all available red algal proteins (from 54,101 genes) were clustered into homologous groups (i.e., gene families) using the OrthoFinder pipeline (supplementary fig. S5A, Supplementary Material online) (Emms and Kelly 2015). Second, we analyzed conserved domain structure and number from all red algal proteins using the web-based CD-search (Marchler-Bauer et al. 2017), and then searched the candidates for species-specific and highly duplicated genes (criteria: > 2-fold domain duplication when compared with other taxa; supplementary fig. S5B and C, Supplementary Material online). Using this procedure, we found 13 G. chorda-specific duplicated gene families (supplementary table S12, Supplementary Material online). This third step resulted in five domain-families that were duplicated when compared with other red algal taxa excluding fragmentally duplicated domain structures (supplementary fig. S5D and table S12, Supplementary Material online). These duplicated domains are located in six gene families (47 genes) in G. chorda (see gene expression and putative functions in supplementary table S13, Supplementary Material online). As the fourth step, these six gene families were validated as functionally expressed genes using RNA-seq data (supplementary fig. S5D, Supplementary Material online), because some cases of duplicated genes had undergone purifying selection (loss of function) during the duplication process (Zhang 2003; Innan and Kondrashov 2010). From this step, six gene families were identified as being functionally expressed and highly duplicated (i.e., OG172, OG1241, OG78, OG329, OG16, and OG1129; supplementary table S13, Supplementary Material online). After filtering out two gene families (i.e., OG16 and OG1129) that show insufficient expression levels, four gene families were confirmed as duplicated functional genes involving three functional domains (cl14543, cl05795, and pfam04096). Phylogenetic analysis of these candidates was done using an extended taxon data set that incorporated EST data (fig. 2A, supplementary fig. S5E, Supplementary Material online). Putative homologs for each target protein were identified using BlastP search (e-value ≤ 1e-05) against RefSeq database and were aligned using MAFFT v7.309 (default option: –auto; Yamada et al. 2016). Through this process, two gene families (i.e., OG78 and OG329) were removed because their copy numbers were not significant when compared with other homologs in red algae from the extended data set (criteria: >2-fold duplications when compared with other taxa). Finally, the combined alignment of cl14543 domain contained gene families (OG172 and OG1241) was used as input for IQ-tree to infer a maximum likelihood tree with 1,000 replications (-bb 1000) after the model test step (-m TEST; supplementary fig. S5E, Supplementary Material online) (Minh et al. 2013; Flouri et al. 2015; Nguyen et al. 2015). Bisulfite Sequencing Library Preparation (Illumina TruSeq) and DNA Methylation Analysis We checked the quality of DNA using 1% agarose gel electrophoresis and the PicoGreen dsDNA Assay (Invitrogen, Waltham, MA). Bisulfite conversion of genomic DNA (1 μg) was performed with the EZ DNA Methylation-Gold kit (Zymo Research, Irvine, CA). The methyl-seq libraries were prepared using the TruSeq DNA Methylation kit (Illumina). The quality of the libraries was verified by capillary electrophoresis (Bioanalyzer; Agilent Technologies). Sequencing of 2×100-bp paired-end reads was carried out on the Illumina HiSeq 2500 platform. We produced the 29 Gb (289 million reads; 91.0% ≥Q30) of bisulfite sequencing data and aligned them to the G. chorda genome using bowtie2 (Langmead and Salzberg 2012). Removal of duplicated reads and methylation calling were done using bismark (v0.16.3) (Krueger and Andrews 2011). Methylation sites in gene bodies, promoters and TEs were identified using customized Python scripts based on G. chorda gene models and the genome sequence. The CpG islands were predicted using a customized Perl script (options: GC% ≥ 50, ObsCpG/ExpCpG ≥ 0.65, and sequence length ≥ 200 bp; Takai and Jones 2002). Small RNA Sequencing Library Preparation (Illumina TruSeq) and Data Analysis Small RNA was extracted using the Hybrid-R miRNA kit (GeneAll, Seoul, Korea). The extracted RNA purity was determined by assaying 1 µl of total RNA on a NanoDrop8000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA). Total RNA integrity was checked using an Agilent Technologies 2100 Bioanalyzer (Agilent Technologies) with the RNA integrity number value. Small RNA sequencing libraries were prepared according to the manufacturer’s instructions (Illumina Truseq small RNA Prep kit). The resulting cDNAs were purified using the Sage Science Pippin prep electrophoresis platform. The quality of libraries was verified by capillary electrophoresis (Bioanalyzer; Agilent Technologies). Sequencing of 1×51-bp paired-end reads was carried out on the Illumina HiSeq 2500 platform. We produced 13 Gb (255 million reads; 96.5% ≥Q30) of sRNA sequencing data. Adapter sequence and quality trimming were done using cutadapt (v1.1) (Martin 2011). The trimmed sequences were aligned against the G. chorda genome sequence at 100% identity, using bowtie2 (Langmead and Salzberg 2012). We kept only 1.6 Gb (74 million reads) of mapped sRNAs of lengths 18–30 nt. The mapped sRNA sequences were classified based on Rfam (v12.2) (Nawrocki et al. 2015) database using bowtie2 (Langmead and Salzberg 2012). We analyzed sRNA coverage on the TEs and methylation sites for each sRNA category using CoverageBED (Quinlan and Hall 2010). Supplementary Material Supplementary data are available at Molecular Biology and Evolution online. Author Contributions H.S.Y., S.M.B., E.C.Y., and J.-H.L. designed the genome sequencing project. H.S.Y. and D.B. conceived the research in collaboration with J.L. and J.H.Y., E.C.Y., G.H.B., K.M.K., H.-S.Y., and E.C.Y. prepared samples from field and culture. E.C.Y. and K.M.K. did the transcriptome library construction. J.L., Y.S., M.J., and S.J.L. did the genome assembly and gene prediction. J.L. led the genome analysis. J.L., S.J.L., L.G., H.Q., U.Z.Z., A.P.M.W., C.X.C., T.G.S., and E.C.Y. analyzed the gene ontology, gene duplications, repeats, gene families, and DNA methylation. J.L., D.B., and H.S.Y. wrote the paper in collaboration with the coauthors. H.S.Y. supervised the project. All authors read and approved the final manuscript. Acknowledgements We thank to Sung Bae Lee for kindly providing the Gracilariopsis chorda tissue from his sea farm that was used in this study. D.B., H.Q., and U.Z.Z. acknowledge the generous support from the School of Environmental and Biological Sciences at Rutgers University for the SEBS Genome Cooperative. C.X.C. and T.G.S. acknowledge support from the University of Queensland. This research was supported by the Collaborative Genome Program of the Korea Institute of Marine Science and Technology Promotion (KIMST) funded by the Ministry of Oceans and Fisheries (MOF) (No. 20140428), the National Research Foundation of Korea (NRF-2017R1A2B3001923), and the Next-generation BioGreen21 Program (PJ01389003) from the Rural Development Administration, Korea. The authors declare no competing financial interests. References Amborella Genome Consortium . 2013 . The complete nuclear genome of Amborella trichopoda: an evolutionary reference genome for the angiosperms . Science 342 : 6165 . Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana . Nature 408 ( 6814 ): 796 – 815 . CrossRef Search ADS PubMed Arkhipova IR. 2005 . Mobile genetic elements and sexual reproduction . Cytogenet Genome Res . 110 ( 1–4 ): 372 – 382 . Google Scholar CrossRef Search ADS PubMed Banyushin BF. 2006 . DNA methylation in plants. In: Doerfler W , Bohm P , editors. DNA methylation: basic mechanisms . Berlin (Germany ): Springer-Verlag Press . p. 67 – 122 . Google Scholar CrossRef Search ADS Bao Z , Eddy SR. 2002 . Automated de novo identification of repeat sequence families in sequenced genomes . Genome Res . 12 ( 8 ): 1269 – 1276 . Google Scholar CrossRef Search ADS PubMed Bennetzen JL , Wang H. 2014 . The contributions of transposable elements to the structure, function, and evolution of plant genomes . Annu Rev Plant Biol . 65 : 505 – 530 . Google Scholar CrossRef Search ADS PubMed Benson G. 1999 . Tandem repeats finder: a program to analyze DNA sequences . Nucleic Acids Res . 27 ( 2 ): 573 – 580 . Google Scholar CrossRef Search ADS PubMed Bewick AJ , Niederhuth CE , Ji L , Rohr NA , Griffin PT , Leebens-Mack J , Schmitz RJ. 2017 . The evolution of CHROMOMETHYLASES and gene body DNA methylation in plants . Genom Biol . 18 ( 1 ): 65. Google Scholar CrossRef Search ADS Bewick AJ , Schmitz RJ. 2017 . Gene body DNA methylation in plants . Curr Opin Plant Biol . 36 : 103 – 110 . Google Scholar CrossRef Search ADS PubMed Bhattacharya D , Price DC , Chan CX , Qiu H , Rose N , Ball S , Weber APM , Arias MC , Henrissat B , Coutinho PM , et al. 2013 . Genome of the red alga Porphyridium purpureum . Nat Commun . 4 : 1941. Google Scholar CrossRef Search ADS PubMed Bhattacharya D , Yoon HS , Hackett JD. 2004 . Photosynthetic eukaryotes unite: endosymbiosis connects the dots . BioEssays 26 ( 1 ): 50 – 60 . Google Scholar CrossRef Search ADS PubMed Bhaya D , Dufresne A , Vaulot D , Grossman A. 2002 . Analysis of the hli gene family in marine and freshwater cyanobacteria . FEMS Microbiol Lett . 215 ( 2 ): 209 – 219 . Google Scholar CrossRef Search ADS PubMed Birney E , Clamp M , Durbin R. 2004 . GeneWise and genomewise . Genome Res . 14 ( 5 ): 988 – 995 . Google Scholar CrossRef Search ADS PubMed Blanc G , Wolfe KH. 2004 . Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution . Plant Cell 16 ( 7 ): 1679 – 1691 . Google Scholar CrossRef Search ADS PubMed Boetzer M , Henkel CV , Jansen HJ , Butler D , Pirovano W. 2011 . Scaffolding pre-assembled contigs using SSPACE . Bioinformatics 27 ( 4 ): 578 – 579 . Google Scholar CrossRef Search ADS PubMed Brawley SH , Blouin NA , Ficko-Blean E , Wheeler GL , Lohr M , Goodson HV , Jenkins JW , Blaby-Haas CE , Helliwell KE , Chan CX , et al. 2017 . Insights into the red algae and eukaryotic evolution from the genome of Porphyra umbilicalis (Bangiophyceae, Rhodophyta) . Proc Natl Acad Sci U S A . 114 ( 31 ): E6361 – E6370 . Google Scholar CrossRef Search ADS PubMed Brennecke J , Malone CD , Aravin AA , Sachidanandam R , Stark A , Hannon GJ. 2008 . An epigenetic role for maternally inherited piRNAs in transposon silencing . Science 322 ( 5906 ): 1387 – 1392 . Google Scholar CrossRef Search ADS PubMed Capuano F , Mülleder M , Kok R , Blom HJ , Ralser M. 2014 . Cytosine DNA methylation is found in Drosophila melanogaster but absent in Saccharomyces cerevisiae, Schizosaccharomyces pombe, and other yeast species . Anal Chem . 86 ( 8 ): 3697 – 3702 . Google Scholar CrossRef Search ADS PubMed Carlberg C , Molnár F. 2014 . Mechanisms of gene regulation. Dordrecht (The Netherlands ): Springer Science+Business Media . Google Scholar CrossRef Search ADS Chin CS , Alexander DH , Marks P , Klammer AA , Drake J , Heiner C , Clum A , Copeland A , Huddleston J , Eichler EE , et al. 2013 . Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data . Nat Methods . 10 ( 6 ): 563 – 569 . Google Scholar CrossRef Search ADS PubMed Cole KM. 1990 . Chromosomes. In: Cole KM , Sheath RG , editors. Biology of the red algae . Cambridge : Cambridge University Press . p. 73 – 102 . Collén J , Porcel B , Carré W , Ball SG , Chaparro C , Tonon T , Barbeyron T , Michel G , Noel B , Valentin K et al. , 2013 . Genome structure and metabolic features in the red seaweed Chondrus crispus shed light on evolution of the Archaeplastida . Proc Natl Acad Sci U S A . 110 ( 13 ): 5247 – 5252 . Google Scholar CrossRef Search ADS PubMed Csűös M. 2010 . Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood . Bioinformatics 26 ( 15 ): 1910 – 1912 . Google Scholar CrossRef Search ADS PubMed Deaton AM , Bird A. 2011 . CpG islands and the regulation of transcription . Genes Dev . 25 ( 10 ): 1010 – 1022 . Google Scholar CrossRef Search ADS PubMed Derelle E , Ferraz C , Rombauts S , Rouzé P , Worden AZ , Robbens S , Partensky F , Degroeve S , Echeynié S , Cooke R et al. , 2006 . Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features . Proc Natl Acad Sci U S A . 103 ( 31 ): 11647 – 11652 . Google Scholar CrossRef Search ADS PubMed Eddy SR. 2012 . The C-value paradox, junk DNA and ENCDE . Curr Biol . 22 ( 21 ): R898 – R899 . Google Scholar CrossRef Search ADS PubMed Emms D , Kelly S. 2015 . OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy . Genome Biol . 16 : 157. Google Scholar CrossRef Search ADS PubMed Feng S , Cokus SJ , Zhang X , Chen P-Y , Bostick M , Goll MG , Hetzel J , Jain J , Strauss SH , Halpern ME et al. , 2010 . Conservation and divergence of methylation patterning in plants and animals . Proc Natl Acad Sci U S A . 107 ( 19 ): 8689 – 8694 . Google Scholar CrossRef Search ADS PubMed Flouri T , Izquierdo-Carrasco F , Darriba D , Aberer AJ , Nguyen LT , Minh BQ , Von Haeseler A , Stamatakis A. 2015 . The phylogenetic likelihood library . Syst Biol . 64 ( 2 ): 356 – 362 . Google Scholar CrossRef Search ADS PubMed Funk C , Vermaas W. 1999 . A cyanobacterial gene family coding for single-helix proteins resembling part of the light-harvesting proteins from higher plants . Biochemistry 38 ( 29 ): 9397 – 9404 . Google Scholar CrossRef Search ADS PubMed Galindo-González L , Mhiri C , Deyholos MK , Grandbastien MA. 2017 . LTR-retrotransposons in plants: engines of evolution . Gene 626 : 14 – 25 . Google Scholar CrossRef Search ADS PubMed Garcia S , Leitch IJ , Anadon-Rosell A , Canela MÁ , Gálvez F , Garnatje T , Gras A , Hidalgo O , Johnston E , Mas de Xaxars G et al. , 2014 . Recent updates and developments to plant genome size databases . Nucleic Acids Res. 42 ( D1 ): D1159 – D1166 . Google Scholar CrossRef Search ADS PubMed Garcia-Aguilar M , Michaud C , Leblanc O , Grimanelli D. 2010 . Inactivation of a DNA methylation pathway in maize reproductive organs results in apomixis-like phenotypes . Plant Cell 22 ( 10 ): 3249 – 3267 . Google Scholar CrossRef Search ADS PubMed Glasauer SMK , Neuhauss SCF. 2014 . Whole-genome duplication in teleost fishes and its evolutionary consequences . Mol Genet Genomics . 289 ( 6 ): 1045 – 1060 . Google Scholar CrossRef Search ADS PubMed Goff LJ , Coleman AW. 1986 . A novel pattern of apical cell polyploidy, sequential polyploidy reduction and intercellular nuclear transfer in the red alga Polysiphonia . Am J Bot . 73 ( 8 ): 1109 – 1130 . Google Scholar CrossRef Search ADS Goff LJ , Coleman AW. 1990 . DNA: micro-spectrohotometric studies. In: Cole KM , Sheath RG , editors. Biology of the red algae . Cambridge : Cambridge University Press . p. 43 – 72 . Goodier JL. 2016 . Restricting retrotransposons: a review . Mob DNA . 7 : 16. Google Scholar CrossRef Search ADS PubMed Greilhuber J , Doležel J , Lysák M , Bennett MD. 2005 . The origin, evolution and proposed stabilization of the terms ‘genome size’ and ‘C-value’ to describe nuclear DNA contents . Ann Bot . 95 ( 1 ): 255 – 260 . Google Scholar CrossRef Search ADS PubMed Haas BJ , Papanicolaou A , Yassour M , Grabherr M , Blood PD , Bowden J , Couger MB , Eccles D , Lieber M , MacManes MD et al. , 2013 . De novo transcript sequence reconstruction from RNA-seq: reference generation and analysis with Trinity . Nat Protoc . 8 ( 8 ): 1494 – 1512 . Google Scholar CrossRef Search ADS PubMed Haas BJ , Salzberg SL , Zhu W , Pertea M , Allen JE , Orvis J , White O , Buell CR , Wortman JR. 2008 . Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments . Genome Biol . 9 ( 1 ): R7. Google Scholar CrossRef Search ADS PubMed Hahn MW , Wray GA. 2002 . The g-value paradox . Evol Dev . 4 ( 2 ): 73 – 75 . Google Scholar CrossRef Search ADS PubMed Hawkins JS , Kim HR , Nason JD , Wing RA , Wendel JF. 2006 . Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium . Genome Res . 16 ( 10 ): 1252 – 1261 . Google Scholar CrossRef Search ADS PubMed Hidalgo O , Pellicer J , Christenhusz M , Schneider H , Leitch AR , Leitch IJ. 2017 . Is there an upper limit to genome size? Trends Plant Sci . 22 ( 7 ): 567 – 573 . Google Scholar CrossRef Search ADS PubMed Hirochika H , Okamoto H , Kakutani T. 2000 . Silencing of retrotransposons in Arabidopsis and reactivation by the ddm1 mutation . Plant Cell 12 ( 3 ): 357 – 368 . Google Scholar CrossRef Search ADS PubMed Hoff KJ , Lange S , Lomsadze A , Borodovsky M , Stanke M. 2016 . BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS . Bioinformatics 32 ( 5 ): 767 – 769 . Google Scholar CrossRef Search ADS PubMed Hori K , Maruyama F , Fujisawa T , Togashi T , Yamamoto N , Seo M , Sato S , Yamada T , Mori H , Yajima N et al. , 2014 . Klebsormidium flaccidum genome reveals primary factors for plant terrestrial adaptation . Nat Commun . 5 : 3978 . Google Scholar CrossRef Search ADS PubMed Hua-Van A , Le Rouzic A , Boutin TS , Filée J , Capy P. 2011 . The struggle for life of the genome’s selfish architects . Biol Direct . 6 ( 1 ): 19. Google Scholar CrossRef Search ADS PubMed Huerta-Cepas J , Szklarczyk D , Forslund K , Cook H , Heller D , Walter MC , Rattei T , Mende DR , Sunagawa S , Kuhn M et al. , 2016 . eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences . Nucleic Acids Res. 44 ( D1 ): D286 – D293 . Google Scholar CrossRef Search ADS PubMed Huff JT , Zilberman D. 2014 . Dnmt1-independent CG methylation contributes to nucleosome positioning in diverse eukaryotes . Cell 156 ( 6 ): 1286 – 1297 . Google Scholar CrossRef Search ADS PubMed Hutvagner G , Simard MJ. 2008 . Argonaute proteins: key players in RNA silencing . Nat Rev Mol Cell Biol . 9 ( 1 ): 22 – 32 . Google Scholar CrossRef Search ADS PubMed Illingworth RS , Bird AP. 2009 . CpG islands—‘a rough guide’ . FEBS Lett . 583 ( 11 ): 1713 – 1720 . Google Scholar CrossRef Search ADS PubMed Innan H , Kondrashov F. 2010 . The evolution of gene duplications: classifying and distinguishing between models . Nat Rev Genet . 11 ( 2 ): 97 – 108 . Google Scholar CrossRef Search ADS PubMed Jantaro S , Ali Q , Lone S , He Q. 2006 . Suppression of the lethality of high light to a quadruple HLI mutant by the inactivation of the regulatory protein PfsR in Synechocystis PCC 6803 . J Biol Chem . 281 ( 41 ): 30865 – 30874 . Google Scholar CrossRef Search ADS PubMed Ji L , Chen X. 2012 . Regulation of small RNA stability: methylation and beyond . Cell Res . 22 ( 4 ): 624 – 636 . Google Scholar CrossRef Search ADS PubMed Kapraun DF. 2005 . Nuclear DNA content estimates in multicellular green, red and brown algae: phylogenetic considerations . Ann Bot . 95 ( 1 ): 7 – 44 . Google Scholar CrossRef Search ADS PubMed Keeling PJ. 2010 . The endosymbiotic origin, diversification and fate of plastids . Philos Trans R Soc B . 365 ( 1541 ): 729 – 748 . Google Scholar CrossRef Search ADS Kim D , Pertea G , Trapnell C , Pimentel H , Kelley R , Salzberg SL. 2013 . TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions . Genome Biol. 14 ( 4 ): R36. Google Scholar CrossRef Search ADS PubMed Kimura M. 1980 . A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences . J Mol Evol . 16 ( 2 ): 111 – 120 . Google Scholar CrossRef Search ADS PubMed Kong F , Mao Y , Yang H , Wang L , Liu L. 2012 . Cloning and characterization of the HLIP gene encoding high light-inducible protein from Porphyra yezoensis . J Appl Phycol . 24 ( 4 ): 685 – 692 . Google Scholar CrossRef Search ADS Krueger F , Andrews SR. 2011 . Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications . Bioinformatics 27 ( 11 ): 1571 – 1572 . Google Scholar CrossRef Search ADS PubMed Langmead B , Salzberg SL. 2012 . Fast gapped-read alignment with Bowtie 2 . Nat Methods . 9 ( 4 ): 357 – 359 . Google Scholar CrossRef Search ADS PubMed Law JA , Jacobsen SE. 2010 . Establishing, maintaining and modifying DNA methylation patterns in plants and animals . Nat Rev Genet . 11 ( 3 ): 204 – 220 . Google Scholar CrossRef Search ADS PubMed Lee JM , Cho CH , Park SI , Choi JW , Song HS , West JA , Bhattacharya D , Yoon HS. 2016 . Parallel evolution of highly conserved plastid genome architecture in red seaweeds and seed plants . BMC Biol . 14 ( 1 ): 75. Google Scholar CrossRef Search ADS PubMed Li H. 2012 . Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly . Bioinformatics 28 ( 14 ): 1838 – 1844 . Google Scholar CrossRef Search ADS PubMed Li SF , Zhang GJ , Yuan JH , Deng CL , Gao WJ. 2016 . Repetitive sequences and epigenetic modification: inseparable partners play important roles in the evolution of plant sex chromosomes . Planta 243 ( 5 ): 1083 – 1095 . Google Scholar CrossRef Search ADS PubMed Li Y , Tollefsbol TO. 2011 . DNA methylation detection: bisulfite genomic sequencing analysis . Methods Mol Biol . 791 : 11 – 21 . Google Scholar CrossRef Search ADS PubMed Liu Q , Feng Y , Zhu Z. 2009 . Dicer-like (DCL) proteins in plants . Funct Integr Genomics . 9 ( 3 ): 277 – 286 . Google Scholar CrossRef Search ADS PubMed Lomsadze A , Burns PD , Borodovsky M. 2014 . Integration of mapped RNA-seq reads into automatic training of eukaryotic gene finding algorithm . Nucleic Acids Res . 42 ( 15 ): e119. Google Scholar CrossRef Search ADS PubMed Lynch M , Ackerman MS , Gout JF , Long H , Sung W , Thomas WK , Foster PL. 2016 . Genetic drift, selection and the evolution of the mutation rate . Nat Rev Genet . 17 ( 11 ): 704 – 714 . Google Scholar CrossRef Search ADS PubMed Marçais G , Kingsford C. 2011 . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers . Bioinformatics 27 ( 6 ): 764 – 770 . Google Scholar CrossRef Search ADS PubMed Marchler-Bauer A , Bo Y , Han L , He J , Lanczycki CJ , Lu S , Chitsaz F , Derbyshire MK , Geer RC , Gonzales NR et al. , 2017 . CDD/SPARCLE: functional classification of proteins via subfamily domain architectures . Nucleic Acids Res. 45 ( D1 ): D200 – D203 . Google Scholar CrossRef Search ADS PubMed Marí-Ordóñez A , Marchais A , Etcheverry M , Martin A , Colot V , Voinnet O. 2013 . Reconstructing de novo silencing of an active plant retrotransposon . Nat Genet . 45 ( 9 ): 1029 – 1039 . Google Scholar CrossRef Search ADS PubMed Martienssen RA , Colot V. 2001 . DNA methylation and epigenetic inheritance in plants and filamentous fungi . Science 293 ( 5532 ): 1070 – 1074 . Google Scholar CrossRef Search ADS PubMed Martin M. 2011 . Cutadapt removes adapter sequences from high-throughput sequencing reads . EMBnet.Journal 17 ( 1 ): 10 – 12 . Google Scholar CrossRef Search ADS Matsuzaki M , Misumi O , Shin-i T , Maruyama S , Takahara M , Miyagishima S , Mori T , Nishida K , Yagisawa F , Nishida K et al. , 2004 . Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D . Nature 428 ( 6983 ): 653 – 657 . Google Scholar CrossRef Search ADS PubMed Matzke M , Kanno T , Daxinger L , Huettel B , Matzke AJM. 2009 . RNA-mediated chromatin-based silencing in plants . Curr Opin Cell Biol . 21 ( 3 ): 367 – 376 . Google Scholar CrossRef Search ADS PubMed Matzke MA , Kanno T , Matzke AJM. 2015 . RNA-directed DNA methylation: the evolution of a complex epigenetic pathway in flowering plants . Annu Rev Plant Biol . 66 : 243 – 267 . Google Scholar CrossRef Search ADS PubMed Matzke MA , Mosher RA. 2014 . RNA-directed DNA methylation: an epigenetic pathway of increasing complexity . Net Rev Genet . 15 ( 6 ): 394 – 408 . Google Scholar CrossRef Search ADS McClintock B. 1950 . The origin and behavior of mutable loci in maize . Proc Natl Acad Sci U S A . 36 ( 6 ): 344 – 355 . Google Scholar CrossRef Search ADS PubMed McClintock B. 1984 . The significance of responses of the genome to challenge . Science 226 ( 4676 ): 792 – 801 . Google Scholar CrossRef Search ADS PubMed Melnyk CW , Molnar A , Bassett A , Baulcombe DC. 2011 . Mobile 24 nt small RNAs direct transcriptional gene silencing in the root meristems of Arabidopsis thaliana . Curr Biol. 21 ( 19 ): 1678 – 1683 . Google Scholar CrossRef Search ADS PubMed Merchant SS , Prochnik SE , Vallon O , Harris EH , Karpowicz SJ , Witman GB , Terry A , Salamov A , Fritz-Laylin LK , Maréchal-Drouard L et al. , 2007 . The Chlamydomonas genome reveals the evolution of key animal and plant functions . Science 318 ( 5848 ): 245 – 250 . Google Scholar CrossRef Search ADS PubMed Michael TP. 2014 . Plant genome size variation: bloating and purging DNA . Brief Funct Genomics . 13 ( 4 ): 308 – 317 . Google Scholar CrossRef Search ADS PubMed Minh BQ , Nguyen MAT , Von Haeseler A. 2013 . Ultrafast approximation for phylogenetic bootstrap . Mol Biol Evol . 30 ( 5 ): 1188 – 1195 . Google Scholar CrossRef Search ADS PubMed Mohn F , Weber M , Rebhan M , Roloff TC , Richter J , Stadler MB , Bibel M , Schübeler D. 2008 . Linage-specific polycomb targets and de novo DNA methylation define restriction and potential of neuronal progenitors . Mol Cell . 30 ( 6 ): 755 – 766 . Google Scholar CrossRef Search ADS PubMed Montané MH , Petzold B , Kloppstech K. 1999 . Formation of early-light-inducible-protein complexes and status of xanthophyll levels under high light and cold stress in barley (Hordeum vulgare L.) . Planta 208 ( 4 ): 519 – 527 . Google Scholar CrossRef Search ADS Nawrocki EP , Burge SW , Bateman A , Daub J , Eberhardt RY , Eddy SR , Floden EW , Gardner PP , Jones TA , Tate J et al. , 2015 . Rfam 12.0: updates to the RNA families database . Nucleic Acids Res. 43 ( Database issue ): D130 – D137 . Google Scholar CrossRef Search ADS PubMed Nguyen LT , Schmidt HA , Von Haeseler A , Minh BQ. 2015 . IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies . Mol Biol Evol . 32 ( 1 ): 268 – 274 . Google Scholar CrossRef Search ADS PubMed Okamoto H , Hirochika H. 2001 . Silencing of transposable elements in plants . Trends Plant Sci . 6 ( 11 ): 527 – 534 . Google Scholar CrossRef Search ADS PubMed Olmedo-Monfil V , Durán-Figueroa N , Arteaga-Vázquez M , Demesa-Arévalo E , Autran D , Grimanelli D , Slotkin RK , Martienssen RA , Vielle-Calzada J-P. 2010 . Control of female gamete formation by a small RNA pathway in Arabidopsis . Nature 464 ( 7288 ): 628 – 632 . Google Scholar CrossRef Search ADS PubMed Ou X , Zhang Y , Xu C , Lin X , Zang Q , Zhuang T , Jiang L , von Wettstein D , Liu B. 2012 . Transgenerational inheritance of modified DNA methylation patterns and enhanced tolerance induced by heavy metal stress in rice (Oryza sativa L.) . PLoS One 7 ( 9 ): e41143. Google Scholar CrossRef Search ADS PubMed Payer B , Lee JT. 2008 . X chromosome dosage compensation: how mammals keep the balance . Annu Rev Genet . 42 : 733 – 772 . Google Scholar CrossRef Search ADS PubMed Pellicer J , Fay MF , Leitch IJ. 2010 . The largest eukaryotic genome of them all? Bot J Linn Soc . 164 ( 1 ): 10 – 15 . Google Scholar CrossRef Search ADS Piegu B , Guyot R , Picault N , Roulin A , Sanyal A , Saniyal A , Kim H , Collura K , Brar DS , Jackson S et al. , 2006 . Doubling genome size without polyploidization: dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice . Genome Res . 16 ( 10 ): 1262 – 1269 . Google Scholar CrossRef Search ADS PubMed Popova OV , Dinh HQ , Aufsatz W , Jonak C. 2013 . The RdDM pathway is required for basal heat tolerance in Arabidopsis . Mol Plant. 6 ( 2 ): 396 – 410 . Google Scholar CrossRef Search ADS PubMed Price AL , Jones NC , Pevzner PA. 2005 . De novo identification of repeat families in large genomes . Bioinformatics 21 ( Suppl 1 ): i351 – i358 . Google Scholar CrossRef Search ADS PubMed Price DC , Chan CX , Yoon HS , Yang EC , Qiu H , Weber APM , Schwacke R , Gross J , Blouin NA , Lane C et al. , 2012 . Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants . Science 335 ( 6070 ): 843 – 847 . Google Scholar CrossRef Search ADS PubMed Qiu H , Lee J , Yoon HS , Bhattacharya D. 2017 . Hypothesis: gene-rich plastid genomes in red algae may be an outcome of nuclear genome reduction . J Phycol . 53 ( 3 ): 715 – 719 . Google Scholar CrossRef Search ADS PubMed Qiu H , Price DC , Yang EC , Yoon HS , Bhattacharya D. 2015 . Evidence of ancient genome reduction in red algae (Rhodophyta) . J Phycol . 51 ( 4 ): 624 – 636 . Google Scholar CrossRef Search ADS PubMed Quinlan AR , Hall IM. 2010 . BEDTools: a flexible suite of utilities for comparing genomic features . Bioinformatics 26 ( 6 ): 841 – 842 . Google Scholar CrossRef Search ADS PubMed Rensing SA. 2014 . Gene duplication as a driver of plant morphogenetic evolution . Curr Opin Plant Biol . 17 : 43 – 48 . Google Scholar CrossRef Search ADS PubMed Rensing SA , Lang D , Zimmer AD , Terry A , Salamov A , Shapiro H , Nishiyama T , Perroud P-F , Lindquist EA , Kamisugi Y et al. , 2008 . The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants . Science 319 ( 5859 ): 64 – 69 . Google Scholar CrossRef Search ADS PubMed Saito K , Siomi MC. 2010 . Small RNA-mediated quiescence of transposable elements in animals . Dev Cell . 19 ( 5 ): 687 – 697 . Google Scholar CrossRef Search ADS PubMed Saze H , Scheid OR , Paszkowski J. 2003 . Maintenance of CpG methylation is essential for epigenetic inheritance during plant gametogenesis . Nat Genet . 34 ( 1 ): 65 – 69 . Google Scholar CrossRef Search ADS PubMed Saze H , Tsugane K , Kanno T , Nishimura T. 2012 . DNA methylation in plants: relationship to small RNAs and histone modifications, and functions in transposon inactivation . Plant Cell Physiol . 53 ( 5 ): 766 – 784 . Google Scholar CrossRef Search ADS PubMed Schönknecht G , Chen W , Ternes CM , Barbier GG , Shrestha RP , Stanke M , Bräutigam A , Baker BJ , Banfield JF , Garavito RM et al. , 2013 . Gene transfer from bacteria and archaea facilitated evolution of an extremophilic eukaryote . Science 339 ( 6124 ): 1207 – 1210 . Google Scholar CrossRef Search ADS PubMed Shahmuradov IA , Umarov RK , Solovyev VV. 2017 . TSSPlant: a new tool for prediction of plant Pol II promoters . Nucleic Acids Res . 45 ( 8 ): e65. Google Scholar PubMed Simão FA , Waterhouse RM , Ioannidis P , Kriventseva EV , Zdobnov EM. 2015 . BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs . Bioinformatics 31 ( 19 ): 3210 – 3212 . Google Scholar CrossRef Search ADS PubMed Singh M , Goel S , Meeley RB , Dantec C , Parrinello H , Michaud C , Leblanc O , Grimanelli D. 2011 . Production of viable gametes without meiosis in maize deficient for an ARGONAUTE protein . Plant Cell 23 ( 2 ): 443 – 458 . Google Scholar CrossRef Search ADS PubMed Slotkin RK , Martienssen R. 2007 . Transposable elements and the epigenetic regulation of the genome . Nat Rev Genet . 8 ( 4 ): 272 – 285 . Google Scholar CrossRef Search ADS PubMed Slotkin RK , Vaughn M , Borges F , Tanurdžić M , Becker JD , Feijó JA , Martienssen R. 2009 . Epigenetic reprogramming and small RNA silencing of transposable elements in pollen . Cell 136 ( 3 ): 461 – 472 . Google Scholar CrossRef Search ADS PubMed Stanke M , Keller O , Gunduz I , Hayes A , Waack S , Morgenstern B. 2006 . AUGUSTUS: ab initio prediction of alternative transcripts . Nucleic Acids Res. 34 ( Web Server ): W435 – W439 . Google Scholar CrossRef Search ADS PubMed Steinemann S , Steinemann M. 2005 . Retroelements: tools for sex chromosome evolution . Cytogenet Genome Res . 110 ( 1–4 ): 134 – 143 . Google Scholar CrossRef Search ADS PubMed Sung W , Ackerman MS , Miller SF , Doak TG , Lynch M. 2012 . Drift-barrier hypothesis and mutation-rate evolution . Proc Natl Acad Sci U S A . 109 ( 45 ): 18488 – 18492 . Google Scholar CrossRef Search ADS PubMed Takai D , Jones PA. 2002 . Comprehensive analysis of CpG islands in human chromosomes 21 and 22 . Proc Natl Acad Sci U S A . 99 ( 6 ): 3740 – 3745 . Google Scholar CrossRef Search ADS PubMed Teramoto H , Itoh T , Ono TA. 2004 . High-intensity-light-dependent and transient expression of new genes encoding distant relatives of light-harvesting chlorophyll-a/b proteins in Chlamydomonas reinhardtii . Plant Cell Physiol . 45 ( 9 ): 1221 – 1232 . Google Scholar CrossRef Search ADS PubMed Tippens ND , Vihervaara A , Lis JT. 2018 . Enhancer transcription: what, where, when, and why? Genes Dev . 32 ( 1 ): 1 . Google Scholar CrossRef Search ADS PubMed Tricker PJ , Gibbings JG , Rodríguez López CM , Hadley P , Wilkinson MJ. 2012 . Low relative humidity triggers RNA-directed de novo DNA methylation and suppression of genes controlling stomatal development . J Exp Bot . 63 ( 10 ): 3799 – 3813 . Google Scholar CrossRef Search ADS PubMed Vaucheret H , Fagard M. 2001 . Transcriptional gene silencing in plants: targets, inducers and regulators . Trends Genet . 17 ( 1 ): 29 – 35 . Google Scholar CrossRef Search ADS PubMed Verbsky ML , Richards EJ. 2001 . Chromatin remodeling in plants . Curr Opin Plant Biol . 4 : 494 – 500 . Google Scholar CrossRef Search ADS PubMed Worden AZ , Lee J-H , Mock T , Rouzé P , Simmons MP , Aerts AL , Allen AE , Cuvelier ML , Derelle E , Everett MV et al. , 2009 . Green evolution and dynamic adaptations revealed by genomes of the marine picoeukaryotes Micromonas . Science 324 ( 5924 ): 268 – 272 . Google Scholar CrossRef Search ADS PubMed Wright S , Finnegan D. 2001 . Genome evolution: sex and the transposable element . Curr Biol . 11 ( 8 ): R296 – R299 . Google Scholar CrossRef Search ADS PubMed Xiao W , Custard KD , Brown RC , Lemmon BE , Harada JJ , Goldberg RB , Fischer RL. 2006 . DNA methylation is critical for Arabidopsis embryogenesis and seed viability . Plant Cell 18 ( 4 ): 805 – 814 . Google Scholar CrossRef Search ADS PubMed Yamada KD , Tomii K , Katoh K. 2016 . Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees . Bioinformatics 32 ( 21 ): 3246 – 3251 . Google Scholar CrossRef Search ADS PubMed Yang EC , Boo SM , Bhattacharya D , Saunders GW , Knoll AH , Fredericq S , Graf L , Yoon HS. 2016 . Divergence time estimates and the evolution of major lineages in the florideophyte red algae . Sci Rep . 6 : 21361. Google Scholar CrossRef Search ADS PubMed Zemach A , McDaniel IE , Silva P , Zilberman D. 2010 . Genome-wide evolutionary analysis of eukaryotic DNA methylation . Science 328 ( 5980 ): 916 – 919 . Google Scholar CrossRef Search ADS PubMed Zeyl C , Bell G. 1996 . Symbiotic DNA in eukaryotic genomes . Trends Ecol Evol . 11 ( 1 ): 10 – 15 . Google Scholar CrossRef Search ADS PubMed Zhang J. 2003 . Evolution by gene duplication: an update . Trends Ecol Evol . 18 ( 6 ): 292 – 298 . Google Scholar CrossRef Search ADS Zilberman D , Gehring M , Tran RK , Ballinger T , Henikoff S. 2007 . Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription . Nat Genet . 39 ( 1 ): 61 – 69 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Molecular Biology and EvolutionOxford University Press

Published: Apr 23, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off