Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Revealing the novel complexity of plant long non-coding RNA by strand-specific and whole transcriptome sequencing for evolutionarily representative plant species

Revealing the novel complexity of plant long non-coding RNA by strand-specific and whole... Background: Previous studies on plant long noncoding RNAs (lncRNAs) lacked consistency and suffered from many factors like heterogeneous data sources and experimental protocols, different plant tissues, inconsistent bioinformat - ics pipelines, etc. For example, the sequencing of RNAs with poly(A) tails excluded a large portion of lncRNAs without poly(A), and use of regular RNA-sequencing technique did not distinguish transcripts’ direction for lncRNAs. The current study was designed to systematically discover and analyze lncRNAs across eight evolutionarily representative plant species, using strand-specific (directional) and whole transcriptome sequencing (RiboMinus) technique. Results: A total of 39,945 lncRNAs (25,350 lincRNAs and 14,595 lncNATs) were identified, which showed molecular features of lncRNAs that are consistent across divergent plant species but different from those of mRNA. Further, transposable elements ( TEs) were found to play key roles in the origination of lncRNA, as significantly large number of lncRNAs were found to contain TEs in gene body and promoter region, and transcription of many lncRNAs was driven by TE promoters. The lncRNA sequences were divergent even in closely related species, and most plant lncRNAs were genus/species-specific, amid rapid turnover in evolution. Evaluated with PhastCons scores, plant lncRNAs showed similar conservation level to that of intergenic sequences, suggesting that most lincRNAs were young and with short evolutionary age. INDUCED BY PHOSPHATE STARVATION (IPS) was found so far to be the only plant lncRNA group with conserved motifs, which may play important roles in the adaptation of terrestrial life during migration from aquatic to terrestrial. Most highly and specially expressed lncRNAs formed co-expression network with coding genes, and their functions were believed to be closely related to their co-expression genes. Yan Zhu and Longxian Chen have contributed equally to this work. *Correspondence: lixuan@cemps.ac.cn Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China Full list of author information is available at the end of the article © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Zhu et al. BMC Genomics (2022) 23:381 Page 2 of 15 Conclusion: The study revealed novel features and complexity of lncRNAs in plants through systematic analysis, providing important insights into the origination and evolution of plant lncRNAs. Keywords: lncRNA, lincRNA, lncNAT, Plant, Strand-specific Background lncRNAs, more research is needed to understand the Long noncoding RNAs (lncRNAs) were found to account roles of TEs in the origination of lncRNAs in plant. for a large part of the whole transcriptome in plants. LncRNAs had a fast evolutionary rate and a high Most of lncRNAs identified in plants were lincRNAs, degree of sequence diversity in animals [18–20]. Some which were transcribed from intergenic regions. Another ancient lncRNAs were highly conserved in tetrapod, and lncRNAs transcribed from intragenic regions were lnc- they were preserved in evolution and may have impor- NATs, which were antisense to coding genes. Both types tant functions [20]. Plant lncRNAs were highly divergent of lncRNAs have been identified in many plant species, at the nucleotide level [21]. LncRNAs showed highly among which a few were found to play important roles in divergency on sequences between rice and maize, and plant development and stress resistance [1–4]. Benefiting between Brassicaceae and Cleomaceae [22, 23]. Besides, from the high-throughput sequencing technology, bulks molecular features of plant lncRNAs did not follow the of RNA-sequencing data were deposited to public data- classic evolutionary patterns, and they showed little evi- bases. Databases of plant lncRNAs have been developed dence of phylogenetic relationships [24]. A comprehen- in the past several years, such as PLncDB, CANTATAdb, sive design is needed to systematically study the plant GreeNC, NONCODE, et al. [5–8]. More than one million lncRNA evolution and feature divergency. lncRNAs were recorded in these databases, which cov- The current study was designed to comprehensively ered more than 80 plant species. However, a systematic discover and analyze lncRNAs in multiple plant species analysis on plant lncRNA evolution has not been con- across the evolutionary landscape using strand-specific ducted due to many difficulties. The heterogeneous data and whole transcriptome sequencing data. We employed sources and protocols, like inconsistent plant tissues, eight plant species from the primitive to higher taxa in ways of sequencing library construction, pipelines for plant, which were representatives at key position of plant bioinformatics analysis, were cited as some of the factors evolution. Using RiboMinus and strand-specific RNA- [9–12]. Most of the previous studies sequenced RNAs sequencing techniques in combination with consistent with poly(A) tails by oligo(dT) selection, which would plant tissues, we have expanded the lncRNA landscape, exclude a large portion of lncRNAs without poly(A) [13, including long noncoding natural antisense RNAs (lnc- 14]. In addition, the regular RNA-sequencing technique NATs) and lncRNAs without poly(A). Our results showed does not distinguish transcripts’ direction, which would that both lincRNAs and lncNATs were widely distributed lack a large part of lncRNAs antisense to coding genes. in plants. TEs played important roles in the origination Therefore, to comprehensively analyze the lncRNAs in and transcription of lncRNAs. The molecular features divergent plant species and their evolution requires 1) of lncRNAs were found to be conserved in plants, but unified study design with consistent plant sources, and sequences were non-conserved even between evolution- 2) the whole transcriptome RNA-sequencing technology arily close species. Most highly and specially expressing with strand-specificity. lncRNAs participated in the co-expressional network To date, important question about how plants lncRNAs with coding genes, which may function as regulators in originated and evolved remains largely un-answered. the network. Several ways for the origin of lncRNAs have been docu- mented, among which transposable elements (TEs) are Results an important cause. TE was thought as an important Identification of plant lncRNAs with new lncNAT species factor to drive the transcription of lncRNAs [15, 16]. from strand‑specific RiboMinus transcriptome sequencing In humans, about 75% lncRNAs contained at least one data exon deriving from TEs [15]. In plants, several lncRNAs To systematically study the molecular features and evo- were found to originate from TEs [16, 17]. lncRNA-314 lutionary profiles of lncRNAs in plants, we selected eight was originated from a full-length LTR retrotransposon representative plant species across the phylogenetic land- inserting into downstream of a promoter in the ances- scape for analysis, including Chlamydomonas reinhardtii tral tomato genome [16]. In Arabidopsis, lincRNA11195 (C. reinhardtii), Selaginella moellendoriffi (S. moellen - contained a LTR retrotransposon and was activated dori ffi ), Zea mays (maize), Oryza sativa subsp. Japon- under abiotic stresses [17]. In addition to these two plant ica (rice), Arabidopsis lyrata (A. lyrata), Arabidopsis Zhu  et al. BMC Genomics (2022) 23:381 Page 3 of 15 thaliana (A. thaliana), Populus trichocarpa (poplar) and 0.5 for single-exon transcripts and 0.1 for multiple-exon Solanum lycopersicum (tomato) (Fig.  1). These species transcripts) to resolve the transcripts with low expres- can represent single-cell alga, ferns, monocotyledon and sional abundance. For species with annotations including dicotyledon, spanned an evolutionary history of 116 mil- lncRNAs, we added them in our result if those lncRNAs lion years. Whole-genome sequences and annotations of satisfied our criterion. these species were available from public databases (Addi- A total of 39,945 lncRNAs were obtained in all the eight tional file 1: Table 1). species, for which the transcriptional orientation of each Different from previous studies, we acquired only lncRNA was also confirmed. LncRNAs identified in this RiboMinus and strand-specific RNA-seq (ssRNA-seq) study were categorized into two types: lncRNA in inter- data for multiple plant tissues, either collected from pub- genic region (lincRNA: 25,350) and lncRNA antisense lic database or sequenced by ourselves when they are not to one or more exons of coding genes (lncNAT: 14,595) available (Additional file  2). This critical approach was (Fig.  1, Additional file  1: Table 2, Additional file  3). Both designed to include lncRNA species either with poly(A) of lincRNAs and lncNATs dispersed widely in all of the or without poly(A), which can also identify the tran- plant clades from 777, the least number in C. reinhardtii scriptional orientation of lncRNAs, and their location on to 8321, the most in maize. By comparing lncRNAs iden- sense- or antisense- strand of coding gene regions. A con- tified in this study with lncRNAs from the databases of sensus and efficient lncRNA identification workflow were PLncDB, CANTATAdb, GreeNC, and NONCODE, a constructed based on our previous study [25]. To reduce total of 22,313 new lncRNAs (12,838 lincRNAs and 9475 the variation emerging from the tissue difference among lncNATs) were found, and their distribution in each spe- species, we selected data of five tissues: root, stem, leaf, cies were listed in “Additional file 1: Table 2”. flower, and seed/fruit. For lower-level plant species with - Previous studies on plant lncRNAs often missed out differentiated tissues, such as C. reinhardtii and S. out lncNATs due to lack of orientation in their RNA moellendori ffi , we used all available data. Considering the sequencing data (thus could not made accurate distinc- batch effects coming from PCR artifacts, sequence depth tion on lncNATs from coding gene transcripts). Here and gene expressional abundance, etc. among different we searched lncNATs using strand specific RNA-seq data sources, we eliminated PCR artifacts and low-qual- data, and found that lncNATs dispersed widely in plant ity sequencing data by pre-processing and mapping high- (Fig. 1). LncNATs took a large part in plant lncRNAs, and quality reads to genome in our procedure. We required even more than lincRNA in several species, such as A. the mapped data size for each tissue to be above 10X thaliana and C. reinhardtii (Fig.  1). There are about 20% genomic coverage depth in order to retain transcripts coding genes with lncNATs transcribing in A. thaliana. with low-level expression (“Additional file  2” listed the Coding genes can be expressed with their counterpart data size of each sample for all tissues of the eight spe- lncNATs simultaneously (Additional file  4: Fig. S1A). cies). We normalized the expression of transcripts using Compared with coding genes without lncNATs, their FPKM, which reduced or eliminated the discrepancy expressional levels were not significantly affected by the from sequencing depth and transcript length. At last, presence of lncNATs (Additional file  4: Fig. S1B). The lack we removed those low-expressional transcripts (FPKM of correlation between the expression of coding genes Fig. 1 Phylogenetic relationship of plant species in this study and lncRNAs identified in each species. The phylogenetic tree was drawn using TimeTree (http:// www. timet ree. org/). Myr: million years Zhu et al. BMC Genomics (2022) 23:381 Page 4 of 15 and their paired lncNATs implied the lncNATs may not In current study, we revealed comprehensive char- be directly involved in the regulation of coding genes’ acteristics of lncRNAs in multiple species using unified expression. sequencing data and bioinformatic analysis procedure. We found both lncNATs and lincRNAs have more sim- The consistent molecular features of lncRNAs ple structures than mRNAs. LncRNAs were shorter than among divergent plant species mRNAs (median length of lncNAT: 964, lincRNA: 817, The characteristics of lncRNAs, such as transcript length, mRNA: 1422, both p-value < 2.2e-16, t-test) (Fig. 2A), and AT content, exon number, and so on, were explored and lncRNAs have fewer exons compared to mRNAs (Mean compared to coding genes in individual plant species of lncNAT: 1.4, lincRNA: 1.7, mRNA: 5.2, both p-value [11, 26, 27]. However, if these features were consistent in < 2.2e-16, t-test) (Fig.  2B). These two features were con - plants were not clear. In addition to the most important served in plant lncRNAs (Fig. 2A, B). LncRNAs existed in differences between lncRNA and mRNA that was cod - plant genomes mainly as single-exon transcripts. A total ing products, other features remained to be investigated of 65% lincRNAs and 82% lncNATs were single-exon, between lncRNA and mRNA in plants. while in mRNAs this number was 22%. This phenomenon Fig. 2 Novel conserved features of lncRNAs in plant. A, B Length distribution and Exon-number of lincRNA, lncNAT and mRNA in each species. C Splicing ratio of lincRNA, lncNAT and mRNA in each species. D GC content of lincRNA, lncNAT, CDS/5’UTR/3’UTR of coding genes and intergenic sequences in each species. E SNP frequency of lincRNA, lncNAT and CDS of coding genes in each species Zhu  et al. BMC Genomics (2022) 23:381 Page 5 of 15 was observed in all plant species of this study, and was in TE for lncRNAs, we first established a reference library consistent with results of other plant species in previous of TEs for each genome using RepeatMasker [34]. As a studies [27]. result, up to 59.7% lincRNAs can be found contain TEs Fewer multiple-exon transcripts meant few require- in their sequences (Fig.  3A). Besides, we found a large ments for splicing by lncRNA. In a previous study, proportion of TEs inserted into the genomic loci of lncR- researchers found lncRNAs were rarely spliced and NAs and coding genes (lincRNA: 3.9% ~ 59.7%, lncNAT: mainly non-polyadenylated [28]. The splicing efficiency in 2.4% ~ 32%, coding gene: 3.1% ~ 43.9%). But, when we lim- human and mouse showed that inefficient splicing might ited the inserting regions to exons of lncRNAs and CDS be a common feature of lncRNAs across species [29]. The of coding genes, the ratio of transcripts overlapping with ratio of splicing lncRNAs was significantly lower than TEs decreased sharply in the latter (1.8% ~ 13.1%), while those for mRNAs (Wilcoxon test, both p-value = 7.6e-06, in lncRNAs the ratio remained (lincRNA: 3.1% ~ 58.2%, Additional file  4: Fig. S2A) in this study. Splicing lincR- lncNAT: 0.5% ~ 31.8%) (Fig.  3A). Significantly, sequences NAs ranged from 13.7% (poplar) to 48.8% (S. moellen- of TEs were remained in lncNATs though they shared dori ffi ), compared to the ratio of mRNAs from 65.3% sequences with coding genes. Decrease of TEs in protein (rice) to 92.6% (C. reinhardtii) (Fig. 2C). Inefficient splic - coding sequences indicated that TEs inserting into CDS ing of lncRNAs was not just a feature in animals but is were eliminated in evolution since they may change gene also common across plant species. Next, we wanted to products, and resulted in serious consequences. In cod- investigate if the splicing sites in lncRNAs were same ing genes, TE insertion was concentrated in regions with with mRNAs. Further analysis of the base distribution limited effects on gene products, such as UTR, intron. upstream and downstream of splicing sites showed that On the contrary, lncRNAs were not influenced by inser - sequences of splice sites in lncRNAs were conserved tion of TEs in their exons for lack of coding products, across plant species. They have almost the same sequence so these TE sequences can be kept and accompanied context between lncRNAs and mRNAs, which suggested by lncRNAs in evolution. This suggested that lncRNAs that they probably used the same splicing mechanism may originate from TE insertion. The close relationship (Additional file 4: Fig. S2B). between lncRNA and TEs revealed the key roles of TEs Both of lincRNAs and lncNATs have conserved nucleo- played in the origin of lncRNA. tide constitution among multiple species. LncRNAs have TEs were classified into two types: Class I TEs or retro - lower GC content than coding sequences (CDS) of coding transposons, and Class II TEs or DNA transposons. They genes, while which was higher than intergenic sequences are both present in the plant genome. However, we found (Fig. 2D). This feature was also consistent across multiple that Class II TEs favored transcribed regions, includ- species in plant. High GC content was usually associated ing lncNAT and lincRNA, whereas Class I TEs favored with coding sequences [30]. Though lncNATs and coding regions of non-transcribed sequences (Fig. 3B). The ratio genes shared overlapping sequences, the GC content of between Classes I and II TEs deviating from that the lncNATs was still lower than CDS in most plant species entire genome indicated that the TEs, Classes I and II, (Fig.  2D). The bias of nucleotide in lncRNA meant the in lncRNAs and coding sequences were under different selective pressure on their sequences. At the same time, selection pressure. mutants accumulated in both lincRNAs and lncNATs TEs play important roles on the origination of lncR- were much higher than coding genes (Fig. 2E, Additional NAs. Benefitting from the strand-specific sequencing file  1: Table  3). Mutants accumulated in coding regions data, we can confirm the positions of promoters. We with the possibility to change amino acid and lead to the found about 38 ± 21% lincRNAs had promoters origi- change of protein products. While the absence of coding nating from TEs, which was higher than the percent- products for lncRNAs made them tolerate more mutants. age of coding genes (29 ± 20%) and lncNATs (27 ± 21%) Although lncNATs shared sequences with coding genes, (Fig. 3C). The ratio of lncRNAs with promoters originat - their sequences accumulated more mutants and fewer ing from TEs was positively related to whole genomic TE GC bases, implied they may experience different paths to content in each species (Pearson’s correlation, lincRNA: coding genes in evolution. 0.63, lncNAT: 0.83). This suggested that TEs randomly inserted into genome was one of the major ways to gen- New discovery of lncRNA origination by inserting erate lncRNAs. of transposable elements (TEs) Further, we found TEs initiated the transcription of TEs were widely existed in eukaryotic genomes and lncRNAs by supplying TF binding sequences for lncR- acted as key factors for gene and genome evolution [31, NAs (Fig.  3D). Combining with ChIP-seq data from 32]. Insertion of TEs was considered as one of many nine transcript factors (TFs) in A. thaliana, A. lyrata, ways for origin of lncRNA [15, 33]. To study the roles of rice and maize, we identified 250 lncRNAs, which Zhu et al. BMC Genomics (2022) 23:381 Page 6 of 15 Fig. 3 TEs regulate the transcription of lncRNAs. A Genomic loci of lncRNAs and coding genes overlapping with TEs (up). Exonic sequences of lncRNAs and coding genes overlapping with TEs (down). The black dashed lines represent the ratio of TE sequences in whole genome; the red solid lines represent the ratio of coding genes overlapping with TEs; and the black and grey bars represent the ratio of lincRNA and lncNAT overlapping with TEs, respectively. B The distribution of Class I and Class II TEs in genome, lncRNAs and coding genes. C The ratio of lncRNAs with TE promoters in each species. D TE-lincRNA Osat_00007032 in rice (subspecies: Japonica). The up black line represents genome of Japonica, while the down one represents genome of Indica. The grey block between Japonica and Indica was the syntenic area between the two genomes. The locus of Osat_00007032 was signed with the red and blue bar, while the syntenic area in Indica was not found with lncRNAs or other elements. The TE upstream Osat_0000732 belonged to MERMITEJ sub-family, and was a type of DNA/Gypsy TE contained TF binding sites originating from TEs, and for transcript factor MADS29 was confirmed (Fig.  3D). these lncRNAs accounted for 11.7% of lncRNAs with MADS29 belongs to MADS-box transcription factors, TF biding sites being confirmed (Additional file  1: which were critical regulators for rice reproductive devel- Table  4, Additional file  5). As an example, we found a opment [35, 36]. Surprisingly, we cannot detect lncRNAs lincRNA Osat_00007032, which was upstream of cod- or TEs in the syntenic region of genome for Indica, which ing gene OS01G0166800 in rice (subsp. Japonica), and was another subspecies of rice. This indicated that lin - expressed in anther, pistil, leaf, root, stem and seed, cRNA Osat_00007032 was specific for Japonica, and was but especially high in pistil (Additional file  4: Fig. S3). evolved recently by TE insertion. This novel discovery A TE belonging to MERMITEJ subfamily covered the that TE brought promoters containing TF binding sites upstream of Osat_00007032, and provided promoter to the upstream exonic locus, revealed the relationship for Osat_00007032. In the promoter, binding sequences between TE and lncRNA different from that previously Zhu  et al. BMC Genomics (2022) 23:381 Page 7 of 15 found in tomato lncRNA, for which a TE inserted into rice were homologous with those in maize (Fig.  4C). In the downstream of a promoter [16]. genus such as Arabidopsis, half of lincRNAs were found having homologs in other species within the same genus Most plant lincRNAs only conserved within genus (Fig.  4C). Hence, we found that most of the homologous indicating rapid turnover in evolution lincRNAs were genus- or species-specific in plants, sug - Sequence conservation provides insight into evolution- gesting that lincRNAs have a rapid turnover in evolution. ary process of functional elements in genome. However, most lncRNAs in plants were found with poor conserva- IPS was found to be the only plant lncRNA group tion, including the lncRNAs whose functions were well with conserved sequence motifs studied [37, 38]. In this study, we attempted to explore Some lincRNAs were found to be functional via short- the evolution of lncRNAs in plants by analyzing the con- conserved patches, though the whole sequences were servation of lncRNAs with two methods. First, we evalu- divergent [40, 41]. To know if conserved patches existed ated the sequence conservation using PhastCons score in plant lncRNAs, we developed a method using sliding [39]. PhastCons score was calculated based on the result windows to calculate the mean PhastCons score for each of seven-way whole genomic alignment for A. thaliana, patch. We found most lincRNAs have no apparently con- A. lyrata, soybean, rice, poplar, tomato and S. moellen- served patches in the eight plant species (Fig.  4D). We dori ffi . We compared the sequence conservation of lincR - then looked into the plant lncRNAs in lncrnadb [42], and NAs, lncNATs, CDS of coding genes, 5’UTR and 3’UTR found IPS was the only lncRNA with conserved patches. of coding genes, and used sequences from intergenic IPS lncRNA family participated in phosphorus (P ) regions as control. As expected, CDS exhibited the high- homeostasis in plants, including AtIPS1 and AtIPS2/At4 est sequence conservation, while the intergenic sequence in Arabidopsis, OsIPS1 and OsIPS2 in rice, and TPSI1 was the lowest (Fig.  4A). Considering that lncNATs and in tomato [43–45]. They were found to be accumulated coding genes shared some common sequences, they under Pi starved condition. IPS contained conserved showed similar sequence conservation (Fig.  4A, Addi- 24-nt nucleotides motif in several species [43, 45]. The tional file  4: Fig. S4A). The conservation of lincRNAs, motif contained two highly conserved regions which however, was lower than 5’UTR, 3’UTR and introns of were separated by three nucleotides, and played roles coding genes (Fig. 4A). The proportion of plant lincRNAs as the target mimicry for miR399 [46]. Strikingly, our with PhastCons scores higher than 0.6 was only 0.2%, and study found all species contained this motif except algae this number in coding genes was 14.4%. About 71% plant (Fig. 4E). The absence of IPS in algae may be explained by lincRNAs scored 0, indicating very low conservation their lack of vascular tissues, whereas IPS was expressed (Additional file  4: Fig. S4B). In placental mammals, old in vascular tissues when lacking of P . IPS first appeared (minimum age 90 Myr) and young lncRNAs (minimum in S. moellendori ffi , a representative of primitive vascular age 25 Myr) were defined basing on the phylogenetic dis - plant (Fig. 4E, F). But only one of the two IPS motifs was tribution of species, and they found the conservation of found in S. moellendori ffi , whereas two motifs were often old lncRNAs were close to CDS, while young lncRNAs found in high plants (Fig.  4E, F). Different from most were even lower than intergenic regions [20]. In this lncRNAs with short evolutionary histories, IPS under- study, we found plant lincRNAs showed similar conser- went a long evolutionary history with conserved function vation to intergenic sequences, and which suggested that in P homeostasis, suggesting plant lncRNAs may play most plant lincRNAs were young and with short evolu- important roles in the adaptation of terrestrial life during tionary age (Fig. 4A, Additional file 4: Fig. S4A). migration from aquatic to terrestrial. Further, we evaluated the conservation of plant lincR- NAs by identifying the homologous lncRNAs across the Specific expression of lncRNAs revealing sophisticated eight plant species. We mapped lincRNAs in one species transcription regulation to genomes of all other species. LncNATs were excluded Previously several lncRNA studies showed that most since the existence of overlapped regions shared with lncRNAs were expressed at low levels, and often coding genes. While strong conversation of coding genes expressed in specific tissues or conditions in plants [11, existed between evolutionarily remote species in the 26]. In this study, we attempted to profile the expression plant kingdom, the conservation of lincRNAs retained of lincRNAs and lncNATs in more species. Both lincRNA only in the evolutionarily close groups (Fig. 4B, C). Previ- and lncNAT showed lower expression levels and higher ous studies of lncRNAs showed that the lncRNA homol- tissue specificity compared to coding genes (Fig.  5A, ogy mainly existed within a family in both plants and B). LincRNA and lncNAT showed similar expressional animals [20, 21]. However, we found conserved lincR- levels (average: 35.2 vs 22.9), while the expressional NAs were rare in plant family, as only 5% lincRNAs from level of mRNA was significantly higher (average: 124.8, Zhu et al. BMC Genomics (2022) 23:381 Page 8 of 15 Fig. 4 The conservation of lncRNAs in plant. A The PhastCons scores of coding genes (including CDS, 5’UTR, 3’UTR, and Intron), lncRNAs (lincRNA and lncNAT ) and intergenic sequences. B, C Conserved lincRNAs and coding genes in each species. LincRNAs were aligned to genome of another species using blastn, while amino acid sequences of coding genes were aligned to another species’ amino acid sequences using blastp. The numbers of legend were the ratio of transcripts with homologous sequences in the genome of other species. The labels were abbreviations of each species. D The cumulative frequency of PhastCons scores in coding genes and lncRNAs. The lincRNA-patch meant a window (12 bp) sliding from the first base to the last one, and then the highest window score was selected to stand for the score of the lincRNA. E Highly conserved IPS motifs in all plant species in this study except algae. The marked sequences with orange and turquoise were regions combining to miRNA. F The phylogenetic relationships of terrestrial plant was constructed according to the nucleotide sequences of IPS using MEGA7 (https:// www. megas oftwa re. net/) Kolmogorov-Smirnov test, both p-value < 2.2e-16) score of 0 meant the transcript expressed in all tissues. (Fig.  5A). Similar trends were observed in all plant spe- Both lincRNA and lncNAT exhibited significantly higher cies (Additional file  4: Fig. S5A). Low expressional level JS score, i.e., higher tissue specificity, than mRNA (Kol - was a common feature for plant lncRNAs. mogorov-Smirnov test, both p-value < 2.2e-16) (Fig.  5B, Tissue specificity was evaluated using Jensen-Shan - Additional file 4: Fig. S5B). non (JS) score in all of the plant species, which returned The expressional patterns of lncRNAs evolved rap - a score between 0 and 1 [47]. JS score of 1 represented idly in plant, which formed sharp contrast to the cod- the transcript expressed uniquely in one tissue, while JS ing genes. We defined tissue-specific (TS) transcripts as Zhu  et al. BMC Genomics (2022) 23:381 Page 9 of 15 Fig. 5 The expressional patterns of lncRNAs in plant. A, B Expressional levels (FPKM) and tissue specificity (JS score) of lincRNA, lncNAT and mRNA. C Tissue distribution of TS (tissue-specific) transcripts in several plant species D Tissue distribution of all expressed transcripts in A. thaliana. The legend numbers were calculated using log(FPKM+ 1). E Transcript factor binding frequency of lncRNA and mRNA were significantly different among tissues (Fig.  5D). In transcript with JS score > = 0.9. We found TS lncRNAs A. thaliana, lncNATs were preferentially expressed in and TS mRNAs were distributed to tissues differently embryo, while lincRNAs were in both endosperm and in the same species (Fig.  5C). Also, among the differ - embryo. LncRNAs had a narrow expression profile ent plants, TS lncRNAs of tomato and maize were more compared to coding genes. likely to express in leaf and root, while TS lncRNAs LncRNAs showed similar TF binding rate compared were found mostly in the pistil of rice, and the seed of A. to that of coding genes, though they had low expres- lyrata. sional levels and unstable expression patterns. We used Previous studies in rice and maize showed that ChIP-seq data of nine transcript factors (TFs) from A. lncRNAs were more specially expressed in reproduc- thaliana, A. lyrata, rice and maize to predict the bind- tive tissues [11, 26]. By comparing the expression of ing sites in promoter (Additional file  5). The frequency all transcripts, we found mRNAs were more evenly of lncRNA promoters binding with TFs was compa- expressed in all tissues in A. thaliana, while lncRNAs rable to coding genes (Fig.  5E). TFs with low binding Zhu et al. BMC Genomics (2022) 23:381 Page 10 of 15 rate in coding genes also had low binding efficiency in this network, lncRNAs and coding genes were clustered lncRNA, e.g., LF YGR , NLP7, P1, a-myc, HASF1b. into 24 modules according to their expressional charac- teristics (Fig.  6A). In these modules, a total of 689 lncR- Plant lncRNAs forming co‑expression network with coding NAs (155 lincRNAs and 534 lncNATs) were found to be genes co-expressed with coding genes. Especially, lncNATs in Expression is often closely related to function, especially the network were not directly related to their antisense highly and specifically expressed lncRNAs [48, 49]. We coding genes, which referred they may not regulate the selected lncRNAs that were highly expressed in one or transcription of antisense genes. GO enrichment analy- two tissues in A. thaliana to construct WGCNA net- sis was performed for the coding genes in each module. work. A total of 178 lincRNAs, 555 lncNATs and 20,729 Functions of the coding genes were summarized as fol- coding genes were selected to construct the network. In lows: I) Basic metabolism: kinase activity, lipid metabolic Fig. 6 The WGCNA network of lncRNAs and coding genes, and function prediction of lncRNAs co-expressed with coding genes. A Correlation matrix between modules and tissues of A. thaliana. B Correlational network of the module 22. The red and blue circles represent lncNAT and lincRNA respectively, while the other circles represent coding genes. C Functional description of modules containing lncRNAs. The area of the pie was positively correlated to the number of transcripts. The numbers in the brackets correspond to the modules in (A) Zhu  et al. BMC Genomics (2022) 23:381 Page 11 of 15 process, and hydrolase activity; II) Response to stress and Although several lncNATs were reported to regulate response to abiotic stimulus; III) Reproduction (Fig. 6C). the expression of antisense coding genes [51, 52], we LncRNAs involved in the network may play roles in these found most lncNATs were not correlated to the expres- functions. sion of antisense coding genes (Additional file  4: Fig. The correlation between modules and samples S1B). This observation implied the function of most showed that the expressional patterns of lncRNAs were lncNATs may not be related to their complimentary highly correlated to specific tissues (Fig.  6A). In our coding genes. result, lncRNAs of A. thaliana were highly and spe- In plants, insertion of TEs is one of the important cially expressed in embryo and endosperm (Fig.  5D). mechanisms for lncRNA origin [16, 53]. We found many In WGCNA network, the embryo was strongly corre- plant lncRNAs from all the studied species contained lated with module 01, and the endosperm was strongly sequence elements of TEs. In A. thaliana, rice and maize, correlated to module 22 (Fig.  6A, B). In module 01, we the proportion of TE-related lincRNAs in total lincRNAs found a myriad of coding genes associating with embryo were reported to range from 22.9 to 51.5% [17], similar development. Functions of these genes can be catego- to our results (18.2% ~ 58.2%) on these species. TEs were rized as: I) Involving in embryo development directly; mainly inserted into the exons of plant lncRNAs, avoid- II) Overexpression, mutation or knocking down caused ing exons with coding genes (Fig.  3A). Particularly, we embryo death; III) Enrichment of protein products found an example that TE insertion caused initiating new during embryonic development. Co-expressional rela- transcription, which has not been reported in plants. tionship was found between these coding genes and This finding provided further evidence that TEs play key lncRNA, and lncRNAs in this network may play roles in roles in the origin of plant lncRNAs. Note several differ - embryo development. In module 22, two radial networks ent ways for the origin of lncRNAs were documented, were constituted of four lncNATs and seven lincRNAs including coding genes losing coding capacity, genomic with coding genes (Fig.  6B). In these coding genes, we sequences devoid of exons and insertion of TEs [54]. Our found four genes encoding transcript factors played research added new details for lncRNAs origination with roles in the endosperm development. As an example, novel complexity. gene AT1G02580 encoded transcript factor MEDEA/ Some lncRNAs were found to be conserved in animals MEA, a polycomb group gene that was imprinted in the with the species were distantly separated in evolution [20, endosperm [50]. All of the 11 lncRNAs in the network 55]. These ‘old’ lncRNAs that were preserved in animals, were found co-expressed with it, which indicating that were suggested to have important functions [20]. How- these lncRNAs may play roles in endosperm imprint. ever, plant lncRNAs were highly divergent, and almost all Coincidently, researchers found some lncRNAs express- plant lncRNAs were species/genus-specific. Differently ing in castor bean seeds were involved in genomic from animals, we did not find any plant lncRNAs that imprint, and several of them comprised the imprinted are conserved along the evolution. Our results agree with cluster with imprinted coding genes [27]. others’ studies that did not detect highly conserved plant lncRNAs [21, 22]. Conserved sequence patches within lncRNAs were Discussion previously reported in some lncRNAs [40]. IPS was a Benefiting from the advancement of high-throughput lncRNA participating in Pi regulation in several plant sequencing technology, we have discovered more lncR- species, and was induced expressing in the vascular tis- NAs in representative plant species that are associated sues of root and shoot [41, 45]. However, we have found with novel complexity. Most of the previous studies the highly conserved motifs of IPS crossed all tracheo- sequenced RNAs with poly(A) tails by oligo(dT) selec- phyte (Fig.  4E, F). The emergency of IPS in all terrestrial tion, which would exclude a large part of lncRNAs plants indicated that IPS lncRNAs may play important without poly(A) [14]. In addition, lncNATs was much roles in the adaptative evolution of plant migrating from more difficult to study with the non-strand specific aquatic environments to terrestrial ones. RNA-sequencing technology, whose distribution in To date, the functions of a large number of lncRNAs plant was largely not clear. Here we studied lncRNAs in plants remain unknown. Our finding that most highly using RiboMinus and strand-specific RNA-sequencing and specially expressed lncRNAs formed co-expression data. We expanded the plant lncRNA landscape, and network with coding genes suggested they may play roles found the lncNATs dispersed widely in plant. The num - in the plant development, stress response, etc. For exam- bers of lncNATs were even greater than lincRNAs in ple, they may participate in embryo development and several species, such as A. thaliana and C. reinhardtii. endosperm imprint. However, within the large amount Zhu et al. BMC Genomics (2022) 23:381 Page 12 of 15 of lncRNAs identified in plant species, only a small por - (Additional file  2). Data downloaded from public data- tion of them were found in potential functional networks. bases can be found by the accession numbers record- The large numbers of novel plant lncRNAs found in our ing in “Additional file  2”. Samples been sequenced in study open an avenue to systematically characterize and this study were dealt with following processes: materials explore the functions and evolution of lncRNAs in plant. were taken from at least two plant individuals, and then been mixed for RNA isolation. Total RNA was extracted, Conclusions and rRNA was removed from the purified RNA using We performed RiboMinus combining with strand-spe- Ribo-Zero rRNA Removal Kit. Strand-specific RNA-seq cific RNA-sequencing technique on eight evolutionarily libraries were constructed with TruSeq Stranded mRNA representative plant species. We identified 39,945 lncR - LT Sample Prep Kit. Libraries were applied to Illumina NAs, which including 14,595 lncNATs, and expounded Hiseq2500 platform. the distribution of lncNATs in plants. Further, we found TEs played key roles in the origination of plant lncRNAs. Bioinformatics pipeline for identification of lncRNAs Combing with ChIP-seq data, we found a new way for Sequencing reads were filtered according to the quality lncRNA originating from TE insertion. In addition, we score using sickle (version 1.210), and the quality thresh- found plant lncRNAs were not conserved in evolutionar- old was set as 20. Reads shorter than 50 bp or with ambig- ily distant species, and most of them were species/genus- uous nucleotides were removed after quality checking. specific. However, conserved motifs of IPS were found Genomes and gene sets of the eight plant species were in lncRNA of all terrestrial plants, and IPS may played downloaded from public databases, and the versions important roles in the adaptive evolution of plant. The of them were shown in “Additional file  1: Table  1”. High study revealed novel features and complexity of lncRNAs quality reads were aligned to the genomes of each spe- in plants through systematic analysis, providing impor- cies using TopHat2 (version 2.1.0) [56]. Cufflinks (version tant insights into the origination and evolution of plant 2.2.1) was used to construct transcripts of each sample lncRNAs. according to the mapping results [57]. The expressional levels of transcripts were evaluated by FPKM using Cuf- flinks software suite. JS score was employed to represent Methods the tissue specificity, and which was calculated using R Plant materials library cummeRbund (version 2.7.2). We defined a tran - The seeds of A. thaliana and A. lyrata were buy from script as lncRNA if it satisfied the following criteria: I) it ABRC (https:// abrc. osu. edu/ users/ sign_ in? redir ect_ must be longer than 200 bp; II) it was noncoding by the to=% 2Ford ers% 2Fnew), and then cultivated in phyto- result of CPC and PLEK [58, 59]; III) it had no homologs tron respectively. Leaves, stems, roots and flowers of in databases of Swiss-Prot [60], Pfam [61] or Rfam [62]; A. thaliana and A. lyrata were collected from mature IV) the FPKM of it was more than 0.5 for single-exon plants. Siliques of A. thaliana were collected one week transcript or more than 0.1 for multiple-exon transcript. after flowering, and seeds were collected when siliques were full but epidermis were still green. The seedlings of Characterization of lncRNAs and coding genes poplar were provided by Cuiting Wang, CAS Center for SNP calling Excellence in Molecular Plant Sciences. Leaves, stems, All clean reads were mapped to genome using bwa (ver- roots and shoot tips of poplar were collected from seed- sion 0.7.17-r1188) [63]. Samtools (version 1.9) and lings with height of ~ 1 m. The mature plants of rice bcftools (version 1.9) were used to call SNPs [64]. All (Japonica) were obtained from the lab of Zuhua He, CAS SNPs from various tissues or libraries were merged, then Center for Excellence in Molecular Plant Sciences (http:// SNPs with quality more than 10 and depth more than 5 sippe. ac. cn/ zuhua he/). Leaves and stems of rice were col- were retained. At last, high-quality SNPs in the region lected before heading period. Samples of S. moellendori ffi of lncRNAs and coding genes were retained using local were described in our previous study [25]. scripts. Transcript factor binding site identification RNA sequencing and data acquisition ChIP-seq data from A. thaliana, A. lyrata, rice and maize Our dataset contained more than 700G read pairs from were downloaded from SRA database, and their acces- eight plant species, of which 490G were previously pub- sion numbers were listed in “Additional file  5”. After qual- lished data and 210G were generated by RNA sequencing ity filtering using sickle (version 1.210), clean reads were Zhu  et al. BMC Genomics (2022) 23:381 Page 13 of 15 mapped to genome using bowtie2 (version 2.2.6) [65], Supplementary Information and the parameter was set as -N 1. Peak calling was sub- The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s12864- 022- 08602-9. jected to MACS (version 2.1.2) with module “callpeak” (version 2.1.2), and the parameters were set as --keep- Additional file 1. dup 1 and -q 0.01 [66]. Additional file 2. RNA-seq datasets. Additional file 3. lncRNA-gff3. TE annotation Additional file 4: Figure S1. (A) Expressional levels of lncNATs and their Whole genomic repeats and TEs of the eight plant spe- antisense coding genes. The expressional levels were evaluated by log cies in this study were identified using RepeatMasker (FPKM+1). (B) Expressional levels of lncNAT, antisense coding genes, and coding genes without lncNATs expressing at their antisense strand. 3.3.0 with parameters setting as -xsmall, and the Rep- Figure S2. (A) Ratio of splicing and alternative splicing transcripts. (B) Base20.02 were used as reference library. TE-lncRNAs Base distribution of splicing acceptor and donor for lncRNAs and mRNAs. were defined as lncRNAs with exon sequence overlap - Figure S3. Expressional levels of lncRNA Osat_00007032 in rice. Figure S4. The distribution of PhastCons score in lncRNAs and coding genes. (A) ping at least 5 bp to TE sequences. The distribution of PhastCons score in lncRNAs, intergenic sequences and coding sequences of coding genes. (B) Percentage of lncRNAs and mRNAs with different PhastCons scores in total lncRNAs and mRNAs, respectively. PhastCons score Figure S5. Expressional patterns of lncRNAs and mRNAs in divergent Lastz (version 1.02.00) was used for inter-species align- species. (A) Expressional levels of lncRNAs and mRNAs in each species, and ments, then TBA (version v12) was used for final integra - the numbers of y-axes were calculated by log (FPKM). (B) Tissue-specific- ity of lncRNAs and mRNAs in each species with multiple tissues, and the tion to obtain genome-wide alignment files for all eight y-axes stands for JS score. species [67]. PhastCons score (PCs) were calculated using Additional file 5. ChIP-seq datasets. phast (version 1.3) according to the results of genome- wide alignment [39]. Conserved patches in lncRNAs Acknowledgements were defined as short patches (12 bp) with PCs more than The authors thank Cuiting Wang for supplying the materials of poplar. 0.6, and meanwhile PCs of the whole lncRNAs were less than 0.3. About this supplement This article has been published as part of BMC Genomics Volume 23 Supple- ment 4, 2022: Selected articles from the International Conference on Intel- WGCNA network construction ligent Biology and Medicine (ICIBM 2021): genomics. The full contents of the supplement are available online at https:// bmcge nomics. biome dcent ral. com/ The WGCNA co-expressional network was constructed artic les/ suppl ements/ volume- 23- suppl ement-4. with R package WGCNA (version 1.68) using the FPKMs from all the 12 tissues of selected genes in A. thaliana Authors’ contributions X L conceived, designed the study, and modified the manuscript. Y Z collected [68]. Genes complying with the following criteria were data, performed analysis and drafted the manuscript. LX C collected samples selected to construct the network: 1) lncNATs express- and performed experiments. XN H took part in some experiments. H S drew ing in one or two tissues, and their expression levels were several figures. All authors have read and approved the final manuscript. among the top 10%; 2) lincRNAs expressing in one or two Funding tissues, and with which the maximum FPKM was larger This work was in part supported by grand from the National Key Research than 1; 3) mRNAs with FPKM larger than 1. The param - and Development Program of China (2019YFA0904601) and the National Natural Science Foundation of China (31701137, 31972881, 31900470). The eters of the key function “blockwiseModules” were set as: plant materials collecting, cultivating, and the RNA-seq experiments were power = 16, mergeCutHeight = 0.25, and reassignThresh - supported by the National Natural Science Foundation of China (31701137, old = 0. The network of each module was showed by 31972881, 31900470). The data Publication costs are funded by the National Key Research and Development Program of China (2019YFA0904601). Cytoscape (Version 3.6.1) [69]. GO enrichment analysis was subjected by BiNGO (version 3.0.3), and the param- Availability of data and materials eters were set as: the multiple testing correlation - “FDR The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive [71] in National Genomics Data Center [72], Beijing correlation”, ontology file - “GOSlim_Plants”, and the Institute of Genomics (China National Center for Bioinformation), Chinese organism annotation - “Arabidopsis thaliana” [70]. Academy of Sciences, under accession number CRA003591 that are publicly accessible at https:// bigd. big. ac. cn/ gsa. The other RNA-seq raw data used in this study were downloaded from NCBI public database, and the accession Abbreviations numbers were listed in ‘Additional file 2’. lncRNA: long noncoding RNA; lincRNA: long intergenic noncoding RNA; lncNAT: long noncoding natural antisense RNA; NAT: Natural antisense Declarations transcript; ncRNA: noncoding RNA; AS: Alternative splicing; TE: Transposable element; CDS: Coding sequence; TF: Transcript factor; PCs: PhastCons score; Ethics approval and consent to participate Pi: Phosphorus; FPKM: Fragment per kilobase per million mapped fragments; Not applicable. JS score: Jensen-Shannon score; TS: Tissue-special; WGCNA: Weighted cor- relation network analysis; GO: Gene ontology; Atha: A. thaliana; Alyr: A. lyrata; Consent for publication Ptri: P. trichocarpa; Slyc: S. lycopersicum; Zmay: Z. mays; Osat: O. sativa; Smoe: S. Not applicable. moellendorffii; Crei: C. reinhardtii. Zhu et al. BMC Genomics (2022) 23:381 Page 14 of 15 Competing interests 19. Lopez-Ezquerra A, Harrison MC, Bornberg-Bauer E. Comparative analysis The authors declare that they have no competing interests. of lincRNA in insect species. BMC Evol Biol. 2017;17(1):155. 20. Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, et al. Author details The evolution of lncRNA repertoires and expression patterns in tetrapods. Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Nature. 2014;505(7485):635–40. Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sci- 21. Deng P, Liu S, Nie X, Weining S, Wu L. Conservation analysis of long non- ences, Shanghai, China. University of Chinese Academy of Sciences, Beijing, coding RNAs in plants. Sci China Life Sci. 2018;61(2):190–8. China. Henan University, Kaifeng, China. 22. Mohammadin S, Edger PP, Pires JC, Schranz ME. Positionally-conserved but sequence-diverged: identification of long non-coding RNAs in the Received: 30 April 2022 Accepted: 3 May 2022 Brassicaceae and Cleomaceae. BMC Plant Biol. 2015;15:217. 23. Wang H, Niu QW, Wu HW, Liu J, Ye J, Yu N, et al. Analysis of non-coding transcriptome in rice and maize uncovers roles of conserved lncRNAs associated with agriculture traits. Plant J. 2015;84(2):404–16. 24. Simopoulos CMA, Weretilnyk EA, Golding GB. Molecular Traits of Long Non-protein Coding RNAs from Diverse Plant Species Show Little Evi- References dence of Phylogenetic Relationships. G3 (Bethesda). 2019;9(8):2511–20. 1. Liu X, Hao L, Li D, Zhu L, Hu S. Long non-coding RNAs and their biological 25. Zhu Y, Chen L, Zhang C, Hao P, Jing X, Li X. Global transcriptome roles in plants. Genomics Proteomics Bioinformatics. 2015;13(3):137–47. analysis reveals extensive gene remodeling, alternative splicing and 2. Seo JS, Sun HX, Park BS, Huang CH, Yeh SD, Jung C, et al. ELF18-INDUCED differential transcription profiles in non-seed vascular plant Selaginella LONG-NONCODING RNA associates with mediator to enhance expres- moellendorffii. BMC Genomics. 2017;18(Suppl 1):1042. sion of innate immune response genes in Arabidopsis. Plant Cell. 26. Zhang YC, Liao JY, Li ZY, Yu Y, Zhang JP, Li QF, et al. Genome-wide 2017;29(5):1024–38. screening and functional analysis identify a large number of long 3. Zhao X, Li J, Lian B, Gu H, Li Y, Qi Y. Global identification of Arabidopsis noncoding RNAs involved in the sexual reproduction of rice. Genome lncRNAs reveals the regulation of MAF4 by a natural antisense RNA. Nat Biol. 2014;15(12):512. Commun. 2018;9(1):5056. 27. Xu W, Yang T, Wang B, Han B, Zhou H, Wang Y, et al. Differential expres- 4. Sun Y, Hao P, Lv X, Tian J, Wang Y, Zhang X, et al. A long non-coding apple sion networks and inheritance patterns of long non-coding RNAs in RNA, MSTRG.85814.11, acts as a transcriptional enhancer of SAUR32 and castor bean seeds. Plant J. 2018;95(2):324–40. contributes to the Fe-deficiency response. Plant J. 2020;103(1):53–7. 28. Schlackow M, Nojima T, Gomes T, Dhir A, Carmo-Fonseca M, Proudfoot 5. Paytuvi Gallart A, Hermoso Pulido A, Anzar Martinez de Lagran I, San- NJ. Distinctive Patterns of Transcription and RNA Processing for Human severino W, Aiese Cigliano R. GREENC: a Wiki-based database of plant lincRNAs. Mol Cell. 2017;65(1):25–38. lncRNAs. Nucleic Acids Res. 2016;44(D1):D1161–6. 29. Mele M, Mattioli K, Mallard W, Shechner DM, Gerhardinger C, Rinn JL. 6. Fang S, Zhang L, Guo J, Niu Y, Wu Y, Li H, et al. NONCODEV5: a compre- Chromatin environment, transcriptional regulation, and splicing distin- hensive annotation database for long non-coding RNAs. Nucleic Acids guish lincRNAs and mRNAs. Genome Res. 2017;27(1):27–37. Res. 2018;46(D1):D308–14. 30. Pozzoli U, Menozzi G, Fumagalli M, Cereda M, Comi GP, Cagliani R, et al. 7. Jin J, Lu P, Xu Y, Li Z, Yu S, Liu J, et al. PLncDB V2.0: a comprehensive Both selective and neutral processes drive GC content evolution in the encyclopedia of plant long noncoding RNAs. Nucleic Acids Res. human genome. BMC Evol Biol. 2008;8:99. 2021;49(D1):D1489–95. 31. Bennetzen JL, Wang H. The contributions of transposable elements 8. Szczesniak MW, Rosikiewicz W, Makalowska I. CANTATAdb: A Collection of to the structure, function, and evolution of plant genomes. Annu Rev Plant Long Non-Coding RNAs. Plant Cell Physiol. 2016;57(1):e8. Plant Biol. 2014;65:505–30. 9. Di C, Yuan J, Wu Y, Li J, Lin H, Hu L, et al. Characterization of stress- 32. Slotkin RK, Martienssen R. Transposable elements and the epigenetic responsive lncRNAs in Arabidopsis thaliana by integrating expression, regulation of the genome. Nat Rev Genet. 2007;8(4):272–85. epigenetic and structural features. Plant J. 2014;80(5):848–61. 33. Cho J. Transposon-Derived Non-coding RNAs and Their Function in 10. Shumayla S, Taneja M, Tyagi S, Singh K, Upadhyay SK. Survey of High Plants. Front Plant Sci. 2018;9:600. Throughput RNA-Seq Data Reveals Potential Roles for lncRNAs during 34. Smit AF, Hubley R, Green P. Green: RepeatMasker Open-4.0. 2013-2015 Development and Stress Response in Bread Wheat. Front Plant Sci. <http:// www. repea tmask er. org>. 2017;8:1019. 35. Yin LL, Xue HW. The MADS29 transcription factor regulates the deg- 11. Li L, Eichten SR, Shimizu R, Petsch K, Yeh CT, Wu W, et al. Genome-wide radation of the nucellus and the nucellar projection during rice seed discovery and characterization of maize long non-coding RNAs. Genome development. Plant Cell. 2012;24(3):1049–65. Biol. 2014;15(2):R40. 36. Nayar S, Sharma R, Tyagi AK, Kapoor S. Functional delineation of rice 12. Golicz AA, Singh MB, Bhalla PL. The Long Intergenic Noncoding MADS29 reveals its role in embryo and endosperm development by RNA (LincRNA) Landscape of the Soybean Genome. Plant Physiol. affecting hormone homeostasis. J Exp Bot. 2013;64(14):4239–53. 2018;176(3):2133–47. 37. Heo JB, Sung S. Vernalization-mediated epigenetic silencing by a long 13. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. intronic noncoding RNA. Science. 2011;331(6013):76–9. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8. 38. Ding J, Lu Q, Ouyang Y, Mao H, Zhang P, Yao J, et al. A long noncoding 14. Yang L, Duff MO, Graveley BR, Carmichael GG, Chen LL. Genom- RNA regulates photoperiod-sensitive male sterility, an essential com- ewide characterization of non-polyadenylated RNAs. Genome Biol. ponent of hybrid rice. Proc Natl Acad Sci U S A. 2012;109(7):2654–9. 2011;12(2):R16. 39. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, 15. Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G, et al. et al. Evolutionarily conserved elements in vertebrate, insect, worm, Transposable elements are major contributors to the origin, diversifica- and yeast genomes. Genome Res. 2005;15(8):1034–50. tion, and regulation of vertebrate long noncoding RNAs. PLoS Genet. 40. Lin N, Chang KY, Li Z, Gates K, Rana ZA, Dang J, et al. An evolutionar- 2013;9(4):e1003470. ily conserved long noncoding RNA TUNA controls pluripotency and 16. Wang X, Ai G, Zhang C, Cui L, Wang J, Li H, et al. Expression and diver- neural lineage commitment. Mol Cell. 2014;53(6):1005–19. sification analysis reveals transposable elements play important roles 41. Huang CY, Shirley N, Genc Y, Shi B, Langridge P. Phosphate utiliza- in the origin of Lycopersicon-specific lncRNAs in tomato. New Phytol. tion efficiency correlates with expression of low-affinity phosphate 2016;209(4):1442–55. transporters and noncoding RNA, IPS1, in barley. Plant Physiol. 17. Wang D, Qu Z, Yang L, Zhang Q, Liu ZH, Do T, et al. Transposable elements 2011;156(3):1217–29. ( TEs) contribute to stress-related long intergenic noncoding RNAs in 42. Quek XC, Thomson DW, Maag JLV, Bartonicek N, Signal B, Clark MB, plants. Plant J. 2017;90(1):133–46. et al. lncRNAdb v2.0: expanding the reference database for func- 18. Hezroni H, Koppstein D, Schwartz MG, Avrutin A, Bartel DP, Ulitsky I. Princi- tional long noncoding RNAs. Nuncleic Acids Res. 2015;43(Database ples of long noncoding RNA evolution derived from direct comparison of issue):D168–73. transcriptomes in 17 species. Cell Rep. 2015;11(7):1110–22. Zhu  et al. BMC Genomics (2022) 23:381 Page 15 of 15 43. Hou XL, Wu P, Jiao FC, Jia QJ, Chen HM, Yu J, et al. Regulation of the 67. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, et al. expression of OsIPS1 and OsIPS2 in rice via systemic and local Pi signalling Aligning multiple genomic sequences with the threaded blockset aligner. and hormones. Plant Cell Environ. 2005;28(3):353–64. Genome Res. 2004;14(4):708–15. 44. Liu C, Muchhal US. Differential expression of TPS11, a phosphate 68. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation starvation-induced gene in tomato. Plant Mol Biol. 1997;33:867–74. network analysis. BMC Bioinformatics. 2008;9:559. 45. Shin H, Shin HS, Chen R, Harrison MJ. Loss of At4 function impacts phos- 69. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. phate distribution between the roots and the shoots during phosphate Cytoscape: a software environment for integrated models of biomolecu- starvation. Plant J. 2006;45(5):712–26. lar interaction networks. Genome Res. 2003;13(11):2498–504. 46. Franco-Zorrilla JM, Valli A, Todesco M, Mateos I, Puga MI, Rubio-Somoza 70. Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess I, et al. Target mimicry provides a new mechanism for regulation of micro- overrepresentation of gene ontology categories in biological networks. RNA activity. Nat Genet. 2007;39(8):1033–7. Bioinformatics. 2005;21(16):3448–9. 47. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, 71. Wang Y, Song F, Zhu J, Zhang S, Yang Y, Chen T, et al. GSA: Genome et al. Integrative annotation of human large intergenic noncoding Sequence Archive. Genomics Proteomics Bioinformatics. 2017;15(1):14–8. RNAs reveals global properties and specific subclasses. Genes Dev. 72. National Genomics Data Center Members and Partners. Database 2011;25(18):1915–27. Resources of the National Genomics Data Center in 2020. Nucleic Acids 48. Cui J, Luan Y, Jiang N, Bao H, Meng J. Comparative transcriptome analysis Res. 2020;48(D1):D24–33. between resistant and susceptible tomato allows the identification of lncRNA16397 conferring resistance to Phytophthora infestans by co- Publisher’s Note expressing glutaredoxin. Plant J. 2017;89(3):577–89. Springer Nature remains neutral with regard to jurisdictional claims in pub- 49. Zhang G, Duan A, Zhang J, He C. Genome-wide analysis of long non-cod- lished maps and institutional affiliations. ing RNAs at the mature stage of sea buckthorn (Hippophae rhamnoides Linn) fruit. Gene. 2017;596:130–6. 50. Song Q, Ando A, Jiang N, Ikeda Y, Chen ZJ. Single-cell RNA-seq analysis reveals ploidy-dependent and cell-specific transcriptome changes in Arabidopsis female gametophytes. Genome Biol. 2020;21(1):178. 51. Yuan JH, Liu XN, Wang TT, Pan W, Tao QF, Zhou WP, et al. The MBNL3 splicing factor promotes hepatocellular carcinoma by increasing PXN expression through the alternative splicing of lncRNA-PXN-AS1. Nat Cell Biol. 2017;19(7):820–32. 52. Yue H, Zhu J, Xie S, Li F, Xu Q. MDC1-AS, an antisense long noncod- ing RNA, regulates cell proliferation of glioma. Biomed Pharmacother. 2016;81:203–9. 53. Cho J, Paszkowski J. Regulation of rice root development by a retrotrans- poson acting as a microRNA sponge. Elife. 2017;6:e30038. 54. Ponting CP, Oliver PL, Reik W. Evolution and functions of long noncoding RNAs. Cell. 2009;136(4):629–41. 55. Washietl S, Kellis M, Garber M. Evolutionary dynamics and tissue specific- ity of human long noncoding RNAs in six mammals. Genome Res. 2014;24(4):616–28. 56. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, dele- tions and gene fusions. Genome Biol. 2013;14(4):R36. 57. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech- nol. 2010;28(5):511–5. 58. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, et al. CPC: assess the protein-coding potential of transcripts using sequence features and sup- port vector machine. Nucleic Acids Res. 2007;35(suppl_2):W345–9. 59. Li A, Zhang J, Zhou Z. PLEK a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioin- formatics. 2014;15:311. 60. UniProt C. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–15. 61. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47(D1):D427–32. Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : 62. Kalvari I, Argasinska J, Quinones-Olvera N, Nawrocki EP, Rivas E, Eddy SR, et al. Rfam 13.0: shifting to a genome-centric resource for non-coding fast, convenient online submission RNA families. Nucleic Acids Res. 2018;46(D1):D335–42. thorough peer review by experienced researchers in your field 63. Li H, Durbin R. Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics. 2009;25(14):1754–60. rapid publication on acceptance 64. Li H. A statistical framework for SNP calling, mutation discovery, associa- support for research data, including large and complex data types tion mapping and population genetical parameter estimation from • gold Open Access which fosters wider collaboration and increased citations sequencing data. Bioinformatics. 2011;27(21):2987–93. 65. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat maximum visibility for your research: over 100M website views per year Methods. 2012;9(4):357–9. 66. Feng J, Liu T, Qin B, Zhang Y, Liu XS. Identifying ChIP-seq enrichment At BMC, research is always in progress. using MACS. Nat Protoc. 2012;7(9):1728–40. Learn more biomedcentral.com/submissions http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Genomics Springer Journals

Revealing the novel complexity of plant long non-coding RNA by strand-specific and whole transcriptome sequencing for evolutionarily representative plant species

BMC Genomics , Volume 23 (Suppl 4) – May 19, 2022

Loading next page...
 
/lp/springer-journals/revealing-the-novel-complexity-of-plant-long-non-coding-rna-by-strand-lvY0GnQ5v5

References (78)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
eISSN
1471-2164
DOI
10.1186/s12864-022-08602-9
Publisher site
See Article on Publisher Site

Abstract

Background: Previous studies on plant long noncoding RNAs (lncRNAs) lacked consistency and suffered from many factors like heterogeneous data sources and experimental protocols, different plant tissues, inconsistent bioinformat - ics pipelines, etc. For example, the sequencing of RNAs with poly(A) tails excluded a large portion of lncRNAs without poly(A), and use of regular RNA-sequencing technique did not distinguish transcripts’ direction for lncRNAs. The current study was designed to systematically discover and analyze lncRNAs across eight evolutionarily representative plant species, using strand-specific (directional) and whole transcriptome sequencing (RiboMinus) technique. Results: A total of 39,945 lncRNAs (25,350 lincRNAs and 14,595 lncNATs) were identified, which showed molecular features of lncRNAs that are consistent across divergent plant species but different from those of mRNA. Further, transposable elements ( TEs) were found to play key roles in the origination of lncRNA, as significantly large number of lncRNAs were found to contain TEs in gene body and promoter region, and transcription of many lncRNAs was driven by TE promoters. The lncRNA sequences were divergent even in closely related species, and most plant lncRNAs were genus/species-specific, amid rapid turnover in evolution. Evaluated with PhastCons scores, plant lncRNAs showed similar conservation level to that of intergenic sequences, suggesting that most lincRNAs were young and with short evolutionary age. INDUCED BY PHOSPHATE STARVATION (IPS) was found so far to be the only plant lncRNA group with conserved motifs, which may play important roles in the adaptation of terrestrial life during migration from aquatic to terrestrial. Most highly and specially expressed lncRNAs formed co-expression network with coding genes, and their functions were believed to be closely related to their co-expression genes. Yan Zhu and Longxian Chen have contributed equally to this work. *Correspondence: lixuan@cemps.ac.cn Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China Full list of author information is available at the end of the article © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Zhu et al. BMC Genomics (2022) 23:381 Page 2 of 15 Conclusion: The study revealed novel features and complexity of lncRNAs in plants through systematic analysis, providing important insights into the origination and evolution of plant lncRNAs. Keywords: lncRNA, lincRNA, lncNAT, Plant, Strand-specific Background lncRNAs, more research is needed to understand the Long noncoding RNAs (lncRNAs) were found to account roles of TEs in the origination of lncRNAs in plant. for a large part of the whole transcriptome in plants. LncRNAs had a fast evolutionary rate and a high Most of lncRNAs identified in plants were lincRNAs, degree of sequence diversity in animals [18–20]. Some which were transcribed from intergenic regions. Another ancient lncRNAs were highly conserved in tetrapod, and lncRNAs transcribed from intragenic regions were lnc- they were preserved in evolution and may have impor- NATs, which were antisense to coding genes. Both types tant functions [20]. Plant lncRNAs were highly divergent of lncRNAs have been identified in many plant species, at the nucleotide level [21]. LncRNAs showed highly among which a few were found to play important roles in divergency on sequences between rice and maize, and plant development and stress resistance [1–4]. Benefiting between Brassicaceae and Cleomaceae [22, 23]. Besides, from the high-throughput sequencing technology, bulks molecular features of plant lncRNAs did not follow the of RNA-sequencing data were deposited to public data- classic evolutionary patterns, and they showed little evi- bases. Databases of plant lncRNAs have been developed dence of phylogenetic relationships [24]. A comprehen- in the past several years, such as PLncDB, CANTATAdb, sive design is needed to systematically study the plant GreeNC, NONCODE, et al. [5–8]. More than one million lncRNA evolution and feature divergency. lncRNAs were recorded in these databases, which cov- The current study was designed to comprehensively ered more than 80 plant species. However, a systematic discover and analyze lncRNAs in multiple plant species analysis on plant lncRNA evolution has not been con- across the evolutionary landscape using strand-specific ducted due to many difficulties. The heterogeneous data and whole transcriptome sequencing data. We employed sources and protocols, like inconsistent plant tissues, eight plant species from the primitive to higher taxa in ways of sequencing library construction, pipelines for plant, which were representatives at key position of plant bioinformatics analysis, were cited as some of the factors evolution. Using RiboMinus and strand-specific RNA- [9–12]. Most of the previous studies sequenced RNAs sequencing techniques in combination with consistent with poly(A) tails by oligo(dT) selection, which would plant tissues, we have expanded the lncRNA landscape, exclude a large portion of lncRNAs without poly(A) [13, including long noncoding natural antisense RNAs (lnc- 14]. In addition, the regular RNA-sequencing technique NATs) and lncRNAs without poly(A). Our results showed does not distinguish transcripts’ direction, which would that both lincRNAs and lncNATs were widely distributed lack a large part of lncRNAs antisense to coding genes. in plants. TEs played important roles in the origination Therefore, to comprehensively analyze the lncRNAs in and transcription of lncRNAs. The molecular features divergent plant species and their evolution requires 1) of lncRNAs were found to be conserved in plants, but unified study design with consistent plant sources, and sequences were non-conserved even between evolution- 2) the whole transcriptome RNA-sequencing technology arily close species. Most highly and specially expressing with strand-specificity. lncRNAs participated in the co-expressional network To date, important question about how plants lncRNAs with coding genes, which may function as regulators in originated and evolved remains largely un-answered. the network. Several ways for the origin of lncRNAs have been docu- mented, among which transposable elements (TEs) are Results an important cause. TE was thought as an important Identification of plant lncRNAs with new lncNAT species factor to drive the transcription of lncRNAs [15, 16]. from strand‑specific RiboMinus transcriptome sequencing In humans, about 75% lncRNAs contained at least one data exon deriving from TEs [15]. In plants, several lncRNAs To systematically study the molecular features and evo- were found to originate from TEs [16, 17]. lncRNA-314 lutionary profiles of lncRNAs in plants, we selected eight was originated from a full-length LTR retrotransposon representative plant species across the phylogenetic land- inserting into downstream of a promoter in the ances- scape for analysis, including Chlamydomonas reinhardtii tral tomato genome [16]. In Arabidopsis, lincRNA11195 (C. reinhardtii), Selaginella moellendoriffi (S. moellen - contained a LTR retrotransposon and was activated dori ffi ), Zea mays (maize), Oryza sativa subsp. Japon- under abiotic stresses [17]. In addition to these two plant ica (rice), Arabidopsis lyrata (A. lyrata), Arabidopsis Zhu  et al. BMC Genomics (2022) 23:381 Page 3 of 15 thaliana (A. thaliana), Populus trichocarpa (poplar) and 0.5 for single-exon transcripts and 0.1 for multiple-exon Solanum lycopersicum (tomato) (Fig.  1). These species transcripts) to resolve the transcripts with low expres- can represent single-cell alga, ferns, monocotyledon and sional abundance. For species with annotations including dicotyledon, spanned an evolutionary history of 116 mil- lncRNAs, we added them in our result if those lncRNAs lion years. Whole-genome sequences and annotations of satisfied our criterion. these species were available from public databases (Addi- A total of 39,945 lncRNAs were obtained in all the eight tional file 1: Table 1). species, for which the transcriptional orientation of each Different from previous studies, we acquired only lncRNA was also confirmed. LncRNAs identified in this RiboMinus and strand-specific RNA-seq (ssRNA-seq) study were categorized into two types: lncRNA in inter- data for multiple plant tissues, either collected from pub- genic region (lincRNA: 25,350) and lncRNA antisense lic database or sequenced by ourselves when they are not to one or more exons of coding genes (lncNAT: 14,595) available (Additional file  2). This critical approach was (Fig.  1, Additional file  1: Table 2, Additional file  3). Both designed to include lncRNA species either with poly(A) of lincRNAs and lncNATs dispersed widely in all of the or without poly(A), which can also identify the tran- plant clades from 777, the least number in C. reinhardtii scriptional orientation of lncRNAs, and their location on to 8321, the most in maize. By comparing lncRNAs iden- sense- or antisense- strand of coding gene regions. A con- tified in this study with lncRNAs from the databases of sensus and efficient lncRNA identification workflow were PLncDB, CANTATAdb, GreeNC, and NONCODE, a constructed based on our previous study [25]. To reduce total of 22,313 new lncRNAs (12,838 lincRNAs and 9475 the variation emerging from the tissue difference among lncNATs) were found, and their distribution in each spe- species, we selected data of five tissues: root, stem, leaf, cies were listed in “Additional file 1: Table 2”. flower, and seed/fruit. For lower-level plant species with - Previous studies on plant lncRNAs often missed out differentiated tissues, such as C. reinhardtii and S. out lncNATs due to lack of orientation in their RNA moellendori ffi , we used all available data. Considering the sequencing data (thus could not made accurate distinc- batch effects coming from PCR artifacts, sequence depth tion on lncNATs from coding gene transcripts). Here and gene expressional abundance, etc. among different we searched lncNATs using strand specific RNA-seq data sources, we eliminated PCR artifacts and low-qual- data, and found that lncNATs dispersed widely in plant ity sequencing data by pre-processing and mapping high- (Fig. 1). LncNATs took a large part in plant lncRNAs, and quality reads to genome in our procedure. We required even more than lincRNA in several species, such as A. the mapped data size for each tissue to be above 10X thaliana and C. reinhardtii (Fig.  1). There are about 20% genomic coverage depth in order to retain transcripts coding genes with lncNATs transcribing in A. thaliana. with low-level expression (“Additional file  2” listed the Coding genes can be expressed with their counterpart data size of each sample for all tissues of the eight spe- lncNATs simultaneously (Additional file  4: Fig. S1A). cies). We normalized the expression of transcripts using Compared with coding genes without lncNATs, their FPKM, which reduced or eliminated the discrepancy expressional levels were not significantly affected by the from sequencing depth and transcript length. At last, presence of lncNATs (Additional file  4: Fig. S1B). The lack we removed those low-expressional transcripts (FPKM of correlation between the expression of coding genes Fig. 1 Phylogenetic relationship of plant species in this study and lncRNAs identified in each species. The phylogenetic tree was drawn using TimeTree (http:// www. timet ree. org/). Myr: million years Zhu et al. BMC Genomics (2022) 23:381 Page 4 of 15 and their paired lncNATs implied the lncNATs may not In current study, we revealed comprehensive char- be directly involved in the regulation of coding genes’ acteristics of lncRNAs in multiple species using unified expression. sequencing data and bioinformatic analysis procedure. We found both lncNATs and lincRNAs have more sim- The consistent molecular features of lncRNAs ple structures than mRNAs. LncRNAs were shorter than among divergent plant species mRNAs (median length of lncNAT: 964, lincRNA: 817, The characteristics of lncRNAs, such as transcript length, mRNA: 1422, both p-value < 2.2e-16, t-test) (Fig. 2A), and AT content, exon number, and so on, were explored and lncRNAs have fewer exons compared to mRNAs (Mean compared to coding genes in individual plant species of lncNAT: 1.4, lincRNA: 1.7, mRNA: 5.2, both p-value [11, 26, 27]. However, if these features were consistent in < 2.2e-16, t-test) (Fig.  2B). These two features were con - plants were not clear. In addition to the most important served in plant lncRNAs (Fig. 2A, B). LncRNAs existed in differences between lncRNA and mRNA that was cod - plant genomes mainly as single-exon transcripts. A total ing products, other features remained to be investigated of 65% lincRNAs and 82% lncNATs were single-exon, between lncRNA and mRNA in plants. while in mRNAs this number was 22%. This phenomenon Fig. 2 Novel conserved features of lncRNAs in plant. A, B Length distribution and Exon-number of lincRNA, lncNAT and mRNA in each species. C Splicing ratio of lincRNA, lncNAT and mRNA in each species. D GC content of lincRNA, lncNAT, CDS/5’UTR/3’UTR of coding genes and intergenic sequences in each species. E SNP frequency of lincRNA, lncNAT and CDS of coding genes in each species Zhu  et al. BMC Genomics (2022) 23:381 Page 5 of 15 was observed in all plant species of this study, and was in TE for lncRNAs, we first established a reference library consistent with results of other plant species in previous of TEs for each genome using RepeatMasker [34]. As a studies [27]. result, up to 59.7% lincRNAs can be found contain TEs Fewer multiple-exon transcripts meant few require- in their sequences (Fig.  3A). Besides, we found a large ments for splicing by lncRNA. In a previous study, proportion of TEs inserted into the genomic loci of lncR- researchers found lncRNAs were rarely spliced and NAs and coding genes (lincRNA: 3.9% ~ 59.7%, lncNAT: mainly non-polyadenylated [28]. The splicing efficiency in 2.4% ~ 32%, coding gene: 3.1% ~ 43.9%). But, when we lim- human and mouse showed that inefficient splicing might ited the inserting regions to exons of lncRNAs and CDS be a common feature of lncRNAs across species [29]. The of coding genes, the ratio of transcripts overlapping with ratio of splicing lncRNAs was significantly lower than TEs decreased sharply in the latter (1.8% ~ 13.1%), while those for mRNAs (Wilcoxon test, both p-value = 7.6e-06, in lncRNAs the ratio remained (lincRNA: 3.1% ~ 58.2%, Additional file  4: Fig. S2A) in this study. Splicing lincR- lncNAT: 0.5% ~ 31.8%) (Fig.  3A). Significantly, sequences NAs ranged from 13.7% (poplar) to 48.8% (S. moellen- of TEs were remained in lncNATs though they shared dori ffi ), compared to the ratio of mRNAs from 65.3% sequences with coding genes. Decrease of TEs in protein (rice) to 92.6% (C. reinhardtii) (Fig. 2C). Inefficient splic - coding sequences indicated that TEs inserting into CDS ing of lncRNAs was not just a feature in animals but is were eliminated in evolution since they may change gene also common across plant species. Next, we wanted to products, and resulted in serious consequences. In cod- investigate if the splicing sites in lncRNAs were same ing genes, TE insertion was concentrated in regions with with mRNAs. Further analysis of the base distribution limited effects on gene products, such as UTR, intron. upstream and downstream of splicing sites showed that On the contrary, lncRNAs were not influenced by inser - sequences of splice sites in lncRNAs were conserved tion of TEs in their exons for lack of coding products, across plant species. They have almost the same sequence so these TE sequences can be kept and accompanied context between lncRNAs and mRNAs, which suggested by lncRNAs in evolution. This suggested that lncRNAs that they probably used the same splicing mechanism may originate from TE insertion. The close relationship (Additional file 4: Fig. S2B). between lncRNA and TEs revealed the key roles of TEs Both of lincRNAs and lncNATs have conserved nucleo- played in the origin of lncRNA. tide constitution among multiple species. LncRNAs have TEs were classified into two types: Class I TEs or retro - lower GC content than coding sequences (CDS) of coding transposons, and Class II TEs or DNA transposons. They genes, while which was higher than intergenic sequences are both present in the plant genome. However, we found (Fig. 2D). This feature was also consistent across multiple that Class II TEs favored transcribed regions, includ- species in plant. High GC content was usually associated ing lncNAT and lincRNA, whereas Class I TEs favored with coding sequences [30]. Though lncNATs and coding regions of non-transcribed sequences (Fig. 3B). The ratio genes shared overlapping sequences, the GC content of between Classes I and II TEs deviating from that the lncNATs was still lower than CDS in most plant species entire genome indicated that the TEs, Classes I and II, (Fig.  2D). The bias of nucleotide in lncRNA meant the in lncRNAs and coding sequences were under different selective pressure on their sequences. At the same time, selection pressure. mutants accumulated in both lincRNAs and lncNATs TEs play important roles on the origination of lncR- were much higher than coding genes (Fig. 2E, Additional NAs. Benefitting from the strand-specific sequencing file  1: Table  3). Mutants accumulated in coding regions data, we can confirm the positions of promoters. We with the possibility to change amino acid and lead to the found about 38 ± 21% lincRNAs had promoters origi- change of protein products. While the absence of coding nating from TEs, which was higher than the percent- products for lncRNAs made them tolerate more mutants. age of coding genes (29 ± 20%) and lncNATs (27 ± 21%) Although lncNATs shared sequences with coding genes, (Fig. 3C). The ratio of lncRNAs with promoters originat - their sequences accumulated more mutants and fewer ing from TEs was positively related to whole genomic TE GC bases, implied they may experience different paths to content in each species (Pearson’s correlation, lincRNA: coding genes in evolution. 0.63, lncNAT: 0.83). This suggested that TEs randomly inserted into genome was one of the major ways to gen- New discovery of lncRNA origination by inserting erate lncRNAs. of transposable elements (TEs) Further, we found TEs initiated the transcription of TEs were widely existed in eukaryotic genomes and lncRNAs by supplying TF binding sequences for lncR- acted as key factors for gene and genome evolution [31, NAs (Fig.  3D). Combining with ChIP-seq data from 32]. Insertion of TEs was considered as one of many nine transcript factors (TFs) in A. thaliana, A. lyrata, ways for origin of lncRNA [15, 33]. To study the roles of rice and maize, we identified 250 lncRNAs, which Zhu et al. BMC Genomics (2022) 23:381 Page 6 of 15 Fig. 3 TEs regulate the transcription of lncRNAs. A Genomic loci of lncRNAs and coding genes overlapping with TEs (up). Exonic sequences of lncRNAs and coding genes overlapping with TEs (down). The black dashed lines represent the ratio of TE sequences in whole genome; the red solid lines represent the ratio of coding genes overlapping with TEs; and the black and grey bars represent the ratio of lincRNA and lncNAT overlapping with TEs, respectively. B The distribution of Class I and Class II TEs in genome, lncRNAs and coding genes. C The ratio of lncRNAs with TE promoters in each species. D TE-lincRNA Osat_00007032 in rice (subspecies: Japonica). The up black line represents genome of Japonica, while the down one represents genome of Indica. The grey block between Japonica and Indica was the syntenic area between the two genomes. The locus of Osat_00007032 was signed with the red and blue bar, while the syntenic area in Indica was not found with lncRNAs or other elements. The TE upstream Osat_0000732 belonged to MERMITEJ sub-family, and was a type of DNA/Gypsy TE contained TF binding sites originating from TEs, and for transcript factor MADS29 was confirmed (Fig.  3D). these lncRNAs accounted for 11.7% of lncRNAs with MADS29 belongs to MADS-box transcription factors, TF biding sites being confirmed (Additional file  1: which were critical regulators for rice reproductive devel- Table  4, Additional file  5). As an example, we found a opment [35, 36]. Surprisingly, we cannot detect lncRNAs lincRNA Osat_00007032, which was upstream of cod- or TEs in the syntenic region of genome for Indica, which ing gene OS01G0166800 in rice (subsp. Japonica), and was another subspecies of rice. This indicated that lin - expressed in anther, pistil, leaf, root, stem and seed, cRNA Osat_00007032 was specific for Japonica, and was but especially high in pistil (Additional file  4: Fig. S3). evolved recently by TE insertion. This novel discovery A TE belonging to MERMITEJ subfamily covered the that TE brought promoters containing TF binding sites upstream of Osat_00007032, and provided promoter to the upstream exonic locus, revealed the relationship for Osat_00007032. In the promoter, binding sequences between TE and lncRNA different from that previously Zhu  et al. BMC Genomics (2022) 23:381 Page 7 of 15 found in tomato lncRNA, for which a TE inserted into rice were homologous with those in maize (Fig.  4C). In the downstream of a promoter [16]. genus such as Arabidopsis, half of lincRNAs were found having homologs in other species within the same genus Most plant lincRNAs only conserved within genus (Fig.  4C). Hence, we found that most of the homologous indicating rapid turnover in evolution lincRNAs were genus- or species-specific in plants, sug - Sequence conservation provides insight into evolution- gesting that lincRNAs have a rapid turnover in evolution. ary process of functional elements in genome. However, most lncRNAs in plants were found with poor conserva- IPS was found to be the only plant lncRNA group tion, including the lncRNAs whose functions were well with conserved sequence motifs studied [37, 38]. In this study, we attempted to explore Some lincRNAs were found to be functional via short- the evolution of lncRNAs in plants by analyzing the con- conserved patches, though the whole sequences were servation of lncRNAs with two methods. First, we evalu- divergent [40, 41]. To know if conserved patches existed ated the sequence conservation using PhastCons score in plant lncRNAs, we developed a method using sliding [39]. PhastCons score was calculated based on the result windows to calculate the mean PhastCons score for each of seven-way whole genomic alignment for A. thaliana, patch. We found most lincRNAs have no apparently con- A. lyrata, soybean, rice, poplar, tomato and S. moellen- served patches in the eight plant species (Fig.  4D). We dori ffi . We compared the sequence conservation of lincR - then looked into the plant lncRNAs in lncrnadb [42], and NAs, lncNATs, CDS of coding genes, 5’UTR and 3’UTR found IPS was the only lncRNA with conserved patches. of coding genes, and used sequences from intergenic IPS lncRNA family participated in phosphorus (P ) regions as control. As expected, CDS exhibited the high- homeostasis in plants, including AtIPS1 and AtIPS2/At4 est sequence conservation, while the intergenic sequence in Arabidopsis, OsIPS1 and OsIPS2 in rice, and TPSI1 was the lowest (Fig.  4A). Considering that lncNATs and in tomato [43–45]. They were found to be accumulated coding genes shared some common sequences, they under Pi starved condition. IPS contained conserved showed similar sequence conservation (Fig.  4A, Addi- 24-nt nucleotides motif in several species [43, 45]. The tional file  4: Fig. S4A). The conservation of lincRNAs, motif contained two highly conserved regions which however, was lower than 5’UTR, 3’UTR and introns of were separated by three nucleotides, and played roles coding genes (Fig. 4A). The proportion of plant lincRNAs as the target mimicry for miR399 [46]. Strikingly, our with PhastCons scores higher than 0.6 was only 0.2%, and study found all species contained this motif except algae this number in coding genes was 14.4%. About 71% plant (Fig. 4E). The absence of IPS in algae may be explained by lincRNAs scored 0, indicating very low conservation their lack of vascular tissues, whereas IPS was expressed (Additional file  4: Fig. S4B). In placental mammals, old in vascular tissues when lacking of P . IPS first appeared (minimum age 90 Myr) and young lncRNAs (minimum in S. moellendori ffi , a representative of primitive vascular age 25 Myr) were defined basing on the phylogenetic dis - plant (Fig. 4E, F). But only one of the two IPS motifs was tribution of species, and they found the conservation of found in S. moellendori ffi , whereas two motifs were often old lncRNAs were close to CDS, while young lncRNAs found in high plants (Fig.  4E, F). Different from most were even lower than intergenic regions [20]. In this lncRNAs with short evolutionary histories, IPS under- study, we found plant lincRNAs showed similar conser- went a long evolutionary history with conserved function vation to intergenic sequences, and which suggested that in P homeostasis, suggesting plant lncRNAs may play most plant lincRNAs were young and with short evolu- important roles in the adaptation of terrestrial life during tionary age (Fig. 4A, Additional file 4: Fig. S4A). migration from aquatic to terrestrial. Further, we evaluated the conservation of plant lincR- NAs by identifying the homologous lncRNAs across the Specific expression of lncRNAs revealing sophisticated eight plant species. We mapped lincRNAs in one species transcription regulation to genomes of all other species. LncNATs were excluded Previously several lncRNA studies showed that most since the existence of overlapped regions shared with lncRNAs were expressed at low levels, and often coding genes. While strong conversation of coding genes expressed in specific tissues or conditions in plants [11, existed between evolutionarily remote species in the 26]. In this study, we attempted to profile the expression plant kingdom, the conservation of lincRNAs retained of lincRNAs and lncNATs in more species. Both lincRNA only in the evolutionarily close groups (Fig. 4B, C). Previ- and lncNAT showed lower expression levels and higher ous studies of lncRNAs showed that the lncRNA homol- tissue specificity compared to coding genes (Fig.  5A, ogy mainly existed within a family in both plants and B). LincRNA and lncNAT showed similar expressional animals [20, 21]. However, we found conserved lincR- levels (average: 35.2 vs 22.9), while the expressional NAs were rare in plant family, as only 5% lincRNAs from level of mRNA was significantly higher (average: 124.8, Zhu et al. BMC Genomics (2022) 23:381 Page 8 of 15 Fig. 4 The conservation of lncRNAs in plant. A The PhastCons scores of coding genes (including CDS, 5’UTR, 3’UTR, and Intron), lncRNAs (lincRNA and lncNAT ) and intergenic sequences. B, C Conserved lincRNAs and coding genes in each species. LincRNAs were aligned to genome of another species using blastn, while amino acid sequences of coding genes were aligned to another species’ amino acid sequences using blastp. The numbers of legend were the ratio of transcripts with homologous sequences in the genome of other species. The labels were abbreviations of each species. D The cumulative frequency of PhastCons scores in coding genes and lncRNAs. The lincRNA-patch meant a window (12 bp) sliding from the first base to the last one, and then the highest window score was selected to stand for the score of the lincRNA. E Highly conserved IPS motifs in all plant species in this study except algae. The marked sequences with orange and turquoise were regions combining to miRNA. F The phylogenetic relationships of terrestrial plant was constructed according to the nucleotide sequences of IPS using MEGA7 (https:// www. megas oftwa re. net/) Kolmogorov-Smirnov test, both p-value < 2.2e-16) score of 0 meant the transcript expressed in all tissues. (Fig.  5A). Similar trends were observed in all plant spe- Both lincRNA and lncNAT exhibited significantly higher cies (Additional file  4: Fig. S5A). Low expressional level JS score, i.e., higher tissue specificity, than mRNA (Kol - was a common feature for plant lncRNAs. mogorov-Smirnov test, both p-value < 2.2e-16) (Fig.  5B, Tissue specificity was evaluated using Jensen-Shan - Additional file 4: Fig. S5B). non (JS) score in all of the plant species, which returned The expressional patterns of lncRNAs evolved rap - a score between 0 and 1 [47]. JS score of 1 represented idly in plant, which formed sharp contrast to the cod- the transcript expressed uniquely in one tissue, while JS ing genes. We defined tissue-specific (TS) transcripts as Zhu  et al. BMC Genomics (2022) 23:381 Page 9 of 15 Fig. 5 The expressional patterns of lncRNAs in plant. A, B Expressional levels (FPKM) and tissue specificity (JS score) of lincRNA, lncNAT and mRNA. C Tissue distribution of TS (tissue-specific) transcripts in several plant species D Tissue distribution of all expressed transcripts in A. thaliana. The legend numbers were calculated using log(FPKM+ 1). E Transcript factor binding frequency of lncRNA and mRNA were significantly different among tissues (Fig.  5D). In transcript with JS score > = 0.9. We found TS lncRNAs A. thaliana, lncNATs were preferentially expressed in and TS mRNAs were distributed to tissues differently embryo, while lincRNAs were in both endosperm and in the same species (Fig.  5C). Also, among the differ - embryo. LncRNAs had a narrow expression profile ent plants, TS lncRNAs of tomato and maize were more compared to coding genes. likely to express in leaf and root, while TS lncRNAs LncRNAs showed similar TF binding rate compared were found mostly in the pistil of rice, and the seed of A. to that of coding genes, though they had low expres- lyrata. sional levels and unstable expression patterns. We used Previous studies in rice and maize showed that ChIP-seq data of nine transcript factors (TFs) from A. lncRNAs were more specially expressed in reproduc- thaliana, A. lyrata, rice and maize to predict the bind- tive tissues [11, 26]. By comparing the expression of ing sites in promoter (Additional file  5). The frequency all transcripts, we found mRNAs were more evenly of lncRNA promoters binding with TFs was compa- expressed in all tissues in A. thaliana, while lncRNAs rable to coding genes (Fig.  5E). TFs with low binding Zhu et al. BMC Genomics (2022) 23:381 Page 10 of 15 rate in coding genes also had low binding efficiency in this network, lncRNAs and coding genes were clustered lncRNA, e.g., LF YGR , NLP7, P1, a-myc, HASF1b. into 24 modules according to their expressional charac- teristics (Fig.  6A). In these modules, a total of 689 lncR- Plant lncRNAs forming co‑expression network with coding NAs (155 lincRNAs and 534 lncNATs) were found to be genes co-expressed with coding genes. Especially, lncNATs in Expression is often closely related to function, especially the network were not directly related to their antisense highly and specifically expressed lncRNAs [48, 49]. We coding genes, which referred they may not regulate the selected lncRNAs that were highly expressed in one or transcription of antisense genes. GO enrichment analy- two tissues in A. thaliana to construct WGCNA net- sis was performed for the coding genes in each module. work. A total of 178 lincRNAs, 555 lncNATs and 20,729 Functions of the coding genes were summarized as fol- coding genes were selected to construct the network. In lows: I) Basic metabolism: kinase activity, lipid metabolic Fig. 6 The WGCNA network of lncRNAs and coding genes, and function prediction of lncRNAs co-expressed with coding genes. A Correlation matrix between modules and tissues of A. thaliana. B Correlational network of the module 22. The red and blue circles represent lncNAT and lincRNA respectively, while the other circles represent coding genes. C Functional description of modules containing lncRNAs. The area of the pie was positively correlated to the number of transcripts. The numbers in the brackets correspond to the modules in (A) Zhu  et al. BMC Genomics (2022) 23:381 Page 11 of 15 process, and hydrolase activity; II) Response to stress and Although several lncNATs were reported to regulate response to abiotic stimulus; III) Reproduction (Fig. 6C). the expression of antisense coding genes [51, 52], we LncRNAs involved in the network may play roles in these found most lncNATs were not correlated to the expres- functions. sion of antisense coding genes (Additional file  4: Fig. The correlation between modules and samples S1B). This observation implied the function of most showed that the expressional patterns of lncRNAs were lncNATs may not be related to their complimentary highly correlated to specific tissues (Fig.  6A). In our coding genes. result, lncRNAs of A. thaliana were highly and spe- In plants, insertion of TEs is one of the important cially expressed in embryo and endosperm (Fig.  5D). mechanisms for lncRNA origin [16, 53]. We found many In WGCNA network, the embryo was strongly corre- plant lncRNAs from all the studied species contained lated with module 01, and the endosperm was strongly sequence elements of TEs. In A. thaliana, rice and maize, correlated to module 22 (Fig.  6A, B). In module 01, we the proportion of TE-related lincRNAs in total lincRNAs found a myriad of coding genes associating with embryo were reported to range from 22.9 to 51.5% [17], similar development. Functions of these genes can be catego- to our results (18.2% ~ 58.2%) on these species. TEs were rized as: I) Involving in embryo development directly; mainly inserted into the exons of plant lncRNAs, avoid- II) Overexpression, mutation or knocking down caused ing exons with coding genes (Fig.  3A). Particularly, we embryo death; III) Enrichment of protein products found an example that TE insertion caused initiating new during embryonic development. Co-expressional rela- transcription, which has not been reported in plants. tionship was found between these coding genes and This finding provided further evidence that TEs play key lncRNA, and lncRNAs in this network may play roles in roles in the origin of plant lncRNAs. Note several differ - embryo development. In module 22, two radial networks ent ways for the origin of lncRNAs were documented, were constituted of four lncNATs and seven lincRNAs including coding genes losing coding capacity, genomic with coding genes (Fig.  6B). In these coding genes, we sequences devoid of exons and insertion of TEs [54]. Our found four genes encoding transcript factors played research added new details for lncRNAs origination with roles in the endosperm development. As an example, novel complexity. gene AT1G02580 encoded transcript factor MEDEA/ Some lncRNAs were found to be conserved in animals MEA, a polycomb group gene that was imprinted in the with the species were distantly separated in evolution [20, endosperm [50]. All of the 11 lncRNAs in the network 55]. These ‘old’ lncRNAs that were preserved in animals, were found co-expressed with it, which indicating that were suggested to have important functions [20]. How- these lncRNAs may play roles in endosperm imprint. ever, plant lncRNAs were highly divergent, and almost all Coincidently, researchers found some lncRNAs express- plant lncRNAs were species/genus-specific. Differently ing in castor bean seeds were involved in genomic from animals, we did not find any plant lncRNAs that imprint, and several of them comprised the imprinted are conserved along the evolution. Our results agree with cluster with imprinted coding genes [27]. others’ studies that did not detect highly conserved plant lncRNAs [21, 22]. Conserved sequence patches within lncRNAs were Discussion previously reported in some lncRNAs [40]. IPS was a Benefiting from the advancement of high-throughput lncRNA participating in Pi regulation in several plant sequencing technology, we have discovered more lncR- species, and was induced expressing in the vascular tis- NAs in representative plant species that are associated sues of root and shoot [41, 45]. However, we have found with novel complexity. Most of the previous studies the highly conserved motifs of IPS crossed all tracheo- sequenced RNAs with poly(A) tails by oligo(dT) selec- phyte (Fig.  4E, F). The emergency of IPS in all terrestrial tion, which would exclude a large part of lncRNAs plants indicated that IPS lncRNAs may play important without poly(A) [14]. In addition, lncNATs was much roles in the adaptative evolution of plant migrating from more difficult to study with the non-strand specific aquatic environments to terrestrial ones. RNA-sequencing technology, whose distribution in To date, the functions of a large number of lncRNAs plant was largely not clear. Here we studied lncRNAs in plants remain unknown. Our finding that most highly using RiboMinus and strand-specific RNA-sequencing and specially expressed lncRNAs formed co-expression data. We expanded the plant lncRNA landscape, and network with coding genes suggested they may play roles found the lncNATs dispersed widely in plant. The num - in the plant development, stress response, etc. For exam- bers of lncNATs were even greater than lincRNAs in ple, they may participate in embryo development and several species, such as A. thaliana and C. reinhardtii. endosperm imprint. However, within the large amount Zhu et al. BMC Genomics (2022) 23:381 Page 12 of 15 of lncRNAs identified in plant species, only a small por - (Additional file  2). Data downloaded from public data- tion of them were found in potential functional networks. bases can be found by the accession numbers record- The large numbers of novel plant lncRNAs found in our ing in “Additional file  2”. Samples been sequenced in study open an avenue to systematically characterize and this study were dealt with following processes: materials explore the functions and evolution of lncRNAs in plant. were taken from at least two plant individuals, and then been mixed for RNA isolation. Total RNA was extracted, Conclusions and rRNA was removed from the purified RNA using We performed RiboMinus combining with strand-spe- Ribo-Zero rRNA Removal Kit. Strand-specific RNA-seq cific RNA-sequencing technique on eight evolutionarily libraries were constructed with TruSeq Stranded mRNA representative plant species. We identified 39,945 lncR - LT Sample Prep Kit. Libraries were applied to Illumina NAs, which including 14,595 lncNATs, and expounded Hiseq2500 platform. the distribution of lncNATs in plants. Further, we found TEs played key roles in the origination of plant lncRNAs. Bioinformatics pipeline for identification of lncRNAs Combing with ChIP-seq data, we found a new way for Sequencing reads were filtered according to the quality lncRNA originating from TE insertion. In addition, we score using sickle (version 1.210), and the quality thresh- found plant lncRNAs were not conserved in evolutionar- old was set as 20. Reads shorter than 50 bp or with ambig- ily distant species, and most of them were species/genus- uous nucleotides were removed after quality checking. specific. However, conserved motifs of IPS were found Genomes and gene sets of the eight plant species were in lncRNA of all terrestrial plants, and IPS may played downloaded from public databases, and the versions important roles in the adaptive evolution of plant. The of them were shown in “Additional file  1: Table  1”. High study revealed novel features and complexity of lncRNAs quality reads were aligned to the genomes of each spe- in plants through systematic analysis, providing impor- cies using TopHat2 (version 2.1.0) [56]. Cufflinks (version tant insights into the origination and evolution of plant 2.2.1) was used to construct transcripts of each sample lncRNAs. according to the mapping results [57]. The expressional levels of transcripts were evaluated by FPKM using Cuf- flinks software suite. JS score was employed to represent Methods the tissue specificity, and which was calculated using R Plant materials library cummeRbund (version 2.7.2). We defined a tran - The seeds of A. thaliana and A. lyrata were buy from script as lncRNA if it satisfied the following criteria: I) it ABRC (https:// abrc. osu. edu/ users/ sign_ in? redir ect_ must be longer than 200 bp; II) it was noncoding by the to=% 2Ford ers% 2Fnew), and then cultivated in phyto- result of CPC and PLEK [58, 59]; III) it had no homologs tron respectively. Leaves, stems, roots and flowers of in databases of Swiss-Prot [60], Pfam [61] or Rfam [62]; A. thaliana and A. lyrata were collected from mature IV) the FPKM of it was more than 0.5 for single-exon plants. Siliques of A. thaliana were collected one week transcript or more than 0.1 for multiple-exon transcript. after flowering, and seeds were collected when siliques were full but epidermis were still green. The seedlings of Characterization of lncRNAs and coding genes poplar were provided by Cuiting Wang, CAS Center for SNP calling Excellence in Molecular Plant Sciences. Leaves, stems, All clean reads were mapped to genome using bwa (ver- roots and shoot tips of poplar were collected from seed- sion 0.7.17-r1188) [63]. Samtools (version 1.9) and lings with height of ~ 1 m. The mature plants of rice bcftools (version 1.9) were used to call SNPs [64]. All (Japonica) were obtained from the lab of Zuhua He, CAS SNPs from various tissues or libraries were merged, then Center for Excellence in Molecular Plant Sciences (http:// SNPs with quality more than 10 and depth more than 5 sippe. ac. cn/ zuhua he/). Leaves and stems of rice were col- were retained. At last, high-quality SNPs in the region lected before heading period. Samples of S. moellendori ffi of lncRNAs and coding genes were retained using local were described in our previous study [25]. scripts. Transcript factor binding site identification RNA sequencing and data acquisition ChIP-seq data from A. thaliana, A. lyrata, rice and maize Our dataset contained more than 700G read pairs from were downloaded from SRA database, and their acces- eight plant species, of which 490G were previously pub- sion numbers were listed in “Additional file  5”. After qual- lished data and 210G were generated by RNA sequencing ity filtering using sickle (version 1.210), clean reads were Zhu  et al. BMC Genomics (2022) 23:381 Page 13 of 15 mapped to genome using bowtie2 (version 2.2.6) [65], Supplementary Information and the parameter was set as -N 1. Peak calling was sub- The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s12864- 022- 08602-9. jected to MACS (version 2.1.2) with module “callpeak” (version 2.1.2), and the parameters were set as --keep- Additional file 1. dup 1 and -q 0.01 [66]. Additional file 2. RNA-seq datasets. Additional file 3. lncRNA-gff3. TE annotation Additional file 4: Figure S1. (A) Expressional levels of lncNATs and their Whole genomic repeats and TEs of the eight plant spe- antisense coding genes. The expressional levels were evaluated by log cies in this study were identified using RepeatMasker (FPKM+1). (B) Expressional levels of lncNAT, antisense coding genes, and coding genes without lncNATs expressing at their antisense strand. 3.3.0 with parameters setting as -xsmall, and the Rep- Figure S2. (A) Ratio of splicing and alternative splicing transcripts. (B) Base20.02 were used as reference library. TE-lncRNAs Base distribution of splicing acceptor and donor for lncRNAs and mRNAs. were defined as lncRNAs with exon sequence overlap - Figure S3. Expressional levels of lncRNA Osat_00007032 in rice. Figure S4. The distribution of PhastCons score in lncRNAs and coding genes. (A) ping at least 5 bp to TE sequences. The distribution of PhastCons score in lncRNAs, intergenic sequences and coding sequences of coding genes. (B) Percentage of lncRNAs and mRNAs with different PhastCons scores in total lncRNAs and mRNAs, respectively. PhastCons score Figure S5. Expressional patterns of lncRNAs and mRNAs in divergent Lastz (version 1.02.00) was used for inter-species align- species. (A) Expressional levels of lncRNAs and mRNAs in each species, and ments, then TBA (version v12) was used for final integra - the numbers of y-axes were calculated by log (FPKM). (B) Tissue-specific- ity of lncRNAs and mRNAs in each species with multiple tissues, and the tion to obtain genome-wide alignment files for all eight y-axes stands for JS score. species [67]. PhastCons score (PCs) were calculated using Additional file 5. ChIP-seq datasets. phast (version 1.3) according to the results of genome- wide alignment [39]. Conserved patches in lncRNAs Acknowledgements were defined as short patches (12 bp) with PCs more than The authors thank Cuiting Wang for supplying the materials of poplar. 0.6, and meanwhile PCs of the whole lncRNAs were less than 0.3. About this supplement This article has been published as part of BMC Genomics Volume 23 Supple- ment 4, 2022: Selected articles from the International Conference on Intel- WGCNA network construction ligent Biology and Medicine (ICIBM 2021): genomics. The full contents of the supplement are available online at https:// bmcge nomics. biome dcent ral. com/ The WGCNA co-expressional network was constructed artic les/ suppl ements/ volume- 23- suppl ement-4. with R package WGCNA (version 1.68) using the FPKMs from all the 12 tissues of selected genes in A. thaliana Authors’ contributions X L conceived, designed the study, and modified the manuscript. Y Z collected [68]. Genes complying with the following criteria were data, performed analysis and drafted the manuscript. LX C collected samples selected to construct the network: 1) lncNATs express- and performed experiments. XN H took part in some experiments. H S drew ing in one or two tissues, and their expression levels were several figures. All authors have read and approved the final manuscript. among the top 10%; 2) lincRNAs expressing in one or two Funding tissues, and with which the maximum FPKM was larger This work was in part supported by grand from the National Key Research than 1; 3) mRNAs with FPKM larger than 1. The param - and Development Program of China (2019YFA0904601) and the National Natural Science Foundation of China (31701137, 31972881, 31900470). The eters of the key function “blockwiseModules” were set as: plant materials collecting, cultivating, and the RNA-seq experiments were power = 16, mergeCutHeight = 0.25, and reassignThresh - supported by the National Natural Science Foundation of China (31701137, old = 0. The network of each module was showed by 31972881, 31900470). The data Publication costs are funded by the National Key Research and Development Program of China (2019YFA0904601). Cytoscape (Version 3.6.1) [69]. GO enrichment analysis was subjected by BiNGO (version 3.0.3), and the param- Availability of data and materials eters were set as: the multiple testing correlation - “FDR The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive [71] in National Genomics Data Center [72], Beijing correlation”, ontology file - “GOSlim_Plants”, and the Institute of Genomics (China National Center for Bioinformation), Chinese organism annotation - “Arabidopsis thaliana” [70]. Academy of Sciences, under accession number CRA003591 that are publicly accessible at https:// bigd. big. ac. cn/ gsa. The other RNA-seq raw data used in this study were downloaded from NCBI public database, and the accession Abbreviations numbers were listed in ‘Additional file 2’. lncRNA: long noncoding RNA; lincRNA: long intergenic noncoding RNA; lncNAT: long noncoding natural antisense RNA; NAT: Natural antisense Declarations transcript; ncRNA: noncoding RNA; AS: Alternative splicing; TE: Transposable element; CDS: Coding sequence; TF: Transcript factor; PCs: PhastCons score; Ethics approval and consent to participate Pi: Phosphorus; FPKM: Fragment per kilobase per million mapped fragments; Not applicable. JS score: Jensen-Shannon score; TS: Tissue-special; WGCNA: Weighted cor- relation network analysis; GO: Gene ontology; Atha: A. thaliana; Alyr: A. lyrata; Consent for publication Ptri: P. trichocarpa; Slyc: S. lycopersicum; Zmay: Z. mays; Osat: O. sativa; Smoe: S. Not applicable. moellendorffii; Crei: C. reinhardtii. Zhu et al. BMC Genomics (2022) 23:381 Page 14 of 15 Competing interests 19. Lopez-Ezquerra A, Harrison MC, Bornberg-Bauer E. Comparative analysis The authors declare that they have no competing interests. of lincRNA in insect species. BMC Evol Biol. 2017;17(1):155. 20. Necsulea A, Soumillon M, Warnefors M, Liechti A, Daish T, Zeller U, et al. Author details The evolution of lncRNA repertoires and expression patterns in tetrapods. Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Nature. 2014;505(7485):635–40. Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sci- 21. Deng P, Liu S, Nie X, Weining S, Wu L. Conservation analysis of long non- ences, Shanghai, China. University of Chinese Academy of Sciences, Beijing, coding RNAs in plants. Sci China Life Sci. 2018;61(2):190–8. China. Henan University, Kaifeng, China. 22. Mohammadin S, Edger PP, Pires JC, Schranz ME. Positionally-conserved but sequence-diverged: identification of long non-coding RNAs in the Received: 30 April 2022 Accepted: 3 May 2022 Brassicaceae and Cleomaceae. BMC Plant Biol. 2015;15:217. 23. Wang H, Niu QW, Wu HW, Liu J, Ye J, Yu N, et al. Analysis of non-coding transcriptome in rice and maize uncovers roles of conserved lncRNAs associated with agriculture traits. Plant J. 2015;84(2):404–16. 24. Simopoulos CMA, Weretilnyk EA, Golding GB. Molecular Traits of Long Non-protein Coding RNAs from Diverse Plant Species Show Little Evi- References dence of Phylogenetic Relationships. G3 (Bethesda). 2019;9(8):2511–20. 1. Liu X, Hao L, Li D, Zhu L, Hu S. Long non-coding RNAs and their biological 25. Zhu Y, Chen L, Zhang C, Hao P, Jing X, Li X. Global transcriptome roles in plants. Genomics Proteomics Bioinformatics. 2015;13(3):137–47. analysis reveals extensive gene remodeling, alternative splicing and 2. Seo JS, Sun HX, Park BS, Huang CH, Yeh SD, Jung C, et al. ELF18-INDUCED differential transcription profiles in non-seed vascular plant Selaginella LONG-NONCODING RNA associates with mediator to enhance expres- moellendorffii. BMC Genomics. 2017;18(Suppl 1):1042. sion of innate immune response genes in Arabidopsis. Plant Cell. 26. Zhang YC, Liao JY, Li ZY, Yu Y, Zhang JP, Li QF, et al. Genome-wide 2017;29(5):1024–38. screening and functional analysis identify a large number of long 3. Zhao X, Li J, Lian B, Gu H, Li Y, Qi Y. Global identification of Arabidopsis noncoding RNAs involved in the sexual reproduction of rice. Genome lncRNAs reveals the regulation of MAF4 by a natural antisense RNA. Nat Biol. 2014;15(12):512. Commun. 2018;9(1):5056. 27. Xu W, Yang T, Wang B, Han B, Zhou H, Wang Y, et al. Differential expres- 4. Sun Y, Hao P, Lv X, Tian J, Wang Y, Zhang X, et al. A long non-coding apple sion networks and inheritance patterns of long non-coding RNAs in RNA, MSTRG.85814.11, acts as a transcriptional enhancer of SAUR32 and castor bean seeds. Plant J. 2018;95(2):324–40. contributes to the Fe-deficiency response. Plant J. 2020;103(1):53–7. 28. Schlackow M, Nojima T, Gomes T, Dhir A, Carmo-Fonseca M, Proudfoot 5. Paytuvi Gallart A, Hermoso Pulido A, Anzar Martinez de Lagran I, San- NJ. Distinctive Patterns of Transcription and RNA Processing for Human severino W, Aiese Cigliano R. GREENC: a Wiki-based database of plant lincRNAs. Mol Cell. 2017;65(1):25–38. lncRNAs. Nucleic Acids Res. 2016;44(D1):D1161–6. 29. Mele M, Mattioli K, Mallard W, Shechner DM, Gerhardinger C, Rinn JL. 6. Fang S, Zhang L, Guo J, Niu Y, Wu Y, Li H, et al. NONCODEV5: a compre- Chromatin environment, transcriptional regulation, and splicing distin- hensive annotation database for long non-coding RNAs. Nucleic Acids guish lincRNAs and mRNAs. Genome Res. 2017;27(1):27–37. Res. 2018;46(D1):D308–14. 30. Pozzoli U, Menozzi G, Fumagalli M, Cereda M, Comi GP, Cagliani R, et al. 7. Jin J, Lu P, Xu Y, Li Z, Yu S, Liu J, et al. PLncDB V2.0: a comprehensive Both selective and neutral processes drive GC content evolution in the encyclopedia of plant long noncoding RNAs. Nucleic Acids Res. human genome. BMC Evol Biol. 2008;8:99. 2021;49(D1):D1489–95. 31. Bennetzen JL, Wang H. The contributions of transposable elements 8. Szczesniak MW, Rosikiewicz W, Makalowska I. CANTATAdb: A Collection of to the structure, function, and evolution of plant genomes. Annu Rev Plant Long Non-Coding RNAs. Plant Cell Physiol. 2016;57(1):e8. Plant Biol. 2014;65:505–30. 9. Di C, Yuan J, Wu Y, Li J, Lin H, Hu L, et al. Characterization of stress- 32. Slotkin RK, Martienssen R. Transposable elements and the epigenetic responsive lncRNAs in Arabidopsis thaliana by integrating expression, regulation of the genome. Nat Rev Genet. 2007;8(4):272–85. epigenetic and structural features. Plant J. 2014;80(5):848–61. 33. Cho J. Transposon-Derived Non-coding RNAs and Their Function in 10. Shumayla S, Taneja M, Tyagi S, Singh K, Upadhyay SK. Survey of High Plants. Front Plant Sci. 2018;9:600. Throughput RNA-Seq Data Reveals Potential Roles for lncRNAs during 34. Smit AF, Hubley R, Green P. Green: RepeatMasker Open-4.0. 2013-2015 Development and Stress Response in Bread Wheat. Front Plant Sci. <http:// www. repea tmask er. org>. 2017;8:1019. 35. Yin LL, Xue HW. The MADS29 transcription factor regulates the deg- 11. Li L, Eichten SR, Shimizu R, Petsch K, Yeh CT, Wu W, et al. Genome-wide radation of the nucellus and the nucellar projection during rice seed discovery and characterization of maize long non-coding RNAs. Genome development. Plant Cell. 2012;24(3):1049–65. Biol. 2014;15(2):R40. 36. Nayar S, Sharma R, Tyagi AK, Kapoor S. Functional delineation of rice 12. Golicz AA, Singh MB, Bhalla PL. The Long Intergenic Noncoding MADS29 reveals its role in embryo and endosperm development by RNA (LincRNA) Landscape of the Soybean Genome. Plant Physiol. affecting hormone homeostasis. J Exp Bot. 2013;64(14):4239–53. 2018;176(3):2133–47. 37. Heo JB, Sung S. Vernalization-mediated epigenetic silencing by a long 13. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. intronic noncoding RNA. Science. 2011;331(6013):76–9. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8. 38. Ding J, Lu Q, Ouyang Y, Mao H, Zhang P, Yao J, et al. A long noncoding 14. Yang L, Duff MO, Graveley BR, Carmichael GG, Chen LL. Genom- RNA regulates photoperiod-sensitive male sterility, an essential com- ewide characterization of non-polyadenylated RNAs. Genome Biol. ponent of hybrid rice. Proc Natl Acad Sci U S A. 2012;109(7):2654–9. 2011;12(2):R16. 39. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, 15. Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G, et al. et al. Evolutionarily conserved elements in vertebrate, insect, worm, Transposable elements are major contributors to the origin, diversifica- and yeast genomes. Genome Res. 2005;15(8):1034–50. tion, and regulation of vertebrate long noncoding RNAs. PLoS Genet. 40. Lin N, Chang KY, Li Z, Gates K, Rana ZA, Dang J, et al. An evolutionar- 2013;9(4):e1003470. ily conserved long noncoding RNA TUNA controls pluripotency and 16. Wang X, Ai G, Zhang C, Cui L, Wang J, Li H, et al. Expression and diver- neural lineage commitment. Mol Cell. 2014;53(6):1005–19. sification analysis reveals transposable elements play important roles 41. Huang CY, Shirley N, Genc Y, Shi B, Langridge P. Phosphate utiliza- in the origin of Lycopersicon-specific lncRNAs in tomato. New Phytol. tion efficiency correlates with expression of low-affinity phosphate 2016;209(4):1442–55. transporters and noncoding RNA, IPS1, in barley. Plant Physiol. 17. Wang D, Qu Z, Yang L, Zhang Q, Liu ZH, Do T, et al. Transposable elements 2011;156(3):1217–29. ( TEs) contribute to stress-related long intergenic noncoding RNAs in 42. Quek XC, Thomson DW, Maag JLV, Bartonicek N, Signal B, Clark MB, plants. Plant J. 2017;90(1):133–46. et al. lncRNAdb v2.0: expanding the reference database for func- 18. Hezroni H, Koppstein D, Schwartz MG, Avrutin A, Bartel DP, Ulitsky I. Princi- tional long noncoding RNAs. Nuncleic Acids Res. 2015;43(Database ples of long noncoding RNA evolution derived from direct comparison of issue):D168–73. transcriptomes in 17 species. Cell Rep. 2015;11(7):1110–22. Zhu  et al. BMC Genomics (2022) 23:381 Page 15 of 15 43. Hou XL, Wu P, Jiao FC, Jia QJ, Chen HM, Yu J, et al. Regulation of the 67. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, et al. expression of OsIPS1 and OsIPS2 in rice via systemic and local Pi signalling Aligning multiple genomic sequences with the threaded blockset aligner. and hormones. Plant Cell Environ. 2005;28(3):353–64. Genome Res. 2004;14(4):708–15. 44. Liu C, Muchhal US. Differential expression of TPS11, a phosphate 68. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation starvation-induced gene in tomato. Plant Mol Biol. 1997;33:867–74. network analysis. BMC Bioinformatics. 2008;9:559. 45. Shin H, Shin HS, Chen R, Harrison MJ. Loss of At4 function impacts phos- 69. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. phate distribution between the roots and the shoots during phosphate Cytoscape: a software environment for integrated models of biomolecu- starvation. Plant J. 2006;45(5):712–26. lar interaction networks. Genome Res. 2003;13(11):2498–504. 46. Franco-Zorrilla JM, Valli A, Todesco M, Mateos I, Puga MI, Rubio-Somoza 70. Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess I, et al. Target mimicry provides a new mechanism for regulation of micro- overrepresentation of gene ontology categories in biological networks. RNA activity. Nat Genet. 2007;39(8):1033–7. Bioinformatics. 2005;21(16):3448–9. 47. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, 71. Wang Y, Song F, Zhu J, Zhang S, Yang Y, Chen T, et al. GSA: Genome et al. Integrative annotation of human large intergenic noncoding Sequence Archive. Genomics Proteomics Bioinformatics. 2017;15(1):14–8. RNAs reveals global properties and specific subclasses. Genes Dev. 72. National Genomics Data Center Members and Partners. Database 2011;25(18):1915–27. Resources of the National Genomics Data Center in 2020. Nucleic Acids 48. Cui J, Luan Y, Jiang N, Bao H, Meng J. Comparative transcriptome analysis Res. 2020;48(D1):D24–33. between resistant and susceptible tomato allows the identification of lncRNA16397 conferring resistance to Phytophthora infestans by co- Publisher’s Note expressing glutaredoxin. Plant J. 2017;89(3):577–89. Springer Nature remains neutral with regard to jurisdictional claims in pub- 49. Zhang G, Duan A, Zhang J, He C. Genome-wide analysis of long non-cod- lished maps and institutional affiliations. ing RNAs at the mature stage of sea buckthorn (Hippophae rhamnoides Linn) fruit. Gene. 2017;596:130–6. 50. Song Q, Ando A, Jiang N, Ikeda Y, Chen ZJ. Single-cell RNA-seq analysis reveals ploidy-dependent and cell-specific transcriptome changes in Arabidopsis female gametophytes. Genome Biol. 2020;21(1):178. 51. Yuan JH, Liu XN, Wang TT, Pan W, Tao QF, Zhou WP, et al. The MBNL3 splicing factor promotes hepatocellular carcinoma by increasing PXN expression through the alternative splicing of lncRNA-PXN-AS1. Nat Cell Biol. 2017;19(7):820–32. 52. Yue H, Zhu J, Xie S, Li F, Xu Q. MDC1-AS, an antisense long noncod- ing RNA, regulates cell proliferation of glioma. Biomed Pharmacother. 2016;81:203–9. 53. Cho J, Paszkowski J. Regulation of rice root development by a retrotrans- poson acting as a microRNA sponge. Elife. 2017;6:e30038. 54. Ponting CP, Oliver PL, Reik W. Evolution and functions of long noncoding RNAs. Cell. 2009;136(4):629–41. 55. Washietl S, Kellis M, Garber M. Evolutionary dynamics and tissue specific- ity of human long noncoding RNAs in six mammals. Genome Res. 2014;24(4):616–28. 56. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, dele- tions and gene fusions. Genome Biol. 2013;14(4):R36. 57. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech- nol. 2010;28(5):511–5. 58. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, et al. CPC: assess the protein-coding potential of transcripts using sequence features and sup- port vector machine. Nucleic Acids Res. 2007;35(suppl_2):W345–9. 59. Li A, Zhang J, Zhou Z. PLEK a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioin- formatics. 2014;15:311. 60. UniProt C. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–15. 61. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47(D1):D427–32. Re Read ady y to to submit y submit your our re researc search h ? Choose BMC and benefit fr ? Choose BMC and benefit from om: : 62. Kalvari I, Argasinska J, Quinones-Olvera N, Nawrocki EP, Rivas E, Eddy SR, et al. Rfam 13.0: shifting to a genome-centric resource for non-coding fast, convenient online submission RNA families. Nucleic Acids Res. 2018;46(D1):D335–42. thorough peer review by experienced researchers in your field 63. Li H, Durbin R. Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics. 2009;25(14):1754–60. rapid publication on acceptance 64. Li H. A statistical framework for SNP calling, mutation discovery, associa- support for research data, including large and complex data types tion mapping and population genetical parameter estimation from • gold Open Access which fosters wider collaboration and increased citations sequencing data. Bioinformatics. 2011;27(21):2987–93. 65. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat maximum visibility for your research: over 100M website views per year Methods. 2012;9(4):357–9. 66. Feng J, Liu T, Qin B, Zhang Y, Liu XS. Identifying ChIP-seq enrichment At BMC, research is always in progress. using MACS. Nat Protoc. 2012;7(9):1728–40. Learn more biomedcentral.com/submissions

Journal

BMC GenomicsSpringer Journals

Published: May 19, 2022

Keywords: lncRNA; lincRNA; lncNAT; Plant; Strand-specific

There are no references for this article.