Alternative splicing (AS) and fusion transcripts produce a vast expansion of transcriptomes and proteomes diversity. However, the reliability of these events and the extend of epigenetic mech- anisms have not been adequately addressed due to its limitation of uncertainties about the complete structure of mRNA. Here we combined single-molecule real-time sequencing, Illumina RNA-seq and DNA methylation data to characterize the landscapes of DNA methylation on AS, fusion isoforms formation and lncRNA feature and further to unveil the transcriptome complex- ity of pig. Our analysis identified an unprecedented scale of high-quality full-length isoforms with over 28,127 novel isoforms from 26,881 novel genes. More than 92,000 novel AS events were detected and intron retention predominated in AS model, followed by exon skipping. Interestingly, we found that DNA methylation played an important role in generating various AS isoforms by regulating splicing sites, promoter regions and first exons. Furthermore, we identified a large of fusion transcripts and novel lncRNAs, and found that DNA methylation of the promoter and gene body could regulate lncRNA expression. Our results significantly im- proved existed gene models of pig and unveiled that pig AS and epigenetic modify were more complex than previously thought. Key words: single-molecule sequencing, full-length, novel gene, alternative splicing, methylation 1. Introduction Thus, reference assembly and gene annotations require reﬁnement. Domestic pig (Sus scrofa) is an agriculturally important species and Obtained through short-read sequencing, the sequence data of an attractive biomedical model because of its anatomical, physiologi- several species have been accumulated in recent years. But the knowl- 1,2 cal, pathological and genomic similarities to humans. However, edge on full-length (FL) sequences of mRNAs remains scarce. increasing number of studies have shown that reference genomes Furthermore, in some cases, low-quality transcripts derived from 3,4 5 are often incomplete and has annotation and structural defects. short-read sequencing can result in incorrect annotations. V C The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 2 Y. Li et al. FL transcripts can signiﬁcantly increase the accuracy of genome an- and tongue] from adult sow, and 17 tissues [heart, liver, spleen, lung, notation and transcriptome characterization. Several expressed se- stomach, duodenum, inguinal lymph nodes, precaval vein blood, 6,7 quence tags (ESTs) from FL studies have been performed in pigs, uterus, thymus, skin (dorsum), subcutaneous fat, longissimus muscle, which improved genome annotation and was beneﬁcial to down- psoas muscle, soleus muscle, EDL and tongue] from one-day-old sow stream analysis such as expression quantiﬁcation and alternative and one organ (26-day-old embryo). For each tissue, total RNA was splicing (AS) identiﬁcation. extracted using TRIzol reagent (REF15596026, Invitrogen) and AS increases the variability of the cells and tissues proteome processed following the manufacturer’s protocol. according to different splice modes in a single animal, thereby chang- ing the composition of transcribed genes without massively increas- 8 9 ing the number of genes. Since AS was discovered in 1977, a large 2.2. PacBio library construction and sequencing number of AS events were identiﬁed in the reference genomes of hu- Equimolar rations of 38 samples were pooled together. Total RNA 10,11 TM man and other animals. In humans, 20% of multi-exon genes (1 lg) was reverse-transcribed into cDNA using the SMARTer are tissue speciﬁc and 95% of multi-exon genes are alternatively PCR cDNA synthesis kit (Takara Biotechnology, Dalian, China) and spliced. In porcine, 30% of the genes undergo AS, and 31% of optimized to prepare high-quality and FL cDNAs. Subsequently, size the identiﬁed splice events appear to be species speciﬁc. AS is an im- fractionation (0.6–1, 1–2 and>2 kb) was conducted using the TM portant regulatory mechanism involved in gene expression and pro- BluePippin Size-Selection System (Sage Science, Beverly, MA). teome diversity in individual and related to many diseases such as Another ampliﬁcation was performed using 12–14 PCR cycles. cancer and chemoresistance. Depending on a speciﬁc AS switch, Large-scale PCR products were puriﬁed with AMPure PB magnetic enhancing the speciﬁc exon inclusion has potential as a clinically beads. Each SMRTbell library was constructed using selected cDNA compatible therapeutic target. Accordingly, a more comprehensive (500 ng) with the Paciﬁc Biosciences DNA Template Prep Kit 2.0. and accurate identiﬁcation of AS event will facilitate further cogni- The SMRTbell templates were bound to polymerases using the tion of the AS regulatory mechanism, scholars in medicine, genetics, DNA/Polymerase Binding Kit P6 and v2 primers. The polymerase- bioinformatics and other ﬁelds. bound template was bound to zero-mode waveguide using Diverse classes of epigenetic regulation, ranging from ncRNA to Magbeadbingding kit (part 100-133-600). A total of 20 SMRT cells, methylation, have emerged as key regulators of gene expression, ge- composed of three SMRTbell libraries (0.6–1 kb: 7 cells; 1–2 kb: 7 nome stability and defence against foreign genetic elements. cells;>2kb: 6 cells), were prepared on the Paciﬁc Bioscience RS II Epigenetics can mediate disease aetiology through isoform variations platform by Frasergen Inc. (Wuhan, China) using C4 reagents with of cytosine modiﬁcation-speciﬁc transcript which are attributable to 17 240 min movies. AS. Recently, it is revealed that CHG methylation could repress AS while CG methylation promoted AS in plant. However, the com- prehensive relationship between DNA methylation and AS or 2.3 Illumina RNA-seq library construction lncRNA remains unclear in animals. Because uncertainties about the In parallel, eight tissues (subcutaneous back fat, soleus muscle, EDL complete structure of mRNA transcripts limited the identiﬁcation of and endometria from adult and one-day-old sows) from the 38 tissues splice sites and lncRNA. were sequenced respectively using PE125 sequencing on the Illumina Single-molecule real-time (SMRT) sequencing carried out in HiSeq 2500 platform to quantify gene/isoform expression. HiSeq li- Paciﬁc Bioscience RS (PacBio, http://www.paciﬁcbiosciences.com/) brary was constructed using NEB kit. Brieﬂy, poly(A)þ RNA transcript provides a third-generation sequencing platform widely used in ge- was isolated from the total RNA (1 lg). Libraries were prepared nome sequencing because of its long reads (average: 12 kb). SMRT using the NEBNext Ultra RNA Library Prep Kit for Illumina (NEB technology without assembling sequencing read provides direct evi- #E7530). dence for comprehensive analysis of splice isoforms of each gene and 20–22 can improve the annotation of existing gene models. Recently, Iso-Seq is used to analyse FL splice isoforms in humans and chick- 22 2.4 Subread processing and error correction ens, and indicates that identiﬁcation of genes and splice isoforms Effective subreads were obtained using the P_Fetch and P_Filter are far from being complete even in a highly characterized transcrip- function (parameters: minSubReadLength¼50, readScore¼0.75 and tome. In this study, we combined SMRT sequencing and short-read minLength¼50) in the SMRT Analysis Software v2.3 Suite (http:// next-generation sequencing technology to generate a more complete www.pacb.com/devnet/). The FL transcript sequence was obtained FL porcine transcriptome further to analyse features of AS, fusion using ToFU pipeline (Fig. 1a). Brieﬂy, circular consensus (CCS) isoforms and lncRNAs. Accordingly, this study highlights the splice read was obtained from the P_CCS module using the parameter isoforms and transcriptome diversity and dynamics, provides a valu- able resource for further investigation of genome annotation and MinFullPasses¼2 and MinPredictedAccuracy¼0. After examining 0 0 increases our understanding of the porcine transcriptome. for poly(A) signal, 5 and 3 adaptors, only the CCS with all three sig- nals was considered a FL non-chimetric (FLNC) read. To improve consensus accuracy, we used an isoform-level clustering algorithm, 2. Materials and methods namely, iterative clustering for error correction (ICE), and polished FL consensus sequences from ICE using Quiver with the following 2.1 Animal materials cut-off criteria: isoform length>200; high-quality>0.99. Additional A total of 38 porcine tissues of Large White sow were collected, in- nucleotide errors in FLNC reads were corrected using the Illumina cluding 20 tissues [heart, liver, spleen, lung, kidney, stomach, duode- RNA-seq data with the software Proovread using the parameter num, cecum, inguinal lymph nodes, precaval vein blood, ovary, coverage of 127. The untrimmed sequence was regarded as the result uterus, corpus luteum, inner ear, subcutaneous fat, longissimus mus- cle, psoas muscle, soleus muscle, extensor digitorum longus (EDL) of error correction. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Transcriptome complexity in Sus scrofa 3 0 0 Figure 1. Illustration of methods. (a) Iso-Seq workflow for data processing (ToFU). (b) Mapping of PacBio data. (c) Identification of full-length. Same 5 and 3 0 0 ends: the first and last splice sites of FLNC were as same as the reference annotation transcripts. With 5 and 3 ends: the first and last splice sites of the refer- 0 0 ence transcripts were presented in the splice sites of the FLNC sequence. Same 5 end: The first splice sites of 5 end was identical between annotated transcript 0 0 and FLNC sequence. With 5 end: the first splice site of the 5 end of the annotated transcript was presented in the FLNC sequence. The criteria of full-length were the type ‘with 5 end’. (d) Criteria for assessing a new isoform. (e) Fusion transcript identification from PacBio sequences. sequence can be identiﬁed as the same isoform when all the splicing 2.5 Mapping of PacBio data sites of the multiple-exon sequences were identical. The redundant The error rectiﬁed FLNC reads were mapped to the pig genome se- and false positive gene structure was removed as follows: (i) the miss- quence (Sus_scrofa.10.2.84) from ENSEMBL databases (Release 84) ing 5 end was removed; the sequence structure was a subset of other using GMAP with the options—no-chimeras-n 20 (Fig. 1b). The sequences (sequence structure refers to the ordered sets of all remain- best mapped locus was selected for each FLNC read based on both ing cleavage sites, excluding the initiation and termination sites); the identity and coverage values. Genome mapping results of FLNC 27 0 5 last exon spanning the intron region was determined, and the se- reads were visualized using the Integrative Genome Viewer. The quence was retained when it spanned the intron region; (ii) region high percent of identity (PID) aligned FLNC reads were used to an- notate loci and isoform. For loci, two sequences, which overlapped PID<99: each transcript model kept at least two PacBio sequences; 20% and at least one overlapping exon to more than 20%, were otherwise, all junctions of this sequence were annotated or supported by the junction of second-generation sequencing and (iii) the longest identiﬁed as the same loci transcript. For isoform, single-exon se- quence with overlap was determined as the same isoform. Such one was retained when the structure of two sequences was the same. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 4 Y. Li et al. discovered lincRNAs of pig in ALDB under a criterion of e-value 2.6 Novel isoform of1e10, min-identity of 90% and min-coverage of 85%. The gene structure annotation results were compared with those of reference annotation to determine the new gene following these crite- ria: (1) results showed no overlap or overlapped by less than 20% of 2.11 Identification of tissue-specific and period-specific the annotated gene site, or (2) the gene overlap was more than 20%, PacBio isoforms by Illumina data but the gene direction was not consistent. The criteria used for a sin- Isoforms detected by PacBio was used as the template combined with gle transcript to identify novel isoform were as follows: (i) the ﬁnal the Illumina data to detect the junction of different tissue and period splice site of 3 end changed and (ii) new intron or new exon emerged for multi-exon isoforms of Iso-Seq data. We used a new modiﬁed (Fig. 1d). GFF ﬁle to estimate the expression level of each isoforms. The new GFF ﬁle contained junction positions of each isoform which were extracted from the GMAP mapping data. The expressed junction in- 2.7 Alternative splicing classification formation was obtained as follows: the Illumina raw reads were un- The relative importance of the main models of AS and the compre- der the quality control by ng_QC (in-house software of Frasergen hensive distribution of AS structure in pig transcriptome were ascer- Bioinformatics Inc., Wuhan, China) with parameters read_len 125 tained using Astalavista. Astalavista was also used for the and default settings for other parameters. Then, the clean paired classiﬁcation analysis of splice type for the gene model after remov- reads were aligned to the pig genome sequence using TopHat2.1.0 ing redundancy, and the simpliﬁed model constructed by IBS was following these options: G—library-type fr-ﬁrst strand. When all the visualized. junctions of one isoform were supported by the Illumina data, it was deﬁned as an expressed isoform (with ‘1’) in the corresponding pe- 2.8 Methylation data analysis riod or tissue. Conversely, when no Illumina data supported the Methylome data from our research group (the GEO accession is junction of the Iso-Seq data, or the junction was only partially sup- GSE92417) were remapped to the pig genome assembly, and the ported by the Illumina data, it was deﬁned as an unexpressed iso- methylation levels were calculated as previously reported. Brieﬂy, form (with ‘0’) in the corresponding period or tissue. In addition, the all splice junctions from PacBio Iso-Seq transcripts were stacked read counts for expressed genes were calculated with the HTSeq- (50 bp exonþ50 bp intron for donor and 50 bp intronþ50 bp exon count software and the differential expression level of genes were for acceptor). The methylation level of each base pair was calculated determined using GFOLD with the default parameters. as C/(C þ T). The methylation of the donor site was calculated from the ﬁrst nucleotide of both strands on 5 end of the intron as C/ 2.12 Validation of isoforms by RT-PCR (C þ T). The methylation of the acceptor site was calculated from We randomly selected 13 isoforms including AS, novel gene, the last nucleotide of both strands on the 3 end of the intron using lncRNA and fusion gene for experimental validation. Transcript- the same formula. For lncRNA and non-lncRNA methylations, three speciﬁc primers were designed to span the predicted splicing events regions were used for methylation study: 5 kb upstream transcription based on FL sequences using Primer-BLAST (https://www.ncbi.nlm. start site (TSS), transcript body and 5 kb downstream transcription nih.gov/tools/primer-blast/; Supplementary Table S13). PCR ampliﬁ- termination site (TTS). Each region was divided into 100 bins. Each cation was monitored on 1.5% agarose gel and followed by Sanger methylation ratio was calculated from the corresponding bins from sequencing. The 18srRNA was ampliﬁed as an endogenous control. all genes. 2.9 Fusion transcript identification 3. Results The criteria used to identify candidate fusion transcripts for a single 3.1 Pig transcriptome by PacBio Iso-Seq transcript were as follows (Fig. 1e): (i) FLNC transcript mapped to Short-read sequencing from the Illumina platform is effective in qual- two or more annotation loci in the genome; (ii) each mapped locus ifying gene expression and detecting AS events. However, its capacity must align with at least 10% of the transcript; (iii) the combined to accurately detect FL splice variants of genes is limited. To avoid alignment coverage must be at least 99%; (iv) each mapped locus underestimating isoform diversity, pig transcriptomes were se- must be at least 10 kb apart and (v) a certain amount of Illumina quenced using the PacBio Iso-Seq platform. This platform can pro- reads should support the fusion regions. vide long reads often up to several transcript lengths, thus the acquisition of accurately reconstructed FL splice variants is possible. 2.10 LncRNA identification from PacBio data To identify as many transcripts as possible, high-quality total RNA LncRNA identiﬁcation was performed as previously reported. The was extracted from 38 tissues at two different developmental stages. known high-conﬁdence 27,692 long non-coding RNA transcript The method for detecting FLNC, new AS and fusion transcripts was sequences and 94,359 protein-coding transcript sequences of human depicted in the ﬂow chart shown in Figure 1. downloaded from GENCODE (Release 25, GRCh38.p7) were used SMRT bell libraries were constructed and sequenced on the to build the model using PLEK. All PacBio isoforms were predicted PacBio RSII using the latest P6-C4 chemistry with 20 SMRT cells. In on the basis of the model. The open reading frames (ORFs) of candi- total, we detected 1,898,155 polymerase reads representing more date lncRNAs were predicted by EMBOSS (http://emboss.bioinfor than 36 G bases, with a mean length of 12 kb (Supplementary matics.nl/). The transcripts encoding ORFs that were longer than Table S1). After processing raw data by ToFU pipeline (Fig. 1a), we 100 amino acids were ﬁltered. The remaining transcripts were fur- obtained 14,868,653 ﬁltered subread data with a mean length of ther screened by BLASTX (e-value1e10) against protein sequen- 2,596 bp (Supplementary Table S2) and 1,300,544 CCS reads with ces of all species from NR database. Finally, Basic Local Alignment average depth of 6–10 passes in three libraries (median, Fig. 2a). Search Tool (BLASTN) was used to eliminate the previously Then CCS reads were classiﬁed into ﬁve types as follows: with 5 Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Transcriptome complexity in Sus scrofa 5 Figure 2. Length analysis with Iso-seq reads. (a) Mean number of passes of CCS read. (b) Comparison of length between reference annotation and FLNC. adaptor, with 3 adaptor, with poly-A tail, FL and FLNC (Fig. 1b 3.3 Novel genes identification 0 0 and c). We detected 517,462 FL reads [containing 5 and 3 cDNA Compared with reference annotation, over 69% (193,318/277,842) synthesis primers and a distinct poly(A) tail], of which 99.46% of high-quality aligned FLNC had the same initial or terminated sites (514,659/517,462) were deﬁned as FLNC (Table 1). Interestingly, as mRNAs in annotated database, implying a relatively high integrity we found that only 1.10% (2,254/206,756) FLNCs were <1kb in in structure. Considering the preferable integrity of the FLNC at the length in 0.6–1 kb library (Supplementary Table S3), indicating that 3 end based on the characteristics of library construction, gene struc- swine FL transcripts might be more than 1 kb in length while short ture integrity of FLNC was only estimated at the 5 end of the se- quence. After removing redundancies and false positives, a total of fragments mainly consisted of non-coding RNAs with a single exon. 237,580 FLNC (85.51%) contained the same initial splice site Thus, future research on FL transcript can mainly focus on longer sequences with the reference annotation, covering 29,036 isoforms than 0.6 kb reads. By analysing the distribution of transcripts’ (68.34%, Supplementary Table S6). Moreover, a total of 14,792 length, we found that PacBio data set could retrieve much longer FLNC isoforms from 6,105 genes maintained the same structures as transcripts than those described in the current SSC10.2 reference annotation, which is similar to previous report that 1,4000 FL per- annotations (Fig. 2b). sonal transcripts from complement of a pooled set of 20 human Considering a high base error rate of SMRT sequencing technol- organs and tissues in a single-molecule level. ogy, we used high-quality Illumina short reads to correct erroneous The published pig genome annotation contains 25,322 gene SMRT long reads by Proovread software. The FLNC sequences be- models with 30,585 isoforms. In PacBio data set, 26,881 unique fore and after correction were respectively aligned to pig genome se- transcript clusters did not overlap with any annotated gene, which quence through GMAP. After correction, we obtained 389,781 high- likely originated from novel genes (Supplementary Table S7). Even quality FLNC for further study (Supplementary Table S4). so, about 13,000 known loci (32.70%) could be mapped to 12,278 reference annotated genes (48.49% of the 25,322 gene). Thus, Iso- 3.2 Isoform detection and characterization Seq data displayed a good example that one gene was annotated to To evaluate the density and length of isoforms, we compared the loci produce a single transcript but was found to generate four splice var- coverage of PacBio FLNC and swine SSC 10.2 annotation. In PacBio iants (Supplementary Fig. S1a). Novel annotation of Iso-Seq could data set, a total of 389,781 high-quality FLNCs covered 77,075 iso- rectify the incorrect position annotation of exon of the reference forms and were allocated to 39,940 loci (Supplementary Table S5). (Supplementary Fig. S1b). Novel genes have emerged as a new About 96.65% (38,604/39,940) gene transcripts were 1kb in structure and ﬁlled the blank position without annotation length. In reference annotation, 30,585 isoforms covered 25,322 loci (Supplementary Fig. S1c). and only 58.72% (14,872/25,322) gene transcripts were 1 kb. Our Gene annotation for the 26,881 new loci showed that 23,712 loci unique isoforms covered about 51.57% (13,059/25,322) of reference were single-exon loci and 3,169 were multiple-exon loci annotation loci. In addition, out of 77,075 isoforms, 29,992 (Supplementary Table S7). To validate the unannotated novel loci, (38.91%) were single-exon isoforms and 47,083 (61.09%) were we searched these 26,881 new genes in NR, KOG, KO and GO data- multiple-exon isoforms. Approximately 8,830 loci could produce bases using BLASTX (e-value1e5). It showed that 10,299 (38.31%), 800 (2.98%), 1,567 (5.83%) and 2,803 (10.43%) of more than one transcript, accounting for a total of 45,695 isoforms 26,881 new genes could be found in NR, KOG, KO and GO data- (Fig. 3a). In contrary, in reference annotation, around half (51.57%, bases, respectively. A total of 417 novel genes had signiﬁcant hits in 13,059/25,322) could be detected in Iso-Seq reads, of which only the four databases (Fig. 4). At the same time, a large number of 8,062 genes (61.74%) could produce at least two splice isoforms unannotated single-exon genes might contain some non-coding (Fig. 3b). The gene ACTA1 (gene ID: 14.1882, chr14: 65, 236, 181- RNAs, because coding genes usually had multiple exons which were 65, 239 and 183) had the largest number of isoforms 337, which 10,38 overwhelmingly alternatively spliced. played a vital role in the development of skeletal muscle myoﬁbrils in pig. Thus, PacBio FLNC data set provided higher isoform density and longer isoform length than SSC 10.2 annotation, which would 3.4 Various types of alternative splicing be propitious to unveil the comprehensive assessment of the true AS is an important mechanism of generating regulatory function for 20 39 complexity of the transcriptome for the gene structure annotation. trait expression of eukaryotes. Through alternative recognition of Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 6 Y. Li et al. Table 1. Sequence summary of PacBio CCS reads Library Cell CCS 5’ 3’ poly-A FL FLNC Average FLNC read length (bp) 0.6–1 kb 7 456,861 240,085 286,237 279,184 207,721 206,756 1,560 1–2 kb 7 472,850 225,442 268,959 261,714 189,761 188,735 2,087 >2 kb 6 370,833 155,580 192,158 185,582 119,980 119,168 2,853 Total 20 1,300,544 621,107 747,354 726,480 517,462 514,659 Figure 3. Distribution of genes that produce splice isoform and AS event with different models. (a) The total number of loci according to isoforms of FLNC in PacBio Iso-deq data. (b) Annotated loci and isoforms were based on FLNC mapped to reference genome. (c) Five basic models of AS: exon skipping, intron re- tention, alternative donor site, alternative acceptor site and mutually exclusive exons. (d) The total number of AS events in genes based on Iso-seq data com- pared with the annotated gene models. exon and splice site during splicing, a single gene can generate func- example, IR event could appear in different sites within a gene, AA tionally distinct mRNA and diverse protein isoforms. AS formation event might be more prone to the 3 terminal exon and ES event has ﬁve basic types as follows (Fig. 3c): exon skipping (ES), intron re- would appear in different combinations (Supplementary Fig. S1d–f). On whole-genome level, 1,150 AS models in reference annotation tention (IR), mutually exclusive exons (MEE), alternative donor site (AD) and alternative acceptor site (AA). Previously, 30% of por- and 7,132 models in Iso-Seq reads were detected (Fig. 5). In contrary, cine genes were reported to undergo AS using ESTs. In this study, by removing the duplication among different chromosomes, 97,727 we explored a more comprehensive formation and distribution of AS AS events with 2,637 models in PacBio Iso-Seq data were identiﬁed, event in pig. In reference annotation, a total of 7.77% (1,967/ of which more than 92,000 were new AS events with 2,100 modes (Fig. 3d). In detail, GANS gene has the largest number of splice var- 25,322) genes corresponding to 17.45% (5,336/30,585) isoforms ex- perienced 5,417 AS events (2.71 isoforms and 2.75 AS in each gene). iants (23 transcripts) in present reference annotation. However, we found 35 novel transcripts for GANS gene in Iso-Seq data, which im- In contrary, 17.66% loci (7,053/39,940 loci) corresponding to 42.38% (32,662/77,075) isoforms underwent 97,727 AS events in plied a more complex imprinted expression pattern closely related to PacBio data set (4.63 isoforms and 13.86 AS in each gene, Fig. 3d, tumour therapy. Moreover, ACTA1 gene expressed only one anno- Supplementary Table S8). This result indicated that the prevalence of tated transcript in reference but exhibited 337 splice variants in our AS event was much higher than previously thought. Interestingly, analysis. Its speciﬁc expression in muscle provides therapeutic strate- gies for thin ﬁlament myopathy patients and a method for improv- ﬁve basic AS models only accounted for 21.42% (20,930/97,727) of AS events. IR predominated, accounting for 10.16% of alternative ing muscle mass in livestock. These results suggested that transcripts, while MEE (0.34%) was the least frequent. More AS AS events were severely underestimated in present pig genome events evolved from combinations of ﬁve basic patterns. For annotations. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Transcriptome complexity in Sus scrofa 7 antisense strand with 2 bp [both EI and intron–exon (IE)] (Fig. 7a). The tremendous changes might be closely related to AS. Previous research reported that the CG locus in the genome was highly methylated and the methylation of CpG islands in the pro- moter region of the gene could regulate gene expression. To deter- mine whether the methylation level of gene was related to the number of isoforms, we divided the isoforms into three groups as fol- lows: type a, genes with only one isoform; type b, genes with two to ten isoforms and type c, genes with more than 10 isoforms. Gene methylation was divided into upstream region (promoters encompass 2 kb upstream of the transcriptional start site), gene body and down- stream region. Interestingly, CG methylation of promoter was nega- tively correlated with number of isoforms, suggesting that a high level of CG methylation at promoter region repressed AS and gene expression (Fig. 7b). Some evidence proved that the region near the promoters of many active genes tended to be unmethylated. We also found that the downward trend of methylation level of exon and promoter was insigniﬁcant when isoforms were more than 10. To determine whether the ﬁrst exon was affected by CG methylation, Figure 4. Novel genes identified in Iso-seq data. The number represents we analysed the relationship between CG methylation of the ﬁrst novel genes identified in NR, KOG, KO and GO databases. exon and the number of isoforms. These results showed that AS was negatively correlated with CG methylation of the ﬁrst exon, suggest- Interestingly, we found that AS events (23,440 with 555 models) ing that a low level of CG methylation at ﬁrst exon could promote predominantly took place in chromosome 14. To detect whether AS formations (Fig. 7c). Furthermore, we determined whether the such a situation existed in other species, we downloaded reference number of isoforms was affected by the gene length. The results annotations of six species, including horse, cattle, sheep, human, showed that the longer the gene, the more isoforms were produced, mouse and pig, from ensemble (release-84, ftp://ftp.ensembl.org/pub/ but the produced isoforms would be stable when the number of iso- release-84/gtf/) for AS analysis (Supplementary Fig. S2). In other ﬁve forms arrived at a certain level (Fig. 7d). The results showed that AS species, correlation coefﬁcient between the number of AS events and was not correlated with the number of genes on chromosome in pig gene numbers was very high (r ¼ 0.83–0.94). However, the correla- but was closely linked to gene length and promoter methylation. tion coefﬁcient in pig from reference annotation was fairly low (r ¼ 0.35), which was similar to the results in Iso-Seq data 3.6 Fusion isoform identification and characterization (r ¼ 0.49). The number of AS events did not signiﬁcantly grow with A fusion gene was an aberrant gene formed by the concatenation of the increase of the number of genes on chromosome in pig, indicating two separate genes. Long-read alignments can actually determine the that a species speciﬁcity was presented on the pig. EI structure of fusion genes. In this study, we identiﬁed 711 isoforms and they could simultaneously cover two or more annotated genes 3.5 DNA methylation regulates alternative splicing (Supplementary Table S9). The 711 fusion isoforms corresponded to variation 269 FLNC fusion loci and were involved in 622 genes in the refer- 43,44 Given that exons are often more highly methylated than introns, ence annotation. Distribution of fusion isoforms revealed that most DNA methylation is often considered as a strong marker for exon– of the fusion isoforms were present on chr7, followed by chr14 and 45,46 intron (EI) boundaries during splicing. The large number of iso- chr6 (Fig. 5). In 269 FLNC fusion loci, 31 chimeric genes forms identiﬁed via long-read sequencing in this work provided a (11.52%) corresponding to 90 fusion isoforms (12.66%) origi- good opportunity to investigate the relationship between DNA meth- nated from at least three candidate genes, whereas the remaining ylation and AS. To determine whether methylation level is related to from two candidate ones. Previous studies showed that fusion events AS, DNA methylation level of various isoforms was investigated. We were mostly comprised of two genes. However, our results stacked all splice junctions of the PacBio isoforms and measured the revealed that fusion event was not limited to two genes and could in- methylation levels of three types of cytosines on both strands in each clude more than two genes, or even ncRNAs. For example, MT.3.1 methylation context at donor and acceptor sites. The results revealed originated from 10 candidates (MT-CO2; MT_tRNA; MT-ATP8; that CHG methylation was enriched in acceptor sites, whereas CG MT-ATP6; MT-CO3; MT-tRNA; MT-ND3; MT-tRNA; MT- methylation was elevated at donor sites (Fig. 6a–c). The CHH and ND4L; MT-ND4;). MT-CO2 is usually unexpressed and has the CHG methylation levels were extremely low, whereas CG methyla- characteristic of inducible expression. Once the transcript originates tion predominated in pig. This result was consistent with a previous from MT-CO2 expression, it can induce tumour occurrence and study that three types of methylated motif existed in plants, whereas increase the probability of inducing tumour in females. Moreover, more CG methylation existed in vertebrates. Simultaneously, we fusion event could not only span two distinct genes but also produce found that methylation levels of the longest transcripts were consis- variant isoforms with AS events. For example, fusion isoform 11.112 tent with the methylation of all transcripts. The methylation levels at originated form two genes (EXOSC8-ENSSSCG00000026613) and splice sites did not change signiﬁcantly with the increase of the num- acquired another two novel exons (Supplementary Fig. S1g). This re- ber of isoforms (Fig. 6d–f). However, we discovered that CG methyl- sult was consistent with previously reported results that FL fusion ation levels dropped sharply (at least 10 times) at splice sites and could produce fusion genes that acquired novel genes or exons, sug- then quickly increased in the range of sense strand with 3 bp and gested their isoforms independent of a reference annotation library. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 8 Y. Li et al. Figure 5. Distributed visualization using Circos of different data at chromosome level. The total length of the half circle corresponding to each label is the sum of all the values corresponding to the label. The connection between the different half circles indicates the value expressed by the two tags. A and B represent the number distribution of AS events in reference annotation and Iso-seq data, respectively. C and D correspond to number distribution of AS models in refer- ence annotation and Iso-seq data, respectively. E and F exhibit number of gene in annotation and Iso-seq data, respectively. G and H denote fusion gene and fu- sion isoform in Iso-seq data, respectively. Chr represents chromosome; JH and GL represent the scaffolds that have not assembled on the chromosome. Thus, fusion isoforms identiﬁed by Iso-Seq would provide more di- 539,73.65%) were formed from the combinations of separate pre- rect evidence for further studies of aberrant transcription which were cursor transcripts in a same gene family. For example, 58 isoforms from trans-splicing of distinct genes or splicing of fusion genes were from different duplicates of the Myosin heavy chain gene family formed by from genome rearrangements. (MYH), 18 isoforms from DEAD-box helicase family (DDX) and 15 To test whether these fusion events were formed stochastically by isoforms from interferon induced protein with tetratricopeptide abnormal AS or necessarily by functional requirement, we analysed repeats family (IFIT). Thus, fusion isoforms from a same gene family the functional relationship between precursor genes and each fusion implied that the formation of fusing events were likely due to special isoform (Supplementary Table S9). After excluding unannotated biological functions such as functional conserved between conserved genes, we obtained 539 fusion isoforms and 352 precursor genes. duplicates or functional complementation between subfunctional Interestingly, we found that a majority of fusion isoforms (397/ duplicates. For other fusion isoforms that originated from different Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Transcriptome complexity in Sus scrofa 9 Figure 6. Level of DNA methylation at splice sites (exon–intron, intron–exon). (a)–(c): For each row, level of DNA methylation, combing sense and antisense strand, on the sense strand, on the antisense strand. (d)–(f): For each row, level of DNA methylation in genes with only one isoform, in all isoforms of genes with 2–10 isoforms, in all isoforms of genes with more than 10 isoforms. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 10 Y. Li et al. Figure 7. CG methylation and isoform. (a) Level of CG methylation on the sense strand and antisense strand at splice sites. (b) The relationship between methyl- ation of gene and isoform. (c) Methylation of exon. (exon1: the first exon; exon: other exons excluding the first exon). (d) The relationship between gene length and number of isoform. In (b), (c) and (d), type a: one isoform; type b: 2–10 isoforms; type c: more than 10 isoforms. family genes, precursor transcripts of which were probably function- mapping data. We obtained more than 4 G bases clean paired data ally related. For example, a fusion gene 7.1323 originated from two each tissue, which was adequate for quantify gene expression precursor transcripts (ATP6V1G2 and DDX39B). DDX39B enco- (Supplementary Table S11). Additionally, rarefaction analysis revealed des a member of the DEAD box family of RNA-dependent ATPases that sequencing depth had reached saturation of gene and isoform dis- that mediates adenosine triphosphate (ATP) hydrolysis during pre- covery in each tissue (Supplementary Fig. S3a and b). STEM expres- mRNA splicing. The fusion gene 7.1323 could produce 17 fusion sion analysis for the 39,940 loci revealed that 25,018 loci were isoforms in pigs while it only produced two transcripts in humans expressed in eight tissues. Moreover, about 46.25% (12,432/26,881) of and might induce immunologic and infectious diseases. novel gene loci and 48.45% (4,282/8838) of lncRNA could be vali- Moreover, we perform a more precise BLAST analysis using 711 dated from these eight tissues (Supplementary Table S12). fusion isoforms against NCBI Reference RNA Sequences database to For detecting the tissue-speciﬁc expressed isoform, a total of validate fusion isoforms. Results showed that, 270 fusion genes 47,083 multi-exon isoforms were analysed in different tissues and (38%) are supported by at least one Reference RNA Sequences in periods (Fig. 8). Uniquely expressed isoforms were 451 in subcutane- pig (with identity>95% and overlap>90%). In addition, we ous fat of back (SFB), 416 in SM, 341 in EDL and 1,337 in EN re- searched these fusion genes in human. Interestingly, 86 are supported spectively, and 16,501 isoforms were simultaneously expressed in by at least one Reference RNA Sequences (with identity>88% and the four tissues (Fig. 9a). For the 2,545 isoforms uniquely expressed overlap>90%). And 84 fusion genes were both supported in these in each tissue, GO terms were primarily (top six) related to cellular two databases, suggested that fusion gene may exist in different spe- and single-organism processes, cell, cell part, metabolic process, cies and evolve conserved functions (Supplementary Table S9). binding (Fig. 9b). Function enrichments of pathway (Fig. 9c) varied slightly in each tissue such as catabolism and carbohydrate metabo- 3.7 Tissue-specific and period-specific isoform lism in SFB; digestive system and lipid metabolism in SM; sorting and degradation and digestive system in EDL and endocrine system, Identiﬁcation of speciﬁc isoform is among the primary analysis accom- plished through Illumina RNA-seq data of the same eight tissues. We immune system and transport and catabolism in EN. These results used a new modiﬁed GFF ﬁle to estimate the expression level of each showed that genes played the same biological functions in different tissues by the same or unique isoforms and the isoforms with tissue isoforms (Supplementary Table S10). The new GFF ﬁle contained junc- tion positions of each isoform which were extracted from the GMAP speciﬁcity might lead to various roles via different pathways. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Transcriptome complexity in Sus scrofa 11 Figure 8. Heatmap of tissue-specific and period-specific isoforms. (a) Tissue-specific isoforms. (b) Period-specific isoforms. SFB: subcutaneous fat of back, SM: soleus muscle, EDL: extensor digitorum longus, EN: endometria, A: adult, B: birth (one day). The coloured box on the left means the cluster analysis of whole- gene expression levels in eight tissues. expression abundance could promote varieties of protein, although Table 2. Analysis of period-specific isoforms GO terms were similar (Supplementary Figs S4b and c and S5b and Isoform 1 day Adult c). For example, isoform 12.550.1 (TUSC5) showed speciﬁc and high expression in SF adult period, which could regulate insulin- Total multiple-exon isoform 47,083 47,083 mediated adipose tissue glucose uptake by modulation of GLUT4 Expressed in at least one tissue 22,116 21,986 recycling. Isoform 3.161.1 (COX6A2), 12.685.1(MYH13) and Expressed in 4 tissues 14,681 12,700 13.197.8 (XIRP1), with speciﬁc and high expression in SM and EDL Expressed only in single tissue 2,675 3,314 Unexpressed only in single tissue 2,810 2,937 adult period, could play important roles in skeletal muscle ﬁbre type 56 57 58 Unexpressed in 4 tissues 24,967 25,097 switch, immune-mediated myositis and muscle development, respectively. 3.8 LncRNA identification For the period-speciﬁc expressed isoforms, 47,083 multi-exon iso- A classiﬁcation model of a high-conﬁdence set of known lncRNAs forms were classiﬁed into ﬁve categories (Table 2). A total of 14,706 was built using PLEK to identify lncRNAs in the PacBio data. and 12,766 isoforms simultaneously expressed in four tissues in one- Among the 77,075 PacBio isoforms, 15,049 candidate lncRNAs day-old and adult period, respectively (Supplementary Figs S4a and were obtained via PLEK model. To obtain a high-conﬁdence set of S5a). The number of isoforms expressed only in single tissue between lncRNA genes, we eliminated transcripts with ORFs exceeding 100 one-day-old and adult also maintained high consistency (r ¼ 0.974). codons and used BLASTX to screen the 15,049 candidates for ho- Results from GO and Kyoto Encyclopaedia of Genes and Genomes mology with proteins of all species in NR data, thereby obtaining (KEGG) analyses of the isoforms expressed only in single tissue be- 8,838 lncRNAs with a mean length of 2 kb. Of the 8,838 lncRNAs, tween one-day-old and adult revealed that the tissue diversity of the including 7,928 single-exon lncRNAs (89.7%) and 910 multi-exon Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 12 Y. Li et al. lncRNAs (10.3%), 5,058 lincRNAs existed. BLASTN found that with 77,075 isoforms covering 39,940 loci, 97,727 AS events corre- 664 of the 5,058 candidates corresponded to previously reported sponding to 2,637 models, 711 fusion isoforms and 4,394 novel lincRNA of pig in ALDB. The remaining 4,394 novel lincRNAs lincRNAs that were not previously annotated in pig. The new re- had a mean length of 2.024 kb (Fig. 10a). LncRNAs were classiﬁed source and transcriptional information would be of great value to into ﬁve groups according to their biogenesis positions relative to improve pig genome annotation and livestock transcriptome protein-coding genes of SSC10.2.84 annotations: 57.23% of them research. were generated from intergenic regions, 31.81% from the intronic This experimental design aimed at maximizing transcript diversity regions, 7.24% from the sense strand, 3.33% from the antisense and investigating comprehensive splice isoforms by broadly sampling strand and 0.40% belong to bidirectional lncRNA (Fig. 10b). In ad- 38 tissues and organs. Compared with previous studies, this experi- dition, majority (89.7%) of the lncRNAs were single exon, and this ment by far used the most number of tissues to comprehensively re- percentage was signiﬁcantly higher than the non-lncRNAs (P < 0.01, search transcriptome isoform via SMRT methodology. Twenty Fig. 10c). cells in our study might not well uncover more low-abundance iso- A total of 388 lncRNAs were simultaneously expressed in the forms, but the data depth was adequate to explore as many FL splice eight tissues. Endometria at adult expressed the most speciﬁc isoforms as possible when depending on the porcine genome size and lncRNAs (132), whereas adult soleus muscle expressed the fewest (9) comparing with previous similar studies of SMRT tran- 18,20,22 (Fig. 10d). Heatmap of lncRNA and non-lncRNA expression con- scriptomes. In this study, only 0.6% of FLNCs was excavated ﬁrmed that both lncRNA and non-lncRNA unfolded tissue-speciﬁc in the 0–600 bp library, indicated that swine FL transcript might be expression, particularly in SFB_A, EN_A and EN_B (Fig. 10e and f). more than 1 kb in length and short fragments mainly consisted of Comparison of overall expression between lncRNA and non- non-coding RNAs with a single exon. Thus, future research on FL lncRNA showed that lncRNA were signiﬁcantly less expressed than transcript can mainly focus on longer than 0.6 kb reads. non-lncRNA (P < 0.01, Fig. 10g). Comparison revealed that multi- AS plays important roles in regulating molecular, cellular, physio- exon lncRNAs expressed a higher level than single-exon lncRNAs logical and developmental processes/pathways in eukaryotes. (Fig. 10h). Furthermore, we monitored the level of CG methylation However, the difﬁculty of identifying the combinations of splice-site within and surrounding (5 kb upstream and 5 kb downstream) usage by Illumina short reads limits gene model prediction. Previous lncRNA and non-lncRNA isoforms (Fig. 10i). DNA methylation de- studies show that 30% of genes were alternatively spliced in pig, creased near the TSS and TTS of both lncRNA and non-lncRNA compared with 68% in humans, 57% in mouse and 21% in 7,11 genes; DNA methylation was also particularly signiﬁcant for the TSS bovine. Our analysis shows that 17.66% loci (7,053/39,940) of non-lncRNA genes (P < 0.01). Non-lncRNA genes presented are alternatively spliced, and 42.38% (32,662/77,075) transcripts higher CG methylation within the gene body than lncRNAs, whereas are associated with AS in Iso-Seq data, while only 7.77% (1,967/ lncRNAs showed higher methylation level at upstream and down- 25,322) genes corresponding to 17.45% (5,336/30,585) isoforms stream regions than non-lncRNAs. Finally, we discovered that multi- are in pig reference annotation. Thus, a large of AS gene identiﬁed in exon lncRNAs exhibited higher CG methylation within the gene this study will greatly improve reference annotation. Relatively, gene body and the lower level at upstream and downstream regions than proportion of AS in Iso-Seq data is low, owing to slightly high single- single-exon lncRNAs. DNA methylation of the promoter and gene exon isoform with little coding capacity and thus may represent body was strongly correlated with lncRNA expression, which lncRNAs. Furthermore, 7,051 multi-exon genes (99.97%) are al- might reﬂect the diverse expression levels of non-lncRNA genes and ternatively spliced in Iso-Seq data, which was consistent with previ- lncRNAs, as well as that of single-exon and multi-exon isoforms. 10 ous reports on humans, conﬁrming that almost all multi-exon genes undergo AS to increase transcriptome diversity. 3.9 Validation of isoforms by RT-PCR Hypothetically, variations in the EI methylation dynamics may re- To verify the expression of novel isoforms and genes in eight tissues, sult in ES. In fact, alternative exon recognition mechanisms may we randomly selected six genes (with or without exons and with or have evolved in genes with equal exon-to-intron DNA methylation without AS) and performed reverse transcription (RT) PCR follow- ratios, and this phenomenon is one of the biological explanations 61,62 given for the variation in AS patterns between species. ing by Sanger sequencing (primers listed in Supplementary Table S13). As shown in Figure 11, expression and splicing were conﬁrmed Interestingly, DNA methylation is enriched in AS sites and splicing in our expression analysis. In addition, two fusion transcripts regulatory motifs. We found that DNA methylation enriching in (7.168.3 and 1.2355.7) and two transcripts of novel lncRNAs AS sites varied from acceptor to donor sites and from sense strand to (1.1171.1 and x.631.1) were randomly selected and experimentally antisense strand. The trend of CG methylation is not as signiﬁcant validated. The results conﬁrmed the authenticity of two chimeric like that in plant, revealing that CG methylation promoted AS. RNA and tow lncRNAs. The validation above also showed that AS The sharp decline of cytosine coverage of EI and IE structure sites in- and fusion isoforms had differential expression and tissue and period dicated that the change of cytosine content at 2–3 bp around the AS speciﬁcity, thus increasing diversity of gene regulation as well as sites caused the change of methylation level and affected the occur- complexity of transcriptome. rence of AS events. This ﬁnding provides evidence for the methyla- tion of AS events depending on the structure of the AS sites. Moreover, our results revealed that CG methylation in promoter re- 4. Discussion gion repressed AS and regulated the ﬁrst exon and that methylation In this work, we applied PacBio sequencing technique to investigate of ﬁrst exon also repressed AS in pig. Alternative ﬁrst exon affected by the usage of alternative promoter was found to result in mRNA transcripts, provided the ﬁrst comprehensive view of splice variants isoforms with distinct 5 UTRs. This phenomenon indicates that in pig and illustrated the advantage of Iso-Seq in identifying FL splice isoforms. Following the latest methodologies in analysing PacBio promoter methylation can enhance the ﬁrst exon to repress the AS transcriptome data, we obtained 389,781 high-quality FLNC reads, events in animals. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Transcriptome complexity in Sus scrofa 13 Figure 9. Function annotation of tissue-specific isoforms. (a) Tissue-specific isoforms. (b) Classification of gene ontology annotation for isoform uniquely expressed in the four tissues. (c) Classification of KEGG pathways annotation for isoform uniquely expressed in the four tissues. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 14 Y. Li et al. Figure 10. Characters of identified lncRNAs. (a) Comparison of lengths of previously reported lincRNAs with novel lincRNAs identified in our study. (b) Proportions of five kinds of lncRNAs, classified according to their position relative to protein-coding genes. (c) Number of exons in lncRNAs and non-lncRNAs. (d) Overlap of lncRNAs among tissue and period [SFB: subcutaneous fat of back, SM: soleus muscle, EDL: extensor digitorum longus, EN: endometria, A: adult, B: birth (one day)]. (e) Heatmap of lncRNA expression in eight tissues. (f) Heatmap of non-lncRNA expression in eight tissues. (g) Comparison of overall expres- sion between lncRNAs and non-lncRNAs. (h) Comparison of overall expression between single-exon lncRNAs and multi-exon lncRNAs. (i) Comparison of DNA methylation level on lncRNAs and non-lncRNAs. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Transcriptome complexity in Sus scrofa 15 Figure 11. Validation of differential expressed isoforms. (a) Heatmap of differential expressed isoforms in eight tissues. (b) Validation of differential expressed isoforms by RT-PCR. The isoform symbol was explained, for example, 8.1373.1 represented the first transcript of loci 8.1373 on the chromosome 8. 8.1373.1, 14.1634.2 and 15.158.2 belonged to the transcripts of novel loci; 5.203.1, 4.1478.1 and 1.1932.4 were known isoforms of annotated genes of SSC10.2 reference; 13.681.3, 1.3863.6 and 11.632.1 were new isoforms came from the annotated genes of SSC10.2 reference; 1.1177.1 and x.631.1 were novel isoforms of novel lncRNAs; 1.2355.7 and 7.168.3 were fusion transcripts and the genes below them represented their precursor genes. In this study, we identiﬁed 8,838 high-conﬁdence lncRNA, includ- Overall, our study demonstrates that long-read sequencing com- ing 4,394 novel lincRNA in pig. Owing to the lack of a good data- plements short-read sequencing for cataloguing and quantifying eu- base, we constructed the PLEK model using human data, which karyotic transcripts. Based on FL transcripts, a large number of AS included the most complete lncRNAs. Relatively, media length of models and isoforms provide a more comprehensive foundation to lncRNA in pig (2,024 bp) was only half of that of human (4,096 bp). explore transcriptome diversity. Our results revealed the unique spe- 90% of the lncRNAs belonging to the single exon conﬁrmed that cies speciﬁcity of AS, the rule of fusion event, the speciﬁcity of iso- candidate genes with few introns showed little coding capacity and form, the difference of lncRNA, the DNA methylation regulation on thus might represent lncRNAs, and isoforms with more introns cor- AS and lncRNA. These results not only signiﬁcantly improve existing responded to known (mostly protein coding) genes. The expression gene models of pig, but have revealed important rules and generated of lncRNA is usually low, and we speculate that one reason is that novel resource and information with positive implications for agri- lncRNAs with little exons rarely occur AS events. Consistent with cultural production or disease prevention. previous work, we uncovered a high degree of tissue speciﬁcity 64,65 among lncRNAs, a feature shared by other animals. In addition, we discovered that multi-exon lncRNAs expressed a higher level 4.1 Availability of data and material than single-exon lncRNAs. Methylation levels of the lncRNA geno- The sequence data reported in this paper have been deposited in the mic regions were signiﬁcantly higher than that for the mRNA gen- Genome Sequence Archive (GSA; http://gsa.big.ac.cn/) of Beijing es. Meanwhile, our work found that methylation levels of Institute of Genomics, Chinese Academy of Sciences, under accession lncRNAs were higher than non-lncRNAs in the upstream region of number PRJCA000349. TSS and downstream region of TTS, whereas non-lncRNAs with higher methylation levels existed in gene body. This phenomenon also appeared in single-exon and multi-exon lncRNAs, indicating Acknowledgements that gene with high expression or coding capacity corresponds with low methylation level in TSS and TTS. However, high methylation Many thanks to Frasergen Inc. (http://www.frasergen.com/) for providing se- level in gene body needs future research. quencing platform and technical assistance in data analyses. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 16 Y. Li et al. 17. Zhang, X. and Zhang, W. 2016, Transcript isoform variation associated Conflict of interest with cytosine modiﬁcation in human lymphoblastoid cell lines, Genetics, The authors declare no competing ﬁnancial interests. 203, 985–95. 18. Wang, B., Tseng, E., Regulski, M., et al. 2016, Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing, Nat. Comms., 7, 11708. Funding 19. Vembar, S.S., Seetin, M., Lambert, C., et al. 2016, Complete This work was supported by National Natural Science Foundation of China telomere-to-telomere de novo assembly of the Plasmodium falciparum ge- (NSFC, 31472076), the Science Fund for Distinguished Young Scholars of nome through long-read (>11 kb), single molecule, real-time sequencing, Hubei Province of China (Grant No. 2014CFA024) and the Research Project DNA Res., 23, 339–51. of Chinese Ministry of Education (Grant No. 113048A). 20. Sharon, D., Tilgner, H., Grubert, F. and Snyder, M. 2013, A single-molecule long-read survey of the human transcriptome, Nat. Biotechnol., 31, 1009–14. 21. Tilgner, H., Jahanbani, F., Blauwkamp, T., et al. 2015, Comprehensive Supplementary data transcriptome analysis using synthetic long-read sequencing reveals mo- Supplementary data are available at DNARES online. lecular co-association of distant splicing events, Nat. Biotechnol., 33, 736–42. 22. Thomas, S., Underwood, J.G., Tseng, E., Holloway, A.K. and Bench To Basinet CvDC I.S. 2014, Long-read sequencing of chicken transcripts and References identiﬁcation of new transcript isoforms, PLoS One, 9, e94650. 1. Critser, J.K., Laughlin, M.H., Prather, R.S. and Riley, L.K. 2009, 23. Gordon, S.P., Tseng, E., Salamov, A., et al. 2015, Widespread polycis- Proceedings of the conference on swine in biomedical research, ILAR J., tronic transcripts in fungi revealed by single-molecule mRNA sequencing, 50, 89–94. PLoS One, 10, e0132628. 2. Gonzalez-Bulnes, A., Astiz, S., Ovilo, C., et al. 2016, Developmental ori- 24. Minoche, A.E., Dohm, J.C., Schneider, J., et al. 2015, Exploiting gins of health and disease in swine: implications for animal production single-molecule transcript sequencing for eukaryotic gene prediction, and biomedical research, Theriogenology, 86, 110–9. Genome Biol., 16, 184. 3. Choi, J.W., Chung, W.H., Lee, K.T., et al. 2015, Whole-genome rese- 25. Hackl, T., Hedrich, R., Schultz, J. and Forster, F. 2014, proovread: quencing analyses of ﬁve pig breeds, including Korean wild and native, large-scale high-accuracy PacBio correction through iterative short read and three European origin breeds, DNA Res., 22, 259–67. consensus, Bioinformatics, 30, 3004–11. 4. Shen-Gunther, J., Wang, C.M., Poage, G.M., et al. 2016, Molecular Pap 26. Wu, T.D. and Watanabe, C.K. 2005, GMAP: a genomic mapping and smear: hPV genotype and DNA methylation of ADCY8, CDH8, and alignment program for mRNA and EST sequences, Bioinformatics, 21, ZNF582 as an integrated biomarker for high-grade cervical cytology, 1859–75. Clin. Epigenetics, 8, 96. 27. Robinson, J.T., Thorvaldsdottir, H., Winckler, W., et al. 2011, Integrative 5. Au, K.F., Sebastiano, V., Afshar, P.T., et al. 2013, Characterization of the genomics viewer, Nat. Biotechnol., 29, 24–6. 28. Ast, G. 2004, How did alternative splicing evolve? Nat. Rev. Genet., 5, human ESC transcriptome by hybrid sequencing, Proc. Natl. Acad. Sci. 773–82. US A, 110, E4821–30. 29. Foissac, S. and Sammeth, M. 2007, ASTALAVISTA: dynamic and ﬂexible 6. Lim, D., Cho, Y.M., Lee, K.T., et al. 2009, The Pig Genome Database analysis of alternative splicing events in custom gene datasets, Nucleic (PiGenome): an integrated database for pig genome research, Mamm. Acids Res., 35, W297–9. Genome., 20, 60–6. 30. Liu, W., Xie, Y., Ma, J., et al. 2015, IBS: an illustrator for the presentation 7. Nygard, A.B., Cirera, S., Gilchrist, M.J., Gorodkin, J., Jorgensen, C.B. and visualization of biological sequences, Bioinformatics, 31, 3359–61. and Fredholm, M. 2010, A study of alternative splicing in the pig, BMC 31. Li, A., Zhang, J. and Zhou, Z. 2014, PLEK: a tool for predicting long Res. Notes, 3, 123. non-coding RNAs and messenger RNAs based on an improved k-mer 8. Wang, E.T., Sandberg, R., Luo, S., et al. 2008, Alternative isoform regula- scheme, BMC Bioinformatics, 15, 311. tion in human tissue transcriptomes, Nature, 456, 470–6. 32. Li, A., Zhang, J., Zhou, Z., Wang, L., Liu, Y. and Liu, Y. 2015, ALDB: a 9. Chow, L.T., Gelinas, R.E., Broker, T.R. and Roberts, R.J. 1977, An amaz- domestic-animal long noncoding RNA database, PLoS One, 10, ing sequence arrangement at the 5’ ends of adenovirus 2 messenger RNA, e0124003. Cell, 12, 1–8. 33. Trapnell, C., Pachter, L. and Salzberg, S.L. 2009, TopHat: discovering 10. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. and Blencowe, B.J. 2008, Deep sur- splice junctions with RNA-Seq, Bioinformatics, 25, 1105–11. veying of alternative splicing complexity in the human transcriptome by 34. Anders, S., Pyl, P.T. and Huber, W. 2015, HTSeq–a Python framework to high-throughput sequencing, Nat. Genet., 40, 1413–5. work with high-throughput sequencing data, Bioinformatics, 31, 166–9. 11. Chacko, E. and Ranganathan, S. 2009, Genome-wide analysis of alterna- 35. Feng, J., Meyer, C.A., Wang, Q., Liu, J.S., Shirley Liu, X. and Zhang, Y. tive splicing in cow: implications in bovine as a model for human diseases, 2012, GFOLD: a generalized fold change for ranking differentially BMC Genomics, 10, S11. expressed genes from RNA-seq data, Bioinformatics, 28, 2782–8. 12. Modrek, B. and Lee, C. 2002, A genomic view of alternative splicing, 36. Steijger, T., Abril, J.F., Engstrom, P.G., et al. 2013, Assessment of tran- Nat. Genet., 30, 13–9. script reconstruction methods for RNA-seq, Nat. Methods, 10, 1177–84. 13. Paronetto, M.P., Passacantilli, I. and Sette, C. 2016, Alternative splicing 37. Reecy, J.M., Bidwell, C.A., Briley, G.P. and Grant, A.L. 1996, Structure and cell survival: from tissue homeostasis to disease, Cell Death Differ., and regulation of the porcine skeletal alpha-actin-encoding gene, Gene, 23, 1919–29. 180, 23–8. 14. Dewaele, M., Tabaglio, T., Willekens, K., et al. 2015, Antisense 38. Abdel-Ghany, S.E., Hamilton, M., Jacobi, J.L., et al. 2016, A survey of oligonucleotide-mediated MDM4 exon 6 skipping impairs tumor growth, the sorghum transcriptome using single-molecule long reads, Nat. J. Clin. Invest., 126, 68–84. Comms., 7, 11706. 15. Min, F., Wang, S. and Zhang, L. 2015, Survey of programs used to detect 39. Nilsen, T.W. and Graveley, B.R. 2010, Expansion of the eukaryotic prote- alternative splicing isoforms from deep sequencing data in silico, Biomed. ome by alternative splicing, Nature, 463, 457–63. Res. Int., 2015,1. 40. Le Guennec, L., Roos-Weil, D., Mokhtari, K., et al. 2013, Granulomatous 16. Holoch, D. and Moazed, D. 2015, RNA-mediated epigenetic regulation angiitis of the CNS revealing a Hodgkin lymphoma, Neurology, 80, of gene expression, Nat. Rev. Genet., 16, 71–84. 323–4. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018 Transcriptome complexity in Sus scrofa 17 41. Winter, J.M., Joureau, B., Lee, E.J., et al. 2016, Mutation-speciﬁc effects 54. Ernst, J. and Bar-Joseph, Z. 2006, STEM: a tool for the analysis of short on thin ﬁlament length in thin ﬁlament myopathy, Ann. Neurol., 79, time series gene expression data, BMC Bioinformatics, 7, 191. 55. Beaton, N., Rudigier, C., Moest, H., et al. 2015, TUSC5 regulates 959–69. insulin-mediated adipose tissue glucose uptake by modulation of GLUT4 42. Hu, Q., Tong, H., Zhao, D., et al. 2015, Generation of an efﬁcient recycling, Mol. Metab., 4, 795–810. artiﬁcial promoter of bovine skeletal muscle alpha-actin gene 56. Quintens, R., Singh, S., Lemaire, K., et al. 2013, Mice deﬁcient in the re- (ACTA1) through addition of cis-acting element, Cell Mol. Biol. Lett., 20, spiratory chain gene Cox6a2 are protected against high-fat diet-induced 160–76. obesity and insulin resistance, PLoS One, 8, e56719. 43. Gelfman, S., Cohen, N., Yearim, A. and Ast, G. 2013, DNA-methylation 57. Finno, C.J., Gianino, G., Perumbakkam, S., et al. 2018, A missense muta- effect on cotranscriptional splicing is dependent on GC architecture of the tion in MYH1 is associated with susceptibility to immune-mediated myo- exon-intron structure, Genome Res., 23, 789–99. sitis in Quarter Horses, Skelet. Muscle, 8,7. 44. Lister, R., Pelizzola, M., Dowen, R.H., et al. 2009, Human DNA methyl- 58. Eulitz, S., Sauer, F., Pelissier, M.C., et al. 2013, Identiﬁcation of omes at base resolution show widespread epigenomic differences, Nature, Xin-repeat proteins as novel ligands of the SH3 domains of nebulin and 462, 315–22. nebulette and analysis of their interaction during myoﬁbril formation and 45. Laurent, L., Wong, E., Li, G., et al. 2010, Dynamic changes in the human remodeling, Mol. Biol. Cell, 24, 3215–26. methylome during differentiation, Genome Res., 20, 320–31. 59. Lee, S.T., Xiao, Y., Muench, M.O., et al. 2012, A global DNA methyla- 46. Gelfman, S. and Ast, G. 2013, When epigenetics meets alternative splic- tion and gene expression analysis of early human B-cell development ing: the roles of DNA methylation and GC architecture, Epigenomics, 5, reveals a demethylation signature and transcription factor network, 351–3. Nucleic Acids Res., 40, 11339–51. 47. Feng, S., Cokus, S.J., Zhang, X., et al. 2010, Conservation and divergence 60. Kalsotra, A. and Cooper, T.A. 2011, Functional consequences of develop- of methylation patterning in plants and animals, Proc. Natl. Acad. Sci. mentally regulated alternative splicing, Nat. Rev. Genet., 12, 715–29. US A, 107, 8689–94. 61. Merkin, J., Russell, C., Chen, P. and Burge, C.B. 2012, Evolutionary dy- 48. Suzuki, M.M. and Bird, A. 2008, DNA methylation landscapes: provoca- namics of gene and isoform regulation in Mammalian tissues, Science, tive insights from epigenomics, Nat. Rev. Genet., 9, 465–76. 338, 1593–9. 49. Weirather, J.L., Afshar, P.T., Clark, T.A., et al. 2015, Characterization of 62. Barbosa-Morais, N.L., Irimia, M., Pan, Q., et al. 2012, The evolutionary fusion genes and the signiﬁcantly expressed fusion isoforms in breast can- landscape of alternative splicing in vertebrate species, Science, 338, cer by hybrid sequencing, Nucleic Acids Res., 43, e116. 1587–93. 50. Li, H., Jin, F., Jiang, K., et al. 2016, mTORC1-mediated downregulation 63. Anastasiadou, C., Malousi, A., Maglaveras, N. and Kouidou, S. 2011, Human of COX2 restrains tumor growth caused by TSC2 deﬁciency, Oncotarget, epigenome data reveal increased CpG methylation in alternatively spliced sites 7, 28435–47. and putative exonic splicing enhancers, DNA Cell Biol., 30, 267–75. 51. Cherukuri, D.P., Ishikawa, T.O., Chun, P., et al. 2014, Targeted Cox2 64. Bakhtiarizadeh, M.R., Hosseinpour, B., Arefnezhad, B., Shamabadi, N. gene deletion in intestinal epithelial cells decreases tumorigenesis in fe- and Salami, S.A. 2016, In silico prediction of long intergenic non-coding male, but not male, ApcMin/þ mice, Mol. Oncol., 8, 169–77. RNAs in sheep, Genome, 59, 263–75. 52. Kota, K.P., Wagner, S.R., Huerta, E., Underwood, J.M. and Nickerson, 65. Koufariotis, L.T., Chen, Y.P., Chamberlain, A., Vander Jagt, C. and J.A. 2008, Binding of ATP to UAP56 is necessary for mRNA export, Hayes, B.J. 2015, A catalogue of novel bovine long noncoding RNA J. Cell Sci., 121, 1526–37. across 18 tissues, PLoS One, 10, e0141225. 53. Merino, A.M., Zhang, K., Kaslow, R.A. and Aissani, B. 2013, Structure 66. Zhou, Z.Y., Li, A., Wang, L.G., et al. 2015, DNA methylation signatures of tumor necrosis factor-alpha haploblocks in European populations, of long intergenic noncoding RNAs in porcine adipose and muscle tissues, Immunogenetics, 65, 543–52. Sci. Rep., 5, 15435. Downloaded from https://academic.oup.com/dnaresearch/advance-article-abstract/doi/10.1093/dnares/dsy014/5020793 by Ed 'DeepDyve' Gillespie user on 08 June 2018
DNA Research – Oxford University Press
Published: May 29, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera