APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data

APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data Abstract Motivation Alternative polyadenylation (APA) has been increasingly recognized as a crucial mechanism that contributes to transcriptome diversity and gene expression regulation. As RNA-seq has become a routine protocol for transcriptome analysis, it is of great interest to leverage such unprecedented collection of RNA-seq data by new computational methods to extract and quantify APA dynamics in these transcriptomes. However, research progress in this area has been relatively limited. Conventional methods rely on either transcript assembly to determine transcript 3′ ends or annotated poly(A) sites. Moreover, they can neither identify more than two poly(A) sites in a gene nor detect dynamic APA site usage considering more than two poly(A) sites. Results We developed an approach called APAtrap based on the mean squared error model to identify and quantify APA sites from RNA-seq data. APAtrap is capable of identifying novel 3′ UTRs and 3′ UTR extensions, which contributes to locating potential poly(A) sites in previously overlooked regions and improving genome annotations. APAtrap also aims to tally all potential poly(A) sites and detect genes with differential APA site usages between conditions. Extensive comparisons of APAtrap with two other latest methods, ChangePoint and DaPars, using various RNA-seq datasets from simulation studies, human and Arabidopsis demonstrate the efficacy and flexibility of APAtrap for any organisms with an annotated genome. Availability and implementation Freely available for download at https://apatrap.sourceforge.io. Contact liqq@xmu.edu.cn or xhuister@xmu.edu.cn Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Pre-mRNA cleavage and polyadenylation plays a key role in transcription termination, cytoplasmic localization, RNA stability and translation (Tian and Manley, 2017; Wu and Bartel, 2017). Recent transcriptome-wide studies have revealed that most eukaryotic genes produce distinct transcript isoforms through the usage of different polyadenylation sites, a mechanism called alternative polyadenylation (APA) (Di Giammartino et al., 2011; Tian and Manley, 2013). APA has been increasingly recognized as a crucial mechanism that contributes to the transcriptome complexity and gene expression regulation. The dominant class of APA events involves in the same terminal exon, leading to isoforms with variable 3′ UTR (3′ untranslated region) length. A small fraction of APA events occurs within internal introns/exons, producing mRNAs encoding different proteins. Diverse 3′ UTRs generated by APA modulate the availability of different cis-elements, which can regulate mRNA stability, translation efficiency and subcellular localization (Mayr, 2016; Tian and Manley, 2013). Importantly, dynamic usage of APA sites has been implicated in a variety of biological processes and diseases, including cell activation, growth, proliferation, cancer and neurodegenerative disorders (Batra et al., 2015; Berkovits and Mayr, 2015; Blazie et al., 2015; Di Giammartino et al., 2011; Gruber et al., 2014; Gupta et al., 2014; Han and Kim, 2014; Oktaba et al., 2015; Tian and Manley, 2013; Xia et al., 2014). Given the functional importance of APA, identification and quantification of APA site usages is critical and may be the first step towards deciphering the underlying mechanism of 3′ UTR-mediated gene regulation. Efforts has been made to capture 3′ ends of mRNAs by developing experimental protocols for profiling genome-wide APA sites [reviewed in Ji et al. (2015)], such as PAT-seq (Harrison et al., 2015), 3′ T-fill (Wilkening et al., 2013), 3′ READs (Hoque et al., 2013), TAIL-seq (Chang et al., 2014; Park et al., 2016) and mTAIL-seq (Lim et al., 2016). Although these approaches are powerful in providing the precise locations of poly(A) sites, they are costly, technically demanding, and most of them are hampered by inherent technical issues, such as internal priming artifacts. Limited number of poly(A) site databases are currently available, such as PolyA_DB 2 (Lee et al., 2007), APASdb (You et al., 2014), PlantAPA (Wu et al., 2016) and PolyAsite (Gruber et al., 2016), covering only a few species. In contrast, RNA-seq has become the standard technology for transcriptome profiling and has been applied in almost every large-scale genomics studies. Birol et al. (Birol et al., 2015) and Bonfert et al. (Bonfert and Friedel, 2017) developed methods for characterizing cleavage sites on 3′ UTRs in assembled RNA-seq data, by incorporating untemplated poly(A) sequence at their 3′ ends. Although RNA-seq reads containing poly(A) stretch can be directly used to identify poly(A) sites, the coverage of poly(A) tails by reads is very poor (Birol et al., 2015; Pickrell et al., 2010). Some tools are also available to identify novel 3′ UTRs by transcriptome assembly using RNA-seq data, such as Cufflinks (Trapnell et al., 2012), 3USS (Le Pera et al., 2015) and MISO (Katz et al., 2010). However, these tools may lead to truncation of long 3′ UTRs due to variable read depth and discontinuous coverage in large processed exons. Moreover, isoforms with short 3′ UTRs are frequently overlooked by transcript assembly tools like Cufflinks because short 3′ UTRs are often embedded within longer ones (Miura et al., 2013; Xia et al., 2014). MISO (Katz et al., 2010) can detect tandem 3′ UTRs, however, it relies on prior knowledge of annotated poly(A) sites, failing to identify novel poly(A) sites beyond existing poly(A) site databases. Therefore, these approaches that rely heavily on annotated poly(A) sites and/or transcriptome assembly results may not be precise or powerful because of incomplete information of known poly(A) sites and inherent limitations of transcript assembly tools. RNA-seq data for transcriptome profiling continue to accumulate in an unprecedented pace, being released for the benefit of all. Several tools are currently in progress to infer differential APA site usage between two conditions by detecting changes in RNA-seq read density. Wang et al. (Wang et al., 2014) proposed a change-point model (hereinafter referred as ChangePoint) based on a likelihood ratio test for detecting 3′ UTR switching. Xia et al. (Xia et al., 2014) developed a tool called DarPas to analyze dynamic regulation of APA through the comparison of RNA-seq read coverage between two conditions. However, each of these existing methods has its own limitations. For instance, ChangePoint is only applicable for detecting one change point in annotated 3′ UTR and relies on the annotated distal poly(A) sites to infer the proximal ones. Moreover, ChangePoint only supports pair-wise comparison between two samples, failing to aggregate results from more than two samples. DaPars can predict only one proximal poly(A) site based on a given annotated 3′ UTR and considers only the expression levels of the proximal and distal poly(A) sites for significance test in detecting genes with differential APA site usage. Consequently, it fails to detect other potential poly(A) sites and discards the information about the coordinates of poly(A) sites (or the length of alternative 3′ UTRs), which might reduce the power of significance assessment of dynamic APA events (Xing et al., 2015). Obviously, the main drawback of these two methods is that they can neither predict multiple proximal poly(A) sites nor detect dynamic APA site usage considering more than two poly(A) sites. Recent studies using 3′ end sequencing with increasing depth have revealed that more and more genes contain more than two poly(A) sites (Fu et al., 2011, 2016; Ji et al., 2015; Lianoglou et al., 2013; Ulitsky et al., 2012; Wu et al., 2014). Despite the availability of various methods, detecting all potential poly(A) sites and differential APA site usage remains challenging and there is still lot of scope to improve the accuracy of dynamic APA event identification and the ability to fully exploit the power of RNA-seq for profiling dynamics of APA. In this study, we have developed a tool called APAtrap to identify APA sites and detect changes in APA site usage by leveraging the resolution and huge abundance of RNA-seq data. APAtrap is capable of identifying novel 3′ UTRs or 3′ UTR extensions and can predict all potential poly(A) sites based on the refined 3′ UTRs. Furthermore, APAtrap can explore APA events with differential usage patterns among conditions by incorporating both coordinates and expression levels of all poly(A) sites in a gene. We compared APAtrap with two other latest methods, ChangePoint and DaPars, using both simulated data and real RNA-seq data from different species. Results show that APAtrap can identify APA sites and differential APA site usages with higher accuracy than the competing methods and can be adapted to genomes with different sizes by tuning model parameters. 2 Materials and methods 2.1 Identification of novel 3′ UTRs and 3′ UTR extensions A strategy similar to the previous study (Miura et al., 2013) was adopted to identify novel 3′ UTRs or 3′ UTR extensions (Supplementary Fig. S1a). First, the 3′ end regions were extracted from the genome annotation and extended by a pre-defined length. Then RNA-seq read enriched 3′ regions are identified by sliding a window in 1 bp increment starting from the 5′ start of the extended 3′ regions. The following criteria are utilized to identify precise 3′ end: (i) depths of the current window containing the candidate distal 3′ end and the following window should be less than the predefined coverage depth; (ii) coverage of the first base of the current window should be less than the predefined coverage depth, etc. The first base of the current window that meets these criteria is considered as the 3′ end. Finally, newly identified 3′ UTR regions are compared with the original genome annotation to obtain novel 3′ UTRs or 3′ UTR extensions. More details are described in Supplementary Methods. 2.2 Identification of APA sites Based on the refined 3′ UTRs identified in the previous step, distal poly(A) sites are then defined as the 3′ ends of these 3′ UTRs. First, a sliding window strategy is used to define potential number of poly(A) sites (details in Supplementary Methods). Then a mean squared error model was proposed to identify the precise positions of poly(A) sites for each gene, which seeks the minimum difference between the estimated read coverage and the real read coverage in the 3′ UTR from all samples. The loss function is, arg min⁡eik≥0∑i=1N∑j=1L(Ci,j−∑k=1MeikIjk)2/N×L (1) Here, M is the number of poly(A) sites in the whole 3′ UTR region in a given gene, which is the number of merged regions. L is the length of the longest 3′ UTR; N is the number of samples; Cij is the observed coverage depth of position j in sample i; Ijk =1 or 0 indicates whether the position j belongs to the kth 3′ UTR or not; eik is the estimated expression level of the kth 3′ UTR in sample i. Let Pk be the location of the kth poly(A) site, which is also the length of the kth 3′ UTR. The expression level of the longest 3′ UTR can be estimated as the mean read coverage of the fragment from PM - 1 to PM as this fragment only belongs to the longest 3′ UTR. eiM=∑j=PM−1+1LCji/(L−PM−1) (2) Because the kth (1 < k < M) 3′ UTR is shared by longer 3′ UTRs, its expression level is denoted as the mean read coverage of the fragment from Pk - 1 to Pk subtracted by the sum of expression levels of the (k + 1)th to the Mth 3′ UTRs. eik1<k<M=∑j=Pk−1+1Pk(Cji−∑m=k+1Meim)/(Pk−Pk−1) (3) Similarly, the expression level of the first 3′ UTR can be represented as, ei1=∑j=1P1(Cji−∑m=2Meim)/P1 (4) 2.3 Identification of dynamic APA site usage between two samples We incorporated a difference index and a linear trend test (Fu et al., 2011, 2016) to investigate the dynamic APA site usage between two samples (Supplementary Fig. S1c). First, the index, termed PD (percentage difference), is used to quantify the difference of APA site usage between two samples in a given gene, which is defined as following: PD=∑k=1M|pak−pbk|2 (PD∈[0,1]) (5) Here, M is the number of poly(A) sites in a gene; a and b denote the two samples; pak represents the fraction of read coverage of poly(A) site k over all poly(A) sites of the gene in sample a. Here the read coverage for each poly(A) site is the estimated expression level of the corresponding 3′ UTR (Supplementary Fig. S1b). Next, the linear trend test based on the Pearson product moment correlation coefficient (Fu et al., 2011, 2016) was adopted to examine significant changes of APA site usage in a gene between two samples. Benjamini-Hochberg method (Benjamini and Hochberg, 1995) is used to estimate the FDR (false discovery rate). By default, we consider genes with PD higher than a threshold and with FDR below a cutoff as genes with differential APA site usage. More details are described in Supplementary Methods. 2.4 Validation We used both simulated data and real RNA-seq data from human and Arabidopsis to evaluate the results of APAtrap. The simulated RNA-seq data of 50 bp paired-end reads with a 150 bp fragment size were generated by Flux Simulator (Griebel et al., 2012) based on the human genome annotation and a predefined library size. Genes containing different number of isoforms with different 3′ UTR lengths were simulated. For real data analysis in human, we used RNA-seq data from the Human Brain Reference (hereinafter Brain) and the Universal Human Reference (UHR) MAQC samples (Bullard et al., 2010). The polyA-seq data from the same samples collected from UCSC browser (http://genome.ucsc.edu/) were used to detect genes with differential APA site usage as the true reference for evaluation. Annotated poly(A) sites from multiple resources were compiled to form a large reference dataset for verification, including NCBI Refseq database (https://www.ncbi.nlm.nih.gov/refseq/), Ensemble database (http://www.ensembl.org/Homo_sapiens/Info/Index), UCSC genome browser (http://genome.ucsc.edu/), PolyAsite database (http://www.polyasite.unibas.ch/), GENCODE poly(A) datasets (http://www.gencodegenes.org/releases/19.html) and PolyA_DB 2 (Lee et al., 2007). For Arabidopsis, two RNA-Seq samples subjected to control and mild drought conditions were used (Clauw et al., 2015). Annotated poly(A) sites in Arabidopsis were downloaded from PlantAPA database (Wu et al., 2016), which were collected from leaf, seed and root tissues of wild type (WT) and mutant oxt6. For the evaluation of performance of differential APA detection, we calculated the true index [PD or ODD (Wang et al., 2014)] from the test data and compared it with the value estimated by each tool. Smaller difference between the predicted index and the true index indicates better prediction and a differential APA event with both true index and estimated index above a given cutoff is considered as a true prediction. More details about the validation are described in Supplementary Methods. 3 Results 3.1 Evaluation of APAtrap using simulated data We conducted a simulation study to assess the performance of APAtrap using simulated RNA-seq data with predefined APA events. Since in many APA studies only proximal and distal poly(A) sites of a gene were considered when inferring dynamic APA site usage, we first simulated 1000 human genes with exact two isoforms. One isoform is from the annotated genome and the other is 1000 bp longer. The expression level of each isoform is randomly generated by Flux Simulator (Griebel et al., 2012), therefore the PD values between two samples of all the 1000 genes are uniformly distributed in [0, 1]. Using these simulated RNA-seq reads as inputs, we compared APAtrap with two other tools also based on RNA-seq read coverage change, ChangePoint (Wang et al., 2014) and DaPars (Xia et al., 2014). First, we calculated the correlation of the real index (PD or ODD) and the estimated index from each tool (Supplementary Fig. S2). APAtrap obtained the best correlation followed by DaPars, while ChangePoint presented much lower correlation, suggesting better ability of APAtrap and DaPars to recover the true difference in the dataset. Next, we evaluated the APA site prediction accuracy of different tools. Apparently, APAtrap and DaPars recovered significantly higher number of true poly(A) sites than ChangePoint, while APAtrap performed slightly better than DaPars (Fig. 1a). We also compared the performance of these three tools in the detection of dynamic APA events between two conditions. The accuracy and ROC (receiver-operating characteristic) curve at different cutoffs from 0 to 1 in an increment of 0.05 for each tool were reported. The accuracy of APAtrap was the highest no matter which cutoff was applied (Fig. 1b). APAtrap and DaPars provided more stable results than ChangePoint regardless of the cutoff, indicating the robustness of these two methods and the desirability of the PD metric. APAtrap also provided the best performance with the highest areas under ROC curve (AUC: APAtrap = 0.95, DaPars = 0.90 and ChangePoint = 0.67) among the three tools (Fig. 1c). Apparently, ChangePoint performed the worst with considerably higher false positive rate (FPR) but lower true positive rate (TPR). In contrast, APAtrap provided the best TPR with low FPR especially at small cutoff while the TPR of DaPars at small cutoff was slighter lower than that of APAtrap. Fig. 1. View largeDownload slide Evaluation of performance of APAtrap using simulated data. (a) Fraction of recovered poly(A) sites using different methods in the two-isoform study. If a predicted poly(A) site is located within a predefined distance from any annotated poly(A) site, then the prediction is considered as a true prediction. Cutoffs from 0 bp to 100 bp in an increment of 10 bp were set. (b) Accuracy of detecting differential APA site usage using different methods in the two-isoform study. A differential APA event with both true index and estimated index above a given cutoff is considered as a true prediction. Cutoffs from 0 to 1 in an increment of 0.05 were set. (c) ROC curves of dynamic APA event detection using different methods in the two-isoform study. (d–f) As in a–c, except that genes with one to four isoforms were used Fig. 1. View largeDownload slide Evaluation of performance of APAtrap using simulated data. (a) Fraction of recovered poly(A) sites using different methods in the two-isoform study. If a predicted poly(A) site is located within a predefined distance from any annotated poly(A) site, then the prediction is considered as a true prediction. Cutoffs from 0 bp to 100 bp in an increment of 10 bp were set. (b) Accuracy of detecting differential APA site usage using different methods in the two-isoform study. A differential APA event with both true index and estimated index above a given cutoff is considered as a true prediction. Cutoffs from 0 to 1 in an increment of 0.05 were set. (c) ROC curves of dynamic APA event detection using different methods in the two-isoform study. (d–f) As in a–c, except that genes with one to four isoforms were used Recent studies with deeper sequencing coverage have uncovered increasing number of poly(A) sites than expected. Here we conducted another simulation study that is not limited for genes with exact two isoforms but for genes with one to four isoforms. It should be noted that DaPars considers the end of the annotated 3′ UTR as the distal poly(A) site and can only predict one proximal poly(A) site upstream of the distal site. Consequently, even for genes with more than two isoforms, DaPars can detect only one proximal poly(A) site rather than all potential poly(A) sites and then infer differential APA site usage based on the proximal and distal poly(A) sites. Unlike the correlation results for genes with only two isoforms for which both DaPars and APAtrap obtained high and comparable results (Supplementary Fig. S2a, b), APAtrap provided significantly higher correlation of true PD and estimated PD than DaPars (0.91 versus 0.54) (Supplementary Fig. S2d, e). Still, the ODD calculated by ChangePoint had poor correlation with true PD. Surprisingly, APAtrap obtained considerably high correlation in both simulation cases while the correlation of DaPars dropped substantially from 0.84 to 0.54 (Supplementary Fig. S2b, e), suggesting that APAtrap can achieve comparable results regardless of the number of poly(A) sites to predict. For APA site prediction, APAtrap recovered much more true poly(A) sites than DaPars and ChangePoint (Fig. 1d), demonstrating the efficiency of APAtrap in identifying multiple proximal poly(A) sites in 3′ UTRs. Regarding the metrics of ROC curve and accuracy, APAtrap was obviously superior to DaPars (Fig. 1d, e; AUC: APAtrap = 0.94, DaPars = 0.78). Particularly, compared to the performance in the previous simulation study on genes with exact two isoforms, the performance of DaPars decreased dramatically in identifying APA sites or differential APA events for genes with varied number of isoforms. In contrast, the loss in performance of APAtrap was not remarkable, which indicates the importance of using all potential poly(A) sites rather than only proximal and distal ones for effectively detecting dynamic APA usage. These results demonstrate that APAtrap outperforms the competing methods ChangePoint and DaPars. Moreover, APAtrap is powerful in identifying APA sites and dynamic APA site usages between samples for genes with different number of poly(A) sites. 3.2 Application of APAtrap to RNA-seq data in human Next, we evaluated the performance of APAtrap using experimental RNA-seq data in human. APAtrap and DaPars identified 161 and 111 genes with differential APA site usage based on the same cutoff of FDR and PD (FDR ≤ 0.05, PD ≥ 0.2), respectively (Fig. 2a). Among them, 51 genes were predicted by both methods. Using polyA-seq data, 205 genes with differential APA site usage were identified. We noted that only a small number of genes predicted by both methods were validated by polyA-seq (30 from APAtrap, 22 from DaPars). This may be because that the real poly(A) site dataset from polyA-seq is far from complete and different sequencing strategies or sequencing depth were used to generate the RNA-seq data and polyA-seq data. This result also indicates that there is still lot of scope to improve the accuracy of dynamic APA event prediction from RNA-seq data. Up to 84.5% (136/161) genes identified by APAtrap and 78.4% (87/111) genes identified by DaPars prefer using longer 3′ UTR in the Brain sample, respectively (Fig. 2b, c). This result is consistent with recent reports that brain tissues normally have the longest 3′ UTRs (Miura et al., 2013). Overall, APAtrap identified more dynamic APA events and ‘longer events’ than DaPars, demonstrating the efficiency of APAtrap and the importance of considering all potential poly(A) sites for the detection of differential APA site usage. Fig. 2. View largeDownload slide Evaluation of APAtrap using RNA-seq data in human. (a) Overlapping of genes with differential APA site usage detected by APAtrap and DaPars or using real polyA-seq data. (b) Dynamic APA events between the Brain and UHR sample detected by APAtrap. Pearson product moment correlation coefficient is plotted against the log2 fold change between Brain and UHR. Genes with significant switching to longer 3′ UTRs in Brain and UHR are colored in blue and red, respectively. The inset bar plot denotes the number of genes with significantly longer 3′ UTRs in the respective sample. (c) As in b, except that DaPars was used. (d) Comparison of APAtrap and DaPars in recovering true poly(A) sites in genes with differential APA site usage. ‘APAtrap-distal PA’ means the numbers of identified and verified distal poly(A) sites inferred by APAtrap. ‘APAtrap-proximal PA’ denotes the numbers of genes with predicted and verified proximal poly(A) sites identified by APAtrap. Genes with at least one verified proximal poly(A) site are considered as verified. ‘DaPars-proximal PA’ is as in ‘APAtrap-proximal PA’, except that DaPars was used. (e) Overlapping of proximal poly(A) sites in genes with differential APA site usage identified by APAtrap and DaPars. (f) Difference of read coverage between upstream 100 bp and downstream 100 bp around proximal poly(A) sites identified by APAtrap and DaPars. The ratio of read coverage of downstream 100 bp to the upstream 100 bp for each poly(A) site was calculated Fig. 2. View largeDownload slide Evaluation of APAtrap using RNA-seq data in human. (a) Overlapping of genes with differential APA site usage detected by APAtrap and DaPars or using real polyA-seq data. (b) Dynamic APA events between the Brain and UHR sample detected by APAtrap. Pearson product moment correlation coefficient is plotted against the log2 fold change between Brain and UHR. Genes with significant switching to longer 3′ UTRs in Brain and UHR are colored in blue and red, respectively. The inset bar plot denotes the number of genes with significantly longer 3′ UTRs in the respective sample. (c) As in b, except that DaPars was used. (d) Comparison of APAtrap and DaPars in recovering true poly(A) sites in genes with differential APA site usage. ‘APAtrap-distal PA’ means the numbers of identified and verified distal poly(A) sites inferred by APAtrap. ‘APAtrap-proximal PA’ denotes the numbers of genes with predicted and verified proximal poly(A) sites identified by APAtrap. Genes with at least one verified proximal poly(A) site are considered as verified. ‘DaPars-proximal PA’ is as in ‘APAtrap-proximal PA’, except that DaPars was used. (e) Overlapping of proximal poly(A) sites in genes with differential APA site usage identified by APAtrap and DaPars. (f) Difference of read coverage between upstream 100 bp and downstream 100 bp around proximal poly(A) sites identified by APAtrap and DaPars. The ratio of read coverage of downstream 100 bp to the upstream 100 bp for each poly(A) site was calculated Based on these genes with differential APA site usage identified by both methods, we next compared the ability of APAtrap and DaPars in recovering true poly(A) sites. APAtrap predicted 454 poly(A) sites for the 161 genes, including 293 proximal poly(A) sites and 161 distal ones (Fig. 2d). Up to 86% (139/161) distal poly(A) sites were verified by the annotated poly(A) dataset. It should be noted that DaPars only predicts one proximal poly(A) site for each gene based on the annotated 3′ end, therefore no predicted distal poly(A) site is available from DaPars. Since APAtrap can predict multiple proximal poly(A) sites while DaPars identifies only one proximal site, for fair comparison of proximal poly(A) site prediction, we compared the number of genes with accurately identified proximal poly(A) sites. APAtrap identifies 161 genes with proximal poly(A) sites and 64% (103) of them had at least one accurately predicted proximal site (Fig. 2d). In contrast, DaPars identified 111 proximal poly(A) sites and 63% of them were verified by annotated poly(A) sites (Fig. 2d). APAtrap predicted higher number of true proximal sites than DaPars (129 versus 70) (Fig. 2e), which due to that APAtrap is capable of detecting multiple proximal sites. Meanwhile, APAtrap identified 157 proximal poly(A) sites that were not verified by annotated poly(A) site database. This is not surprising because the current poly(A) site database is far from complete (Lee et al., 2007; Wu et al., 2016) and the specific samples used in this study may not be presented in the database. What is surprising, however, is that only 32 proximal poly(A) sites were predicted by both methods, indicating the importance of using different tools for poly(A) site prediction. The significantly higher number of proximal poly(A) sites or genes with differential APA site usage identified by APAtrap than DaPars may be due to the following reasons. First, APAtrap can automatically refine annotated 3′ UTRs by the sliding window strategy to identify shortened or lengthened 3′ UTRs and novel 3′ UTRs before predicting poly(A) sites. Among the proximal poly(A) sites detected by APAtrap, 14 were located in newly identified 3′ UTR regions. Some examples are shown in Supplementary Figure S3. Second, APAtrap is more effective than DaPars in identifying all potential poly(A) sites rather than only the proximal one, which has been demonstrated in the simulation study (Fig. 1). APAtrap identified 230 proximal poly(A) sites which were missed by DaPars. Some examples are shown in Supplementary Figure S4. Third, APAtrap could estimate a more accurate PD than DaPars (Supplementary Fig. S2), which will better reflect the true difference between samples and boost up the sensitivity of the prediction. Some examples are shown in Supplementary Figure S5. There are 64 poly(A) sites that were only identified by DaPars. By close inspection of these cases, we found some interesting observations. First, DaPars identified poly(A) sites located in 3′ UTRs that were overlapped with other exons (Supplementary Fig. S6), while such cases would be discarded in APAtrap because of potential read mapping errors in overlapped regions. Second, the large gap of RNA-seq read coverage in some 3′ UTRs may lead to identifying shortened 3′ UTR by APAtrap, while DaPars uses the annotated 3′ UTR regardless of the read coverage in the whole 3′ UTR region (Supplementary Fig. S7). Although some poly(A) sites located beyond the shortened 3′ UTRs would be neglected by APAtrap, we prefer to use the refined 3′ UTR to increase the confidence of predicted poly(A) sites and using RNA-seq data with higher sequencing depth will also solve this problem. Third, APAtrap uses a different and more stringent criterion than DaPars to assess the statistical significance of differential APA site usage, which may lead to some poly(A) sites with high PD fail to pass the FDR cutoff. Some cases are shown in Supplementary Figure S8. However, in practical application, users can freely choose different combinations of PD and FDR cutoffs to select desirable number of genes with differential APA site usages. By manual inspection, we also found some special cases that some poly(A) sites predicted by DaPars were located near authentic ones while no significant read coverage change was observed around the predicted poly(A) sites (Supplementary Fig. S9). In contrast, there is an obvious read coverage drop-off around the poly(A) site predicted by APAtrap. Indeed, the ratio of read coverage between downstream 100 bp and upstream 100 bp around poly(A) sites predicted by APAtrap was much lower than that by DaPars (Wilcoxon test, P-value = 3.77 × 10−4, Fig. 2f). Despite of different models used in DaPars and APAtrap, both methods infer poly(A) sites by detecting significant changes of read coverage, therefore we suspected that this kind of poly(A) sites predicted by DaPars without significant change of read coverage were actually false predictions even they were located near true poly(A) sites by chance. 3.3 Flexibility of using APAtrap on different species To demonstrate the flexibility of using APAtrap on different species, next we identified APA sites and differential APA events in the model plant Arabidopsis thaliana using RNA-seq data from control and mild drought conditions (Clauw et al., 2015). Parameters of APAtrap could be fine-tuned to adapt to species with different genome size and 3′ UTR length. Here we set both the window size in the sliding window strategy for detecting distal poly(A) site and the minimum distance between two poly(A) sites as 50 bp. Using the RNA-seq data as an input, APAtrap identified 224 genes with differential APA site usage, while DaPars identified 152 genes (FDR ≤ 0.05, PD ≥ 0.4) (Fig. 3a). A moderate number of genes (62) were detected by both methods. APAtrap predicted 450 proximal poly(A) sites from the 224 genes. Among them, ∼41% (184) poly(A) sites were located within 50 bp of annotated poly(A) sites, and ∼62% (104) genes had at least one accurately predicted proximal poly(A) site (Fig. 3b). In contrast, ∼37% proximal poly(A) sites predicted by DaPars were verified by annotated poly(A) sites. Generally, APAtrap identified more dynamic APA events than DaPars and more proximal poly(A) sites predicted by APAtrap were verified by annotated poly(A) site database, which demonstrates the efficiency of APAtrap for different species. Surprisingly, the gene ontology analysis showed that genes with differential APA site usage identified by APAtrap and DaPars were enriched in distinct groups of functions (Fig. 3c). However, gene ontology results from APAtrap such as peroxisome organization, lipid oxidation, lipid modification and fatty acid beta-oxidation have been demonstrated to be related to drought resistance in previous studies (Campo et al., 2014; Hameed et al., 2012). Again, this result demonstrates the effectiveness of APAtrap in identifying function related genes that are regulated by APA and respond to specific biological conditions. Fig. 3. View largeDownload slide Evaluation of APAtrap using RNA-seq data in Arabidopsis. (a) Overlapping of genes with differential APA site usage between control and mild drought conditions identified by APAtrap and DaPars. (b) Comparison of APAtrap and DaPars in recovering true proximal poly(A) sites in genes with differential APA site usage. ‘APAtrap-proximal PA’ means the numbers of identified and verified proximal poly(A) sites inferred by APAtrap. ‘APAtrap-proximal PA/Gene’ denotes the numbers of genes with predicted and verified proximal poly(A) sites identified by APAtrap. Genes with at least one verified proximal poly(A) site are considered as verified. ‘DaPars-proximal PA’ is as in ‘APAtrap-proximal PA/Gene’, except that DaPars was used. (c) Gene ontology (GO) enrichment analysis using genes with differential APA site usage identified by APAtrap and DaPars. Shown are the enriched GO terms in the biological process (BP) and cellular component (CC) categories Fig. 3. View largeDownload slide Evaluation of APAtrap using RNA-seq data in Arabidopsis. (a) Overlapping of genes with differential APA site usage between control and mild drought conditions identified by APAtrap and DaPars. (b) Comparison of APAtrap and DaPars in recovering true proximal poly(A) sites in genes with differential APA site usage. ‘APAtrap-proximal PA’ means the numbers of identified and verified proximal poly(A) sites inferred by APAtrap. ‘APAtrap-proximal PA/Gene’ denotes the numbers of genes with predicted and verified proximal poly(A) sites identified by APAtrap. Genes with at least one verified proximal poly(A) site are considered as verified. ‘DaPars-proximal PA’ is as in ‘APAtrap-proximal PA/Gene’, except that DaPars was used. (c) Gene ontology (GO) enrichment analysis using genes with differential APA site usage identified by APAtrap and DaPars. Shown are the enriched GO terms in the biological process (BP) and cellular component (CC) categories 3.4 Using APAtrap to identify novel 3′ UTRs and 3′ UTR extensions One feature of APAtrap is the identification of novel 3′ UTRs and 3′ UTR extensions, which could be useful in improving 3′ UTR annotations in genomes lacking complete information of 3′ UTRs. Here we refined 3′ UTRs in Arabidopsis using the same RNA-seq data in the above study. To focus on novel 3′ UTRs, a minimum length change of 50 bp is required. A total of 1417 confident 3′ UTRs, including 548 novel 3′ UTRs and 869 3′ UTR extensions, were identified, comprising ∼0.24 Mb of unannotated sequence. The average length is 276 nt (Fig. 4a), far above the average 3′ UTR length of annotated genes (216 nt). These 3′ UTRs are covered by an average of 39 RNA-seq reads depth, demonstrating their high quality. Up to 1694 annotated poly(A) sites supported by an average of 28 reads depth which are located in 1417 newly identified 3′ UTR portions. Particularly, seven 3′ UTRs extended more than 1000 nt, which is similar to the observation in the previous study that some exceptionally long 3′ UTRs (many >10 kb and some >18 kb in length) were found in mammals (Miura et al., 2013). As we used limited RNA-seq data in specific samples and discarded 3′ UTRs that failed to meet our stringent criteria, the scope of 3′ UTRs or 3′ UTR extensions is likely even larger than we currently annotated. Poly(A) sites that originated from novel 3′ UTR regions bore single nucleotide profiles that were similar to those located in annotated 3′ UTRs (Fig. 4b), demonstrating the confidence of the newly identified 3′ UTRs. Fig. 4. View largeDownload slide Novel 3′ UTRs and 3′ UTR extensions in Arabidopsis identified by APAtrap. (a) Length distribution of novel 3′ UTRs and 3′ UTR extensions. (b) Nucleotide compositions of the sequences surrounding poly(A) sites located in novel 3′ UTRs. Y-axis denotes the fractional nucleotide content at each position. On the X-axis, ‘3′ denotes the poly(A) site. (c) Number of genes varies with changes in poly(A) site number after 3′ UTR extension. (d) Comparison of the number of genes with different number of poly(A) sites before or after 3′ UTR extension Fig. 4. View largeDownload slide Novel 3′ UTRs and 3′ UTR extensions in Arabidopsis identified by APAtrap. (a) Length distribution of novel 3′ UTRs and 3′ UTR extensions. (b) Nucleotide compositions of the sequences surrounding poly(A) sites located in novel 3′ UTRs. Y-axis denotes the fractional nucleotide content at each position. On the X-axis, ‘3′ denotes the poly(A) site. (c) Number of genes varies with changes in poly(A) site number after 3′ UTR extension. (d) Comparison of the number of genes with different number of poly(A) sites before or after 3′ UTR extension Previous studies (Seaver et al., 2014; Wu et al., 2014) extended each 3′ UTR for an arbitrary length according to the average length of annotated 3′ UTRs to estimate the fraction of genes with multiple poly(A) sites. Consequently, the APA extent highly depends on the newly assigned 3′ UTRs to count the occurrence of poly(A) sites within 3′ UTRs, which may lead to misestimate of the APA extent because of the deviation of real 3′ UTRs and assigned 3′ UTRs. Here we recruited the 3′ UTRs identified by APAtrap to compare the APA extent between original annotation and extended annotation in Arabidopsis. To this end, we used 1417 genes that have newly discovered 3′ UTRs as candidate dataset. Based on the original TAIR10 annotation of the Arabidopsis genome, 931 genes exhibit 2539 poly(A) sites. Using new 3′ UTRs, 463 genes remained unchanged number of poly(A) sites while 954 genes exhibited 1694 additional poly(A) sites in their new 3′ UTR regions (Fig. 4c). The number of genes with poly(A) sites after assigning new 3′ UTR rises from 931 to 1287. Without new 3′ UTRs, 363 and 568 genes possess one or more than one poly(A) site, respectively. In contrast, 344 genes have a single poly(A) site, 943 genes have multiple poly(A) sites with the addition of 3′ UTRs (Fig. 4d). Utilizing new 3′ UTRs, the APA extent is increased from ∼40 to ∼67%, and 954 genes show the increase of the number of poly(A) sites. If genes were added 3′ UTR with a fixed length as in the previous study (Wu et al., 2014), the APA extent became 57%. It is interesting to note that the fraction of APA genes showed significant difference between using new 3′ UTRs identified by our protocol and using 3′ UTRs extended with a fixed length (Chi-squared test, P-value = 9.4e-7). Therefore, with increasing availability of RNA-seq data, using 3′ UTRs identified by our protocol would fully exploit the power of RNA-seq for a better estimate of the occurrence of poly(A) sites within each putative 3′ UTR. 4 Discussion Transcriptional profiling via deep sequencing has underscored the complexity and dynamics of APA across tissues and different stages of cells. Computational identification of APA sites has been a major goal in the field of mRNA 3′ processing. Although approaches based on 3′ end sequencing are capable of locating precise poly(A) sites, they are costly and technically demanding, limiting their application in wider range of species. As RNA-seq has become a routine protocol for transcriptome analysis, it is of great importance to leverage newer computational methods to quantify APA dynamics based on RNA-seq data. However, research progress in this area has been relatively limited and only a few tools are available for the identification of poly(A) sites from RNA-seq data. Cufflinks (Trapnell et al., 2012) and 3USS (Le Pera et al., 2015) are methods that use transcript assembly to determine transcript 3′ ends, however, many distal poly(A) sites are neglected by methods based on transcriptome assembly (Miura et al., 2013). ContextMap 2 (Bonfert and Friedel, 2017) and Kleat (Birol et al., 2015) can identify poly(A) sites by mapping poly(A) reads from RNA-seq. Although using poly(A) read mapping provides direct evidence for polyadenylation, a very small fraction of RNA-seq reads contain poly(A) tails, leading to low sensitivity of these methods in identifying APA events. In contrast to methods that employ poly(A) reads, other methods, like DaPars (Xia et al., 2014) and ChangePoint (Wang et al., 2014), identify APA sites based on the change of RNA-seq read coverage in terminal exons. Although these approaches have achieved reasonable performance in identifying APA events from RNA-seq data, most of them can neither identify more than two poly(A) sites in a gene nor detect APA-site switching genes considering more than two poly(A) sites. Moreover, they are not suitable for identifying APA events from more than two samples as comparison of only two samples are allowed. As more data are being sequenced and increasing number of genes are found to have more than two poly(A) sites, more sophisticated approaches that are flexible in identifying all potential APA sites and dynamic APA events in more than two samples are in high demand. To overcome the limitations associated with these pioneering studies, we proposed an approach termed APAtrap which is devoted to the accurate identification of APA sites, novel 3′ UTRs or extensions, and differential APA site usage from RNA-seq data. The proposed method has several desirable properties. First, APAtrap is capable of identifying novel 3′ UTRs and 3′ UTR extensions, which contributes to locating potential poly(A) sites in previously overlooked regions and improving the genome annotation. Second, unlike many other methods limiting in identifying only one proximal poly(A) site in each annotated 3′ UTR, APAtrap aims to predict all potential poly(A) sites, which represents a major step towards a more comprehensive profiling of APA from RNA-seq data. Third, extensive analyses of various RNA-seq datasets from human and Arabidopsis demonstrate the flexibility of APAtrap for different organisms by tuning model parameters. Moreover, APAtrap can handle batch samples (>2) in one time, which facilitates large-scale data analysis and avoids the inconsistency of results from different pair-wise comparisons. In addition, APAtrap uses generic files (e.g. SAM format) from standard read mapping, researchers can easily integrate it into their analysis pipeline for detecting poly(A) sites or refining 3′ UTRs. APAtrap could also be combined with other tools, like Cufflinks, MISO, or Kleat, which can provide additional or consensus evidence for 3′ UTR annotations or APA events identified by different approaches. We comprehensively evaluated the performance of APAtrap using both simulated and real data and compared it to other competing approaches, including ChangePoint (Wang et al., 2014) and DaPars (Xia et al., 2014). The results showed that APAtrap is powerful in identifying functionally important APA events and dynamic APA usage in different biological conditions and achieves considerably higher accuracy than the competing approaches. The better performance of APAtrap may be attributed to the following reasons. First, the sliding window strategy adopted in APAtrap with stringent conditions enables accurately finding potential poly(A) site regions with substantial change of read coverage and estimating the number of poly(A) sites. The subsequent mean squared error model used in APAtrap for poly(A) site prediction is less complex but more efficient than models used in the competing methods, such as the regression model in DaPars and the likelihood ratio test in ChangePoint, demonstrating bright prospect in using APAtrap for large-scale analysis. On a practical note, ChangePoint is much more time consuming than APAtrap and DaPars for relatively large data, which limits its application in large-scale analysis. Second, the proposed PD index considers the effect of 3′ UTR length and is capable of quantifying the difference of APA usage in genes with more than two poly(A) sites. In contrast, the ΔPDUI used in DaPars and ODD used in ChangePoint considers only the proximal and distal poly(A) sites. We need to point out that ODD is not an appropriate metric because it uses the absolute read count in 3′ UTR regions while fails to exclude the effect of 3′ UTR length as longer 3′ UTRs tend to have more reads. Another disadvantage of ODD is that it is too sensitive to read count and slight difference of the read count in different poly(A) regions would result in a relatively large ODD. Results of the simulation studies clearly indicate the importance of using an appropriate metric in detecting differential APA site usage (Fig. 1). Third, as demonstrated in the simulation study for one to four poly(A) sites, incorporating information of additional poly(A) sites rather than only the proximal and distal ones would considerably improve the accuracy of dynamic APA usage detection. The ability of APAtrap to identify all potential poly(A) sites in refined 3′ UTRs would definitely contribute to the better performance. Fourth, the linear trend test used in APAtrap for testing significance in difference of APA usage between samples has an advantage over independent test such as the Fisher’s test in DaPars. The linear trend test allows more than two poly(A) sites and considers both the length and the expression level of each 3′ UTR. In contrast, independent test used in DaPars only considers the proximal and distal poly(A) sites and neglects the length of 3′ UTRs. We demonstrated the usefulness and applicability of APAtrap by analyzing experimental RNA-seq data in Arabidopsis and human. As expected, APAtrap identifies more poly(A) sites and dynamic APA events than DaPars. Poly(A) sites only predicted by APAtrap are shown to be highly confident in that many of them are overlapping with annotated poly(A) sites. Despite that there are some predicted poly(A) sites not supported by the annotated poly(A) database, these sites may be sample specific which are exclusively presented in specific tissues or conditions and have not been collected due to the incompleteness of current annotated database. Although continuous efforts to curate the genome annotation in different species are being made, the genome annotation for many species is far from complete. Particularly, the genome-wide annotation of 3′ UTRs received much less attention than coding regions, where genes in many genome assemblies exhibit no annotated 3′ UTR. Moreover, it has also been noted that the terminal exon annotations inferred from RNA-seq data by different transcriptome assembly tools can be inaccurate and incomplete (Miura et al., 2013; Shenker et al., 2015). Recent genomic studies also documented widespread 3′ UTR extensions. For example, data from RNA-seq revealed 2035 mouse and 1847 human genes that utilize substantially distal novel 3′ UTRs (many >10 kb) (Miura et al., 2013). Neural specific 3′ UTR extensions have also been revealed in Drosophila (Smibert et al., 2012) and zebrafish (Ulitsky et al., 2012), which may contribute to post transcriptional regulation underlying specific neuronal functions (Oktaba et al., 2015). With the tsunami of new genomes and RNA-seq data being sequenced, it would be highly desirable to improve the accuracy of 3′ UTR annotations from RNA-seq data. Using APAtrap to refine annotated 3′ UTRs and identify novel 3′ UTR extensions might benefit the cause. The novel 3′ UTRs in Arabidopsis inferred by APAtrap were verified to share the same characteristics as annotated 3′ UTRs, suggesting the effectiveness of APAtrap in 3′ UTR identification. Moreover, in recent genomic studies on polyadenylation, it is common to extend annotated 3′ UTR or gene model by a fixed length to capture uncharacterized poly(A) sites or transcript extensions, e.g. 5 or 1 kb for mammals (Derti et al., 2012; Lianoglou et al., 2013), 500 bp for rice (Shen et al., 2008), 218 bp for Arabidopsis (Wu et al., 2011) and 600 bp for Medicago truncatula (Wu et al., 2014). Despite its simplicity and straightforwardness in improving the ‘recovery’ of poly(A) sites that may fall in authentic 3′ UTRs, the same extended length was set for all annotated genes, neglecting the variation of 3′ UTR length in different genes. In order to get more accurate estimate of annotated poly(A) sites for each gene, a better approach is to revise the 3′ UTR regions of individual genes with new methods as the one described in this study. The 3′ UTRs of mRNAs contain regulatory elements such as microRNA-binding sites, which could affect mRNA stability or translation efficiency. Inclusion of refined 3′ UTRs detected by APAtrap could provide a more extensive reference database for the searching of potential cis-elements binding to specific 3′ UTRs, which would help fully understand the functional roles of 3′ UTRs. Funding This work was supported by the National Natural Science Foundation of China (61673323 to X.W.), Fundamental Research Funds for the Central Universities in China from Xiamen University (20720170076, to C.Y.), Natural Science Foundation of Fujian Province of China (2017J01068 to X.W.), scholarship from China Scholarship Council (201606315010 to X.W.), U.S. NSF (IOS-154173 to Q.Q.L.), Chinese Ministry of Science and Technology (2016YFE0108800 to Q.Q.L.). Conflict of Interest: none declared. References Batra R. et al. . ( 2015 ) Global insights into alternative polyadenylation regulation . RNA Biol ., 12 , 597 – 602 . Google Scholar CrossRef Search ADS PubMed Benjamini Y. , Hochberg Y. ( 1995 ) Controlling the false discovery rate: a practical and powerful approach to multiple testing . J. R. Stat. Soc. Ser. B (Methodological) , 57 , 289 – 300 . Berkovits B.D. , Mayr C. ( 2015 ) Alternative 3′ UTRs act as scaffolds to regulate membrane protein localization . Nature , 522 , 363 – 367 . Google Scholar CrossRef Search ADS PubMed Birol I. et al. . ( 2015 ) Kleat: cleavage site analysis of transcriptomes . Pac. Symp. Biocomput ., 347 – 358 . Blazie S.M. et al. . ( 2015 ) Comparative RNA-Seq analysis reveals pervasive tissue-specific alternative polyadenylation in Caenorhabditis elegans intestine and muscles . BMC Biol ., 13 , 4 . Google Scholar CrossRef Search ADS PubMed Bonfert T. , Friedel C.C. ( 2017 ) Prediction of poly(A) sites by poly(A) read mapping . Plos One , 12 , e0170914. Google Scholar CrossRef Search ADS PubMed Bullard J.H. et al. . ( 2010 ) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments . BMC Bioinformatics , 11 , 94 . Google Scholar CrossRef Search ADS PubMed Campo S. et al. . ( 2014 ) Overexpression of a calcium-dependent protein kinase confers salt and drought tolerance in rice by preventing membrane lipid peroxidation . Plant Physiol ., 165 , 688 – 704 . Google Scholar CrossRef Search ADS PubMed Chang H. et al. . ( 2014 ) TAIL-seq: genome-wide determination of poly(A) tail length and 3′ end modifications . Mol. Cell ., 53 , 1044 – 1052 . Google Scholar CrossRef Search ADS PubMed Clauw P. et al. . ( 2015 ) Leaf responses to mild drought stress in natural variants of Arabidopsis . Plant Physiol ., 167 , 800 – 816 . Google Scholar CrossRef Search ADS PubMed Derti A. et al. . ( 2012 ) A quantitative atlas of polyadenylation in five mammals . Genome Res ., 22 , 1173 – 1183 . Google Scholar CrossRef Search ADS PubMed Di Giammartino D.C. et al. . ( 2011 ) Mechanisms and consequences of alternative polyadenylation . Mol. Cell , 43 , 853 – 866 . Google Scholar CrossRef Search ADS PubMed Fu H. et al. . ( 2016 ) Genome-wide dynamics of alternative polyadenylation in rice . Genome Res ., 26 , 1753 – 1760 . Google Scholar CrossRef Search ADS PubMed Fu Y. et al. . ( 2011 ) Differential genome-wide profiling of tandem 3′ UTRs among human breast cancer and normal cells by high-throughput sequencing . Genome Res ., 21 , 741 – 747 . Google Scholar CrossRef Search ADS PubMed Griebel T. et al. . ( 2012 ) Modelling and simulating generic RNA-Seq experiments with the flux simulator . Nucleic Acids Res ., 40 , 10073 – 10083 . Google Scholar CrossRef Search ADS PubMed Gruber A.J. et al. . ( 2016 ) A comprehensive analysis of 3′ end sequencing data sets reveals novel polyadenylation signals and the repressive role of heterogeneous ribonucleoprotein C on cleavage and polyadenylation . Genome Res ., 26 , 1145 – 1159 . Google Scholar CrossRef Search ADS PubMed Gruber A.R. et al. . ( 2014 ) Global 3′ UTR shortening has a limited effect on protein abundance in proliferating T cells . Nat. Commun ., 5 , 5465 . Google Scholar CrossRef Search ADS PubMed Gupta I. et al. . ( 2014 ) Alternative polyadenylation diversifies post-transcriptional regulation by selective RNA-protein interactions . Mol. Syst. Biol ., 10 , 719. Google Scholar CrossRef Search ADS PubMed Hameed A. et al. . ( 2012 ) Drought induced programmed cell death and associated changes in antioxidants, proteases, and lipid peroxidation in wheat leaves . Biol. Plant , 57 , 370 – 374 . Google Scholar CrossRef Search ADS Han T. , Kim J.K. ( 2014 ) Driving glioblastoma growth by alternative polyadenylation . Cell Res ., 24 , 1023 – 1024 . Google Scholar CrossRef Search ADS PubMed Harrison P.F. et al. . ( 2015 ) PAT-seq: a method to study the integration of 3′-UTR dynamics with gene expression in the eukaryotic transcriptome . RNA , 21 , 1502 – 1510 . Google Scholar CrossRef Search ADS PubMed Hoque M. et al. . ( 2013 ) Analysis of alternative cleavage and polyadenylation by 3′ region extraction and deep sequencing . Nat. Methods , 10 , 133 – 139 . Google Scholar CrossRef Search ADS PubMed Ji G. et al. . ( 2015 ) Genome-wide identification and predictive modeling of polyadenylation sites in eukaryotes . Brief. Bioinf ., 16 , 304 – 313 . Google Scholar CrossRef Search ADS Katz Y. et al. . ( 2010 ) Analysis and design of RNA sequencing experiments for identifying isoform regulation . Nat. Methods , 7 , 1009 – 1015 . Google Scholar CrossRef Search ADS PubMed Le Pera L. et al. . ( 2015 ) 3USS: a web server for detecting alternative 3′ UTRs from RNA-seq experiments . Bioinformatics , 31 , 1845 – 1847 . Google Scholar CrossRef Search ADS PubMed Lee J.Y. et al. . ( 2007 ) PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes . Nucleic Acids Res ., 35 , D165 – D168 . Google Scholar CrossRef Search ADS PubMed Lianoglou S. et al. . ( 2013 ) Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression . Genes Dev ., 27 , 2380 – 2396 . Google Scholar CrossRef Search ADS PubMed Lim J. et al. . ( 2016 ) mTAIL-seq reveals dynamic poly(A) tail regulation in oocyte-to-embryo development . Genes Dev ., 30 , 1671 – 1682 . Google Scholar CrossRef Search ADS PubMed Mayr C. ( 2016 ) Evolution and biological roles of alternative 3′ UTRs . Trends Cell Biol ., 26 , 227 – 237 . Google Scholar CrossRef Search ADS PubMed Miura P. et al. . ( 2013 ) Widespread and extensive lengthening of 3′ UTRs in the mammalian brain . Genome Res ., 23 , 812 – 825 . Google Scholar CrossRef Search ADS PubMed Oktaba K. et al. . ( 2015 ) ELAV links paused Pol II to alternative polyadenylation in the Drosophila nervous system . Mol. Cell , 57 , 341 – 348 . Google Scholar CrossRef Search ADS PubMed Park J.E. et al. . ( 2016 ) Regulation of poly(A) tail and translation during the somatic cell cycle . Mol. Cell , 62 , 462 – 471 . Google Scholar CrossRef Search ADS PubMed Pickrell J.K. et al. . ( 2010 ) Understanding mechanisms underlying human gene expression variation with RNA sequencing . Nature , 464 , 768 – 772 . Google Scholar CrossRef Search ADS PubMed Seaver S.M.D. et al. . ( 2014 ) High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource . Proc. Natl. Acad. Sci. USA , 111 , 9645 – 9650 . Google Scholar CrossRef Search ADS Shen Y. et al. . ( 2008 ) Genome level analysis of rice mRNA 3′-end processing signals and alternative polyadenylation . Nucleic Acids Res ., 36 , 3150 – 3161 . Google Scholar CrossRef Search ADS PubMed Shenker S. et al. . ( 2015 ) IsoSCM: improved and alternative 3′ UTR annotation using multiple change-point inference . RNA , 21 , 14 – 27 . Google Scholar CrossRef Search ADS PubMed Smibert P. et al. . ( 2012 ) Global patterns of tissue-specific alternative polyadenylation in Drosophila . Cell Rep ., 1 , 277 – 289 . Google Scholar CrossRef Search ADS PubMed Tian B. , Manley J.L. ( 2013 ) Alternative cleavage and polyadenylation: the long and short of it . Trends Biochem. Sci ., 38 , 312 – 320 . Google Scholar CrossRef Search ADS PubMed Tian B. , Manley J.L. ( 2017 ) Alternative polyadenylation of mRNA precursors . Nat. Rev. Mol. Cell Biol ., 18 , 18 – 30 . Google Scholar CrossRef Search ADS PubMed Trapnell C. et al. . ( 2012 ) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks . Nat. Protoc ., 7 , 562 – 578 . Google Scholar CrossRef Search ADS PubMed Ulitsky I. et al. . ( 2012 ) Extensive alternative polyadenylation during zebrafish development . Genome Res ., 22 , 2054 – 2066 . Google Scholar CrossRef Search ADS PubMed Wang W. et al. . ( 2014 ) A change-point model for identifying 3′ UTR switching by next-generation RNA sequencing . Bioinformatics , 30 , 2162 – 2170 . Google Scholar CrossRef Search ADS PubMed Wilkening S. et al. . ( 2013 ) An efficient method for genome-wide polyadenylation site mapping and RNA quantification . Nucleic Acids Res ., 41 , e65. Google Scholar CrossRef Search ADS PubMed Wu X. , Bartel D.P. ( 2017 ) Widespread influence of 3′-end structures on mammalian mRNA processing and stability . Cell , 169 , 905 – 917 . e911. Google Scholar CrossRef Search ADS PubMed Wu X. et al. . ( 2014 ) Genome-wide determination of poly(A) sites in Medicago truncatula: evolutionary conservation of alternative poly(A) site choice . BMC Genomics , 15 , 615 . Google Scholar CrossRef Search ADS PubMed Wu X. et al. . ( 2011 ) Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation . Proc. Natl. Acad. Sci. USA , 108 , 12533 – 12538 . Google Scholar CrossRef Search ADS Wu X. et al. . ( 2016 ) PlantAPA: a portal for visualization and analysis of alternative polyadenylation in plants . Front. Plant Sci ., 7 , 1 – 14 . Google Scholar PubMed Xia Z. et al. . ( 2014 ) Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across seven tumour types . Nat. Commun ., 5 , 1 – 13 . Xing Y. et al. . ( 2015 ) Evaluation of two statistical methods provides insights into the complex patterns of alternative polyadenylation site switching . PloS One , 10 , e0124324 . Google Scholar CrossRef Search ADS PubMed You L. et al. . ( 2014 ) APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals . Nucleic Acids Res ., 43 , D59 – D67 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data

Loading next page...
 
/lp/ou_press/apatrap-identification-and-quantification-of-alternative-T67Gea6KrS
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty029
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation Alternative polyadenylation (APA) has been increasingly recognized as a crucial mechanism that contributes to transcriptome diversity and gene expression regulation. As RNA-seq has become a routine protocol for transcriptome analysis, it is of great interest to leverage such unprecedented collection of RNA-seq data by new computational methods to extract and quantify APA dynamics in these transcriptomes. However, research progress in this area has been relatively limited. Conventional methods rely on either transcript assembly to determine transcript 3′ ends or annotated poly(A) sites. Moreover, they can neither identify more than two poly(A) sites in a gene nor detect dynamic APA site usage considering more than two poly(A) sites. Results We developed an approach called APAtrap based on the mean squared error model to identify and quantify APA sites from RNA-seq data. APAtrap is capable of identifying novel 3′ UTRs and 3′ UTR extensions, which contributes to locating potential poly(A) sites in previously overlooked regions and improving genome annotations. APAtrap also aims to tally all potential poly(A) sites and detect genes with differential APA site usages between conditions. Extensive comparisons of APAtrap with two other latest methods, ChangePoint and DaPars, using various RNA-seq datasets from simulation studies, human and Arabidopsis demonstrate the efficacy and flexibility of APAtrap for any organisms with an annotated genome. Availability and implementation Freely available for download at https://apatrap.sourceforge.io. Contact liqq@xmu.edu.cn or xhuister@xmu.edu.cn Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Pre-mRNA cleavage and polyadenylation plays a key role in transcription termination, cytoplasmic localization, RNA stability and translation (Tian and Manley, 2017; Wu and Bartel, 2017). Recent transcriptome-wide studies have revealed that most eukaryotic genes produce distinct transcript isoforms through the usage of different polyadenylation sites, a mechanism called alternative polyadenylation (APA) (Di Giammartino et al., 2011; Tian and Manley, 2013). APA has been increasingly recognized as a crucial mechanism that contributes to the transcriptome complexity and gene expression regulation. The dominant class of APA events involves in the same terminal exon, leading to isoforms with variable 3′ UTR (3′ untranslated region) length. A small fraction of APA events occurs within internal introns/exons, producing mRNAs encoding different proteins. Diverse 3′ UTRs generated by APA modulate the availability of different cis-elements, which can regulate mRNA stability, translation efficiency and subcellular localization (Mayr, 2016; Tian and Manley, 2013). Importantly, dynamic usage of APA sites has been implicated in a variety of biological processes and diseases, including cell activation, growth, proliferation, cancer and neurodegenerative disorders (Batra et al., 2015; Berkovits and Mayr, 2015; Blazie et al., 2015; Di Giammartino et al., 2011; Gruber et al., 2014; Gupta et al., 2014; Han and Kim, 2014; Oktaba et al., 2015; Tian and Manley, 2013; Xia et al., 2014). Given the functional importance of APA, identification and quantification of APA site usages is critical and may be the first step towards deciphering the underlying mechanism of 3′ UTR-mediated gene regulation. Efforts has been made to capture 3′ ends of mRNAs by developing experimental protocols for profiling genome-wide APA sites [reviewed in Ji et al. (2015)], such as PAT-seq (Harrison et al., 2015), 3′ T-fill (Wilkening et al., 2013), 3′ READs (Hoque et al., 2013), TAIL-seq (Chang et al., 2014; Park et al., 2016) and mTAIL-seq (Lim et al., 2016). Although these approaches are powerful in providing the precise locations of poly(A) sites, they are costly, technically demanding, and most of them are hampered by inherent technical issues, such as internal priming artifacts. Limited number of poly(A) site databases are currently available, such as PolyA_DB 2 (Lee et al., 2007), APASdb (You et al., 2014), PlantAPA (Wu et al., 2016) and PolyAsite (Gruber et al., 2016), covering only a few species. In contrast, RNA-seq has become the standard technology for transcriptome profiling and has been applied in almost every large-scale genomics studies. Birol et al. (Birol et al., 2015) and Bonfert et al. (Bonfert and Friedel, 2017) developed methods for characterizing cleavage sites on 3′ UTRs in assembled RNA-seq data, by incorporating untemplated poly(A) sequence at their 3′ ends. Although RNA-seq reads containing poly(A) stretch can be directly used to identify poly(A) sites, the coverage of poly(A) tails by reads is very poor (Birol et al., 2015; Pickrell et al., 2010). Some tools are also available to identify novel 3′ UTRs by transcriptome assembly using RNA-seq data, such as Cufflinks (Trapnell et al., 2012), 3USS (Le Pera et al., 2015) and MISO (Katz et al., 2010). However, these tools may lead to truncation of long 3′ UTRs due to variable read depth and discontinuous coverage in large processed exons. Moreover, isoforms with short 3′ UTRs are frequently overlooked by transcript assembly tools like Cufflinks because short 3′ UTRs are often embedded within longer ones (Miura et al., 2013; Xia et al., 2014). MISO (Katz et al., 2010) can detect tandem 3′ UTRs, however, it relies on prior knowledge of annotated poly(A) sites, failing to identify novel poly(A) sites beyond existing poly(A) site databases. Therefore, these approaches that rely heavily on annotated poly(A) sites and/or transcriptome assembly results may not be precise or powerful because of incomplete information of known poly(A) sites and inherent limitations of transcript assembly tools. RNA-seq data for transcriptome profiling continue to accumulate in an unprecedented pace, being released for the benefit of all. Several tools are currently in progress to infer differential APA site usage between two conditions by detecting changes in RNA-seq read density. Wang et al. (Wang et al., 2014) proposed a change-point model (hereinafter referred as ChangePoint) based on a likelihood ratio test for detecting 3′ UTR switching. Xia et al. (Xia et al., 2014) developed a tool called DarPas to analyze dynamic regulation of APA through the comparison of RNA-seq read coverage between two conditions. However, each of these existing methods has its own limitations. For instance, ChangePoint is only applicable for detecting one change point in annotated 3′ UTR and relies on the annotated distal poly(A) sites to infer the proximal ones. Moreover, ChangePoint only supports pair-wise comparison between two samples, failing to aggregate results from more than two samples. DaPars can predict only one proximal poly(A) site based on a given annotated 3′ UTR and considers only the expression levels of the proximal and distal poly(A) sites for significance test in detecting genes with differential APA site usage. Consequently, it fails to detect other potential poly(A) sites and discards the information about the coordinates of poly(A) sites (or the length of alternative 3′ UTRs), which might reduce the power of significance assessment of dynamic APA events (Xing et al., 2015). Obviously, the main drawback of these two methods is that they can neither predict multiple proximal poly(A) sites nor detect dynamic APA site usage considering more than two poly(A) sites. Recent studies using 3′ end sequencing with increasing depth have revealed that more and more genes contain more than two poly(A) sites (Fu et al., 2011, 2016; Ji et al., 2015; Lianoglou et al., 2013; Ulitsky et al., 2012; Wu et al., 2014). Despite the availability of various methods, detecting all potential poly(A) sites and differential APA site usage remains challenging and there is still lot of scope to improve the accuracy of dynamic APA event identification and the ability to fully exploit the power of RNA-seq for profiling dynamics of APA. In this study, we have developed a tool called APAtrap to identify APA sites and detect changes in APA site usage by leveraging the resolution and huge abundance of RNA-seq data. APAtrap is capable of identifying novel 3′ UTRs or 3′ UTR extensions and can predict all potential poly(A) sites based on the refined 3′ UTRs. Furthermore, APAtrap can explore APA events with differential usage patterns among conditions by incorporating both coordinates and expression levels of all poly(A) sites in a gene. We compared APAtrap with two other latest methods, ChangePoint and DaPars, using both simulated data and real RNA-seq data from different species. Results show that APAtrap can identify APA sites and differential APA site usages with higher accuracy than the competing methods and can be adapted to genomes with different sizes by tuning model parameters. 2 Materials and methods 2.1 Identification of novel 3′ UTRs and 3′ UTR extensions A strategy similar to the previous study (Miura et al., 2013) was adopted to identify novel 3′ UTRs or 3′ UTR extensions (Supplementary Fig. S1a). First, the 3′ end regions were extracted from the genome annotation and extended by a pre-defined length. Then RNA-seq read enriched 3′ regions are identified by sliding a window in 1 bp increment starting from the 5′ start of the extended 3′ regions. The following criteria are utilized to identify precise 3′ end: (i) depths of the current window containing the candidate distal 3′ end and the following window should be less than the predefined coverage depth; (ii) coverage of the first base of the current window should be less than the predefined coverage depth, etc. The first base of the current window that meets these criteria is considered as the 3′ end. Finally, newly identified 3′ UTR regions are compared with the original genome annotation to obtain novel 3′ UTRs or 3′ UTR extensions. More details are described in Supplementary Methods. 2.2 Identification of APA sites Based on the refined 3′ UTRs identified in the previous step, distal poly(A) sites are then defined as the 3′ ends of these 3′ UTRs. First, a sliding window strategy is used to define potential number of poly(A) sites (details in Supplementary Methods). Then a mean squared error model was proposed to identify the precise positions of poly(A) sites for each gene, which seeks the minimum difference between the estimated read coverage and the real read coverage in the 3′ UTR from all samples. The loss function is, arg min⁡eik≥0∑i=1N∑j=1L(Ci,j−∑k=1MeikIjk)2/N×L (1) Here, M is the number of poly(A) sites in the whole 3′ UTR region in a given gene, which is the number of merged regions. L is the length of the longest 3′ UTR; N is the number of samples; Cij is the observed coverage depth of position j in sample i; Ijk =1 or 0 indicates whether the position j belongs to the kth 3′ UTR or not; eik is the estimated expression level of the kth 3′ UTR in sample i. Let Pk be the location of the kth poly(A) site, which is also the length of the kth 3′ UTR. The expression level of the longest 3′ UTR can be estimated as the mean read coverage of the fragment from PM - 1 to PM as this fragment only belongs to the longest 3′ UTR. eiM=∑j=PM−1+1LCji/(L−PM−1) (2) Because the kth (1 < k < M) 3′ UTR is shared by longer 3′ UTRs, its expression level is denoted as the mean read coverage of the fragment from Pk - 1 to Pk subtracted by the sum of expression levels of the (k + 1)th to the Mth 3′ UTRs. eik1<k<M=∑j=Pk−1+1Pk(Cji−∑m=k+1Meim)/(Pk−Pk−1) (3) Similarly, the expression level of the first 3′ UTR can be represented as, ei1=∑j=1P1(Cji−∑m=2Meim)/P1 (4) 2.3 Identification of dynamic APA site usage between two samples We incorporated a difference index and a linear trend test (Fu et al., 2011, 2016) to investigate the dynamic APA site usage between two samples (Supplementary Fig. S1c). First, the index, termed PD (percentage difference), is used to quantify the difference of APA site usage between two samples in a given gene, which is defined as following: PD=∑k=1M|pak−pbk|2 (PD∈[0,1]) (5) Here, M is the number of poly(A) sites in a gene; a and b denote the two samples; pak represents the fraction of read coverage of poly(A) site k over all poly(A) sites of the gene in sample a. Here the read coverage for each poly(A) site is the estimated expression level of the corresponding 3′ UTR (Supplementary Fig. S1b). Next, the linear trend test based on the Pearson product moment correlation coefficient (Fu et al., 2011, 2016) was adopted to examine significant changes of APA site usage in a gene between two samples. Benjamini-Hochberg method (Benjamini and Hochberg, 1995) is used to estimate the FDR (false discovery rate). By default, we consider genes with PD higher than a threshold and with FDR below a cutoff as genes with differential APA site usage. More details are described in Supplementary Methods. 2.4 Validation We used both simulated data and real RNA-seq data from human and Arabidopsis to evaluate the results of APAtrap. The simulated RNA-seq data of 50 bp paired-end reads with a 150 bp fragment size were generated by Flux Simulator (Griebel et al., 2012) based on the human genome annotation and a predefined library size. Genes containing different number of isoforms with different 3′ UTR lengths were simulated. For real data analysis in human, we used RNA-seq data from the Human Brain Reference (hereinafter Brain) and the Universal Human Reference (UHR) MAQC samples (Bullard et al., 2010). The polyA-seq data from the same samples collected from UCSC browser (http://genome.ucsc.edu/) were used to detect genes with differential APA site usage as the true reference for evaluation. Annotated poly(A) sites from multiple resources were compiled to form a large reference dataset for verification, including NCBI Refseq database (https://www.ncbi.nlm.nih.gov/refseq/), Ensemble database (http://www.ensembl.org/Homo_sapiens/Info/Index), UCSC genome browser (http://genome.ucsc.edu/), PolyAsite database (http://www.polyasite.unibas.ch/), GENCODE poly(A) datasets (http://www.gencodegenes.org/releases/19.html) and PolyA_DB 2 (Lee et al., 2007). For Arabidopsis, two RNA-Seq samples subjected to control and mild drought conditions were used (Clauw et al., 2015). Annotated poly(A) sites in Arabidopsis were downloaded from PlantAPA database (Wu et al., 2016), which were collected from leaf, seed and root tissues of wild type (WT) and mutant oxt6. For the evaluation of performance of differential APA detection, we calculated the true index [PD or ODD (Wang et al., 2014)] from the test data and compared it with the value estimated by each tool. Smaller difference between the predicted index and the true index indicates better prediction and a differential APA event with both true index and estimated index above a given cutoff is considered as a true prediction. More details about the validation are described in Supplementary Methods. 3 Results 3.1 Evaluation of APAtrap using simulated data We conducted a simulation study to assess the performance of APAtrap using simulated RNA-seq data with predefined APA events. Since in many APA studies only proximal and distal poly(A) sites of a gene were considered when inferring dynamic APA site usage, we first simulated 1000 human genes with exact two isoforms. One isoform is from the annotated genome and the other is 1000 bp longer. The expression level of each isoform is randomly generated by Flux Simulator (Griebel et al., 2012), therefore the PD values between two samples of all the 1000 genes are uniformly distributed in [0, 1]. Using these simulated RNA-seq reads as inputs, we compared APAtrap with two other tools also based on RNA-seq read coverage change, ChangePoint (Wang et al., 2014) and DaPars (Xia et al., 2014). First, we calculated the correlation of the real index (PD or ODD) and the estimated index from each tool (Supplementary Fig. S2). APAtrap obtained the best correlation followed by DaPars, while ChangePoint presented much lower correlation, suggesting better ability of APAtrap and DaPars to recover the true difference in the dataset. Next, we evaluated the APA site prediction accuracy of different tools. Apparently, APAtrap and DaPars recovered significantly higher number of true poly(A) sites than ChangePoint, while APAtrap performed slightly better than DaPars (Fig. 1a). We also compared the performance of these three tools in the detection of dynamic APA events between two conditions. The accuracy and ROC (receiver-operating characteristic) curve at different cutoffs from 0 to 1 in an increment of 0.05 for each tool were reported. The accuracy of APAtrap was the highest no matter which cutoff was applied (Fig. 1b). APAtrap and DaPars provided more stable results than ChangePoint regardless of the cutoff, indicating the robustness of these two methods and the desirability of the PD metric. APAtrap also provided the best performance with the highest areas under ROC curve (AUC: APAtrap = 0.95, DaPars = 0.90 and ChangePoint = 0.67) among the three tools (Fig. 1c). Apparently, ChangePoint performed the worst with considerably higher false positive rate (FPR) but lower true positive rate (TPR). In contrast, APAtrap provided the best TPR with low FPR especially at small cutoff while the TPR of DaPars at small cutoff was slighter lower than that of APAtrap. Fig. 1. View largeDownload slide Evaluation of performance of APAtrap using simulated data. (a) Fraction of recovered poly(A) sites using different methods in the two-isoform study. If a predicted poly(A) site is located within a predefined distance from any annotated poly(A) site, then the prediction is considered as a true prediction. Cutoffs from 0 bp to 100 bp in an increment of 10 bp were set. (b) Accuracy of detecting differential APA site usage using different methods in the two-isoform study. A differential APA event with both true index and estimated index above a given cutoff is considered as a true prediction. Cutoffs from 0 to 1 in an increment of 0.05 were set. (c) ROC curves of dynamic APA event detection using different methods in the two-isoform study. (d–f) As in a–c, except that genes with one to four isoforms were used Fig. 1. View largeDownload slide Evaluation of performance of APAtrap using simulated data. (a) Fraction of recovered poly(A) sites using different methods in the two-isoform study. If a predicted poly(A) site is located within a predefined distance from any annotated poly(A) site, then the prediction is considered as a true prediction. Cutoffs from 0 bp to 100 bp in an increment of 10 bp were set. (b) Accuracy of detecting differential APA site usage using different methods in the two-isoform study. A differential APA event with both true index and estimated index above a given cutoff is considered as a true prediction. Cutoffs from 0 to 1 in an increment of 0.05 were set. (c) ROC curves of dynamic APA event detection using different methods in the two-isoform study. (d–f) As in a–c, except that genes with one to four isoforms were used Recent studies with deeper sequencing coverage have uncovered increasing number of poly(A) sites than expected. Here we conducted another simulation study that is not limited for genes with exact two isoforms but for genes with one to four isoforms. It should be noted that DaPars considers the end of the annotated 3′ UTR as the distal poly(A) site and can only predict one proximal poly(A) site upstream of the distal site. Consequently, even for genes with more than two isoforms, DaPars can detect only one proximal poly(A) site rather than all potential poly(A) sites and then infer differential APA site usage based on the proximal and distal poly(A) sites. Unlike the correlation results for genes with only two isoforms for which both DaPars and APAtrap obtained high and comparable results (Supplementary Fig. S2a, b), APAtrap provided significantly higher correlation of true PD and estimated PD than DaPars (0.91 versus 0.54) (Supplementary Fig. S2d, e). Still, the ODD calculated by ChangePoint had poor correlation with true PD. Surprisingly, APAtrap obtained considerably high correlation in both simulation cases while the correlation of DaPars dropped substantially from 0.84 to 0.54 (Supplementary Fig. S2b, e), suggesting that APAtrap can achieve comparable results regardless of the number of poly(A) sites to predict. For APA site prediction, APAtrap recovered much more true poly(A) sites than DaPars and ChangePoint (Fig. 1d), demonstrating the efficiency of APAtrap in identifying multiple proximal poly(A) sites in 3′ UTRs. Regarding the metrics of ROC curve and accuracy, APAtrap was obviously superior to DaPars (Fig. 1d, e; AUC: APAtrap = 0.94, DaPars = 0.78). Particularly, compared to the performance in the previous simulation study on genes with exact two isoforms, the performance of DaPars decreased dramatically in identifying APA sites or differential APA events for genes with varied number of isoforms. In contrast, the loss in performance of APAtrap was not remarkable, which indicates the importance of using all potential poly(A) sites rather than only proximal and distal ones for effectively detecting dynamic APA usage. These results demonstrate that APAtrap outperforms the competing methods ChangePoint and DaPars. Moreover, APAtrap is powerful in identifying APA sites and dynamic APA site usages between samples for genes with different number of poly(A) sites. 3.2 Application of APAtrap to RNA-seq data in human Next, we evaluated the performance of APAtrap using experimental RNA-seq data in human. APAtrap and DaPars identified 161 and 111 genes with differential APA site usage based on the same cutoff of FDR and PD (FDR ≤ 0.05, PD ≥ 0.2), respectively (Fig. 2a). Among them, 51 genes were predicted by both methods. Using polyA-seq data, 205 genes with differential APA site usage were identified. We noted that only a small number of genes predicted by both methods were validated by polyA-seq (30 from APAtrap, 22 from DaPars). This may be because that the real poly(A) site dataset from polyA-seq is far from complete and different sequencing strategies or sequencing depth were used to generate the RNA-seq data and polyA-seq data. This result also indicates that there is still lot of scope to improve the accuracy of dynamic APA event prediction from RNA-seq data. Up to 84.5% (136/161) genes identified by APAtrap and 78.4% (87/111) genes identified by DaPars prefer using longer 3′ UTR in the Brain sample, respectively (Fig. 2b, c). This result is consistent with recent reports that brain tissues normally have the longest 3′ UTRs (Miura et al., 2013). Overall, APAtrap identified more dynamic APA events and ‘longer events’ than DaPars, demonstrating the efficiency of APAtrap and the importance of considering all potential poly(A) sites for the detection of differential APA site usage. Fig. 2. View largeDownload slide Evaluation of APAtrap using RNA-seq data in human. (a) Overlapping of genes with differential APA site usage detected by APAtrap and DaPars or using real polyA-seq data. (b) Dynamic APA events between the Brain and UHR sample detected by APAtrap. Pearson product moment correlation coefficient is plotted against the log2 fold change between Brain and UHR. Genes with significant switching to longer 3′ UTRs in Brain and UHR are colored in blue and red, respectively. The inset bar plot denotes the number of genes with significantly longer 3′ UTRs in the respective sample. (c) As in b, except that DaPars was used. (d) Comparison of APAtrap and DaPars in recovering true poly(A) sites in genes with differential APA site usage. ‘APAtrap-distal PA’ means the numbers of identified and verified distal poly(A) sites inferred by APAtrap. ‘APAtrap-proximal PA’ denotes the numbers of genes with predicted and verified proximal poly(A) sites identified by APAtrap. Genes with at least one verified proximal poly(A) site are considered as verified. ‘DaPars-proximal PA’ is as in ‘APAtrap-proximal PA’, except that DaPars was used. (e) Overlapping of proximal poly(A) sites in genes with differential APA site usage identified by APAtrap and DaPars. (f) Difference of read coverage between upstream 100 bp and downstream 100 bp around proximal poly(A) sites identified by APAtrap and DaPars. The ratio of read coverage of downstream 100 bp to the upstream 100 bp for each poly(A) site was calculated Fig. 2. View largeDownload slide Evaluation of APAtrap using RNA-seq data in human. (a) Overlapping of genes with differential APA site usage detected by APAtrap and DaPars or using real polyA-seq data. (b) Dynamic APA events between the Brain and UHR sample detected by APAtrap. Pearson product moment correlation coefficient is plotted against the log2 fold change between Brain and UHR. Genes with significant switching to longer 3′ UTRs in Brain and UHR are colored in blue and red, respectively. The inset bar plot denotes the number of genes with significantly longer 3′ UTRs in the respective sample. (c) As in b, except that DaPars was used. (d) Comparison of APAtrap and DaPars in recovering true poly(A) sites in genes with differential APA site usage. ‘APAtrap-distal PA’ means the numbers of identified and verified distal poly(A) sites inferred by APAtrap. ‘APAtrap-proximal PA’ denotes the numbers of genes with predicted and verified proximal poly(A) sites identified by APAtrap. Genes with at least one verified proximal poly(A) site are considered as verified. ‘DaPars-proximal PA’ is as in ‘APAtrap-proximal PA’, except that DaPars was used. (e) Overlapping of proximal poly(A) sites in genes with differential APA site usage identified by APAtrap and DaPars. (f) Difference of read coverage between upstream 100 bp and downstream 100 bp around proximal poly(A) sites identified by APAtrap and DaPars. The ratio of read coverage of downstream 100 bp to the upstream 100 bp for each poly(A) site was calculated Based on these genes with differential APA site usage identified by both methods, we next compared the ability of APAtrap and DaPars in recovering true poly(A) sites. APAtrap predicted 454 poly(A) sites for the 161 genes, including 293 proximal poly(A) sites and 161 distal ones (Fig. 2d). Up to 86% (139/161) distal poly(A) sites were verified by the annotated poly(A) dataset. It should be noted that DaPars only predicts one proximal poly(A) site for each gene based on the annotated 3′ end, therefore no predicted distal poly(A) site is available from DaPars. Since APAtrap can predict multiple proximal poly(A) sites while DaPars identifies only one proximal site, for fair comparison of proximal poly(A) site prediction, we compared the number of genes with accurately identified proximal poly(A) sites. APAtrap identifies 161 genes with proximal poly(A) sites and 64% (103) of them had at least one accurately predicted proximal site (Fig. 2d). In contrast, DaPars identified 111 proximal poly(A) sites and 63% of them were verified by annotated poly(A) sites (Fig. 2d). APAtrap predicted higher number of true proximal sites than DaPars (129 versus 70) (Fig. 2e), which due to that APAtrap is capable of detecting multiple proximal sites. Meanwhile, APAtrap identified 157 proximal poly(A) sites that were not verified by annotated poly(A) site database. This is not surprising because the current poly(A) site database is far from complete (Lee et al., 2007; Wu et al., 2016) and the specific samples used in this study may not be presented in the database. What is surprising, however, is that only 32 proximal poly(A) sites were predicted by both methods, indicating the importance of using different tools for poly(A) site prediction. The significantly higher number of proximal poly(A) sites or genes with differential APA site usage identified by APAtrap than DaPars may be due to the following reasons. First, APAtrap can automatically refine annotated 3′ UTRs by the sliding window strategy to identify shortened or lengthened 3′ UTRs and novel 3′ UTRs before predicting poly(A) sites. Among the proximal poly(A) sites detected by APAtrap, 14 were located in newly identified 3′ UTR regions. Some examples are shown in Supplementary Figure S3. Second, APAtrap is more effective than DaPars in identifying all potential poly(A) sites rather than only the proximal one, which has been demonstrated in the simulation study (Fig. 1). APAtrap identified 230 proximal poly(A) sites which were missed by DaPars. Some examples are shown in Supplementary Figure S4. Third, APAtrap could estimate a more accurate PD than DaPars (Supplementary Fig. S2), which will better reflect the true difference between samples and boost up the sensitivity of the prediction. Some examples are shown in Supplementary Figure S5. There are 64 poly(A) sites that were only identified by DaPars. By close inspection of these cases, we found some interesting observations. First, DaPars identified poly(A) sites located in 3′ UTRs that were overlapped with other exons (Supplementary Fig. S6), while such cases would be discarded in APAtrap because of potential read mapping errors in overlapped regions. Second, the large gap of RNA-seq read coverage in some 3′ UTRs may lead to identifying shortened 3′ UTR by APAtrap, while DaPars uses the annotated 3′ UTR regardless of the read coverage in the whole 3′ UTR region (Supplementary Fig. S7). Although some poly(A) sites located beyond the shortened 3′ UTRs would be neglected by APAtrap, we prefer to use the refined 3′ UTR to increase the confidence of predicted poly(A) sites and using RNA-seq data with higher sequencing depth will also solve this problem. Third, APAtrap uses a different and more stringent criterion than DaPars to assess the statistical significance of differential APA site usage, which may lead to some poly(A) sites with high PD fail to pass the FDR cutoff. Some cases are shown in Supplementary Figure S8. However, in practical application, users can freely choose different combinations of PD and FDR cutoffs to select desirable number of genes with differential APA site usages. By manual inspection, we also found some special cases that some poly(A) sites predicted by DaPars were located near authentic ones while no significant read coverage change was observed around the predicted poly(A) sites (Supplementary Fig. S9). In contrast, there is an obvious read coverage drop-off around the poly(A) site predicted by APAtrap. Indeed, the ratio of read coverage between downstream 100 bp and upstream 100 bp around poly(A) sites predicted by APAtrap was much lower than that by DaPars (Wilcoxon test, P-value = 3.77 × 10−4, Fig. 2f). Despite of different models used in DaPars and APAtrap, both methods infer poly(A) sites by detecting significant changes of read coverage, therefore we suspected that this kind of poly(A) sites predicted by DaPars without significant change of read coverage were actually false predictions even they were located near true poly(A) sites by chance. 3.3 Flexibility of using APAtrap on different species To demonstrate the flexibility of using APAtrap on different species, next we identified APA sites and differential APA events in the model plant Arabidopsis thaliana using RNA-seq data from control and mild drought conditions (Clauw et al., 2015). Parameters of APAtrap could be fine-tuned to adapt to species with different genome size and 3′ UTR length. Here we set both the window size in the sliding window strategy for detecting distal poly(A) site and the minimum distance between two poly(A) sites as 50 bp. Using the RNA-seq data as an input, APAtrap identified 224 genes with differential APA site usage, while DaPars identified 152 genes (FDR ≤ 0.05, PD ≥ 0.4) (Fig. 3a). A moderate number of genes (62) were detected by both methods. APAtrap predicted 450 proximal poly(A) sites from the 224 genes. Among them, ∼41% (184) poly(A) sites were located within 50 bp of annotated poly(A) sites, and ∼62% (104) genes had at least one accurately predicted proximal poly(A) site (Fig. 3b). In contrast, ∼37% proximal poly(A) sites predicted by DaPars were verified by annotated poly(A) sites. Generally, APAtrap identified more dynamic APA events than DaPars and more proximal poly(A) sites predicted by APAtrap were verified by annotated poly(A) site database, which demonstrates the efficiency of APAtrap for different species. Surprisingly, the gene ontology analysis showed that genes with differential APA site usage identified by APAtrap and DaPars were enriched in distinct groups of functions (Fig. 3c). However, gene ontology results from APAtrap such as peroxisome organization, lipid oxidation, lipid modification and fatty acid beta-oxidation have been demonstrated to be related to drought resistance in previous studies (Campo et al., 2014; Hameed et al., 2012). Again, this result demonstrates the effectiveness of APAtrap in identifying function related genes that are regulated by APA and respond to specific biological conditions. Fig. 3. View largeDownload slide Evaluation of APAtrap using RNA-seq data in Arabidopsis. (a) Overlapping of genes with differential APA site usage between control and mild drought conditions identified by APAtrap and DaPars. (b) Comparison of APAtrap and DaPars in recovering true proximal poly(A) sites in genes with differential APA site usage. ‘APAtrap-proximal PA’ means the numbers of identified and verified proximal poly(A) sites inferred by APAtrap. ‘APAtrap-proximal PA/Gene’ denotes the numbers of genes with predicted and verified proximal poly(A) sites identified by APAtrap. Genes with at least one verified proximal poly(A) site are considered as verified. ‘DaPars-proximal PA’ is as in ‘APAtrap-proximal PA/Gene’, except that DaPars was used. (c) Gene ontology (GO) enrichment analysis using genes with differential APA site usage identified by APAtrap and DaPars. Shown are the enriched GO terms in the biological process (BP) and cellular component (CC) categories Fig. 3. View largeDownload slide Evaluation of APAtrap using RNA-seq data in Arabidopsis. (a) Overlapping of genes with differential APA site usage between control and mild drought conditions identified by APAtrap and DaPars. (b) Comparison of APAtrap and DaPars in recovering true proximal poly(A) sites in genes with differential APA site usage. ‘APAtrap-proximal PA’ means the numbers of identified and verified proximal poly(A) sites inferred by APAtrap. ‘APAtrap-proximal PA/Gene’ denotes the numbers of genes with predicted and verified proximal poly(A) sites identified by APAtrap. Genes with at least one verified proximal poly(A) site are considered as verified. ‘DaPars-proximal PA’ is as in ‘APAtrap-proximal PA/Gene’, except that DaPars was used. (c) Gene ontology (GO) enrichment analysis using genes with differential APA site usage identified by APAtrap and DaPars. Shown are the enriched GO terms in the biological process (BP) and cellular component (CC) categories 3.4 Using APAtrap to identify novel 3′ UTRs and 3′ UTR extensions One feature of APAtrap is the identification of novel 3′ UTRs and 3′ UTR extensions, which could be useful in improving 3′ UTR annotations in genomes lacking complete information of 3′ UTRs. Here we refined 3′ UTRs in Arabidopsis using the same RNA-seq data in the above study. To focus on novel 3′ UTRs, a minimum length change of 50 bp is required. A total of 1417 confident 3′ UTRs, including 548 novel 3′ UTRs and 869 3′ UTR extensions, were identified, comprising ∼0.24 Mb of unannotated sequence. The average length is 276 nt (Fig. 4a), far above the average 3′ UTR length of annotated genes (216 nt). These 3′ UTRs are covered by an average of 39 RNA-seq reads depth, demonstrating their high quality. Up to 1694 annotated poly(A) sites supported by an average of 28 reads depth which are located in 1417 newly identified 3′ UTR portions. Particularly, seven 3′ UTRs extended more than 1000 nt, which is similar to the observation in the previous study that some exceptionally long 3′ UTRs (many >10 kb and some >18 kb in length) were found in mammals (Miura et al., 2013). As we used limited RNA-seq data in specific samples and discarded 3′ UTRs that failed to meet our stringent criteria, the scope of 3′ UTRs or 3′ UTR extensions is likely even larger than we currently annotated. Poly(A) sites that originated from novel 3′ UTR regions bore single nucleotide profiles that were similar to those located in annotated 3′ UTRs (Fig. 4b), demonstrating the confidence of the newly identified 3′ UTRs. Fig. 4. View largeDownload slide Novel 3′ UTRs and 3′ UTR extensions in Arabidopsis identified by APAtrap. (a) Length distribution of novel 3′ UTRs and 3′ UTR extensions. (b) Nucleotide compositions of the sequences surrounding poly(A) sites located in novel 3′ UTRs. Y-axis denotes the fractional nucleotide content at each position. On the X-axis, ‘3′ denotes the poly(A) site. (c) Number of genes varies with changes in poly(A) site number after 3′ UTR extension. (d) Comparison of the number of genes with different number of poly(A) sites before or after 3′ UTR extension Fig. 4. View largeDownload slide Novel 3′ UTRs and 3′ UTR extensions in Arabidopsis identified by APAtrap. (a) Length distribution of novel 3′ UTRs and 3′ UTR extensions. (b) Nucleotide compositions of the sequences surrounding poly(A) sites located in novel 3′ UTRs. Y-axis denotes the fractional nucleotide content at each position. On the X-axis, ‘3′ denotes the poly(A) site. (c) Number of genes varies with changes in poly(A) site number after 3′ UTR extension. (d) Comparison of the number of genes with different number of poly(A) sites before or after 3′ UTR extension Previous studies (Seaver et al., 2014; Wu et al., 2014) extended each 3′ UTR for an arbitrary length according to the average length of annotated 3′ UTRs to estimate the fraction of genes with multiple poly(A) sites. Consequently, the APA extent highly depends on the newly assigned 3′ UTRs to count the occurrence of poly(A) sites within 3′ UTRs, which may lead to misestimate of the APA extent because of the deviation of real 3′ UTRs and assigned 3′ UTRs. Here we recruited the 3′ UTRs identified by APAtrap to compare the APA extent between original annotation and extended annotation in Arabidopsis. To this end, we used 1417 genes that have newly discovered 3′ UTRs as candidate dataset. Based on the original TAIR10 annotation of the Arabidopsis genome, 931 genes exhibit 2539 poly(A) sites. Using new 3′ UTRs, 463 genes remained unchanged number of poly(A) sites while 954 genes exhibited 1694 additional poly(A) sites in their new 3′ UTR regions (Fig. 4c). The number of genes with poly(A) sites after assigning new 3′ UTR rises from 931 to 1287. Without new 3′ UTRs, 363 and 568 genes possess one or more than one poly(A) site, respectively. In contrast, 344 genes have a single poly(A) site, 943 genes have multiple poly(A) sites with the addition of 3′ UTRs (Fig. 4d). Utilizing new 3′ UTRs, the APA extent is increased from ∼40 to ∼67%, and 954 genes show the increase of the number of poly(A) sites. If genes were added 3′ UTR with a fixed length as in the previous study (Wu et al., 2014), the APA extent became 57%. It is interesting to note that the fraction of APA genes showed significant difference between using new 3′ UTRs identified by our protocol and using 3′ UTRs extended with a fixed length (Chi-squared test, P-value = 9.4e-7). Therefore, with increasing availability of RNA-seq data, using 3′ UTRs identified by our protocol would fully exploit the power of RNA-seq for a better estimate of the occurrence of poly(A) sites within each putative 3′ UTR. 4 Discussion Transcriptional profiling via deep sequencing has underscored the complexity and dynamics of APA across tissues and different stages of cells. Computational identification of APA sites has been a major goal in the field of mRNA 3′ processing. Although approaches based on 3′ end sequencing are capable of locating precise poly(A) sites, they are costly and technically demanding, limiting their application in wider range of species. As RNA-seq has become a routine protocol for transcriptome analysis, it is of great importance to leverage newer computational methods to quantify APA dynamics based on RNA-seq data. However, research progress in this area has been relatively limited and only a few tools are available for the identification of poly(A) sites from RNA-seq data. Cufflinks (Trapnell et al., 2012) and 3USS (Le Pera et al., 2015) are methods that use transcript assembly to determine transcript 3′ ends, however, many distal poly(A) sites are neglected by methods based on transcriptome assembly (Miura et al., 2013). ContextMap 2 (Bonfert and Friedel, 2017) and Kleat (Birol et al., 2015) can identify poly(A) sites by mapping poly(A) reads from RNA-seq. Although using poly(A) read mapping provides direct evidence for polyadenylation, a very small fraction of RNA-seq reads contain poly(A) tails, leading to low sensitivity of these methods in identifying APA events. In contrast to methods that employ poly(A) reads, other methods, like DaPars (Xia et al., 2014) and ChangePoint (Wang et al., 2014), identify APA sites based on the change of RNA-seq read coverage in terminal exons. Although these approaches have achieved reasonable performance in identifying APA events from RNA-seq data, most of them can neither identify more than two poly(A) sites in a gene nor detect APA-site switching genes considering more than two poly(A) sites. Moreover, they are not suitable for identifying APA events from more than two samples as comparison of only two samples are allowed. As more data are being sequenced and increasing number of genes are found to have more than two poly(A) sites, more sophisticated approaches that are flexible in identifying all potential APA sites and dynamic APA events in more than two samples are in high demand. To overcome the limitations associated with these pioneering studies, we proposed an approach termed APAtrap which is devoted to the accurate identification of APA sites, novel 3′ UTRs or extensions, and differential APA site usage from RNA-seq data. The proposed method has several desirable properties. First, APAtrap is capable of identifying novel 3′ UTRs and 3′ UTR extensions, which contributes to locating potential poly(A) sites in previously overlooked regions and improving the genome annotation. Second, unlike many other methods limiting in identifying only one proximal poly(A) site in each annotated 3′ UTR, APAtrap aims to predict all potential poly(A) sites, which represents a major step towards a more comprehensive profiling of APA from RNA-seq data. Third, extensive analyses of various RNA-seq datasets from human and Arabidopsis demonstrate the flexibility of APAtrap for different organisms by tuning model parameters. Moreover, APAtrap can handle batch samples (>2) in one time, which facilitates large-scale data analysis and avoids the inconsistency of results from different pair-wise comparisons. In addition, APAtrap uses generic files (e.g. SAM format) from standard read mapping, researchers can easily integrate it into their analysis pipeline for detecting poly(A) sites or refining 3′ UTRs. APAtrap could also be combined with other tools, like Cufflinks, MISO, or Kleat, which can provide additional or consensus evidence for 3′ UTR annotations or APA events identified by different approaches. We comprehensively evaluated the performance of APAtrap using both simulated and real data and compared it to other competing approaches, including ChangePoint (Wang et al., 2014) and DaPars (Xia et al., 2014). The results showed that APAtrap is powerful in identifying functionally important APA events and dynamic APA usage in different biological conditions and achieves considerably higher accuracy than the competing approaches. The better performance of APAtrap may be attributed to the following reasons. First, the sliding window strategy adopted in APAtrap with stringent conditions enables accurately finding potential poly(A) site regions with substantial change of read coverage and estimating the number of poly(A) sites. The subsequent mean squared error model used in APAtrap for poly(A) site prediction is less complex but more efficient than models used in the competing methods, such as the regression model in DaPars and the likelihood ratio test in ChangePoint, demonstrating bright prospect in using APAtrap for large-scale analysis. On a practical note, ChangePoint is much more time consuming than APAtrap and DaPars for relatively large data, which limits its application in large-scale analysis. Second, the proposed PD index considers the effect of 3′ UTR length and is capable of quantifying the difference of APA usage in genes with more than two poly(A) sites. In contrast, the ΔPDUI used in DaPars and ODD used in ChangePoint considers only the proximal and distal poly(A) sites. We need to point out that ODD is not an appropriate metric because it uses the absolute read count in 3′ UTR regions while fails to exclude the effect of 3′ UTR length as longer 3′ UTRs tend to have more reads. Another disadvantage of ODD is that it is too sensitive to read count and slight difference of the read count in different poly(A) regions would result in a relatively large ODD. Results of the simulation studies clearly indicate the importance of using an appropriate metric in detecting differential APA site usage (Fig. 1). Third, as demonstrated in the simulation study for one to four poly(A) sites, incorporating information of additional poly(A) sites rather than only the proximal and distal ones would considerably improve the accuracy of dynamic APA usage detection. The ability of APAtrap to identify all potential poly(A) sites in refined 3′ UTRs would definitely contribute to the better performance. Fourth, the linear trend test used in APAtrap for testing significance in difference of APA usage between samples has an advantage over independent test such as the Fisher’s test in DaPars. The linear trend test allows more than two poly(A) sites and considers both the length and the expression level of each 3′ UTR. In contrast, independent test used in DaPars only considers the proximal and distal poly(A) sites and neglects the length of 3′ UTRs. We demonstrated the usefulness and applicability of APAtrap by analyzing experimental RNA-seq data in Arabidopsis and human. As expected, APAtrap identifies more poly(A) sites and dynamic APA events than DaPars. Poly(A) sites only predicted by APAtrap are shown to be highly confident in that many of them are overlapping with annotated poly(A) sites. Despite that there are some predicted poly(A) sites not supported by the annotated poly(A) database, these sites may be sample specific which are exclusively presented in specific tissues or conditions and have not been collected due to the incompleteness of current annotated database. Although continuous efforts to curate the genome annotation in different species are being made, the genome annotation for many species is far from complete. Particularly, the genome-wide annotation of 3′ UTRs received much less attention than coding regions, where genes in many genome assemblies exhibit no annotated 3′ UTR. Moreover, it has also been noted that the terminal exon annotations inferred from RNA-seq data by different transcriptome assembly tools can be inaccurate and incomplete (Miura et al., 2013; Shenker et al., 2015). Recent genomic studies also documented widespread 3′ UTR extensions. For example, data from RNA-seq revealed 2035 mouse and 1847 human genes that utilize substantially distal novel 3′ UTRs (many >10 kb) (Miura et al., 2013). Neural specific 3′ UTR extensions have also been revealed in Drosophila (Smibert et al., 2012) and zebrafish (Ulitsky et al., 2012), which may contribute to post transcriptional regulation underlying specific neuronal functions (Oktaba et al., 2015). With the tsunami of new genomes and RNA-seq data being sequenced, it would be highly desirable to improve the accuracy of 3′ UTR annotations from RNA-seq data. Using APAtrap to refine annotated 3′ UTRs and identify novel 3′ UTR extensions might benefit the cause. The novel 3′ UTRs in Arabidopsis inferred by APAtrap were verified to share the same characteristics as annotated 3′ UTRs, suggesting the effectiveness of APAtrap in 3′ UTR identification. Moreover, in recent genomic studies on polyadenylation, it is common to extend annotated 3′ UTR or gene model by a fixed length to capture uncharacterized poly(A) sites or transcript extensions, e.g. 5 or 1 kb for mammals (Derti et al., 2012; Lianoglou et al., 2013), 500 bp for rice (Shen et al., 2008), 218 bp for Arabidopsis (Wu et al., 2011) and 600 bp for Medicago truncatula (Wu et al., 2014). Despite its simplicity and straightforwardness in improving the ‘recovery’ of poly(A) sites that may fall in authentic 3′ UTRs, the same extended length was set for all annotated genes, neglecting the variation of 3′ UTR length in different genes. In order to get more accurate estimate of annotated poly(A) sites for each gene, a better approach is to revise the 3′ UTR regions of individual genes with new methods as the one described in this study. The 3′ UTRs of mRNAs contain regulatory elements such as microRNA-binding sites, which could affect mRNA stability or translation efficiency. Inclusion of refined 3′ UTRs detected by APAtrap could provide a more extensive reference database for the searching of potential cis-elements binding to specific 3′ UTRs, which would help fully understand the functional roles of 3′ UTRs. Funding This work was supported by the National Natural Science Foundation of China (61673323 to X.W.), Fundamental Research Funds for the Central Universities in China from Xiamen University (20720170076, to C.Y.), Natural Science Foundation of Fujian Province of China (2017J01068 to X.W.), scholarship from China Scholarship Council (201606315010 to X.W.), U.S. NSF (IOS-154173 to Q.Q.L.), Chinese Ministry of Science and Technology (2016YFE0108800 to Q.Q.L.). Conflict of Interest: none declared. References Batra R. et al. . ( 2015 ) Global insights into alternative polyadenylation regulation . RNA Biol ., 12 , 597 – 602 . Google Scholar CrossRef Search ADS PubMed Benjamini Y. , Hochberg Y. ( 1995 ) Controlling the false discovery rate: a practical and powerful approach to multiple testing . J. R. Stat. Soc. Ser. B (Methodological) , 57 , 289 – 300 . Berkovits B.D. , Mayr C. ( 2015 ) Alternative 3′ UTRs act as scaffolds to regulate membrane protein localization . Nature , 522 , 363 – 367 . Google Scholar CrossRef Search ADS PubMed Birol I. et al. . ( 2015 ) Kleat: cleavage site analysis of transcriptomes . Pac. Symp. Biocomput ., 347 – 358 . Blazie S.M. et al. . ( 2015 ) Comparative RNA-Seq analysis reveals pervasive tissue-specific alternative polyadenylation in Caenorhabditis elegans intestine and muscles . BMC Biol ., 13 , 4 . Google Scholar CrossRef Search ADS PubMed Bonfert T. , Friedel C.C. ( 2017 ) Prediction of poly(A) sites by poly(A) read mapping . Plos One , 12 , e0170914. Google Scholar CrossRef Search ADS PubMed Bullard J.H. et al. . ( 2010 ) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments . BMC Bioinformatics , 11 , 94 . Google Scholar CrossRef Search ADS PubMed Campo S. et al. . ( 2014 ) Overexpression of a calcium-dependent protein kinase confers salt and drought tolerance in rice by preventing membrane lipid peroxidation . Plant Physiol ., 165 , 688 – 704 . Google Scholar CrossRef Search ADS PubMed Chang H. et al. . ( 2014 ) TAIL-seq: genome-wide determination of poly(A) tail length and 3′ end modifications . Mol. Cell ., 53 , 1044 – 1052 . Google Scholar CrossRef Search ADS PubMed Clauw P. et al. . ( 2015 ) Leaf responses to mild drought stress in natural variants of Arabidopsis . Plant Physiol ., 167 , 800 – 816 . Google Scholar CrossRef Search ADS PubMed Derti A. et al. . ( 2012 ) A quantitative atlas of polyadenylation in five mammals . Genome Res ., 22 , 1173 – 1183 . Google Scholar CrossRef Search ADS PubMed Di Giammartino D.C. et al. . ( 2011 ) Mechanisms and consequences of alternative polyadenylation . Mol. Cell , 43 , 853 – 866 . Google Scholar CrossRef Search ADS PubMed Fu H. et al. . ( 2016 ) Genome-wide dynamics of alternative polyadenylation in rice . Genome Res ., 26 , 1753 – 1760 . Google Scholar CrossRef Search ADS PubMed Fu Y. et al. . ( 2011 ) Differential genome-wide profiling of tandem 3′ UTRs among human breast cancer and normal cells by high-throughput sequencing . Genome Res ., 21 , 741 – 747 . Google Scholar CrossRef Search ADS PubMed Griebel T. et al. . ( 2012 ) Modelling and simulating generic RNA-Seq experiments with the flux simulator . Nucleic Acids Res ., 40 , 10073 – 10083 . Google Scholar CrossRef Search ADS PubMed Gruber A.J. et al. . ( 2016 ) A comprehensive analysis of 3′ end sequencing data sets reveals novel polyadenylation signals and the repressive role of heterogeneous ribonucleoprotein C on cleavage and polyadenylation . Genome Res ., 26 , 1145 – 1159 . Google Scholar CrossRef Search ADS PubMed Gruber A.R. et al. . ( 2014 ) Global 3′ UTR shortening has a limited effect on protein abundance in proliferating T cells . Nat. Commun ., 5 , 5465 . Google Scholar CrossRef Search ADS PubMed Gupta I. et al. . ( 2014 ) Alternative polyadenylation diversifies post-transcriptional regulation by selective RNA-protein interactions . Mol. Syst. Biol ., 10 , 719. Google Scholar CrossRef Search ADS PubMed Hameed A. et al. . ( 2012 ) Drought induced programmed cell death and associated changes in antioxidants, proteases, and lipid peroxidation in wheat leaves . Biol. Plant , 57 , 370 – 374 . Google Scholar CrossRef Search ADS Han T. , Kim J.K. ( 2014 ) Driving glioblastoma growth by alternative polyadenylation . Cell Res ., 24 , 1023 – 1024 . Google Scholar CrossRef Search ADS PubMed Harrison P.F. et al. . ( 2015 ) PAT-seq: a method to study the integration of 3′-UTR dynamics with gene expression in the eukaryotic transcriptome . RNA , 21 , 1502 – 1510 . Google Scholar CrossRef Search ADS PubMed Hoque M. et al. . ( 2013 ) Analysis of alternative cleavage and polyadenylation by 3′ region extraction and deep sequencing . Nat. Methods , 10 , 133 – 139 . Google Scholar CrossRef Search ADS PubMed Ji G. et al. . ( 2015 ) Genome-wide identification and predictive modeling of polyadenylation sites in eukaryotes . Brief. Bioinf ., 16 , 304 – 313 . Google Scholar CrossRef Search ADS Katz Y. et al. . ( 2010 ) Analysis and design of RNA sequencing experiments for identifying isoform regulation . Nat. Methods , 7 , 1009 – 1015 . Google Scholar CrossRef Search ADS PubMed Le Pera L. et al. . ( 2015 ) 3USS: a web server for detecting alternative 3′ UTRs from RNA-seq experiments . Bioinformatics , 31 , 1845 – 1847 . Google Scholar CrossRef Search ADS PubMed Lee J.Y. et al. . ( 2007 ) PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes . Nucleic Acids Res ., 35 , D165 – D168 . Google Scholar CrossRef Search ADS PubMed Lianoglou S. et al. . ( 2013 ) Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression . Genes Dev ., 27 , 2380 – 2396 . Google Scholar CrossRef Search ADS PubMed Lim J. et al. . ( 2016 ) mTAIL-seq reveals dynamic poly(A) tail regulation in oocyte-to-embryo development . Genes Dev ., 30 , 1671 – 1682 . Google Scholar CrossRef Search ADS PubMed Mayr C. ( 2016 ) Evolution and biological roles of alternative 3′ UTRs . Trends Cell Biol ., 26 , 227 – 237 . Google Scholar CrossRef Search ADS PubMed Miura P. et al. . ( 2013 ) Widespread and extensive lengthening of 3′ UTRs in the mammalian brain . Genome Res ., 23 , 812 – 825 . Google Scholar CrossRef Search ADS PubMed Oktaba K. et al. . ( 2015 ) ELAV links paused Pol II to alternative polyadenylation in the Drosophila nervous system . Mol. Cell , 57 , 341 – 348 . Google Scholar CrossRef Search ADS PubMed Park J.E. et al. . ( 2016 ) Regulation of poly(A) tail and translation during the somatic cell cycle . Mol. Cell , 62 , 462 – 471 . Google Scholar CrossRef Search ADS PubMed Pickrell J.K. et al. . ( 2010 ) Understanding mechanisms underlying human gene expression variation with RNA sequencing . Nature , 464 , 768 – 772 . Google Scholar CrossRef Search ADS PubMed Seaver S.M.D. et al. . ( 2014 ) High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource . Proc. Natl. Acad. Sci. USA , 111 , 9645 – 9650 . Google Scholar CrossRef Search ADS Shen Y. et al. . ( 2008 ) Genome level analysis of rice mRNA 3′-end processing signals and alternative polyadenylation . Nucleic Acids Res ., 36 , 3150 – 3161 . Google Scholar CrossRef Search ADS PubMed Shenker S. et al. . ( 2015 ) IsoSCM: improved and alternative 3′ UTR annotation using multiple change-point inference . RNA , 21 , 14 – 27 . Google Scholar CrossRef Search ADS PubMed Smibert P. et al. . ( 2012 ) Global patterns of tissue-specific alternative polyadenylation in Drosophila . Cell Rep ., 1 , 277 – 289 . Google Scholar CrossRef Search ADS PubMed Tian B. , Manley J.L. ( 2013 ) Alternative cleavage and polyadenylation: the long and short of it . Trends Biochem. Sci ., 38 , 312 – 320 . Google Scholar CrossRef Search ADS PubMed Tian B. , Manley J.L. ( 2017 ) Alternative polyadenylation of mRNA precursors . Nat. Rev. Mol. Cell Biol ., 18 , 18 – 30 . Google Scholar CrossRef Search ADS PubMed Trapnell C. et al. . ( 2012 ) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks . Nat. Protoc ., 7 , 562 – 578 . Google Scholar CrossRef Search ADS PubMed Ulitsky I. et al. . ( 2012 ) Extensive alternative polyadenylation during zebrafish development . Genome Res ., 22 , 2054 – 2066 . Google Scholar CrossRef Search ADS PubMed Wang W. et al. . ( 2014 ) A change-point model for identifying 3′ UTR switching by next-generation RNA sequencing . Bioinformatics , 30 , 2162 – 2170 . Google Scholar CrossRef Search ADS PubMed Wilkening S. et al. . ( 2013 ) An efficient method for genome-wide polyadenylation site mapping and RNA quantification . Nucleic Acids Res ., 41 , e65. Google Scholar CrossRef Search ADS PubMed Wu X. , Bartel D.P. ( 2017 ) Widespread influence of 3′-end structures on mammalian mRNA processing and stability . Cell , 169 , 905 – 917 . e911. Google Scholar CrossRef Search ADS PubMed Wu X. et al. . ( 2014 ) Genome-wide determination of poly(A) sites in Medicago truncatula: evolutionary conservation of alternative poly(A) site choice . BMC Genomics , 15 , 615 . Google Scholar CrossRef Search ADS PubMed Wu X. et al. . ( 2011 ) Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation . Proc. Natl. Acad. Sci. USA , 108 , 12533 – 12538 . Google Scholar CrossRef Search ADS Wu X. et al. . ( 2016 ) PlantAPA: a portal for visualization and analysis of alternative polyadenylation in plants . Front. Plant Sci ., 7 , 1 – 14 . Google Scholar PubMed Xia Z. et al. . ( 2014 ) Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across seven tumour types . Nat. Commun ., 5 , 1 – 13 . Xing Y. et al. . ( 2015 ) Evaluation of two statistical methods provides insights into the complex patterns of alternative polyadenylation site switching . PloS One , 10 , e0124324 . Google Scholar CrossRef Search ADS PubMed You L. et al. . ( 2014 ) APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals . Nucleic Acids Res ., 43 , D59 – D67 . Google Scholar CrossRef Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BioinformaticsOxford University Press

Published: Jan 18, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off