Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi

Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi Background: Genome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction. Results: Here, we present an extension of the gene prediction program GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction. We show on published benchmark data for plants, animals and fungi that GeMoMa performs better than the gene prediction programs BRAKER1, MAKER2, and CodingQuarry, and purely RNA-seq-based pipelines for transcript identification. In addition, we demonstrate that using multiple reference organisms may help to further improve the performance of GeMoMa. Finally, we apply GeMoMa to four nematode species and to the recently published barley reference genome indicating that current annotations of protein-coding genes may be refined using GeMoMa predictions. Conclusions: GeMoMa might be of great utility for annotating newly sequenced genomes but also for finding homologs of a specific gene or gene family. GeMoMa has been published under GNU GPL3 and is freely available at http://www.jstacs.de/index.php/GeMoMa. Keywords: Homology-based gene prediction, RNA-seq, Genome annotation Background Genome annotation pipelines utilize three main sources The annotation of protein-coding genes is of critical of information, namely evidence from wet-lab transcrip- importance for many fields of biological research includ- tome studies [1, 2], ab-initio gene prediction based on ing, for instance, comparative genomics, functional pro- general features of (protein-coding) genes [3, 4], and teomics, gene targeting, genome editing, phylogenetics, homology-based gene prediction relying on gene models transcriptomics, and phylostratigraphy. The process of of (closely) related, well-annotated species [5–7]. annotating protein-coding genes to an existing genome Experimental data allow for inferring coverage of gene (assembly) can be described as specifying the exact predictions and splice sites bordering their exons, which genomic location of genes comprising all (partially) cod- may assist computational ab-initio or homology-based ing exons. A difficulty in gene annotation is distinc- approaches. Due to the progress in the field of next gener- tion between protein-coding genes, transposons and ation sequencing, RNA-seq has revolutionized transcrip- pseudogenes. tomics [8]. Today, RNA-seq data is available for a wide range of organisms, tissues and environmental conditions, and can be utilized for genome annotation pipelines. *Correspondence: jens.keilwagen@julius-kuehn.de In recent years, several programs have been developed Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - that combine multiple sources allowing for a more accu- Federal Research Centre for Cultivated Plants, D-06484, Quedlinburg, Germany rate prediction of protein-coding genes [9–11]. MAKER2 Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 2 of 12 is a pipeline that integrates support of different resources mapped RNA-seq data. Introns passing this filter define including ab-initio gene predictors and RNA-seq data donor and acceptor splice sites, which are treated inde- [9]. CodingQuarry is a pipeline for RNA-Seq assembly- pendently within GeMoMa. If splice sites with experimen- supported training and gene prediction, which is only tal evidence have been detected in a genomic region with a recommended for application to fungi [10]. Recently, [11] good match to an exon of a reference transcript, these are collected for further use. If no splice sites with experimen- published BRAKER1 a pipeline for unsupervised RNA- seq-based genome annotation that combines the advan- tal evidence have been detected in a genomic region with a tages of GeneMark-ET [12] and AUGUSTUS [4]. good match to an exon of a reference transcript, GeMoMa Here, we present an extension of GeMoMa [7]thatuti- resorts to conserved dinucleotides allowing to identify lizes RNA-seq data in addition to amino acid sequence gene models that are not covered by RNA-seq data due to, and intron position conservation. We investigate the per- e.g., very specifically or lowly expressed transcripts. Com- formance of GeMoMa on publicly available benchmark bining two potential exons, all in-frame combinations data [11] and compare it with state-of-the-art competitors using the collected donor and acceptor splice sites are [9–11]. tested and scored according to the reference transcript. Subsequently, we demonstrate how combining homology- The best combination is used for the prediction. based predictions based on gene models from multiple Based on this experimental evidence, the improved reference organisms can be used to improve the perfor- version of GeMoMa provides several new properties mance of GeMoMa. Finally, we apply GeMoMa to four reported for gene predictions. The most prominent fea- nematode species provided by Wormbase [13] and to the tures are transcript intron evidence (tie) and transcript recently published barley reference genome [14], where percentage coverage (tpc). The tie of a transcript varies GeMoMa predictions will be included into future versions between 0 and 1, and corresponds to the fraction of of the corresponding genome annotations. introns (i.e., splice sites of two neighboring exons) that are supported by split reads in the mapped RNA-seq Methods data. In case of transcripts comprising a single coding In this section, we describe recent extensions of GeMoMa exon, NA is reported. The tpc of a transcript also varies to make use of evidence from RNA-seq data, the RNA-seq between 0 and 1, and corresponds to the fraction of (cod- pipelines used and the data considered in the benchmark ing) bases of a predicted transcript that are also covered by mapped reads in the RNA-seq data. Further properties and application studies. reported by GeMoMa are i) tae and tde,the percentages of acceptor and donor sites, respectively, with RNA-seq GeMoMa using RNA-seq evidence, ii) minCov and avgCov, the minimum and aver- GeMoMa predicts protein-coding genes utilizing the gen- eral conservation of protein-coding genes on the level of age coverage, respectively, of the predicted transcript, and their amino acid sequence and on the level of their intron iii) minSplitReads, the minimum number of split reads positions, i.e., the locations of exon-exon boundaries in supporting any of the predicted introns of a transcript. CDSs [7]. To this end, sequences of (partially) protein- Optionally, GeMoMa reports pAA and iAA,the percent- coding exons are extracted from well-annotated reference age of positive-scoring and identical amino acids in a genomes. Individual exons are then matched to loci on the pairwise alignment, if the reference protein is provided target genome using tblastn [15], matches are adjusted for as input. proper splice sites, start codons and stop codons, respec- GeMoMa allows for computing and ranking multiple tively, and joined to full, protein-coding genes models. In predictions per reference transcript, but does not fil- this process, the conserved dinucleotides GT and GC for ter these predictions. Predictions of different reference donor splice sites, and AG for acceptor splice sites have transcripts might be highly overlapping or even identi- been used for the identification of splice sites border- cal, especially if the reference transcripts are from the ing matches to the (partially) protein-coding exons of the same gene family. Since GeMoMa 1.4, the default param- reference transcripts. The improved version of GeMoMa eters for number of predictions and contig threshold have may now also include experimental splice site evidence been changed which might lead to an increased number extracted from mapped RNA-seq data to improve the of highly overlapping or identical predictions. In addi- accuracy of splice site and, hence, exon annotation. We tion, it might be beneficial to run GeMoMa starting from visualize the extended GeMoMa pipeline in Fig. 1. multiple reference species to broaden the scope of tran- scripts covered by the predictions. However, these may Starting from mapped RNA-seq data, the module also result in redundant predictions for, e.g., orthologs or Extract RNA-seq evidence (ERE) allows for extracting paralogs stemming from the different reference species introns and, if user-specified, read coverage of genomic regions. GeMoMa filters these introns using a user- considered. To handle such situations, the new module specified minimal number of split reads within the GeMoMa annotation filter (GAF) of the improved version Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 3 of 12 Fig. 1 GeMoMa workflow. Blue items represent input data sets, green boxes represent GeMoMa modules, while grey boxes represent external modules. The GeMoMa Annotation Filter allows to combine predictions from different reference species and produces the final output. RNA-seq data is optional of GeMoMa now allows for joining and reducing such pre- genes in the cluster looking for discarded predictions that dictions using various filters. Filtering criteria comprise do not overlap with any selected prediction, which are the relative GeMoMa score of a predicted transcript, fil- recovered. In the benchmark studies comparing GeMoMa tering for complete predictions (starting with start codon with state-of-the-art competitors, we directly use the GAF and ending with stop codon), and filtering for evidence results without any further filters on attributes reported from multiple reference organisms. In addition, GAF also by the GeMoMa pipeline. joins duplicate predictions that originate from different In addition to the modules for annotating a genome reference transcripts. (assembly) described above, we also provide two addi- Initially, GAF filters predictions based on their rela- tional modules in GeMoMa for analyzing and comparing tive GeMoMa score, i.e., the GeMoMa score divided by to prediction to a reference annotation. The module Com- the length of the predicted protein. This filter removes pareTranscripts determines that CDS of the reference spurious predictions. Subsequently, the predictions are annotation with the largest overlap with the prediction clustered based on their genomic location. Overlapping utilizing the F measure as objective function [7]. The predictions on the same strand yield a common cluster. module AnnotationEvidence computes tie and tpc of all For each cluster, the prediction with the highest GeMoMa CDSs of a given annotation. Hence, these two modules score is selected. Non-identical predictions overlapping can be used to determine, whether a prediction is known, the high-scoring prediction with at least a user-specified partially known or new and whether the overlapping percentage of borders (i.e., splice sites, start and stop annotation has good RNA-seq support. codon, cf. common border filter) are treated as alterna- tive transcripts. Predictions that have completely identical MAKER2 predictions borders to any previously selected prediction are removed Recently, we have shown that GeMoMa outperforms and only listed in the GFF attribute field alternative.All state-of-the-art homology-based gene predictors [7]. We filtered predictions of a cluster are assigned to one gene are not aware of any homology-based gene predic- with a generic gene name. Finally, GAF checks for nested tion program that allows for incorporating of RNA-seq Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 4 of 12 data. Hence, we provide predictions of MAKER2 using with the previous benchmark study, we use the manually the same reference proteins as GeMoMa for a mini- curated gene set of Wormbase. mal comparison. Internally, MAKER2 uses exonerate [5] For the analysis of barley, we use the latest genome for homology-based gene prediction. We run MAKER2 assembly and gene annotation [14]. As reference species, with default parameters except protein2genome=1, we choose A. thaliana [22], B. distachyon [23], O. sativa [24], and S. italica [25] (Additional file 1:Table S2). and genome and protein set to the respective input files. In addition, we run MAKER2 using (i) RNA-seq In addition to genome assembly and gene annotation, data in form of Trinity 2.4 transcripts (-jaccard_clip) [16], we also used RNA-seq data from four different public (ii) homology in form of proteins of one related refer- available data sets (ERP015182, ERP015986, SRP063318, ence species, and (iii) ab-initio gene prediction in form SRP071745). Reads were mapped and assembled using of Augustus 3.3 [4]. In this case, we run MAKER2 with Hisat2 and StringTie [26]. As reference annotation, we default parameters except genome, est, protein,and used the union of high and low confidence annotation. augustus_species,whichhavebeensetto thecorre- As independent evidence for validating GeMoMa pre- sponding species. For comparison, we run Maker2 with dictions in the nematode species and barley, we use ESTs the same parameter settings but using the GeMoMa pre- and cDNAs. While Wormbase provides coordinates for dictions for protein_gff instead of using protein. best BLAT matches, we adapt the pipeline and download all available EST from NCBI and map them to the genome RNA-seq pipelines using BLAT [27]. Computational pipelines have been used to infer gene annotation from RNA-seq data produced by next genera- Results and discussion tion sequencing methods. Dozens of tools and tool com- Benchmark binations have been proposed. Here, we focus on the short The comparison of different software pipelines is often read mapper TopHat2 [17], the transcript assemblers critical as a) specific parameters settings might be cru- Cufflinks [1] and StringTie [2], and the coding sequence cial for good results and b) different input might be used. predictor TransDecoder [16]. Based on the transcript For these reasons, we designed the benchmark as follows. assemblers, we build two RNA-seq pipelines following the First, we use publicly available gene predictions results. instructions in [11]. Second,welimit thenumberofreference speciestoone in the initial study. Data We used GeMoMa for predicting the gene annota- For the benchmark studies, we consider target species tions of A. thaliana, C. elegans, D. melanogaster,and S. pombe.InTable 1, we summarize the performance of and their genome versions as specified in the BRAKER1 supplement. For the homology-based prediction by BRAKER1, MAKER2, and CodingQuarry as reported in GeMoMa, we choose one closely related reference species Hoff et al. [11], as well as the performance of GeMoMa per target species that are sequenced and annotated with and without RNA-seq evidence, purely RNA-seq- [13, 18–20]. For these species, we consider the latest based pipelines and various MAKER2 predictions. The genome versions available (Additional file 1:Table S1). results of CodingQuarry reported by Hoff et al. [11]devi- For the analysis of C. elegans,weusetheman- ate substantially from those originally reported by Testa ually curated gene set of C. briggsae provided by et al. [10]. We find that the performance of CodingQuarry Wormbase. In addition, we use the experimental evi- is highly sensitive to RNA-seq processing, whereas the dence from RNA-seq data referenced in the BRAKER1 performance of GeMoMa is barely affected (Additional publication. file 1: Table S5). For all comparisons, we provide sen- For the analysis of the four nematode species, sitivity (Sn) and specificity (Sp) for the categories gene, C. brenneri, C. briggsae, C. japonica,and C. remanei, transcript, and exon, respectively [28]. In addition, we we use the genome assembly and gene annotation of compare CodingQuarry with GeMoMa for S. cerevisiae Wormbase WS257 [13]. We choose the model organism (Additional file 1:TableS6). C. elegans as reference species (Additional file 1:Table S2). First, we compare the two purely homology-based In addition to genome assembly and gene annotation, we predictions, namely on the one hand side MAKER2 using also use publicly available RNA-seq data of these four exonerate and on the other hand side GeMoMa without nematode species, which have been mapped by Worm- RNA-seq data. In all cases, we use the same reference base using STAR [21]. We used a minimum intron size of species and reference proteins. We find that MAKER2 25 bp, a maximum intron size of 15Kb, specify that only using only homologous proteins has a higher exon speci- reads mapping once or twice on the genome are reported, ficity than GeMoMa without RNA-seq data for C. elegans, and alignments are reported only if their ratio of mis- while the opposite is true for all other categories and matches to mapped length is less than 0.02. In accordance target species. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 5 of 12 Table 1 Benchmark results on the BRAKER1 test sets + + + ∗ ∗ ∗ + + MAKER2 GeMoMa without GeMoMa with RNAseq- RNAseq- BRAKER1 MAKER2 CodingQuarry MAKER2 (exonerate, MAKER2 (exonerate) RNA-seq data RNA-seq data Cufflinks StringTie Trinity, Augustus) (GeMoMa, Trinity, Augustus) Arabidopsis thaliana (ref. A. lyrata) Gene Sn 44.0 61.3 66.5 28.9 35.9 64.4 51.3 NA 56.9 57.9 Gene Sp 47.8 65.7 71.3 47.9 59.1 52.0 52.5 NA 65.7 67.8 Transcript Sn 37.5 52.2 57.2 26.6 33.7 55.0 43.5 NA 48.3 49.1 Transcript Sp 47.8 65.7 65.3 35.6 48.3 50.9 52.5 NA 65.7 67.8 Exon Sn 70.0 79.3 80.6 58.1 60.8 82.9 76.1 NA 81.8 82.1 Exon Sp 81.9 86.6 87.5 81.9 87.1 79.0 76.1 NA 87.5 88.6 Caenorhabditis elegans (ref. C. briggsae) Gene Sn 26.2 39.6 49.1 18.7 22.6 55.0 41.0 NA 40.5 47.3 Gene Sp 38.0 49.9 63.8 29.1 36.1 55.2 30.8 NA 51.5 56.4 Transcript Sn 21.0 30.7 39.8 16.2 20.0 43.0 31.3 NA 31.4 36.2 Transcript Sp 38.0 49.9 58.7 24.1 30.1 53.2 30.8 NA 51.5 56.4 Exon Sn 50.3 64.2 67.1 54.4 59.1 80.2 69.4 NA 70.5 75.2 Exon Sp 82.6 81.5 87.5 81.3 84.1 85.3 62.3 NA 85.6 86.7 Drosophila melanogaster (ref. D. simulans) Gene Sn 64.3 78.2 83.1 55.7 55.2 64.9 55.2 NA 61.5 64.0 Gene Sp 69.2 81.6 87.1 71.3 73.5 59.4 46.3 NA 69.6 71.9 Transcript Sn 44.1 52.9 65.0 48.7 49.0 46.1 38.5 NA 42.7 44.3 Transcript Sp 69.2 81.6 81.2 60.1 65.7 57.9 46.3 NA 69.6 71.9 Exon Sn 69.0 76.3 80.0 67.8 66.2 75.0 66.5 NA 74.3 76.3 Exon Sp 89.1 92.0 93.3 85.4 88.3 81.7 66.9 NA 88.0 89.1 Schizosaccharomyces pombe (ref. S. octosporus) Gene Sn 49.2 76.4 79.2 69.0 65.8 77.4 42.8 79.7 71.6 74.6 Gene Sp 59.9 84.6 88.0 93.8 92.5 80.5 68.7 72.6 88.1 89.1 Transcript Sn 49.2 76.4 79.2 69.0 65.8 77.4 42.8 79.7 71.6 74.6 Transcript Sp 59.9 84.6 87.6 80.5 71.3 76.5 68.7 72.6 88.1 89.1 Exon Sn 56.1 81.6 83.1 77.2 77.7 83.2 50.1 79.6 79.2 81.2 Exon Sp 73.3 88.6 91.9 87.6 81.7 83.2 71.4 81.7 92.0 92.6 The target species are given in multi-column rows. The same reference species, which is given in brackets, is used for all tools using homology-based gene prediction indicated by plus. The asterisks indicates that the performance of BRAKER1, MAKER2 and CodingQuarry is given as reported in [11]. The highest value per line is depicted in bold-face Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 6 of 12 Second, we additionally consider RNA-seq data. Hence, we compare GeMoMa to combined gene pre- MAKER2 does not allow for combining RNA-seq evi- diction approaches. Specifically, we compare the perfor- dence and homology-based predictions without using mance of GeMoMa using RNA-seq evidence to BRAKER1 any ab-initio gene predictor. In contrast, GeMoMa allows in Fig. 2, which provides the best overall performance for additionally using intron position conservation and in [11]. We find that GeMoMa performs better than BRAKER1 for the categories gene and transcript with RNA-seq data. For this reason, we compare the perfor- mance of GeMoMa with and without RNA-seq evidence the exception of gene and transcript sensitivity for (Table 1). We find that sensitivity and specificity in C. elegans. Interestingly, we find the biggest improvements almost all cases increases by up to 13.9 with only two for D. melanogaster where gene/transcript sensitivity and exceptions for transcript specificity of A. thaliana and specificity increases between 18.2 and 27.7. For the exon D. melanogaster which decreases by at most 0.4. Hence, category, we find a less clear picture. In total, we observe we summarize that RNA-seq evidence improves the sen- the worst results for C. elegans where the sensitivity for sitivity and specificity of GeMoMa and should be used if all three categories decreases between 3.2 and 13.2, while available. the specificity increases only between 2.2 and 8.6. Notably, Third, we compare the performance of GeMoMa using we generally find the worst gene/transcript sensitivity and RNA-seq evidence to that of purely RNA-seq-based specificity for C. elegans compared with the other target pipelines, namely Cufflinks and StringTie (Table 1). We species considering the best performance of all tools. find for all four species that GeMoMa using RNA-seq In summary, we find that the gene predictors MAKER2, evidence outperforms purely RNA-seq-based pipelines. BRAKER1, CodingQuarry and GeMoMa, and the tran- Interestingly, purely RNA-seq-based pipelines also yield script assemblers Cufflinks and StringTie often perform the worst gene/transcript sensitivity and specificity for quite well on exon level. The main difference becomes C. elegans. Comparing the results based on different evident on transcript and gene level, where exons need transcript assemblers, we find that the results based on to be combined correctly (Table 1) as reported earlier StringTie are better than those based on Cufflinks for [29, 30]. Homology-based gene predictors might benefit A. thaliana and C. elegans, while the opposite is true from experimentally validated and manually curated ref- for S. pombe.For D. melanogaster, both pipelines per- erence transcripts guiding the prediction of transcripts in form comparably. Additional RNA-seq reads increasing the target organism. Although GeMoMa performed well, it is not able to pre- the coverage might improve the performance of purely RNA-seq-based pipelines but could also improve the per- dict genes that do not show any homology to a protein formance of GeMoMa. in the reference species, while ab-initio gene predictors Summarizing these three observations, we find that mightfailinother cases. Asbothtypesofapproaches have GeMoMa performs better than purely homology-based their specific advantages, users will probably use combi- or purely RNA-seq-based pipelines and that including nations of different gene predictors in practice to obtain a RNA-seq data improves the performance of GeMoMa. comprehensive gene annotation. Gene sensitivity Transcript sensitivity Exon sensitivity Gene specificity Transcript specificity Exon specificity A. thaliana C. elegans D. melanogaster S. pombe Fig. 2 Benchmark results. The y-axis depicts the difference between the GeMoMa with RNA-seq data and the BRAKER1 performance Difference to result of BRAKER1 −10 0 10 20 30 Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 7 of 12 In addition, we performed a small runtime study for the the inclusion of all available evidence and that per- two main time-consuming steps of the pipeline to demon- formance is decreased if some important evidence is strate that GeMoMa is reasonably fast (Additional file 1: missed [9]. Table S7). Furthermore, we compare GeMoMa using RNA- seq evidence with MAKER2 using RNA-seq evidence, Combined gene prediction pipelines homology-based and ab-initio gene prediction. In some cases, it is hard to compare these results as sensitivity Combined gene prediction pipelines, as for instance MAKER2, use RNA-seq evidence, homology-based and of one tool is higher than the sensitivity of the other ab-initio methods for predicting final gene models. tool and the opposite is true for specificity. In machine MAKER2 uses exonerate by default for homology-based learning, recall, also known as sensitivity, and precision, gene prediction. However, MAKER2 also provides the which is called specificity in the context of gene prediction possibility to use other homology-based gene predictors evaluation [31], are combined into a single scalar value instead of exonerate (cf. parameter protein_gff). For this called F1 measure [32] that can be compared more easily. reason, we compare the performance of MAKER2 using We combined sensitivity and specificity resulting in an either exonerate or GeMoMa for homology based gene F1 measure for each evaluation level gene, transcript prediction (Table 1). In addition, we use Augustus as ab- and exon (Additional file 1 – Table S4) We find that in initio gene prediction program and Trinity transcripts many cases GeMoMa using RNA-seq evidence outper- in MAKER2. We find that MAKER2 using GeMoMa forms MAKER2. The reason for this observation might performs better than MAKER2 using exonerate for all be that RNA-seq data and homology based gene predic- species and all measure. The improvement varies between tion is used in MAKER2 to train ab-initio gene predictors, 0.3% and 6.8% with clearly the biggest improvement for in this case Augustus. With the recommended parameter C. elegans. setting, homology-based gene predictions are not directly In addition, we find that the MAKER2 performance is used for the final prediction and doing so might further substantially improved compared to the performance of improve performance. the the previously reported MAKER2 predictions, either purely based on proteins (cf. Table 1, column MAKER2 Influence of reference species (exonerate)) or as reported in [11] (cf. Maker2 ). These Utilizing different fly species from FlyBase [33], we scru- other predictions do not utilize all available sources of tinize the influence of different or multiple reference information as they either ignore RNA-seq data and species on the performance of GeMoMa using RNA-seq ab-initio gene prediction or homology to proteins of data (Additional file 1:TableS8).In Fig. 3,wedepictgene related species. Based on this observation, we agree sensitivity and gene specificity for eight different reference that combined gene prediction pipelines benefit from species indicated by points. We find that performance Fig. 3 Gene sensitivity and specificity for D. melanogaster using different or multiple reference species in GeMoMa. The points correspond to the eight reference species. In addition, the dashed line indicates the usage of multiple reference species. Using multiple reference species allows for filtering identical predictions from several reference as indicated by the numbers Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 8 of 12 varies with the reference species. In this specific case, supported by at least two of the eight reference species. D. sechellia and D. persimilis yield the worst results for sin- In addition, we find 9 single-coding-exon predictions that gle reference-based predictions. This observation might do not overlap with any annotated transcript, have a tpc be related to the fact that genome assembly of D. sechellia of 1 and are supported by at least two of the five reference and D. persimilis is of lower quality [34], while the genome species. In summary, those genes supported by multiple reference organisms or additional RNA-seq data might be of D. simulans has been updated [35] later. Besides these two outliers, the performance of the different fly species promising candidates for extending the existing genome as reference species for D. melanogaster in GeMoMa cor- annotation of D. melanogaster. relates with their evolutionary distance [36]. Generally speaking, the closer a reference species is related to the Analysis of nematode species target species D. melanogaster, the better is the perfor- The relatively poor results for C. elegans in the bench- mance in terms of gene sensitivity and specificity. Hence, mark study, might be due to insufficiencies in the current we speculate that two requirements must be met to have C. briggsae annotation. Hence, we decided to scruti- a good reference species. First, the evolutionary distance nize the Wormbase annotation of four nematode species between reference and target species should be small comprising C. brenneri, C. briggsae, C. japonica,and and second, the genome assembly and annotation of the C. remanei based on the model organism C. elegans. reference species should be comprehensive and of high We compare GeMoMa predictions with manually curated quality. CDS from Wormbase. Based on RNA-seq evidence, we The new GAF module of GeMoMa allows for com- collect multi-coding-exon predictions of GeMoMa with bining the predictions based on different reference tie=1 and compare these to the annotation as depicted in organisms. The combined predictions may be filtered by Fig. 4. number of reference species with perfect support (#evi- In summary, we find between 6 749 differences for dence), as indicated by the dashed line. We find that C. briggsae and 12 903 for C. brenneri (cf. Fig. 4). The combining multiple reference organisms improves predic- most interesting category are new multi-coding-exon pre- tion performance and stability. Depending on the number dictions, which vary between 53 for C. briggsae and 1 974 of supporting reference organisms required, gene speci- for C. brenneri. The largest category are GeMoMa pre- ficity and gene sensitivity may be balanced according to dictions that missed exons compared to annotated CDSs, the needs of a specific application. We observe that (i) which vary between 2 340 for C. japonica and 4 220 for gene sensitivity increases but specificity decreases when C. remanei. requiring support from at least one reference organ- We additionally filter the transcripts showing differ- ism, whereas (ii) gene specificity increases but sensitivity ences to obtain a smaller, more conservative set of high- decreases severely filtering for perfect support from all confidence predictions. First, we filter new multi-coding eight reference species. In summary, the inclusion of mul- exon GeMoMa predictions for tpc=1 obtaining between tiple reference species may yield an improved prediction 39 and 996 for C. briggsae and C. brenneri,respectively. performance for GeMoMa using the GAF module, where Second, we filter GeMoMa predictions that have differ- we suggest to filter predictions for support by at least two ent splice sites compared to highly overlapping annotated but not necessarily all reference species. transcripts, contain new exons, have missing exons, or Furthermore, we check whether GeMoMa allows for have new and missing exons for tie<1 of the overlapping identifying new transcripts in D. melanogaster that do annotation. We obtain between 100 and 1 079 predic- not overlap with any annotated transcript but are sup- tions with different splice-site, between 42 and 786 predic- ported by RNA-seq data. First, we check whether we tions containing new exons, between 548 and 1 431 could identify transcripts based on the GeMoMa predic- predictions with missing exons, and between 284 and tions using D. simulans as reference organism. We find 1 191 predictions with new and missing exons. Finally, 35 multi-coding-exon predictions that do not overlap with for GeMoMa predictions that differ in the start codon any annotated transcript but have a tie of 1, i.e., all introns compared to the annotation, we filter for tpc=1ofthe are supported by split reads in the RNA-seq data (see GeMoMa prediction and tpc<1 for the annotation obtain- “Methods”). In addition, we find 15 single-coding-exon ing between 14 and 149 for C. brenneri and C. remanei, predictions that do not overlap with any annotated tran- respectively. In summary, we obtain between 1 065 pre- script but have a tpc of 1, i.e., that are fully covered dictions differing from the annotation for C. briggsae and by mapped RNA-seq reads. Second, we check whether 4 735 predictions for C. brenneri, respectively (cf. Fig. 4) we could identify transcripts that are supported by at using these strict criteria. Despite the overall reduction of least two of the eight reference species (cf. above). We transcripts considered, GeMoMa predictions that missed find 14 multi-coding-exon predictions that do not over- exons compared to annotated CDSs are the largest cate- lap with any annotated transcript, obtain a tie of 1 and are gory for all four nematode species. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 9 of 12 Fig. 4 Summary of difference for GeMoMa predictions with tie=1. The relaxed evaluation (left panel) depicts differences between GeMoMa predictions and annotation without any filter on the annotation, while the conservative evaluation (right panel) applies additional filters for the annotation (cf. main text). Predictions that do not overlap with any annotated CDS are depicted in yellow, predictions that differ from annotated CDSs only in splice sites are depicted in green, predictions that have additional exons compared to annotated CDSs are depicted in turquoise, predictions that missed some exons compared to annotated CDSs are depicted in blue, predictions with additional and missing exons compared to annotated CDSs are depicted in pink, predictions that only differ in the start of the CDS compared to annotated CDS are depicted in red, and any other category is depicted in gray For both evaluations, we find that the predictions for in finding isoforms missed by traditional prediction C. briggsae are in better accordance with the annotation methods. than the predictions of the remaining three nematode species. One possible explanation might be that the anno- Analysis of barley tation of C. briggsae has recently been updated using Complementary to the studies in animals in the last sub- RNA-seq data (Gary Williams, personal communication), section, we used GeMoMa to predict the annotation of while the annotation of C. japonica is based on Augustus protein-coding genes in barley (Hordeum vulgare). Based (Erich Schwartz, personal communication) and the anno- on the benchmark results for D. melanogaster,weused tation of the other two nematodes are NGASP sets from several reference organisms to predict the gene annota- multiple ab-initio gene prediction programs [37]. For tion using GeMoMa and GAF and finally obtain 75 484 C. japonica, we find the second best results, although transcript predictions. Most of the predictions showed C. japonica is phylogenetically more distantly related to a good overlap with the annotation (F ≥ 0.8). Never- C. elegans than the remaining two nematodes [38]. This is theless, 27 204 out of these 75 484 predictions had little additional evidence that the annotation pipeline employed (F <0.8) or no overlap with high or low confidence has a decisive influence on the quality and completeness gene annotations. However, thousands of the transcripts of the annotation. contained in the official annotation do not have start or In addition, we checked for C. brenneri whether the stop codons [14], which renders an exact comparison of GeMoMa predictions partially overlap with cDNAs or predictions with perfect or at least very good overlap ESTs mapped to the C. brenneri genome. In 472 cases, unreasonable. the prediction overlaps with a cDNA or EST, but not Hence, we focus on 19 619 predictions with no overlap with the annotation. In 364 out of these 472 cases, the with any annotated transcript (Table 2). Scrutinizing these prediction has tie=1. To evaluate the predictions, we man- predictions, we find 1 729 single-coding-exon predictions ually checked about 9% (43) of the predicted missing that are completely covered by RNA-seq reads (tpc=1) but genes with tie=1. Based on RNA-seq data, protein homol- that are not contained in the annotation. Out of these, 367 ogy, cDNA/ESTs and manual curation, 95% were genuine are partially supported by best BLAT matches of ESTs to new isoforms which have been missed in the original the genome. In addition, we analyzed multi-coding-exon C. brenneri gene set. This shows that GeMoMa is valuable predictions and find 2 821 predictions that obtain tie=1, Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 10 of 12 Table 2 Predictions that do not overlap with any high or low We also find that the performance depends on the evo- confidence annotation lutionary distance betwen reference and target organism. a) Single-coding-exon predictions However, prediction performance also depends on sev- eral further aspects including i) the quality of the target #evidence tpc =00 < tpc < 1 tpc = 1 genome (assembly), ii) the number of reference organisms 1 1 971 (11) 878 (14) 1 005 (137) available and iii) especially the quality of the reference 2 204 (19) 158 (8) 299 (55) annotation(s) itself. Hence, we recommend to balance 3 200 (16) 126 (5) 257 (92) between evolutionary distance and (expected) quality of 4 91 (17) 43 (9) 168 (83) the reference annotation when selecting reference species 2 466 (63) 1 205 (36) 1 729 (367) for GeMoMa. The integration of RNA-seq data into GeMoMa might b) Multi-coding-exon predictions help to overcome wrongly annotated splice sites in the #evidence tie =00 < tie <1tie = 1 reference species in some cases. However, missing or 1 9 671 (287) 942 (211) 1 681 (775) wrongly additional annotated exons in the reference anno- 2 283 (36) 86 (32) 456 (196) tation might still lead to partially wrong gene model 3 155 (31) 64 (43) 382 (223) predictions in the target species. The benefit of RNA-seq 4 142 (57) 55 (37) 302 (196) data, however, also depends on the quality and amount of sequenced reads, on the diversity (tissues, conditions) 10 251 (411) 1 147 (323) 2 821 (1 390) of the sequenced samples, and on the library type, where The numbers in parenthesis depict those predictions that are partially supported by stranded libraries should be more informative than non- any best BLAT hit of ESTs stranded ones. In addition, GeMoMa uses RNA-seq data currently only to refine homologous genes models and stating that each predicted intron is supported by at least not to identify transcribed gene models that do not show one split read from mapped RNA-seq data. Out of these, any homology. Hence, GeMoMa should be used in com- 1 390 are partially supported by best BLAT matches of bination with other gene predictors allowing for purely ESTs to the genome. RNA-seq-based or ab-initio gene predcitions. Exemplar- Besides predictions that are well supported by RNA-seq ily, we demonstrate that GeMoMa helps to improve the performance of combined gene predictor pipelines as for data, we also observe thousands of predictions that are not instance MAKER2. (tpc = 0ortie = 0) or only partially (0 < tpc < 1or0 < Notably, model organisms have been used as target tie < 1) supported by RNA-seq. Despite no or only partial organisms in this benchmark, whereas they would typi- RNA-seq support, we find that 833 are partially supported cally be used as reference organisms in real applications. by best BLAT matches of ESTs to the genome. Hence, the performance of homology-based gene pre- Alternatively, we can utilize the number of reference diction programs might be underestimated. In summary, organisms that support a prediction (#evidence) to fil- we recommend to use homology-based gene prediction ter the predictions as noted for D. melanogaster.This approach will decrease sensitivity, but increase specificity using RNA-seq data as implemented in GeMoMa when- obtaining predictions with a high confidence. Although, ever high-quality gene annotations of related species are we find the most predictions with #evidence = 1, we also available. find about 3 500 predictions with #evidence > 1, more Interestingly, we find that GeMoMa works especially than 1 100 of these predictions are additionally supported well for D. melanogaster in the benchmark study com- by RNA-seq data or ESTs. pared to the performance of its competitors. One possible reason could be that Flybase used homology and RNA-seq Conclusions data besides other evidence to infer the gene annotation Summarizing the methods and results, we present an [19]. In contrast, we find the worst results in C. elegans in extension of GeMoMa that allows for the incorporation the benchmark study, which might be related to the fact of RNA-seq data into homology-based gene prediction that the C. elegans gene set contains many rare isoform utilizing intron position conservation. Comparing the community submissions whereas C. briggsae was anno- performance of GeMoMa with and without RNA-seq evi- tated by a large scale gene predictions effort based on dence, we demonstrate for all four organism included in RNA-seq. Scrutinizing the annotation in Wormbase, we pre- the benchmark that RNA-seq evidence improves the per- dicted protein-coding transcripts for four nematode formance of GeMoMa. GeMoMa performs equally well species based on the annotation of the model organism or better than BRAKER1, MAKER2, CodingQuarry, and C. elegans. We find that a substantial part of the GeMoMa purely RNA-seq-based pipelines on the benchmark data predictions is either missing, marked as modification sets including plants, animals and fungi. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 11 of 12 of annotated transcripts or alternative transcripts. Espe- Ethics approval and consent to participate Not applicable. cially for the three nematodes, C. brenneri, C. japonica and C. remanei, that are annotated solely using ab-initio Competing interests gene prediction, we find a large part of the annotation The authors declare that they have no competing interests. that is marked as questionable or missing. This may give Publisher’s Note an indication, why homology-based gene prediction for Springer Nature remains neutral with regard to jurisdictional claims in C. elegans shows less good performance in the benchmark published maps and institutional affiliations. study. The GeMoMa predictions of the four nematodes Author details will be included in Wormbase in the upcoming releases. 1 Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Furthermore, GeMoMa will be included in the WormBase Federal Research Centre for Cultivated Plants, D-06484, Quedlinburg, Germany. European Molecular Biology Laboratory, European Bioinformatics gene curation process and trialled for strain annotation. Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, Cambridge, Furthermore, we predicted protein-coding transcripts 3 UK. Plant Genome and Systems Biology, Helmholtz Center Munich - German for barley using four reference species and find several Research Center for Environmental Health, D-85764, Neuherberg, Germany. Institute of Computer Science, Martin Luther University Halle–Wittenberg, hundreds of predictions that are not included in the refer- D-06120, Halle (Saale), Germany. ence annotation but are supported by RNA-seq data, ESTs Received: 20 November 2017 Accepted: 14 May 2018 or multiple reference species. Hence, we conclude that these are valuable predictions harboring additional barley genes. These predictions will be incorporated in the new References barley annotation. 1. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, GeMoMa provides a user-friendly documentation and Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching can be use as command line tool or through the Galaxy during cell differentiation,. Nat Biotechnol. 2010;28(5):511–5. https://doi. workflow management system [39] providing its own org/10.1038/nbt.1621. Galaxy integration (Fig. 1). GeMoMa is freely avail- 2. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from able under GNU GPL3 at http://www.jstacs.de/index.php/ RNA-seq reads. Nat Biotech. 2015;33(3):290–5. https://doi.org/10.1038/ GeMoMa. nbt.3122. 3. Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of Additional files eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006;7(1):10. https://doi.org/10.1186/gb-2006-7-s1-s10. 4. Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and Additional file 1: Supplementary Tables and Figures. Table S1:Data syntenically mapped cDNA alignments to improve de novo gene finding. used for the BRAKER1 benchmark. Table S2: Data for Wormbase study. Bioinformatics. 2008;24(5):637. https://doi.org/10.1093/bioinformatics/ Table S3: Data for barley annotation. Table S4: F1-measure on the btn013. BRAKER1 test sets. Table S5: CodingQuarry and GeMoMa results for 5. Slater G, Birney E. Automated generation of heuristics for biological S. pombe using the original read mappings from [10, 11]. Table S6: sequence comparison. BMC Bioinformatics. 2005;6(1):31. https://doi.org/ CodingQuarry and GeMoMa results for S. cerevisiae. Table S7: GeMoMa 10.1186/1471-2105-6-31. runtime. Table S8: GeMoMa performance for D.melanogaster.(PDF147 kb) 6. She R, Chu JS-C, Uyar B, Wang J, Wang K, Chen N. genBlastG: using BLAST searches to build homologous gene models. Bioinformatics. 2011;27(15):2141–3. https://doi.org/10.1093/bioinformatics/btr342. Abbreviations http://bioinformatics.oxfordjournals.org/content/27/15/2141.full.pdf+ avgCov: Average coverage; cDNA: Complementary DNA; CDS: Coding html. sequence; ERE: Extract RNA-seq evidence; EST: Expressed sequence tag; GAF: 7. Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F. Using GeMoMa annotation filter; GeMoMa: Gene Model Mapper; iAA: Identical intron position conservation for homology-based gene prediction. amino acids; tae: Transcript acceptor evidence; tde: Transcript donor evidence; Nucleic Acids Res. 2016;44(9):89. https://doi.org/10.1093/nar/gkw092. tie: Transcript intron evidence; tpc: Transcript percentage coverage; minCov: 8. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for Minimal coverage; minSplitReads: Minimum number of split reads; pAA: transcriptomics. Nat Rev Genet. 2009;10(1):57–63. https://doi.org/10. Positive-scoring amino acids 1038/nrg2484. 9. Holt C, Yandell M. MAKER2: an annotation pipeline and Acknowledgements genome-database management tool for second-generation genome We thank Katharina Hoff for providing the BRAKER1 benchmark data sets, projects. BMC Bioinformatics. 2011;12(1):491. https://doi.org/10.1186/ Carson Holt for assisting the MAKER2 comparison, Gil dos Santos for his 1471-2105-12-491. comments on the quality of the Drosophila genome assemblies, Erich Schwartz 10. Testa AC, Hane JK, Ellwood SR, Oliver RP. CodingQuarry: highly accurate for his comments on C. japonica, Alison Testa for her help with CodingQuarry, hidden Markov model gene prediction in fungal genomes using RNA-seq Gary Williams for his comments on C. briggsae, and Thomas Berner for transcripts. BMC Genomics. 2015;16(1):170. https://doi.org/10.1186/ technical assistance. We thank the open access publication fund of Martin s12864-015-1344-4. Luther University Halle–Wittenberg for funding article-processing charges. 11. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET Availability of data and materials and AUGUSTUS. Bioinformatics. 2016;32(5):767. https://doi.org/10.1093/ Not applicable. Accession numbers of publicly available data used in this study bioinformatics/btv661. are listed in Additional file 1. 12. Lomsadze A, Burns PD, Borodovsky M. Integration of mapped rna-seq reads into automatic training of eukaryotic gene finding algorithm. Authors’ contributions Nucleic Acids Res. 2014;42(15):119. https://doi.org/10.1093/nar/gku557. JK and JG implemented the software, designed and performed benchmark 13. Howe KL, Bolt BJ, Cain S, Chan J, Chen WJ, Davis P, Done J, Down T, studies. JK, FH, MP, SOT and JG contributed to data analysis and interpretation, Gao S, Grove C, Harris TW, Kishore R, Lee R, Lomax J, Li Y, Muller H-M, and to writing the manuscript. All authors read and approved the final Nakamura C, Nuin P, Paulini M, Raciti D, Schindelman G, Stanley E, manuscript. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 12 of 12 Tuli MA, Van Auken K, Wang D, Wang X, Williams G, Wright A, Yook K, 25. Bennetzen JL, Schmutz J, Wang H, Percifield R, Hawkins J, Pontaroli AC, Berriman M, Kersey P, Schedl T, Stein L, Sternberg PW. Wormbase 2016: Estep M, Feng L, Vaughn JN, Grimwood J, Jenkins J, Barry K, Lindquist E, expanding to enable helminth genomic research. Nucleic Acids Res. HellstenU,DeshpandeS,WangX,WuX,MitrosT,TriplettJ,YangX, 2016;44(D1):774. https://doi.org/10.1093/nar/gkv1217. Ye C-Y, Mauro-Herrera M, Wang L, Li P, Sharma M, Sharma R, Ronald 14. Mascher M, Gundlach H, Himmelbach A, Beier S, Twardziok SO, Wicker T, PC, Panaud O, Kellogg EA, Brutnell TP, Doust AN, Tuskan GA, Rokhsar Radchuk V, Dockter C, Hedley PE, Russell J, Bayer M, Ramsay L, Liu H, D, Devos KM. Reference genome sequence of the model plant Setaria. Haberer G, Zhang X-Q, Zhang Q, Barrero RA, Li L, Taudien S, Groth M, Nat Biotechnol. 2012;30(6):555–61. https://doi.org/10.1038/nbt.2196. Felder M, Hastie A, Šimková H, Stanková ˇ H, Vrána J, Chan S, 26. Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL. Transcript-level MuñozAmatriaín M, Ounit R, Wanamaker S, Bolser D, Colmsee C, expression analysis of RNA-seq experiments with HISAT, StringTie and Schmutzer T, Aliyeva-Schnorr L, Grasso S, Tanskanen J, Chailyan A, Ballgown. Nat Protocols. 2016;11(9):1650–67. https://doi.org/10.1038/ Sampath D, Heavens D, Clissold L, Cao S, Chapman B, Dai F, Han Y, Li H, nprot.2016.095. Li X, Lin C, McCooke JK, Tan C, Wang P, Wang S, Yin S, Zhou G, Poland JA, 27. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12(4): Bellgard MI, Borisjuk L, Houben A, Doležel J, Ayling S, Lonardi S, Kersey P, 656–64. https://doi.org/10.1101/gr.229202. Article published online Langridge P, Muehlbauer GJ, Clark MD, Caccamo M, Schulman AH, before March 2002. Mayer KFX, Platzer M, Close TJ, Scholz U, Hansson M, Zhang G, 28. Keibler E, Brent MR. Eval: A software package for analysis of genome Braumann I, Spannagl M, Li C, Waugh R, Stein N. A chromosome annotations. BMC Bioinformatics. 2003;4(1):50. https://doi.org/10.1186/ conformation capture ordered sequence of the barley genome. Nature. 1471-2105-4-50. 2017;544(7651):427–33. https://doi.org/10.1038/nature22043. 29. Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, 15. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local Harrow J, Bertone P, Consortium R, et al. Assessment of transcript alignment search tool. J Mol Biol. 1990;215(3):403–10. https://doi.org/10. reconstruction methods for RNA-seq. Nat Methods. 2013;10(12):1177–84. 1016/S0022-2836(05)80360-2. 30. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak ´ MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A. 16. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, A survey of best practices for RNA-seq data analysis. Genome Biol. Couger MB, Eccles D, Li B, Lieber M, MacManes MD, Ott M, Orvis J, 2016;17(1):13. https://doi.org/10.1186/s13059-016-0881-8. Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, 31. Burset M, Guigó R. Evaluation of gene structure prediction programs. Henschel R, LeDuc RD, Friedman N, Regev A. De novo transcript Genomics. 1996;34(3):353–67. https://doi.org/10.1006/geno.1996.0298. sequence reconstruction from RNA-seq using the Trinity platform for 32. Powers DMW. Evaluation: From precision, recall and F-measure to ROC, reference generation and analysis. Nat Protocols. 2013;8(8):1494–512. informedness, markedness & correlation. J Mach Learn Technol. https://doi.org/10.1038/nprot.2013.084. 2011;2(1):37–63. 17. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: 33. Gramates LS, Marygold SJ, Santos Gd, Urbano J-M, Antonazzo G, accurate alignment of transcriptomes in the presence of insertions, Matthews BB, Rey AJ, Tabone CJ, Crosby MA, Emmert DB, Falls K, deletions and gene fusions. Genome Biol. 2013;14(4):R36. https://doi.org/ Goodman JL, Hu Y, Ponting L, Schroeder AJ, Strelets VB, Thurmond J, 10.1186/gb-2013-14-4-r36. Zhou P. FlyBase at 25: looking to the future. Nucleic Acids Res. 18. Rawat V, Abdelsamad A, Pietzenuk B, Seymour DK, Koenig D, Weigel D, 2017;45(D1):663–71. https://doi.org/10.1093/nar/gkw1016. Pecinka A, Schneeberger K. Improving the annotation of arabidopsis 34. Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, lyrata using rna-seq data. PLOS ONE. 2015;10(9):1–12. https://doi.org/10. Kaufman TC, Kellis M, Gelbart W, Iyer VN, et al. Evolution of genes and 1371/journal.pone.0137391. genomes on the Drosophila phylogeny. Nature. 2007;450(7167):203–18. 19. Matthews BB, dos Santos G, Crosby MA, Emmert DB, St Pierre SE, 35. Hu TT, Eisen MB, Thornton KR, Andolfatto P. A second-generation Gramates LS, Zhou P, Schroeder AJ, Falls K, Strelets V, Russo SM, assembly of the Drosophila simulans genome provides new insights into Gelbart WM, The FlyBase Consortium. Gene model annotations for patterns of lineage-specific divergence. Genome Res. 2013;23(1):89–98. drosophila melanogaster: Impact of high-throughput data. G3: Genes https://doi.org/10.1101/gr.141689.112. http://genome.cshlp.org/content/ Genomes Genet. 2015;5(8):1721–36. https://doi.org/10.1534/g3.115. 23/1/89.full.pdf+html. 018929. http://www.g3journal.org/content/5/8/1721.full.pdf. 36. Singh ND, Larracuente AM, Sackton TB, Clark AG. Comparative genomics 20. Rhind N, Chen Z, Yassour M, Thompson DA, Haas BJ, Habib N, Wapinski I, on the drosophila phylogenetic tree. Annu Rev Ecol Evol Syst. 2009;40(1): Roy S, Lin MF, Heiman DI, Young SK, Furuya K, Guo Y, Pidoux A, Chen HM, 459–80. https://doi.org/10.1146/annurev.ecolsys.110308.120214. Robbertse B, Goldberg JM, Aoki K, Bayne EH, Berlin AM, Desjardins CA, 37. Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, nGASP Dobbs E, Dukaj L, Fan L, FitzGerald MG, French C, Gujja S, Hansen K, Consortium, Stein LD. ngasp–the nematode genome annotation Keifenheim D, Levin JZ, Mosher RA, Müller CA, Pfiffner J, Priest M, Russ C, assessment project. BMC Bioinformatics. 2008;9:549. https://doi.org/10. Smialowska A, Swoboda P, Sykes SM, Vaughn M, Vengrova S, Yoder R, 1186/1471-2105-9-549. Zeng Q, Allshire R, Baulcombe D, Birren BW, Brown W, Ekwall K, Kellis M, 38. Kiontke KC, Félix M-A, Ailion M, Rockman MV, Braendle C, Pénigault J-B, Leatherwood J, Levin H, Margalit H, Martienssen R, Nieduszynski CA, Fitch DH. A phylogeny and molecular barcodes for caenorhabditis, with Spatafora JW, Friedman N, Dalgaard JZ, Baumann P, Niki H, Regev A, numerous new species from rotting fruits. BMC Evol Biol. 2011;11(1):339. Nusbaum C. Comparative functional genomics of the fission yeasts. https://doi.org/10.1186/1471-2148-11-339. Science. 2011;332(6032):930–6. https://doi.org/10.1126/science.1203357. 39. Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Cech M, http://science.sciencemag.org/content/332/6032/930.full.pdf. Chilton J, Clements D, Coraor N, Eberhard C, Grüning B, Guerler A, 21. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Hillman-Jackson J, Von Kuster G, Rasche E, Soranzo N, Turaga N, Taylor J, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Nekrutenko A, Goecks J. The galaxy platform for accessible, reproducible Bioinformatics. 2013;29(1):15. https://doi.org/10.1093/bioinformatics/ and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. bts635. 2016;44(W1):3. https://doi.org/10.1093/nar/gkw343. 22. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, Karthikeyan AS, Lee CH, Nelson WD, Ploetz L, Singh S, Wensel A, Huala E. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40(D1):1202. https://doi.org/10.1093/nar/gkr1090. 23. International Brachypodium Initiative. Genome sequencing and analysis of the model grass Brachypodium distachyon,. Nature. 2010;463(5):763–8. https://doi.org/10.1038/nature08747. 24. Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek R. L, Lee Y, Zheng L, Orvis J, Haas B, Wortman J, Buell CR. The tigr rice genome annotation resource: improvements and new features. Nucleic Acids Res. 2007;35(suppl_1):883. https://doi.org/10. 1093/nar/gkl976. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi

Free
12 pages
Loading next page...
 
/lp/springer_journal/combining-rna-seq-data-and-homology-based-gene-prediction-for-plants-JkG3koVG0s
Publisher
BioMed Central
Copyright
Copyright © 2018 by The Author(s)
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Algorithms
eISSN
1471-2105
D.O.I.
10.1186/s12859-018-2203-5
Publisher site
See Article on Publisher Site

Abstract

Background: Genome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction. Results: Here, we present an extension of the gene prediction program GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction. We show on published benchmark data for plants, animals and fungi that GeMoMa performs better than the gene prediction programs BRAKER1, MAKER2, and CodingQuarry, and purely RNA-seq-based pipelines for transcript identification. In addition, we demonstrate that using multiple reference organisms may help to further improve the performance of GeMoMa. Finally, we apply GeMoMa to four nematode species and to the recently published barley reference genome indicating that current annotations of protein-coding genes may be refined using GeMoMa predictions. Conclusions: GeMoMa might be of great utility for annotating newly sequenced genomes but also for finding homologs of a specific gene or gene family. GeMoMa has been published under GNU GPL3 and is freely available at http://www.jstacs.de/index.php/GeMoMa. Keywords: Homology-based gene prediction, RNA-seq, Genome annotation Background Genome annotation pipelines utilize three main sources The annotation of protein-coding genes is of critical of information, namely evidence from wet-lab transcrip- importance for many fields of biological research includ- tome studies [1, 2], ab-initio gene prediction based on ing, for instance, comparative genomics, functional pro- general features of (protein-coding) genes [3, 4], and teomics, gene targeting, genome editing, phylogenetics, homology-based gene prediction relying on gene models transcriptomics, and phylostratigraphy. The process of of (closely) related, well-annotated species [5–7]. annotating protein-coding genes to an existing genome Experimental data allow for inferring coverage of gene (assembly) can be described as specifying the exact predictions and splice sites bordering their exons, which genomic location of genes comprising all (partially) cod- may assist computational ab-initio or homology-based ing exons. A difficulty in gene annotation is distinc- approaches. Due to the progress in the field of next gener- tion between protein-coding genes, transposons and ation sequencing, RNA-seq has revolutionized transcrip- pseudogenes. tomics [8]. Today, RNA-seq data is available for a wide range of organisms, tissues and environmental conditions, and can be utilized for genome annotation pipelines. *Correspondence: jens.keilwagen@julius-kuehn.de In recent years, several programs have been developed Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - that combine multiple sources allowing for a more accu- Federal Research Centre for Cultivated Plants, D-06484, Quedlinburg, Germany rate prediction of protein-coding genes [9–11]. MAKER2 Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 2 of 12 is a pipeline that integrates support of different resources mapped RNA-seq data. Introns passing this filter define including ab-initio gene predictors and RNA-seq data donor and acceptor splice sites, which are treated inde- [9]. CodingQuarry is a pipeline for RNA-Seq assembly- pendently within GeMoMa. If splice sites with experimen- supported training and gene prediction, which is only tal evidence have been detected in a genomic region with a recommended for application to fungi [10]. Recently, [11] good match to an exon of a reference transcript, these are collected for further use. If no splice sites with experimen- published BRAKER1 a pipeline for unsupervised RNA- seq-based genome annotation that combines the advan- tal evidence have been detected in a genomic region with a tages of GeneMark-ET [12] and AUGUSTUS [4]. good match to an exon of a reference transcript, GeMoMa Here, we present an extension of GeMoMa [7]thatuti- resorts to conserved dinucleotides allowing to identify lizes RNA-seq data in addition to amino acid sequence gene models that are not covered by RNA-seq data due to, and intron position conservation. We investigate the per- e.g., very specifically or lowly expressed transcripts. Com- formance of GeMoMa on publicly available benchmark bining two potential exons, all in-frame combinations data [11] and compare it with state-of-the-art competitors using the collected donor and acceptor splice sites are [9–11]. tested and scored according to the reference transcript. Subsequently, we demonstrate how combining homology- The best combination is used for the prediction. based predictions based on gene models from multiple Based on this experimental evidence, the improved reference organisms can be used to improve the perfor- version of GeMoMa provides several new properties mance of GeMoMa. Finally, we apply GeMoMa to four reported for gene predictions. The most prominent fea- nematode species provided by Wormbase [13] and to the tures are transcript intron evidence (tie) and transcript recently published barley reference genome [14], where percentage coverage (tpc). The tie of a transcript varies GeMoMa predictions will be included into future versions between 0 and 1, and corresponds to the fraction of of the corresponding genome annotations. introns (i.e., splice sites of two neighboring exons) that are supported by split reads in the mapped RNA-seq Methods data. In case of transcripts comprising a single coding In this section, we describe recent extensions of GeMoMa exon, NA is reported. The tpc of a transcript also varies to make use of evidence from RNA-seq data, the RNA-seq between 0 and 1, and corresponds to the fraction of (cod- pipelines used and the data considered in the benchmark ing) bases of a predicted transcript that are also covered by mapped reads in the RNA-seq data. Further properties and application studies. reported by GeMoMa are i) tae and tde,the percentages of acceptor and donor sites, respectively, with RNA-seq GeMoMa using RNA-seq evidence, ii) minCov and avgCov, the minimum and aver- GeMoMa predicts protein-coding genes utilizing the gen- eral conservation of protein-coding genes on the level of age coverage, respectively, of the predicted transcript, and their amino acid sequence and on the level of their intron iii) minSplitReads, the minimum number of split reads positions, i.e., the locations of exon-exon boundaries in supporting any of the predicted introns of a transcript. CDSs [7]. To this end, sequences of (partially) protein- Optionally, GeMoMa reports pAA and iAA,the percent- coding exons are extracted from well-annotated reference age of positive-scoring and identical amino acids in a genomes. Individual exons are then matched to loci on the pairwise alignment, if the reference protein is provided target genome using tblastn [15], matches are adjusted for as input. proper splice sites, start codons and stop codons, respec- GeMoMa allows for computing and ranking multiple tively, and joined to full, protein-coding genes models. In predictions per reference transcript, but does not fil- this process, the conserved dinucleotides GT and GC for ter these predictions. Predictions of different reference donor splice sites, and AG for acceptor splice sites have transcripts might be highly overlapping or even identi- been used for the identification of splice sites border- cal, especially if the reference transcripts are from the ing matches to the (partially) protein-coding exons of the same gene family. Since GeMoMa 1.4, the default param- reference transcripts. The improved version of GeMoMa eters for number of predictions and contig threshold have may now also include experimental splice site evidence been changed which might lead to an increased number extracted from mapped RNA-seq data to improve the of highly overlapping or identical predictions. In addi- accuracy of splice site and, hence, exon annotation. We tion, it might be beneficial to run GeMoMa starting from visualize the extended GeMoMa pipeline in Fig. 1. multiple reference species to broaden the scope of tran- scripts covered by the predictions. However, these may Starting from mapped RNA-seq data, the module also result in redundant predictions for, e.g., orthologs or Extract RNA-seq evidence (ERE) allows for extracting paralogs stemming from the different reference species introns and, if user-specified, read coverage of genomic regions. GeMoMa filters these introns using a user- considered. To handle such situations, the new module specified minimal number of split reads within the GeMoMa annotation filter (GAF) of the improved version Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 3 of 12 Fig. 1 GeMoMa workflow. Blue items represent input data sets, green boxes represent GeMoMa modules, while grey boxes represent external modules. The GeMoMa Annotation Filter allows to combine predictions from different reference species and produces the final output. RNA-seq data is optional of GeMoMa now allows for joining and reducing such pre- genes in the cluster looking for discarded predictions that dictions using various filters. Filtering criteria comprise do not overlap with any selected prediction, which are the relative GeMoMa score of a predicted transcript, fil- recovered. In the benchmark studies comparing GeMoMa tering for complete predictions (starting with start codon with state-of-the-art competitors, we directly use the GAF and ending with stop codon), and filtering for evidence results without any further filters on attributes reported from multiple reference organisms. In addition, GAF also by the GeMoMa pipeline. joins duplicate predictions that originate from different In addition to the modules for annotating a genome reference transcripts. (assembly) described above, we also provide two addi- Initially, GAF filters predictions based on their rela- tional modules in GeMoMa for analyzing and comparing tive GeMoMa score, i.e., the GeMoMa score divided by to prediction to a reference annotation. The module Com- the length of the predicted protein. This filter removes pareTranscripts determines that CDS of the reference spurious predictions. Subsequently, the predictions are annotation with the largest overlap with the prediction clustered based on their genomic location. Overlapping utilizing the F measure as objective function [7]. The predictions on the same strand yield a common cluster. module AnnotationEvidence computes tie and tpc of all For each cluster, the prediction with the highest GeMoMa CDSs of a given annotation. Hence, these two modules score is selected. Non-identical predictions overlapping can be used to determine, whether a prediction is known, the high-scoring prediction with at least a user-specified partially known or new and whether the overlapping percentage of borders (i.e., splice sites, start and stop annotation has good RNA-seq support. codon, cf. common border filter) are treated as alterna- tive transcripts. Predictions that have completely identical MAKER2 predictions borders to any previously selected prediction are removed Recently, we have shown that GeMoMa outperforms and only listed in the GFF attribute field alternative.All state-of-the-art homology-based gene predictors [7]. We filtered predictions of a cluster are assigned to one gene are not aware of any homology-based gene predic- with a generic gene name. Finally, GAF checks for nested tion program that allows for incorporating of RNA-seq Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 4 of 12 data. Hence, we provide predictions of MAKER2 using with the previous benchmark study, we use the manually the same reference proteins as GeMoMa for a mini- curated gene set of Wormbase. mal comparison. Internally, MAKER2 uses exonerate [5] For the analysis of barley, we use the latest genome for homology-based gene prediction. We run MAKER2 assembly and gene annotation [14]. As reference species, with default parameters except protein2genome=1, we choose A. thaliana [22], B. distachyon [23], O. sativa [24], and S. italica [25] (Additional file 1:Table S2). and genome and protein set to the respective input files. In addition, we run MAKER2 using (i) RNA-seq In addition to genome assembly and gene annotation, data in form of Trinity 2.4 transcripts (-jaccard_clip) [16], we also used RNA-seq data from four different public (ii) homology in form of proteins of one related refer- available data sets (ERP015182, ERP015986, SRP063318, ence species, and (iii) ab-initio gene prediction in form SRP071745). Reads were mapped and assembled using of Augustus 3.3 [4]. In this case, we run MAKER2 with Hisat2 and StringTie [26]. As reference annotation, we default parameters except genome, est, protein,and used the union of high and low confidence annotation. augustus_species,whichhavebeensetto thecorre- As independent evidence for validating GeMoMa pre- sponding species. For comparison, we run Maker2 with dictions in the nematode species and barley, we use ESTs the same parameter settings but using the GeMoMa pre- and cDNAs. While Wormbase provides coordinates for dictions for protein_gff instead of using protein. best BLAT matches, we adapt the pipeline and download all available EST from NCBI and map them to the genome RNA-seq pipelines using BLAT [27]. Computational pipelines have been used to infer gene annotation from RNA-seq data produced by next genera- Results and discussion tion sequencing methods. Dozens of tools and tool com- Benchmark binations have been proposed. Here, we focus on the short The comparison of different software pipelines is often read mapper TopHat2 [17], the transcript assemblers critical as a) specific parameters settings might be cru- Cufflinks [1] and StringTie [2], and the coding sequence cial for good results and b) different input might be used. predictor TransDecoder [16]. Based on the transcript For these reasons, we designed the benchmark as follows. assemblers, we build two RNA-seq pipelines following the First, we use publicly available gene predictions results. instructions in [11]. Second,welimit thenumberofreference speciestoone in the initial study. Data We used GeMoMa for predicting the gene annota- For the benchmark studies, we consider target species tions of A. thaliana, C. elegans, D. melanogaster,and S. pombe.InTable 1, we summarize the performance of and their genome versions as specified in the BRAKER1 supplement. For the homology-based prediction by BRAKER1, MAKER2, and CodingQuarry as reported in GeMoMa, we choose one closely related reference species Hoff et al. [11], as well as the performance of GeMoMa per target species that are sequenced and annotated with and without RNA-seq evidence, purely RNA-seq- [13, 18–20]. For these species, we consider the latest based pipelines and various MAKER2 predictions. The genome versions available (Additional file 1:Table S1). results of CodingQuarry reported by Hoff et al. [11]devi- For the analysis of C. elegans,weusetheman- ate substantially from those originally reported by Testa ually curated gene set of C. briggsae provided by et al. [10]. We find that the performance of CodingQuarry Wormbase. In addition, we use the experimental evi- is highly sensitive to RNA-seq processing, whereas the dence from RNA-seq data referenced in the BRAKER1 performance of GeMoMa is barely affected (Additional publication. file 1: Table S5). For all comparisons, we provide sen- For the analysis of the four nematode species, sitivity (Sn) and specificity (Sp) for the categories gene, C. brenneri, C. briggsae, C. japonica,and C. remanei, transcript, and exon, respectively [28]. In addition, we we use the genome assembly and gene annotation of compare CodingQuarry with GeMoMa for S. cerevisiae Wormbase WS257 [13]. We choose the model organism (Additional file 1:TableS6). C. elegans as reference species (Additional file 1:Table S2). First, we compare the two purely homology-based In addition to genome assembly and gene annotation, we predictions, namely on the one hand side MAKER2 using also use publicly available RNA-seq data of these four exonerate and on the other hand side GeMoMa without nematode species, which have been mapped by Worm- RNA-seq data. In all cases, we use the same reference base using STAR [21]. We used a minimum intron size of species and reference proteins. We find that MAKER2 25 bp, a maximum intron size of 15Kb, specify that only using only homologous proteins has a higher exon speci- reads mapping once or twice on the genome are reported, ficity than GeMoMa without RNA-seq data for C. elegans, and alignments are reported only if their ratio of mis- while the opposite is true for all other categories and matches to mapped length is less than 0.02. In accordance target species. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 5 of 12 Table 1 Benchmark results on the BRAKER1 test sets + + + ∗ ∗ ∗ + + MAKER2 GeMoMa without GeMoMa with RNAseq- RNAseq- BRAKER1 MAKER2 CodingQuarry MAKER2 (exonerate, MAKER2 (exonerate) RNA-seq data RNA-seq data Cufflinks StringTie Trinity, Augustus) (GeMoMa, Trinity, Augustus) Arabidopsis thaliana (ref. A. lyrata) Gene Sn 44.0 61.3 66.5 28.9 35.9 64.4 51.3 NA 56.9 57.9 Gene Sp 47.8 65.7 71.3 47.9 59.1 52.0 52.5 NA 65.7 67.8 Transcript Sn 37.5 52.2 57.2 26.6 33.7 55.0 43.5 NA 48.3 49.1 Transcript Sp 47.8 65.7 65.3 35.6 48.3 50.9 52.5 NA 65.7 67.8 Exon Sn 70.0 79.3 80.6 58.1 60.8 82.9 76.1 NA 81.8 82.1 Exon Sp 81.9 86.6 87.5 81.9 87.1 79.0 76.1 NA 87.5 88.6 Caenorhabditis elegans (ref. C. briggsae) Gene Sn 26.2 39.6 49.1 18.7 22.6 55.0 41.0 NA 40.5 47.3 Gene Sp 38.0 49.9 63.8 29.1 36.1 55.2 30.8 NA 51.5 56.4 Transcript Sn 21.0 30.7 39.8 16.2 20.0 43.0 31.3 NA 31.4 36.2 Transcript Sp 38.0 49.9 58.7 24.1 30.1 53.2 30.8 NA 51.5 56.4 Exon Sn 50.3 64.2 67.1 54.4 59.1 80.2 69.4 NA 70.5 75.2 Exon Sp 82.6 81.5 87.5 81.3 84.1 85.3 62.3 NA 85.6 86.7 Drosophila melanogaster (ref. D. simulans) Gene Sn 64.3 78.2 83.1 55.7 55.2 64.9 55.2 NA 61.5 64.0 Gene Sp 69.2 81.6 87.1 71.3 73.5 59.4 46.3 NA 69.6 71.9 Transcript Sn 44.1 52.9 65.0 48.7 49.0 46.1 38.5 NA 42.7 44.3 Transcript Sp 69.2 81.6 81.2 60.1 65.7 57.9 46.3 NA 69.6 71.9 Exon Sn 69.0 76.3 80.0 67.8 66.2 75.0 66.5 NA 74.3 76.3 Exon Sp 89.1 92.0 93.3 85.4 88.3 81.7 66.9 NA 88.0 89.1 Schizosaccharomyces pombe (ref. S. octosporus) Gene Sn 49.2 76.4 79.2 69.0 65.8 77.4 42.8 79.7 71.6 74.6 Gene Sp 59.9 84.6 88.0 93.8 92.5 80.5 68.7 72.6 88.1 89.1 Transcript Sn 49.2 76.4 79.2 69.0 65.8 77.4 42.8 79.7 71.6 74.6 Transcript Sp 59.9 84.6 87.6 80.5 71.3 76.5 68.7 72.6 88.1 89.1 Exon Sn 56.1 81.6 83.1 77.2 77.7 83.2 50.1 79.6 79.2 81.2 Exon Sp 73.3 88.6 91.9 87.6 81.7 83.2 71.4 81.7 92.0 92.6 The target species are given in multi-column rows. The same reference species, which is given in brackets, is used for all tools using homology-based gene prediction indicated by plus. The asterisks indicates that the performance of BRAKER1, MAKER2 and CodingQuarry is given as reported in [11]. The highest value per line is depicted in bold-face Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 6 of 12 Second, we additionally consider RNA-seq data. Hence, we compare GeMoMa to combined gene pre- MAKER2 does not allow for combining RNA-seq evi- diction approaches. Specifically, we compare the perfor- dence and homology-based predictions without using mance of GeMoMa using RNA-seq evidence to BRAKER1 any ab-initio gene predictor. In contrast, GeMoMa allows in Fig. 2, which provides the best overall performance for additionally using intron position conservation and in [11]. We find that GeMoMa performs better than BRAKER1 for the categories gene and transcript with RNA-seq data. For this reason, we compare the perfor- mance of GeMoMa with and without RNA-seq evidence the exception of gene and transcript sensitivity for (Table 1). We find that sensitivity and specificity in C. elegans. Interestingly, we find the biggest improvements almost all cases increases by up to 13.9 with only two for D. melanogaster where gene/transcript sensitivity and exceptions for transcript specificity of A. thaliana and specificity increases between 18.2 and 27.7. For the exon D. melanogaster which decreases by at most 0.4. Hence, category, we find a less clear picture. In total, we observe we summarize that RNA-seq evidence improves the sen- the worst results for C. elegans where the sensitivity for sitivity and specificity of GeMoMa and should be used if all three categories decreases between 3.2 and 13.2, while available. the specificity increases only between 2.2 and 8.6. Notably, Third, we compare the performance of GeMoMa using we generally find the worst gene/transcript sensitivity and RNA-seq evidence to that of purely RNA-seq-based specificity for C. elegans compared with the other target pipelines, namely Cufflinks and StringTie (Table 1). We species considering the best performance of all tools. find for all four species that GeMoMa using RNA-seq In summary, we find that the gene predictors MAKER2, evidence outperforms purely RNA-seq-based pipelines. BRAKER1, CodingQuarry and GeMoMa, and the tran- Interestingly, purely RNA-seq-based pipelines also yield script assemblers Cufflinks and StringTie often perform the worst gene/transcript sensitivity and specificity for quite well on exon level. The main difference becomes C. elegans. Comparing the results based on different evident on transcript and gene level, where exons need transcript assemblers, we find that the results based on to be combined correctly (Table 1) as reported earlier StringTie are better than those based on Cufflinks for [29, 30]. Homology-based gene predictors might benefit A. thaliana and C. elegans, while the opposite is true from experimentally validated and manually curated ref- for S. pombe.For D. melanogaster, both pipelines per- erence transcripts guiding the prediction of transcripts in form comparably. Additional RNA-seq reads increasing the target organism. Although GeMoMa performed well, it is not able to pre- the coverage might improve the performance of purely RNA-seq-based pipelines but could also improve the per- dict genes that do not show any homology to a protein formance of GeMoMa. in the reference species, while ab-initio gene predictors Summarizing these three observations, we find that mightfailinother cases. Asbothtypesofapproaches have GeMoMa performs better than purely homology-based their specific advantages, users will probably use combi- or purely RNA-seq-based pipelines and that including nations of different gene predictors in practice to obtain a RNA-seq data improves the performance of GeMoMa. comprehensive gene annotation. Gene sensitivity Transcript sensitivity Exon sensitivity Gene specificity Transcript specificity Exon specificity A. thaliana C. elegans D. melanogaster S. pombe Fig. 2 Benchmark results. The y-axis depicts the difference between the GeMoMa with RNA-seq data and the BRAKER1 performance Difference to result of BRAKER1 −10 0 10 20 30 Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 7 of 12 In addition, we performed a small runtime study for the the inclusion of all available evidence and that per- two main time-consuming steps of the pipeline to demon- formance is decreased if some important evidence is strate that GeMoMa is reasonably fast (Additional file 1: missed [9]. Table S7). Furthermore, we compare GeMoMa using RNA- seq evidence with MAKER2 using RNA-seq evidence, Combined gene prediction pipelines homology-based and ab-initio gene prediction. In some cases, it is hard to compare these results as sensitivity Combined gene prediction pipelines, as for instance MAKER2, use RNA-seq evidence, homology-based and of one tool is higher than the sensitivity of the other ab-initio methods for predicting final gene models. tool and the opposite is true for specificity. In machine MAKER2 uses exonerate by default for homology-based learning, recall, also known as sensitivity, and precision, gene prediction. However, MAKER2 also provides the which is called specificity in the context of gene prediction possibility to use other homology-based gene predictors evaluation [31], are combined into a single scalar value instead of exonerate (cf. parameter protein_gff). For this called F1 measure [32] that can be compared more easily. reason, we compare the performance of MAKER2 using We combined sensitivity and specificity resulting in an either exonerate or GeMoMa for homology based gene F1 measure for each evaluation level gene, transcript prediction (Table 1). In addition, we use Augustus as ab- and exon (Additional file 1 – Table S4) We find that in initio gene prediction program and Trinity transcripts many cases GeMoMa using RNA-seq evidence outper- in MAKER2. We find that MAKER2 using GeMoMa forms MAKER2. The reason for this observation might performs better than MAKER2 using exonerate for all be that RNA-seq data and homology based gene predic- species and all measure. The improvement varies between tion is used in MAKER2 to train ab-initio gene predictors, 0.3% and 6.8% with clearly the biggest improvement for in this case Augustus. With the recommended parameter C. elegans. setting, homology-based gene predictions are not directly In addition, we find that the MAKER2 performance is used for the final prediction and doing so might further substantially improved compared to the performance of improve performance. the the previously reported MAKER2 predictions, either purely based on proteins (cf. Table 1, column MAKER2 Influence of reference species (exonerate)) or as reported in [11] (cf. Maker2 ). These Utilizing different fly species from FlyBase [33], we scru- other predictions do not utilize all available sources of tinize the influence of different or multiple reference information as they either ignore RNA-seq data and species on the performance of GeMoMa using RNA-seq ab-initio gene prediction or homology to proteins of data (Additional file 1:TableS8).In Fig. 3,wedepictgene related species. Based on this observation, we agree sensitivity and gene specificity for eight different reference that combined gene prediction pipelines benefit from species indicated by points. We find that performance Fig. 3 Gene sensitivity and specificity for D. melanogaster using different or multiple reference species in GeMoMa. The points correspond to the eight reference species. In addition, the dashed line indicates the usage of multiple reference species. Using multiple reference species allows for filtering identical predictions from several reference as indicated by the numbers Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 8 of 12 varies with the reference species. In this specific case, supported by at least two of the eight reference species. D. sechellia and D. persimilis yield the worst results for sin- In addition, we find 9 single-coding-exon predictions that gle reference-based predictions. This observation might do not overlap with any annotated transcript, have a tpc be related to the fact that genome assembly of D. sechellia of 1 and are supported by at least two of the five reference and D. persimilis is of lower quality [34], while the genome species. In summary, those genes supported by multiple reference organisms or additional RNA-seq data might be of D. simulans has been updated [35] later. Besides these two outliers, the performance of the different fly species promising candidates for extending the existing genome as reference species for D. melanogaster in GeMoMa cor- annotation of D. melanogaster. relates with their evolutionary distance [36]. Generally speaking, the closer a reference species is related to the Analysis of nematode species target species D. melanogaster, the better is the perfor- The relatively poor results for C. elegans in the bench- mance in terms of gene sensitivity and specificity. Hence, mark study, might be due to insufficiencies in the current we speculate that two requirements must be met to have C. briggsae annotation. Hence, we decided to scruti- a good reference species. First, the evolutionary distance nize the Wormbase annotation of four nematode species between reference and target species should be small comprising C. brenneri, C. briggsae, C. japonica,and and second, the genome assembly and annotation of the C. remanei based on the model organism C. elegans. reference species should be comprehensive and of high We compare GeMoMa predictions with manually curated quality. CDS from Wormbase. Based on RNA-seq evidence, we The new GAF module of GeMoMa allows for com- collect multi-coding-exon predictions of GeMoMa with bining the predictions based on different reference tie=1 and compare these to the annotation as depicted in organisms. The combined predictions may be filtered by Fig. 4. number of reference species with perfect support (#evi- In summary, we find between 6 749 differences for dence), as indicated by the dashed line. We find that C. briggsae and 12 903 for C. brenneri (cf. Fig. 4). The combining multiple reference organisms improves predic- most interesting category are new multi-coding-exon pre- tion performance and stability. Depending on the number dictions, which vary between 53 for C. briggsae and 1 974 of supporting reference organisms required, gene speci- for C. brenneri. The largest category are GeMoMa pre- ficity and gene sensitivity may be balanced according to dictions that missed exons compared to annotated CDSs, the needs of a specific application. We observe that (i) which vary between 2 340 for C. japonica and 4 220 for gene sensitivity increases but specificity decreases when C. remanei. requiring support from at least one reference organ- We additionally filter the transcripts showing differ- ism, whereas (ii) gene specificity increases but sensitivity ences to obtain a smaller, more conservative set of high- decreases severely filtering for perfect support from all confidence predictions. First, we filter new multi-coding eight reference species. In summary, the inclusion of mul- exon GeMoMa predictions for tpc=1 obtaining between tiple reference species may yield an improved prediction 39 and 996 for C. briggsae and C. brenneri,respectively. performance for GeMoMa using the GAF module, where Second, we filter GeMoMa predictions that have differ- we suggest to filter predictions for support by at least two ent splice sites compared to highly overlapping annotated but not necessarily all reference species. transcripts, contain new exons, have missing exons, or Furthermore, we check whether GeMoMa allows for have new and missing exons for tie<1 of the overlapping identifying new transcripts in D. melanogaster that do annotation. We obtain between 100 and 1 079 predic- not overlap with any annotated transcript but are sup- tions with different splice-site, between 42 and 786 predic- ported by RNA-seq data. First, we check whether we tions containing new exons, between 548 and 1 431 could identify transcripts based on the GeMoMa predic- predictions with missing exons, and between 284 and tions using D. simulans as reference organism. We find 1 191 predictions with new and missing exons. Finally, 35 multi-coding-exon predictions that do not overlap with for GeMoMa predictions that differ in the start codon any annotated transcript but have a tie of 1, i.e., all introns compared to the annotation, we filter for tpc=1ofthe are supported by split reads in the RNA-seq data (see GeMoMa prediction and tpc<1 for the annotation obtain- “Methods”). In addition, we find 15 single-coding-exon ing between 14 and 149 for C. brenneri and C. remanei, predictions that do not overlap with any annotated tran- respectively. In summary, we obtain between 1 065 pre- script but have a tpc of 1, i.e., that are fully covered dictions differing from the annotation for C. briggsae and by mapped RNA-seq reads. Second, we check whether 4 735 predictions for C. brenneri, respectively (cf. Fig. 4) we could identify transcripts that are supported by at using these strict criteria. Despite the overall reduction of least two of the eight reference species (cf. above). We transcripts considered, GeMoMa predictions that missed find 14 multi-coding-exon predictions that do not over- exons compared to annotated CDSs are the largest cate- lap with any annotated transcript, obtain a tie of 1 and are gory for all four nematode species. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 9 of 12 Fig. 4 Summary of difference for GeMoMa predictions with tie=1. The relaxed evaluation (left panel) depicts differences between GeMoMa predictions and annotation without any filter on the annotation, while the conservative evaluation (right panel) applies additional filters for the annotation (cf. main text). Predictions that do not overlap with any annotated CDS are depicted in yellow, predictions that differ from annotated CDSs only in splice sites are depicted in green, predictions that have additional exons compared to annotated CDSs are depicted in turquoise, predictions that missed some exons compared to annotated CDSs are depicted in blue, predictions with additional and missing exons compared to annotated CDSs are depicted in pink, predictions that only differ in the start of the CDS compared to annotated CDS are depicted in red, and any other category is depicted in gray For both evaluations, we find that the predictions for in finding isoforms missed by traditional prediction C. briggsae are in better accordance with the annotation methods. than the predictions of the remaining three nematode species. One possible explanation might be that the anno- Analysis of barley tation of C. briggsae has recently been updated using Complementary to the studies in animals in the last sub- RNA-seq data (Gary Williams, personal communication), section, we used GeMoMa to predict the annotation of while the annotation of C. japonica is based on Augustus protein-coding genes in barley (Hordeum vulgare). Based (Erich Schwartz, personal communication) and the anno- on the benchmark results for D. melanogaster,weused tation of the other two nematodes are NGASP sets from several reference organisms to predict the gene annota- multiple ab-initio gene prediction programs [37]. For tion using GeMoMa and GAF and finally obtain 75 484 C. japonica, we find the second best results, although transcript predictions. Most of the predictions showed C. japonica is phylogenetically more distantly related to a good overlap with the annotation (F ≥ 0.8). Never- C. elegans than the remaining two nematodes [38]. This is theless, 27 204 out of these 75 484 predictions had little additional evidence that the annotation pipeline employed (F <0.8) or no overlap with high or low confidence has a decisive influence on the quality and completeness gene annotations. However, thousands of the transcripts of the annotation. contained in the official annotation do not have start or In addition, we checked for C. brenneri whether the stop codons [14], which renders an exact comparison of GeMoMa predictions partially overlap with cDNAs or predictions with perfect or at least very good overlap ESTs mapped to the C. brenneri genome. In 472 cases, unreasonable. the prediction overlaps with a cDNA or EST, but not Hence, we focus on 19 619 predictions with no overlap with the annotation. In 364 out of these 472 cases, the with any annotated transcript (Table 2). Scrutinizing these prediction has tie=1. To evaluate the predictions, we man- predictions, we find 1 729 single-coding-exon predictions ually checked about 9% (43) of the predicted missing that are completely covered by RNA-seq reads (tpc=1) but genes with tie=1. Based on RNA-seq data, protein homol- that are not contained in the annotation. Out of these, 367 ogy, cDNA/ESTs and manual curation, 95% were genuine are partially supported by best BLAT matches of ESTs to new isoforms which have been missed in the original the genome. In addition, we analyzed multi-coding-exon C. brenneri gene set. This shows that GeMoMa is valuable predictions and find 2 821 predictions that obtain tie=1, Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 10 of 12 Table 2 Predictions that do not overlap with any high or low We also find that the performance depends on the evo- confidence annotation lutionary distance betwen reference and target organism. a) Single-coding-exon predictions However, prediction performance also depends on sev- eral further aspects including i) the quality of the target #evidence tpc =00 < tpc < 1 tpc = 1 genome (assembly), ii) the number of reference organisms 1 1 971 (11) 878 (14) 1 005 (137) available and iii) especially the quality of the reference 2 204 (19) 158 (8) 299 (55) annotation(s) itself. Hence, we recommend to balance 3 200 (16) 126 (5) 257 (92) between evolutionary distance and (expected) quality of 4 91 (17) 43 (9) 168 (83) the reference annotation when selecting reference species 2 466 (63) 1 205 (36) 1 729 (367) for GeMoMa. The integration of RNA-seq data into GeMoMa might b) Multi-coding-exon predictions help to overcome wrongly annotated splice sites in the #evidence tie =00 < tie <1tie = 1 reference species in some cases. However, missing or 1 9 671 (287) 942 (211) 1 681 (775) wrongly additional annotated exons in the reference anno- 2 283 (36) 86 (32) 456 (196) tation might still lead to partially wrong gene model 3 155 (31) 64 (43) 382 (223) predictions in the target species. The benefit of RNA-seq 4 142 (57) 55 (37) 302 (196) data, however, also depends on the quality and amount of sequenced reads, on the diversity (tissues, conditions) 10 251 (411) 1 147 (323) 2 821 (1 390) of the sequenced samples, and on the library type, where The numbers in parenthesis depict those predictions that are partially supported by stranded libraries should be more informative than non- any best BLAT hit of ESTs stranded ones. In addition, GeMoMa uses RNA-seq data currently only to refine homologous genes models and stating that each predicted intron is supported by at least not to identify transcribed gene models that do not show one split read from mapped RNA-seq data. Out of these, any homology. Hence, GeMoMa should be used in com- 1 390 are partially supported by best BLAT matches of bination with other gene predictors allowing for purely ESTs to the genome. RNA-seq-based or ab-initio gene predcitions. Exemplar- Besides predictions that are well supported by RNA-seq ily, we demonstrate that GeMoMa helps to improve the performance of combined gene predictor pipelines as for data, we also observe thousands of predictions that are not instance MAKER2. (tpc = 0ortie = 0) or only partially (0 < tpc < 1or0 < Notably, model organisms have been used as target tie < 1) supported by RNA-seq. Despite no or only partial organisms in this benchmark, whereas they would typi- RNA-seq support, we find that 833 are partially supported cally be used as reference organisms in real applications. by best BLAT matches of ESTs to the genome. Hence, the performance of homology-based gene pre- Alternatively, we can utilize the number of reference diction programs might be underestimated. In summary, organisms that support a prediction (#evidence) to fil- we recommend to use homology-based gene prediction ter the predictions as noted for D. melanogaster.This approach will decrease sensitivity, but increase specificity using RNA-seq data as implemented in GeMoMa when- obtaining predictions with a high confidence. Although, ever high-quality gene annotations of related species are we find the most predictions with #evidence = 1, we also available. find about 3 500 predictions with #evidence > 1, more Interestingly, we find that GeMoMa works especially than 1 100 of these predictions are additionally supported well for D. melanogaster in the benchmark study com- by RNA-seq data or ESTs. pared to the performance of its competitors. One possible reason could be that Flybase used homology and RNA-seq Conclusions data besides other evidence to infer the gene annotation Summarizing the methods and results, we present an [19]. In contrast, we find the worst results in C. elegans in extension of GeMoMa that allows for the incorporation the benchmark study, which might be related to the fact of RNA-seq data into homology-based gene prediction that the C. elegans gene set contains many rare isoform utilizing intron position conservation. Comparing the community submissions whereas C. briggsae was anno- performance of GeMoMa with and without RNA-seq evi- tated by a large scale gene predictions effort based on dence, we demonstrate for all four organism included in RNA-seq. Scrutinizing the annotation in Wormbase, we pre- the benchmark that RNA-seq evidence improves the per- dicted protein-coding transcripts for four nematode formance of GeMoMa. GeMoMa performs equally well species based on the annotation of the model organism or better than BRAKER1, MAKER2, CodingQuarry, and C. elegans. We find that a substantial part of the GeMoMa purely RNA-seq-based pipelines on the benchmark data predictions is either missing, marked as modification sets including plants, animals and fungi. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 11 of 12 of annotated transcripts or alternative transcripts. Espe- Ethics approval and consent to participate Not applicable. cially for the three nematodes, C. brenneri, C. japonica and C. remanei, that are annotated solely using ab-initio Competing interests gene prediction, we find a large part of the annotation The authors declare that they have no competing interests. that is marked as questionable or missing. This may give Publisher’s Note an indication, why homology-based gene prediction for Springer Nature remains neutral with regard to jurisdictional claims in C. elegans shows less good performance in the benchmark published maps and institutional affiliations. study. The GeMoMa predictions of the four nematodes Author details will be included in Wormbase in the upcoming releases. 1 Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Furthermore, GeMoMa will be included in the WormBase Federal Research Centre for Cultivated Plants, D-06484, Quedlinburg, Germany. European Molecular Biology Laboratory, European Bioinformatics gene curation process and trialled for strain annotation. Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, Cambridge, Furthermore, we predicted protein-coding transcripts 3 UK. Plant Genome and Systems Biology, Helmholtz Center Munich - German for barley using four reference species and find several Research Center for Environmental Health, D-85764, Neuherberg, Germany. Institute of Computer Science, Martin Luther University Halle–Wittenberg, hundreds of predictions that are not included in the refer- D-06120, Halle (Saale), Germany. ence annotation but are supported by RNA-seq data, ESTs Received: 20 November 2017 Accepted: 14 May 2018 or multiple reference species. Hence, we conclude that these are valuable predictions harboring additional barley genes. These predictions will be incorporated in the new References barley annotation. 1. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, GeMoMa provides a user-friendly documentation and Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching can be use as command line tool or through the Galaxy during cell differentiation,. Nat Biotechnol. 2010;28(5):511–5. https://doi. workflow management system [39] providing its own org/10.1038/nbt.1621. Galaxy integration (Fig. 1). GeMoMa is freely avail- 2. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from able under GNU GPL3 at http://www.jstacs.de/index.php/ RNA-seq reads. Nat Biotech. 2015;33(3):290–5. https://doi.org/10.1038/ GeMoMa. nbt.3122. 3. Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of Additional files eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006;7(1):10. https://doi.org/10.1186/gb-2006-7-s1-s10. 4. Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and Additional file 1: Supplementary Tables and Figures. Table S1:Data syntenically mapped cDNA alignments to improve de novo gene finding. used for the BRAKER1 benchmark. Table S2: Data for Wormbase study. Bioinformatics. 2008;24(5):637. https://doi.org/10.1093/bioinformatics/ Table S3: Data for barley annotation. Table S4: F1-measure on the btn013. BRAKER1 test sets. Table S5: CodingQuarry and GeMoMa results for 5. Slater G, Birney E. Automated generation of heuristics for biological S. pombe using the original read mappings from [10, 11]. Table S6: sequence comparison. BMC Bioinformatics. 2005;6(1):31. https://doi.org/ CodingQuarry and GeMoMa results for S. cerevisiae. Table S7: GeMoMa 10.1186/1471-2105-6-31. runtime. Table S8: GeMoMa performance for D.melanogaster.(PDF147 kb) 6. She R, Chu JS-C, Uyar B, Wang J, Wang K, Chen N. genBlastG: using BLAST searches to build homologous gene models. Bioinformatics. 2011;27(15):2141–3. https://doi.org/10.1093/bioinformatics/btr342. Abbreviations http://bioinformatics.oxfordjournals.org/content/27/15/2141.full.pdf+ avgCov: Average coverage; cDNA: Complementary DNA; CDS: Coding html. sequence; ERE: Extract RNA-seq evidence; EST: Expressed sequence tag; GAF: 7. Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F. Using GeMoMa annotation filter; GeMoMa: Gene Model Mapper; iAA: Identical intron position conservation for homology-based gene prediction. amino acids; tae: Transcript acceptor evidence; tde: Transcript donor evidence; Nucleic Acids Res. 2016;44(9):89. https://doi.org/10.1093/nar/gkw092. tie: Transcript intron evidence; tpc: Transcript percentage coverage; minCov: 8. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for Minimal coverage; minSplitReads: Minimum number of split reads; pAA: transcriptomics. Nat Rev Genet. 2009;10(1):57–63. https://doi.org/10. Positive-scoring amino acids 1038/nrg2484. 9. Holt C, Yandell M. MAKER2: an annotation pipeline and Acknowledgements genome-database management tool for second-generation genome We thank Katharina Hoff for providing the BRAKER1 benchmark data sets, projects. BMC Bioinformatics. 2011;12(1):491. https://doi.org/10.1186/ Carson Holt for assisting the MAKER2 comparison, Gil dos Santos for his 1471-2105-12-491. comments on the quality of the Drosophila genome assemblies, Erich Schwartz 10. Testa AC, Hane JK, Ellwood SR, Oliver RP. CodingQuarry: highly accurate for his comments on C. japonica, Alison Testa for her help with CodingQuarry, hidden Markov model gene prediction in fungal genomes using RNA-seq Gary Williams for his comments on C. briggsae, and Thomas Berner for transcripts. BMC Genomics. 2015;16(1):170. https://doi.org/10.1186/ technical assistance. We thank the open access publication fund of Martin s12864-015-1344-4. Luther University Halle–Wittenberg for funding article-processing charges. 11. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET Availability of data and materials and AUGUSTUS. Bioinformatics. 2016;32(5):767. https://doi.org/10.1093/ Not applicable. Accession numbers of publicly available data used in this study bioinformatics/btv661. are listed in Additional file 1. 12. Lomsadze A, Burns PD, Borodovsky M. Integration of mapped rna-seq reads into automatic training of eukaryotic gene finding algorithm. Authors’ contributions Nucleic Acids Res. 2014;42(15):119. https://doi.org/10.1093/nar/gku557. JK and JG implemented the software, designed and performed benchmark 13. Howe KL, Bolt BJ, Cain S, Chan J, Chen WJ, Davis P, Done J, Down T, studies. JK, FH, MP, SOT and JG contributed to data analysis and interpretation, Gao S, Grove C, Harris TW, Kishore R, Lee R, Lomax J, Li Y, Muller H-M, and to writing the manuscript. All authors read and approved the final Nakamura C, Nuin P, Paulini M, Raciti D, Schindelman G, Stanley E, manuscript. Keilwagen et al. BMC Bioinformatics (2018) 19:189 Page 12 of 12 Tuli MA, Van Auken K, Wang D, Wang X, Williams G, Wright A, Yook K, 25. Bennetzen JL, Schmutz J, Wang H, Percifield R, Hawkins J, Pontaroli AC, Berriman M, Kersey P, Schedl T, Stein L, Sternberg PW. Wormbase 2016: Estep M, Feng L, Vaughn JN, Grimwood J, Jenkins J, Barry K, Lindquist E, expanding to enable helminth genomic research. Nucleic Acids Res. HellstenU,DeshpandeS,WangX,WuX,MitrosT,TriplettJ,YangX, 2016;44(D1):774. https://doi.org/10.1093/nar/gkv1217. Ye C-Y, Mauro-Herrera M, Wang L, Li P, Sharma M, Sharma R, Ronald 14. Mascher M, Gundlach H, Himmelbach A, Beier S, Twardziok SO, Wicker T, PC, Panaud O, Kellogg EA, Brutnell TP, Doust AN, Tuskan GA, Rokhsar Radchuk V, Dockter C, Hedley PE, Russell J, Bayer M, Ramsay L, Liu H, D, Devos KM. Reference genome sequence of the model plant Setaria. Haberer G, Zhang X-Q, Zhang Q, Barrero RA, Li L, Taudien S, Groth M, Nat Biotechnol. 2012;30(6):555–61. https://doi.org/10.1038/nbt.2196. Felder M, Hastie A, Šimková H, Stanková ˇ H, Vrána J, Chan S, 26. Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL. Transcript-level MuñozAmatriaín M, Ounit R, Wanamaker S, Bolser D, Colmsee C, expression analysis of RNA-seq experiments with HISAT, StringTie and Schmutzer T, Aliyeva-Schnorr L, Grasso S, Tanskanen J, Chailyan A, Ballgown. Nat Protocols. 2016;11(9):1650–67. https://doi.org/10.1038/ Sampath D, Heavens D, Clissold L, Cao S, Chapman B, Dai F, Han Y, Li H, nprot.2016.095. Li X, Lin C, McCooke JK, Tan C, Wang P, Wang S, Yin S, Zhou G, Poland JA, 27. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12(4): Bellgard MI, Borisjuk L, Houben A, Doležel J, Ayling S, Lonardi S, Kersey P, 656–64. https://doi.org/10.1101/gr.229202. Article published online Langridge P, Muehlbauer GJ, Clark MD, Caccamo M, Schulman AH, before March 2002. Mayer KFX, Platzer M, Close TJ, Scholz U, Hansson M, Zhang G, 28. Keibler E, Brent MR. Eval: A software package for analysis of genome Braumann I, Spannagl M, Li C, Waugh R, Stein N. A chromosome annotations. BMC Bioinformatics. 2003;4(1):50. https://doi.org/10.1186/ conformation capture ordered sequence of the barley genome. Nature. 1471-2105-4-50. 2017;544(7651):427–33. https://doi.org/10.1038/nature22043. 29. Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, 15. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local Harrow J, Bertone P, Consortium R, et al. Assessment of transcript alignment search tool. J Mol Biol. 1990;215(3):403–10. https://doi.org/10. reconstruction methods for RNA-seq. Nat Methods. 2013;10(12):1177–84. 1016/S0022-2836(05)80360-2. 30. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak ´ MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A. 16. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, A survey of best practices for RNA-seq data analysis. Genome Biol. Couger MB, Eccles D, Li B, Lieber M, MacManes MD, Ott M, Orvis J, 2016;17(1):13. https://doi.org/10.1186/s13059-016-0881-8. Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, 31. Burset M, Guigó R. Evaluation of gene structure prediction programs. Henschel R, LeDuc RD, Friedman N, Regev A. De novo transcript Genomics. 1996;34(3):353–67. https://doi.org/10.1006/geno.1996.0298. sequence reconstruction from RNA-seq using the Trinity platform for 32. Powers DMW. Evaluation: From precision, recall and F-measure to ROC, reference generation and analysis. Nat Protocols. 2013;8(8):1494–512. informedness, markedness & correlation. J Mach Learn Technol. https://doi.org/10.1038/nprot.2013.084. 2011;2(1):37–63. 17. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: 33. Gramates LS, Marygold SJ, Santos Gd, Urbano J-M, Antonazzo G, accurate alignment of transcriptomes in the presence of insertions, Matthews BB, Rey AJ, Tabone CJ, Crosby MA, Emmert DB, Falls K, deletions and gene fusions. Genome Biol. 2013;14(4):R36. https://doi.org/ Goodman JL, Hu Y, Ponting L, Schroeder AJ, Strelets VB, Thurmond J, 10.1186/gb-2013-14-4-r36. Zhou P. FlyBase at 25: looking to the future. Nucleic Acids Res. 18. Rawat V, Abdelsamad A, Pietzenuk B, Seymour DK, Koenig D, Weigel D, 2017;45(D1):663–71. https://doi.org/10.1093/nar/gkw1016. Pecinka A, Schneeberger K. Improving the annotation of arabidopsis 34. Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, lyrata using rna-seq data. PLOS ONE. 2015;10(9):1–12. https://doi.org/10. Kaufman TC, Kellis M, Gelbart W, Iyer VN, et al. Evolution of genes and 1371/journal.pone.0137391. genomes on the Drosophila phylogeny. Nature. 2007;450(7167):203–18. 19. Matthews BB, dos Santos G, Crosby MA, Emmert DB, St Pierre SE, 35. Hu TT, Eisen MB, Thornton KR, Andolfatto P. A second-generation Gramates LS, Zhou P, Schroeder AJ, Falls K, Strelets V, Russo SM, assembly of the Drosophila simulans genome provides new insights into Gelbart WM, The FlyBase Consortium. Gene model annotations for patterns of lineage-specific divergence. Genome Res. 2013;23(1):89–98. drosophila melanogaster: Impact of high-throughput data. G3: Genes https://doi.org/10.1101/gr.141689.112. http://genome.cshlp.org/content/ Genomes Genet. 2015;5(8):1721–36. https://doi.org/10.1534/g3.115. 23/1/89.full.pdf+html. 018929. http://www.g3journal.org/content/5/8/1721.full.pdf. 36. Singh ND, Larracuente AM, Sackton TB, Clark AG. Comparative genomics 20. Rhind N, Chen Z, Yassour M, Thompson DA, Haas BJ, Habib N, Wapinski I, on the drosophila phylogenetic tree. Annu Rev Ecol Evol Syst. 2009;40(1): Roy S, Lin MF, Heiman DI, Young SK, Furuya K, Guo Y, Pidoux A, Chen HM, 459–80. https://doi.org/10.1146/annurev.ecolsys.110308.120214. Robbertse B, Goldberg JM, Aoki K, Bayne EH, Berlin AM, Desjardins CA, 37. Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, nGASP Dobbs E, Dukaj L, Fan L, FitzGerald MG, French C, Gujja S, Hansen K, Consortium, Stein LD. ngasp–the nematode genome annotation Keifenheim D, Levin JZ, Mosher RA, Müller CA, Pfiffner J, Priest M, Russ C, assessment project. BMC Bioinformatics. 2008;9:549. https://doi.org/10. Smialowska A, Swoboda P, Sykes SM, Vaughn M, Vengrova S, Yoder R, 1186/1471-2105-9-549. Zeng Q, Allshire R, Baulcombe D, Birren BW, Brown W, Ekwall K, Kellis M, 38. Kiontke KC, Félix M-A, Ailion M, Rockman MV, Braendle C, Pénigault J-B, Leatherwood J, Levin H, Margalit H, Martienssen R, Nieduszynski CA, Fitch DH. A phylogeny and molecular barcodes for caenorhabditis, with Spatafora JW, Friedman N, Dalgaard JZ, Baumann P, Niki H, Regev A, numerous new species from rotting fruits. BMC Evol Biol. 2011;11(1):339. Nusbaum C. Comparative functional genomics of the fission yeasts. https://doi.org/10.1186/1471-2148-11-339. Science. 2011;332(6032):930–6. https://doi.org/10.1126/science.1203357. 39. Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Cech M, http://science.sciencemag.org/content/332/6032/930.full.pdf. Chilton J, Clements D, Coraor N, Eberhard C, Grüning B, Guerler A, 21. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Hillman-Jackson J, Von Kuster G, Rasche E, Soranzo N, Turaga N, Taylor J, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Nekrutenko A, Goecks J. The galaxy platform for accessible, reproducible Bioinformatics. 2013;29(1):15. https://doi.org/10.1093/bioinformatics/ and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. bts635. 2016;44(W1):3. https://doi.org/10.1093/nar/gkw343. 22. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, Karthikeyan AS, Lee CH, Nelson WD, Ploetz L, Singh S, Wensel A, Huala E. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40(D1):1202. https://doi.org/10.1093/nar/gkr1090. 23. International Brachypodium Initiative. Genome sequencing and analysis of the model grass Brachypodium distachyon,. Nature. 2010;463(5):763–8. https://doi.org/10.1038/nature08747. 24. Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek R. L, Lee Y, Zheng L, Orvis J, Haas B, Wortman J, Buell CR. The tigr rice genome annotation resource: improvements and new features. Nucleic Acids Res. 2007;35(suppl_1):883. https://doi.org/10. 1093/nar/gkl976.

Journal

BMC BioinformaticsSpringer Journals

Published: May 30, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off