TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders

W. H. Majoros; M. Pertea; S. L. Salzberg

doi:10.1093/bioinformatics/bth315

Majoros, W. H.; Pertea, M.; Salzberg, S. L.

2004-05-14 00:00:00

Vol. 20 no. 16 2004, pages 2878–2879 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bth315 TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-ﬁnders W. H. Majoros , M. Pertea and S. L. Salzberg Bioinformatics Department, The Institute for Genomic Research, Rockville, MD 20850, USA Received on April 1, 2004; revised on April 28, 2004; accepted on May 3, 2004 Advance Access publication May 14, 2004 ABSTRACT METHODS Summary: We describe two new Generalized Hidden Markov A Hidden Markov Model (HMM) is a state-based generat- Model implementations for ab initio eukaryotic gene prediction. ive model which transitions stochastically from state to state, The C/C++ source code for both is available as open source emitting a single symbol from each state according to that and is highly reusable due to their modular and extensible state’s emission probabilities. A GHMM generalizes this pro- architectures. Unlike most of the currently available gene- cess by emitting complete gene features, or subsequences, in ﬁnders, the programs are re-trainable by the end user. They each state. Because each state can be associated with a dif- are also re-conﬁgurable and include several types of probab- ferent gene feature type (e.g. donor, exon, etc.), a GHMM ilistic submodels which can be independently combined, such provides an intuitive and ﬂexible framework for exploring as Maximal Dependence Decomposition trees and interpol- alternative gene-ﬁnding approaches. For example, feature ated Markov models. Both programs have been used at TIGR states can be independently retrained, and different types for the annotation of the Aspergillus fumigatus and Toxoplasma of submodels (e.g. Markov models, weight matrices, etc.) gondii genomes. can be used at each state. Predicting gene models with a Availability: Source code and documentation are available GHMM involves ﬁnding the most probable path, φ, through under the open source Artistic License from http://www.tigr.org/ the GHMM topology given the sequence, S; i.e. maximizing software/pirate. P(φ|S). Bayes’ theorem and the invariance of the marginal Contact: [email protected] probability P(S) with respect to individual paths φ gives: P(φ ∧ S) INTRODUCTION argmax P(φ|S) = argmax = argmax P(φ ∧ S) φ φ φ P(S) With the increased availability of raw genomic sequence data = argmax P(S|φ)P (φ). has come an increase in the number of gene-ﬁnder programs available for predicting the protein-coding genes in these data. Unfortunately, the vast majority of these programs cannot eas- Because the GHMM allows explicit modeling of state duration ily be retrained by end users, because these packages rarely (feature length) d for each state q in parse φ, this can be i i include retraining software, and in most cases the source code factored into is not available, which also limits modiﬁcation and reuse of argmax P(S |q ∧ d )P (q |q )P (d ), i i i i i−1 i these programs for functionally different annotation tasks. q ∈φ We describe two new gene ﬁnders, GlimmerHMM and TigrScan, which are based on the same class of models as where P(q |q ) is the probability of transitioning from state i i−1 Genscan (Burge, 1997) and Genie (Kulp et al., 1996), namely, q to q , S is the subsequence emitted by state q , and i−1 i i i a Generalized Hidden Markov Model (GHMM). GHMMs P(d ) is the probability of state q emitting a feature of length i i offer the advantage of providing a probabilistically rigorous d . These can all be estimated from training data through framework in which alternative gene-ﬁnding strategies can be various well-known means (e.g. Salzberg et al., 1998). This readily explored. Furthermore, since our source code is avail- optimization step can be efﬁciently evaluated using a dynamic able as open source, and because the programs are written in a programming approach. highly modular C/C++ style, reusing portions of the programs Though both TigrScan and GlimmerHMM conform to the for novel annotation tasks is made quite feasible. overall mathematical framework of a GHMM, they differ sig- niﬁcantly from each other and from our previous gene ﬁnders To whom correspondence should be addressed. in the details of their implementation—speciﬁcally, in the 2878 Bioinformatics vol. 20 issue 16 © Oxford University Press 2004; all rights reserved. Two open source eukaryotic gene-ﬁnders Table 1. Results on a set of 800 full-length Arabidopsis thaliana cDNAs Table 2. Memory and time requirements on a 922 kb A.fumigatus contig (a.t.) and on 360 curated Aspergillus fumigatus CDSs (a.f.) Memory (Mb) Time (min) % Nucl. % Exon % Exon % Exact accuracy sensitivity speciﬁcity genes GlimmerHMM 84 0:17 a.t. a.f. a.t. a.f. a.t. a.f. a.t. a.f. TigrScan 29 1:28 Genscan+ 445 2:57 TigrScan 96 90 77 37 81 47 43 19 GlimmerHMM 96 91 71 36 79 49 33 21 Genscan+ 95 87 75 23 82 4 35 11 GlimmerHMM was found to perform best on the A.fumigatus Exon sensitivity = TP/(TP+FN), where TP stands for true positives and FN for test set for three of the measures. The greater difference false negatives, with a TP indicating that both exon coordinates were correct. Exon speciﬁcity = TP/(TP+FP), where FP stands for false positives. Exact genes is the per- in accuracy between our gene ﬁnders and Genscan+ on the centage of the test CDSs for which the predictions were entirely correct. Genscan+ is A.fumigatus set demonstrates the value of being able to retrain an A.thaliana speciﬁc version of Genscan provided to us by C. Burge. the gene ﬁnders for speciﬁc organisms. Time and memory requirements of both programs increase linearly with the length of the input sequence, though the two statistical methods employed at the submodel level and in programs make different trade-offs between speed and space, their overall software architecture. Whereas TigrScan util- as can be seen from Table 2. TigrScan successfully processed izes several types of weight matrices and Markov chains, a 5.6 Mb contig in 5 min 32 s using 105 Mb of RAM on a GlimmerHMM additionally incorporates splice site mod- els adapted from the GeneSplicer program (Pertea et al., 1.6 GHz Pentium IV, illustrating that long sequences can be 2001) and a decision tree adapted from GlimmerM (Salzberg processed even on machines with relatively limited memory. et al., 1999). Both programs utilize interpolated Markov By offering both these programs to the community as open models (Salzberg et al., 1999) as well as the Maximal source, we hope to facilitate more studies comparing the Dependence Decomposition technique for improving spe- suitability of alternate gene-ﬁnding strategies. ciﬁcity in splice site identiﬁcation (Burge, 1997). Currently, TigrScan’s GHMM structure includes introns of each phase, ACKNOWLEDGEMENTS intergenic regions, 5 - and 3 -untranslated regions (5 - and 3 - We would like to thank C. Burge for kindly providing a ver- UTRs), and four types of exons (initial, internal, ﬁnal, and sion of Genscan+ trained for A.thaliana, and the anonymous single). GlimmerHMM includes states for exons, introns and reviewers for their constructive comments which improved intergenic regions. this paper. This work was supported in part by NIH grant R01 TigrScan also provides as an optional feature the construc- LM06845 and NSF grant MCB-0114792. tion of a graph-theoretic representation of all high-scoring open reading frames. Such graphs have been found to be useful REFERENCES in several ongoing research projects, including a homology- Burge,C. (1997) Identiﬁcation of genes in human genomic DNA. based gene ﬁnder as well as two other projects which explore PhD Thesis, Stanford University, CA. unconventional approaches to genomic annotation. TigrScan Kulp,D., Haussler,D., Reese,M. and Eeckman,F. (1996) A general- can also read and score an arbitrary gene model provided in ized hidden Markov model for the recognition of human genes in GFF format. These features allow us to dynamically explore DNA. Proc. Int. Conf. Intell. Syst. Mol. Biol., 4, 134–142. the immense space of suboptimal gene models in ways that Pertea,M., Lin,X. and Salzberg,S.L. (2001) GeneSplicer: a new com- putational method for splice site prediction. Nucleic Acids Res., are simply not possible with most other gene ﬁnders. 29, 1185–1190. Salzberg,S.L., Searls,D.B. and Kasif,S. (eds) (1998) Computa- RESULTS tional Methods in Molecular Biology. Elsevier, Amsterdam, The Both programs performed well in tests when compared with Netherlands. Genscan+ (Table 1). Of the three gene ﬁnders, TigrScan Salzberg,S.L., Pertea,M., Delcher,A.L., Gardner,M.J. and was found to perform most competitively on the A.thaliana Tettelin,H. (1999) Interpolated Markov models for eukaryotic test set for three of the four reported measures, whereas gene ﬁnding. Genomics, 59, 24–31.

http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png

Bioinformatics Oxford University Press

http://www.deepdyve.com/lp/oxford-university-press/tigrscan-and-glimmerhmm-two-open-source-ab-initio-eukaryotic-gene-6khXYm0VP7

TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders

Loading next page...

References (5)

D. Kulp, D. Haussler, M. Reese, F. Eeckman (1996)
A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA
Proceedings. International Conference on Intelligent Systems for Molecular Biology, 4
M. Perțea, Xiaoying Lin, S. Salzberg (2001)
GeneSplicer: a new computational method for splice site prediction.
Nucleic acids research, 29 5
S. Salzberg, D. Searls, S. Kasif (1998)
Computational methods in molecular biology
N. Mache, P. Levi (1998)
GENIO/scan - EST Guided Identification of Genes in Human Genomic DNA
S. Salzberg, M. Perțea, A. Delcher, A. Delcher, M. Gardner, H. Tettelin (1999)
Interpolated Markov models for eukaryotic gene finding.
Genomics, 59 1

Publisher: Oxford University Press
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/bth315
pmid: 15145805
Publisher site: See Article on Publisher Site

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders

TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders

TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders

References (5)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies