Access the full text.
Sign up today, get DeepDyve free for 14 days.
D. Kulp, D. Haussler, M. Reese, F. Eeckman (1996)
A Generalized Hidden Markov Model for the Recognition of Human Genes in DNAProceedings. International Conference on Intelligent Systems for Molecular Biology, 4
M. Perțea, Xiaoying Lin, S. Salzberg (2001)
GeneSplicer: a new computational method for splice site prediction.Nucleic acids research, 29 5
S. Salzberg, D. Searls, S. Kasif (1998)
Computational methods in molecular biology
N. Mache, P. Levi (1998)
GENIO/scan - EST Guided Identification of Genes in Human Genomic DNA
S. Salzberg, M. Perțea, A. Delcher, A. Delcher, M. Gardner, H. Tettelin (1999)
Interpolated Markov models for eukaryotic gene finding.Genomics, 59 1
Vol. 20 no. 16 2004, pages 2878–2879 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bth315 TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders W. H. Majoros , M. Pertea and S. L. Salzberg Bioinformatics Department, The Institute for Genomic Research, Rockville, MD 20850, USA Received on April 1, 2004; revised on April 28, 2004; accepted on May 3, 2004 Advance Access publication May 14, 2004 ABSTRACT METHODS Summary: We describe two new Generalized Hidden Markov A Hidden Markov Model (HMM) is a state-based generat- Model implementations for ab initio eukaryotic gene prediction. ive model which transitions stochastically from state to state, The C/C++ source code for both is available as open source emitting a single symbol from each state according to that and is highly reusable due to their modular and extensible state’s emission probabilities. A GHMM generalizes this pro- architectures. Unlike most of the currently available gene- cess by emitting complete gene features, or subsequences, in finders, the programs are re-trainable by the end user. They each state. Because each state can be associated with a dif- are also re-configurable and include several types of probab- ferent gene feature type (e.g. donor, exon, etc.), a GHMM ilistic submodels which can be independently combined, such provides an intuitive and flexible framework for exploring as Maximal Dependence Decomposition trees and interpol- alternative gene-finding approaches. For example, feature ated Markov models. Both programs have been used at TIGR states can be independently retrained, and different types for the annotation of the Aspergillus fumigatus and Toxoplasma of submodels (e.g. Markov models, weight matrices, etc.) gondii genomes. can be used at each state. Predicting gene models with a Availability: Source code and documentation are available GHMM involves finding the most probable path, φ, through under the open source Artistic License from http://www.tigr.org/ the GHMM topology given the sequence, S; i.e. maximizing software/pirate. P(φ|S). Bayes’ theorem and the invariance of the marginal Contact: [email protected] probability P(S) with respect to individual paths φ gives: P(φ ∧ S) INTRODUCTION argmax P(φ|S) = argmax = argmax P(φ ∧ S) φ φ φ P(S) With the increased availability of raw genomic sequence data = argmax P(S|φ)P (φ). has come an increase in the number of gene-finder programs available for predicting the protein-coding genes in these data. Unfortunately, the vast majority of these programs cannot eas- Because the GHMM allows explicit modeling of state duration ily be retrained by end users, because these packages rarely (feature length) d for each state q in parse φ, this can be i i include retraining software, and in most cases the source code factored into is not available, which also limits modification and reuse of argmax P(S |q ∧ d )P (q |q )P (d ), i i i i i−1 i these programs for functionally different annotation tasks. q ∈φ We describe two new gene finders, GlimmerHMM and TigrScan, which are based on the same class of models as where P(q |q ) is the probability of transitioning from state i i−1 Genscan (Burge, 1997) and Genie (Kulp et al., 1996), namely, q to q , S is the subsequence emitted by state q , and i−1 i i i a Generalized Hidden Markov Model (GHMM). GHMMs P(d ) is the probability of state q emitting a feature of length i i offer the advantage of providing a probabilistically rigorous d . These can all be estimated from training data through framework in which alternative gene-finding strategies can be various well-known means (e.g. Salzberg et al., 1998). This readily explored. Furthermore, since our source code is avail- optimization step can be efficiently evaluated using a dynamic able as open source, and because the programs are written in a programming approach. highly modular C/C++ style, reusing portions of the programs Though both TigrScan and GlimmerHMM conform to the for novel annotation tasks is made quite feasible. overall mathematical framework of a GHMM, they differ sig- nificantly from each other and from our previous gene finders To whom correspondence should be addressed. in the details of their implementation—specifically, in the 2878 Bioinformatics vol. 20 issue 16 © Oxford University Press 2004; all rights reserved. Two open source eukaryotic gene-finders Table 1. Results on a set of 800 full-length Arabidopsis thaliana cDNAs Table 2. Memory and time requirements on a 922 kb A.fumigatus contig (a.t.) and on 360 curated Aspergillus fumigatus CDSs (a.f.) Memory (Mb) Time (min) % Nucl. % Exon % Exon % Exact accuracy sensitivity specificity genes GlimmerHMM 84 0:17 a.t. a.f. a.t. a.f. a.t. a.f. a.t. a.f. TigrScan 29 1:28 Genscan+ 445 2:57 TigrScan 96 90 77 37 81 47 43 19 GlimmerHMM 96 91 71 36 79 49 33 21 Genscan+ 95 87 75 23 82 4 35 11 GlimmerHMM was found to perform best on the A.fumigatus Exon sensitivity = TP/(TP+FN), where TP stands for true positives and FN for test set for three of the measures. The greater difference false negatives, with a TP indicating that both exon coordinates were correct. Exon specificity = TP/(TP+FP), where FP stands for false positives. Exact genes is the per- in accuracy between our gene finders and Genscan+ on the centage of the test CDSs for which the predictions were entirely correct. Genscan+ is A.fumigatus set demonstrates the value of being able to retrain an A.thaliana specific version of Genscan provided to us by C. Burge. the gene finders for specific organisms. Time and memory requirements of both programs increase linearly with the length of the input sequence, though the two statistical methods employed at the submodel level and in programs make different trade-offs between speed and space, their overall software architecture. Whereas TigrScan util- as can be seen from Table 2. TigrScan successfully processed izes several types of weight matrices and Markov chains, a 5.6 Mb contig in 5 min 32 s using 105 Mb of RAM on a GlimmerHMM additionally incorporates splice site mod- els adapted from the GeneSplicer program (Pertea et al., 1.6 GHz Pentium IV, illustrating that long sequences can be 2001) and a decision tree adapted from GlimmerM (Salzberg processed even on machines with relatively limited memory. et al., 1999). Both programs utilize interpolated Markov By offering both these programs to the community as open models (Salzberg et al., 1999) as well as the Maximal source, we hope to facilitate more studies comparing the Dependence Decomposition technique for improving spe- suitability of alternate gene-finding strategies. cificity in splice site identification (Burge, 1997). Currently, TigrScan’s GHMM structure includes introns of each phase, ACKNOWLEDGEMENTS intergenic regions, 5 - and 3 -untranslated regions (5 - and 3 - We would like to thank C. Burge for kindly providing a ver- UTRs), and four types of exons (initial, internal, final, and sion of Genscan+ trained for A.thaliana, and the anonymous single). GlimmerHMM includes states for exons, introns and reviewers for their constructive comments which improved intergenic regions. this paper. This work was supported in part by NIH grant R01 TigrScan also provides as an optional feature the construc- LM06845 and NSF grant MCB-0114792. tion of a graph-theoretic representation of all high-scoring open reading frames. Such graphs have been found to be useful REFERENCES in several ongoing research projects, including a homology- Burge,C. (1997) Identification of genes in human genomic DNA. based gene finder as well as two other projects which explore PhD Thesis, Stanford University, CA. unconventional approaches to genomic annotation. TigrScan Kulp,D., Haussler,D., Reese,M. and Eeckman,F. (1996) A general- can also read and score an arbitrary gene model provided in ized hidden Markov model for the recognition of human genes in GFF format. These features allow us to dynamically explore DNA. Proc. Int. Conf. Intell. Syst. Mol. Biol., 4, 134–142. the immense space of suboptimal gene models in ways that Pertea,M., Lin,X. and Salzberg,S.L. (2001) GeneSplicer: a new com- putational method for splice site prediction. Nucleic Acids Res., are simply not possible with most other gene finders. 29, 1185–1190. Salzberg,S.L., Searls,D.B. and Kasif,S. (eds) (1998) Computa- RESULTS tional Methods in Molecular Biology. Elsevier, Amsterdam, The Both programs performed well in tests when compared with Netherlands. Genscan+ (Table 1). Of the three gene finders, TigrScan Salzberg,S.L., Pertea,M., Delcher,A.L., Gardner,M.J. and was found to perform most competitively on the A.thaliana Tettelin,H. (1999) Interpolated Markov models for eukaryotic test set for three of the four reported measures, whereas gene finding. Genomics, 59, 24–31.
Bioinformatics – Oxford University Press
Published: May 14, 2004
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.