Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

HMMoC—a compiler for hidden Markov models

HMMoC—a compiler for hidden Markov models Vol. 23 no. 18 2007, pages 2485–2487 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm350 Sequence analysis Gerton Lunter MRC Functional Genetics Unit, Department of Physiology, Anatomy and Genetics, South Parks Road, OX1 3TG, University of Oxford, UK Received on May 30, 2007; revised on June 26, 2007; accepted on June 27, 2007 Advance Access publication July 10, 2007 Associate Editor: Alex Bateman algorithms, at the cost of some flexibility. Dynamite later ABSTRACT developed into the gene annotation toolset Exonerate, which is Summary: Hidden Markov models are widely applied within based on the code generator C4 (Slater and Birney, 2005). computational biology. The large data sets and complex models These code generators, being somewhat tied to their original involved demand optimized implementations, while efficient explora- application domain, lack direct support for probabilistic tion of model space requires rapid prototyping. These requirements algorithms. Here I present an HMM compiler, HMMoC, that are not met by existing solutions, and hand-coding is time- aims to fill this gap. HMMoC is both efficient and flexible, consuming and error-prone. Here, I present a compiler that takes and provides support for all standard HMM algorithms. The over the mechanical process of implementing HMM algorithms, by input to the compiler consists of a succinct XML file that translating high-level XML descriptions into efficient Cþþ imple- defines the topology of the HMM. The probabilities associated mentations. The compiler is highly customizable, produces efficient to transitions and emissions are defined in the same file, using and bug-free code, and includes several optimizations. arbitrarily parameterized C code fragments. From these, the Availability: http://genserv.anat.ox.ac.uk/software compiler produces Cþþ header and source files that implement Contact: gerton.lunter@dpag.ox.ac.uk the required HMM algorithms. 2 OVERVIEW 1 INTRODUCTION Hidden Markov models (HMMs) are very suitable for sequence 2.1 Supported algorithms analysis, and have found wide application in computational The compiler supports the Forward and Backward algorithms, biology. These applications include gene finding (Burge and which compute posterior probabilities conditional on one or Karlin, 1997), motif discovery and searching (Bailey and Elkan, more input sequences. These algorithms can be configured to 1995; Baldi et al., 1994; Bateman et al., 2002; Eddy, 1998), return the dynamic programming (DP) table if desired. Baum– identification of CpG islands (Durbin et al., 1998), genotyping Welch parameter estimation is also implemented, when required, (Scheet and Stephens, 2006), detection of recombination events as part of either the Forward or Backward algorithm. Parameter (Hobolth et al., 2007), sequence alignment (Holmes, 2003) and estimates are computed in two steps, by first computing a many more. standard Forward DP table followed by a Baum–Welch Applications in computational biology typically involve large backward iteration (or vice versa). For efficiency, either or data sets, making high-level toolkits such as R or MatLab both transition and emission parameters can be computed. less suitable. Another common approach, of which GHMM The Viterbi algorithm is implemented as two separate (Schliep et al., 2003) is but one example, is to use a library procedures, which compute the Viterbi table (and likelihood) implementation which operate on HMMs defined by some data and the most likely path, respectively. Finally, a sampling structure. In practice, either the flexibility or the efficiency of algorithm may be generated, which samples paths from the such libraries often falls short of practical requirements, posterior distribution. Unconditional sampling from the HMM especially for pair- and higher-dimensional HMMs. For these is not directly supported, but can be implemented by defining reasons, researchers often resort to hand-coding their algo- a ‘silent’ HMM producing no output, and computing a rithms. While straightforward, hand-coding remains tedious conditional sample from that. and error-prone, particularly when models become more complex and optimized code is required. 2.2 Supported HMM architectures Code generation could, in principle, combine flexibility, usability and efficiency. An early effort in this direction is HMMoC supports HMMs with any number of outputs, the PROLOG-based language GenLang, which is capable including pair HMMs, triple HMMs and phylogenetic of handling context-free grammars as well as HMMs HMMs. Emissions, from a user-defined alphabet, may be (Searls, 1993). Another approach was taken by the dynamic- associated to states (‘Moore machines’) or to transitions programming code generator Dynamite (Birney and Durbin, (‘Mealy machines’), and mixtures are also possible. Higher- 1997), which produces highly efficient C code for Viterbi-like order Markov chains (where probabilities depend on a limited The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2485 G.Lunter number of previously emitted symbols) are also supported, and the order or states may vary over the network. HMMoC includes support for inhomogeneous Markov chains, allowing probabilities to depend on the position within the sequence. This can be used to implement, e.g. inference algorithms for continuous-time Markov chains with variable intervals between measurements, or may be used to incorporate prior knowledge. Finally, HMMoC supports any number of silent states, and the required transition probabilities are computed by a matrix inversion step (‘wing retraction’, (Eddy, 1998)] whenever necessary. Fig. 1. Execution times for performing a database search using HMM profiles of the TK and ZnF_C2HC domains (38 and 56 states, 2.3 Extended-exponent real-number type respectively), in a database of 4.84 million amino acids on a desktop Underflows are a frequent problem for HMM implementations computer (2.33 GHz E5345 Xeon, 4 MB cache). For this comparison, and large data sets. Working in log space only partially solves HMMER and HMMoC performed a Viterbi and a Forward recursion. this problem, since most algorithms use addition as well as The HMMoC-generated algorithm uses logspace real numbers for the multiplication, necessitating slow conversions to and from log Viterbi recursion, and BFloats for the Forward recursion. space. HMMoC provides an 8-byte extended-exponent real type, termed BFloats (for ‘buoyant floats’), which combines the user to specify an arbitrary iterator to traverse the DP table. precision of a float with an essentially unbounded exponent. When banding is used, HMMoC uses a sparse DP table, backed An efficient template library implementation provides fast by a hash map storing only entries that are visited. All in-line code, resulting in comparable running time for DP tables provide accessor functions to hide implementation algorithms that use doubles or BFloats. In addition, a logspace details from the user. type is provided, which is slightly more efficient than BFloats when used in the Viterbi algorithm. 2.6 Robustness HMMoC provides several consistency checks, both at compile 2.4 Efficiency and optimizations and run time. For instance, HMMoC checks and determines For efficiency, states can be grouped into cliques, over which the consistency of state orders, by insisting that orders can the HMM induces an acyclic graph. This structure is used by increase by one only following an emission, and the induced HMMoC to efficiently allocate memory for the DP table. graph on cliques is checked for cycles. At run time, warnings For instance, the start and end states may always form their are issued when iterators follows an inconsistent or out-of- own clique, and need not be represented in the main DP bounds path, and when negative probabilities are encountered. table. Some algorithm-specific optimizations are provided; Care has been taken to produce clean and relatively readable for instance, when no DP table is required as output from Cþþ code which should compile without warnings, allowing a Forward or Backward iteration, a lower-dimensional potential problems to be easily identified. DP table is allocated. Several time-optimizations are implemented as well. 2.7 Examples and documentation HMMoC interprets the emission patterns to determine the range of coordinates at which states may be visited, and uses The package includes four examples demonstrating the use of this information to limit access to accessible states only. HMMoC: the classic ‘dishonest casino’ HMM (Durbin et al., In the main loops, HMMoC chooses an order of computation 1998), a CpG-island finder, a probabilistic global pairwise that minimizes accesses to DP table entries, resulting aligner with Baum-Welch parameter optimization and a in considerable improvements, particularly for sparse conversion script for HMMER profile HMMs that was used DP tables where such accesses are relatively expensive (see to produce the test results mentioned before. Efforts are below). In addition, constant transition probabilities are pre- underway to develop a graphical programming interface computed outside the main loop whenever possible. Finally, all abstracting the XML layer to simplify model design (Dombai loops over transitions and states are unrolled, reducing the and Miklo´ s, Personal communication); a prototype web number of run-time decisions and lookups. As a result of these implementation is available at http://dbalazs.web.elte.hu/ optimizations, HMMoC produces code approaching the Phyl4/. efficiency of the hand-optimized HMMER package (Eddy, 1998), see Figure 1. ACKNOWLEDGEMENTS HMMoC’s design was inspired by Ian Holmes’ Telegraph, 2.5 Banding an XML-based language to define HMMs. Helpful dis- Banding refers to the technique of traversing only a user- cussions with Jotun Hein and Istva´ n Miklo´ s are gratefully defined portion of a DP table. When used correctly, this can acknowledged. result in dramatic improvements in time and memory usage, with small to negligible loss of accuracy. HMMoC allows the Conflict of Interest: none declared. 2486 HMMoC—a compiler for hidden Markov models Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. REFERENCES Hobolth,A. et al. (2007) Genomic relationships and speciation times of human, Bailey,T.L. and Elkan,C. (1995) The value of prior knowledge in discovering chimpanzee, and gorilla inferred from a coalescent hidden Markov model. motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 21–29. PLoS Genet., 3, e7. Baldi,P. et al. (1994) Hidden Markov models of biological primary sequence Holmes,I. (2003) Using guide trees to construct multiple-sequence evolutionary information. Proc. Natl Acad. Sci. USA, 91, 1059–1063. HMMs. Bioinformatics, 19 (Suppl. 1), i147–i157. Bateman,A. et al. (2002) The Pfam protein families database. Nucleic Acids Res., Scheet,P. and Stephens,M. (2006) A fast and flexible statistical model for large- 30, 276–280. scale population genotype data: applications to inferring missing genotypes Birney,E. and Durbin,R. (1997) Dynamite: a flexible code generating language and haplotypic phase. Am. J. Hum. Genet., 78, 629–644. for dynamic programming methods used in sequence comparison. In Fifth Schliep,A. et al. (2003) Using hidden Markov models to analyze gene expression International Conference on Intelligent Systemts in Molecular Biology. IAAA time course data. Bioinformatics, 19 (Suppl. 1), i255–i263. Press, Menlo Park, pp. 56–64. Searls,D.B. (1993) String variable grammar: a logic grammar formalism for the Burge,C. and Karlin,S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94. biological language of DNA. J. Log. Program., 24, 73–102. Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University Press, Slater,G.S. and Birney,E. (2005) Automated generation of heuristics for Cambridge. biological sequence comparison. BMC Bioinformatics, 6,31. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

HMMoC—a compiler for hidden Markov models

Bioinformatics , Volume 23 (18): 3 – Jul 10, 2007

Loading next page...
 
/lp/oxford-university-press/hmmoc-a-compiler-for-hidden-markov-models-4oooMHtnXL

References (0)

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Oxford University Press
Copyright
© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
eISSN
1367-4811
DOI
10.1093/bioinformatics/btm350
pmid
17623703
Publisher site
See Article on Publisher Site

Abstract

Vol. 23 no. 18 2007, pages 2485–2487 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm350 Sequence analysis Gerton Lunter MRC Functional Genetics Unit, Department of Physiology, Anatomy and Genetics, South Parks Road, OX1 3TG, University of Oxford, UK Received on May 30, 2007; revised on June 26, 2007; accepted on June 27, 2007 Advance Access publication July 10, 2007 Associate Editor: Alex Bateman algorithms, at the cost of some flexibility. Dynamite later ABSTRACT developed into the gene annotation toolset Exonerate, which is Summary: Hidden Markov models are widely applied within based on the code generator C4 (Slater and Birney, 2005). computational biology. The large data sets and complex models These code generators, being somewhat tied to their original involved demand optimized implementations, while efficient explora- application domain, lack direct support for probabilistic tion of model space requires rapid prototyping. These requirements algorithms. Here I present an HMM compiler, HMMoC, that are not met by existing solutions, and hand-coding is time- aims to fill this gap. HMMoC is both efficient and flexible, consuming and error-prone. Here, I present a compiler that takes and provides support for all standard HMM algorithms. The over the mechanical process of implementing HMM algorithms, by input to the compiler consists of a succinct XML file that translating high-level XML descriptions into efficient Cþþ imple- defines the topology of the HMM. The probabilities associated mentations. The compiler is highly customizable, produces efficient to transitions and emissions are defined in the same file, using and bug-free code, and includes several optimizations. arbitrarily parameterized C code fragments. From these, the Availability: http://genserv.anat.ox.ac.uk/software compiler produces Cþþ header and source files that implement Contact: gerton.lunter@dpag.ox.ac.uk the required HMM algorithms. 2 OVERVIEW 1 INTRODUCTION Hidden Markov models (HMMs) are very suitable for sequence 2.1 Supported algorithms analysis, and have found wide application in computational The compiler supports the Forward and Backward algorithms, biology. These applications include gene finding (Burge and which compute posterior probabilities conditional on one or Karlin, 1997), motif discovery and searching (Bailey and Elkan, more input sequences. These algorithms can be configured to 1995; Baldi et al., 1994; Bateman et al., 2002; Eddy, 1998), return the dynamic programming (DP) table if desired. Baum– identification of CpG islands (Durbin et al., 1998), genotyping Welch parameter estimation is also implemented, when required, (Scheet and Stephens, 2006), detection of recombination events as part of either the Forward or Backward algorithm. Parameter (Hobolth et al., 2007), sequence alignment (Holmes, 2003) and estimates are computed in two steps, by first computing a many more. standard Forward DP table followed by a Baum–Welch Applications in computational biology typically involve large backward iteration (or vice versa). For efficiency, either or data sets, making high-level toolkits such as R or MatLab both transition and emission parameters can be computed. less suitable. Another common approach, of which GHMM The Viterbi algorithm is implemented as two separate (Schliep et al., 2003) is but one example, is to use a library procedures, which compute the Viterbi table (and likelihood) implementation which operate on HMMs defined by some data and the most likely path, respectively. Finally, a sampling structure. In practice, either the flexibility or the efficiency of algorithm may be generated, which samples paths from the such libraries often falls short of practical requirements, posterior distribution. Unconditional sampling from the HMM especially for pair- and higher-dimensional HMMs. For these is not directly supported, but can be implemented by defining reasons, researchers often resort to hand-coding their algo- a ‘silent’ HMM producing no output, and computing a rithms. While straightforward, hand-coding remains tedious conditional sample from that. and error-prone, particularly when models become more complex and optimized code is required. 2.2 Supported HMM architectures Code generation could, in principle, combine flexibility, usability and efficiency. An early effort in this direction is HMMoC supports HMMs with any number of outputs, the PROLOG-based language GenLang, which is capable including pair HMMs, triple HMMs and phylogenetic of handling context-free grammars as well as HMMs HMMs. Emissions, from a user-defined alphabet, may be (Searls, 1993). Another approach was taken by the dynamic- associated to states (‘Moore machines’) or to transitions programming code generator Dynamite (Birney and Durbin, (‘Mealy machines’), and mixtures are also possible. Higher- 1997), which produces highly efficient C code for Viterbi-like order Markov chains (where probabilities depend on a limited The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2485 G.Lunter number of previously emitted symbols) are also supported, and the order or states may vary over the network. HMMoC includes support for inhomogeneous Markov chains, allowing probabilities to depend on the position within the sequence. This can be used to implement, e.g. inference algorithms for continuous-time Markov chains with variable intervals between measurements, or may be used to incorporate prior knowledge. Finally, HMMoC supports any number of silent states, and the required transition probabilities are computed by a matrix inversion step (‘wing retraction’, (Eddy, 1998)] whenever necessary. Fig. 1. Execution times for performing a database search using HMM profiles of the TK and ZnF_C2HC domains (38 and 56 states, 2.3 Extended-exponent real-number type respectively), in a database of 4.84 million amino acids on a desktop Underflows are a frequent problem for HMM implementations computer (2.33 GHz E5345 Xeon, 4 MB cache). For this comparison, and large data sets. Working in log space only partially solves HMMER and HMMoC performed a Viterbi and a Forward recursion. this problem, since most algorithms use addition as well as The HMMoC-generated algorithm uses logspace real numbers for the multiplication, necessitating slow conversions to and from log Viterbi recursion, and BFloats for the Forward recursion. space. HMMoC provides an 8-byte extended-exponent real type, termed BFloats (for ‘buoyant floats’), which combines the user to specify an arbitrary iterator to traverse the DP table. precision of a float with an essentially unbounded exponent. When banding is used, HMMoC uses a sparse DP table, backed An efficient template library implementation provides fast by a hash map storing only entries that are visited. All in-line code, resulting in comparable running time for DP tables provide accessor functions to hide implementation algorithms that use doubles or BFloats. In addition, a logspace details from the user. type is provided, which is slightly more efficient than BFloats when used in the Viterbi algorithm. 2.6 Robustness HMMoC provides several consistency checks, both at compile 2.4 Efficiency and optimizations and run time. For instance, HMMoC checks and determines For efficiency, states can be grouped into cliques, over which the consistency of state orders, by insisting that orders can the HMM induces an acyclic graph. This structure is used by increase by one only following an emission, and the induced HMMoC to efficiently allocate memory for the DP table. graph on cliques is checked for cycles. At run time, warnings For instance, the start and end states may always form their are issued when iterators follows an inconsistent or out-of- own clique, and need not be represented in the main DP bounds path, and when negative probabilities are encountered. table. Some algorithm-specific optimizations are provided; Care has been taken to produce clean and relatively readable for instance, when no DP table is required as output from Cþþ code which should compile without warnings, allowing a Forward or Backward iteration, a lower-dimensional potential problems to be easily identified. DP table is allocated. Several time-optimizations are implemented as well. 2.7 Examples and documentation HMMoC interprets the emission patterns to determine the range of coordinates at which states may be visited, and uses The package includes four examples demonstrating the use of this information to limit access to accessible states only. HMMoC: the classic ‘dishonest casino’ HMM (Durbin et al., In the main loops, HMMoC chooses an order of computation 1998), a CpG-island finder, a probabilistic global pairwise that minimizes accesses to DP table entries, resulting aligner with Baum-Welch parameter optimization and a in considerable improvements, particularly for sparse conversion script for HMMER profile HMMs that was used DP tables where such accesses are relatively expensive (see to produce the test results mentioned before. Efforts are below). In addition, constant transition probabilities are pre- underway to develop a graphical programming interface computed outside the main loop whenever possible. Finally, all abstracting the XML layer to simplify model design (Dombai loops over transitions and states are unrolled, reducing the and Miklo´ s, Personal communication); a prototype web number of run-time decisions and lookups. As a result of these implementation is available at http://dbalazs.web.elte.hu/ optimizations, HMMoC produces code approaching the Phyl4/. efficiency of the hand-optimized HMMER package (Eddy, 1998), see Figure 1. ACKNOWLEDGEMENTS HMMoC’s design was inspired by Ian Holmes’ Telegraph, 2.5 Banding an XML-based language to define HMMs. Helpful dis- Banding refers to the technique of traversing only a user- cussions with Jotun Hein and Istva´ n Miklo´ s are gratefully defined portion of a DP table. When used correctly, this can acknowledged. result in dramatic improvements in time and memory usage, with small to negligible loss of accuracy. HMMoC allows the Conflict of Interest: none declared. 2486 HMMoC—a compiler for hidden Markov models Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. REFERENCES Hobolth,A. et al. (2007) Genomic relationships and speciation times of human, Bailey,T.L. and Elkan,C. (1995) The value of prior knowledge in discovering chimpanzee, and gorilla inferred from a coalescent hidden Markov model. motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 21–29. PLoS Genet., 3, e7. Baldi,P. et al. (1994) Hidden Markov models of biological primary sequence Holmes,I. (2003) Using guide trees to construct multiple-sequence evolutionary information. Proc. Natl Acad. Sci. USA, 91, 1059–1063. HMMs. Bioinformatics, 19 (Suppl. 1), i147–i157. Bateman,A. et al. (2002) The Pfam protein families database. Nucleic Acids Res., Scheet,P. and Stephens,M. (2006) A fast and flexible statistical model for large- 30, 276–280. scale population genotype data: applications to inferring missing genotypes Birney,E. and Durbin,R. (1997) Dynamite: a flexible code generating language and haplotypic phase. Am. J. Hum. Genet., 78, 629–644. for dynamic programming methods used in sequence comparison. In Fifth Schliep,A. et al. (2003) Using hidden Markov models to analyze gene expression International Conference on Intelligent Systemts in Molecular Biology. IAAA time course data. Bioinformatics, 19 (Suppl. 1), i255–i263. Press, Menlo Park, pp. 56–64. Searls,D.B. (1993) String variable grammar: a logic grammar formalism for the Burge,C. and Karlin,S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94. biological language of DNA. J. Log. Program., 24, 73–102. Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University Press, Slater,G.S. and Birney,E. (2005) Automated generation of heuristics for Cambridge. biological sequence comparison. BMC Bioinformatics, 6,31.

Journal

BioinformaticsOxford University Press

Published: Jul 10, 2007

There are no references for this article.