Profile hidden Markov models.

S R Eddy

doi:10.1093/bioinformatics/14.9.755

Profile hidden Markov models.

Eddy, S R 1998-01-01 00:00:00 &# %& BIOINFORMATICS REVIEW % - '(*$%* & %*") )!"% *&% %",()"*- !&&# & ""% &** ,%+ * &+") gapped pairwise alignment scores can be calculated analyti- Abstract cally, and the significance of gapped alignment scores can be Summary: The recent literature on profile hidden Markov calculated by simple empirical procedures (Altschul and model (profile HMM) methods and software is reviewed. Gish, 1996; Altschul et al., 1997). In contrast, profile Profile HMMs turn a multiple sequence alignment into a methods have historically used ad hoc scoring systems. position-specific scoring system suitable for searching Some mathematical theory was desirable for the meaning databases for remotely homologous sequences. Profile and derivation of the scores in a model as complex as a pro- HMM analyses complement standard pairwise comparison file (Henikoff, 1996). methods for large-scale sequence analysis. Several software Hidden Markov models (HMMs) now provide a coherent implementations and two large libraries of profile HMMs of theory for profile methods. HMMs are a class of probabilistic common protein domains are available. HMM methods models that are generally applicable to time series or linear performed comparably to threading methods in the CASP2 sequences. HMMs have been most widely applied to recog- structure prediction exercise. nizing words in digitized sequences of the acoustics of Contact: eddy@genetics.wustl.edu human speech (Rabiner, 1989). HMMs were introduced into computational biology in the late 1980s (Churchill, 1989), Introduction and for use as profile models just a few years ago (Krogh et al., 1994a). Proteins, RNAs and other features in genomes can usually be Here, the recent literature on profile HMM methods and classified into families of related sequences and structures related methods for modeling sequence families is reviewed. (Henikoff et al., 1997). Different residues in a functional se- Preference is given to papers appearing in the past 2 years, quence are subject to different selective pressures. Multiple since my last review of the field (Eddy, 1996). There seem alignments of a sequence family reveal this in their pattern to be three principal advances. First, motif-based HMMs of conservation. Some positions are more conserved than have been introduced as an alternative to the original Krogh/ others, and some regions of a multiple alignment seem to Haussler profile HMM architecture (Grundy et al., 1997; tolerate insertions and deletions more than other regions. Neuwald et al., 1997). Second, large libraries of profile Intuitively, it seems desirable to use position-specific in- HMMs and multiple alignments have become available, as formation from multiple alignments when searching data- well as compute servers to search query sequences against bases for homologous sequences. ‘Profile’ methods for these resources (Sonnhammer et al., 1998). Third, there has building position-specific scoring models from multiple been an increasing incursion of profile HMM methods into alignments were introduced for this purpose (Taylor, 1986; the area of protein structure prediction by fold recognition Gribskov et al., 1987; Barton, 1990; Henikoff, 1996). How- (Levitt, 1997). ever, profiles have been less used than pairwise methods like Because of space limitations, some of the background I BLAST (Altschul et al., 1990, 1997) and FASTA (Pearson give is terse. A satisfactory introduction to HMMs and pro- and Lipman, 1988), with the most notable exceptions being babilistic models is beyond the scope of this review. Tutorial the popular BLOCKS database (Henikoff et al., 1998) and introductions to HMMs are available (Rabiner, 1989), in- the skilled use of profiles by a small band of professional cluding introductions that specifically include profile HMM protein domain hunters (Bork and Gibson, 1996). methods (Krogh, 1998). Two recent books describe proba- In part, this is because the residue scoring systems used by bilistic modeling methods for biological sequence analysis in pairwise alignment methods are supported by a significant detail (Baldi and Brunak, 1998; Durbin et al., 1998). body of statistical theory (Altschul and Gish, 1996). The pro- babilistic ‘meaning’ of position-independent pairwise align- Hidden Markov models ment scoring matrices is well understood (Altschul, 1991), allowing powerful scoring matrices to be derived (Henikoff There are now various kinds of profile HMMs and related and Henikoff, 1992). The statistical significance of un- models, all based on HMM theory. It is useful to understand Oxford University Press 755 S.R.Eddy the generality and relative simplicity of HMM theory before considering the special case of profile HMMs. An HMM de- scribes a probability distribution over a potentially infinite number of sequences. Because a probability distribution must sum to one, the ‘scores’ that an HMM assigns to se- quences are constrained. The probability of one sequence cannot be increased without decreasing the probability of one or more other sequences. It is this fundamental constraint of probabilistic modeling (Jaynes, 1998) that allows the parameters in an HMM to have non-trivial optima. An example of a simple HMM that models sequences composed of two letters (a, b) is shown in Figure 1. This toy HMM would be an appropriate model for a problem in which we thought sequences started with one residue composition Fig. 1. A toy HMM, modeling sequences of as and bs as two regions (a-rich, perhaps), then switched once to a different residue of potentially different residue composition. The model is drawn composition (b-rich, perhaps). The HMM consists of two (top) with circles for states and arrows for state transitions. A states connected by state transitions. Each state has a symbol possible state sequence generated from the model is shown, followed emission probability distribution for generating (matching) by a possible symbol sequence. The joint probability P(x,π|HMM) of the symbol sequence and the state sequence is a product of all the a symbol in the alphabet. It is convenient to think of an HMM transition and emission probabilities. Notice that another state as a model that generates sequences. Starting in an initial sequence (1-2-2) could have generated the same symbol sequence, state, we choose a new state with some transition probability though probably with a different total probability. This is the (either staying in state 1 with transition probability t , , or 1 1 distinction between HMMs and a standard Markov model with moving to state 2 with transition probability t , ); then we 1 2 nothing to hide: in an HMM, the state sequence (e.g. the biologically generate a residue with an emission probability specific to meaningful alignment) is not uniquely determined by the observed that state [e.g. choosing an a with p (a)]. We repeat the transi- 1 symbol sequence, but must be inferred probabilistically from it. tion/emission process until we reach an end state. At the end of this process, we have a hidden state sequence that we do not observe, and a symbol sequence that we do observe. The name ‘hidden Markov model’ comes from the fact that ward can also be implemented (Hughey and Krogh, 1996; the state sequence is a first-order Markov chain, but only the Tarnas and Hughey, 1998). symbol sequence is directly observed. The states of the Parameters can be set for an HMM in two ways. An HMM HMM are often associated with meaningful biological la- can be trained from initially unaligned (unlabeled) se- bels, such as ‘structural position 42’. In our toy HMM, for quences. Alternatively, an HMM can be built from pre- instance, states 1 and 2 correspond to a biological notion of aligned (pre-labeled) sequences (i.e. where the state paths are two sequence regions with differing residue composition. In- assumed to be known). In the latter case, the parameter es- ferring the alignment of the observed protein or DNA se- timation problem is simply a matter of converting observed quence to the hidden state sequence is like labeling the se- counts of symbol emissions and state transitions into prob- quence with relevant biological information. abilities. In building a profile HMM, an existing multiple Once an HMM is drawn, regardless of its complexity, the alignment is given as input. In contrast, training a profile same standard dynamic programming algorithms can be HMM is analogous to running a multiple alignment program used for aligning and scoring sequences with the model before building the model, and thus is a harder problem. (Durbin et al., 1998). These algorithms, called Forward (for Training algorithms are of interest because we may not yet scoring) and Viterbi (for alignment), have a worst-case algo- know a plausible alignment for the sequences in question. rithmic complexity of O(NM ) in time and O(NM) in space The standard HMM training algorithms are Baum–Welch for a sequence of length N and an HMM of M states. For expectation maximization or gradient descent algorithms. profile HMMs that have a constant number of state transi- Gibbs sampling, simulated annealing and genetic algorithm tions per state rather than the vector of M transitions per state training methods seem better at avoiding spurious local opti- in fully connected HMMs, both algorithms run in O(NM) ma in training HMMs and HMM-like models (Eddy, 1996; time and O(NM) space—not coincidentally, identical to Neuwald et al., 1997; Durbin et al., 1998). Most training al- other sequence alignment dynamic programming algo- gorithms seek relatively simple maximum likelihood (or rithms. For a modest (constant) penalty in time, very mem- maximum a posteriori) optimization targets. More sophisti- 1.5 ory-efficient O(M) and O(M ) versions of Viterbi and For- cated optimization targets are used to compensate for non- 756 Profile hidden Markov models independence of example sequences (e.g. biased representa- tion) (Eddy, 1996; Bruno, 1996; Durbin et al., 1998; Karchin and Hughey, 1998; Sunyaev et al., 1998), or to maximize the ability of a model to discriminate a set of true positive example sequences from a set of true negative training examples (Mamitsuka, 1996). However, since HMM training algorithms are local optim- izers, it pays to build HMMs on pre-aligned data whenever possible. Especially for complicated HMMs, the parameter space may be complex, with many spurious local optima that can trap a training algorithm. In contrast to parameter estimation, a suitable HMM archi- Fig. 2. A small profile HMM (right) representing a short multiple tecture (the number of states, and how they are connected by alignment of five sequences (left) with three consensus columns. state transitions) must usually be designed by hand. A maxi- The three columns are modeled by three match states (squares mum likelihood architecture construction algorithm exists labeled m1, m2 and m3), each of which has 20 residue emission for the special case of building profile HMMs from multiple probabilities, shown with black bars. Insert states (diamonds labeled alignments (Durbin et al., 1998). Efforts have been made to i0–i3) also have 20 emission probabilities each. Delete states (circles develop architecture learning algorithms for general HMMs labeled d1–d3) are ‘mute’ states that have no emission probabilities. (Yada et al., 1996). One can also train fully connected A begin and end state are included (b,e). State transition probabilities are shown as arrows. HMMs and prune low-probability transitions at the end of training (Mamitsuka, 1996). More or less formal probabilistic models are increasingly state at each column allow for insertion of one or more resi- important in biological analysis, particularly in complicated dues between that column and the next, or for deleting the analysis problems with many model parameters. Because consensus residue. Profile HMMs are strongly linear, left– many problems in computational biology reduce to some right models, unlike the general HMM case. Figure 2 shows sort of linear ‘sequence’ analysis, probabilistic models based a small profile HMM corresponding to a short multiple se- on HMMs have been applied to many problems. Other bio- quence alignment. logical applications of HMMs include gene finding (Krogh The probability parameters in a profile HMM are usually et al., 1994b; Kulp et al., 1996; Burge and Karlin, 1997; Hen- converted to additive log-odds scores before aligning and derson et al., 1997; Krogh, 1997; Lukashin and Borodovsky, scoring a query sequence (Barrett et al., 1997). The scores for 1998), radiation hybrid mapping (Slonim et al., 1997), gen- aligning a residue to a profile match state are therefore com- etic linkage mapping (Kruglyak et al., 1996), phylogenetic parable to the derivation of BLAST or FASTA scores: if the analysis (Felsenstein and Churchill, 1996; Thorne et al., probability of the match state emitting residue x is p , and the 1996) and protein secondary structure prediction (Asai et al., expected background frequency of residue x in the sequence 1993; Goldman et al., 1996). In general, the more a problem database is f , the score for residue x at this match state is log resembles a linear sequence analysis problem—i.e. the less p /f . x x it depends on correlations between ‘observables’ (e.g. resi- For other scores, profile HMM treatment diverges from dues)—the more useful HMM approaches will be. Profile standard sequence alignment scoring. In traditional gapped HMMs and HMM-based gene finders have probably been alignment, an insert of x residues is typically scored with an the most successful applications of HMMs in computational affine gap penalty, a + b(x – 1), where a is the score for the biology. On the other hand, protein secondary structure pre- first residue and b is the score for each subsequent residue in diction is an area in which the state of the art is neural net the insertion. In a profile HMM, for an insertion of length x methods that outperform HMM methods by using extensive there is a state transition into an insert state which costs log local correlation information that is not necessarily easy to t (where t is the state transition probability for moving MI MI model in an HMM (Rost and Sander, 1993). from the match state to the insert state), (x – 1) state transi- tions for each subsequent insert state that cost log t , and a II state transition for leaving the insert state that costs log t . Profile HMMs IM This is akin to the traditional affine gap penalty, with the gap Krogh et al. (1994a) introduced an HMM architecture that open cost as a = log t + log t , and the gap extend cost as MI IM was well suited for representing profiles of multiple se- b = log t . II quence alignments. For each consensus column of the mul- However, in a profile HMM, these gap costs are not arbit- tiple alignment, a ‘match’ state models the distribution of rary numbers. This is an example of why probabilistic mo- residues allowed in the column. An ‘insert’ state and ‘delete’ dels have useful and non-trivial optima. Imagine that we 757 S.R.Eddy were trying to optimize the gap parameters of a model by maximizing the score of the model on a training set of example sequences. In a profile with ad hoc gap costs, we could trivially maximize the scores just by setting all gap costs to zero, but the alignments produced by a profile with no gap penalties would be terrible. In the profile HMM, in contrast, the probability of a transition to an insert is linked to the probability of transition to a match and not inserting; profile HMMs have a cost for the match state to match state transition that has no counterpart in standard alignment. As we lower the gap cost by raising the transition probability t MI towards 1.0, the probability of the match–match transition t falls towards zero, and thus the cost for sequences with- MM out an insertion approaches negative infinity. There is, there- fore, a trade-off point in choosing the state transition prob- abilities where the cost for the sequences that do have an in- sertion is balanced against the cost for the sequences that do not. Additionally, the inserted residues are associated with in- sert state emission probabilities in the HMM. If these emission probabilities are the same as the background amino acid frequency, then the score of inserted residues is log Fig. 3. Different model architectures used in current methods. State transitions are shown as arrows and emission distributions are not f /f = 0. In traditional alignment, inserted residues also have x x represented. Numbered squares indicate ‘match states’. Diamonds no cost besides the affine gap penalty. The profile HMM for- indicate ‘insert states’. Match and insert states each have emission malism forces us to see that this zero cost corresponds to an distributions over 4 or 20 possible nucleic or amino acid symbols. assumption that unconserved insertions in protein structures Circles indicate non-emitting delete states and other special non- have the same residue distribution as proteins in general. emitting states such as begin and end states. From top to bottom: However, the assumption is usually wrong. Insertions tend BLOCKS-style ungapped motifs, represented as an HMM; the to be seen most often in surface loops of protein structures, multiple motif model in META-MEME; the original profile HMM and so have a bias towards hydrophilic residues. Profile of Krogh et al.; and the ‘Plan 7’ architecture of HMMER 2, HMMs can capture this information in the insert state representative of the new generation of profile HMM software in emission distributions. SAM, HMMER and PFTOOLS. Profile HMM software augmented that simple model to deal with multiple domains, Several available software packages implement profile sequence fragments and local alignments, as illustrated by HMMs or HMM-like models (Table 1). One important dif- the HMMER 2.0 ‘Plan 7’ model architecture in Figure 3. ference between these packages is the model architecture Thus, local versus global alignment is not necessarily in- they adopt (Figure 3). The philosophical divide is between trinsic to the algorithm (as is usually thought, for instance, in ‘profile’ models and ‘motif’ models. By ‘profile’ models, I the distinction between the global ‘Needleman/Wunsch’ and mean models with an insert and delete state associated with local ‘Smith/Waterman’ algorithms), but can be dealt with each match state, allowing insertion and deletion anywhere probabilistically as part of the model architecture. Local in a target sequence. By ‘motif’ models, I mean models alignments with respect to the model are allowed by non- dominated by strings of match states (modeling ungapped zero state transition probabilities from a begin state to inter- blocks of sequence consensus) separated by a small number nal match states, and from internal match states to an end of insert states modeling the spaces between ungapped state (dotted lines in Figure 3). Local alignments with respect blocks. to the sequence are allowed by non-zero state transitions on SAM (Hughey, 1996), HMMER (S.R.Eddy, unpublished), the flanking insert states (shaded in the Plan 7 architecture in PFTOOLS (Bucher et al., 1996) and HMMpro (Baldi et al., Figure 3). More than one hit to the HMM per sequence is 1994) implement models based at least in part on the original allowed by a cycle of non-zero transitions through a third profile HMMs of Krogh et al. (1994a). These packages have special insert state. 758 Profile hidden Markov models Table 1. Internet sources for obtaining some of the existing profile HMM the current protein database starting with single randomly and HMM-like software packages selected query sequences, with impressive results (Neuwald et al., 1997). Software URL GENEWISE is a sophisticated ‘framesearch’ application SAM http://www.cse.ucsc.edu/research/compbio/sam.html that can take a HMMER protein model and search it against HMMER http://hmmer.wustl.edu/ EST or genomic DNA, allowing for frameshifts, introns and sequencing errors (Birney and Durbin, 1997). PFTOOLS http://ulrec3.unil.ch:80/profile/ PSI-BLAST (Altschul et al., 1997) is not an HMM ap- HMMpro http://www.netid.com/ plication per se, but it uses some principles of full probabilis- GENEWISE http://www.sanger.ac.uk/Software/Wise2/ tic modeling to build HMM-like models from multiple align- PROBE ftp://ncbi.nlm.nih.gov/pub/neuwald/probe1.0/ ments. Like the use of PROBE (Neuwald et al., 1997), PSI- META-MEME http://www.cse.ucsd.edu/users/bgrundy/metameme.1.0.html BLAST starts from a single query sequence and collects homologous sequences by BLAST search. These homo- BLOCKS http://www.blocks.fhcrc.org/ logues are aligned to the query. An HMM-like search model PSI-BLAST http://www.ncbi.nlm.nih.gov/BLAST/newblast.html is built from the multiple alignment. The model is searched against the database, new homologues are discovered and These profile HMMs are rather general, allowing inser- added to the alignment, and a new model is built. The process tions and deletions anywhere in a sequence relative to the is iterated until no new homologues are discovered. PROBE consensus model. Intuitively, they should be more sensitive and PSI-BLAST both illustrate the power of automating it- than ungapped models. However, in practice, there is a trade- erative profile searches. The remarkable speed of PSI- off between increasing the descriptive power of the model BLAST also demonstrates that the fast BLAST algorithm and the difficulty in determining an increasingly large can be applied to position-specific scoring systems and number of free parameters. A complex model is more prone gapped alignments, and hence to profile HMMs. to overfitting the training data and failing to generalize to With the exception of PSI-BLAST, profile HMM search other sequences. SAM and HMMER use mixture Dirichlet algorithms are computationally demanding. Fast hardware priors on most distributions to help avoid overfitting and to implementations of Gribskov profile searches (Gribskov et limit the effective number of free parameters (Sjolander, al., 1987) are available from several manufacturers, includ- 1996). It is possible to reduce the effective number of free ing Compugen and Time Logic. These systems are currently parameters even further by adopting hybrid HMM/neural being revised to accommodate profile HMMs and the exist- network techniques (Baldi and Chauvin, 1996). Nonethe- ing PROSITE and PFAM HMM libraries. HMM approaches less, this relatively unconstrained freedom to insert and de- are also readily parallelized (Grundy et al., 1996; Hughey, lete anywhere makes these models somewhat difficult to 1996). Even more esoteric speed-ups are also possible. For train from initially unaligned sequences. HMMER and instance, Intel Corporation has made a white paper available PFTOOLS are used primarily to build database search mo- on using MMX assembly instructions to parallelize the Viter- dels from pre-existing alignments, such as those in the Pfam bi algorithm and get about a 2-fold speed increase on Intel and PROSITE Profiles databases (see below). hardware (http://developer.intel.com/drg/mmx/AppNotes/ PROBE (Neuwald et al., 1997), META-MEME (with its AP569.HTM). This could be significant, since some of the brethren MEME and MAST) (Grundy et al., 1997) and WWW-based HMM servers are backed by Intel processor BLOCKS (Henikoff et al., 1998) assume quite different farms running Linux or FreeBSD, such as the ISREC/Prosite ‘motif’ models. In these models, alignments consist of one INSECT farm (Jongeneel et al., 1998). or more ungapped blocks, separated by intervening se- quences that are assumed to be random (Figure 3). The handling of these gaps in BLOCKS is ad hoc. PROBE and Profile HMM libraries META-MEME adopt probabilistic models for the gaps. META-MEME, interestingly, fits its models into HMMER Profile HMM software is well suited for modeling a particular format. The motif models can therefore be viewed as special sequence family of interest and finding additional remote homo- cases of profile HMMs; indeed, HMMER, SAM and logues in a sequence database. Suppose instead that I have a PFTOOLS have various options for creating motif-like mo- query sequence of interest, and I am interested in whether this dels. The strength here is that by limiting the freedom of the sequence contains one or more known domains. This problem model a priori, the HMM training problem is made more arises especially in high-throughput genome sequence analysis, tractable. These approaches can be very powerful for dis- where standard ‘top hit’ BLAST analyses can be confused by covering conserved motifs in initially unaligned sets of se- proteins with several distinct domains. Now I need to search the quences. PROBE, for instance, has been turned loose on a single query sequence against a library of profile HMMs, rather fully automated exercise in identifying domain families in than a single profile HMM against a database of sequences. 759 S.R.Eddy Building a library of profile HMMs in turn requires a large 806 models in Pfam 3.0 recognize ~ 42% (S.R.E. unpublished number of multiple alignments of common protein domains. data). Thus, an ~ 5-fold increase in Pfam database size (175 to 806) resulted in only about a 50% increase in the number of A database of annotated multiple alignments and pre-built sequences recognized with significant scores. On the bright profile HMMs becomes desirable. side, the number of C.elegans sequences annotated by one or Two large collections of annotated profile HMMs are cur- more Pfam models is starting to approach the number that is rently available: the Pfam database (Sonnhammer et al., 1997, hit by one or more informative BLAST similarities to the non- 1998) and the PROSITE Profiles database (Bairoch et al., redundant sequence database (42% compared to ~ 55%). 1997). The PROSITE Profiles database is a supplement to the None of the profile servers is mature. Both profile software widely used PROSITE motifs database; for families that can- and profile databases are rapidly improving and changing. In not be recognized by simple PROSITE motif patterns (regular particular, profile databases typically include domain models expressions which either match a sequence or do not), more that other databases may not yet have. Users are well advised sensitive profile HMMs are developed. Both databases are to search several domain annotation servers. The Interpro col- available via WWW servers, including on-line analysis laboration is expected to be extremely valuable as the various servers for submitting protein sequence queries (Table 2). A database teams begin actively sharing alignment and annota- new European Union funded initiative, called Interpro, has tion data. established a collaboration among several sites interested in effective protein domain annotation, including the Pfam, HMMs for fold recognition PROSITE and PRINTS development teams as well as the SWISS-PROT/TREMBL team. Profile HMMs are sometimes viewed as ‘mere sequence mo- The current pre-release of the PROSITE Profiles database dels’. However, profile scores can be calculated from struc- contains profiles for 290 protein domains, and the current tural data instead of sequences, e.g. ‘3D/1D profiles’ (Bowie Pfam 3.1 release contains 1313 profiles. There is substantial et al., 1991; Luthy et al., 1992). These structural profile ap- overlap between the two collections. It is not meaningful to try proaches can readily be put into a full probabilistic, HMM- to estimate how complete these databases are, because the based framework (Stultz et al., 1993; White et al., 1994). Di number of protein families in nature is unknown and probably Francesco and colleagues have used profile HMMs to model very large. Although there is much discussion of how many secondary structure symbol sequences by modifying the protein families there are—the number 1000 is often cited SAM code to emit an alphabet of protein secondary structure (Chothia, 1992)—such estimates typically make a false as- symbols, training models on known secondary structures, sumption that all families have approximately equal numbers and aligning these models to secondary structure predictions of members (Orengo et al., 1994). However, a small number of new protein sequences (Di Francesco et al., 1997a,b). of families (such as protein kinases, G-protein coupled recep- The pejorative appellation of ‘mere sequence models’ tors and immunoglobulin superfamily domains) account for seems to be applied to HMMs based on a misunderstanding a disproportionate number of sequences. The two databases of the central assumption of position independence in are therefore seeing diminishing returns as models of less HMMs. Obviously, neighboring three-dimensional struc- populous families are developed. For example, the 175 mo- tural contacts influence the types of residue that will be ac- dels in Pfam 1.0 recognize one or more domains in ~ 27% of cepted at any given position in a protein structure. How can predicted proteins from the Caenorhabditis elegans genome HMMs that explicitly assume position independence hope to project, the 527 models in Pfam 2.0 recognize ~ 35% and the be a realistic model of protein structure? Table 2. WWW analysis servers for analyzing protein sequences for known domains Profile HMM libraries: Pfam (Sonnhammer et al., 1998) http://www.sanger.ac.uk/Pfam/ PROSITE profiles (Bairoch et al., 1997) http://ulrec3.unil.ch/software/PFSCAN_form.html HMM-like methods: BLOCKS (Henikoff et al., 1998) http://www.blocks.fhcrc.org/ Other protein domain family classification servers: PRINTS (Attwood et al., 1998) http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/ ProClass (Wu et al., 1996) http://diana.uthct.edu/proclass.html PRODOM (Corpet et al., 1998) http://www.toulouse.inra.fr/prodom.html SBASE (Fabian et al., 1997) http://base.icgeb.trieste.it/sbase/ 760 Profile hidden Markov models The assumption of position independence only means that quence annotation is so difficult that some people almost when an HMM state scores a residue in a sequence, it does seem ready to give up on it (Wheelan and Boguski, 1998). so independently of the rest of that sequence’s alignment. The development of robust methods for automated sequence However, nothing says that the emission probability distribu- classification and annotation is imperative. Our hope in de- tion at that state cannot be determined in the first place from veloping profile HMM methods is that we can provide a sec- complex three-dimensional structural knowledge of the ond tier of solid, sensitive, statistically based analysis tools training set. If I know that a residue is buried by spatially that complement current BLAST and FASTA analyses. The neighboring hydrophobic residues, and this environment is combination of powerful new HMM software and large se- approximately constant among related structures in the pro- quence alignment databases of conserved protein domains tein family, I can build that knowledge into my model. What should help make this hope a reality. HMMs cannot deal with efficiently are long-distance cor- relations between residues, as is seen in RNA structural Acknowledgements alignments, where the complementarity of a pair of distant sequence positions is more important than the identity of Work on profile HMMs and Pfam in my laboratory is sup- either position by itself (Durbin et al., 1998). (Short-distance ported by NIH/NHGRI R01 HG01363, Monsanto and Eli correlation can be built into HMMs without much difficulty; Lilly. I thank D.States for pointing out the Intel paper on for example, gene-finding HMMs typically model the prob- MMX Viterbi implementations; K.Karplus, R.Hughey and ability of coding hexamers instead of probabilities of single A.Neuwald for providing pre-publication results; and residues.) C.Eddy, S.Johnson, my research group and three anonymous Many current fold recognition methods are not cast as reviewers for their useful criticism of the manuscript. I also HMMs, but instead as sequence/structure ‘threading’ algo- thank the many people in the HMM community with whom rithms with relatively ad hoc scores. However, any threading I have discussed these issues, especially A.Krogh, P.Bucher, scoring system for which a dynamic programming algorithm A.Neuwald, B.Grundy, G.Mitchison, the other members of can be used to find optimal sequence/structure alignments can the Pfam consortium (the R.Durbin and E.Sonnhammer be recast as a full probabilistic HMM. This includes ‘frozen groups), and the remarkable UC Santa Cruz HMM group. approximation’ methods (Godzik et al., 1992), for instance. The fold recognition section of the CASP (Current Asses- sment of Structure Prediction) exercise (Moult et al., 1997) References is one of the most interesting anecdotal benchmarks of how Altschul,S.F. (1991) Amino acid substitution matrices from an HMM techniques perform. In CASP, the sequences of pro- information theoretic perspective. J. Mol. Biol., 219, 555–565. tein ‘prediction targets’ whose structures are soon to be Altschul,S.F. and Gish,W. (1996) Local alignment statistics. Methods solved by crystallography or NMR are made available to Enzymol., 266, 460–480. computational structure prediction groups. After the struc- Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) tures become available, the success of the fold predictions is Basic local alignment search tool. J. Mol. Biol., 215, 403–410. evaluated. Ranking the performance of different methods in Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., CASP is difficult and somewhat subjective (Levitt, 1997). Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI- Also, there is usually a variable and sometimes substantial BLAST: A new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. degree of expert human interpretation added to the auto- Asai,K., Hayamizu,S. and Handa,K. (1993) Prediction of protein mated methods (Murzin and Bateman, 1997). Nonetheless, secondary structure by the hidden Markov model. Comput. Applic. CASP has been a lively venue to explore the strengths and Biosci., 9, 141–146. weaknesses of fold recognition methods. At CASP2 last Attwood,T.K., Beck,M.E., Flower,D.R., Scordis,P. and Selley,J.N. year, HMM-based methods were among the techniques used (1998) The PRINTS protein fingerprint database in its fifth year. by several of the most successful prediction groups (Di Fran- Nucleic Acids Res., 26, 304–308. cesco et al., 1997; Karplus et al., 1997; Levitt, 1997; Murzin Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE data- and Bateman, 1997). Indeed, Murzin and Bateman (1997) base, its status in 1997. Nucleic Acids Res., 25, 217–221. correctly predicted the folds of all six proteins they at- Baldi,P. and Brunak,S. (1998) Bioinformatics: The Machine Learning tempted, using a combination of profile HMMs, secondary Approach. MIT Press, Boston. structure prediction and expert knowledge. Baldi,P. and Chauvin,Y. (1996) Hybrid modeling, HMM/NN architec- tures and protein applications. Neural Comput., 8, 1541–1565. Baldi,P., Chauvin,Y., Hunkapiller,T. and McClure,M.A. (1994) Conclusion Hidden Markov models of biological primary sequence informa- tion. Proc. Natl Acad. Sci. USA, 91, 1059–1063. The human genome project threatens to overwhelm us in a Barrett,C., Hughey,R. and Karplus,K. (1997) Scoring hidden Markov deluge of raw sequence data. Successful large-scale se- models. Comput. Applic. Biosci., 13, 191–199. 761 S.R.Eddy Barton,G.J. (1990) Protein multiple sequence alignment and flexible Grundy,W.N., Bailey,T.L., Elkan,C.P. and Baker,M.E. (1997) Meta- pattern matching. Methods Enzymol., 183, 403–427. MEME: Motif-based hidden Markov models of protein families. Birney,E. and Durbin,R. (1997) Dynamite: A flexible code generating Comput. Applic. Biosci., 13, 397–406. language for dynamic programming methods used in sequence Henderson,J., Salzberg,S. and Fasman,K. (1997) Finding genes in comparison. In Proceedings of the Fifth International Conference human DNA with a hidden Markov model. J. Comput. Biol., 4, on Intelligent Systems in Molecular Biology, 5, 56–64. AAAI Press, 127–141. Menlo Park. Henikoff,S. (1996) Scores for sequence searches and alignments. Curr. Bork,P. and Gibson,T.J. (1996) Applying motif and profile searches. Opin. Struct. Biol., 6, 353–360. Methods Enzymol., 266, 162–184. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution ma- Bowie,J.U., Luthy,R. and Eisenberg,D. (1991) A method to identify trices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919. protein sequences that fold into a known three-dimensional structure. Science, 253, 164–170. Henikoff,S., Greene,E.A., Pietrokovski,S., Bork,P., Attwood,T.K. and Bruno,W.J. (1996) Modeling residue usage in aligned protein se- Hood,L. (1997) Gene families: The taxonomy of protein paralogs quences via maximum likelihood. Mol. Biol. Evol., 13, 1368–1374. and chimeras. Science, 278, 609–614. Bucher,P., Karplus,K., Moeri,N. and Hofmann,K. (1996) A flexible Henikoff,S., Pietrokovski,S. and Henikoff,J.G. (1998) Superior per- motif search technique based on generalized profiles. Comput. formance in protein homology detection with the Blocks database Chem., 20, 3–23. servers. Nucleic Acids Res., 26, 309–312. Burge,C. and Karlin,S. (1997) Prediction of complete gene structures Hughey,R. (1996) Parallel hardware for sequence comparison and in human genomic DNA. J. Mol. Biol., 268, 78–94. alignment. Comput. Applic. Biosci., 12, 473–479. Chothia,C. (1992) One thousand families for the molecular biologist. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence Nature, 357, 543–544. analysis: Extension and analysis of the basic method. Comput. Churchill,G.A. (1989) Stochastic models for heterogeneous DNA Applic. Biosci., 12, 95–107. sequences. Bull. Math. Biol., 51, 79–94. Jaynes,E.T. (1998) Probability Theory: The Logic of Science. Corpet,F., Gouzy,J. and Kahn,D. (1998) The ProDom database of Available from http://bayes.wustl.edu. protein domain families. Nucleic Acids Res., 26, 323–326. Jongeneel,V., Junier,T., Iseli,C., Hofmann,K. and Bucher,P. (1998) Di Francesco,V., Garnier,J. and Munson,P.J. (1997a) Protein topology INSECT and MOLLUSCS—supercomputing on the cheap. Avail- recognition from secondary structure sequences: Application of the able from http:// cmpteam4.unil.ch/biocomputing/mollusc/ IN- SECT_and_MOLLUSCS.html. hidden Markov models to the alpha class proteins. J. Mol. Biol., 267, 446–463. Karchin,R. and Hughey,R. (1998) Weighting hidden Markov models Di Francesco,V., Geetha,V., Garnier,J. and Munson,P.J. (1997b) Fold for maximum discrimination. Bioinformatics, in press. recognition using predicted secondary structure sequences and Karplus,K., Sjolander,K., Barrett,C., Cline,M., Haussler,D., hidden Markov models of protein folds. Proteins, 1(Suppl.), Hughey,R., Holm,L. and Sander,C. (1997) Predicting protein 123–128. structure using hidden Markov models. Proteins, 1(Suppl.), Durbin,R., Eddy,S.R., Krogh,A. and Mitchison,G.J. (1998) Biological 134–139. Sequence Analysis: Probabilistic Models of Proteins and Nucleic Krogh,A. (1997) Two methods for improving performance of an Acids. Cambridge University Press, Cambridge, UK. HMM and their application for gene finding. In Proceedings of the Eddy,S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol., 6, Fifth International Conference on Intelligent Systems in Molecular 361–365. Biology, 5, 179–186. AAAI Press, Menlo Park. Fabian,P., Murvai,J., Vlahovicek,K., Hegyi,H. and Pongor,S. (1997) Krogh,A. (1998) An introduction to hidden Markov models for The SBASE protein domain library, release 5.0: A collection of biological sequences. In Salzberg,S., Searls,D. and Kasif,S. (eds), annotated protein sequence segments. Nucleic Acids Res., 25, Computational Methods in Molecular Biology. Elsevier, New York. 240–243. pp. 45–63. Felsenstein,J. and Churchill,G. (1996) A hidden Markov model Krogh,A., Brown,M., Mian,I.S., Sjolander,K. and Haussler,D. (1994a) approach to variation among sites in rate of evolution. Mol. Biol. Hidden Markov models in computational biology: Applications to Evol., 13, 93–104. protein modeling. J. Mol. Biol., 235, 1501–1531. Godzik,A., Kolinski,A. and Skolnick,J. (1992) Topology fingerprint Krogh,A., Mian,I.S. and Haussler,D. (1994b) A hidden Markov model approach to the inverse protein folding problem. J. Mol. Biol., 227, that finds genes in E.coli DNA. Nucleic Acids Res., 22, 4768–4778. 227–238. Kruglyak,L., Daly,M.J., Reeve-Daly,M.P. and Lander,E.S. (1996) Goldman,N., Thorne,J.L. and Jones,D.T. (1996) Using evolutionary Parametric and nonparametric linkage analysis: A unified multi- trees in protein secondary structure prediction and other compara- point approach. Am. J. Hum. Genet., 58, 1347–1363. tive sequence analyses. J. Mol. Biol., 263, 196–208. Kulp,D., Haussler,D., Reese,M.G. and Eeckman,F.H. (1996) A Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Profile analy- generalized hidden Markov model for the recognition of human sis: Detection of distantly related proteins. Proc. Natl Acad. Sci. genes in DNA. In Proceedings of the Fourth International USA, 84, 4355–4358. Conference on Intelligent Systems in Molecular Biology, 4, Grundy,W.N., Bailey,T.L. and Elkan,C.P. (1996) ParaMEME: A 134–141. AAAI Press, Menlo Park. parallel implementation and a web interface for a DNA and protein Levitt,M. (1997) Competitive assessment of protein fold recognition motif discovery tool. Comput. Applic. Biosci., 12, 303–310. and alignment accuracy. Proteins, 1(Suppl.), 92–104. 762 Profile hidden Markov models Lukashin,A.V. and Borodovsky,M. (1998) GeneMark.hmm: New Sonnhammer,E.L., Eddy,S.R. and Durbin,R. (1997) Pfam: A com- solutions for gene finding. Nucleic Acids Res., 26, 1107–1115. prehensive database of protein families based on seed alignments. Luthy,R., Bowie,J.U. and Eisenberg,D. (1992) Assessment of protein Proteins, 28, 405–420. models with three-dimensional profiles. Nature, 356, 83–85. Sonnhammer,E.L.L., Eddy,S.R., Birney,E., Bateman,A. and Durbin,R. Mamitsuka,H. (1996) A learning method of hidden Markov models for (1998) Pfam: Multiple sequence alignments and HMM-profiles of sequence discrimination. J. Comput. Biol., 3, 361–373. protein domains. Nucleic Acids Res., 26, 320–322. Moult,J., Hubbard,T., Bryant,S.H., Fidelis,K. and Pedersen,J.T. (1997) Stultz,C.M., White,J.V. and Smith,T.F. (1993) Structural analysis Critical assessment of methods of protein structure prediction based on state-space modeling. Protein Sci., 2, 305–314. (CASP): Round II. Proteins, 1(Suppl.), 2–6. Sunyaev,S.R., Rodchenkov,I.V., Eisenhaber,F. and Kuznetsov,E.N. Murzin,A.G. and Bateman,A. (1997) Distant homology recognition (1998) Analysis of the position dependent amino acid probabilities using structural classification of proteins. Proteins, 1(Suppl.), and its application to the search for remote homologues. In 105–112. RECOMB ’98, pp. 258–265. Neuwald,A.F., Liu,J.S., Lipman,D.J. and Lawrence,C.E. (1997) Ex- Tarnas,C. and Hughey,R. (1998) Reduced space hidden Markov model tracting protein alignment models from the sequence database. training. Bioinformatics, in press. Nucleic Acids Res., 25, 1665–1677. Taylor,W.R. (1986) Identification of protein sequence homology by Orengo,C., Jones,D.T. and Thornton,J.M. (1994) Protein superfamilies consensus template alignment. J. Mol. Biol., 188, 233–258. and domain superfolds. Nature, 372, 631–634. Thorne,J.L., Goldman,N. and Jones,D.T. (1996) Combining protein Pearson,W. and Lipman,D. (1988) Improved tools for biological evolution and secondary structure. Mol. Biol. Evol., 13, 666–673. sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. Wheelan,S.J. and Boguski,M.S. (1998) Late-night thoughts on the Rabiner,L.R. (1989) A tutorial on hidden Markov models and selected sequence annotation problem. Genome Res., 8, 168–169. applications in speech recognition. Proc. IEEE, 77, 257–286. White,J.V., Stultz,C.M. and Smith,T.F. (1994) Protein classification by Rost,B. and Sander,C. (1993) Prediction of protein secondary structure stochastic modeling and optimal filtering of amino-acid sequences. at better than 70% accuracy. J. Mol. Biol., 232, 584–599. Math. Biosci., 119, 35–75. Slonim,D., Kruglyak,L., Stein,L. and Lander,E. (1997) Building Wu,C.H., Zhao,S. and Chen,H.L. (1996) A protein class database human genome maps with radiation hybrids. J. Comput. Biol., 4, organized with ProSite, protein groups and PIR, superfamilies. J. 487–504. Comput. Biol., 3, 547–561. Sjölander,K., Karplus,K., Brown,M., Hughey,R., Krogh,A., Mian,I.S. Yada,T., Ishikawa,M., Tanaka,H. and Asai,K. (1996) Extraction of and Haussler,D. (1996) Dirichlet mixtures: A method for improving hidden Markov model representations of signal patterns in DNA detection of weak but significant protein sequence homology. sequences. Pac. Symp. Biocomput., World Scientific, Singapore, pp. Comput. Applic. Biosci., 12, 327–345. 686–696. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/profile-hidden-markov-models-6HHQEDZ4zf

Loading next page...

References (74)

C. Tarnas, R. Hughey (1998)
Reduced space hidden Markov model training
Bioinformatics, 14 5
S. Henikoff, S. Pietrokovski, J. Henikoff (1998)
Superior performance in protein homology detection with the Blocks Database servers
Nucleic acids research, 26 1
P. Baldi, Y. Chauvin, T. Hunkapiller, M. McClure (1994)
Hidden Markov models of biological primary sequence information.
Proceedings of the National Academy of Sciences of the United States of America, 91
A. Bairoch, P. Bucher, K. Hofmann (1997)
The PROSITE database, its status in 1997
Nucleic acids research, 25 1
R. Karchin, R. Hughey (1998)
Weighting hidden Markov models for maximum discrimination
Bioinformatics, 14 9
E. Sonnhammer, S. Eddy, E. Birney, A. Bateman, R. Durbin (1998)
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Nucleic acids research, 26 1
P. Bucher, K. Karplus, N. Moeri, K. Hofmann (1996)
A Flexible Motif Search Technique Based on Generalized Profiles
Computers & chemistry, 20 1
P. Fabian, J. Murvai, Z. Hátsági, K. Vlahoviček, H. Hegyi, S. Pongor (1997)
The SBASE protein domain library, release 5.0: a collection of annotated protein sequence segments
Nucleic acids research, 25 1
W. Bruno (1996)
Modeling residue usage in aligned protein sequences via maximum likelihood.
Molecular biology and evolution, 13 10
Nick Goldman, Nick Goldman, Nick Goldman, J. Thorne, Jeffrey Thorne, Jeffrey Thorne, David Jones, David Jones, David Jones (1996)
Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses.
Journal of molecular biology, 263 2
A. Murzin, A. Bateman (1997)
Distant homology recognition using structural classification of proteins
Proteins: Structure, 29
J. Felsenstein, G. Churchill (1996)
A Hidden Markov Model approach to variation among sites in rate of evolution.
Molecular biology and evolution, 13 1
G. Churchill (1989)
Stochastic models for heterogeneous DNA sequences.
Bulletin of mathematical biology, 51 1
D. Slonim, L. Kruglyak, L. Stein, E. Lander (1997)
Building human genome maps with radiation hybrids
Journal of computational biology : a journal of computational molecular cell biology, 4 4
A. Krogh (1998)
Chapter 4 - An introduction to hidden Markov models for biological sequences
New Comprehensive Biochemistry, 32
(1998)
) INSECT and MOLLUSCS—supercomputing on the cheap. Available from http
J. Bowie, R. Lüthy, D. Eisenberg (1991)
A method to identify protein sequences that fold into a known three-dimensional structure.
Science, 253 5016
K. Asai, S. Hayamizu, Ken'ichi Handa (1993)
Prediction of protein secondary structure by the hidden Markov model
Computer applications in the biosciences : CABIOS, 9 2
William Taylor (1986)
Identification of protein sequence homology by consensus template alignment.
Journal of molecular biology, 188 2
P. Bork, T. Gibson (1996)
Applying motif and profile searches.
Methods in enzymology, 266
C. Burge, S. Karlin (1997)
Prediction of complete gene structures in human genomic DNA.
Journal of molecular biology, 268 1
S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman (1990)
Basic local alignment search tool.
Journal of molecular biology, 215 3
F. Corpet, J. Gouzy, D. Kahn (1998)
The ProDom database of protein domain families
Nucleic acids research, 26 1
W. Grundy, T. Bailey, C. Elkan (1996)
ParaMEME: a parallel implementation and a web interface for a DNA and protein motif discovery tool
Computer applications in the biosciences : CABIOS, 12 4
Hiroshi Mamitsuka (1996)
A Learning Method of Hidden Markov Models for Sequence Discrimination
Journal of computational biology : a journal of computational molecular cell biology, 3 3
L. Rabiner (1989)
A tutorial on hidden Markov models and selected applications in speech recognition
Proc. IEEE, 77
S. Henikoff, E. Greene, S. Pietrokovski, P. Bork, T. Attwood, Leroy Hood (1997)
Gene families: the taxonomy of protein paralogs and chimeras.
Science, 278 5338
R. Hughey, A. Krogh (1996)
Hidden Markov models for sequence analysis: extension and analysis of the basic method
Computer applications in the biosciences : CABIOS, 12 2
R. Hughey (1996)
Parallel hardware for sequence comparison and alignment
Computer applications in the biosciences : CABIOS, 12 6
L. Rabiner, B. Juang (1986)
An introduction to hidden Markov models
IEEE ASSP Magazine, 3
A. Neuwald, Jun Liu, D. Lipman, C. Lawrence (1997)
Extracting protein alignment models from the sequence database.
Nucleic acids research, 25 9
W. Pearson, D. Lipman (1988)
Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences of the United States of America, 85 8
M. Levitt (1997)
Competitive assessment of protein fold recognition and alignment accuracy
Proteins: Structure, 29
J. Thorne, N. Goldman, David Jones (1996)
Combining protein evolution and secondary structure.
Molecular biology and evolution, 13 5
A. Krogh (1997)
Two Methods for Improving Performance of a HMM and their Application for Gene Finding
Proceedings. International Conference on Intelligent Systems for Molecular Biology, 5
S. Altschul, W. Gish (1996)
Local alignment statistics.
Methods in enzymology, 266
B. Rost, C. Sander (1993)
Prediction of protein secondary structure at better than 70% accuracy.
Journal of molecular biology, 232 2
R. Lüthy, J. Bowie, D. Eisenberg (1992)
Assessment of protein models with three-dimensional profiles
Nature, 356
J. Moult, T. Hubbard, S. Bryant, K. Fidelis, J. Pedersen (1997)
Critical assessment of methods of protein structure prediction (CASP): Round II
Proteins: Structure, 29
R. Durbin, S. Eddy, A. Krogh, G. Mitchison (1998)
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
S. Eddy (1996)
Hidden Markov models.
Current opinion in structural biology, 6 3
A. Krogh, I. Mian, D. Haussler, K. Rudd (1994)
A hidden Markov model that finds genes in E. coli DNA.
Nucleic acids research, 22 22
L. Kruglyak, M. Daly, Mary Reeve-Daly, E. Lander (1996)
Parametric and nonparametric linkage analysis: a unified multipoint approach.
American journal of human genetics, 58 6
Steven Henikoff (1996)
Scores for sequence searches and alignments.
Current opinion in structural biology, 6 3
S. Altschul (1991)
Amino acid substitution matrices from an information theoretic perspective
Journal of Molecular Biology, 219
G. Barton (1990)
Protein multiple sequence alignment and flexible pattern matching.
Methods in enzymology, 183
S. Sunyaev, I. Rodchenkov, F. Eisenhaber, E. Kuznetsov (1998)
Analysis of the position dependent amino acid probabilities and its application to the search for remote homologues
V. Francesco, V. Geetha, Jean Garnier, P. Munson (1997)
Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds
Proteins: Structure, 29
James White, Collin Stultz, Temple Smith (1994)
Protein classification by stochastic modeling and optimal filtering of amino-acid sequences.
Mathematical biosciences, 119 1
S. Henikoff, J. Henikoff (1992)
Amino acid substitution matrices from protein blocks.
Proceedings of the National Academy of Sciences of the United States of America, 89 22
T. Yada, M. Ishikawa, H. Tanaka, K. Asai (1996)
Extraction of hidden Markov model representations of signal patterns in DNA sequences.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
A. Lukashin, M. Borodovsky (1998)
GeneMark.hmm: new solutions for gene finding.
Nucleic acids research, 26 4
D. Kulp, D. Haussler, M. Reese, F. Eeckman (1996)
A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA
Proceedings. International Conference on Intelligent Systems for Molecular Biology, 4
C. Barrett, R. Hughey, K. Karplus (1997)
Scoring hidden Markov models
Computer applications in the biosciences : CABIOS, 13 2
T. Attwood, M. Beck, D. Flower, P. Scordis, J. Selley (1998)
The PRINTS protein fingerprint database in its fifth year
Nucleic acids research, 26 1
E. Sonnhammer, S. Eddy, R. Durbin (1997)
Pfam: A comprehensive database of protein domain families based on seed alignments
Proteins: Structure, 28
G. Grant (2000)
Bioinformatics - The Machine Learning Approach
Comput. Chem., 24
V. Francesco, J. Garnier, P. Munson (1997)
Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins.
Journal of molecular biology, 267 2
S. Wheelan, M. Boguski (1998)
Late-night thoughts on the sequence annotation problem.
Genome research, 8 3
S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic acids research, 25 17
Cathy Wu, Sheng Zhao, Hsi-Lien Chen (1996)
A Protein Class Database Organized with ProSite Protein Groups and PIR Superfamilies
Journal of computational biology : a journal of computational molecular cell biology, 3 4
Kimmen Sjölander, K. Karplus, Michael Brown, R. Hughey, A. Krogh, I. Mian, D. Haussler (1996)
Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology
Computer applications in the biosciences : CABIOS, 12 4
P. Baldi, Y. Chauvin (1996)
Hybrid Modeling, HMM/NN Architectures, and Protein Applications
Neural Computation, 8
C. Chothia (1992)
One thousand families for the molecular biologist
Nature, 357
J. Murvai, A. Gabrielian, P. Fabian, Z. Hátsági, K. Degtyarenko, H. Hegyi, S. Pongor (1993)
The SBASE protein domain library, Release 4.0: a collection of annotated protein sequence segments
Nucleic acids research, 24 1
E. Birney, R. Durbin (1997)
Dynamite: A Flexible Code Generating Language for Dynamic Programming Methods Used in Sequence Comparison
Proceedings. International Conference on Intelligent Systems for Molecular Biology, 5
John Henderson, S. Salzberg, K. Fasman (1996)
Finding Genes in Human DNA with a Hidden Markov Model
A. Godzik, A. Kolinski, A. Kolinski, J. Skolnick (1992)
Topology fingerprint approach to the inverse protein folding problem.
Journal of molecular biology, 227 1
K. Karplus, Kimmen Sjölander, C. Barrett, M. Cline, D. Haussler, R. Hughey, L. Holm, C. Sander (1997)
Predicting protein structure using hidden Markov models
Proteins: Structure, 29
Anders Krogh, Michael Brown, I. Mian, Kimmen Sjölander, David Haussler (1993)
Hidden Markov models in computational biology. Applications to protein modeling.
Journal of molecular biology, 235 5
C. Orengo, David Jones, J. Thornton (1994)
Protein superfamilles and domain superfolds
Nature, 372
M. Gribskov, Andrew MCLACHLANt, D. Eisenberg (1987)
Profile analysis: detection of distantly related proteins.
Proceedings of the National Academy of Sciences of the United States of America, 84 13
Collin Stultz, J. White, Temple Smith (1993)
Structural analysis based on state‐space modeling
Protein Science, 2
W. Grundy, T. Bailey, C. Elkan, M. Baker (1997)
Meta-MEME: motif-based hidden Markov models of protein families
Computer applications in the biosciences : CABIOS, 13 4

Publisher: Oxford University Press
Copyright: © Published by Oxford University Press.
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/14.9.755
Publisher site: See Article on Publisher Site

Abstract

&# %& BIOINFORMATICS REVIEW % - '(*$%* & %*") )!"% *&% %",()"*- !&&# & ""% &** ,%+ * &+") gapped pairwise alignment scores can be calculated analyti- Abstract cally, and the significance of gapped alignment scores can be Summary: The recent literature on profile hidden Markov calculated by simple empirical procedures (Altschul and model (profile HMM) methods and software is reviewed. Gish, 1996; Altschul et al., 1997). In contrast, profile Profile HMMs turn a multiple sequence alignment into a methods have historically used ad hoc scoring systems. position-specific scoring system suitable for searching Some mathematical theory was desirable for the meaning databases for remotely homologous sequences. Profile and derivation of the scores in a model as complex as a pro- HMM analyses complement standard pairwise comparison file (Henikoff, 1996). methods for large-scale sequence analysis. Several software Hidden Markov models (HMMs) now provide a coherent implementations and two large libraries of profile HMMs of theory for profile methods. HMMs are a class of probabilistic common protein domains are available. HMM methods models that are generally applicable to time series or linear performed comparably to threading methods in the CASP2 sequences. HMMs have been most widely applied to recog- structure prediction exercise. nizing words in digitized sequences of the acoustics of Contact: eddy@genetics.wustl.edu human speech (Rabiner, 1989). HMMs were introduced into computational biology in the late 1980s (Churchill, 1989), Introduction and for use as profile models just a few years ago (Krogh et al., 1994a). Proteins, RNAs and other features in genomes can usually be Here, the recent literature on profile HMM methods and classified into families of related sequences and structures related methods for modeling sequence families is reviewed. (Henikoff et al., 1997). Different residues in a functional se- Preference is given to papers appearing in the past 2 years, quence are subject to different selective pressures. Multiple since my last review of the field (Eddy, 1996). There seem alignments of a sequence family reveal this in their pattern to be three principal advances. First, motif-based HMMs of conservation. Some positions are more conserved than have been introduced as an alternative to the original Krogh/ others, and some regions of a multiple alignment seem to Haussler profile HMM architecture (Grundy et al., 1997; tolerate insertions and deletions more than other regions. Neuwald et al., 1997). Second, large libraries of profile Intuitively, it seems desirable to use position-specific in- HMMs and multiple alignments have become available, as formation from multiple alignments when searching data- well as compute servers to search query sequences against bases for homologous sequences. ‘Profile’ methods for these resources (Sonnhammer et al., 1998). Third, there has building position-specific scoring models from multiple been an increasing incursion of profile HMM methods into alignments were introduced for this purpose (Taylor, 1986; the area of protein structure prediction by fold recognition Gribskov et al., 1987; Barton, 1990; Henikoff, 1996). How- (Levitt, 1997). ever, profiles have been less used than pairwise methods like Because of space limitations, some of the background I BLAST (Altschul et al., 1990, 1997) and FASTA (Pearson give is terse. A satisfactory introduction to HMMs and pro- and Lipman, 1988), with the most notable exceptions being babilistic models is beyond the scope of this review. Tutorial the popular BLOCKS database (Henikoff et al., 1998) and introductions to HMMs are available (Rabiner, 1989), in- the skilled use of profiles by a small band of professional cluding introductions that specifically include profile HMM protein domain hunters (Bork and Gibson, 1996). methods (Krogh, 1998). Two recent books describe proba- In part, this is because the residue scoring systems used by bilistic modeling methods for biological sequence analysis in pairwise alignment methods are supported by a significant detail (Baldi and Brunak, 1998; Durbin et al., 1998). body of statistical theory (Altschul and Gish, 1996). The pro- babilistic ‘meaning’ of position-independent pairwise align- Hidden Markov models ment scoring matrices is well understood (Altschul, 1991), allowing powerful scoring matrices to be derived (Henikoff There are now various kinds of profile HMMs and related and Henikoff, 1992). The statistical significance of un- models, all based on HMM theory. It is useful to understand Oxford University Press 755 S.R.Eddy the generality and relative simplicity of HMM theory before considering the special case of profile HMMs. An HMM de- scribes a probability distribution over a potentially infinite number of sequences. Because a probability distribution must sum to one, the ‘scores’ that an HMM assigns to se- quences are constrained. The probability of one sequence cannot be increased without decreasing the probability of one or more other sequences. It is this fundamental constraint of probabilistic modeling (Jaynes, 1998) that allows the parameters in an HMM to have non-trivial optima. An example of a simple HMM that models sequences composed of two letters (a, b) is shown in Figure 1. This toy HMM would be an appropriate model for a problem in which we thought sequences started with one residue composition Fig. 1. A toy HMM, modeling sequences of as and bs as two regions (a-rich, perhaps), then switched once to a different residue of potentially different residue composition. The model is drawn composition (b-rich, perhaps). The HMM consists of two (top) with circles for states and arrows for state transitions. A states connected by state transitions. Each state has a symbol possible state sequence generated from the model is shown, followed emission probability distribution for generating (matching) by a possible symbol sequence. The joint probability P(x,π|HMM) of the symbol sequence and the state sequence is a product of all the a symbol in the alphabet. It is convenient to think of an HMM transition and emission probabilities. Notice that another state as a model that generates sequences. Starting in an initial sequence (1-2-2) could have generated the same symbol sequence, state, we choose a new state with some transition probability though probably with a different total probability. This is the (either staying in state 1 with transition probability t , , or 1 1 distinction between HMMs and a standard Markov model with moving to state 2 with transition probability t , ); then we 1 2 nothing to hide: in an HMM, the state sequence (e.g. the biologically generate a residue with an emission probability specific to meaningful alignment) is not uniquely determined by the observed that state [e.g. choosing an a with p (a)]. We repeat the transi- 1 symbol sequence, but must be inferred probabilistically from it. tion/emission process until we reach an end state. At the end of this process, we have a hidden state sequence that we do not observe, and a symbol sequence that we do observe. The name ‘hidden Markov model’ comes from the fact that ward can also be implemented (Hughey and Krogh, 1996; the state sequence is a first-order Markov chain, but only the Tarnas and Hughey, 1998). symbol sequence is directly observed. The states of the Parameters can be set for an HMM in two ways. An HMM HMM are often associated with meaningful biological la- can be trained from initially unaligned (unlabeled) se- bels, such as ‘structural position 42’. In our toy HMM, for quences. Alternatively, an HMM can be built from pre- instance, states 1 and 2 correspond to a biological notion of aligned (pre-labeled) sequences (i.e. where the state paths are two sequence regions with differing residue composition. In- assumed to be known). In the latter case, the parameter es- ferring the alignment of the observed protein or DNA se- timation problem is simply a matter of converting observed quence to the hidden state sequence is like labeling the se- counts of symbol emissions and state transitions into prob- quence with relevant biological information. abilities. In building a profile HMM, an existing multiple Once an HMM is drawn, regardless of its complexity, the alignment is given as input. In contrast, training a profile same standard dynamic programming algorithms can be HMM is analogous to running a multiple alignment program used for aligning and scoring sequences with the model before building the model, and thus is a harder problem. (Durbin et al., 1998). These algorithms, called Forward (for Training algorithms are of interest because we may not yet scoring) and Viterbi (for alignment), have a worst-case algo- know a plausible alignment for the sequences in question. rithmic complexity of O(NM ) in time and O(NM) in space The standard HMM training algorithms are Baum–Welch for a sequence of length N and an HMM of M states. For expectation maximization or gradient descent algorithms. profile HMMs that have a constant number of state transi- Gibbs sampling, simulated annealing and genetic algorithm tions per state rather than the vector of M transitions per state training methods seem better at avoiding spurious local opti- in fully connected HMMs, both algorithms run in O(NM) ma in training HMMs and HMM-like models (Eddy, 1996; time and O(NM) space—not coincidentally, identical to Neuwald et al., 1997; Durbin et al., 1998). Most training al- other sequence alignment dynamic programming algo- gorithms seek relatively simple maximum likelihood (or rithms. For a modest (constant) penalty in time, very mem- maximum a posteriori) optimization targets. More sophisti- 1.5 ory-efficient O(M) and O(M ) versions of Viterbi and For- cated optimization targets are used to compensate for non- 756 Profile hidden Markov models independence of example sequences (e.g. biased representa- tion) (Eddy, 1996; Bruno, 1996; Durbin et al., 1998; Karchin and Hughey, 1998; Sunyaev et al., 1998), or to maximize the ability of a model to discriminate a set of true positive example sequences from a set of true negative training examples (Mamitsuka, 1996). However, since HMM training algorithms are local optim- izers, it pays to build HMMs on pre-aligned data whenever possible. Especially for complicated HMMs, the parameter space may be complex, with many spurious local optima that can trap a training algorithm. In contrast to parameter estimation, a suitable HMM archi- Fig. 2. A small profile HMM (right) representing a short multiple tecture (the number of states, and how they are connected by alignment of five sequences (left) with three consensus columns. state transitions) must usually be designed by hand. A maxi- The three columns are modeled by three match states (squares mum likelihood architecture construction algorithm exists labeled m1, m2 and m3), each of which has 20 residue emission for the special case of building profile HMMs from multiple probabilities, shown with black bars. Insert states (diamonds labeled alignments (Durbin et al., 1998). Efforts have been made to i0–i3) also have 20 emission probabilities each. Delete states (circles develop architecture learning algorithms for general HMMs labeled d1–d3) are ‘mute’ states that have no emission probabilities. (Yada et al., 1996). One can also train fully connected A begin and end state are included (b,e). State transition probabilities are shown as arrows. HMMs and prune low-probability transitions at the end of training (Mamitsuka, 1996). More or less formal probabilistic models are increasingly state at each column allow for insertion of one or more resi- important in biological analysis, particularly in complicated dues between that column and the next, or for deleting the analysis problems with many model parameters. Because consensus residue. Profile HMMs are strongly linear, left– many problems in computational biology reduce to some right models, unlike the general HMM case. Figure 2 shows sort of linear ‘sequence’ analysis, probabilistic models based a small profile HMM corresponding to a short multiple se- on HMMs have been applied to many problems. Other bio- quence alignment. logical applications of HMMs include gene finding (Krogh The probability parameters in a profile HMM are usually et al., 1994b; Kulp et al., 1996; Burge and Karlin, 1997; Hen- converted to additive log-odds scores before aligning and derson et al., 1997; Krogh, 1997; Lukashin and Borodovsky, scoring a query sequence (Barrett et al., 1997). The scores for 1998), radiation hybrid mapping (Slonim et al., 1997), gen- aligning a residue to a profile match state are therefore com- etic linkage mapping (Kruglyak et al., 1996), phylogenetic parable to the derivation of BLAST or FASTA scores: if the analysis (Felsenstein and Churchill, 1996; Thorne et al., probability of the match state emitting residue x is p , and the 1996) and protein secondary structure prediction (Asai et al., expected background frequency of residue x in the sequence 1993; Goldman et al., 1996). In general, the more a problem database is f , the score for residue x at this match state is log resembles a linear sequence analysis problem—i.e. the less p /f . x x it depends on correlations between ‘observables’ (e.g. resi- For other scores, profile HMM treatment diverges from dues)—the more useful HMM approaches will be. Profile standard sequence alignment scoring. In traditional gapped HMMs and HMM-based gene finders have probably been alignment, an insert of x residues is typically scored with an the most successful applications of HMMs in computational affine gap penalty, a + b(x – 1), where a is the score for the biology. On the other hand, protein secondary structure pre- first residue and b is the score for each subsequent residue in diction is an area in which the state of the art is neural net the insertion. In a profile HMM, for an insertion of length x methods that outperform HMM methods by using extensive there is a state transition into an insert state which costs log local correlation information that is not necessarily easy to t (where t is the state transition probability for moving MI MI model in an HMM (Rost and Sander, 1993). from the match state to the insert state), (x – 1) state transi- tions for each subsequent insert state that cost log t , and a II state transition for leaving the insert state that costs log t . Profile HMMs IM This is akin to the traditional affine gap penalty, with the gap Krogh et al. (1994a) introduced an HMM architecture that open cost as a = log t + log t , and the gap extend cost as MI IM was well suited for representing profiles of multiple se- b = log t . II quence alignments. For each consensus column of the mul- However, in a profile HMM, these gap costs are not arbit- tiple alignment, a ‘match’ state models the distribution of rary numbers. This is an example of why probabilistic mo- residues allowed in the column. An ‘insert’ state and ‘delete’ dels have useful and non-trivial optima. Imagine that we 757 S.R.Eddy were trying to optimize the gap parameters of a model by maximizing the score of the model on a training set of example sequences. In a profile with ad hoc gap costs, we could trivially maximize the scores just by setting all gap costs to zero, but the alignments produced by a profile with no gap penalties would be terrible. In the profile HMM, in contrast, the probability of a transition to an insert is linked to the probability of transition to a match and not inserting; profile HMMs have a cost for the match state to match state transition that has no counterpart in standard alignment. As we lower the gap cost by raising the transition probability t MI towards 1.0, the probability of the match–match transition t falls towards zero, and thus the cost for sequences with- MM out an insertion approaches negative infinity. There is, there- fore, a trade-off point in choosing the state transition prob- abilities where the cost for the sequences that do have an in- sertion is balanced against the cost for the sequences that do not. Additionally, the inserted residues are associated with in- sert state emission probabilities in the HMM. If these emission probabilities are the same as the background amino acid frequency, then the score of inserted residues is log Fig. 3. Different model architectures used in current methods. State transitions are shown as arrows and emission distributions are not f /f = 0. In traditional alignment, inserted residues also have x x represented. Numbered squares indicate ‘match states’. Diamonds no cost besides the affine gap penalty. The profile HMM for- indicate ‘insert states’. Match and insert states each have emission malism forces us to see that this zero cost corresponds to an distributions over 4 or 20 possible nucleic or amino acid symbols. assumption that unconserved insertions in protein structures Circles indicate non-emitting delete states and other special non- have the same residue distribution as proteins in general. emitting states such as begin and end states. From top to bottom: However, the assumption is usually wrong. Insertions tend BLOCKS-style ungapped motifs, represented as an HMM; the to be seen most often in surface loops of protein structures, multiple motif model in META-MEME; the original profile HMM and so have a bias towards hydrophilic residues. Profile of Krogh et al.; and the ‘Plan 7’ architecture of HMMER 2, HMMs can capture this information in the insert state representative of the new generation of profile HMM software in emission distributions. SAM, HMMER and PFTOOLS. Profile HMM software augmented that simple model to deal with multiple domains, Several available software packages implement profile sequence fragments and local alignments, as illustrated by HMMs or HMM-like models (Table 1). One important dif- the HMMER 2.0 ‘Plan 7’ model architecture in Figure 3. ference between these packages is the model architecture Thus, local versus global alignment is not necessarily in- they adopt (Figure 3). The philosophical divide is between trinsic to the algorithm (as is usually thought, for instance, in ‘profile’ models and ‘motif’ models. By ‘profile’ models, I the distinction between the global ‘Needleman/Wunsch’ and mean models with an insert and delete state associated with local ‘Smith/Waterman’ algorithms), but can be dealt with each match state, allowing insertion and deletion anywhere probabilistically as part of the model architecture. Local in a target sequence. By ‘motif’ models, I mean models alignments with respect to the model are allowed by non- dominated by strings of match states (modeling ungapped zero state transition probabilities from a begin state to inter- blocks of sequence consensus) separated by a small number nal match states, and from internal match states to an end of insert states modeling the spaces between ungapped state (dotted lines in Figure 3). Local alignments with respect blocks. to the sequence are allowed by non-zero state transitions on SAM (Hughey, 1996), HMMER (S.R.Eddy, unpublished), the flanking insert states (shaded in the Plan 7 architecture in PFTOOLS (Bucher et al., 1996) and HMMpro (Baldi et al., Figure 3). More than one hit to the HMM per sequence is 1994) implement models based at least in part on the original allowed by a cycle of non-zero transitions through a third profile HMMs of Krogh et al. (1994a). These packages have special insert state. 758 Profile hidden Markov models Table 1. Internet sources for obtaining some of the existing profile HMM the current protein database starting with single randomly and HMM-like software packages selected query sequences, with impressive results (Neuwald et al., 1997). Software URL GENEWISE is a sophisticated ‘framesearch’ application SAM http://www.cse.ucsc.edu/research/compbio/sam.html that can take a HMMER protein model and search it against HMMER http://hmmer.wustl.edu/ EST or genomic DNA, allowing for frameshifts, introns and sequencing errors (Birney and Durbin, 1997). PFTOOLS http://ulrec3.unil.ch:80/profile/ PSI-BLAST (Altschul et al., 1997) is not an HMM ap- HMMpro http://www.netid.com/ plication per se, but it uses some principles of full probabilis- GENEWISE http://www.sanger.ac.uk/Software/Wise2/ tic modeling to build HMM-like models from multiple align- PROBE ftp://ncbi.nlm.nih.gov/pub/neuwald/probe1.0/ ments. Like the use of PROBE (Neuwald et al., 1997), PSI- META-MEME http://www.cse.ucsd.edu/users/bgrundy/metameme.1.0.html BLAST starts from a single query sequence and collects homologous sequences by BLAST search. These homo- BLOCKS http://www.blocks.fhcrc.org/ logues are aligned to the query. An HMM-like search model PSI-BLAST http://www.ncbi.nlm.nih.gov/BLAST/newblast.html is built from the multiple alignment. The model is searched against the database, new homologues are discovered and These profile HMMs are rather general, allowing inser- added to the alignment, and a new model is built. The process tions and deletions anywhere in a sequence relative to the is iterated until no new homologues are discovered. PROBE consensus model. Intuitively, they should be more sensitive and PSI-BLAST both illustrate the power of automating it- than ungapped models. However, in practice, there is a trade- erative profile searches. The remarkable speed of PSI- off between increasing the descriptive power of the model BLAST also demonstrates that the fast BLAST algorithm and the difficulty in determining an increasingly large can be applied to position-specific scoring systems and number of free parameters. A complex model is more prone gapped alignments, and hence to profile HMMs. to overfitting the training data and failing to generalize to With the exception of PSI-BLAST, profile HMM search other sequences. SAM and HMMER use mixture Dirichlet algorithms are computationally demanding. Fast hardware priors on most distributions to help avoid overfitting and to implementations of Gribskov profile searches (Gribskov et limit the effective number of free parameters (Sjolander, al., 1987) are available from several manufacturers, includ- 1996). It is possible to reduce the effective number of free ing Compugen and Time Logic. These systems are currently parameters even further by adopting hybrid HMM/neural being revised to accommodate profile HMMs and the exist- network techniques (Baldi and Chauvin, 1996). Nonethe- ing PROSITE and PFAM HMM libraries. HMM approaches less, this relatively unconstrained freedom to insert and de- are also readily parallelized (Grundy et al., 1996; Hughey, lete anywhere makes these models somewhat difficult to 1996). Even more esoteric speed-ups are also possible. For train from initially unaligned sequences. HMMER and instance, Intel Corporation has made a white paper available PFTOOLS are used primarily to build database search mo- on using MMX assembly instructions to parallelize the Viter- dels from pre-existing alignments, such as those in the Pfam bi algorithm and get about a 2-fold speed increase on Intel and PROSITE Profiles databases (see below). hardware (http://developer.intel.com/drg/mmx/AppNotes/ PROBE (Neuwald et al., 1997), META-MEME (with its AP569.HTM). This could be significant, since some of the brethren MEME and MAST) (Grundy et al., 1997) and WWW-based HMM servers are backed by Intel processor BLOCKS (Henikoff et al., 1998) assume quite different farms running Linux or FreeBSD, such as the ISREC/Prosite ‘motif’ models. In these models, alignments consist of one INSECT farm (Jongeneel et al., 1998). or more ungapped blocks, separated by intervening se- quences that are assumed to be random (Figure 3). The handling of these gaps in BLOCKS is ad hoc. PROBE and Profile HMM libraries META-MEME adopt probabilistic models for the gaps. META-MEME, interestingly, fits its models into HMMER Profile HMM software is well suited for modeling a particular format. The motif models can therefore be viewed as special sequence family of interest and finding additional remote homo- cases of profile HMMs; indeed, HMMER, SAM and logues in a sequence database. Suppose instead that I have a PFTOOLS have various options for creating motif-like mo- query sequence of interest, and I am interested in whether this dels. The strength here is that by limiting the freedom of the sequence contains one or more known domains. This problem model a priori, the HMM training problem is made more arises especially in high-throughput genome sequence analysis, tractable. These approaches can be very powerful for dis- where standard ‘top hit’ BLAST analyses can be confused by covering conserved motifs in initially unaligned sets of se- proteins with several distinct domains. Now I need to search the quences. PROBE, for instance, has been turned loose on a single query sequence against a library of profile HMMs, rather fully automated exercise in identifying domain families in than a single profile HMM against a database of sequences. 759 S.R.Eddy Building a library of profile HMMs in turn requires a large 806 models in Pfam 3.0 recognize ~ 42% (S.R.E. unpublished number of multiple alignments of common protein domains. data). Thus, an ~ 5-fold increase in Pfam database size (175 to 806) resulted in only about a 50% increase in the number of A database of annotated multiple alignments and pre-built sequences recognized with significant scores. On the bright profile HMMs becomes desirable. side, the number of C.elegans sequences annotated by one or Two large collections of annotated profile HMMs are cur- more Pfam models is starting to approach the number that is rently available: the Pfam database (Sonnhammer et al., 1997, hit by one or more informative BLAST similarities to the non- 1998) and the PROSITE Profiles database (Bairoch et al., redundant sequence database (42% compared to ~ 55%). 1997). The PROSITE Profiles database is a supplement to the None of the profile servers is mature. Both profile software widely used PROSITE motifs database; for families that can- and profile databases are rapidly improving and changing. In not be recognized by simple PROSITE motif patterns (regular particular, profile databases typically include domain models expressions which either match a sequence or do not), more that other databases may not yet have. Users are well advised sensitive profile HMMs are developed. Both databases are to search several domain annotation servers. The Interpro col- available via WWW servers, including on-line analysis laboration is expected to be extremely valuable as the various servers for submitting protein sequence queries (Table 2). A database teams begin actively sharing alignment and annota- new European Union funded initiative, called Interpro, has tion data. established a collaboration among several sites interested in effective protein domain annotation, including the Pfam, HMMs for fold recognition PROSITE and PRINTS development teams as well as the SWISS-PROT/TREMBL team. Profile HMMs are sometimes viewed as ‘mere sequence mo- The current pre-release of the PROSITE Profiles database dels’. However, profile scores can be calculated from struc- contains profiles for 290 protein domains, and the current tural data instead of sequences, e.g. ‘3D/1D profiles’ (Bowie Pfam 3.1 release contains 1313 profiles. There is substantial et al., 1991; Luthy et al., 1992). These structural profile ap- overlap between the two collections. It is not meaningful to try proaches can readily be put into a full probabilistic, HMM- to estimate how complete these databases are, because the based framework (Stultz et al., 1993; White et al., 1994). Di number of protein families in nature is unknown and probably Francesco and colleagues have used profile HMMs to model very large. Although there is much discussion of how many secondary structure symbol sequences by modifying the protein families there are—the number 1000 is often cited SAM code to emit an alphabet of protein secondary structure (Chothia, 1992)—such estimates typically make a false as- symbols, training models on known secondary structures, sumption that all families have approximately equal numbers and aligning these models to secondary structure predictions of members (Orengo et al., 1994). However, a small number of new protein sequences (Di Francesco et al., 1997a,b). of families (such as protein kinases, G-protein coupled recep- The pejorative appellation of ‘mere sequence models’ tors and immunoglobulin superfamily domains) account for seems to be applied to HMMs based on a misunderstanding a disproportionate number of sequences. The two databases of the central assumption of position independence in are therefore seeing diminishing returns as models of less HMMs. Obviously, neighboring three-dimensional struc- populous families are developed. For example, the 175 mo- tural contacts influence the types of residue that will be ac- dels in Pfam 1.0 recognize one or more domains in ~ 27% of cepted at any given position in a protein structure. How can predicted proteins from the Caenorhabditis elegans genome HMMs that explicitly assume position independence hope to project, the 527 models in Pfam 2.0 recognize ~ 35% and the be a realistic model of protein structure? Table 2. WWW analysis servers for analyzing protein sequences for known domains Profile HMM libraries: Pfam (Sonnhammer et al., 1998) http://www.sanger.ac.uk/Pfam/ PROSITE profiles (Bairoch et al., 1997) http://ulrec3.unil.ch/software/PFSCAN_form.html HMM-like methods: BLOCKS (Henikoff et al., 1998) http://www.blocks.fhcrc.org/ Other protein domain family classification servers: PRINTS (Attwood et al., 1998) http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/ ProClass (Wu et al., 1996) http://diana.uthct.edu/proclass.html PRODOM (Corpet et al., 1998) http://www.toulouse.inra.fr/prodom.html SBASE (Fabian et al., 1997) http://base.icgeb.trieste.it/sbase/ 760 Profile hidden Markov models The assumption of position independence only means that quence annotation is so difficult that some people almost when an HMM state scores a residue in a sequence, it does seem ready to give up on it (Wheelan and Boguski, 1998). so independently of the rest of that sequence’s alignment. The development of robust methods for automated sequence However, nothing says that the emission probability distribu- classification and annotation is imperative. Our hope in de- tion at that state cannot be determined in the first place from veloping profile HMM methods is that we can provide a sec- complex three-dimensional structural knowledge of the ond tier of solid, sensitive, statistically based analysis tools training set. If I know that a residue is buried by spatially that complement current BLAST and FASTA analyses. The neighboring hydrophobic residues, and this environment is combination of powerful new HMM software and large se- approximately constant among related structures in the pro- quence alignment databases of conserved protein domains tein family, I can build that knowledge into my model. What should help make this hope a reality. HMMs cannot deal with efficiently are long-distance cor- relations between residues, as is seen in RNA structural Acknowledgements alignments, where the complementarity of a pair of distant sequence positions is more important than the identity of Work on profile HMMs and Pfam in my laboratory is sup- either position by itself (Durbin et al., 1998). (Short-distance ported by NIH/NHGRI R01 HG01363, Monsanto and Eli correlation can be built into HMMs without much difficulty; Lilly. I thank D.States for pointing out the Intel paper on for example, gene-finding HMMs typically model the prob- MMX Viterbi implementations; K.Karplus, R.Hughey and ability of coding hexamers instead of probabilities of single A.Neuwald for providing pre-publication results; and residues.) C.Eddy, S.Johnson, my research group and three anonymous Many current fold recognition methods are not cast as reviewers for their useful criticism of the manuscript. I also HMMs, but instead as sequence/structure ‘threading’ algo- thank the many people in the HMM community with whom rithms with relatively ad hoc scores. However, any threading I have discussed these issues, especially A.Krogh, P.Bucher, scoring system for which a dynamic programming algorithm A.Neuwald, B.Grundy, G.Mitchison, the other members of can be used to find optimal sequence/structure alignments can the Pfam consortium (the R.Durbin and E.Sonnhammer be recast as a full probabilistic HMM. This includes ‘frozen groups), and the remarkable UC Santa Cruz HMM group. approximation’ methods (Godzik et al., 1992), for instance. The fold recognition section of the CASP (Current Asses- sment of Structure Prediction) exercise (Moult et al., 1997) References is one of the most interesting anecdotal benchmarks of how Altschul,S.F. (1991) Amino acid substitution matrices from an HMM techniques perform. In CASP, the sequences of pro- information theoretic perspective. J. Mol. Biol., 219, 555–565. tein ‘prediction targets’ whose structures are soon to be Altschul,S.F. and Gish,W. (1996) Local alignment statistics. Methods solved by crystallography or NMR are made available to Enzymol., 266, 460–480. computational structure prediction groups. After the struc- Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) tures become available, the success of the fold predictions is Basic local alignment search tool. J. Mol. Biol., 215, 403–410. evaluated. Ranking the performance of different methods in Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., CASP is difficult and somewhat subjective (Levitt, 1997). Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI- Also, there is usually a variable and sometimes substantial BLAST: A new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. degree of expert human interpretation added to the auto- Asai,K., Hayamizu,S. and Handa,K. (1993) Prediction of protein mated methods (Murzin and Bateman, 1997). Nonetheless, secondary structure by the hidden Markov model. Comput. Applic. CASP has been a lively venue to explore the strengths and Biosci., 9, 141–146. weaknesses of fold recognition methods. At CASP2 last Attwood,T.K., Beck,M.E., Flower,D.R., Scordis,P. and Selley,J.N. year, HMM-based methods were among the techniques used (1998) The PRINTS protein fingerprint database in its fifth year. by several of the most successful prediction groups (Di Fran- Nucleic Acids Res., 26, 304–308. cesco et al., 1997; Karplus et al., 1997; Levitt, 1997; Murzin Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE data- and Bateman, 1997). Indeed, Murzin and Bateman (1997) base, its status in 1997. Nucleic Acids Res., 25, 217–221. correctly predicted the folds of all six proteins they at- Baldi,P. and Brunak,S. (1998) Bioinformatics: The Machine Learning tempted, using a combination of profile HMMs, secondary Approach. MIT Press, Boston. structure prediction and expert knowledge. Baldi,P. and Chauvin,Y. (1996) Hybrid modeling, HMM/NN architec- tures and protein applications. Neural Comput., 8, 1541–1565. Baldi,P., Chauvin,Y., Hunkapiller,T. and McClure,M.A. (1994) Conclusion Hidden Markov models of biological primary sequence informa- tion. Proc. Natl Acad. Sci. USA, 91, 1059–1063. The human genome project threatens to overwhelm us in a Barrett,C., Hughey,R. and Karplus,K. (1997) Scoring hidden Markov deluge of raw sequence data. Successful large-scale se- models. Comput. Applic. Biosci., 13, 191–199. 761 S.R.Eddy Barton,G.J. (1990) Protein multiple sequence alignment and flexible Grundy,W.N., Bailey,T.L., Elkan,C.P. and Baker,M.E. (1997) Meta- pattern matching. Methods Enzymol., 183, 403–427. MEME: Motif-based hidden Markov models of protein families. Birney,E. and Durbin,R. (1997) Dynamite: A flexible code generating Comput. Applic. Biosci., 13, 397–406. language for dynamic programming methods used in sequence Henderson,J., Salzberg,S. and Fasman,K. (1997) Finding genes in comparison. In Proceedings of the Fifth International Conference human DNA with a hidden Markov model. J. Comput. Biol., 4, on Intelligent Systems in Molecular Biology, 5, 56–64. AAAI Press, 127–141. Menlo Park. Henikoff,S. (1996) Scores for sequence searches and alignments. Curr. Bork,P. and Gibson,T.J. (1996) Applying motif and profile searches. Opin. Struct. Biol., 6, 353–360. Methods Enzymol., 266, 162–184. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution ma- Bowie,J.U., Luthy,R. and Eisenberg,D. (1991) A method to identify trices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919. protein sequences that fold into a known three-dimensional structure. Science, 253, 164–170. Henikoff,S., Greene,E.A., Pietrokovski,S., Bork,P., Attwood,T.K. and Bruno,W.J. (1996) Modeling residue usage in aligned protein se- Hood,L. (1997) Gene families: The taxonomy of protein paralogs quences via maximum likelihood. Mol. Biol. Evol., 13, 1368–1374. and chimeras. Science, 278, 609–614. Bucher,P., Karplus,K., Moeri,N. and Hofmann,K. (1996) A flexible Henikoff,S., Pietrokovski,S. and Henikoff,J.G. (1998) Superior per- motif search technique based on generalized profiles. Comput. formance in protein homology detection with the Blocks database Chem., 20, 3–23. servers. Nucleic Acids Res., 26, 309–312. Burge,C. and Karlin,S. (1997) Prediction of complete gene structures Hughey,R. (1996) Parallel hardware for sequence comparison and in human genomic DNA. J. Mol. Biol., 268, 78–94. alignment. Comput. Applic. Biosci., 12, 473–479. Chothia,C. (1992) One thousand families for the molecular biologist. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence Nature, 357, 543–544. analysis: Extension and analysis of the basic method. Comput. Churchill,G.A. (1989) Stochastic models for heterogeneous DNA Applic. Biosci., 12, 95–107. sequences. Bull. Math. Biol., 51, 79–94. Jaynes,E.T. (1998) Probability Theory: The Logic of Science. Corpet,F., Gouzy,J. and Kahn,D. (1998) The ProDom database of Available from http://bayes.wustl.edu. protein domain families. Nucleic Acids Res., 26, 323–326. Jongeneel,V., Junier,T., Iseli,C., Hofmann,K. and Bucher,P. (1998) Di Francesco,V., Garnier,J. and Munson,P.J. (1997a) Protein topology INSECT and MOLLUSCS—supercomputing on the cheap. Avail- recognition from secondary structure sequences: Application of the able from http:// cmpteam4.unil.ch/biocomputing/mollusc/ IN- SECT_and_MOLLUSCS.html. hidden Markov models to the alpha class proteins. J. Mol. Biol., 267, 446–463. Karchin,R. and Hughey,R. (1998) Weighting hidden Markov models Di Francesco,V., Geetha,V., Garnier,J. and Munson,P.J. (1997b) Fold for maximum discrimination. Bioinformatics, in press. recognition using predicted secondary structure sequences and Karplus,K., Sjolander,K., Barrett,C., Cline,M., Haussler,D., hidden Markov models of protein folds. Proteins, 1(Suppl.), Hughey,R., Holm,L. and Sander,C. (1997) Predicting protein 123–128. structure using hidden Markov models. Proteins, 1(Suppl.), Durbin,R., Eddy,S.R., Krogh,A. and Mitchison,G.J. (1998) Biological 134–139. Sequence Analysis: Probabilistic Models of Proteins and Nucleic Krogh,A. (1997) Two methods for improving performance of an Acids. Cambridge University Press, Cambridge, UK. HMM and their application for gene finding. In Proceedings of the Eddy,S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol., 6, Fifth International Conference on Intelligent Systems in Molecular 361–365. Biology, 5, 179–186. AAAI Press, Menlo Park. Fabian,P., Murvai,J., Vlahovicek,K., Hegyi,H. and Pongor,S. (1997) Krogh,A. (1998) An introduction to hidden Markov models for The SBASE protein domain library, release 5.0: A collection of biological sequences. In Salzberg,S., Searls,D. and Kasif,S. (eds), annotated protein sequence segments. Nucleic Acids Res., 25, Computational Methods in Molecular Biology. Elsevier, New York. 240–243. pp. 45–63. Felsenstein,J. and Churchill,G. (1996) A hidden Markov model Krogh,A., Brown,M., Mian,I.S., Sjolander,K. and Haussler,D. (1994a) approach to variation among sites in rate of evolution. Mol. Biol. Hidden Markov models in computational biology: Applications to Evol., 13, 93–104. protein modeling. J. Mol. Biol., 235, 1501–1531. Godzik,A., Kolinski,A. and Skolnick,J. (1992) Topology fingerprint Krogh,A., Mian,I.S. and Haussler,D. (1994b) A hidden Markov model approach to the inverse protein folding problem. J. Mol. Biol., 227, that finds genes in E.coli DNA. Nucleic Acids Res., 22, 4768–4778. 227–238. Kruglyak,L., Daly,M.J., Reeve-Daly,M.P. and Lander,E.S. (1996) Goldman,N., Thorne,J.L. and Jones,D.T. (1996) Using evolutionary Parametric and nonparametric linkage analysis: A unified multi- trees in protein secondary structure prediction and other compara- point approach. Am. J. Hum. Genet., 58, 1347–1363. tive sequence analyses. J. Mol. Biol., 263, 196–208. Kulp,D., Haussler,D., Reese,M.G. and Eeckman,F.H. (1996) A Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Profile analy- generalized hidden Markov model for the recognition of human sis: Detection of distantly related proteins. Proc. Natl Acad. Sci. genes in DNA. In Proceedings of the Fourth International USA, 84, 4355–4358. Conference on Intelligent Systems in Molecular Biology, 4, Grundy,W.N., Bailey,T.L. and Elkan,C.P. (1996) ParaMEME: A 134–141. AAAI Press, Menlo Park. parallel implementation and a web interface for a DNA and protein Levitt,M. (1997) Competitive assessment of protein fold recognition motif discovery tool. Comput. Applic. Biosci., 12, 303–310. and alignment accuracy. Proteins, 1(Suppl.), 92–104. 762 Profile hidden Markov models Lukashin,A.V. and Borodovsky,M. (1998) GeneMark.hmm: New Sonnhammer,E.L., Eddy,S.R. and Durbin,R. (1997) Pfam: A com- solutions for gene finding. Nucleic Acids Res., 26, 1107–1115. prehensive database of protein families based on seed alignments. Luthy,R., Bowie,J.U. and Eisenberg,D. (1992) Assessment of protein Proteins, 28, 405–420. models with three-dimensional profiles. Nature, 356, 83–85. Sonnhammer,E.L.L., Eddy,S.R., Birney,E., Bateman,A. and Durbin,R. Mamitsuka,H. (1996) A learning method of hidden Markov models for (1998) Pfam: Multiple sequence alignments and HMM-profiles of sequence discrimination. J. Comput. Biol., 3, 361–373. protein domains. Nucleic Acids Res., 26, 320–322. Moult,J., Hubbard,T., Bryant,S.H., Fidelis,K. and Pedersen,J.T. (1997) Stultz,C.M., White,J.V. and Smith,T.F. (1993) Structural analysis Critical assessment of methods of protein structure prediction based on state-space modeling. Protein Sci., 2, 305–314. (CASP): Round II. Proteins, 1(Suppl.), 2–6. Sunyaev,S.R., Rodchenkov,I.V., Eisenhaber,F. and Kuznetsov,E.N. Murzin,A.G. and Bateman,A. (1997) Distant homology recognition (1998) Analysis of the position dependent amino acid probabilities using structural classification of proteins. Proteins, 1(Suppl.), and its application to the search for remote homologues. In 105–112. RECOMB ’98, pp. 258–265. Neuwald,A.F., Liu,J.S., Lipman,D.J. and Lawrence,C.E. (1997) Ex- Tarnas,C. and Hughey,R. (1998) Reduced space hidden Markov model tracting protein alignment models from the sequence database. training. Bioinformatics, in press. Nucleic Acids Res., 25, 1665–1677. Taylor,W.R. (1986) Identification of protein sequence homology by Orengo,C., Jones,D.T. and Thornton,J.M. (1994) Protein superfamilies consensus template alignment. J. Mol. Biol., 188, 233–258. and domain superfolds. Nature, 372, 631–634. Thorne,J.L., Goldman,N. and Jones,D.T. (1996) Combining protein Pearson,W. and Lipman,D. (1988) Improved tools for biological evolution and secondary structure. Mol. Biol. Evol., 13, 666–673. sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. Wheelan,S.J. and Boguski,M.S. (1998) Late-night thoughts on the Rabiner,L.R. (1989) A tutorial on hidden Markov models and selected sequence annotation problem. Genome Res., 8, 168–169. applications in speech recognition. Proc. IEEE, 77, 257–286. White,J.V., Stultz,C.M. and Smith,T.F. (1994) Protein classification by Rost,B. and Sander,C. (1993) Prediction of protein secondary structure stochastic modeling and optimal filtering of amino-acid sequences. at better than 70% accuracy. J. Mol. Biol., 232, 584–599. Math. Biosci., 119, 35–75. Slonim,D., Kruglyak,L., Stein,L. and Lander,E. (1997) Building Wu,C.H., Zhao,S. and Chen,H.L. (1996) A protein class database human genome maps with radiation hybrids. J. Comput. Biol., 4, organized with ProSite, protein groups and PIR, superfamilies. J. 487–504. Comput. Biol., 3, 547–561. Sjölander,K., Karplus,K., Brown,M., Hughey,R., Krogh,A., Mian,I.S. Yada,T., Ishikawa,M., Tanaka,H. and Asai,K. (1996) Extraction of and Haussler,D. (1996) Dirichlet mixtures: A method for improving hidden Markov model representations of signal patterns in DNA detection of weak but significant protein sequence homology. sequences. Pac. Symp. Biocomput., World Scientific, Singapore, pp. Comput. Applic. Biosci., 12, 327–345. 686–696.

Journal

Bioinformatics – Oxford University Press

Published: Jan 1, 1998

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Profile hidden Markov models.

Profile hidden Markov models.

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Profile hidden Markov models.

Profile hidden Markov models.

References (74)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies