Access the full text.
Sign up today, get DeepDyve free for 14 days.
H. Berman, K. Henrick, Haruki Nakamura, J. Markley (2006)
The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB dataNucleic Acids Research, 35
Robert Miller (1899)
Response time in man-computer conversational transactionsProceedings of the December 9-11, 1968, fall joint computer conference, part I
Z. Dosztányi, V. Csizmok, P. Tompa, I. Simon (2005)
IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy contentBioinformatics, 21 16
Cathy Wu, A. Nikolskaya, Hongzhan Huang, L. Yeh, D. Natale, C. Vinayaka, Zhang-Zhi Hu, R. Mazumder, Sandeep Kumar, P. Kourtesis, R. Ledley, Baris Suzek, L. Arminski, Yongxing Chen, Jian Zhang, Jorge Cardenas, Sehee Chung, Jorge Castro-Alvear, Georgi Dinkov, W. Barker (2004)
PIRSF: family classification system at the Protein Information ResourceNucleic acids research, 32 Database issue
S. Altschul (1991)
Amino acid substitution matrices from an information theoretic perspectiveJournal of Molecular Biology, 219
Matt Oates, Jonathan Stahlhacke, Dimitrios Vavoulis, B. Smithers, O. Rackham, Adam Sardar, J. Zaucha, Natalie Thurlby, Hai Fang, J. Gough (2014)
The SUPERFAMILY 1.75 database in 2014: a doubling of dataNucleic Acids Research, 43
S. Henikoff, J. Henikoff (1992)
Amino acid substitution matrices from protein blocks.Proceedings of the National Academy of Sciences of the United States of America, 89 22
David Wheeler, D. Church, Ron Edgar, S. Federhen, W. Helmberg, Thomas Madden, J. Pontius, G. Schuler, L. Schriml, Edwin Sequeira, Tugba Suzek, T. Tatusova, L. Wagner (2004)
Database resources of the National Center for Biotechnology Information: updateNucleic Acids Research, 32
F. Nah (2004)
A study on tolerable waiting time: how long are Web users willing to wait?Behaviour & Information Technology, 23
P. Rose, A. Prlić, Chunxiao Bi, Wolfgang Bluhm, Cole Christie, Shuchismita Dutta, Rachel Green, D. Goodsell, J. Westbrook, Jesse Woo, Jasmine Young, C. Zardecki, H. Berman, P. Bourne, S. Burley (2014)
The RCSB Protein Data Bank: views of structural biology for basic and applied research and educationNucleic Acids Research, 43
S. Card, G. Robertson, J. Mackinlay (1991)
The information visualizer, an information workspaceProceedings of the SIGCHI Conference on Human Factors in Computing Systems
Chuming Chen, D. Natale, R. Finn, Hongzhan Huang, Jian Zhang, Cathy Wu, R. Mazumder (2011)
Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional AnnotationPLoS ONE, 6
The Consortium (2014)
UniProt: a hub for protein informationNucleic Acids Research, 43
Z. Dosztányi, V. Csizmok, P. Tompa, I. Simon (2005)
The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins.Journal of molecular biology, 347 4
S. Eddy (2011)
Accelerated Profile HMM SearchesPLoS Computational Biology, 7
S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic acids research, 25 17
I. Sillitoe, Tony Lewis, Alison Cuff, Sayoni Das, P. Ashford, N. Dawson, Nicholas Furnham, R. Laskowski, David Lee, J. Lees, Sonja Lehtinen, R. Studer, J. Thornton, C. Orengo (2014)
CATH: comprehensive structural and functional annotations for genome sequencesNucleic Acids Research, 43
(2015)
Database resources of the National Center for Biotechnology InformationNucleic Acids Res., 43
S. Henikoff, J. Henikoff (2000)
Amino acid substitution matrices.Advances in protein chemistry, 54
(2014)
PDBe: Protein Data Bank in EuropeNucleic Acids Res., 42
S. Velankar, Younes Alhroub, Anaëlle Alili, C. Best, H. Boutselakis, Ségolène Caboche, M. Conroy, J. Dana, G. Ginkel, A. Golovin, S. Gore, A. Gutmanas, Pauline Haslam, M. Hirshberg, M. John, Ingvar Lagerstedt, Saqib Mir, L. Newman, T. Oldfield, C. Penkett, Jorge Pineda-Castillo, Luana Rinaldi, Gaurav Sahni, G. Sawka, Sanchayita Sen, Robert Slowley, Alan Silva, A. Suarez-Uruena, Jawahar Swaminathan, M. Symmons, W. Vranken, Michael Wainwright, G. Kleywegt (2010)
PDBe: Protein Data Bank in EuropeNucleic Acids Research, 39
D. Haft, J. Selengut, R. Richter, D. Harkins, M. Basu, Erin Beck (2012)
TIGRFAMs and Genome Properties in 2013Nucleic Acids Research, 41
L. Käll, A. Krogh, E. Sonnhammer (2004)
A combined transmembrane topology and signal peptide prediction method.Journal of molecular biology, 338 5
A. Lupas, M. Dyke, J. Stock (1991)
Predicting coiled coils from protein sequencesScience, 252
S. Eddy (1998)
Profile hidden Markov modelsBioinformatics, 14 9
R. Finn, J. Clements, S. Eddy (2011)
HMMER web server: interactive sequence similarity searchingNucleic Acids Research, 39
I. Aradaib, Mohamed Mohamed, M. Abdalla, A. Karrar, A. Majid, R. Omer, S. Elamin, Mohammed Salih, S. Idris (2003)
Molecular Biology Laboratory
A. Bateman, M. Martin, C. O’Donovan, M. Magrane, R. Apweiler, E. Alpi, R. Antunes, J. Arganiska, B. Bely, M. Bingley, C. Bonilla, R. Britto, Borisas Bursteinas, G. Chavali, Elena Cibrián-Uhalte, A. Silva, M. Giorgi, Tunca Dogan, F. Fazzini, P. Gane, Lg Castro, Penelope Garmiri, E. Hatton-Ellis, R. Hieta, R. Huntley, D. Legge, W. Liu, J. Luo, Alistair MacDougall, P. Mutowo, Andrew Nightingale, S. Orchard, K. Pichler, D. Poggioli, S. Pundir, L. Pureza, G. Qi, S. Rosanoff, Rabie Saidi, T. Sawford, A. Shypitsyna, E. Turner, Volynkin, T. Wardell, X. Watkins, H. Zellner, A. Cowley, L. Figueira, Weizhong Li, Hamish McWilliam, R. Lopez, I. Xenarios, L. Bougueleret, A. Bridge, S. Poux, Nicole Redaschi, L. Aimo, Ghislaine Argoud-Puy, A. Auchincloss, K. Axelsen, Parit Bansal, Delphine Baratin, M. Blatter, B. Boeckmann, Jerven Bolleman, E. Boutet, L. Breuza, C. Casal-Casas, E. Castro, E. Coudert, Béatrice Cuche, M. Doche, D. Dornevil, S. Duvaud, A. Estreicher, L. Famiglietti, M. Feuermann, E. Gasteiger, S. Gehant, Gerritsen, A. Gos, N. Gruaz-Gumowski, U. Hinz, C. Hulo, F. Jungo, G. Keller, Lara, P. Lemercier, D. Lieberherr, T. Lombardot, X. Martin, P. Masson, A. Morgat, T. Neto, N. Nouspikel, S. Paesano, I. Pedruzzi, S. Pilbout, Monica Pozzato, Manuela Pruess, C. Rivoire, B. Roechert, Michel Schneider, Christian Sigrist, K. Sonesson, S. Staehli, A. Stutz, S. Sundaram, M. Tognolli, L. Verbregue, A. Veuthey, Cathy Wu, C. Arighi, L. Arminski, Chuming Chen, Youhai Chen, J. Garavelli, Hongzhan Huang, K. Laiho, P. McGarvey, D. Natale, Baris Suzek, C. Vinayaka, Q. Wang, Y. Wang, L. Yeh, Yerramalla, J. Zhang (2015)
UniProt: A hub for protein information
Anders Krogh, Michael Brown, I. Mian, Kimmen Sjölander, David Haussler (1993)
Hidden Markov models in computational biology. Applications to protein modeling.Journal of molecular biology, 235 5
Robert Finn, Jaina Mistry, John Tate, Penny Coggill, A. Heger, Joanne Pollington, O. Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik Sonnhammer, Sean Eddy, Alex Bateman (2007)
The Pfam protein families databaseNucleic Acids Research, 38
(2014)
Pfam: the protein families databaseNucleic Acids Res., 42
R. Finn, S. Griffiths-Jones, A. Bateman (2003)
Identifying Protein Domains with the Pfam DatabaseCurrent Protocols in Bioinformatics, 1
Steve Johnson, Sean Eddy, Elon Portugaly (2010)
Hidden Markov model speed heuristic and iterative HMM search procedureBMC Bioinformatics, 11
W30–W38 Nucleic Acids Research, 2015, Vol. 43, Web Server issue Published online 05 May 2015 doi: 10.1093/nar/gkv397 1,2,* 2 2 2 Robert D. Finn , Jody Clements , William Arndt , Benjamin L. Miller ,Travis 2,3 1 1 2 J. Wheeler , Fabian Schreiber , Alex Bateman and Sean R. Eddy European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK, HHMI Janelia Research Campus, 19700 Helix Drive, Ashburn, VA 20147, USA and Department of Computer Science, University of Montana, Social Sciences Building Room 412, Missoula MT 59812, USA Received February 12, 2015; Revised April 10, 2015; Accepted April 15, 2015 ABSTRACT or a multiple sequence alignment. In the case of a multiple sequence alignment, the observed amino acid frequencies in The HMMER website, available at http://www.ebi.ac. each column are converted to position-specific probabilities, uk/Tools/hmmer/, provides access to the protein ho- with per position probabilities for both insertions and dele- mology search algorithms found in the HMMER soft- tions, determined from the input alignment (1,2). For single ware suite. Since the first release of the website in sequence searches, a profile HMM is constructed from the 2011, the search repertoire has been expanded to in- sequence using position-independent affine gap open and extension probabilities (defaults: 0.02 and 0.4) and emission clude the iterative search algorithm, jackhmmer. The probabilities obtained from the inferred probabilistic ba- continued growth of the target sequence databases sis of a standard substitution matrix (default: BLOSUM62) means that traditional tabular representations of sig- (3,4). nificant sequence hits can be overwhelming to the In early 2011, the functionality of the website hosting user. Consequently, additional ways of presenting the HMMER software (http://hmmer.org) was expanded to homology search results have been developed, al- allow online searches of protein sequences against either lowing them to be summarised according to taxo- a protein sequence database or a HMM library (5). This nomic distribution or domain architecture. The tax- search service not only took advantage of the speed im- onomy and domain architecture representations can provements of HMMER3 software (6), but also hardware, be used in combination to filter the results accord- the latest approaches to website design and other techni- ing to the needs of a user. Searches can also be re- cal implementations (e.g. in-memory databases and use of stricted prior to submission using a new taxonomic NoSQL). The combination of these four aspects allowed the searching of sequences against large sequence databases filter, which not only ensures that the results are spe- such as UniProtKB, at near interactive speeds. Websites cific to the requested taxonomic group, but also im- with minimal loading times (<10 s) are recognised for not proves search performance. The repertoire of profile interrupting the user’s train of thought, and hence increase hidden Markov model libraries, which are used for user productivity (7–9). annotation of query sequences with protein families Since the initial release, the popularity of online HM- and domains, has been expanded to include the li- MER searches has grown, with millions of sequence braries from CATH-Gene3D, PIRSF, Superfamily and searches performed per year (averaging over 5200 searches TIGRFAMs. Finally, we discuss the relocation of the per day or one search every 6 s, search statistics for http:// HMMER webserver to the European Bioinformatics hmmer.org only). These searches are split between requests Institute and the potential impact that this will have. coming from a browser (20% of searches) or via program- matic access using the RESTful application program inter- face (API, 80% of searches). For example, RSCB-PDB (10) INTRODUCTION uses the API to annotate newly deposited structures with Homology searches are widely used within molecular bi- Pfam annotations. ology, facilitating the transfer of annotation from a func- The initial release of the website provided the following tionally characterised sequence or region to a correspond- three search algorithms: ing region in another sequence. When searching against sequence databases, the HMMER software uses profile hidden Markov models (HMMs) to represent the query phmmer––single protein sequence against protein se- (1,2)––which can take the form of a single protein sequence quence database To whom correspondence should be addressed. Tel: +44 1223 494 481; Fax: +44 1223 494 468; Email: [email protected] C The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2015, Vol. 43, Web Server issue W31 hmmscan––single protein sequence against profile HMM many of the developments of the website have focused on library (Pfam) trying to improve results visualisation, from summaries to hmmsearch––either multiple sequence alignment or pro- alternative representations to filtering. file HMM against protein sequence database Expanded results visualisations In this article, we describe the recent developments of the website, which include iterative searches with jackhm- User experience testing and web usage statistics indicate mer - an expanded repertoire of HMM libraries and a va- that it is very difficult to predict what a user is trying to riety of result visualisations that allow rapid interrogation achieve from a homology search. A user’s purpose for the and interpretation of the results. Such developments are set search can range from functional annotation to establishing against a constant background of target database growth, taxonomy distribution to understanding residue conserva- which for the large sequence databases results in a constant tion in a collection of aligned sequences. The developments increase in the dynamic range of hits that will be returned described in the following sections have all been designed from a query. We demonstrate different ways that searches to enhance access and understanding of the results, whilst can be limited and how the results can be progressively dis- catering to a wide range of use cases or enquiries. sected according to taxonomy and/or domain architecture (the order of the domain(s) on a sequence). Sequence matches and features. When performing a phm- mer search, the single sequence query is also automatically searched against the Pfam (14) profile HMM library us- HMMER WEBSERVER DEVELOPMENTS ing hmmscan, to identify the presence of any Pfam fami- The HMMER webserver has been upgraded to the latest lies on the query sequence. The result of the Pfam search version of the HMMER software, version 3.1b2. This soft- is displayed at the top of the page, within the ‘Sequence ware version includes minor bug fixes, but more importantly Matches and Features’ section. The graphical representa- has additional performance improvements over the previ- tion of Pfam domains remains unchanged––for Pfam en- ous version, version 3.0. While this has made the searches tries that are considered type ‘domain’ or ‘family’, a rectan- faster, the underlying sequence databases have grown sub- gular shape with curved ends represents a full-length match stantially. UniProtKB (11) currently contains 91 408 504 se- position graphically. However, when a hit does not match quences (release 2015 02), compared with 13 593 921 in Jan- the first and /or last state of a profile HMM, a jagged end uary 2011 (release 2011 01), an increase of 570%. One of represents the N-terminal and/or C-terminal end of the the major speed optimisations utilised by the site requires match. While it is informative to know that a match is not the caching of the sequence database in memory. Due to full-length, it does not provide the user with the concept of exponential growth of the databases, it is no longer possi- how incomplete the match is compared to the profile HMM. ble to support both the NCBI non-redundant protein se- As HMMER uses a local-local match strategy (a match can quence database and UniProtKB. Future software develop- be anywhere in the sequence or the HMM) partial matches ments will ultimately allow the sharding of the databases to are common. For all Pfam matches, information on the allow future scalability, but this has yet to be implemented completeness of match is now provided in the tool-tip (re- within the HMMER daemon (hmmpgmd). Consequently, vealed by placing the mouse cursor over the graphical rep- the website now focuses primarily on UniProtKB (11)as resentation of the domain), where the profile HMM is rep- it represents the world’s pre-eminent protein database, with resented by a black bar and the region matched in the pro- sequences annotated either by expert curation or by the file HMM indicated by an overlaid coloured rectangle. As application of expert curated rules for automated annota- shown in Figure 1A, this gives an immediate impression of tion. While searches remain fast, returning in a matter of a length of the match between the sequence and the profile few seconds, subsets of UniProtKB have been included to HMM, even when it is not full length. provide either the highest quality (UniProtKB/Swiss-Prot) The original ‘Sequence Matches and Features’ view has annotations, representative sets to provide good coverage been advanced further to include other types of annota- of sequence space while reducing the number of poten- tion and to provide a summary of the phmmer search results tial matches (UniProt Reference Proteomes (11) and Rep- (Figure 1B, C). The protein sequence is also now analysed resentative Proteome sets (12)) or for curation purposes for the presence of other features: disordered regions us- (pfamseq––Pfam’s underlying sequence database). While ing IUPred (15,16), signal peptides and transmembrane re- these subsets do not change the amount of data cached in gions using Phobius (17) and coiled-coil regions (18). When memory, the smaller target databases increase search per- a sequence contains one or more matches against one of formance and make results more manageable. We also in- these three algorithms, a graphical representation show- clude sequences from known structures that have been de- ing the positional information from each algorithm is dy- posited in the Protein Data Bank (13)). namically inserted under the Pfam domain graphic. If a The growth of UniProtKB has been substantial, but it sequence does not contain any matches, a graphic is not is important to remember that increasing fractions of the displayed. However, the successful execution of the differ- new sequences are either identical or nearly identical (>95% ent feature algorithms is shown below, within the bottom identity) to a sequence that already exists in the database border of the bounding box (green check mark on success, (11). Given the nature of this sequence database growth, it is red x mark on failure). Figure 1 shows an example of the unsurprising that the number of sequences that a query may interplay between these annotation tools. In this example, match from a homology search has equally grown. Thus, there is a large region of sequence between the N-terminal W32 Nucleic Acids Research, 2015, Vol. 43, Web Server issue Figure 1. Results of searching the Efflux ABC transporter permease protein from Enterococcus casseliflavus ATCC 12755 (UniProtKB accession F0EMD7) against the Reference Proteome database using phmmer with default search options. (A) The tool tip associated with the partial C-terminal MacB PCD domain match. The model match line indicates the region of the HMM to which the sequence has been aligned (alignment region). While the match is incomplete, in this particular case, >90% of the model positions have been matched. (B) Shows the Pfam matches on the query and other sequence features. The hit coverage and similarity are shown in a condensed heat map style view below the sequence features. These can be expanded using the red icon to their right. (C) The hit similarity and coverage graph, summarising the phmmer matches. MacB PCD (Pfam accession:PF12704) and the FtsX do- be overall functionally distinct yet share the common ho- mains (Pfam accession:PF02687) that is not currently rep- mologous domain. Figure 1B shows an example of how the resented by a Pfam domain. Inspection of these other se- hit distribution varies across the sequence, from pale yellow quence features indicates that this is a region that is expected (little coverage) to red where there is more coverage. The to be largely disordered and contains three coiled-coil mo- second track below shows the relative sequence similarity of tifs. While neither feature type precludes a Pfam entry from the sequences aligned at each position. This track clearly in- being present, such features are typically less tractable as dicates that the sequence similarity fluctuates across the se- they are often poorly conserved at the amino acid level. quence, but patches of high similarity can be identified that Upon completion of the associated phmmer search, all align with the transmembrane regions in the C-terminus of of the matches are aligned and used to display two addi- the sequence. A more detailed view of the information con- tional tracks in the ‘Sequence Matches and Features’ sec- tained in both tracks can be obtained by clicking on the tion, the hit coverage and the hit similarity. Both tracks use icon to their right. This reveals a graph that plots the rela- a heat map style to represent the two hit metrics. The hit tive hit similarity, relative hit identity and the percentage oc- coverage indicates the regions of the query sequence that cupancy of the column in the alignment (match positions). have been matched by the sequences in the target database. Moving the mouse cursor over the graph reveals a moving As matches can be anywhere between the query and target, line, which allows the position of the graph/alignment to the presence of a ubiquitous domain or motif in the query be more readily determined. Overall, these additional de- can result in many sequences matching the query that may velopments allow a rapid understanding of the domains, se- Nucleic Acids Research, 2015, Vol. 43, Web Server issue W33 quence features and conservation profile of the hits found in are shown to the right (as many as exist, or are permitted by the phmmer search. available space). Once the root of the taxonomic tree (All)is no longer visible, a breadcrumb trail of the viewed branch back to the root of the taxonomic tree is displayed above Viewing results in different formats the tree. Either the breadcrumb trail or the back arrow on When performing any search against a sequence database the left can be used to move back up the taxonomic tree (i.e. phmmer, hmmsearch or jackhmmer), the default result (Figure 3A). In addition to changing the graphical tree, re- view is a paginated tabular scores output, with matching focusing the tree to different nodes causes the species listed target sequences ranked according to bit score (high to low), in the table below the tree to be updated to show just the which corresponds to an ordering by expectation value (E- species and the number of hits for the taxa found below the value, ordered low to high, most significant to least). The visible root. histogram above the results table, termed ‘Hit graph’, sum- marises the distribution of hits according to both E-value and taxonomy. The x-axis of the histogram is divided into Domain architecture view 30 E-value bins, ordered from least to most significant, with the total height of each column (or bin) proportional to the A typical homology search against a large sequence number of hits that fall within the bin. Each column in the database will return hundreds to thousands of hits. An alter- histogram is further subdivided according to the major tax- native to clustering hits by taxonomy is to cluster them by onomic group, with the size of the bar proportional to the the domain architecture of the hit sequences (the ordered number of hits in that group. Clicking on columns in the his- collection of domain(s) found across the entire sequence). togram takes the user to the row in the results table of the The ‘Domain’ view lists all of the unique domain architec- most significant hit in the bin represented by that column. tures found in the set of matched sequences, with the do- The results table is now customisable to allow the user to main architectures defined according to Pfam domains. All include a range of additional data fields that provide infor- hit sequences containing exactly the same domain architec- mation on the hit sequences and nature of the match. Fig- ture grouped into a single row of the table. The number of ure 2 shows an example of the ‘Score’ results table, where sequences containing a given architecture is indicated on the these three additional columns (taxonomic classification left, and the scores of just this set of sequences can be viewed of the organism to which each matched sequence belongs, using the link on the right of the row. The table is ordered number of significant hits, and a graphical display of the po- according to the frequency that each domain architecture sition of the hit(s) between the query and target) have been occurs in the result set. If the query is a single sequence, added using the ‘Customize’ button found in the header of then the row containing the same architecture as the query the table. The highlighted example in Figure 2 illustrates the is highlighted (Figure 3B). Typically, this view of the results two hit regions between the query and target demonstrating can represent over 75% of all the results in the first page, that there has been a re-arrangement of the hit regions in the providing a rapid understanding of domain diversity of the query sequence compared to the target sequence. matched sequences. While the score view is a more typical way of viewing re- sults, we have developed two alternative ways of visualising the results, (1) by taxonomy and (2) by domain. These views Filtering search results using different views apply to the results of phmmer, hmmsearch and jackhmmer searches. With the number of sequences deposited in the UniPro- tKB databases growing at unprecedented speed, an average query sequence might return an overwhelming number of Taxonomy view hits. Classically, hits have been presented as raw tables, but The ‘Taxonomy’ view, Figure 3A, shows the taxonomic dis- in doing this it can be hard to find the most informative tribution of matches according to a species tree. The species matches buried deep in the results. While using either tax- tree is derived from the NCBI taxonomy (19) and drawn onomy or domain architecture to provide alternative views from left (root) to right. More often than not, the taxonomic of the data, ordering results by E-values remains an impor- distribution of the matches is too broad to display the entire tant way to prioritise matches. Consequently, the results in- taxonomic tree. By default, the tree is shown with the top terface has been developed so that both the domain archi- four taxonomic levels found, but the user can click on the tecture and taxonomy views can be used to filter the results, tree to focus on a specific lineage, allowing them to browse e.g. select hits belonging to a taxonomic clade and subse- the most relevant clades and organisms while temporarily quently filtering the subset by domain architecture or vice hiding other parts of the tree (Figure 3A). Each node in the versa. For example, using the default example query for a displayed tree corresponds to a taxonomic level and shows phmmer search, a user may wish to identify all Caenorhabdi- a sparkline version of the ‘hit graph’ for that level to indi- tis sequences that contain the domain architecture ‘SH3 1’, cate the number of hits and their E-value distribution for ‘SH2’ followed by ‘Pkinase Tyr’. To do this they would se- that particular taxonomic level. The arrow(s) on the right lect the ‘View scores’ for this architecture, then select the side of the tree indicate the number of species that match taxonomy view and navigate to Caenorhabditis and select below that level. Clicking on any of the nodes of the tree the ‘Show scores for all’, which reveals the 6 matching se- will re-focus the tree, such that this node appears on the left quences from the 6170 matches, in a few simple clicks (Fig- side of the tree representation and children nodes (names) ure 4). W34 Nucleic Acids Research, 2015, Vol. 43, Web Server issue Figure 2. Example of the expanded results table, showing the kingdom and species, number of significant hits, and the hit positions between the query and the target sequences after searching the UniProtKB sequence accession P00519 (amino acids 57 to 218) against the UniProtKB reference proteomes sequences (2014 10 release). The customise button in the top-right of the table header can be used to switch on different columns in that table (row count, secondary accessions, description, species, kingdom, known structure, number of identical sequences, number of hits, number of significant hits, b it score and graphical representation of the hit position). An expanded view of the hit position graphic is shown below the table. The enlarged view indicates where the two regions of similarity, or hits, in the query sequence match the target sequence. Each distinct hit of the query sequence is shown as a coloured box, and the corresponding aligned region is represented by a box of the same colour. The two sequences in each row are drawn proportionally to each other, with the sequence represented as a grey line. The two sequences are drawn left-justified (i.e. unaligned), with the query sequence always shown above t he target. In this particular case, the order of the hits is reversed between the query and target sequences. A similar representation is used for queries with a profile HMM, with the top image (the query) representing the length of the profile HMM. The hit graphic quickly allows the identification of sequence rearrangements and repeated regions (where hit/coloured box in the query is duplicated multiple times in the target sequence). Taxonomy-restricted searches on the search improves search speeds, as well as improving result visualisation by focusing matches on the desired tax- While the previous section describes filtering of results once onomic range. they have been calculated, an alternative way of restricting the results it to reduce the initial search space. While alter- Multiple HMM databases native target databases offer one such mechanism, another approach (which can be used in combination with any of The hmmscan algorithm takes a single protein sequence the sequence databases) is to restrict the search to sets of and searches it against a profile HMM library. The first sequences belonging to one or more taxonomic clades us- profile HMM database to be incorporated into the site ing the ‘Restrict by taxonomy’ on the search submission was the Pfam library. This initial display has been ex- page. This can be performed by either entering valid tax- panded to provide the disorder, coiled-coil, signal peptide onomic levels (species, phylum) or checking taxonomic lev- and transmembrane annotations described earlier in the els in a representative taxonomy tree provided on the web- article. Furthermore, the HMMER3 based protein fam- site. Note that when entering different taxonomic terms, ily databases CATH-Gene3D (20), PIRSF (21), Superfam- the look-up tool is aware of the taxonomic tree. For ex- ily (22) and TIGRFAMs (23) have been incorporated into ample, if a user wants to search all sequences from Chor- the hmmscan search as alternative target HMM databases. dates except human, they would not want to have to se- While Pfam and TIGRFAMs use the domain boundaries lect species individually. To enable the rapid construction assigned by HMMER directly, CATH-Gene3D, PIRSF of such queries, the user would first enter ‘Chordata’ fol- and Superfamily employ alternative post-processing meth- lowed by ‘Homo sapiens’. As the first term has already se- ods for domain assignments. This is primarily because a lected ‘Homo sapiens’ (as it is part of ‘Chordata’), the query family/domain may be represented by more than one pro- builder assumes that user wants to remove ‘Homo sapiens’ file HMM, or may have to reach additional criteria speci- from the set of sequences to search. As taxonomic terms fied by the database e.g. length, and in the case of structural are added to the query, the interpretation of the terms by domains, the domain may not be contiguous on a protein the query builder is displayed below the input field. Results sequence. Consequently, the standard hmmscan thresholds from taxonomically restricted searches will be presented as are disabled for these three databases and the significance described above, with the same score, taxonomy and do- thresholds/criteria provided by each database are applied. main architecture views, while also clearly indicating that The E-value or bit score threshold can still be defined for the search space has been restricted. It is important to note either Pfam or TIGRFAMs. that the E-values are calculated as if the entire target se- Unlike in the selection of target sequence databases, it quence database had been searched. Using such restrictions is now possible for more than one profile HMM database Nucleic Acids Research, 2015, Vol. 43, Web Server issue W35 Figure 3. Two different results view from searching the human S-adenosylmethionine synthase sequence (UniProtKB accession Q00266) against UniPro- tKB (2014 10 release). (A) The taxonomic distribution of the archaeal homologs in the results. Below each taxonomic name is a sparkline version of the hit graphic showing the hit distribution of all sequences belonging to that taxonomic clade. The numbers in brackets denote the number of sequences matched, while the numbers in the right-hand arrows indicate the number of species. (B) The same results as in (A), but grouped according to domain architecture. In this example, 20 799 out of the 21 695 match sequences have the same domain architecture as the query (as indicated by the yellow background). The remaining domain architectures appear to be subsets of the dominant domain architecture, arising from sequence fragments found in the database. to be selected, allowing the different protein family assign- be searched iteratively against a target sequence database, ments to be compared in a single search submission. As similar to PSI-BLAST (24) functionality. Iterative sequence each annotation returns, it is inserted into the results page searching is often able to identify similarities to functionally and shown both graphically and as a table where appropri- characterised proteins that are not detected with single se- ate. If no matches are found for a particular protein family quence searching (25), as the residue conservation from a database this will be indicated in the list of tables below the set of related sequences is used to determine position spe- graphical summary. cific amino acid, insert and delete probabilities. This itera- tive search functionality has now been implemented in the HMMER website, but, unlike the command line version of Iterative searching jackhmmer which only accepts a single protein sequence as a query, the website implementation allows jackhammer to The initial release of the HMMER website included be initiated with a single sequence, a profile HMM or a mul- the search algorithms phmmer, hmmsearch and hmmscan. tiple sequence alignment against a target sequence database The HMMER software package also includes jackhmmer, (as with phmmer). When starting with a single sequence, the which on the command line allows a single sequence to W36 Nucleic Acids Research, 2015, Vol. 43, Web Server issue Figure 4. An example of filtered search results using both domain architecture and taxonomic filters (described in the text). The box above the table shows the filtering steps, first restricting by the domain architecture ‘SH3 1 SH2 Pkinase Tyr’ then by a taxonomy filter. The user can click the filter labels in the breadcrumb string (‘All Results’) in the filter section to reverse any of the steps to the right, or all filters can be cancelled by clicking the cancel but ton. first round of jackhmmer is equivalent to phmmer; otherwise by a green background in the cell containing the sequence it is equivalent to hmmsearch, with the first results page re- accession. Sequences that have been dropped will be below flecting the nature of the search method. In the case of a threshold and be indicated by a pink background in the se- single sequence, the result page is shown with the ‘Sequence quence accession cell. Features and Matches’ information. A further notable dif- When running jackhmmer searches interactively, the user ference between the command line version of jackhmmer can keep iterating the search until the search converges (i.e. and the website implementation is that the website allows no new hits are found and no hits are either dropped or lost) the user to interact with the results from one search iteration or they deem that no further iterations are necessary. It is before starting the next, by either including or excluding se- also possible to replicate the command line functionality quences (Figure 5A, B, C). Under the menu for the different of jackhmmer where multiple iterations are performed se- result visualisations (Figure 5B), an ‘iteration count’ box in- quentially without intervention, by using the batch search dicates the current iteration and contains links that allow option under the ‘advanced’ options on the search submis- the user to either jump to the hit at the inclusion threshold sion page. In this mode, all hits scoring above the inclusion of the search, or show the results summary of the differ- threshold will be used in the subsequent iteration. Similarly ent iterations (2 or more iterations) and start the next iter- to the command line version, the user can choose to iter- ation. In the scores table, the user can use the check boxes ate automatically for up to 5 iterations (or until conver- (located in the right most column, by default all hits above gence). As with the other searches against a target sequence the threshold are preselected) to either include or exclude database, the sequence search space can be restricted ac- sequences from the search results; a checked box indicates cording to taxonomy, and this restriction will be applied in that a sequence will be included in the next round, even if it each successive iteration. However, if the results have been currently falls below threshold from the preceding search. filtered according to domain architecture and /or taxonomy, When a sequence is removed that is currently scoring above all significantly scoring sequences are used in the subse- threshold, the row will be shown in grey (Figure 5C). quent rounds. After the first iteration, regardless of the initial input, The interactive iterative searching is analogous to the the HMMER web server builds a profile HMM from the approach adopted during Pfam curation (26) that is per- selected sequences and searches it against the sequence formed using command-line tools. The inclusion of jackhm- database, equivalent to an hmmsearch search. Rather than mer in the HMMER site and the provision of pfamseq as immediately going to the results page, a jackhmmer sum- a target database now provides a parallel platform to the mary table is presented to the user, comparing the results of Pfam curation system. As this system is public, it will po- each iteration to the previous round (Figure 5A). The table tentially enable Pfam curation to be distributed within the lists the iteration, links to the results and lists the number of scientific community. new sequences found, the number lost, the number dropped and the total number of sequence matched in that round. Sequences that have been ‘dropped’ are those that are still DISCUSSION found in the results, but fall below the inclusion thresh- This article described the recent developments to the HM- old where they had previously been included (taking into MER website search interface since the first release was account any user intervention). Some sequences that were published. Notably, all of the HMMER command-line pro- once significant can completely disappear from the results tein search algorithms now have an equivalent that is acces- file, and these are considered ‘lost’. After clicking through sible via the web. In addition to the new algorithm and ad- to the results, the iteration summary box includes a link to ditional target databases, substantial effort has been made list all ‘lost’ sequences, if appropriate. It will also include a to provide different results visualisation, which can also be link that allows the user to ‘Jump to the first new match’. used to filter the results according to taxonomy and /or do- The presence of a new match is indicated in the result table main architecture. The ability to select subdivisions of the Nucleic Acids Research, 2015, Vol. 43, Web Server issue W37 Figure 5. Examples of the jackhmmer user interface. (A) This shows the summary table of a jackhmmer search that has been iterated to convergence. Each iteration is compared to the previous stage and shows the number of new sequences found compared to the previous iteration, the number of sequences lost (see text for details), the number of sequences that were dropped and the total number of sequences. The results job identifier in the second column provides a link through to the results table for that iteration. At the top of the results page for a specific iteration, there is an ‘iteration’ box ( B). This provides information about the iteration and a series of links to navigate to the summary page, or previous or next iteration results, to re-run iterations or to navigate the results. If any sequences have been lost, a link to a table listing those sequences is provided. (C) Shows the results on either side of the inclusion threshold (red horizontal line). The rows containing sequence accessions with a green background indicate new sequences that were not previously above threshold. The row containing a sequence accession with a pink background is a sequence that is no longer significant, but was in the previous iteration, i.e. dropp ed. The grey rows indicate the sequences that have been manually de-selected by the user and will not be used in the subsequent iteration. target database (either according to predefined groups such the European Bioinformatics Institute (EMBL-EBI, http: as reference proteomes or by using taxonomic restrictions) //www.ebi.ac.uk/Tools/hmmer). While the responsibility of is a complementary approach to achieving the same goal, the algorithm development will remain in the US, the search the improved navigability of results. The HMMER devel- infrastructure and website development group have transi- opment team remains committed to improving both search tioned to the UK. We anticipate that the infrastructure run- strategies and the presentation of results that scale well with ning at Janelia Research Campus will be decommissioned the ever-increasing target sequence databases. However, the during 2015. Heavy users of the API are encouraged to up- user interface has now reached a certain degree of stability, date their software to connect to the EMBL-EBI site. The and what started a feasibility pilot project has now turned two sites will continue to inter-operate seamlessly, with the into a widely used informatics resource. HMMER source code and binaries being made available At the time of writing, the HMMER website is running via hmmer.org, and the search functionality provided by the at both Janelia Research Campus (http://hmmer.org)and EMBL-EBI site. Both sites will adopt a common branding W38 Nucleic Acids Research, 2015, Vol. 43, Web Server issue that is now displayed at the UK site, giving a uniform look 10. Rose,P.W., Prlic,A., ´ Bi,C., Bluhm,W.F., Christie,C.H., Dutta,S., Green,R.K., Goodsell,D.S., Westbrook,J.D., Woo,J. et al. (2015) The as a user switches from one site to another. RCSB Protein Data Bank: views of structural biology for basic and While any change to the organisation of web services applied research and education. Nucleic Acids Res., 43, D345–D356. can be irksome, there are many advantages to locating the 11. UniProt Consortium. (2015) UniProt: a hub for protein information. web based HMMER homology searches at EMBL-EBI. Nucleic Acids Res., 43, D204–D212. 12. Chen,C., Natale,D.A., Finn,R.D., Huang,H., Zhang,J., Wu,C.H. and Primarily, EMBL-EBI has the infrastructure to sustain Mazumder,R. (2011) Representative proteomes: a stable, scalable and the user-base growth, while maintaining scalability of the unbiased proteome set for sequence analysis and functional searches against the background of ever growing sequence annotation. PLoS One, 6, e18910. databases. Being co-located with the source of many of 13. Berman,H., Henrick,K., Nakamura,H. and Markley,J.L. (2007) The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform the target databases (UniProtKB, PDBe (27), Pfam) brings archive of PDB data. Nucleic Acids Res., 35, D301–D303. many benefits, as updates to the HMMER target databases 14. Finn,R.D., Bateman,A., Clements,J., Coggill,P., Eberhardt,R.Y., will be more closely synchronised with new source database Eddy,S.R., Heger,A., Hetherington,K., Holm,L., Mistry,J. et al. releases. Also, as the homology search system becomes es- (2014) Pfam: the protein families database. Nucleic Acids Res., 42, tablished at the EMBL-EBI, there will be better cross link- D222–D230. 15. Dosztan ´ yi,Z., Csizmok,V., Tompa,P. and Simon,I. (2005) IUPred: ing to relevant databases at the EMBL-EBI and use of the web server for the prediction of intrinsically unstructured regions of search infrastructure by EMBL-EBI resources. proteins based on estimated energy content. Bioinformatics, 21, 3433–3434. ACKNOWLEDGEMENTS 16. Dosztan ´ yi,Z., Csizmok,V., Tompa,P. and Simon,I. (2005) The pairwise energy content estimated from amino acid composition We thank the EMBL-EBI’s Technical Service teams and discriminates between folded and intrinsically unstructured proteins. Janelia’s High Performance Computing group for their on- J. Mol. Biol., 347, 827–839. 17. Kall,L., ¨ Krogh,A. and Sonnhammer,E.L.L. (2004) A combined going computational support. transmembrane topology and signal peptide prediction method. J. Mol. Biol., 338, 1027–1036. FUNDING 18. Lupas,A., Van Dyke,M. and Stock,J. (1991) Predicting coiled coils from protein sequences. Science, 252, 1162–1164. European Molecular Biology Laboratory, European Bioin- 19. NCBI Resource Coordinators. (2015) Database resources of the formatics Institute (EMBL-EBI) [to R.D.F., F.S. and A.B.]; National Center for Biotechnology Information. Nucleic Acids Res., Howard Hughes Medical Institute [to R.D.F, J.C., W.A., 43, D6–D17. 20. Sillitoe,I., Lewis,T.E., Cuff,A., Das,S., Ashford,P., Dawson,N.L., B.L.M., T.J.W. and S.R.E.]. Funding for open access charge: Furnham,N., Laskowski,R.A., Lee,D., Lees,J.G. et al. (2015) CATH: European Bioinformatics Institute (EMBL-EBI). comprehensive structural and functional annotations for genome Conflict of interest statement. None declared. sequences. Nucleic Acids Res., 43, D376–D381. 21. Wu,C.H., Nikolskaya,A., Huang,H., Yeh,L.-S.L., Natale,D.A., Vinayaka,C.R., Hu,Z.-Z., Mazumder,R., Kumar,S., Kourtesis,P. REFERENCES et al. (2004) PIRSF: family classification system at the Protein 1. Krogh,A., Brown,M., Mian,I.S., Sjolander,K. ¨ and Haussler,D. Information Resource. Nucleic Acids Res., 32, D112–D114. (1994) Hidden Markov models in computational biology: 22. Oates,M.E., Stahlhacke,J., Vavoulis,D.V., Smithers,B., Applications to protein modeling. J. Mol. Biol., 235, 1501–1531. Rackham,O.J.L., Sardar,A.J., Zaucha,J., Thurlby,N., Fang,H. and 2. Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, Gough,J. (2015) The SUPERFAMILY 1.75 database in 2014: a 755–763. doubling of data. Nucleic Acids Res., 43, D227–D33. 3. Altschul,S.F. (1991) Amino acid substitution matrices from an 23. Haft,D.H., Selengut,J.D., Richter,R.A., Harkins,D., Basu,M.K. and information theoretic perspective. J. Mol. Biol., 219, 555–565. Beck,E. (2013) TIGRFAMs and Genome Properties in 2013. Nucleic 4. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution Acids Res., 41, D387–D395. matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A., 89, 24. Altschul,S.F., Madden,T.L., Schaf ¨ fer,A.A., Zhang,J., Zhang,Z., 10915–10919. Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: 5. Finn,R.D., Clements,J. and Eddy,S.R. (2011) HMMER web server: a new generation of protein database search programs. Nucleic Acids interactive sequence similarity searching. Nucleic Acids Res., 39, Res., 25, 3389–3402. W29–W37. 25. Johnson,L.S., Eddy,S.R. and Portugaly,E. (2010) Hidden Markov 6. Eddy,S.R. (2011) Accelerated profile HMM searches. PLoS Comput. model speed heuristic and iterative HMM search procedure. BMC Biol., 7, e1002195. Bioinformatics, 11, 431. 7. Miller,R.B. (1968) Response time in man-computer conversational 26. Coggill,P., Finn,R.D. and Bateman,A. (2008) Identifying protein transactions. AFIPS ’68 (Fall, part I) Proceedings of the December domains with the Pfam database. Curr. Protoc. Bioinform., Chapter 2, 9-11, 1968, fall joint computer conference. I, pp. 267–277. Unit 2.5. 8. Card,S.K., Robertson,G.G. and Mackinlay,J.D. (1991) The 27. Gutmanas,A., Alhroub,Y., Battle,G.M., Berrisford,J.M., Bochet,E., information visualizer, an information workspace. ProceedingCHI ’91 Conroy,M.J., Dana,J.M., Fernandez Montecelo,M.A., van Proceedings of the SIGCHI Conference on Human Factors in Ginkel,G., Gore,S.P. et al. (2014) PDBe: Protein Data Bank in Computing Systems. pp. 181–186. Europe. Nucleic Acids Res., 42, D285–D291. 9. Nah,F. (2004) A study on tolerable waiting time: how long are Web users willing to wait? Behaviour & Information Technology 23 , 153–163.
Nucleic Acids Research – Oxford University Press
Published: Jul 1, 2015
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.