Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Infrastructure for the life sciences: design and implementation of the UniProt website

Infrastructure for the life sciences: design and implementation of the UniProt website Background: The UniProt consortium was formed in 2002 by groups from the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) at Georgetown University, and soon afterwards the website http://www.uniprot.org was set up as a central entry point to UniProt resources. Requests to this address were redirected to one of the three organisations' websites. While these sites shared a set of static pages with general information about UniProt, their pages for searching and viewing data were different. To provide users with a consistent view and to cut the cost of maintaining three separate sites, the consortium decided to develop a common website for UniProt. Following several years of intense development and a year of public beta testing, the http://www.uniprot.org domain was switched to the newly developed site described in this paper in July 2008. Description: The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The http://www.uniprot.org website is the primary access point to this data and to documentation and basic tools for the data. These tools include full text and field-based text search, similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access. http://www.uniprot.org/ is open for both academic and commercial use. The site was built with open source tools and libraries. Feedback is very welcome and should be sent to [email protected]. Conclusion: The new UniProt website makes accessing and understanding UniProt easier than ever. The two main lessons learned are that getting the basics right for such a data provider website has huge benefits, but is not trivial and easy to underestimate, and that there is no substitute for using empirical data throughout the development process to decide on what is and what is not working for your users. Page 1 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 However, a careful review of the archived help desk ques- Background The UniProt consortium[1] was formed in 2002 by groups tions and web server request logs, collected over several from the Swiss Institute of Bioinformatics (SIB), the Euro- years from the existing sites, revealed the following: pean Bioinformatics Institute (EBI) and the Protein Infor- mation Resource (PIR) at Georgetown University, and � The majority of the queries consisted of nothing soon afterwards the website http://www.uniprot.org was more than a protein or gene name, sometimes com- set up as a central entry point to UniProt resources. bined with an organism name. Some of these queries Requests to this address were redirected to one of the three did not yield useful results, because of the lack of a organisations' websites (http://www.expasy.uniprot.org, good scoring algorithm (e.g. searching for "human http://www.ebi.uniprot.org and http://www.pir.uni insulin" could require scrolling through hundreds of prot.org). While these sites shared a set of static pages with results before finding the most relevant entries, such as general information about UniProt, their pages for search- INS_HUMAN). ing and viewing data were different: The SIB was redirect- ing such requests to the ExPASy website, where some of � Some queries yielded no results because people mis- the data and tools had been available since 1993, while spelled terms or did not use the same conventions as the EBI and PIR both developed their own sites for Uni- UniProt (e.g. American vs English spelling, Roman vs Prot, with a similar appearance, but different code and Arabic numbers in protein names, dashes vs separated functionality. Though the redirection was done according words) or chose the wrong field in an "advanced" to the geographic location of the client, it happened occa- search form, etc. Some of this was documented, but sionally that users were confronted with a site that looked the documentation was not accessed much. and worked differently from the one they were used to. To provide users with a consistent view and to cut the cost of � The majority of requests came from web crawlers maintaining three separate sites, the consortium decided and other automated applications (many of which to develop a common website for UniProt. Following sev- made valid use of our data). Referrals from search eral years of intense development and a year of public engines made up a substantial part of the visits, there- beta testing, the http://www.uniprot.org domain was fore we did not want to block web crawlers either, yet switched to the newly developed site described in this this was putting quite a bit of a load on our servers. paper in July 2008. Ensuring that these issues would be resolved by the new Requirements site, along with all the basic requirements, was therefore The essential functionality that the website (like its prede- made a priority [2]. cessors) had to provide was: Construction, content and utility � Retrieval of individual database entries by identifier. What data is available on the site? The UniProt web site provides access to the data sets pre- � Retrieval of sets of entries based on simple search cri- sented in Table 1. teria such as organism, keyword or free text matches. How is the site structured? The pattern for URL templates shown in Table 2 is used � Display of data in a human readable manner. not only for the main data sets, but also for the various � Download of data in all official formats. "ontologies", for documentation and even running or completed jobs. � Basic tools for identifier mapping, sequence align- ments and similarity searches. There are no special search pages. The search function and other tools can be accessed directly through a tool bar that � Access to documentation and controlled vocabular- appears at the top of every page. Depending on the current ies. context, some of the tool forms are pre-filled. For exam- ple, when viewing a UniProtKB entry, the sequence search An additional wish was that each consortium member form is pre-filled with the sequence of the entry, and the should be able to host a mirror of the website without too alignment form is pre-filled with all alternative products much effort, and that the technology on which the web- of the entry, if any. site was to be built should be familiar enough to allow all consortium members to contribute to the development. How to get people started? Beyond that there was no shortage of ideas for bells and Important information is often overlooked on home whistles, such as data mining and visualization tools. pages with a lot of content. The new UniProt home page Page 2 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Table 1: Overview of the UniProt data sets Data set Description References Entries Path Formats UniProtKB Protein sequence and UniRef, UniParc, Literature 6.4 M /uniprot/ Plain text, FASTA, (GFF), annotation data citations, Taxonomy, Keywords XML, RDF UniRef Clusters of proteins with UniProtKB, UniParc, Taxonomy 12.3 M /uniref/ FASTA, XML, RDF similar sequences UniParc Protein sequence archive UniProtKB, Taxonomy 17.0 M /uniparc/ FASTA, XML, RDF Literature citations Literature cited in UniProtKB 0.4 M /citations/ RDF (based on PubMed) Taxonomy Taxonomy data 0.5 M /taxonomy/ RDF, (Tab-delimited) (based on NCBI taxonomy) Keywords Keywords used in UniProtKB 1K /keywords/ RDF, (OBO) Subcellular locations Subcellular location terms used 375 /locations/ RDF, (OBO) in UniProtKB (see Figure 1) features a prominent tools bar, that is � How often a search term occurs in an entry (without present on every page and serves as a basic site map with normalizing by document size, as this would benefit links to common entry points. The site contains a lot of poorly annotated documents). small, useful features that are documented in the on-line help; however, people in general appear to be reluctant to � What fields in an entry a term occurs in (e.g. matches invest a lot of time into reading documentation. To in a protein name are more relevant than in the title of address this issue, we recorded a "site tour" [3] that is a referenced publication). accessible from the home page. � Whether an entry has been reviewed (reviewed How to get the text search function right? entries are more likely to contain correct and relevant The text search function is the most used feature on the information). website. Considerable effort was therefore invested into making all common and less common searches not only � How comprehensively annotated an entry is (all else possible, but also simple and convenient to use for people being equal, we want to have a bias towards well- without a detailed understanding of UniProt data. One of annotated entries). the most obvious problems with the old sites had been the lack of good relevance scoring of search results. Scor- The exact scoring scheme differs for each data set and ing is essential for queries that are meant to locate specific requires ongoing fine-tuning. entries, but that contain terms that appear in a large number of entries (e.g. the "human insulin" example In order to allow people to see quickly the entries with e.g. quoted above). The main factors that influence the score the longest or shortest sequences, or to page through the of an entry for a given query on the new website are: results one organism at a time, certain fields were made sortable. This turned out to be not trivial to implement as the underlying search engine library had no support for sorting results efficiently on anything but their score. Table 2: URL templates Template Description Example http://www.uniprot.org/{dataset}/ Overview page for a data set, may contain a http://www.uniprot.org/uniprot/ description of the data set along with various entry points, or just list all database items (equivalent to searching for *). http://www.uniprot.org/{dataset}/ Filters the data set with the specific query. Other http://www.uniprot.org/uniprot/?query=green ?query={query} parameters are "offset" (index of first result), "limit" (number of results to return), "format" (e.g. "tab" for tab-delimited or "rdf") and "compress" ("yes" to gzip results when downloading). http://www.uniprot.org/{dataset}/{id} Displays a specific database entry. http://www.uniprot.org/uniprot/P00750 http://www.uniprot.org/{dataset}/{id}.{format} Returns a database entry in the specified format. http://www.uniprot.org/uniprot/P00750.rdf Page 3 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 H Figure 1 ome page at http://www.uniprot.org/ Home page at http://www.uniprot.org/. Page 4 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Therefore, special sort indexes are now built when the � "Did you mean" spelling suggestions (if there are no data is loaded, at the cost of slowing down incremental or few results and the index contains a similar word). updates. Figure 2 shows the result of a query in Uni- ProtKB, sorted by length descending. � Restrict a term to field (listing only fields in which a term occurs). The traditional approach of having two separate forms for "basic" and "advanced" queries has several issues: Based � Quote terms (if they frequently appear together in on our observations, few people start out with the inten- the index). tion of using an "advanced" form. Even if they have a good understanding of the data and search function, they � Filter out unreviewed or obsolete entries (if the often first try to obtain the results through a simple search results contain such entries). form, as these are quicker to fill in. If the basic search does not yield the expected result, or too many results, the � Replace a field with a more stringent field (if this query has to be redone in an advanced search form where helps reduce the number of results). further constraints can be applied. � Restrict the range of values in a field (if results are Another problem with "advanced" search forms is that being sorted on this field). they often do not take into account that most nontrivial queries appear to be built iteratively: People start out with � ...and others, depending on the context. one or two terms, and then add or modify (e.g. constrain) terms until they see the desired results or give up. If mul- This approach allows people to move seamlessly from a tiple complex constraints are specified at once and the basic to an advanced query without prior knowledge of query produces no results, it can be time-consuming to the fields used to store data in UniProt. Clicking on a sug- figure out which (if any) of the constraints was used incor- gestion requires less mouse clicks than selecting a field in rectly. We therefore opted for a "fail early" approach: A an "advanced" search form. It is also more effective, simple full text search is the fastest and most effective way because only the fields in which a term occurs are listed – to determine whether or not a term even appears in a data- such a filtering is difficult to accomplish with traditional base, as you can skip the step of scrolling through and "advanced" search forms. selecting a field, and then having to wonder if the term might have appeared in another field. Figure 4 shows suggestions for a simple query, http:// www.uniprot.org/uniprot/?query=insulin. For these reasons, we opted for a single search form(see Figure 3). People start by searching for one or two terms. Each step in the query building process updates the query The results page shows the matches for these terms and, string and is reflected in the URL, so it can be book- for people who are not familiar with our search fields, marked, or undone by hitting the back button. The step- clickable suggestions such as: by-step process does not preclude expert users entering complex queries directly, which can be faster and more powerful (e.g. Boolean operators) than using an "advanced" query form. Table 3 provides an overview of the query syntax. A frequent cause of failed queries in past implementations were trivial differences such as the use of dashes (e.g. "CapZ-Beta" vs "CapZ Beta") or Roman vs Arabic num- bers in names (e.g. "protein IV" vs "protein 4"). Such cases are now treated as equivalent. Many search engines stem words, for example they would treat the search terms: "inhibit", "inhibits" and "inhibiting" as equivalent. How- ever, given that most of the queries consisted of names, such as protein, gene or organism names, where such stemming is dangerous, and that there is no way to know whether or not an entered term should be stemmed, this UniProtKB search Figure 2 results, sorted by length descending was left out. UniProtKB search results, sorted by length descend- ing. Page 5 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Basi Figure 3 c search form Basic search form. One advantage of "advanced" search forms is that they Figure 6 shows a screenshot of the "Customize display" allow to present fields that have a limited number of pos- option for UniProtKB search results. sible values as drop-down lists or offer auto-completion, How to support download of custom data sets? if the field contains values from a medium- to large-sized ontology. This functionality can, however, also be inte- We receive frequent demands to provide various down- grated into a "simple" search form: We chose to provide loadable entry sets, such as all reviewed human entries in the possibility to search in specific fields of a data set by FASTA format. While some of the most frequently adding one field search constraint at a time. The user requested files can be distributed through our FTP server, clicks on "Fields >>", selects the desired field and enters a doing so is obviously not feasible for many requests (espe- value and then clicks "Add & Search" to execute the query. cially for incremental updates such as all reviewed, Further search constraints can be added to refine the query human entries in FASTA format added or updated since iteratively until the desired results are obtained (see Figure the beginning of this year). Such sets can now be obtained 5). from the website, which no longer imposes any download limits. However, large downloads are given low priority in Certain data sets reference each other: This can be used to order to ensure that they do not interfere with interactive do subqueries, e.g. while searching UniRef you can add a queries, and they can therefore be slow compared to constraint "uniprot:(keyword:antigen organism:9606)" downloads from the UniProt FTP server. to show only UniRef entries that reference a UniProt entry with the specified keyword and organism. This function- How to support browsing? ality can sometimes also be accessed from search results, The two main modes of looking for data are 1. with direct e.g. while searching UniProtKB there may be a "Reduce searches and 2. by browsing, i.e. following links through sequence redundancy" link that converts the current a hierarchical organization. The new website makes use of query into a subquery in UniRef. various ontologies (Taxonomy, Keywords, Subcellular locations, Enzyme, Gene Ontology, UniPathway) to allow The search result table of most data sets can be customized users to browse the data or combine searching with in two ways: The number of rows shown per page can be browsing (e.g. search for keyword:Antigen and then changed, and different columns can be selected. browse by taxonomy, see Figure 7). Note that the choice of columns is preserved when down- loading the results in tab-delimited format. Suggestions for a simp Figure 4 le query, http://www.uniprot.org/uniprot/?query=insulin Suggestions for a simple query, http://www.uniprot.org/uniprot/?query=insulin. Page 6 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Table 3: Query syntax overview Query Returns human antigen All entries containing both terms human AND antigen "human antigen" All entries containing both terms in the same order anti* All entries containing terms starting with anti. To search for a term that contains an actual asterisk, escape the asterisk with a backslash (anti\*). Asterisks can be used within and at the end of terms. human-antigen All entries containing the term human but not antigen human NOT antigen human OR antigen All entries containing either term antigen (human OR pig) Using brackets to override Boolean precedence rules author:Tiger* All entries with a citation that has an author whose name starts with Tiger. Note the field prefix author; had we left it out, there would have been a large amount of unwanted results. gene:L\(1\)2CB All entries with the specified gene name. Note how the backslash is used to escape the brackets, which would otherwise be interpreted as part of a Boolean query. Other characters that must be escaped are: []{}?:~* gene:* All entries that have a gene name. How to allow selection of multiple items? or "Align", and can be cleared with a single click. As Using a list of results will often imply performing further shown in Figure 8, the cart also allows to select items action, such as downloading all or selected items, or align- across multiple data sets. ing corresponding sequences. The simplest solution would be to add check boxes next to the items and enclose How to show complex entries? them in a form that also contains a list of tools to which The most important data on this site can be found in Uni- the items can be submitted. The problem with this ProtKB, in particular in the reviewed UniProtKB/Swiss- approach is that it can result in some redundancy in the Prot entries. These entries often contain a large amount of user interface: when adding a tool, it is necessary to add it information that needs to be shown in a way that allows everywhere where items can be selected. Moreover, this easy scanning and reading: approach does not allow selection of items across multi- ple pages (e.g. when paging through search results) or � Names and origin across different queries or data sets. The solution that was implemented was to provide a general selection mecha- � Protein attributes nism that stores items in a "cart". The contents of the cart are stored as a cookie in the web browser (so it does not � General annotation (Comments) require any state to be stored on the server side). The cart itself has certain actions attached to it such as "Retrieve" � Ontologies (Keywords and Gene Ontology) Using Figure 5 the query builder to add a constraint Using the query builder to add a constraint. Page 7 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 "Customize display" option Figure 6 for UniProtKB search results "Customize display" option for UniProtKB search results. Using hierarchical collection Figure 7 s to browse search results Using hierarchical collections to browse search results. Page 8 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Using the cart Figure 8 to select items across multiple data sets Using the cart to select items across multiple data sets. � Binary interactions Parts of two sections from the UniProtKB entry view of human tissue-type plasminogen activator (P00750) are � Alternative products shown in Figure 9. � Sequence annotation (Features) How to integrate sequence similarity searches? In addition to text searches, sequence similarity searches � Sequences are a commonly used way to search in UniProt. They can be launched by submitting a sequence in FASTA format, � References or a UniProt identifier, in the "Blast" form of the tools bar. Note that this form is pre-filled with the current sequence � Web resources (Links to Wikipedia and other online when viewing a UniProtKB, UniRef or UniParc entry. Fig- resources) ure 10 shows the sequence similarity search form and results. � Cross-references How to integrate multiple sequence alignments? � Entry information (Meta data including release dates The purpose of the integrated "Align" tool is to allow sim- and version numbers) ple and convenient sequence alignments. ClustalW is used because it is still the most widely used tool, though � Relevant documents (List of documents that refer- it may no longer be the best-performing tool in all cases. ence an entry) The form can be submitted with a set of sequences in FASTA format or a list of UniProt identifiers, or more Describing the information found in UniProtKB/Swiss- likely through the built-in "cart". The form is pre-filled Prot [4] is outside the scope of this paper. Here are some with a list of sequences when viewing a UniProtKB entry improvements that were made over previous attempts to with alternative products or a UniRef cluster. For complex show this data: alignments that require specific options or a specific tool, the sequences can easily be exported into FASTA format � Features and cross-references are categorized. for use with an external alignment tool. Figure 11 shows the multiple sequence alignment form and results. � Features have a simple graphical representation to How to integrate identifier mapping functionality? facilitate a comparison of their locations and extents. There is an identifier mapping tool that takes a list of Uni- � Secondary structure features are collapsed into a sin- Prot identifiers as input and maps them to identifiers in a gle graphic. database referenced from UniProt, or vice versa. An addi- tional supported data set that can be mapped is NCBI GI � Alternative products are listed explicitly in the numbers. Figure 12 shows how RefSeq identifiers can be "sequences" section. mapped to UniProtKB. How to retrieve UniProt entries in batch? � Sections can be reordered or hidden (and these changes are remembered). The batch retrieval tool allows to specify or upload a list of UniProt identifiers to retrieve the corresponding entries (see Figure 13). The available download formats are the greatest common denominator: For example, if the batch Page 9 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Figure 9 Parts of two sections from the UniProtKB entry view shown at http://www.uniprot.org/uniprot/P00750 Parts of two sections from the UniProtKB entry view shown at http://www.uniprot.org/uniprot/P00750. How to handle job submissions? retrieval request contains both UniProtKB and UniParc identifiers, neither plain text (only available for Uni- The website is a read-only, stateless application, with the ProtKB) nor XML (available for both, but with different exception of the job handling system. When e.g. a data- schemas) will be available, only FASTA and RDF. The set base mapping job is submitted, a new "job" resource is of entries retrieved by their identifiers can then optionally created, and the user is redirected to this job page (which be queried further, using the search engine described pre- will initially just show the job status, later the results). A viously. consequence of this is that if a web server receives a request for a job that it does not have, it needs to ask all other mirrors if they have this job (and if yes, transfer it). Page 10 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Seq Figure 10 uence similarity search form and results Sequence similarity search form and results. Jobs have unique identifiers, which (depending on the job scale, a robots.txt file [6] is used, and links that should not type) can be used in queries (e.g. to get the intersection of be followed were marked with a "nofollow" "rel" attribute two sequence similarity searches). Recent jobs run by the [7]. The retrieval performance of documents in large col- current user can be listed using the URL http://www.uni lections was optimized until we felt confident that even prot.org/jobs/ (see Figure 14). rapid crawling would not impact the overall responsive- ness of the site too much. Such documents also return a How to deal with web crawlers? "Last-modified" date header when requested. Certain web Search engines are an important source of traffic, perhaps crawlers (e.g. Googlebot) will then on their next visit issue because people now tend to "google" search terms before a conditional "If-Modified-Since" request, so there is no trying more specialized sites. Therefore it is important to need to resend unchanged documents. Since each ensure that search engines are able to index (and keep up resource can now be accessed through one URL only, to date) as much content of the site as possible. However, there is no more redundant crawling of resources (as used web crawlers have trouble finding pages, such as database to be the case with multiple mirrors with different entries, that are part of large collections and are not linked addresses). The request logs and "Google Webmaster from main navigation pages, and when they do, this can Tools" site [8] are used to monitor the behavior of web put a significant load on the site. To ensure that web crawlers on this site. As of July 2008, over 4 M pages from crawlers find all content that was meant to be indexed, the the new site are indexed in Google. content was linked from multiple sources, including over- view documents and machine-readable site maps [5]. To How to avoid breaking links? keep web crawlers away from content that is either not This site publishes a large number of resources (several worth indexing or too expensive to retrieve on a large million) on the Web. These resources are linked from a lot Page 11 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Multip Figure 11 le sequence alignment form and results Multiple sequence alignment form and results. of other life sciences databases as well as scientific papers. but keep their own web page (e.g. http://www.uni Since tracking down and getting all such links updated is prot.org/uniprot/P00001), with a link to a list of (retriev- not practical, and keeping legacy URL redirection schemes able) previous versions (e.g. http://www.uniprot.org/uni- in place for a long time can be tedious, it is worth invest- prot/P00001?version=*). Specific versions can also be ing some effort into reducing the likelihood that large sets referenced directly (e.g. http://www.uniprot.org/uniprot/ of URLs will have to be changed in future [9]. Technology P00001.txt?version=48), and used in the tool forms (e.g. artifacts, such as "/cgi-bin/" or ".do", are avoided in URLs P00001.48). [10]. Official and stable URLs are no good if they are not used. A lesson learned from the previous sites was that the How to support user-defined customizations? URLs that end up being used are those that are shown in Some simple customizations, such as being able to choose the browser (i.e. mirror-site specific URLs). The new site the columns shown in search results, make the site a lot avoids this problem by having exactly one URL for each more convenient to use. However, we did not want to resource. Another issue is how to deal with individual compromise the statelessness of the application (which is resources that are removed. Obsolete entries, or entries important for keeping the application distributable and that were merged with other entries, in the main data set scalable) by having each request depend on centralized (UniProtKB) no longer disappear from the web interface, user profile data. This was possible by storing basic set- Page 12 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Mappi Figure 12 ng RefSeq identifiers to UniProtKB Mapping RefSeq identifiers to UniProtKB. tings in client-side cookies. The drawback of this solution How to enable programmatic access to the site? is that such settings are lost when cookies are cleared in People need to be able to retrieve individual entries or sets of entries in various formats and use our tools simply and the browser or when the user switches to another machine. The amount of data that can be stored this way efficiently from within basic scripts or complex applica- is also limited. On the other hand, the customizations are tions. We want to encourage people to build applications simple and easy to redo. This solution also does not that are tailored towards certain user communities on top require people to sign up, which some may be reluctant to of our data – customization options on our site can go do as this introduces some privacy issues. only so far, and anything we build will always be focused on our data. How to let people tailor the web site to their needs? The ideal site would be a "one stop shop" that solves all Early versions of the new site had a complete SOAP [11] our users' needs. However, the life sciences community interface built with Apache Axis [12]. Unfortunately this has very diverse needs, and our data is often just one small interface had poor performance (which necessitated intro- part of these needs. The best we can do is make it as easy ducing limitations such as the maximum number of as possible to retrieve data from this site programmati- entries that could be retrieved in one go), and simple cally, in order to facilitate the development of applica- operations such as retrieving an entry in FASTA format tions on top of our data. ended up being more complex than could be justified. For example, in order to retrieve the data from a Perl script, a special module (SOAP::Lite) had to be installed and Page 13 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Batch retrieval of a Figure 13 set of UniProtKB entries Batch retrieval of a set of UniProtKB entries. patched, due to quirks with the support for SOAP attach- � Ensures that invalid pages stay out of search engine ments. Meanwhile, there are better SOAP libraries, but indexes they are still more complicated and less efficient to use than doing direct HTTP requests. � Simplifies error handling for people doing program- matic access (no need to have fragile checks for error To ensure that such "RESTful" [13] access is as simple and message strings) robust as possible, the site has a simple and consistent URL schema (explained in a previous section) and returns � Helps detect common problems when analyzing appropriate content type headers (e.g. application/xml for request logs. XML resources) and response codes. Returning appropri- ate HTTP status codes instead of returning 200 OK for all Table 4 lists all response codes that the site might return. requests, even if they fail, has several benefits: The most important distinction are the 4xx response codes, which indicate that there is some problem with the Page 14 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Figure 14 Recent jobs run by the current user, shown at http://www.uniprot.org/jobs/ Recent jobs run by the current user, shown at http://www.uniprot.org/jobs/. request itself and sending the same request again will the human taxon. This URL can be resolved to either likely fail again, and the 5xx response codes, which indi- http://www.uniprot.org/taxonomy/9606 (human reada- cate a problem with the server. ble representation; returned e.g. when entered in a browser) or http://www.uniprot.org/taxonomy/9606.rdf In addition to the data set-dependent formats, all search (machine-readable representation; returned if the request results can be retrieved as OpenSearch [14] RSS feeds for contains an "Accept: application/rdf+xml" header). integration with external tools such as news feed readers What is the architecture of the site? or Yahoo Pipes [15]. The website was implemented as a pure Java web applica- How to handle complex data? tion that can either be run standalone using an embedded While the needs of most people are met by being able to web server, Jetty [19], or deployed on any Servlet 2.4 com- obtain data in FASTA, tab-delimited or even plain text for- pliant [20] web application server. The components in the mat, some people need to obtain and work with the com- application are configured and connected together using plete structure of the data. This is complicated by the fact the Spring Application Framework [21]. Struts [22] coor- that the data model is complex and changes a lot. RDF dinates the request handling, and pages are rendered as (part of the W3C's Semantic Web initiative [16]) provides XHTML using JSP (2.0) templates. Database entries are a generic graph-like data model that can help address stored in Berkeley DB JE [23] for fast retrieval. Searching these issues [17]. All UniProt data is available in RDF as was implemented with help of the Lucene text search well as XML formats both on our FTP servers (for bulk library [24]. downloads) and on the site. Spring was introduced to remove hard-coded dependen- Resources in RDF are identified with URIs. UniProt uses cies (or hard-coded service lookups) from the code, as this PURLs [18]. For example, http://purl.uniprot.org/taxon was hampering our ability to unit-test the code. Struts was omy/9606 is used to reference and identify the concept of chosen among the plethora of available web application Table 4: Listing of response codes that the site might return Code Description 200 The request was processed successfully. 301 Moved (permanently). Use the new address for future requests 302 Moved (temporarily) 400 Bad request. There is a problem with your input. 404 Not found. The resource you requested doesn't exist. 410 Gone. The resource you requested was removed. 500 Internal server error. Most likely a temporary problem, but if the problem persists please contact the site operators. 503 Service not available. The server is being updated, try again later. Page 15 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 frameworks because it provided some conveniences, such must use to resolve addresses often keep track of what as the automatic population of objects from request resolving name server responds faster (and therefore is parameters, but did not attempt to abstract too much (e.g. most likely the nearest) for a given domain. The tests we needed access to the HTTP request and response showed that this worked to some degree, at least for the objects in order to read and set certain HTTP headers). most frequent users. The drawback of this solution is that Struts is (or was) also a de facto standard and simple to with only two mirrors, no failover is possible when one of learn. Using Berkeley DB JE to store serialized Java objects the sites happens to become unreachable. Given that net- using custom serialization code was by far the most effi- work delays between the U.S. and Europe are not too bad, cient solution for retrieving data that we tested. Extracting reliability was seen as more important. This may change if data from uncompressed text files using stored offsets one or more additional mirrors are set up in more remote might be faster for returning data in specific formats. places. However, the size of the uncompressed files and number of different databases and formats make this approach The current mirror sites deploy the web application on less practical than generating the various representations Tomcat [26] and use Apache [27] as a reverse proxy [28], on the fly. Minimizing the amount of data that is stored as well as for request logging, caching and compressing on disk also ensures that the application can benefit more responses (the latter can have a huge impact on page load from increasing disk cache sizes. times for clients on slow connections). If the application is not available at one site (e.g. while it is being updated), The website application is self-contained and can be run Apache automatically sends requests to another available out of the box with zero configuration for development mirror. The web application has a special health-check and test purposes. Non-Java and compute-intensive tools page that is monitored from local scripts (which notify the such as sequence similarity searches (BLAST), multiple local site administrators if there is a problem), as well as sequence alignments (ClustalW) and database identifier from a commercial monitoring service. This service can mappings are run on external servers. To minimize the also detect network problems and keeps statistics on the footprint of the application further, historical data such as overall reliability and responsiveness of each mirror. entry versions from UniSave [25] is also retrieved Application-level warnings and errors are handled by remotely on demand. Even so, the data including indexes Log4j [29]. Errors trigger notification messages that go occupies almost 140 GB (as of release 14.5). For develop- directly to an e-mail account that is monitored by the ment and testing, a smaller, internally consistent, data set developers (unless an error was triggered by a serious (~2 GB) is generated by issuing queries to a site that has operational issue, it usually indicates a bug in the code). the complete data loaded. Finally, there is a JMX interface that supplements JVM- level information, such as memory use, and Tomcat-sup- How to deploy the site on distributed mirrors? plied information, such as the number of open HTTP con- With the previous setup, each "mirror" site had its own nections, with application information, such as the hit public address. Requests to http://www.uniprot.org were ratios of specific object caches. redirected to http://www.ebi.uniprot.org (EBI), http:// www.pir.uniprot.org (PIR) or http://www.expasy.uni How to manage data and application updates? prot.org (SIB), respectively. The redirection was done with Data can be loaded into the web application simply by client-side HTTP redirects and was based on the top level dropping a data set, or partial data set for incremental domain (TLD) revealed by a reverse DNS lookup of the IP updates, in RDF format into a special directory and wait- address from which a request originated. One major prob- ing for the application to pick up and load the data. How- lem with this setup was that people and web crawlers ever, to save resources, we load all the data on a single bookmarked and linked to the mirror they had been redi- staging server and then distribute the zipped data to the rected to rather than the main site. Tracking down such mirror sites, usually along with the latest version of the links and getting them corrected turned out to be a Sisy- web application. Updates occur every three weeks, in sync phean task. One consequence was that people more often with the UniProt releases. than not ended up neither on the nearest mirror, nor did they get the benefits of failover. The new setup makes use How to reduce the risk that bugs are introduced into the of the fact that you can attach multiple IP addresses (A code? records) to a domain name. Clients will connect more or All code changes risk introducing bugs, which can be less randomly to one address that is reachable. We also time-consuming to fix, especially when not detected right tested whether it was possible to achieve geographic affin- away. To minimize this risk, automated tests need to be ity by having different name servers return different IP set up. The lowest-level testing is done in the form of addresses depending on which mirror was nearer. This "unit" tests. The goal of unit tests is to cover each execu- can work because the "caching" name servers that clients tion path in the code using different input. Single classes Page 16 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 are tested in isolation. Such tests were written using the and 2. by analyzing the request logs (which include the JUnit testing framework [30]. The initial test coverage was duration of each request) at the end of each month. While quite low, as there was no simple way to untangle classes the former allows immediate action to be taken, the latter for isolated tests. This was improved by introducing a can help to detect more subtle performance issues. "dependency injection" framework [21] to remove hard- coded dependencies. Unit tests are complemented with How to ensure that people will know how use the site? "functional" tests. We created a set of test scripts with the For all but the most trivial functions it is difficult to pre- open source tool Selenium [31]. These tests can be played dict whether people will be able to figure out how to use back in any recent browser and simulate typical user inter- them. The most effective way to answer such questions is actions with the site. Reducing the playback speed allows to do "usability" tests [35]. We managed to recruit a dozen semi-automatic testing, where a person watches the test or so volunteers who let us watch them use the site to execution to catch layout glitches that would be difficult accomplish certain tasks. Most of them had some kind of to catch with fully automated tests. In addition to testing life sciences background, however, not all of them had the site prior to releases, these tests can be used for been working with our data on a regular basis. The tests browser compatibility testing. We attempted to ensure took on the following form: Two people, one to ask ques- that the site works well with the most popular browser tions, the other to take notes, went to the volunteer's and operating system combinations (e.g. Internet workplace (if possible; this ensured that people had the Explorer 6 and 7 on Windows, Firefox 2 on Linux) and setup they were used to and felt comfortable). Some brief acceptably with other, recent browsers (e.g. Safari on Mac background questions were asked to establish how famil- OS X). Another major "functional" test is loading all data iar the person already was with our data, what services and into the site. Given that this is a long procedure, a smaller tools they had used in the past, etc. Based on this informa- test data set is often used instead to verify that the import tion they would then be asked to accomplish certain tasks procedure is working. Other one-off tests included setting on the site. We tried as best as possible to avoid putting up and using a tool to compare search results returned by the user under pressure. The other difficulty was avoiding the new and the old sites. phrasing tasks in terms of the concepts we were using. For example, when asked to fill out the "contact form", people How to ensure that the site will have adequate would immediately find the "contact" link. But if we performance? phrased the question in different terms, such as "send Basic performance goals were established based on num- feedback", success was less guaranteed. Testing sessions bers obtained from the request logs of the previous sites. were between half an hour and one hour and helped settle Even though the application is stateless (i.e. no state is and open up quite a few "design" discussions. stored on the server) and can therefore be scaled out hor- izontally (i.e. by buying more machines), the stated goal How to keep track of what is being used? was to be able to support the full load on a single, power- In order to know how the site is being used, and which ful, but "off-the shelf" machine. Following are some load parts of the site are working well and which are not, it is tests that were performed to identify potential issues and essential to collect data on the site's usage. Usability test- help build confidence in the application: ing is invaluable and can help pinpoint certain issues, but it is time-consuming and therefore unsuitable to collect � Retrieve random database entries large sample sizes. Fortunately, some basic information on user interactions with the site can be recorded through � Execute random queries the web server log files. In addition to the standard infor- mation logged for each request (such as the exact time and � Download large result sets path of the request, the response code, the IP address, referrer and user agent), the web server was configured to � Simulate initial requests to the home page including also record the duration of the request, the number of all required static resources search results returned (if the request was a query) and the content-type of the response (this could often be inferred The tests were performed with a combination of shell from the extension of the requested resource, but not scripts and the httperf tool [32]. Performance numbers, always). The request logs are collected from all mirror sites such as response times, were analyzed with R [33]. Once a once a month, cleaned up a bit and loaded into a simple performance issue was found, problematic code was star schema in a relational database. This allows us not tracked down with the help of a commercial profiling tool only to get general usage statistics (e.g. total number of [34]. Performance issues on the production servers are requests for different resources, showing the percentage of caught 1. by an external monitoring tool that records automated requests), but also to look for problems (e.g. response times for requests to a general health-check page, queries that do not produce any results or fail) and even Page 17 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 help to set annotation priorities (e.g. by looking at what plementing the conventional feedback forms with a com- are the most frequently requested entries that have not menting system. Other more complex approaches, such as been reviewed yet, or not updated in a long time). using wiki software, are under investigation as well [37]. Another complementary tool that is used to gather statis- Looking beyond UniProt: Much of the development effort tics is Google Analytics [36], which records data through was spent on issues that are not UniProt-specific, ranging JavaScript embedded in the web pages. The advantage and from handling identifiers in a stable way to most of the drawback of this approach is that it does not record auto- code for the search engine. These issues are likely to be rel- mated requests, such as those issued by web crawlers, or evant for other life sciences databases as well. Had there requests to non-HTML resources. While Google Analytics been some kind of framework that provided these fea- was left enabled during all of the beta phase, it is now only tures, the development time could have been reduced sig- enabled from time to time. We use it to check the (less nificantly. This seems especially important for smaller accurate) number of non-robot requests reported via the databases that may not have resources to reinvent the request log analysis procedure, which relies on user agent wheel. It may therefore be worth incorporating some of strings matches and the patterns need to be updated from the solutions we have come up with here into a frame- time to time. It also helps us to get an idea of the browsers work (or add them to existing frameworks). and screen resolutions people are using, information which is less accurate or impossible to get via the web Conclusion server request logs. Google Analytics can provide fast and The new UniProt website makes accessing and under- convenient feedback on the impact of certain changes, as standing UniProt easier than ever. The two main lessons data is updated at least once a day, but it can also slow learned are that 1. getting the basics right for such a data down the perceived page loading time and aggravate cer- provider website (and likely others as well) has huge ben- tain privacy concerns. efits, but is not trivial and easy to underestimate, and that 2. there is no substitute for using empirical data through- How to "go live" with minimal casualties? out the development process to decide on what is and Despite all the load and usability testing, there was no what is not working for your users. We hope to encourage guarantee that switching over all the old sites to the new more people in the life sciences community to resist the site in one go would not swamp us with more technical temptation to spend time adding bells and whistles to an issues and irate users than we could possibly handle at application before getting the basics right, and to put in once. Having a prolonged "beta" period allowed us to get place rigorous procedures for assessing whether or not the feedback from many users and to ramp up the number of site is serving its users well. people (and web crawlers, etc) using the new site gradu- ally by taking these steps (moving to the next step when- Availability and requirements ever we felt comfortable enough to do so): http://www.uniprot.org/ is open for both academic and commercial use. The site was built with open source tools 1. Sending out invitations to certain people and and libraries. groups to use the new site. Abbreviations HTTP: Hyper-Text Transfer Protocol; JMX: Java Manage- 2. Getting the site indexed in Google. ment Extensions; JVM: Java Virtual Machine; RDF: 3. Linking the old sites to the new site. Resource Description Framework; REST: REpresentational State Transfer; RSS: Really Simple Syndication Format; 4. Switching over the old sites one by one. TBD: To Be Done; TLD: Top-Level Domain; URI: Uniform Resource Locator; W3C: World Wide Web Consortium; XHTML: eXtensible HTML. Discussion Collecting data and optimizing the use and performance of the website is an ongoing process. Authors' contributions EJ carried out most of the design and development work One of the biggest challenges UniProt is facing is getting and drafted the manuscript. IP and SD participated in the more community involvement to help cope with the development. EG, NR, AB, MJM, PM and BS coordinated increasing amount and complexity of data. Simply having design, requirements and specifications. EG, IP and NR more people give feedback when they see incorrect or helped to draft and critically revised the manuscript. All missing data in UniProt would already be a huge improve- authors read and approved the final manuscript. ment. One possible approach under investigation is to make such feedback more rewarding by replacing or sup- Page 18 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Acknowledgements UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG002712-04. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science, and the European Commission contract FELICS (EU no. 021902). Special thanks to the many people within and outside of the UniProt con- sortium who spent time providing early feedback on the site, or consented to participating in one of the usability tests. References 1. The UniProt Consortium: The Universal Protein Resource (UniProt). Nucleic Acids Res 2008, 36:D190-D195. 2. Hoekman R Jr: Designing the Obvious: A Common Sense Approach to Web Application Design Berkeley: New Riders Press; 2006. 3. Wink [http://www.debugmode.com/wink/] 4. Bairoch A, Boeckmann B, Ferro S, Gasteiger E: Swiss-Prot: Juggling between evolution and stability. Brief Bioinform 2004, 5:39-55. 5. Sitemap Protocol 0.9 [http://www.sitemaps.org/] 6. Robots Exclusion Standard [http://www.robotstxt.org/] 7. rel-nofollow Microformat [http://microformats.org/wiki/rel- nofollow] 8. Google Webmaster Central [http://www.google.com/webmas ters/] 9. Berners-Lee T: Cool URIs don't change. [http://www.w3.org/ Provider/Style/URI]. 10. urlrewrite [http://code.google.com/p/urlrewritefilter/] 11. Simple Object Access Protocol (SOAP) [http://www.w3.org/ TR/soap/] 12. Apache Axis [http://ws.apache.org/axis/] 13. Fielding R: Architectural Styles and the Design of Network- based Software Architectures. In PhD thesis University of Cali- fornia, Irvine, Information and Computer Science; 2000. 14. OpenSearch [http://www.opensearch.org/] 15. Yahoo Pipes [http://pipes.yahoo.com/pipes/] 16. W3C Semantic Web Activity [http://www.w3.org/2001/sw/] 17. UniProt RDF [http://dev.isb-sib.ch/projects/uniprot-rdf/] 18. OCLC PURL [http://purl.org/] 19. Jetty [http://www.mortbay.org/jetty-6/] 20. Java Servlet Technology [http://java.sun.com/products/servlet/] 21. Spring Application Framework [http://www.springframe work.org/] 22. Struts [http://struts.apache.org/] 23. Berkeley DB JE [http://www.oracle.com/database/berkeley-db/je/] 24. Lucene [http://lucene.apache.org/java/] 25. Leinonen R, Nardone F, Zhu W, Apweiler R: UniSave: the Uni- ProtKB Sequence/Annotation Version database. Bioinformat- ics 2006, 22:1284-1285. 26. Apache Tomcat [http://tomcat.apache.org/] 27. The Apache HTTP Server Project [http://httpd.apache.org/] 28. Apache Module mod_proxy [http://httpd.apache.org/docs/2.0/ mod/mod_proxy.html] 29. Apache Log4j [http://logging.apache.org/log4j/] 30. JUnit [http://junit.org/] 31. Selenium Web Application Testing System [http://sele nium.openqa.org/] 32. httperf [http://www.hpl.hp.com/research/linux/httperf/] Publish with Bio Med Central and every 33. The R Project for Statistical Computing [http://www.r- scientist can read your work free of charge project.org/] 34. JProfiler [http://www.ej-technologies.com/] "BioMed Central will be the most significant development for 35. Dumas JS, Redish JC: A Practical Guide to Usability Testing Westport: disseminating the results of biomedical researc h in our lifetime." Greenwood Publishing Group Inc; 1999. Sir Paul Nurse, Cancer Research UK 36. Google Analytics [http://www.google.com/analytics/] 37. Mons B, Ashburner M, Chichester C, van Mulligen E, Weeber M, den Your research papers will be: Dunnen J, van Ommen GJ, Musen M, Cockerill M, Hermjakob H, available free of charge to the entire biomedical community Mons A, Packer A, Pacheco R, Lewis S, Berkeley A, Melton W, Barris N, Wales J, Meijssen G, Moeller E, Roes PJ, Borner K, Bairoch A: peer reviewed and published immediately upon acceptance Calling on a million minds for community annotation in cited in PubMed and archived on PubMed Central WikiProteins. Genome Biology 2008, 9:R89. yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 19 of 19 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Infrastructure for the life sciences: design and implementation of the UniProt website

Loading next page...
 
/lp/springer-journals/infrastructure-for-the-life-sciences-design-and-implementation-of-the-kA07EKwT1C

References (38)

Publisher
Springer Journals
Copyright
Copyright © 2009 by Jain et al; licensee BioMed Central Ltd.
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN
1471-2105
DOI
10.1186/1471-2105-10-136
pmid
19426475
Publisher site
See Article on Publisher Site

Abstract

Background: The UniProt consortium was formed in 2002 by groups from the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) at Georgetown University, and soon afterwards the website http://www.uniprot.org was set up as a central entry point to UniProt resources. Requests to this address were redirected to one of the three organisations' websites. While these sites shared a set of static pages with general information about UniProt, their pages for searching and viewing data were different. To provide users with a consistent view and to cut the cost of maintaining three separate sites, the consortium decided to develop a common website for UniProt. Following several years of intense development and a year of public beta testing, the http://www.uniprot.org domain was switched to the newly developed site described in this paper in July 2008. Description: The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The http://www.uniprot.org website is the primary access point to this data and to documentation and basic tools for the data. These tools include full text and field-based text search, similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access. http://www.uniprot.org/ is open for both academic and commercial use. The site was built with open source tools and libraries. Feedback is very welcome and should be sent to [email protected]. Conclusion: The new UniProt website makes accessing and understanding UniProt easier than ever. The two main lessons learned are that getting the basics right for such a data provider website has huge benefits, but is not trivial and easy to underestimate, and that there is no substitute for using empirical data throughout the development process to decide on what is and what is not working for your users. Page 1 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 However, a careful review of the archived help desk ques- Background The UniProt consortium[1] was formed in 2002 by groups tions and web server request logs, collected over several from the Swiss Institute of Bioinformatics (SIB), the Euro- years from the existing sites, revealed the following: pean Bioinformatics Institute (EBI) and the Protein Infor- mation Resource (PIR) at Georgetown University, and � The majority of the queries consisted of nothing soon afterwards the website http://www.uniprot.org was more than a protein or gene name, sometimes com- set up as a central entry point to UniProt resources. bined with an organism name. Some of these queries Requests to this address were redirected to one of the three did not yield useful results, because of the lack of a organisations' websites (http://www.expasy.uniprot.org, good scoring algorithm (e.g. searching for "human http://www.ebi.uniprot.org and http://www.pir.uni insulin" could require scrolling through hundreds of prot.org). While these sites shared a set of static pages with results before finding the most relevant entries, such as general information about UniProt, their pages for search- INS_HUMAN). ing and viewing data were different: The SIB was redirect- ing such requests to the ExPASy website, where some of � Some queries yielded no results because people mis- the data and tools had been available since 1993, while spelled terms or did not use the same conventions as the EBI and PIR both developed their own sites for Uni- UniProt (e.g. American vs English spelling, Roman vs Prot, with a similar appearance, but different code and Arabic numbers in protein names, dashes vs separated functionality. Though the redirection was done according words) or chose the wrong field in an "advanced" to the geographic location of the client, it happened occa- search form, etc. Some of this was documented, but sionally that users were confronted with a site that looked the documentation was not accessed much. and worked differently from the one they were used to. To provide users with a consistent view and to cut the cost of � The majority of requests came from web crawlers maintaining three separate sites, the consortium decided and other automated applications (many of which to develop a common website for UniProt. Following sev- made valid use of our data). Referrals from search eral years of intense development and a year of public engines made up a substantial part of the visits, there- beta testing, the http://www.uniprot.org domain was fore we did not want to block web crawlers either, yet switched to the newly developed site described in this this was putting quite a bit of a load on our servers. paper in July 2008. Ensuring that these issues would be resolved by the new Requirements site, along with all the basic requirements, was therefore The essential functionality that the website (like its prede- made a priority [2]. cessors) had to provide was: Construction, content and utility � Retrieval of individual database entries by identifier. What data is available on the site? The UniProt web site provides access to the data sets pre- � Retrieval of sets of entries based on simple search cri- sented in Table 1. teria such as organism, keyword or free text matches. How is the site structured? The pattern for URL templates shown in Table 2 is used � Display of data in a human readable manner. not only for the main data sets, but also for the various � Download of data in all official formats. "ontologies", for documentation and even running or completed jobs. � Basic tools for identifier mapping, sequence align- ments and similarity searches. There are no special search pages. The search function and other tools can be accessed directly through a tool bar that � Access to documentation and controlled vocabular- appears at the top of every page. Depending on the current ies. context, some of the tool forms are pre-filled. For exam- ple, when viewing a UniProtKB entry, the sequence search An additional wish was that each consortium member form is pre-filled with the sequence of the entry, and the should be able to host a mirror of the website without too alignment form is pre-filled with all alternative products much effort, and that the technology on which the web- of the entry, if any. site was to be built should be familiar enough to allow all consortium members to contribute to the development. How to get people started? Beyond that there was no shortage of ideas for bells and Important information is often overlooked on home whistles, such as data mining and visualization tools. pages with a lot of content. The new UniProt home page Page 2 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Table 1: Overview of the UniProt data sets Data set Description References Entries Path Formats UniProtKB Protein sequence and UniRef, UniParc, Literature 6.4 M /uniprot/ Plain text, FASTA, (GFF), annotation data citations, Taxonomy, Keywords XML, RDF UniRef Clusters of proteins with UniProtKB, UniParc, Taxonomy 12.3 M /uniref/ FASTA, XML, RDF similar sequences UniParc Protein sequence archive UniProtKB, Taxonomy 17.0 M /uniparc/ FASTA, XML, RDF Literature citations Literature cited in UniProtKB 0.4 M /citations/ RDF (based on PubMed) Taxonomy Taxonomy data 0.5 M /taxonomy/ RDF, (Tab-delimited) (based on NCBI taxonomy) Keywords Keywords used in UniProtKB 1K /keywords/ RDF, (OBO) Subcellular locations Subcellular location terms used 375 /locations/ RDF, (OBO) in UniProtKB (see Figure 1) features a prominent tools bar, that is � How often a search term occurs in an entry (without present on every page and serves as a basic site map with normalizing by document size, as this would benefit links to common entry points. The site contains a lot of poorly annotated documents). small, useful features that are documented in the on-line help; however, people in general appear to be reluctant to � What fields in an entry a term occurs in (e.g. matches invest a lot of time into reading documentation. To in a protein name are more relevant than in the title of address this issue, we recorded a "site tour" [3] that is a referenced publication). accessible from the home page. � Whether an entry has been reviewed (reviewed How to get the text search function right? entries are more likely to contain correct and relevant The text search function is the most used feature on the information). website. Considerable effort was therefore invested into making all common and less common searches not only � How comprehensively annotated an entry is (all else possible, but also simple and convenient to use for people being equal, we want to have a bias towards well- without a detailed understanding of UniProt data. One of annotated entries). the most obvious problems with the old sites had been the lack of good relevance scoring of search results. Scor- The exact scoring scheme differs for each data set and ing is essential for queries that are meant to locate specific requires ongoing fine-tuning. entries, but that contain terms that appear in a large number of entries (e.g. the "human insulin" example In order to allow people to see quickly the entries with e.g. quoted above). The main factors that influence the score the longest or shortest sequences, or to page through the of an entry for a given query on the new website are: results one organism at a time, certain fields were made sortable. This turned out to be not trivial to implement as the underlying search engine library had no support for sorting results efficiently on anything but their score. Table 2: URL templates Template Description Example http://www.uniprot.org/{dataset}/ Overview page for a data set, may contain a http://www.uniprot.org/uniprot/ description of the data set along with various entry points, or just list all database items (equivalent to searching for *). http://www.uniprot.org/{dataset}/ Filters the data set with the specific query. Other http://www.uniprot.org/uniprot/?query=green ?query={query} parameters are "offset" (index of first result), "limit" (number of results to return), "format" (e.g. "tab" for tab-delimited or "rdf") and "compress" ("yes" to gzip results when downloading). http://www.uniprot.org/{dataset}/{id} Displays a specific database entry. http://www.uniprot.org/uniprot/P00750 http://www.uniprot.org/{dataset}/{id}.{format} Returns a database entry in the specified format. http://www.uniprot.org/uniprot/P00750.rdf Page 3 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 H Figure 1 ome page at http://www.uniprot.org/ Home page at http://www.uniprot.org/. Page 4 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Therefore, special sort indexes are now built when the � "Did you mean" spelling suggestions (if there are no data is loaded, at the cost of slowing down incremental or few results and the index contains a similar word). updates. Figure 2 shows the result of a query in Uni- ProtKB, sorted by length descending. � Restrict a term to field (listing only fields in which a term occurs). The traditional approach of having two separate forms for "basic" and "advanced" queries has several issues: Based � Quote terms (if they frequently appear together in on our observations, few people start out with the inten- the index). tion of using an "advanced" form. Even if they have a good understanding of the data and search function, they � Filter out unreviewed or obsolete entries (if the often first try to obtain the results through a simple search results contain such entries). form, as these are quicker to fill in. If the basic search does not yield the expected result, or too many results, the � Replace a field with a more stringent field (if this query has to be redone in an advanced search form where helps reduce the number of results). further constraints can be applied. � Restrict the range of values in a field (if results are Another problem with "advanced" search forms is that being sorted on this field). they often do not take into account that most nontrivial queries appear to be built iteratively: People start out with � ...and others, depending on the context. one or two terms, and then add or modify (e.g. constrain) terms until they see the desired results or give up. If mul- This approach allows people to move seamlessly from a tiple complex constraints are specified at once and the basic to an advanced query without prior knowledge of query produces no results, it can be time-consuming to the fields used to store data in UniProt. Clicking on a sug- figure out which (if any) of the constraints was used incor- gestion requires less mouse clicks than selecting a field in rectly. We therefore opted for a "fail early" approach: A an "advanced" search form. It is also more effective, simple full text search is the fastest and most effective way because only the fields in which a term occurs are listed – to determine whether or not a term even appears in a data- such a filtering is difficult to accomplish with traditional base, as you can skip the step of scrolling through and "advanced" search forms. selecting a field, and then having to wonder if the term might have appeared in another field. Figure 4 shows suggestions for a simple query, http:// www.uniprot.org/uniprot/?query=insulin. For these reasons, we opted for a single search form(see Figure 3). People start by searching for one or two terms. Each step in the query building process updates the query The results page shows the matches for these terms and, string and is reflected in the URL, so it can be book- for people who are not familiar with our search fields, marked, or undone by hitting the back button. The step- clickable suggestions such as: by-step process does not preclude expert users entering complex queries directly, which can be faster and more powerful (e.g. Boolean operators) than using an "advanced" query form. Table 3 provides an overview of the query syntax. A frequent cause of failed queries in past implementations were trivial differences such as the use of dashes (e.g. "CapZ-Beta" vs "CapZ Beta") or Roman vs Arabic num- bers in names (e.g. "protein IV" vs "protein 4"). Such cases are now treated as equivalent. Many search engines stem words, for example they would treat the search terms: "inhibit", "inhibits" and "inhibiting" as equivalent. How- ever, given that most of the queries consisted of names, such as protein, gene or organism names, where such stemming is dangerous, and that there is no way to know whether or not an entered term should be stemmed, this UniProtKB search Figure 2 results, sorted by length descending was left out. UniProtKB search results, sorted by length descend- ing. Page 5 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Basi Figure 3 c search form Basic search form. One advantage of "advanced" search forms is that they Figure 6 shows a screenshot of the "Customize display" allow to present fields that have a limited number of pos- option for UniProtKB search results. sible values as drop-down lists or offer auto-completion, How to support download of custom data sets? if the field contains values from a medium- to large-sized ontology. This functionality can, however, also be inte- We receive frequent demands to provide various down- grated into a "simple" search form: We chose to provide loadable entry sets, such as all reviewed human entries in the possibility to search in specific fields of a data set by FASTA format. While some of the most frequently adding one field search constraint at a time. The user requested files can be distributed through our FTP server, clicks on "Fields >>", selects the desired field and enters a doing so is obviously not feasible for many requests (espe- value and then clicks "Add & Search" to execute the query. cially for incremental updates such as all reviewed, Further search constraints can be added to refine the query human entries in FASTA format added or updated since iteratively until the desired results are obtained (see Figure the beginning of this year). Such sets can now be obtained 5). from the website, which no longer imposes any download limits. However, large downloads are given low priority in Certain data sets reference each other: This can be used to order to ensure that they do not interfere with interactive do subqueries, e.g. while searching UniRef you can add a queries, and they can therefore be slow compared to constraint "uniprot:(keyword:antigen organism:9606)" downloads from the UniProt FTP server. to show only UniRef entries that reference a UniProt entry with the specified keyword and organism. This function- How to support browsing? ality can sometimes also be accessed from search results, The two main modes of looking for data are 1. with direct e.g. while searching UniProtKB there may be a "Reduce searches and 2. by browsing, i.e. following links through sequence redundancy" link that converts the current a hierarchical organization. The new website makes use of query into a subquery in UniRef. various ontologies (Taxonomy, Keywords, Subcellular locations, Enzyme, Gene Ontology, UniPathway) to allow The search result table of most data sets can be customized users to browse the data or combine searching with in two ways: The number of rows shown per page can be browsing (e.g. search for keyword:Antigen and then changed, and different columns can be selected. browse by taxonomy, see Figure 7). Note that the choice of columns is preserved when down- loading the results in tab-delimited format. Suggestions for a simp Figure 4 le query, http://www.uniprot.org/uniprot/?query=insulin Suggestions for a simple query, http://www.uniprot.org/uniprot/?query=insulin. Page 6 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Table 3: Query syntax overview Query Returns human antigen All entries containing both terms human AND antigen "human antigen" All entries containing both terms in the same order anti* All entries containing terms starting with anti. To search for a term that contains an actual asterisk, escape the asterisk with a backslash (anti\*). Asterisks can be used within and at the end of terms. human-antigen All entries containing the term human but not antigen human NOT antigen human OR antigen All entries containing either term antigen (human OR pig) Using brackets to override Boolean precedence rules author:Tiger* All entries with a citation that has an author whose name starts with Tiger. Note the field prefix author; had we left it out, there would have been a large amount of unwanted results. gene:L\(1\)2CB All entries with the specified gene name. Note how the backslash is used to escape the brackets, which would otherwise be interpreted as part of a Boolean query. Other characters that must be escaped are: []{}?:~* gene:* All entries that have a gene name. How to allow selection of multiple items? or "Align", and can be cleared with a single click. As Using a list of results will often imply performing further shown in Figure 8, the cart also allows to select items action, such as downloading all or selected items, or align- across multiple data sets. ing corresponding sequences. The simplest solution would be to add check boxes next to the items and enclose How to show complex entries? them in a form that also contains a list of tools to which The most important data on this site can be found in Uni- the items can be submitted. The problem with this ProtKB, in particular in the reviewed UniProtKB/Swiss- approach is that it can result in some redundancy in the Prot entries. These entries often contain a large amount of user interface: when adding a tool, it is necessary to add it information that needs to be shown in a way that allows everywhere where items can be selected. Moreover, this easy scanning and reading: approach does not allow selection of items across multi- ple pages (e.g. when paging through search results) or � Names and origin across different queries or data sets. The solution that was implemented was to provide a general selection mecha- � Protein attributes nism that stores items in a "cart". The contents of the cart are stored as a cookie in the web browser (so it does not � General annotation (Comments) require any state to be stored on the server side). The cart itself has certain actions attached to it such as "Retrieve" � Ontologies (Keywords and Gene Ontology) Using Figure 5 the query builder to add a constraint Using the query builder to add a constraint. Page 7 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 "Customize display" option Figure 6 for UniProtKB search results "Customize display" option for UniProtKB search results. Using hierarchical collection Figure 7 s to browse search results Using hierarchical collections to browse search results. Page 8 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Using the cart Figure 8 to select items across multiple data sets Using the cart to select items across multiple data sets. � Binary interactions Parts of two sections from the UniProtKB entry view of human tissue-type plasminogen activator (P00750) are � Alternative products shown in Figure 9. � Sequence annotation (Features) How to integrate sequence similarity searches? In addition to text searches, sequence similarity searches � Sequences are a commonly used way to search in UniProt. They can be launched by submitting a sequence in FASTA format, � References or a UniProt identifier, in the "Blast" form of the tools bar. Note that this form is pre-filled with the current sequence � Web resources (Links to Wikipedia and other online when viewing a UniProtKB, UniRef or UniParc entry. Fig- resources) ure 10 shows the sequence similarity search form and results. � Cross-references How to integrate multiple sequence alignments? � Entry information (Meta data including release dates The purpose of the integrated "Align" tool is to allow sim- and version numbers) ple and convenient sequence alignments. ClustalW is used because it is still the most widely used tool, though � Relevant documents (List of documents that refer- it may no longer be the best-performing tool in all cases. ence an entry) The form can be submitted with a set of sequences in FASTA format or a list of UniProt identifiers, or more Describing the information found in UniProtKB/Swiss- likely through the built-in "cart". The form is pre-filled Prot [4] is outside the scope of this paper. Here are some with a list of sequences when viewing a UniProtKB entry improvements that were made over previous attempts to with alternative products or a UniRef cluster. For complex show this data: alignments that require specific options or a specific tool, the sequences can easily be exported into FASTA format � Features and cross-references are categorized. for use with an external alignment tool. Figure 11 shows the multiple sequence alignment form and results. � Features have a simple graphical representation to How to integrate identifier mapping functionality? facilitate a comparison of their locations and extents. There is an identifier mapping tool that takes a list of Uni- � Secondary structure features are collapsed into a sin- Prot identifiers as input and maps them to identifiers in a gle graphic. database referenced from UniProt, or vice versa. An addi- tional supported data set that can be mapped is NCBI GI � Alternative products are listed explicitly in the numbers. Figure 12 shows how RefSeq identifiers can be "sequences" section. mapped to UniProtKB. How to retrieve UniProt entries in batch? � Sections can be reordered or hidden (and these changes are remembered). The batch retrieval tool allows to specify or upload a list of UniProt identifiers to retrieve the corresponding entries (see Figure 13). The available download formats are the greatest common denominator: For example, if the batch Page 9 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Figure 9 Parts of two sections from the UniProtKB entry view shown at http://www.uniprot.org/uniprot/P00750 Parts of two sections from the UniProtKB entry view shown at http://www.uniprot.org/uniprot/P00750. How to handle job submissions? retrieval request contains both UniProtKB and UniParc identifiers, neither plain text (only available for Uni- The website is a read-only, stateless application, with the ProtKB) nor XML (available for both, but with different exception of the job handling system. When e.g. a data- schemas) will be available, only FASTA and RDF. The set base mapping job is submitted, a new "job" resource is of entries retrieved by their identifiers can then optionally created, and the user is redirected to this job page (which be queried further, using the search engine described pre- will initially just show the job status, later the results). A viously. consequence of this is that if a web server receives a request for a job that it does not have, it needs to ask all other mirrors if they have this job (and if yes, transfer it). Page 10 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Seq Figure 10 uence similarity search form and results Sequence similarity search form and results. Jobs have unique identifiers, which (depending on the job scale, a robots.txt file [6] is used, and links that should not type) can be used in queries (e.g. to get the intersection of be followed were marked with a "nofollow" "rel" attribute two sequence similarity searches). Recent jobs run by the [7]. The retrieval performance of documents in large col- current user can be listed using the URL http://www.uni lections was optimized until we felt confident that even prot.org/jobs/ (see Figure 14). rapid crawling would not impact the overall responsive- ness of the site too much. Such documents also return a How to deal with web crawlers? "Last-modified" date header when requested. Certain web Search engines are an important source of traffic, perhaps crawlers (e.g. Googlebot) will then on their next visit issue because people now tend to "google" search terms before a conditional "If-Modified-Since" request, so there is no trying more specialized sites. Therefore it is important to need to resend unchanged documents. Since each ensure that search engines are able to index (and keep up resource can now be accessed through one URL only, to date) as much content of the site as possible. However, there is no more redundant crawling of resources (as used web crawlers have trouble finding pages, such as database to be the case with multiple mirrors with different entries, that are part of large collections and are not linked addresses). The request logs and "Google Webmaster from main navigation pages, and when they do, this can Tools" site [8] are used to monitor the behavior of web put a significant load on the site. To ensure that web crawlers on this site. As of July 2008, over 4 M pages from crawlers find all content that was meant to be indexed, the the new site are indexed in Google. content was linked from multiple sources, including over- view documents and machine-readable site maps [5]. To How to avoid breaking links? keep web crawlers away from content that is either not This site publishes a large number of resources (several worth indexing or too expensive to retrieve on a large million) on the Web. These resources are linked from a lot Page 11 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Multip Figure 11 le sequence alignment form and results Multiple sequence alignment form and results. of other life sciences databases as well as scientific papers. but keep their own web page (e.g. http://www.uni Since tracking down and getting all such links updated is prot.org/uniprot/P00001), with a link to a list of (retriev- not practical, and keeping legacy URL redirection schemes able) previous versions (e.g. http://www.uniprot.org/uni- in place for a long time can be tedious, it is worth invest- prot/P00001?version=*). Specific versions can also be ing some effort into reducing the likelihood that large sets referenced directly (e.g. http://www.uniprot.org/uniprot/ of URLs will have to be changed in future [9]. Technology P00001.txt?version=48), and used in the tool forms (e.g. artifacts, such as "/cgi-bin/" or ".do", are avoided in URLs P00001.48). [10]. Official and stable URLs are no good if they are not used. A lesson learned from the previous sites was that the How to support user-defined customizations? URLs that end up being used are those that are shown in Some simple customizations, such as being able to choose the browser (i.e. mirror-site specific URLs). The new site the columns shown in search results, make the site a lot avoids this problem by having exactly one URL for each more convenient to use. However, we did not want to resource. Another issue is how to deal with individual compromise the statelessness of the application (which is resources that are removed. Obsolete entries, or entries important for keeping the application distributable and that were merged with other entries, in the main data set scalable) by having each request depend on centralized (UniProtKB) no longer disappear from the web interface, user profile data. This was possible by storing basic set- Page 12 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Mappi Figure 12 ng RefSeq identifiers to UniProtKB Mapping RefSeq identifiers to UniProtKB. tings in client-side cookies. The drawback of this solution How to enable programmatic access to the site? is that such settings are lost when cookies are cleared in People need to be able to retrieve individual entries or sets of entries in various formats and use our tools simply and the browser or when the user switches to another machine. The amount of data that can be stored this way efficiently from within basic scripts or complex applica- is also limited. On the other hand, the customizations are tions. We want to encourage people to build applications simple and easy to redo. This solution also does not that are tailored towards certain user communities on top require people to sign up, which some may be reluctant to of our data – customization options on our site can go do as this introduces some privacy issues. only so far, and anything we build will always be focused on our data. How to let people tailor the web site to their needs? The ideal site would be a "one stop shop" that solves all Early versions of the new site had a complete SOAP [11] our users' needs. However, the life sciences community interface built with Apache Axis [12]. Unfortunately this has very diverse needs, and our data is often just one small interface had poor performance (which necessitated intro- part of these needs. The best we can do is make it as easy ducing limitations such as the maximum number of as possible to retrieve data from this site programmati- entries that could be retrieved in one go), and simple cally, in order to facilitate the development of applica- operations such as retrieving an entry in FASTA format tions on top of our data. ended up being more complex than could be justified. For example, in order to retrieve the data from a Perl script, a special module (SOAP::Lite) had to be installed and Page 13 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Batch retrieval of a Figure 13 set of UniProtKB entries Batch retrieval of a set of UniProtKB entries. patched, due to quirks with the support for SOAP attach- � Ensures that invalid pages stay out of search engine ments. Meanwhile, there are better SOAP libraries, but indexes they are still more complicated and less efficient to use than doing direct HTTP requests. � Simplifies error handling for people doing program- matic access (no need to have fragile checks for error To ensure that such "RESTful" [13] access is as simple and message strings) robust as possible, the site has a simple and consistent URL schema (explained in a previous section) and returns � Helps detect common problems when analyzing appropriate content type headers (e.g. application/xml for request logs. XML resources) and response codes. Returning appropri- ate HTTP status codes instead of returning 200 OK for all Table 4 lists all response codes that the site might return. requests, even if they fail, has several benefits: The most important distinction are the 4xx response codes, which indicate that there is some problem with the Page 14 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Figure 14 Recent jobs run by the current user, shown at http://www.uniprot.org/jobs/ Recent jobs run by the current user, shown at http://www.uniprot.org/jobs/. request itself and sending the same request again will the human taxon. This URL can be resolved to either likely fail again, and the 5xx response codes, which indi- http://www.uniprot.org/taxonomy/9606 (human reada- cate a problem with the server. ble representation; returned e.g. when entered in a browser) or http://www.uniprot.org/taxonomy/9606.rdf In addition to the data set-dependent formats, all search (machine-readable representation; returned if the request results can be retrieved as OpenSearch [14] RSS feeds for contains an "Accept: application/rdf+xml" header). integration with external tools such as news feed readers What is the architecture of the site? or Yahoo Pipes [15]. The website was implemented as a pure Java web applica- How to handle complex data? tion that can either be run standalone using an embedded While the needs of most people are met by being able to web server, Jetty [19], or deployed on any Servlet 2.4 com- obtain data in FASTA, tab-delimited or even plain text for- pliant [20] web application server. The components in the mat, some people need to obtain and work with the com- application are configured and connected together using plete structure of the data. This is complicated by the fact the Spring Application Framework [21]. Struts [22] coor- that the data model is complex and changes a lot. RDF dinates the request handling, and pages are rendered as (part of the W3C's Semantic Web initiative [16]) provides XHTML using JSP (2.0) templates. Database entries are a generic graph-like data model that can help address stored in Berkeley DB JE [23] for fast retrieval. Searching these issues [17]. All UniProt data is available in RDF as was implemented with help of the Lucene text search well as XML formats both on our FTP servers (for bulk library [24]. downloads) and on the site. Spring was introduced to remove hard-coded dependen- Resources in RDF are identified with URIs. UniProt uses cies (or hard-coded service lookups) from the code, as this PURLs [18]. For example, http://purl.uniprot.org/taxon was hampering our ability to unit-test the code. Struts was omy/9606 is used to reference and identify the concept of chosen among the plethora of available web application Table 4: Listing of response codes that the site might return Code Description 200 The request was processed successfully. 301 Moved (permanently). Use the new address for future requests 302 Moved (temporarily) 400 Bad request. There is a problem with your input. 404 Not found. The resource you requested doesn't exist. 410 Gone. The resource you requested was removed. 500 Internal server error. Most likely a temporary problem, but if the problem persists please contact the site operators. 503 Service not available. The server is being updated, try again later. Page 15 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 frameworks because it provided some conveniences, such must use to resolve addresses often keep track of what as the automatic population of objects from request resolving name server responds faster (and therefore is parameters, but did not attempt to abstract too much (e.g. most likely the nearest) for a given domain. The tests we needed access to the HTTP request and response showed that this worked to some degree, at least for the objects in order to read and set certain HTTP headers). most frequent users. The drawback of this solution is that Struts is (or was) also a de facto standard and simple to with only two mirrors, no failover is possible when one of learn. Using Berkeley DB JE to store serialized Java objects the sites happens to become unreachable. Given that net- using custom serialization code was by far the most effi- work delays between the U.S. and Europe are not too bad, cient solution for retrieving data that we tested. Extracting reliability was seen as more important. This may change if data from uncompressed text files using stored offsets one or more additional mirrors are set up in more remote might be faster for returning data in specific formats. places. However, the size of the uncompressed files and number of different databases and formats make this approach The current mirror sites deploy the web application on less practical than generating the various representations Tomcat [26] and use Apache [27] as a reverse proxy [28], on the fly. Minimizing the amount of data that is stored as well as for request logging, caching and compressing on disk also ensures that the application can benefit more responses (the latter can have a huge impact on page load from increasing disk cache sizes. times for clients on slow connections). If the application is not available at one site (e.g. while it is being updated), The website application is self-contained and can be run Apache automatically sends requests to another available out of the box with zero configuration for development mirror. The web application has a special health-check and test purposes. Non-Java and compute-intensive tools page that is monitored from local scripts (which notify the such as sequence similarity searches (BLAST), multiple local site administrators if there is a problem), as well as sequence alignments (ClustalW) and database identifier from a commercial monitoring service. This service can mappings are run on external servers. To minimize the also detect network problems and keeps statistics on the footprint of the application further, historical data such as overall reliability and responsiveness of each mirror. entry versions from UniSave [25] is also retrieved Application-level warnings and errors are handled by remotely on demand. Even so, the data including indexes Log4j [29]. Errors trigger notification messages that go occupies almost 140 GB (as of release 14.5). For develop- directly to an e-mail account that is monitored by the ment and testing, a smaller, internally consistent, data set developers (unless an error was triggered by a serious (~2 GB) is generated by issuing queries to a site that has operational issue, it usually indicates a bug in the code). the complete data loaded. Finally, there is a JMX interface that supplements JVM- level information, such as memory use, and Tomcat-sup- How to deploy the site on distributed mirrors? plied information, such as the number of open HTTP con- With the previous setup, each "mirror" site had its own nections, with application information, such as the hit public address. Requests to http://www.uniprot.org were ratios of specific object caches. redirected to http://www.ebi.uniprot.org (EBI), http:// www.pir.uniprot.org (PIR) or http://www.expasy.uni How to manage data and application updates? prot.org (SIB), respectively. The redirection was done with Data can be loaded into the web application simply by client-side HTTP redirects and was based on the top level dropping a data set, or partial data set for incremental domain (TLD) revealed by a reverse DNS lookup of the IP updates, in RDF format into a special directory and wait- address from which a request originated. One major prob- ing for the application to pick up and load the data. How- lem with this setup was that people and web crawlers ever, to save resources, we load all the data on a single bookmarked and linked to the mirror they had been redi- staging server and then distribute the zipped data to the rected to rather than the main site. Tracking down such mirror sites, usually along with the latest version of the links and getting them corrected turned out to be a Sisy- web application. Updates occur every three weeks, in sync phean task. One consequence was that people more often with the UniProt releases. than not ended up neither on the nearest mirror, nor did they get the benefits of failover. The new setup makes use How to reduce the risk that bugs are introduced into the of the fact that you can attach multiple IP addresses (A code? records) to a domain name. Clients will connect more or All code changes risk introducing bugs, which can be less randomly to one address that is reachable. We also time-consuming to fix, especially when not detected right tested whether it was possible to achieve geographic affin- away. To minimize this risk, automated tests need to be ity by having different name servers return different IP set up. The lowest-level testing is done in the form of addresses depending on which mirror was nearer. This "unit" tests. The goal of unit tests is to cover each execu- can work because the "caching" name servers that clients tion path in the code using different input. Single classes Page 16 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 are tested in isolation. Such tests were written using the and 2. by analyzing the request logs (which include the JUnit testing framework [30]. The initial test coverage was duration of each request) at the end of each month. While quite low, as there was no simple way to untangle classes the former allows immediate action to be taken, the latter for isolated tests. This was improved by introducing a can help to detect more subtle performance issues. "dependency injection" framework [21] to remove hard- coded dependencies. Unit tests are complemented with How to ensure that people will know how use the site? "functional" tests. We created a set of test scripts with the For all but the most trivial functions it is difficult to pre- open source tool Selenium [31]. These tests can be played dict whether people will be able to figure out how to use back in any recent browser and simulate typical user inter- them. The most effective way to answer such questions is actions with the site. Reducing the playback speed allows to do "usability" tests [35]. We managed to recruit a dozen semi-automatic testing, where a person watches the test or so volunteers who let us watch them use the site to execution to catch layout glitches that would be difficult accomplish certain tasks. Most of them had some kind of to catch with fully automated tests. In addition to testing life sciences background, however, not all of them had the site prior to releases, these tests can be used for been working with our data on a regular basis. The tests browser compatibility testing. We attempted to ensure took on the following form: Two people, one to ask ques- that the site works well with the most popular browser tions, the other to take notes, went to the volunteer's and operating system combinations (e.g. Internet workplace (if possible; this ensured that people had the Explorer 6 and 7 on Windows, Firefox 2 on Linux) and setup they were used to and felt comfortable). Some brief acceptably with other, recent browsers (e.g. Safari on Mac background questions were asked to establish how famil- OS X). Another major "functional" test is loading all data iar the person already was with our data, what services and into the site. Given that this is a long procedure, a smaller tools they had used in the past, etc. Based on this informa- test data set is often used instead to verify that the import tion they would then be asked to accomplish certain tasks procedure is working. Other one-off tests included setting on the site. We tried as best as possible to avoid putting up and using a tool to compare search results returned by the user under pressure. The other difficulty was avoiding the new and the old sites. phrasing tasks in terms of the concepts we were using. For example, when asked to fill out the "contact form", people How to ensure that the site will have adequate would immediately find the "contact" link. But if we performance? phrased the question in different terms, such as "send Basic performance goals were established based on num- feedback", success was less guaranteed. Testing sessions bers obtained from the request logs of the previous sites. were between half an hour and one hour and helped settle Even though the application is stateless (i.e. no state is and open up quite a few "design" discussions. stored on the server) and can therefore be scaled out hor- izontally (i.e. by buying more machines), the stated goal How to keep track of what is being used? was to be able to support the full load on a single, power- In order to know how the site is being used, and which ful, but "off-the shelf" machine. Following are some load parts of the site are working well and which are not, it is tests that were performed to identify potential issues and essential to collect data on the site's usage. Usability test- help build confidence in the application: ing is invaluable and can help pinpoint certain issues, but it is time-consuming and therefore unsuitable to collect � Retrieve random database entries large sample sizes. Fortunately, some basic information on user interactions with the site can be recorded through � Execute random queries the web server log files. In addition to the standard infor- mation logged for each request (such as the exact time and � Download large result sets path of the request, the response code, the IP address, referrer and user agent), the web server was configured to � Simulate initial requests to the home page including also record the duration of the request, the number of all required static resources search results returned (if the request was a query) and the content-type of the response (this could often be inferred The tests were performed with a combination of shell from the extension of the requested resource, but not scripts and the httperf tool [32]. Performance numbers, always). The request logs are collected from all mirror sites such as response times, were analyzed with R [33]. Once a once a month, cleaned up a bit and loaded into a simple performance issue was found, problematic code was star schema in a relational database. This allows us not tracked down with the help of a commercial profiling tool only to get general usage statistics (e.g. total number of [34]. Performance issues on the production servers are requests for different resources, showing the percentage of caught 1. by an external monitoring tool that records automated requests), but also to look for problems (e.g. response times for requests to a general health-check page, queries that do not produce any results or fail) and even Page 17 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 help to set annotation priorities (e.g. by looking at what plementing the conventional feedback forms with a com- are the most frequently requested entries that have not menting system. Other more complex approaches, such as been reviewed yet, or not updated in a long time). using wiki software, are under investigation as well [37]. Another complementary tool that is used to gather statis- Looking beyond UniProt: Much of the development effort tics is Google Analytics [36], which records data through was spent on issues that are not UniProt-specific, ranging JavaScript embedded in the web pages. The advantage and from handling identifiers in a stable way to most of the drawback of this approach is that it does not record auto- code for the search engine. These issues are likely to be rel- mated requests, such as those issued by web crawlers, or evant for other life sciences databases as well. Had there requests to non-HTML resources. While Google Analytics been some kind of framework that provided these fea- was left enabled during all of the beta phase, it is now only tures, the development time could have been reduced sig- enabled from time to time. We use it to check the (less nificantly. This seems especially important for smaller accurate) number of non-robot requests reported via the databases that may not have resources to reinvent the request log analysis procedure, which relies on user agent wheel. It may therefore be worth incorporating some of strings matches and the patterns need to be updated from the solutions we have come up with here into a frame- time to time. It also helps us to get an idea of the browsers work (or add them to existing frameworks). and screen resolutions people are using, information which is less accurate or impossible to get via the web Conclusion server request logs. Google Analytics can provide fast and The new UniProt website makes accessing and under- convenient feedback on the impact of certain changes, as standing UniProt easier than ever. The two main lessons data is updated at least once a day, but it can also slow learned are that 1. getting the basics right for such a data down the perceived page loading time and aggravate cer- provider website (and likely others as well) has huge ben- tain privacy concerns. efits, but is not trivial and easy to underestimate, and that 2. there is no substitute for using empirical data through- How to "go live" with minimal casualties? out the development process to decide on what is and Despite all the load and usability testing, there was no what is not working for your users. We hope to encourage guarantee that switching over all the old sites to the new more people in the life sciences community to resist the site in one go would not swamp us with more technical temptation to spend time adding bells and whistles to an issues and irate users than we could possibly handle at application before getting the basics right, and to put in once. Having a prolonged "beta" period allowed us to get place rigorous procedures for assessing whether or not the feedback from many users and to ramp up the number of site is serving its users well. people (and web crawlers, etc) using the new site gradu- ally by taking these steps (moving to the next step when- Availability and requirements ever we felt comfortable enough to do so): http://www.uniprot.org/ is open for both academic and commercial use. The site was built with open source tools 1. Sending out invitations to certain people and and libraries. groups to use the new site. Abbreviations HTTP: Hyper-Text Transfer Protocol; JMX: Java Manage- 2. Getting the site indexed in Google. ment Extensions; JVM: Java Virtual Machine; RDF: 3. Linking the old sites to the new site. Resource Description Framework; REST: REpresentational State Transfer; RSS: Really Simple Syndication Format; 4. Switching over the old sites one by one. TBD: To Be Done; TLD: Top-Level Domain; URI: Uniform Resource Locator; W3C: World Wide Web Consortium; XHTML: eXtensible HTML. Discussion Collecting data and optimizing the use and performance of the website is an ongoing process. Authors' contributions EJ carried out most of the design and development work One of the biggest challenges UniProt is facing is getting and drafted the manuscript. IP and SD participated in the more community involvement to help cope with the development. EG, NR, AB, MJM, PM and BS coordinated increasing amount and complexity of data. Simply having design, requirements and specifications. EG, IP and NR more people give feedback when they see incorrect or helped to draft and critically revised the manuscript. All missing data in UniProt would already be a huge improve- authors read and approved the final manuscript. ment. One possible approach under investigation is to make such feedback more rewarding by replacing or sup- Page 18 of 19 (page number not for citation purposes) BMC Bioinformatics 2009, 10:136 http://www.biomedcentral.com/1471-2105/10/136 Acknowledgements UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG002712-04. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science, and the European Commission contract FELICS (EU no. 021902). Special thanks to the many people within and outside of the UniProt con- sortium who spent time providing early feedback on the site, or consented to participating in one of the usability tests. References 1. The UniProt Consortium: The Universal Protein Resource (UniProt). Nucleic Acids Res 2008, 36:D190-D195. 2. Hoekman R Jr: Designing the Obvious: A Common Sense Approach to Web Application Design Berkeley: New Riders Press; 2006. 3. Wink [http://www.debugmode.com/wink/] 4. Bairoch A, Boeckmann B, Ferro S, Gasteiger E: Swiss-Prot: Juggling between evolution and stability. Brief Bioinform 2004, 5:39-55. 5. Sitemap Protocol 0.9 [http://www.sitemaps.org/] 6. Robots Exclusion Standard [http://www.robotstxt.org/] 7. rel-nofollow Microformat [http://microformats.org/wiki/rel- nofollow] 8. Google Webmaster Central [http://www.google.com/webmas ters/] 9. Berners-Lee T: Cool URIs don't change. [http://www.w3.org/ Provider/Style/URI]. 10. urlrewrite [http://code.google.com/p/urlrewritefilter/] 11. Simple Object Access Protocol (SOAP) [http://www.w3.org/ TR/soap/] 12. Apache Axis [http://ws.apache.org/axis/] 13. Fielding R: Architectural Styles and the Design of Network- based Software Architectures. In PhD thesis University of Cali- fornia, Irvine, Information and Computer Science; 2000. 14. OpenSearch [http://www.opensearch.org/] 15. Yahoo Pipes [http://pipes.yahoo.com/pipes/] 16. W3C Semantic Web Activity [http://www.w3.org/2001/sw/] 17. UniProt RDF [http://dev.isb-sib.ch/projects/uniprot-rdf/] 18. OCLC PURL [http://purl.org/] 19. Jetty [http://www.mortbay.org/jetty-6/] 20. Java Servlet Technology [http://java.sun.com/products/servlet/] 21. Spring Application Framework [http://www.springframe work.org/] 22. Struts [http://struts.apache.org/] 23. Berkeley DB JE [http://www.oracle.com/database/berkeley-db/je/] 24. Lucene [http://lucene.apache.org/java/] 25. Leinonen R, Nardone F, Zhu W, Apweiler R: UniSave: the Uni- ProtKB Sequence/Annotation Version database. Bioinformat- ics 2006, 22:1284-1285. 26. Apache Tomcat [http://tomcat.apache.org/] 27. The Apache HTTP Server Project [http://httpd.apache.org/] 28. Apache Module mod_proxy [http://httpd.apache.org/docs/2.0/ mod/mod_proxy.html] 29. Apache Log4j [http://logging.apache.org/log4j/] 30. JUnit [http://junit.org/] 31. Selenium Web Application Testing System [http://sele nium.openqa.org/] 32. httperf [http://www.hpl.hp.com/research/linux/httperf/] Publish with Bio Med Central and every 33. The R Project for Statistical Computing [http://www.r- scientist can read your work free of charge project.org/] 34. JProfiler [http://www.ej-technologies.com/] "BioMed Central will be the most significant development for 35. Dumas JS, Redish JC: A Practical Guide to Usability Testing Westport: disseminating the results of biomedical researc h in our lifetime." Greenwood Publishing Group Inc; 1999. Sir Paul Nurse, Cancer Research UK 36. Google Analytics [http://www.google.com/analytics/] 37. Mons B, Ashburner M, Chichester C, van Mulligen E, Weeber M, den Your research papers will be: Dunnen J, van Ommen GJ, Musen M, Cockerill M, Hermjakob H, available free of charge to the entire biomedical community Mons A, Packer A, Pacheco R, Lewis S, Berkeley A, Melton W, Barris N, Wales J, Meijssen G, Moeller E, Roes PJ, Borner K, Bairoch A: peer reviewed and published immediately upon acceptance Calling on a million minds for community annotation in cited in PubMed and archived on PubMed Central WikiProteins. Genome Biology 2008, 9:R89. yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 19 of 19 (page number not for citation purposes)

Journal

BMC BioinformaticsSpringer Journals

Published: May 8, 2009

There are no references for this article.