Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You and Your Team.

Learn More →

P2P proteomics -- data sharing for enhanced protein identification

P2P proteomics -- data sharing for enhanced protein identification Background: In order to tackle the important and challenging problem in proteomics of identifying known and new protein sequences using high-throughput methods, we propose a data-sharing platform that uses fully distributed P2P technologies to share specifications of peer-interaction protocols and service components. By using such a platform, information to be searched is no longer centralised in a few repositories but gathered from experiments in peer proteomics laboratories, which can subsequently be searched by fellow researchers. Methods: The system distributively runs a data-sharing protocol specified in the Lightweight Communication Calculus underlying the system through which researchers interact via message passing. For this, researchers interact with the system through particular components that link to database querying systems based on BLAST and/or OMSSA and GUI-based visualisation environments. We have tested the proposed platform with data drawn from preexisting MS/MS data reservoirs from the 2006 ABRF (Association of Biomolecular Resource Facilities) test sample, which was extensively tested during the ABRF Proteomics Standards Research Group 2006 worldwide survey. In particular we have taken the data available from a subset of proteomics laboratories of Spain’s National Institute for Proteomics, ProteoRed, a network for the coordination, integration and development of the Spanish proteomics facilities. Results and Discussion: We performed queries against nine databases including seven ProteoRed proteomics laboratories, the NCBI Swiss-Prot database and the local database of the CSIC/UAB Proteomics Laboratory. A detailed analysis of the results indicated the presence of a protein that was supported by other NCBI matches and highly scored matches in several proteomics labs. The analysis clearly indicated that the protein was a relatively high concentrated contaminant that could be present in the ABRF sample. This fact is evident from the information that could be derived from the proposed P2P proteomics system, however it is not straightforward to arrive to the same conclusion by conventional means as it is difficult to discard organic contamination of samples. The actual presence of this contaminant was only stated after the ABRF study of all the identifications reported by the laboratories. Background changes constantly through its biochemical interactions Proteomics studies the quantitative changes occurring in with the genome and the environment, while the gen- a proteome and its application for disease diagnostics ome of an organism is rather constant. and therapy, and drug development. It examines pro- Proteins are large linear chains of amino-acids (resi- teins at different levels, including their sequences, struc- dues). Thesequenceofamino-acids inaproteinis tures and functionalities, and it is considered the next directly translated from the information encoded in the step in the study of biological systems, after genomics. It genome. However, a proteome is more complex than a is much more complicated than genomics mostly genome. One organism has radically different protein because the proteome differs from cell to cell and expression in different parts of its body, different stages of its life cycle and different environmental conditions * Correspondence: marco@iiia.csic.es (e.g., in humans there are about 20,500 identified genes Artificial Intelligence Research Institute, IIIA-CSIC, Spain but an estimate of more than 500,000 proteins that are Full list of author information is available at the end of the article © 2012 Schorlemmer et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 2 of 17 http://www.aejournal.net/content/4/1/1 derived from these genes [1]). This is mainly caused by algorithms cannot become a solution of the problem mRNA alternative splicing processes and by the possibi- because of intrinsic technical limitations. Once a protein lity of residues in a protein being chemically altered in has been sequenced de novo, one can look for similar post-translational modification (PTM), either as part of proteins in a GTDB using a matching algorithm such as the protein maturation processes before the protein BLAST [2] or FASTA [3]; or, alternatively, one can use takes part in the cell’s functionalities, or as part of con- an algorithm such as OMSSA [4] to match spectra trol mechanisms. The discrepancy implies that protein directly to sequences of a GTDB. Mass spectra identification is usually carried out by diversity cannot be fully characterized by gene expres- mixing and combining these two techniques. However, sion analysis. Thus, proteomics is necessary for a better characterization of cells and tissues, and for manufactur- among other factors, the following issues complicate ing improved drugs and medicines. this task: the number of possible PTMs can multiply the amount of results to be analysed; bad quality and noise Protein Identification in Proteomics in mass spectra increase the uncertainty of interpreta- One important and challenging task in proteomics is the tion; and database errors in sequence annotations can identification of proteins, that is, the recognition of the lead to misunderstandings in the identification. Conse- sequenced protein if the protein is known, or its discov- quently, we get a huge amount of apparently useless ery if it is unknown. For this, protein sequences are data (for instance, non-matching mass spectra or low- stored in public databases (such as nrNCBI, UniProt,or scoring de novo interpreted sequences), which most of Genpept). However, they are mostly produced by the the times are simply discarded. As a result, this data is direct translation of gene sequences. This means that seldom accessible to other groups involved in the identi- neither proteins with post-translation modifications fication of the same or homologous proteins. Our con- (PTM) nor proteins whose genomes have not been viction is that we can benefit from this kind of data sequenced would find exact matches in such databases. making it available as searchable repositories for other A key experimental technique for the identification of laboratories. If we compared data coming from different proteins is mass spectrometry (MS). Mass spectra pro- laboratories then we would be able to eventually dis- vide very detailed fingerprints of the proteins contained cover new matches. The discovery of matches would in a given sample. In the so called shotgun approach, contribute to further discriminate between really waste MS is often combined with cutting-edge separation data and possibly good data. We envision many advan- technologies to allow large-scale analysis of proteomes. tages with this new methodology, as other laboratories could provide the missing information for an incomplete For this, proteins are extracted from cells and tissues, enzymatically digested, andthe resultingpeptides spectrum or sequence, making a proteine identification (shorter amino-acids chains) separated by multidimen- process succeed; or even more, matches could help to sional liquid chromatography techniques. As the pep- recognize new proteins or identify PTMs. tides are separated, they are on-line injected into the mass spectrometer, where they are ionized, fragmented P2P Networks for Proteomics and these fragments mass-monitored to produce a spe- We propose a new scenario where the information to be cific sequence fingerprint. searched is no longer centralised in a few repositories, Identification of the huge amount of spectra produced but where information gathered from experiments in by current state-of-the-art high-throughput analysis is peer proteomics laboratories can be searched by fellow one of the major tasks for proteomics laboratories. researchers. To avoid centralising all data into a single Mainly two popular bioinformatics techniques are repository –with all the problems that such centralisa- involved in this effort. The first one takes advantage of tion would entail–, it is better to maintain the informa- public genome-translated databases (GTDB) that can be tion locally at each of the proteomics laboratories. As a accessed through data-mining software (search engines), result, this decentralised data storage needs a decentra- which directly relates mass spectra with database lised search mechanism. The use of peer-to-peer (P2P) sequences. Most of these search engines (Mascot, X! technologies fits our needs. Tandem, SEQUEST, OMSSA) are available both as A P2P network provides methods for accessing dis- stand-alone programs that consult a local copy of a tributed resources with minimal maintenance cost. It GTDB, or as web-services connected to online GTDBs. also provides scalable techniques to search through The limitations, once again, lie in their capability of large amounts of resources scattered through the net- identifying missing PTMs or unsequenced genomes. work. Furthermore, joining or leaving the network The latter case is addressed applying de novo interpreta- becomes a simple task. These properties of P2P net- tion algorithms that yield a sequence for a given mass works make the technology an ideal candidate to imple- spectrum, thus avoiding any database search. But these ment a distributed search mechanism in a network of Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 3 of 17 http://www.aejournal.net/content/4/1/1 proteomics labs. Other distributed storage systems such execute them on their local machines. At this point the as distributed databases or federated storage services application is said to be implemented. have been developed with efficiency in mind, and the After the application is implemented, it can be exe- maintenance and joining costs for these solutions are cuted on top of the OK system. For this purpose, the very high. users wanting to interact as specified in the given peer- A proteomics laboratory acting as a peer in a P2P net- interaction protocol by playing one of the roles will sub- work would be able to share its complete or partial data scribe the appropriate OKCs to it. The discovery service is in charge of managing these subscriptions, and when repository –e.g., mass spectra and de novo interpreted it gathers enough of these to satisfy all the necessary sequences– so that other peers can benefit from it. In addition, in order to find matches among data coming roles in the protocol, it sends this information to a from different peers, the interacting peers of such a P2P designated peer acting as the protocol coordinator who network would need also to validate and cross check the will start managing the peer interaction by asking each consistency of the information obtained by fellow peers. of the components to provide the services when In this article, we describe an approach that imple- required by the interaction protocol. ments such a P2P network on top of the OpenKnow- ledge (OK) system [5,6], which was developed in the The Lightweight Coordination Calculus scope of the European OpenKnowledge project [7]. For the case at hand, the developer has to specify a pro- tocol of the peer interaction defining the roles each per- The OpenKnowledge System ticipating peer has to play, the sort of messages sent The OpenKnowledge (OK) system is a fully distributed amongst them, and the particular constraints to be system that uses P2P technologies in order to share solved by the OKCs enacting these roles. Several model- peer-interaction protocols and service components ling languages such as those reviewed in [8] could have across the network. For this, a kernel module – the OK been chosen. Our aim, however, is to use the most kernel– needs to be installed in each machine that is to easily applied formal language for this engineering task be connected to the system. We shall call the protocols that we could conceive and for which an executable and service components to be shared generically Open- peer-to-peer environment already exists, choosing thus Knowledge Components (OKCs). Furthermore, these ser- the Lightweight Coordination Calculus (LCC) [9]. vices are executed and coordinated using the same set LCC is the executable interaction modelling language of tools. In the Methods section below we will show underlying the OK system. It is used to constrain inter- how the tools of the OK system are used to implement actions between distributed components and is neutral the proteomics P2P application. The OK system consists to the infrastructure used for message passing between of three main services which can be executed by any components, although for the purposes of this paper we computer running the OK kernel: assume components are peers in some form of peer-to- � a discovery service consisting of a distributed hash peer network. table (DHT), by which peer-interaction protocols and For example, Figure 1 shows the specification in LCC other OKCs are stored, so that they can be located and of the protocol for sequenced MS spectra sharing that downloaded by users; we will describe in detail later in the Methods section. It � a coordination service, which manages the peer is based on a simple query-answering protocol between interactions between OKCs; and one inquirer and many repliers. � an execution service, which is capable of executing An LCC specification describes (in the style of a pro- the offered service by means of the OK kernel at the cess calculus) a protocol for interaction between peers local machine. in order to achieve a collaborative task. The nature of The workflow for implementing a new application on this task is described through definitions of roles, with top of the OK platform is as follows. First, a specifica- each role being defined as a separate LCC clause. The tion defining the interaction protocol linking different set of these clauses forms the LCC interaction model. services has to be defined. This specification is pub- An interaction model provides a context for each mes- lished to the discovery service so that other users can sage that is sent between peers by describing the current find it and can execute OKCs capable of playing the state of the interaction (not of the peer) at the time of roles specified in the peer-interaction protocol. A devel- message passing. Coordination is achieved between oper, not necessarily the one that specified the protocol peers by communicating this state along with the appro- originally, will develop the OKCs that are to play the priate messages. Since roles are independently defined roles defined in the protocol specification. Some of within an interaction model, it is possible to distribute these OKCs may be shared across the network by pub- the computation to peers performing roles indepen- lishing them to the discovery service, so others can also dently, with synchronisation occurring only through Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 4 of 17 http://www.aejournal.net/content/4/1/1 Figure 1 LCC specification of the protocol for sequenced MS spectra sharing. message passing. Should the application demand it, however, LCC can also be used in more centralised, ser- ver-based style. Figure 2 shows the main definitions of LCC’ssyntax. A detailed discussion of LCC, its semantics, and the mechanisms used to deploy it, lies outside the scope of this paper. For these, the reader is referred to [9]. In this paper, though, we explain enough of LCC to Figure 2 Syntax of LCC. demonstrate how to represent interactions. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 5 of 17 http://www.aejournal.net/content/4/1/1 An interaction model in LCC is a set of clauses, each sharing. There is no restriction in the OK system to pre- of which defines how a role in the interaction must be vent locally produced OKCs from being published and performed. Roles are described in the head of each downloaded by other users. clause by the type of role (and its parameters) and an identifier for the individual peer undertaking that role. Methods Clauses may require subroles to be undertaken as part To show the viability of a P2P-based data-sharing envir- of the completion of a role. The definition of perfor- onment for the task of protein identification in proteo- mance of a role is constructed using combinations of mics we first specify in LCC a protocol for sharing the sequence operator (’then’)or choice operator (’or’)to sequenced MS spectra among peer laboratories. Then connect messages and changes of role. Messages are we describe the OpenKnowledge components (OKCs) terms, and are either outgoing to another peer in a that we have implemented to play the roles specified in given role (’ => ‘) or incoming from another peer in a the protocol, and finally we recount an actual experi- given role (’ <= ‘). Message input/output or change of ment carried out with the OK system. The aim of the role can be governed by a constraint to be solved before experiment is to serve as proofofconcept for applying (when at the right of ‘ <-’) or after (when at the left of ‘ P2P technology to the task of protein and peptide iden- <-’) message passing or role change. Constraints are tification. As such, we do not claim that the experiment defined using the normal logical operators for conjunc- proves that the OK system for P2P-based data-sharing tion, disjunction, and negation. If they are subject to fail, significantly improves all current standard protein and the interaction may proceed along alternative paths (e.g., peptide identification protocols based on centralised those specified with operator ‘or’). Notice that there is database search. The data available for the experiment is no commitment to the system of logic through which insufficient in order to come to such conclusion. How- constraints are solved –on the contrary we expect differ- ever, we do show in the Results and Discussion section ent peers to operate different constraint solvers. below that by using a P2P-based data-sharing environ- A protocolliketheoneinFigure1is genericinthe ment such as the one proposed in this article, research- sense that it gives different interactions depending on ers gain valuable information that allows them to raise how the variables (starting with a capital letter) in the the confidence of their identification task. clauses are bound at run time –this depending on the For an enhanced selection of those peer laboratories choices made by peers when satisfying the constraints that are to participate in the data-sharing protocol, we within these clauses. have added a confidence evaluation mechanisms that varies over time, and which is based on the expected answer accuracy of peer laboratories. OpenKnowledge Components To complete the application, we need also an imple- mentation of the OKCs enacting each of the roles. For Protocol Specification the protocol specified in Figure 1, this means two Figure 1 shows an LCC specification of a protocol that OKCs. One has to enact the researcher role as specified guides peer laboratories in their search of each other’s in the first two clauses, and another one has to enact locally stored proteomic data files. This is only one of the omicslab role as specified in the third clause. As a many possible protocols of this kind. LCC protocols are result each OKC will need to be able to solve the con- declarative specifications, and as such they are neutral straints occurring in their respective role specification. to the specifics of a protocol execution. The only For instance, for the omicslab role, the relevant OKC requirement is that all peers in the network that are to must be able to solve the constraint findHit(...).There- interact by means of a given interaction protocol should fore, its implementation must provide at least a findHit be capable of doing so. This capability amounts to (a) method. This method should search the local database running a local copy of the OK kernel, and (b) having a for data that matches a given query. Obviously, this local implementation of an OKC capable of resolving implementation will be tightly coupled to the local the constraints relevant to the role a peer is playing in machinery, the file format used for storing this informa- the protocol. tion, and the type of storage system from where it has For our proof of concept we have specified a protocol to be retrieved. This is an obstacle for the portability of –and implemented the required OKCs– for sharing OKCs across different laboratories. Consequently, it is sequenced MS spectra among peer laboratories. Ideally, advisable that each laboratory develops its own particu- to obtain the advantages of peer-based MS spectra shar- larOKC forthe omicslab role to be played, adjusted to ing as outlined in the Background section above, we its own system requirements. However, standard OKCs should ultimately aim at querying and sharing MS spec- for the most common formats and mass spectrometers tra directly. However, for our proof of concept for vali- could be made publically availabe for dwonload and dating the potential gain of peer-based proteomic data Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 6 of 17 http://www.aejournal.net/content/4/1/1 sharing, we have first targeted the implementation of a response it receives in the message from the labora- system for sequenced MS spectra sharing. Since the pro- tory, and finally runs a recursive call to the rest of tocol models a simple query-answering interaction the list. When the list of laboratories is empty it between one inquirer and many repliers, its application returns an empty list of results. We could have to MS spectra sharing will depend on the availability of decided alternatively to specify that queries ought to OKCs that implement searching based on spectrum-to- be sent out in parallel to all selected laboratories. This would be the obvious choice to speed up the spectrum matching. The OKCs we have implemented so querying process, but this is not relevant for the far for our proof of concept, allow searching based on objectives of this article. sequence-to-sequence matching (by means of BLAST) and spectra-to-sequence matching (by means of OMSSA). Constraints of the researcher role that require user For the actual specification of the protocol, two main input –such as selecting candidate laboratories or writ- roles are needed, one for the inquirer, which in the pro- ing the query– and generate output to the user –such tocol specification has been termed researcher (the as displaying results– aredoneviaso call visual con- clause headed by a(researcher, Researcher)::), and straints. That is, these constraints are annotated in the another for the replier called omicslab, which will be LCC specification to be solved by means of domain-spe- replying to the queries (the clause headed by a(omicslab, cific GUIs. In our case here they are specially tailored OmicsLab)::). We will start explaining the latter role for sequenced MS spectra sharing. first, which is simpler. OpenKnowledge Components (OKCs) � omicslab: A peer in this role waits for a message with In the following we describe the implementation of a query from a peer playing the researcher role, then OKCs that ground the enactment of the protocol of Fig- solves this query by executing the findHit constraint ure 1. As mentioned above, our implementation so far that finds all matching hits in its local database, and allows for searching that is based on sequence-to- finally sends these hits back to the researcher peer via sequence matching (by means of BLAST) and spectra- another message. This is specified as a conditional to-sequence matching (by means of OMSSA). With this message-passing action that is only carried out when initial implementation we are capable to run our experi- the findHit constraint can be satisfied. ment that serves as proof of concept of the proposed � researcher: A peer in this role acts as the inquirer, P2P proteomics data sharing environment. and the role makes use of a researcher subrole that The researcher OKC includes additional parameters. A peer in the main This OKC implements the constraints relevant for a researcher role (the one without parameters) asks the peer that wants to participate in the peer interaction user for a query. It does so by launching a input GUI playing the researcher role. Hence, the OKC’s main task with the constraint getQuery and then iterating is to ask the user for the proteomic query to solve, for- through all the selected proteomics labs participating ward it to the laboratories, fetch the results, and present in the peer interaction (obtained via constraints getO- them back to the user. micsLabRole, getPeers and selectLabs). It aggregates The getOmicsLabRole(RoleName) and getPeers(Role- all the different results and displays them to the user Name, LabList) constraints are used to get the list of through an output GUI that is launched by solving omics lab peers participating during a particular enact- the constraint showResults.Noticethatall thesecon- ment of the protocol. The selectLabs(LabList, Selecte- straints are conditions of an empty message-passing dLabList) filters those laboratories to which users want action labelled null. (This is syntactical requirement to sent the proteomic query. It shows a GUI (Figure 3) of LCC: constraints always go together with a mes- by which users identify and select the desired peer sage passing action, which can be the empty one.) laboratories. The iteration through all the omics laboratories is The getQuery(SearchType, SearchArguments, Input- currently specified to be done via a recursive helper Format,Input)constraintasks the user for the proteo- subrole, which receives the query and a list of omics mic query to solve. It requires four arguments that need laboratory. This role change is executed after getting to be provided by the user: all the identifiers of the peers playing the omicslab role by means of the getPeers constraint, which is � SearchType: the type of search to be performed executed by the protocol coordinator, who is the (BLAST or OMSSA). peer holding this information. In this subrole, the � SearchArguments: the parameters to be used by the peer first sends a message containing the query to laboratories when executing their locally installed the first laboratory in the list, then aggregates the search engines. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 7 of 17 http://www.aejournal.net/content/4/1/1 Figure 3 GUI for selecting peer laboratories for the data- sharing protocol. Figure 5 GUI for building OMSSA queries. � InputFormat: the proteomic search engines (BLAST and OMSSA) allow different input formats, (LabList, SearchType, SearchArguments, InputFormat, this argument is used to inform the search engines Input) subrole. This role iterates the list given to LabList about the format used in the input. using recursion; at each iteration the message query � Input: the proteomic sequences (if BLAST is used) (SearchType, SearchArguments, InputFormat, Input) is or mass spectra (if OMSSA is used) that constitute sent to a laboratory of the list, a(omicslab, H), and the the input to the search engines. researcher peer waits for the laboratory peer’sresponse message answer(Result, ResultInfo) to aggregate it. To solve the getQuery(SearchType, SearchArguments, When all the various results from the laboratories InputFormat, Input) constraint, a custom visualisation have been collected, the processResults(End) constraint (Figures 4 and 5) is shown to the user. With these GUIs is invoked. This constraint launches another custom the user can easily build the proteomic query to be sub- visualisation GUI for human users (see Figures 6, 7, 8, mitted to the system by writing or selecting the argu- and 9), by which they can examine the different results ments of the constraint. returned by the laboratories. As soon as researchers have built their query, it can be From an architectural point of view, the researcher sent to each one of the laboratories that are in the fil- OKC has been divided into three main components: the tered laboratory list. This task is done by the researcher OKC layer,the researcher kernel and the visualisation component. With this division, if the protocol specifica- tion or the visualisation requirements are modified, the Figure 4 GUI for building BLAST queries. Figure 6 BLAST result window with answers from labs. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 8 of 17 http://www.aejournal.net/content/4/1/1 Figure 7 OMSSA result window with answers from labs. Figure 9 OMSSA mass spectrum view. corresponding changes to the OKC can be applied quickly. The OKC layer acts as a thin interface between From an architectural point of view, as with the the OK P2P network and the researcher kernel, translat- researcher OKC, the omicslab OKC has been split into ing incoming constraints into researcher kernel method three main components: the OKC layer,the omicslab invocations. The researcher kernel contains all the utili- kernel and the search engine wrapper component.The ties to build the proteomics queries and to parse the OKC layer function is to act as interface between the results provided by the laboratories, and it invokes the OK P2P network and the omicslab kernel, mapping con- visualisations when needed. A schematic view of the straints into omicslab kernel functions. The kernel task architecture is shown in Figure 10. is to identify the incoming query and to send it to the The omicslab OKC search engines through the search engine wrapper com- This OKC implements the only constraint relevant for a ponent. This latter component executes the locally peer that participates in the peer interaction playing the installed proteomic search engine and returns the result omicslab role, namely findHit(SearchType, SearchArgu- to the omicslab kernel. A schematic view of the architec- ments, InputFormat, Input, Result, ResultInfo). This ture is shown in Figure 11. constraint is to solve the query received from a peer in In order to connect a peer playing the omicslab role the researcher role and to return the result. The Search- into our OK P2P network it is also necessary to install Type, SearchArguments, InputFormat and Input argu- the BLAST and OMSSA proteomic search engines, to ments are supplied by the researcher peer, and they are setupthe peer’s proteomic database, and to configure used as input to solve the query. The Result and Resul- tInfo arguments are instantiated by the omicslab peer. Result contains the result of the execution of the proteo- mic search engine and ResultInfo contains additional metadata about the result. Figure 8 OMSSA peptide detail view. Figure 10 Schematic view of the researcher OKC architecture. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 9 of 17 http://www.aejournal.net/content/4/1/1 FASTA files to a binary BLAST formatted database (by means of the formatdb program provided in the BLAST package). Finally, to also pull existing online proteomic databases into our P2P network we also set up omicslab OKCs whose proteomic databases were downloaded from institutions such as NCBI. � Configuration: To configure a peer to use the search engines, each machine acting as omicslab peer contains a configuration file. By reading this configuration file the omicslab peer knows where the search engines are locally installed, what database it should be using for each search, and the default parameters to use with the search engines. A frag- ment of a configuration file can be seen in Figure 12. Experimentation For the experimentation we have drawn from real data obtained from the ProteoRed scientific community. In the following we briefly describe this community, the test data employed, how the experimentation has been set up, and the concrete data-sharing peer interaction launched. The ProteoRed Scientific Community The National Institute for Proteomics, ProteoRed, is a Figure 11 Schematic view of the omicslab OKC architecture. network for the coordination, integration and develop- ment of the Spanish proteomics facilities providing the peer to use the search engines over its database. � Search engines: We decided not to include the search engines as part of the omicslab OKC, to make it platform- and search-engine-independent, as BLAST and OMSSA can be freely downloaded from NCBI, the National Center for Biotechnology Infor- mation (http://www.ncbi.nlm.nih.gov). It is required to install them locally in every machine acting as an omicslab. � Databases: For setting up the protein sequence database with the mass-spectra data returned by the lab’s local mass spectrometers we processed each set of mgf files (a common format to collect mass-spec- tra) using the de novo interpreter tool PEAKS, which was available to all ProteoRed lab members (see Experimentation section below) to obtain a corre- sponding set of amino-acids sequences. Before build- ing the database of sequences we applied a filter over de novo results discarding short sequences (less than 4 bases) and duplicates. We assume the first (highest score) sequence to be the best de novo interpretation; after that, the de novo score is not taken into account anymore, although its value is Figure 12 Fragment of the configuration file used by the annotated in the database as header information. omicslab peers. The final step consists of formatting these plain text Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 10 of 17 http://www.aejournal.net/content/4/1/1 services to support Spanish researchers in the field of This ambiguity could be rapidly solved querying the proteomics. As of 2009 ProteoRed integrated 19 well- OK system and searching for other laboratories report- established proteomics facilities giving services all over ing the same unexpected identifications as we show in Spain and abroad. the test experiment. This case is an example of a more ProteoRed offers major services necessary in all stages general situation when a laboratory needs to evaluate of the protein analysis process, and its main objective is the confidence of results that cannot be supported by to increase the specialisation and competitiveness of other means –such as ahighconfidencematch ina database– by checking if the same data has been proteomics facilities, considering the type of technolo- obtained by a number of independent partners working gies and equipment available, and the type of customers, their expertise and their geographical situation. Custo- with similar biological samples. mers are research groups from universities, the CSIC, Experiment Setup hospitals, or other public institutions, as well as private To test whether the OK system could be used by the companies (biotech and pharmaceutical companies). ProteoRed community in order to speed up the protein ProteoRed also has the objective of testing new techno- detection process, we simulated an environment in logical developments to provide new proteomics meth- which several peers of the OK system were emulating odologies and equipment to the Spanish proteomics real proteomics laboratories. Through this environment facilities. It also establishes open channels with custo- a researcher could query these peer laboratories to mers of these proteomics services to know their techno- retrieve data from their local databases. logical needs, data accuracy, quality requirements, price Sincewedid nothaveaccesstothe entire MS/MS scales, and new services needed for the future. Pro- repository of the 2006 ABRF test sample, we set up a teoRed also takes care of the coordination of courses, P2P ntework of those 9 laboratories of the ProteoRed workshops and meetings to promote and enhance the community that made their data available for our quality of proteomics knowledge through the scientific experimentation. Each of these peers managed its own community, ProteoRed technicians and governmental database containing protein data extracted from the agencies. ABRF test sample. To serve as an interface between the The Test Data ABRF database and the OK system we implemented an For our test data we have decided to use preexisting omicslab OKC for each peer and subscribed it to play MS/MS data repositories from the 2006 ABRF (Associa- the omicslab role in charge of replying to incoming pro- tion of Biomolecular Resource Facilities) test sample. It teomic queries as specified in the MS spectra-sharing consists of a mixture of 48 purified and recombinant protocol of Figure 1. In addition, we gave proteomics researchers a tool that allowed them to search for pro- proteins (plus an unknown number of protein contami- nants) extensively tested during the ABRF Proteomics teomic information through the OK P2P network by Standards Research Group 2006 worldwide survey. sending queries to omicslab peers and retrieving data Seventy-eight laboratories participated in the analysis from their databases. For this, researchers had to set up of these mixtures, some of them members of the Pro- their access to the OK system by executing the follow- teoRed network. Among these, only 35% could correctly ing set of steps: identify more than 40 protein components. Thus, the sample, being relatively handy for the purpose of testing 1. Installing the OK kernel. All researchers need to the OK system, is still of enough complexity to become link into the OK system by installing the OK kernel a challenge for most proteomics laboratories. [6] on a computer with an internet connection, in This sample was prepared by combining five picomole any operating system, with the only requirement aliquots of each protein. For this purpose, individual that it has the Java 1.5 suite installed. proteins were previously purified to assure a purity 2. Searching for the protocol specification.The OK >95%, and the protein concentration determined by system supports that different peers interact accord- amino-acid analysis. The combined sample was lyophi- ing to peer-interaction protocols. These protocols lized in 1 mL polypropylene tubes for storage before are to be specified and made public in the OK sys- analysis. The presence of low levels of impurities in the tem. Users of the OK system can then search for mixture represented an additional challenge to this ana- protocol specifications that define the type of inter- lysis. Thus, in addition to the 48 standard proteins, action according to which they would like to interact most laboratories reported the identification of many with other peers. In our current scenario researchers other proteins. These identifications could be either due will use the browsing facility provided by the user to real contaminants or to false positive identifications. interface of the OK kernel application to search for the appropriate protocol specifications. Searching is To ascertain which was the case requires a careful ana- lysis of the full data obtained by a laboratory. achieved by sending queries with keywords to a Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 11 of 17 http://www.aejournal.net/content/4/1/1 discovery service, which is in charge of storing all of Protocol Enactment the published OKCs and protocols. (The discovery When the protocol and OKCs are in place, and suffi- service is itself not a centralised, but a completely cient peers have subscribed to the required roles of the protocol, the protocol itself can be enacted. distributed service that follows the decentralised approach taken by P2P networks.) The discovery ser- 1. Selecting the laboratories. When the interaction vice then retrieves all the protocol specifications starts, the protocol iniciator peer recevies the list of whose metadata matches the query. The researchers all the peers that have subscribed to the omicslab can then read the descriptions associated to the role is received by the peer. This list is shown to the retrieved protocol specifications in order to select the one that suits them best. If no specification suits user which has to select the subset of the labs that them, they can refine the query by using different he or she wants to query. keywords. 2. Building the proteomic query. Having selected the 3. Installing the researcher OKC. Recall that the pro- laboratories to which the query will be sent, the user tocol specified for our experiment defines two roles: will then have to create the query. This is also done the omicslab, which is in charge of replying to through a user interface providing users with a form queries, and the researcher, which is the one sending through which they can specify the query. This form out queries (see the Protocol Specification section consists of: above). Actual researchers that want to query peer � an item asking to select the type of search laboratories will have to do so by enacting the (BLAST or OMSSA); researcher role as specified in the protocol. For this � a text box in which to write the proteomic text they must have previously installed an OKC capable query or import it from a file; of solving the constraints attached to that role (see � another text box where the researcher can add the OpenKnowledge Components (OKCs) section meta-data annotations that are used if the confi- above). OKCs can also be published in the OK sys- dence on the returned results is to be deter- mined (see the next subsection); and tem so that it is easy for users to find them and � a subform where the user can enter custom install them locally on their computers. Downloading search arguments to be used by the search and installing an OKC from the OK system can be engines. achieved directly from the kernel’s user interface. Once both the protocol specification and the role Once the query has been introduced by the researcher that a user wants to play have been chosen, one needs to search for existing OKC implementations it is sent to all the selected omicslab peers so that they for the given role, and then download and install can process it and reply with the set of matching pro- them. At this point a researcher would be ready to teomic data. start launching proteomic queries. Although this is the simplest way to install the required OKC, an 3. Showing the results from laboratories. Every time advanced user may also find or develop an OKC an omicslab peer replies to the query, the researcher through other means and plug it into the kernel. OKC stores the results. Once all the omicslab peers 4. Subscribing to the researcher role.Once the actual to which the query was sent have replied, the results researchers have installed the OKC needed for play- are shown to the users via a custom visualisation. ing the researcher role, they can start the peer inter- Through this custom GUI researchers can browse action through a subscription. Users select the through the different results and compare them. protocol specification and role they want to play and This is the final step of the protocol execution, if the then run the subscription command from the user researchers want to make another query they simply interface. This command sends the subscription start another interaction. information to the discovery service, which will define a peer (the researcher or another peer in the network) that will act as a coordinator of the peer Confidence in Peers In our scenario, researchers send queries to and receive interaction, following the protocol that will govern answers from a set of peer proteomics laboratories. For the peer interaction between the researcher and these researchers it is important to have a mechanism those peers that have subscribed to play the other helping them to distinguish which laboratories return roles defined in the protocol. In our case these will more significant and relevant answers to their queries. be the laboratory peers subscribed to play the omic- slab role. This can be achieved by measuring the confidence that Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 12 of 17 http://www.aejournal.net/content/4/1/1 researchers have in each laboratory, a confidence that is computing spectrum significance, which can be derived -E built during successive queries launched by the from the reported E-value as follows: P =1- e . researcher to their peer laboratories. The second factor that we consider is the number of In [10], the trust on a peer (a proteomics lab in our significant spectra in the set M of matched sequences scenario) is defined as the overall satisfaction with reported by a peer laboratory with respect to the score previous experiences with that peer. In our case the distribution of the sequences of M.Thus,givenaquery satisfaction measure of each particular experience with sequence, the overall significance of the set of matched a laboratory is based on the similarity between the sequences provided by a peer laboratory is given by query and the answer obtained from the laboratory. S n∈N ,where N ={m Î M | S > s }, being s m M M This similarity, however, is not only the similarity | N | between amino-acid sequences – it also takes into the standard deviation of scores in the population M. account additional information such as the enzyme Protocol Similarity chosen for digestion during sample preparation, the In its graphical representation a mass spectrum is a type of mass spectrometer used, or the kind of organ- plot of the mass to charge values of the detected ions ismthatthe sample wastaken from,amongother versus their corresponding intensity. Fragmentation information. spectra from peptides show profiles that are character- The similarity between a query and a laboratory hit istic of the peptide sequence and that contain different (the basic building block of theconfidencecalculation types of ions [12]. In addition to the returned spectra, in this application domain) is then defined as the pro- researchers need to have some confidence that the way duct of two factors: in which the spectrum in the query has been obtained is comparable to the way in which the hits are spectrum similarity = spectrum significance × protocol similarity obtained. This is because, although the protocol fol- wherethevalueof spectrum significance represents lowedbyalaboratorymay be well defined, theproto- how significant the matched spectrum is with respect to col itself admits certain variations that will produce the query spectrum, and the value of protocol similarity spectra with different ion types. (Bear in mind that this represents how similar the spectrographic protocols are is not the peer-interaction protocol specified in LCC that where followed by the researcher when obtaining for data sharing in our P2P network.) These variations the spectrum in the query and the laboratory when include the enzymes used to modify and to digest the obtaining the spectra in its database. Let us explain amino-acids, and the type of mass spectrometer used these two similarity measures in more detail. to produce the spectra. Spectrum significance Another important factor is the organism from which To calculate the significance of a spectrum (or of its the protein has been obtained. All this information is associated sequence of amino-acid characters) search provided as metadata. We define the protocol similarity engines that work over databases of amino-acid of a hit between a query and a database entry as sequences usually report a score S together with a prob- protocol similarity = organism similarity × modification similarity × digestion similarity abilistic value, referred to as P-value. The score S is a × mass - spectrometer similarity measure of the similarity of the query to the sequence Organism similarity The semantic similarity among matched, and the P-value is a measure of the reliability organisms o and o used for our confidence evaluation 1 2 of this score. It is the probability due to chance that is based on the organism taxonomy tree as according to there is at least another match with score greater than the NCBI lineage. Figure 13 shows a fragment of this or equal to S. (Here chance means the comparison of (i) taxonomy. real but non-homologous sequences; (ii) real sequences To defined a similarity measure we have used the one that are shuffled to preserve compositional properties; described in [13] and repeated here: or (iii) sequences that are generated randomly based upon a DNA or protein sequence model [11].) But 1if o = o 1 2 instead of determining P-values directly, search engines Sim(o , o )= κ h −κ h 1 2 2 2 −κ l e −e e · otherwise such as BLAST or OMSSA report the so-called E-values, κ h −κ h 2 2 e +e which are easier to interpret. The E-value is the number where l is the length (i.e., number of edges) of the of times a sequence with a score greater than S may shortest path between nodes, h is the depth of the dee- occur by chance in the database. That is, the closer the pest node subsuming both nodes, and  and  are E-value to zero, the better the match. But as E-values 1 2 parameters balancing the contribution of shortest path are not normalised in 0[1], we take the P-value for Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 13 of 17 http://www.aejournal.net/content/4/1/1 Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcop- terygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Cercopithecoidea; Cerco- pithecidae; Cercopithecinae; Macaca; Macaca mulatta Taking as parameter values =0.02 and  =0.6 we 1 2 get the following similarities: Sim(homo sapiens; solea) = 0.46 Sim(homo sapiens; rattus) = 0.72 Sim(homo sapiens; macaca mulatta) = 0.81 The NCBI database of organisms and taxonomies is dynamicand verylarge(currentlytherearemorethat Figure 13 Fragment of the organism lineage tree. 300,000 organisms), but it provides a REST web service to get taxonomic information without the need to download the entire database. (REST is a method to length and depth respectively. For example, the path to make web service queries where the query is written as humans is: an URL over HTTP.) Modification similarity To calculate the similarity of cellular organisms; Eukaryota; Fungi/Metazoa group; modification terms, rather than a tree distance we have Metazoa; Eumetazoa; Bilateria; Coelomata; Deuteros- used a binary similarity table (see Table 1). This is tomia; Chordata; Craniata; Vertebrata; Gnathosto- because there are situations that cannot be compared. mata; Teleostomi; Eu-teleostomi; Sarcopterygii; For example, a peptide modified with oxidation of an Tetrapoda; Amniota; Mammalia; Theria; Eutheria; amino-acid cannot be compared with another peptide Euarchontoglires; Primates; Haplorrhini; Simiiformes; that has not been modified at all. Catarrhini; Hominoidea; Hominidae; Homo/Pan/ Digestion similarity To calculate the similarity of the Gorilla group; Homo; Homo sapiens digestion terms we assign a similarity value of 1 between identical enzymes, and also between Trypsin and LysC, and the paths to the rat, the sole, and the rhesus but only if the peptide ends with K. In all other cases macaque are, respectively: we assign a similarity value of 0. Mass-spectrometer similarity Finally, the similarity of � cellular organisms; Eukaryota; Fungi/Metazoa the mass spectrometers is calculated based on the tax- group; Metazoa; Eumetazoa; Bilateria; Coelomata; onomy tree depicted in Figure 14 (see Table 2 for acro- Deuterostomia; Chordata; Craniata; Vertebrata; nyms). The tree classifies the spectrometers according Gnathostomata; Teleostomi; Euteleostomi; Sarcop- terygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurog- Table 1 Similarity table for peptide modification nathi; Muroidea; Muridae; Murinae; Rattus modification code -1 1235 31 32 89 90 � cellular organisms; Eukaryota; Fungi/Metazoa -1 1 group; Metazoa; Eumetazoa; Bilateria; Coelomata; 11 1 Deuterostomia; Chordata; Craniata; Vertebrata; 21 0 1 Gnathostomata; Teleostomi; Euteleostomi; Actinop- 31 0 1 1 terygii; Actinopteri; Neopterygii; Teleostei; Elopoce- 5 1 01 11 phala; Clupeocephala; Euteleostei; Neognathi; 31 1 0 1 1 1 1 Neoteleostei; Eurypterygii; Ctenosquamata; Acantho- 32 1 0 1 1 1 1 1 morpha; Euacanthomorpha; Holacanthopterygii; 89 1 1 0 0 0 0 0 1 Acanthopterygii; Euacanthopterygii; Percomorpha; 90 1 1 0 0 0 0 001 Pleuronectiformes; Soleoidei; Soleidae; Solea; Solea -1 = nothing; 1 = oxidation of M; 2 = carboxymethyl C; 3 = carbanidomethyl senegalensis C; 5 = propionamide C; 31 = carbamylation of K; 32 = carbamylation of n- � cellular organisms; Eukaryota; Fungi/Metazoa trem peptide; 89 = oxidation of H; 90 = oxidation of W. For the remaining group; Metazoa; Eumetazoa; Bilateria; Coelomata; codes the similarity is 1 if they are equal and 0 if the are not equal. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 14 of 17 http://www.aejournal.net/content/4/1/1 interaction (researcher and omicslab)servedasatrue Spectrometer positive control to check for failures in the functionality our system. Tandem Spectrometer To evaluate the evolution of the confidence of reported answers to queries, the 48 sequences were divided into 4 groups to be able to mimic the building of a query history. Despite that this model does not pro- Type I Type II Type IV Type III duce a valid history (as sequences were grouped ran- domly), it will allow to evaluate the functionality of the Default ETD MS ECD MS ESI-QIT MALDI-LIT MALDI-TOF/TOF confidence calculation. ESI-LIT ESI-QTOF ESI-TOF/TOF Each group was queried with the same parameters ESI-QqLIT ESI-TSQ (Figure 15) and the results analysed in the researcher ESI LIT-FT (HCD) OKC prospector window (Figure 16). As expected, the Figure 14 Similarity tree for different mass spectrometers. search in the researchers database (column labelled with uab) generated always full coincidences. Contrarily, to the type of ion fragmentation profiles that are typi- other proteomics labs and the NCBI Swiss-Prot database cally produced. The main classification parameters are (labelled with ncbi) produced more diverse results. Most the ionization method (MALDI or ESI ionization), of the queries produced high percentage identity values which determines the type of precursor ions formed, in the ncbi search. These hits give direct information and the collision energy (high- or low-energy collision). about the identity of the peptide and the source protein (’id’ and’des’ text windows in Figure 17). One of the Results and Discussion queries in Figure 17 (Query 10) produced a 100% coin- To play the researcher role we randomly selected 38 cidence in the NCBI Swiss-Prot database. The expecta- peptide sequences obtained by PEAKS de novo analysis tion values for this match indicated that it was not due of the mass spectrometric analytical data obtained by to hazard. The protein that had been tentatively identi- the LP CSIC/UAB proteomics laboratory from the fied, P20160 (azurocidin precursor) was, however, not ABRF sample. Additionally we included data derived included in the list of component of the standard ABRF from spectra that during the original analysis matched sample. to proteins not included in the standard ABRF list. The analysis of the answers from other proteomics The mass spectrometric data set from LP CSIC/UAB labs for this sequence showed that other laboratories wasobtainedbyLC-MS/MSanalysisof thetryptic found identical (see, i.e., cpif) or highly homologous digest of the protein mixture in the ABRF sample. This sequences (see, i.e., ucm). This fact indicated that several data set included 2000 spectra from which 48 of the 49 laboratories had observed the presence of the same proteins in the ABRF standard could be identified by conventional proteomic data analysis. Each protein was identified from the sequence of one or more of its tryp- tic peptides. Queries were performed against 9 data- bases, including 7 proteomics labs, the NCBI Swiss-Prot database, and the database of the researcher itself. Mak- ing the researcher laboratory play both roles in the peer Table 2 Mass spectrometer acronyms acronym mass spectometer ESI electrospray MALDI matrix assisted laser desorption QIT quadrupole ion trap QqLIT hybrid quadrupole-linear ion trap TSQ triple stage quadrupole QTOF hybrid quadrupole-time of flight TOF time of flight LIT-FT(HCD) hybrid LIT-Orbitrap with high collision dissociation Figure 15 Query window and BLAST search parameters used ETD fragmentation by electron-transfer dissociation for this study. The sequences shown in the image correspond to the first group of queries. ECD fragmetnation by electron-capture dissociation Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 15 of 17 http://www.aejournal.net/content/4/1/1 Figure 16 OK-omics Prospector window. Responses to the first Figure 18 Match data from ncbi database for query 11. group of queries (% coincidences). (several laboratories detected several of its tryptic pep- component in their samples and supported the fact that tides) that could have been present in the ABRF sample. the queried sequence was not the result of noise or lab- This fact is evident from the information that can be specific sample preparation artifacts. More detailed ana- derived from the OK system. However, it is not straight- lysis of the results indicated that the presence of protein forward to arrive to the same conclusion by conven- P20160 was supported by other NCBI matches (see, i.e., tional means, as it is difficult to discard organic Query 11 in Figure 18) and that the corresponding contamination of the samples. The actual presence of query also had produced highly scored matches in many this artifact was only stated after the ABRF study of all proteomics labs. This analysis clearly showed that the identification reported by the laboratories. P20160 was a relatively high concentrated contaminant The confidence on the result of each proteomics lab was evaluated for the 4 queries performed (Figure 19). Confidence values for the different laboratories are in Figure 17 Match data from ncbi database for query 10. Figure 19 Confidence evaluation. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 16 of 17 http://www.aejournal.net/content/4/1/1 Author details the range from near 0 to near 1 indicating different effi- 1 2 Artificial Intelligence Research Institute, IIIA-CSIC, Spain. CSIC/UAB ciencies sending back high score matches for the quer- Proteomics Laboratory, IIBB-CSIC, IDIBAPS, Spain. ied sequences. No improvements on the quality of the Authors’ contributions information derived from the OK-omics system were MS, JA, and CS conceived the P2P-based spectra-sharing method and observed by selecting 2-3 of the more trusted labs for experimentation, and they specified the peer-interaction protocol. DC and these queries. Due to the small size of the databases, an LB implemented OKCs and GUIs, and made the computational setup of the experiment, which was validated and interpreted by JA. CS and EJ designed important fraction of the processing time was due to the confidence evaluation module, which EJ further implemented and the public NCBI database search. Selecting a few labora- tested. AP and DC adjusted the OK system to the spectra-sharing platform. tories of high trust could however increase the perfor- MS wrote the initial versions of the article based on contributions of AP, MA, LB, DC and EJ; and prepared the final version. All authors read and approved mance when a higher number of peers are involved in the final manuscript. the interaction. As expected by the origin of the data (sequences randomly taken from the LP CSIC/UAB data Competing interests The authors declare that they have no competing interests. set) trust values are stable over the experiments. Received: 30 December 2011 Accepted: 31 January 2012 Conclusions Published: 31 January 2012 We have presented a new form of data sharing for References expression proteomics with the aim of (1) augmenting 1. Pertea M, Salzberg SL: Between a chicken and a grape: estimating the significantly the percentage of peptides and proteins number of human genes. Genome Biology 2010, 11:206. to be sequenced and identified by means of mass- 2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology 1990, 215:403-410. spectrometry-based analysis, and (2) reducing signifi- 3. Pearson WR: Rapid and Sensitive Sequence Comparison with FASTP and cantly the sequencing and identification time needed. FASTA. In Molecular Evolution: Computer Analysis of Protein and Nucleic Acid For this we have combined current bioinformatics Sequences, Volume 183 of Methods in Enzymology. Edited by: Doolittle R. Academic Press; 1990:63-98. techniques for proteomics with a novel multiagent 4. Geer L, Markey S, Kowalak J, Wagner L, Xu M, Maynard D, Yang X, Shi W, system architecture and a distributed knowledge coor- Bryant S: Open mass spectrometry search algorithm. Journal of Proteome dination mechanism in peer-to-peer networks, which Research 2004, 3:958-964. 5. Siebes R, Dupplaw D, Kotoulas S, Perreau de Pinninck A, van Harmelen F, have been developed in the context of the OpenKnow- Robertson D: The OpenKnowledge System: An Interaction-Centered ledge EU project. Approach to Knowledge Sharing. In On the Move to Meaningful Internet In this article we have specified the data-sharing peer- Systems 2007: CoopIS, DOA, ODBASE, GADA, and IS. OTM Confederated International Conferences CoopIS, DOA, ODBASE, GADA, and IS 2007, interaction protocol for P2P proteomics, implemented Vilamoura, Portugal, November 25-30, 2007, Proceedings, Part I, Volume 4803 the P2P data-sharing system using the OpenKnowledge of Lecture Notes in Computer Science. Edited by: Meersman R, Tari Z. system, and carried out a feasibility experiment with test Springer; 2007:381-390. 6. Perreau de Pinninck A, Dupplaw D, Kotoulas S, Siebes R: The data from preexisting MS/MS data repositories from the OpenKnowledge Kernel. International Journal of Applied Mathematics and 2006 ABRF test sample provided by different labora- Computer Sciences 2007, 4(3):162-167. tories for the ProteoRed scientific community. 7. Robertson D, Giunchiglia F, van Harmelen F, Marchese M, Sabou M, Schorlemmer M, Shadbolt N, Siebes R, Sierra C, Walton C, Dasmahapatra S, We conclude that by using the proposed P2P data- Dupplaw D, Lewis P, Yatskevich M, Kotoulas S, Perreau de Pinninck A, sharing system and protocol a researcher is capable of Loizou A: Open Knowledge – Coordinating Knowledge Sharing Through deriving information from the test data that is not Peer-to-Peer Interaction. In Languages, Methodologies and Development Tools for Multi-Agent Systems. First InternationalWorkshop, LADS 2007. straightforward to obtain by conventional means. This Durham, UK, September 4-6, 2007. Revised Selected and Invited Papers, Volume in turn shows that P2P data-sharing in proteomics can 5118 of Lecture Notes in Artificial Intelligence. Edited by: Dastani M, El Fallah indeed lead to enhanced protein identification. Seghrouchni A, Leite J, Torroni P. Springer; 2008:1-18. 8. Miller T, McGinnis J: Amongst First-Class Protocols. In Engineering Societies in the Agents World VIII, 8th International Workshop, ESAW 2007, Athens, Greece, October 22-24, 2007, Revised Selected Papers, Volume 4995 of Lecture Acknowledgements Notes in Computer Science. Edited by: Artikis A, O’Hare GMP, Stathis K, This research has been supported by the OpenKnowledge STREP (FP6- Vouros GA. Springer; 2008:208-223. 027253) funded by the European Commission; by grants BIO2009-11735, 9. Robertson D: Multi-agent Coordination as Distributed Logic CONSOLIDER-INGENIO 2010 Agreement Technologies (CSD2007-0022) and Programming. In Logic Programming. 20th International Conference, ICLP CBIT (TIN2010-16306) funded by Spain’s Ministerio de Ciencia e Innovación; 2004, Volume 3132 of Lecture Notes in Computer Science. Edited by: Demoen and by the Generalitat de Catalunya (2009-SGR-1434). The LP-CSIC/UAB is a B, Lifschitz V. Springer; 2004:416-430. member of ProteoRed (http://www.proteored.org), funded by Genoma 10. Giunchiglia F, Sierra C, McNeill F, Osman N, Siebes R: Good Enough Spain, and follows the quality criteria set up by ProteoRed standards. Answer Algorithms. Deliverable D4.5, OpenKnowledge 2007. We are also grateful for the test data provided by the proteomics facilities 11. The Statistics of Sequence Similarity Scores. , Retrieved from http://www. from Centro de Investigación Príncipe Felipe, CIC bioGUNE, Centro Nacional ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html on November 19, 2008. nd. de Biotecnología (CSIC), Hospital Universitari Vall d’Hebron, Universidad de 12. Johnson RS, Martin SA, Biemann K, Stults JT, Watson JT: Novel Alicante, Universidad de Córdoba, Universidad Complutense de Madrid, and Fragmentation Process of Peptides by Collision-Induced Decomposition Universidad del País Vasco - Euskal Herriko Unibertsitatea. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 17 of 17 http://www.aejournal.net/content/4/1/1 in a Tandem Mass Spectrometer: Differentiation of Leucine and Isoleucine. Analytical Chemistry 1987, 59:2621-2625. 13. Li Y, Bandar ZA, McLean D: An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering 2003, 15:871-882. doi:10.1186/1759-4499-4-1 Cite this article as: Schorlemmer et al.: P2P proteomics – data sharing for enhanced protein identification. Automated Experimentation 2012 4:1. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Automated Experimentation Springer Journals

Loading next page...
 
/lp/springer-journals/p2p-proteomics-data-sharing-for-enhanced-protein-identification-OLxHnZZ5fI
Publisher
Springer Journals
Copyright
Copyright © 2012 by Schorlemmer et al.; licensee BioMed Central Ltd.
Subject
Life Sciences; Computer Appl. in Life Sciences; Laboratory Medicine; Computational Biology/Bioinformatics
eISSN
1759-4499
DOI
10.1186/1759-4499-4-1
pmid
22293032
Publisher site
See Article on Publisher Site

Abstract

Background: In order to tackle the important and challenging problem in proteomics of identifying known and new protein sequences using high-throughput methods, we propose a data-sharing platform that uses fully distributed P2P technologies to share specifications of peer-interaction protocols and service components. By using such a platform, information to be searched is no longer centralised in a few repositories but gathered from experiments in peer proteomics laboratories, which can subsequently be searched by fellow researchers. Methods: The system distributively runs a data-sharing protocol specified in the Lightweight Communication Calculus underlying the system through which researchers interact via message passing. For this, researchers interact with the system through particular components that link to database querying systems based on BLAST and/or OMSSA and GUI-based visualisation environments. We have tested the proposed platform with data drawn from preexisting MS/MS data reservoirs from the 2006 ABRF (Association of Biomolecular Resource Facilities) test sample, which was extensively tested during the ABRF Proteomics Standards Research Group 2006 worldwide survey. In particular we have taken the data available from a subset of proteomics laboratories of Spain’s National Institute for Proteomics, ProteoRed, a network for the coordination, integration and development of the Spanish proteomics facilities. Results and Discussion: We performed queries against nine databases including seven ProteoRed proteomics laboratories, the NCBI Swiss-Prot database and the local database of the CSIC/UAB Proteomics Laboratory. A detailed analysis of the results indicated the presence of a protein that was supported by other NCBI matches and highly scored matches in several proteomics labs. The analysis clearly indicated that the protein was a relatively high concentrated contaminant that could be present in the ABRF sample. This fact is evident from the information that could be derived from the proposed P2P proteomics system, however it is not straightforward to arrive to the same conclusion by conventional means as it is difficult to discard organic contamination of samples. The actual presence of this contaminant was only stated after the ABRF study of all the identifications reported by the laboratories. Background changes constantly through its biochemical interactions Proteomics studies the quantitative changes occurring in with the genome and the environment, while the gen- a proteome and its application for disease diagnostics ome of an organism is rather constant. and therapy, and drug development. It examines pro- Proteins are large linear chains of amino-acids (resi- teins at different levels, including their sequences, struc- dues). Thesequenceofamino-acids inaproteinis tures and functionalities, and it is considered the next directly translated from the information encoded in the step in the study of biological systems, after genomics. It genome. However, a proteome is more complex than a is much more complicated than genomics mostly genome. One organism has radically different protein because the proteome differs from cell to cell and expression in different parts of its body, different stages of its life cycle and different environmental conditions * Correspondence: marco@iiia.csic.es (e.g., in humans there are about 20,500 identified genes Artificial Intelligence Research Institute, IIIA-CSIC, Spain but an estimate of more than 500,000 proteins that are Full list of author information is available at the end of the article © 2012 Schorlemmer et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 2 of 17 http://www.aejournal.net/content/4/1/1 derived from these genes [1]). This is mainly caused by algorithms cannot become a solution of the problem mRNA alternative splicing processes and by the possibi- because of intrinsic technical limitations. Once a protein lity of residues in a protein being chemically altered in has been sequenced de novo, one can look for similar post-translational modification (PTM), either as part of proteins in a GTDB using a matching algorithm such as the protein maturation processes before the protein BLAST [2] or FASTA [3]; or, alternatively, one can use takes part in the cell’s functionalities, or as part of con- an algorithm such as OMSSA [4] to match spectra trol mechanisms. The discrepancy implies that protein directly to sequences of a GTDB. Mass spectra identification is usually carried out by diversity cannot be fully characterized by gene expres- mixing and combining these two techniques. However, sion analysis. Thus, proteomics is necessary for a better characterization of cells and tissues, and for manufactur- among other factors, the following issues complicate ing improved drugs and medicines. this task: the number of possible PTMs can multiply the amount of results to be analysed; bad quality and noise Protein Identification in Proteomics in mass spectra increase the uncertainty of interpreta- One important and challenging task in proteomics is the tion; and database errors in sequence annotations can identification of proteins, that is, the recognition of the lead to misunderstandings in the identification. Conse- sequenced protein if the protein is known, or its discov- quently, we get a huge amount of apparently useless ery if it is unknown. For this, protein sequences are data (for instance, non-matching mass spectra or low- stored in public databases (such as nrNCBI, UniProt,or scoring de novo interpreted sequences), which most of Genpept). However, they are mostly produced by the the times are simply discarded. As a result, this data is direct translation of gene sequences. This means that seldom accessible to other groups involved in the identi- neither proteins with post-translation modifications fication of the same or homologous proteins. Our con- (PTM) nor proteins whose genomes have not been viction is that we can benefit from this kind of data sequenced would find exact matches in such databases. making it available as searchable repositories for other A key experimental technique for the identification of laboratories. If we compared data coming from different proteins is mass spectrometry (MS). Mass spectra pro- laboratories then we would be able to eventually dis- vide very detailed fingerprints of the proteins contained cover new matches. The discovery of matches would in a given sample. In the so called shotgun approach, contribute to further discriminate between really waste MS is often combined with cutting-edge separation data and possibly good data. We envision many advan- technologies to allow large-scale analysis of proteomes. tages with this new methodology, as other laboratories could provide the missing information for an incomplete For this, proteins are extracted from cells and tissues, enzymatically digested, andthe resultingpeptides spectrum or sequence, making a proteine identification (shorter amino-acids chains) separated by multidimen- process succeed; or even more, matches could help to sional liquid chromatography techniques. As the pep- recognize new proteins or identify PTMs. tides are separated, they are on-line injected into the mass spectrometer, where they are ionized, fragmented P2P Networks for Proteomics and these fragments mass-monitored to produce a spe- We propose a new scenario where the information to be cific sequence fingerprint. searched is no longer centralised in a few repositories, Identification of the huge amount of spectra produced but where information gathered from experiments in by current state-of-the-art high-throughput analysis is peer proteomics laboratories can be searched by fellow one of the major tasks for proteomics laboratories. researchers. To avoid centralising all data into a single Mainly two popular bioinformatics techniques are repository –with all the problems that such centralisa- involved in this effort. The first one takes advantage of tion would entail–, it is better to maintain the informa- public genome-translated databases (GTDB) that can be tion locally at each of the proteomics laboratories. As a accessed through data-mining software (search engines), result, this decentralised data storage needs a decentra- which directly relates mass spectra with database lised search mechanism. The use of peer-to-peer (P2P) sequences. Most of these search engines (Mascot, X! technologies fits our needs. Tandem, SEQUEST, OMSSA) are available both as A P2P network provides methods for accessing dis- stand-alone programs that consult a local copy of a tributed resources with minimal maintenance cost. It GTDB, or as web-services connected to online GTDBs. also provides scalable techniques to search through The limitations, once again, lie in their capability of large amounts of resources scattered through the net- identifying missing PTMs or unsequenced genomes. work. Furthermore, joining or leaving the network The latter case is addressed applying de novo interpreta- becomes a simple task. These properties of P2P net- tion algorithms that yield a sequence for a given mass works make the technology an ideal candidate to imple- spectrum, thus avoiding any database search. But these ment a distributed search mechanism in a network of Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 3 of 17 http://www.aejournal.net/content/4/1/1 proteomics labs. Other distributed storage systems such execute them on their local machines. At this point the as distributed databases or federated storage services application is said to be implemented. have been developed with efficiency in mind, and the After the application is implemented, it can be exe- maintenance and joining costs for these solutions are cuted on top of the OK system. For this purpose, the very high. users wanting to interact as specified in the given peer- A proteomics laboratory acting as a peer in a P2P net- interaction protocol by playing one of the roles will sub- work would be able to share its complete or partial data scribe the appropriate OKCs to it. The discovery service is in charge of managing these subscriptions, and when repository –e.g., mass spectra and de novo interpreted it gathers enough of these to satisfy all the necessary sequences– so that other peers can benefit from it. In addition, in order to find matches among data coming roles in the protocol, it sends this information to a from different peers, the interacting peers of such a P2P designated peer acting as the protocol coordinator who network would need also to validate and cross check the will start managing the peer interaction by asking each consistency of the information obtained by fellow peers. of the components to provide the services when In this article, we describe an approach that imple- required by the interaction protocol. ments such a P2P network on top of the OpenKnow- ledge (OK) system [5,6], which was developed in the The Lightweight Coordination Calculus scope of the European OpenKnowledge project [7]. For the case at hand, the developer has to specify a pro- tocol of the peer interaction defining the roles each per- The OpenKnowledge System ticipating peer has to play, the sort of messages sent The OpenKnowledge (OK) system is a fully distributed amongst them, and the particular constraints to be system that uses P2P technologies in order to share solved by the OKCs enacting these roles. Several model- peer-interaction protocols and service components ling languages such as those reviewed in [8] could have across the network. For this, a kernel module – the OK been chosen. Our aim, however, is to use the most kernel– needs to be installed in each machine that is to easily applied formal language for this engineering task be connected to the system. We shall call the protocols that we could conceive and for which an executable and service components to be shared generically Open- peer-to-peer environment already exists, choosing thus Knowledge Components (OKCs). Furthermore, these ser- the Lightweight Coordination Calculus (LCC) [9]. vices are executed and coordinated using the same set LCC is the executable interaction modelling language of tools. In the Methods section below we will show underlying the OK system. It is used to constrain inter- how the tools of the OK system are used to implement actions between distributed components and is neutral the proteomics P2P application. The OK system consists to the infrastructure used for message passing between of three main services which can be executed by any components, although for the purposes of this paper we computer running the OK kernel: assume components are peers in some form of peer-to- � a discovery service consisting of a distributed hash peer network. table (DHT), by which peer-interaction protocols and For example, Figure 1 shows the specification in LCC other OKCs are stored, so that they can be located and of the protocol for sequenced MS spectra sharing that downloaded by users; we will describe in detail later in the Methods section. It � a coordination service, which manages the peer is based on a simple query-answering protocol between interactions between OKCs; and one inquirer and many repliers. � an execution service, which is capable of executing An LCC specification describes (in the style of a pro- the offered service by means of the OK kernel at the cess calculus) a protocol for interaction between peers local machine. in order to achieve a collaborative task. The nature of The workflow for implementing a new application on this task is described through definitions of roles, with top of the OK platform is as follows. First, a specifica- each role being defined as a separate LCC clause. The tion defining the interaction protocol linking different set of these clauses forms the LCC interaction model. services has to be defined. This specification is pub- An interaction model provides a context for each mes- lished to the discovery service so that other users can sage that is sent between peers by describing the current find it and can execute OKCs capable of playing the state of the interaction (not of the peer) at the time of roles specified in the peer-interaction protocol. A devel- message passing. Coordination is achieved between oper, not necessarily the one that specified the protocol peers by communicating this state along with the appro- originally, will develop the OKCs that are to play the priate messages. Since roles are independently defined roles defined in the protocol specification. Some of within an interaction model, it is possible to distribute these OKCs may be shared across the network by pub- the computation to peers performing roles indepen- lishing them to the discovery service, so others can also dently, with synchronisation occurring only through Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 4 of 17 http://www.aejournal.net/content/4/1/1 Figure 1 LCC specification of the protocol for sequenced MS spectra sharing. message passing. Should the application demand it, however, LCC can also be used in more centralised, ser- ver-based style. Figure 2 shows the main definitions of LCC’ssyntax. A detailed discussion of LCC, its semantics, and the mechanisms used to deploy it, lies outside the scope of this paper. For these, the reader is referred to [9]. In this paper, though, we explain enough of LCC to Figure 2 Syntax of LCC. demonstrate how to represent interactions. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 5 of 17 http://www.aejournal.net/content/4/1/1 An interaction model in LCC is a set of clauses, each sharing. There is no restriction in the OK system to pre- of which defines how a role in the interaction must be vent locally produced OKCs from being published and performed. Roles are described in the head of each downloaded by other users. clause by the type of role (and its parameters) and an identifier for the individual peer undertaking that role. Methods Clauses may require subroles to be undertaken as part To show the viability of a P2P-based data-sharing envir- of the completion of a role. The definition of perfor- onment for the task of protein identification in proteo- mance of a role is constructed using combinations of mics we first specify in LCC a protocol for sharing the sequence operator (’then’)or choice operator (’or’)to sequenced MS spectra among peer laboratories. Then connect messages and changes of role. Messages are we describe the OpenKnowledge components (OKCs) terms, and are either outgoing to another peer in a that we have implemented to play the roles specified in given role (’ => ‘) or incoming from another peer in a the protocol, and finally we recount an actual experi- given role (’ <= ‘). Message input/output or change of ment carried out with the OK system. The aim of the role can be governed by a constraint to be solved before experiment is to serve as proofofconcept for applying (when at the right of ‘ <-’) or after (when at the left of ‘ P2P technology to the task of protein and peptide iden- <-’) message passing or role change. Constraints are tification. As such, we do not claim that the experiment defined using the normal logical operators for conjunc- proves that the OK system for P2P-based data-sharing tion, disjunction, and negation. If they are subject to fail, significantly improves all current standard protein and the interaction may proceed along alternative paths (e.g., peptide identification protocols based on centralised those specified with operator ‘or’). Notice that there is database search. The data available for the experiment is no commitment to the system of logic through which insufficient in order to come to such conclusion. How- constraints are solved –on the contrary we expect differ- ever, we do show in the Results and Discussion section ent peers to operate different constraint solvers. below that by using a P2P-based data-sharing environ- A protocolliketheoneinFigure1is genericinthe ment such as the one proposed in this article, research- sense that it gives different interactions depending on ers gain valuable information that allows them to raise how the variables (starting with a capital letter) in the the confidence of their identification task. clauses are bound at run time –this depending on the For an enhanced selection of those peer laboratories choices made by peers when satisfying the constraints that are to participate in the data-sharing protocol, we within these clauses. have added a confidence evaluation mechanisms that varies over time, and which is based on the expected answer accuracy of peer laboratories. OpenKnowledge Components To complete the application, we need also an imple- mentation of the OKCs enacting each of the roles. For Protocol Specification the protocol specified in Figure 1, this means two Figure 1 shows an LCC specification of a protocol that OKCs. One has to enact the researcher role as specified guides peer laboratories in their search of each other’s in the first two clauses, and another one has to enact locally stored proteomic data files. This is only one of the omicslab role as specified in the third clause. As a many possible protocols of this kind. LCC protocols are result each OKC will need to be able to solve the con- declarative specifications, and as such they are neutral straints occurring in their respective role specification. to the specifics of a protocol execution. The only For instance, for the omicslab role, the relevant OKC requirement is that all peers in the network that are to must be able to solve the constraint findHit(...).There- interact by means of a given interaction protocol should fore, its implementation must provide at least a findHit be capable of doing so. This capability amounts to (a) method. This method should search the local database running a local copy of the OK kernel, and (b) having a for data that matches a given query. Obviously, this local implementation of an OKC capable of resolving implementation will be tightly coupled to the local the constraints relevant to the role a peer is playing in machinery, the file format used for storing this informa- the protocol. tion, and the type of storage system from where it has For our proof of concept we have specified a protocol to be retrieved. This is an obstacle for the portability of –and implemented the required OKCs– for sharing OKCs across different laboratories. Consequently, it is sequenced MS spectra among peer laboratories. Ideally, advisable that each laboratory develops its own particu- to obtain the advantages of peer-based MS spectra shar- larOKC forthe omicslab role to be played, adjusted to ing as outlined in the Background section above, we its own system requirements. However, standard OKCs should ultimately aim at querying and sharing MS spec- for the most common formats and mass spectrometers tra directly. However, for our proof of concept for vali- could be made publically availabe for dwonload and dating the potential gain of peer-based proteomic data Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 6 of 17 http://www.aejournal.net/content/4/1/1 sharing, we have first targeted the implementation of a response it receives in the message from the labora- system for sequenced MS spectra sharing. Since the pro- tory, and finally runs a recursive call to the rest of tocol models a simple query-answering interaction the list. When the list of laboratories is empty it between one inquirer and many repliers, its application returns an empty list of results. We could have to MS spectra sharing will depend on the availability of decided alternatively to specify that queries ought to OKCs that implement searching based on spectrum-to- be sent out in parallel to all selected laboratories. This would be the obvious choice to speed up the spectrum matching. The OKCs we have implemented so querying process, but this is not relevant for the far for our proof of concept, allow searching based on objectives of this article. sequence-to-sequence matching (by means of BLAST) and spectra-to-sequence matching (by means of OMSSA). Constraints of the researcher role that require user For the actual specification of the protocol, two main input –such as selecting candidate laboratories or writ- roles are needed, one for the inquirer, which in the pro- ing the query– and generate output to the user –such tocol specification has been termed researcher (the as displaying results– aredoneviaso call visual con- clause headed by a(researcher, Researcher)::), and straints. That is, these constraints are annotated in the another for the replier called omicslab, which will be LCC specification to be solved by means of domain-spe- replying to the queries (the clause headed by a(omicslab, cific GUIs. In our case here they are specially tailored OmicsLab)::). We will start explaining the latter role for sequenced MS spectra sharing. first, which is simpler. OpenKnowledge Components (OKCs) � omicslab: A peer in this role waits for a message with In the following we describe the implementation of a query from a peer playing the researcher role, then OKCs that ground the enactment of the protocol of Fig- solves this query by executing the findHit constraint ure 1. As mentioned above, our implementation so far that finds all matching hits in its local database, and allows for searching that is based on sequence-to- finally sends these hits back to the researcher peer via sequence matching (by means of BLAST) and spectra- another message. This is specified as a conditional to-sequence matching (by means of OMSSA). With this message-passing action that is only carried out when initial implementation we are capable to run our experi- the findHit constraint can be satisfied. ment that serves as proof of concept of the proposed � researcher: A peer in this role acts as the inquirer, P2P proteomics data sharing environment. and the role makes use of a researcher subrole that The researcher OKC includes additional parameters. A peer in the main This OKC implements the constraints relevant for a researcher role (the one without parameters) asks the peer that wants to participate in the peer interaction user for a query. It does so by launching a input GUI playing the researcher role. Hence, the OKC’s main task with the constraint getQuery and then iterating is to ask the user for the proteomic query to solve, for- through all the selected proteomics labs participating ward it to the laboratories, fetch the results, and present in the peer interaction (obtained via constraints getO- them back to the user. micsLabRole, getPeers and selectLabs). It aggregates The getOmicsLabRole(RoleName) and getPeers(Role- all the different results and displays them to the user Name, LabList) constraints are used to get the list of through an output GUI that is launched by solving omics lab peers participating during a particular enact- the constraint showResults.Noticethatall thesecon- ment of the protocol. The selectLabs(LabList, Selecte- straints are conditions of an empty message-passing dLabList) filters those laboratories to which users want action labelled null. (This is syntactical requirement to sent the proteomic query. It shows a GUI (Figure 3) of LCC: constraints always go together with a mes- by which users identify and select the desired peer sage passing action, which can be the empty one.) laboratories. The iteration through all the omics laboratories is The getQuery(SearchType, SearchArguments, Input- currently specified to be done via a recursive helper Format,Input)constraintasks the user for the proteo- subrole, which receives the query and a list of omics mic query to solve. It requires four arguments that need laboratory. This role change is executed after getting to be provided by the user: all the identifiers of the peers playing the omicslab role by means of the getPeers constraint, which is � SearchType: the type of search to be performed executed by the protocol coordinator, who is the (BLAST or OMSSA). peer holding this information. In this subrole, the � SearchArguments: the parameters to be used by the peer first sends a message containing the query to laboratories when executing their locally installed the first laboratory in the list, then aggregates the search engines. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 7 of 17 http://www.aejournal.net/content/4/1/1 Figure 3 GUI for selecting peer laboratories for the data- sharing protocol. Figure 5 GUI for building OMSSA queries. � InputFormat: the proteomic search engines (BLAST and OMSSA) allow different input formats, (LabList, SearchType, SearchArguments, InputFormat, this argument is used to inform the search engines Input) subrole. This role iterates the list given to LabList about the format used in the input. using recursion; at each iteration the message query � Input: the proteomic sequences (if BLAST is used) (SearchType, SearchArguments, InputFormat, Input) is or mass spectra (if OMSSA is used) that constitute sent to a laboratory of the list, a(omicslab, H), and the the input to the search engines. researcher peer waits for the laboratory peer’sresponse message answer(Result, ResultInfo) to aggregate it. To solve the getQuery(SearchType, SearchArguments, When all the various results from the laboratories InputFormat, Input) constraint, a custom visualisation have been collected, the processResults(End) constraint (Figures 4 and 5) is shown to the user. With these GUIs is invoked. This constraint launches another custom the user can easily build the proteomic query to be sub- visualisation GUI for human users (see Figures 6, 7, 8, mitted to the system by writing or selecting the argu- and 9), by which they can examine the different results ments of the constraint. returned by the laboratories. As soon as researchers have built their query, it can be From an architectural point of view, the researcher sent to each one of the laboratories that are in the fil- OKC has been divided into three main components: the tered laboratory list. This task is done by the researcher OKC layer,the researcher kernel and the visualisation component. With this division, if the protocol specifica- tion or the visualisation requirements are modified, the Figure 4 GUI for building BLAST queries. Figure 6 BLAST result window with answers from labs. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 8 of 17 http://www.aejournal.net/content/4/1/1 Figure 7 OMSSA result window with answers from labs. Figure 9 OMSSA mass spectrum view. corresponding changes to the OKC can be applied quickly. The OKC layer acts as a thin interface between From an architectural point of view, as with the the OK P2P network and the researcher kernel, translat- researcher OKC, the omicslab OKC has been split into ing incoming constraints into researcher kernel method three main components: the OKC layer,the omicslab invocations. The researcher kernel contains all the utili- kernel and the search engine wrapper component.The ties to build the proteomics queries and to parse the OKC layer function is to act as interface between the results provided by the laboratories, and it invokes the OK P2P network and the omicslab kernel, mapping con- visualisations when needed. A schematic view of the straints into omicslab kernel functions. The kernel task architecture is shown in Figure 10. is to identify the incoming query and to send it to the The omicslab OKC search engines through the search engine wrapper com- This OKC implements the only constraint relevant for a ponent. This latter component executes the locally peer that participates in the peer interaction playing the installed proteomic search engine and returns the result omicslab role, namely findHit(SearchType, SearchArgu- to the omicslab kernel. A schematic view of the architec- ments, InputFormat, Input, Result, ResultInfo). This ture is shown in Figure 11. constraint is to solve the query received from a peer in In order to connect a peer playing the omicslab role the researcher role and to return the result. The Search- into our OK P2P network it is also necessary to install Type, SearchArguments, InputFormat and Input argu- the BLAST and OMSSA proteomic search engines, to ments are supplied by the researcher peer, and they are setupthe peer’s proteomic database, and to configure used as input to solve the query. The Result and Resul- tInfo arguments are instantiated by the omicslab peer. Result contains the result of the execution of the proteo- mic search engine and ResultInfo contains additional metadata about the result. Figure 8 OMSSA peptide detail view. Figure 10 Schematic view of the researcher OKC architecture. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 9 of 17 http://www.aejournal.net/content/4/1/1 FASTA files to a binary BLAST formatted database (by means of the formatdb program provided in the BLAST package). Finally, to also pull existing online proteomic databases into our P2P network we also set up omicslab OKCs whose proteomic databases were downloaded from institutions such as NCBI. � Configuration: To configure a peer to use the search engines, each machine acting as omicslab peer contains a configuration file. By reading this configuration file the omicslab peer knows where the search engines are locally installed, what database it should be using for each search, and the default parameters to use with the search engines. A frag- ment of a configuration file can be seen in Figure 12. Experimentation For the experimentation we have drawn from real data obtained from the ProteoRed scientific community. In the following we briefly describe this community, the test data employed, how the experimentation has been set up, and the concrete data-sharing peer interaction launched. The ProteoRed Scientific Community The National Institute for Proteomics, ProteoRed, is a Figure 11 Schematic view of the omicslab OKC architecture. network for the coordination, integration and develop- ment of the Spanish proteomics facilities providing the peer to use the search engines over its database. � Search engines: We decided not to include the search engines as part of the omicslab OKC, to make it platform- and search-engine-independent, as BLAST and OMSSA can be freely downloaded from NCBI, the National Center for Biotechnology Infor- mation (http://www.ncbi.nlm.nih.gov). It is required to install them locally in every machine acting as an omicslab. � Databases: For setting up the protein sequence database with the mass-spectra data returned by the lab’s local mass spectrometers we processed each set of mgf files (a common format to collect mass-spec- tra) using the de novo interpreter tool PEAKS, which was available to all ProteoRed lab members (see Experimentation section below) to obtain a corre- sponding set of amino-acids sequences. Before build- ing the database of sequences we applied a filter over de novo results discarding short sequences (less than 4 bases) and duplicates. We assume the first (highest score) sequence to be the best de novo interpretation; after that, the de novo score is not taken into account anymore, although its value is Figure 12 Fragment of the configuration file used by the annotated in the database as header information. omicslab peers. The final step consists of formatting these plain text Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 10 of 17 http://www.aejournal.net/content/4/1/1 services to support Spanish researchers in the field of This ambiguity could be rapidly solved querying the proteomics. As of 2009 ProteoRed integrated 19 well- OK system and searching for other laboratories report- established proteomics facilities giving services all over ing the same unexpected identifications as we show in Spain and abroad. the test experiment. This case is an example of a more ProteoRed offers major services necessary in all stages general situation when a laboratory needs to evaluate of the protein analysis process, and its main objective is the confidence of results that cannot be supported by to increase the specialisation and competitiveness of other means –such as ahighconfidencematch ina database– by checking if the same data has been proteomics facilities, considering the type of technolo- obtained by a number of independent partners working gies and equipment available, and the type of customers, their expertise and their geographical situation. Custo- with similar biological samples. mers are research groups from universities, the CSIC, Experiment Setup hospitals, or other public institutions, as well as private To test whether the OK system could be used by the companies (biotech and pharmaceutical companies). ProteoRed community in order to speed up the protein ProteoRed also has the objective of testing new techno- detection process, we simulated an environment in logical developments to provide new proteomics meth- which several peers of the OK system were emulating odologies and equipment to the Spanish proteomics real proteomics laboratories. Through this environment facilities. It also establishes open channels with custo- a researcher could query these peer laboratories to mers of these proteomics services to know their techno- retrieve data from their local databases. logical needs, data accuracy, quality requirements, price Sincewedid nothaveaccesstothe entire MS/MS scales, and new services needed for the future. Pro- repository of the 2006 ABRF test sample, we set up a teoRed also takes care of the coordination of courses, P2P ntework of those 9 laboratories of the ProteoRed workshops and meetings to promote and enhance the community that made their data available for our quality of proteomics knowledge through the scientific experimentation. Each of these peers managed its own community, ProteoRed technicians and governmental database containing protein data extracted from the agencies. ABRF test sample. To serve as an interface between the The Test Data ABRF database and the OK system we implemented an For our test data we have decided to use preexisting omicslab OKC for each peer and subscribed it to play MS/MS data repositories from the 2006 ABRF (Associa- the omicslab role in charge of replying to incoming pro- tion of Biomolecular Resource Facilities) test sample. It teomic queries as specified in the MS spectra-sharing consists of a mixture of 48 purified and recombinant protocol of Figure 1. In addition, we gave proteomics researchers a tool that allowed them to search for pro- proteins (plus an unknown number of protein contami- nants) extensively tested during the ABRF Proteomics teomic information through the OK P2P network by Standards Research Group 2006 worldwide survey. sending queries to omicslab peers and retrieving data Seventy-eight laboratories participated in the analysis from their databases. For this, researchers had to set up of these mixtures, some of them members of the Pro- their access to the OK system by executing the follow- teoRed network. Among these, only 35% could correctly ing set of steps: identify more than 40 protein components. Thus, the sample, being relatively handy for the purpose of testing 1. Installing the OK kernel. All researchers need to the OK system, is still of enough complexity to become link into the OK system by installing the OK kernel a challenge for most proteomics laboratories. [6] on a computer with an internet connection, in This sample was prepared by combining five picomole any operating system, with the only requirement aliquots of each protein. For this purpose, individual that it has the Java 1.5 suite installed. proteins were previously purified to assure a purity 2. Searching for the protocol specification.The OK >95%, and the protein concentration determined by system supports that different peers interact accord- amino-acid analysis. The combined sample was lyophi- ing to peer-interaction protocols. These protocols lized in 1 mL polypropylene tubes for storage before are to be specified and made public in the OK sys- analysis. The presence of low levels of impurities in the tem. Users of the OK system can then search for mixture represented an additional challenge to this ana- protocol specifications that define the type of inter- lysis. Thus, in addition to the 48 standard proteins, action according to which they would like to interact most laboratories reported the identification of many with other peers. In our current scenario researchers other proteins. These identifications could be either due will use the browsing facility provided by the user to real contaminants or to false positive identifications. interface of the OK kernel application to search for the appropriate protocol specifications. Searching is To ascertain which was the case requires a careful ana- lysis of the full data obtained by a laboratory. achieved by sending queries with keywords to a Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 11 of 17 http://www.aejournal.net/content/4/1/1 discovery service, which is in charge of storing all of Protocol Enactment the published OKCs and protocols. (The discovery When the protocol and OKCs are in place, and suffi- service is itself not a centralised, but a completely cient peers have subscribed to the required roles of the protocol, the protocol itself can be enacted. distributed service that follows the decentralised approach taken by P2P networks.) The discovery ser- 1. Selecting the laboratories. When the interaction vice then retrieves all the protocol specifications starts, the protocol iniciator peer recevies the list of whose metadata matches the query. The researchers all the peers that have subscribed to the omicslab can then read the descriptions associated to the role is received by the peer. This list is shown to the retrieved protocol specifications in order to select the one that suits them best. If no specification suits user which has to select the subset of the labs that them, they can refine the query by using different he or she wants to query. keywords. 2. Building the proteomic query. Having selected the 3. Installing the researcher OKC. Recall that the pro- laboratories to which the query will be sent, the user tocol specified for our experiment defines two roles: will then have to create the query. This is also done the omicslab, which is in charge of replying to through a user interface providing users with a form queries, and the researcher, which is the one sending through which they can specify the query. This form out queries (see the Protocol Specification section consists of: above). Actual researchers that want to query peer � an item asking to select the type of search laboratories will have to do so by enacting the (BLAST or OMSSA); researcher role as specified in the protocol. For this � a text box in which to write the proteomic text they must have previously installed an OKC capable query or import it from a file; of solving the constraints attached to that role (see � another text box where the researcher can add the OpenKnowledge Components (OKCs) section meta-data annotations that are used if the confi- above). OKCs can also be published in the OK sys- dence on the returned results is to be deter- mined (see the next subsection); and tem so that it is easy for users to find them and � a subform where the user can enter custom install them locally on their computers. Downloading search arguments to be used by the search and installing an OKC from the OK system can be engines. achieved directly from the kernel’s user interface. Once both the protocol specification and the role Once the query has been introduced by the researcher that a user wants to play have been chosen, one needs to search for existing OKC implementations it is sent to all the selected omicslab peers so that they for the given role, and then download and install can process it and reply with the set of matching pro- them. At this point a researcher would be ready to teomic data. start launching proteomic queries. Although this is the simplest way to install the required OKC, an 3. Showing the results from laboratories. Every time advanced user may also find or develop an OKC an omicslab peer replies to the query, the researcher through other means and plug it into the kernel. OKC stores the results. Once all the omicslab peers 4. Subscribing to the researcher role.Once the actual to which the query was sent have replied, the results researchers have installed the OKC needed for play- are shown to the users via a custom visualisation. ing the researcher role, they can start the peer inter- Through this custom GUI researchers can browse action through a subscription. Users select the through the different results and compare them. protocol specification and role they want to play and This is the final step of the protocol execution, if the then run the subscription command from the user researchers want to make another query they simply interface. This command sends the subscription start another interaction. information to the discovery service, which will define a peer (the researcher or another peer in the network) that will act as a coordinator of the peer Confidence in Peers In our scenario, researchers send queries to and receive interaction, following the protocol that will govern answers from a set of peer proteomics laboratories. For the peer interaction between the researcher and these researchers it is important to have a mechanism those peers that have subscribed to play the other helping them to distinguish which laboratories return roles defined in the protocol. In our case these will more significant and relevant answers to their queries. be the laboratory peers subscribed to play the omic- slab role. This can be achieved by measuring the confidence that Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 12 of 17 http://www.aejournal.net/content/4/1/1 researchers have in each laboratory, a confidence that is computing spectrum significance, which can be derived -E built during successive queries launched by the from the reported E-value as follows: P =1- e . researcher to their peer laboratories. The second factor that we consider is the number of In [10], the trust on a peer (a proteomics lab in our significant spectra in the set M of matched sequences scenario) is defined as the overall satisfaction with reported by a peer laboratory with respect to the score previous experiences with that peer. In our case the distribution of the sequences of M.Thus,givenaquery satisfaction measure of each particular experience with sequence, the overall significance of the set of matched a laboratory is based on the similarity between the sequences provided by a peer laboratory is given by query and the answer obtained from the laboratory. S n∈N ,where N ={m Î M | S > s }, being s m M M This similarity, however, is not only the similarity | N | between amino-acid sequences – it also takes into the standard deviation of scores in the population M. account additional information such as the enzyme Protocol Similarity chosen for digestion during sample preparation, the In its graphical representation a mass spectrum is a type of mass spectrometer used, or the kind of organ- plot of the mass to charge values of the detected ions ismthatthe sample wastaken from,amongother versus their corresponding intensity. Fragmentation information. spectra from peptides show profiles that are character- The similarity between a query and a laboratory hit istic of the peptide sequence and that contain different (the basic building block of theconfidencecalculation types of ions [12]. In addition to the returned spectra, in this application domain) is then defined as the pro- researchers need to have some confidence that the way duct of two factors: in which the spectrum in the query has been obtained is comparable to the way in which the hits are spectrum similarity = spectrum significance × protocol similarity obtained. This is because, although the protocol fol- wherethevalueof spectrum significance represents lowedbyalaboratorymay be well defined, theproto- how significant the matched spectrum is with respect to col itself admits certain variations that will produce the query spectrum, and the value of protocol similarity spectra with different ion types. (Bear in mind that this represents how similar the spectrographic protocols are is not the peer-interaction protocol specified in LCC that where followed by the researcher when obtaining for data sharing in our P2P network.) These variations the spectrum in the query and the laboratory when include the enzymes used to modify and to digest the obtaining the spectra in its database. Let us explain amino-acids, and the type of mass spectrometer used these two similarity measures in more detail. to produce the spectra. Spectrum significance Another important factor is the organism from which To calculate the significance of a spectrum (or of its the protein has been obtained. All this information is associated sequence of amino-acid characters) search provided as metadata. We define the protocol similarity engines that work over databases of amino-acid of a hit between a query and a database entry as sequences usually report a score S together with a prob- protocol similarity = organism similarity × modification similarity × digestion similarity abilistic value, referred to as P-value. The score S is a × mass - spectrometer similarity measure of the similarity of the query to the sequence Organism similarity The semantic similarity among matched, and the P-value is a measure of the reliability organisms o and o used for our confidence evaluation 1 2 of this score. It is the probability due to chance that is based on the organism taxonomy tree as according to there is at least another match with score greater than the NCBI lineage. Figure 13 shows a fragment of this or equal to S. (Here chance means the comparison of (i) taxonomy. real but non-homologous sequences; (ii) real sequences To defined a similarity measure we have used the one that are shuffled to preserve compositional properties; described in [13] and repeated here: or (iii) sequences that are generated randomly based upon a DNA or protein sequence model [11].) But 1if o = o 1 2 instead of determining P-values directly, search engines Sim(o , o )= κ h −κ h 1 2 2 2 −κ l e −e e · otherwise such as BLAST or OMSSA report the so-called E-values, κ h −κ h 2 2 e +e which are easier to interpret. The E-value is the number where l is the length (i.e., number of edges) of the of times a sequence with a score greater than S may shortest path between nodes, h is the depth of the dee- occur by chance in the database. That is, the closer the pest node subsuming both nodes, and  and  are E-value to zero, the better the match. But as E-values 1 2 parameters balancing the contribution of shortest path are not normalised in 0[1], we take the P-value for Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 13 of 17 http://www.aejournal.net/content/4/1/1 Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcop- terygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Cercopithecoidea; Cerco- pithecidae; Cercopithecinae; Macaca; Macaca mulatta Taking as parameter values =0.02 and  =0.6 we 1 2 get the following similarities: Sim(homo sapiens; solea) = 0.46 Sim(homo sapiens; rattus) = 0.72 Sim(homo sapiens; macaca mulatta) = 0.81 The NCBI database of organisms and taxonomies is dynamicand verylarge(currentlytherearemorethat Figure 13 Fragment of the organism lineage tree. 300,000 organisms), but it provides a REST web service to get taxonomic information without the need to download the entire database. (REST is a method to length and depth respectively. For example, the path to make web service queries where the query is written as humans is: an URL over HTTP.) Modification similarity To calculate the similarity of cellular organisms; Eukaryota; Fungi/Metazoa group; modification terms, rather than a tree distance we have Metazoa; Eumetazoa; Bilateria; Coelomata; Deuteros- used a binary similarity table (see Table 1). This is tomia; Chordata; Craniata; Vertebrata; Gnathosto- because there are situations that cannot be compared. mata; Teleostomi; Eu-teleostomi; Sarcopterygii; For example, a peptide modified with oxidation of an Tetrapoda; Amniota; Mammalia; Theria; Eutheria; amino-acid cannot be compared with another peptide Euarchontoglires; Primates; Haplorrhini; Simiiformes; that has not been modified at all. Catarrhini; Hominoidea; Hominidae; Homo/Pan/ Digestion similarity To calculate the similarity of the Gorilla group; Homo; Homo sapiens digestion terms we assign a similarity value of 1 between identical enzymes, and also between Trypsin and LysC, and the paths to the rat, the sole, and the rhesus but only if the peptide ends with K. In all other cases macaque are, respectively: we assign a similarity value of 0. Mass-spectrometer similarity Finally, the similarity of � cellular organisms; Eukaryota; Fungi/Metazoa the mass spectrometers is calculated based on the tax- group; Metazoa; Eumetazoa; Bilateria; Coelomata; onomy tree depicted in Figure 14 (see Table 2 for acro- Deuterostomia; Chordata; Craniata; Vertebrata; nyms). The tree classifies the spectrometers according Gnathostomata; Teleostomi; Euteleostomi; Sarcop- terygii; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurog- Table 1 Similarity table for peptide modification nathi; Muroidea; Muridae; Murinae; Rattus modification code -1 1235 31 32 89 90 � cellular organisms; Eukaryota; Fungi/Metazoa -1 1 group; Metazoa; Eumetazoa; Bilateria; Coelomata; 11 1 Deuterostomia; Chordata; Craniata; Vertebrata; 21 0 1 Gnathostomata; Teleostomi; Euteleostomi; Actinop- 31 0 1 1 terygii; Actinopteri; Neopterygii; Teleostei; Elopoce- 5 1 01 11 phala; Clupeocephala; Euteleostei; Neognathi; 31 1 0 1 1 1 1 Neoteleostei; Eurypterygii; Ctenosquamata; Acantho- 32 1 0 1 1 1 1 1 morpha; Euacanthomorpha; Holacanthopterygii; 89 1 1 0 0 0 0 0 1 Acanthopterygii; Euacanthopterygii; Percomorpha; 90 1 1 0 0 0 0 001 Pleuronectiformes; Soleoidei; Soleidae; Solea; Solea -1 = nothing; 1 = oxidation of M; 2 = carboxymethyl C; 3 = carbanidomethyl senegalensis C; 5 = propionamide C; 31 = carbamylation of K; 32 = carbamylation of n- � cellular organisms; Eukaryota; Fungi/Metazoa trem peptide; 89 = oxidation of H; 90 = oxidation of W. For the remaining group; Metazoa; Eumetazoa; Bilateria; Coelomata; codes the similarity is 1 if they are equal and 0 if the are not equal. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 14 of 17 http://www.aejournal.net/content/4/1/1 interaction (researcher and omicslab)servedasatrue Spectrometer positive control to check for failures in the functionality our system. Tandem Spectrometer To evaluate the evolution of the confidence of reported answers to queries, the 48 sequences were divided into 4 groups to be able to mimic the building of a query history. Despite that this model does not pro- Type I Type II Type IV Type III duce a valid history (as sequences were grouped ran- domly), it will allow to evaluate the functionality of the Default ETD MS ECD MS ESI-QIT MALDI-LIT MALDI-TOF/TOF confidence calculation. ESI-LIT ESI-QTOF ESI-TOF/TOF Each group was queried with the same parameters ESI-QqLIT ESI-TSQ (Figure 15) and the results analysed in the researcher ESI LIT-FT (HCD) OKC prospector window (Figure 16). As expected, the Figure 14 Similarity tree for different mass spectrometers. search in the researchers database (column labelled with uab) generated always full coincidences. Contrarily, to the type of ion fragmentation profiles that are typi- other proteomics labs and the NCBI Swiss-Prot database cally produced. The main classification parameters are (labelled with ncbi) produced more diverse results. Most the ionization method (MALDI or ESI ionization), of the queries produced high percentage identity values which determines the type of precursor ions formed, in the ncbi search. These hits give direct information and the collision energy (high- or low-energy collision). about the identity of the peptide and the source protein (’id’ and’des’ text windows in Figure 17). One of the Results and Discussion queries in Figure 17 (Query 10) produced a 100% coin- To play the researcher role we randomly selected 38 cidence in the NCBI Swiss-Prot database. The expecta- peptide sequences obtained by PEAKS de novo analysis tion values for this match indicated that it was not due of the mass spectrometric analytical data obtained by to hazard. The protein that had been tentatively identi- the LP CSIC/UAB proteomics laboratory from the fied, P20160 (azurocidin precursor) was, however, not ABRF sample. Additionally we included data derived included in the list of component of the standard ABRF from spectra that during the original analysis matched sample. to proteins not included in the standard ABRF list. The analysis of the answers from other proteomics The mass spectrometric data set from LP CSIC/UAB labs for this sequence showed that other laboratories wasobtainedbyLC-MS/MSanalysisof thetryptic found identical (see, i.e., cpif) or highly homologous digest of the protein mixture in the ABRF sample. This sequences (see, i.e., ucm). This fact indicated that several data set included 2000 spectra from which 48 of the 49 laboratories had observed the presence of the same proteins in the ABRF standard could be identified by conventional proteomic data analysis. Each protein was identified from the sequence of one or more of its tryp- tic peptides. Queries were performed against 9 data- bases, including 7 proteomics labs, the NCBI Swiss-Prot database, and the database of the researcher itself. Mak- ing the researcher laboratory play both roles in the peer Table 2 Mass spectrometer acronyms acronym mass spectometer ESI electrospray MALDI matrix assisted laser desorption QIT quadrupole ion trap QqLIT hybrid quadrupole-linear ion trap TSQ triple stage quadrupole QTOF hybrid quadrupole-time of flight TOF time of flight LIT-FT(HCD) hybrid LIT-Orbitrap with high collision dissociation Figure 15 Query window and BLAST search parameters used ETD fragmentation by electron-transfer dissociation for this study. The sequences shown in the image correspond to the first group of queries. ECD fragmetnation by electron-capture dissociation Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 15 of 17 http://www.aejournal.net/content/4/1/1 Figure 16 OK-omics Prospector window. Responses to the first Figure 18 Match data from ncbi database for query 11. group of queries (% coincidences). (several laboratories detected several of its tryptic pep- component in their samples and supported the fact that tides) that could have been present in the ABRF sample. the queried sequence was not the result of noise or lab- This fact is evident from the information that can be specific sample preparation artifacts. More detailed ana- derived from the OK system. However, it is not straight- lysis of the results indicated that the presence of protein forward to arrive to the same conclusion by conven- P20160 was supported by other NCBI matches (see, i.e., tional means, as it is difficult to discard organic Query 11 in Figure 18) and that the corresponding contamination of the samples. The actual presence of query also had produced highly scored matches in many this artifact was only stated after the ABRF study of all proteomics labs. This analysis clearly showed that the identification reported by the laboratories. P20160 was a relatively high concentrated contaminant The confidence on the result of each proteomics lab was evaluated for the 4 queries performed (Figure 19). Confidence values for the different laboratories are in Figure 17 Match data from ncbi database for query 10. Figure 19 Confidence evaluation. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 16 of 17 http://www.aejournal.net/content/4/1/1 Author details the range from near 0 to near 1 indicating different effi- 1 2 Artificial Intelligence Research Institute, IIIA-CSIC, Spain. CSIC/UAB ciencies sending back high score matches for the quer- Proteomics Laboratory, IIBB-CSIC, IDIBAPS, Spain. ied sequences. No improvements on the quality of the Authors’ contributions information derived from the OK-omics system were MS, JA, and CS conceived the P2P-based spectra-sharing method and observed by selecting 2-3 of the more trusted labs for experimentation, and they specified the peer-interaction protocol. DC and these queries. Due to the small size of the databases, an LB implemented OKCs and GUIs, and made the computational setup of the experiment, which was validated and interpreted by JA. CS and EJ designed important fraction of the processing time was due to the confidence evaluation module, which EJ further implemented and the public NCBI database search. Selecting a few labora- tested. AP and DC adjusted the OK system to the spectra-sharing platform. tories of high trust could however increase the perfor- MS wrote the initial versions of the article based on contributions of AP, MA, LB, DC and EJ; and prepared the final version. All authors read and approved mance when a higher number of peers are involved in the final manuscript. the interaction. As expected by the origin of the data (sequences randomly taken from the LP CSIC/UAB data Competing interests The authors declare that they have no competing interests. set) trust values are stable over the experiments. Received: 30 December 2011 Accepted: 31 January 2012 Conclusions Published: 31 January 2012 We have presented a new form of data sharing for References expression proteomics with the aim of (1) augmenting 1. Pertea M, Salzberg SL: Between a chicken and a grape: estimating the significantly the percentage of peptides and proteins number of human genes. Genome Biology 2010, 11:206. to be sequenced and identified by means of mass- 2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology 1990, 215:403-410. spectrometry-based analysis, and (2) reducing signifi- 3. Pearson WR: Rapid and Sensitive Sequence Comparison with FASTP and cantly the sequencing and identification time needed. FASTA. In Molecular Evolution: Computer Analysis of Protein and Nucleic Acid For this we have combined current bioinformatics Sequences, Volume 183 of Methods in Enzymology. Edited by: Doolittle R. Academic Press; 1990:63-98. techniques for proteomics with a novel multiagent 4. Geer L, Markey S, Kowalak J, Wagner L, Xu M, Maynard D, Yang X, Shi W, system architecture and a distributed knowledge coor- Bryant S: Open mass spectrometry search algorithm. Journal of Proteome dination mechanism in peer-to-peer networks, which Research 2004, 3:958-964. 5. Siebes R, Dupplaw D, Kotoulas S, Perreau de Pinninck A, van Harmelen F, have been developed in the context of the OpenKnow- Robertson D: The OpenKnowledge System: An Interaction-Centered ledge EU project. Approach to Knowledge Sharing. In On the Move to Meaningful Internet In this article we have specified the data-sharing peer- Systems 2007: CoopIS, DOA, ODBASE, GADA, and IS. OTM Confederated International Conferences CoopIS, DOA, ODBASE, GADA, and IS 2007, interaction protocol for P2P proteomics, implemented Vilamoura, Portugal, November 25-30, 2007, Proceedings, Part I, Volume 4803 the P2P data-sharing system using the OpenKnowledge of Lecture Notes in Computer Science. Edited by: Meersman R, Tari Z. system, and carried out a feasibility experiment with test Springer; 2007:381-390. 6. Perreau de Pinninck A, Dupplaw D, Kotoulas S, Siebes R: The data from preexisting MS/MS data repositories from the OpenKnowledge Kernel. International Journal of Applied Mathematics and 2006 ABRF test sample provided by different labora- Computer Sciences 2007, 4(3):162-167. tories for the ProteoRed scientific community. 7. Robertson D, Giunchiglia F, van Harmelen F, Marchese M, Sabou M, Schorlemmer M, Shadbolt N, Siebes R, Sierra C, Walton C, Dasmahapatra S, We conclude that by using the proposed P2P data- Dupplaw D, Lewis P, Yatskevich M, Kotoulas S, Perreau de Pinninck A, sharing system and protocol a researcher is capable of Loizou A: Open Knowledge – Coordinating Knowledge Sharing Through deriving information from the test data that is not Peer-to-Peer Interaction. In Languages, Methodologies and Development Tools for Multi-Agent Systems. First InternationalWorkshop, LADS 2007. straightforward to obtain by conventional means. This Durham, UK, September 4-6, 2007. Revised Selected and Invited Papers, Volume in turn shows that P2P data-sharing in proteomics can 5118 of Lecture Notes in Artificial Intelligence. Edited by: Dastani M, El Fallah indeed lead to enhanced protein identification. Seghrouchni A, Leite J, Torroni P. Springer; 2008:1-18. 8. Miller T, McGinnis J: Amongst First-Class Protocols. In Engineering Societies in the Agents World VIII, 8th International Workshop, ESAW 2007, Athens, Greece, October 22-24, 2007, Revised Selected Papers, Volume 4995 of Lecture Acknowledgements Notes in Computer Science. Edited by: Artikis A, O’Hare GMP, Stathis K, This research has been supported by the OpenKnowledge STREP (FP6- Vouros GA. Springer; 2008:208-223. 027253) funded by the European Commission; by grants BIO2009-11735, 9. Robertson D: Multi-agent Coordination as Distributed Logic CONSOLIDER-INGENIO 2010 Agreement Technologies (CSD2007-0022) and Programming. In Logic Programming. 20th International Conference, ICLP CBIT (TIN2010-16306) funded by Spain’s Ministerio de Ciencia e Innovación; 2004, Volume 3132 of Lecture Notes in Computer Science. Edited by: Demoen and by the Generalitat de Catalunya (2009-SGR-1434). The LP-CSIC/UAB is a B, Lifschitz V. Springer; 2004:416-430. member of ProteoRed (http://www.proteored.org), funded by Genoma 10. Giunchiglia F, Sierra C, McNeill F, Osman N, Siebes R: Good Enough Spain, and follows the quality criteria set up by ProteoRed standards. Answer Algorithms. Deliverable D4.5, OpenKnowledge 2007. We are also grateful for the test data provided by the proteomics facilities 11. The Statistics of Sequence Similarity Scores. , Retrieved from http://www. from Centro de Investigación Príncipe Felipe, CIC bioGUNE, Centro Nacional ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html on November 19, 2008. nd. de Biotecnología (CSIC), Hospital Universitari Vall d’Hebron, Universidad de 12. Johnson RS, Martin SA, Biemann K, Stults JT, Watson JT: Novel Alicante, Universidad de Córdoba, Universidad Complutense de Madrid, and Fragmentation Process of Peptides by Collision-Induced Decomposition Universidad del País Vasco - Euskal Herriko Unibertsitatea. Schorlemmer et al. Automated Experimentation 2012, 4:1 Page 17 of 17 http://www.aejournal.net/content/4/1/1 in a Tandem Mass Spectrometer: Differentiation of Leucine and Isoleucine. Analytical Chemistry 1987, 59:2621-2625. 13. Li Y, Bandar ZA, McLean D: An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering 2003, 15:871-882. doi:10.1186/1759-4499-4-1 Cite this article as: Schorlemmer et al.: P2P proteomics – data sharing for enhanced protein identification. Automated Experimentation 2012 4:1. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit

Journal

Automated ExperimentationSpringer Journals

Published: Jan 31, 2012

References