Access the full text.
Sign up today, get DeepDyve free for 14 days.
E. Zdobnov, R. Lopez, R. Apweiler, T. Etzold (2002)
The EBI SRS Server: Recent DevelopmentsBioinformatics, 18 2
W. Pearson (1994)
Using the FASTA program to search protein and DNA sequence databases.Methods in molecular biology, 25
V. Lombard, E. Camon, H. Parkinson, P. Hingamp, G. Stoesser, Nicole Redaschi (2002)
EMBL-Align: a new public nucleotide and amino acid multiple sequence alignment databaseBioinformatics, 18 5
E. Zdobnov, R. Lopez, R. Apweiler, T. Etzold (2002)
The EBI SRS server-new featuresBioinformatics, 18 8
Nucleic Acids Research, 2005, Vol. 33, Database issue D29–D33 doi:10.1093/nar/gki098 Carola Kanz*, Philippe Aldebert, Nicola Althorpe, Wendy Baker, Alastair Baldwin, Kirsty Bates, Paul Browne, Alexandra van den Broek, Matias Castro, Guy Cochrane, Karyn Duggan, Ruth Eberhardt, Nadeem Faruque, John Gamble, Federico Garcia Diez, Nicola Harte, Tamara Kulikova, Quan Lin, Vincent Lombard, Rodrigo Lopez, Renato Mancuso, Michelle McHale, Francesco Nardone, Ville Silventoinen, Siamak Sobhany, Peter Stoehr, Mary Ann Tuli, Katerina Tzouvara, Robert Vaughan, Dan Wu, Weimin Zhu and Rolf Apweiler EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received September 14, 2004; Revised October 6, 2004; Accepted October 14, 2004 INTRODUCTION ABSTRACT The European Bioinformatics Institute (EBI) is an outstation The EMBL Nucleotide Sequence Database (http:// of the European Molecular Biology Laboratory (EMBL) in www.ebi.ac.uk/embl), maintained at the European Heidelberg, Germany. It is located on the Wellcome Trust Bioinformatics Institute (EBI) near Cambridge, UK, is Genome Campus near Cambridge, UK. a comprehensive collection of nucleotide sequences The mission of the Service Programme at the EBI is the and annotation from available public sources. The building, maintenance and provision of biological databases database is part of an international collaboration and other information services to support data deposition and with DDBJ (Japan) and GenBank (USA). Data are free access by the scientific community (1). exchanged daily between the collaborating institutes The EMBL Nucleotide Sequence Database (http://www. toachieve swiftsynchrony. Webinisthepreferredtool ebi.ac.uk/embl/) is Europe’s primary nucleotide sequence for individual submissions of nucleotide sequences, resource. This database is the European part of an international including Third Party Annotation (TPA) and align- collaboration with DDBJ (Japan) (2) and GenBank (USA) (3) (INSDC, International Nucleotide Sequence Database Colla- ments. Automated procedures are provided for boration). Data are exchanged on a daily basis between the submissions from large-scale sequencing projects collaborating institutes. The data in the EMBL Nucleotide and data from the European Patent Office. New and Sequence Database originates from a combination of large- updated data records are distributed daily and the scale genome sequencing projects, direct submissions from whole EMBL Nucleotide Sequence Database is individual scientists and the European Patent Office. There released four times a year. Access to the sequence is a quarterly release of the whole database and new and data is provided via ftp and several WWW interfaces. updated records are distributed daily. With the web-basedSequenceRetrievalSystem(SRS) Over the last year, the size of EMBL Nucleotide Sequence it is also possible to link nucleotide data to other Database has increased from 27.2 million entries in Release specialist molecular biology databases maintained 76, September 2003 to 42.3 million entries in Release 80, September 2004, of which 4.4 million entries are WGS at the EBI. Other tools are available for sequence (Whole Genome Shotgun) data. There are now over similarity searching (e.g. FASTA and BLAST). 185 000 organisms represented in the database. Changes over the past year include the removal In 2004, the limit on sequence length has been dropped, of the sequence length limit, the launch of the the EMBLCDSs dataset containing all coding sequences EMBLCDSs dataset, extension of the Sequence annotated in the EMBL Nucleotide Sequence Database was Version Archive functionality and the revision of launched, the data collection rules for Third Party Anotation quality rules for TPA data. *To whom correspondence should be addressed. Tel: +44 1223 494453; Fax: +44 1223 494468; Email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact [email protected]. ª 2005, the authors Nucleic Acids Research, Vol. 33, Database issue ª Oxford University Press 2005; all rights reserved D30 Nucleic Acids Research, 2005, Vol. 33, Database issue (TPA) data were revised and the functionality of the Sequence WGS submissions Version Archive was extended further. WGS data submission is not a continuous process—WGS Other databases provided by the EBI include the protein datasets are normally not updated more often than once resource UniProt (4), InterPro, a database of protein families, every few months. Therefore email or ftp accounts are not domains and functional sites (5), the Macromolecular Struc- opened for the submission of WGS data, but submissions are ture Database E-MSD (6), the automatic genome annotation dealt with on a one-by-one basis. Potential submitters are database Ensembl (7), Genome Reviews, curated versions of advised to contact the EMBL database at [email protected]. complete Genomes from the EMBL Database, the Enzyme database IntEnz (8) and the database for protein interaction data, IntAct (9). How to update entries in the EMBL Nucleotide Sequence Database? The editorial rights to an entry in the EMBL Nucleotide SUBMISSIONS TO THE EMBL NUCLEOTIDE Sequence Database remain with the original submitter(s). SEQUENCE DATABASE The EBI team adds value to entries, e.g. via cross-references, but the data itself is archival and is not updated by the EBI. Why is it essential to submit new sequence? Submitters are advised to update their own entries via the Printing sequence data as part of a publication is neither update form (http://www.ebi.ac.uk/embl/webin/update.html). sensible nor manageable, hence journals prefer to cite only the accession number assigned by the INSD Collaboration. Most journals have a mandatory submission procedure such that papers will only be accepted if they have an accession number. DATA IN THE EMBL NUCLEOTIDE SEQUENCE The nucleotide sequence is considered part of the publication DATABASE and therefore almost all nucleotide sequences are publicly available. Having your sequence in the database means it is Data in the EMBL Nucleotide Sequence Database are grouped readily available to the scientific user community. A repository into divisions, according to either the methodology used in of primary nucleotide sequence data that is freely accessible is their generation (e.g. EST and HTG divisions) or taxonomic essential for computational analysis and genome research. origin of the sequence source (e.g. HUM and PRO divisions). There are also some specialized entry types. How to submit new sequences to the EMBL Nucleotide Sequence Database? Whole Genome Shotgun (WGS) data The primary tool for submission of nucleotide sequence data is Methods using WGS data are used to gain a large amount Webin. For alignment data, it is Webin-Align. Projects with of genome coverage for an organism. The sequences of all large-scale submissions can open a project account allowing contigs originating from one experiment are grouped in a set. direct updates. WGS entries have the standard EMBL format, with accession Information for submitters can be found here: http://www. numbers clearly distinct from those of non-WGS entries. ebi.ac.uk/embl/Documentation/information_for_submitters. The accession numbers of all entries in each WGS set share html. For submission guidelines please see http://www. the same prefix. ebi.ac.uk/embl/Submission/. Webin Third Party Annotation (TPA) data Webin is the preferred submission tool for nucleotide The Third Party Annotation data set was launched in response sequences and biological information. It should also be to requests from the research community to submit entries that used for TPA submissions. Webin allows fast submissions include either re-annotation of existing data, or combinations of single, multiple and very large numbers of sequences of novel sequence, existing primary sequence, trace archive (bulk submissions) and is available at http://www.ebi.ac.uk/ and WGS data. embl/Submission/webin.html. To distinguish TPA entries from primary data, the abbre- viation ‘TPA’ appears at the beginning of each description Genome project submissions (DE) line and in the keyword list. The link to the primary data Large-scale sequencing projects can open a project account information is given in the linetypes AH and AS that have been to deposit and update data directly using email or ftp. Groups created for TPA entries. The following flatfile extract is taken producing large volumes of sequence data are advised to from entry BN000024: contact the database at [email protected]. More informa- tion is available at http://www.ebi.ac.uk/embl/Submission/ AH TPA_SPAN PRIMARY_ PRIMARY_ COMP genomes.html. IDENTIFIER SPAN Alignment submissions AS 1-251 BE529226.1 1-251 Webin-Align (10) is the dedicated submission tool for multiple AS 68-450 BE524624.1 1-383 nucleotide and protein alignments. It accepts all common AS 394-1086 AJ420881.1 1-693 alignment formats and is available at http://www.ebi.ac.uk/ AS 826-1211 AV561543.1 1-386 c embl/Submission/align_top.html. Nucleic Acids Research, 2005, Vol. 33, Database issue D31 Constructed (CON) entries and expanded CONs WGS data are not represented in a separate library any more, but is part of EMBL (Release) and EMBL (Updates). CON entries do not contain a sequence but an assembly of WGS entries can be identified via the keyword ‘WGS’. contigs, i.e. the sequence is to be constructed from segments of SRS also links to other databases, with cross-references to smaller sequences. UniProt and publications available online, for example. The format of a CON entry is similar to that of a standard entry, with the additional CO linetype to accommodate the assembly information. A CON entry does not have any FTP Server annotation apart from source features. Release data, daily updates and cumulative files of all data The following example of an assembly is taken from entry types can be freely obtained from the ftp server at ftp://ftp. BX470249: ebi.ac.uk/pub/databases/embl/. Please see the README file for further information. CO join(BX640423.1:1..348251,BX640424.1:51. .349146,BX640425.1:51..348257, To create and maintain a local copy of the cumulative file, CO BX640426.1:51..348866,BX640427.1:51..348997,BX640428.1:51..348525, the syncron tool (ftp://ftp.ebi.ac.uk/pub/software/unix/listtools/) CO BX640429.1:51..344321,BX640430.1:51..348014,BX640431.1:51..347894, can be used to download automatically newly available incre- CO BX640432.1:51..346301,BX640433.1:51..349305,BX640434.1:51..344805, mental data files from the ftp site and to merge them locally. CO BX640435.1:51..346259,BX640436.1:51..255260) Recently, the expanded forms of CON entries (CONFF) have Dbfetch been made available via SRS and ftp. In this format, the Dbfetch (database fetch) is a tool for simple sequence retrieval sequence defined by the assembly and the annotation of the via http. It can be used to retrieve up to 50 entries from various segments are imposed onto the constructed sequence. databases. Dbfetch can be found at http://www.ebi.ac.uk/ cgi-bin/dbfetch. EMBLCDSs dataset Wsdbfetch provides programmatic access to the Dbfetch functionality. The service is described using Web Services Following requests from database users, a new subset of EMBL data, the EMBLCDSs database, has been created Description Language (WSDL) and uses the Simple Object during the last year. Every CDS (coding sequence) feature Access Protocol (SOAP) to communicate with other systems. annotated in EMBL entries is displayed as a single entry. For further information on Wsdbfetch please see http:// More details are provided in the New Developments section www.ebi.ac.uk/Tools/webservices/WSDbfetch.html. below. EMBL Sequence Version Archive The EMBL Sequence Version Archive (SVA) (13) is a repos- ACCESSING THE EMBL NUCLEOTIDE SEQUENCE itory of all versions of any entry that have been distributed DATABASE to the public from the EMBL Nucleotide Sequence Database. An interactive web-based interface to the SVA can be accessed The EMBL Nucleotide Sequence Database is available from at http://www.ebi.ac.uk/cgi-bin/sva/sva.pl. the EBI via various WWW interfaces, ftp and email (for more Entries from the SVA can also be retrieved using dbfetch. information see http://www.ebi.ac.uk/embl/Access). Sequence Retrieval System (SRS) Completed genome sequences The EMBL Nucleotide Sequence Database can be accessed Direct access to completely sequenced genomic components is via the EBI SRS server (11,12) at http://srs.ebi.ac.uk/. In SRS, available via the EBI Genomes server at http://www.ebi.ac.uk/ the data are available in the libraries shown in Table 1. genomes/. At the time of writing (September 2004) there are 162 completed genomes of bacteria, 19 archaea, 36 eukaryota, 540 organelles, 136 phages, 204 plasmids, 903 viruses and Table 1. SRS data libraries 36 viroids available. Library Content Sequence searching EMBL Entire EMBL Nucleotide Sequence Database apart from Contig and A comprehensive set of sequence analysis and database search expanded Contig data algorithms is available at http://www.ebi.ac.uk/Tools/. The EMBL (Release) The latest public release of the EMBL Nucleotide Sequence Database most commonly used algorithms available are FASTA (14) EMBL (Updates) All entries that are new or updated and WU-BLAST (15), permitting comparisons between query since the latest public release sequences and the nucleotide, translated nucleotide and pro- EMBL (Third Party Annotation) TPA data tein databases. EMBL (Contig) CON entries EMBL (Contigs Expanded) Expanded CON entries Sequence similarity searches are available interactively EMBL (Coding Sequences) CDS data over the WWW as well as by email. Instructions for email EMBLALIGN (under Alignment data searches can be obtained by sending a message with the word ‘Nucleotide related databases’) HELP in its body to [email protected]. D32 Nucleic Acids Research, 2005, Vol. 33, Database issue Access via email To be released into the public TPA dataset, entries must also meet the following requirements: Data can also be retrieved by email using netserv (netserv@ ebi.ac.uk). To get started send an email to [email protected] (i) The study must have been published in a peer-reviewed with ‘HELP’ in the message body. journal. (ii) The study must be supported by biological experimental evidence. Further details may be found at: http://www.ebi.ac. NEW DEVELOPMENTS uk/embl/Documentation/third_party_annotation_dataset.html. Sequence length limit and http://www.ebi.ac.uk/webin/webin_help.html. In the past, the sequence length of a database record was limited EMBL Sequence Version Archive—extended to 350 000 bp. In June 2004, this restriction was lifted functionality and entries of any length are now permitted in the database. Complete genomic units such as entire chromosomes can In February 2004, a new ‘batch retrieval’ functionality has now be represented in a single entry. To represent unseque- been added to the SVA. Multiple entries can now be retrieved nced gaps, the new ‘gap’ feature is used. Some genomes by supplying a list of accession numbers with either entry that were split in the past in order to comply with version number, sequence version number (user-indicated in the 350 000 bp limit have now been updated into single entries, the interface) or no version details for the most recent entry. e.g. AE000516. By the end of 2004, expanded CON entries will be included in the SVA. A warning has been added to report the suppression date for Third Party Annotations—new rules entries that have been suppressed in the database. Following a decision taken at the 2004 Collaborative Meeting, the INSD Collaboration has increased the stringency for EMBLCDSs dataset acceptance of data into the TPA dataset. The aim is to ensure that the TPA dataset includes the highest quality sequence and Following requests from database users, a new subset of biological annotation. EMBL data, EMBLCDSs database, has been created during To achieve this aim, the similarity between the TPA the year. Every CDS (coding sequence) feature annotated in sequence and the contributing primary sequences is checked EMBL entries is displayed as a single entry. at the time of submission. We aim to achieve a similarity of at Entries are presented in an EMBL-like flatfile format, with least 90%. In addition, there can be no more than 50 bp of the addition of new line types (Figure 1). TPA sequence that does not correspond to primary entry(ies). The primary identifier of the entry given in the ID line is the All TPA records are manually curated and checked prior to protein_id of the CDS feature, the IV (identifier version) line public release. gives protein_id and version. The accession number and ID CAD19988 standard; genomic DNA; FUN; 1839 BP. XX IV CAD19988.1 XX PA AJ426417.1 XX DE Gibberella fujikuroi carotene cyclase XX OS Gibberella fujikuroi OC Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes; OC Hypocreomycetidae; Hypocreales; Nectriaceae; Gibberella; OC Gibberella fujikuroi complex. OX NCBI_TaxID=5127; XX FH Key Location/Qualifiers FH FT CDS join(AJ426417.1:801..823,AJ426417.1:874..1416, FT AJ426417.1:1470..1643,AJ426417.1:1692..2790) FT /db_xref="GOA:Q8X0Z1" FT /db_xref="UniProt/TrEMBL:Q8X0Z1" FT /gene="carRA" FT /product="carotene cyclase" FT /function="essential role in carotenoid biosynthesis" FT /protein_id="CAD19988.1" FT /translation="MGWEYAQVHLKYTIPFGVVLAAVYRPLMSRLDVFKLVFLITVSFF ... XX SQ Sequence 1839 BP; 444 A; 433 C; 424 G; 538 T; 0 other; atgggctggg aatatgccca agtgcacctg aaatacacga taccgtttgg tgttgttttg 60 gcggcggttt acagaccgtt gatgtcacgg ctggatgttt ttaagcttgt gtttttgata 120 ... // Figure 1. A sample entry from the EMBLCDSs dataset. Nucleic Acids Research, 2005, Vol. 33, Database issue D33 sequence version of the parent EMBL entry can be found in the REFERENCES PA line. The DE line is created automatically and comprises the 1. Brooksbank,C., Camon,E., Harris,M.A., Magrane,M., Martin,M., organism and product names. The taxonomic information is Mulder,N., O’Donovan,C., Parkinson,H., Tuli,M., Apweiler,R. et al. taken from the parent entry. The CDS annotation itself contains (2003) The European Bioinformatics Institute’s data resources. Nucleic Acids Res., 31, 43–50. all qualifiers that belong to the feature, nucleotide locations 2. Miyazaki,S., Sugawara,H., Ikeo,K., Gojobori,T. and Tateno,Y. (2004) being given in relation to the parent entry(ies). The nucleotide DDBJ in the stream of various biological data. Nucleic Acids Res., sequence of the feature is shown last in the entry. 32, D31–D34. The EMBLCDSs dataset is available via SRS [library: 3. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and EMBL (Coding Sequences)] and ftp (ftp://ftp.ebi.ac.uk/pub/ Wheeler,D.L. (2004) GenBank: update. Nucleic Acids Res., 32, D23–D26. databases/embl/cds). 4. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B., Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. Finishing whole genome shotgun sets (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res., 33, D154–D159. Data from the WGS projects where the sequencing and assem- 5. Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., bling process is finished are moved into the main section of the Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P. et al. (2003) database.At thetime ofwriting only 5outof120 relativelysmall The InterPro Database, 2003 brings increased coverage and new features. projects have been finished (example: Nanoarchaeum equitans Nucleic Acids Res., 31, 315–318. Kin4-M, WGS project prefix: AACL, newly created entry in the 6. Golovin,A., Oldfield,T.J., Tate,J.G., Velankar,S., Barton,G.J., Boutselakis,H.,Dimitropoulos,D., Fillon,J.,Hussain,A.,Henrick,K. et al. main section: AE017199). In all cases, accession numbers of the (2004) E-MSD: an integrated data resource for bioinformatics. WGSentriesareaddedassecondaryaccessionnumberstonewly Nucleic Acids Res., 32, D211–D216. created entries in the main section to help track the data. 7. Clamp,M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V. et al. (2003) Ensembl 2002: XML format accommodating comparative genomics. Nucleic Acids Res., 31, 38–42. The International Nucleotide Sequence Database Collaboration 8. Fleischmann,A., Darsow,M., Degtyarenko,K., Fleischmann,W., INSDC has adopted a first draft for a common XML format for Boyce,S., Axelsen,K.B., Bairoch,A., Schomburg,D., nucleotidedata.The DTD can befoundat http://www.ebi.ac.uk/ Tipton,K.F. and Apweiler,R. (2004) IntEnz, the integrated relational enzyme database. Nucleic Acids Res., 32, embl/Documentation/DTD/INSDSeq_v1.dtd.txt. D434–D437. 9. Hermjakob,H., Montecchi-Palazzi,L., Lewington,C., Mudai,S., CITING THE EMBL NUCLEOTIDE SEQUENCE Kerrien,S., Orchard,S., Vingron,M., Roechert,B., Roepstorff,P. and Apweiler,R. (2004) IntAct: an open source molecular DATABASE interaction database. Nucleic Acids Res., 32, D452–D455. The preferred form for citation of the EMBL Nucleotide 10. Lombard,V., Camon,E.B., Parkinson,H.E., Hingamp,P., Stoesser,G. and Redaschi,N. (2002) EMBL-Align: a new public nucleotide and Sequence Database is: Kanz,C. et al. (2005) The EMBL amino acid multiple sequence alignment database. Bioinformatics, Nucleotide Sequence Database. Nucleic Acids Res., 33, 18, 763–764. D29–D33. 11. Zdobnov,E.M., Lopez,R., Apweiler,R. and Etzold,T. (2002) The EBI SRS server—new features. Bioinformatics, 18, 1149–1150. CONTACTING THE EMBL DATABASE 12. Zdobnov,E.M., Lopez,R., Apweiler,R. and Etzold,T. (2002) The EBI SRS server—recent developments. Bioinformatics, 18, Contact by email: data submissions: [email protected]; 368–373. other enquiries: [email protected]; data updates/publication 13. Leinonen,R., Nardone,F., Oyewole,O., Redaschi,N. and Stoehr,P. (2003) notifications: [email protected]. The EMBL sequence version archive. Bioinformatics, 19, Postal address: EMBL Nucleotide Sequence Database, 1861–1862. 14. Pearson,W.R. (1994) Using the FASTA program to search European Bioinformatics Institute, Wellcome Trust Genome protein and DNA sequence databases. Methods Mol. Biol., 24, Campus, Hinxton, Cambridge CB10 1SD, UK. 307–331. Telephone: data submissions, +44 1223 494499; general, 15. Lopez,R., Silventoinen,V., Robinson,S., Kibria,A. and Gish,W. (2003) +44 1223 494444. WU-Blast2 server at the European Bioinformatics Institute. Nucleic Fax: general, +44 1223 494468. Acids Res., 31, 3795–3798.
Nucleic Acids Research – Oxford University Press
Published: Jan 1, 2005
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.