New tools and methods for direct programmatic access to the dbSNP relational database

Scott F. Saccone; Jiaxi Quan; Gaurang Mehta; Raphael Bolze; Prasanth Thomas; Ewa Deelman; Jay A. Tischfield; John P. Rice

doi:10.1093/nar/gkq1054

New tools and methods for direct programmatic access to the dbSNP relational database

Saccone, Scott F.; Quan, Jiaxi; Mehta, Gaurang; Bolze, Raphael; Thomas, Prasanth; Deelman, Ewa; Tischfield, Jay A.; Rice, John P. 2011-01-30 00:00:00 Published online 29 October 2010 Nucleic Acids Research, 2011, Vol. 39, Database issue D901–D907 doi:10.1093/nar/gkq1054 New tools and methods for direct programmatic access to the dbSNP relational database 1, 1 2 2 2 Scott F. Saccone *, Jiaxi Quan , Gaurang Mehta , Raphael Bolze , Prasanth Thomas , 2 3 1,4 Ewa Deelman , Jay A. Tischfield and John P. Rice 1 2 Department of Psychiatry, Washington University, Information Sciences Institute, University of Southern 3 4 California, Department of Genetics, Rutgers University and Department of Genetics, Washington University, USA Received August 7, 2010; Revised October 1, 2010; Accepted October 13, 2010 these databases. For example one strategy that can be ABSTRACT used after a GWAS when selecting single nucleotide poly- Genome-wide association studies often incorporate morphisms (SNPs) for further research is to preferentially information from public biological databases in target SNPs with evidence of biological relevance (1,2). order to provide a biological reference for interpret- If a SNP resides in a gene from a pathway theorized to ing the results. The dbSNP database is an extensive be relevant to the phenotype or if there is evidence that the source of information on single nucleotide poly- SNP has a non-neutral effect on gene expression, this bio- logical information may increase the priority for further morphisms (SNPs) for many different organisms, study of the SNP, such as additional genotyping in a rep- including humans. We have developed free lication sample. The implementation of such a strategy software that will download and install a local requires (i) direct programmatic access to biological data- MySQL implementation of the dbSNP relational bases in order to efﬁciently and viably implement the database for a specified organism. We have also strategy on a genome-wide scale and (ii) a systematic designed a system for classifying dbSNP tables in method of isolating the relevant information within terms of common tasks we wish to accomplish the complex network of objects and relationships within using the database. For each task we have these databases. Ideally, we should also require designed a small set of custom tables that facilitate (iii) methods for identifying the speciﬁc sequence of ex- task-related queries and provide entity-relationship periments that produced this information and for tracing diagrams for each task composed from the relevant these experiments back to their core biologics and original dbSNP tables. In order to expose these concepts organisms in order to establish the credibility of the database and viably assess the reliability of the informa- and methods to a wider audience we have de- tion being retrieved. This work is a description of our veloped web tools for querying the database and efforts to develop tools for utilizing the dbSNP relational browsing documentation on the tables and database that meet these criteria. columns to clarify the relevant relational structure. The dbSNP database (3,4) is a repository that accepts All web tools and software are freely available to the submissions of data for SNPs and other structural vari- public at http://cgsmd.isi.edu/dbsnpq. Resources ation such as short deletions and insertions for a multitude such as these for programmatically querying bio- of organisms. It provides mapping data onto a number of logical databases are essential for viably integrating conventional genomes, such as the human reference biological information into genetic association ex- genome GRCh37 (http://www.ncbi.nlm.nih.gov/projects/ periments on a genome-wide scale. genome/assembly/grc/human), as well as mapping and functional information for gene transcripts. The database has a complex relational structure and through INTRODUCTION the tracking of submitted data users can retrieve detailed Integrating information from biological databases into information on the experiments that led to the discovery high-throughput experiments, such as genome-wide asso- of the variants, as well as genotype data in certain popu- ciation studies (GWAS), requires a database management lations. dbSNP has recently been substantially expanded system (DBMS) that is capable of handling very high by data from the 1000 Genomes project (5). volume and that is equipped with resources for dealing Any algorithm incorporating information from a bio- with the complex relational structure commonly seen in logical database on a genome-wide scale will beneﬁt *To whom correspondence should be addressed. Tel: +1 314 286 2581; Fax: +1 314 286 2577; Email: [email protected] The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D902 Nucleic Acids Research, 2011, Vol. 39, Database issue substantially from the implementation of the database visualization tools and proprietary application program- using a high performance DBMS such as MySQL (http:/ matic interfaces (APIs) such as those offered by Ensembl www.mysql.com), PostgreSQL (http://www.postgresql (9). The preference for these alternatives can result in less .org/), MSSQL (http://www.microsoft.com) or Oracle documentation and resources provided for recreating the (http://www.oracle.com). Direct programmatic access to internal relational database, making it more difﬁcult for a high quality DBMS provides critical support for efﬁ- programmers to understand the nature of the objects and ciently executed complex queries to very large databases, their relationships in the database. The objects of particular where a properly constructed index can mean the differ- importance are the biologics, the experiments and the ence between a query taking days and minutes. The use of results of these experiments. The relational model is well a conventional query language, such as SQL (6), should be suited to represent the basic ﬂow of information resulting used to maintain consistency across independent data- from these experiments thereby making it easier to assess bases and facilitate cross-database integration, which is the quality and reliability of the information being common practice when integrating biological information retrieved. More sophisticated models may be necessary to into a GWAS. capture the entire spectrum of experimental metadata (10), Many of the common biological databases found on the although this is not necessary for some applications such as internet, such as dbSNP (3,4,7) and the University of integrating biological information into a GWAS. California at Santa Cruz (UCSC) genome browser We have embraced the relational model as a means of database (8), allow users to download the data in a elucidating the complex relational structure of the dbSNP format conducive to creating a local version of the rela- database. We have divided the database into groups of tional database. dbSNP provides the data in a MSSQL tables corresponding to speciﬁc tasks we wish to accom- (Microsoft) format and UCSC provides schemas compat- plish and provide entity-relationship (ER) diagrams in the ible with MySQL. UCSC offers a small number of tables supplementary data for each task. The task model clariﬁes related to dbSNP containing only summary information, key parts of the database and suggests speciﬁc queries that excluding some key information such as details on samples should be used to accomplish each task. The ER diagrams used and experimental methodology. It allows users to illustrate the relationships between task-related tables and directly query their MySQL database over the internet, aid in the construction of queries. The task model and ER while dbSNP does not. diagrams also provide the user with a clearer view of the Querying these databases over the internet is not a sequence of experiments that led to the data and a means viable solution due to the quantity of data involved. The of querying speciﬁc information on the nature of those transfer of such large amounts of data over the internet experiments. would substantially increase query execution time and To make these resources available to a wider audience place a very heavy burden on public servers. We have we have developed a simple web-based query tool (http:// developed tools for downloading and implementing a cgsmd.isi.edu/dbsnpq) that is integrated with a database local copy of the dbSNP relational database using the documentation tool (http://cgsmd.isi.edu/dbdoc) allowing freely available MySQL DBMS. That is the local version users with a wide range of backgrounds to perform exists on the user’s machine so that queries to the database a variety of tasks with the dbSNP relational database. are done locally rather than over the internet. While All tools and software are freely available to the public. dbSNP provides numerous online tools for querying and visualizing the database, as well as a download facility for retrieving the database in Microsoft MSSQL format, we METHODS AND TOOLS have supplemented these tools with our own software for A task-based representation of dbSNP downloading and constructing a local MySQL relational Table 1 describes the tasks we wish to accomplish with the database implementation of dbSNP for a speciﬁed dbSNP relational database. We use the task concept organism. Because converting from MSSQL to MySQL because in practice the goal is not necessarily to retrieve is not straightforward, and the ﬁle system used in the the full spectrum of information available, but is instead dbSNP download facility is quite complex, our software greatly simpliﬁes the task of implementing the relational to retrieve only the portion relevant to a speciﬁc applica- database on a local machine. Our software was developed tion, such as cross referencing the results of a GWAS. and tested on the Linux platform. Because it was Figure 1 shows the relationship between our tasks and developed using the Perl programming language, it the tables in the dbSNP database (see Supplementary should be readily portable to other platforms such as Table S1 for descriptions of the dbSNP tables) and illus- Microsoft Windows. This suite of tools and the conven- trates some of the relational structure among the tables. tional and freely available MySQL DBMS will allow pro- The tables were downloaded from the dbSNP FTP server grammers to implement complex algorithms using a fast, (ftp://ftp.ncbi.nlm.nih.gov) and correspond to build 131 of structured interface to the database, making the execution the human component of the database. In the of these algorithms on a genome-wide scale much more Supplementary data we have provided a more detailed viable. description of each task, including ER diagrams showing While many public biological databases provide the schemas and relationships for the relevant tables and programmers with a means of creating their own local sample queries with output. We attempted to deﬁne the copy, this method of usage is often overshadowed by tasks so that the number of related tables is manageable other means of data access, such as web-based query and and leads to interpretable ER diagrams. While we Nucleic Acids Research, 2011, Vol. 39, Database issue D903 Table 1. Descriptions of the tasks used in our classiﬁcation scheme for the tables in the dbSNP database Task Description Submission Determine the source of the submission such as speciﬁc laboratories or researchers, the populations used, any associated publications and how the submissions cluster into ‘reference SNP’ identiﬁcation numbers (a group of ‘ss’ SNP IDs correspond to a unique ‘rs’ ID via the table SNPSubSNPLink). Experimental methods Determine the experimental methods used to produce the data, such as direct DNA sequencing, DNA hybrid- ization and DHPLC (denaturing high pressure liquid chromatography). Validation Assess the reliability of the information and evaluate whether or not a reported variant is truly a genetic poly- morphism or is just an experimental artifact. Methods include determining if there are multiple submissions with at least one non-computational observation and conﬁrmation by observation of positive frequency in a genotyped sample. Classiﬁcation Determine if the variants are classiﬁed as being a true SNP, insertion, deletion and so on. Sample information Retrieve information on the biological samples used, such as ethnicity and the number of samples used for a submission. Alleles and frequency data Retrieve the alleles observed for the variant, which DNA strand was used and the frequencies of the alleles and genotypes in various populations. Genome mapping Retrieve information on how the variants map to various reference genomes, such as the physical mapping coordinates and the quality of the alignments. Genes and function Retrieve information on relationships between the variants and genes, such as SNP/gene transcript functional properties (missense mutations, frameshifts, UTR regions and so on). Flanking sequence Retrieve the ﬂanking DNA sequences used to deﬁne the variant. This can be useful when conducting custom genotyping experiments for variants not represented by commercial SNP microarrays. Individual genotyping Retrieve submitted individual genotypes. Summary information Retrieve summary information for a reference SNP ID—an amalgamation the tasks above. developed these methods and tools for human applica- results in 143 million bytes of data. This is an unwieldy tions, because the database structure is consistent across amount of information to query over the internet and illustrates the need for direct access to this information different organisms, our system should translate to the on a local machine. A more substantial query would be other organisms maintained by dbSNP with minor to retrieve dbSNP allele frequency data for the Illumina modiﬁcations. 1 M array via our local table _loc_allele_freqs which For each task we designed a small set of tables which results in 799 million bytes of data. we refer to as ‘local tables’ (Supplementary Table S3). The examples in the Supplementary data provide a We derived these from the original dbSNP tables in snapshot of the ﬂow of information in the dbSNP order to facilitate the tasks by providing the relevant in- database and the precise types of data being stored. formation in a more conducive form. These local tables These examples follow the tasks in Figure 1, starting are often denormalized in the sense that certain numeric with Submission and proceeding clockwise. These repre- foreign keys are replaced by their text values and duplicate sent a typical cycle of experiments and applications, from keys are replaced by comma-separated lists. While this the discovery of a variant to conﬁrmation by genotyping may lead to inefﬁcient storage the beneﬁt is faster access in multiple samples followed by genome mapping to the information, which can be crucial when using the properties and gene transcript analysis and ending with database for genome-wide analysis. Another option is to representative ﬂanking sequence to be used for additional use views instead of physical tables. A view acts like genotyping experiments such as disease mapping studies. a physical table but is really a query to one or more Our examples convey the relational structure of the existing tables and therefore occupies no additional database through ER diagrams and provide a detailed space. Some of our local tables have too complex a view of the data through example queries with output derivation to be implemented as a view and for others tables. the queries to a view take signiﬁcantly longer than to a physical table, particularly when the physical table is Software for downloading and implementing dbSNP implemented with special indexes that improve the per- formance of the query. We believe the performance We have developed software to automate the process increase outweighs the issue of increased storage and of downloading data and schema ﬁles from dbSNP, therefore tend to use physical local tables rather than converting the dbSNP MSSQL schema ﬁles to MySQL views. and loading the data into a local MySQL server. The A routine application of the dbSNP database to a software is written in the Perl programming language GWAS experiment is to retrieve basic annotation, such and is executed via a UNIX command line using the as from our table _loc_snp_summary, for an entire com- syntax dbsnp.pl [command] [options]. Supplementary mercial SNP microarray (see the section ‘Summary Table S4 lists the commands and their descriptions (see Information’ in the Supplementary data for details on the section titled ‘Download Script’ in the Supplemenary the table _loc_snp_summary). For dbSNP human build data). The script is freely available to the public as part of 131 and the Illumina 1 M SNP microarray (http://www the ‘dbSNP Downloader’ package which can be obtained .illumina.com), which contains one million SNPs, this at http://cgsmd.isi.edu/dbsnpq/downloads.php. We also D904 Nucleic Acids Research, 2011, Vol. 39, Database issue Figure 1. The tasks and corresponding dbSNP tables from our classiﬁcation scheme. A tree structure is used to partially represent the relationships between the tables. All tables except those listed under ‘Local Tables’ are directly from dbSNP. Tables with asterisks have names in the dbSNP database that are preﬁxed by the dbSNP build and sufﬁxed by the representative genome build, such as b131_ContigInfo_37_1 in build 131 of the human database. provide MySQL ‘dumps’ of the human dbSNP database A web-based query tool either as a single ﬁle containing the entire database or We have developed a web-based tool, dbSNP-Q (http:// as a separate ﬁle for each table. These ﬁles are easily cgsmd.isi.edu/dbsnpq), for querying the human data from imported into a local MySQL database. The set of our MySQL implementation of the dbSNP relational tables for which we provide downloads is not comprehen- database. This tool has a straightforward interface for sive—the remaining tables can be implemented using entering SNPs and selecting a simple predeﬁned query or dbsnp.pl. entering a custom MySQL query. This allows users with a Nucleic Acids Research, 2011, Vol. 39, Database issue D905 wide range of backgrounds to accomplish a variety of information if the data is being used to guide other experi- tasks. In addition to SNPs, users may enter genes and ments, such as post-GWAS prioritization of follow up genomic regions in order to retrieve all the corresponding studies. We also believe the information should be SNPs. There is also an option to look up and add all SNPs provided in such a way that it can be programmatically in linkage disequilibrium (LD) with the entered SNPs incorporated into an application such as GWAS prioritiza- using one of eleven HapMap Phase III populations. tion through a DBMS such as MySQL that uses a conven- Queries may be viewed directly in the browser or down- tional query language and table relationship paradigm and loaded in Excel or tab-delimited text format. The jQuery is supported by a wide variety of programming languages JavaScript framework (http://www.jquery.com) ensures such as Perl and Hypertext Preprocessor (PHP). the interface is browser-independent and is a true inter- A case in point is the dbSNP relational database which active web-based application rather than a continuously contains information on the labs and scientists that per- re-loaded web page. dbSNP-Q is a model-view-controller formed the experiments, the technology and the samples (MVC) application: it uses Ajax technology as imple- used in those experiments and the methods used to map mented by the jQuery plug-in jqGrid (http://www the variants to the reference genomes and determine any .trirand.net) to enable users to viably work with 10 s of relationships to known gene transcripts. dbSNP provides millions of SNPs because the core logic, such as sorting the data and schemas for all the information in their and paging through the results of the query, is handled by database, as well as an online data dictionary document- our server. We have also implemented a vertical view tech- ing the tables (http://www.ncbi.nlm.nih.gov/projects/ nique that is useful for tables with many columns: clicking SNP/snp_db_list_table.cgi) and an online Handbook on a row immediately displays a separate table showing (http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book= the contents of only that row as a two-column table. handbook&part=ch5). While this is a tremendous resource to investigators using SNP data, we have de- veloped tools that enhance this resource and support our A web-based database documentation tool goal of establishing a programmatic interface to biological We have developed a web-based tool, dbDoc (http:// databases and promoting credibility and reliability. The cgsmd.isi.edu/dbdoc), that provides documentation for task of implementing a MySQL copy of dbSNP is difﬁcult any MySQL database. The tool provides documentation to accomplish manually due to the MSSQL conversion and summary information on databases, individual tables, not being straightforward and the complexity of dbSNP individual columns and, when available, relationships storage system for schemas and data. Furthermore, our between tables and columns. It uses a web-based hierarch- task-based table classiﬁcation system with ER diagrams ical navigation system to explore the database starting and custom local tables facilitate usage of the database. with a complete list of tables which can then be navigated dbSNP does provide an ER diagram of their database at down to documentation at the individual column level. It ftp://ftp.ncbi.nih.gov/snp/database/erd_dbSNP.pdf. At also provides support for grouping tables into categories the time of writing this diagram was last updated in identiﬁed by ‘tags’; in the case of dbSNP we use the tasks 2005 and some tables in the diagram are obsolete. While described in Table 1 (http://cgsmd.isi.edu/dbdoc/db it does divide the tables into different subject areas, the .php?db=dbsnp_human_131#tags). In comparison, the result is still quite complex. We believe that focusing on dbSNP online data dictionary (http://www.ncbi.nlm.nih more narrowly deﬁned tasks leads to a more manageable .gov/projects/SNP/snp_db_list_table.cgi) requires users set of tables and therefore a more viable resource for to search for tables and columns. This excludes the programmers. display of a complete list of the tables, which we believe The UCSC (8) and Ensembl (9) sites are two other is very useful for understanding the content of the popular sources of biological information that provide database. dbDoc includes a search tool equipped a word data and schemas in MySQL format as well as direct completion feature making it easier for users to ﬁnd MySQL access. The tables from the UCSC human documentation. database relevant to dbSNP, for example snp130 and dbDoc is integrated into our query tool dbSNP-Q. snp130CodingDbSnp for build 130, provide summary in- When a user selects a predeﬁned query in dbSNP-Q, a formation similar to our _loc_snp_summary table. The table showing documentation on each of the columns is Ensembl database offers additional information (http:// displayed beneath the query results. The column descrip- pre.ensembl.org/info/docs/api/variation), but both tions are linked back to dbDoc so the user may browse sources lack data on submitters and technological additional documentation. methods which we believe detracts from the credibility of the database. A limitation in our methods is determining when there DISCUSSION are errors in the dbSNP data (7,11) as we provide no One of the themes in the development of these tools is the resources to speciﬁcally address this issue. Our goal is to ability to establish the credibility of a biological database ﬁrst clarify the structure of the database in terms of iden- and assess the reliability of the information being retrieved tifying what it is we intend to do with the information, from it by providing a means of tracing the information hence the task-based representation. The next steps in de- through the sequence experiments that generated the veloping tools that utilize databases such as dbSNP should information, ideally back to the original biologics and be to determine whether it is necessary to incorporate add- organisms. We feel it is important to obtain this itional validation mechanisms and then implement these D906 Nucleic Acids Research, 2011, Vol. 39, Database issue mechanisms in such a way that any quality-control will be explored. Another resource is the HapMap procedures that affect the data are clearly traceable, in database, which is already integrated indirectly into the the same way we trace the experiments that produced dbSNP-Q web tool for the purpose for querying data for the data. all SNPs in LD with speciﬁed SNPs. We are currently An additional service that may be provided in future designing a HapMap relational database similar to our implementations of these tools is the provision of probe dbSNP implementation. One difﬁculty is that the sequence data from commercial SNP microarrays such as HapMap site (http://hapmap.ncbi.nlm.nih.gov) does not those offered by Affymetrix (http://affymetrix.com) and provide any form of relational database schema to Illumina (http://illumina.com). While we do provide the model such aspects as population data and the different dbSNP ﬂanking sequence data that was originally technologies used to produce the genotype data. Therefore reported by submitters, it may be useful to obtain probe these relational database models must be developed sequence data directly from the manufacturers and independently. provide this data to investigators using similar relational A key theme in our work is programmatic access to the database methods. One application would be the database, by which we mean a DBMS that is supported by resolution of strand-ambiguous genotype calls. conventional programming languages so that complex We have developed a systematic workﬂow for retrieving algorithms can be implementing using a fast access and processing data from dbSNP and implementing a paradigm to the database. For example we prefer the local MySQL relational database. This workﬂow will widely used Perl programming language (http://www provide a robust solution for working with future .perl.org) along with the module Perl::DBI (http://dbi releases of dbSNP. The primary utility for automating .perl.org) for MySQL support. Web software development this workﬂow is our dbSNP Downloader package that is can take advantage of the Linux-Apache-MySQL-PHP freely available to the public at http://cgsmd.isi.edu/ (LAMP) and Ajax programming paradigms, where dbsnpq/downloads.php. Future releases may involve MySQL-friendly PHP scripts can be used to develop changes to the format and structure of the ﬁles on the web applications, such as our own SNP prioritization dbSNP download site, such as changes to dbSNP tool (https://spot.cgsmd.isi.edu) (2) and the dbSNP-Q database schema ﬁles and we will modify our software query tool (http://cgsmd.isi.edu/dbsnpq). MySQL is a to accommodate these changes. We have found that very widely used DBMS with freely available client and these changes may sometimes be subtle and yet have server software and numerous professional grade develop- serious consequences, such as a text ﬁle that suddenly ment tools (http://dev.mysql.com). The ability to imple- uses Windows line delimiters when the format was ment local MySQL copies of large biological databases UNIX in the previous download from dbSNP. This high- will be conducive to the development of public tools for lights the importance of having a robust system for down- integrating this information into other biological experi- loading and processing data. Our dbSNP Downloader ments such as GWAS. package system will enhance our ability to ensure the dbSNP-Q web tool provides the most recent build of human dbSNP data. SUPPLEMENTARY DATA In order to allow these resources to broaden and remain Supplementary Data are available at NAR Online. useful to the general scientiﬁc community rather than only to programmers, we will expand our dbSNP-Q and dbDoc web tools and accompanying software packages such as ACKNOWLEDGEMENTS dbSNP Downloader to include experimental genomic data from sources other than dbSNP. This expansion would be The authors are grateful to the following individuals for subject to the limitations dictated by our goal of establish- testing our software: Andrew Schrage, Richard McEachin ing credible relational databases of information that can and Sharon Ryan. The authors are also very grateful for be traced back through speciﬁc experiments and processes the assistance received from the dbSNP administrators. to the original subjects and biologics. For example it is not The authors thank the reviewers for their comments and clear if it is practical to develop new utilities for imple- suggestions, which helped to improve the quality of the menting the complete UCSC and Ensembl databases. article. Finally, the authors thank the editor, Dr Michael One reason is that these sites in particular already Galperin, for the suggestion of making this work access- provide resources for implementing local versions of ible to a wider scientiﬁc audience, which led to the devel- their databases using software such as MySQL. Another opment of the dbSNP-Q web tool. reason is that the information provided by these sites is often derived from other sources such as dbSNP. Our goal, as in the case of dbSNP, is to develop databases FUNDING for speciﬁc experimental sources using the traceability criteria to establish credibility. Nevertheless, integrating National Institute on Drug Abuse (K01 DA024722 to these extensive genomic databases into a common S.F.S.); American Cancer Society (IRG 5801050 to resource that includes our dbSNP tools could improve S.F.S.); National Institute on Mental Health (U24 the ability of researchers to integrate a vast array of MH068457 to J.A.T.). Funding for open access charge: genomic experimental data using conventional National Institutes of Health grants [DA024722 (50%) programming methods and therefore this undertaking and MH068457 (50%)]. Nucleic Acids Research, 2011, Vol. 39, Database issue D907 4. Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L., Conﬂict of interest statement. None declared. Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311. 5. Kuehn,B.M. (2008) 1000 Genomes Project promises closer look at variation in human genome. JAMA, 300, 2715. REFERENCES 6. Jamison,D.C. (2003) Structured Query Language (SQL) fundamentals. Curr. Protoc. Bioinformatics, Chapter 9, Unit 9.2. 1. Saccone,S.F., Saccone,N.L., Swan,G.E., Madden,P.A., 7. Day,I.N. (2009) dbSNP in the detail and copy number Goate,A.M., Rice,J.P. and Bierut,L.J. (2008) Systematic complexities. Hum. Mutat., 31, 2–4. biological prioritization after a genome-wide association study: 8. Rhead,B., Karolchik,D., Kuhn,R.M., Hinrichs,A.S., Zweig,A.S., an application to nicotine dependence. Bioinformatics, 24, Fujita,P.A., Diekhans,M., Smith,K.E., Rosenbloom,K.R., 1805–1811. Raney,B.J. et al. (2010) The UCSC genome browser database: 2. Saccone,S.F., Bolze,R., Thomas,P., Quan,J., Mehta,G., update 2010. Nucleic Acids Res., 38, D613–619. Deelman,E., Tischﬁeld,J.A. and Rice,J.P. (2010) SPOT: a 9. Flicek,P., Aken,B.L., Ballester,B., Beal,K., Bragin,E., Brent,S., web-based tool for using biological databases to prioritize SNPs Chen,Y., Clapham,P., Coates,G., Fairley,S. et al. (2010) after a genome-wide association study. Nucleic Acids Res., Ensembl’s 10th year. Nucleic Acids Res., 38, D557–562. 38(Suppl.), W201–209. 10. Jones,A.R. and Lister,A.L. (2009) Managing experimental data 3. Sayers,E.W., Barrett,T., Benson,D.A., Bolton,E., Bryant,S.H., using FuGE. Methods Mol. Biol., 604, 333–343. Canese,K., Chetvernin,V., Church,D.M., Dicuccio,M., Federhen,S. 11. Musumeci,L., Arthur,J.W., Cheung,F.S., Hoque,A., Lippman,S. et al. (2010) Database resources of the National Center for and Reichardt,J.K. (2010) Single Nucleotide Differences (SNDs) Biotechnology Information. Nucleic Acids Res., in the dbSNP Database May Lead to Errors in Genotyping and 38(Database issue), D5–D16. Haplotyping Studies. Hum. Mutat., 31, 67–73. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/new-tools-and-methods-for-direct-programmatic-access-to-the-dbsnp-gwCooqKvz4

Loading next page...

References (13)

A. Jones, Allyson Lister (2010)
Managing Experimental Data Using FuGE
Methods in molecular biology, 604
Sayers (2010)
Database resources of the National Center for Biotechnology Information
Nucleic Acids Res., 38
P. Flicek, Bronwen Aken, B. Ballester, Kathryn Beal, E. Bragin, Simon Brent, Yuan Chen, P. Clapham, Guy Coates, S. Fairley, Stephen Fitzgerald, J. Fernandez-Banet, Leo Gordon, S. Gräf, Syed Haider, M. Hammond, K. Howe, Andrew Jenkinson, Nathan Johnson, Andreas Kähäri, Damian Keefe, S. Keenan, R. Kinsella, F. Kokocinski, Gautier Koscielny, Eugene Kulesha, D. Lawson, Ian Longden, Tim Massingham, W. McLaren, K. Megy, B. Overduin, Bethan Pritchard, Daniel Rios, Magali Ruffier, Michael Schuster, G. Slater, D. Smedley, Giulietta Spudich, Y. Tang, S. Trevanion, Albert Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, Fiona Cunningham, I. Dunham, R. Durbin, X. Fernández-Suárez, Javier Herrero, T. Hubbard, Anne Parker, G. Proctor, James Smith, S. Searle (2009)
Ensembl’s 10th year
Nucleic Acids Research, 38
Scott Saccone, N. Saccone, G. Swan, P. Madden, A. Goate, J. Rice, L. Bierut (2008)
Systematic biological prioritization after a genome-wide association study: an application to nicotine dependence
Bioinformatics, 24 16
L. Musumeci, J. Arthur, F. Cheung, A. Hoque, S. Lippman, J. Reichardt (2010)
Single nucleotide differences (SNDs) in the dbSNP database may lead to errors in genotyping and haplotyping studies
Human Mutation, 31
David Wheeler, D. Church, Ron Edgar, S. Federhen, W. Helmberg, Thomas Madden, J. Pontius, G. Schuler, L. Schriml, Edwin Sequeira, Tugba Suzek, T. Tatusova, L. Wagner (2004)
Database resources of the National Center for Biotechnology Information: update
Nucleic Acids Research, 32
D. Jamison (2003)
Structured Query Language (SQL) Fundamentals
Current Protocols in Bioinformatics, 00
Scott Saccone, Raphael Bolze, P. Thomas, J. Quan, Gaurang Mehta, E. Deelman, J. Tischfield, J. Rice (2010)
SPOT: a web-based tool for using biological databases to prioritize SNPs after a genome-wide association study
Nucleic Acids Research, 38
B. Kuehn (2008)
1000 Genomes Project promises closer look at variation in human genome.
JAMA, 300 23
P. Fujita, B. Rhead, A. Zweig, A. Hinrichs, D. Karolchik, M. Cline, M. Goldman, G. Barber, H. Clawson, Antonio Coelho, M. Diekhans, T. Dreszer, B. Giardine, R. Harte, J. Hillman-Jackson, F. Hsu, V. Kirkup, R. Kuhn, K. Learned, Chin Li, L. Meyer, A. Pohl, B. Raney, K. Rosenbloom, Kayla Smith, D. Haussler, W. Kent (2010)
The UCSC Genome Browser database: update 2011
Nucleic Acids Research, 39
S. Sherry, Minghong Ward, Michael Kholodov, J. Baker, Lon Phan, Elizabeth Smigielski, K. Sirotkin (2001)
dbSNP: the NCBI database of genetic variation
Nucleic acids research, 29 1
I. Day (2010)
dbSNP in the detail and copy number complexities
Human Mutation, 31
B. Rhead, D. Karolchik, R. Kuhn, A. Hinrichs, A. Zweig, P. Fujita, M. Diekhans, Kayla Smith, K. Rosenbloom, B. Raney, A. Pohl, Michael Pheasant, L. Meyer, K. Learned, F. Hsu, J. Hillman-Jackson, R. Harte, B. Giardine, T. Dreszer, H. Clawson, G. Barber, D. Haussler, W. Kent (2009)
The UCSC Genome Browser database: update 2010
Nucleic Acids Research, 38

Publisher: Oxford University Press
Copyright: The Author(s) 2010. Published by Oxford University Press.
ISSN: 0305-1048
eISSN: 1362-4962
DOI: 10.1093/nar/gkq1054
pmid: 21037260
Publisher site: See Article on Publisher Site

Abstract

Published online 29 October 2010 Nucleic Acids Research, 2011, Vol. 39, Database issue D901–D907 doi:10.1093/nar/gkq1054 New tools and methods for direct programmatic access to the dbSNP relational database 1, 1 2 2 2 Scott F. Saccone *, Jiaxi Quan , Gaurang Mehta , Raphael Bolze , Prasanth Thomas , 2 3 1,4 Ewa Deelman , Jay A. Tischfield and John P. Rice 1 2 Department of Psychiatry, Washington University, Information Sciences Institute, University of Southern 3 4 California, Department of Genetics, Rutgers University and Department of Genetics, Washington University, USA Received August 7, 2010; Revised October 1, 2010; Accepted October 13, 2010 these databases. For example one strategy that can be ABSTRACT used after a GWAS when selecting single nucleotide poly- Genome-wide association studies often incorporate morphisms (SNPs) for further research is to preferentially information from public biological databases in target SNPs with evidence of biological relevance (1,2). order to provide a biological reference for interpret- If a SNP resides in a gene from a pathway theorized to ing the results. The dbSNP database is an extensive be relevant to the phenotype or if there is evidence that the source of information on single nucleotide poly- SNP has a non-neutral effect on gene expression, this bio- logical information may increase the priority for further morphisms (SNPs) for many different organisms, study of the SNP, such as additional genotyping in a rep- including humans. We have developed free lication sample. The implementation of such a strategy software that will download and install a local requires (i) direct programmatic access to biological data- MySQL implementation of the dbSNP relational bases in order to efﬁciently and viably implement the database for a specified organism. We have also strategy on a genome-wide scale and (ii) a systematic designed a system for classifying dbSNP tables in method of isolating the relevant information within terms of common tasks we wish to accomplish the complex network of objects and relationships within using the database. For each task we have these databases. Ideally, we should also require designed a small set of custom tables that facilitate (iii) methods for identifying the speciﬁc sequence of ex- task-related queries and provide entity-relationship periments that produced this information and for tracing diagrams for each task composed from the relevant these experiments back to their core biologics and original dbSNP tables. In order to expose these concepts organisms in order to establish the credibility of the database and viably assess the reliability of the informa- and methods to a wider audience we have de- tion being retrieved. This work is a description of our veloped web tools for querying the database and efforts to develop tools for utilizing the dbSNP relational browsing documentation on the tables and database that meet these criteria. columns to clarify the relevant relational structure. The dbSNP database (3,4) is a repository that accepts All web tools and software are freely available to the submissions of data for SNPs and other structural vari- public at http://cgsmd.isi.edu/dbsnpq. Resources ation such as short deletions and insertions for a multitude such as these for programmatically querying bio- of organisms. It provides mapping data onto a number of logical databases are essential for viably integrating conventional genomes, such as the human reference biological information into genetic association ex- genome GRCh37 (http://www.ncbi.nlm.nih.gov/projects/ periments on a genome-wide scale. genome/assembly/grc/human), as well as mapping and functional information for gene transcripts. The database has a complex relational structure and through INTRODUCTION the tracking of submitted data users can retrieve detailed Integrating information from biological databases into information on the experiments that led to the discovery high-throughput experiments, such as genome-wide asso- of the variants, as well as genotype data in certain popu- ciation studies (GWAS), requires a database management lations. dbSNP has recently been substantially expanded system (DBMS) that is capable of handling very high by data from the 1000 Genomes project (5). volume and that is equipped with resources for dealing Any algorithm incorporating information from a bio- with the complex relational structure commonly seen in logical database on a genome-wide scale will beneﬁt *To whom correspondence should be addressed. Tel: +1 314 286 2581; Fax: +1 314 286 2577; Email: [email protected] The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D902 Nucleic Acids Research, 2011, Vol. 39, Database issue substantially from the implementation of the database visualization tools and proprietary application program- using a high performance DBMS such as MySQL (http:/ matic interfaces (APIs) such as those offered by Ensembl www.mysql.com), PostgreSQL (http://www.postgresql (9). The preference for these alternatives can result in less .org/), MSSQL (http://www.microsoft.com) or Oracle documentation and resources provided for recreating the (http://www.oracle.com). Direct programmatic access to internal relational database, making it more difﬁcult for a high quality DBMS provides critical support for efﬁ- programmers to understand the nature of the objects and ciently executed complex queries to very large databases, their relationships in the database. The objects of particular where a properly constructed index can mean the differ- importance are the biologics, the experiments and the ence between a query taking days and minutes. The use of results of these experiments. The relational model is well a conventional query language, such as SQL (6), should be suited to represent the basic ﬂow of information resulting used to maintain consistency across independent data- from these experiments thereby making it easier to assess bases and facilitate cross-database integration, which is the quality and reliability of the information being common practice when integrating biological information retrieved. More sophisticated models may be necessary to into a GWAS. capture the entire spectrum of experimental metadata (10), Many of the common biological databases found on the although this is not necessary for some applications such as internet, such as dbSNP (3,4,7) and the University of integrating biological information into a GWAS. California at Santa Cruz (UCSC) genome browser We have embraced the relational model as a means of database (8), allow users to download the data in a elucidating the complex relational structure of the dbSNP format conducive to creating a local version of the rela- database. We have divided the database into groups of tional database. dbSNP provides the data in a MSSQL tables corresponding to speciﬁc tasks we wish to accom- (Microsoft) format and UCSC provides schemas compat- plish and provide entity-relationship (ER) diagrams in the ible with MySQL. UCSC offers a small number of tables supplementary data for each task. The task model clariﬁes related to dbSNP containing only summary information, key parts of the database and suggests speciﬁc queries that excluding some key information such as details on samples should be used to accomplish each task. The ER diagrams used and experimental methodology. It allows users to illustrate the relationships between task-related tables and directly query their MySQL database over the internet, aid in the construction of queries. The task model and ER while dbSNP does not. diagrams also provide the user with a clearer view of the Querying these databases over the internet is not a sequence of experiments that led to the data and a means viable solution due to the quantity of data involved. The of querying speciﬁc information on the nature of those transfer of such large amounts of data over the internet experiments. would substantially increase query execution time and To make these resources available to a wider audience place a very heavy burden on public servers. We have we have developed a simple web-based query tool (http:// developed tools for downloading and implementing a cgsmd.isi.edu/dbsnpq) that is integrated with a database local copy of the dbSNP relational database using the documentation tool (http://cgsmd.isi.edu/dbdoc) allowing freely available MySQL DBMS. That is the local version users with a wide range of backgrounds to perform exists on the user’s machine so that queries to the database a variety of tasks with the dbSNP relational database. are done locally rather than over the internet. While All tools and software are freely available to the public. dbSNP provides numerous online tools for querying and visualizing the database, as well as a download facility for retrieving the database in Microsoft MSSQL format, we METHODS AND TOOLS have supplemented these tools with our own software for A task-based representation of dbSNP downloading and constructing a local MySQL relational Table 1 describes the tasks we wish to accomplish with the database implementation of dbSNP for a speciﬁed dbSNP relational database. We use the task concept organism. Because converting from MSSQL to MySQL because in practice the goal is not necessarily to retrieve is not straightforward, and the ﬁle system used in the the full spectrum of information available, but is instead dbSNP download facility is quite complex, our software greatly simpliﬁes the task of implementing the relational to retrieve only the portion relevant to a speciﬁc applica- database on a local machine. Our software was developed tion, such as cross referencing the results of a GWAS. and tested on the Linux platform. Because it was Figure 1 shows the relationship between our tasks and developed using the Perl programming language, it the tables in the dbSNP database (see Supplementary should be readily portable to other platforms such as Table S1 for descriptions of the dbSNP tables) and illus- Microsoft Windows. This suite of tools and the conven- trates some of the relational structure among the tables. tional and freely available MySQL DBMS will allow pro- The tables were downloaded from the dbSNP FTP server grammers to implement complex algorithms using a fast, (ftp://ftp.ncbi.nlm.nih.gov) and correspond to build 131 of structured interface to the database, making the execution the human component of the database. In the of these algorithms on a genome-wide scale much more Supplementary data we have provided a more detailed viable. description of each task, including ER diagrams showing While many public biological databases provide the schemas and relationships for the relevant tables and programmers with a means of creating their own local sample queries with output. We attempted to deﬁne the copy, this method of usage is often overshadowed by tasks so that the number of related tables is manageable other means of data access, such as web-based query and and leads to interpretable ER diagrams. While we Nucleic Acids Research, 2011, Vol. 39, Database issue D903 Table 1. Descriptions of the tasks used in our classiﬁcation scheme for the tables in the dbSNP database Task Description Submission Determine the source of the submission such as speciﬁc laboratories or researchers, the populations used, any associated publications and how the submissions cluster into ‘reference SNP’ identiﬁcation numbers (a group of ‘ss’ SNP IDs correspond to a unique ‘rs’ ID via the table SNPSubSNPLink). Experimental methods Determine the experimental methods used to produce the data, such as direct DNA sequencing, DNA hybrid- ization and DHPLC (denaturing high pressure liquid chromatography). Validation Assess the reliability of the information and evaluate whether or not a reported variant is truly a genetic poly- morphism or is just an experimental artifact. Methods include determining if there are multiple submissions with at least one non-computational observation and conﬁrmation by observation of positive frequency in a genotyped sample. Classiﬁcation Determine if the variants are classiﬁed as being a true SNP, insertion, deletion and so on. Sample information Retrieve information on the biological samples used, such as ethnicity and the number of samples used for a submission. Alleles and frequency data Retrieve the alleles observed for the variant, which DNA strand was used and the frequencies of the alleles and genotypes in various populations. Genome mapping Retrieve information on how the variants map to various reference genomes, such as the physical mapping coordinates and the quality of the alignments. Genes and function Retrieve information on relationships between the variants and genes, such as SNP/gene transcript functional properties (missense mutations, frameshifts, UTR regions and so on). Flanking sequence Retrieve the ﬂanking DNA sequences used to deﬁne the variant. This can be useful when conducting custom genotyping experiments for variants not represented by commercial SNP microarrays. Individual genotyping Retrieve submitted individual genotypes. Summary information Retrieve summary information for a reference SNP ID—an amalgamation the tasks above. developed these methods and tools for human applica- results in 143 million bytes of data. This is an unwieldy tions, because the database structure is consistent across amount of information to query over the internet and illustrates the need for direct access to this information different organisms, our system should translate to the on a local machine. A more substantial query would be other organisms maintained by dbSNP with minor to retrieve dbSNP allele frequency data for the Illumina modiﬁcations. 1 M array via our local table _loc_allele_freqs which For each task we designed a small set of tables which results in 799 million bytes of data. we refer to as ‘local tables’ (Supplementary Table S3). The examples in the Supplementary data provide a We derived these from the original dbSNP tables in snapshot of the ﬂow of information in the dbSNP order to facilitate the tasks by providing the relevant in- database and the precise types of data being stored. formation in a more conducive form. These local tables These examples follow the tasks in Figure 1, starting are often denormalized in the sense that certain numeric with Submission and proceeding clockwise. These repre- foreign keys are replaced by their text values and duplicate sent a typical cycle of experiments and applications, from keys are replaced by comma-separated lists. While this the discovery of a variant to conﬁrmation by genotyping may lead to inefﬁcient storage the beneﬁt is faster access in multiple samples followed by genome mapping to the information, which can be crucial when using the properties and gene transcript analysis and ending with database for genome-wide analysis. Another option is to representative ﬂanking sequence to be used for additional use views instead of physical tables. A view acts like genotyping experiments such as disease mapping studies. a physical table but is really a query to one or more Our examples convey the relational structure of the existing tables and therefore occupies no additional database through ER diagrams and provide a detailed space. Some of our local tables have too complex a view of the data through example queries with output derivation to be implemented as a view and for others tables. the queries to a view take signiﬁcantly longer than to a physical table, particularly when the physical table is Software for downloading and implementing dbSNP implemented with special indexes that improve the per- formance of the query. We believe the performance We have developed software to automate the process increase outweighs the issue of increased storage and of downloading data and schema ﬁles from dbSNP, therefore tend to use physical local tables rather than converting the dbSNP MSSQL schema ﬁles to MySQL views. and loading the data into a local MySQL server. The A routine application of the dbSNP database to a software is written in the Perl programming language GWAS experiment is to retrieve basic annotation, such and is executed via a UNIX command line using the as from our table _loc_snp_summary, for an entire com- syntax dbsnp.pl [command] [options]. Supplementary mercial SNP microarray (see the section ‘Summary Table S4 lists the commands and their descriptions (see Information’ in the Supplementary data for details on the section titled ‘Download Script’ in the Supplemenary the table _loc_snp_summary). For dbSNP human build data). The script is freely available to the public as part of 131 and the Illumina 1 M SNP microarray (http://www the ‘dbSNP Downloader’ package which can be obtained .illumina.com), which contains one million SNPs, this at http://cgsmd.isi.edu/dbsnpq/downloads.php. We also D904 Nucleic Acids Research, 2011, Vol. 39, Database issue Figure 1. The tasks and corresponding dbSNP tables from our classiﬁcation scheme. A tree structure is used to partially represent the relationships between the tables. All tables except those listed under ‘Local Tables’ are directly from dbSNP. Tables with asterisks have names in the dbSNP database that are preﬁxed by the dbSNP build and sufﬁxed by the representative genome build, such as b131_ContigInfo_37_1 in build 131 of the human database. provide MySQL ‘dumps’ of the human dbSNP database A web-based query tool either as a single ﬁle containing the entire database or We have developed a web-based tool, dbSNP-Q (http:// as a separate ﬁle for each table. These ﬁles are easily cgsmd.isi.edu/dbsnpq), for querying the human data from imported into a local MySQL database. The set of our MySQL implementation of the dbSNP relational tables for which we provide downloads is not comprehen- database. This tool has a straightforward interface for sive—the remaining tables can be implemented using entering SNPs and selecting a simple predeﬁned query or dbsnp.pl. entering a custom MySQL query. This allows users with a Nucleic Acids Research, 2011, Vol. 39, Database issue D905 wide range of backgrounds to accomplish a variety of information if the data is being used to guide other experi- tasks. In addition to SNPs, users may enter genes and ments, such as post-GWAS prioritization of follow up genomic regions in order to retrieve all the corresponding studies. We also believe the information should be SNPs. There is also an option to look up and add all SNPs provided in such a way that it can be programmatically in linkage disequilibrium (LD) with the entered SNPs incorporated into an application such as GWAS prioritiza- using one of eleven HapMap Phase III populations. tion through a DBMS such as MySQL that uses a conven- Queries may be viewed directly in the browser or down- tional query language and table relationship paradigm and loaded in Excel or tab-delimited text format. The jQuery is supported by a wide variety of programming languages JavaScript framework (http://www.jquery.com) ensures such as Perl and Hypertext Preprocessor (PHP). the interface is browser-independent and is a true inter- A case in point is the dbSNP relational database which active web-based application rather than a continuously contains information on the labs and scientists that per- re-loaded web page. dbSNP-Q is a model-view-controller formed the experiments, the technology and the samples (MVC) application: it uses Ajax technology as imple- used in those experiments and the methods used to map mented by the jQuery plug-in jqGrid (http://www the variants to the reference genomes and determine any .trirand.net) to enable users to viably work with 10 s of relationships to known gene transcripts. dbSNP provides millions of SNPs because the core logic, such as sorting the data and schemas for all the information in their and paging through the results of the query, is handled by database, as well as an online data dictionary document- our server. We have also implemented a vertical view tech- ing the tables (http://www.ncbi.nlm.nih.gov/projects/ nique that is useful for tables with many columns: clicking SNP/snp_db_list_table.cgi) and an online Handbook on a row immediately displays a separate table showing (http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book= the contents of only that row as a two-column table. handbook&part=ch5). While this is a tremendous resource to investigators using SNP data, we have de- veloped tools that enhance this resource and support our A web-based database documentation tool goal of establishing a programmatic interface to biological We have developed a web-based tool, dbDoc (http:// databases and promoting credibility and reliability. The cgsmd.isi.edu/dbdoc), that provides documentation for task of implementing a MySQL copy of dbSNP is difﬁcult any MySQL database. The tool provides documentation to accomplish manually due to the MSSQL conversion and summary information on databases, individual tables, not being straightforward and the complexity of dbSNP individual columns and, when available, relationships storage system for schemas and data. Furthermore, our between tables and columns. It uses a web-based hierarch- task-based table classiﬁcation system with ER diagrams ical navigation system to explore the database starting and custom local tables facilitate usage of the database. with a complete list of tables which can then be navigated dbSNP does provide an ER diagram of their database at down to documentation at the individual column level. It ftp://ftp.ncbi.nih.gov/snp/database/erd_dbSNP.pdf. At also provides support for grouping tables into categories the time of writing this diagram was last updated in identiﬁed by ‘tags’; in the case of dbSNP we use the tasks 2005 and some tables in the diagram are obsolete. While described in Table 1 (http://cgsmd.isi.edu/dbdoc/db it does divide the tables into different subject areas, the .php?db=dbsnp_human_131#tags). In comparison, the result is still quite complex. We believe that focusing on dbSNP online data dictionary (http://www.ncbi.nlm.nih more narrowly deﬁned tasks leads to a more manageable .gov/projects/SNP/snp_db_list_table.cgi) requires users set of tables and therefore a more viable resource for to search for tables and columns. This excludes the programmers. display of a complete list of the tables, which we believe The UCSC (8) and Ensembl (9) sites are two other is very useful for understanding the content of the popular sources of biological information that provide database. dbDoc includes a search tool equipped a word data and schemas in MySQL format as well as direct completion feature making it easier for users to ﬁnd MySQL access. The tables from the UCSC human documentation. database relevant to dbSNP, for example snp130 and dbDoc is integrated into our query tool dbSNP-Q. snp130CodingDbSnp for build 130, provide summary in- When a user selects a predeﬁned query in dbSNP-Q, a formation similar to our _loc_snp_summary table. The table showing documentation on each of the columns is Ensembl database offers additional information (http:// displayed beneath the query results. The column descrip- pre.ensembl.org/info/docs/api/variation), but both tions are linked back to dbDoc so the user may browse sources lack data on submitters and technological additional documentation. methods which we believe detracts from the credibility of the database. A limitation in our methods is determining when there DISCUSSION are errors in the dbSNP data (7,11) as we provide no One of the themes in the development of these tools is the resources to speciﬁcally address this issue. Our goal is to ability to establish the credibility of a biological database ﬁrst clarify the structure of the database in terms of iden- and assess the reliability of the information being retrieved tifying what it is we intend to do with the information, from it by providing a means of tracing the information hence the task-based representation. The next steps in de- through the sequence experiments that generated the veloping tools that utilize databases such as dbSNP should information, ideally back to the original biologics and be to determine whether it is necessary to incorporate add- organisms. We feel it is important to obtain this itional validation mechanisms and then implement these D906 Nucleic Acids Research, 2011, Vol. 39, Database issue mechanisms in such a way that any quality-control will be explored. Another resource is the HapMap procedures that affect the data are clearly traceable, in database, which is already integrated indirectly into the the same way we trace the experiments that produced dbSNP-Q web tool for the purpose for querying data for the data. all SNPs in LD with speciﬁed SNPs. We are currently An additional service that may be provided in future designing a HapMap relational database similar to our implementations of these tools is the provision of probe dbSNP implementation. One difﬁculty is that the sequence data from commercial SNP microarrays such as HapMap site (http://hapmap.ncbi.nlm.nih.gov) does not those offered by Affymetrix (http://affymetrix.com) and provide any form of relational database schema to Illumina (http://illumina.com). While we do provide the model such aspects as population data and the different dbSNP ﬂanking sequence data that was originally technologies used to produce the genotype data. Therefore reported by submitters, it may be useful to obtain probe these relational database models must be developed sequence data directly from the manufacturers and independently. provide this data to investigators using similar relational A key theme in our work is programmatic access to the database methods. One application would be the database, by which we mean a DBMS that is supported by resolution of strand-ambiguous genotype calls. conventional programming languages so that complex We have developed a systematic workﬂow for retrieving algorithms can be implementing using a fast access and processing data from dbSNP and implementing a paradigm to the database. For example we prefer the local MySQL relational database. This workﬂow will widely used Perl programming language (http://www provide a robust solution for working with future .perl.org) along with the module Perl::DBI (http://dbi releases of dbSNP. The primary utility for automating .perl.org) for MySQL support. Web software development this workﬂow is our dbSNP Downloader package that is can take advantage of the Linux-Apache-MySQL-PHP freely available to the public at http://cgsmd.isi.edu/ (LAMP) and Ajax programming paradigms, where dbsnpq/downloads.php. Future releases may involve MySQL-friendly PHP scripts can be used to develop changes to the format and structure of the ﬁles on the web applications, such as our own SNP prioritization dbSNP download site, such as changes to dbSNP tool (https://spot.cgsmd.isi.edu) (2) and the dbSNP-Q database schema ﬁles and we will modify our software query tool (http://cgsmd.isi.edu/dbsnpq). MySQL is a to accommodate these changes. We have found that very widely used DBMS with freely available client and these changes may sometimes be subtle and yet have server software and numerous professional grade develop- serious consequences, such as a text ﬁle that suddenly ment tools (http://dev.mysql.com). The ability to imple- uses Windows line delimiters when the format was ment local MySQL copies of large biological databases UNIX in the previous download from dbSNP. This high- will be conducive to the development of public tools for lights the importance of having a robust system for down- integrating this information into other biological experi- loading and processing data. Our dbSNP Downloader ments such as GWAS. package system will enhance our ability to ensure the dbSNP-Q web tool provides the most recent build of human dbSNP data. SUPPLEMENTARY DATA In order to allow these resources to broaden and remain Supplementary Data are available at NAR Online. useful to the general scientiﬁc community rather than only to programmers, we will expand our dbSNP-Q and dbDoc web tools and accompanying software packages such as ACKNOWLEDGEMENTS dbSNP Downloader to include experimental genomic data from sources other than dbSNP. This expansion would be The authors are grateful to the following individuals for subject to the limitations dictated by our goal of establish- testing our software: Andrew Schrage, Richard McEachin ing credible relational databases of information that can and Sharon Ryan. The authors are also very grateful for be traced back through speciﬁc experiments and processes the assistance received from the dbSNP administrators. to the original subjects and biologics. For example it is not The authors thank the reviewers for their comments and clear if it is practical to develop new utilities for imple- suggestions, which helped to improve the quality of the menting the complete UCSC and Ensembl databases. article. Finally, the authors thank the editor, Dr Michael One reason is that these sites in particular already Galperin, for the suggestion of making this work access- provide resources for implementing local versions of ible to a wider scientiﬁc audience, which led to the devel- their databases using software such as MySQL. Another opment of the dbSNP-Q web tool. reason is that the information provided by these sites is often derived from other sources such as dbSNP. Our goal, as in the case of dbSNP, is to develop databases FUNDING for speciﬁc experimental sources using the traceability criteria to establish credibility. Nevertheless, integrating National Institute on Drug Abuse (K01 DA024722 to these extensive genomic databases into a common S.F.S.); American Cancer Society (IRG 5801050 to resource that includes our dbSNP tools could improve S.F.S.); National Institute on Mental Health (U24 the ability of researchers to integrate a vast array of MH068457 to J.A.T.). Funding for open access charge: genomic experimental data using conventional National Institutes of Health grants [DA024722 (50%) programming methods and therefore this undertaking and MH068457 (50%)]. Nucleic Acids Research, 2011, Vol. 39, Database issue D907 4. Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L., Conﬂict of interest statement. None declared. Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308–311. 5. Kuehn,B.M. (2008) 1000 Genomes Project promises closer look at variation in human genome. JAMA, 300, 2715. REFERENCES 6. Jamison,D.C. (2003) Structured Query Language (SQL) fundamentals. Curr. Protoc. Bioinformatics, Chapter 9, Unit 9.2. 1. Saccone,S.F., Saccone,N.L., Swan,G.E., Madden,P.A., 7. Day,I.N. (2009) dbSNP in the detail and copy number Goate,A.M., Rice,J.P. and Bierut,L.J. (2008) Systematic complexities. Hum. Mutat., 31, 2–4. biological prioritization after a genome-wide association study: 8. Rhead,B., Karolchik,D., Kuhn,R.M., Hinrichs,A.S., Zweig,A.S., an application to nicotine dependence. Bioinformatics, 24, Fujita,P.A., Diekhans,M., Smith,K.E., Rosenbloom,K.R., 1805–1811. Raney,B.J. et al. (2010) The UCSC genome browser database: 2. Saccone,S.F., Bolze,R., Thomas,P., Quan,J., Mehta,G., update 2010. Nucleic Acids Res., 38, D613–619. Deelman,E., Tischﬁeld,J.A. and Rice,J.P. (2010) SPOT: a 9. Flicek,P., Aken,B.L., Ballester,B., Beal,K., Bragin,E., Brent,S., web-based tool for using biological databases to prioritize SNPs Chen,Y., Clapham,P., Coates,G., Fairley,S. et al. (2010) after a genome-wide association study. Nucleic Acids Res., Ensembl’s 10th year. Nucleic Acids Res., 38, D557–562. 38(Suppl.), W201–209. 10. Jones,A.R. and Lister,A.L. (2009) Managing experimental data 3. Sayers,E.W., Barrett,T., Benson,D.A., Bolton,E., Bryant,S.H., using FuGE. Methods Mol. Biol., 604, 333–343. Canese,K., Chetvernin,V., Church,D.M., Dicuccio,M., Federhen,S. 11. Musumeci,L., Arthur,J.W., Cheung,F.S., Hoque,A., Lippman,S. et al. (2010) Database resources of the National Center for and Reichardt,J.K. (2010) Single Nucleotide Differences (SNDs) Biotechnology Information. Nucleic Acids Res., in the dbSNP Database May Lead to Errors in Genotyping and 38(Database issue), D5–D16. Haplotyping Studies. Hum. Mutat., 31, 67–73.

Journal

Nucleic Acids Research – Oxford University Press

Published: Jan 30, 2011

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

New tools and methods for direct programmatic access to the dbSNP relational database

New tools and methods for direct programmatic access to the dbSNP relational database

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

New tools and methods for direct programmatic access to the dbSNP relational database

New tools and methods for direct programmatic access to the dbSNP relational database

References (13)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies