Access the full text.
Sign up today, get DeepDyve free for 14 days.
N. Mulder, R. Apweiler, T. Attwood, A. Bairoch, D. Barrell, A. Bateman, David Binns, M. Biswas, Paul Bradley, P. Bork, P. Bucher, R. Copley, E. Courcelle, Ujjwal Das, R. Durbin, L. Falquet, W. Fleischmann, S. Griffiths-Jones, D. Haft, Nicola Harte, N. Hulo, D. Kahn, Alexander Kanapin, Maria Krestyaninova, R. Lopez, Ivica Letunic, D. Lonsdale, Ville Silventoinen, S. Orchard, M. Pagni, David Peyruc, C. Ponting, J. Selengut, F. Servant, Christian Sigrist, Robert Vaughan, E. Zdobnov (2003)
The InterPro Database, 2003 brings increased coverage and new featuresNucleic acids research, 31 1
M. Clamp, T. Andrews, D. Barker, Paul Bevan, G. Cameron, Yuan Chen, Laura Clarke, Tony Cox, James Cuff, V. Curwen, T. Down, R. Durbin, E. Eyras, J. Gilbert, M. Hammond, T. Hubbard, A. Kasprzyk, Damian Keefe, H. Lehväslaiho, V. Iyer, Craig Melsopp, Emmanuel Mongin, Roger Pettett, Simon Potter, A. Rust, E. Schmidt, S. Searle, G. Slater, James Smith, W. Spooner, Arne Stabenau, J. Stalker, E. Stupka, A. Ureta-Vidal, Imre Vastrik, E. Birney (2003)
Ensembl 2002: accommodating comparative genomicsNucleic acids research, 31 1
Hai Yan, Weishi Yuan, V. Velculescu, B. Vogelstein, K. Kinzler (2002)
Allelic Variation in Human Gene ExpressionScience, 297
F. Collins, E. Green, A. Guttmacher, M. Guyer (2003)
A vision for the future of genomics researchNature, 422
A. Kel, E. Gößling, I. Reuter, E. Cheremushkin, O. Kel-Margoulis, E. Wingender (2003)
MATCH: A tool for searching transcription factor binding sites in DNA sequences.Nucleic acids research, 31 13
P. Stenson, E. Ball, M. Mort, A. Phillips, Jacqueline Shiel, N. Thomas, S. Abeysinghe, M. Krawczak, D. Cooper (2003)
Human Gene Mutation Database (HGMD®): 2003 updateHuman Mutation, 21
A. Brookes, H. Lehväslaiho, M. Siegfried, Jana Boehm, Yan Yuan, C. Sarkar, P. Bork, F. Ortigao (2000)
HGBASE: a database of SNPs and other variations in and around human genesNucleic acids research, 28 1
Hum Mutat
D. Botstein, N. Risch (2003)
Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex diseaseNature Genetics, 33 Suppl 1
B. Hoogendoorn, S. Coleman, C. Guy, Kaye Smith, T. Bowen, P. Buckland, M. O’Donovan (2003)
Functional analysis of human promoter polymorphisms.Human molecular genetics, 12 18
M. Ashburner, C. Ball, J. Blake, D. Botstein, Heather Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, G. Sherlock (2000)
Gene Ontology: tool for the unification of biologyNature Genetics, 25
S. Harendza, D. Lovett, U. Panzer, Z. Lukács, P. Kühnl, R. Stahl (2003)
Linked Common Polymorphisms in the Gelatinase A Promoter Are Associated with Diminished Transcriptional Response to Estrogen and Genetic Fitness*Journal of Biological Chemistry, 278
P. Stenson, E. Ball, M. Mort, A. Phillips, Jacqueline Shiel, S. Abeysinghe, M. Krawczak, D. Cooper (2003)
Human Gene Mutation Database (HGMD
J. Amiel, Valérie Raclin, J. Jouannic, N. Morichon, Hélène Hoffman-Radvanyi, M. Dommergues, J. Feingold, A. Munnich, J. Bonnefont (2001)
Trinucleotide repeat contraction: a pitfall in prenatal diagnosis of myotonic dystrophyJournal of Medical Genetics, 38
(2002)
Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders
L. Cartegni, Jinhua Wang, Zhengwei Zhu, Michael Zhang, A. Krainer (2003)
ESEfinder: a web resource to identify exonic splicing enhancersNucleic acids research, 31 13
T. Hudson (2003)
Wanted: regulatory SNPsNature Genetics, 33
L. Prokunina, C. Castillejo-López, F. Öberg, I. Gunnarsson, L. Berg, V. Magnusson, A. Brookes, D. Tentler, H. Kristjánsdóttir, G. Gröndal, A. Bolstad, E. Svenungsson, I. Lundberg, G. Sturfelt, Andreas Jönssen, L. Truedsson, G. Lima, J. Alcocer-Varela, R. Jonsson, U. Gyllensten, J. Harley, D. Alarcon-segovia, K. Steinsson, M. Alarcón-Riquelme (2002)
A regulatory polymorphism in PDCD1 is associated with susceptibility to systemic lupus erythematosus in humansNature Genetics, 32
W. Strittmatter, A. Saunders, D. Schmechel, M. Pericak-Vance, J. Enghild, G. Salvesen, A. Roses (1993)
Apolipoprotein E: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial Alzheimer disease.Proceedings of the National Academy of Sciences of the United States of America, 90
José Badano, N. Katsanis (2002)
Human genetics and disease: Beyond Mendel: an evolving view of human genetic disease transmissionNature Reviews Genetics, 3
(2004)
W248 Nucleic Acids Research
P. Ng, S. Henikoff (2001)
Predicting deleterious amino acid substitutions.Genome research, 11 5
A. Hamosh, A. Scott, J. Amberger, C. Bocchini, David Valle, V. McKusick (2004)
Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disordersNucleic Acids Research, 33
Mark Miller, Sudhir Kumar (2001)
Understanding human disease mutations through the use of interspecific genetic variation.Human molecular genetics, 10 21
J. Mol. Biol
D. Chasman, R. Adams (2001)
Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation.Journal of molecular biology, 307 2
S. Sunyaev, V. Ramensky, I. Koch, Warren Lathe, A. Kondrashov, P. Bork (2001)
Prediction of deleterious human alleles.Human molecular genetics, 10 6
T. Schaal, T. Maniatis (1999)
Multiple Distinct Splicing Enhancers in the Protein-Coding Sequences of a Constitutively Spliced Pre-mRNAMolecular and Cellular Biology, 19
Carles Ferrer-Costa, M. Orozco, X. Cruz (2002)
Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties.Journal of molecular biology, 315 4
Hong Liu, Michael Zhang, A. Krainer (1998)
Identification of functional exonic splicing enhancer motifs recognized by individual SR proteins.Genes & development, 12 13
M. Krawczak, J. Reiss, D. Cooper (1992)
The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: Causes and consequencesHuman Genetics, 90
F. Al-Shahrour, R. Díaz-Uriarte, J. Dopazo (2004)
FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genesBioinformatics, 20 4
J. Hugot, M. Chamaillard, H. Zouali, S. Lesage, J. Cézard, J. Belaiche, S. Almér, C. Tysk, C. O'Morain, M. Gassull, V. Binder, Y. Finkel, A. Cortot, R. Modigliani, P. Laurent-Puig, C. Gower-Rousseau, J. Macry, J. Colombel, M. Sahbatou, G. Thomas (2001)
Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn's diseaseNature, 411
Hangil Chang, Toshiro Fujita (2001)
PicSNP: a browsable catalog of nonsynonymous single nucleotide polymorphisms in the human genome.Biochemical and biophysical research communications, 287 1
E. Wingender, Xin Chen, R. Hehl, H. Karas, I. Liebich, V. Matys, T. Meinhardt, M. Prüß, I. Reuter, F. Schacherer (2000)
TRANSFAC: an integrated system for gene expression regulationNucleic acids research, 28 1
N. Risch (2000)
Searching for genetic determinants in the new millenniumNature, 405
L. Cartegni, S. Chew, A. Krainer (2002)
Listening to silence and understanding nonsense: exonic mutations that affect splicingNature Reviews Genetics, 3
Jinghui Zhang, William Rowe, J. Struewing, K. Buetow (2002)
HapScope: a software system for automated and visual analysis of functionally annotated haplotypes.Nucleic acids research, 30 23
R. Guérois, J. Nielsen, L. Serrano (2002)
Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations.Journal of molecular biology, 320 2
W242–W248 Nucleic Acids Research, 2004, Vol. 32, Web Server issue DOI: 10.1093/nar/gkh438 PupaSNP Finder: a web tool for finding SNPs with putative effect at transcriptional level Lucı´a Conde, Juan M. Vaquerizas, Javier Santoyo, Fa´tima Al-Shahrour, Sergio 1 1 Ruiz-Llorente , Mercedes Robledo and Joaquı´n Dopazo* Bioinformatics Unit and Hereditary Endocrine Cancer Group, Centro Nacional de Investigaciones Oncolo´gicas (CNIO) Madrid, Spain Received February 3, 2004; Revised and Accepted April 15, 2004 for the analysis of genomes (1). Owing to their widespread ABSTRACT distribution, SNPs are particularly valuable as genetic markers We have developed a web tool, PupaSNP Finder in the search for disease susceptibility genes, drug response- (PupaSNP for short), for high-throughput searching determining genes, and so on. In the past decades, linkage for single nucleotide polymorphisms (SNPs) with analysis has been very successful in the identification of potential phenotypic effect. PupaSNP takes as its genes responsible for mendelian diseases. Nevertheless, direct input lists of genes (or generates them from chromo- application of linkage analysis to the case of complex diseases, in which several genes with weaker genotype–phenotype cor- somal coordinates) and retrieves SNPs that could relations are involved, has resulted in more modest success affect the conserved regions that the cellular machin- (2). Now, it is believed that improved genotyping methods in ery uses for the correct processing of genes (intron/ combination with the proper design strategies could bring the exon boundaries or exonic splicing enhancers), pre- genetics of complex diseases to a point of success comparable dicted transcription factor binding sites (TFBS) and to where mendelian genetics now firmly resides (3). changes in amino acids in the proteins. The program There are examples documented in which alleles of more uses the mapping of SNPs in the genome provided by than one gene contribute to the same disease. It is generally Ensembl. Additionally, user-defined SNPs (not yet believed that multigenic diseases reflect disruptions in the mapped in the genome) can be easily provided to proteins that participate in a protein complex or a pathway the program. Also, additional functional information (4). Typically, SNPs have been used as markers; that is, the from Gene Ontology, OMIM and homologies in other real determinant of the disease was not the SNP itself but some model organisms is provided. In contrast to other pro- other mutation in linkage disequilibria with it. The use of functional SNPs could be an important factor for grams already available, which focus only on SNPs increasing significantly the sensitivity of association tests. In with possible effect in the protein, PupaSNP includes fact, several complex genetic disorders such as Alzheimer’s SNPs with possible transcriptional effect. PupaSNP disease (5) and Crohn’s disease (6) have been associated with will be of significant help in studies of multifactorial functional SNPs, lending credence to strategies giving priority disorders, where the use of functional SNPs will to candidate markers based on predictable function. The latest increase the sensitivity of identification of the build of NCBI’s dbSNP (http://www.ncbi.nlm.nih.gov/SNP/ genes responsible for the disease. The PupaSNP snp_summary.cgi) contains 5 772 564 SNPs, with 2 356 957 of web interface is accessible through http://pupasnp. them validated. This means that human variation has been bioinfo.cnio.es. screened to an average resolution of 1 SNP for every 566 nt. There is also curated information on SNPs in HGVbase (7). These figures suggest that the possibility of finding the real determinant of a disease among the characterized SNPs can INTRODUCTION be seriously considered. In fact, dbSNP build 117 contains Single nucleotide polymorphisms (SNPs) are the simplest and 24 483 SNPs located in coding regions that produce amino most frequent type of DNA sequence variation among indi- acid change, affecting a total of 9791 different genes. Several viduals and they represent one of the most powerful tools estimate suggest that, overall, only 20% of them could damage *To whom correspondence should be addressed. Tel: +34 912246919; Fax: +34 912246972; Email: [email protected] The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. ª 2004, the authors Nucleic Acids Research, Vol. 32, Web Server issue ª Oxford University Press 2004; all rights reserved Nucleic Acids Research, 2004, Vol. 32, Web Server issue W243 the protein (8). Much attention has been focused on the of introns, 1 387 506 in introns and 242 842 in untranslated possible phenotypic effects of SNPs that cause amino acid regions affecting 336 16 306 and 14 198 genes, respectively. A changes. The volume of available information together with number of these SNPs could be disease determinants. the development of more sophisticated methods of protein With the idea of extracting as much information as possible structure prediction has led to different attempts to relate form SNPs with putative phenotypic effect, we have devel- the effect of amino acid changes to structural distortions oped PupaSNP Finder (Putative Phenotypic Alterations caused and, consequently, possible phenotypic effect. Following by SNPs; PupaSNP for short). This tool retrieves all the SNPs this, two main different approaches have been taken: on the present in a set of genes of interest that potentially affect the one hand is the study of conservation of residues in homo- functionality of the gene product. This list is combined with logous proteins (9) including more sophisticated approaches functional information obtained from Gene Ontology (GO) taking into account the phylogenetic history (10) and, on the annotations (22). Genes can be directly retrieved from geno- other hand, there is the study of changes in the stability mic locations or, alternatively, can be taken from a list pro- (11,12) and other properties of the protein due to changes vided by the user. This corresponds to two typical problems: of amino acids (8,13). (i) traits mapped to a given chromosomal region or (ii) traits Nevertheless, there are different ways in which the func- associated with a given class of genes (e.g. a signalling path- tionality of a gene product can be affected without requiring a way). Genome coordinates of genes and SNPs are taken from amino acid change in the protein. There is increasing evidence the Ensembl annotation (23). that many human disease genes harbour exonic or non-coding mutations that affect pre-mRNA splicing (14). Alternative splicing produced by mutations in intron/exon junctions, or METHODS in distinct binding motifs, such as exonic splicing enhancers Finding SNPs with potential phenotypic effect (ESEs), to which different proteins involved in splicing bind, is the basis of different diseases. In fact, it has been estimated PupaSNP operates with a collection of entries from dbSNP that 15% of point mutations that result in human genetic dis- mapped to the Golden Path genome assembly, as implemented eases cause RNA splicing defects (15). For example, a silent in human section of Ensembl (http://www.ensembl.org). As mutation in exon 14 of the APC gene is associated with exon previously mentioned, PupaSNP uses a list of genes and gen- skipping in a Familial Adenomatous Polyposis (FAP) family erates a report in which all the SNPs with possible phenotypic (16), and there are many more examples [see Table 2 in (14)]. effect are listed. The genes can be selected directly by their Also, alterations in the level of expression of gene products location in a region of the genome, or just provided as a list can cause diseases. Different SNPs are associated with altera- (e.g. genes belonging to a given pathway, involved in a parti- tions in gene expression (17) and, in some cases, it is known cular biological function). Genomic regions can be selected that they alter some regulatory sequence motif. For example, a either by defining a range of chromosome coordinates or by regulatory polymorphism in the programmed cell death 1 gene directly choosing the cytoband of interest. The engine finds all (PDCD1), which alters a binding site for the runt-related tran- the genes located within the specified region as well as their scription factor 1 (RUNX1) located in an intronic enhancer, is promoter regions using Ensembl APIs. In the case of a user- associated with susceptibility to systemic lupus erythematosus defined list, Ensembl is used to extract their complete intron/ in humans (18). It has also been reported that polymorphisms exon structure as well as the promoter regions. in the gelatinase A promoter region are associated with dimin- The potential effects on the phenotype taken into account ished transcriptional response to estrogen and genetic fitness are at both transcriptional and gene product levels. (19). A recent large-scale screening over a set of 16 chromo- These include alterations in (i) transcription factor binding somes, found SNPs in the promoters regions of 35% of the sites, (ii) intron/exon border consensus sequences, (iii) ESE genes, and experimental evidence suggested that around one- sequences, which are the binding sites for specific serine/ third of promoter variants may alter gene expression to a arginine-rich (SR) proteins involved in the splicing machinery functionally relevant extent (20). Therefore, the inclusion of (24,25) and (iv) the exons that cause an amino acid change. other possible causes of loss of functionality in gene products, Additionally, the GO terms (22) associated with the genes can beyond the simple estimation of the possible phenotypic effect be obtained. This is very useful in the case of looking for genes of an amino acid change, increases considerably the number of in a chromosomal region, because it can help to discard genes SNPs with potential phenotypic effect to be considered for the definitively not involved in the disease studied, based on the design of experiments. annotations. Classical statistical linkage tests need a large number of cases if the number of genes to be tested is high. It has Transcription factor binding sites. In the search for SNPs only recently been recognized that reliable identification of with potential phenotypic effect, 10 000 bp upstream of the genetic variants that affect gene regulation is still a challenge genes, belonging to the promoter region of each gene in the in genomics and is expected to play an important role in the list, are scanned for the presence of possible transcription TM molecular characterization of complex traits (21). Another factor binding sites (TFBSs). The program Match (26), important consideration when analysing multigenic traits is version 1.10, from the Transfac database (27), version pro- the information available on the genes. Information allows fessional 7.3, was used for this purpose. SNPs located within a more targeted approach, by focusing initially on genes these motifs are considered to have a putative phenotypic whose functionality is related to the disease studied. effect in the expression of the gene. The options used for TM Genome surveys based on the information contained in the program Match were (i) group of matrices: vertebrates, dbSNP show that there are 361 SNPs mapped in splice sites (ii) use high quality matrices only and (iii) cutoff selection for W244 Nucleic Acids Research, 2004, Vol. 32, Web Server issue matrix group: to minimize false positives. This cutoff was OMIM (Online Mendelian Inheritance in Man), which con- obtained by exploring the third exon sequences with the stitutes a comprehensive, authoritative and timely knowledge weight matrices and was chosen to reduce the number of base of human genes and genetic disorders (31) and (iii) homo- random putative sites found by the program (26). logies to other organisms, obtained directly from Ensembl. Although the scan is done in a region 10 000 bp upstream Gene Ontology is a tree structure (called a directed acyclic from the start of the gene, the number of bases to be taken into graph) in which terms describing three fundamental ontologies account in the study is customizable. Obviously, the closer to (molecular function, biological process and cellular compo- the start of the gene, the more likely the binding site is to be nent) have descendants with more detailed descriptions. Thus, authentic. descending the hierarchy of GO implies moving towards terms with more detailed descriptions of the ontologies, but, at the Intron–exon boundaries. Ensembl APIs were used to extract same time, there are fewer genes with annotations at such the intron/exon organization of the genes and the correspond- detail. FatiGO works by climbing up the hierarchy to a ing sequences. The two conserved nucleotides at each side of selected parent level (30) to optimize the number of genes the splicing point, which constitute the splicing signal (14), with annotation and the detail of the annotation. Thus, the were then located and all the SNPs altering these signals are identification of common parent functions or processes is recorded. easier. In this way, the consideration of the SNPs in a func- tional context can help to understand the potential biological Exonic splicing enhancers. Mutations that deactivate or acti- implications of the SNPs and genes studied. vate exonic splicing enhancer sequences may result in exon skipping, malformation, and so on. ESEs also appear to be important in exons that normally undergo alternative splicing. Different classes of ESE consensus motifs have been RESULTS described, but they are not always easily identified. We SNPs with possible phenotypic effect have developed a script that scans exon sequences to identify putative ESEs responsive to the human SR proteins SF2/ASF, We analysed a total of 24 037 human genes corresponding to SC35, SRp40 and SRp55, by using the weight matrices avail- the annotations in Ensembl build 34 (version 18.34.1), which able for them (28). A score is obtained related to the likelihood contains the mapping of dbSNP 117. By scanning with the TM that the site found is a real ESE. Only ESE sites with scores Match program the 10 000 bp upstream promoter regions of over the threshold [see (28) and http://exon.cshl.org/ESE/ the genes, 2 587 478 transcription factor binding sites, corre- ESEmatrix.html for details] are taken into account in the ana- sponding to 330 different Transfac weight matrices (27), were lysis. Threshold values, above which a score for a given found. After mapping the SNPs in the promoter regions, sequence is considered to be significant, are set as the median 71 444 TFBSs were found to be disrupted by a total of of the highest score for each sequence in a set of 30 randomly 57 412 SNPs (some SNPs affect more than one TFBS at the chosen 20 nt sequences (from the starting pool used for func- same time). A total of 19 010 genes presented at least 1 pre- tional assays for ESE identification; see http://exon.cshl.org/ dicted TFBS disrupted by a SNP, which constitutes a consid- ESE/ESEmatrix.html). If an SNP disrupts one of these erable proportion of the total number of genes. The coverage sequences, the new score, corresponding to the mutated in terms of both SNPs and TFBS predictions was good: only sequence, is also calculated. Strong differences between the for 54 genes was no single SNP found in the 10 000 bp 5 - two score values suggest more drastic effects caused by the SNP. upstream region, and only for 2 genes could no predicted TFBS be found (ENSG00000116119,or KV2A_HUMAN, Changes at amino acid level and functional implications. SNPs which is the IG KAPPA CHAIN V-II REGION CUM, and that result in a change of amino acid are likely to cause some ENSG00000174994,or AK057375, which seems to be a DNA phenotypic effect and, consequently, are all listed. Since the binding protein). In a number of cases, SNPs affect overlap- main purpose of the tool is to cover possible transcriptional ping TFBSs, which could have a stronger effect still in the effects of the SNPs and there are a number of tools already phenotype. There are even 2 SNPs that simultaneously affect available for the prediction of phenotypic effects due to muta- 15 TFBSs. tions in amino acids (see Introduction) PupaSNP only lists The four conserved bases that define intron–exon bound- them. To help in the identification of possible effects we label aries were mutated by 844 SNPs, affecting to a total of SNPs that disrupt any functional motif as listed in Interpro (29), 598 genes. a resource that compiles information on protein families, Over eight million ESE motifs were found, covering all the domains and functional sites. The coordinates of the Interpro genes studied. A total of 138 746 SNPs were found to disrupt motifs within the exons of the genes are extracted from ESE sequences. These SNPs affect a total of 17 312 genes. Ensembl and cross-referenced with the SNPs coordinates. These results suggest that, in the search for SNPs with potential phenotypic effects, regulatory SNPs or SNPs affect- Additional functional information. Since PupaSNP finder ing splicing should not be neglected. works with lists of genes in order to select the best SNP candidates for further use in association analysis, it is very The web interface helpful to have functional annotations of the genes. This allows the assignment of priorities based also on the informa- Input data. PupaSNP has been designed for high-throughput tion available on the genes. Information is obtained from (i) screening of functional SNPs. Thus, the input consists of a list Gene Ontology annotations, obtained through the FatiGO of genes. The list can be directly provided as a collection of engine (30) (available at http://fatigo.bioinfo.cnio.es), (ii) gene identifiers (Ensembl IDs, or external IDs, which include Nucleic Acids Research, 2004, Vol. 32, Web Server issue W245 GenBank, Swissprot/TrEMBL and other gene IDs supported functional role. Such information is scarce: 2 359 534 out of by Ensembl) or can be specified by means of a chromosomal 5 798 183 SNPs in dbSNP build 118 have been validated, location (cytobands or chromosomal coordinates). In the latter which constitutes 40%. However, only 160 466 have estimates case, PupaSNP extracts all the genes contained in the specified of population frequencies and only 94 867 have a phenotype location. Ensembl coordinates are used to extract the genes. associated. To obtain a sense of the reliability of the SNPs Only Ensembl annotated genes, but not predictions, are annotated with ‘no-info’, a set of SNPs was sought for a list of extracted. candidate modifier genes related to a phenotype exhibited by MEN2 (Multiple endocrine neoplasia, type IIA) patients User-defined SNPs. Alternatively, the user can input SNPs (OMIM, #171400), all of them RET mutation carriers. not in the database in a very straightforward manner and MEN2 is an autosomal dominant syndrome of multiple endo- take advantage of the tools for predicting their potential phe- crine neoplasms, with variable clinical expression even notypic effect. A text file containing the descriptions of the between members of the same family. This fact cannot be SNPs must be generated. Each line describes one unique SNP explained only by a mutation in a major susceptibility gene, with the following tab-delimited data: SNP name, gene but suggests a role for genetic modifiers, which may also work (Ensembl ID or external ID), position with respect to the through quantitative effect. start of the translation and alleles, e.g. In most of cases, it was necessary to validate the putative SNPs identified by PupaSNP because there was no information MySNP01 ENSG00000000003 -1830 A/G about validation status. To validate SNPs and estimate their MySNP02 ENSG00000157873 421 C/G allele frequency, 48 non-related individuals from the Spanish This describes two SNPs: the first in the gene population were used. The specific primers used to amplify the ENSG00000000003 (tetraspanin 6,or TSPAN6), 1830 bp fragments of interest by PCR (polymerase chain reaction) away from the transcription start point, with polymorphisms were designed using the OLIGO 4.1 program. When possible, consisting of a change of an A for a G; and the second in gene the primers were selected and designed to amplify a fragment ENSG00000157873 (tumor necrosis factor receptor-like 2, (200–500 bp) that allowed us to investigate several SNPs at the TNFRSF14), 421 bp within the transcripted region, which same time. As a denaturing high-performance liquid chroma- corresponds to the first exon of the gene. tograph (dHPLC) system (WAVE, Transgenomics Limited, Crewe, UK) was used for the initial SNP screening, the frag- The web interface. A web interface to PupaSNP is available at ments of interest had a homogeneous GC content across dif- http://pupas.bioinfo.cnio.es/. Lists of genes can be defined ferent domains from the DNA fragment to obtain a consistent by chromosome position, which can be specified in terms melting profile. The Navigator software was used for data of cytoband units or in absolute chromosomal position (as handling and optimization of the dHPLC system. After nor- mapped in the corresponding Ensembl assembly). The up- malization, each PCR product that exhibited a change in the stream region makes reference to the number of bases chromatogram profile was characterized by sequence analysis. upstream in which TFBSs will be searched for (with a These PCR products were purified using an E.Z.N.A. Cycle- upper limit of 10 000 bp). Also, lists of genes can be uploaded Pure Kit (Omega Bio-tek, USA) according to the man- or just pasted into the box. PupaSNP finds all the SNPs map- ufacturer’s instructions, and sequenced using an automatic ping to locations that might cause a loss of functionality in the TM sequencer ABI PRISM 3700 (Applied Biosystems. Perkin genes. Functional information for the genes can also be Elmer, USA). The reaction was carried out in 4 ml of a Big Dye obtained from OMIM and from Gene Ontology. Information terminator cycle sequencing Kit (Perkin Elmer, USA), 10 pmol on homologous genes can also be retrieved. Finally, SNPs do of the sense/antisense primer, 5% DMSO and 6–12 ng of not need to be annotated in the genome to be included in the amplified DNA. Although the results obtained here do not query tool. The user can specify a list of SNPs using a gene as pretend to be capable of general extrapolation to the entire reference. In this way the use of absolute coordinates, which database, we have found that 24 out of 28 SNPs assayed can easily change between assembly versions, is avoided in proved to be authentic and polymorphic in the Spanish popu- favour of the use of coordinates relative to genes, which tend to lation, which constitutes a good rate. be more stable. Results include SNPs in a the promoter region of the genes, SNPs located at intron boundaries, SNPs located at exonic splicing enhancers and coding SNPs located at Inter- pro domains. Figure 1 shows part of the results provided by the DISCUSSION program for the SNPs with possible phenotypic effect on genes Typically, SNPs have been used as markers to search for the in the p36.33 cytoband of chromosome 1. Figure 1C is espe- real determinant of a disease in linkage disequilibria with it. As cially interesting because it shows how the scores obtained by previously mentioned, the use of functional SNPs, which may the motif scanning method can be used to assess the possible be the real disease determinants, could be an important factor impact of the polymorphism on the recognition of the ESE in increasing the sensitivity of association tests. motif by the cellular machinery. Despite the obvious importance that alterations in the reg- Both the SNPs and the genes found are linked to the ulation, expression level or splicing of genes can have for the Ensembl Genome Browser. phenotype, these have long been ignored in the most common approaches to finding functional SNPs, which have instead Experimental validation focused more on the possible effect of polymorphisms causing The validation status of the SNPs is, in some cases, a much amino acid changes. Apart from the databases mentioned more important factor for their selection than their possible above (dbSNP and HGVbase), there are a number of resources W246 Nucleic Acids Research, 2004, Vol. 32, Web Server issue Figure 1. A selection of results from PupaSNP. (A) List of genes and the corresponding transcripts with the SNPs mapping to the different regions, which include 0 0 coding and 5 - and 3 -untranslated regions. For coding SNPs, the position within the transcript and the change produced (if any) is reported. (B) SNPs located in the promoter regions (in the example, a limit of 4000 bp was chosen). Disruptions of predicted TFBSs are listed. The validation status of the SNPs (‘no-info’, ‘by-submitter’, ‘by-frequency’, ‘by-cluster’; see dbSNP web page) is also provided. (C) SNPs located at exonic splice enhancers. The scores make reference to the closeness of the site to the motif. If the polymorphism gives a site with a worst score, this would, generally speaking, probably imply worst recognition of the site by the cellular machinery and, consequently, a putative alteration in the normal splicing process. When the cursor is over the gene name, additional information is displayed. available over the net collecting information on phenotypes source of information to be considered in the search for associated with SNPs, such as The Human Gene Mutation SNPs affecting multigentic traits. Database (http://www.hgmd.org) at the University of Wales, Despite the fact that PupaSNP is more focused on which classifies SNPs according the lesion they cause SNPs with possible effects at transcriptional level, the (missense substitutions, splice variants, and so on) (32) and inclusion of an algorithm for improving the predictions of PicSNP, a catalogue of non-synonymous SNPs obtained from the effect of SNPs in the proteins, such as FoldX (12), the human genome assembly (33). However, these are mainly would provide, within the same framework, both types of specialized catalogues collecting information on SNPs rather result. than tools for their selection. Minimum SNP set selection allows the user to optimize PupaSNP constitutes a tool for selecting SNPs with putative the number of SNPs required to represent haplotype diversity, phenotypic effects designed for high-throughput experiments. thus reducing the cost of genotyping by assaying the mini- It deals with lists of genes, instead of focusing on individual mum number of SNPs required. The inclusion of information genes. In addition, more information on different possible on linkage disequilibrium or on haplotype blocks can assist in motifs with regulatory function has been included. For exam- a more efficient selection of SNPs. Some programs, such as ple, SNPs in ESE had never previously been included in any HapScope (34), include information on haplotypes and use catalogue. them to select minimum subsets of SNPs. Another important Multigenic diseases are generally associated with disrup- issue is the reliability of the SNPs. As previously mentioned, tions in proteins that participate in a protein complex or a only 40% of the SNPs in dbSNP have been validated, and pathway (4). The inclusion in PupaSNP of information regard- only for 5% are population frequencies are available. This ing the participation of genes in signalling cascades or in means that most of the SNPs found in any kind of selection pathways or in protein complexes will be considered in the will lack information on their possible presence in the near future. Databases containing protein interaction data, population of interest as a manageable polymorphism. Even such as DIP and BIND (see http://www.hgmp.mrc.ac.uk/ though our results suggest a high rate of authenticity, even for GenomeWeb/prot-interaction.html), can be an important the SNPs labeled as ‘no-info’, they must be treated carefully Nucleic Acids Research, 2004, Vol. 32, Web Server issue W247 and cannot be directly extrapolated to the entire database. 12. Guerois,R., Nielsen,J.E. and Serrano,L. (2002) Predicting changes in the stability of proteins and protein complexes: a study of more than As population frequencies are included in the database, 1000 mutations. J. Mol. Biol., 320, 369–387. these data could be of interest for use as part of the selection 13. Ferrer-Costa,C., Orozco,M. and de la Cruz,X. (2002) Characterization of process of SNPs disease-associated single amino acid polymorphisms in terms of PupaSNP will be the tool used in the first step of the pipeline sequence and structure properties. J. Mol. Biol., 315, 771–786. 14. Cartegni,L., Chew,S.L. and Krainer,A.R. (2002) Listening to silence and for the study of polymorphisms at the Spanish National understanding nonsense: exonic mutations that affect splicing. Nature Genotyping Centre (CeGen). For this reason it has been deve- Rev. Genet., 3, 285–298. loped to cope with high-throughput experimental designs. 15. Krawczak,M., Reiss,J. and Cooper,D.N. (1992) The mutational spectrum PupaSNP takes as input lists of genes (or generates them of single base-pair substitutions in mRNA splice junctions of human from chromosomal coordinates) and provides results which genes: causes and consequences. Hum. Genet., 90, 41–54. 16. Montera,M., Piaggio,F., Marchese,C., Gismondi,V., Stella,A., Resta,N., integrate all the information available as well as obtained Varesco,L., Guanti,G. and Mareni,C. (2001) A silent mutation in by means of predictions of SNPs with possible functional exon 14 of the APC gene is associated with exon skipping in a FAP family. consequences. J. Med. Genet., 38, 863–867. 17. Yan,H., Yuan, W., Velculescu,V.E., Vogelstein,B. and Kinzler,K.W. (2002) Allelic variation in human gene expression. Science, 297, 1143. ACKNOWLEDGEMENTS 18. Prokunina,L., Castillejo-Lopez,C., Oberg,F., Gunnarsson,I., Berg,L., Magnusson,V., Brookes,A.J., Tentler,D., Kristjansdottir,H., Grondal,G., L.C. and this work are supported by grant PI020919 from the Bolstad,A.I., Svenungsson,E., Lundberg,I., Sturfelt,G., Jonssen,A., Fondo de Investigaciones Sanitarias. F.A.-S. is supported by Truedsson,L., Lima,G., Alcocer-Varela,J., Jonsson,R., Gyllensten,U.B., ´ Harley,J.B., Alarcon-Segovia,D., Steinsson,K. and grant BIO2001-0068 from Ministerio de Ciencia y Tecnologıa. Alarcon-Riquelme,M.E. (2002) A regulatory polymorphism in This work is also partly supported by a grant from Fundacio La PDCD1 is associated with susceptibility to systemic lupus Caixa and by the Spanish National Genotyping Centre erythematosus in humans. Nature Genet., 32, 666–669. (CeGen), funded by Genoma Espanna, which is using this pro- 19. Harendza,S., Lovett,D.H., Panzer,U., Lukacs,Z., Kuhnl,P. and Stahl,R.A. gram for high-throughput SNP selection. (2003) Linked common polymorphisms in the gelatinase a promoter are associated with diminished transcriptional response to estrogen and genetic fitness. J. Biol. Chem., 278, 20490–20499. 20. Hoogendoorn,B., Coleman,S.L., Guy,C.A., Smith,K., Bowen,T., Buckland,P.R. and O’Donovan,M.C. (2003) Functional analysis of REFERENCES human promoter polymorphisms. Hum. Mol. Genet., 12, 1. Collins,F.S., Green,E.D., Guttmacher,A.E. and Guyer,M.S. (2003) 2249–2254. A vision for the future of genomics research. Nature, 422, 21. Hudson,T.J. (2003) Wanted: regulatory SNPs. Nature Genet., 33, 835–847. 439–440. 2. Risch,N.J. (2000) Searching for genetic determinants in the new 22. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., millennium. Nature, 405, 847–856. Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T., 3. Botstein,D. and Risch,N. (2003) Discovering genotypes underlying Harris,M.A., Hill,D.P., Issel-Tarver,L., Kasarskis,A., Lewis,S., human phenotypes: past successes for mendelian disease, Matese,J.C.,Richardson,J.E.,Ringwald,M.,Rubin,G.M.andSherlock,G. future approaches for complex disease. Nature Genet., (2000) Gene ontology: tool for the unification of biology. The Gene 33, 228–237. Ontology Consortium. Nature Genet., 25, 25–29. 4. Badano,J.L. and Katsanis,N. (2002) Human genetics and disease: beyond 23. Clamp,M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y., Mendel: an evolving view of human genetic disease transmission. Clark,L., Cox,T., Cuff,J., Curwen,V., Down,T., Durbin,R., Eyras,E., Nature Rev. Genet., 3, 779–789. Gilbert,J., Hammond,M., Hubbard,T., Kasprzyk,A., Keefe,D., 5. Strittmatter,W.J., Saunders,A.M., Schmechel,D., Pericak-Vance,M., Lehvaslaiho,H., Iyer,V., Melsopp,C., Mongin,E., Pettett,R., Potter,S., Enghild,J., Salvesen,G.S. and Roses,A.D. (1993) Apolipoprotein E: Rust,A., Schmidt,E., Searle,S., Slater,G., Smith,J., Spooner,W., high-avidity binding to beta-amyloid and increased frequency Stabenau,A., Stalker,J., Stupka,E., Ureta-Vidal,A., Vastrik,I. and of type 4 allele in late-onset familial Alzheimer’s disease. Birney,E. (2003) Ensembl 2002: accommodating comparative genomics. Proc. Natl Acad. Sci. USA, 90, 1977–1981. Nucleic Acids Res., 31, 38–42. 6. Hugot,J.P., Chamaillard,M., Zouali,H., Lesage,S., Cezard,J.P., 24. Liu,H.X., Zhang,M. and Krainer, A.R. (1998) Identification of functional Belaiche,J., Almer,S., Tysk,C., O’Morain,C.A., Gassull,M., Binder,V., exonic splicing enhancer motifs recognized by individual SR proteins. Finkel,Y., Cortot,A., Modigliani,R., Laurent-Puig,P., Genes Dev., 12, 1998–2012. Gower-Rousseau,C., Macry,J., Colombel,J.F., Sahbatou,M. and 25. Schaal,T.D. and Maniatis,T. (1999) Multiple distinct splicing enhancers Thomas,G. (2001) Association of NOD2 leucine-rich repeat in the protein-coding sequences of a constitutively spliced pre-mRNA. variants with susceptibility to Crohn’s disease. Nature, 411, Mol. Cell Biol., 19, 261–273. 599–603. 26. Kel,A.E., Goßling,E., Reuter,I., Cheremushkin,E., Kel-Margoulis,O.V. 7. Brookes,A.J., Lehvaslaiho,H., Siegfried,M., Boehm,J.G., Yuan,Y.P., and Wingender,E. (2003) MATCHTM: a tool for searching Sarkar,C.M., Bork,P. and Ortigao,F. (2000) HGBASE: a database transcription factor binding sites in DNA sequences Nucleic Acids Res., 31, 3576–3579. of SNPs and other variations in and around human genes. Nucleic Acids Res., 28, 356–360. 27. Wingender,E., Chen,X., Hehl,R., Karas,H., Liebich,I., Matys,V., 8. Sunyaev,S., Ramensky,V., Koch,I., Lathe,W., Kondrashov,A.S. and Meinhardt,T., Pr€uuß,M., Reuter,I. and Schacherer,F. (2000) Bork,P. (2000) Prediction of deleterious human alleles. Hum. Mol. TRANSFAC: an integrated system for gene expression regulation. Genet., 10, 591–597. Nucleic Acids Res., 28, 316–319. 9. Ng,P.C. and Henikoff,S. (2001) Predicting deleterious amino acid 28. Cartegni,L., Wang,J., Zhu,Z., Zhang,M.Q. and Krainer,A.R. (2003) substitutions. Genome Res., 11, 863–874. ESEfinder: a web resource to identify exonic splicing enhancers. Nucleic 10. Miller,M.P. and Kumar,S. (2001) Understanding human disease Acids Res., 31, 3568–3571. mutations through the use of interspecific genetic variation. Hum. Mol. 29. Mulder,N.J., Apweiler,R., Attwood,T.K., Bairoch,A., Barrell,D., Genet., 10, 2319–2328. Bateman,A., Binns,D., Biswas,M., Bradley,P., Bork,P., Bucher,P., 11. Chasman,D. and Adams,R.M. (2001) Predicting functional Copley,R.R., Courcelle,E., Das,U., Durbin,R., Falquet,L., consequences of non-synonymous single nucleotide polymorphisms: Fleischmann,W., Griffiths-Jones,S., Haft,D., Harte,N., Hulo,N., structure-based assessment of amino acid variation. J. Mol. Biol., Kahn,D., Kanapin,A., Krestyaninova,M., Lopez,R., Letunic,I., 307, 683–706. Lonsdale,D., Silventoinen,V., Orchard,S.E., Pagni,M., Peyruc,D., W248 Nucleic Acids Research, 2004, Vol. 32, Web Server issue Ponting,C.P., Selengut,J.D., Servant,F., Sigrist,C.J., Vaughan,R. and 32. Stenson,P.D., Ball,E.V., Mort,M., Phillips,A.D., Shiel,J.A., Zdobnov,E.M. (2003) The InterPro Database brings Thomas,N.S., Abeysinghe,S., Krawczak,M. and Cooper,D.N. (2003) increased coverage and new features Nucleic Acids Res., 31, Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat., 315–318. 21, 577–581. 30. Al-Shahrour D´ıaz-Uriarte,R. and Dopazo,J. (2004) FatiGO: a web tool for 33. Chang,H. and Fujita,T. (2001) PicSNP: a browsable catalog of finding significant associations of gene ontology terms with groups nonsynonymous single nucleotide polymorphisms in the human genome. of genes. Bioinformatics, 20, 578–580. Biochem. Biophys. Res. Commun., 287, 288–291. 31. Hamosh,A., Scott,A.F., Amberger,J., Bocchini,C., Valle,D. and 34. Zhang,J., Rowe,W.L., Struewing,J.P. and Buetow,K.H. (2002) McKusick,V.A. (2002) Online Mendelian inheritance in man (OMIM), HapScope: a software system for automated and visual analysis a knowledgebase of human genes and genetic disorders Nucleic of functionally annotated haplotypes Nucleic Acids Res., Acids. Res., 30, 52–55. 30, 5213–5221.
Nucleic Acids Research – Oxford University Press
Published: Jul 1, 2004
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.