Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Ensembl 2013

Ensembl 2013 D48–D55 Nucleic Acids Research, 2013, Vol. 41, Database issue Published online 30 November 2012 doi:10.1093/nar/gks1236 1,2, 1 2 2 1 Paul Flicek *, Ikhlak Ahmed , M. Ridwan Amode , Daniel Barrell , Kathryn Beal , 2 1 2 2 2 Simon Brent , Denise Carvalho-Silva , Peter Clapham , Guy Coates , Susan Fairley , 1 1 2 1 2 Stephen Fitzgerald , Laurent Gil , Carlos Garcı´a-Giro ´ n , Leo Gordon , Thibaut Hourlier , 1 1 2 1 Sarah Hunt , Thomas Juettemann , Andreas K. Ka ¨ ha ¨ ri , Stephen Keenan , 1 1 1 1 Monika Komorowska , Eugene Kulesha , Ian Longden , Thomas Maurel , 1 1 2 1 1 William M. McLaren , Matthieu Muffato , Rishi Nag , Bert Overduin , Miguel Pignatelli , 2 1 2 1 Bethan Pritchard , Emily Pritchard , Harpreet Singh Riat , Graham R. S. Ritchie , 1 1 2 1 1 Magali Ruffier , Michael Schuster , Daniel Sheppard , Daniel Sobral , Kieron Taylor , 1 2 2 1 Anja Thormann , Stephen Trevanion , Simon White , Steven P. Wilder , 2 1 1 1 2 Bronwen L. Aken , Ewan Birney , Fiona Cunningham , Ian Dunham , Jennifer Harrow , 1 2 1 1 2 Javier Herrero , Tim J. P. Hubbard , Nathan Johnson , Rhoda Kinsella , Anne Parker , 1 1 2 2 Giulietta Spudich , Andy Yates , Amonida Zadissa and Stephen M. J. Searle European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK Received October 11, 2012; Revised October 31, 2012; Accepted November 1, 2012 web interface to advanced bioinformatics programmers ABSTRACT looking to do complex analysis or build new tools that The Ensembl project (http://www.ensembl.org) leverage the Ensembl infrastructure. As such, we provide provides genome information for sequenced chord- all of the Ensembl source code freely under an ate genomes with a particular focus on human, Apache-style license and release all of our data without mouse, zebrafish and rat. Our resources include restriction. Ensembl data are distributed from our genome evidenced-based gene sets for all supported spe- browser at http://www.ensembl.org as well as via cies; large-scale whole genome multiple species BioMart, the Ensembl Application Programming Interface (API), direct MySQL access, Amazon Web alignments across vertebrates and clade-specific Services Public data sets (http://www.ensembl.org/info/ alignments for eutherian mammals, primates, birds data/amazon_aws.html) and via full data download. and fish; variation data resources for 17 species and Ensembl aims to be a hub of genome information by regulation annotations based on ENCODE and other linking identifiers and information between external bio- data sets. Ensembl data are accessible through the logical resources and data within Ensembl or importing genome browser at http://www.ensembl.org and essential information from other resources so that it can through other tools and programmatic interfaces. be found within Ensembl and linked back to the original resource as necessary. For example, we provide up to date external database references to gene names from the INTRODUCTION HUGO Gene Nomenclature Committee (HGNC) (1), Ensembl (http://www.ensembl.org) collects, creates, or- the Universal Protein Resource (UniProt) (2), Orphanet ganizes and distributes data resources in support of portal for rare diseases and orphan drugs (3), the Online research into the genetics and genomics of chordates. Mendelian Inheritance in Man (OMIM) database (4), the We currently support 70 species with a focus on human RefSeq collection of Reference Sequences from NCBI (5), in additional to agricultural animals and major vertebrate the UCSC Genome Browser (6), the Protein Data Bank model organisms such as mouse, zebrafish and rat. We (PDB) repository for biological macromolecular support a full range of researchers in genomics from structures (7) and many other resources. bench biologists interested in looking up specific details We participate in or work closely with a number of about their genes or loci of interest using a graphical large-scale international projects including the 1000 *To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: fl[email protected] The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]. Nucleic Acids Research, 2013, Vol. 41, Database issue D49 Genomes Project (8), ENCODE (9), the International (Melopsittacus undulates), Chinese hamster CHO cell line Cancer Genome Consortium (ICGC) (10) and the (Cricetulus griseus), painted turtle (Chrysemys picta bellii), BLUEPRINT epigenome mapping project (11). spotted gar (Lepisosteus oculatus), collared flycatcher Participation in these efforts helps ensure that we (Ficedula albicollis) and squirrel monkey (Saimiri produce timely and valuable resources through direct sci- boliviensis boliviensis). Ensembl Pre! sites provide entific engagement with the communities that we are BLAST and genome visualization, but do not provide a trying to serve. In addition, we actively develop and complete gene build. For specific genomes, we also provide key pieces of large-scale bioinformatics infrastruc- provide downloadable data on the preview site. ture including the eHive workflow management system for We update the human gene set for every Ensembl genomic analysis (12). release via a merge of the Ensembl evidence-based auto- Full incorporation of the data types resulting from the matic annotation and Havana manual annotation (14) to myriad of experimental assays now leveraging next gener- produce an updated GENCODE gene set (9,15). This set ation sequencing technology remains an important area of also includes all current human Consensus Coding development for the project. During the past year, we have Sequence (CCDS) gene models (16). Manual annotation made considerable progress in a number of ways including from Havana is also incorporated into our gene sets on a greater incorporation of RNA-seq data into our gene alternate releases for mouse and zebrafish. In addition, pig annotations and ChIP-seq data into our regulatory anno- now includes manual annotation from Havana on selected tations. In general, we believe that the most useful re- regions of the genome. sources provide integrated summary information that The human genome assembly is updated regularly by transforms the raw sequencing data into biological know- the Genome Reference Consortium (GRC) to include al- ledge that can provide a foundation for further biological ternate sequences in the form of ‘fix’ and ‘novel’ assembly research. Thus, we believe that the display of the called patches (17), and we continue to include these additional variants from the 1000 Genomes Project or regulatory alternate sequences and annotate them with genes and region annotations supported by specific histone modifi- other features as appropriate. Ensembl release 69 cation or transcription factor (TF) binding sites are more (October 2012) included GRCh37.p8 (i.e. the eighth useful as resources for the community than a display of patch release of the GRCh37 assembly). The mouse the raw aligned sequence reads. However, Ensembl does genome annotation, which also incorporates all current support the upload and visualization of read alignment mouse CCDS models, was updated for Ensembl release data (e.g. alignment files in BAM format) and provides 68 (July 2012) to reflect the new GRCm38 assembly. signal files for our ChIP-seq and alignment files for Other species previously available on our website also RNA-seq data within the browser for those users saw updates in the past year including new primary needing direct access to the supporting data. Indeed, assemblies and gene sets for chimpanzee, dog, pig, Ensembl’s API development this year included increasing ground squirrel, bushbaby and Ciona intestinalis. The support for file-based data access to enable integration of gene sets for orang-utan, opossum and platypus were very large BAM and other file-based data sets into the also updated using RNA-seq data. browser. The whole genome multiple and pairwise alignments This report highlights the new data we have released have been re-run in conjunction with the incorporation and the new mechanisms of data access that we have of new or updated genomes. In addition to cross-species deployed during the past year since our previous report alignments, we now provide self-alignments for the human (13). We describe how these new features extend the genome and also use the Ensembl comparative genomics existing capabilities of the project, which will be explained infrastructure for the comparison of fix and novel patches as appropriate. alongside the reference human genome (Figure 1). Supported species Gene annotation As of release 69 (October 2012), Ensembl supports 70 The year 2012 has seen the inclusion of RNA-seq data species including 61 species fully supported on our main provided by several different groups (18–20) as supporting site. Of these, we have created full gene annotations for 58 evidence for our gene annotations. Thirteen species cur- chordates (43 with high-coverage genome sequences and rently incorporate RNA-seq data including zebrafish, 15 with low-coverage) and have imported annotation data chimpanzee, Nile tilapia, dog, Chinese softshell turtle, for three non-chordate model organisms (Saccharomyces pig, ferret, platyfish, coelacanth, Tasmanian devil, cerevisiae, Caenorhabditis elegans and Drosophila orang-utan, opossum and platypus. For some of these melanogaster) to facilitate comparative analysis. Five species, the RNA-seq data were added after a standard new species were included during the past year with full gene annotation process (21), whereas for other species, support: Atlantic cod (Gadus morhua), coelacanth the data were added as an integral part of the genebuild (Latimeria chalumnae), ferret (Mustela putorius furo), process. Some species also include tissue-specific RNA-seq Nile tilapia (Oreochromis niloticus) and Chinese softshell data that enables the exploration of tissue-specific expres- turtle (Pelodiscus sinensis). An additional nine species are currently available with limited support on the Ensembl sion. In addition, the Illumina Human BodyMap 2.0 Pre! site (http://pre.ensembl.org) including the following, data (http://www.ebi.ac.uk/arrayexpress/experiments/ which were newly added in the past year: budgerigar E-MTAB-513) have been re-processed using our enhanced D50 Nucleic Acids Research, 2013, Vol. 41, Database issue Figure 1. A region of the GRCh37 human assembly showing the complete APBA1 gene. The top panel displays the GRCh37 reference sequence as originally released, and the bottom panel displays the region after the inclusion of the novel patch HSCHR9_1_CTG35. The region of difference is highlighted and marked by the ‘Assembly exception’ track, whereas the pink regions of LASTZ self-alignment provide more details about what has changed in the patch including the addition of new sequence that was missing in the originally released assembly. The green areas show the mapping between the original and the alternative sequences and demonstrate a corrected inversion at the left hand side of the patch. The patch changes the annotation such that the RNA gene RP11-548B3.3 (in purple) moves from 5 of the APBA1 gene to within the second intron. As can be seen in the right hand side of the figure, the existence of the patch does not alter the annotation downstream of the change. Figure based on http://e68.ensembl.org/Homo_sapiens/Location/Multi?db=core;r=9:72019177-72298831;r1=HSCHR9_1_CTG35:72019384-72307679;s1=Homo_ sapiens--HSCHR9_1_CTG35. pipeline to produce updated gene models and new BAM particularly effective for species that are distantly related files. to the well-annotated mammals and those with little RNA-seq data are now routinely used in gene annota- species-specific sequence data available at the time of tion in a number of ways, and we anticipate that RNA-seq initial annotation. Specific improvements from the data will be used in almost all gene annotation projects for RNA-seq update pipeline include lengthening truncated the foreseeable future. Briefly, our current procedure genes, merging adjacent gene fragments and splitting starts with raw-sequencing reads that are aligned to the artificially merged genes. RNA-seq-based data are also genome and processed to produce RNA-seq-based gene useful for higher primate species that have previously models, BAM files and intron features that are supported relied largely on human sequence data for annotation, as by intron-spanning reads. Intron-supporting evidence it allows for the identification of non-human primate- helps to quantify intron predictions in RNA-seq transcript specific gene expression. sets. The intron features and RNA-seq-based gene models are used alongside cDNA and EST alignments to compare Variation resources and filter the preliminary set of protein-coding models We create variation resources for 17 species by importing against a set of highly supported splice sites. In addition, and merging data from many different sources through the RNA-seq-based gene models are used to provide al- our pipeline (22). The current list of variation data is ternate isoforms and fill in gaps between models identified provided at http://www.ensembl.org/info/docs/variation/ by the standard Similarity Genewise component of our sources_documentation.html. Most of our SNP and annotation system, which aligns protein sequences to the in-del data (rsIDs, locations, allele frequencies and geno- genome, and to add untranslated regions to the protein types) come from dbSNP (23). This year, we have updated coding models. the Ensembl Variation databases for human, rat, chim- We have also developed an RNA-seq update pipeline panzee, orang-utan, zebrafish, pig, dog and macaque. that allows an existing Ensembl gene set to be updated We have also remapped the variation data for mouse through incorporation of new RNA-seq data. The onto the new GRCm38 assembly before updated RNA-seq update pipeline takes in the results of the GRCm38 mappings were provided by dbSNP and standard Ensembl gene annotation method and also RNA-seq-based models produced by the pipeline previ- provided the same update for new dog assembly. ously described (20). The two sets of input models are Available structural variation data have increased consid- compared and merged to produce an updated gene set. erably, and we have data for human, mouse, horse, This new method was used to improve the existing zebrafish, cow and macaque largely provided by the opossum, platypus and orang-tuan gene sets for DGVa database of copy number and structural variation Ensembl release 69 (October 2012). The method is (24). The human structural variation data are more Nucleic Acids Research, 2013, Vol. 41, Database issue D51 comprehensive than all other species combined and Ensembl web interface include >6 million variants of which 5624 are somatic. During the past year, development on the Ensembl web The variation database infrastructure storing geno- interface has continued a combined strategy of small in- types has also been redeveloped to improve the respon- cremental improvements on the website while making sub- siveness of our displays and to support non-diploid stantial progress on a number of major infrastructure-level genomes. projects. The human variation data also include genotypes On the data display front, we are now able to show imported from the 1000 Genomes Project and the alignments of human assembly patches to the reference NHLBI Exome Sequencing Project (25), 79 000 assembly (Figure 1) and have renamed the ‘Multi-species mutation data locations provided by HGMD (26), view’ as ‘Region comparison’ to reflect its wider applic- clinical variants on LRGs (27) and >135 000 somatic ability. We have also added a transcript variation page, mutation positions from COSMIC (28). We have also similar to the gene variation page but showing only one added mitochondrial variants, information on clinical transcript at a time, which is particularly helpful in the significance and global minor allele frequencies from case of large, well-annotated genes that are challenging dbSNP, as well as phenotype data for >287 000 to display quickly or interpret easily due to their data variants from OMIM (4), the European Genome- density. Other additions to the user interface include a phenome Archive (EGA) and the NHGRI GWAS new online tool, Region Report, which provides graphical catalog (29). We denote those variants present on access to the API script of the same name to export three Affymetrix genotyping chips (GeneChip 100 K sequence, genes and other annotation from one or more Array, GeneChip 500 K Array, GenomeWideSNP_6.0) regions. We have also re-introduced the ability to save and nine Illumina chips (CytoSNP12v1, Human660- configurations on images: users can turn their choice of W-quad, Human1M-duoV3, CardioMetaboChip, tracks on and off and then save this selection in either the HumanOmni1-Quad, HumanHap650, HumanHap550, browser session or their personal accounts and then HumanOmni2.5 and Human610_Quad), and also quickly return to the same layout at a later time. These indicate those variants curated by UniProt (2). configurations can also be grouped into sets (e.g. to For all species, we calculate the effect of each variant combine a set of favourite variation tracks with a set of allele on overlapping Ensembl transcripts and whether the gene tracks) for even quicker reconfiguration of images. variant falls within an Ensembl regulatory feature, TF We have started to refresh the look and feel of the binding motif or a high information position within the website. For example, our icon set was previously motif. Our consequence annotation now uses defined created from various sources and has now been replaced Sequence Ontology (SO) terms (30) for all descriptions, with a single matching set. We have adapted the layout which enable querying of ontological relationships in and colour scheme for increased readability, and we are BioMart. More detailed consequence information is also continuing the process of replacing text-heavy pages with provided for SNPs and in-dels in specific genomic loca- simpler, more user-friendly layouts where appropriate. tions such as splice sites. These SO terms have also been Finally, major projects nearing completion and adopted by both the UCSC genome browser and ICGC scheduled for release by the end of 2012 include a Javascript-based scrollable genome browser called providing a standard to enable easy comparison of vari- Genoverse that will be incorporated into our location ation annotation. displays for Ensembl release 69 (October 2012) and Other resources supporting human variation include support for UCSC-style datahubs, which can contain calculated linkage disequilibrium values and tag SNPs, sets of preconfigured tracks or a user-supplied collection in addition to SIFT (31) and PolyPhen (32) predictions of remote resources. Additional work underway includes a for amino acid changes. This year we have switched to top-to-bottom rewrite of our BLAST/BLAT search using using the Ensembl comparative genomics pipeline to the Ensembl eHive job management system supporting a provide the ancestral alleles of SNPs and short deletions new web frontend, which will be tested on our beta site for human, orang-utan, chimpanzee and macaque (previ- (http://beta.ensembl.org) before rolling out into a major ously this was imported from dbSNP). We have also ex- Ensembl release in 2013. tensively improved our quality control (QC) procedures, which leverage the eHive software and have been extended Regulation to include structural variations. As a result of our effort to provide the most useful During the past year, we have significantly updated and possible summaries of large data sets to our users, we increased the amount of data available from the Ensembl have added new tracks for 1000 Genomes Project regulation database. As of Ensembl release 69 (October common variants and also tracks for each global 1000 2012), there are 532 ChIP-seq and DNase-seq data sets Genomes population. Additionally, appropriate pheno- from 13 human and five mouse cell lines. In total, these type data have been collected into a dedicated section on data sets represent information about the genomic loca- the Ensembl gene pages. Finally, the documentation tions of 49 different histone modification types and the section of the website has also been extended and binding regions of 113 different TFs. Forty of these TFs improved for all areas of Ensembl Variation especially have binding matrices available through the JASPAR for the Variant Effect Predictor (VEP), SO consequences, database (33), and we have incorporated these motif data QC pipeline and API diagrams. as positions of high probability TF-binding sites (5% False D52 Nucleic Acids Research, 2013, Vol. 41, Database issue Discovery Rate) within the binding regions. We have also of new taxonomic groups. These species define additional created a dedicated experimental summary page providing branching points in the phylogenetic trees, enable splitting information on individual experimental details and long branches and provide us with more taxonomic power summary metadata, such as references to the raw sequences to better resolve the gene trees. Further information on the reads available in the European Nucleotide Archive (34). evolution of the gene families is now provided by supple- The data underlying the Ensembl Regulatory Build cur- menting our phylogenetic analysis with a calculated as- rently include experiments in 13 cell lines. Regulatory Build sessment on the possible expansions and contractions in coverage has increased by 15% in the past year and now each family using the CAFE tool (39). annotates 270 Mb of the human genome in 518 020 regula- Our data model for gene trees has been modified to tory features. In Ensembl release 65 (December 2011), we handle both protein and ncRNA gene trees. During that introduced the combined Segway (35) and ChromHMM process, we also improved our support for protein (36) segmentation analyses developed for ENCODE (9), super-trees, which are used in the resolution of very which classifies the genome into regions based on 12 large protein families. These are split in sub-families, specific assays to obtain a single-track summary of the and the super-protein tree represents the relationship functional architecture of the human genome. The segmen- between these sub-families. We have developed a better tation tracks are currently available for six human cell lines: identification and annotation of split genes that usually GM12878, K562, H1-hESC, HepG2, HeLa-S3 and arise because of assembly errors (40). In our current im- HUVEC. The segmentation tracks are displayed with plementation, the enhanced gene tree pipeline (41) detects specific views available from the ‘Regulation’ configuration gene split events after building the protein multiple align- in the Ensembl browser (Figure 2). ment, and the resulting nodes of the tree can be annotated The Ensembl Regulation database and web views as gene split events when they relate to partial proteins continue to provide various other data resources including that could be concatenated to form a full gene. the following: mapping of probe sets for all the common microarray platforms, DNA methylation from various Ensembl tools and software projects including ENCODE, high profile externally During the past year, we have made significant improve- curated data sets such as cisRED motifs (37) and an ment to the Ensembl VEP (42) and launched a beta im- updated VISTA enhancer set (38). plementation of a new Ensembl REST API. The VEP provides comprehensive analysis of SNP, in-del or struc- Comparative genomics tural variation data including reports of which gene, tran- New species added in the past year such as coelacanth and script, protein or regulatory region overlap the variants of lamprey have provided our gene trees with representatives interest and if there is any change in amino acid sequence. Figure 2. Combined Segway and ChromHMM segmentation analyses within Ensembl in the region around the SLC18B1 gene on human chromo- some 6. The combination process results in seven annotated segments: CTCF enriched, Predicted Weak Enhancer/Cis-reg element, Predicted Transcribed Region, Predicted Enhancer, Predicated Promoter Flank, Predicted Repressed/Low Activity or Predicted Promoter with TSS. Six of the seven segment types are shown with variability in predicted enhancer activity between the assayed cell lines. Figure based on http://e68.ensembl. org/Homo_sapiens/Location/View?r=6:133088392-133123741. Nucleic Acids Research, 2013, Vol. 41, Database issue D53 It also includes information about SIFT and PolyPHEN dataset has been added containing data from COSMIC predictions in human, protein domains, exon/intron (28). The ability to search multiple chromosomal regions numbers, minor allele frequencies and other information. at once has been added to the Ensembl Regulation mart. The VEP works with many different file formats and can In addition to this, users can query human regulatory seg- in fact convert variant positions between different coord- mentation features using the newly added regulatory segments filter section and attribute page. inate systems (Ensembl, RefSeq, LRG and HGVS). We have also written plugins to report on degree of conserva- User training and support tion, presence of the variant in an LOVD database in a Locus Specific Database (LSDB) using the Leiden Open Ensembl supports new and existing users in a variety of Variation Database (LOVD) software (43) and other ways from a strong and increasing on-line presence to capabilities. Our VEP plugins are present in the direct face-to-face training at universities and other insti- ensembl-variation github repository (https://github.com/ tutions worldwide. This year, we held one-day workshops ensembl-variation/VEP_plugins), and we encourage users on five continents and launched new virtual initiatives to share their own plugins. available to all including those further afield or without The REST API web service was released as a beta ap- the means to host a one-day workshop. plication this year at http://beta.rest.ensembl.org. We provide extensive free and user-driven tutorials via Although we have a fully supported Perl API to all of the Ensembl YouTube (http://www.youtube.com/user/ the Ensembl data (44), the REST API addresses those EnsemblHelpdesk) and YouKu (http://i.youku.com/u/ users who wish to access Ensembl data in a language- id_UMzM1NjkzMTI0) channels and e-learning course agnostic manner. The web service is built using the Perl (http://www.ebi.ac.uk/training/online/course/ensembl- web framework Catalyst, Catalyst::Action::REST and our browsing-chordate-genomes). The Ensembl YouTube existing Perl API providing a rapid development environ- channel has >165 subscribers and >91 000 video views, ment and lowering the cost of creating new endpoints. now hosts >20 videos including navigation ‘how-to’ Output is a combination of bioinformatics and program- guides. This year, we have added more advanced videos matically relevant formats such as FASTA and JSON. We covering subjects such as patches and haplotypes on the provide access to sequences, assembly mapping, homo- human assembly, API installation and how RNA-seq data logues and integration of the VEP with support for are used in the genebuild. In 2012, the top 20 countries genomic features. The REST service, like all Ensembl accessing our on-line training reflect a worldwide audience software, is free to download from our CVS server from the USA, Europe, India, Japan, Australia, Pakistan, allowing users to deploy over their local Ensembl Taiwan, Mexico, South Korea and Brazil, and our most databases. popular videos have been viewed hundreds or thousands of times. Data access and data mining We communicate more informally and highlight updates and new features using the Ensembl blog Each Ensembl release provides a full rebuild of seven (http://www.ensembl.info/), Facebook page (http://www. BioMart (45,46) databases. Four of these BioMart data- facebook.com/Ensembl.org) and Twitter account (http:// bases (Ensembl Gene, Ensembl Variation, Ensembl twitter.com/ensembl). Our Helpdesk (helpdesk@ Regulation and VEGA) are visible on the Ensembl ensembl.org) continues to provide email support for BioMart interface, and the remaining three BioMart data- >100 questions monthly, and we are exploring webinars bases are hidden from view but are accessed through fed- as a vehicle for more interactive long-distance learning eration with visible BioMart databases to provide and plan to offer more of these events in 2013. ontology, sequence and genomic feature data. Performing a complete rebuild each release ensures the availability most up to date integrated data from across ACKNOWLEDGEMENTS the Ensembl project. Users can access these data via the The authors are consistently grateful to their users and MartView (web interface) and MartService (BioMart Perl especially to those who take the time to contact us API, DAS server, SOAP, REST, BioConductor biomaRt through our mailing lists, blog and other avenues. They package). acknowledge those researchers, organizations and Each Ensembl BioMart release includes the addition of large-scale projects that have provided data to Ensembl any new species, updated assemblies, updates to the before publication under the understandings of the Fort germline and somatic variation and structural variation Lauderdale meeting discussing Community Resource data sets as well as updates to the regulation data. One Projects and the Toronto meeting on pre-publication can now obtain our SIFT and PolyPhen predictions and data sharing. scores from the Ensembl variation BioMart and from the variation ‘filter’ and ‘attribute’ sections of the Ensembl gene BioMart. It is also possible to select specific mouse FUNDING strain information from the mouse structural variation data set, and one can filter on the source and study acces- The Wellcome Trust provides majority funding for the sion of interest in the structural variation data sets avail- Ensembl project [WT062023 and WT079643] with add- able for cow, zebrafish, horse, human, mouse and itional funding from the National Human Genome macaque. A new human somatic structural variation Research Institute [U01HG004695, U54HG004563 and D54 Nucleic Acids Research, 2013, Vol. 41, Database issue 12. Severin,J., Beal,K., Vilella,A.J., Fitzgerald,S., Schuster,M., U41HG006104] the BBSRC [BB/I025506/1], and the Gordon,L., Ureta-Vidal,A., Flicek,P. and Herrero,J. (2010) European Molecular Biology Laboratory. Additional eHive: an artificial intelligence workflow system for genomic support for specific project components as specified: analysis. BMC Bioinformatics, 11, 240. Funded by the European Commission under SLING, 13. Flicek,P., Amode,M.R., Barrell,D., Beal,K., Brent,S., Carvalho- grant agreement number 226073 (Integrating Activity) Silva,D., Clapham,P., Coates,G., Fairley,S. et al. (2012) Ensembl 2012. Nucleic Acids Res., 40, D84–D90. within Research Infrastructures of the FP7 Capacities 14. Wilming,L.G., Gilbert,J.G., Howe,K., Trevanion,S., Hubbard,T. Specific Programme; The research leading to these and Harrow,J.L. (2008) The vertebrate genome annotation (Vega) results has received funding from the European database. Nucleic Acids Res., 36, D753–D760. Community’s Seventh Framework Programme (FP7/ 15. Harrow,J., Denoeud,F., Frankish,A., Reymond,A., Chen,C.K., 2007-2013) under grant agreement n 222664. Chrast,J., Lagarde,J., Gilbert,J.G., Storey,R. et al. (2006) GENCODE: producing a reference annotation for ENCODE. (‘‘Quantomics’’). This Publication reflects only the Genome Biol., 7(Suppl.1), S4.1–S4.9. author’s views and the European Community is not 16. Harte,R.A., Farrell,C.M., Loveland,J.E., Suner,M.M., Wilming,L., liable for any use that may be made of the information Aken,B., Barrell,D., Frankish,A., Wallin,C. et al. (2012) Tracking contained herein; The research leading to these results has and coordinating an international curation effort for the CCDS Project. Database (Oxford), 2012, bas008. received funding from the European Community’s 17. Church,D.M., Schneider,V.A., Graves,T., Auger,K., Seventh Framework Programme (FP7/2007-2013) under Cunningham,F., Bouk,N., Chen,H.C., Agarwala,R., grant agreement number 200754 – the GEN2PHEN McLaren,W.M. et al. (2011) Modernizing reference genome project; The research leading to these results has assemblies. PLoS Biol., 9, e1001091. received funding from the European Community’s 18. Brawand,D., Soumillon,M., Necsulea,A., Julien,P., Csa´ rdi,G., Harrigan,P., Weier,M., Liechti,A., Aximu-Petri,A. et al. (2011) Seventh Framework Programme (FP7/ 2007-2013) under The evolution of gene expression levels in mammalian organs. the grant agreement n 223210 CISSTEM; The research Nature, 478, 343–348. leading to these results has received funding from the 19. Murchison,E.P., Schulz-Trieglaff,O.B., Ning,Z., Alexandrov,L.B., European Union’s Seventh Framework Programme Bauer,M.J., Fu,B., Hims,M., Ding,Z., Ivakhno,S. et al. (2012) (FP7/2007-2013) under grant agreement n 282510 – Genome sequencing and analysis of the tasmanian devil and its transmissible cancer. Cell, 148, 780–791. BLUEPRINT. Funding for open access charge: The 20. Collins,J.E., White,S., Searle,S.M. and Stemple,D.L. (2012) Wellcome Trust. Incorporating RNA-seq data into the zebrafish Ensembl genebuild. Genome Res., 22, 2067–2078. Conflict of interest statement. None declared. 21. Curwen,V., Eyras,E., Andrews,T.D., Clarke,L., Mongin,E., Searle,S.M. and Clamp,M. (2004) The Ensembl automatic gene annotation system. Genome Res., 14, 942–950. REFERENCES 22. Chen,Y., Cunningham,F., Rios,D., McLaren,W.M., Smith,J., Pritchard,B., Spudich,G.M., Brent,S., Kulesha,E. et al. (2010) 1. Seal,R.L., Gordon,S.M., Lush,M.J., Wright,M.W. and Ensembl variation resources. BMC Genomics, 11, 293. Bruford,E.A. (2011) genenames.org: the HGNC resources in 2011. 23. Foelo,M.L. and Sherry,S.T. (2007) NCBI dbSNP Database: Nucleic Acids Res., 39, D514–D519. content and searching. In: Weiner,M.P., Gabriel,S.B. and 2. UniProt Consortium. (2012) Reorganizing the protein space at Stephens,J.C. (eds), Genetic Variation: A Laboratory Manual. the Universal Protein Resource (UniProt). Nucleic Acids Res., 40, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, D71–D75. pp. 41–61. 3. Rath,A., Olry,A., Dhombres,F., Brandt,M.M., Urbero,B. and 24. Church,D.M., Lappalainen,I., Sneddon,T.P., Hinton,J., Ayme,S. (2012) Representation of rare diseases in health Maguire,M., Lopez,J., Garner,J., Paschall,J., Dicuccio,M. et al. information systems: the Orphanet approach to serve a wide (2010) Public data archives for genomic structural variation. Nat. range of end users. Hum. Mutat., 33, 803–808. Genet., 42, 813–814. 4. Amberger,J., Bocchini,C. and Hamosh,A. (2011) A new face and 25. Tennessen,J.A., Bigham,A.W., O’Connor,T.D., Fu,W., new challenges for Online Mendelian Inheritance in Man Kenny,E.E., Gravel,S., McGee,S., Do,R., Liu,X. et al. (2012) (OMIM( )). Hum. Mutat., 32, 564–567. Evolution and functional impact of rare coding variation from 5. Pruitt,K.D., Tatusova,T., Brown,G.R. and Maglott,D.R. (2012) deep sequencing of human exomes. Science, 337, 64–69. NCBI Reference Sequences (RefSeq): current status, new features 26. Stenson,P.D., Ball,E.V., Mort,M., Phillips,A.D., Shiel,J.A., and genome annotation policy. Nucleic Acids Res., 40, Thomas,N.S., Abeysinghe,S., Krawczak,M. and Cooper,D.N. D130–D135. (2003) Human gene mutation database (HGMD): 2003 update. 6. Dreszer,T.R., Karolchik,D., Zweig,A.S., Hinrichs,A.S., Hum. Mutat., 21, 577–581. Raney,B.J., Kuhn,R.M., Meyer,L.R., Wong,M., Sloan,C.A. et al. 27. Dalgleish,R., Flicek,P., Cunningham,F., Astashyn,A., Tully,R.E., (2012) The UCSC Genome Browser database: extensions and Proctor,G., Chen,Y., McLaren,W.M., Larsson,P. et al. (2010) updates 2011. Nucleic Acids Res., 40, D918–D923. Locus Reference Genomic sequences: an improved basis for 7. Velankar,S., Alhroub,Y., Best,C., Caboche,S., Conroy,M.J., describing human DNA variants. Genome Med., 2, 24. Dana,J.M., Fernandez Montecelo,M.A., van Ginkel,G., 28. Forbes,S.A., Bindal,N., Bamford,S., Cole,C., Kok,C.Y., Beare,D., Golovin,A. et al. (2012) PDBe: Protein Data Bank in Europe. Jia,M., Shepherd,R., Leung,K. et al. (2011) COSMIC: mining Nucleic Acids Res., 40, D445–D452. complete cancer genomes in the Catalogue of Somatic Mutations 8. 1000 Genomes Project Consortium. (2010) A map of human in Cancer. Nucleic Acids Res., 39, D945–D950. genome variation from population-scale sequencing. Nature, 467, 29. Hindorff,L.A., Sethupathy,P., Junkins,H.A., Ramos,E.M., 1061–1073. Mehta,J.P., Collins,F.S. and Manolio,T.A. (2009) Potential 9. ENCODE Project Consortium. (2012) An integrated encyclopedia etiologic and functional implications of genome-wide association of DNA elements in the human genome. Nature, 489, 57–74. loci for human diseases and traits. Proc. Natl Acad. Sci. USA, 10. International Cancer Genome Consortium. (2010) International 106, 9362–9367. network of cancer genome projects. Nature, 464, 993–998. 30. Eilbeck,K., Lewis,S.E., Mungall,C.J., Yandell,M., Stein,L., 11. Adams,D., Altucci,L., Antonarakis,S.E., Ballesteros,J., Beck,S., Durbin,R. and Ashburner,M. (2005) The sequence ontology: a Bird,A., Bock,C., Boehm,B., Campo,E. et al. (2012) tool for the unification of genome annotations. Genome Biol., 6, BLUEPRINT to decode the epigenetic signature written in blood. R44. Nat. Biotechnol., 30, 224–226. Nucleic Acids Research, 2013, Vol. 41, Database issue D55 31. Kumar,P., Henikoff,S. and Ng,P.C. (2009) Predicting the effects 39. De Bie,T., Cristianini,N., Demuth,J.P. and Hahn,M.W. (2006) of coding non-synonymous variants on protein function using the CAFE: a computational tool for the study of gene family SIFT algorithm. Nat. Protoc., 4, 1073–1081. evolution. Bioinformatics, 22, 1269–1271. 32. Adzhubei,I.A., Schmidt,S., Peshkin,L., Ramensky,V.E., 40. Dessimoz,C., Zoller,S., Manousaki,T., Qiu,H., Meyer,A. and Gerasimova,A., Bork,P., Kondrashov,A.S. and Sunyaev,S.R. Kuraku,S. (2011) Comparative genomics approach to detecting (2010) A method and server for predicting damaging missense split-coding regions in a low-coverage genome: lessons from the mutations. Nat. Methods, 7, 248–249. chimaera Callorhinchus milii (Holocephali, Chondrichthyes). Brief 33. Portales-Casamar,E., Thongjuea,S., Kwon,A.T., Arenillas,D., Bioinform., 12, 474–484. Zhao,X., Valen,E., Yusuf,D., Lenhard,B., Wasserman,W.W. and 41. Vilella,A.J., Severin,J., Ureta-Vidal,A., Heng,L., Durbin,R. and Sandelin,A. (2010) JASPAR 2010: the greatly expanded Birney,E. (2009) EnsemblCompara GeneTrees: complete, open-access database of transcription factor binding profiles. duplication-aware phylogenetic trees in vertebrates. Genome Res., 19, Nucleic Acids Res., 38, D105–D110. 327–335. 34. Amid,C., Birney,E., Bower,L., Cerden˜ o-Ta´ rraga,A., Cheng,Y., 42. McLaren,W., Pritchard,B., Rios,D., Chen,Y., Flicek,P. and Cleland,I., Faruque,N., Gibson,R., Goodgame,N. et al. (2012) Cunningham,F. (2010) Deriving the consequences of genomic Major submissions tool developments at the European Nucleotide variants with the Ensembl API and SNP Effect Predictor. Archive. Nucleic Acids Res., 40, D43–D47. Bioinformatics, 26, 2069–2070. 35. Hoffman,M.M., Buske,O.J., Wang,J., Weng,Z., Bilmes,J.A. and 43. Fokkema,I.F., Taschner,P.E., Schaafsma,G.C., Celli,J., Laros,J.F. Noble,W.S. (2012) Unsupervised pattern discovery in human and den Dunnen,J.T. (2011) LOVD v.2.0: the next generation in chromatin structure through genomic segmentation. Nat. Methods, gene variant databases. Hum. Mutat., 32, 557–563. 9, 473–476. 44. Stabenau,A., McVicker,G., Melsopp,C., Proctor,G., Clamp,M. 36. Ernst,J. and Kellis,M. (2012) ChromHMM: automating chromatin- and Birney,E. (2004) The Ensembl core software libraries. state discovery and characterization. Nat. Methods, 9, 215–216. Genome Res., 14, 929–933. 37. Robertson,G., Bilenky,M., Lin,K., He,A., Yuen,W., Dagpinar,M., 45. Smedley,D., Haider,S., Ballester,B., Holland,R., London,D., Varhol,R., Teague,K., Griffith,O.L. et al. (2006) cisRED: a Thorisson,G. and Kasprzyk,A. (2009) BioMart–biological queries database system for genome-scale computational discovery of made easy. BMC Genomics, 10, 22. regulatory elements. Nucleic Acids Res., 34, D68–D73. 46. Kinsella,R.J., Ka¨ ha¨ ri,A., Haider,S., Zamora,J., Proctor,G., 38. Visel,A., Minovitsky,S., Dubchak,I. and Pennacchio,L.A. (2007) Spudich,G., Almeida-King,J., Staines,D., Derwent,P. et al. (2011) VISTA Enhancer Browser–a database of tissue-specific human Ensembl BioMarts: a hub for data retrieval across taxonomic enhancers. Nucleic Acids Res., 35, D88–D92. space. Database, 2011, bar030. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

Loading next page...
 
/lp/oxford-university-press/ensembl-2013-gAEmDM0Jd3

References (94)

Publisher
Oxford University Press
Copyright
The Author(s) 2012. Published by Oxford University Press.
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gks1236
pmid
23203987
Publisher site
See Article on Publisher Site

Abstract

D48–D55 Nucleic Acids Research, 2013, Vol. 41, Database issue Published online 30 November 2012 doi:10.1093/nar/gks1236 1,2, 1 2 2 1 Paul Flicek *, Ikhlak Ahmed , M. Ridwan Amode , Daniel Barrell , Kathryn Beal , 2 1 2 2 2 Simon Brent , Denise Carvalho-Silva , Peter Clapham , Guy Coates , Susan Fairley , 1 1 2 1 2 Stephen Fitzgerald , Laurent Gil , Carlos Garcı´a-Giro ´ n , Leo Gordon , Thibaut Hourlier , 1 1 2 1 Sarah Hunt , Thomas Juettemann , Andreas K. Ka ¨ ha ¨ ri , Stephen Keenan , 1 1 1 1 Monika Komorowska , Eugene Kulesha , Ian Longden , Thomas Maurel , 1 1 2 1 1 William M. McLaren , Matthieu Muffato , Rishi Nag , Bert Overduin , Miguel Pignatelli , 2 1 2 1 Bethan Pritchard , Emily Pritchard , Harpreet Singh Riat , Graham R. S. Ritchie , 1 1 2 1 1 Magali Ruffier , Michael Schuster , Daniel Sheppard , Daniel Sobral , Kieron Taylor , 1 2 2 1 Anja Thormann , Stephen Trevanion , Simon White , Steven P. Wilder , 2 1 1 1 2 Bronwen L. Aken , Ewan Birney , Fiona Cunningham , Ian Dunham , Jennifer Harrow , 1 2 1 1 2 Javier Herrero , Tim J. P. Hubbard , Nathan Johnson , Rhoda Kinsella , Anne Parker , 1 1 2 2 Giulietta Spudich , Andy Yates , Amonida Zadissa and Stephen M. J. Searle European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK Received October 11, 2012; Revised October 31, 2012; Accepted November 1, 2012 web interface to advanced bioinformatics programmers ABSTRACT looking to do complex analysis or build new tools that The Ensembl project (http://www.ensembl.org) leverage the Ensembl infrastructure. As such, we provide provides genome information for sequenced chord- all of the Ensembl source code freely under an ate genomes with a particular focus on human, Apache-style license and release all of our data without mouse, zebrafish and rat. Our resources include restriction. Ensembl data are distributed from our genome evidenced-based gene sets for all supported spe- browser at http://www.ensembl.org as well as via cies; large-scale whole genome multiple species BioMart, the Ensembl Application Programming Interface (API), direct MySQL access, Amazon Web alignments across vertebrates and clade-specific Services Public data sets (http://www.ensembl.org/info/ alignments for eutherian mammals, primates, birds data/amazon_aws.html) and via full data download. and fish; variation data resources for 17 species and Ensembl aims to be a hub of genome information by regulation annotations based on ENCODE and other linking identifiers and information between external bio- data sets. Ensembl data are accessible through the logical resources and data within Ensembl or importing genome browser at http://www.ensembl.org and essential information from other resources so that it can through other tools and programmatic interfaces. be found within Ensembl and linked back to the original resource as necessary. For example, we provide up to date external database references to gene names from the INTRODUCTION HUGO Gene Nomenclature Committee (HGNC) (1), Ensembl (http://www.ensembl.org) collects, creates, or- the Universal Protein Resource (UniProt) (2), Orphanet ganizes and distributes data resources in support of portal for rare diseases and orphan drugs (3), the Online research into the genetics and genomics of chordates. Mendelian Inheritance in Man (OMIM) database (4), the We currently support 70 species with a focus on human RefSeq collection of Reference Sequences from NCBI (5), in additional to agricultural animals and major vertebrate the UCSC Genome Browser (6), the Protein Data Bank model organisms such as mouse, zebrafish and rat. We (PDB) repository for biological macromolecular support a full range of researchers in genomics from structures (7) and many other resources. bench biologists interested in looking up specific details We participate in or work closely with a number of about their genes or loci of interest using a graphical large-scale international projects including the 1000 *To whom correspondence should be addressed. Tel: +44 1223 492581; Fax: +44 1223 494494; Email: fl[email protected] The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]. Nucleic Acids Research, 2013, Vol. 41, Database issue D49 Genomes Project (8), ENCODE (9), the International (Melopsittacus undulates), Chinese hamster CHO cell line Cancer Genome Consortium (ICGC) (10) and the (Cricetulus griseus), painted turtle (Chrysemys picta bellii), BLUEPRINT epigenome mapping project (11). spotted gar (Lepisosteus oculatus), collared flycatcher Participation in these efforts helps ensure that we (Ficedula albicollis) and squirrel monkey (Saimiri produce timely and valuable resources through direct sci- boliviensis boliviensis). Ensembl Pre! sites provide entific engagement with the communities that we are BLAST and genome visualization, but do not provide a trying to serve. In addition, we actively develop and complete gene build. For specific genomes, we also provide key pieces of large-scale bioinformatics infrastruc- provide downloadable data on the preview site. ture including the eHive workflow management system for We update the human gene set for every Ensembl genomic analysis (12). release via a merge of the Ensembl evidence-based auto- Full incorporation of the data types resulting from the matic annotation and Havana manual annotation (14) to myriad of experimental assays now leveraging next gener- produce an updated GENCODE gene set (9,15). This set ation sequencing technology remains an important area of also includes all current human Consensus Coding development for the project. During the past year, we have Sequence (CCDS) gene models (16). Manual annotation made considerable progress in a number of ways including from Havana is also incorporated into our gene sets on a greater incorporation of RNA-seq data into our gene alternate releases for mouse and zebrafish. In addition, pig annotations and ChIP-seq data into our regulatory anno- now includes manual annotation from Havana on selected tations. In general, we believe that the most useful re- regions of the genome. sources provide integrated summary information that The human genome assembly is updated regularly by transforms the raw sequencing data into biological know- the Genome Reference Consortium (GRC) to include al- ledge that can provide a foundation for further biological ternate sequences in the form of ‘fix’ and ‘novel’ assembly research. Thus, we believe that the display of the called patches (17), and we continue to include these additional variants from the 1000 Genomes Project or regulatory alternate sequences and annotate them with genes and region annotations supported by specific histone modifi- other features as appropriate. Ensembl release 69 cation or transcription factor (TF) binding sites are more (October 2012) included GRCh37.p8 (i.e. the eighth useful as resources for the community than a display of patch release of the GRCh37 assembly). The mouse the raw aligned sequence reads. However, Ensembl does genome annotation, which also incorporates all current support the upload and visualization of read alignment mouse CCDS models, was updated for Ensembl release data (e.g. alignment files in BAM format) and provides 68 (July 2012) to reflect the new GRCm38 assembly. signal files for our ChIP-seq and alignment files for Other species previously available on our website also RNA-seq data within the browser for those users saw updates in the past year including new primary needing direct access to the supporting data. Indeed, assemblies and gene sets for chimpanzee, dog, pig, Ensembl’s API development this year included increasing ground squirrel, bushbaby and Ciona intestinalis. The support for file-based data access to enable integration of gene sets for orang-utan, opossum and platypus were very large BAM and other file-based data sets into the also updated using RNA-seq data. browser. The whole genome multiple and pairwise alignments This report highlights the new data we have released have been re-run in conjunction with the incorporation and the new mechanisms of data access that we have of new or updated genomes. In addition to cross-species deployed during the past year since our previous report alignments, we now provide self-alignments for the human (13). We describe how these new features extend the genome and also use the Ensembl comparative genomics existing capabilities of the project, which will be explained infrastructure for the comparison of fix and novel patches as appropriate. alongside the reference human genome (Figure 1). Supported species Gene annotation As of release 69 (October 2012), Ensembl supports 70 The year 2012 has seen the inclusion of RNA-seq data species including 61 species fully supported on our main provided by several different groups (18–20) as supporting site. Of these, we have created full gene annotations for 58 evidence for our gene annotations. Thirteen species cur- chordates (43 with high-coverage genome sequences and rently incorporate RNA-seq data including zebrafish, 15 with low-coverage) and have imported annotation data chimpanzee, Nile tilapia, dog, Chinese softshell turtle, for three non-chordate model organisms (Saccharomyces pig, ferret, platyfish, coelacanth, Tasmanian devil, cerevisiae, Caenorhabditis elegans and Drosophila orang-utan, opossum and platypus. For some of these melanogaster) to facilitate comparative analysis. Five species, the RNA-seq data were added after a standard new species were included during the past year with full gene annotation process (21), whereas for other species, support: Atlantic cod (Gadus morhua), coelacanth the data were added as an integral part of the genebuild (Latimeria chalumnae), ferret (Mustela putorius furo), process. Some species also include tissue-specific RNA-seq Nile tilapia (Oreochromis niloticus) and Chinese softshell data that enables the exploration of tissue-specific expres- turtle (Pelodiscus sinensis). An additional nine species are currently available with limited support on the Ensembl sion. In addition, the Illumina Human BodyMap 2.0 Pre! site (http://pre.ensembl.org) including the following, data (http://www.ebi.ac.uk/arrayexpress/experiments/ which were newly added in the past year: budgerigar E-MTAB-513) have been re-processed using our enhanced D50 Nucleic Acids Research, 2013, Vol. 41, Database issue Figure 1. A region of the GRCh37 human assembly showing the complete APBA1 gene. The top panel displays the GRCh37 reference sequence as originally released, and the bottom panel displays the region after the inclusion of the novel patch HSCHR9_1_CTG35. The region of difference is highlighted and marked by the ‘Assembly exception’ track, whereas the pink regions of LASTZ self-alignment provide more details about what has changed in the patch including the addition of new sequence that was missing in the originally released assembly. The green areas show the mapping between the original and the alternative sequences and demonstrate a corrected inversion at the left hand side of the patch. The patch changes the annotation such that the RNA gene RP11-548B3.3 (in purple) moves from 5 of the APBA1 gene to within the second intron. As can be seen in the right hand side of the figure, the existence of the patch does not alter the annotation downstream of the change. Figure based on http://e68.ensembl.org/Homo_sapiens/Location/Multi?db=core;r=9:72019177-72298831;r1=HSCHR9_1_CTG35:72019384-72307679;s1=Homo_ sapiens--HSCHR9_1_CTG35. pipeline to produce updated gene models and new BAM particularly effective for species that are distantly related files. to the well-annotated mammals and those with little RNA-seq data are now routinely used in gene annota- species-specific sequence data available at the time of tion in a number of ways, and we anticipate that RNA-seq initial annotation. Specific improvements from the data will be used in almost all gene annotation projects for RNA-seq update pipeline include lengthening truncated the foreseeable future. Briefly, our current procedure genes, merging adjacent gene fragments and splitting starts with raw-sequencing reads that are aligned to the artificially merged genes. RNA-seq-based data are also genome and processed to produce RNA-seq-based gene useful for higher primate species that have previously models, BAM files and intron features that are supported relied largely on human sequence data for annotation, as by intron-spanning reads. Intron-supporting evidence it allows for the identification of non-human primate- helps to quantify intron predictions in RNA-seq transcript specific gene expression. sets. The intron features and RNA-seq-based gene models are used alongside cDNA and EST alignments to compare Variation resources and filter the preliminary set of protein-coding models We create variation resources for 17 species by importing against a set of highly supported splice sites. In addition, and merging data from many different sources through the RNA-seq-based gene models are used to provide al- our pipeline (22). The current list of variation data is ternate isoforms and fill in gaps between models identified provided at http://www.ensembl.org/info/docs/variation/ by the standard Similarity Genewise component of our sources_documentation.html. Most of our SNP and annotation system, which aligns protein sequences to the in-del data (rsIDs, locations, allele frequencies and geno- genome, and to add untranslated regions to the protein types) come from dbSNP (23). This year, we have updated coding models. the Ensembl Variation databases for human, rat, chim- We have also developed an RNA-seq update pipeline panzee, orang-utan, zebrafish, pig, dog and macaque. that allows an existing Ensembl gene set to be updated We have also remapped the variation data for mouse through incorporation of new RNA-seq data. The onto the new GRCm38 assembly before updated RNA-seq update pipeline takes in the results of the GRCm38 mappings were provided by dbSNP and standard Ensembl gene annotation method and also RNA-seq-based models produced by the pipeline previ- provided the same update for new dog assembly. ously described (20). The two sets of input models are Available structural variation data have increased consid- compared and merged to produce an updated gene set. erably, and we have data for human, mouse, horse, This new method was used to improve the existing zebrafish, cow and macaque largely provided by the opossum, platypus and orang-tuan gene sets for DGVa database of copy number and structural variation Ensembl release 69 (October 2012). The method is (24). The human structural variation data are more Nucleic Acids Research, 2013, Vol. 41, Database issue D51 comprehensive than all other species combined and Ensembl web interface include >6 million variants of which 5624 are somatic. During the past year, development on the Ensembl web The variation database infrastructure storing geno- interface has continued a combined strategy of small in- types has also been redeveloped to improve the respon- cremental improvements on the website while making sub- siveness of our displays and to support non-diploid stantial progress on a number of major infrastructure-level genomes. projects. The human variation data also include genotypes On the data display front, we are now able to show imported from the 1000 Genomes Project and the alignments of human assembly patches to the reference NHLBI Exome Sequencing Project (25), 79 000 assembly (Figure 1) and have renamed the ‘Multi-species mutation data locations provided by HGMD (26), view’ as ‘Region comparison’ to reflect its wider applic- clinical variants on LRGs (27) and >135 000 somatic ability. We have also added a transcript variation page, mutation positions from COSMIC (28). We have also similar to the gene variation page but showing only one added mitochondrial variants, information on clinical transcript at a time, which is particularly helpful in the significance and global minor allele frequencies from case of large, well-annotated genes that are challenging dbSNP, as well as phenotype data for >287 000 to display quickly or interpret easily due to their data variants from OMIM (4), the European Genome- density. Other additions to the user interface include a phenome Archive (EGA) and the NHGRI GWAS new online tool, Region Report, which provides graphical catalog (29). We denote those variants present on access to the API script of the same name to export three Affymetrix genotyping chips (GeneChip 100 K sequence, genes and other annotation from one or more Array, GeneChip 500 K Array, GenomeWideSNP_6.0) regions. We have also re-introduced the ability to save and nine Illumina chips (CytoSNP12v1, Human660- configurations on images: users can turn their choice of W-quad, Human1M-duoV3, CardioMetaboChip, tracks on and off and then save this selection in either the HumanOmni1-Quad, HumanHap650, HumanHap550, browser session or their personal accounts and then HumanOmni2.5 and Human610_Quad), and also quickly return to the same layout at a later time. These indicate those variants curated by UniProt (2). configurations can also be grouped into sets (e.g. to For all species, we calculate the effect of each variant combine a set of favourite variation tracks with a set of allele on overlapping Ensembl transcripts and whether the gene tracks) for even quicker reconfiguration of images. variant falls within an Ensembl regulatory feature, TF We have started to refresh the look and feel of the binding motif or a high information position within the website. For example, our icon set was previously motif. Our consequence annotation now uses defined created from various sources and has now been replaced Sequence Ontology (SO) terms (30) for all descriptions, with a single matching set. We have adapted the layout which enable querying of ontological relationships in and colour scheme for increased readability, and we are BioMart. More detailed consequence information is also continuing the process of replacing text-heavy pages with provided for SNPs and in-dels in specific genomic loca- simpler, more user-friendly layouts where appropriate. tions such as splice sites. These SO terms have also been Finally, major projects nearing completion and adopted by both the UCSC genome browser and ICGC scheduled for release by the end of 2012 include a Javascript-based scrollable genome browser called providing a standard to enable easy comparison of vari- Genoverse that will be incorporated into our location ation annotation. displays for Ensembl release 69 (October 2012) and Other resources supporting human variation include support for UCSC-style datahubs, which can contain calculated linkage disequilibrium values and tag SNPs, sets of preconfigured tracks or a user-supplied collection in addition to SIFT (31) and PolyPhen (32) predictions of remote resources. Additional work underway includes a for amino acid changes. This year we have switched to top-to-bottom rewrite of our BLAST/BLAT search using using the Ensembl comparative genomics pipeline to the Ensembl eHive job management system supporting a provide the ancestral alleles of SNPs and short deletions new web frontend, which will be tested on our beta site for human, orang-utan, chimpanzee and macaque (previ- (http://beta.ensembl.org) before rolling out into a major ously this was imported from dbSNP). We have also ex- Ensembl release in 2013. tensively improved our quality control (QC) procedures, which leverage the eHive software and have been extended Regulation to include structural variations. As a result of our effort to provide the most useful During the past year, we have significantly updated and possible summaries of large data sets to our users, we increased the amount of data available from the Ensembl have added new tracks for 1000 Genomes Project regulation database. As of Ensembl release 69 (October common variants and also tracks for each global 1000 2012), there are 532 ChIP-seq and DNase-seq data sets Genomes population. Additionally, appropriate pheno- from 13 human and five mouse cell lines. In total, these type data have been collected into a dedicated section on data sets represent information about the genomic loca- the Ensembl gene pages. Finally, the documentation tions of 49 different histone modification types and the section of the website has also been extended and binding regions of 113 different TFs. Forty of these TFs improved for all areas of Ensembl Variation especially have binding matrices available through the JASPAR for the Variant Effect Predictor (VEP), SO consequences, database (33), and we have incorporated these motif data QC pipeline and API diagrams. as positions of high probability TF-binding sites (5% False D52 Nucleic Acids Research, 2013, Vol. 41, Database issue Discovery Rate) within the binding regions. We have also of new taxonomic groups. These species define additional created a dedicated experimental summary page providing branching points in the phylogenetic trees, enable splitting information on individual experimental details and long branches and provide us with more taxonomic power summary metadata, such as references to the raw sequences to better resolve the gene trees. Further information on the reads available in the European Nucleotide Archive (34). evolution of the gene families is now provided by supple- The data underlying the Ensembl Regulatory Build cur- menting our phylogenetic analysis with a calculated as- rently include experiments in 13 cell lines. Regulatory Build sessment on the possible expansions and contractions in coverage has increased by 15% in the past year and now each family using the CAFE tool (39). annotates 270 Mb of the human genome in 518 020 regula- Our data model for gene trees has been modified to tory features. In Ensembl release 65 (December 2011), we handle both protein and ncRNA gene trees. During that introduced the combined Segway (35) and ChromHMM process, we also improved our support for protein (36) segmentation analyses developed for ENCODE (9), super-trees, which are used in the resolution of very which classifies the genome into regions based on 12 large protein families. These are split in sub-families, specific assays to obtain a single-track summary of the and the super-protein tree represents the relationship functional architecture of the human genome. The segmen- between these sub-families. We have developed a better tation tracks are currently available for six human cell lines: identification and annotation of split genes that usually GM12878, K562, H1-hESC, HepG2, HeLa-S3 and arise because of assembly errors (40). In our current im- HUVEC. The segmentation tracks are displayed with plementation, the enhanced gene tree pipeline (41) detects specific views available from the ‘Regulation’ configuration gene split events after building the protein multiple align- in the Ensembl browser (Figure 2). ment, and the resulting nodes of the tree can be annotated The Ensembl Regulation database and web views as gene split events when they relate to partial proteins continue to provide various other data resources including that could be concatenated to form a full gene. the following: mapping of probe sets for all the common microarray platforms, DNA methylation from various Ensembl tools and software projects including ENCODE, high profile externally During the past year, we have made significant improve- curated data sets such as cisRED motifs (37) and an ment to the Ensembl VEP (42) and launched a beta im- updated VISTA enhancer set (38). plementation of a new Ensembl REST API. The VEP provides comprehensive analysis of SNP, in-del or struc- Comparative genomics tural variation data including reports of which gene, tran- New species added in the past year such as coelacanth and script, protein or regulatory region overlap the variants of lamprey have provided our gene trees with representatives interest and if there is any change in amino acid sequence. Figure 2. Combined Segway and ChromHMM segmentation analyses within Ensembl in the region around the SLC18B1 gene on human chromo- some 6. The combination process results in seven annotated segments: CTCF enriched, Predicted Weak Enhancer/Cis-reg element, Predicted Transcribed Region, Predicted Enhancer, Predicated Promoter Flank, Predicted Repressed/Low Activity or Predicted Promoter with TSS. Six of the seven segment types are shown with variability in predicted enhancer activity between the assayed cell lines. Figure based on http://e68.ensembl. org/Homo_sapiens/Location/View?r=6:133088392-133123741. Nucleic Acids Research, 2013, Vol. 41, Database issue D53 It also includes information about SIFT and PolyPHEN dataset has been added containing data from COSMIC predictions in human, protein domains, exon/intron (28). The ability to search multiple chromosomal regions numbers, minor allele frequencies and other information. at once has been added to the Ensembl Regulation mart. The VEP works with many different file formats and can In addition to this, users can query human regulatory seg- in fact convert variant positions between different coord- mentation features using the newly added regulatory segments filter section and attribute page. inate systems (Ensembl, RefSeq, LRG and HGVS). We have also written plugins to report on degree of conserva- User training and support tion, presence of the variant in an LOVD database in a Locus Specific Database (LSDB) using the Leiden Open Ensembl supports new and existing users in a variety of Variation Database (LOVD) software (43) and other ways from a strong and increasing on-line presence to capabilities. Our VEP plugins are present in the direct face-to-face training at universities and other insti- ensembl-variation github repository (https://github.com/ tutions worldwide. This year, we held one-day workshops ensembl-variation/VEP_plugins), and we encourage users on five continents and launched new virtual initiatives to share their own plugins. available to all including those further afield or without The REST API web service was released as a beta ap- the means to host a one-day workshop. plication this year at http://beta.rest.ensembl.org. We provide extensive free and user-driven tutorials via Although we have a fully supported Perl API to all of the Ensembl YouTube (http://www.youtube.com/user/ the Ensembl data (44), the REST API addresses those EnsemblHelpdesk) and YouKu (http://i.youku.com/u/ users who wish to access Ensembl data in a language- id_UMzM1NjkzMTI0) channels and e-learning course agnostic manner. The web service is built using the Perl (http://www.ebi.ac.uk/training/online/course/ensembl- web framework Catalyst, Catalyst::Action::REST and our browsing-chordate-genomes). The Ensembl YouTube existing Perl API providing a rapid development environ- channel has >165 subscribers and >91 000 video views, ment and lowering the cost of creating new endpoints. now hosts >20 videos including navigation ‘how-to’ Output is a combination of bioinformatics and program- guides. This year, we have added more advanced videos matically relevant formats such as FASTA and JSON. We covering subjects such as patches and haplotypes on the provide access to sequences, assembly mapping, homo- human assembly, API installation and how RNA-seq data logues and integration of the VEP with support for are used in the genebuild. In 2012, the top 20 countries genomic features. The REST service, like all Ensembl accessing our on-line training reflect a worldwide audience software, is free to download from our CVS server from the USA, Europe, India, Japan, Australia, Pakistan, allowing users to deploy over their local Ensembl Taiwan, Mexico, South Korea and Brazil, and our most databases. popular videos have been viewed hundreds or thousands of times. Data access and data mining We communicate more informally and highlight updates and new features using the Ensembl blog Each Ensembl release provides a full rebuild of seven (http://www.ensembl.info/), Facebook page (http://www. BioMart (45,46) databases. Four of these BioMart data- facebook.com/Ensembl.org) and Twitter account (http:// bases (Ensembl Gene, Ensembl Variation, Ensembl twitter.com/ensembl). Our Helpdesk (helpdesk@ Regulation and VEGA) are visible on the Ensembl ensembl.org) continues to provide email support for BioMart interface, and the remaining three BioMart data- >100 questions monthly, and we are exploring webinars bases are hidden from view but are accessed through fed- as a vehicle for more interactive long-distance learning eration with visible BioMart databases to provide and plan to offer more of these events in 2013. ontology, sequence and genomic feature data. Performing a complete rebuild each release ensures the availability most up to date integrated data from across ACKNOWLEDGEMENTS the Ensembl project. Users can access these data via the The authors are consistently grateful to their users and MartView (web interface) and MartService (BioMart Perl especially to those who take the time to contact us API, DAS server, SOAP, REST, BioConductor biomaRt through our mailing lists, blog and other avenues. They package). acknowledge those researchers, organizations and Each Ensembl BioMart release includes the addition of large-scale projects that have provided data to Ensembl any new species, updated assemblies, updates to the before publication under the understandings of the Fort germline and somatic variation and structural variation Lauderdale meeting discussing Community Resource data sets as well as updates to the regulation data. One Projects and the Toronto meeting on pre-publication can now obtain our SIFT and PolyPhen predictions and data sharing. scores from the Ensembl variation BioMart and from the variation ‘filter’ and ‘attribute’ sections of the Ensembl gene BioMart. It is also possible to select specific mouse FUNDING strain information from the mouse structural variation data set, and one can filter on the source and study acces- The Wellcome Trust provides majority funding for the sion of interest in the structural variation data sets avail- Ensembl project [WT062023 and WT079643] with add- able for cow, zebrafish, horse, human, mouse and itional funding from the National Human Genome macaque. A new human somatic structural variation Research Institute [U01HG004695, U54HG004563 and D54 Nucleic Acids Research, 2013, Vol. 41, Database issue 12. Severin,J., Beal,K., Vilella,A.J., Fitzgerald,S., Schuster,M., U41HG006104] the BBSRC [BB/I025506/1], and the Gordon,L., Ureta-Vidal,A., Flicek,P. and Herrero,J. (2010) European Molecular Biology Laboratory. Additional eHive: an artificial intelligence workflow system for genomic support for specific project components as specified: analysis. BMC Bioinformatics, 11, 240. Funded by the European Commission under SLING, 13. Flicek,P., Amode,M.R., Barrell,D., Beal,K., Brent,S., Carvalho- grant agreement number 226073 (Integrating Activity) Silva,D., Clapham,P., Coates,G., Fairley,S. et al. (2012) Ensembl 2012. Nucleic Acids Res., 40, D84–D90. within Research Infrastructures of the FP7 Capacities 14. Wilming,L.G., Gilbert,J.G., Howe,K., Trevanion,S., Hubbard,T. Specific Programme; The research leading to these and Harrow,J.L. (2008) The vertebrate genome annotation (Vega) results has received funding from the European database. Nucleic Acids Res., 36, D753–D760. Community’s Seventh Framework Programme (FP7/ 15. Harrow,J., Denoeud,F., Frankish,A., Reymond,A., Chen,C.K., 2007-2013) under grant agreement n 222664. Chrast,J., Lagarde,J., Gilbert,J.G., Storey,R. et al. (2006) GENCODE: producing a reference annotation for ENCODE. (‘‘Quantomics’’). This Publication reflects only the Genome Biol., 7(Suppl.1), S4.1–S4.9. author’s views and the European Community is not 16. Harte,R.A., Farrell,C.M., Loveland,J.E., Suner,M.M., Wilming,L., liable for any use that may be made of the information Aken,B., Barrell,D., Frankish,A., Wallin,C. et al. (2012) Tracking contained herein; The research leading to these results has and coordinating an international curation effort for the CCDS Project. Database (Oxford), 2012, bas008. received funding from the European Community’s 17. Church,D.M., Schneider,V.A., Graves,T., Auger,K., Seventh Framework Programme (FP7/2007-2013) under Cunningham,F., Bouk,N., Chen,H.C., Agarwala,R., grant agreement number 200754 – the GEN2PHEN McLaren,W.M. et al. (2011) Modernizing reference genome project; The research leading to these results has assemblies. PLoS Biol., 9, e1001091. received funding from the European Community’s 18. Brawand,D., Soumillon,M., Necsulea,A., Julien,P., Csa´ rdi,G., Harrigan,P., Weier,M., Liechti,A., Aximu-Petri,A. et al. (2011) Seventh Framework Programme (FP7/ 2007-2013) under The evolution of gene expression levels in mammalian organs. the grant agreement n 223210 CISSTEM; The research Nature, 478, 343–348. leading to these results has received funding from the 19. Murchison,E.P., Schulz-Trieglaff,O.B., Ning,Z., Alexandrov,L.B., European Union’s Seventh Framework Programme Bauer,M.J., Fu,B., Hims,M., Ding,Z., Ivakhno,S. et al. (2012) (FP7/2007-2013) under grant agreement n 282510 – Genome sequencing and analysis of the tasmanian devil and its transmissible cancer. Cell, 148, 780–791. BLUEPRINT. Funding for open access charge: The 20. Collins,J.E., White,S., Searle,S.M. and Stemple,D.L. (2012) Wellcome Trust. Incorporating RNA-seq data into the zebrafish Ensembl genebuild. Genome Res., 22, 2067–2078. Conflict of interest statement. None declared. 21. Curwen,V., Eyras,E., Andrews,T.D., Clarke,L., Mongin,E., Searle,S.M. and Clamp,M. (2004) The Ensembl automatic gene annotation system. Genome Res., 14, 942–950. REFERENCES 22. Chen,Y., Cunningham,F., Rios,D., McLaren,W.M., Smith,J., Pritchard,B., Spudich,G.M., Brent,S., Kulesha,E. et al. (2010) 1. Seal,R.L., Gordon,S.M., Lush,M.J., Wright,M.W. and Ensembl variation resources. BMC Genomics, 11, 293. Bruford,E.A. (2011) genenames.org: the HGNC resources in 2011. 23. Foelo,M.L. and Sherry,S.T. (2007) NCBI dbSNP Database: Nucleic Acids Res., 39, D514–D519. content and searching. In: Weiner,M.P., Gabriel,S.B. and 2. UniProt Consortium. (2012) Reorganizing the protein space at Stephens,J.C. (eds), Genetic Variation: A Laboratory Manual. the Universal Protein Resource (UniProt). Nucleic Acids Res., 40, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, D71–D75. pp. 41–61. 3. Rath,A., Olry,A., Dhombres,F., Brandt,M.M., Urbero,B. and 24. Church,D.M., Lappalainen,I., Sneddon,T.P., Hinton,J., Ayme,S. (2012) Representation of rare diseases in health Maguire,M., Lopez,J., Garner,J., Paschall,J., Dicuccio,M. et al. information systems: the Orphanet approach to serve a wide (2010) Public data archives for genomic structural variation. Nat. range of end users. Hum. Mutat., 33, 803–808. Genet., 42, 813–814. 4. Amberger,J., Bocchini,C. and Hamosh,A. (2011) A new face and 25. Tennessen,J.A., Bigham,A.W., O’Connor,T.D., Fu,W., new challenges for Online Mendelian Inheritance in Man Kenny,E.E., Gravel,S., McGee,S., Do,R., Liu,X. et al. (2012) (OMIM( )). Hum. Mutat., 32, 564–567. Evolution and functional impact of rare coding variation from 5. Pruitt,K.D., Tatusova,T., Brown,G.R. and Maglott,D.R. (2012) deep sequencing of human exomes. Science, 337, 64–69. NCBI Reference Sequences (RefSeq): current status, new features 26. Stenson,P.D., Ball,E.V., Mort,M., Phillips,A.D., Shiel,J.A., and genome annotation policy. Nucleic Acids Res., 40, Thomas,N.S., Abeysinghe,S., Krawczak,M. and Cooper,D.N. D130–D135. (2003) Human gene mutation database (HGMD): 2003 update. 6. Dreszer,T.R., Karolchik,D., Zweig,A.S., Hinrichs,A.S., Hum. Mutat., 21, 577–581. Raney,B.J., Kuhn,R.M., Meyer,L.R., Wong,M., Sloan,C.A. et al. 27. Dalgleish,R., Flicek,P., Cunningham,F., Astashyn,A., Tully,R.E., (2012) The UCSC Genome Browser database: extensions and Proctor,G., Chen,Y., McLaren,W.M., Larsson,P. et al. (2010) updates 2011. Nucleic Acids Res., 40, D918–D923. Locus Reference Genomic sequences: an improved basis for 7. Velankar,S., Alhroub,Y., Best,C., Caboche,S., Conroy,M.J., describing human DNA variants. Genome Med., 2, 24. Dana,J.M., Fernandez Montecelo,M.A., van Ginkel,G., 28. Forbes,S.A., Bindal,N., Bamford,S., Cole,C., Kok,C.Y., Beare,D., Golovin,A. et al. (2012) PDBe: Protein Data Bank in Europe. Jia,M., Shepherd,R., Leung,K. et al. (2011) COSMIC: mining Nucleic Acids Res., 40, D445–D452. complete cancer genomes in the Catalogue of Somatic Mutations 8. 1000 Genomes Project Consortium. (2010) A map of human in Cancer. Nucleic Acids Res., 39, D945–D950. genome variation from population-scale sequencing. Nature, 467, 29. Hindorff,L.A., Sethupathy,P., Junkins,H.A., Ramos,E.M., 1061–1073. Mehta,J.P., Collins,F.S. and Manolio,T.A. (2009) Potential 9. ENCODE Project Consortium. (2012) An integrated encyclopedia etiologic and functional implications of genome-wide association of DNA elements in the human genome. Nature, 489, 57–74. loci for human diseases and traits. Proc. Natl Acad. Sci. USA, 10. International Cancer Genome Consortium. (2010) International 106, 9362–9367. network of cancer genome projects. Nature, 464, 993–998. 30. Eilbeck,K., Lewis,S.E., Mungall,C.J., Yandell,M., Stein,L., 11. Adams,D., Altucci,L., Antonarakis,S.E., Ballesteros,J., Beck,S., Durbin,R. and Ashburner,M. (2005) The sequence ontology: a Bird,A., Bock,C., Boehm,B., Campo,E. et al. (2012) tool for the unification of genome annotations. Genome Biol., 6, BLUEPRINT to decode the epigenetic signature written in blood. R44. Nat. Biotechnol., 30, 224–226. Nucleic Acids Research, 2013, Vol. 41, Database issue D55 31. Kumar,P., Henikoff,S. and Ng,P.C. (2009) Predicting the effects 39. De Bie,T., Cristianini,N., Demuth,J.P. and Hahn,M.W. (2006) of coding non-synonymous variants on protein function using the CAFE: a computational tool for the study of gene family SIFT algorithm. Nat. Protoc., 4, 1073–1081. evolution. Bioinformatics, 22, 1269–1271. 32. Adzhubei,I.A., Schmidt,S., Peshkin,L., Ramensky,V.E., 40. Dessimoz,C., Zoller,S., Manousaki,T., Qiu,H., Meyer,A. and Gerasimova,A., Bork,P., Kondrashov,A.S. and Sunyaev,S.R. Kuraku,S. (2011) Comparative genomics approach to detecting (2010) A method and server for predicting damaging missense split-coding regions in a low-coverage genome: lessons from the mutations. Nat. Methods, 7, 248–249. chimaera Callorhinchus milii (Holocephali, Chondrichthyes). Brief 33. Portales-Casamar,E., Thongjuea,S., Kwon,A.T., Arenillas,D., Bioinform., 12, 474–484. Zhao,X., Valen,E., Yusuf,D., Lenhard,B., Wasserman,W.W. and 41. Vilella,A.J., Severin,J., Ureta-Vidal,A., Heng,L., Durbin,R. and Sandelin,A. (2010) JASPAR 2010: the greatly expanded Birney,E. (2009) EnsemblCompara GeneTrees: complete, open-access database of transcription factor binding profiles. duplication-aware phylogenetic trees in vertebrates. Genome Res., 19, Nucleic Acids Res., 38, D105–D110. 327–335. 34. Amid,C., Birney,E., Bower,L., Cerden˜ o-Ta´ rraga,A., Cheng,Y., 42. McLaren,W., Pritchard,B., Rios,D., Chen,Y., Flicek,P. and Cleland,I., Faruque,N., Gibson,R., Goodgame,N. et al. (2012) Cunningham,F. (2010) Deriving the consequences of genomic Major submissions tool developments at the European Nucleotide variants with the Ensembl API and SNP Effect Predictor. Archive. Nucleic Acids Res., 40, D43–D47. Bioinformatics, 26, 2069–2070. 35. Hoffman,M.M., Buske,O.J., Wang,J., Weng,Z., Bilmes,J.A. and 43. Fokkema,I.F., Taschner,P.E., Schaafsma,G.C., Celli,J., Laros,J.F. Noble,W.S. (2012) Unsupervised pattern discovery in human and den Dunnen,J.T. (2011) LOVD v.2.0: the next generation in chromatin structure through genomic segmentation. Nat. Methods, gene variant databases. Hum. Mutat., 32, 557–563. 9, 473–476. 44. Stabenau,A., McVicker,G., Melsopp,C., Proctor,G., Clamp,M. 36. Ernst,J. and Kellis,M. (2012) ChromHMM: automating chromatin- and Birney,E. (2004) The Ensembl core software libraries. state discovery and characterization. Nat. Methods, 9, 215–216. Genome Res., 14, 929–933. 37. Robertson,G., Bilenky,M., Lin,K., He,A., Yuen,W., Dagpinar,M., 45. Smedley,D., Haider,S., Ballester,B., Holland,R., London,D., Varhol,R., Teague,K., Griffith,O.L. et al. (2006) cisRED: a Thorisson,G. and Kasprzyk,A. (2009) BioMart–biological queries database system for genome-scale computational discovery of made easy. BMC Genomics, 10, 22. regulatory elements. Nucleic Acids Res., 34, D68–D73. 46. Kinsella,R.J., Ka¨ ha¨ ri,A., Haider,S., Zamora,J., Proctor,G., 38. Visel,A., Minovitsky,S., Dubchak,I. and Pennacchio,L.A. (2007) Spudich,G., Almeida-King,J., Staines,D., Derwent,P. et al. (2011) VISTA Enhancer Browser–a database of tissue-specific human Ensembl BioMarts: a hub for data retrieval across taxonomic enhancers. Nucleic Acids Res., 35, D88–D92. space. Database, 2011, bar030.

Journal

Nucleic Acids ResearchOxford University Press

Published: Jan 30, 2013

There are no references for this article.