G2S: a web-service for annotating genomic variants on 3D protein structures

G2S: a web-service for annotating genomic variants on 3D protein structures Abstract Motivation Accurately mapping and annotating genomic locations on 3D protein structures is a key step in structure-based analysis of genomic variants detected by recent large-scale sequencing efforts. There are several mapping resources currently available, but none of them provides a web API (Application Programming Interface) that supports programmatic access. Results We present G2S, a real-time web API that provides automated mapping of genomic variants on 3D protein structures. G2S can align genomic locations of variants, protein locations, or protein sequences to protein structures and retrieve the mapped residues from structures. G2S API uses REST-inspired design and it can be used by various clients such as web browsers, command terminals, programming languages and other bioinformatics tools for bringing 3D structures into genomic variant analysis. Availability and implementation The webserver and source codes are freely available at https://g2s.genomenexus.org. Contact g2s@genomenexus.org Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction With extensive recent large-scale genome sequencing projects such as The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013), and 1000 Genomes Project (Auton et al., 2015), a great number of germline and somatic genomic variants are being detected. Protein structure changes due to genome variation in protein-coding regions are interesting for genetic marker discovery and interpretation of disease mechanisms. Mapping genomic variants onto the specific 3D protein structures is the first critical step to analyze the variants in the context of protein structures. Several resources have been developed to address the need of mapping protein positions to protein structures. SIFTS (Velankar et al., 2013) maps UniProt (Bateman et al., 2017) entries to Protein Data Bank (PDB) (Berman et al., 2000) entries and provides XML and flat files for download. PDB utilizes SIFTS and provided a user web interface for accessing the mapping (http://www.rcsb.org/pdb/chromosome.do) (Berman et al., 2000; Prlic et al., 2016). G23D (Solomon et al., 2016) also provides a user web interface for mapping genomic variants onto 3D protein structures. However, none of them provides a web API for programmatic access of the sequence alignments and residue mapping. Here, we present G2S, a web API that supports programmatic mapping and annotation of genomic variants on 3D protein structures. The following functionalities were implemented: (i) retrieving protein structure chains aligned to a primary protein sequence (a UniProt/Ensembl entry or a user-defined sequence); (ii) retrieving mapping between genomic positions and structural positions; and (iii) retrieving mapping between amino acid positions and structural positions. G2S provides a RESTful API. The pre-computed alignments are automatically updated weekly to keep up to date with the PDB structure archive. G2S API and source codes are publicly available at https://g2s.genomenexus.org/. 2 G2S pipeline The G2S backend pipeline collects and aligns UniProt and Ensembl protein sequences along with carefully parsed PDB sequences using BioJava (Holland et al., 2008). Raw protein sequences were retrieved from the atom records of PDB files directly to avoid flaws and inconsistences from SEQRES and DBREF records in PDB files. The alignments of the protein sequences against the PDB sequences by BLASTP (Altschul, 1997) were stored in a relational database. The pipeline updates the pre-computed alignments weekly as new PDB structures added to RCSB PDB. The workflow and architecture of G2S API are shown in Supplementary Figure S1. 3 G2S API The G2S API accepts UniProt names, UniProt Isoform names, Human Ensembl names, genomic positions (GRCh37 and 38) and user-defined protein sequences. G2S API returns high confidence pre-calculated mapping of the protein/residue aligned to the protein structures. For user-provided protein sequences, G2S API calculates alignments against structure sequences on the fly. Several alignment quality metrics as E-value and bit-score can be used as parameters in the API request to refine alignment results. G2S API is a RESTful service and all API endpoints provide fast real-time responses in JSON format. The details of API endpoints, use cases and additional design details are provided on the web site (https://g2s.genomenexus.org) and in Supplementary Material. 4 Use cases: cBioPortal and 3DHotspots The cBioPortal for Cancer Genomics (http://cbioportal.org), a widely used resource for studying cancer genomics (Cerami et al., 2012; Gao et al., 2013), utilizes the G2S API to retrieve updated sequence-structure alignments and residue mapping to visualize cancer mutations in protein structures (see Fig. 1 for an example). The G2S API is also being used for detecting 3D mutational hotspots in cancer https://github.com/knowledgesystems/mutationhotspots (Gao et al., 2017). Fig. 1. View largeDownload slide Protein structure visualization of SPOP mutations in the MSK-IMPACT study in the cBioPortal. The mutations were plotted along the primary sequence (up left); alignments from primary sequence to PDB chains were plotted underneath (bottom left); the protein structure of the selected alignment was displayed with mutations highlighted in the structure (right) Fig. 1. View largeDownload slide Protein structure visualization of SPOP mutations in the MSK-IMPACT study in the cBioPortal. The mutations were plotted along the primary sequence (up left); alignments from primary sequence to PDB chains were plotted underneath (bottom left); the protein structure of the selected alignment was displayed with mutations highlighted in the structure (right) 5 Discussion The G2S API provides an auto-updated real-time resource for retrieving residue-level sequence-structure alignments. It fills the critical gap that no existing resources provide programmatic access to up-to-date protein structure alignments and residue mapping. The API was designed with high performance and real-time access in mind so that third-party tools such as cBioPortal can achieve smooth user experience when mapping their variants against up-to-date protein structures, and supporting visualization and analysis of variants in the context of protein structures. Funding This work has been supported by Google Summer of Code 2016 [JW], National Institutes of Health Grant (R33-GM078601 and R01-GM100701) [DX], the Marie-Josée and Henry R. Kravis Center for Molecular Oncology [NS, JG], a National Cancer Institute Cancer Center Core Grant (P30-CA008748), the Fund for Innovation in Cancer Informatics from the Brown Performance Group (www.BrownPerformance.com/ici) [NS, JG] and the Robertson Foundation [NS]. Conflict of Interest: none declared. References Altschul S.F. ( 1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res ., 25, 3389– 3402. Google Scholar CrossRef Search ADS PubMed  Auton A. et al.  . ( 2015) A global reference for human genetic variation. Nature , 526, 68– 74. Google Scholar CrossRef Search ADS PubMed  Bateman A. et al.  . ( 2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res ., 45, D158– D169. Google Scholar CrossRef Search ADS PubMed  Berman H.M. et al.  . ( 2000) The Protein Data Bank. Nucleic Acids Res ., 28, 235– 242. Google Scholar CrossRef Search ADS PubMed  Cerami E. et al.  . ( 2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov ., 2, 401– 404. Google Scholar CrossRef Search ADS PubMed  Gao J. et al.  . ( 2013) Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal ., 6, pl1. Google Scholar CrossRef Search ADS PubMed  Gao J. et al.  . ( 2017) 3D clusters of somatic mutations in cancer reveal numerous rare mutations as functional targets. Genome Med ., 9, 4. Google Scholar CrossRef Search ADS PubMed  Holland R.C. et al.  . ( 2008) BioJava: an open-source framework for bioinformatics. Bioinformatics (Oxford, England) , 24, 2096– 2097. Google Scholar CrossRef Search ADS PubMed  Prlic A. et al.  . ( 2016) Integrating genomic information with protein sequence and 3D atomic level structure at the RCSB protein data bank. Bioinformatics (Oxford, England) , 32, 3833– 3835. Google Scholar CrossRef Search ADS PubMed  Solomon O. et al.  . ( 2016) G23D: online tool for mapping and visualization of genomic variants on 3D protein structures. BMC Genomics , 17, 681. Google Scholar CrossRef Search ADS PubMed  Velankar S. et al.  . ( 2013) SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Res ., 41, D483– D489. Google Scholar CrossRef Search ADS PubMed  Weinstein J.N. et al.  . ( 2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet ., 45, 1113– 1120. Google Scholar CrossRef Search ADS PubMed  © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

G2S: a web-service for annotating genomic variants on 3D protein structures

Loading next page...
 
/lp/ou_press/g2s-a-web-service-for-annotating-genomic-variants-on-3d-protein-rHH9f50ICs
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
D.O.I.
10.1093/bioinformatics/bty047
Publisher site
See Article on Publisher Site

Abstract

Abstract Motivation Accurately mapping and annotating genomic locations on 3D protein structures is a key step in structure-based analysis of genomic variants detected by recent large-scale sequencing efforts. There are several mapping resources currently available, but none of them provides a web API (Application Programming Interface) that supports programmatic access. Results We present G2S, a real-time web API that provides automated mapping of genomic variants on 3D protein structures. G2S can align genomic locations of variants, protein locations, or protein sequences to protein structures and retrieve the mapped residues from structures. G2S API uses REST-inspired design and it can be used by various clients such as web browsers, command terminals, programming languages and other bioinformatics tools for bringing 3D structures into genomic variant analysis. Availability and implementation The webserver and source codes are freely available at https://g2s.genomenexus.org. Contact g2s@genomenexus.org Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction With extensive recent large-scale genome sequencing projects such as The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013), and 1000 Genomes Project (Auton et al., 2015), a great number of germline and somatic genomic variants are being detected. Protein structure changes due to genome variation in protein-coding regions are interesting for genetic marker discovery and interpretation of disease mechanisms. Mapping genomic variants onto the specific 3D protein structures is the first critical step to analyze the variants in the context of protein structures. Several resources have been developed to address the need of mapping protein positions to protein structures. SIFTS (Velankar et al., 2013) maps UniProt (Bateman et al., 2017) entries to Protein Data Bank (PDB) (Berman et al., 2000) entries and provides XML and flat files for download. PDB utilizes SIFTS and provided a user web interface for accessing the mapping (http://www.rcsb.org/pdb/chromosome.do) (Berman et al., 2000; Prlic et al., 2016). G23D (Solomon et al., 2016) also provides a user web interface for mapping genomic variants onto 3D protein structures. However, none of them provides a web API for programmatic access of the sequence alignments and residue mapping. Here, we present G2S, a web API that supports programmatic mapping and annotation of genomic variants on 3D protein structures. The following functionalities were implemented: (i) retrieving protein structure chains aligned to a primary protein sequence (a UniProt/Ensembl entry or a user-defined sequence); (ii) retrieving mapping between genomic positions and structural positions; and (iii) retrieving mapping between amino acid positions and structural positions. G2S provides a RESTful API. The pre-computed alignments are automatically updated weekly to keep up to date with the PDB structure archive. G2S API and source codes are publicly available at https://g2s.genomenexus.org/. 2 G2S pipeline The G2S backend pipeline collects and aligns UniProt and Ensembl protein sequences along with carefully parsed PDB sequences using BioJava (Holland et al., 2008). Raw protein sequences were retrieved from the atom records of PDB files directly to avoid flaws and inconsistences from SEQRES and DBREF records in PDB files. The alignments of the protein sequences against the PDB sequences by BLASTP (Altschul, 1997) were stored in a relational database. The pipeline updates the pre-computed alignments weekly as new PDB structures added to RCSB PDB. The workflow and architecture of G2S API are shown in Supplementary Figure S1. 3 G2S API The G2S API accepts UniProt names, UniProt Isoform names, Human Ensembl names, genomic positions (GRCh37 and 38) and user-defined protein sequences. G2S API returns high confidence pre-calculated mapping of the protein/residue aligned to the protein structures. For user-provided protein sequences, G2S API calculates alignments against structure sequences on the fly. Several alignment quality metrics as E-value and bit-score can be used as parameters in the API request to refine alignment results. G2S API is a RESTful service and all API endpoints provide fast real-time responses in JSON format. The details of API endpoints, use cases and additional design details are provided on the web site (https://g2s.genomenexus.org) and in Supplementary Material. 4 Use cases: cBioPortal and 3DHotspots The cBioPortal for Cancer Genomics (http://cbioportal.org), a widely used resource for studying cancer genomics (Cerami et al., 2012; Gao et al., 2013), utilizes the G2S API to retrieve updated sequence-structure alignments and residue mapping to visualize cancer mutations in protein structures (see Fig. 1 for an example). The G2S API is also being used for detecting 3D mutational hotspots in cancer https://github.com/knowledgesystems/mutationhotspots (Gao et al., 2017). Fig. 1. View largeDownload slide Protein structure visualization of SPOP mutations in the MSK-IMPACT study in the cBioPortal. The mutations were plotted along the primary sequence (up left); alignments from primary sequence to PDB chains were plotted underneath (bottom left); the protein structure of the selected alignment was displayed with mutations highlighted in the structure (right) Fig. 1. View largeDownload slide Protein structure visualization of SPOP mutations in the MSK-IMPACT study in the cBioPortal. The mutations were plotted along the primary sequence (up left); alignments from primary sequence to PDB chains were plotted underneath (bottom left); the protein structure of the selected alignment was displayed with mutations highlighted in the structure (right) 5 Discussion The G2S API provides an auto-updated real-time resource for retrieving residue-level sequence-structure alignments. It fills the critical gap that no existing resources provide programmatic access to up-to-date protein structure alignments and residue mapping. The API was designed with high performance and real-time access in mind so that third-party tools such as cBioPortal can achieve smooth user experience when mapping their variants against up-to-date protein structures, and supporting visualization and analysis of variants in the context of protein structures. Funding This work has been supported by Google Summer of Code 2016 [JW], National Institutes of Health Grant (R33-GM078601 and R01-GM100701) [DX], the Marie-Josée and Henry R. Kravis Center for Molecular Oncology [NS, JG], a National Cancer Institute Cancer Center Core Grant (P30-CA008748), the Fund for Innovation in Cancer Informatics from the Brown Performance Group (www.BrownPerformance.com/ici) [NS, JG] and the Robertson Foundation [NS]. Conflict of Interest: none declared. References Altschul S.F. ( 1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res ., 25, 3389– 3402. Google Scholar CrossRef Search ADS PubMed  Auton A. et al.  . ( 2015) A global reference for human genetic variation. Nature , 526, 68– 74. Google Scholar CrossRef Search ADS PubMed  Bateman A. et al.  . ( 2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res ., 45, D158– D169. Google Scholar CrossRef Search ADS PubMed  Berman H.M. et al.  . ( 2000) The Protein Data Bank. Nucleic Acids Res ., 28, 235– 242. Google Scholar CrossRef Search ADS PubMed  Cerami E. et al.  . ( 2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov ., 2, 401– 404. Google Scholar CrossRef Search ADS PubMed  Gao J. et al.  . ( 2013) Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal ., 6, pl1. Google Scholar CrossRef Search ADS PubMed  Gao J. et al.  . ( 2017) 3D clusters of somatic mutations in cancer reveal numerous rare mutations as functional targets. Genome Med ., 9, 4. Google Scholar CrossRef Search ADS PubMed  Holland R.C. et al.  . ( 2008) BioJava: an open-source framework for bioinformatics. Bioinformatics (Oxford, England) , 24, 2096– 2097. Google Scholar CrossRef Search ADS PubMed  Prlic A. et al.  . ( 2016) Integrating genomic information with protein sequence and 3D atomic level structure at the RCSB protein data bank. Bioinformatics (Oxford, England) , 32, 3833– 3835. Google Scholar CrossRef Search ADS PubMed  Solomon O. et al.  . ( 2016) G23D: online tool for mapping and visualization of genomic variants on 3D protein structures. BMC Genomics , 17, 681. Google Scholar CrossRef Search ADS PubMed  Velankar S. et al.  . ( 2013) SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Res ., 41, D483– D489. Google Scholar CrossRef Search ADS PubMed  Weinstein J.N. et al.  . ( 2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet ., 45, 1113– 1120. Google Scholar CrossRef Search ADS PubMed  © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

BioinformaticsOxford University Press

Published: Jan 27, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off