Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

ProteomeVis: a web app for exploration of protein properties from structure to sequence evolution across organisms’ proteomes

ProteomeVis: a web app for exploration of protein properties from structure to sequence evolution... Motivation: Protein evolution spans time scales and its effects span the length of an organism. A web app named ProteomeVis is developed to provide a comprehensive view of protein evolution in the Saccharomyces cerevisiae and Escherichia coli proteomes. ProteomeVis interactively creates protein chain graphs, where edges between nodes represent structure and sequence similarities within user-defined ranges, to study the long time scale effects of protein structure evolution. The short time scale effects of protein sequence evolution are studied by sequence evolutionary rate (ER) correlation analyses with protein properties that span from the molecular to the organismal level. Results: We demonstrate the utility and versatility of ProteomeVis by investigating the distribution of edges per node in organismal protein chain universe graphs (oPCUGs) and putative ER determi- nants. S.cerevisiae and E.coli oPCUGs are scale-free with scaling constants of 1.79 and 1.56, respectively. Both scaling constants can be explained by a previously reported theoretical model describing protein structure evolution. Protein abundance most strongly correlates with ER among properties in ProteomeVis, with Spearman correlations of –0.49 (P-value < 10 ) and –0.46 (P-value < 10 ) for S.cerevisiae and E.coli, respectively. This result is consistent with previous reports that found protein expression to be the most important ER determinant. Availability and implementation: ProteomeVis is freely accessible at http://proteomevis.chem.har vard.edu. Contact: shakhnovich@chemistry.harvard.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction from a protein’s length and stability, to its copy number and interac- Protein evolution spans time scales, from the billions of years during tions with other proteins, and to its influence on organismal fitness which the universe of extant protein structures emerged to the frac- (Bershtein et al., 2017; Sikosek and Chan, 2014; Zeldovich tion of microseconds in which protein sequences mutate (Caetano- and Shakhnovich, 2008). Accordingly, protein evolution is studied Anolle ´s et al., 2009). Protein evolution also spans length scales, through lines of inquiry usually separated by scale. V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 3557 3558 R.M.Razban et al. One approach takes a long time scale view, inferring evolution- positively with ER in Escherichia coli, Saccharomyces cerevisiae ary dynamics of proteins by ordering extant proteins into a graph and Drosophila melanogaster (Bloom et al., 2006; Zhou et al., based on relationships between their structures and sequences 2008), while the number of protein–protein interaction partners (Deeds et al., 2004; Dokholyan et al., 2002; Koonin et al., 2002; (PPI degree) was reported to correlate negatively with ER in Qian et al., 2001). Dokholyan et al. (2002) constructed the graph by S.cerevisiae (Fraser et al., 2003). However, the evidence is not incon- representing protein domains as nodes and connecting pairs of nodes trovertible. In S.cerevisiae, a significant negative correlation was with edges only if their structure similarities, measured by DALI Z- found between contact density and ER (Shakhnovich, 2006), and no scores (Holm and Sander, 1993), exceed 9 and their sequence correlation was found between PPI degree and ER when carefully identities (SIDs) are less than 0.25. A Z-score of 9 was found to controlling for protein expression (Bloom and Adami, 2003). differentiate protein pairs with similar (Z-score > 9) or different By juxtaposing protein structure graphs and ER correlation anal- (Z-score < 9) structures (Dokholyan et al., 2002). A SID of 0.25 yses, insights into one time scale of protein evolution can be gained originates from numerous studies finding that two aligned protein by information from the other. For example, not only do proteins sequences sharing greater than a SID value ranging from 0.2 to 0.4, with high contact densities tend to evolve more quickly at the se- are likely to share the same structure (Chothia and Lesk, 1986; quence level than their less compact counterparts according to Zhou Chung and Subbiah, 1996; Rost, 1999; Sander and Schneider, et al. (2008), they also tend to have larger structure neighborhoods 1991). Such sequence pairs are called homologous. (Shakhnovich, 2005), defined as the number of different sequences Strikingly, the distribution of edges per node in this non- encoding a structure. This is because selection for protein stability, homologous protein domain graph, dubbed the protein domain uni- roughly approximated by contact density, biases evolution towards 1:6 verse graph (PDUG), scales as k , where k stands for degree exploring closely related structures (Gilson et al., 2017). An example (number of edges per node) and 1.6 is the scaling constant that demonstrates long time scale structural effects explaining (Dokholyan et al., 2002). Scale-free graphs were first shown to form sequence evolution constraints: proteins with related structures by a process of preferential attachment, but models of this type pre- (Lukatsky et al., 2007) or sequences (Wright et al., 2005) have a dict scaling constants greater than 2 (Albert and Baraba ´ si, 2000, statistically enhanced likelihood of forming stable PPIs. This is most 2002; Krapivsky et al., 2000). For the PDUG, an alternative model evident by the prevalence of homodimers and superfamily of dimers called Big Bang was developed, where new structures originate from in nature (Ispolatov et al., 2005). gene duplication events followed by significant divergence in se- In this work, protein chain graphs are merged with correlation quence. Simulations of the Big Bang model generated scale-free de- analyses of ER determinants into an interactive visualization tool gree distributions with scaling constants ranging from 1.4 to 1.9. called ProteomeVis. ProteomeVis’ graph-theoretical approach takes The average over 100 independent simulations yielded a scaling con- a global view of protein evolution over long time scales to capture stant of 1.6, matching that observed in nature (Dokholyan et al., structure evolution, while ER correlation analyses span relatively 2002). Further studies found PDUG sub-graphs built from proteins short time scales. ProteomeVis joins the growing list of visualization belonging to 59 prokaryotic organisms’ proteomes are also scale- tools developed to facilitate understanding of complex and interwo- free (Deeds et al., 2004; Roland and Shakhnovich, 2007). These ven experimental data (Vizcaı ´no et al., 2015). What makes sub-graphs, called organismal PDUGs (oPDUGs), have scaling con- ProteomeVis unique is the disparate data it integrates. A tool already stants ranging from 1.3 to 1.9, closely matching the extent of the Big exists that enables users to interactively create protein graphs based Bang model (Deeds et al., 2004). The Big Bang model’s successes in on sequence and structure similarities (Nepomnyachiy et al., 2015). predicting scaling constants, indicate that the universe of extant pro- However, no tool, to the best of our knowledge, includes protein tein structures emerged by a process of divergent evolution (Deeds properties from the molecular to the organismal level, as well as se- and Shakhnovich, 2010). quence and structure similarities to create protein graphs, providing A complementary approach to protein evolution takes a shorter a comprehensive overview of protein evolution. time scale view, inferring determinants of sequence evolutionary The roadmap for the remainder of the text is as follows. After rate (ER) by correlating ER with other protein properties (Zhang describing how we collect the underlying data and incorporate it in and Yang, 2015). A complex cellular environment determines the the web app (‘Materials and methods’ section), we demonstrate the fitness effects of mutations that arise randomly in genes. Diverse and utility of ProteomeVis by recapitulating and expanding upon previ- sometimes countervailing selection pressures channel the course of a ous literature results regarding protein universe graphs and ER protein’s evolution (Bershtein et al., 2017; Sikosek and Chan, 2014; determinants (Results). The Results section highlights ProteomeVis’ Zeldovich and Shakhnovich, 2008). Over the years in which organ- versatility, illustrating how it facilitates evolutionary studies of pro- isms diverge from their common ancestor, selection pressures aggre- tein structure and sequence. gate in each protein’s ER. ERs span three orders of magnitude (Zhang and Yang, 2015) and by examining which properties correl- ate with ER, previous work attempted to extract the major con- 2 Materials and methods straints of protein evolution. 2.1 Data curation Protein expression was found to be the most prominent con- To access the in-house Python scripts that manage ProteomeVis’ straint. Quantified by mRNA abundance because it was more tract- data and create the web app’s static files, please refer to https:// able to measure than protein abundance (Vogel and Marcotte, github.com/rrazban/proteomevis_scripts. ProteomeVis currently 2012), protein expression correlates negatively with ER across all contains S.cerevisiae and E.coli proteomes. These two organisms three kingdoms of life, with correlation coefficients exceeding those have the largest number of protein structures deposited in the observed for other quantities (Drummond et al., 2005, 2006; Protein Data Bank (PDB) (Berman et al., 2000), accessed in March Drummond and Wilke, 2008; Pa ´l et al., 2001; Zhang and Yang, 2015). Much effort has been expended to identify other protein 2018, per proteome size. In addition, S.cerevisiae and E.coli have a properties that correlate with ER, independent of protein expres- plethora of experimentally measured protein property data since sion. Protein length and contact density were found to correlate they are model organisms (Cooper, 2000). Future iterations of ProteomeVis: a web app for comprehensive protein evolution exploration 3559 ProteomeVis will include more proteomes, such as those from 2.1.3 Length and contact density Homo sapiens and Mus musculus. A protein’s length is taken from UniProt. The contact density is the Data are collected at the level of protein chains. Although total number of residue-residue contacts in the PDB structure file, normalized by its length, as determined by the number of residues most structure evolution studies and some sequence evolution considered in the calculation of contacts. Two residues are defined studies analyze protein domains, the advantage of chains is that to be in contact if any non-hydrogen atom of the residue is within they correspond directly to genes. Protein chain identification is 4.5 Angstroms (A) from another non-hydrogen atom of a non- straightforward, while protein domain identification is dependent adjacent residue (Bloom et al., 2006; Zhou et al., 2008). In Results on the subjective definition chosen (Ingolfsson and Yona, 2008). 3.2, we also explore defining contacts as non-adjacent residue pairs In addition, the protein chain is the smallest level at which having C (C for glycine) distances <7.5A(Shakhnovich, 2006). b a cellular-wide protein properties such as abundance, protein– The Biopython module (Cock et al., 2009) is employed to parse PDB protein interaction and dosage tolerance can be measured. For the files and calculate contact density. remainder of the text, the word ‘protein’ is used to mean ‘protein chain’. 2.1.4 Protein abundance Protein abundance per cell was measured in Ghaemmaghami et al. 2.1.1 Protein structure identification (2003) and Arike et al. (2012) for S.cerevisiae and E.coli K-12 The S.cerevisiae S288c proteome has 6049 proteins and the E.coli MG1655, respectively. Abundance data are downloaded from the K-12 proteome has 4306, manually verified by the Universal Protein Protein Abundances Across Organisms database (PaxDb) (Wang Resource (UniProt) (The UniProt Consortium, 2017) and last modi- et al., 2012). They cover 64% and 95% of the S.cerevisiae and fied on February 19, 2017 and January 5, 2018, respectively. We de- E.coli K-12 UniProt proteomes. They are the largest and most accur- velop a procedure that draws upon existing resources to match ate single-sourced datasets for their respective proteome (PaxDb S.cerevisiae S288c and E.coli K-12 sequences in UniProt with accessed in March 2018). To allow for fair comparison between S.cerevisiae and E.coli structures in the PDB. First, the Structure organisms, abundance data in ProteomeVis are presented as protein Integration with Function, Taxonomy and Sequences resource abundance per cell divided by the total protein abundance per cell (SIFTS) (Velankar et al., 2013) is employed to identify protein for the respective organism. sequences with potentially matching structures. Briefly, SIFTS assigns PDB structures to UniProt sequences with a minimum SID of 2.1.5 Protein–protein interaction 0.9 and the source organism being zero, one or two levels up the spe- PPI partners are downloaded from the IntAct database (Orchard cies level in the taxonomy tree (Velankar et al., 2013). If multiple et al., 2014) via the Bioservices Python module (Cokelaer et al., structures are found for a given sequence, the structure with the larg- 2013). IntAct has the largest number of experimentally determined est sequence coverage is chosen. If multiple structures have the larg- PPIs among PPI databases (Szklarczyk and Jensen, 2015). Only est coverage, the structure solved at the lowest resolution is chosen. interactions identified by affinity purification methods for Second, protein structures are downloaded from the PDB and their S.cerevisiae S288c and E.coli K-12 are included. Affinity purifica- tion methods capture more stable interactions, rather than transient lengths are calculated. Structures are kept if their lengths are at least interactions that can vary greatly between independent experiments 80% of that listed for protein chains in UniProt. This step filters out (Bergga ˚rd et al., 2007). those protein structures that align well with only a small portion of the protein chain, e.g. a single domain from a multidomain protein. Running through the outlined steps yields 778 S.cerevisiae and 2.1.6 Dosage tolerance Dosage tolerance was measured in Douglas et al. (2012) for 1099 E.coli proteins. ProteomeVis includes protein property data S.cerevisiae and in Kitagawa et al. (2005) for E.coli K-12. Douglas for only these proteins. Monthly updates are performed to incorpor- et al. (2012) reported dosage tolerance, called overexpression tox- ate modifications to reported proteomes in UniProt and additional icity in their paper, as a logarithmic base 2 (log2) ratio between cell structures deposited in the PDB. growth when the gene of interest is overexpressed compared to that when the gene is expressed at wild-type levels. Dosage sensitive 2.1.2 Structure similarity and sequence identity genes are defined as those with log2 ratios less than -1. Kitagawa The template modeling score (TM-score) is used as a metric for et al. (2005) reported dosage tolerance, called growth inhibitory structure similarity. TM-score and SID for all intra-organism pro- effects in their paper, as an integer value of 1, 2, or 3. One signifies tein pairs of S.cerevisiae and E.coli are calculated by the TM-align no growth, two signifies some growth and three signifies wild-type software (Zhang and Skolnick, 2005), freely downloadable from growth when the gene is overexpressed. All studied genes, regardless Yang Zhang’s Research Group website. Briefly, TM-align iteratively of their wild-type expressions, are overexpressed to the same ele- aligns the two protein structures until its TM-score is maximized vated copy number for both studies. The most recent and least inva- and remains stable between iterations (Zhang and Skolnick, 2005). sive dosage tolerance measurement for S.cerevisiae is from Douglas Both TM-score and SID vary from 0 to 1, where 0 signifies complete et al. (2012). Kitagawa et al. (2005) is the only known study that inequality and 1 signifies identity. TM-score is proportional to measured dosage tolerance for E.coli. the sum over aligned residue pair distance-dependent weights. TM-score is independent of alignment length, an advantage over 2.1.7 Evolutionary rate other structure similarity scores (Zhang and Skolnick, 2004). SID is ER data for S.cerevisiae S288c and E.coli K-12 MG1655 are taken the fraction of aligned residues determined by the TM-align software from Zhang and Yang (2015). Briefly, orthologous proteins for that have the same amino acid identity. There are N ðN  1Þ=2 S.cerevisiae S288c were identified in Saccharomyces baynus and for unique pairs, equating to 302253 S.cerevisiae and 603351 E.coli pro- E.coli K-12 MG1655, in Salmonella typhimurium LT2. ER was cal- tein pairs in ProteomeVis. culated as the SID between orthologous proteins aligned by 3560 R.M.Razban et al. sequence. This approximates nonsynonymous substitutions per Strip is set up such that the S.cerevisiae proteome is visualized. nonsynonymous site (d ), a more popular metric for ER. Indeed, Protein pairs with TM-score greater than 0.5 and SID less than 0.25 the correlation between ER from Zhang and Yang (2015) and d are connected by an edge and nodes are differentially colored by de- from Wall et al. (2005) for S.cerevisiae is very strong—Spearman gree in the PCG. correlation ¼ 0.98, P-value<10 (Supplementary Fig. S1). We will add d and d data in future iterations of ProteomeVis. N S 2.2.2 Edge filtering The Edge Filtering panel (Fig. 1B) displays the TM-score versus SID 2.2 Web app scatter plot for all protein pairs in the designated proteome. As an The ProteomeVis web app can be accessed through any web browser alternative to entering values into the TM-score and SID fields in the at http://proteomevis.chem.harvard.edu. To access the source code, Control Strip, users can define TM-score and SID ranges by brush- refer to https://github.com/rrazban/proteomevis. ProteomeVis is ing over the TM-score versus SID plot. Each point corresponds to a implemented in the Django web framework and hosted on Amazon potential edge in the PCG. All edges within the brushed region are Web Services. If users wish to scrutinize their own data using rendered in the PCG panel below the Edge Filtering panel. In add- ProteomeVis, they can do so by altering the data files from which ition, datapoints colored green denote protein pairs that physically the database is created and deploying the web app locally (see interact according to the IntAct database. GitHub repository for more details). The ProteomeVis web app layout comprises a top bar—called 2.2.3 Protein chain graph the Control Strip (Fig. 1A), two left panels—Edge Filtering (1B) Upon setting TM-score and SID ranges, nodes and edges appear in and Protein Chain Graph (PCG) (1C), one center panel— the PCG panel (Fig. 1C). A node represents an individual protein, Protein Inspection (1D) and one right panel—Scatter Plot Matrix while an edge signifies that the two proteins it connects have a TM- (SPloM) (1E). score and SID within the user-defined range. Only nodes with at least one edge are shown in the PCG for visual clarity. The horizon- 2.2.1 Control strip tal bar at the top of the panel allows users to gauge the fraction of The Control Strip (Fig. 1A) controls the overall data presented. nodes displayed (in black). As in the Edge Filtering panel, edges col- From left to right, the user can (i) select which organism’s proteome ored green denote two physically interacting proteins. To aid in to visualize (ii) enter the TM-score and SID ranges that define a tracking new edges upon successive definition of TM-score and SID graph in the PCG panel (iii) select the differentially colored property ranges, edges that do not appear in the first set of defined ranges for in the PCG panel in log10 units (except for dosage tolerance which that organism appear in red. has no additional log transformation) (iv) click the download button The node’s color is given by the gray scale shown in the Control to obtain spreadsheets of the underlying data and (v) click the help Strip. In Figure 1, nodes are colored by degree. Nodes in small clus- button for a quick tour of the web app. In Figure 1, the Control ters tend to be light in color as they have low degrees, while nodes in Fig. 1. The ProteomeVis web app after exploration has occurred. Hovering over asterisks located in the top-right corner of each panel opens an informational win- dow. Pressing the help button (question mark symbol) located to the right of the Control Strip (A) leads to a quick tour. Alphabetical labels (A–E) are added to each component of ProteomeVis in this figure to introduce the tool in the text ProteomeVis: a web app for comprehensive protein evolution exploration 3561 large clusters are darker because they are typically connected to mul- Not all proteins in ProteomeVis have recorded values for all its tiple other nodes. When the user hovers over a node, the protein properties. The last four in the list are obtained from external sources as structure and PDB identifier appear at the top left of the panel. described in ‘Materials and methods’ 2.1, and some have incomplete When the user hovers over an edge, the TM-score, SID and align- overlap with proteins covered in ProteomeVis (Supplementary Fig. S4). ment length appear at the top left of the panel. If the user clicks on Each of the 8  7=2 ¼ 28 graphs in the panel, except for those the node, detailed information about the protein appears in the involving dosage tolerance, is a log-log scatter plot of one trait Protein Inspection panel. against another. Values of 0 are approximated by an order of magni- tude smaller than the second smallest possible value for the respect- ive protein property. For example, the second smallest value for 2.2.4 Protein inspection degree is 1, therefore 0 is approximated by 0.1 in the log scale. The The Protein Inspection panel (Fig. 1D) invites users to examine the scatter plots only involving the last six properties in the list are properties of any protein in ProteomeVis. Clicking on a node in static, i.e. no matter the TM-score and SID ranges chosen, the plots the PCG or searching for a protein in the search bar populates the remain the same. Only plots involving degree and weighted degree Individual tab of the Protein Inspection panel with the correspond- change each time a new TM-score and SID selection is made, such ing protein complex and its chains, PDB identifier and associated that they always reflect the current PCG. As described above, nodes genes. Clicking the plus button provides information on the protein complex’ function, as well as PubMed identifiers and Evidence codes with no edges, called orphans, are not displayed in the PCG. (ECOs) (Chibucos et al., 2014) for tracking references, if available. However, data associated with orphans are always shown in the The PDB identifier and gene name are clickable links to its PDB and SPloM, including their degree (i.e. zero, approximated by 0.1 in the UniProt web page, respectively. Clicking the image of the protein log scale) and weighted degree (i.e. their abundance). structure opens a window with the image enlarged and values of All plots in the SPloM have a background color determined by protein properties below it. the P-value color bar at the top of the panel. Red corresponds to a The Cluster tab in the Protein Inspection panel makes it easy for P-value < 0.01; black, P-value ¼ 0.05; and white, P-value > 0.1. If users to peruse all proteins in a connected component in the PCG the Spearman correlation coefficient is less than 0.15, no matter (Supplementary Fig. S2). The cluster size is the number of nodes how small the P-value, the plot will be colored white. Results from comprising the cluster in the PCG. The tab displays a histogram of correlation analysis can be obtained by hovering over a plot in the cluster sizes, for all cluster sizes larger than 2. In the histogram, each SPloM. Clicking on a specific scatter plot enlarges it in the bottom cluster is represented by a square. The gray scale color of each right corner of the SPloM panel for closer inspection and presents its square in the histogram denotes the average degree (default) or other corresponding correlation analysis results to its left. In Figure 1, the property defined by the user in the Control Strip. All proteins evolutionary rate versus abundance plot is enlarged. belonging to a cluster can be displayed by clicking on the square rep- resenting the cluster, which displays a header below, and then click- 2.2.6 Data download ing this header (Supplementary Fig. S2). At any point, users can download the data making up ProteomeVis by Protein chains identified in the Individual and Cluster tabs are clicking the download button located to the right of the Control Strip. given a color, which allows them and data associated with them to As shown in Supplementary Figure S5, there are three main options. be identified in the PCG and SPloM panels, respectively. In these First, users can download a spreadsheet of individual protein property tabs, the color is displayed in the protein chain letter underneath the values, selecting which of the eight properties they desire. Second, they image of the protein structure. In Figure 1, 2m80 chain A is colored can choose to download data on the relationships between protein pairs, orange. Its representative node in the PCG panel and its associated either the edges that fulfill the set TM-score and SID ranges (‘Edges in datapoints in the SPloM panel are also orange. If the default color current range’) or all edges. Edge files include TM-scores and SIDs for assigned is not appealing, the user can cycle through different colors each protein pair, as well as the length of the alignment and whether by clicking the protein chain letter. they form a PPI. Third, in the Correlations tab, users can download matrices of correlation analyses values presented in the SPloM panel. 2.2.5 Scatter plot matrix In addition to downloading from the Control Strip, ProteomeVis The SPloM panel (Fig. 1E) allows users to quickly identify statistic- offers a fourth option that enables users to download data associ- ally significant correlations between pairs of protein properties. ated with specific protein clusters. In the Cluster tab of the Protein ProteomeVis includes eight protein properties. Inspection panel, clicking on a cluster, hovering over its header line and clicking the download icon that appears to the left, downloads 1. Degree: number of proteins sharing an edge in the PCG, e.g. four the data. In Supplementary Figure S2, the download button is not for the gray node in Supplementary Figure S3A. seen but if the user was to hover over the header line with their cur- 2. Weighted degree: its abundance plus the sum of protein abun- sor, the download button would appear to the left of the 11 label. dances of proteins for which it shares an edge with in the PCG, We encourage users to download data to rigorously test any e.g. 12 for the gray node in Supplementary Figure S3. hypotheses. Simply scanning the SPloM panel for significant correla- 3. Length: number of residues. tions is prone to the multiple testing problem and may lead to incor- 4. Contact density: average number of contacts per residue in the rect conclusions (Yoav and Hochberg, 1995). PDB structure. 5. Abundance: normalized protein copy number per cell in parts per million. 3 Results 6. PPI degree: number of stable protein–protein interaction 3.1 Degree distribution of the organismal protein chain partners. 7. Dosage tolerance: relative insensitivity of the cell’s growth rate universe graph to the overexpression of a protein. A characteristic behavior seen in the PDUG compiled across 8. Evolutionary rate: SID between orthologous proteins. (Dokholyan et al., 2002) as well as within organisms (oPDUGs) 3562 R.M.Razban et al. (Deeds et al., 2004; Roland and Shakhnovich, 2007) is that they are scale-free, meaning that the probability to find a node with degree k scales as a power law of k with a scaling constant, c: PðkÞ k . This behavior, especially for oPDUGs, provides support that protein structures evolutionarily diverge rather than converge. In Figure 2, ProteomeVis reproduces the scale-free behavior for the S.cerevisiae and E.coli oPCUGs. The oPCUGs are composed of protein pairs from the respective organism with TM-scores greater than 0.5 and SIDs less than 0.25. For an indepth discussion on threshold and cutoff values, and the scaling constant’s robustness, refer to Supplementary Material 3.1a. To create Figure 2, the TM-score and SID ranges are entered in the Control Strip of the ProteomeVis web app and then the degree data is downloaded via option 1 as described in ‘Materials and methods’ 2.2.6. The slope of the best-fit line in a log-log plot is the additive Fig. 2. Degree distribution of the organismal protein chain universe graph inverse of the scaling constant. Both organisms exhibit highly (oPCUG) with its best-fit line on a log-log plot. The slope of the best-fit line significant scale-free behavior, with Pearson correlation coefficients < (scaling constant, c) with its standard error and the Pearson correlation coeffi- –0.9 and corresponding P-values<10 (Fig. 2). E.coli’soPCUG cient (r) with its P-value in parenthesis are calculated by Python SciPy’s linear regression function. Less S.cerevisiae datapoints are seen because its max- scaling constant is 1:5660:10, which matches Deeds et al. (2004)’s imal degree is less than E.coli’s maximal degree. All degrees are horizontally measurement of 1.56 for E.coli’s oPDUG. We hypothesize that this shifted by 1 unit to allow viewing of orphans (degree ¼ 0) and be consistent agreement results from protein domains within protein chain pairs with Deeds et al. (2004) in evaluating the scaling constant dominating alignments and leave it for a future study. S.cerevisiae’s scaling constant is 1:7960:18. No corresponding protein universe Our results expand the scope of previous protein expression studies graph result is present in the literature for S.cerevisiae. Qian et al. by demonstrating that a protein level property, protein abundance, (2001) and Koonin et al. (2002) demonstrated that the distribution of strongly correlates with ER for both S.cerevisiae and E.coli. This re- S.cerevisiae domains exhibiting the same structure classified by SCOP sult is nontrivial because the correlation between mRNA and pro- is scale-free. However, Dokholyan et al. (2002) showed that this dis- tein abundance is not very strong (Greenbaum et al., 2003; Vogel tribution is one of cluster sizes, making its scale-free behavior simply and Marcotte, 2012). a result of random graphs and not indicative of any unique mechan- The second strongest ER determinant in ProteomeVis is PPI de- ism behind protein structure evolution. gree. In agreement with Fraser et al., 2003, we report a significant Both organisms exhibit oPCUG scaling constants that fall within negative correlation between PPI degree and ER for S.cerevisiae the 1.4–1.9 range of scaling constants output by Big Bang model (Fig. 3). Fraser et al. (2003) hypothesized that proteins with more simulations (Dokholyan and Shakhnovich, 2001), described in the interaction partners are under stronger negative selection to keep Introduction. This result has interesting implications for the general- their sequences unchanged to avoid detrimental loss of interaction ity of protein structure evolution. Although S.cerevisiae and E.coli partners. Bloom and Adami (2003) argued that the negative correl- belong to different kingdoms of life, their oPCUG scaling constants ation between PPI degree and ER is confounded by protein expres- are captured by the Big Bang model, indicating that the mechanism sion. In the dataset that Fraser et al. (2003) used, Bloom and Adami behind protein structure evolution may be inherent to the last uni- (2003) found that proteins with more PPIs were more likely to have versal common ancestor. larger protein expression measurements. When controlling for pro- tein expression by partial correlation analysis, Bloom and Adami 3.2 Putative evolutionary rate determinants (2003) found no significant correlation between PPI degree and ER. Multiple studies have determined protein expression to be the stron- To see if this correlation as revealed by ProteomeVis could be con- gest predictor of ER (Drummond et al., 2005, 2006; Drummond founded by protein abundance, the PPI degree versus abundance and Wilke, 2008; Pa ´l et al., 2001; Zhang and Yang, 2015). plot is explored in the SPloM panel for S.cerevisiae. Indeed, there is Quantified by mRNA abundance, protein expression correlates a significant positive correlation between them in ProteomeVis— negatively with ER. There has been continuous debate as to whether Spearman correlation¼ 0.19, P-value<10 , whose absolute value other properties impact ER. Figure 3 summarizes the correlations is similar to that of the PPI degree versus ER correlation—Spearman between ER and all five non-PCG properties in ProteomeVis, along- correlation ¼ –0.22, P-value<10 for S.cerevisiae. side their corresponding correlations in the literature, if available. Focusing on the correlation between contact density and ER, The scatter plots from which correlation coefficients and P-values ProteomeVis can partially explain previously conflicting reports for are calculated, can be seen in the SPloM panel of the ProteomeVis S.cerevisiae. Mathematical relationships predict that the number of web app. Protein abundance has the largest magnitude correlation sequences stably folding into a native structure is approximated by coefficient at –0.49 for S.cerevisiae and –0.46 for E.coli, with the contact density (Choi et al., 2017; England and Shakhnovich, 2003). lowest P-values (<10 ), in agreement with protein expression Therefore, proteins with larger contact densities are expected to results. Previous studies have mostly focused on the correlation have larger ERs because more mutations can be accommodated between mRNA abundance and ER because measuring protein while still stably folding. In agreement, Zhou et al. (2008) reported abundance across the proteome was more laborious (Vogel and a significant positive correlation between contact density and ER for Marcotte, 2012). Just one other study examined the protein S.cerevisiae and E.coli. However, Shakhnovich (2006) found a cor- abundance-ER relationship in S.cerevisiae, and they reported a responding correlation for S.cerevisiae that is significantly negative. strong negative correlation (Drummond et al., 2006). To our In ProteomeVis, corresponding positive correlations are reported knowledge, no other study has performed this analysis for E.coli. that is significant (P-value<10 ) for S.cerevisiae and slightly ProteomeVis: a web app for comprehensive protein evolution exploration 3563 Fig. 3. Spearman correlation coefficients (q) between evolutionary rate (ER) and five protein properties (x) in ProteomeVis and the literature. X marks and aster- 100 50 10 isks above or below bar plots denote P-value ranges: xxx P -value < 10 ,xx < 10 ,x < 10 , *** < 0.001, ** < 0.01, * < 0.05. Length (Zhou et al., 2008), contact density (Zhou et al., 2008) and (mRNA) abundance (Zhang and Yang, 2015) correlations with evolutionary rate in the literature for both organisms are pre- sented. PPI degree and dosage tolerance correlations with evolutionary rate in the literature for S.cerevisiae are from Fraser et al. (2003) and Vavouri et al. (2009), respectively; for E.coli, not available (N/A) as of March 2018 significant (P-value < 0.01) for E.coli. To probe whether In this paper, we applied ProteomeVis to investigate two import- ProteomeVis’ consistency with Zhou et al. (2008) is due to having ant problems in evolutionary protein biology: the scale-free degree dis- the same contact definition, we define contacts as in Shakhnovich tribution of oPCUGs and the correlations between ER and protein (2006) (‘Materials and methods’ 2.1.3). The corresponding properties. These two topics span time and length scales in protein correlation for S.cerevisiae becomes insignificant—Spearman evolution, and we successfully recapitulated and extended previous correlation ¼ 0.00, P-value ¼ 0.99. Although we do not recapitulate studies. We report the previously unknown scaling constant for the significant negative correlation found in Shakhnovich (2006), S.cerevisiae’s protein universe graph. Interestingly, its oPCUG scaling we find that the significant positive correlation between contact constant fits the range predicted by the Big Bang model, which had density and ER in ProteomeVis is contingent on the employed con- only been applied to prokaryotic organisms (Deeds et al.,2004; tact definition. Dokholyan et al.,2002). This observation points to the mechanism of As seen in Figure 3, all statistically significant correlations in protein structure evolution being common among organisms since the ProteomeVis agree with those available in the literature, except last universal common ancestor. In our studies of potential ER deter- for the correlation between length and ER for E.coli. Although minants, protein abundance is the strongest correlate among proper- ProteomeVis agrees with Zhou et al. (2008) regarding correlations ties in ProteomeVis. Previously, the importance of mRNA abundance between contact density and ER, they vary regarding hadbeenhighlighted (Zhang and Yang, 2015), but our work joins the correlation between protein length and ER for E.coli. Zhou only one other study to demonstrate the importance of protein abun- et al. (2008) reported a significant positive correlation, while dance influencing ER (Drummond et al.,2006). ProteomeVis reports an insignificant negative correlation for E.coli. ProteomeVis captures 13% of the S.cerevisiae and 25% of Methodological differences do not explain the discrepancy. Zhou the E.coli proteomes as of March 2018. These percentages will grow et al. (2008) applied the same filter based on length to identify struc- as monthly updates are performed to incorporate newly deposited PDB tures in the respective organism’s proteome as in ProteomeVis. As structures. We plan to further expand ProteomeVis in future iterations expected, the length versus ER correlation for S.cerevisiae in by including more proteomes, such as those from Homo sapiens and ProteomeVis and Zhou et al. (2008) are consistent. Further studies Mus musculus, and adding genomic data, such as d and d . N S are required to elucidate why inconsistent length-ER correlation are seen in E.coli but not S.cerevisiae. Acknowledgements RMR acknowledges fruitful email communications with Professor Brenda 4 Conclusion Andrews about dosage tolerance data in Sopko et al. (2006), and Professor Proteomic data visualization tools are increasingly developed to fa- Jianzhi Zhang and Professor Jian-Rong Yang about evolutionary rate data in Zhang and Yang (2015). RMR would like to thank Will Jacobs for helpful cilitate understanding of complex and interwoven experimental data discussions surrounding data curation. All figures in the text and (Vizcaı ´no et al., 2015). Here, we present a novel web app called Supplementary Material, except for those of the ProteomeVis web app and ProteomeVis, accessible at http://proteomevis.chem.harvard.edu, to Supplementary Figure S3, were created using Python’s Matplotlib package. study protein structure and sequence evolution simultaneously across organisms’ proteomes. Data downloaded from three pub- lished papers, programmatically accessed from four databases and Funding generated by running three software packages are organized into This work was supported by National Institutes of Health [2R01GM068670-13 ProteomeVis’ four panels. Interacting with the panels gives users to E.I.S.]; and the National Science Foundation Graduate Research Fellowships quick insight into the underlying data. Once more detailed hypothe- Program [2013139945 to A.I.G.]. ses are formulated throughout their web sessions, users can freely download the data to further probe on their own. Conflict of Interest: none declared. 3564 R.M.Razban et al. Greenbaum,D. et al. (2003) Comparing protein abundance and mRNA ex- References pression levels on a genomic scale. Genome Biol., 4, 117. Albert,R. and Baraba ´ si, A.L. (2000) Topology of evolving networks: local Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of events and universality. Phys. Rev. Lett., 85, 5234–5237. distance matrices. J. Mol. Biol., 233, 123–138. Albert, R. and Baraba ´ si, A.L. (2002) Statistical mechanics of complex net- Ingolfsson,H. and Yona,G. (2008). Protein domain prediction. In: Kobe,B. works. Rev. Mod. Phys., 74, 47–97. et al. (eds.) Struct. Proteomics High-Throughput Methods, vol. 426, chapter Arike, L. et al. (2012) Comparison and applications of label-free absolute 7. Humana Press, Totowa, NJ, pp. 117–43. proteome quantification methods on Escherichia coli. J. Proteomics, 75, Ispolatov,I. et al. (2005) Binding properties and evolution of homodimers 5437–5448. in protein–protein interaction networks. Nucleic Acids Res., 33, Bergga ˚ rd,T. et al. (2007) Methods for the detection and analysis of protein– 3629–3635. protein interactions. Proteomics, 7, 2833–2842. Kitagawa,M. et al. (2005) Complete set of ORF clones of Escherichia coli Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, ASKA library (A complete set of E. coli K-12 ORF archive): unique resour- 235–242. ces for biological research. DNA Res., 12, 291–299. Bershtein,S. et al. (2017) Bridging the physical scales in evolutionary biology: Koonin,E.V. et al. (2002) The structure of the protein universe and genome from protein sequence space to fitness of organisms and populations. Curr. evolution. Nature, 420, 218–223. Opin. Struct. Biol., 42, 31–40. Krapivsky,P.L. et al. (2000) Connectivity of growing random networks. Phys. Bloom,J.D. and Adami,C. (2003) Apparent dependence of protein evolution- Rev. Lett., 85, 4629–4632. ary rate on number of interactions is linked to biases in protein–protein Lukatsky,D.B. et al. (2007) Structural similarity enhances interaction propen- interactions data sets. BMC Evol. Biol., 3, 21. sity of proteins. J. Mol. Biol., 365, 1596–1606. Bloom,J.D. et al. (2006) Structural determinants of the rate of protein evolu- Nepomnyachiy,S. et al. (2015) CyToStruct: augmenting the network visualiza- tion in yeast. Mol. Biol. Evol., 23, 1751–1761. tion of CyToStruct with the power of molecular viewers. Structure, 23, Caetano-Anolle ´ s,G. et al. (2009) The origin, evolution and structure of the 941–948. protein world. Biochem. J., 417, 621–637. Orchard,S. et al. (2014) The MIntAct project – IntAct as a common curation Chibucos,M.C. et al. (2014) Standardized description of scientific evidence platform for 11 molecular interaction databases. Nucleic Acids Res., 42, using the Evidence Ontology (ECO). Database J. Biol. Databases Curation, D358–D363. 2014, 1–11. Pa ´ l,C. et al. (2001) Highly expressed genes in yeast evolve slowly. Genet. Soc. Choi,J.M. et al. (2017) Graph’s topology and free energy of a spin model on Am., 158, 927–931. the graph. Phys. Rev. Lett., 118, 1–5. Qian,J. et al. (2001) Protein family and fold occurrence in genomes: Chothia,C. and Lesk,A.M. (1986) The relation between the divergence of se- power-law behaviour and evolutionary model. J. Mol. Biol., 313, 673–681. quence and structure in proteins. EMBO J., 5, 823–826. Roland,C.B. and Shakhnovich,E.I. (2007) Divergent evolution of a structural Chung,S.Y. and Subbiah,S. (1996) A structural explanation for the twilight proteome: phenomenological models. Biophys. J., 92, 701–716. zone of protein sequence homology. Structure, 4, 1123–1127. Rost,B. (1999) Twilight zone of protein sequence alignments. Protein Eng. Cock,P.J.A. et al. (2009) Biopython: freely available Python tools for compu- Des. Sel., 12, 85–94. tational molecular biology and bioinformatics. Bioinformatics, 25, Sander,C. and Schneider,R. (1991) Database of homology-derived protein 1422–1423. structures and the structural meaning of sequence alignment. Proteins Cokelaer,T. et al. (2013) BioServices: a common Python package to access bio- Struct. Funct. Bioinf., 9, 56–68. logical web services programmatically. Bioinformatics, 29, 3241–3242. Shakhnovich,B.E. (2006) Relative contributions of structural designability Cooper,G.M. (2000) Cells as experimental models. In: Cell: A Molecular and functional diversity in molecular evolution of duplicates. Approach, chapter 1, 2nd edn. Sinauer Associates, Sunderland, MA. Bioinformatics, 22, e440–e445. Deeds,E.J. and Shakhnovich,E.I. (2010) A structure-centric view of protein Shakhnovich,B.E. (2005) Protein structure and evolutionary history determine evolution, design, and adaptation. In: Toone,E.J. (ed.) Advances in sequence space topology. Genome Res., 15, 385–392. Enzymology and Related Areas of Molecular Biology: Protein Evolution, Sikosek,T. and Chan,H.S. (2014) Biophysics of protein evolution and evolu- vol. 75. John Wiley and Sons, Inc., Hoboken, NJ, pp. 133–191. tionary protein biophysics. J. R. Soc. Interface, 11, 20140419. Deeds,E.J. et al. (2004) Proteomic traces of speciation. J. Mol. Biol., 336, Sopko,R. et al. (2006) Mapping pathways and phenotypes by systematic gene 695–706. overexpression. Mol. Cell, 21, 319–330. Dokholyan,N.V. and Shakhnovich,E.I. (2001) Understanding hierarchical Szklarczyk,D. and Jensen,L.J. (2015) Protein–protein interaction databases. protein evolution from first principles. J. Mol. Biol., 312, 289–307. In: Meyerkord,C.L. and Fu,H. (eds.) Protein-Protein Interactions, vol. Dokholyan,N.V. et al. (2002) Expanding protein universe and its origin from 1278. Humana Press, New York, NY, pp. 39–56. the biological Big Bang. Proc. Natl. Acad. Sci. USA, 99, 14132–14136. The UniProt Consortium (2017) UniProt: the universal protein knowledge- Douglas,A.C. et al. (2012) Functional analysis with a barcoder yeast gene base. Nucleic Acids Res., 45, D158–D169. overexpression system. G3 Genes Genomes Genet., 2, 1279–1289. Vavouri,T. et al. (2009) Intrinsic protein disorder and interaction promiscuity Drummond,D.A. and Wilke,C.O. (2008) Mistranslation-induced protein mis- are widely associated with dosage sensitivity. Cell, 138, 198–208. folding as a dominant constraint on coding-sequence evolution. Cell, 134, Velankar,S. et al. (2013) SIFTS: structure Integration with Function, 341–352. Taxonomy and Sequences resource. Nucleic Acids Res., 41, D483–D489. Drummond,D.A. et al. (2005) Why highly expressed proteins evolve slowly. Vizcaı´no,J.A. et al. (2015) Proteomics Data Visualisation [Special Issue]. Proc. Natl. Acad. Sci. USA, 102, 14338–14343. Proteomics, 15, 1339–1456. Drummond,D.A. et al. (2006) A single determinant dominates the rate of yeast Vogel,C. and Marcotte,E.M. (2012) Insights into the regulation of protein protein evolution. Mol. Biol. Evol., 23, 327–337. abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet., England,J.L. and Shakhnovich,E.I. (2003) Structural determinant of protein designability. Phys. Rev. Lett., 90, 218101. 13, 227–232. Fraser,H.B. et al. (2003) A simple dependence between protein evolution rate Wall,D.P. et al. (2005) Functional genomic analysis of the rates of protein evo- lution. Proc. Natl. Acad. Sci. USA, 102, 5483–5488. and the number of protein–protein interactions. BMC Evol. Biol., 3,6. Wang,M. et al. (2012) PaxDb, a database of protein abundance averages Ghaemmaghami,S. et al. (2003) Global analysis of protein expression in yeast. Nature, 425, 737–741. across all three domains of life. Mol. Cell. Proteomics, 11, 492–500. Gilson,A.I. et al. (2017) The role of evolutionary selection in the dynamics of Wright,C.F. et al. (2005) The importance of sequence diversity in the aggrega- protein structure evolution. Biophys. J., 112, 1350–1365. tion and evolution. Nature, 438, 878–881. ProteomeVis: a web app for comprehensive protein evolution exploration 3565 Yoav,B. and Hochberg,Y. (1995) Controlling the false discovery rate: a practical Zhang,Y. and Skolnick,J. (2004) Scoring function for automated assessment and powerful approach to multiple testing. J. R. Stat. Soc., 57, 289–300. of protein structure template quality. Proteins Struct. Funct. Genet., 57, Zeldovich,K.B. and Shakhnovich,E.I. (2008) Understanding protein evolution: 702–710. from protein physics to Darwinian selection. Annu. Rev. Phys. Chem., 59, Zhang,Y. and Skolnick,J. (2005) TM-align: a protein structure alignment 105–127. algorithm based on the TM-score. Nucleic Acids Res., 33, 2302–2309. Zhang,J. and Yang,J.R. (2015) Determinants of the rate of protein sequence Zhou,T. et al. (2008) Contact density affects protein evolutionary rate from evolution. Nat. Rev. Genet., 16, 409–420. bacteria to animals. J. Mol. Evol., 66, 395–404. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

ProteomeVis: a web app for exploration of protein properties from structure to sequence evolution across organisms’ proteomes

Loading next page...
 
/lp/ou_press/proteomevis-a-web-app-for-exploration-of-protein-properties-from-w1ZNXVuBaE
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bty370
Publisher site
See Article on Publisher Site

Abstract

Motivation: Protein evolution spans time scales and its effects span the length of an organism. A web app named ProteomeVis is developed to provide a comprehensive view of protein evolution in the Saccharomyces cerevisiae and Escherichia coli proteomes. ProteomeVis interactively creates protein chain graphs, where edges between nodes represent structure and sequence similarities within user-defined ranges, to study the long time scale effects of protein structure evolution. The short time scale effects of protein sequence evolution are studied by sequence evolutionary rate (ER) correlation analyses with protein properties that span from the molecular to the organismal level. Results: We demonstrate the utility and versatility of ProteomeVis by investigating the distribution of edges per node in organismal protein chain universe graphs (oPCUGs) and putative ER determi- nants. S.cerevisiae and E.coli oPCUGs are scale-free with scaling constants of 1.79 and 1.56, respectively. Both scaling constants can be explained by a previously reported theoretical model describing protein structure evolution. Protein abundance most strongly correlates with ER among properties in ProteomeVis, with Spearman correlations of –0.49 (P-value < 10 ) and –0.46 (P-value < 10 ) for S.cerevisiae and E.coli, respectively. This result is consistent with previous reports that found protein expression to be the most important ER determinant. Availability and implementation: ProteomeVis is freely accessible at http://proteomevis.chem.har vard.edu. Contact: shakhnovich@chemistry.harvard.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction from a protein’s length and stability, to its copy number and interac- Protein evolution spans time scales, from the billions of years during tions with other proteins, and to its influence on organismal fitness which the universe of extant protein structures emerged to the frac- (Bershtein et al., 2017; Sikosek and Chan, 2014; Zeldovich tion of microseconds in which protein sequences mutate (Caetano- and Shakhnovich, 2008). Accordingly, protein evolution is studied Anolle ´s et al., 2009). Protein evolution also spans length scales, through lines of inquiry usually separated by scale. V The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 3557 3558 R.M.Razban et al. One approach takes a long time scale view, inferring evolution- positively with ER in Escherichia coli, Saccharomyces cerevisiae ary dynamics of proteins by ordering extant proteins into a graph and Drosophila melanogaster (Bloom et al., 2006; Zhou et al., based on relationships between their structures and sequences 2008), while the number of protein–protein interaction partners (Deeds et al., 2004; Dokholyan et al., 2002; Koonin et al., 2002; (PPI degree) was reported to correlate negatively with ER in Qian et al., 2001). Dokholyan et al. (2002) constructed the graph by S.cerevisiae (Fraser et al., 2003). However, the evidence is not incon- representing protein domains as nodes and connecting pairs of nodes trovertible. In S.cerevisiae, a significant negative correlation was with edges only if their structure similarities, measured by DALI Z- found between contact density and ER (Shakhnovich, 2006), and no scores (Holm and Sander, 1993), exceed 9 and their sequence correlation was found between PPI degree and ER when carefully identities (SIDs) are less than 0.25. A Z-score of 9 was found to controlling for protein expression (Bloom and Adami, 2003). differentiate protein pairs with similar (Z-score > 9) or different By juxtaposing protein structure graphs and ER correlation anal- (Z-score < 9) structures (Dokholyan et al., 2002). A SID of 0.25 yses, insights into one time scale of protein evolution can be gained originates from numerous studies finding that two aligned protein by information from the other. For example, not only do proteins sequences sharing greater than a SID value ranging from 0.2 to 0.4, with high contact densities tend to evolve more quickly at the se- are likely to share the same structure (Chothia and Lesk, 1986; quence level than their less compact counterparts according to Zhou Chung and Subbiah, 1996; Rost, 1999; Sander and Schneider, et al. (2008), they also tend to have larger structure neighborhoods 1991). Such sequence pairs are called homologous. (Shakhnovich, 2005), defined as the number of different sequences Strikingly, the distribution of edges per node in this non- encoding a structure. This is because selection for protein stability, homologous protein domain graph, dubbed the protein domain uni- roughly approximated by contact density, biases evolution towards 1:6 verse graph (PDUG), scales as k , where k stands for degree exploring closely related structures (Gilson et al., 2017). An example (number of edges per node) and 1.6 is the scaling constant that demonstrates long time scale structural effects explaining (Dokholyan et al., 2002). Scale-free graphs were first shown to form sequence evolution constraints: proteins with related structures by a process of preferential attachment, but models of this type pre- (Lukatsky et al., 2007) or sequences (Wright et al., 2005) have a dict scaling constants greater than 2 (Albert and Baraba ´ si, 2000, statistically enhanced likelihood of forming stable PPIs. This is most 2002; Krapivsky et al., 2000). For the PDUG, an alternative model evident by the prevalence of homodimers and superfamily of dimers called Big Bang was developed, where new structures originate from in nature (Ispolatov et al., 2005). gene duplication events followed by significant divergence in se- In this work, protein chain graphs are merged with correlation quence. Simulations of the Big Bang model generated scale-free de- analyses of ER determinants into an interactive visualization tool gree distributions with scaling constants ranging from 1.4 to 1.9. called ProteomeVis. ProteomeVis’ graph-theoretical approach takes The average over 100 independent simulations yielded a scaling con- a global view of protein evolution over long time scales to capture stant of 1.6, matching that observed in nature (Dokholyan et al., structure evolution, while ER correlation analyses span relatively 2002). Further studies found PDUG sub-graphs built from proteins short time scales. ProteomeVis joins the growing list of visualization belonging to 59 prokaryotic organisms’ proteomes are also scale- tools developed to facilitate understanding of complex and interwo- free (Deeds et al., 2004; Roland and Shakhnovich, 2007). These ven experimental data (Vizcaı ´no et al., 2015). What makes sub-graphs, called organismal PDUGs (oPDUGs), have scaling con- ProteomeVis unique is the disparate data it integrates. A tool already stants ranging from 1.3 to 1.9, closely matching the extent of the Big exists that enables users to interactively create protein graphs based Bang model (Deeds et al., 2004). The Big Bang model’s successes in on sequence and structure similarities (Nepomnyachiy et al., 2015). predicting scaling constants, indicate that the universe of extant pro- However, no tool, to the best of our knowledge, includes protein tein structures emerged by a process of divergent evolution (Deeds properties from the molecular to the organismal level, as well as se- and Shakhnovich, 2010). quence and structure similarities to create protein graphs, providing A complementary approach to protein evolution takes a shorter a comprehensive overview of protein evolution. time scale view, inferring determinants of sequence evolutionary The roadmap for the remainder of the text is as follows. After rate (ER) by correlating ER with other protein properties (Zhang describing how we collect the underlying data and incorporate it in and Yang, 2015). A complex cellular environment determines the the web app (‘Materials and methods’ section), we demonstrate the fitness effects of mutations that arise randomly in genes. Diverse and utility of ProteomeVis by recapitulating and expanding upon previ- sometimes countervailing selection pressures channel the course of a ous literature results regarding protein universe graphs and ER protein’s evolution (Bershtein et al., 2017; Sikosek and Chan, 2014; determinants (Results). The Results section highlights ProteomeVis’ Zeldovich and Shakhnovich, 2008). Over the years in which organ- versatility, illustrating how it facilitates evolutionary studies of pro- isms diverge from their common ancestor, selection pressures aggre- tein structure and sequence. gate in each protein’s ER. ERs span three orders of magnitude (Zhang and Yang, 2015) and by examining which properties correl- ate with ER, previous work attempted to extract the major con- 2 Materials and methods straints of protein evolution. 2.1 Data curation Protein expression was found to be the most prominent con- To access the in-house Python scripts that manage ProteomeVis’ straint. Quantified by mRNA abundance because it was more tract- data and create the web app’s static files, please refer to https:// able to measure than protein abundance (Vogel and Marcotte, github.com/rrazban/proteomevis_scripts. ProteomeVis currently 2012), protein expression correlates negatively with ER across all contains S.cerevisiae and E.coli proteomes. These two organisms three kingdoms of life, with correlation coefficients exceeding those have the largest number of protein structures deposited in the observed for other quantities (Drummond et al., 2005, 2006; Protein Data Bank (PDB) (Berman et al., 2000), accessed in March Drummond and Wilke, 2008; Pa ´l et al., 2001; Zhang and Yang, 2015). Much effort has been expended to identify other protein 2018, per proteome size. In addition, S.cerevisiae and E.coli have a properties that correlate with ER, independent of protein expres- plethora of experimentally measured protein property data since sion. Protein length and contact density were found to correlate they are model organisms (Cooper, 2000). Future iterations of ProteomeVis: a web app for comprehensive protein evolution exploration 3559 ProteomeVis will include more proteomes, such as those from 2.1.3 Length and contact density Homo sapiens and Mus musculus. A protein’s length is taken from UniProt. The contact density is the Data are collected at the level of protein chains. Although total number of residue-residue contacts in the PDB structure file, normalized by its length, as determined by the number of residues most structure evolution studies and some sequence evolution considered in the calculation of contacts. Two residues are defined studies analyze protein domains, the advantage of chains is that to be in contact if any non-hydrogen atom of the residue is within they correspond directly to genes. Protein chain identification is 4.5 Angstroms (A) from another non-hydrogen atom of a non- straightforward, while protein domain identification is dependent adjacent residue (Bloom et al., 2006; Zhou et al., 2008). In Results on the subjective definition chosen (Ingolfsson and Yona, 2008). 3.2, we also explore defining contacts as non-adjacent residue pairs In addition, the protein chain is the smallest level at which having C (C for glycine) distances <7.5A(Shakhnovich, 2006). b a cellular-wide protein properties such as abundance, protein– The Biopython module (Cock et al., 2009) is employed to parse PDB protein interaction and dosage tolerance can be measured. For the files and calculate contact density. remainder of the text, the word ‘protein’ is used to mean ‘protein chain’. 2.1.4 Protein abundance Protein abundance per cell was measured in Ghaemmaghami et al. 2.1.1 Protein structure identification (2003) and Arike et al. (2012) for S.cerevisiae and E.coli K-12 The S.cerevisiae S288c proteome has 6049 proteins and the E.coli MG1655, respectively. Abundance data are downloaded from the K-12 proteome has 4306, manually verified by the Universal Protein Protein Abundances Across Organisms database (PaxDb) (Wang Resource (UniProt) (The UniProt Consortium, 2017) and last modi- et al., 2012). They cover 64% and 95% of the S.cerevisiae and fied on February 19, 2017 and January 5, 2018, respectively. We de- E.coli K-12 UniProt proteomes. They are the largest and most accur- velop a procedure that draws upon existing resources to match ate single-sourced datasets for their respective proteome (PaxDb S.cerevisiae S288c and E.coli K-12 sequences in UniProt with accessed in March 2018). To allow for fair comparison between S.cerevisiae and E.coli structures in the PDB. First, the Structure organisms, abundance data in ProteomeVis are presented as protein Integration with Function, Taxonomy and Sequences resource abundance per cell divided by the total protein abundance per cell (SIFTS) (Velankar et al., 2013) is employed to identify protein for the respective organism. sequences with potentially matching structures. Briefly, SIFTS assigns PDB structures to UniProt sequences with a minimum SID of 2.1.5 Protein–protein interaction 0.9 and the source organism being zero, one or two levels up the spe- PPI partners are downloaded from the IntAct database (Orchard cies level in the taxonomy tree (Velankar et al., 2013). If multiple et al., 2014) via the Bioservices Python module (Cokelaer et al., structures are found for a given sequence, the structure with the larg- 2013). IntAct has the largest number of experimentally determined est sequence coverage is chosen. If multiple structures have the larg- PPIs among PPI databases (Szklarczyk and Jensen, 2015). Only est coverage, the structure solved at the lowest resolution is chosen. interactions identified by affinity purification methods for Second, protein structures are downloaded from the PDB and their S.cerevisiae S288c and E.coli K-12 are included. Affinity purifica- tion methods capture more stable interactions, rather than transient lengths are calculated. Structures are kept if their lengths are at least interactions that can vary greatly between independent experiments 80% of that listed for protein chains in UniProt. This step filters out (Bergga ˚rd et al., 2007). those protein structures that align well with only a small portion of the protein chain, e.g. a single domain from a multidomain protein. Running through the outlined steps yields 778 S.cerevisiae and 2.1.6 Dosage tolerance Dosage tolerance was measured in Douglas et al. (2012) for 1099 E.coli proteins. ProteomeVis includes protein property data S.cerevisiae and in Kitagawa et al. (2005) for E.coli K-12. Douglas for only these proteins. Monthly updates are performed to incorpor- et al. (2012) reported dosage tolerance, called overexpression tox- ate modifications to reported proteomes in UniProt and additional icity in their paper, as a logarithmic base 2 (log2) ratio between cell structures deposited in the PDB. growth when the gene of interest is overexpressed compared to that when the gene is expressed at wild-type levels. Dosage sensitive 2.1.2 Structure similarity and sequence identity genes are defined as those with log2 ratios less than -1. Kitagawa The template modeling score (TM-score) is used as a metric for et al. (2005) reported dosage tolerance, called growth inhibitory structure similarity. TM-score and SID for all intra-organism pro- effects in their paper, as an integer value of 1, 2, or 3. One signifies tein pairs of S.cerevisiae and E.coli are calculated by the TM-align no growth, two signifies some growth and three signifies wild-type software (Zhang and Skolnick, 2005), freely downloadable from growth when the gene is overexpressed. All studied genes, regardless Yang Zhang’s Research Group website. Briefly, TM-align iteratively of their wild-type expressions, are overexpressed to the same ele- aligns the two protein structures until its TM-score is maximized vated copy number for both studies. The most recent and least inva- and remains stable between iterations (Zhang and Skolnick, 2005). sive dosage tolerance measurement for S.cerevisiae is from Douglas Both TM-score and SID vary from 0 to 1, where 0 signifies complete et al. (2012). Kitagawa et al. (2005) is the only known study that inequality and 1 signifies identity. TM-score is proportional to measured dosage tolerance for E.coli. the sum over aligned residue pair distance-dependent weights. TM-score is independent of alignment length, an advantage over 2.1.7 Evolutionary rate other structure similarity scores (Zhang and Skolnick, 2004). SID is ER data for S.cerevisiae S288c and E.coli K-12 MG1655 are taken the fraction of aligned residues determined by the TM-align software from Zhang and Yang (2015). Briefly, orthologous proteins for that have the same amino acid identity. There are N ðN  1Þ=2 S.cerevisiae S288c were identified in Saccharomyces baynus and for unique pairs, equating to 302253 S.cerevisiae and 603351 E.coli pro- E.coli K-12 MG1655, in Salmonella typhimurium LT2. ER was cal- tein pairs in ProteomeVis. culated as the SID between orthologous proteins aligned by 3560 R.M.Razban et al. sequence. This approximates nonsynonymous substitutions per Strip is set up such that the S.cerevisiae proteome is visualized. nonsynonymous site (d ), a more popular metric for ER. Indeed, Protein pairs with TM-score greater than 0.5 and SID less than 0.25 the correlation between ER from Zhang and Yang (2015) and d are connected by an edge and nodes are differentially colored by de- from Wall et al. (2005) for S.cerevisiae is very strong—Spearman gree in the PCG. correlation ¼ 0.98, P-value<10 (Supplementary Fig. S1). We will add d and d data in future iterations of ProteomeVis. N S 2.2.2 Edge filtering The Edge Filtering panel (Fig. 1B) displays the TM-score versus SID 2.2 Web app scatter plot for all protein pairs in the designated proteome. As an The ProteomeVis web app can be accessed through any web browser alternative to entering values into the TM-score and SID fields in the at http://proteomevis.chem.harvard.edu. To access the source code, Control Strip, users can define TM-score and SID ranges by brush- refer to https://github.com/rrazban/proteomevis. ProteomeVis is ing over the TM-score versus SID plot. Each point corresponds to a implemented in the Django web framework and hosted on Amazon potential edge in the PCG. All edges within the brushed region are Web Services. If users wish to scrutinize their own data using rendered in the PCG panel below the Edge Filtering panel. In add- ProteomeVis, they can do so by altering the data files from which ition, datapoints colored green denote protein pairs that physically the database is created and deploying the web app locally (see interact according to the IntAct database. GitHub repository for more details). The ProteomeVis web app layout comprises a top bar—called 2.2.3 Protein chain graph the Control Strip (Fig. 1A), two left panels—Edge Filtering (1B) Upon setting TM-score and SID ranges, nodes and edges appear in and Protein Chain Graph (PCG) (1C), one center panel— the PCG panel (Fig. 1C). A node represents an individual protein, Protein Inspection (1D) and one right panel—Scatter Plot Matrix while an edge signifies that the two proteins it connects have a TM- (SPloM) (1E). score and SID within the user-defined range. Only nodes with at least one edge are shown in the PCG for visual clarity. The horizon- 2.2.1 Control strip tal bar at the top of the panel allows users to gauge the fraction of The Control Strip (Fig. 1A) controls the overall data presented. nodes displayed (in black). As in the Edge Filtering panel, edges col- From left to right, the user can (i) select which organism’s proteome ored green denote two physically interacting proteins. To aid in to visualize (ii) enter the TM-score and SID ranges that define a tracking new edges upon successive definition of TM-score and SID graph in the PCG panel (iii) select the differentially colored property ranges, edges that do not appear in the first set of defined ranges for in the PCG panel in log10 units (except for dosage tolerance which that organism appear in red. has no additional log transformation) (iv) click the download button The node’s color is given by the gray scale shown in the Control to obtain spreadsheets of the underlying data and (v) click the help Strip. In Figure 1, nodes are colored by degree. Nodes in small clus- button for a quick tour of the web app. In Figure 1, the Control ters tend to be light in color as they have low degrees, while nodes in Fig. 1. The ProteomeVis web app after exploration has occurred. Hovering over asterisks located in the top-right corner of each panel opens an informational win- dow. Pressing the help button (question mark symbol) located to the right of the Control Strip (A) leads to a quick tour. Alphabetical labels (A–E) are added to each component of ProteomeVis in this figure to introduce the tool in the text ProteomeVis: a web app for comprehensive protein evolution exploration 3561 large clusters are darker because they are typically connected to mul- Not all proteins in ProteomeVis have recorded values for all its tiple other nodes. When the user hovers over a node, the protein properties. The last four in the list are obtained from external sources as structure and PDB identifier appear at the top left of the panel. described in ‘Materials and methods’ 2.1, and some have incomplete When the user hovers over an edge, the TM-score, SID and align- overlap with proteins covered in ProteomeVis (Supplementary Fig. S4). ment length appear at the top left of the panel. If the user clicks on Each of the 8  7=2 ¼ 28 graphs in the panel, except for those the node, detailed information about the protein appears in the involving dosage tolerance, is a log-log scatter plot of one trait Protein Inspection panel. against another. Values of 0 are approximated by an order of magni- tude smaller than the second smallest possible value for the respect- ive protein property. For example, the second smallest value for 2.2.4 Protein inspection degree is 1, therefore 0 is approximated by 0.1 in the log scale. The The Protein Inspection panel (Fig. 1D) invites users to examine the scatter plots only involving the last six properties in the list are properties of any protein in ProteomeVis. Clicking on a node in static, i.e. no matter the TM-score and SID ranges chosen, the plots the PCG or searching for a protein in the search bar populates the remain the same. Only plots involving degree and weighted degree Individual tab of the Protein Inspection panel with the correspond- change each time a new TM-score and SID selection is made, such ing protein complex and its chains, PDB identifier and associated that they always reflect the current PCG. As described above, nodes genes. Clicking the plus button provides information on the protein complex’ function, as well as PubMed identifiers and Evidence codes with no edges, called orphans, are not displayed in the PCG. (ECOs) (Chibucos et al., 2014) for tracking references, if available. However, data associated with orphans are always shown in the The PDB identifier and gene name are clickable links to its PDB and SPloM, including their degree (i.e. zero, approximated by 0.1 in the UniProt web page, respectively. Clicking the image of the protein log scale) and weighted degree (i.e. their abundance). structure opens a window with the image enlarged and values of All plots in the SPloM have a background color determined by protein properties below it. the P-value color bar at the top of the panel. Red corresponds to a The Cluster tab in the Protein Inspection panel makes it easy for P-value < 0.01; black, P-value ¼ 0.05; and white, P-value > 0.1. If users to peruse all proteins in a connected component in the PCG the Spearman correlation coefficient is less than 0.15, no matter (Supplementary Fig. S2). The cluster size is the number of nodes how small the P-value, the plot will be colored white. Results from comprising the cluster in the PCG. The tab displays a histogram of correlation analysis can be obtained by hovering over a plot in the cluster sizes, for all cluster sizes larger than 2. In the histogram, each SPloM. Clicking on a specific scatter plot enlarges it in the bottom cluster is represented by a square. The gray scale color of each right corner of the SPloM panel for closer inspection and presents its square in the histogram denotes the average degree (default) or other corresponding correlation analysis results to its left. In Figure 1, the property defined by the user in the Control Strip. All proteins evolutionary rate versus abundance plot is enlarged. belonging to a cluster can be displayed by clicking on the square rep- resenting the cluster, which displays a header below, and then click- 2.2.6 Data download ing this header (Supplementary Fig. S2). At any point, users can download the data making up ProteomeVis by Protein chains identified in the Individual and Cluster tabs are clicking the download button located to the right of the Control Strip. given a color, which allows them and data associated with them to As shown in Supplementary Figure S5, there are three main options. be identified in the PCG and SPloM panels, respectively. In these First, users can download a spreadsheet of individual protein property tabs, the color is displayed in the protein chain letter underneath the values, selecting which of the eight properties they desire. Second, they image of the protein structure. In Figure 1, 2m80 chain A is colored can choose to download data on the relationships between protein pairs, orange. Its representative node in the PCG panel and its associated either the edges that fulfill the set TM-score and SID ranges (‘Edges in datapoints in the SPloM panel are also orange. If the default color current range’) or all edges. Edge files include TM-scores and SIDs for assigned is not appealing, the user can cycle through different colors each protein pair, as well as the length of the alignment and whether by clicking the protein chain letter. they form a PPI. Third, in the Correlations tab, users can download matrices of correlation analyses values presented in the SPloM panel. 2.2.5 Scatter plot matrix In addition to downloading from the Control Strip, ProteomeVis The SPloM panel (Fig. 1E) allows users to quickly identify statistic- offers a fourth option that enables users to download data associ- ally significant correlations between pairs of protein properties. ated with specific protein clusters. In the Cluster tab of the Protein ProteomeVis includes eight protein properties. Inspection panel, clicking on a cluster, hovering over its header line and clicking the download icon that appears to the left, downloads 1. Degree: number of proteins sharing an edge in the PCG, e.g. four the data. In Supplementary Figure S2, the download button is not for the gray node in Supplementary Figure S3A. seen but if the user was to hover over the header line with their cur- 2. Weighted degree: its abundance plus the sum of protein abun- sor, the download button would appear to the left of the 11 label. dances of proteins for which it shares an edge with in the PCG, We encourage users to download data to rigorously test any e.g. 12 for the gray node in Supplementary Figure S3. hypotheses. Simply scanning the SPloM panel for significant correla- 3. Length: number of residues. tions is prone to the multiple testing problem and may lead to incor- 4. Contact density: average number of contacts per residue in the rect conclusions (Yoav and Hochberg, 1995). PDB structure. 5. Abundance: normalized protein copy number per cell in parts per million. 3 Results 6. PPI degree: number of stable protein–protein interaction 3.1 Degree distribution of the organismal protein chain partners. 7. Dosage tolerance: relative insensitivity of the cell’s growth rate universe graph to the overexpression of a protein. A characteristic behavior seen in the PDUG compiled across 8. Evolutionary rate: SID between orthologous proteins. (Dokholyan et al., 2002) as well as within organisms (oPDUGs) 3562 R.M.Razban et al. (Deeds et al., 2004; Roland and Shakhnovich, 2007) is that they are scale-free, meaning that the probability to find a node with degree k scales as a power law of k with a scaling constant, c: PðkÞ k . This behavior, especially for oPDUGs, provides support that protein structures evolutionarily diverge rather than converge. In Figure 2, ProteomeVis reproduces the scale-free behavior for the S.cerevisiae and E.coli oPCUGs. The oPCUGs are composed of protein pairs from the respective organism with TM-scores greater than 0.5 and SIDs less than 0.25. For an indepth discussion on threshold and cutoff values, and the scaling constant’s robustness, refer to Supplementary Material 3.1a. To create Figure 2, the TM-score and SID ranges are entered in the Control Strip of the ProteomeVis web app and then the degree data is downloaded via option 1 as described in ‘Materials and methods’ 2.2.6. The slope of the best-fit line in a log-log plot is the additive Fig. 2. Degree distribution of the organismal protein chain universe graph inverse of the scaling constant. Both organisms exhibit highly (oPCUG) with its best-fit line on a log-log plot. The slope of the best-fit line significant scale-free behavior, with Pearson correlation coefficients < (scaling constant, c) with its standard error and the Pearson correlation coeffi- –0.9 and corresponding P-values<10 (Fig. 2). E.coli’soPCUG cient (r) with its P-value in parenthesis are calculated by Python SciPy’s linear regression function. Less S.cerevisiae datapoints are seen because its max- scaling constant is 1:5660:10, which matches Deeds et al. (2004)’s imal degree is less than E.coli’s maximal degree. All degrees are horizontally measurement of 1.56 for E.coli’s oPDUG. We hypothesize that this shifted by 1 unit to allow viewing of orphans (degree ¼ 0) and be consistent agreement results from protein domains within protein chain pairs with Deeds et al. (2004) in evaluating the scaling constant dominating alignments and leave it for a future study. S.cerevisiae’s scaling constant is 1:7960:18. No corresponding protein universe Our results expand the scope of previous protein expression studies graph result is present in the literature for S.cerevisiae. Qian et al. by demonstrating that a protein level property, protein abundance, (2001) and Koonin et al. (2002) demonstrated that the distribution of strongly correlates with ER for both S.cerevisiae and E.coli. This re- S.cerevisiae domains exhibiting the same structure classified by SCOP sult is nontrivial because the correlation between mRNA and pro- is scale-free. However, Dokholyan et al. (2002) showed that this dis- tein abundance is not very strong (Greenbaum et al., 2003; Vogel tribution is one of cluster sizes, making its scale-free behavior simply and Marcotte, 2012). a result of random graphs and not indicative of any unique mechan- The second strongest ER determinant in ProteomeVis is PPI de- ism behind protein structure evolution. gree. In agreement with Fraser et al., 2003, we report a significant Both organisms exhibit oPCUG scaling constants that fall within negative correlation between PPI degree and ER for S.cerevisiae the 1.4–1.9 range of scaling constants output by Big Bang model (Fig. 3). Fraser et al. (2003) hypothesized that proteins with more simulations (Dokholyan and Shakhnovich, 2001), described in the interaction partners are under stronger negative selection to keep Introduction. This result has interesting implications for the general- their sequences unchanged to avoid detrimental loss of interaction ity of protein structure evolution. Although S.cerevisiae and E.coli partners. Bloom and Adami (2003) argued that the negative correl- belong to different kingdoms of life, their oPCUG scaling constants ation between PPI degree and ER is confounded by protein expres- are captured by the Big Bang model, indicating that the mechanism sion. In the dataset that Fraser et al. (2003) used, Bloom and Adami behind protein structure evolution may be inherent to the last uni- (2003) found that proteins with more PPIs were more likely to have versal common ancestor. larger protein expression measurements. When controlling for pro- tein expression by partial correlation analysis, Bloom and Adami 3.2 Putative evolutionary rate determinants (2003) found no significant correlation between PPI degree and ER. Multiple studies have determined protein expression to be the stron- To see if this correlation as revealed by ProteomeVis could be con- gest predictor of ER (Drummond et al., 2005, 2006; Drummond founded by protein abundance, the PPI degree versus abundance and Wilke, 2008; Pa ´l et al., 2001; Zhang and Yang, 2015). plot is explored in the SPloM panel for S.cerevisiae. Indeed, there is Quantified by mRNA abundance, protein expression correlates a significant positive correlation between them in ProteomeVis— negatively with ER. There has been continuous debate as to whether Spearman correlation¼ 0.19, P-value<10 , whose absolute value other properties impact ER. Figure 3 summarizes the correlations is similar to that of the PPI degree versus ER correlation—Spearman between ER and all five non-PCG properties in ProteomeVis, along- correlation ¼ –0.22, P-value<10 for S.cerevisiae. side their corresponding correlations in the literature, if available. Focusing on the correlation between contact density and ER, The scatter plots from which correlation coefficients and P-values ProteomeVis can partially explain previously conflicting reports for are calculated, can be seen in the SPloM panel of the ProteomeVis S.cerevisiae. Mathematical relationships predict that the number of web app. Protein abundance has the largest magnitude correlation sequences stably folding into a native structure is approximated by coefficient at –0.49 for S.cerevisiae and –0.46 for E.coli, with the contact density (Choi et al., 2017; England and Shakhnovich, 2003). lowest P-values (<10 ), in agreement with protein expression Therefore, proteins with larger contact densities are expected to results. Previous studies have mostly focused on the correlation have larger ERs because more mutations can be accommodated between mRNA abundance and ER because measuring protein while still stably folding. In agreement, Zhou et al. (2008) reported abundance across the proteome was more laborious (Vogel and a significant positive correlation between contact density and ER for Marcotte, 2012). Just one other study examined the protein S.cerevisiae and E.coli. However, Shakhnovich (2006) found a cor- abundance-ER relationship in S.cerevisiae, and they reported a responding correlation for S.cerevisiae that is significantly negative. strong negative correlation (Drummond et al., 2006). To our In ProteomeVis, corresponding positive correlations are reported knowledge, no other study has performed this analysis for E.coli. that is significant (P-value<10 ) for S.cerevisiae and slightly ProteomeVis: a web app for comprehensive protein evolution exploration 3563 Fig. 3. Spearman correlation coefficients (q) between evolutionary rate (ER) and five protein properties (x) in ProteomeVis and the literature. X marks and aster- 100 50 10 isks above or below bar plots denote P-value ranges: xxx P -value < 10 ,xx < 10 ,x < 10 , *** < 0.001, ** < 0.01, * < 0.05. Length (Zhou et al., 2008), contact density (Zhou et al., 2008) and (mRNA) abundance (Zhang and Yang, 2015) correlations with evolutionary rate in the literature for both organisms are pre- sented. PPI degree and dosage tolerance correlations with evolutionary rate in the literature for S.cerevisiae are from Fraser et al. (2003) and Vavouri et al. (2009), respectively; for E.coli, not available (N/A) as of March 2018 significant (P-value < 0.01) for E.coli. To probe whether In this paper, we applied ProteomeVis to investigate two import- ProteomeVis’ consistency with Zhou et al. (2008) is due to having ant problems in evolutionary protein biology: the scale-free degree dis- the same contact definition, we define contacts as in Shakhnovich tribution of oPCUGs and the correlations between ER and protein (2006) (‘Materials and methods’ 2.1.3). The corresponding properties. These two topics span time and length scales in protein correlation for S.cerevisiae becomes insignificant—Spearman evolution, and we successfully recapitulated and extended previous correlation ¼ 0.00, P-value ¼ 0.99. Although we do not recapitulate studies. We report the previously unknown scaling constant for the significant negative correlation found in Shakhnovich (2006), S.cerevisiae’s protein universe graph. Interestingly, its oPCUG scaling we find that the significant positive correlation between contact constant fits the range predicted by the Big Bang model, which had density and ER in ProteomeVis is contingent on the employed con- only been applied to prokaryotic organisms (Deeds et al.,2004; tact definition. Dokholyan et al.,2002). This observation points to the mechanism of As seen in Figure 3, all statistically significant correlations in protein structure evolution being common among organisms since the ProteomeVis agree with those available in the literature, except last universal common ancestor. In our studies of potential ER deter- for the correlation between length and ER for E.coli. Although minants, protein abundance is the strongest correlate among proper- ProteomeVis agrees with Zhou et al. (2008) regarding correlations ties in ProteomeVis. Previously, the importance of mRNA abundance between contact density and ER, they vary regarding hadbeenhighlighted (Zhang and Yang, 2015), but our work joins the correlation between protein length and ER for E.coli. Zhou only one other study to demonstrate the importance of protein abun- et al. (2008) reported a significant positive correlation, while dance influencing ER (Drummond et al.,2006). ProteomeVis reports an insignificant negative correlation for E.coli. ProteomeVis captures 13% of the S.cerevisiae and 25% of Methodological differences do not explain the discrepancy. Zhou the E.coli proteomes as of March 2018. These percentages will grow et al. (2008) applied the same filter based on length to identify struc- as monthly updates are performed to incorporate newly deposited PDB tures in the respective organism’s proteome as in ProteomeVis. As structures. We plan to further expand ProteomeVis in future iterations expected, the length versus ER correlation for S.cerevisiae in by including more proteomes, such as those from Homo sapiens and ProteomeVis and Zhou et al. (2008) are consistent. Further studies Mus musculus, and adding genomic data, such as d and d . N S are required to elucidate why inconsistent length-ER correlation are seen in E.coli but not S.cerevisiae. Acknowledgements RMR acknowledges fruitful email communications with Professor Brenda 4 Conclusion Andrews about dosage tolerance data in Sopko et al. (2006), and Professor Proteomic data visualization tools are increasingly developed to fa- Jianzhi Zhang and Professor Jian-Rong Yang about evolutionary rate data in Zhang and Yang (2015). RMR would like to thank Will Jacobs for helpful cilitate understanding of complex and interwoven experimental data discussions surrounding data curation. All figures in the text and (Vizcaı ´no et al., 2015). Here, we present a novel web app called Supplementary Material, except for those of the ProteomeVis web app and ProteomeVis, accessible at http://proteomevis.chem.harvard.edu, to Supplementary Figure S3, were created using Python’s Matplotlib package. study protein structure and sequence evolution simultaneously across organisms’ proteomes. Data downloaded from three pub- lished papers, programmatically accessed from four databases and Funding generated by running three software packages are organized into This work was supported by National Institutes of Health [2R01GM068670-13 ProteomeVis’ four panels. Interacting with the panels gives users to E.I.S.]; and the National Science Foundation Graduate Research Fellowships quick insight into the underlying data. Once more detailed hypothe- Program [2013139945 to A.I.G.]. ses are formulated throughout their web sessions, users can freely download the data to further probe on their own. Conflict of Interest: none declared. 3564 R.M.Razban et al. Greenbaum,D. et al. (2003) Comparing protein abundance and mRNA ex- References pression levels on a genomic scale. Genome Biol., 4, 117. Albert,R. and Baraba ´ si, A.L. (2000) Topology of evolving networks: local Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of events and universality. Phys. Rev. Lett., 85, 5234–5237. distance matrices. J. Mol. Biol., 233, 123–138. Albert, R. and Baraba ´ si, A.L. (2002) Statistical mechanics of complex net- Ingolfsson,H. and Yona,G. (2008). Protein domain prediction. In: Kobe,B. works. Rev. Mod. Phys., 74, 47–97. et al. (eds.) Struct. Proteomics High-Throughput Methods, vol. 426, chapter Arike, L. et al. (2012) Comparison and applications of label-free absolute 7. Humana Press, Totowa, NJ, pp. 117–43. proteome quantification methods on Escherichia coli. J. Proteomics, 75, Ispolatov,I. et al. (2005) Binding properties and evolution of homodimers 5437–5448. in protein–protein interaction networks. Nucleic Acids Res., 33, Bergga ˚ rd,T. et al. (2007) Methods for the detection and analysis of protein– 3629–3635. protein interactions. Proteomics, 7, 2833–2842. Kitagawa,M. et al. (2005) Complete set of ORF clones of Escherichia coli Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, ASKA library (A complete set of E. coli K-12 ORF archive): unique resour- 235–242. ces for biological research. DNA Res., 12, 291–299. Bershtein,S. et al. (2017) Bridging the physical scales in evolutionary biology: Koonin,E.V. et al. (2002) The structure of the protein universe and genome from protein sequence space to fitness of organisms and populations. Curr. evolution. Nature, 420, 218–223. Opin. Struct. Biol., 42, 31–40. Krapivsky,P.L. et al. (2000) Connectivity of growing random networks. Phys. Bloom,J.D. and Adami,C. (2003) Apparent dependence of protein evolution- Rev. Lett., 85, 4629–4632. ary rate on number of interactions is linked to biases in protein–protein Lukatsky,D.B. et al. (2007) Structural similarity enhances interaction propen- interactions data sets. BMC Evol. Biol., 3, 21. sity of proteins. J. Mol. Biol., 365, 1596–1606. Bloom,J.D. et al. (2006) Structural determinants of the rate of protein evolu- Nepomnyachiy,S. et al. (2015) CyToStruct: augmenting the network visualiza- tion in yeast. Mol. Biol. Evol., 23, 1751–1761. tion of CyToStruct with the power of molecular viewers. Structure, 23, Caetano-Anolle ´ s,G. et al. (2009) The origin, evolution and structure of the 941–948. protein world. Biochem. J., 417, 621–637. Orchard,S. et al. (2014) The MIntAct project – IntAct as a common curation Chibucos,M.C. et al. (2014) Standardized description of scientific evidence platform for 11 molecular interaction databases. Nucleic Acids Res., 42, using the Evidence Ontology (ECO). Database J. Biol. Databases Curation, D358–D363. 2014, 1–11. Pa ´ l,C. et al. (2001) Highly expressed genes in yeast evolve slowly. Genet. Soc. Choi,J.M. et al. (2017) Graph’s topology and free energy of a spin model on Am., 158, 927–931. the graph. Phys. Rev. Lett., 118, 1–5. Qian,J. et al. (2001) Protein family and fold occurrence in genomes: Chothia,C. and Lesk,A.M. (1986) The relation between the divergence of se- power-law behaviour and evolutionary model. J. Mol. Biol., 313, 673–681. quence and structure in proteins. EMBO J., 5, 823–826. Roland,C.B. and Shakhnovich,E.I. (2007) Divergent evolution of a structural Chung,S.Y. and Subbiah,S. (1996) A structural explanation for the twilight proteome: phenomenological models. Biophys. J., 92, 701–716. zone of protein sequence homology. Structure, 4, 1123–1127. Rost,B. (1999) Twilight zone of protein sequence alignments. Protein Eng. Cock,P.J.A. et al. (2009) Biopython: freely available Python tools for compu- Des. Sel., 12, 85–94. tational molecular biology and bioinformatics. Bioinformatics, 25, Sander,C. and Schneider,R. (1991) Database of homology-derived protein 1422–1423. structures and the structural meaning of sequence alignment. Proteins Cokelaer,T. et al. (2013) BioServices: a common Python package to access bio- Struct. Funct. Bioinf., 9, 56–68. logical web services programmatically. Bioinformatics, 29, 3241–3242. Shakhnovich,B.E. (2006) Relative contributions of structural designability Cooper,G.M. (2000) Cells as experimental models. In: Cell: A Molecular and functional diversity in molecular evolution of duplicates. Approach, chapter 1, 2nd edn. Sinauer Associates, Sunderland, MA. Bioinformatics, 22, e440–e445. Deeds,E.J. and Shakhnovich,E.I. (2010) A structure-centric view of protein Shakhnovich,B.E. (2005) Protein structure and evolutionary history determine evolution, design, and adaptation. In: Toone,E.J. (ed.) Advances in sequence space topology. Genome Res., 15, 385–392. Enzymology and Related Areas of Molecular Biology: Protein Evolution, Sikosek,T. and Chan,H.S. (2014) Biophysics of protein evolution and evolu- vol. 75. John Wiley and Sons, Inc., Hoboken, NJ, pp. 133–191. tionary protein biophysics. J. R. Soc. Interface, 11, 20140419. Deeds,E.J. et al. (2004) Proteomic traces of speciation. J. Mol. Biol., 336, Sopko,R. et al. (2006) Mapping pathways and phenotypes by systematic gene 695–706. overexpression. Mol. Cell, 21, 319–330. Dokholyan,N.V. and Shakhnovich,E.I. (2001) Understanding hierarchical Szklarczyk,D. and Jensen,L.J. (2015) Protein–protein interaction databases. protein evolution from first principles. J. Mol. Biol., 312, 289–307. In: Meyerkord,C.L. and Fu,H. (eds.) Protein-Protein Interactions, vol. Dokholyan,N.V. et al. (2002) Expanding protein universe and its origin from 1278. Humana Press, New York, NY, pp. 39–56. the biological Big Bang. Proc. Natl. Acad. Sci. USA, 99, 14132–14136. The UniProt Consortium (2017) UniProt: the universal protein knowledge- Douglas,A.C. et al. (2012) Functional analysis with a barcoder yeast gene base. Nucleic Acids Res., 45, D158–D169. overexpression system. G3 Genes Genomes Genet., 2, 1279–1289. Vavouri,T. et al. (2009) Intrinsic protein disorder and interaction promiscuity Drummond,D.A. and Wilke,C.O. (2008) Mistranslation-induced protein mis- are widely associated with dosage sensitivity. Cell, 138, 198–208. folding as a dominant constraint on coding-sequence evolution. Cell, 134, Velankar,S. et al. (2013) SIFTS: structure Integration with Function, 341–352. Taxonomy and Sequences resource. Nucleic Acids Res., 41, D483–D489. Drummond,D.A. et al. (2005) Why highly expressed proteins evolve slowly. Vizcaı´no,J.A. et al. (2015) Proteomics Data Visualisation [Special Issue]. Proc. Natl. Acad. Sci. USA, 102, 14338–14343. Proteomics, 15, 1339–1456. Drummond,D.A. et al. (2006) A single determinant dominates the rate of yeast Vogel,C. and Marcotte,E.M. (2012) Insights into the regulation of protein protein evolution. Mol. Biol. Evol., 23, 327–337. abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet., England,J.L. and Shakhnovich,E.I. (2003) Structural determinant of protein designability. Phys. Rev. Lett., 90, 218101. 13, 227–232. Fraser,H.B. et al. (2003) A simple dependence between protein evolution rate Wall,D.P. et al. (2005) Functional genomic analysis of the rates of protein evo- lution. Proc. Natl. Acad. Sci. USA, 102, 5483–5488. and the number of protein–protein interactions. BMC Evol. Biol., 3,6. Wang,M. et al. (2012) PaxDb, a database of protein abundance averages Ghaemmaghami,S. et al. (2003) Global analysis of protein expression in yeast. Nature, 425, 737–741. across all three domains of life. Mol. Cell. Proteomics, 11, 492–500. Gilson,A.I. et al. (2017) The role of evolutionary selection in the dynamics of Wright,C.F. et al. (2005) The importance of sequence diversity in the aggrega- protein structure evolution. Biophys. J., 112, 1350–1365. tion and evolution. Nature, 438, 878–881. ProteomeVis: a web app for comprehensive protein evolution exploration 3565 Yoav,B. and Hochberg,Y. (1995) Controlling the false discovery rate: a practical Zhang,Y. and Skolnick,J. (2004) Scoring function for automated assessment and powerful approach to multiple testing. J. R. Stat. Soc., 57, 289–300. of protein structure template quality. Proteins Struct. Funct. Genet., 57, Zeldovich,K.B. and Shakhnovich,E.I. (2008) Understanding protein evolution: 702–710. from protein physics to Darwinian selection. Annu. Rev. Phys. Chem., 59, Zhang,Y. and Skolnick,J. (2005) TM-align: a protein structure alignment 105–127. algorithm based on the TM-score. Nucleic Acids Res., 33, 2302–2309. Zhang,J. and Yang,J.R. (2015) Determinants of the rate of protein sequence Zhou,T. et al. (2008) Contact density affects protein evolutionary rate from evolution. Nat. Rev. Genet., 16, 409–420. bacteria to animals. J. Mol. Evol., 66, 395–404.

Journal

BioinformaticsOxford University Press

Published: May 8, 2018

References