Abstract Motivation Protein evolution spans time scales and its effects span the length of an organism. A web app named ProteomeVis is developed to provide a comprehensive view of protein evolution in the Saccharomyces cerevisiae and Escherichia coli proteomes. ProteomeVis interactively creates protein chain graphs, where edges between nodes represent structure and sequence similarities within user-defined ranges, to study the long time scale effects of protein structure evolution. The short time scale effects of protein sequence evolution are studied by sequence evolutionary rate (ER) correlation analyses with protein properties that span from the molecular to the organismal level. Results We demonstrate the utility and versatility of ProteomeVis by investigating the distribution of edges per node in organismal protein chain universe graphs (oPCUGs) and putative ER determinants. S.cerevisiae and E.coli oPCUGs are scale-free with scaling constants of 1.79 and 1.56, respectively. Both scaling constants can be explained by a previously reported theoretical model describing protein structure evolution. Protein abundance most strongly correlates with ER among properties in ProteomeVis, with Spearman correlations of –0.49 (P-value < 10−10) and –0.46 (P-value < 10−10) for S.cerevisiae and E.coli, respectively. This result is consistent with previous reports that found protein expression to be the most important ER determinant. Availability and implementation ProteomeVis is freely accessible at http://proteomevis.chem.harvard.edu. Supplementary information Supplementary data are available at Bioinformatics online. 1 Introduction Protein evolution spans time scales, from the billions of years during which the universe of extant protein structures emerged to the fraction of microseconds in which protein sequences mutate (Caetano-Anollés et al., 2009). Protein evolution also spans length scales, from a protein’s length and stability, to its copy number and interactions with other proteins, and to its influence on organismal fitness (Bershtein et al., 2017; Sikosek and Chan, 2014; Zeldovich and Shakhnovich, 2008). Accordingly, protein evolution is studied through lines of inquiry usually separated by scale. One approach takes a long time scale view, inferring evolutionary dynamics of proteins by ordering extant proteins into a graph based on relationships between their structures and sequences (Deeds et al., 2004; Dokholyan et al., 2002; Koonin et al., 2002; Qian et al., 2001). Dokholyan et al. (2002) constructed the graph by representing protein domains as nodes and connecting pairs of nodes with edges only if their structure similarities, measured by DALI Z-scores (Holm and Sander, 1993), exceed 9 and their sequence identities (SIDs) are less than 0.25. A Z-score of 9 was found to differentiate protein pairs with similar (Z-score > 9) or different (Z-score < 9) structures (Dokholyan et al., 2002). A SID of 0.25 originates from numerous studies finding that two aligned protein sequences sharing greater than a SID value ranging from 0.2 to 0.4, are likely to share the same structure (Chothia and Lesk, 1986; Chung and Subbiah, 1996; Rost, 1999; Sander and Schneider, 1991). Such sequence pairs are called homologous. Strikingly, the distribution of edges per node in this non-homologous protein domain graph, dubbed the protein domain universe graph (PDUG), scales as k−1.6, where k stands for degree (number of edges per node) and 1.6 is the scaling constant (Dokholyan et al., 2002). Scale-free graphs were first shown to form by a process of preferential attachment, but models of this type predict scaling constants greater than 2 (Albert and Barabási, 2000, 2002; Krapivsky et al., 2000). For the PDUG, an alternative model called Big Bang was developed, where new structures originate from gene duplication events followed by significant divergence in sequence. Simulations of the Big Bang model generated scale-free degree distributions with scaling constants ranging from 1.4 to 1.9. The average over 100 independent simulations yielded a scaling constant of 1.6, matching that observed in nature (Dokholyan et al., 2002). Further studies found PDUG sub-graphs built from proteins belonging to 59 prokaryotic organisms’ proteomes are also scale-free (Deeds et al., 2004; Roland and Shakhnovich, 2007). These sub-graphs, called organismal PDUGs (oPDUGs), have scaling constants ranging from 1.3 to 1.9, closely matching the extent of the Big Bang model (Deeds et al., 2004). The Big Bang model’s successes in predicting scaling constants, indicate that the universe of extant protein structures emerged by a process of divergent evolution (Deeds and Shakhnovich, 2010). A complementary approach to protein evolution takes a shorter time scale view, inferring determinants of sequence evolutionary rate (ER) by correlating ER with other protein properties (Zhang and Yang, 2015). A complex cellular environment determines the fitness effects of mutations that arise randomly in genes. Diverse and sometimes countervailing selection pressures channel the course of a protein’s evolution (Bershtein et al., 2017; Sikosek and Chan, 2014; Zeldovich and Shakhnovich, 2008). Over the years in which organisms diverge from their common ancestor, selection pressures aggregate in each protein’s ER. ERs span three orders of magnitude (Zhang and Yang, 2015) and by examining which properties correlate with ER, previous work attempted to extract the major constraints of protein evolution. Protein expression was found to be the most prominent constraint. Quantified by mRNA abundance because it was more tractable to measure than protein abundance (Vogel and Marcotte, 2012), protein expression correlates negatively with ER across all three kingdoms of life, with correlation coefficients exceeding those observed for other quantities (Drummond et al., 2005, 2006; Drummond and Wilke, 2008; Pál et al., 2001; Zhang and Yang, 2015). Much effort has been expended to identify other protein properties that correlate with ER, independent of protein expression. Protein length and contact density were found to correlate positively with ER in Escherichia coli, Saccharomyces cerevisiae and Drosophila melanogaster (Bloom et al., 2006; Zhou et al., 2008), while the number of protein–protein interaction partners (PPI degree) was reported to correlate negatively with ER in S.cerevisiae (Fraser et al., 2003). However, the evidence is not incontrovertible. In S.cerevisiae, a significant negative correlation was found between contact density and ER (Shakhnovich, 2006), and no correlation was found between PPI degree and ER when carefully controlling for protein expression (Bloom and Adami, 2003). By juxtaposing protein structure graphs and ER correlation analyses, insights into one time scale of protein evolution can be gained by information from the other. For example, not only do proteins with high contact densities tend to evolve more quickly at the sequence level than their less compact counterparts according to Zhou et al. (2008), they also tend to have larger structure neighborhoods (Shakhnovich, 2005), defined as the number of different sequences encoding a structure. This is because selection for protein stability, roughly approximated by contact density, biases evolution towards exploring closely related structures (Gilson et al., 2017). An example that demonstrates long time scale structural effects explaining sequence evolution constraints: proteins with related structures (Lukatsky et al., 2007) or sequences (Wright et al., 2005) have a statistically enhanced likelihood of forming stable PPIs. This is most evident by the prevalence of homodimers and superfamily of dimers in nature (Ispolatov et al., 2005). In this work, protein chain graphs are merged with correlation analyses of ER determinants into an interactive visualization tool called ProteomeVis. ProteomeVis’ graph-theoretical approach takes a global view of protein evolution over long time scales to capture structure evolution, while ER correlation analyses span relatively short time scales. ProteomeVis joins the growing list of visualization tools developed to facilitate understanding of complex and interwoven experimental data (Vizcaíno et al., 2015). What makes ProteomeVis unique is the disparate data it integrates. A tool already exists that enables users to interactively create protein graphs based on sequence and structure similarities (Nepomnyachiy et al., 2015). However, no tool, to the best of our knowledge, includes protein properties from the molecular to the organismal level, as well as sequence and structure similarities to create protein graphs, providing a comprehensive overview of protein evolution. The roadmap for the remainder of the text is as follows. After describing how we collect the underlying data and incorporate it in the web app (‘Materials and methods’ section), we demonstrate the utility of ProteomeVis by recapitulating and expanding upon previous literature results regarding protein universe graphs and ER determinants (Results). The Results section highlights ProteomeVis’ versatility, illustrating how it facilitates evolutionary studies of protein structure and sequence. 2 Materials and methods 2.1 Data curation To access the in-house Python scripts that manage ProteomeVis’ data and create the web app’s static files, please refer to https://github.com/rrazban/proteomevis_scripts. ProteomeVis currently contains S.cerevisiae and E.coli proteomes. These two organisms have the largest number of protein structures deposited in the Protein Data Bank (PDB) (Berman et al., 2000), accessed in March 2018, per proteome size. In addition, S.cerevisiae and E.coli have a plethora of experimentally measured protein property data since they are model organisms (Cooper, 2000). Future iterations of ProteomeVis will include more proteomes, such as those from Homo sapiens and Mus musculus. Data are collected at the level of protein chains. Although most structure evolution studies and some sequence evolution studies analyze protein domains, the advantage of chains is that they correspond directly to genes. Protein chain identification is straightforward, while protein domain identification is dependent on the subjective definition chosen (Ingolfsson and Yona, 2008). In addition, the protein chain is the smallest level at which cellular-wide protein properties such as abundance, protein–protein interaction and dosage tolerance can be measured. For the remainder of the text, the word ‘protein’ is used to mean ‘protein chain’. 2.1.1 Protein structure identification The S.cerevisiae S288c proteome has 6049 proteins and the E.coli K-12 proteome has 4306, manually verified by the Universal Protein Resource (UniProt) (The UniProt Consortium, 2017) and last modified on February 19, 2017 and January 5, 2018, respectively. We develop a procedure that draws upon existing resources to match S.cerevisiae S288c and E.coli K-12 sequences in UniProt with S.cerevisiae and E.coli structures in the PDB. First, the Structure Integration with Function, Taxonomy and Sequences resource (SIFTS) (Velankar et al., 2013) is employed to identify protein sequences with potentially matching structures. Briefly, SIFTS assigns PDB structures to UniProt sequences with a minimum SID of 0.9 and the source organism being zero, one or two levels up the species level in the taxonomy tree (Velankar et al., 2013). If multiple structures are found for a given sequence, the structure with the largest sequence coverage is chosen. If multiple structures have the largest coverage, the structure solved at the lowest resolution is chosen. Second, protein structures are downloaded from the PDB and their lengths are calculated. Structures are kept if their lengths are at least 80% of that listed for protein chains in UniProt. This step filters out those protein structures that align well with only a small portion of the protein chain, e.g. a single domain from a multidomain protein. Running through the outlined steps yields 778 S.cerevisiae and 1099 E.coli proteins. ProteomeVis includes protein property data for only these proteins. Monthly updates are performed to incorporate modifications to reported proteomes in UniProt and additional structures deposited in the PDB. 2.1.2 Structure similarity and sequence identity The template modeling score (TM-score) is used as a metric for structure similarity. TM-score and SID for all intra-organism protein pairs of S.cerevisiae and E.coli are calculated by the TM-align software (Zhang and Skolnick, 2005), freely downloadable from Yang Zhang’s Research Group website. Briefly, TM-align iteratively aligns the two protein structures until its TM-score is maximized and remains stable between iterations (Zhang and Skolnick, 2005). Both TM-score and SID vary from 0 to 1, where 0 signifies complete inequality and 1 signifies identity. TM-score is proportional to the sum over aligned residue pair distance-dependent weights. TM-score is independent of alignment length, an advantage over other structure similarity scores (Zhang and Skolnick, 2004). SID is the fraction of aligned residues determined by the TM-align software that have the same amino acid identity. There are N*(N−1)/2 unique pairs, equating to 302253 S.cerevisiae and 603351 E.coli protein pairs in ProteomeVis. 2.1.3 Length and contact density A protein’s length is taken from UniProt. The contact density is the total number of residue-residue contacts in the PDB structure file, normalized by its length, as determined by the number of residues considered in the calculation of contacts. Two residues are defined to be in contact if any non-hydrogen atom of the residue is within 4.5 Angstroms (Å) from another non-hydrogen atom of a non-adjacent residue (Bloom et al., 2006; Zhou et al., 2008). In Results 3.2, we also explore defining contacts as non-adjacent residue pairs having Cβ ( Cα for glycine) distances <7.5 Å (Shakhnovich, 2006). The Biopython module (Cock et al., 2009) is employed to parse PDB files and calculate contact density. 2.1.4 Protein abundance Protein abundance per cell was measured in Ghaemmaghami et al. (2003) and Arike et al. (2012) for S.cerevisiae and E.coli K-12 MG1655, respectively. Abundance data are downloaded from the Protein Abundances Across Organisms database (PaxDb) (Wang et al., 2012). They cover 64% and 95% of the S.cerevisiae and E.coli K-12 UniProt proteomes. They are the largest and most accurate single-sourced datasets for their respective proteome (PaxDb accessed in March 2018). To allow for fair comparison between organisms, abundance data in ProteomeVis are presented as protein abundance per cell divided by the total protein abundance per cell for the respective organism. 2.1.5 Protein–protein interaction PPI partners are downloaded from the IntAct database (Orchard et al., 2014) via the Bioservices Python module (Cokelaer et al., 2013). IntAct has the largest number of experimentally determined PPIs among PPI databases (Szklarczyk and Jensen, 2015). Only interactions identified by affinity purification methods for S.cerevisiae S288c and E.coli K-12 are included. Affinity purification methods capture more stable interactions, rather than transient interactions that can vary greatly between independent experiments (Berggård et al., 2007). 2.1.6 Dosage tolerance Dosage tolerance was measured in Douglas et al. (2012) for S.cerevisiae and in Kitagawa et al. (2005) for E.coli K-12. Douglas et al. (2012) reported dosage tolerance, called overexpression toxicity in their paper, as a logarithmic base 2 (log2) ratio between cell growth when the gene of interest is overexpressed compared to that when the gene is expressed at wild-type levels. Dosage sensitive genes are defined as those with log2 ratios less than -1. Kitagawa et al. (2005) reported dosage tolerance, called growth inhibitory effects in their paper, as an integer value of 1, 2, or 3. One signifies no growth, two signifies some growth and three signifies wild-type growth when the gene is overexpressed. All studied genes, regardless of their wild-type expressions, are overexpressed to the same elevated copy number for both studies. The most recent and least invasive dosage tolerance measurement for S.cerevisiae is from Douglas et al. (2012). Kitagawa et al. (2005) is the only known study that measured dosage tolerance for E.coli. 2.1.7 Evolutionary rate ER data for S.cerevisiae S288c and E.coli K-12 MG1655 are taken from Zhang and Yang (2015). Briefly, orthologous proteins for S.cerevisiae S288c were identified in Saccharomyces baynus and for E.coli K-12 MG1655, in Salmonella typhimurium LT2. ER was calculated as the SID between orthologous proteins aligned by sequence. This approximates nonsynonymous substitutions per nonsynonymous site ( dN), a more popular metric for ER. Indeed, the correlation between ER from Zhang and Yang (2015) and dN from Wall et al. (2005) for S.cerevisiae is very strong—Spearman correlation = 0.98, P-value< 10−300 (Supplementary Fig. S1). We will add dN and dS data in future iterations of ProteomeVis. 2.2 Web app The ProteomeVis web app can be accessed through any web browser at http://proteomevis.chem.harvard.edu. To access the source code, refer to https://github.com/rrazban/proteomevis. ProteomeVis is implemented in the Django web framework and hosted on Amazon Web Services. If users wish to scrutinize their own data using ProteomeVis, they can do so by altering the data files from which the database is created and deploying the web app locally (see GitHub repository for more details). The ProteomeVis web app layout comprises a top bar—called the Control Strip (Fig. 1A), two left panels—Edge Filtering (1B) and Protein Chain Graph (PCG) (1C), one center panel—Protein Inspection (1D) and one right panel—Scatter Plot Matrix (SPloM) (1E). Fig. 1. View largeDownload slide The ProteomeVis web app after exploration has occurred. Hovering over asterisks located in the top-right corner of each panel opens an informational window. Pressing the help button (question mark symbol) located to the right of the Control Strip (A) leads to a quick tour. Alphabetical labels (A–E) are added to each component of ProteomeVis in this figure to introduce the tool in the text Fig. 1. View largeDownload slide The ProteomeVis web app after exploration has occurred. Hovering over asterisks located in the top-right corner of each panel opens an informational window. Pressing the help button (question mark symbol) located to the right of the Control Strip (A) leads to a quick tour. Alphabetical labels (A–E) are added to each component of ProteomeVis in this figure to introduce the tool in the text 2.2.1 Control strip The Control Strip (Fig. 1A) controls the overall data presented. From left to right, the user can (i) select which organism’s proteome to visualize (ii) enter the TM-score and SID ranges that define a graph in the PCG panel (iii) select the differentially colored property in the PCG panel in log10 units (except for dosage tolerance which has no additional log transformation) (iv) click the download button to obtain spreadsheets of the underlying data and (v) click the help button for a quick tour of the web app. In Figure 1, the Control Strip is set up such that the S.cerevisiae proteome is visualized. Protein pairs with TM-score greater than 0.5 and SID less than 0.25 are connected by an edge and nodes are differentially colored by degree in the PCG. 2.2.2 Edge filtering The Edge Filtering panel (Fig. 1B) displays the TM-score versus SID scatter plot for all protein pairs in the designated proteome. As an alternative to entering values into the TM-score and SID fields in the Control Strip, users can define TM-score and SID ranges by brushing over the TM-score versus SID plot. Each point corresponds to a potential edge in the PCG. All edges within the brushed region are rendered in the PCG panel below the Edge Filtering panel. In addition, datapoints colored green denote protein pairs that physically interact according to the IntAct database. 2.2.3 Protein chain graph Upon setting TM-score and SID ranges, nodes and edges appear in the PCG panel (Fig. 1C). A node represents an individual protein, while an edge signifies that the two proteins it connects have a TM-score and SID within the user-defined range. Only nodes with at least one edge are shown in the PCG for visual clarity. The horizontal bar at the top of the panel allows users to gauge the fraction of nodes displayed (in black). As in the Edge Filtering panel, edges colored green denote two physically interacting proteins. To aid in tracking new edges upon successive definition of TM-score and SID ranges, edges that do not appear in the first set of defined ranges for that organism appear in red. The node’s color is given by the gray scale shown in the Control Strip. In Figure 1, nodes are colored by degree. Nodes in small clusters tend to be light in color as they have low degrees, while nodes in large clusters are darker because they are typically connected to multiple other nodes. When the user hovers over a node, the protein structure and PDB identifier appear at the top left of the panel. When the user hovers over an edge, the TM-score, SID and alignment length appear at the top left of the panel. If the user clicks on the node, detailed information about the protein appears in the Protein Inspection panel. 2.2.4 Protein inspection The Protein Inspection panel (Fig. 1D) invites users to examine the properties of any protein in ProteomeVis. Clicking on a node in the PCG or searching for a protein in the search bar populates the Individual tab of the Protein Inspection panel with the corresponding protein complex and its chains, PDB identifier and associated genes. Clicking the plus button provides information on the protein complex’ function, as well as PubMed identifiers and Evidence codes (ECOs) (Chibucos et al., 2014) for tracking references, if available. The PDB identifier and gene name are clickable links to its PDB and UniProt web page, respectively. Clicking the image of the protein structure opens a window with the image enlarged and values of protein properties below it. The Cluster tab in the Protein Inspection panel makes it easy for users to peruse all proteins in a connected component in the PCG (Supplementary Fig. S2). The cluster size is the number of nodes comprising the cluster in the PCG. The tab displays a histogram of cluster sizes, for all cluster sizes larger than 2. In the histogram, each cluster is represented by a square. The gray scale color of each square in the histogram denotes the average degree (default) or other property defined by the user in the Control Strip. All proteins belonging to a cluster can be displayed by clicking on the square representing the cluster, which displays a header below, and then clicking this header (Supplementary Fig. S2). Protein chains identified in the Individual and Cluster tabs are given a color, which allows them and data associated with them to be identified in the PCG and SPloM panels, respectively. In these tabs, the color is displayed in the protein chain letter underneath the image of the protein structure. In Figure 1, 2m80 chain A is colored orange. Its representative node in the PCG panel and its associated datapoints in the SPloM panel are also orange. If the default color assigned is not appealing, the user can cycle through different colors by clicking the protein chain letter. Fig. 2. View largeDownload slide Degree distribution of the organismal protein chain universe graph (oPCUG) with its best-fit line on a log-log plot. The slope of the best-fit line (scaling constant, γ) with its standard error and the Pearson correlation coefficient (r) with its P-value in parenthesis are calculated by Python SciPy’s linear regression function. Less S.cerevisiae datapoints are seen because its maximal degree is less than E.coli’s maximal degree. All degrees are horizontally shifted by 1 unit to allow viewing of orphans (degree = 0) and be consistent with Deeds et al. (2004) in evaluating the scaling constant Fig. 2. View largeDownload slide Degree distribution of the organismal protein chain universe graph (oPCUG) with its best-fit line on a log-log plot. The slope of the best-fit line (scaling constant, γ) with its standard error and the Pearson correlation coefficient (r) with its P-value in parenthesis are calculated by Python SciPy’s linear regression function. Less S.cerevisiae datapoints are seen because its maximal degree is less than E.coli’s maximal degree. All degrees are horizontally shifted by 1 unit to allow viewing of orphans (degree = 0) and be consistent with Deeds et al. (2004) in evaluating the scaling constant 2.2.5 Scatter plot matrix The SPloM panel (Fig. 1E) allows users to quickly identify statistically significant correlations between pairs of protein properties. ProteomeVis includes eight protein properties. Degree: number of proteins sharing an edge in the PCG, e.g. four for the gray node in Supplementary Figure S3A. Weighted degree: its abundance plus the sum of protein abundances of proteins for which it shares an edge with in the PCG, e.g. 12 for the gray node in Supplementary Figure S3. Length: number of residues. Contact density: average number of contacts per residue in the PDB structure. Abundance: normalized protein copy number per cell in parts per million. PPI degree: number of stable protein–protein interaction partners. Dosage tolerance: relative insensitivity of the cell’s growth rate to the overexpression of a protein. Evolutionary rate: SID between orthologous proteins. Not all proteins in ProteomeVis have recorded values for all its properties. The last four in the list are obtained from external sources as described in ‘Materials and methods’ 2.1, and some have incomplete overlap with proteins covered in ProteomeVis (Supplementary Fig. S4). Each of the 8*7/2=28 graphs in the panel, except for those involving dosage tolerance, is a log-log scatter plot of one trait against another. Values of 0 are approximated by an order of magnitude smaller than the second smallest possible value for the respective protein property. For example, the second smallest value for degree is 1, therefore 0 is approximated by 0.1 in the log scale. The scatter plots only involving the last six properties in the list are static, i.e. no matter the TM-score and SID ranges chosen, the plots remain the same. Only plots involving degree and weighted degree change each time a new TM-score and SID selection is made, such that they always reflect the current PCG. As described above, nodes with no edges, called orphans, are not displayed in the PCG. However, data associated with orphans are always shown in the SPloM, including their degree (i.e. zero, approximated by 0.1 in the log scale) and weighted degree (i.e. their abundance). All plots in the SPloM have a background color determined by the P-value color bar at the top of the panel. Red corresponds to a P-value < 0.01; black, P-value = 0.05; and white, P-value > 0.1. If the Spearman correlation coefficient is less than 0.15, no matter how small the P-value, the plot will be colored white. Results from correlation analysis can be obtained by hovering over a plot in the SPloM. Clicking on a specific scatter plot enlarges it in the bottom right corner of the SPloM panel for closer inspection and presents its corresponding correlation analysis results to its left. In Figure 1, the evolutionary rate versus abundance plot is enlarged. 2.2.6 Data download At any point, users can download the data making up ProteomeVis by clicking the download button located to the right of the Control Strip. As shown in Supplementary Figure S5, there are three main options. First, users can download a spreadsheet of individual protein property values, selecting which of the eight properties they desire. Second, they can choose to download data on the relationships between protein pairs, either the edges that fulfill the set TM-score and SID ranges (‘Edges in current range’) or all edges. Edge files include TM-scores and SIDs for each protein pair, as well as the length of the alignment and whether they form a PPI. Third, in the Correlations tab, users can download matrices of correlation analyses values presented in the SPloM panel. In addition to downloading from the Control Strip, ProteomeVis offers a fourth option that enables users to download data associated with specific protein clusters. In the Cluster tab of the Protein Inspection panel, clicking on a cluster, hovering over its header line and clicking the download icon that appears to the left, downloads the data. In Supplementary Figure S2, the download button is not seen but if the user was to hover over the header line with their cursor, the download button would appear to the left of the 11 label. We encourage users to download data to rigorously test any hypotheses. Simply scanning the SPloM panel for significant correlations is prone to the multiple testing problem and may lead to incorrect conclusions (Yoav and Hochberg, 1995). 3 Results 3.1 Degree distribution of the organismal protein chain universe graph A characteristic behavior seen in the PDUG compiled across (Dokholyan et al., 2002) as well as within organisms (oPDUGs) (Deeds et al., 2004; Roland and Shakhnovich, 2007) is that they are scale-free, meaning that the probability to find a node with degree k scales as a power law of k with a scaling constant, γ: P(k)∼k−γ. This behavior, especially for oPDUGs, provides support that protein structures evolutionarily diverge rather than converge. In Figure 2, ProteomeVis reproduces the scale-free behavior for the S.cerevisiae and E.coli oPCUGs. The oPCUGs are composed of protein pairs from the respective organism with TM-scores greater than 0.5 and SIDs less than 0.25. For an indepth discussion on threshold and cutoff values, and the scaling constant’s robustness, refer to Supplementary Material 3.1a. To create Figure 2, the TM-score and SID ranges are entered in the Control Strip of the ProteomeVis web app and then the degree data is downloaded via option 1 as described in ‘Materials and methods’ 2.2.6. The slope of the best-fit line in a log-log plot is the additive inverse of the scaling constant. Both organisms exhibit highly significant scale-free behavior, with Pearson correlation coefficients < –0.9 and corresponding P-values< 10−8 (Fig. 2). E.coli’s oPCUG scaling constant is 1.56±0.10, which matches Deeds et al. (2004)’s measurement of 1.56 for E.coli’s oPDUG. We hypothesize that this agreement results from protein domains within protein chain pairs dominating alignments and leave it for a future study. S.cerevisiae’s scaling constant is 1.79±0.18. No corresponding protein universe graph result is present in the literature for S.cerevisiae. Qian et al. (2001) and Koonin et al. (2002) demonstrated that the distribution of S.cerevisiae domains exhibiting the same structure classified by SCOP is scale-free. However, Dokholyan et al. (2002) showed that this distribution is one of cluster sizes, making its scale-free behavior simply a result of random graphs and not indicative of any unique mechanism behind protein structure evolution. Both organisms exhibit oPCUG scaling constants that fall within the 1.4–1.9 range of scaling constants output by Big Bang model simulations (Dokholyan and Shakhnovich, 2001), described in the Introduction. This result has interesting implications for the generality of protein structure evolution. Although S.cerevisiae and E.coli belong to different kingdoms of life, their oPCUG scaling constants are captured by the Big Bang model, indicating that the mechanism behind protein structure evolution may be inherent to the last universal common ancestor. 3.2 Putative evolutionary rate determinants Multiple studies have determined protein expression to be the strongest predictor of ER (Drummond et al., 2005, 2006; Drummond and Wilke, 2008; Pál et al., 2001; Zhang and Yang, 2015). Quantified by mRNA abundance, protein expression correlates negatively with ER. There has been continuous debate as to whether other properties impact ER. Figure 3 summarizes the correlations between ER and all five non-PCG properties in ProteomeVis, alongside their corresponding correlations in the literature, if available. The scatter plots from which correlation coefficients and P-values are calculated, can be seen in the SPloM panel of the ProteomeVis web app. Protein abundance has the largest magnitude correlation coefficient at –0.49 for S.cerevisiae and –0.46 for E.coli, with the lowest P-values (<10−10), in agreement with protein expression results. Previous studies have mostly focused on the correlation between mRNA abundance and ER because measuring protein abundance across the proteome was more laborious (Vogel and Marcotte, 2012). Just one other study examined the protein abundance-ER relationship in S.cerevisiae, and they reported a strong negative correlation (Drummond et al., 2006). To our knowledge, no other study has performed this analysis for E.coli. Our results expand the scope of previous protein expression studies by demonstrating that a protein level property, protein abundance, strongly correlates with ER for both S.cerevisiae and E.coli. This result is nontrivial because the correlation between mRNA and protein abundance is not very strong (Greenbaum et al., 2003; Vogel and Marcotte, 2012). Fig. 3. View largeDownload slide Spearman correlation coefficients (ρ) between evolutionary rate (ER) and five protein properties (x) in ProteomeVis and the literature. X marks and asterisks above or below bar plots denote P-value ranges: xxx ≡P‐value<10−100, xx <10−50, x <10−10, *** < 0.001, ** < 0.01, * < 0.05. Length (Zhou et al., 2008), contact density (Zhou et al., 2008) and (mRNA) abundance (Zhang and Yang, 2015) correlations with evolutionary rate in the literature for both organisms are presented. PPI degree and dosage tolerance correlations with evolutionary rate in the literature for S.cerevisiae are from Fraser et al. (2003) and Vavouri et al. (2009), respectively; for E.coli, not available (N/A) as of March 2018 Fig. 3. View largeDownload slide Spearman correlation coefficients (ρ) between evolutionary rate (ER) and five protein properties (x) in ProteomeVis and the literature. X marks and asterisks above or below bar plots denote P-value ranges: xxx ≡P‐value<10−100, xx <10−50, x <10−10, *** < 0.001, ** < 0.01, * < 0.05. Length (Zhou et al., 2008), contact density (Zhou et al., 2008) and (mRNA) abundance (Zhang and Yang, 2015) correlations with evolutionary rate in the literature for both organisms are presented. PPI degree and dosage tolerance correlations with evolutionary rate in the literature for S.cerevisiae are from Fraser et al. (2003) and Vavouri et al. (2009), respectively; for E.coli, not available (N/A) as of March 2018 The second strongest ER determinant in ProteomeVis is PPI degree. In agreement with Fraser et al., 2003, we report a significant negative correlation between PPI degree and ER for S.cerevisiae (Fig. 3). Fraser et al. (2003) hypothesized that proteins with more interaction partners are under stronger negative selection to keep their sequences unchanged to avoid detrimental loss of interaction partners. Bloom and Adami (2003) argued that the negative correlation between PPI degree and ER is confounded by protein expression. In the dataset that Fraser et al. (2003) used, Bloom and Adami (2003) found that proteins with more PPIs were more likely to have larger protein expression measurements. When controlling for protein expression by partial correlation analysis, Bloom and Adami (2003) found no significant correlation between PPI degree and ER. To see if this correlation as revealed by ProteomeVis could be confounded by protein abundance, the PPI degree versus abundance plot is explored in the SPloM panel for S.cerevisiae. Indeed, there is a significant positive correlation between them in ProteomeVis—Spearman correlation = 0.19, P-value< 10−5, whose absolute value is similar to that of the PPI degree versus ER correlation—Spearman correlation = –0.22, P-value< 10−6 for S.cerevisiae. Focusing on the correlation between contact density and ER, ProteomeVis can partially explain previously conflicting reports for S.cerevisiae. Mathematical relationships predict that the number of sequences stably folding into a native structure is approximated by contact density (Choi et al., 2017; England and Shakhnovich, 2003). Therefore, proteins with larger contact densities are expected to have larger ERs because more mutations can be accommodated while still stably folding. In agreement, Zhou et al. (2008) reported a significant positive correlation between contact density and ER for S.cerevisiae and E.coli. However, Shakhnovich (2006) found a corresponding correlation for S.cerevisiae that is significantly negative. In ProteomeVis, corresponding positive correlations are reported that is significant (P-value< 10−3) for S.cerevisiae and slightly significant (P-value < 0.01) for E.coli. To probe whether ProteomeVis’ consistency with Zhou et al. (2008) is due to having the same contact definition, we define contacts as in Shakhnovich (2006) (‘Materials and methods’ 2.1.3). The corresponding correlation for S.cerevisiae becomes insignificant—Spearman correlation = 0.00, P-value = 0.99. Although we do not recapitulate the significant negative correlation found in Shakhnovich (2006), we find that the significant positive correlation between contact density and ER in ProteomeVis is contingent on the employed contact definition. As seen in Figure 3, all statistically significant correlations in ProteomeVis agree with those available in the literature, except for the correlation between length and ER for E.coli. Although ProteomeVis agrees with Zhou et al. (2008) regarding correlations between contact density and ER, they vary regarding the correlation between protein length and ER for E.coli. Zhou et al. (2008) reported a significant positive correlation, while ProteomeVis reports an insignificant negative correlation for E.coli. Methodological differences do not explain the discrepancy. Zhou et al. (2008) applied the same filter based on length to identify structures in the respective organism’s proteome as in ProteomeVis. As expected, the length versus ER correlation for S.cerevisiae in ProteomeVis and Zhou et al. (2008) are consistent. Further studies are required to elucidate why inconsistent length-ER correlation are seen in E.coli but not S.cerevisiae. 4 Conclusion Proteomic data visualization tools are increasingly developed to facilitate understanding of complex and interwoven experimental data (Vizcaíno et al., 2015). Here, we present a novel web app called ProteomeVis, accessible at http://proteomevis.chem.harvard.edu, to study protein structure and sequence evolution simultaneously across organisms’ proteomes. Data downloaded from three published papers, programmatically accessed from four databases and generated by running three software packages are organized into ProteomeVis’ four panels. Interacting with the panels gives users quick insight into the underlying data. Once more detailed hypotheses are formulated throughout their web sessions, users can freely download the data to further probe on their own. In this paper, we applied ProteomeVis to investigate two important problems in evolutionary protein biology: the scale-free degree distribution of oPCUGs and the correlations between ER and protein properties. These two topics span time and length scales in protein evolution, and we successfully recapitulated and extended previous studies. We report the previously unknown scaling constant for S.cerevisiae’s protein universe graph. Interestingly, its oPCUG scaling constant fits the range predicted by the Big Bang model, which had only been applied to prokaryotic organisms (Deeds et al., 2004; Dokholyan et al., 2002). This observation points to the mechanism of protein structure evolution being common among organisms since the last universal common ancestor. In our studies of potential ER determinants, protein abundance is the strongest correlate among properties in ProteomeVis. Previously, the importance of mRNA abundance had been highlighted (Zhang and Yang, 2015), but our work joins only one other study to demonstrate the importance of protein abundance influencing ER (Drummond et al., 2006). ProteomeVis captures 13% of the S.cerevisiae and 25% of the E.coli proteomes as of March 2018. These percentages will grow as monthly updates are performed to incorporate newly deposited PDB structures. We plan to further expand ProteomeVis in future iterations by including more proteomes, such as those from Homo sapiens and Mus musculus, and adding genomic data, such as dN and dS. Acknowledgements RMR acknowledges fruitful email communications with Professor Brenda Andrews about dosage tolerance data in Sopko et al. (2006), and Professor Jianzhi Zhang and Professor Jian-Rong Yang about evolutionary rate data in Zhang and Yang (2015). RMR would like to thank Will Jacobs for helpful discussions surrounding data curation. All figures in the text and Supplementary Material, except for those of the ProteomeVis web app and Supplementary Figure S3, were created using Python’s Matplotlib package. Funding This work was supported by National Institutes of Health [2R01GM068670-13 to E.I.S.]; and the National Science Foundation Graduate Research Fellowships Program [2013139945 to A.I.G.]. Conflict of Interest: none declared. References Albert R. , Barabási A.L. ( 2000 ) Topology of evolving networks: local events and universality . Phys. Rev. Lett ., 85 , 5234 – 5237 . Google Scholar Crossref Search ADS PubMed Albert R. , Barabási A.L. ( 2002 ) Statistical mechanics of complex networks . Rev. Mod. Phys ., 74 , 47 – 97 . Google Scholar Crossref Search ADS Arike L. et al. ( 2012 ) Comparison and applications of label-free absolute proteome quantification methods on Escherichia coli . J. Proteomics , 75 , 5437 – 5448 . Google Scholar Crossref Search ADS PubMed Berggård T. et al. ( 2007 ) Methods for the detection and analysis of protein–protein interactions . Proteomics , 7 , 2833 – 2842 . Google Scholar Crossref Search ADS PubMed Berman H.M. et al. ( 2000 ) The protein data bank . Nucleic Acids Res ., 28 , 235 – 242 . Google Scholar Crossref Search ADS PubMed Bershtein S. et al. ( 2017 ) Bridging the physical scales in evolutionary biology: from protein sequence space to fitness of organisms and populations . Curr. Opin. Struct. Biol ., 42 , 31 – 40 . Google Scholar Crossref Search ADS PubMed Bloom J.D. , Adami C. ( 2003 ) Apparent dependence of protein evolutionary rate on number of interactions is linked to biases in protein–protein interactions data sets . BMC Evol. Biol ., 3 , 21. Google Scholar Crossref Search ADS PubMed Bloom J.D. et al. ( 2006 ) Structural determinants of the rate of protein evolution in yeast . Mol. Biol. Evol ., 23 , 1751 – 1761 . Google Scholar Crossref Search ADS PubMed Caetano-Anollés G. et al. ( 2009 ) The origin, evolution and structure of the protein world . Biochem. J ., 417 , 621 – 637 . Google Scholar Crossref Search ADS PubMed Chibucos M.C. et al. ( 2014 ) Standardized description of scientific evidence using the Evidence Ontology (ECO) . Database J. Biol. Databases Curation , 2014 , 1 – 11 . Choi J.M. et al. ( 2017 ) Graph’s topology and free energy of a spin model on the graph . Phys. Rev. Lett ., 118 , 1 – 5 . Chothia C. , Lesk A.M. ( 1986 ) The relation between the divergence of sequence and structure in proteins . EMBO J ., 5 , 823 – 826 . Google Scholar Crossref Search ADS PubMed Chung S.Y. , Subbiah S. ( 1996 ) A structural explanation for the twilight zone of protein sequence homology . Structure , 4 , 1123 – 1127 . Google Scholar Crossref Search ADS PubMed Cock P.J.A. et al. ( 2009 ) Biopython: freely available Python tools for computational molecular biology and bioinformatics . Bioinformatics , 25 , 1422 – 1423 . Google Scholar Crossref Search ADS PubMed Cokelaer T. et al. ( 2013 ) BioServices: a common Python package to access biological web services programmatically . Bioinformatics , 29 , 3241 – 3242 . Google Scholar Crossref Search ADS PubMed Cooper G.M. ( 2000 ) Cells as experimental models. In: Cell: A Molecular Approach, chapter 1, 2nd edn. Sinauer Associates, Sunderland, MA. Deeds E.J. , Shakhnovich E.I. ( 2010 ) A structure-centric view of protein evolution, design, and adaptation. In: Toone E.J. (ed.) Advances in Enzymology and Related Areas of Molecular Biology: Protein Evolution , vol. 75. John Wiley and Sons, Inc ., Hoboken, NJ , pp. 133 – 191 . Deeds E.J. et al. ( 2004 ) Proteomic traces of speciation . J. Mol. Biol ., 336 , 695 – 706 . Google Scholar Crossref Search ADS PubMed Dokholyan N.V. , Shakhnovich E.I. ( 2001 ) Understanding hierarchical protein evolution from first principles . J. Mol. Biol ., 312 , 289 – 307 . Google Scholar Crossref Search ADS PubMed Dokholyan N.V. et al. ( 2002 ) Expanding protein universe and its origin from the biological Big Bang . Proc. Natl. Acad. Sci. USA , 99 , 14132 – 14136 . Google Scholar Crossref Search ADS Douglas A.C. et al. ( 2012 ) Functional analysis with a barcoder yeast gene overexpression system . G3 Genes Genomes Genet ., 2 , 1279 – 1289 . Drummond D.A. , Wilke C.O. ( 2008 ) Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution . Cell , 134 , 341 – 352 . Google Scholar Crossref Search ADS PubMed Drummond D.A. et al. ( 2005 ) Why highly expressed proteins evolve slowly . Proc. Natl. Acad. Sci. USA , 102 , 14338 – 14343 . Google Scholar Crossref Search ADS Drummond D.A. et al. ( 2006 ) A single determinant dominates the rate of yeast protein evolution . Mol. Biol. Evol ., 23 , 327 – 337 . Google Scholar Crossref Search ADS PubMed England J.L. , Shakhnovich E.I. ( 2003 ) Structural determinant of protein designability . Phys. Rev. Lett ., 90 , 218101 . Google Scholar Crossref Search ADS PubMed Fraser H.B. et al. ( 2003 ) A simple dependence between protein evolution rate and the number of protein–protein interactions . BMC Evol. Biol ., 3 , 6. Google Scholar Crossref Search ADS PubMed Ghaemmaghami S. et al. ( 2003 ) Global analysis of protein expression in yeast . Nature , 425 , 737 – 741 . Google Scholar Crossref Search ADS PubMed Gilson A.I. et al. ( 2017 ) The role of evolutionary selection in the dynamics of protein structure evolution . Biophys. J ., 112 , 1350 – 1365 . Google Scholar Crossref Search ADS PubMed Greenbaum D. et al. ( 2003 ) Comparing protein abundance and mRNA expression levels on a genomic scale . Genome Biol ., 4 , 117. Google Scholar Crossref Search ADS PubMed Holm L. , Sander C. ( 1993 ) Protein structure comparison by alignment of distance matrices . J. Mol. Biol ., 233 , 123 – 138 . Google Scholar Crossref Search ADS PubMed Ingolfsson H. , Yona G. ( 2008 ). Protein domain prediction. In: Kobe B. et al. (eds.) Struct. Proteomics High-Throughput Methods , vol. 426, chapter 7. Humana Press , Totowa, NJ , pp. 117 – 43 . Ispolatov I. et al. ( 2005 ) Binding properties and evolution of homodimers in protein–protein interaction networks . Nucleic Acids Res ., 33 , 3629 – 3635 . Google Scholar Crossref Search ADS PubMed Kitagawa M. et al. ( 2005 ) Complete set of ORF clones of Escherichia coli ASKA library (A complete set of E. coli K-12 ORF archive): unique resources for biological research . DNA Res ., 12 , 291 – 299 . Google Scholar Crossref Search ADS PubMed Koonin E.V. et al. ( 2002 ) The structure of the protein universe and genome evolution . Nature , 420 , 218 – 223 . Google Scholar Crossref Search ADS PubMed Krapivsky P.L. et al. ( 2000 ) Connectivity of growing random networks . Phys. Rev. Lett ., 85 , 4629 – 4632 . Google Scholar Crossref Search ADS PubMed Lukatsky D.B. et al. ( 2007 ) Structural similarity enhances interaction propensity of proteins . J. Mol. Biol ., 365 , 1596 – 1606 . Google Scholar Crossref Search ADS PubMed Nepomnyachiy S. et al. ( 2015 ) CyToStruct: augmenting the network visualization of CyToStruct with the power of molecular viewers . Structure , 23 , 941 – 948 . Google Scholar Crossref Search ADS PubMed Orchard S. et al. ( 2014 ) The MIntAct project – IntAct as a common curation platform for 11 molecular interaction databases . Nucleic Acids Res ., 42 , D358 – D363 . Google Scholar Crossref Search ADS PubMed Pál C. et al. ( 2001 ) Highly expressed genes in yeast evolve slowly . Genet. Soc. Am ., 158 , 927 – 931 . Qian J. et al. ( 2001 ) Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model . J. Mol. Biol ., 313 , 673 – 681 . Google Scholar Crossref Search ADS PubMed Roland C.B. , Shakhnovich E.I. ( 2007 ) Divergent evolution of a structural proteome: phenomenological models . Biophys. J ., 92 , 701 – 716 . Google Scholar Crossref Search ADS PubMed Rost B. ( 1999 ) Twilight zone of protein sequence alignments . Protein Eng. Des. Sel ., 12 , 85 – 94 . Google Scholar Crossref Search ADS Sander C. , Schneider R. ( 1991 ) Database of homology-derived protein structures and the structural meaning of sequence alignment . Proteins Struct. Funct. Bioinf ., 9 , 56 – 68 . Google Scholar Crossref Search ADS Shakhnovich B.E. ( 2006 ) Relative contributions of structural designability and functional diversity in molecular evolution of duplicates . Bioinformatics , 22 , e440 – e445 . Google Scholar Crossref Search ADS PubMed Shakhnovich B.E. ( 2005 ) Protein structure and evolutionary history determine sequence space topology . Genome Res ., 15 , 385 – 392 . Google Scholar Crossref Search ADS PubMed Sikosek T. , Chan H.S. ( 2014 ) Biophysics of protein evolution and evolutionary protein biophysics . J. R. Soc. Interface , 11 , 20140419. Google Scholar Crossref Search ADS PubMed Sopko R. et al. ( 2006 ) Mapping pathways and phenotypes by systematic gene overexpression . Mol. Cell , 21 , 319 – 330 . Google Scholar Crossref Search ADS PubMed Szklarczyk D. , Jensen L.J. ( 2015 ) Protein–protein interaction databases. In: Meyerkord C.L. , Fu H. (eds.) Protein-Protein Interactions , vol. 1278. Humana Press , New York, NY , pp. 39 – 56 . The UniProt Consortium ( 2017 ) UniProt: the universal protein knowledgebase . Nucleic Acids Res ., 45 , D158 – D169 . Crossref Search ADS PubMed Vavouri T. et al. ( 2009 ) Intrinsic protein disorder and interaction promiscuity are widely associated with dosage sensitivity . Cell , 138 , 198 – 208 . Google Scholar Crossref Search ADS PubMed Velankar S. et al. ( 2013 ) SIFTS: structure Integration with Function, Taxonomy and Sequences resource . Nucleic Acids Res ., 41 , D483 – D489 . Google Scholar Crossref Search ADS PubMed Vizcaíno J.A. et al. ( 2015 ) Proteomics Data Visualisation [Special Issue] . Proteomics , 15 , 1339 – 1456 . Google Scholar Crossref Search ADS PubMed Vogel C. , Marcotte E.M. ( 2012 ) Insights into the regulation of protein abundance from proteomic and transcriptomic analyses . Nat. Rev. Genet ., 13 , 227 – 232 . Google Scholar Crossref Search ADS PubMed Wall D.P. et al. ( 2005 ) Functional genomic analysis of the rates of protein evolution . Proc. Natl. Acad. Sci. USA , 102 , 5483 – 5488 . Google Scholar Crossref Search ADS Wang M. et al. ( 2012 ) PaxDb, a database of protein abundance averages across all three domains of life . Mol. Cell. Proteomics , 11 , 492 – 500 . Google Scholar Crossref Search ADS PubMed Wright C.F. et al. ( 2005 ) The importance of sequence diversity in the aggregation and evolution . Nature , 438 , 878 – 881 . Google Scholar Crossref Search ADS PubMed Yoav B. , Hochberg Y. ( 1995 ) Controlling the false discovery rate: a practical and powerful approach to multiple testing . J. R. Stat. Soc ., 57 , 289 – 300 . Zeldovich K.B. , Shakhnovich E.I. ( 2008 ) Understanding protein evolution: from protein physics to Darwinian selection . Annu. Rev. Phys. Chem ., 59 , 105 – 127 . Google Scholar Crossref Search ADS PubMed Zhang J. , Yang J.R. ( 2015 ) Determinants of the rate of protein sequence evolution . Nat. Rev. Genet ., 16 , 409 – 420 . Google Scholar Crossref Search ADS PubMed Zhang Y. , Skolnick J. ( 2004 ) Scoring function for automated assessment of protein structure template quality . Proteins Struct. Funct. Genet ., 57 , 702 – 710 . Google Scholar Crossref Search ADS Zhang Y. , Skolnick J. ( 2005 ) TM-align: a protein structure alignment algorithm based on the TM-score . Nucleic Acids Res ., 33 , 2302 – 2309 . Google Scholar Crossref Search ADS PubMed Zhou T. et al. ( 2008 ) Contact density affects protein evolutionary rate from bacteria to animals . J. Mol. Evol ., 66 , 395 – 404 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: firstname.lastname@example.org This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Bioinformatics – Oxford University Press
Published: Oct 15, 2018
It’s your single place to instantly
discover and read the research
that matters to you.
Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.
All for just $49/month
Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly
Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.
Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.
Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.
All the latest content is available, no embargo periods.
“Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”Daniel C.
“Whoa! It’s like Spotify but for academic articles.”@Phil_Robichaud
“I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”@deepthiw
“My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”@JoseServera