Pythoscape: a framework for generation of large protein similarity networks

Alan E. Barber; Patricia C. Babbitt

doi:10.1093/bioinformatics/bts532

Pythoscape: a framework for generation of large protein similarity networks

Barber, Alan E.; Babbitt, Patricia C. 2012-09-08 00:00:00 Vol. 28 no. 21 2012, pages 2845–2846 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bts532 Sequence analysis Advance Access publication September 8, 2012 Pythoscape: a framework for generation of large protein similarity networks Alan E. Barber II and Patricia C. Babbitt Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA Associate Editor: Martin Bishop available with software such as Cytoscape (Smoot et al.,2011) ABSTRACT allow creation of small PSNs (Wittkop et al., 2010), no software Pythoscape is a framework implemented in Python for processing solution exists to create and manage large PSNs. And while large protein similarity networks for visualization in other software PSNs are inherently amenable to association with orthogonal packages. Protein similarity networks are graphical representations information sources, the many information types available com- of sequence, structural and other similarities among proteins for plicate development of a single software solution for managing which pairwise all-by-all similarity connections have been calculated. such diverse features. Pythoscape addresses these issues and pro- Mapping of biological and other information to network nodes or vides a software framework to create PSNs and develop new edges enables hypothesis creation about sequence–structure–func- analyses for inference of functional properties in proteins. tion relationships across sets of related proteins. Pythoscape provides several options to calculate pairwise similarities for input sequences or structures, applies filters to network edges and defines sets of similar 2 DESCRIPTION AND SIGNIFICANCE nodes and their associated data as single nodes (termed representa- Pythoscape is an extensible computational framework imple- tive nodes) for compression of network information and output data or mented in Python to generate and analyze PSNs. For the user formatted files for visualization. interested in generating large networks, the Pythoscape package Contact: [email protected] has a core set of plug-ins (Supplementary Table S1) and tutorials, Supplementary information: Supplementary data are available at so that no development is needed to create simple networks Bioinformatics online. painted with useful metadata. For software developers, Pythos- Received on June 7, 2012; revised on August 8, 2012; accepted on cape provides a framework for rapid modification along with August 23, 2012 well-documented application programming interfaces for devel- opment of additional plug-ins using new sources of metadata. Unlike sparser networks such as interaction networks, PSNs 1 INTRODUCTION are frequently close to complete, often requiring storage and The rapid growth of databases of protein information (e.g. se- management of large quantities of data, and fast calculation quences and structures) provides both new opportunities and (Supplementary Table S2). Pythoscape allows for flexible storage challenges for analysis and clustering by similarity. For example, of data through the use of storage interfaces. Appropriate stor- global analysis of entire superfamilies and association of their age solutions can be chosen based on network size or developed members with biological information and other types of meta- as needed allowing for easy updating for faster and more reliable data has become a useful tool for functional annotation and database software solutions. Pythoscape can create, store and discovery (Brown and Babbitt, 2012). As these sets become manage large networks, then, using representative nodes and larger (sometimes many thousands of sequences) and their mem- edges to compress the information, output smaller summary net- bers more divergent, their fast exploration on a large-scale works for visualization (Fig. 1A and B). Users can choose how becomes less feasible using traditional approaches such as align- distances between representative nodes are calculated and, im- ments and trees. portantly, the full set of sequences in each node is retained for Protein similarity networks (PSNs) enable analysis and visual- later use. ization of structure–function relationships in large protein data Additionally, Pythoscape has plug-ins for creating structure sets by clustering of individual protein sets for more complex similarity networks and for generating correlations for edge analysis while summarizing ‘connectivity’ relationships among distances between networks generated from a set of sequences the clusters. Mapping orthogonal sources of biological informa- and a corresponding set of available structures (Supplementary tion onto PSNs then provides a powerful way to view functional Table S1 and Supplementary Figs. S2 and S3). trends across the set that can be interpreted in the context of their similarities. (See Atkinson et al., 2009 for an initial analysis of 3 EXAMPLE USAGE some uses and statistical validation of PSNs.) While databases like Similarity Matrix of Proteins (SIMAP) Glutathione transferases (GSTs) are enzymes that typically cata- (Rattie et al., 2010) store pairwise similarities, and plug-ins lyze the addition of glutathione to substrate compounds. They play roles in many biological processes, including metabolism of To whom correspondence should be addressed. endogenous compounds and xenobiotics such as drugs. Of the The Author 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted, distribution, and reproduction in any medium, provided the original work is properly cited. A.E.Barber and P.C.Babbitt between filtered networks and full networks has also recently been described elsewhere for some example systems (Atkinson et al., 2009), but these differences appear also to depend on the specific system analyzed. While ‘missing data’ is an inherent fea- ture of representative nodes, the trade-off is in visualizing simi- larity relationships across large datasets that would not be practically achievable because of memory and speed limitations in their calculation. The network shown in Figure 1A demonstrates another issue in the use of representative nodes that could complicate inter- preting relationships between functional features and sequence similarity. In the example given here, some GST families are represented by multiple representative nodes, whereas other rep- resentative nodes contain multiple SwissProt families (HSP26, Phi and Tau), obscuring how sequence similarity tracks with annotation. Thus, we recommend that analysis using representa- tive networks be accompanied by examination of the relevant parts of the corresponding full networks. 4CONCLUSION Pythoscape is a software framework to efficiently create and manage protein similarity networks. Tutorials, Pythoscape docu- Fig. 1. Sequence similarity network of the GST superfamily generated by mentation, source code and future development plans are avail- Pythoscape and visualized in Cytoscape. To compact the view for this able at http://www.rbvi.ucsf.edu/trac/Pythoscape. figure, networks were layed out using the organic layout in Cytoscape rather than the distances computed from a similarity metric. In all, 664 representative nodes are used to describe pairwise relationships among ACKNOWLEDGEMENTS 7447 sequences. (A) Representative network with functional classes col- ored, if annotated by SwissProt in a family (The UniProt Consortium, The authors thank Michael Hicks and John Morris (UCSF) and 2011). Family membership is indicated if one or more sequences in the Patrick Frantom (University of Alabama) for helpful abstracted node are associated with that family. (B) Full non-abstracted discussions. network for the group of GSTs found mostly in eukaryotes (boxed in A) Funding: Supported by Pharmaceutical Research and Manufacturers of America, Achievement Rewards for College thousands of GSTs that have been identified, the physiological Scientists Foundation, NIH Training Grant T32 GM007175 substrates of only a small proportion are known; thus, they are and R01 GM60595. principally classified into putative functional classes according to Conflict of Interest: none declared. enzymatic, structural, and other features (Mannervik and Danielson, 1988). Recently, PSNs have been used to summarize and guide a global interpretation of GST sequence and structure REFERENCES relationships (Atkinson and Babbitt, 2009). Atkinson,H.J. and Babbitt,P.C. (2009) Glutathione transferases are structural and A PSN of GST sequences is shown in Figure 1A (see supple- functional outliers in the thioredoxin fold. Biochemistry, 48, 11108–11116. mentary information for network creation and graph statistics). Atkinson,H.J. et al. (2009) Using sequence similarity networks for visualization of It illustrates how representative nodes computed by Pythoscape relationships across diverse protein superfamilies. PLoS One, 4,e4345. enable analysis of PSNs too large to be visualized in total while Brown,S.D. and Babbitt,P.C. (2012) Inference of functional properties from retaining their value for developing hypotheses from sequence large-scale analysis of enzyme superfamilies. J. Biol. Chem., 287, 35–42. Mannervik,B. and Danielson,U.H. (1988) Glutathione transferases—structure and similarities across the whole set. For comparison, individual clus- catalytic activity. CRC Crit. Rev. Biochem., 23, 283–337. ters of interest can be outputted with all nodes present (Fig. 1B). Rattei,T. et al. (2010) SIMAP—a comprehensive database of pre-calculated protein This full non-abstracted network (representing a node for each sequence similarities, domains, annotations and clusters. Nucleic Acids Res., 38, sequence) shows a similar pattern of relationships to those shown D223–D226. Smoot,M.E. et al. (2011) Cytoscape 2.8: new features for data integration and net- in the corresponding representative node network (boxed in work visualization. Bioinformatics, 27, 431–432. Fig. 1A). The correlation between the ideal representative node The UniProt Consortium. (2011) Reorganizing the protein space at the Universal mean distances calculated in Pythoscape and the corresponding Protein Resource (UniProt). Nucleic Acids Res., 40, D71–D75. full network ideal distance for Fig. 1A is provided in Supplemen- Wittkop,T. et al. (2010) Comprehensive cluster analysis with Transitivity Clustering. tary Figure S1. A quantitative description of the relationships Nat. Protoc., 6, 285–295. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/pythoscape-a-framework-for-generation-of-large-protein-similarity-hLuBOuf0Up

Loading next page...

References (11)

H. Atkinson, P. Babbitt (2009)
Glutathione Transferases Are Structural and Functional Outliers in the Thioredoxin Fold†
Biochemistry, 48
T. Wittkop, D. Emig, A. Truß, M. Albrecht, Sebastian Böcker, J. Baumbach (2011)
Comprehensive cluster analysis with Transitivity Clustering
Nature Protocols, 6
T. Rattei, Patrick Tischler, Stefan Götz, Marc-André Jehl, Jonathan Hoser, Roland Arnold, A. Conesa, H. Mewes (2009)
SIMAP—a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters
Nucleic Acids Research, 38
(2011)
Cytoscape 2.8
M. Mehta, Shirley Liu, J. Silberg (2012)
A transposase strategy for creating libraries of circularly permuted proteins
Nucleic Acids Research, 40
H. Atkinson, J. Morris, T. Ferrin, P. Babbitt (2009)
Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies
PLoS ONE, 4
B. Mannervik, U. Danielson, B. Ketterer (1988)
Glutathione transferases--structure and catalytic activity.
CRC critical reviews in biochemistry, 23 3
Smoot (2011)
Cytoscape 2.8: new features for data integration and network visualization
Bioinformatics, 27
The Consortium (2011)
Reorganizing the protein space at the Universal Protein Resource (UniProt)
Nucleic Acids Research, 40
Shoshana Brown, P. Babbitt (2011)
Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies*
The Journal of Biological Chemistry, 287
(2009)
functional outliers in the thioredoxin fold, Biochemistry

Publisher: Oxford University Press
Copyright: © The Author 2012. Published by Oxford University Press.
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/bts532
pmid: 22962345
Publisher site: See Article on Publisher Site

Abstract

Vol. 28 no. 21 2012, pages 2845–2846 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/bts532 Sequence analysis Advance Access publication September 8, 2012 Pythoscape: a framework for generation of large protein similarity networks Alan E. Barber II and Patricia C. Babbitt Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA Associate Editor: Martin Bishop available with software such as Cytoscape (Smoot et al.,2011) ABSTRACT allow creation of small PSNs (Wittkop et al., 2010), no software Pythoscape is a framework implemented in Python for processing solution exists to create and manage large PSNs. And while large protein similarity networks for visualization in other software PSNs are inherently amenable to association with orthogonal packages. Protein similarity networks are graphical representations information sources, the many information types available com- of sequence, structural and other similarities among proteins for plicate development of a single software solution for managing which pairwise all-by-all similarity connections have been calculated. such diverse features. Pythoscape addresses these issues and pro- Mapping of biological and other information to network nodes or vides a software framework to create PSNs and develop new edges enables hypothesis creation about sequence–structure–func- analyses for inference of functional properties in proteins. tion relationships across sets of related proteins. Pythoscape provides several options to calculate pairwise similarities for input sequences or structures, applies filters to network edges and defines sets of similar 2 DESCRIPTION AND SIGNIFICANCE nodes and their associated data as single nodes (termed representa- Pythoscape is an extensible computational framework imple- tive nodes) for compression of network information and output data or mented in Python to generate and analyze PSNs. For the user formatted files for visualization. interested in generating large networks, the Pythoscape package Contact: [email protected] has a core set of plug-ins (Supplementary Table S1) and tutorials, Supplementary information: Supplementary data are available at so that no development is needed to create simple networks Bioinformatics online. painted with useful metadata. For software developers, Pythos- Received on June 7, 2012; revised on August 8, 2012; accepted on cape provides a framework for rapid modification along with August 23, 2012 well-documented application programming interfaces for devel- opment of additional plug-ins using new sources of metadata. Unlike sparser networks such as interaction networks, PSNs 1 INTRODUCTION are frequently close to complete, often requiring storage and The rapid growth of databases of protein information (e.g. se- management of large quantities of data, and fast calculation quences and structures) provides both new opportunities and (Supplementary Table S2). Pythoscape allows for flexible storage challenges for analysis and clustering by similarity. For example, of data through the use of storage interfaces. Appropriate stor- global analysis of entire superfamilies and association of their age solutions can be chosen based on network size or developed members with biological information and other types of meta- as needed allowing for easy updating for faster and more reliable data has become a useful tool for functional annotation and database software solutions. Pythoscape can create, store and discovery (Brown and Babbitt, 2012). As these sets become manage large networks, then, using representative nodes and larger (sometimes many thousands of sequences) and their mem- edges to compress the information, output smaller summary net- bers more divergent, their fast exploration on a large-scale works for visualization (Fig. 1A and B). Users can choose how becomes less feasible using traditional approaches such as align- distances between representative nodes are calculated and, im- ments and trees. portantly, the full set of sequences in each node is retained for Protein similarity networks (PSNs) enable analysis and visual- later use. ization of structure–function relationships in large protein data Additionally, Pythoscape has plug-ins for creating structure sets by clustering of individual protein sets for more complex similarity networks and for generating correlations for edge analysis while summarizing ‘connectivity’ relationships among distances between networks generated from a set of sequences the clusters. Mapping orthogonal sources of biological informa- and a corresponding set of available structures (Supplementary tion onto PSNs then provides a powerful way to view functional Table S1 and Supplementary Figs. S2 and S3). trends across the set that can be interpreted in the context of their similarities. (See Atkinson et al., 2009 for an initial analysis of 3 EXAMPLE USAGE some uses and statistical validation of PSNs.) While databases like Similarity Matrix of Proteins (SIMAP) Glutathione transferases (GSTs) are enzymes that typically cata- (Rattie et al., 2010) store pairwise similarities, and plug-ins lyze the addition of glutathione to substrate compounds. They play roles in many biological processes, including metabolism of To whom correspondence should be addressed. endogenous compounds and xenobiotics such as drugs. Of the The Author 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted, distribution, and reproduction in any medium, provided the original work is properly cited. A.E.Barber and P.C.Babbitt between filtered networks and full networks has also recently been described elsewhere for some example systems (Atkinson et al., 2009), but these differences appear also to depend on the specific system analyzed. While ‘missing data’ is an inherent fea- ture of representative nodes, the trade-off is in visualizing simi- larity relationships across large datasets that would not be practically achievable because of memory and speed limitations in their calculation. The network shown in Figure 1A demonstrates another issue in the use of representative nodes that could complicate inter- preting relationships between functional features and sequence similarity. In the example given here, some GST families are represented by multiple representative nodes, whereas other rep- resentative nodes contain multiple SwissProt families (HSP26, Phi and Tau), obscuring how sequence similarity tracks with annotation. Thus, we recommend that analysis using representa- tive networks be accompanied by examination of the relevant parts of the corresponding full networks. 4CONCLUSION Pythoscape is a software framework to efficiently create and manage protein similarity networks. Tutorials, Pythoscape docu- Fig. 1. Sequence similarity network of the GST superfamily generated by mentation, source code and future development plans are avail- Pythoscape and visualized in Cytoscape. To compact the view for this able at http://www.rbvi.ucsf.edu/trac/Pythoscape. figure, networks were layed out using the organic layout in Cytoscape rather than the distances computed from a similarity metric. In all, 664 representative nodes are used to describe pairwise relationships among ACKNOWLEDGEMENTS 7447 sequences. (A) Representative network with functional classes col- ored, if annotated by SwissProt in a family (The UniProt Consortium, The authors thank Michael Hicks and John Morris (UCSF) and 2011). Family membership is indicated if one or more sequences in the Patrick Frantom (University of Alabama) for helpful abstracted node are associated with that family. (B) Full non-abstracted discussions. network for the group of GSTs found mostly in eukaryotes (boxed in A) Funding: Supported by Pharmaceutical Research and Manufacturers of America, Achievement Rewards for College thousands of GSTs that have been identified, the physiological Scientists Foundation, NIH Training Grant T32 GM007175 substrates of only a small proportion are known; thus, they are and R01 GM60595. principally classified into putative functional classes according to Conflict of Interest: none declared. enzymatic, structural, and other features (Mannervik and Danielson, 1988). Recently, PSNs have been used to summarize and guide a global interpretation of GST sequence and structure REFERENCES relationships (Atkinson and Babbitt, 2009). Atkinson,H.J. and Babbitt,P.C. (2009) Glutathione transferases are structural and A PSN of GST sequences is shown in Figure 1A (see supple- functional outliers in the thioredoxin fold. Biochemistry, 48, 11108–11116. mentary information for network creation and graph statistics). Atkinson,H.J. et al. (2009) Using sequence similarity networks for visualization of It illustrates how representative nodes computed by Pythoscape relationships across diverse protein superfamilies. PLoS One, 4,e4345. enable analysis of PSNs too large to be visualized in total while Brown,S.D. and Babbitt,P.C. (2012) Inference of functional properties from retaining their value for developing hypotheses from sequence large-scale analysis of enzyme superfamilies. J. Biol. Chem., 287, 35–42. Mannervik,B. and Danielson,U.H. (1988) Glutathione transferases—structure and similarities across the whole set. For comparison, individual clus- catalytic activity. CRC Crit. Rev. Biochem., 23, 283–337. ters of interest can be outputted with all nodes present (Fig. 1B). Rattei,T. et al. (2010) SIMAP—a comprehensive database of pre-calculated protein This full non-abstracted network (representing a node for each sequence similarities, domains, annotations and clusters. Nucleic Acids Res., 38, sequence) shows a similar pattern of relationships to those shown D223–D226. Smoot,M.E. et al. (2011) Cytoscape 2.8: new features for data integration and net- in the corresponding representative node network (boxed in work visualization. Bioinformatics, 27, 431–432. Fig. 1A). The correlation between the ideal representative node The UniProt Consortium. (2011) Reorganizing the protein space at the Universal mean distances calculated in Pythoscape and the corresponding Protein Resource (UniProt). Nucleic Acids Res., 40, D71–D75. full network ideal distance for Fig. 1A is provided in Supplemen- Wittkop,T. et al. (2010) Comprehensive cluster analysis with Transitivity Clustering. tary Figure S1. A quantitative description of the relationships Nat. Protoc., 6, 285–295.

Journal

Bioinformatics – Oxford University Press

Published: Sep 8, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Pythoscape: a framework for generation of large protein similarity networks

Pythoscape: a framework for generation of large protein similarity networks

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Pythoscape: a framework for generation of large protein similarity networks

Pythoscape: a framework for generation of large protein similarity networks

References (11)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies