Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Globally, unrelated protein sequences appear random

Globally, unrelated protein sequences appear random Motivation: To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models.Results: While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1 of four-amino acid word clumps (4.7 of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1 (4mers) to 0.5 (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in -helical secondary structures (but not -strands). Five-residue consensus exceptional words are enriched for -helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for -helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.Contact: wrp@virginia.eduSupplementary information: Supplementary data are available at Bioinformatics online. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Globally, unrelated protein sequences appear random

Bioinformatics , Volume 26 (3) – Feb 1, 2010

Loading next page...
 
/lp/oxford-university-press/globally-unrelated-protein-sequences-appear-random-sw4w6RcDOU
Publisher
Oxford University Press
Copyright
The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btp660
pmid
19948773
Publisher site
See Article on Publisher Site

Abstract

Motivation: To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models.Results: While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1 of four-amino acid word clumps (4.7 of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1 (4mers) to 0.5 (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in -helical secondary structures (but not -strands). Five-residue consensus exceptional words are enriched for -helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for -helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random.Contact: wrp@virginia.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Journal

BioinformaticsOxford University Press

Published: Feb 1, 2010

References