Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Homologous over-extension: a challenge for iterative similarity searches

Homologous over-extension: a challenge for iterative similarity searches Published online 11 January 2010 Nucleic Acids Research, 2010, Vol. 38, No. 7 2177–2189 doi:10.1093/nar/gkp1219 Homologous over-extension: a challenge for iterative similarity searches 1 2, Mileidy W. Gonzalez and William R. Pearson * Department of Biological Sciences, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250 and Department of Biochemistry and Molecular Genetics, Jordan Hall Box 800733, Charlottesville, VA 22908, USA Received November 19, 2009; Revised December 16, 2009; Accepted December 17, 2009 ABSTRACT homologous, and homologous proteins share similar structures and, often, similar functions. Thus, reliable We have characterized a novel type of PSI-BLAST transfer of knowledge between homologous proteins error, homologous over-extension (HOE), using requires accurate identification of homologs. embedded PFAM domain queries on searches For pair-wise similarity searches using BLAST (1), against a reference library containing Pfam- FASTA (2) or Smith–Waterman (3,4), the statistical esti- annotated UniProt sequences and random synthetic mates used to infer homology are very accurate; unrelated sequences rarely obtain expectation values lower than sequences. PSI-BLAST makes two types of errors: expected by chance (5,6). Statistical estimates for iterative alignments to non-homologous regions and HOE methods like PSI-BLAST can be much less accurate (7,8); alignments that begin in a homologous region, but sometimes sequences that are clearly unrelated (they have extend beyond the homology into neighboring different three-dimensional topologies) obtain highly sig- sequence regions. When the neighboring sequence –6 nificant expectation values (<10 ) that imply clear region contains a non-homologous domain, homology. These misleading alignments are thought to PSI-BLAST can incorporate the unrelated result from profile or position specific scoring matrix sequence into its position specific scoring matrix, (PSSM) contamination and from the PSI-BLAST which then finds non-homologous proteins with sig- sequence weighting strategy; when distant homologs are nificant expectation values. HOE accounts for the included in the PSSM, their contribution to the matrix is largest fraction of the initial false positive (FP) given a higher weight than when closely related sequences errors, and the largest fraction of FPs at iteration are included. An unfortunate side effect of this weighting occurs when unrelated sequences are included; they also 5. In searches against complete protein sequences, contribute strongly to the new PSSM, which then 5–9% of alignments at iteration 5 are non- produces high scores for homologs to the unrelated homologous. HOE frequently begins in a partial domain. The corruption of PSSMs leads to the assignment protein domain; when partial domains are removed of high scores and statistical significance to from the library, HOE errors decrease from 16 to 3% biologically-incorrect relationships—a serious problem of weighted coverage (hard queries; 35–5% for for any protein characterization effort. sampled queries) and no-error searches increase Alternatively, statistical errors in profile-based align- from 2 to 58% weighed coverage (hard; 16–78% ments might reflect inherent limits in distinguishing sampled). When HOE is reduced by not extending similarities produced by divergent evolution from those previously found sequences, PSI-BLAST specificity produced by convergence from unrelated proteins (7,9). Currently, the most sensitive structural comparison improves 4–8-fold, with little loss in sensitivity. methods often assign statistically significant similarity to non-homologous structural alignments, and this trend INTRODUCTION extends to profile–sequence and profile–profile alignments. This inability to distinguish divergent from convergent Protein similarity searching is central to genome annota- similarity may reflect the small number of regular struc- tion, characterization of protein families and exploration tural motifs in proteins. Similarly, PSI-BLAST’s inaccu- of distant evolutionary relationships. Similarity searching rate statistics may reflect its ability to capture some of this is effective because proteins that share statistically signif- structural information. icant sequence similarity can be inferred to be *To whom correspondence should be addressed. Tel: +1 434 924 2818; Fax: +1 434 924 5069; Email: wrp@virginia.edu The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 2178 Nucleic Acids Research, 2010, Vol. 38, No. 7 Previous refinements to PSI-BLAST have addressed the two of the three domains of life; (v) only one family PSSM corruption issue (8,10–14) by improving the statis- from each clan superfamily was chosen; (vi) the family tical estimates used to evaluate alignment scores. In this contained only non-nested (contiguous) domains. Large article, we describe a novel cause of PSSM contamination: families were reduced to 1500 members by random over-extension of a homologous domain into a removal of members, to avoid an out-of-memory non-homologous region. Because this problem begins problem when PSI-BLAST stored large numbers of with a homologous alignment, it cannot be addressed alignments. with more accurate statistical estimates or more conserva- Query sequence selection. For each of the 320 Pfam tive inclusion thresholds; these errors can occur with domain families meeting these criteria, two domain very significant E()-values (<10 ). It is the alignment, members were chosen as queries based on their location not the score that is inaccurate. However, the problem on the family phylogenetic tree. One query was selected can be reduced dramatically by limiting alignment exten- from a populated area and another from a deserted sion after a domain is included in the PSSM. area of the tree. Each of the 640 queries was compared In evaluations of similarity searching programs, align- to the target database using BLAST, and the 50 queries ments are typically scored as true positives (TPs) or false producing the lowest family coverage at E() < 0.001 were positives (FPs) based on whether the library sequence chosen for the hard query set. In addition, 50 queries contains a homologous domain, rather than whether the were chosen at random with replacement to be part domain of interest is aligned correctly. For pair-wise of the sampled query set. A third set of 50 families sequence comparison, this approach summarizes search (100 queries) was selected to give strong differences performance relatively accurately; incorrect alignment between the populated and deserted regions of the boundaries rarely detract from the identification of the tree; one query was taken from each region for the homologous protein, and may reflect difficulties in accu- rately annotating domain boundaries. 50 families. With iterative methods like PSI-BLAST, accurate Query embedding. While our goal is to simulate searches domain alignments are more important, because inaccu- performed with full-length proteins against a full-length rate alignments can cause non-homologous domains protein database (the typical use of PSI-BLAST), our to be included in the PSSM used for the next iteration. query sequences are not complete proteins; they contain Therefore, to characterize PSI-BLAST errors, we recorded a single Pfam domain (complete proteins can contain the beginning and end of the alignments included in the multiple domains, with complex homology relationships). PSSM for the next iteration. Using the domain boundary To more accurately simulate searches with full-length annotations in our reference library, we could follow the proteins, we embedded bare Pfam domains in random homologous or non-homologous sequence regions that synthetic sequences. The embeddings were created were added to the PSSM. This process allowed us to by randomly shuffling the domain, splitting it in half, identify a novel source of PSSM corruption—homologous and placing each shuffled flank before and after the real over-extension (HOE) errors. HOE, particularly of partial domain. We confirmed that the random embedding domains, is a problem that can affect any iterative strategy sequences did not produce significant alignments for building profiles, PSSMs, or Hidden Markov Models by comparing them to the Swiss-Prot database (16). (HMM). We have developed a very simple strategy for Embedding the domains also allowed HOE to occur in reducing HOE errors—once a sequence is included in the query sequence, since alignments could extend the PSSM, the boundaries of its alignment do not beyond the Pfam domain boundaries. change. This strategy reduces FP errors by more than a factor of 8, with little loss in search sensitivity. Target library construction. The standard search library was constructed by bringing together full-length UniProt (17) proteins containing domains from the query families MATERIALS AND METHODS (excluding viral sequences), supplemented with an equal Evaluation datasets number of randomly generated sequences. The random sequences were generated by shuffling each full-length We generated two sequence datasets to characterize protein sequence. A second long-domain-only library was PSI-BLAST performance: (i) a set of query families and constructed from UniProt sequences containing family (ii) two target (reference) libraries. The query families are domains that were at least 75% of their respective Pfam groups of homologous protein domains derived from model length. Pfam (15). The target libraries are collections of full-length sequences used for iterative PSI-BLAST searching. Annotating the datasets. The target library sequence homology annotations specified by Pfam (v.21) were sup- Query family selection. A subset of domain families from plemented to reduce the number of unannotated Pfam version 21.0, originally 9000 domain families, was homologs and to extend domain boundaries in some selected and filtered down to 320 domain families meeting cases. Because Pfam identifies homologs by alignment the following criteria: (i) at least one PDB structure in the family, (ii) HMM models longer than 200 residues, (iii) with the model HMM, some library sequences con- more than 100 members in the family, (iv) families were tain cryptic domains that share strong similarity to a taxonomically-broad, with members from organisms in domain in the family that is distant from the model Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2179 HMM, so that the cryptic domains do not meet the Pfam compatible library of all significant (new plus threshold. Potentially cryptic homologs were identified by HOE-corrected) HSP sequences plus 10 000 randomized reverse-searching apparent ‘non-homologs’ with E()- sequences—the latter set provides a large sequence sample for more accurate E-value calculation. (vi) We values <10 against the target library using SSEARCH run PSI-BLAST for two iterations against the v36 (3,4). Additional unannotated homologs were identified by examining Pfam v.23, SCOP (18) and significant-hits library with the first-iteration PSSM. (vii) CATH (19) classifications. Domain boundaries were We use all alignments with E() < 10 to build the adjusted using GLSEARCH, which produces an align- HOE-corrected second-iteration PSSM. (viii) We ment that is global in the query domain and local in continue to repeat steps (iii)–(vii) using the the library full-length sequence. The detailed curation HOE-corrected PSSM from the previous iteration (e.g. of the database will be described elsewhere (Gonzalez for iteration 3, we use the HOE-corrected second-iteration and Pearson, manuscript in preparation). The database PSSM) for a user-defined number of times or through may be accessed at http://faculty.virginia.edu/wrpear convergence. All PSI-BLAST searches use the t 1 son/fasta/PUBS/gonzalez09a. composition-based score adjustment (8), which is the As a result of Pfam family consolidation based on only permitted option when re-starting from a checkpoint updates in later versions of Pfam, examination of SCOP PSSM file. and CATH classifications, and additional searches, a Classification of PSI-BLAST errors. Error types were number of our initial Pfam families were merged. In the classified based on differences between the known standard library, the distribution of query hard and query embeddings, the annotated domain boundaries, randomly sampled domain family sizes (the number and the query and library sequence boundaries in the of homologous domains, or TPs) ranged from 84 to PSI-BLAST alignment. TPs: in a TP alignment, at least 11 435 with a median family size of 847 and first and 50% of the residues in the alignment overlapped the third quartiles at 473 and 1514 members. For the long query domain and the homologous region in the library domain library, family sizes (TPs) ranged from 29 sequence, as defined by the annotated Pfam boundaries to 6048, with a median of 425 and quartiles of 249 and (Figure 1A). Because of the 50% overlap requirement 870 members. for TPs, two types of FP errors can occur: (i) non-homologous FP alignments (NH-FP), which align PSI-BLAST searches the query domain to either a random sequence (Fig. 1B, PSI-BLAST (blastpgp 2.2.19) searches were performed left), a non-domain region (Fig. 1B middle), or a with test queries against the standard and long-domain non-homologous domain (Fig. 1B right); (2) HOE FPs libraries at four different inclusion thresholds [E() < 0.01; (Figure 1C). HOEs begin with a TP alignment but <0.005, the blastpgp default; <0.001; and <0.0001]. Each extend it so that more than 50% of the alignment is query family was evaluated by searching with the bare outside the homologous domains. Extensions can occur domain and with 10 different embedding replicates on the library domain (Figure 1C, left and middle) and (the same domain embedded in different shuffles). on the query domain (Figure 1C, right). Each different Searches ran to convergence or were stopped at 20 itera- error type for a sequence was recorded; multiple errors tions. All other parameters used the default values. of the same type were counted once for the sequence. By default, blastpgp 2.2.19 uses a composition-based For coverage at iteration 5 and the sensitivity/specificity score adjustment (t 2) described in ref. (14). However, analysis, HOE errors were counted both as TPs and FPs. for our modification of PSI-BLAST, we interrupted the Likewise, non-homologous (NH) alignments due to a program after each iteration, extracted the aligned library sequence domain that had appeared in an earlier sequences, and built a new PSSM for the sequences HOE alignment were scored as HOE alignments, but only using the appropriate alignment boundaries. Because the as FPs. program was stopped and restarted using its checkpoint Family and tree coverage. Because homologous target facility, a different composition-based score adjustment library domains are rarely uniformly distributed across [t 1, ref. (8)] is required by blastpgp. their phylogeny, simply counting the fraction of the PSI-BLAST noExt. We implemented a simple modifica- homologous sequences identified in a search can obscure tion to PSI-BLAST to prevent the propagation of HOE the evaluation of a method’s success in identifying distant errors. For the first iteration, (i) we search with homologs. Thus, we measured both family coverage, PSI-BLAST against the standard search library for two the fraction of the homologous domains identified, and rounds; (ii) the first-iteration PSSM and all significant tree coverage, the weighted fraction of tree partitions [E() < 0.005] alignments are stored. For the second itera- covered in the search. A phylogenetic tree for each tion, (iii) we search with PSI-BLAST against the standard query domain family was generated using Quicktree [v. search library for one round, using the first-iteration 1.1, ref. (20)]. The multiple sequence alignments were PSSM. (iv) The significant [E() < 0.005] alignments that built with HMMER [v. 2.3.2, ref. (21)] and the Pfam overlap with previously found high-scoring segment HMM model. Tree coverage is the weighted sum of the pairs (HSP) are replaced by the original alignments, branch lengths of the sub-tree covered by the homologs while all new HSP alignments are stored. (v) We build a found by the search, as illustrated in Supplementary significant-hits library: a formatdb PSI-BLAST Figure S1. Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 2180 Nucleic Acids Research, 2010, Vol. 38, No. 7 A True Postive (TP) Alignment Query Embedded Domain Library Homologous Domain Sequence B Non-homologous False Positives (NH-FP) Unrelated Domain Random Library Sequence C Homologous Over-Extension (HOE) False Positives HOE-Library HOE-Library HOE-Query Figure 1. True positive and two types of PSI-BLAST errors. PSI-BLAST alignments are classified after comparing the alignment boundaries to the embedding boundaries in the query sequence, and to the annotated domain boundaries in the library sequence. (A) TP—an alignment is classified as a TP if at least 50% of the aligned residues overlap the Pfam annotation. Two types of FP errors can occur: (B) non-homologous FP alignments (NH-FP) and (C) homologous over-extension (HOE) FPs. Non-homologous FPs map entirely to random (and thus unrelated) sequences (B, left), to non-domain regions of the protein (B, middle), or to regions covered by unrelated Pfam domains (B, right). Homologous over-extension FPs occur when two homologous domains align, but the alignments overextend, so that more than 50% of the alignment is outside the homologous region. Both the library domain (C, left and middle) and the query domain (C, right) can overextend. RESULTS regions (NH-FP, Figure 1B) and alignments that begin in a homologous domain, but that extend into a PSI-BLAST makes two distinct types of errors non-homologous region (HOE, Figure 1C). Characterizing the sources of PSI-BLAST errors, in par- Non-homologous FP errors are well recognized. These ticular the process by which a PSI-BLAST PSSM becomes errors involve alignments either to random sequences contaminated, is challenging. PSI-BLAST searches (Figure 1B, left), to regions of the protein that do not proceed iteratively, with the addition of new sequences contain Pfam domain annotation (Figure1B, middle), or that can either improve the sensitivity of the PSSM, or, to regions that contain unrelated domains (Figure 1B, with the inclusion of unrelated sequences, that can turn right). Non-homologous errors have been the target of the PSSM toward an unrelated family. Moreover, the improved PSI-BLAST statistical estimates. sequence inclusions that take place early may not have a Non-homologous regions align by chance, so methods major effect until several iterations later. To try to follow that reduce chance alignments, either with more accurate the process of sequence inclusion and PSSM contamina- statistics or stricter inclusion thresholds, will reduce tion, we searched a mixture of real, full-length, UniProt non-homologous errors. This is an effective approach proteins and random synthetic sequences using two sets of because the E()-values for the first NH-FP errors are query sequences that contained a single Pfam domain. The often near the threshold for inclusion of a sequence for boundaries of each significant alignment, at each iteration, the next iteration. In our hard and randomly sampled data were recorded and classified by alignment type (Figure 1). sets, the first NH-FP errors had expectation values 7 3 TPs align the domain in the query sequence to homolo- ranging E() < 10 –10 (Figure 2A and C). gous domains in the reference library, while FPs align We were initially surprised that in our comparisons, the query domain to a non-homologous region. some FPs had extremely significant similarity scores, 70 40 Examination of FP alignments revealed two distinct with E()-values (expectations) < 10 –10 (Figure 2A error morphologies: alignments on non-homologous and C). Examination of these alignments provided a Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2181 A hard - all domains B hard - long domains, non-embedded 0 0 –10 5 –10 –20 –20 HOE–L –30 –30 HOE–Q –40 –40 NH –50 –50 0 10 20 30 40 50 0 10 20 30 40 50 domain family domain family C sampled - all domains D sampled - long domains, non-embedded –10 –10 –20 –20 –30 –30 –40 –40 –50 –50 0 10 20 30 40 50 0 10 20 30 40 50 domain family domain family Figure 2. Distribution of initial alignment errors. The expectation values for the first FP errors are shown for each of 50 hard (A and B)or randomly-sampled (C and D) searches, classified by error type (i.e. HOE-L, HOE-Q and NH). FP E()-values are plotted from lowest (most-significant) to highest in each of the four panels; thus, query families are ordered differently in each panel. The iteration number for the first independent FP type (when two FP types occur for a query in the same iteration, the less significant FP is not considered independent) with the lowest E()-value (expectation) for each error type is also shown. (A) FP E()-values and error types with embedded hard queries against the standard domain library. (B) Searches with non-embedded hard queries against the long-domain library. (C) Searches with the embedded randomly sampled queries against the standard domain library [the E()-value for the lowest HOE-L first FP in this panel is E() = 5  10 , but it is graphed at E() = 10 ]. (D) Searches with non-embedded randomly sampled queries against the long-domain library. simple explanation; they began in a homologous region, searching with non-embedded domains (Figure 2B and which sometimes produced a high similarity score, but D). extended into a non-homologous region. In contrast to PSI-BLAST searches have a history that affects the non-homologous alignments, HOE cannot be reduced number of errors with better statistics or reasonable inclusion thresholds; a threshold that excluded alignments with E()-values less The pathological consequences of HOE are illustrated in than E() < 10 would exclude most homologs. HOE is Figure 3 and Table 1, which follow an over-extension that not caused by inaccurate statistical estimates; it is caused brings a new domain into the PSSM, so that the unrelated by inaccurate alignments. domain eventually crowds out the signal from the original In our searches, HOE errors can occur on the library homology. Here, the query domain is a 298-residue region sequence (HOE-L: Figure 1C, left and middle) and on the from 2 to 300 from Q8KR72_PHOLU, a condensation query sequence (HOE-Q: Figure 1C, right). But, library domain in the photobactin biosynthesis protein PhG. sequence over-extensions tend to cause more problems, For this PF00668 query, the first error occurs at iteration because, in the library sequence, the (HOE-L) extension 3 when an initially homologous alignment extends onto can continue into a non-homologous domain, which, in the unrelated PF00550 domain. As a result of the initial turn, can recruit additional homologs to that unrelated over-extension, by iteration 5, most of the alignments only domain. HOE-Q errors are found at all inclusion thresh- contain the PF00550 domain (Figure 3, Table 1). In fact, olds, but not as frequently as HOE-L errors: HOE-L while these two families are similar in size (PF00668: 2528 errors occur 10-times more often than HOE-Q errors members; PF00550: 2855 members) and the actual query (Figure 2A and C). HOE-Q errors cannot occur when was a member of the PF00668 family, the search at Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 log(Expectation) log(Expectation) log(Expectation) log(Expectation) 2182 Nucleic Acids Research, 2010, Vol. 38, No. 7 >Q71EG8_BACSU|1368305 Length = 203 Score = 137 bits (346), Expect = 3e-31, Method: Composition-based stats. Identities = 25/214 (11%), Positives = 59/214 (27%), Gaps = 17/214 (7%) Query: 52 NHAPNKSGSFNNTLYKTVALLLALFPVTATRISEFSYQHAYLDNLTEHAWPYY-RQIFLR 110 + N T K AL + I + ++ ++ ++ Sbjct: 5 ADRQAYTAPRNVTEMKLCALWEEVLKNGPVGIRDHFFERGGHSLKATALVSRIAKEFGVQ 64 5th, 4th Iter 3rd Iter Query: 111 QWHSGERPYHKICDTSQLYSNGALGMVTAYSAQLARYVNPLT-QAQQYYWHEFSLHSEKV 169 + + + + + + P A+ +L V Sbjct: 65 VPLQDIFARPTVEELASVIQDLEESPYESIQPAQKTGHLPGVFGAETDVCAPTALEDGGV 124 2nd Iter Query: 170 VSTVAHYLDIQGNVDQEALCKAITMVISETDVLSVRFKRLEEERLPLQQPNQAATPQLKF 229 + L++ G +D+ L + ++ + L F+ + P+Q+ + + QL Sbjct: 125 GYNMPAVLELTGPLDRGRLEETFRQLVERHESLRTSFETGPD-GEPVQRIHDSVPFQL-- 181 Query: 230 IDLQTQPDPFNTALQLMRADVESPLNLLTQLLSA 263 D +A + P L L Sbjct: 182 -------DEAESADAFV-----RPFCLEEGPLFR 203 B 50 164 100 263 PF00668 150 448 19 83 117 203 PF00550 PF00668 5 119 58 203 Figure 3. Iterative growth of a homologous over-extension. (A) The raw PSI-BLAST output of a search querying a PF00668 embedded domain against the standard curated Pfam library at iteration 5. The portion of the alignment that contains the PF00668 homologous domain is shown in blue, while the over-extension onto the structurally unrelated PF00550 domain is shown in red. (B) A diagram that tracks the progression of the alignments shown in (A) from iterations 2 through 5. The alignment on the partial PF00668 domain begins as the first FP in the search at iteration 3, and continues to overextend further onto the unrelated PF00550 domain (in red) in subsequent iterations. By iteration 5, the entire unrelated PF00550 domain is covered by the overextended alignment. domain is characterized by a 2-layer ab sandwich fold, Table 1. HOE from PF00668 domain onto the unrelated PF00550 while the PF00550 acyl carrier protein domain is an all Iteration TPs Family Tree FPs Unrelated a structure. coverage coverage Pfam coverage HOE into unrelated domains can have a dramatic effect on alignments in subsequent iterations. As illustrated 1 271 0.11 0.14 0 2 998 0.49 0.93 0 in Figure 3 and Table 1, when an alignment extends 3 478 0.19 0.93 113 PF00550 (10), CL61(1) into an adjacent non-homologous domain, many of the 4 330 0.13 0.72 1784 PF00550 (1111) unrelated-domain homologs can be found. An extension PF08415 (15) into a non-domain region or an error that maps entirely to CL61 (1) a random sequence can also contaminate the PSSM, but CL28 (52) CL63 (14) this contamination is less likely to produce additional FPs, PF00501 (4) since it is much less likely that the non-domain or random 5 318 0.13 0.67 2174 PF00550 (1347/2855) sequence has additional homologs in the database. PF08415 (19/140) Non-homologous alignments with unrelated domains PF02911 (1/8) CL28 (38/2153) can cause error propagation in subsequent iterations, CL63 (16/9791) but these errors are relatively rare. The level of unrelated CL46 (6/3586) error propagation is proportional to the size of the PF00501 (4/1905) unrelated family: i.e. errors to larger families lead to The number of alignments to the unrelated family in the given itera- stronger corruption. tion is shown in parenthesis. The size of the family is shown for the last iteration. The number of FPs is often less than the sum on the align- ments that map to unrelated domains because sometimes HOE causes the largest number of errors over-extension in one sequence covered multiple unrelated domains. Because PSI-BLAST errors can propagate in later itera- iteration 5 covered 50% of the non-homologous tions, we consider the relative importance of PF00550 family and only 13% of the PF00668 family. non-homologous and HOE alignments from three differ- PF00668 and PF00550 have clearly distinct structures; ent perspectives. To focus on the initial cause of the errors, the PF00668 (Chloramphenicol Acetyl Transferase fold) we show that HOE errors are the most common initial Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2183 HOE-Q non-hom Total HOE-L no error A hard - all domains B hard - long domains 0.5 1.0 0.4 0.8 0.3 0.6 0.2 0.4 0.1 0.1 0.0 0.0 0.01 0.005 0.001 0.0001 0.01 0.005 0.001 0.0001 E()-threshold E()-threshold C sampled - all domain D sampled - long domains 1.0 1.0 0.8 0.8 0.6 0.6 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.01 0.005 0.001 0.0001 0.01 0.005 0.001 0.0001 E()-threshold E()-threshold Figure 4. Error-free family coverage before the first FP. (A–D) show the weighted fraction of family coverage by FP type—HOE on the library sequence (HOE-L, filled down triangles), HOE on the query sequence (HOE-Q, filled up triangles), non-hom (non-homologous error, filled diamonds)—on two datasets: hard families (A and B) and sampled families (C and D). Performance against the standard library (all domains, panels A and C) and the long-domain library (B and D) is shown. Coverage for searches that converged without any error (no error) is plotted with an ‘X’. Results are weighted so that each of the 50 searches contributes 2% of the coverage. The weighted coverage of each search was calculated as 1=50 ðfp =FP Þðtp =total Þ where fp is the number of errors of each type (NH, HOE-Q, HOE-L) in family f;FP is the total number of FP f f f f¼1 f f f errors for the family in the error iteration; tp is the number of TPs found before the first error iteration, and total is the total family size in the complete or long-domain database. Thus, a search that achieves 50% family coverage before the first FP and had an equal number of HOE-L and NH errors would contribute 0.005 on both the HOE-L and the NH curves. The total family coverage in the iteration before the first FP for each search is also shown (open squares). error (Figure 2), and that HOE errors account errors occurred first in four searches. In addition, three for the greatest loss of error-free family coverage (Figure of the searches converged without an error. (These 4). However, since there is no practical way to recognize values add up to more than 50 because, in some the initial error in real searches, we also show that searches, several first errors occurred in the same itera- HOE errors produce the largest fraction of FPs at tion.) For the randomly-sampled query searches, the prev- iteration 5, a commonly used PSI-BLAST stopping alence of HOE-L first errors is similar; 34 queries had first point (Figure 5). HOE-L errors, 10 queries had HOE-Q first errors, 10 had Figure 2A and C display the E()-value and error type NH first errors and eight searches had no errors at con- for each of 50 hard and randomly-sampled queries in our vergence. Figure 2 also reports the iteration in which the test set for a search with the default PSI-BLAST inclusion first error of each type occurred. Figure 2 shows clearly threshold of E() < 0.005. In this figure, we report the the challenge posed by HOE alignments; many have lowest E()-value for each error type in the first iteration extremely significant E()-values (produced because the that produced an error. Thus, we seek to catch the initial alignment starts in a homologous region), which prevent error event, before the error alignment has been incorpo- a more conservative statistical approach from excluding rated into the PSSM. For the searches with hard families, these errors. HOE-L errors occurred first for 43/50 query domains, While Figure 2 shows that HOE errors occur early, HOE-Q errors occurred first in 12 searches, and NH often, and with very significant E()-values, Figure 2 does Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 family coverage family coverage family coverage family coverage 2184 Nucleic Acids Research, 2010, Vol. 38, No. 7 HOE-Q non-hom True Positive HOE-L False Positive A hard - all domains B hard - long domains 1.0 1.0 0.8 0.8 0.1 0.1 0.0 0.0 0.01 0.005 0.001 0.0001 0.01 0.005 0.001 0.0001 E()-threshold E()-threshold C sampled - all domains D sampled - long domains 1.0 1.0 0.8 0.8 0.1 0.1 0.0 0.0 0.01 0.005 0.001 0.0001 0.01 0.005 0.001 0.0001 E()-threshold E()-threshold Figure 5. Family coverage at iteration 5. The frequency of TPs and FPs found at iteration 5, weighted by number of alignments from each search is shown. Both hard (A, B) and randomly sampled (C, D) families were tested against the standard library (all domains, A, C) and the long-domain library (B, D). The weighted FP frequency at iteration 5 is 1=50 ðfp Þ=ðtp þ FP Þ, where fp is the number of FPs of the specified type (HOE-Q, f¼1 f f f HOE-L, NH, or total errors) in family f at iteration 5, FP is the total number of FPs at iteration 5, and tp is the number of TPs found for the family at iteration 5. Filled squares plot the total weighted coverage of all three error types: HOE-Q (up-triangle), HOE-L (down-triangle) and NH (diamond). Total family coverage (open squares) is defined as 1=50 ðfp Þ=ðtp þ FP Þ, where total is the total number of homologs in the family. f f f f f¼1 With this weighting, a family that finds all of its homologs without any errors will contribute 0.02 to the coverage; a family that finds half of its homologs and an equal number of non-homologs will get a weighted frequency of (0.02  0.333) for the HOE-L, HOE-Q or non-hom. error type. For this figure and Figure 7, an HOE-L or HOE-Q alignment is counted both as a TP, reflecting the homologous alignment, and as a FP, because more than half of the alignment is outside the homologous domain; NH alignments are counted as only as FPs. not show the effect of HOE errors on search sensitivity PSI-BLAST inclusion threshold of 0.01, 3.7% of frac- (family coverage). HOE errors are not only important tional error-free coverage ends with an NH error; this because they happen early, but also because early errors drops to 2.2% at E() <10 . For the randomly sampled dramatically reduce the sensitivity of the search. Figure 4 query families, the reduction in NH errors is more provides an alternative perspective on the searches shown dramatic as the PSSM inclusion threshold becomes more in Figure 2; for exactly the same searches, it reports the stringent; NH errors terminate weighted coverage at 7.2% fractional error-free family coverage before the first FP. for E()<0.01 dropping to 0.9% at E()<10 . As expected, We measure error-free coverage as the family coverage NH first-errors drop with more stringent statistical achieved at the iteration before the first FP, and distribute thresholds. the coverage among the three different types of errors. Implementing more stringent inclusion thresholds has Thus, each of the 50 queries contributes 2% to the little effect on HOE errors (Figure 4). For hard queries, HOE-L fractional error-free coverage remains steady graph if all of its homologs are found. Here again (Figure 4A and C), we see that HOE-L and HOE-Q between 12 and 14% across the threshold range, and errors are responsible for the largest reduction in HOE-Q error-free coverage ranges between 2 and 3%. error-free coverage. NH errors have a small effect on the For the randomly-sampled queries, 29–32% of error-free level of error-free coverage. With hard queries, at a coverage ends with an HOE-L while 6–7% of error-free Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 family coverage family coverage family coverage family coverage Nucleic Acids Research, 2010, Vol. 38, No. 7 2185 in an exon alignment can cause the alignment to extend into a non-homologous intron or flanking sequence (22). populated Because local alignment algorithms, like Smith– deserted Waterman, FASTA, BLAST and PSI-BLAST, determine their alignment boundaries to maximize the similarity 30 score, local alignment over-extension can occur adjacent to any high-scoring region. However, the problem may be exaggerated with profile-based methods like PSI-BLAST, which can produce higher positive scoring matrix values in conserved regions, and may weight non-homologous aligned residues to encourage alignment across the entire PSSM. (Indeed, the drop-off threshold was introduced in BLAST (1), which is implemented as the x-drop parameter in BLAST2 and PSI-BLAST (8), to reduce this tendency.) Thus, we expect that many HOE alignments occur iteration between a full-length domain (such as our query sequences), and partial homologous domains in other Figure 6. Difference between tree coverage and family coverage by iter- ation. A comparison between family and tree coverage for the proteins. non-embedded searches of two queries from each of 50 families We examined the importance of partial domains in against the long-domain library. Two queries per family were chosen nucleating HOE by comparing our query domains to a based on tree location: one domain query from a populated and reference library with a reduced number of partial another from a deserted area of the tree. The number of searches domains. In this long-domain library, Uniprot sequences where tree coverage was larger than family coverage is plotted by iteration. containing query homologs were excluded if their homol- ogous domains were shorter than 75% of the Pfam model length. Our standard domain library contains 234 505 coverage ends with an HOE-Q error. Thus, not only do UniProt sequences (and an equal number of randomly HOE-L and HOE-Q errors happen early, HOE errors shuffled sequences). Our long-domain library contains dramatically reduce the number of homologs that can be 139 165 sequences (and an equal number of random found before the first error. sequences). The plots in Figures 2 and 4 describe what happens Searches against the long-domain library dramatically when no errors are made; Figure 4 reports sensitivity illustrate the importance of HOE from partial domains. (family coverage) at very high specificity, i.e. before the Out of 50 hard searches, 45 generate HOE errors against first error occurs. In general, researchers are comfortable the standard library. Against the long-domain library, with a modest number of errors if that reduction in the number of searches with initial HOE errors decreases specificity allows more homologs to be detected (increased to 17, and to only one when the embedding sensitivity). Indeed, for the hard families, average fam- is removed (Figure 2B). In the randomly sampled ily coverage increases from 20 (Figure 4A) to 76% at families, HOE errors decrease from 36 (standard library) iteration 5 (Figure 5A). However, that increased sensitiv- to 7 (long domain library) to 1 (long-domain ity comes with a cost; selectivity drops from zero FPs library + non-embedded domains, Figure 2D). However, (Figure 4) to from 3% [E() <10 ]to5%[E() <0.01] the long-domain library also has better performance FPs. For the randomly sampled queries, coverage because long domains are easier to find than short increases from 59% (Figure 4C) to 86% coverage domains. Error free coverage improves by 15% (from (Figure 5C), with 9% FPs. 21 to 25% for hard queries, and from 56 to 67% for HOE is also a major source of error at iteration 5, sampled queries) when coverage of only long domains is although its predominance among the other error types calculated (data not shown). is not as dramatic as it is at first FP; HOE errors A similar trend is seen with weighted coverage before were found 1.5–3 times as frequently as the NH errors the first FP (Figure 4B and D) and with coverage and (Figure 5A and C). Our HOE measurements are underes- fraction of FPs at iteration 5 (Figure 5B and D). In timates. We only record HOE errors once they cover 50% both cases, when the long-domain library is searched, the of the alignment; a shorter over-extension might include a number of HOE errors drops dramatically (from 16 to 3% non-homologous domain, but the errors caused by that of weighted error free coverage for hard queries, 35–5% inclusion would be counted as NH errors. If HOE align- for randomly sampled queries), the no-error at conver- ments are classified after 25% over-extension, HOEs are gence coverage increases (from 2 to 58%), and total identified sooner, their frequency increases, and error-free coverage increases as well (from 20 to 65% for hard coverage is reduced even more. However, we used a 50% queries before the first FP). alignment threshold to focus on clear over-extensions. HOE happens frequently because partial domains occur in many proteins. Domains <75% of Pfam model length HOE errors largely occur because of partial domains are found in 95 340 (41%) of our protein sequences, and in proteins removing them from the library removes 47% of all HOE has been recognized for more than 10 years in domains. Twenty percent of the proteins in our standard genomic DNA sequence alignments; a strong similarity library contain a domain that is half the Pfam model Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 tree coverage – family coverage family better tree better 2186 Nucleic Acids Research, 2010, Vol. 38, No. 7 length or less. The distribution of query domain homolog embedded replicate searches the query could only find lengths in our reference libraries is shown in itself, causing search convergence after just two iterations. For one of the 10 embedding replicates, the alignment of Supplementary Figure S2. the only recovered sequence at iteration 1 extends 11 While many HOEs of library sequences (HOE-Ls) start residues into the unrelated embedding. This extended in partial domains, HOE of the query sequence (HOE-Qs) alignment allowed homologs to be found, so that a cannot involve partial sequences; all our query sequences are full-length domains. For example, in Figure 3, while PSSM could be built. As a result, that one search in 10 the library sequence contains a partial domain, the achieved 95% family coverage at iteration 5. homology over-extends into the random embedding sequence around the query domain. The role of HOE-Q PSI-BLAST searches follow an opportunistic extension can be estimated by preventing over-extension phylogenetic path of the query sequence by removing the embedding. For the The traditional measure of search sensitivity, the fraction hard query sequences, searching with non-embedded of related homologs found in a search, can obscure the queries improves sensitivity (weighted coverage) from 20 performance of searches in different protein families. to 29% in the standard library and from 63 to 75% in the Many protein families in Pfam and UniProt are sampled long-domain library. Reducing HOE-L extension with the non-uniformly. Many vertebrate homologs might be avail- long-domain library and HOE-Q extension with able, but few bacteria or archea are present in the non-embedded query domains increases no-error at con- databases. Thus, a query sequence from a vertebrate vergence from 2% (hard) and 16% (sampled) weighted might achieve high coverage, or apparent sensitivity, coverage at E() < 0.005 to 67% (hard) and 90% while an archeal sequence might be much less successful (sampled) no error weighted coverage on the long-domain in identifying homologs. library without embedding. Tree coverage scoring complements the traditional While embedding the query domains usually reduces family coverage evaluation of similarity searching search coverage (Figure 4), we were surprised to find methods. Here, we count how completely the results that sometimes query embedding can improve search per- from a search cover a phylogenetic tree of the family formance (Table 2). The data in Figures 2, 4 and 5 reflect members. For the example above, a vertebrate query the result from a single random embedding for each of the might find 80% of the family members, but miss a large query domains, but searches were done with 10 different fraction of the tree, while the archeal query sequence embeddings. In 20% of the hard queries and 14% of the might find homologs in prokaryotes and eukaryotes, and sampled ones, family coverage at first FP differs by more in animals, plants and fungi, and thus sample the than 50% across the 10 embeddings. And for 6 and 20% phylogenetic tree much more completely. Tree coverage of the hard and sampled queries, respectively, the number is measured by counting the branch-length-weighted par- of FPs at iteration 5 also varies by more than 50% across titions traversed by the searches at each iteration the embedding replicates. For example, one embedding of (Supplementary Figure S1). the PF01018 domain finds 98% of the homologs in the To control for the different phylogenetic distributions library and makes very few errors; while the another of different domain families, we tested queries that were embedding of the same domain, only finds 22% of the chosen based on their location on the family tree; family and 78% of the results that PSI-BLAST reports one from a relatively populated region of the tree, a are FPs. The random embeddings alone do not produce second from a deserted region. In the early iterations, any statistically significant scores, but different searches from deserted parts of the family phylogeny embeddings can produce slightly different over-extension traverse the tree in a ‘depth-first’ order, while alignments. searches that begin in populated regions use an initial Indeed, for one query set, HOE of the query improved ‘breadth-first’ approach. PSI-BLAST performance because the random By traveling deep on the tree initially, queries from over-extension allowed a marginally insignificant deserted areas achieve better tree coverage, but worse homolog to score within the statistical threshold. Thus, family coverage, than queries from populated areas in for searches with a PF00589 query (Q1YWW7_PHOPR, early iterations. At iteration 1, 78% of the deserted (and a phage integrase domain), in nine out of 10 different 48% of the populated) queries have better tree coverage than family coverage (Figure 6). Tree location was not used as a criterion to choose the hard and sampled Table 2. Performance of different embeddings of PF01018 at queries. Yet, 84% of the hard queries were found to iteration 5 come from deserted areas of the tree while the sampled queries were split almost evenly across both locations Embedding replicate TPs TP FPs FP a a (48% populated and 52% deserted). As one might performance coverage frequency expect, the limitations in family coverage observed for Best 507 0.98 67 0.07 the hard families (Figures 4 and 5), are likely explained Worse 116 0.22 419 0.78 by their patterns of tree coverage. a Despite the difference in tree coverage paths between TP coverage is calculated by dividing the frequency of TPs at iteration ‘deserted’ and ‘populated’ queries, their family coverage 5 by family size. FP frequency is defined as the number of FPs found divided by the number of significant alignments found by the search. in the late iterations is similar. At first FP, tree and family Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2187 coverage did not differ significantly in searches with PSI-BLAST noExt also performs well by the perfor- queries from deserted and populated areas of the tree. mance measures displayed in Figures 2, 4 and 5. Since Likewise, by iteration 5, 86 and 89% of the homologs PSI-BLAST noExt does not modify the initial alignment have been found by the deserted and populated queries, boundaries, its performance on the first FP measures respectively (Figure 6). should be similar to unmodified PSI-BLAST, and this is largely true. On the hard queries, PSI-BLAST noExt -73 (threshold E() <0.005) has a first FP at E() <10 , Preventing alignment extension improves PSI-BLAST similar to the unmodified PSI-BLAST. But with the specificity sampled queries the worst first FP is E() <10 ,a dramatic improvement over the unmodified PSI-BLAST We implemented a small modification to PSI-BLAST to E() <10 (Figure 2). PSI-BLAST noExt coverage reduce the effect of HOE that prevents alignments from [threshold E() <0.005] at first FP is 21.4% (hard) and being extended once they have been included in the PSSM 60.9% (sampled), which is slightly better than unmodified (PSI-BLAST noExt; see ‘Materials and Methods’ section). PSI-BLAST (20.8 and 56.5%, Figure 4). For the sampled With this modification, the alignment boundaries families, the coverage by searches with no-error at conver- calculated when a library sequence is first included in the gence increases from 15.9 to 28.9% and it is about the PSSM are not extended when that library sequence is same (2%) for both methods on the hard families. For aligned in subsequent iterations. In some cases, the the hard families, PSI-BLAST noExt coverage at iteration non-extension may decrease the sensitivity of the search 5 is 74.6% (hard) and 89.0% (sampled) with 1.4 and 2.0% because the initial alignment was too short, but the FPs, compared with 77.3 and 85.3% coverage, and 5 and no-extension modification provides a simple strategy for 9% FPs (Figure 5). The no-extension strategy does not reducing over-extension. With the PSI-BLAST default reduce over-extension when it first occurs, but it dramat- E() <0.005 threshold, the no extension modification dra- ically improves specificity, at a small cost in sensitivity, by matically improves PSI-BLAST specificity with a very preventing additional extension in subsequent iterations. small cost in sensitivity (Figure 7). At 50% family coverage, the PSI-BLAST noExt reduces the weighted fraction of errors from 0.02% of TPs (hard families) to 0.0036%, a 5.6-fold reduction in FPs. At the end of iter- DISCUSSION ation 5 with hard families, PSI-BLAST noExt achieves For iterative searches that build a PSSM, HOE, is an 93% of the family coverage of unmodified PSI-BLAST, important source of error caused by alignment errors. with one-quarter the FPs. For the randomly sampled On our data set, HOE errors account for the largest families, the improvement is even greater. For the fraction of PSI-BLAST errors (Figure 2), the greatest randomly sampled queries, PSI-BLAST noExt finds 98% reduction of error-free coverage (Figure 4), and the of the homologs found by conventional PSI-BLAST, but largest fractions of FPs at iteration 5 (Figure 5). reduces the FP errors 8.3-fold at maximum coverage and Moreover, we exclude nested or discontinuous domains 14.7-fold at 50% family coverage (Figure 7). A PSI-BLAST hard families B PSI-BLAST sampled families 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 PSI-BLAST -t1 0.2 PSI-BLAST -t2 0.2 PSI-BLAST noExt 0.0 0.0 0 0.02 0.04 0.06 0.08 0.1 0 0.04 0.08 0.12 0.16 0.2 fraction false positives (weighted) fraction false positives (weighted) Figure 7. Sensitivity and Specificity of PSI-BLAST and PSI-BLAST noExt. Weighted fractional family coverage for TPs is plotted as a function of weighted fractional FPs at iteration five using a threshold of E() <0.005. (A) Performance of unmodified PSI-BLAST with the t 1 composition adjustment (long dashed line), unmodified PSI-BLAST with the default t 2 composition adjustment (short dashed line), and PSI-BLAST noExt (solid line) on the hard queries. (B) Performance of PSI-BLAST t 1, PSI-BLAST t 2 and PSI-BLAST noExt on the randomly sampled queries. As in Figures 4 and 5, each family contributes 2% to the fraction of TPs (weighted). On the y-axis, fraction TPs (weighted) is calculated as 1=50 ðtp Þ=ðtotal Þ where tp is the number of TPs at iteration 5, and total is the total number of homologs in family f. Likewise, fraction f f f¼1 f f FPs (weighted) is calculated as 1=50 ðfp Þ=ðtotal Þ, where fp is the number of FPs at iteration 5. For this figure and Figure 5, an HOE-L and f¼1 f f HOE-Q alignment is counted both as a TP and a FP. Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 fraction true positives (weighted) fraction true positives (weighted) 2188 Nucleic Acids Research, 2010, Vol. 38, No. 7 from our queries; such domains are likely to cause even strategy can recruit unrelated domains in iterative more HOE errors. searches. Alignment boundaries are less important for We used several strategies to reduce the possibility that pair-wise comparison, where errors do not propagate, HOE errors are an artifact of domain boundary annota- and less common in datasets that contain relatively few tion in the query and library sequences. For the queries, shuffled domains (e.g. the yeast genome) or domain-sized we embedded the domain queries in random flanking sequences [PDB (24), Astral (25) and CATH (19) sequences to clearly specify the boundary between the sequences]. In contrast, more comprehensive databases homologous and non-homologous (random) regions. of complete proteins contain many partial domains and Yet, long over-extensions from the query sequence domains in different combinations. account for about a quarter as much error-free coverage While we have focused on PSI-BLAST because it is the loss as is caused by over-extension of homologous library most popular iterative program for building PSSMs, HOE domains. While we are less certain of the accuracy of the will affect any automatic iterative profile searching library sequence domain boundaries, we extended the method (PSSM, HMM, etc.). The HMMs constructed annotations on the library sequences in an attempt to for Pfam (26) are curated and aligned manually, and reduce error misclassification. We carefully examined the thus less sensitive to the over-extension problem, and the HOE-L errors with very low E()-values shown in Figure 2, HMMER package provides two scoring strategies, ls and are confident that they involve alignments between (global) and –fs (local), which may more accurately sequences that are structurally unrelated at the fold delimit partial domain boundaries. But even the hmm-fs (SCOP) or topology (CATH) level. While over-extensions (local) scoring system may have a tendency to over-extend onto non-domain regions may be attributed to conserva- (domain) alignment boundaries, particularly if some parts tive homology annotations, we are confident that of the domain are highly conserved while others are more over-extensions onto unrelated domains are true errors. neutral. Domain over-extension is a problem for every HOE provides an explanation for the observation that sequence alignment, but when erroneous alignments are PSSMs built from manually curated Pfam HMMs iteratively incorporated, over-extension is more perform significantly better than that same PSSMs supple- pathological. mented with iterative searches against ‘nr’ [Figure 2 in ref. We can imagine three strategies for reducing the effect (12)]. The ‘nr’ proteins contain partial domains; HOE of iterative HOE and tested a crude implementation of one from these domains may be responsible for the reduction of these strategies. A simple strategy is to search with a in search specificity. domain against a library of domains, rather than against a HOE is an alignment error, rather than a statistical library of complete protein sequences. Pfam domain error. The high similarity scores that nucleate a HOE assignments are available for a large fraction of UniProt reflect genuine homologies. Thus, HOE errors occur sequences (15); iterative searches against the proteins early and with extremely low E-values; they cannot be broken at domain boundaries, so that both domain and fixed with better statistics or by lowering the inclusion inter-domain sequences are aligned, but separately, might threshold. Since its release in 1997, adjustments to the dramatically reduce PSSM corruption. construction of PSSMs (8,11,14) and improved estimates We have implemented a second strategy—limiting the of the statistical parameters used in the scoring function extension of an alignment once it has been included in the (12,23) have been described to address PSI-BLAST’s sus- PSSM. Currently, PSI-BLAST searches a database, ceptibility to PSSM corruption. Such refinements have identifies all the statistically significant sequences in the improved the sensitivity and specificity on the evaluation database, aligns them, builds a PSSM, and then repeats test sets, which are often sets of proteins from yeast the process, re-adjusting the boundaries of every sequence or Astral that contain few multiple-domain proteins or found in the database at every iteration. HOE takes place partial domains. Likewise, the approach suggested by over multiple iterations. Thus, to reduce it, we set the Lee et al. to resolve PSSM corruption does not address alignment boundaries when a sequence is first brought the HOE problem (10,13). Lee et al. suggest that the into the PSSM alignment, and preserve the alignment results from the first two iterations can be used to discrim- boundaries for that sequence in subsequent iterations. inate TPs from FPs at later iterations, but our results show Our initial implementation of this strategy is effective; that dramatic HOE errors are often seen at iteration 2, for hard and sampled queries, the strategy reduces 70 40 with expectation values as low as E() <10 – <10 the number of FPs 4–8-fold, with little reduction in sensi- (Figure 2). tivity (Figure 7). However, while this strategy reduces HOE is a natural consequence of an alignment strategy iterative over-extension, there is room for improvement: that is very effective in conventional pair-wise similarity initial alignments often extend beyond the domain searching. Because a longer homologous alignment boundaries. produces a higher score, modern alignment programs A third, more sophisticated strategy might use localized use gap penalties that are both as low as possible and PSSM/HMM scoring parameters to terminate alignments still allow local sequence alignments against unrelated sooner, which should, in turn, reduce over-extension. sequences, yet tend to produce near-global alignments in Scoring matrices have preferred alignment lengths. homologous sequences or domains. While this strategy Evolutionarily-deep matrices (e.g. BLOSUM45/50) improves sensitivity and has little effect on specificity generate longer alignments against random (unrelated) for pair-wise alignments (which are often bounded by sequences than shallower matrices (e.g. BLOSUM62/80) the ends of the sequences), the ‘longer alignment’ (27). In diverse families, a single PSSM or HMM model Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2189 accuracy of PSI-BLAST protein database searches with will always be distant from the leaves of the family’s composition-based statistics and other refinements. Nucleic Acids phylogenetic tree. Thus, as the model incorporates more Res., 29, 2994–3005. diverse homologs, the evolutionary distances become 9. Sierk,M.L. and Pearson,W.R. (2004) Sensitivity and selectivity in larger and, as a result, these profiles produce longer align- protein structure comparison. Protein Sci., 13, 773–785. 10. Lee,M.M., Chan,M.K. and Bundschuh,R. (2009) SIB-BLAST: a ments against unrelated sequences. Having multiple web server for improved delineation of true and false shallower scoring models near the leaves of a tree, much positives in PSI-BLAST searches. Nucleic Acids Res., 37, like those SATCHMO (28) generates, might make it W53–W56. possible to design PSMMs/HMMs that are more likely 11. Altschul,S.F., Gertz,E.M., Agarwala,R., Schaffer,A.A. and to generate shorter alignments against unrelated Yu,Y.K. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res., 37, sequences. Alternatively, initial alignment boundaries 815–824. might be set by pair-wise alignment to the closest 12. Stojmirovic,A., Gertz,E.M., Altschul,S.F. and Yu,Y.K. (2008) homolog in the family using the appropriate The effectiveness of position- and composition-specific gap costs position-independent scoring matrix. for protein similarity searches. Bioinformatics, 24, i15–i23. 13. Lee,M.M., Chan,M.K. and Bundschuh,R. (2008) Simple is In addition to suggesting strategies for dramatically beautiful: a straightforward approach to improve the delineation improving PSI-BLAST specificity, our results suggest of true and false positives in PSI-BLAST searches. Bioinformatics, that sensitive profile-based comparison methods can 24, 1339–1343. produce accurate statistical estimates. One could argue 14. Altschul,S.F., Wootton,J.C., Gertz,E.M., Agarwala,R., that, as profile comparison methods approach the sensi- Morgulis,A., Schaffer,A.A. and Yu,Y.K. (2005) Protein database searches using compositionally adjusted substitution matrices. tivity of structure comparison, the reduced variety of Febs J., 272, 5101–5109. structural motifs might make it more difficult to distin- 15. Finn,R.D., Tate,J., Mistry,J., Coggill,P.C., Sammut,S.J., guish divergent from convergent similarities. While this Hotz,H.R., Ceric,G., Forslund,K., Eddy,S.R., Sonnhammer,E.L. may be true, the current inaccuracies in PSI-BLAST are et al. (2008) The Pfam protein families database. Nucleic Acids Res., 36, D281–D288. more easily explained by HOE. 16. Bairoch,A. and Apweiler,R. (1997) The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res., 25, 31–36. SUPPLEMENTARY DATA 17. UniProt Consortium. (2009) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res., 38, D142–D148. Supplementary Data are available at NAR Online. 18. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, FUNDING 536–540. 19. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. National Library of Medicine, LM04969. Funding for and Thornton,J.M. (1997) CATH—a hierarchic classification open access charge: National Library of Medicine, R01 of protein domain structures. Structure, 5, 1093–1108. grant (LM04969). 20. Howe,K., Bateman,A. and Durbin,R. (2002) QuickTree: building huge neighbour-joining trees of protein sequences. Bioinformatics, Conflict of interest statement. None declared. 18, 1546–1547. 21. Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. REFERENCES 22. Zhang,Z., Berman,P., Wiehe,T. and Miller,W. (1999) Post-processing long pairwise alignments. Bioinformatics, 15, 1. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. 1012–1019. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 23. Altschul,S.F., Bundschuh,R., Olsen,R. and Hwa,T. (2001) 403–410. The estimation of statistical parameters for local alignment score 2. Pearson,W.R. (2000) Flexible sequence similarity searching with distributions. Nucleic Acids Res., 29, 351–361. the FASTA3 program package. Methods Mol. Biol., 132, 185–219. 24. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., 3. Pearson,W.R. (1991) Searching protein sequence libraries: Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein comparison of the sensitivity and selectivity of the Data Bank. Nucleic Acids Res., 28, 235–242. Smith-Waterman and FASTA algorithms. Genomics, 11, 635–650. 25. Chandonia,J.M., Hon,G., Walker,N.S., Lo Conte,L., Koehl,P., 4. Smith,T.F. and Waterman,M.S. (1981) Identification of common Levitt,M. and Brenner,S.E. (2004) The ASTRAL compendium molecular subsequences. J. Mol. Biol., 147, 195–197. in 2004. Nucleic Acids Res., 32, D189–D192. 5. Brenner,S.E., Chothia,C. and Hubbard,T.J. (1998) Assessing 26. Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., sequence comparison methods with reliable structurally identified Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, Sonnhammer,E.L. et al. (2004) The Pfam protein families 6073–6078. database. Nucleic Acids Res., 32, D138–D141. 6. Pearson,W.R. (1995) Comparison of methods for searching 27. Altschul,S.F. (1991) Amino acid substitution matrices from an protein sequence databases. Protein Sci., 4, 1145–1160. 7. Pearson,W.R. and Sierk,M.L. (2005) The limits of protein information theoretic perspective. J. Mol. Biol., 219, 555–565. sequence comparison? Curr. Opin. Struct. Biol., 15, 254–260. 28. Edgar,R.C. and Sjolander,K. (2003) SATCHMO: sequence 8. Schaffer,A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., alignment and tree construction using hidden Markov models. Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the Bioinformatics, 19, 1404–1411. Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

Homologous over-extension: a challenge for iterative similarity searches

Loading next page...
 
/lp/oxford-university-press/homologous-over-extension-a-challenge-for-iterative-similarity-zUkAjocEdn

References (31)

Publisher
Oxford University Press
Copyright
© The Author(s) 2010. Published by Oxford University Press.
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gkp1219
pmid
20064877
Publisher site
See Article on Publisher Site

Abstract

Published online 11 January 2010 Nucleic Acids Research, 2010, Vol. 38, No. 7 2177–2189 doi:10.1093/nar/gkp1219 Homologous over-extension: a challenge for iterative similarity searches 1 2, Mileidy W. Gonzalez and William R. Pearson * Department of Biological Sciences, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250 and Department of Biochemistry and Molecular Genetics, Jordan Hall Box 800733, Charlottesville, VA 22908, USA Received November 19, 2009; Revised December 16, 2009; Accepted December 17, 2009 ABSTRACT homologous, and homologous proteins share similar structures and, often, similar functions. Thus, reliable We have characterized a novel type of PSI-BLAST transfer of knowledge between homologous proteins error, homologous over-extension (HOE), using requires accurate identification of homologs. embedded PFAM domain queries on searches For pair-wise similarity searches using BLAST (1), against a reference library containing Pfam- FASTA (2) or Smith–Waterman (3,4), the statistical esti- annotated UniProt sequences and random synthetic mates used to infer homology are very accurate; unrelated sequences rarely obtain expectation values lower than sequences. PSI-BLAST makes two types of errors: expected by chance (5,6). Statistical estimates for iterative alignments to non-homologous regions and HOE methods like PSI-BLAST can be much less accurate (7,8); alignments that begin in a homologous region, but sometimes sequences that are clearly unrelated (they have extend beyond the homology into neighboring different three-dimensional topologies) obtain highly sig- sequence regions. When the neighboring sequence –6 nificant expectation values (<10 ) that imply clear region contains a non-homologous domain, homology. These misleading alignments are thought to PSI-BLAST can incorporate the unrelated result from profile or position specific scoring matrix sequence into its position specific scoring matrix, (PSSM) contamination and from the PSI-BLAST which then finds non-homologous proteins with sig- sequence weighting strategy; when distant homologs are nificant expectation values. HOE accounts for the included in the PSSM, their contribution to the matrix is largest fraction of the initial false positive (FP) given a higher weight than when closely related sequences errors, and the largest fraction of FPs at iteration are included. An unfortunate side effect of this weighting occurs when unrelated sequences are included; they also 5. In searches against complete protein sequences, contribute strongly to the new PSSM, which then 5–9% of alignments at iteration 5 are non- produces high scores for homologs to the unrelated homologous. HOE frequently begins in a partial domain. The corruption of PSSMs leads to the assignment protein domain; when partial domains are removed of high scores and statistical significance to from the library, HOE errors decrease from 16 to 3% biologically-incorrect relationships—a serious problem of weighted coverage (hard queries; 35–5% for for any protein characterization effort. sampled queries) and no-error searches increase Alternatively, statistical errors in profile-based align- from 2 to 58% weighed coverage (hard; 16–78% ments might reflect inherent limits in distinguishing sampled). When HOE is reduced by not extending similarities produced by divergent evolution from those previously found sequences, PSI-BLAST specificity produced by convergence from unrelated proteins (7,9). Currently, the most sensitive structural comparison improves 4–8-fold, with little loss in sensitivity. methods often assign statistically significant similarity to non-homologous structural alignments, and this trend INTRODUCTION extends to profile–sequence and profile–profile alignments. This inability to distinguish divergent from convergent Protein similarity searching is central to genome annota- similarity may reflect the small number of regular struc- tion, characterization of protein families and exploration tural motifs in proteins. Similarly, PSI-BLAST’s inaccu- of distant evolutionary relationships. Similarity searching rate statistics may reflect its ability to capture some of this is effective because proteins that share statistically signif- structural information. icant sequence similarity can be inferred to be *To whom correspondence should be addressed. Tel: +1 434 924 2818; Fax: +1 434 924 5069; Email: wrp@virginia.edu The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 2178 Nucleic Acids Research, 2010, Vol. 38, No. 7 Previous refinements to PSI-BLAST have addressed the two of the three domains of life; (v) only one family PSSM corruption issue (8,10–14) by improving the statis- from each clan superfamily was chosen; (vi) the family tical estimates used to evaluate alignment scores. In this contained only non-nested (contiguous) domains. Large article, we describe a novel cause of PSSM contamination: families were reduced to 1500 members by random over-extension of a homologous domain into a removal of members, to avoid an out-of-memory non-homologous region. Because this problem begins problem when PSI-BLAST stored large numbers of with a homologous alignment, it cannot be addressed alignments. with more accurate statistical estimates or more conserva- Query sequence selection. For each of the 320 Pfam tive inclusion thresholds; these errors can occur with domain families meeting these criteria, two domain very significant E()-values (<10 ). It is the alignment, members were chosen as queries based on their location not the score that is inaccurate. However, the problem on the family phylogenetic tree. One query was selected can be reduced dramatically by limiting alignment exten- from a populated area and another from a deserted sion after a domain is included in the PSSM. area of the tree. Each of the 640 queries was compared In evaluations of similarity searching programs, align- to the target database using BLAST, and the 50 queries ments are typically scored as true positives (TPs) or false producing the lowest family coverage at E() < 0.001 were positives (FPs) based on whether the library sequence chosen for the hard query set. In addition, 50 queries contains a homologous domain, rather than whether the were chosen at random with replacement to be part domain of interest is aligned correctly. For pair-wise of the sampled query set. A third set of 50 families sequence comparison, this approach summarizes search (100 queries) was selected to give strong differences performance relatively accurately; incorrect alignment between the populated and deserted regions of the boundaries rarely detract from the identification of the tree; one query was taken from each region for the homologous protein, and may reflect difficulties in accu- rately annotating domain boundaries. 50 families. With iterative methods like PSI-BLAST, accurate Query embedding. While our goal is to simulate searches domain alignments are more important, because inaccu- performed with full-length proteins against a full-length rate alignments can cause non-homologous domains protein database (the typical use of PSI-BLAST), our to be included in the PSSM used for the next iteration. query sequences are not complete proteins; they contain Therefore, to characterize PSI-BLAST errors, we recorded a single Pfam domain (complete proteins can contain the beginning and end of the alignments included in the multiple domains, with complex homology relationships). PSSM for the next iteration. Using the domain boundary To more accurately simulate searches with full-length annotations in our reference library, we could follow the proteins, we embedded bare Pfam domains in random homologous or non-homologous sequence regions that synthetic sequences. The embeddings were created were added to the PSSM. This process allowed us to by randomly shuffling the domain, splitting it in half, identify a novel source of PSSM corruption—homologous and placing each shuffled flank before and after the real over-extension (HOE) errors. HOE, particularly of partial domain. We confirmed that the random embedding domains, is a problem that can affect any iterative strategy sequences did not produce significant alignments for building profiles, PSSMs, or Hidden Markov Models by comparing them to the Swiss-Prot database (16). (HMM). We have developed a very simple strategy for Embedding the domains also allowed HOE to occur in reducing HOE errors—once a sequence is included in the query sequence, since alignments could extend the PSSM, the boundaries of its alignment do not beyond the Pfam domain boundaries. change. This strategy reduces FP errors by more than a factor of 8, with little loss in search sensitivity. Target library construction. The standard search library was constructed by bringing together full-length UniProt (17) proteins containing domains from the query families MATERIALS AND METHODS (excluding viral sequences), supplemented with an equal Evaluation datasets number of randomly generated sequences. The random sequences were generated by shuffling each full-length We generated two sequence datasets to characterize protein sequence. A second long-domain-only library was PSI-BLAST performance: (i) a set of query families and constructed from UniProt sequences containing family (ii) two target (reference) libraries. The query families are domains that were at least 75% of their respective Pfam groups of homologous protein domains derived from model length. Pfam (15). The target libraries are collections of full-length sequences used for iterative PSI-BLAST searching. Annotating the datasets. The target library sequence homology annotations specified by Pfam (v.21) were sup- Query family selection. A subset of domain families from plemented to reduce the number of unannotated Pfam version 21.0, originally 9000 domain families, was homologs and to extend domain boundaries in some selected and filtered down to 320 domain families meeting cases. Because Pfam identifies homologs by alignment the following criteria: (i) at least one PDB structure in the family, (ii) HMM models longer than 200 residues, (iii) with the model HMM, some library sequences con- more than 100 members in the family, (iv) families were tain cryptic domains that share strong similarity to a taxonomically-broad, with members from organisms in domain in the family that is distant from the model Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2179 HMM, so that the cryptic domains do not meet the Pfam compatible library of all significant (new plus threshold. Potentially cryptic homologs were identified by HOE-corrected) HSP sequences plus 10 000 randomized reverse-searching apparent ‘non-homologs’ with E()- sequences—the latter set provides a large sequence sample for more accurate E-value calculation. (vi) We values <10 against the target library using SSEARCH run PSI-BLAST for two iterations against the v36 (3,4). Additional unannotated homologs were identified by examining Pfam v.23, SCOP (18) and significant-hits library with the first-iteration PSSM. (vii) CATH (19) classifications. Domain boundaries were We use all alignments with E() < 10 to build the adjusted using GLSEARCH, which produces an align- HOE-corrected second-iteration PSSM. (viii) We ment that is global in the query domain and local in continue to repeat steps (iii)–(vii) using the the library full-length sequence. The detailed curation HOE-corrected PSSM from the previous iteration (e.g. of the database will be described elsewhere (Gonzalez for iteration 3, we use the HOE-corrected second-iteration and Pearson, manuscript in preparation). The database PSSM) for a user-defined number of times or through may be accessed at http://faculty.virginia.edu/wrpear convergence. All PSI-BLAST searches use the t 1 son/fasta/PUBS/gonzalez09a. composition-based score adjustment (8), which is the As a result of Pfam family consolidation based on only permitted option when re-starting from a checkpoint updates in later versions of Pfam, examination of SCOP PSSM file. and CATH classifications, and additional searches, a Classification of PSI-BLAST errors. Error types were number of our initial Pfam families were merged. In the classified based on differences between the known standard library, the distribution of query hard and query embeddings, the annotated domain boundaries, randomly sampled domain family sizes (the number and the query and library sequence boundaries in the of homologous domains, or TPs) ranged from 84 to PSI-BLAST alignment. TPs: in a TP alignment, at least 11 435 with a median family size of 847 and first and 50% of the residues in the alignment overlapped the third quartiles at 473 and 1514 members. For the long query domain and the homologous region in the library domain library, family sizes (TPs) ranged from 29 sequence, as defined by the annotated Pfam boundaries to 6048, with a median of 425 and quartiles of 249 and (Figure 1A). Because of the 50% overlap requirement 870 members. for TPs, two types of FP errors can occur: (i) non-homologous FP alignments (NH-FP), which align PSI-BLAST searches the query domain to either a random sequence (Fig. 1B, PSI-BLAST (blastpgp 2.2.19) searches were performed left), a non-domain region (Fig. 1B middle), or a with test queries against the standard and long-domain non-homologous domain (Fig. 1B right); (2) HOE FPs libraries at four different inclusion thresholds [E() < 0.01; (Figure 1C). HOEs begin with a TP alignment but <0.005, the blastpgp default; <0.001; and <0.0001]. Each extend it so that more than 50% of the alignment is query family was evaluated by searching with the bare outside the homologous domains. Extensions can occur domain and with 10 different embedding replicates on the library domain (Figure 1C, left and middle) and (the same domain embedded in different shuffles). on the query domain (Figure 1C, right). Each different Searches ran to convergence or were stopped at 20 itera- error type for a sequence was recorded; multiple errors tions. All other parameters used the default values. of the same type were counted once for the sequence. By default, blastpgp 2.2.19 uses a composition-based For coverage at iteration 5 and the sensitivity/specificity score adjustment (t 2) described in ref. (14). However, analysis, HOE errors were counted both as TPs and FPs. for our modification of PSI-BLAST, we interrupted the Likewise, non-homologous (NH) alignments due to a program after each iteration, extracted the aligned library sequence domain that had appeared in an earlier sequences, and built a new PSSM for the sequences HOE alignment were scored as HOE alignments, but only using the appropriate alignment boundaries. Because the as FPs. program was stopped and restarted using its checkpoint Family and tree coverage. Because homologous target facility, a different composition-based score adjustment library domains are rarely uniformly distributed across [t 1, ref. (8)] is required by blastpgp. their phylogeny, simply counting the fraction of the PSI-BLAST noExt. We implemented a simple modifica- homologous sequences identified in a search can obscure tion to PSI-BLAST to prevent the propagation of HOE the evaluation of a method’s success in identifying distant errors. For the first iteration, (i) we search with homologs. Thus, we measured both family coverage, PSI-BLAST against the standard search library for two the fraction of the homologous domains identified, and rounds; (ii) the first-iteration PSSM and all significant tree coverage, the weighted fraction of tree partitions [E() < 0.005] alignments are stored. For the second itera- covered in the search. A phylogenetic tree for each tion, (iii) we search with PSI-BLAST against the standard query domain family was generated using Quicktree [v. search library for one round, using the first-iteration 1.1, ref. (20)]. The multiple sequence alignments were PSSM. (iv) The significant [E() < 0.005] alignments that built with HMMER [v. 2.3.2, ref. (21)] and the Pfam overlap with previously found high-scoring segment HMM model. Tree coverage is the weighted sum of the pairs (HSP) are replaced by the original alignments, branch lengths of the sub-tree covered by the homologs while all new HSP alignments are stored. (v) We build a found by the search, as illustrated in Supplementary significant-hits library: a formatdb PSI-BLAST Figure S1. Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 2180 Nucleic Acids Research, 2010, Vol. 38, No. 7 A True Postive (TP) Alignment Query Embedded Domain Library Homologous Domain Sequence B Non-homologous False Positives (NH-FP) Unrelated Domain Random Library Sequence C Homologous Over-Extension (HOE) False Positives HOE-Library HOE-Library HOE-Query Figure 1. True positive and two types of PSI-BLAST errors. PSI-BLAST alignments are classified after comparing the alignment boundaries to the embedding boundaries in the query sequence, and to the annotated domain boundaries in the library sequence. (A) TP—an alignment is classified as a TP if at least 50% of the aligned residues overlap the Pfam annotation. Two types of FP errors can occur: (B) non-homologous FP alignments (NH-FP) and (C) homologous over-extension (HOE) FPs. Non-homologous FPs map entirely to random (and thus unrelated) sequences (B, left), to non-domain regions of the protein (B, middle), or to regions covered by unrelated Pfam domains (B, right). Homologous over-extension FPs occur when two homologous domains align, but the alignments overextend, so that more than 50% of the alignment is outside the homologous region. Both the library domain (C, left and middle) and the query domain (C, right) can overextend. RESULTS regions (NH-FP, Figure 1B) and alignments that begin in a homologous domain, but that extend into a PSI-BLAST makes two distinct types of errors non-homologous region (HOE, Figure 1C). Characterizing the sources of PSI-BLAST errors, in par- Non-homologous FP errors are well recognized. These ticular the process by which a PSI-BLAST PSSM becomes errors involve alignments either to random sequences contaminated, is challenging. PSI-BLAST searches (Figure 1B, left), to regions of the protein that do not proceed iteratively, with the addition of new sequences contain Pfam domain annotation (Figure1B, middle), or that can either improve the sensitivity of the PSSM, or, to regions that contain unrelated domains (Figure 1B, with the inclusion of unrelated sequences, that can turn right). Non-homologous errors have been the target of the PSSM toward an unrelated family. Moreover, the improved PSI-BLAST statistical estimates. sequence inclusions that take place early may not have a Non-homologous regions align by chance, so methods major effect until several iterations later. To try to follow that reduce chance alignments, either with more accurate the process of sequence inclusion and PSSM contamina- statistics or stricter inclusion thresholds, will reduce tion, we searched a mixture of real, full-length, UniProt non-homologous errors. This is an effective approach proteins and random synthetic sequences using two sets of because the E()-values for the first NH-FP errors are query sequences that contained a single Pfam domain. The often near the threshold for inclusion of a sequence for boundaries of each significant alignment, at each iteration, the next iteration. In our hard and randomly sampled data were recorded and classified by alignment type (Figure 1). sets, the first NH-FP errors had expectation values 7 3 TPs align the domain in the query sequence to homolo- ranging E() < 10 –10 (Figure 2A and C). gous domains in the reference library, while FPs align We were initially surprised that in our comparisons, the query domain to a non-homologous region. some FPs had extremely significant similarity scores, 70 40 Examination of FP alignments revealed two distinct with E()-values (expectations) < 10 –10 (Figure 2A error morphologies: alignments on non-homologous and C). Examination of these alignments provided a Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2181 A hard - all domains B hard - long domains, non-embedded 0 0 –10 5 –10 –20 –20 HOE–L –30 –30 HOE–Q –40 –40 NH –50 –50 0 10 20 30 40 50 0 10 20 30 40 50 domain family domain family C sampled - all domains D sampled - long domains, non-embedded –10 –10 –20 –20 –30 –30 –40 –40 –50 –50 0 10 20 30 40 50 0 10 20 30 40 50 domain family domain family Figure 2. Distribution of initial alignment errors. The expectation values for the first FP errors are shown for each of 50 hard (A and B)or randomly-sampled (C and D) searches, classified by error type (i.e. HOE-L, HOE-Q and NH). FP E()-values are plotted from lowest (most-significant) to highest in each of the four panels; thus, query families are ordered differently in each panel. The iteration number for the first independent FP type (when two FP types occur for a query in the same iteration, the less significant FP is not considered independent) with the lowest E()-value (expectation) for each error type is also shown. (A) FP E()-values and error types with embedded hard queries against the standard domain library. (B) Searches with non-embedded hard queries against the long-domain library. (C) Searches with the embedded randomly sampled queries against the standard domain library [the E()-value for the lowest HOE-L first FP in this panel is E() = 5  10 , but it is graphed at E() = 10 ]. (D) Searches with non-embedded randomly sampled queries against the long-domain library. simple explanation; they began in a homologous region, searching with non-embedded domains (Figure 2B and which sometimes produced a high similarity score, but D). extended into a non-homologous region. In contrast to PSI-BLAST searches have a history that affects the non-homologous alignments, HOE cannot be reduced number of errors with better statistics or reasonable inclusion thresholds; a threshold that excluded alignments with E()-values less The pathological consequences of HOE are illustrated in than E() < 10 would exclude most homologs. HOE is Figure 3 and Table 1, which follow an over-extension that not caused by inaccurate statistical estimates; it is caused brings a new domain into the PSSM, so that the unrelated by inaccurate alignments. domain eventually crowds out the signal from the original In our searches, HOE errors can occur on the library homology. Here, the query domain is a 298-residue region sequence (HOE-L: Figure 1C, left and middle) and on the from 2 to 300 from Q8KR72_PHOLU, a condensation query sequence (HOE-Q: Figure 1C, right). But, library domain in the photobactin biosynthesis protein PhG. sequence over-extensions tend to cause more problems, For this PF00668 query, the first error occurs at iteration because, in the library sequence, the (HOE-L) extension 3 when an initially homologous alignment extends onto can continue into a non-homologous domain, which, in the unrelated PF00550 domain. As a result of the initial turn, can recruit additional homologs to that unrelated over-extension, by iteration 5, most of the alignments only domain. HOE-Q errors are found at all inclusion thresh- contain the PF00550 domain (Figure 3, Table 1). In fact, olds, but not as frequently as HOE-L errors: HOE-L while these two families are similar in size (PF00668: 2528 errors occur 10-times more often than HOE-Q errors members; PF00550: 2855 members) and the actual query (Figure 2A and C). HOE-Q errors cannot occur when was a member of the PF00668 family, the search at Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 log(Expectation) log(Expectation) log(Expectation) log(Expectation) 2182 Nucleic Acids Research, 2010, Vol. 38, No. 7 >Q71EG8_BACSU|1368305 Length = 203 Score = 137 bits (346), Expect = 3e-31, Method: Composition-based stats. Identities = 25/214 (11%), Positives = 59/214 (27%), Gaps = 17/214 (7%) Query: 52 NHAPNKSGSFNNTLYKTVALLLALFPVTATRISEFSYQHAYLDNLTEHAWPYY-RQIFLR 110 + N T K AL + I + ++ ++ ++ Sbjct: 5 ADRQAYTAPRNVTEMKLCALWEEVLKNGPVGIRDHFFERGGHSLKATALVSRIAKEFGVQ 64 5th, 4th Iter 3rd Iter Query: 111 QWHSGERPYHKICDTSQLYSNGALGMVTAYSAQLARYVNPLT-QAQQYYWHEFSLHSEKV 169 + + + + + + P A+ +L V Sbjct: 65 VPLQDIFARPTVEELASVIQDLEESPYESIQPAQKTGHLPGVFGAETDVCAPTALEDGGV 124 2nd Iter Query: 170 VSTVAHYLDIQGNVDQEALCKAITMVISETDVLSVRFKRLEEERLPLQQPNQAATPQLKF 229 + L++ G +D+ L + ++ + L F+ + P+Q+ + + QL Sbjct: 125 GYNMPAVLELTGPLDRGRLEETFRQLVERHESLRTSFETGPD-GEPVQRIHDSVPFQL-- 181 Query: 230 IDLQTQPDPFNTALQLMRADVESPLNLLTQLLSA 263 D +A + P L L Sbjct: 182 -------DEAESADAFV-----RPFCLEEGPLFR 203 B 50 164 100 263 PF00668 150 448 19 83 117 203 PF00550 PF00668 5 119 58 203 Figure 3. Iterative growth of a homologous over-extension. (A) The raw PSI-BLAST output of a search querying a PF00668 embedded domain against the standard curated Pfam library at iteration 5. The portion of the alignment that contains the PF00668 homologous domain is shown in blue, while the over-extension onto the structurally unrelated PF00550 domain is shown in red. (B) A diagram that tracks the progression of the alignments shown in (A) from iterations 2 through 5. The alignment on the partial PF00668 domain begins as the first FP in the search at iteration 3, and continues to overextend further onto the unrelated PF00550 domain (in red) in subsequent iterations. By iteration 5, the entire unrelated PF00550 domain is covered by the overextended alignment. domain is characterized by a 2-layer ab sandwich fold, Table 1. HOE from PF00668 domain onto the unrelated PF00550 while the PF00550 acyl carrier protein domain is an all Iteration TPs Family Tree FPs Unrelated a structure. coverage coverage Pfam coverage HOE into unrelated domains can have a dramatic effect on alignments in subsequent iterations. As illustrated 1 271 0.11 0.14 0 2 998 0.49 0.93 0 in Figure 3 and Table 1, when an alignment extends 3 478 0.19 0.93 113 PF00550 (10), CL61(1) into an adjacent non-homologous domain, many of the 4 330 0.13 0.72 1784 PF00550 (1111) unrelated-domain homologs can be found. An extension PF08415 (15) into a non-domain region or an error that maps entirely to CL61 (1) a random sequence can also contaminate the PSSM, but CL28 (52) CL63 (14) this contamination is less likely to produce additional FPs, PF00501 (4) since it is much less likely that the non-domain or random 5 318 0.13 0.67 2174 PF00550 (1347/2855) sequence has additional homologs in the database. PF08415 (19/140) Non-homologous alignments with unrelated domains PF02911 (1/8) CL28 (38/2153) can cause error propagation in subsequent iterations, CL63 (16/9791) but these errors are relatively rare. The level of unrelated CL46 (6/3586) error propagation is proportional to the size of the PF00501 (4/1905) unrelated family: i.e. errors to larger families lead to The number of alignments to the unrelated family in the given itera- stronger corruption. tion is shown in parenthesis. The size of the family is shown for the last iteration. The number of FPs is often less than the sum on the align- ments that map to unrelated domains because sometimes HOE causes the largest number of errors over-extension in one sequence covered multiple unrelated domains. Because PSI-BLAST errors can propagate in later itera- iteration 5 covered 50% of the non-homologous tions, we consider the relative importance of PF00550 family and only 13% of the PF00668 family. non-homologous and HOE alignments from three differ- PF00668 and PF00550 have clearly distinct structures; ent perspectives. To focus on the initial cause of the errors, the PF00668 (Chloramphenicol Acetyl Transferase fold) we show that HOE errors are the most common initial Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2183 HOE-Q non-hom Total HOE-L no error A hard - all domains B hard - long domains 0.5 1.0 0.4 0.8 0.3 0.6 0.2 0.4 0.1 0.1 0.0 0.0 0.01 0.005 0.001 0.0001 0.01 0.005 0.001 0.0001 E()-threshold E()-threshold C sampled - all domain D sampled - long domains 1.0 1.0 0.8 0.8 0.6 0.6 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.01 0.005 0.001 0.0001 0.01 0.005 0.001 0.0001 E()-threshold E()-threshold Figure 4. Error-free family coverage before the first FP. (A–D) show the weighted fraction of family coverage by FP type—HOE on the library sequence (HOE-L, filled down triangles), HOE on the query sequence (HOE-Q, filled up triangles), non-hom (non-homologous error, filled diamonds)—on two datasets: hard families (A and B) and sampled families (C and D). Performance against the standard library (all domains, panels A and C) and the long-domain library (B and D) is shown. Coverage for searches that converged without any error (no error) is plotted with an ‘X’. Results are weighted so that each of the 50 searches contributes 2% of the coverage. The weighted coverage of each search was calculated as 1=50 ðfp =FP Þðtp =total Þ where fp is the number of errors of each type (NH, HOE-Q, HOE-L) in family f;FP is the total number of FP f f f f¼1 f f f errors for the family in the error iteration; tp is the number of TPs found before the first error iteration, and total is the total family size in the complete or long-domain database. Thus, a search that achieves 50% family coverage before the first FP and had an equal number of HOE-L and NH errors would contribute 0.005 on both the HOE-L and the NH curves. The total family coverage in the iteration before the first FP for each search is also shown (open squares). error (Figure 2), and that HOE errors account errors occurred first in four searches. In addition, three for the greatest loss of error-free family coverage (Figure of the searches converged without an error. (These 4). However, since there is no practical way to recognize values add up to more than 50 because, in some the initial error in real searches, we also show that searches, several first errors occurred in the same itera- HOE errors produce the largest fraction of FPs at tion.) For the randomly-sampled query searches, the prev- iteration 5, a commonly used PSI-BLAST stopping alence of HOE-L first errors is similar; 34 queries had first point (Figure 5). HOE-L errors, 10 queries had HOE-Q first errors, 10 had Figure 2A and C display the E()-value and error type NH first errors and eight searches had no errors at con- for each of 50 hard and randomly-sampled queries in our vergence. Figure 2 also reports the iteration in which the test set for a search with the default PSI-BLAST inclusion first error of each type occurred. Figure 2 shows clearly threshold of E() < 0.005. In this figure, we report the the challenge posed by HOE alignments; many have lowest E()-value for each error type in the first iteration extremely significant E()-values (produced because the that produced an error. Thus, we seek to catch the initial alignment starts in a homologous region), which prevent error event, before the error alignment has been incorpo- a more conservative statistical approach from excluding rated into the PSSM. For the searches with hard families, these errors. HOE-L errors occurred first for 43/50 query domains, While Figure 2 shows that HOE errors occur early, HOE-Q errors occurred first in 12 searches, and NH often, and with very significant E()-values, Figure 2 does Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 family coverage family coverage family coverage family coverage 2184 Nucleic Acids Research, 2010, Vol. 38, No. 7 HOE-Q non-hom True Positive HOE-L False Positive A hard - all domains B hard - long domains 1.0 1.0 0.8 0.8 0.1 0.1 0.0 0.0 0.01 0.005 0.001 0.0001 0.01 0.005 0.001 0.0001 E()-threshold E()-threshold C sampled - all domains D sampled - long domains 1.0 1.0 0.8 0.8 0.1 0.1 0.0 0.0 0.01 0.005 0.001 0.0001 0.01 0.005 0.001 0.0001 E()-threshold E()-threshold Figure 5. Family coverage at iteration 5. The frequency of TPs and FPs found at iteration 5, weighted by number of alignments from each search is shown. Both hard (A, B) and randomly sampled (C, D) families were tested against the standard library (all domains, A, C) and the long-domain library (B, D). The weighted FP frequency at iteration 5 is 1=50 ðfp Þ=ðtp þ FP Þ, where fp is the number of FPs of the specified type (HOE-Q, f¼1 f f f HOE-L, NH, or total errors) in family f at iteration 5, FP is the total number of FPs at iteration 5, and tp is the number of TPs found for the family at iteration 5. Filled squares plot the total weighted coverage of all three error types: HOE-Q (up-triangle), HOE-L (down-triangle) and NH (diamond). Total family coverage (open squares) is defined as 1=50 ðfp Þ=ðtp þ FP Þ, where total is the total number of homologs in the family. f f f f f¼1 With this weighting, a family that finds all of its homologs without any errors will contribute 0.02 to the coverage; a family that finds half of its homologs and an equal number of non-homologs will get a weighted frequency of (0.02  0.333) for the HOE-L, HOE-Q or non-hom. error type. For this figure and Figure 7, an HOE-L or HOE-Q alignment is counted both as a TP, reflecting the homologous alignment, and as a FP, because more than half of the alignment is outside the homologous domain; NH alignments are counted as only as FPs. not show the effect of HOE errors on search sensitivity PSI-BLAST inclusion threshold of 0.01, 3.7% of frac- (family coverage). HOE errors are not only important tional error-free coverage ends with an NH error; this because they happen early, but also because early errors drops to 2.2% at E() <10 . For the randomly sampled dramatically reduce the sensitivity of the search. Figure 4 query families, the reduction in NH errors is more provides an alternative perspective on the searches shown dramatic as the PSSM inclusion threshold becomes more in Figure 2; for exactly the same searches, it reports the stringent; NH errors terminate weighted coverage at 7.2% fractional error-free family coverage before the first FP. for E()<0.01 dropping to 0.9% at E()<10 . As expected, We measure error-free coverage as the family coverage NH first-errors drop with more stringent statistical achieved at the iteration before the first FP, and distribute thresholds. the coverage among the three different types of errors. Implementing more stringent inclusion thresholds has Thus, each of the 50 queries contributes 2% to the little effect on HOE errors (Figure 4). For hard queries, HOE-L fractional error-free coverage remains steady graph if all of its homologs are found. Here again (Figure 4A and C), we see that HOE-L and HOE-Q between 12 and 14% across the threshold range, and errors are responsible for the largest reduction in HOE-Q error-free coverage ranges between 2 and 3%. error-free coverage. NH errors have a small effect on the For the randomly-sampled queries, 29–32% of error-free level of error-free coverage. With hard queries, at a coverage ends with an HOE-L while 6–7% of error-free Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 family coverage family coverage family coverage family coverage Nucleic Acids Research, 2010, Vol. 38, No. 7 2185 in an exon alignment can cause the alignment to extend into a non-homologous intron or flanking sequence (22). populated Because local alignment algorithms, like Smith– deserted Waterman, FASTA, BLAST and PSI-BLAST, determine their alignment boundaries to maximize the similarity 30 score, local alignment over-extension can occur adjacent to any high-scoring region. However, the problem may be exaggerated with profile-based methods like PSI-BLAST, which can produce higher positive scoring matrix values in conserved regions, and may weight non-homologous aligned residues to encourage alignment across the entire PSSM. (Indeed, the drop-off threshold was introduced in BLAST (1), which is implemented as the x-drop parameter in BLAST2 and PSI-BLAST (8), to reduce this tendency.) Thus, we expect that many HOE alignments occur iteration between a full-length domain (such as our query sequences), and partial homologous domains in other Figure 6. Difference between tree coverage and family coverage by iter- ation. A comparison between family and tree coverage for the proteins. non-embedded searches of two queries from each of 50 families We examined the importance of partial domains in against the long-domain library. Two queries per family were chosen nucleating HOE by comparing our query domains to a based on tree location: one domain query from a populated and reference library with a reduced number of partial another from a deserted area of the tree. The number of searches domains. In this long-domain library, Uniprot sequences where tree coverage was larger than family coverage is plotted by iteration. containing query homologs were excluded if their homol- ogous domains were shorter than 75% of the Pfam model length. Our standard domain library contains 234 505 coverage ends with an HOE-Q error. Thus, not only do UniProt sequences (and an equal number of randomly HOE-L and HOE-Q errors happen early, HOE errors shuffled sequences). Our long-domain library contains dramatically reduce the number of homologs that can be 139 165 sequences (and an equal number of random found before the first error. sequences). The plots in Figures 2 and 4 describe what happens Searches against the long-domain library dramatically when no errors are made; Figure 4 reports sensitivity illustrate the importance of HOE from partial domains. (family coverage) at very high specificity, i.e. before the Out of 50 hard searches, 45 generate HOE errors against first error occurs. In general, researchers are comfortable the standard library. Against the long-domain library, with a modest number of errors if that reduction in the number of searches with initial HOE errors decreases specificity allows more homologs to be detected (increased to 17, and to only one when the embedding sensitivity). Indeed, for the hard families, average fam- is removed (Figure 2B). In the randomly sampled ily coverage increases from 20 (Figure 4A) to 76% at families, HOE errors decrease from 36 (standard library) iteration 5 (Figure 5A). However, that increased sensitiv- to 7 (long domain library) to 1 (long-domain ity comes with a cost; selectivity drops from zero FPs library + non-embedded domains, Figure 2D). However, (Figure 4) to from 3% [E() <10 ]to5%[E() <0.01] the long-domain library also has better performance FPs. For the randomly sampled queries, coverage because long domains are easier to find than short increases from 59% (Figure 4C) to 86% coverage domains. Error free coverage improves by 15% (from (Figure 5C), with 9% FPs. 21 to 25% for hard queries, and from 56 to 67% for HOE is also a major source of error at iteration 5, sampled queries) when coverage of only long domains is although its predominance among the other error types calculated (data not shown). is not as dramatic as it is at first FP; HOE errors A similar trend is seen with weighted coverage before were found 1.5–3 times as frequently as the NH errors the first FP (Figure 4B and D) and with coverage and (Figure 5A and C). Our HOE measurements are underes- fraction of FPs at iteration 5 (Figure 5B and D). In timates. We only record HOE errors once they cover 50% both cases, when the long-domain library is searched, the of the alignment; a shorter over-extension might include a number of HOE errors drops dramatically (from 16 to 3% non-homologous domain, but the errors caused by that of weighted error free coverage for hard queries, 35–5% inclusion would be counted as NH errors. If HOE align- for randomly sampled queries), the no-error at conver- ments are classified after 25% over-extension, HOEs are gence coverage increases (from 2 to 58%), and total identified sooner, their frequency increases, and error-free coverage increases as well (from 20 to 65% for hard coverage is reduced even more. However, we used a 50% queries before the first FP). alignment threshold to focus on clear over-extensions. HOE happens frequently because partial domains occur in many proteins. Domains <75% of Pfam model length HOE errors largely occur because of partial domains are found in 95 340 (41%) of our protein sequences, and in proteins removing them from the library removes 47% of all HOE has been recognized for more than 10 years in domains. Twenty percent of the proteins in our standard genomic DNA sequence alignments; a strong similarity library contain a domain that is half the Pfam model Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 tree coverage – family coverage family better tree better 2186 Nucleic Acids Research, 2010, Vol. 38, No. 7 length or less. The distribution of query domain homolog embedded replicate searches the query could only find lengths in our reference libraries is shown in itself, causing search convergence after just two iterations. For one of the 10 embedding replicates, the alignment of Supplementary Figure S2. the only recovered sequence at iteration 1 extends 11 While many HOEs of library sequences (HOE-Ls) start residues into the unrelated embedding. This extended in partial domains, HOE of the query sequence (HOE-Qs) alignment allowed homologs to be found, so that a cannot involve partial sequences; all our query sequences are full-length domains. For example, in Figure 3, while PSSM could be built. As a result, that one search in 10 the library sequence contains a partial domain, the achieved 95% family coverage at iteration 5. homology over-extends into the random embedding sequence around the query domain. The role of HOE-Q PSI-BLAST searches follow an opportunistic extension can be estimated by preventing over-extension phylogenetic path of the query sequence by removing the embedding. For the The traditional measure of search sensitivity, the fraction hard query sequences, searching with non-embedded of related homologs found in a search, can obscure the queries improves sensitivity (weighted coverage) from 20 performance of searches in different protein families. to 29% in the standard library and from 63 to 75% in the Many protein families in Pfam and UniProt are sampled long-domain library. Reducing HOE-L extension with the non-uniformly. Many vertebrate homologs might be avail- long-domain library and HOE-Q extension with able, but few bacteria or archea are present in the non-embedded query domains increases no-error at con- databases. Thus, a query sequence from a vertebrate vergence from 2% (hard) and 16% (sampled) weighted might achieve high coverage, or apparent sensitivity, coverage at E() < 0.005 to 67% (hard) and 90% while an archeal sequence might be much less successful (sampled) no error weighted coverage on the long-domain in identifying homologs. library without embedding. Tree coverage scoring complements the traditional While embedding the query domains usually reduces family coverage evaluation of similarity searching search coverage (Figure 4), we were surprised to find methods. Here, we count how completely the results that sometimes query embedding can improve search per- from a search cover a phylogenetic tree of the family formance (Table 2). The data in Figures 2, 4 and 5 reflect members. For the example above, a vertebrate query the result from a single random embedding for each of the might find 80% of the family members, but miss a large query domains, but searches were done with 10 different fraction of the tree, while the archeal query sequence embeddings. In 20% of the hard queries and 14% of the might find homologs in prokaryotes and eukaryotes, and sampled ones, family coverage at first FP differs by more in animals, plants and fungi, and thus sample the than 50% across the 10 embeddings. And for 6 and 20% phylogenetic tree much more completely. Tree coverage of the hard and sampled queries, respectively, the number is measured by counting the branch-length-weighted par- of FPs at iteration 5 also varies by more than 50% across titions traversed by the searches at each iteration the embedding replicates. For example, one embedding of (Supplementary Figure S1). the PF01018 domain finds 98% of the homologs in the To control for the different phylogenetic distributions library and makes very few errors; while the another of different domain families, we tested queries that were embedding of the same domain, only finds 22% of the chosen based on their location on the family tree; family and 78% of the results that PSI-BLAST reports one from a relatively populated region of the tree, a are FPs. The random embeddings alone do not produce second from a deserted region. In the early iterations, any statistically significant scores, but different searches from deserted parts of the family phylogeny embeddings can produce slightly different over-extension traverse the tree in a ‘depth-first’ order, while alignments. searches that begin in populated regions use an initial Indeed, for one query set, HOE of the query improved ‘breadth-first’ approach. PSI-BLAST performance because the random By traveling deep on the tree initially, queries from over-extension allowed a marginally insignificant deserted areas achieve better tree coverage, but worse homolog to score within the statistical threshold. Thus, family coverage, than queries from populated areas in for searches with a PF00589 query (Q1YWW7_PHOPR, early iterations. At iteration 1, 78% of the deserted (and a phage integrase domain), in nine out of 10 different 48% of the populated) queries have better tree coverage than family coverage (Figure 6). Tree location was not used as a criterion to choose the hard and sampled Table 2. Performance of different embeddings of PF01018 at queries. Yet, 84% of the hard queries were found to iteration 5 come from deserted areas of the tree while the sampled queries were split almost evenly across both locations Embedding replicate TPs TP FPs FP a a (48% populated and 52% deserted). As one might performance coverage frequency expect, the limitations in family coverage observed for Best 507 0.98 67 0.07 the hard families (Figures 4 and 5), are likely explained Worse 116 0.22 419 0.78 by their patterns of tree coverage. a Despite the difference in tree coverage paths between TP coverage is calculated by dividing the frequency of TPs at iteration ‘deserted’ and ‘populated’ queries, their family coverage 5 by family size. FP frequency is defined as the number of FPs found divided by the number of significant alignments found by the search. in the late iterations is similar. At first FP, tree and family Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2187 coverage did not differ significantly in searches with PSI-BLAST noExt also performs well by the perfor- queries from deserted and populated areas of the tree. mance measures displayed in Figures 2, 4 and 5. Since Likewise, by iteration 5, 86 and 89% of the homologs PSI-BLAST noExt does not modify the initial alignment have been found by the deserted and populated queries, boundaries, its performance on the first FP measures respectively (Figure 6). should be similar to unmodified PSI-BLAST, and this is largely true. On the hard queries, PSI-BLAST noExt -73 (threshold E() <0.005) has a first FP at E() <10 , Preventing alignment extension improves PSI-BLAST similar to the unmodified PSI-BLAST. But with the specificity sampled queries the worst first FP is E() <10 ,a dramatic improvement over the unmodified PSI-BLAST We implemented a small modification to PSI-BLAST to E() <10 (Figure 2). PSI-BLAST noExt coverage reduce the effect of HOE that prevents alignments from [threshold E() <0.005] at first FP is 21.4% (hard) and being extended once they have been included in the PSSM 60.9% (sampled), which is slightly better than unmodified (PSI-BLAST noExt; see ‘Materials and Methods’ section). PSI-BLAST (20.8 and 56.5%, Figure 4). For the sampled With this modification, the alignment boundaries families, the coverage by searches with no-error at conver- calculated when a library sequence is first included in the gence increases from 15.9 to 28.9% and it is about the PSSM are not extended when that library sequence is same (2%) for both methods on the hard families. For aligned in subsequent iterations. In some cases, the the hard families, PSI-BLAST noExt coverage at iteration non-extension may decrease the sensitivity of the search 5 is 74.6% (hard) and 89.0% (sampled) with 1.4 and 2.0% because the initial alignment was too short, but the FPs, compared with 77.3 and 85.3% coverage, and 5 and no-extension modification provides a simple strategy for 9% FPs (Figure 5). The no-extension strategy does not reducing over-extension. With the PSI-BLAST default reduce over-extension when it first occurs, but it dramat- E() <0.005 threshold, the no extension modification dra- ically improves specificity, at a small cost in sensitivity, by matically improves PSI-BLAST specificity with a very preventing additional extension in subsequent iterations. small cost in sensitivity (Figure 7). At 50% family coverage, the PSI-BLAST noExt reduces the weighted fraction of errors from 0.02% of TPs (hard families) to 0.0036%, a 5.6-fold reduction in FPs. At the end of iter- DISCUSSION ation 5 with hard families, PSI-BLAST noExt achieves For iterative searches that build a PSSM, HOE, is an 93% of the family coverage of unmodified PSI-BLAST, important source of error caused by alignment errors. with one-quarter the FPs. For the randomly sampled On our data set, HOE errors account for the largest families, the improvement is even greater. For the fraction of PSI-BLAST errors (Figure 2), the greatest randomly sampled queries, PSI-BLAST noExt finds 98% reduction of error-free coverage (Figure 4), and the of the homologs found by conventional PSI-BLAST, but largest fractions of FPs at iteration 5 (Figure 5). reduces the FP errors 8.3-fold at maximum coverage and Moreover, we exclude nested or discontinuous domains 14.7-fold at 50% family coverage (Figure 7). A PSI-BLAST hard families B PSI-BLAST sampled families 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 PSI-BLAST -t1 0.2 PSI-BLAST -t2 0.2 PSI-BLAST noExt 0.0 0.0 0 0.02 0.04 0.06 0.08 0.1 0 0.04 0.08 0.12 0.16 0.2 fraction false positives (weighted) fraction false positives (weighted) Figure 7. Sensitivity and Specificity of PSI-BLAST and PSI-BLAST noExt. Weighted fractional family coverage for TPs is plotted as a function of weighted fractional FPs at iteration five using a threshold of E() <0.005. (A) Performance of unmodified PSI-BLAST with the t 1 composition adjustment (long dashed line), unmodified PSI-BLAST with the default t 2 composition adjustment (short dashed line), and PSI-BLAST noExt (solid line) on the hard queries. (B) Performance of PSI-BLAST t 1, PSI-BLAST t 2 and PSI-BLAST noExt on the randomly sampled queries. As in Figures 4 and 5, each family contributes 2% to the fraction of TPs (weighted). On the y-axis, fraction TPs (weighted) is calculated as 1=50 ðtp Þ=ðtotal Þ where tp is the number of TPs at iteration 5, and total is the total number of homologs in family f. Likewise, fraction f f f¼1 f f FPs (weighted) is calculated as 1=50 ðfp Þ=ðtotal Þ, where fp is the number of FPs at iteration 5. For this figure and Figure 5, an HOE-L and f¼1 f f HOE-Q alignment is counted both as a TP and a FP. Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 fraction true positives (weighted) fraction true positives (weighted) 2188 Nucleic Acids Research, 2010, Vol. 38, No. 7 from our queries; such domains are likely to cause even strategy can recruit unrelated domains in iterative more HOE errors. searches. Alignment boundaries are less important for We used several strategies to reduce the possibility that pair-wise comparison, where errors do not propagate, HOE errors are an artifact of domain boundary annota- and less common in datasets that contain relatively few tion in the query and library sequences. For the queries, shuffled domains (e.g. the yeast genome) or domain-sized we embedded the domain queries in random flanking sequences [PDB (24), Astral (25) and CATH (19) sequences to clearly specify the boundary between the sequences]. In contrast, more comprehensive databases homologous and non-homologous (random) regions. of complete proteins contain many partial domains and Yet, long over-extensions from the query sequence domains in different combinations. account for about a quarter as much error-free coverage While we have focused on PSI-BLAST because it is the loss as is caused by over-extension of homologous library most popular iterative program for building PSSMs, HOE domains. While we are less certain of the accuracy of the will affect any automatic iterative profile searching library sequence domain boundaries, we extended the method (PSSM, HMM, etc.). The HMMs constructed annotations on the library sequences in an attempt to for Pfam (26) are curated and aligned manually, and reduce error misclassification. We carefully examined the thus less sensitive to the over-extension problem, and the HOE-L errors with very low E()-values shown in Figure 2, HMMER package provides two scoring strategies, ls and are confident that they involve alignments between (global) and –fs (local), which may more accurately sequences that are structurally unrelated at the fold delimit partial domain boundaries. But even the hmm-fs (SCOP) or topology (CATH) level. While over-extensions (local) scoring system may have a tendency to over-extend onto non-domain regions may be attributed to conserva- (domain) alignment boundaries, particularly if some parts tive homology annotations, we are confident that of the domain are highly conserved while others are more over-extensions onto unrelated domains are true errors. neutral. Domain over-extension is a problem for every HOE provides an explanation for the observation that sequence alignment, but when erroneous alignments are PSSMs built from manually curated Pfam HMMs iteratively incorporated, over-extension is more perform significantly better than that same PSSMs supple- pathological. mented with iterative searches against ‘nr’ [Figure 2 in ref. We can imagine three strategies for reducing the effect (12)]. The ‘nr’ proteins contain partial domains; HOE of iterative HOE and tested a crude implementation of one from these domains may be responsible for the reduction of these strategies. A simple strategy is to search with a in search specificity. domain against a library of domains, rather than against a HOE is an alignment error, rather than a statistical library of complete protein sequences. Pfam domain error. The high similarity scores that nucleate a HOE assignments are available for a large fraction of UniProt reflect genuine homologies. Thus, HOE errors occur sequences (15); iterative searches against the proteins early and with extremely low E-values; they cannot be broken at domain boundaries, so that both domain and fixed with better statistics or by lowering the inclusion inter-domain sequences are aligned, but separately, might threshold. Since its release in 1997, adjustments to the dramatically reduce PSSM corruption. construction of PSSMs (8,11,14) and improved estimates We have implemented a second strategy—limiting the of the statistical parameters used in the scoring function extension of an alignment once it has been included in the (12,23) have been described to address PSI-BLAST’s sus- PSSM. Currently, PSI-BLAST searches a database, ceptibility to PSSM corruption. Such refinements have identifies all the statistically significant sequences in the improved the sensitivity and specificity on the evaluation database, aligns them, builds a PSSM, and then repeats test sets, which are often sets of proteins from yeast the process, re-adjusting the boundaries of every sequence or Astral that contain few multiple-domain proteins or found in the database at every iteration. HOE takes place partial domains. Likewise, the approach suggested by over multiple iterations. Thus, to reduce it, we set the Lee et al. to resolve PSSM corruption does not address alignment boundaries when a sequence is first brought the HOE problem (10,13). Lee et al. suggest that the into the PSSM alignment, and preserve the alignment results from the first two iterations can be used to discrim- boundaries for that sequence in subsequent iterations. inate TPs from FPs at later iterations, but our results show Our initial implementation of this strategy is effective; that dramatic HOE errors are often seen at iteration 2, for hard and sampled queries, the strategy reduces 70 40 with expectation values as low as E() <10 – <10 the number of FPs 4–8-fold, with little reduction in sensi- (Figure 2). tivity (Figure 7). However, while this strategy reduces HOE is a natural consequence of an alignment strategy iterative over-extension, there is room for improvement: that is very effective in conventional pair-wise similarity initial alignments often extend beyond the domain searching. Because a longer homologous alignment boundaries. produces a higher score, modern alignment programs A third, more sophisticated strategy might use localized use gap penalties that are both as low as possible and PSSM/HMM scoring parameters to terminate alignments still allow local sequence alignments against unrelated sooner, which should, in turn, reduce over-extension. sequences, yet tend to produce near-global alignments in Scoring matrices have preferred alignment lengths. homologous sequences or domains. While this strategy Evolutionarily-deep matrices (e.g. BLOSUM45/50) improves sensitivity and has little effect on specificity generate longer alignments against random (unrelated) for pair-wise alignments (which are often bounded by sequences than shallower matrices (e.g. BLOSUM62/80) the ends of the sequences), the ‘longer alignment’ (27). In diverse families, a single PSSM or HMM model Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018 Nucleic Acids Research, 2010, Vol. 38, No. 7 2189 accuracy of PSI-BLAST protein database searches with will always be distant from the leaves of the family’s composition-based statistics and other refinements. Nucleic Acids phylogenetic tree. Thus, as the model incorporates more Res., 29, 2994–3005. diverse homologs, the evolutionary distances become 9. Sierk,M.L. and Pearson,W.R. (2004) Sensitivity and selectivity in larger and, as a result, these profiles produce longer align- protein structure comparison. Protein Sci., 13, 773–785. 10. Lee,M.M., Chan,M.K. and Bundschuh,R. (2009) SIB-BLAST: a ments against unrelated sequences. Having multiple web server for improved delineation of true and false shallower scoring models near the leaves of a tree, much positives in PSI-BLAST searches. Nucleic Acids Res., 37, like those SATCHMO (28) generates, might make it W53–W56. possible to design PSMMs/HMMs that are more likely 11. Altschul,S.F., Gertz,E.M., Agarwala,R., Schaffer,A.A. and to generate shorter alignments against unrelated Yu,Y.K. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res., 37, sequences. Alternatively, initial alignment boundaries 815–824. might be set by pair-wise alignment to the closest 12. Stojmirovic,A., Gertz,E.M., Altschul,S.F. and Yu,Y.K. (2008) homolog in the family using the appropriate The effectiveness of position- and composition-specific gap costs position-independent scoring matrix. for protein similarity searches. Bioinformatics, 24, i15–i23. 13. Lee,M.M., Chan,M.K. and Bundschuh,R. (2008) Simple is In addition to suggesting strategies for dramatically beautiful: a straightforward approach to improve the delineation improving PSI-BLAST specificity, our results suggest of true and false positives in PSI-BLAST searches. Bioinformatics, that sensitive profile-based comparison methods can 24, 1339–1343. produce accurate statistical estimates. One could argue 14. Altschul,S.F., Wootton,J.C., Gertz,E.M., Agarwala,R., that, as profile comparison methods approach the sensi- Morgulis,A., Schaffer,A.A. and Yu,Y.K. (2005) Protein database searches using compositionally adjusted substitution matrices. tivity of structure comparison, the reduced variety of Febs J., 272, 5101–5109. structural motifs might make it more difficult to distin- 15. Finn,R.D., Tate,J., Mistry,J., Coggill,P.C., Sammut,S.J., guish divergent from convergent similarities. While this Hotz,H.R., Ceric,G., Forslund,K., Eddy,S.R., Sonnhammer,E.L. may be true, the current inaccuracies in PSI-BLAST are et al. (2008) The Pfam protein families database. Nucleic Acids Res., 36, D281–D288. more easily explained by HOE. 16. Bairoch,A. and Apweiler,R. (1997) The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res., 25, 31–36. SUPPLEMENTARY DATA 17. UniProt Consortium. (2009) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res., 38, D142–D148. Supplementary Data are available at NAR Online. 18. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, FUNDING 536–540. 19. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. National Library of Medicine, LM04969. Funding for and Thornton,J.M. (1997) CATH—a hierarchic classification open access charge: National Library of Medicine, R01 of protein domain structures. Structure, 5, 1093–1108. grant (LM04969). 20. Howe,K., Bateman,A. and Durbin,R. (2002) QuickTree: building huge neighbour-joining trees of protein sequences. Bioinformatics, Conflict of interest statement. None declared. 18, 1546–1547. 21. Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. REFERENCES 22. Zhang,Z., Berman,P., Wiehe,T. and Miller,W. (1999) Post-processing long pairwise alignments. Bioinformatics, 15, 1. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. 1012–1019. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 23. Altschul,S.F., Bundschuh,R., Olsen,R. and Hwa,T. (2001) 403–410. The estimation of statistical parameters for local alignment score 2. Pearson,W.R. (2000) Flexible sequence similarity searching with distributions. Nucleic Acids Res., 29, 351–361. the FASTA3 program package. Methods Mol. Biol., 132, 185–219. 24. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., 3. Pearson,W.R. (1991) Searching protein sequence libraries: Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein comparison of the sensitivity and selectivity of the Data Bank. Nucleic Acids Res., 28, 235–242. Smith-Waterman and FASTA algorithms. Genomics, 11, 635–650. 25. Chandonia,J.M., Hon,G., Walker,N.S., Lo Conte,L., Koehl,P., 4. Smith,T.F. and Waterman,M.S. (1981) Identification of common Levitt,M. and Brenner,S.E. (2004) The ASTRAL compendium molecular subsequences. J. Mol. Biol., 147, 195–197. in 2004. Nucleic Acids Res., 32, D189–D192. 5. Brenner,S.E., Chothia,C. and Hubbard,T.J. (1998) Assessing 26. Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., sequence comparison methods with reliable structurally identified Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, Sonnhammer,E.L. et al. (2004) The Pfam protein families 6073–6078. database. Nucleic Acids Res., 32, D138–D141. 6. Pearson,W.R. (1995) Comparison of methods for searching 27. Altschul,S.F. (1991) Amino acid substitution matrices from an protein sequence databases. Protein Sci., 4, 1145–1160. 7. Pearson,W.R. and Sierk,M.L. (2005) The limits of protein information theoretic perspective. J. Mol. Biol., 219, 555–565. sequence comparison? Curr. Opin. Struct. Biol., 15, 254–260. 28. Edgar,R.C. and Sjolander,K. (2003) SATCHMO: sequence 8. Schaffer,A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., alignment and tree construction using hidden Markov models. Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the Bioinformatics, 19, 1404–1411. Downloaded from https://academic.oup.com/nar/article-abstract/38/7/2177/3100605 by Ed 'DeepDyve' Gillespie user on 06 February 2018

Journal

Nucleic Acids ResearchOxford University Press

Published: Jan 11, 2010

There are no references for this article.