Toward completion of the Earth’s proteome: an update a decade later

Toward completion of the Earth’s proteome: an update a decade later Abstract Protein databases are steadily growing driven by the spread of new more efficient sequencing techniques. This growth is dominated by an increase in redundancy (homologous proteins with various degrees of sequence similarity) and by the incapability to process and curate sequence entries as fast as they are created. To understand these trends and aid bioinformatic resources that might be compromised by the increasing size of the protein sequence databases, we have created a less-redundant protein data set. In parallel, we analyzed the evolution of protein sequence databases in terms of size and redundancy. While the SwissProt database has decelerated its growth mostly because of a focus on increasing the level of annotation of its sequences, its counterpart TrEMBL, much less limited by curation steps, is still in a phase of accelerated growth. However, we predict that before 2020, almost all entries deposited in UniProtKB will be homologous to known proteins. We propose that new sequencing projects can be made more useful if they are driven to sequencing voids, parts of the tree of life far from already sequenced species or model organisms. We show these voids are present in the Archaea and Eukarya domains of life. The approach to the certainty of the redundancy of new protein sequence entries leads to the consideration that most of the protein diversity on Earth has already been described, which we estimate to be of around 3.75 million proteins, revising down the prediction we did a decade ago. protein databases, sequencing projects Introduction As our group already reported 10 years ago [1], while protein sequence databases such as Entrez [2] and UniProtKB [3] grow at increasing speed, their increase in novelty, understood as the rate at which new entries deposited in these databases are nonhomologous to already known entries, decreases. Our analysis allowed us to estimate that by 2018 all proteins on Earth, which we estimated to be 5 million, would be known. Here, we update this study using new data. The sequencing capability has increased since then without a parallel increase in new sequences, as we predicted. Ten years ago, complete genomes were starting to be sequenced as a standard procedure (148 genome projects were completed in 2007) [4]. Since then, sequencing strategies keep becoming faster and more affordable, which results in increasing numbers of genomes being sequenced each year (just in 2016, 1903 genomes were completely sequenced, and there were 28 377 genomes close to completion) [4]. The related processes of identifying coding genes, annotating genomic regions and predicting novel functions are also being backed up computationally with an increasingly larger myriad of Web tools and automatic programs that generate more data than humans could possibly curate. The desire for new biological knowledge collides with the demands for a responsible storage and curation of the data. In 2015, UniProtKB retired almost 47 million entries by removing strains of the same bacterial species with highly redundant proteomes [5]. However, apart from this type of redundancy, sequence databases also have protein redundancy, in which several entries for the same protein are stored in different records (e.g. UniProtKB:Q9BZZ5, UniProtKB:G3V1C3 and UniProtKB:H0YER7, for human protein API5). One could also argue that orthologous proteins are redundant as they are in fact the same protein in different organisms, or even that homologous proteins are redundant. Redundancy of protein sequences is thus a relative concept. Redundancy is not necessarily a bad feature for a sequence database; it offers valuable information in relation to sequence variability and helps to establish solid differences between taxa. But redundancy also introduces noise in sequence similarity searches, which affects negatively the automatic and manual assessment of the results [6]. A possible solution to control the redundancy of the protein sequence entries is the creation of databases such as UniRef [7] and Uniclust [8], in which sequences are clustered together based on predefined levels of identity. While this strategy helps to structure the database and thus reduce the number of entries to deal with, it does not address in general the fundamental issue of the repetition of sequence information, which is independent on a fixed level of identity. Here, we take a first step toward the creation of a data set of unique proteins (none of them full length similar to any other) representing the minimum data set covering all the protein diversity on Earth. To generate this low-redundancy protein data set, we simplified UniProtKB from >74.2 million sequences to 8.2 million, using a pipeline based on a non-exhaustive all-versus-all strategy. As we performed a non-exhaustive strategy to generate the data set, its redundancy is not completely circumvented. Our data set might be used as a step toward the generation of the nonredundant data set. Once a nonredundant data set of unique proteins was created, new proteins could be compared with it, and only the sequences without a similar protein would be added to the data set. This would constitute a meaningful protein database with a sustainable growth. In parallel to generating this data set, we studied the way UniProtKB has evolved historically, and how it has responded to the caveats about protein and proteome redundancy throughout the years. The study of the redundancy of our reduced protein data set, along with two intermediate ones created during the process of reduction of the initial database, allows us to estimate the number of unique proteins in Earth. Finally, we describe the current sequencing voids based on the known and sequenced species from the three domains of life and viruses, to lead future sequencing projects to level those sequencing gaps. Methods Data retrieval Old SwissProt data sets from release 9.0 (November 1988) to release 45.0 (October 2004) were downloaded from the ftp site of ExPASy (ftp://ftp.expasy.org/databases/swiss-prot/sw_old_releases/). Later releases were obtained from the ftp site of UniProtKB (ftp://ftp.uniprot.org/pub/databases/uniprot/previous _releases/). Similarly, the complete sequence and annotation data sets from UniProtKB versions were downloaded from its ftp site. Only the first release of the year was stored for years 2010 and 2011. From 2012 onward, we downloaded just releases ‘_01’ and ‘_07’ for each year because of the increase in the number of sequences per version. To calculate the Earth’s proteome data set, we used UniRef50 release 2017_01 (20 083 468 sequences). We downloaded the list of TaxID from all the known species and subspecies (lower taxa) from NCBI Taxonomy, the list of TaxID from all the fully sequenced proteomes from UniProtKB > Taxonomy and the list of TaxID from all the reference proteomes from UniProtKB > Taxonomy > Reference, on 13 February 2017. The set of TaxID from all the species and subspecies with at least one sequence in UniProtKB was extracted from UniProtKB release 2017_01. Clustering of the SwissProt releases To study the redundancy of the 61 SwissProt releases, we used a strategy based on a modified version of the FastaHerder2 algorithm [9]. First, the initial SwissProt release (version 9.0, 8702 proteins) is clustered using FastaHerder2 mode 1 with default parameters, and a sequence identity minimum threshold of 50%. As a first step, that release is compared with the next one (version 11.0, 10 855 proteins) to obtain a set of sequences added from one version to the following, and similarly for the deleted entries, which are taken out of the clusters. In a second step, FastaHerder2 mode 2 (co-clusters a sequence with previously made clusters) clusters the added sequences with the current clusters. The updated clusters can be taken as if they were obtained by clustering version 11.0 ab initio. The two-step procedure is repeated for each version of SwissProt, using as seed the clusters for version x and clustering with them the entries added in version x + 1. Study of the qualitative growth of a database We selected UniProtKB releases 4.0, 7.0, 10.0, 13.0 and 15.0 as representatives of the database for the period 2005–09. In addition, the first yearly release of UniProtKB (‘_01’) from 2010 to 2017 was considered. No release was used for the year 2015, as the number of sequences grew from 53 million (2014_01) to 91 million (2015_01), and then descended to 64.4 million sequences; therefore, we did not considered steps from 2014_01 to 2015_01 nor from 2015_01 to 2016_01, and instead used from 2014_01 to 2016_01. We calculated the set of added entries from each data set x to the following x + 1. To compare UniProtKB version x with the following x + 1, we randomly selected 500 proteins from the data set of added entries x + 1 and performed for each of them a BLAST similarity search with default parameters (BLOSUM62 matrix, low-complexity filter off, e-value threshold ≤ 0.001) against the database x. The Rost curve [10] was used as minimum identity threshold for the consideration of homology between each protein and any of the proteins in the database x. We calculated then the percentage of added entries in x + 1 similar to an existing one in x. Three replicates were performed for each comparison release x + 1 versus release x. The same strategy was followed to calculate the percentage of redundancy of each database [UniRef50 2017_01, the two intermediate less-redundant databases, and a low-redundancy data set generated by us that we named pre-Unique Earth’s Proteome (preUEP); see below]. In these cases, 500 randomly selected proteins from the databases were compared against the same complete data sets to assess their uniqueness. Three replicates were also performed for each calculation. Similarly, to estimate the fraction of redundant entries added to UniProtKB per taxa, we split the entries from releases 2016_01 and 2017_01 in five data sets depending on their taxonomy (Archaea, viruses, bacteria, Eukaryota and unclassified); we also calculated the added entries from one version to the other one, and repeated the splitting procedure of the data sets. A total of 500 randomly selected entries per replicate (three replicates) from each of these added entries data sets were compared with the 2016_01 data sets following the same strategy. Reduction of the protein redundancy of UniRef50 UniRef50 release 2017_01 (20 083 468 sequences) was initially split in 201 bins of 100 000 sequences. Redundancy within each bin was computed following an all-versus-all comparison. The identity threshold given by the Rost curve [10] was taken as the cutoff to consider a sequence redundant (if at least one result was above the curve) or not (no results, or results below the curve). The similarity searches were performed using BLAST default parameters (BLOSUM62 matrix, low-complexity filter off, e-value threshold ≤ 0.001). A minimum coverage length of 80% was required, similar to the one used by UniProt to generate UniRef50. After all sequences from UniRef50 were analyzed, the resulting intermediate database (12 542 332 sequences) was ordered by protein length, and the process was repeated using it as initial database. This second data set contained 8 616 716 sequences. A third iteration was performed using a shuffled version of the second database as the initial data set. The resulting data set (preUEP) contained 8 225 772 sequences. Hardware used for the experiments All the experiments presented in this article were executed on a Lenovo ThinkPad 64-bit with 7.7 GB of RAM and an Intel Core i7-4600 U CPU @ 2.10 GHz × 4, running Ubuntu 16.04 LTS. Results Change of pace in the growth of SwissProt We analyzed the growth along time of the SwissProt database, which contains only manually curated protein sequences, from one of its first releases in 1988 (release 9.0, November 1988; 8702 sequences) to the first released version of 2017 (553 474 sequences). We computed a cumulative clustering of the corresponding 61 versions of SwissProt as follows. First, we clustered the initial version of SwissProt using a clustering method we previously developed [9] (see ‘Methods’ section for details). Then, we compared that version of the database with the next one (release 11.0, July 1989), and calculated both the deleted and the added entries from one version to the other. Deleted entries were then removed from the clusters, and new entries were either added to an existing cluster or they formed a new cluster if they were different enough. The step was iterated using the next versions of the database. This strategy allowed us to replicate the results we obtained >10 years ago (until October 2004, release 45.0) [1], and to study the historical evolution of SwissProt over the past years (Figure 1), when the increased amount of entries would had made it computationally expensive. The minor discrepancies observed in the estimated redundancies obtained in the two studies are given by the difference in the sequence identity threshold used for the clustering (see ‘Methods’ section). Figure 1. View largeDownload slide Analysis of the evolution of the SwissProt database. The number of entries and clusters is represented for all the releases in the SwissProt history (‘17) and only for the releases between 1988 and 2005 (‘07), following the study in Perez-Iratxeta et al. [1]. Redundancy of each version of the database is calculated as sequences divided by the clusters (see ‘Methods’ section for details). Figure 1. View largeDownload slide Analysis of the evolution of the SwissProt database. The number of entries and clusters is represented for all the releases in the SwissProt history (‘17) and only for the releases between 1988 and 2005 (‘07), following the study in Perez-Iratxeta et al. [1]. Redundancy of each version of the database is calculated as sequences divided by the clusters (see ‘Methods’ section for details). The results show variable trends in the addition of entries to the database over time, in accordance to the changes in release naming. In the first period (releases 9.0–45.0, from 1988 to 2004), covered already in our previous work [1], proteins were deposited at a linearly growing rate. The similar growth in the number of clusters evidenced that redundancy grew in that period, which we define as an initial lag phase. The switch in the second period (releases 4.0–15.0, from 2005 to 2009) to an exponential growth in the deposition of entries came with the realization that not many of the added proteins were new, but orthologs of already known proteins. This second period was characterized by continuity in the trend of cluster growth, reaching a redundancy of around five sequences per cluster in SwissProt 2010_01. The latest period (releases 2010_01–2017_01, from 2010 to 2017) is acting as a death phase in this database in which the rate of deposition of entries has almost stopped (0.17% entries were added from 2016_07 to 2017_01). Historical growth in size and redundancy of UniProtKB The yearly addition of sequences in UniProtKB is happening at a faster rate, lately in the order of millions. To characterize the rate of deposition of new nonredundant entries, we used one release per year from UniProtKB, in the period 2005–17. The amount of deposited sequences in UniProtKB has historically increased since 2005–17 (Figure 2A); the period involving release 2015_01 should not be considered, as the database was reduced drastically from 91 to 64.4 million sequences. Figure 2. View largeDownload slide Historical evolution of the UniProtKB database. (A) Number of sequences and added entries per yearly release of UniProtKB. (B) Percentage of sampled added entries from release x redundant to release (x−1). No release was used for year 2015. Three replicates per sampling, 500 randomly selected entries per replicate. Figure 2. View largeDownload slide Historical evolution of the UniProtKB database. (A) Number of sequences and added entries per yearly release of UniProtKB. (B) Percentage of sampled added entries from release x redundant to release (x−1). No release was used for year 2015. Three replicates per sampling, 500 randomly selected entries per replicate. One would have to compare every added entry from a given release x with the complete previous release x−1, to determine the real amount of redundant entries added in the period (x−1)→x. Given the database size, to perform these comparisons is not viable time-wise. As an approximation, we sampled the added entries for each period (x−1)→x, from 2005 to 2017, and calculated the percentage of entries redundant to those already in the database. We consider two sequences to be redundant if they are similar enough to be homologous (see ‘Methods’ section). According to this approach, the redundancy of added sequences in the period 2006–14 follows a linear trend (Figure 2B). If nothing had changed, theoretically in 2017, the entirety of added sequences would have been redundant. Results are in accordance to what we proposed in 2007 [1], a 95% redundancy in new sequences by 2012, and a 99% by 2018. The database mode change in release 2015_01 (proteins from different strains not being considered anymore) resulted in a temporary decrease in the level of redundancy of the incoming sequences. However, the trend of increasing redundancy resumed afterward (Figure 2B). We estimate that the addition of redundant entries will asymptotically approach 100% before 2020. An approximation to a nonredundant Earth’s proteome More than 10 million sequences are added annually to the UniProtKB database. Most of them would be redundant, as depicted in the previous section. Given that only a few new nonredundant sequences are predicted or discovered per year, following the current trend, we can assume that the increase of unique protein sequences in the database will soon stop. This entails that in theory, we could generate a data set including one representative sequence per distinct protein on Earth, or at least estimate the size of such nonredundant data set. To our knowledge, the only actual way to do that would be by performing an exhaustive all-versus-all strategy, in which every sequence would be compared with the rest of the sequences in the database. Then, only one sequence per distinct protein would be kept. But there are computational and time limitations that make impossible to compare 74.2 million sequences with the same 74.2 million sequences (UniProtKB release 2017_01). Considering just the UniRef50 database, a full analysis would require (2 × 107) × (2 × 107) = 4 × 1014 comparisons. To estimate the execution time for such analysis, we computed 500 independent executions using random proteins as queries versus UniRef50 version 2017_01 as database, and calculated the execution time for each search (data not shown). We obtained a median of 68 s per execution, which would mean ∼43 years [68 s× (2 × 107) searches] to compare each sequence from UniRef50 against it whole. To overcome such limitations, we followed a simplified all-versus-all strategy (Figure 3A). Figure 3. View largeDownload slide Redundancy reduction of the UniRef50 database. (A) Pipeline followed to reduce the redundancy of the UniRef50 database, to generate a preUEP database. (B) Number of sequences per database (millions), both initial and last data set, plus the two intermediate ones created while iterating the strategy in (A). (C) Estimation of fraction of redundant entries per data set. (D) Number of sequences (millions) for the ETM of unique proteins on Earth calculated from the redundancy of each data set, using the formula. Figure 3. View largeDownload slide Redundancy reduction of the UniRef50 database. (A) Pipeline followed to reduce the redundancy of the UniRef50 database, to generate a preUEP database. (B) Number of sequences per database (millions), both initial and last data set, plus the two intermediate ones created while iterating the strategy in (A). (C) Estimation of fraction of redundant entries per data set. (D) Number of sequences (millions) for the ETM of unique proteins on Earth calculated from the redundancy of each data set, using the formula. UniRef50 is generated from the clustering of the UniProtKB database. In this clustering, the sequences in the clusters have at least 50% sequence identity to the longest sequence in it. We take this database as a proxy to start the reduction of the UniProtKB database, as it already is a simplified version of it. The cluster representatives have then <50% identity between themselves. UniRef50 was split in 201 bins of 100 000 sequences, and an all-versus-all strategy was performed in each bin separately (Figure 3A) (see ‘Methods’ section for details). The 201 intermediate result data sets were joined to form an intermediate database (Intermediate1), with 12 542 332 sequences (a 37.55% compression compared with the initial 20 083 468 proteins) (Figure 3B). A second iteration of the same procedure was performed, using as initial data set an ordered-by-length version of the Intermediate1 database. This database was then further reduced to an Intermediate2 database, with 8 616 716 sequences (31.30% compressed compared with the Intermediate1 data set). As a last iteration, a randomly shuffled version of the Intermediate2 database was used as the initial data set. In this case, the compression achieved was low (4.54%), and a data set called preUEP with 8 225 772 sequences was generated (Figure 3B). The simplified all-versus-all strategy using bins was no longer able to reduce the redundancy of the intermediate data sets. To calculate the Unique Earth’s Proteome (UEP) data set, one would have to perform comparisons of all the proteins in the preUEP database between themselves, which we are still not capable of doing because of computational and time limitations. The generated data sets can be downloaded from a dedicated Web page: http://cbdm-01.zdv.uni-mainz.de/∼munoz/uep/. Although we cannot produce the ultimate UEP data set, we can approximate the estimated theoretical minimum (ETM) of unique proteins on Earth by studying the redundancy of the data sets we are working with. For each of them, we compared 500 randomly selected proteins from the database against it as a whole, and assessed their uniqueness (using three replicates; see ‘Methods’ section for details). As expected, the larger the database, the more redundant it is (Figure 3C). Considering both the redundancy for each database and its size, we can produce an estimation of the ETM of proteins on Earth per data set (Figure 3D). Given the 8 225 772 sequences in the preUEP data set (Figure 3B), and considering that we calculated its redundancy to be 54.27% (Figure 3C), the estimated ETM would be ∼3.76 million sequences. It is important to note that, while the calculated ETM for the different databases is similar, we believe the real ETM value to be closer to the result given by the smaller databases (Intermediate2 and preUEP). Ten years ago, we proposed a value of 5 million for the total number of distinct proteins on Earth [1]. Based on our current results, we can estimate a lower value, for around 3.75 million different proteins in the proposed theoretical UEP data set. It is expected that every protein in the UniProtKB database would be homolog to a maximum of one protein in the UEP. Study of known proteins and database growth by taxa Sequencing projects are often biased toward specific branches of the tree of life, mostly populated by similar species or model organisms [11, 12]. Regarding the number of species, it is widely accepted that there is still an indefinite number of organisms to be discovered and characterized [13, 14]. Based on the currently described species, it is crucial to detect and to describe sequencing voids. It is necessary to cautiously devise a thorough strategy for future sequencing projects to level those proteome gaps, to maximize the number of nonredundant entries that would be added to the protein databases. As of February 2017, in the Taxonomy resource from NCBI, there were 1 434 345 species and lower taxa characterized. There is an asymmetry in the number of the described species, as eukaryotes and bacteria sum up to >87% of the total (Figure 4A). This number is both influenced by the large number of characterized insects [15] and the popularity of metagenomic studies [16], respectively. More than half of those known eukaryotic species have at least one protein entry in UniProtKB (Figure 4B). On the other hand, only around a fifth of the bacterial proteomes are represented in UniProtKb. In any case, bacterial and viral proteomes make up to almost all the currently fully sequenced proteomes, given their small genome size. Archaeal genomes are also small, but they have been described in a much limited number, as they are typically more difficult to culture and study in vitro [17]. Reference proteomes are chosen by UniProtKB as an indicator of well-studied model organisms and other species of interest; model organisms do not reflect directly the distribution of full proteomes, as there are many more eukaryotes in proportion than viruses. An overview of the knowledge status per taxa (Figure 4B) confirms that while viruses are the best covered taxon (47.5% of the known species are fully sequenced, and for 93.1% of them, there is at least one sequence in the database), the archaeal taxon is largely unexplored, as it was 10 years ago [1]. Figure 4. View largeDownload slide Taxonomic distribution of known, partially sequenced, fully sequenced and reference proteomes. (A) Comparison of the proportion of known, full, partial and reference proteomes, distributed per taxa. (B) Number of proteomes per taxa distributed depending on their status. Figure 4. View largeDownload slide Taxonomic distribution of known, partially sequenced, fully sequenced and reference proteomes. (A) Comparison of the proportion of known, full, partial and reference proteomes, distributed per taxa. (B) Number of proteomes per taxa distributed depending on their status. To characterize the fraction of redundant entries added per taxa to the database in the past year, we followed a similar procedure to the one used above to describe the evolution of the redundancy in UniProtKB (see ‘Methods’ section for details). From the 74 265 355 entries in UniProtKB release 2017_01, bacterial proteins make up to >63% of the total (Figure 5A). It is then more difficult to describe new proteins for this taxon than for any of the others; the sampling strategy shows that 96.93% of the added bacterial entries from release 2016_01 are redundant (Figure 5B). Similarly, newly added viral entries tend to be redundant (94.07%) because almost a half of the known viruses in nature have already been fully sequenced (Figure 4B). Eukaryotic entries behave differently because their proteomes are far less well covered than those from the rest of the taxa (Figure 4B), and only 90.13% of the added entries were considered redundant. Apart from archaeal, viral, bacterial and eukaryotic entries, UniProtKB contains >1 million unclassified entries (usually from uncultured organisms, like the ones from the recently described Asgard superphylum [18]), which are more likely to be new (Figure 5B). Collectively, our results suggest that Eukarya and Archaea taxa need more sequencing. Figure 5. View largeDownload slide Distinctive contribution of entries based on their taxonomy to the UniProtKB redundancy. (A) Proportion of entries in UniProtKB 2017_01 per taxa. (B) Percentage of sampled added entries from UniProtKB release 2017_01 similar to an existing entry in UniProtKB release 2016_01, divided by taxa. Three replicates per taxa, 500 randomly chosen entries per replicate. Figure 5. View largeDownload slide Distinctive contribution of entries based on their taxonomy to the UniProtKB redundancy. (A) Proportion of entries in UniProtKB 2017_01 per taxa. (B) Percentage of sampled added entries from UniProtKB release 2017_01 similar to an existing entry in UniProtKB release 2016_01, divided by taxa. Three replicates per taxa, 500 randomly chosen entries per replicate. Conclusions The spread in the past years of a multitude of sequencing techniques is driving the protein databases to a continuously accelerating increase in size. This dynamic challenges the development, maintenance and use of the protein databases and requires a reaction. On the other hand, the new sequences entering the databases are not produced to fill our gaps in biological knowledge in any rational manner. In our opinion, the work we presented here, especially when put in perspective of our previous analysis from a decade ago, suggests that proper reactions to organize the protein database and rationalize its growth did not materialize. Based on our results, we propose a small number of simple directives to avoid the course of the sequence databases toward an information catastrophe and to efficiently guide sequencing efforts toward organisms holding the last unknown protein functions. In this project, we have resumed the characterization of the growth in protein diversity in the protein sequence databases over the past 10 years, retaking the story where we left it [1]. On the one hand, we analyzed the SwissProt database, which has a particular dynamic, given its high level of curation, and contains a small fraction of all known protein sequences. On the other hand, the TrEMBL/UniProtKB database includes mostly larger numbers of automatically predicted protein sequences and in consequence behaves in a dramatically different way than SwissProt (Figure 1). The automatic prediction of sequences does not require any curation, thus simplifying the process of creation of new entries. We have shown that most of the newly added entries are redundant, and that this trend is still growing. We propose that a resource to give structure to the sequence database is badly needed: mostly to avoid the large amount of sequences similar to each other in the lists of hits resulting from most sequence similarity searches, which currently grow with each new version of the database and can obscure interesting results. This resource would be built using a strategy purely dependent on sequence comparison. We would not use protein features like function, which derives from sequence, because the correspondence between sequence similarity and function similarity is extremely variable across protein families and thus combining them to cluster sequences would not be helpful [19]. For example, pseudokinases are catalytically deficient counterparts of active kinases with inhibitory functions; while they have highly similar sequences, they have opposite functions [20]. This is not a rare case: 10% of human kinases are pseudokinases [21]. In any case, we acknowledge the importance of contrasting such a database organized purely by sequence to functional annotations; annotating these clusters is a necessary subsequent step. We envision this resource as a data set of proteins, where no two proteins would be homologous to each other for a large part of their sequences. Operationally, novel sequences would only need to be compared with this data set to decide if they would enlarge the data set or not. We predict that given the evolution of the sequence databases, if such data set was generated today, it would not grow further much. However, we are in a situation where we cannot handle the computational effort to produce this data set in reasonable time, and thus, we present a pipeline to produce this data set in incremental steps, and provide an incomplete reduced set of sequences as an intermediate step. In more detail, through a simple pipeline based on sequence identity, we reduced the number of unique proteins in the UniRef50 database from 20 million to 8.2 million (a 59% reduction); this number could be further reduced using a time-consuming all-versus-all strategy. Taking the complete UniProtKB database release 2017_01 as the initial reference, the reduction rises to 89%. This non-exhaustive strategy, in addition to the calculation of redundancy of these databases, leads us to estimate the total number of distinct proteins on Earth in around 3.75 million. We propose to make a collaborative effort to perform the all-versus-all strategy over our 8.2 million preUEP data set to simplify it and get the real UEP data set, from which we would all benefit. For example, comparing the UEP with a sequenced genome would result in an automatic and fast proteome annotation. Furthermore, given an unknown protein, one could obtain easily with a similarity search its homologous protein from the UEP. Were the complete UEP data set manually curated and thoroughly annotated (a second-phase task), it would simplify the prediction and annotation processes of protein sequences. Our proposed way to start tackling the issue about reducing the redundancy of the UniProtKB database is only one of several that could be followed to achieve it. A different take could be based on profile models to summarize the diversity of the different protein families. It could function in a similar way than current protein domain databases, but taking full proteins as entities. Furthermore, this database of profile models could be developed once the full UEP was generated; each sequence in the UEP would serve as the representative of a protein family, to which all known proteins similar to it would be attached. The profile models would be then derived from them. One can think even in taxa-specific profiles, to describe the diversity of a protein family at different taxonomic levels. Sequencing projects normally proceed toward organisms with a health or economic benefit and, even if sequencing is cheap, it is difficult to convince a laboratory to attempt isolating and sequencing an organism just because it contains many novel proteins. We propose that funding agencies should guide this effort with specially allocated funds. We hope that our efforts will make the community of genomic researchers, database developers and related funding agencies aware of the current status of protein and protein function knowledge, spurring novel strategies to choose species for sequencing based in phylogenetic analyses of the database. The work presented here should be used as a guide to speed up the last steps in ending the frontier era of protein research. Key Points The novelty of the newly deposited sequences in the protein databases is decreasing over the years. We predict that before 2020, almost all entries deposited in UniProtKB will be homologous to known proteins. We estimate the size of the Earth’s proteome to be 3.75 million sequences. Eukarya and Archaea are the taxa that are probably holding most of the remaining unknown proteins. Sequencing voids can only be covered with encouragement from funding agencies. Funding This work has been supported with funds from the Center for Computational Sciences Mainz (CSM, Johannes Gutenberg University of Mainz, Germany). Pablo Mier is a postdoctoral researcher interested in the development of Web tools and databases related to protein evolution and low-complexity regions. He works in the Faculty of Biology at Johannes Gutenberg University Mainz. Miguel A. Andrade-Navarro is a professor of Faculty of Biology, at the Johannes Gutenberg University of Mainz. His group (‘Computational Biology and Data Mining’) is interested in exploring gene function using computational techniques including algorithms and databases. References 1 Perez-Iratxeta C, Palidwor G, Andrade-Navarro MA. Towards completion of the Earth's proteome. EMBO Rep  2007; 8( 12): 1135– 41. Google Scholar CrossRef Search ADS PubMed  2 NCBI Resource Coordinators. Database resources of the national center for biotechnology information. Nucleic Acids Res  2017; 45( D1): D12– 7. CrossRef Search ADS PubMed  3 The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res  2017; 45( D1): D158– 69. CrossRef Search ADS PubMed  4 Mukherjee S, Stamatis D, Bertsch J, et al.   Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res  2017; 45( D1): D446– 56. Google Scholar CrossRef Search ADS PubMed  5 Bursteinas B, Britto R, Bely B, et al.   Minimizing proteome redundancy in the UniProt Knowledgebase. Database  2016; 2016: baw139. Google Scholar CrossRef Search ADS PubMed  6 Chen Q, Zobel J, Verspoor K. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. Database  2017; 2017: baw163. Google Scholar CrossRef Search ADS   7 Suzek BE, Wang Y, Huang H, et al.   UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics  2015; 31( 6): 926– 32. Google Scholar CrossRef Search ADS PubMed  8 Mirdita M, von den Driesch L, Galiez C, et al.   Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res  2017; 45( D1): D170– 6. Google Scholar CrossRef Search ADS PubMed  9 Mier P, Andrade-Navarro MA. FastaHerder2: four ways to research protein function and evolution with clustering and clustered databases. J Comput Biol  2016; 23( 4): 270– 8. Google Scholar CrossRef Search ADS PubMed  10 Rost B. Twilight zone of protein sequence alignments. Protein Eng  1999; 12( 2): 85– 94. Google Scholar CrossRef Search ADS PubMed  11 Land M, Hauser L, Jun SR, et al.   Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics  2015; 15( 2): 141– 61. Google Scholar CrossRef Search ADS PubMed  12 del Campo J, Sieracki ME, Molestina R, et al.   The others: our biased perspective of eukaryotic genomes. Trends Ecol Evol  2014; 29( 5): 252– 9. Google Scholar CrossRef Search ADS PubMed  13 Mora C, Tittensor DP, Adl S, et al.   How many species are there on Earth and in the ocean? PLoS Biol  2011; 9( 8): e1001127. Google Scholar CrossRef Search ADS PubMed  14 Strain D. Biodiversity. 8.7 million: a new estimate for all the complex species on Earth. Science  2011; 333( 6046): 1083. Google Scholar CrossRef Search ADS PubMed  15 Stork NE, McBroom J, Gely C, et al.   New approaches narrow global species estimates for beetles, insects, and terrestrial arthropods. Proc Natl Acad Sci USA  2015; 112( 24): 7519– 23. Google Scholar CrossRef Search ADS PubMed  16 Roumpeka DD, Wallace RJ, Escalettes F, et al.   A review of bioinformatics tools for bio-prospecting from metagenomic sequence data. Front Genet  2017; 8: 23. Google Scholar CrossRef Search ADS PubMed  17 Cowan DA, Ramond JB, Makhalanyane TP, et al.   Metagenomics of extreme environments. Curr Opin Microbiol  2015; 25: 97– 102. Google Scholar CrossRef Search ADS PubMed  18 Zaremba-Niedzwiedzka K, Caceres EF, Saw JH, et al.   Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature  2017; 541( 7637): 353– 8. Google Scholar CrossRef Search ADS PubMed  19 Devos D, Valencia A. Practical limits of function prediction. Proteins  2000; 41( 1): 98– 107. Google Scholar CrossRef Search ADS PubMed  20 Jacobsen AV, Murphy JM. The secret life of kinases: insights into non-catalytic signaling functions from pseudokinases. Biochem Soc Trans  2017; 45( 3): 665– 81. Google Scholar CrossRef Search ADS PubMed  21 Boudeau J, Miranda-Saavedra D, Barton GJ, et al.   Emerging roles of pseudokinases. Trends Cell Biol  2006; 16( 9): 443– 52. Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Toward completion of the Earth’s proteome: an update a decade later

Loading next page...
 
/lp/ou_press/toward-completion-of-the-earth-s-proteome-an-update-a-decade-later-fnIYBzOGog
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bbx127
Publisher site
See Article on Publisher Site

Abstract

Abstract Protein databases are steadily growing driven by the spread of new more efficient sequencing techniques. This growth is dominated by an increase in redundancy (homologous proteins with various degrees of sequence similarity) and by the incapability to process and curate sequence entries as fast as they are created. To understand these trends and aid bioinformatic resources that might be compromised by the increasing size of the protein sequence databases, we have created a less-redundant protein data set. In parallel, we analyzed the evolution of protein sequence databases in terms of size and redundancy. While the SwissProt database has decelerated its growth mostly because of a focus on increasing the level of annotation of its sequences, its counterpart TrEMBL, much less limited by curation steps, is still in a phase of accelerated growth. However, we predict that before 2020, almost all entries deposited in UniProtKB will be homologous to known proteins. We propose that new sequencing projects can be made more useful if they are driven to sequencing voids, parts of the tree of life far from already sequenced species or model organisms. We show these voids are present in the Archaea and Eukarya domains of life. The approach to the certainty of the redundancy of new protein sequence entries leads to the consideration that most of the protein diversity on Earth has already been described, which we estimate to be of around 3.75 million proteins, revising down the prediction we did a decade ago. protein databases, sequencing projects Introduction As our group already reported 10 years ago [1], while protein sequence databases such as Entrez [2] and UniProtKB [3] grow at increasing speed, their increase in novelty, understood as the rate at which new entries deposited in these databases are nonhomologous to already known entries, decreases. Our analysis allowed us to estimate that by 2018 all proteins on Earth, which we estimated to be 5 million, would be known. Here, we update this study using new data. The sequencing capability has increased since then without a parallel increase in new sequences, as we predicted. Ten years ago, complete genomes were starting to be sequenced as a standard procedure (148 genome projects were completed in 2007) [4]. Since then, sequencing strategies keep becoming faster and more affordable, which results in increasing numbers of genomes being sequenced each year (just in 2016, 1903 genomes were completely sequenced, and there were 28 377 genomes close to completion) [4]. The related processes of identifying coding genes, annotating genomic regions and predicting novel functions are also being backed up computationally with an increasingly larger myriad of Web tools and automatic programs that generate more data than humans could possibly curate. The desire for new biological knowledge collides with the demands for a responsible storage and curation of the data. In 2015, UniProtKB retired almost 47 million entries by removing strains of the same bacterial species with highly redundant proteomes [5]. However, apart from this type of redundancy, sequence databases also have protein redundancy, in which several entries for the same protein are stored in different records (e.g. UniProtKB:Q9BZZ5, UniProtKB:G3V1C3 and UniProtKB:H0YER7, for human protein API5). One could also argue that orthologous proteins are redundant as they are in fact the same protein in different organisms, or even that homologous proteins are redundant. Redundancy of protein sequences is thus a relative concept. Redundancy is not necessarily a bad feature for a sequence database; it offers valuable information in relation to sequence variability and helps to establish solid differences between taxa. But redundancy also introduces noise in sequence similarity searches, which affects negatively the automatic and manual assessment of the results [6]. A possible solution to control the redundancy of the protein sequence entries is the creation of databases such as UniRef [7] and Uniclust [8], in which sequences are clustered together based on predefined levels of identity. While this strategy helps to structure the database and thus reduce the number of entries to deal with, it does not address in general the fundamental issue of the repetition of sequence information, which is independent on a fixed level of identity. Here, we take a first step toward the creation of a data set of unique proteins (none of them full length similar to any other) representing the minimum data set covering all the protein diversity on Earth. To generate this low-redundancy protein data set, we simplified UniProtKB from >74.2 million sequences to 8.2 million, using a pipeline based on a non-exhaustive all-versus-all strategy. As we performed a non-exhaustive strategy to generate the data set, its redundancy is not completely circumvented. Our data set might be used as a step toward the generation of the nonredundant data set. Once a nonredundant data set of unique proteins was created, new proteins could be compared with it, and only the sequences without a similar protein would be added to the data set. This would constitute a meaningful protein database with a sustainable growth. In parallel to generating this data set, we studied the way UniProtKB has evolved historically, and how it has responded to the caveats about protein and proteome redundancy throughout the years. The study of the redundancy of our reduced protein data set, along with two intermediate ones created during the process of reduction of the initial database, allows us to estimate the number of unique proteins in Earth. Finally, we describe the current sequencing voids based on the known and sequenced species from the three domains of life and viruses, to lead future sequencing projects to level those sequencing gaps. Methods Data retrieval Old SwissProt data sets from release 9.0 (November 1988) to release 45.0 (October 2004) were downloaded from the ftp site of ExPASy (ftp://ftp.expasy.org/databases/swiss-prot/sw_old_releases/). Later releases were obtained from the ftp site of UniProtKB (ftp://ftp.uniprot.org/pub/databases/uniprot/previous _releases/). Similarly, the complete sequence and annotation data sets from UniProtKB versions were downloaded from its ftp site. Only the first release of the year was stored for years 2010 and 2011. From 2012 onward, we downloaded just releases ‘_01’ and ‘_07’ for each year because of the increase in the number of sequences per version. To calculate the Earth’s proteome data set, we used UniRef50 release 2017_01 (20 083 468 sequences). We downloaded the list of TaxID from all the known species and subspecies (lower taxa) from NCBI Taxonomy, the list of TaxID from all the fully sequenced proteomes from UniProtKB > Taxonomy and the list of TaxID from all the reference proteomes from UniProtKB > Taxonomy > Reference, on 13 February 2017. The set of TaxID from all the species and subspecies with at least one sequence in UniProtKB was extracted from UniProtKB release 2017_01. Clustering of the SwissProt releases To study the redundancy of the 61 SwissProt releases, we used a strategy based on a modified version of the FastaHerder2 algorithm [9]. First, the initial SwissProt release (version 9.0, 8702 proteins) is clustered using FastaHerder2 mode 1 with default parameters, and a sequence identity minimum threshold of 50%. As a first step, that release is compared with the next one (version 11.0, 10 855 proteins) to obtain a set of sequences added from one version to the following, and similarly for the deleted entries, which are taken out of the clusters. In a second step, FastaHerder2 mode 2 (co-clusters a sequence with previously made clusters) clusters the added sequences with the current clusters. The updated clusters can be taken as if they were obtained by clustering version 11.0 ab initio. The two-step procedure is repeated for each version of SwissProt, using as seed the clusters for version x and clustering with them the entries added in version x + 1. Study of the qualitative growth of a database We selected UniProtKB releases 4.0, 7.0, 10.0, 13.0 and 15.0 as representatives of the database for the period 2005–09. In addition, the first yearly release of UniProtKB (‘_01’) from 2010 to 2017 was considered. No release was used for the year 2015, as the number of sequences grew from 53 million (2014_01) to 91 million (2015_01), and then descended to 64.4 million sequences; therefore, we did not considered steps from 2014_01 to 2015_01 nor from 2015_01 to 2016_01, and instead used from 2014_01 to 2016_01. We calculated the set of added entries from each data set x to the following x + 1. To compare UniProtKB version x with the following x + 1, we randomly selected 500 proteins from the data set of added entries x + 1 and performed for each of them a BLAST similarity search with default parameters (BLOSUM62 matrix, low-complexity filter off, e-value threshold ≤ 0.001) against the database x. The Rost curve [10] was used as minimum identity threshold for the consideration of homology between each protein and any of the proteins in the database x. We calculated then the percentage of added entries in x + 1 similar to an existing one in x. Three replicates were performed for each comparison release x + 1 versus release x. The same strategy was followed to calculate the percentage of redundancy of each database [UniRef50 2017_01, the two intermediate less-redundant databases, and a low-redundancy data set generated by us that we named pre-Unique Earth’s Proteome (preUEP); see below]. In these cases, 500 randomly selected proteins from the databases were compared against the same complete data sets to assess their uniqueness. Three replicates were also performed for each calculation. Similarly, to estimate the fraction of redundant entries added to UniProtKB per taxa, we split the entries from releases 2016_01 and 2017_01 in five data sets depending on their taxonomy (Archaea, viruses, bacteria, Eukaryota and unclassified); we also calculated the added entries from one version to the other one, and repeated the splitting procedure of the data sets. A total of 500 randomly selected entries per replicate (three replicates) from each of these added entries data sets were compared with the 2016_01 data sets following the same strategy. Reduction of the protein redundancy of UniRef50 UniRef50 release 2017_01 (20 083 468 sequences) was initially split in 201 bins of 100 000 sequences. Redundancy within each bin was computed following an all-versus-all comparison. The identity threshold given by the Rost curve [10] was taken as the cutoff to consider a sequence redundant (if at least one result was above the curve) or not (no results, or results below the curve). The similarity searches were performed using BLAST default parameters (BLOSUM62 matrix, low-complexity filter off, e-value threshold ≤ 0.001). A minimum coverage length of 80% was required, similar to the one used by UniProt to generate UniRef50. After all sequences from UniRef50 were analyzed, the resulting intermediate database (12 542 332 sequences) was ordered by protein length, and the process was repeated using it as initial database. This second data set contained 8 616 716 sequences. A third iteration was performed using a shuffled version of the second database as the initial data set. The resulting data set (preUEP) contained 8 225 772 sequences. Hardware used for the experiments All the experiments presented in this article were executed on a Lenovo ThinkPad 64-bit with 7.7 GB of RAM and an Intel Core i7-4600 U CPU @ 2.10 GHz × 4, running Ubuntu 16.04 LTS. Results Change of pace in the growth of SwissProt We analyzed the growth along time of the SwissProt database, which contains only manually curated protein sequences, from one of its first releases in 1988 (release 9.0, November 1988; 8702 sequences) to the first released version of 2017 (553 474 sequences). We computed a cumulative clustering of the corresponding 61 versions of SwissProt as follows. First, we clustered the initial version of SwissProt using a clustering method we previously developed [9] (see ‘Methods’ section for details). Then, we compared that version of the database with the next one (release 11.0, July 1989), and calculated both the deleted and the added entries from one version to the other. Deleted entries were then removed from the clusters, and new entries were either added to an existing cluster or they formed a new cluster if they were different enough. The step was iterated using the next versions of the database. This strategy allowed us to replicate the results we obtained >10 years ago (until October 2004, release 45.0) [1], and to study the historical evolution of SwissProt over the past years (Figure 1), when the increased amount of entries would had made it computationally expensive. The minor discrepancies observed in the estimated redundancies obtained in the two studies are given by the difference in the sequence identity threshold used for the clustering (see ‘Methods’ section). Figure 1. View largeDownload slide Analysis of the evolution of the SwissProt database. The number of entries and clusters is represented for all the releases in the SwissProt history (‘17) and only for the releases between 1988 and 2005 (‘07), following the study in Perez-Iratxeta et al. [1]. Redundancy of each version of the database is calculated as sequences divided by the clusters (see ‘Methods’ section for details). Figure 1. View largeDownload slide Analysis of the evolution of the SwissProt database. The number of entries and clusters is represented for all the releases in the SwissProt history (‘17) and only for the releases between 1988 and 2005 (‘07), following the study in Perez-Iratxeta et al. [1]. Redundancy of each version of the database is calculated as sequences divided by the clusters (see ‘Methods’ section for details). The results show variable trends in the addition of entries to the database over time, in accordance to the changes in release naming. In the first period (releases 9.0–45.0, from 1988 to 2004), covered already in our previous work [1], proteins were deposited at a linearly growing rate. The similar growth in the number of clusters evidenced that redundancy grew in that period, which we define as an initial lag phase. The switch in the second period (releases 4.0–15.0, from 2005 to 2009) to an exponential growth in the deposition of entries came with the realization that not many of the added proteins were new, but orthologs of already known proteins. This second period was characterized by continuity in the trend of cluster growth, reaching a redundancy of around five sequences per cluster in SwissProt 2010_01. The latest period (releases 2010_01–2017_01, from 2010 to 2017) is acting as a death phase in this database in which the rate of deposition of entries has almost stopped (0.17% entries were added from 2016_07 to 2017_01). Historical growth in size and redundancy of UniProtKB The yearly addition of sequences in UniProtKB is happening at a faster rate, lately in the order of millions. To characterize the rate of deposition of new nonredundant entries, we used one release per year from UniProtKB, in the period 2005–17. The amount of deposited sequences in UniProtKB has historically increased since 2005–17 (Figure 2A); the period involving release 2015_01 should not be considered, as the database was reduced drastically from 91 to 64.4 million sequences. Figure 2. View largeDownload slide Historical evolution of the UniProtKB database. (A) Number of sequences and added entries per yearly release of UniProtKB. (B) Percentage of sampled added entries from release x redundant to release (x−1). No release was used for year 2015. Three replicates per sampling, 500 randomly selected entries per replicate. Figure 2. View largeDownload slide Historical evolution of the UniProtKB database. (A) Number of sequences and added entries per yearly release of UniProtKB. (B) Percentage of sampled added entries from release x redundant to release (x−1). No release was used for year 2015. Three replicates per sampling, 500 randomly selected entries per replicate. One would have to compare every added entry from a given release x with the complete previous release x−1, to determine the real amount of redundant entries added in the period (x−1)→x. Given the database size, to perform these comparisons is not viable time-wise. As an approximation, we sampled the added entries for each period (x−1)→x, from 2005 to 2017, and calculated the percentage of entries redundant to those already in the database. We consider two sequences to be redundant if they are similar enough to be homologous (see ‘Methods’ section). According to this approach, the redundancy of added sequences in the period 2006–14 follows a linear trend (Figure 2B). If nothing had changed, theoretically in 2017, the entirety of added sequences would have been redundant. Results are in accordance to what we proposed in 2007 [1], a 95% redundancy in new sequences by 2012, and a 99% by 2018. The database mode change in release 2015_01 (proteins from different strains not being considered anymore) resulted in a temporary decrease in the level of redundancy of the incoming sequences. However, the trend of increasing redundancy resumed afterward (Figure 2B). We estimate that the addition of redundant entries will asymptotically approach 100% before 2020. An approximation to a nonredundant Earth’s proteome More than 10 million sequences are added annually to the UniProtKB database. Most of them would be redundant, as depicted in the previous section. Given that only a few new nonredundant sequences are predicted or discovered per year, following the current trend, we can assume that the increase of unique protein sequences in the database will soon stop. This entails that in theory, we could generate a data set including one representative sequence per distinct protein on Earth, or at least estimate the size of such nonredundant data set. To our knowledge, the only actual way to do that would be by performing an exhaustive all-versus-all strategy, in which every sequence would be compared with the rest of the sequences in the database. Then, only one sequence per distinct protein would be kept. But there are computational and time limitations that make impossible to compare 74.2 million sequences with the same 74.2 million sequences (UniProtKB release 2017_01). Considering just the UniRef50 database, a full analysis would require (2 × 107) × (2 × 107) = 4 × 1014 comparisons. To estimate the execution time for such analysis, we computed 500 independent executions using random proteins as queries versus UniRef50 version 2017_01 as database, and calculated the execution time for each search (data not shown). We obtained a median of 68 s per execution, which would mean ∼43 years [68 s× (2 × 107) searches] to compare each sequence from UniRef50 against it whole. To overcome such limitations, we followed a simplified all-versus-all strategy (Figure 3A). Figure 3. View largeDownload slide Redundancy reduction of the UniRef50 database. (A) Pipeline followed to reduce the redundancy of the UniRef50 database, to generate a preUEP database. (B) Number of sequences per database (millions), both initial and last data set, plus the two intermediate ones created while iterating the strategy in (A). (C) Estimation of fraction of redundant entries per data set. (D) Number of sequences (millions) for the ETM of unique proteins on Earth calculated from the redundancy of each data set, using the formula. Figure 3. View largeDownload slide Redundancy reduction of the UniRef50 database. (A) Pipeline followed to reduce the redundancy of the UniRef50 database, to generate a preUEP database. (B) Number of sequences per database (millions), both initial and last data set, plus the two intermediate ones created while iterating the strategy in (A). (C) Estimation of fraction of redundant entries per data set. (D) Number of sequences (millions) for the ETM of unique proteins on Earth calculated from the redundancy of each data set, using the formula. UniRef50 is generated from the clustering of the UniProtKB database. In this clustering, the sequences in the clusters have at least 50% sequence identity to the longest sequence in it. We take this database as a proxy to start the reduction of the UniProtKB database, as it already is a simplified version of it. The cluster representatives have then <50% identity between themselves. UniRef50 was split in 201 bins of 100 000 sequences, and an all-versus-all strategy was performed in each bin separately (Figure 3A) (see ‘Methods’ section for details). The 201 intermediate result data sets were joined to form an intermediate database (Intermediate1), with 12 542 332 sequences (a 37.55% compression compared with the initial 20 083 468 proteins) (Figure 3B). A second iteration of the same procedure was performed, using as initial data set an ordered-by-length version of the Intermediate1 database. This database was then further reduced to an Intermediate2 database, with 8 616 716 sequences (31.30% compressed compared with the Intermediate1 data set). As a last iteration, a randomly shuffled version of the Intermediate2 database was used as the initial data set. In this case, the compression achieved was low (4.54%), and a data set called preUEP with 8 225 772 sequences was generated (Figure 3B). The simplified all-versus-all strategy using bins was no longer able to reduce the redundancy of the intermediate data sets. To calculate the Unique Earth’s Proteome (UEP) data set, one would have to perform comparisons of all the proteins in the preUEP database between themselves, which we are still not capable of doing because of computational and time limitations. The generated data sets can be downloaded from a dedicated Web page: http://cbdm-01.zdv.uni-mainz.de/∼munoz/uep/. Although we cannot produce the ultimate UEP data set, we can approximate the estimated theoretical minimum (ETM) of unique proteins on Earth by studying the redundancy of the data sets we are working with. For each of them, we compared 500 randomly selected proteins from the database against it as a whole, and assessed their uniqueness (using three replicates; see ‘Methods’ section for details). As expected, the larger the database, the more redundant it is (Figure 3C). Considering both the redundancy for each database and its size, we can produce an estimation of the ETM of proteins on Earth per data set (Figure 3D). Given the 8 225 772 sequences in the preUEP data set (Figure 3B), and considering that we calculated its redundancy to be 54.27% (Figure 3C), the estimated ETM would be ∼3.76 million sequences. It is important to note that, while the calculated ETM for the different databases is similar, we believe the real ETM value to be closer to the result given by the smaller databases (Intermediate2 and preUEP). Ten years ago, we proposed a value of 5 million for the total number of distinct proteins on Earth [1]. Based on our current results, we can estimate a lower value, for around 3.75 million different proteins in the proposed theoretical UEP data set. It is expected that every protein in the UniProtKB database would be homolog to a maximum of one protein in the UEP. Study of known proteins and database growth by taxa Sequencing projects are often biased toward specific branches of the tree of life, mostly populated by similar species or model organisms [11, 12]. Regarding the number of species, it is widely accepted that there is still an indefinite number of organisms to be discovered and characterized [13, 14]. Based on the currently described species, it is crucial to detect and to describe sequencing voids. It is necessary to cautiously devise a thorough strategy for future sequencing projects to level those proteome gaps, to maximize the number of nonredundant entries that would be added to the protein databases. As of February 2017, in the Taxonomy resource from NCBI, there were 1 434 345 species and lower taxa characterized. There is an asymmetry in the number of the described species, as eukaryotes and bacteria sum up to >87% of the total (Figure 4A). This number is both influenced by the large number of characterized insects [15] and the popularity of metagenomic studies [16], respectively. More than half of those known eukaryotic species have at least one protein entry in UniProtKB (Figure 4B). On the other hand, only around a fifth of the bacterial proteomes are represented in UniProtKb. In any case, bacterial and viral proteomes make up to almost all the currently fully sequenced proteomes, given their small genome size. Archaeal genomes are also small, but they have been described in a much limited number, as they are typically more difficult to culture and study in vitro [17]. Reference proteomes are chosen by UniProtKB as an indicator of well-studied model organisms and other species of interest; model organisms do not reflect directly the distribution of full proteomes, as there are many more eukaryotes in proportion than viruses. An overview of the knowledge status per taxa (Figure 4B) confirms that while viruses are the best covered taxon (47.5% of the known species are fully sequenced, and for 93.1% of them, there is at least one sequence in the database), the archaeal taxon is largely unexplored, as it was 10 years ago [1]. Figure 4. View largeDownload slide Taxonomic distribution of known, partially sequenced, fully sequenced and reference proteomes. (A) Comparison of the proportion of known, full, partial and reference proteomes, distributed per taxa. (B) Number of proteomes per taxa distributed depending on their status. Figure 4. View largeDownload slide Taxonomic distribution of known, partially sequenced, fully sequenced and reference proteomes. (A) Comparison of the proportion of known, full, partial and reference proteomes, distributed per taxa. (B) Number of proteomes per taxa distributed depending on their status. To characterize the fraction of redundant entries added per taxa to the database in the past year, we followed a similar procedure to the one used above to describe the evolution of the redundancy in UniProtKB (see ‘Methods’ section for details). From the 74 265 355 entries in UniProtKB release 2017_01, bacterial proteins make up to >63% of the total (Figure 5A). It is then more difficult to describe new proteins for this taxon than for any of the others; the sampling strategy shows that 96.93% of the added bacterial entries from release 2016_01 are redundant (Figure 5B). Similarly, newly added viral entries tend to be redundant (94.07%) because almost a half of the known viruses in nature have already been fully sequenced (Figure 4B). Eukaryotic entries behave differently because their proteomes are far less well covered than those from the rest of the taxa (Figure 4B), and only 90.13% of the added entries were considered redundant. Apart from archaeal, viral, bacterial and eukaryotic entries, UniProtKB contains >1 million unclassified entries (usually from uncultured organisms, like the ones from the recently described Asgard superphylum [18]), which are more likely to be new (Figure 5B). Collectively, our results suggest that Eukarya and Archaea taxa need more sequencing. Figure 5. View largeDownload slide Distinctive contribution of entries based on their taxonomy to the UniProtKB redundancy. (A) Proportion of entries in UniProtKB 2017_01 per taxa. (B) Percentage of sampled added entries from UniProtKB release 2017_01 similar to an existing entry in UniProtKB release 2016_01, divided by taxa. Three replicates per taxa, 500 randomly chosen entries per replicate. Figure 5. View largeDownload slide Distinctive contribution of entries based on their taxonomy to the UniProtKB redundancy. (A) Proportion of entries in UniProtKB 2017_01 per taxa. (B) Percentage of sampled added entries from UniProtKB release 2017_01 similar to an existing entry in UniProtKB release 2016_01, divided by taxa. Three replicates per taxa, 500 randomly chosen entries per replicate. Conclusions The spread in the past years of a multitude of sequencing techniques is driving the protein databases to a continuously accelerating increase in size. This dynamic challenges the development, maintenance and use of the protein databases and requires a reaction. On the other hand, the new sequences entering the databases are not produced to fill our gaps in biological knowledge in any rational manner. In our opinion, the work we presented here, especially when put in perspective of our previous analysis from a decade ago, suggests that proper reactions to organize the protein database and rationalize its growth did not materialize. Based on our results, we propose a small number of simple directives to avoid the course of the sequence databases toward an information catastrophe and to efficiently guide sequencing efforts toward organisms holding the last unknown protein functions. In this project, we have resumed the characterization of the growth in protein diversity in the protein sequence databases over the past 10 years, retaking the story where we left it [1]. On the one hand, we analyzed the SwissProt database, which has a particular dynamic, given its high level of curation, and contains a small fraction of all known protein sequences. On the other hand, the TrEMBL/UniProtKB database includes mostly larger numbers of automatically predicted protein sequences and in consequence behaves in a dramatically different way than SwissProt (Figure 1). The automatic prediction of sequences does not require any curation, thus simplifying the process of creation of new entries. We have shown that most of the newly added entries are redundant, and that this trend is still growing. We propose that a resource to give structure to the sequence database is badly needed: mostly to avoid the large amount of sequences similar to each other in the lists of hits resulting from most sequence similarity searches, which currently grow with each new version of the database and can obscure interesting results. This resource would be built using a strategy purely dependent on sequence comparison. We would not use protein features like function, which derives from sequence, because the correspondence between sequence similarity and function similarity is extremely variable across protein families and thus combining them to cluster sequences would not be helpful [19]. For example, pseudokinases are catalytically deficient counterparts of active kinases with inhibitory functions; while they have highly similar sequences, they have opposite functions [20]. This is not a rare case: 10% of human kinases are pseudokinases [21]. In any case, we acknowledge the importance of contrasting such a database organized purely by sequence to functional annotations; annotating these clusters is a necessary subsequent step. We envision this resource as a data set of proteins, where no two proteins would be homologous to each other for a large part of their sequences. Operationally, novel sequences would only need to be compared with this data set to decide if they would enlarge the data set or not. We predict that given the evolution of the sequence databases, if such data set was generated today, it would not grow further much. However, we are in a situation where we cannot handle the computational effort to produce this data set in reasonable time, and thus, we present a pipeline to produce this data set in incremental steps, and provide an incomplete reduced set of sequences as an intermediate step. In more detail, through a simple pipeline based on sequence identity, we reduced the number of unique proteins in the UniRef50 database from 20 million to 8.2 million (a 59% reduction); this number could be further reduced using a time-consuming all-versus-all strategy. Taking the complete UniProtKB database release 2017_01 as the initial reference, the reduction rises to 89%. This non-exhaustive strategy, in addition to the calculation of redundancy of these databases, leads us to estimate the total number of distinct proteins on Earth in around 3.75 million. We propose to make a collaborative effort to perform the all-versus-all strategy over our 8.2 million preUEP data set to simplify it and get the real UEP data set, from which we would all benefit. For example, comparing the UEP with a sequenced genome would result in an automatic and fast proteome annotation. Furthermore, given an unknown protein, one could obtain easily with a similarity search its homologous protein from the UEP. Were the complete UEP data set manually curated and thoroughly annotated (a second-phase task), it would simplify the prediction and annotation processes of protein sequences. Our proposed way to start tackling the issue about reducing the redundancy of the UniProtKB database is only one of several that could be followed to achieve it. A different take could be based on profile models to summarize the diversity of the different protein families. It could function in a similar way than current protein domain databases, but taking full proteins as entities. Furthermore, this database of profile models could be developed once the full UEP was generated; each sequence in the UEP would serve as the representative of a protein family, to which all known proteins similar to it would be attached. The profile models would be then derived from them. One can think even in taxa-specific profiles, to describe the diversity of a protein family at different taxonomic levels. Sequencing projects normally proceed toward organisms with a health or economic benefit and, even if sequencing is cheap, it is difficult to convince a laboratory to attempt isolating and sequencing an organism just because it contains many novel proteins. We propose that funding agencies should guide this effort with specially allocated funds. We hope that our efforts will make the community of genomic researchers, database developers and related funding agencies aware of the current status of protein and protein function knowledge, spurring novel strategies to choose species for sequencing based in phylogenetic analyses of the database. The work presented here should be used as a guide to speed up the last steps in ending the frontier era of protein research. Key Points The novelty of the newly deposited sequences in the protein databases is decreasing over the years. We predict that before 2020, almost all entries deposited in UniProtKB will be homologous to known proteins. We estimate the size of the Earth’s proteome to be 3.75 million sequences. Eukarya and Archaea are the taxa that are probably holding most of the remaining unknown proteins. Sequencing voids can only be covered with encouragement from funding agencies. Funding This work has been supported with funds from the Center for Computational Sciences Mainz (CSM, Johannes Gutenberg University of Mainz, Germany). Pablo Mier is a postdoctoral researcher interested in the development of Web tools and databases related to protein evolution and low-complexity regions. He works in the Faculty of Biology at Johannes Gutenberg University Mainz. Miguel A. Andrade-Navarro is a professor of Faculty of Biology, at the Johannes Gutenberg University of Mainz. His group (‘Computational Biology and Data Mining’) is interested in exploring gene function using computational techniques including algorithms and databases. References 1 Perez-Iratxeta C, Palidwor G, Andrade-Navarro MA. Towards completion of the Earth's proteome. EMBO Rep  2007; 8( 12): 1135– 41. Google Scholar CrossRef Search ADS PubMed  2 NCBI Resource Coordinators. Database resources of the national center for biotechnology information. Nucleic Acids Res  2017; 45( D1): D12– 7. CrossRef Search ADS PubMed  3 The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res  2017; 45( D1): D158– 69. CrossRef Search ADS PubMed  4 Mukherjee S, Stamatis D, Bertsch J, et al.   Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res  2017; 45( D1): D446– 56. Google Scholar CrossRef Search ADS PubMed  5 Bursteinas B, Britto R, Bely B, et al.   Minimizing proteome redundancy in the UniProt Knowledgebase. Database  2016; 2016: baw139. Google Scholar CrossRef Search ADS PubMed  6 Chen Q, Zobel J, Verspoor K. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. Database  2017; 2017: baw163. Google Scholar CrossRef Search ADS   7 Suzek BE, Wang Y, Huang H, et al.   UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics  2015; 31( 6): 926– 32. Google Scholar CrossRef Search ADS PubMed  8 Mirdita M, von den Driesch L, Galiez C, et al.   Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res  2017; 45( D1): D170– 6. Google Scholar CrossRef Search ADS PubMed  9 Mier P, Andrade-Navarro MA. FastaHerder2: four ways to research protein function and evolution with clustering and clustered databases. J Comput Biol  2016; 23( 4): 270– 8. Google Scholar CrossRef Search ADS PubMed  10 Rost B. Twilight zone of protein sequence alignments. Protein Eng  1999; 12( 2): 85– 94. Google Scholar CrossRef Search ADS PubMed  11 Land M, Hauser L, Jun SR, et al.   Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics  2015; 15( 2): 141– 61. Google Scholar CrossRef Search ADS PubMed  12 del Campo J, Sieracki ME, Molestina R, et al.   The others: our biased perspective of eukaryotic genomes. Trends Ecol Evol  2014; 29( 5): 252– 9. Google Scholar CrossRef Search ADS PubMed  13 Mora C, Tittensor DP, Adl S, et al.   How many species are there on Earth and in the ocean? PLoS Biol  2011; 9( 8): e1001127. Google Scholar CrossRef Search ADS PubMed  14 Strain D. Biodiversity. 8.7 million: a new estimate for all the complex species on Earth. Science  2011; 333( 6046): 1083. Google Scholar CrossRef Search ADS PubMed  15 Stork NE, McBroom J, Gely C, et al.   New approaches narrow global species estimates for beetles, insects, and terrestrial arthropods. Proc Natl Acad Sci USA  2015; 112( 24): 7519– 23. Google Scholar CrossRef Search ADS PubMed  16 Roumpeka DD, Wallace RJ, Escalettes F, et al.   A review of bioinformatics tools for bio-prospecting from metagenomic sequence data. Front Genet  2017; 8: 23. Google Scholar CrossRef Search ADS PubMed  17 Cowan DA, Ramond JB, Makhalanyane TP, et al.   Metagenomics of extreme environments. Curr Opin Microbiol  2015; 25: 97– 102. Google Scholar CrossRef Search ADS PubMed  18 Zaremba-Niedzwiedzka K, Caceres EF, Saw JH, et al.   Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature  2017; 541( 7637): 353– 8. Google Scholar CrossRef Search ADS PubMed  19 Devos D, Valencia A. Practical limits of function prediction. Proteins  2000; 41( 1): 98– 107. Google Scholar CrossRef Search ADS PubMed  20 Jacobsen AV, Murphy JM. The secret life of kinases: insights into non-catalytic signaling functions from pseudokinases. Biochem Soc Trans  2017; 45( 3): 665– 81. Google Scholar CrossRef Search ADS PubMed  21 Boudeau J, Miranda-Saavedra D, Barton GJ, et al.   Emerging roles of pseudokinases. Trends Cell Biol  2006; 16( 9): 443– 52. Google Scholar CrossRef Search ADS PubMed  © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

Journal

Briefings in BioinformaticsOxford University Press

Published: Oct 12, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off