Benchmarks for measurement of duplicate detection methods in nucleotide databases

Benchmarks for measurement of duplicate detection methods in nucleotide databases Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale vali- dated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate de- tection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of question- able generality. In this study, we present three nucleotide sequence database bench- marks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated bio- logical duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first bench- marks targeting the primary nucleotide databases. The records include the 21 most heav- ily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate VThe Author(s) 2017. Published by Oxford University Press. Page 1 of 17 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. (page number not for citation purposes) Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 2 of 17 Database, Vol. 2017, Article ID baw164 detection or record linkage methods that are required to help maintain these essential resources. Database URL: https://bitbucket.org/biodbqual/benchmarks Introduction In this study, we address these issues by accomplishing the following: Sequencing technologies are producing massive volumes of data. GenBank, one of the primary nucleotide databases, We introduce three benchmarks containing INSDC du- increased in size by over 40% in 2014 alone (1). However, plicates that were collected based on three different prin- researchers have been concerned about the underlying data ciples: records merged directly in INSDC (111 ,826 quality in biological sequence databases since the 1990s pairs); INSDC records labelled as references during (2). A particular problem of concern is duplicates, when a UniProtKB/Swiss-Prot expert curation (2 465 891 pairs); database contains multiple instances representing the same and INSDC records labelled as references in UniProtKB/ entity. Duplicates introduce redundancies, such as repeti- TrEMBL automatic curation (473 555 072 pairs); tive results in database search (3), and may even represent We quantitatively measure similarities between dupli- inconsistencies, such as contradictory functional annota- cates, showing that our benchmarks have duplicates with tions on multiple records that concern the same entity (4). dramatically different characteristics, and are comple- Recent studies have noted duplicates as one of five central mentary to each other. Given these differences, we argue data quality problems (5), and it has been observed that de- that it is insufficient to evaluate against only one bench- tection and removal of duplicates is a key early step in bio- mark; and informatics database curation (6). We demonstrate the value of expert curation, in its iden- Existing work has addressed duplicate detection in bio- tification of a much more diverse set of duplicate types. logical sequence databases in different ways. This work It may seem that, with so many duplicates in our bench- falls into two broad categories: efficiency-focused methods marks, there is little need for new duplicate detection meth- that are based on assumptions such as duplicates have ods. However, the limitations of the mechanisms that led to identical or near-identical sequences, where the aim is to discovery of these duplicates, and the fact that the preva- detect similar sequences in a scalable manner; and quality- lences are so very different between different species and re- focused methods that examine record fields other than the sources, strongly suggest that these are a tiny fraction of the sequence, where the aim is accurate duplicate detection. total that is likely to be present. While a half billion dupli- However, the value of these existing approaches is unclear, cates may seem like a vast number, they only involve due to the lack of broad-based, validated benchmarks; as 710 254 records, while the databases contain 189 264 014 some of this previous work illustrates, there is a tendency records (http://www.ddbj.nig.ac.jp/breakdown_stats/dbgro for investigators of new methods to use custom-built col- wth-e.html#ddbjvalue) altogether to date. Also, as sug- lections that emphasize the kind of characteristic their gested by the effort expended in expert curation, there is a method is designed to detect. great need for effective duplicate detection methods. Thus, different methods have been evaluated using sep- arate, inconsistent benchmarks (or test collections). The efficiency-focused methods used large benchmarks. However, the records in these benchmarks are not necessar- Background ily duplicates, due to use of mechanical assumptions about In the context of general databases, the problems of quality what a duplicate is. The quality-focused methods have used control and duplicate detection have a long history of re- collections of expert-labelled duplicates. However, as a result search. However, this work has only limited relevance for of the manual effort involved, these collections are small and bioinformatics databases, because, for example, it has contain only limited kinds of duplicates from limited data tended to focus on tasks such as ensuring that each real- sources. To date, no published benchmarks have included world entity is only represented once, and the attributes of duplicates that are explicitly marked as such in the primary entities (such as ‘home address’) are externally verifiable. nucleotide databases, GenBank, the EMBL European In this section we review prior work on duplicate detection Nucleotide Archive, and the DNA DataBank of Japan. (We in bioinformatics databases. We show that researchers refer to these collectively as INSDC: the International have approached duplicate detection with different as- Nucleotide Sequence Database Collaboration (7).) sumptions. We then review the more general duplicate Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 3 of 17 detection literature, showing that the issue of a lack of even from the same perspective (8). By categorizing dupli- rigorous benchmarks is a key problem for duplicate detec- cates collected directly from INSDC, we have already tion in general domains and is what motivates our work. found diverse types: similar or identical sequences; similar Finally, we describe the data quality control in INSDC, or identical fragments; duplicates with relatively different UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, as the sequences; working drafts; sequencing in progress records; sources for construction of the duplicate benchmark sets and predicted records. The prevalence of each type varies that we introduce. considerably between organisms. Studies on duplicate de- tection in general performance on a single dataset may be biased if we do not consider the independence and underly- Kinds of duplicate ing stratifications (16). Thus, as well as creating bench- Different communities, and even different individuals, may marks from different perspectives, we collect duplicates have inconsistent understandings of what a duplicate is. from multiple organisms from the same perspectives. Such differences may in turn lead to different strategies for We do not regard these discrepancies as shortcomings de-duplication. or errors. Rather, we stress the diversity of duplication. A generic definition of a duplicate is that it occurs when The understanding of ‘duplicates’ may be different be- there are multiple instances that point to the same entity. tween database staff, computer scientists, biological cur- Yet this definition is inadequate; it requires a definition ators and so on, and benchmarks need to reflect this that allows identification of which things are ‘the same en- diversity. In this work, we assemble duplicates from three tity’. We have explored definitions of duplicates in other different perspectives: expert curation (how data curators work (8). We regard two records as duplicates if, in the understand duplicates); automatic curation (how auto- context of a particular task, the presence of one means that matic software without expert review identifies dupli- the other is not required. Here we explain that duplication cates); and merged-based quality checking (how records has at least four characteristics, as follows. are merged in INSDC). These different perspectives reflect First, duplication is not simply redundancy. The latter the diversity: a pair considered as duplicates from one per- can be defined using a simple threshold. For example, if spective may not be so in another. For instance, nucleotide two instances have over 90% similarity, they can arguably coding records might not be duplicates strictly at the DNA be defined as redundant. Duplicate detection often regards level, but they might be considered to be duplicates if they such examples as ‘near duplicates’ (9) or ‘approximate du- concern the same proteins. Use of different benchmarks plicates’ (10). In bioinformatics, ‘redundancy’ is commonly derived from different assumptions tests the generality of used to describe records with sequence similarity over a duplicate detection methods: a method may have strong certain threshold, such as 90% for CD-HIT (11). performance in one benchmark but very poor in another; Nevertheless, instances with high similarity are not neces- only by being verified from different benchmarks can pos- sarily duplicates, and vice versa. For example, curators sibly guarantee the method is robust. working with human pathway databases have found re- Currently, understanding of duplicates via expert cur- cords labelled with the same reaction name that are not du- ation is the best approach. Here ‘expert curation’ means plicates, while legitimate duplicates may exist under a that curation either is purely manually performed, as in variety of different names (12). Likewise, as we present ONRLDB (17); or not entirely manual but involving ex- later, nucleotide sequence records with high sequence simi- pert review, as in UniProtKB/Swiss-Prot (18). Experts use larity may not be duplicates, whereas records whose se- experience and intuition to determine whether a pair is du- quences are relatively different may be true duplicates. plicate, and will often check additional resources to ensure Second, duplication is context dependent. From one per- the correctness of a decision (16). Studies on clinical (19) spective, two records might be considered duplicates while and biological databases (17) have demonstrated that ex- from another they are distinct; one community may consider pert curation can find a greater variety of duplicates, and them duplicates whereas another may not. For instance, ultimately improves the data quality. Therefore, in this amongst gene annotation databases, more broader duplicate work we derive one benchmark from UniProtKB/Swiss- types are considered in Wilming et al. (13) than in Williams Prot expert curation. et al. (14), whereas, for genome characterization, ‘duplicate records’ means creation of a new record in the database using configurations of existing records (15). Different attri- Impact of duplicates butes have been emphasized in the different databases. Third, duplication has various types with distinct char- There are many types of duplicate, and each type has dif- acteristics. Multiple types of duplicates could be found ferent impacts on use of the databases. Approximate or Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 4 of 17 Database, Vol. 2017, Article ID baw164 near duplicates introduce redundancies, whereas other Duplicate detection methods types may lead to inconsistencies. Most duplicate detection methods use pairwise compari- Approximate or near duplicates in biological databases is son, where each record is compared against others in pairs not a new problem. We found related literature in 1994 (3), using a similarity metric. The similarity score is typically 2006 (20) and as recently as 2015 (http://www.uniprot.org/ computed by comparing the specific fields in the two re- help/proteome_redundancy). A recent significant issue was cords. The two classes of methods that we previously intro- proteome redundancy in UniProtKB/TrEMBL (2015). duced, efficiency-focused and quality-focused, detect UniProt staff observed that many records were over- duplicates in different ways; we now summarize those represented, such as 5.97 million entries for just 1692 strains approaches. of Mycobacterium tuberculosis. This redundancy impacts se- quence similarity searches, proteomics identification and motif searches. In total, 46.9 million entries were removed. Efficiency-focused methods Additionally, recall that duplicates are not just redun- Efficiency-focused methods have two common features. dancies. Use of a simple similarity threshold will result in One is that they typically rest on simple assumptions, such many false positives (distinct records with high similarity) as that duplicates are records with identical or near- and false negatives (duplicates with low similarity). Studies identical sequences. These are near or approximate dupli- show that both cases matter: in clinical databases, merging cates as above. The other is an application of heuristics to of records from distinct patients by mistake may lead to filter out pairs to compare, in order to reduce the running withholding of a treatment if one patient is allergic but the time. Thus, a common pattern of such methods is to assume other is not (21); failure to merge duplicate records for the that duplicates have sequence similarity greater than a cer- same patient could lead to a fatal drug administration error tain threshold. In one of the earliest methods, nrdb90,itis (22). Likewise, in biological databases, merging of records assumed that duplicates have sequence similarities over with distinct functional annotations might result in incor- 90%, with k-mer matching used to rapidly estimate similar- rect function identification; failing to merge duplicate re- ity (28). In CD-HIT, 90% similarity is assumed, with cords with different functional annotations might lead to short-substring matching as the heuristic (11); in starcode, incorrect function prediction. One study retrieved corres- a more recent method, it is assumed that duplicates have se- ponding records from two biological databases, Gene quences with a Levenshtein distance of no > 3, and pairs of Expression Omnibus and ArrayExpress, but surprisingly sequences with greater estimated distance are ignored (29). found the number of records to be significantly different: Using these assumptions and associated heuristics, the the former has 72 whereas only 36 in latter. Some of the re- methods are designed to speed up the running time, which cords were identical, but in some cases records were in one is typically the main focus of evaluation (11,28). While but not the other (23). Indeed, duplication commonly some such methods consider accuracy, efficiency is still the interacts with inconsistency (5). major concern (29). The collections are often whole data- Further, we cannot ignore the propagated impacts of bases, such as the NCBI non-redundant database (Listed at duplicates. The above duplication issue in UniProtKB/ https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_ TrEMBL not only impacts UniProtKB/TrEMBL itself, but TYPE¼BlastSearch) for nucleotide databases and Protein also significantly impacts databases or studies using Data Bank (http://www.rcsb.org/pdb/home/home.do) for UniProtKB/TrEMBL data. For instance, release of Pfam, a protein databases. These collections are certainly large, but curated protein family database, was delayed for close to 2 are not validated, that is, records are not known to be du- years; the duplication issue in UniProtKB/TrEMBL was the plicates via quality-control or curation processes. The major reason (24). Even removal of duplicates in methods based on simple assumptions can reduce redun- UniProtKB/TrEMBL caused problems: ‘the removal of bac- dancies, but recall that duplication is not limited to redun- terial duplication in UniProtKB (and normal flux in pro- dancy: records with similar sequences may not be tein) would have meant that nearly all (>90%) of Pfam duplicates and vice versa. For instance, records INSDC seed alignments would have needed manual verification AL592206.2 and INSDC AC069109.2 have only 68% (and potential modification) .. . This imposes a significant local identity (measured in Section 3.2 advised by NCBI manual biocuration burden’ (24). BLAST staff), but they have overlapped clones and were Finally, duplicate detection across multiple sources pro- merged as part of the finishing strategy of the human gen- vides valuable record linkages (25–27). Combination of in- ome. Therefore, records measured solely based on a simi- formation from multiple sources could link literature larity threshold are not validated and do not provide a databases, containing papers mentioning the record; gene basis for measuring the accuracy of a duplicate detection databases; and protein databases. method, that is, the false positive or false negative rate. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 5 of 17 Quality-focused methods records yields over 12 million pairs; even a small data set In contrast to efficiency-focused methods, quality-focused requires a large processing time under these conditions. methods tend to have two main differences: use of a Hence, there is no large-scale validated benchmark, and greater number of fields; and evaluation on validated data- no verified collections of duplicate nucleotide records in sets. An early method of this kind compared the similarity INSDC. However, INSDC contains primary nucleotide of both metadata (such as description, literature and biolo- data sources that are essential for protein databases. For gical function annotations) and sequence, and then used instance, 95% of records in UniProt are from INSDC association rule mining (30) to discover detection rules. (http://www.uniprot.org/help/sequence_origin). A further More recent proposals focus on measuring metadata using underlying problem is that fundamental understanding of approximate string matching: Markov random models duplication is missing. The scale, characteristics and im- (31), shortest-path edit distance (32) or longest approxi- pacts of duplicates in biological databases remain unclear. mately common prefix matching (33), the former two for general bioinformatics databases and the latter specifically for biomedical databases. The first method used a 1300-re- Benchmarks in duplicate detection cord dataset of protein records labelled by domain experts, whereas the others used a 1900-record dataset of protein Lack of large-scale validated benchmarks is a problem in records labelled in UniProt Proteomes, of protein sets from duplicate detection in general domains. Researchers sur- fully sequenced genomes in UniProt. veying duplicate detection methods have stated that the The collections used in this work are validated, but most challenging obstacle is lack of ‘standardized, large- have significant limitations. First, both of the collections scale benchmarking data sets’ (34). It is not easy to identify have <2000 records, and only cover limited types of dupli- whether new methods surpass existing ones without reli- cates (46). We classified duplicates specifically on one of able benchmarks. Moreover, some methods are based on the benchmarks (merge-based) and it demonstrates that machine learning, which require reliable training data. In different organisms have dramatically distinct kinds of du- general domains, many supervised or semi-supervised du- plicate: in Caenorhabditis elegans, the majority duplicate plicate detection methods exist, such as decision trees (35) type is identical sequences, whereas in Danio rerio the ma- and active learning (36). jority duplicate type is of similar fragments. From our case The severity of this issue is illustrated by the only super- study of GC content and melting temperature, those differ- vised machine-learning method for bioinformatics of ent types introduce different impacts: duplicates under the which we are aware, which was noted above (30). The exact sequence category only have 0.02% mean difference method was developed on a collection of 1300 records. In of GC content compared with normal pairs in Homo sapi- prior work, we reproduced the method and evaluated ens, whereas another type of duplicates that have relatively against a larger dataset with different types of duplicates. low sequence identity introduced a mean difference of The results were extremely poor compared with the ori- 5.67%. A method could easily work well in a limited data- ginal outcomes, which we attribute to the insufficiency of set of this kind but not be applicable for broader datasets the data used in the original work (37). with multiple types of duplicates. Second, they only cover We aim to create large-scale validated benchmarks of a limited number of organisms; the first collection had two duplicates. By assembling understanding of duplicates and the latter had five. Authors of prior studies, such as from different perspectives, it becomes possible to test dif- Rudniy et al. (33), acknowledged that differences of dupli- ferent methods in the same platform, as well as test the ro- cates (different organisms have different kinds of duplicate; bustness of methods in different contexts. different duplicate types have different characteristics) are the main problem impacting the method performance. In some respects, the use of small datasets to assess Quality control in bioinformatics databases quality-based methods is understandable. It is difficult to find explicitly labelled duplicates. Typically, for nucleotide To construct a collection of explicitly labelled duplicates, databases, sources of labelled duplicates are limited. In an essential step is to understand the quality control pro- addition, these methods focus on the quality and so are un- cess in bioinformatics databases, including how duplicates likely to use strategies for pruning the search space, mean- are found and merged. Here we describe how INSDC and ing that they are compute intensive. These methods also UniProt perform quality control in general and indicate generally consider many more fields and many more pairs how these mechanisms can help in construction of large than the efficiency-focused methods. A dataset with 5000 validated collections of duplicates. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 6 of 17 Database, Vol. 2017, Article ID baw164 Figure 1. A screenshot of the revision history for record INSDC AC034192.5 (http://www.ncbi.nlm.gov/nuccore/AC034192.5?report¼girevhist). Note the differences between normal updates (changes on a record itself) and merged records (duplicates). For instance, the record was updated from ver- sion 3 to 4, which is a normal update. A different record INSDC AC087090.1 is merged in during Apr 2002. This is a case of duplication confirmed by ENA staff. We only collected duplicates, not normal updates. Quality control in INSDC 3. Literature curation: identify relevant papers, read the full text and extract the related context, assign gene Merging of records addresses duplication in INSDC. The ontology terms accordingly; merge may occur due to various reasons, including cases 4. Family curation: analyse putative homology relation- where different submitters adding records for the same bio- logical entities, or changes of database policies. We have ships; perform steps 1–3 for identified instances; 5. Evidence attribution: link all expert curated data to the discussed various reasons for merging elsewhere (8). original source; Different merge reasons reflect the fact that duplication may arise from diverse causes. Figure 1 shows an example. 6. Quality assurance and integration: final check of fin- ished entries and integration into UniProtKB/Swiss- Record INSDC AC034192.5 is merged with Record Prot. INSDC AC087090.1 in Apr 2002. (We used recommended accession.version format to describe record. Since the UniProtKB/Swiss-Prot curation is sophisticated and sen- paper covers three data sources, we also added data source sitive, and involves substantial expert effort, so the data name.) In contrast, the different versions of Record INSDC quality can be assumed to be high. UniProtKB/TrEMBL AC034192 (version 2 in April 2000 and version 3 in May complements UniProtKB/Swiss-Prot using purely auto- 2000) are just normal updates on the same record. matic curation. The automatic curation in UniProtKB/ Therefore we only collect the former. TrEMBL mainly comes from two sources: (1) the Unified Staff confirmed that this is the only resource for merged Rule (UniRule) system, which derives curator-tested rules records in INSDC. Currently there is no completely auto- from UniProtKB/Swiss-Prot manually annotated entries. matic way to collect such duplicates from the revision his- For instance, the derived rules have been used to determine tory. Elsewhere we have explained the procedure that we family membership of uncharacterized protein sequences developed to collect these duplicates, why we believe that (39); and (2) Statistical Automatic Annotation System many duplicates are still present in INSDC, and why the (SAAS), which generates automatic rules for functional an- collection is representative (8). notations. For instance, it applies C4.5 decision tree algo- rithm to UniProtKB/Swiss-Prot entries to generate Quality control in UniProt automatic functional annotation rules (38). The whole pro- UniProt Knowledgebase (UniProtKB) is a protein database cess is automatic and does not have expert review. that is a main focus of the UniProt Consortium. It has two Therefore, it avoids expert curation with the trade-off of sections: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. lower quality assurance. Overall both collections represent UniProtKB/Swiss-Prot is expert curated and reviewed, with the state of the art in biological data curation. software support, whereas UniProtKB/TrEMBL is curated Recall that nucleotide records in INSDC are primary automatically without review. Here, we list the steps of sources for other databases. From a biological perspective, curation in UniProtKB/Swiss-Prot (http://www.uniprot. protein coding nucleotide sequences are translated into org/help/), as previously explained elsewhere (38): protein sequences (40). Both UniProtKB/Swiss-Prot and UniProtKB/TrEMBL generate cross-references from the 1. Sequence curation: identify and merge records from coding sequence records in INSDC to their translated pro- same genes and same organisms; identify and document tein records. This provides a mapping between INSDC and sequence discrepancies such as natural variations and curated protein databases. We can use the mapping be- frameshifts; explore homologs to check existing anno- tween INSDC and UniProtKB/Swiss-Prot and the mapping tations and propagate other information; between INSDC and UniProtKB/TrEMBL, respectively, to 2. Sequence analysis: predict sequence features using se- construct two collections of nucleotide duplicate records. quence analysis programs, then experts check the We detail the methods and underlying ideas below. results; Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 7 of 17 Methods We interpret the mapping based on biological know- ledge and database policies, as confirmed by UniProt staff. We now explain how we construct our benchmarks, which Recall that protein coding nucleotide sequences are trans- we call the merge-based, expert curation and automatic lated into protein sequences. In principle, one coding se- curation benchmarks; we then describe how we measure quence record in INSDC can be mapped into one protein the duplicate pairs for all three benchmarks. record in UniProt; it can also be mapped into more than one protein record in UniProt. More specifically, if one protein record in UniProt cross-references multiple coding Benchmark construction sequence records in INSDC, those coding sequence records Our first benchmark is the merge-based collection, based are duplicates. Some of those duplicates may have distinct on direct reports of merged records provided by record sequences due to the presence of introns and other regula- submitters, curators, and users to any of the INSDC data- tory regions in the genomic sequences. We classify the bases. Creation of this benchmark involves searching the mappings into six cases, as follows. Note that the follow- revision history of records in INSDC, tracking merged re- ing cases related with merging occur in the same species. cord IDs, and downloading accordingly. We have Case 1: A protein record maps to one nucleotide coding described the process in detail elsewhere, in work where sequence record. No duplication is detected. we analysed the scale, classification and impacts of dupli- Case 2: A protein record maps to many nucleotide cod- cates specifically in INSDC (8). ing sequence records. This is an instance of duplication. The other two benchmarks are the expert curation and Here UniProtKB/Swiss-Prot and UniProtKB/TrEMBL automatic curation benchmarks. Construction of these represent different duplicate types. In the former splice benchmarks of duplicate nucleotide records is based on the forms, genetic variations and other sequences are mapping between INSDC and protein databases merged, whereas in the latter merges are mainly of re- (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL), and cords with close to identical sequences (either from the consists of two main steps. The first is to perform the map- same or different submitters). That is also why we con- ping: downloading record IDs and using the existing map- struct two different benchmarks accordingly. ping service; the second is to interpret the mapping results Case 3: Many protein records have the same mapped and find the cases where duplicates occur. coding sequence records. There may be duplication, but The first step has the following sub-steps. Our expert we assume that the data is valid. For example, the cross- and automatic curation benchmarks are constructed using referenced coding sequence could be a complete genome the same steps, except that one is based on mapping be- that links to all corresponding coding sequences. tween INSDC and UniProtKB/Swiss-Prot and the other is Case 4: Protein records do not map to nucleotide coding based on mapping between INSDC and UniProtKB/ sequence records. No duplication is detected. TrEMBL. Case 5: The nucleotide coding sequences exist in IIDs 1. Retrieve a list of coding records IDs for an organism in but are not cross-referenced. Not all nucleotide records INSDC. We call these IIDs (I for INSDC). Databases with a coding region will be integrated, and some might under INSDC exchange data daily so the data is the not be selected in the cross-reference process. same (though the representations may vary). Thus, re- Case 6: The nucleotide coding sequence records are cords can be retrieved from any one of the databases in cross-referenced, but are not in IIDs. A possible explan- INSDC. This list is used in the interpretation step; ation is that the cross-referenced nucleotide sequence 2. Download a list of record IDs for an organism in either was predicted to be a coding sequence by curators or UniProtKB/Swiss-Prot or UniProtKB/TrEMBL. We call automatic software, but was not annotated as a coding these UIDs (U for UniProt). This list is used in sequence by the original submitters in INSDC. In other mapping; words, UniProt corrects the original missing annotations 3. Use the mapping service provided in UniProt (41)to in INSDC. Such cases can be identified with the generate mappings: Provide the UIDs from Step 2; NOT_ANNOTATED_CDS qualifier on the DR line Choose ‘UniProtKB AC/ID to EMBL/GenBank/DDBJ’ when searching in EMBL. option; and Click ‘Generate Mapping’. This will gener- In this study, we focus on Case 2, given that this is ate a list of mappings. Each mapping contains the re- where duplicates are identified. We collected all the related cord ID in UniProt and the cross-referenced ID(s) in nucleotide records and constructed the benchmarks INSDC. We will use the mappings and IIDs in the inter- accordingly. pretation step. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 8 of 17 Database, Vol. 2017, Article ID baw164 Quantitative measures Local sequence identity and alignment proportion We used NCBI BLAST (version 2.2.30) (43) to measure After building the benchmarks as above, we quantitatively local sequence identity. We used the bl2seq application measured the similarities in nucleotide duplicate pairs in all that aligns sequences pairwise and reports the identity of three benchmarks to understand their characteristics. every pair. NCBI BLAST staff advised on the recom- Typically, for each pair, we measured the similarity of de- mended parameters for running BLAST pairwise alignment scription, literature and submitter, the local sequence iden- in general. We disabled the dusting parameter (which auto- tity and the alignment proportion. The methods are matically filters low-complexity regions) and selected the described briefly here; more detail (‘Description similarity’, smallest word size (4), aiming to achieve the highest accur- ‘Submitter similarity’ and ‘Local sequence identity and align- acy as possible. Thus, we can reasonably conclude that a ment proportion’ sections is available in our other work (8). pair has low sequence identity if the output reports ‘no hits’ or the expected value is over the threshold. Description similarity We also used another metric, which we called the align- A description is provided in each nucleotide record’s ment proportion, to estimate the likelihood of the global DEFINITION field. This is typically a one-line description identity between a pair. This has two advantages: in some of the record, manually entered by record submitters. We cases where a pair has very high local identity, their lengths have applied the following approximate string matching are significantly different. Use of alignment proportion can process to measure the description similarity of two re- identify these cases; and running of global alignment is cords, using the Python NLTK package (42): computationally intensive. Alignment proportion can dir- 1. Tokenising: split the whole description word by word; ectly estimate an upper bound on the possible global iden- 2. Lowering case: for each token, change all its characters tity. It is computed using Formula (2) where L is the local into small cases; alignment proportion, I is the locally aligned identical 3. Removing stop words: removes the words that are com- bases, D and R are sequences of the pair, and len(S) is the monly used but not content bearing, such as ‘so’, ‘too’, length of a sequence S. ‘very’ and certain special characters; 4. Lemmatising: convert to a word to its base form. For L ¼ lenðIÞ=maxðLenðDÞ; LenðRÞÞ (2) example, ‘encoding’ will be converted to ‘encode’, or We constructed three benchmarks containing duplicates ‘cds’ (coding sequences) will be converted into ‘cd’; covering records for 21 organisms, using the above map- 5. Set representation: for each description, we represent it ping process. We also quantitatively measured their char- as a set of tokens after the above processing. We re- acteristics in selected organisms. These 21 organisms are move any repeated tokens; commonly used in molecular research projects and the We applied set comparison to measure the similarity NCBI Taxonomy provides direct links (http://www.ncbi. using the Jaccard similarity defined by Equation (1). Given nlm.nih.gov/Taxonomy/taxonomyhome.html/). two sets, it reports the number of shared elements as a frac- tion of the total number of elements. This similarity metric can successfully find descriptions containing the same Results and discussion tokens but in different orders. We present our results in two stages. The first introduces intersectionðset1; set2Þ=unionðset1; set2Þ (1) the statistics of the benchmarks constructed using the methods described above. The second provides the out- come of quantitative measurement of the duplicate pairs in Submitter similarity different benchmarks. The REFERENCE field of a record in the primary nucleo- We applied our methods to records for 21 organisms tide databases contains two kinds of reference. The first is popularly studied organisms, listed in the NCBI Taxonomy the literature citation that first introduced the record and website (http://www.ncbi.nlm.nih.gov/Taxonomy/taxono the second is the submitter details. Here, we measure the myhome.html/). Tables 1, 2 and 3 show the summary stat- submitter details to find out whether two records are sub- istics of the duplicates collected in the three benchmarks. mitted by the same group. Table 1 is reproduced from another of our papers (8). All We label a pair as ‘Same’ if it shares one of submission the benchmarks are significantly larger than previous col- authors, and otherwise as ‘Different’. If a pair does not lections of verified duplicates. The submitter-based bench- have such field, we label it as ‘N/A’. The author name is mark has over 100 000 duplicate pairs. Even more formatted as ‘last name, first initial’. duplicate pairs are in the other two benchmarks: the expert Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 9 of 17 Table 1. Submitter-based benchmark Organism Total records Available merged groups Duplicate pairs Arabidopsis thaliana 337 640 47 50 Bos taurus 245 188 12 822 20 945 Caenorhabditis elegans 74 404 1881 1904 Chlamydomonas reinhardtii 24 891 10 17 Danio rerio 153 360 7895 9227 Dictyostelium discoideum 7943 25 26 Drosophila melanogaster 211 143 431 3039 Escherichia coli 512 541 201 231 Hepatitis C virus 130 456 32 48 Homo sapiens 12 506 281 16 545 30 336 Mus musculus 1 730 943 13 222 23 733 Mycoplasma pneumoniae 1009 2 3 Oryza sativa 108 395 6 6 Plasmodium falciparum 43 375 18 26 Pneumocystis carinii 528 1 1 Rattus norvegicus 318 577 12 411 19 295 Saccharomyces cerevisiae 68 236 165 191 Schizosaccharomyces pombe 4086 39 545 Takifugu rubripes 51 654 64 72 Xenopus laevis 35 544 1620 1660 Zea mays 613 768 454 471 Total records: numbers of records directly belong to the organism in total; Available merged groups: number of groups that are tracked in record revision his- tories. One group may contain multiple records. Duplicate pairs: total number of duplicate pairs. This table also appears in the paper (8). curation benchmark has around 2.46 million pairs and the merge-based benchmark are owned by RefSeq (search- automatic curation benchmark has around 0.47 billion able via INSDC), and RefSeq merges records using a mix pairs; hence, these two are also appropriate for evaluation of manual and automatic curation (8). However, only of efficiency-focused methods. limited duplicates have been identified using this method. We measured duplicates for Bos taurus, Rattus norvegi- Our results clearly show that it contains far fewer dupli- cus, Saccharomyces cerevisiae, Xenopus laevis and Zea cates than the other two, even though the original total number of records is much larger. mays quantitatively as stated above. Figures 2–9 show rep- resentative results, for Xenopus laevis and Zea mays. • The expert curation benchmark is shown to contain a These figures demonstrate that duplicates in different much more diverse set of duplicate types. For instance, benchmarks have dramatically different characteristics, Figure 4 clearly illustrates that expert curation bench- and that duplicates from different organisms in the same mark identifies much more diverse kinds of duplicate in benchmarks also have variable characteristics. We elabor- Xenopus Laevis than the other two benchmarks. It not ate further as follows. only identifies 25.0% of duplicates with close to the Construction of benchmarks from three different per- same sequences, but it finds presence of duplicates with spectives has yielded different numbers of duplicates with very different lengths and even duplicates with relatively distinct characteristics in each benchmark. These bench- low sequence identity. In contrast, the other two mainly marks have their own advantages and limitations. We ana- identify duplicates having almost the same sequence— lyse and present them here. 83.9% for automatic curation benchmark and 96.8% for the merge-based benchmark. However, the volume The merge-based benchmark is broad. Essentially all of duplicates is smaller than for automatic curation. The types of records in INSDC are represented, including use of the protein database means that only coding se- clones, introns, and binding regions; all types in addition quences will be found. to the coding sequences that are cross-referenced in pro- The automatic curation benchmark holds the highest tein databases. Elsewhere we have detailed different rea- number of duplicates amongst the three. However, even sons for merging INSDC records, for instance many though it represents the state-of-the-art in automatic records from Bos Taurus and Rattus Norvegicus in the Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 10 of 17 Database, Vol. 2017, Article ID baw164 Table 2. Expert curation benchmark Organism Cross-referenced coding records Cross-referenced coding records that are duplicates Duplicate pairs Arabidopsis thaliana 34 709 34 683 162 983 Bos taurus 9605 5646 28 443 Caenorhabditis elegans 3225 2597 4493 Chlamydomonas reinhardtii 369 255 421 Danio rerio 5244 3858 4942 Dictyostelium discoideum 1242 1188 1757 Drosophila melanogaster 13 385 13 375 573 858 Escherichia coli 611 420 1042 Homo sapiens 132 500 131 967 1 392 490 Mus musculus 74 132 72 840 252 213 Oryza sativa 40 0 Plasmodium falciparum 97 68 464 Pneumocystis carinii 33 19 11 Rattus norvegicus 15 595 11 686 24 000 Saccharomyces cerevisiae 84 67 297 Schizosaccharomyces pombe 33 2 Takifugu rubripes 153 64 59 Xenopus laevis 4701 2259 2279 Zea mays 1218 823 16 137 Cross-referenced coding records: Number of records in INSDC that are cross-referenced in total; Cross-referenced coding records that are duplicates: Number of records that are duplicates based on interpretation of the mapping (Case 2); Duplicate pairs: total number of duplicate pairs. Table 3. Automatic curation benchmark Organism Cross-referenced coding records Cross-referenced coding records that are duplicates Duplicate pairs Arabidopsis thaliana 42 697 31 580 229 725 Bos taurus 35 427 25 050 440 612 Caenorhabditis elegans 2203 1541 20 513 Chlamydomonas reinhardtii 1728 825 1342 Danio rerio 43 703 29 236 74 170 Dictyostelium discoideum 935 289 2475 Drosophila melanogaster 49 599 32 305 527 246 Escherichia coli 56 459 49 171 3 671 319 Hepatitis C virus 105 613 171 639 Homo sapiens 141 373 79 711 467 101 272 Mus musculus 58 292 32 102 95 728 Mycoplasma pneumoniae 65 20 13 Oryza sativa 3195 1883 32 727 Plasmodium falciparum 32 561 15 114 997 038 Pneumocystis carinii 314 38 23 Rattus norvegicus 39 199 30 936 115 910 Saccharomyces cerevisiae 4763 3784 107 928 Schizosaccharomyces pombe 80 6 3 Takifugu rubripes 1341 288 1650 Xenopus laevis 15 320 3615 26 443 Zea mays 55 097 25 139 108 296 The headings are the same as previously. curation, it mainly uses rule-based curation and does not similarity, whereas the expert curation benchmark con- have expert review, so is still not as diverse or exhaustive tains duplicates with description similarity in different as expert curation. For example, in Figure 2, over 70% distributions. As with the expert curation benchmark, it of the identified duplicates have high description only contains coding sequences by construction. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 11 of 17 Figure 2 Description similarities of duplicates from Xenopus laevis in three benchmarks: Auto for auto curation based; Expert for expert curation; and Merge for merge-based collection. X-axis defines the similarity range. For instance, [0.5, 0.6) means greater than or equal to 0.5 and <0.6. Y-axis de- fines the proportion for each similarity range. Figure 3. Submitter similarities of duplicates from Xenopus laevis in three benchmarks. Different: the submitters of records are completely Different; Same: the pair at least shares with at least one submitter; Not specified: no submitter details are specified in REFERENCE field in records by standard. The rest is the same as above. Figure 4. Alignment proportion of duplicates from Xenopus laevis. LOW refers to similarity that is greater than the threshold or NO HITS based on BLAST output. Recall that we chose the parameters to produce reliable BLAST output. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 12 of 17 Database, Vol. 2017, Article ID baw164 Figure 5 Local sequence identity of duplicates from Xenopus laevis in three benchmarks. The rest is the same as above. Figure 6 Description similarity of duplicates from Zea mays in three benchmarks. The analysis shows that these three benchmarks com- would fail to find many of the duplicates in our expert cur- plement each other. Merging records in INSDC provides ation benchmark. preliminary quality checking across all kinds of records in Also, duplicates in one benchmark yet in different or- INSDC. Curation (automatic and expert) provides more ganisms have distinct characteristics. For instance, as reliable and detailed checking specifically for coding se- shown in figures for Xenopus laevis and Zea mays, dupli- quences. Expert curation contains more kinds of duplicates cates in Zea mays generally have higher description simi- and automatic curation has a larger volume of identified larity (comparing Figure 2 with Figure 6), submitted by duplicates. more same submitters (comparing Figure 3 with Figure 7), Recall that previous studies used a limited number of re- more similar sequence lengths (comparing Figure 4 with cords with a limited number of organisms and kinds of du- Figure 8) and higher sequence identity (comparing Figure 5 plication. Given the richness evidenced in our benchmarks, with Figure 9). However, duplicates in Xenopus laevis and the distinctions between them, it is unreliable to evalu- have different characteristics. For instance, the expert cur- ate against only one benchmark, or multiple benchmarks ation benchmark contains 40.0 and 57.7% of duplicates constructed from the same perspective. As shown above, submitted by different and same submitters respectively. the expert curation benchmark contains considerable num- Yet the same benchmark shows many more duplicates in bers of duplicates that have the distinct alignment propor- Xenopus laevis from different submitters (47.4%), which tions or relatively low similarity sequences. The efficiency- is double the amount for the same submitters (26.4%). focused duplicate detection methods discussed earlier thus Due to these differences, methods that demonstrate good Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 13 of 17 Figure 7 Submitter similarity of duplicates from Zea mays in three benchmarks. Figure 8 Alignment proportion of duplicates from Zea mays in three benchmarks. performance on one organism may not display comparable sequences. It may be possible to construct another bench- performance on others. mark through the mapping between INSDC and RefSeq, Additionally, the two curation-based benchmarks indi- using the approach described in this paper. cate that there are potentially many undiscovered dupli- Another observation is that UniProtKB/Swiss-Prot, with cates in the primary nucleotide databases. Using expert curation, contains a more diverse set of duplicates Arabidopsis thaliana as an example, only 47 groups of du- than the other benchmarks. From the results, it can be plicates were merged out of 337 640 records in total. The observed that expert curation can find occurrences of du- impression from this would be that the overall prevalence plicates that have low description similarity, are submitted of duplicates in INSDC is quite low. However, UniProtKB/ by completely different groups, have varied lengths, or are Swiss-Prot and UniProtKB/TrEMBL only cross-referenced of comparatively low local sequence identity. This illus- 34 709 and 42 697 Arabidopsis thaliana records, respect- trates that it is not sufficient to focus on duplicates that ively, yet tracing their mappings results in finding that have highly similar sequences of highly similar lengths. 34 683 (99.93%) records in Table 2 and 31 580 (73.96%) A case study has already found that expert curation recti- records in Table 3 have at least one corresponding dupli- fies errors in original studies (39). Our study on duplicates cate record, even though they only examine coding illustrates this from another angle. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 14 of 17 Database, Vol. 2017, Article ID baw164 Figure 9 Local sequence identity of duplicates from Zea mays in three benchmarks. These results also highlight the complexity of duplicates 2 (https://www.ncbi.nlm.nih.gov/nuccore/AC069109.2?re that are present in bioinformatics databases. The overlap port¼genbank). This is an example that we noted earlier among our benchmarks is remarkably minimal. The sub- from the submitter collection. Record gi:8616100 was sub- mitter benchmark includes records that do not correspond mitted by the Whitehead Institute/MIT Center for Genome to coding sequences, so they are not considered by the pro- Research. It concerns the RP11-301H18 clone in Homo sa- tein databases. UniProtKB/Swiss-Prot and UniProtKB/ piens chromosome 9. It has 18 unordered pieces as the sub- TrEMBL use different curation processes as mentioned mitters documented. The later record gi:15029538 was above. It shows that from the perspective of one resource, submitted by the Sanger Centre. That record also concerns a pair may be considered as a duplicate, but on the basis of the RP11-301H18 clone but it only has three unordered another resource may not be. pieces. Therefore, this case shows an example of duplication More fundamentally, records that are considered as du- where different submitters submit records about the same plicates for one task may not be duplicates for another. entities. Note that they are inconsistent, in that both the an- Thus, it is not possible to use a simple and universal defin- notation data and sequence are quite different. Therefore, a ition to conceptualize duplicates. Given that the results merge was done (by either database staff or submitter). show that kinds and prevalence of duplicates vary amongst Record INSDC AC069109.2 was replaced by INSDC organisms and benchmarks, it suggests that studies are AL592206.2, as INSDC AL592206.2 has fewer unordered needed to answer fundamental questions: what kinds of du- pieces, that is, is closer to being complete. Then record plicates are there? What are their corresponding impacts for AC069109.2 became obsolete. Only record INSDC biological studies that draw from the sequence databases? AL592206.2 can be updated. This record now has complete Can existing duplicate detection methods successfully find sequence (no unordered pieces) around 2012, after 18 up- the type of duplicates that has impacts for specific kinds of dates from the version since the merge. biomedical investigations? These questions are currently un- Case 2: record INSDC AC055725.22 (https://www. answered. The benchmarks here enable such discovery (46). ncbi.nlm.nih.gov/nuccore/AC055725.22), INSDC We explored the prevalence, categories and impacts of du- BC022542.1 (https://www.ncbi.nlm.nih.gov/nuccore/ plicates in the submitter-based benchmark to understand BC022542.1) and INSDC AK000529.1 (https://www.ncbi. the duplication directly in INSDC. nlm.nih.gov/nuccore/AK000529.1). These records are To summarise, we review the benefits of having created from the expert curation collection. At the protein level, these benchmarks. they correspond to the same protein record Q8TBF5, First, the records in the benchmarks can be uses for two about a Phosphatidylinositol-glycan biosynthesis class X main purposes: (1) as duplicates to merge; (2) as records to protein. Those three records have been explicitly cross- label or cross-reference to support record linkage. We now referenced into the same protein entry during expert cur- examine the two cases: ation. The translations of record INSDC BC022542.1 and Case 1: record INSDC AL592206.2 (https://www.ncbi. INSDC AK000529.1 are almost the same. Further, the nlm.nih.gov/nuccore/AL592206.2) and INSDC AC069109. expert-reviewed protein record UniProtKB/Swiss-Prot Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 15 of 17 Q8TBF5 is documented as follows (http://www.uniprot. users can use the benchmarks as test cases, perhaps organ- org/uniprot/Q8TBF5): ized by organisms or by type. AC055725 [INSDC AC055725.22] Genomic DNA. No translation available; Conclusion BC022542 [INSDC BC022542.1] mRNA. Translation: In this study, we established three large-scale validated AAH22542.1. Sequence problems; benchmarks of duplicates in bioinformatics databases, spe- AK000529 [INSDC AK000529.1] mRNA. Translation: cifically focusing on identifying duplicates from primary nu- BAA91233.1. Sequence problems. cleotide databases (INSDC). The benchmarks are available Those annotations were made via curation to mark for use at https://bitbucket.org/biodbqual/benchmarks. problematic sequences submitted to INSDC. The ‘no trans- These benchmark data sets can be used to support develop- lation available’ annotation indicates that the original sub- ment and evaluation of duplicate detection methods. The mitted INSDC records did not specify the coding sequence three benchmarks contain the largest number of duplicates (CDS) regions, but the UniProt curators have identified the validated by submitters, database staff, expert curation or CDS. ‘Sequence problems’ refers to ‘discrepancies due to automatic curation presented to date, with nearly half a bil- an erroneous gene model prediction, erroneous ORF as- lion record pairs in the largest of our collections. signment, miscellaneous discrepancy, etc.’ (http://www.uni We explained how we constructed the benchmarks and prot.org/help/cross_references_section) resolved by the cur- their underlying principles. We also measured the charac- ator. Therefore, without expert curation, it is indeed diffi- teristics of duplicates collected in these benchmarks quanti- cult to access the correct information and is difficult to tatively, and found substantial variation among them. This know they refer to the same protein. As mentioned earlier, demonstrates that it is unreliable to evaluate methods with an important impact of duplicate detection is record link- only one benchmark. We find that expert curation in age. Cross-referencing across multiple databases is cer- UniProtKB/Swiss-Prot can identify much more diverse tainly useful, regardless of whether the linked records are kinds of duplicates and emphasize that we appreciate the regarded as duplicates. effort of expert curation due to its finer-grained assessment Second, considering the three benchmarks as a whole, of duplication. they cover diverse duplicate types. The detailed types are In future work, we plan to explore the possibility of summarized elsewhere (8), but broadly three types are evi- mapping other curated databases to INSDC to construct dent: (1) similar records, if not identical; (2) fragments; (3) more duplicate collections. We will assess these duplicates somewhat different records belonging to the same entities. in more depth to establish a detailed taxonomy of dupli- Existing studies have already shown all of them have spe- cates and collaborate with biologists to measure the pos- cific impacts on biomedical tasks. Type (1) may affect sible impacts of different types of duplicates in practical database searches (44); type (2) may affect meta-analyses biomedical applications. However, this work already pro- (45); while type (3) may confuse novice database users. vides new insights into the characteristics of duplicates in Third, those benchmarks are constructed based on differ- INSDC, and has created a resource that can be used for the ent principles. The large volume of the dataset, and diversity development of duplicate detection methods. With, in all in type of duplicate, can provide a basis for evaluation of likelihood, vast numbers of undiscovered duplicates, such both efficiency and accuracy. Benchmarks are always a methods will be essential to maintenance of these critical problem for duplicate detection methods: a method can de- databases. tect duplicates in one dataset successfully, but may get poor performance on another. This is because the methods have different definitions of duplicate, or those datasets have dif- Funding ferent types or distributions. This is why the duplicate detec- Qingyu Chen’s work is supported by an International tion survey identified the creation of benchmarks as a Research Scholarship from The University of Melbourne. pressing task (34). Multiple benchmarks enable testing of The project receives funding from the Australian Research the robustness and generalization of the proposed methods. Council through a Discovery Project grant, DP150101550. We used six organisms from the expert curated benchmark as the dataset and developed a supervised learning duplicate Conflict of interest. None declared. detection method (46). We tested the generality of the trained model as an example: whether a model trained from Acknowledgements duplicate records in one organism maintains the perform- We greatly appreciate the assistance of Elisabeth Gasteiger from ance in another organism. This is effectively showing how UniProtKB/Swiss-Prot, who advised on and confirmed the mapping Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 16 of 17 Database, Vol. 2017, Article ID baw164 process in this work with domain expertise. We also thank Nicole 20. Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering Silvester and Clara Amid from the EMBL European Nucleotide and comparing large sets of protein or nucleotide sequences. Archive, who advised on the procedures regarding merged records Bioinformatics, 22, 1658–1659. in INSDC. Finally we are grateful to Wayne Mattern from NCBI, 21. Verykios,V.S., Moustakides,G.V., and Elfeky,M.G. (2003) A who advised how to use BLAST properly by setting reliable param- Bayesian decision model for cost optimal record matching. eter values. VLDB J., 12, 28–40. 22. McCoy,A.B., Wright,A., Kahn,M.G. et al. (2013) Matching References identifiers in electronic health records: implications for duplicate records and patient safety. BMJ Qual. Saf., 22, 219–224. 1. Benson,D.A., Clark,K., Karsch-Mizrachi,I. et al. (2015) 23. Bagewadi,S., Adhikari,S., Dhrangadhariya,A. et al. (2015) GenBank. Nucleic Acids Res., 43, D30. NeuroTransDB: highly curated and structured transcriptomic 2. Bork,P. and Bairoch,A. (1996) Go hunting in sequence databases metadata for neurodegenerative diseases. Database, 2015, but watch out for the traps. Trends Genet., 12, 425–427. bav099. 3. Altschul,S.F., Boguski,M.S., Gish,W., et al. (1994) Issues in 24. Finn,R.D., Coggill,P., Eberhardt,R.Y. et al. (2015) The Pfam searching molecular sequence databases. Nat. Genet., 6, protein families database: towards a more sustainable future. 119–129. Nucleic Acids Res., 44:D279–D285. 4. Brenner,S.E. (1999) Errors in genome annotation. Trends 25. Herzog,T.N., Scheuren,F.J., and Winkler,W.E. (2007) Data Genet., 15, 132–133. Quality and Record Linkage Techniques. Springer, Berlin. 5. Fan,W. (2012) Web-Age Information Management. Springer, 26. Christen,P. (2012) A survey of indexing techniques for scalable Berlin, pp. 1–16. record linkage and deduplication. IEEE Trans. Knowl. Data 6. UniProt Consortium. (2014) Activities at the universal protein Eng., 24, 1537–1555. resource (UniProt). Nucleic Acids Res., 42, D191–D198. 27. Joffe,E., Byrne,M.J., Reeder,P. et al. (2014) A benchmark com- 7. Nakamura,Y., Cochrane,G., and Karsch-Mizrachi,I. (2013) The parison of deterministic and probabilistic methods for defining international nucleotide sequence database collaboration. manual review datasets in duplicate records reconciliation. Nucleic Acids Res., 41, D21–D24. J. Am. Med. Informat. Assoc., 21, 97–104. 8. Chen,Q., Justin,Z., and Verspoor,K. (2016) Duplicates, redundan- 28. Holm,L. and Sander,C. (1998) Removing near-neighbour redun- cies, and inconsistencies in the primary nucleotide databases: a de- dancy from large protein sequence collections. Bioinformatics, scriptive study. Database, doi: http://dx.doi.org/10.1101/085019. 14, 423–429. 9. Lin,Y.S., Liao,T.Y., and Lee,S.J. (2013) Detecting near- 29. Zorita,E.V., Cusco,P., and Filion,G. (2015) Starcode: sequence duplicate documents using sentence-level features and supervised clustering based on all-pairs search. Bioinformatics, 31, learning. Expert Syst. Appl., 40, 1467–1476. 1913–1919. 10. Liu,X. and Xu,L. (2013), Proceedings of the International 30. Koh,J.L., M.L., Lee,M., Khan,A.M., Tan,P.T., and Brusic,V. Conference on Information Engineering and Applications (IEA) (2004) Duplicate detection in biological data using association 2012. Springer, Heidelberg, pp. 325–332. rule mining. Locus, 501, S22388. 11. Fu,L., Niu,B., Zhu,Z. et al. (2012) CD-HIT: accelerated for clus- 31. Cross,G.R. and Jain,A.K. (1983) Markov random field texture tering the next-generation sequencing data. Bioinformatics, 28, models. IEEE Trans. Pattern Anal. Mach. Intell., 5, 25–39. 3150–3152. 32. Rudniy,A., Song,M., and Geller,J. (2010) Detecting duplicate 12. Jupe,S., Jassal,B., Williams,M., and Wu,G. (2014) A controlled vo- biological entities using shortest path edit distance. Int. J. Data cabulary for pathway entities and events. Database, 2014, bau060. Mining Bioinformatics, 4, 395–410. 13. Wilming,L.G., Boychenko,V., and Harrow,J.L. (2015) 33. Rudniy,A., Song,M., and Geller,J. (2014) Mapping biological Comprehensive comparative homeobox gene annotation in entities using the longest approximately common prefix method. human and mouse. Database, 2015, bav091. BMC Bioinformatics, 15, 187. 14. Williams,G., Davis,P., Rogers,A. et al. (2011) Methods and 34. Elmagarmid,A.K., Ipeirotis,P.G., and Verykios,V.S. (2007) strategies for gene structure curation in WormBase. Database, Duplicate record detection: a survey. IEEE Trans. Knowl. Data 2011, baq039. Eng., 19, 1–16. 15. Safran,M., Dalah,I., Alexander,J. et al. (2010) GeneCards 35. Martins,B. (2011), GeoSpatial Semantics. Springer, Berlin, pp. Version 3: the human gene integrator. Database, 2010, baq020. 34–51. 16. Christen,P. and Goiser,K. (2007) Quality Measures in Data 36. Bilenko,M. and Mooney,R.J. (2003) Proceedings of the ninth Mining. Springer, Berlin, pp. 127–151. ACM SIGKDD international conference on Knowledge discov- 17. Nanduri,R., Bhutani,I., Somavarapu,A.K. et al. (2015) ery and data mining. ACM, New York, pp. 39–48. ONRLDB—manually curated database of experimentally vali- 37. Chen,Q., Zobel,J., and Verspoor,K. (2015) Evaluation of a dated ligands for orphan nuclear receptors: insights into new Machine Learning Duplicate Detection Method for drug discovery. Database, 2015, bav112. Bioinformatics Databases. ACM Ninth International 18. UniProt Consortium. (2014) UniProt: a hub for protein informa- Workshop on Data and Text Mining in Biomedical Informatics tion. Nucleic Acids Res., 43:D204–D212. in conjunction with CIKM, Washington, DC. ACM Press, New 19. Joffe,E., Byrne,M.J., Reeder,P. et al. (2013), AMIA Annual York. Symposium Proceedings. American Medical Informatics Association, Washington, DC, Vol. 2013, pp. 721. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 17 of 17 38. Magrane,M. and UniProt Consortium. (2011) UniProt 43. Camacho,C., Coulouris,G., Avagyan,V. et al. (2009) BLASTþ: Knowledgebase: a hub of integrated protein data. Database, architecture and applications. BMC Bioinformatics, 10, 421. 2011, bar009. 44. Suzek,B.E., Wang,Y., Huang,H. et al. (2014) UniRef clusters: a 39. Poux,S., Magrane,M., Arighi,C.N. et al. (2014) Expert curation comprehensive and scalable alternative for improving sequence in UniProtKB: a case study on dealing with conflicting and erro- similarity searches. Bioinformatics,31, 926–932. neous data. Database, 2014, bau016. 45. Rosikiewicz,M., Comte,A., Niknejad,A. et al. (2013) 40. Crick,F. (1970) Central dogma of molecular biology. Nature, Uncovering hidden duplicated content in public transcriptomics 227, 561–563. data. Database, 2013, bat010. 41. Huang,H., McGarvey,P.B., Suzek,B.E. et al. (2011) A compre- 46. Chen,Q., Zobel,J., Zhang,X., and Verspoor,K. (2016) hensive protein-centric ID mapping service for molecular data in- Supervised learning for detection of duplicates in genomic se- tegration. Bioinformatics, 27, 1190–1191. quence databases. PLoS One, 11, e0159644. 42. Bird,S., Klein,E., and Loper,E. (2009) Natural Language Processing with Python. O’Reilly Media, Inc., Sebastopol, CA. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Database Oxford University Press

Benchmarks for measurement of duplicate detection methods in nucleotide databases

Free
17 pages

Loading next page...
 
/lp/ou_press/benchmarks-for-measurement-of-duplicate-detection-methods-in-dyBAS10CaV
Publisher
Oxford University Press
Copyright
© The Author(s) 2017. Published by Oxford University Press.
eISSN
1758-0463
D.O.I.
10.1093/database/baw164
Publisher site
See Article on Publisher Site

Abstract

Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale vali- dated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate de- tection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of question- able generality. In this study, we present three nucleotide sequence database bench- marks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated bio- logical duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first bench- marks targeting the primary nucleotide databases. The records include the 21 most heav- ily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate VThe Author(s) 2017. Published by Oxford University Press. Page 1 of 17 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. (page number not for citation purposes) Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 2 of 17 Database, Vol. 2017, Article ID baw164 detection or record linkage methods that are required to help maintain these essential resources. Database URL: https://bitbucket.org/biodbqual/benchmarks Introduction In this study, we address these issues by accomplishing the following: Sequencing technologies are producing massive volumes of data. GenBank, one of the primary nucleotide databases, We introduce three benchmarks containing INSDC du- increased in size by over 40% in 2014 alone (1). However, plicates that were collected based on three different prin- researchers have been concerned about the underlying data ciples: records merged directly in INSDC (111 ,826 quality in biological sequence databases since the 1990s pairs); INSDC records labelled as references during (2). A particular problem of concern is duplicates, when a UniProtKB/Swiss-Prot expert curation (2 465 891 pairs); database contains multiple instances representing the same and INSDC records labelled as references in UniProtKB/ entity. Duplicates introduce redundancies, such as repeti- TrEMBL automatic curation (473 555 072 pairs); tive results in database search (3), and may even represent We quantitatively measure similarities between dupli- inconsistencies, such as contradictory functional annota- cates, showing that our benchmarks have duplicates with tions on multiple records that concern the same entity (4). dramatically different characteristics, and are comple- Recent studies have noted duplicates as one of five central mentary to each other. Given these differences, we argue data quality problems (5), and it has been observed that de- that it is insufficient to evaluate against only one bench- tection and removal of duplicates is a key early step in bio- mark; and informatics database curation (6). We demonstrate the value of expert curation, in its iden- Existing work has addressed duplicate detection in bio- tification of a much more diverse set of duplicate types. logical sequence databases in different ways. This work It may seem that, with so many duplicates in our bench- falls into two broad categories: efficiency-focused methods marks, there is little need for new duplicate detection meth- that are based on assumptions such as duplicates have ods. However, the limitations of the mechanisms that led to identical or near-identical sequences, where the aim is to discovery of these duplicates, and the fact that the preva- detect similar sequences in a scalable manner; and quality- lences are so very different between different species and re- focused methods that examine record fields other than the sources, strongly suggest that these are a tiny fraction of the sequence, where the aim is accurate duplicate detection. total that is likely to be present. While a half billion dupli- However, the value of these existing approaches is unclear, cates may seem like a vast number, they only involve due to the lack of broad-based, validated benchmarks; as 710 254 records, while the databases contain 189 264 014 some of this previous work illustrates, there is a tendency records (http://www.ddbj.nig.ac.jp/breakdown_stats/dbgro for investigators of new methods to use custom-built col- wth-e.html#ddbjvalue) altogether to date. Also, as sug- lections that emphasize the kind of characteristic their gested by the effort expended in expert curation, there is a method is designed to detect. great need for effective duplicate detection methods. Thus, different methods have been evaluated using sep- arate, inconsistent benchmarks (or test collections). The efficiency-focused methods used large benchmarks. However, the records in these benchmarks are not necessar- Background ily duplicates, due to use of mechanical assumptions about In the context of general databases, the problems of quality what a duplicate is. The quality-focused methods have used control and duplicate detection have a long history of re- collections of expert-labelled duplicates. However, as a result search. However, this work has only limited relevance for of the manual effort involved, these collections are small and bioinformatics databases, because, for example, it has contain only limited kinds of duplicates from limited data tended to focus on tasks such as ensuring that each real- sources. To date, no published benchmarks have included world entity is only represented once, and the attributes of duplicates that are explicitly marked as such in the primary entities (such as ‘home address’) are externally verifiable. nucleotide databases, GenBank, the EMBL European In this section we review prior work on duplicate detection Nucleotide Archive, and the DNA DataBank of Japan. (We in bioinformatics databases. We show that researchers refer to these collectively as INSDC: the International have approached duplicate detection with different as- Nucleotide Sequence Database Collaboration (7).) sumptions. We then review the more general duplicate Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 3 of 17 detection literature, showing that the issue of a lack of even from the same perspective (8). By categorizing dupli- rigorous benchmarks is a key problem for duplicate detec- cates collected directly from INSDC, we have already tion in general domains and is what motivates our work. found diverse types: similar or identical sequences; similar Finally, we describe the data quality control in INSDC, or identical fragments; duplicates with relatively different UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, as the sequences; working drafts; sequencing in progress records; sources for construction of the duplicate benchmark sets and predicted records. The prevalence of each type varies that we introduce. considerably between organisms. Studies on duplicate de- tection in general performance on a single dataset may be biased if we do not consider the independence and underly- Kinds of duplicate ing stratifications (16). Thus, as well as creating bench- Different communities, and even different individuals, may marks from different perspectives, we collect duplicates have inconsistent understandings of what a duplicate is. from multiple organisms from the same perspectives. Such differences may in turn lead to different strategies for We do not regard these discrepancies as shortcomings de-duplication. or errors. Rather, we stress the diversity of duplication. A generic definition of a duplicate is that it occurs when The understanding of ‘duplicates’ may be different be- there are multiple instances that point to the same entity. tween database staff, computer scientists, biological cur- Yet this definition is inadequate; it requires a definition ators and so on, and benchmarks need to reflect this that allows identification of which things are ‘the same en- diversity. In this work, we assemble duplicates from three tity’. We have explored definitions of duplicates in other different perspectives: expert curation (how data curators work (8). We regard two records as duplicates if, in the understand duplicates); automatic curation (how auto- context of a particular task, the presence of one means that matic software without expert review identifies dupli- the other is not required. Here we explain that duplication cates); and merged-based quality checking (how records has at least four characteristics, as follows. are merged in INSDC). These different perspectives reflect First, duplication is not simply redundancy. The latter the diversity: a pair considered as duplicates from one per- can be defined using a simple threshold. For example, if spective may not be so in another. For instance, nucleotide two instances have over 90% similarity, they can arguably coding records might not be duplicates strictly at the DNA be defined as redundant. Duplicate detection often regards level, but they might be considered to be duplicates if they such examples as ‘near duplicates’ (9) or ‘approximate du- concern the same proteins. Use of different benchmarks plicates’ (10). In bioinformatics, ‘redundancy’ is commonly derived from different assumptions tests the generality of used to describe records with sequence similarity over a duplicate detection methods: a method may have strong certain threshold, such as 90% for CD-HIT (11). performance in one benchmark but very poor in another; Nevertheless, instances with high similarity are not neces- only by being verified from different benchmarks can pos- sarily duplicates, and vice versa. For example, curators sibly guarantee the method is robust. working with human pathway databases have found re- Currently, understanding of duplicates via expert cur- cords labelled with the same reaction name that are not du- ation is the best approach. Here ‘expert curation’ means plicates, while legitimate duplicates may exist under a that curation either is purely manually performed, as in variety of different names (12). Likewise, as we present ONRLDB (17); or not entirely manual but involving ex- later, nucleotide sequence records with high sequence simi- pert review, as in UniProtKB/Swiss-Prot (18). Experts use larity may not be duplicates, whereas records whose se- experience and intuition to determine whether a pair is du- quences are relatively different may be true duplicates. plicate, and will often check additional resources to ensure Second, duplication is context dependent. From one per- the correctness of a decision (16). Studies on clinical (19) spective, two records might be considered duplicates while and biological databases (17) have demonstrated that ex- from another they are distinct; one community may consider pert curation can find a greater variety of duplicates, and them duplicates whereas another may not. For instance, ultimately improves the data quality. Therefore, in this amongst gene annotation databases, more broader duplicate work we derive one benchmark from UniProtKB/Swiss- types are considered in Wilming et al. (13) than in Williams Prot expert curation. et al. (14), whereas, for genome characterization, ‘duplicate records’ means creation of a new record in the database using configurations of existing records (15). Different attri- Impact of duplicates butes have been emphasized in the different databases. Third, duplication has various types with distinct char- There are many types of duplicate, and each type has dif- acteristics. Multiple types of duplicates could be found ferent impacts on use of the databases. Approximate or Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 4 of 17 Database, Vol. 2017, Article ID baw164 near duplicates introduce redundancies, whereas other Duplicate detection methods types may lead to inconsistencies. Most duplicate detection methods use pairwise compari- Approximate or near duplicates in biological databases is son, where each record is compared against others in pairs not a new problem. We found related literature in 1994 (3), using a similarity metric. The similarity score is typically 2006 (20) and as recently as 2015 (http://www.uniprot.org/ computed by comparing the specific fields in the two re- help/proteome_redundancy). A recent significant issue was cords. The two classes of methods that we previously intro- proteome redundancy in UniProtKB/TrEMBL (2015). duced, efficiency-focused and quality-focused, detect UniProt staff observed that many records were over- duplicates in different ways; we now summarize those represented, such as 5.97 million entries for just 1692 strains approaches. of Mycobacterium tuberculosis. This redundancy impacts se- quence similarity searches, proteomics identification and motif searches. In total, 46.9 million entries were removed. Efficiency-focused methods Additionally, recall that duplicates are not just redun- Efficiency-focused methods have two common features. dancies. Use of a simple similarity threshold will result in One is that they typically rest on simple assumptions, such many false positives (distinct records with high similarity) as that duplicates are records with identical or near- and false negatives (duplicates with low similarity). Studies identical sequences. These are near or approximate dupli- show that both cases matter: in clinical databases, merging cates as above. The other is an application of heuristics to of records from distinct patients by mistake may lead to filter out pairs to compare, in order to reduce the running withholding of a treatment if one patient is allergic but the time. Thus, a common pattern of such methods is to assume other is not (21); failure to merge duplicate records for the that duplicates have sequence similarity greater than a cer- same patient could lead to a fatal drug administration error tain threshold. In one of the earliest methods, nrdb90,itis (22). Likewise, in biological databases, merging of records assumed that duplicates have sequence similarities over with distinct functional annotations might result in incor- 90%, with k-mer matching used to rapidly estimate similar- rect function identification; failing to merge duplicate re- ity (28). In CD-HIT, 90% similarity is assumed, with cords with different functional annotations might lead to short-substring matching as the heuristic (11); in starcode, incorrect function prediction. One study retrieved corres- a more recent method, it is assumed that duplicates have se- ponding records from two biological databases, Gene quences with a Levenshtein distance of no > 3, and pairs of Expression Omnibus and ArrayExpress, but surprisingly sequences with greater estimated distance are ignored (29). found the number of records to be significantly different: Using these assumptions and associated heuristics, the the former has 72 whereas only 36 in latter. Some of the re- methods are designed to speed up the running time, which cords were identical, but in some cases records were in one is typically the main focus of evaluation (11,28). While but not the other (23). Indeed, duplication commonly some such methods consider accuracy, efficiency is still the interacts with inconsistency (5). major concern (29). The collections are often whole data- Further, we cannot ignore the propagated impacts of bases, such as the NCBI non-redundant database (Listed at duplicates. The above duplication issue in UniProtKB/ https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_ TrEMBL not only impacts UniProtKB/TrEMBL itself, but TYPE¼BlastSearch) for nucleotide databases and Protein also significantly impacts databases or studies using Data Bank (http://www.rcsb.org/pdb/home/home.do) for UniProtKB/TrEMBL data. For instance, release of Pfam, a protein databases. These collections are certainly large, but curated protein family database, was delayed for close to 2 are not validated, that is, records are not known to be du- years; the duplication issue in UniProtKB/TrEMBL was the plicates via quality-control or curation processes. The major reason (24). Even removal of duplicates in methods based on simple assumptions can reduce redun- UniProtKB/TrEMBL caused problems: ‘the removal of bac- dancies, but recall that duplication is not limited to redun- terial duplication in UniProtKB (and normal flux in pro- dancy: records with similar sequences may not be tein) would have meant that nearly all (>90%) of Pfam duplicates and vice versa. For instance, records INSDC seed alignments would have needed manual verification AL592206.2 and INSDC AC069109.2 have only 68% (and potential modification) .. . This imposes a significant local identity (measured in Section 3.2 advised by NCBI manual biocuration burden’ (24). BLAST staff), but they have overlapped clones and were Finally, duplicate detection across multiple sources pro- merged as part of the finishing strategy of the human gen- vides valuable record linkages (25–27). Combination of in- ome. Therefore, records measured solely based on a simi- formation from multiple sources could link literature larity threshold are not validated and do not provide a databases, containing papers mentioning the record; gene basis for measuring the accuracy of a duplicate detection databases; and protein databases. method, that is, the false positive or false negative rate. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 5 of 17 Quality-focused methods records yields over 12 million pairs; even a small data set In contrast to efficiency-focused methods, quality-focused requires a large processing time under these conditions. methods tend to have two main differences: use of a Hence, there is no large-scale validated benchmark, and greater number of fields; and evaluation on validated data- no verified collections of duplicate nucleotide records in sets. An early method of this kind compared the similarity INSDC. However, INSDC contains primary nucleotide of both metadata (such as description, literature and biolo- data sources that are essential for protein databases. For gical function annotations) and sequence, and then used instance, 95% of records in UniProt are from INSDC association rule mining (30) to discover detection rules. (http://www.uniprot.org/help/sequence_origin). A further More recent proposals focus on measuring metadata using underlying problem is that fundamental understanding of approximate string matching: Markov random models duplication is missing. The scale, characteristics and im- (31), shortest-path edit distance (32) or longest approxi- pacts of duplicates in biological databases remain unclear. mately common prefix matching (33), the former two for general bioinformatics databases and the latter specifically for biomedical databases. The first method used a 1300-re- Benchmarks in duplicate detection cord dataset of protein records labelled by domain experts, whereas the others used a 1900-record dataset of protein Lack of large-scale validated benchmarks is a problem in records labelled in UniProt Proteomes, of protein sets from duplicate detection in general domains. Researchers sur- fully sequenced genomes in UniProt. veying duplicate detection methods have stated that the The collections used in this work are validated, but most challenging obstacle is lack of ‘standardized, large- have significant limitations. First, both of the collections scale benchmarking data sets’ (34). It is not easy to identify have <2000 records, and only cover limited types of dupli- whether new methods surpass existing ones without reli- cates (46). We classified duplicates specifically on one of able benchmarks. Moreover, some methods are based on the benchmarks (merge-based) and it demonstrates that machine learning, which require reliable training data. In different organisms have dramatically distinct kinds of du- general domains, many supervised or semi-supervised du- plicate: in Caenorhabditis elegans, the majority duplicate plicate detection methods exist, such as decision trees (35) type is identical sequences, whereas in Danio rerio the ma- and active learning (36). jority duplicate type is of similar fragments. From our case The severity of this issue is illustrated by the only super- study of GC content and melting temperature, those differ- vised machine-learning method for bioinformatics of ent types introduce different impacts: duplicates under the which we are aware, which was noted above (30). The exact sequence category only have 0.02% mean difference method was developed on a collection of 1300 records. In of GC content compared with normal pairs in Homo sapi- prior work, we reproduced the method and evaluated ens, whereas another type of duplicates that have relatively against a larger dataset with different types of duplicates. low sequence identity introduced a mean difference of The results were extremely poor compared with the ori- 5.67%. A method could easily work well in a limited data- ginal outcomes, which we attribute to the insufficiency of set of this kind but not be applicable for broader datasets the data used in the original work (37). with multiple types of duplicates. Second, they only cover We aim to create large-scale validated benchmarks of a limited number of organisms; the first collection had two duplicates. By assembling understanding of duplicates and the latter had five. Authors of prior studies, such as from different perspectives, it becomes possible to test dif- Rudniy et al. (33), acknowledged that differences of dupli- ferent methods in the same platform, as well as test the ro- cates (different organisms have different kinds of duplicate; bustness of methods in different contexts. different duplicate types have different characteristics) are the main problem impacting the method performance. In some respects, the use of small datasets to assess Quality control in bioinformatics databases quality-based methods is understandable. It is difficult to find explicitly labelled duplicates. Typically, for nucleotide To construct a collection of explicitly labelled duplicates, databases, sources of labelled duplicates are limited. In an essential step is to understand the quality control pro- addition, these methods focus on the quality and so are un- cess in bioinformatics databases, including how duplicates likely to use strategies for pruning the search space, mean- are found and merged. Here we describe how INSDC and ing that they are compute intensive. These methods also UniProt perform quality control in general and indicate generally consider many more fields and many more pairs how these mechanisms can help in construction of large than the efficiency-focused methods. A dataset with 5000 validated collections of duplicates. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 6 of 17 Database, Vol. 2017, Article ID baw164 Figure 1. A screenshot of the revision history for record INSDC AC034192.5 (http://www.ncbi.nlm.gov/nuccore/AC034192.5?report¼girevhist). Note the differences between normal updates (changes on a record itself) and merged records (duplicates). For instance, the record was updated from ver- sion 3 to 4, which is a normal update. A different record INSDC AC087090.1 is merged in during Apr 2002. This is a case of duplication confirmed by ENA staff. We only collected duplicates, not normal updates. Quality control in INSDC 3. Literature curation: identify relevant papers, read the full text and extract the related context, assign gene Merging of records addresses duplication in INSDC. The ontology terms accordingly; merge may occur due to various reasons, including cases 4. Family curation: analyse putative homology relation- where different submitters adding records for the same bio- logical entities, or changes of database policies. We have ships; perform steps 1–3 for identified instances; 5. Evidence attribution: link all expert curated data to the discussed various reasons for merging elsewhere (8). original source; Different merge reasons reflect the fact that duplication may arise from diverse causes. Figure 1 shows an example. 6. Quality assurance and integration: final check of fin- ished entries and integration into UniProtKB/Swiss- Record INSDC AC034192.5 is merged with Record Prot. INSDC AC087090.1 in Apr 2002. (We used recommended accession.version format to describe record. Since the UniProtKB/Swiss-Prot curation is sophisticated and sen- paper covers three data sources, we also added data source sitive, and involves substantial expert effort, so the data name.) In contrast, the different versions of Record INSDC quality can be assumed to be high. UniProtKB/TrEMBL AC034192 (version 2 in April 2000 and version 3 in May complements UniProtKB/Swiss-Prot using purely auto- 2000) are just normal updates on the same record. matic curation. The automatic curation in UniProtKB/ Therefore we only collect the former. TrEMBL mainly comes from two sources: (1) the Unified Staff confirmed that this is the only resource for merged Rule (UniRule) system, which derives curator-tested rules records in INSDC. Currently there is no completely auto- from UniProtKB/Swiss-Prot manually annotated entries. matic way to collect such duplicates from the revision his- For instance, the derived rules have been used to determine tory. Elsewhere we have explained the procedure that we family membership of uncharacterized protein sequences developed to collect these duplicates, why we believe that (39); and (2) Statistical Automatic Annotation System many duplicates are still present in INSDC, and why the (SAAS), which generates automatic rules for functional an- collection is representative (8). notations. For instance, it applies C4.5 decision tree algo- rithm to UniProtKB/Swiss-Prot entries to generate Quality control in UniProt automatic functional annotation rules (38). The whole pro- UniProt Knowledgebase (UniProtKB) is a protein database cess is automatic and does not have expert review. that is a main focus of the UniProt Consortium. It has two Therefore, it avoids expert curation with the trade-off of sections: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. lower quality assurance. Overall both collections represent UniProtKB/Swiss-Prot is expert curated and reviewed, with the state of the art in biological data curation. software support, whereas UniProtKB/TrEMBL is curated Recall that nucleotide records in INSDC are primary automatically without review. Here, we list the steps of sources for other databases. From a biological perspective, curation in UniProtKB/Swiss-Prot (http://www.uniprot. protein coding nucleotide sequences are translated into org/help/), as previously explained elsewhere (38): protein sequences (40). Both UniProtKB/Swiss-Prot and UniProtKB/TrEMBL generate cross-references from the 1. Sequence curation: identify and merge records from coding sequence records in INSDC to their translated pro- same genes and same organisms; identify and document tein records. This provides a mapping between INSDC and sequence discrepancies such as natural variations and curated protein databases. We can use the mapping be- frameshifts; explore homologs to check existing anno- tween INSDC and UniProtKB/Swiss-Prot and the mapping tations and propagate other information; between INSDC and UniProtKB/TrEMBL, respectively, to 2. Sequence analysis: predict sequence features using se- construct two collections of nucleotide duplicate records. quence analysis programs, then experts check the We detail the methods and underlying ideas below. results; Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 7 of 17 Methods We interpret the mapping based on biological know- ledge and database policies, as confirmed by UniProt staff. We now explain how we construct our benchmarks, which Recall that protein coding nucleotide sequences are trans- we call the merge-based, expert curation and automatic lated into protein sequences. In principle, one coding se- curation benchmarks; we then describe how we measure quence record in INSDC can be mapped into one protein the duplicate pairs for all three benchmarks. record in UniProt; it can also be mapped into more than one protein record in UniProt. More specifically, if one protein record in UniProt cross-references multiple coding Benchmark construction sequence records in INSDC, those coding sequence records Our first benchmark is the merge-based collection, based are duplicates. Some of those duplicates may have distinct on direct reports of merged records provided by record sequences due to the presence of introns and other regula- submitters, curators, and users to any of the INSDC data- tory regions in the genomic sequences. We classify the bases. Creation of this benchmark involves searching the mappings into six cases, as follows. Note that the follow- revision history of records in INSDC, tracking merged re- ing cases related with merging occur in the same species. cord IDs, and downloading accordingly. We have Case 1: A protein record maps to one nucleotide coding described the process in detail elsewhere, in work where sequence record. No duplication is detected. we analysed the scale, classification and impacts of dupli- Case 2: A protein record maps to many nucleotide cod- cates specifically in INSDC (8). ing sequence records. This is an instance of duplication. The other two benchmarks are the expert curation and Here UniProtKB/Swiss-Prot and UniProtKB/TrEMBL automatic curation benchmarks. Construction of these represent different duplicate types. In the former splice benchmarks of duplicate nucleotide records is based on the forms, genetic variations and other sequences are mapping between INSDC and protein databases merged, whereas in the latter merges are mainly of re- (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL), and cords with close to identical sequences (either from the consists of two main steps. The first is to perform the map- same or different submitters). That is also why we con- ping: downloading record IDs and using the existing map- struct two different benchmarks accordingly. ping service; the second is to interpret the mapping results Case 3: Many protein records have the same mapped and find the cases where duplicates occur. coding sequence records. There may be duplication, but The first step has the following sub-steps. Our expert we assume that the data is valid. For example, the cross- and automatic curation benchmarks are constructed using referenced coding sequence could be a complete genome the same steps, except that one is based on mapping be- that links to all corresponding coding sequences. tween INSDC and UniProtKB/Swiss-Prot and the other is Case 4: Protein records do not map to nucleotide coding based on mapping between INSDC and UniProtKB/ sequence records. No duplication is detected. TrEMBL. Case 5: The nucleotide coding sequences exist in IIDs 1. Retrieve a list of coding records IDs for an organism in but are not cross-referenced. Not all nucleotide records INSDC. We call these IIDs (I for INSDC). Databases with a coding region will be integrated, and some might under INSDC exchange data daily so the data is the not be selected in the cross-reference process. same (though the representations may vary). Thus, re- Case 6: The nucleotide coding sequence records are cords can be retrieved from any one of the databases in cross-referenced, but are not in IIDs. A possible explan- INSDC. This list is used in the interpretation step; ation is that the cross-referenced nucleotide sequence 2. Download a list of record IDs for an organism in either was predicted to be a coding sequence by curators or UniProtKB/Swiss-Prot or UniProtKB/TrEMBL. We call automatic software, but was not annotated as a coding these UIDs (U for UniProt). This list is used in sequence by the original submitters in INSDC. In other mapping; words, UniProt corrects the original missing annotations 3. Use the mapping service provided in UniProt (41)to in INSDC. Such cases can be identified with the generate mappings: Provide the UIDs from Step 2; NOT_ANNOTATED_CDS qualifier on the DR line Choose ‘UniProtKB AC/ID to EMBL/GenBank/DDBJ’ when searching in EMBL. option; and Click ‘Generate Mapping’. This will gener- In this study, we focus on Case 2, given that this is ate a list of mappings. Each mapping contains the re- where duplicates are identified. We collected all the related cord ID in UniProt and the cross-referenced ID(s) in nucleotide records and constructed the benchmarks INSDC. We will use the mappings and IIDs in the inter- accordingly. pretation step. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 8 of 17 Database, Vol. 2017, Article ID baw164 Quantitative measures Local sequence identity and alignment proportion We used NCBI BLAST (version 2.2.30) (43) to measure After building the benchmarks as above, we quantitatively local sequence identity. We used the bl2seq application measured the similarities in nucleotide duplicate pairs in all that aligns sequences pairwise and reports the identity of three benchmarks to understand their characteristics. every pair. NCBI BLAST staff advised on the recom- Typically, for each pair, we measured the similarity of de- mended parameters for running BLAST pairwise alignment scription, literature and submitter, the local sequence iden- in general. We disabled the dusting parameter (which auto- tity and the alignment proportion. The methods are matically filters low-complexity regions) and selected the described briefly here; more detail (‘Description similarity’, smallest word size (4), aiming to achieve the highest accur- ‘Submitter similarity’ and ‘Local sequence identity and align- acy as possible. Thus, we can reasonably conclude that a ment proportion’ sections is available in our other work (8). pair has low sequence identity if the output reports ‘no hits’ or the expected value is over the threshold. Description similarity We also used another metric, which we called the align- A description is provided in each nucleotide record’s ment proportion, to estimate the likelihood of the global DEFINITION field. This is typically a one-line description identity between a pair. This has two advantages: in some of the record, manually entered by record submitters. We cases where a pair has very high local identity, their lengths have applied the following approximate string matching are significantly different. Use of alignment proportion can process to measure the description similarity of two re- identify these cases; and running of global alignment is cords, using the Python NLTK package (42): computationally intensive. Alignment proportion can dir- 1. Tokenising: split the whole description word by word; ectly estimate an upper bound on the possible global iden- 2. Lowering case: for each token, change all its characters tity. It is computed using Formula (2) where L is the local into small cases; alignment proportion, I is the locally aligned identical 3. Removing stop words: removes the words that are com- bases, D and R are sequences of the pair, and len(S) is the monly used but not content bearing, such as ‘so’, ‘too’, length of a sequence S. ‘very’ and certain special characters; 4. Lemmatising: convert to a word to its base form. For L ¼ lenðIÞ=maxðLenðDÞ; LenðRÞÞ (2) example, ‘encoding’ will be converted to ‘encode’, or We constructed three benchmarks containing duplicates ‘cds’ (coding sequences) will be converted into ‘cd’; covering records for 21 organisms, using the above map- 5. Set representation: for each description, we represent it ping process. We also quantitatively measured their char- as a set of tokens after the above processing. We re- acteristics in selected organisms. These 21 organisms are move any repeated tokens; commonly used in molecular research projects and the We applied set comparison to measure the similarity NCBI Taxonomy provides direct links (http://www.ncbi. using the Jaccard similarity defined by Equation (1). Given nlm.nih.gov/Taxonomy/taxonomyhome.html/). two sets, it reports the number of shared elements as a frac- tion of the total number of elements. This similarity metric can successfully find descriptions containing the same Results and discussion tokens but in different orders. We present our results in two stages. The first introduces intersectionðset1; set2Þ=unionðset1; set2Þ (1) the statistics of the benchmarks constructed using the methods described above. The second provides the out- come of quantitative measurement of the duplicate pairs in Submitter similarity different benchmarks. The REFERENCE field of a record in the primary nucleo- We applied our methods to records for 21 organisms tide databases contains two kinds of reference. The first is popularly studied organisms, listed in the NCBI Taxonomy the literature citation that first introduced the record and website (http://www.ncbi.nlm.nih.gov/Taxonomy/taxono the second is the submitter details. Here, we measure the myhome.html/). Tables 1, 2 and 3 show the summary stat- submitter details to find out whether two records are sub- istics of the duplicates collected in the three benchmarks. mitted by the same group. Table 1 is reproduced from another of our papers (8). All We label a pair as ‘Same’ if it shares one of submission the benchmarks are significantly larger than previous col- authors, and otherwise as ‘Different’. If a pair does not lections of verified duplicates. The submitter-based bench- have such field, we label it as ‘N/A’. The author name is mark has over 100 000 duplicate pairs. Even more formatted as ‘last name, first initial’. duplicate pairs are in the other two benchmarks: the expert Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 9 of 17 Table 1. Submitter-based benchmark Organism Total records Available merged groups Duplicate pairs Arabidopsis thaliana 337 640 47 50 Bos taurus 245 188 12 822 20 945 Caenorhabditis elegans 74 404 1881 1904 Chlamydomonas reinhardtii 24 891 10 17 Danio rerio 153 360 7895 9227 Dictyostelium discoideum 7943 25 26 Drosophila melanogaster 211 143 431 3039 Escherichia coli 512 541 201 231 Hepatitis C virus 130 456 32 48 Homo sapiens 12 506 281 16 545 30 336 Mus musculus 1 730 943 13 222 23 733 Mycoplasma pneumoniae 1009 2 3 Oryza sativa 108 395 6 6 Plasmodium falciparum 43 375 18 26 Pneumocystis carinii 528 1 1 Rattus norvegicus 318 577 12 411 19 295 Saccharomyces cerevisiae 68 236 165 191 Schizosaccharomyces pombe 4086 39 545 Takifugu rubripes 51 654 64 72 Xenopus laevis 35 544 1620 1660 Zea mays 613 768 454 471 Total records: numbers of records directly belong to the organism in total; Available merged groups: number of groups that are tracked in record revision his- tories. One group may contain multiple records. Duplicate pairs: total number of duplicate pairs. This table also appears in the paper (8). curation benchmark has around 2.46 million pairs and the merge-based benchmark are owned by RefSeq (search- automatic curation benchmark has around 0.47 billion able via INSDC), and RefSeq merges records using a mix pairs; hence, these two are also appropriate for evaluation of manual and automatic curation (8). However, only of efficiency-focused methods. limited duplicates have been identified using this method. We measured duplicates for Bos taurus, Rattus norvegi- Our results clearly show that it contains far fewer dupli- cus, Saccharomyces cerevisiae, Xenopus laevis and Zea cates than the other two, even though the original total number of records is much larger. mays quantitatively as stated above. Figures 2–9 show rep- resentative results, for Xenopus laevis and Zea mays. • The expert curation benchmark is shown to contain a These figures demonstrate that duplicates in different much more diverse set of duplicate types. For instance, benchmarks have dramatically different characteristics, Figure 4 clearly illustrates that expert curation bench- and that duplicates from different organisms in the same mark identifies much more diverse kinds of duplicate in benchmarks also have variable characteristics. We elabor- Xenopus Laevis than the other two benchmarks. It not ate further as follows. only identifies 25.0% of duplicates with close to the Construction of benchmarks from three different per- same sequences, but it finds presence of duplicates with spectives has yielded different numbers of duplicates with very different lengths and even duplicates with relatively distinct characteristics in each benchmark. These bench- low sequence identity. In contrast, the other two mainly marks have their own advantages and limitations. We ana- identify duplicates having almost the same sequence— lyse and present them here. 83.9% for automatic curation benchmark and 96.8% for the merge-based benchmark. However, the volume The merge-based benchmark is broad. Essentially all of duplicates is smaller than for automatic curation. The types of records in INSDC are represented, including use of the protein database means that only coding se- clones, introns, and binding regions; all types in addition quences will be found. to the coding sequences that are cross-referenced in pro- The automatic curation benchmark holds the highest tein databases. Elsewhere we have detailed different rea- number of duplicates amongst the three. However, even sons for merging INSDC records, for instance many though it represents the state-of-the-art in automatic records from Bos Taurus and Rattus Norvegicus in the Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 10 of 17 Database, Vol. 2017, Article ID baw164 Table 2. Expert curation benchmark Organism Cross-referenced coding records Cross-referenced coding records that are duplicates Duplicate pairs Arabidopsis thaliana 34 709 34 683 162 983 Bos taurus 9605 5646 28 443 Caenorhabditis elegans 3225 2597 4493 Chlamydomonas reinhardtii 369 255 421 Danio rerio 5244 3858 4942 Dictyostelium discoideum 1242 1188 1757 Drosophila melanogaster 13 385 13 375 573 858 Escherichia coli 611 420 1042 Homo sapiens 132 500 131 967 1 392 490 Mus musculus 74 132 72 840 252 213 Oryza sativa 40 0 Plasmodium falciparum 97 68 464 Pneumocystis carinii 33 19 11 Rattus norvegicus 15 595 11 686 24 000 Saccharomyces cerevisiae 84 67 297 Schizosaccharomyces pombe 33 2 Takifugu rubripes 153 64 59 Xenopus laevis 4701 2259 2279 Zea mays 1218 823 16 137 Cross-referenced coding records: Number of records in INSDC that are cross-referenced in total; Cross-referenced coding records that are duplicates: Number of records that are duplicates based on interpretation of the mapping (Case 2); Duplicate pairs: total number of duplicate pairs. Table 3. Automatic curation benchmark Organism Cross-referenced coding records Cross-referenced coding records that are duplicates Duplicate pairs Arabidopsis thaliana 42 697 31 580 229 725 Bos taurus 35 427 25 050 440 612 Caenorhabditis elegans 2203 1541 20 513 Chlamydomonas reinhardtii 1728 825 1342 Danio rerio 43 703 29 236 74 170 Dictyostelium discoideum 935 289 2475 Drosophila melanogaster 49 599 32 305 527 246 Escherichia coli 56 459 49 171 3 671 319 Hepatitis C virus 105 613 171 639 Homo sapiens 141 373 79 711 467 101 272 Mus musculus 58 292 32 102 95 728 Mycoplasma pneumoniae 65 20 13 Oryza sativa 3195 1883 32 727 Plasmodium falciparum 32 561 15 114 997 038 Pneumocystis carinii 314 38 23 Rattus norvegicus 39 199 30 936 115 910 Saccharomyces cerevisiae 4763 3784 107 928 Schizosaccharomyces pombe 80 6 3 Takifugu rubripes 1341 288 1650 Xenopus laevis 15 320 3615 26 443 Zea mays 55 097 25 139 108 296 The headings are the same as previously. curation, it mainly uses rule-based curation and does not similarity, whereas the expert curation benchmark con- have expert review, so is still not as diverse or exhaustive tains duplicates with description similarity in different as expert curation. For example, in Figure 2, over 70% distributions. As with the expert curation benchmark, it of the identified duplicates have high description only contains coding sequences by construction. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 11 of 17 Figure 2 Description similarities of duplicates from Xenopus laevis in three benchmarks: Auto for auto curation based; Expert for expert curation; and Merge for merge-based collection. X-axis defines the similarity range. For instance, [0.5, 0.6) means greater than or equal to 0.5 and <0.6. Y-axis de- fines the proportion for each similarity range. Figure 3. Submitter similarities of duplicates from Xenopus laevis in three benchmarks. Different: the submitters of records are completely Different; Same: the pair at least shares with at least one submitter; Not specified: no submitter details are specified in REFERENCE field in records by standard. The rest is the same as above. Figure 4. Alignment proportion of duplicates from Xenopus laevis. LOW refers to similarity that is greater than the threshold or NO HITS based on BLAST output. Recall that we chose the parameters to produce reliable BLAST output. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 12 of 17 Database, Vol. 2017, Article ID baw164 Figure 5 Local sequence identity of duplicates from Xenopus laevis in three benchmarks. The rest is the same as above. Figure 6 Description similarity of duplicates from Zea mays in three benchmarks. The analysis shows that these three benchmarks com- would fail to find many of the duplicates in our expert cur- plement each other. Merging records in INSDC provides ation benchmark. preliminary quality checking across all kinds of records in Also, duplicates in one benchmark yet in different or- INSDC. Curation (automatic and expert) provides more ganisms have distinct characteristics. For instance, as reliable and detailed checking specifically for coding se- shown in figures for Xenopus laevis and Zea mays, dupli- quences. Expert curation contains more kinds of duplicates cates in Zea mays generally have higher description simi- and automatic curation has a larger volume of identified larity (comparing Figure 2 with Figure 6), submitted by duplicates. more same submitters (comparing Figure 3 with Figure 7), Recall that previous studies used a limited number of re- more similar sequence lengths (comparing Figure 4 with cords with a limited number of organisms and kinds of du- Figure 8) and higher sequence identity (comparing Figure 5 plication. Given the richness evidenced in our benchmarks, with Figure 9). However, duplicates in Xenopus laevis and the distinctions between them, it is unreliable to evalu- have different characteristics. For instance, the expert cur- ate against only one benchmark, or multiple benchmarks ation benchmark contains 40.0 and 57.7% of duplicates constructed from the same perspective. As shown above, submitted by different and same submitters respectively. the expert curation benchmark contains considerable num- Yet the same benchmark shows many more duplicates in bers of duplicates that have the distinct alignment propor- Xenopus laevis from different submitters (47.4%), which tions or relatively low similarity sequences. The efficiency- is double the amount for the same submitters (26.4%). focused duplicate detection methods discussed earlier thus Due to these differences, methods that demonstrate good Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 13 of 17 Figure 7 Submitter similarity of duplicates from Zea mays in three benchmarks. Figure 8 Alignment proportion of duplicates from Zea mays in three benchmarks. performance on one organism may not display comparable sequences. It may be possible to construct another bench- performance on others. mark through the mapping between INSDC and RefSeq, Additionally, the two curation-based benchmarks indi- using the approach described in this paper. cate that there are potentially many undiscovered dupli- Another observation is that UniProtKB/Swiss-Prot, with cates in the primary nucleotide databases. Using expert curation, contains a more diverse set of duplicates Arabidopsis thaliana as an example, only 47 groups of du- than the other benchmarks. From the results, it can be plicates were merged out of 337 640 records in total. The observed that expert curation can find occurrences of du- impression from this would be that the overall prevalence plicates that have low description similarity, are submitted of duplicates in INSDC is quite low. However, UniProtKB/ by completely different groups, have varied lengths, or are Swiss-Prot and UniProtKB/TrEMBL only cross-referenced of comparatively low local sequence identity. This illus- 34 709 and 42 697 Arabidopsis thaliana records, respect- trates that it is not sufficient to focus on duplicates that ively, yet tracing their mappings results in finding that have highly similar sequences of highly similar lengths. 34 683 (99.93%) records in Table 2 and 31 580 (73.96%) A case study has already found that expert curation recti- records in Table 3 have at least one corresponding dupli- fies errors in original studies (39). Our study on duplicates cate record, even though they only examine coding illustrates this from another angle. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 14 of 17 Database, Vol. 2017, Article ID baw164 Figure 9 Local sequence identity of duplicates from Zea mays in three benchmarks. These results also highlight the complexity of duplicates 2 (https://www.ncbi.nlm.nih.gov/nuccore/AC069109.2?re that are present in bioinformatics databases. The overlap port¼genbank). This is an example that we noted earlier among our benchmarks is remarkably minimal. The sub- from the submitter collection. Record gi:8616100 was sub- mitter benchmark includes records that do not correspond mitted by the Whitehead Institute/MIT Center for Genome to coding sequences, so they are not considered by the pro- Research. It concerns the RP11-301H18 clone in Homo sa- tein databases. UniProtKB/Swiss-Prot and UniProtKB/ piens chromosome 9. It has 18 unordered pieces as the sub- TrEMBL use different curation processes as mentioned mitters documented. The later record gi:15029538 was above. It shows that from the perspective of one resource, submitted by the Sanger Centre. That record also concerns a pair may be considered as a duplicate, but on the basis of the RP11-301H18 clone but it only has three unordered another resource may not be. pieces. Therefore, this case shows an example of duplication More fundamentally, records that are considered as du- where different submitters submit records about the same plicates for one task may not be duplicates for another. entities. Note that they are inconsistent, in that both the an- Thus, it is not possible to use a simple and universal defin- notation data and sequence are quite different. Therefore, a ition to conceptualize duplicates. Given that the results merge was done (by either database staff or submitter). show that kinds and prevalence of duplicates vary amongst Record INSDC AC069109.2 was replaced by INSDC organisms and benchmarks, it suggests that studies are AL592206.2, as INSDC AL592206.2 has fewer unordered needed to answer fundamental questions: what kinds of du- pieces, that is, is closer to being complete. Then record plicates are there? What are their corresponding impacts for AC069109.2 became obsolete. Only record INSDC biological studies that draw from the sequence databases? AL592206.2 can be updated. This record now has complete Can existing duplicate detection methods successfully find sequence (no unordered pieces) around 2012, after 18 up- the type of duplicates that has impacts for specific kinds of dates from the version since the merge. biomedical investigations? These questions are currently un- Case 2: record INSDC AC055725.22 (https://www. answered. The benchmarks here enable such discovery (46). ncbi.nlm.nih.gov/nuccore/AC055725.22), INSDC We explored the prevalence, categories and impacts of du- BC022542.1 (https://www.ncbi.nlm.nih.gov/nuccore/ plicates in the submitter-based benchmark to understand BC022542.1) and INSDC AK000529.1 (https://www.ncbi. the duplication directly in INSDC. nlm.nih.gov/nuccore/AK000529.1). These records are To summarise, we review the benefits of having created from the expert curation collection. At the protein level, these benchmarks. they correspond to the same protein record Q8TBF5, First, the records in the benchmarks can be uses for two about a Phosphatidylinositol-glycan biosynthesis class X main purposes: (1) as duplicates to merge; (2) as records to protein. Those three records have been explicitly cross- label or cross-reference to support record linkage. We now referenced into the same protein entry during expert cur- examine the two cases: ation. The translations of record INSDC BC022542.1 and Case 1: record INSDC AL592206.2 (https://www.ncbi. INSDC AK000529.1 are almost the same. Further, the nlm.nih.gov/nuccore/AL592206.2) and INSDC AC069109. expert-reviewed protein record UniProtKB/Swiss-Prot Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 15 of 17 Q8TBF5 is documented as follows (http://www.uniprot. users can use the benchmarks as test cases, perhaps organ- org/uniprot/Q8TBF5): ized by organisms or by type. AC055725 [INSDC AC055725.22] Genomic DNA. No translation available; Conclusion BC022542 [INSDC BC022542.1] mRNA. Translation: In this study, we established three large-scale validated AAH22542.1. Sequence problems; benchmarks of duplicates in bioinformatics databases, spe- AK000529 [INSDC AK000529.1] mRNA. Translation: cifically focusing on identifying duplicates from primary nu- BAA91233.1. Sequence problems. cleotide databases (INSDC). The benchmarks are available Those annotations were made via curation to mark for use at https://bitbucket.org/biodbqual/benchmarks. problematic sequences submitted to INSDC. The ‘no trans- These benchmark data sets can be used to support develop- lation available’ annotation indicates that the original sub- ment and evaluation of duplicate detection methods. The mitted INSDC records did not specify the coding sequence three benchmarks contain the largest number of duplicates (CDS) regions, but the UniProt curators have identified the validated by submitters, database staff, expert curation or CDS. ‘Sequence problems’ refers to ‘discrepancies due to automatic curation presented to date, with nearly half a bil- an erroneous gene model prediction, erroneous ORF as- lion record pairs in the largest of our collections. signment, miscellaneous discrepancy, etc.’ (http://www.uni We explained how we constructed the benchmarks and prot.org/help/cross_references_section) resolved by the cur- their underlying principles. We also measured the charac- ator. Therefore, without expert curation, it is indeed diffi- teristics of duplicates collected in these benchmarks quanti- cult to access the correct information and is difficult to tatively, and found substantial variation among them. This know they refer to the same protein. As mentioned earlier, demonstrates that it is unreliable to evaluate methods with an important impact of duplicate detection is record link- only one benchmark. We find that expert curation in age. Cross-referencing across multiple databases is cer- UniProtKB/Swiss-Prot can identify much more diverse tainly useful, regardless of whether the linked records are kinds of duplicates and emphasize that we appreciate the regarded as duplicates. effort of expert curation due to its finer-grained assessment Second, considering the three benchmarks as a whole, of duplication. they cover diverse duplicate types. The detailed types are In future work, we plan to explore the possibility of summarized elsewhere (8), but broadly three types are evi- mapping other curated databases to INSDC to construct dent: (1) similar records, if not identical; (2) fragments; (3) more duplicate collections. We will assess these duplicates somewhat different records belonging to the same entities. in more depth to establish a detailed taxonomy of dupli- Existing studies have already shown all of them have spe- cates and collaborate with biologists to measure the pos- cific impacts on biomedical tasks. Type (1) may affect sible impacts of different types of duplicates in practical database searches (44); type (2) may affect meta-analyses biomedical applications. However, this work already pro- (45); while type (3) may confuse novice database users. vides new insights into the characteristics of duplicates in Third, those benchmarks are constructed based on differ- INSDC, and has created a resource that can be used for the ent principles. The large volume of the dataset, and diversity development of duplicate detection methods. With, in all in type of duplicate, can provide a basis for evaluation of likelihood, vast numbers of undiscovered duplicates, such both efficiency and accuracy. Benchmarks are always a methods will be essential to maintenance of these critical problem for duplicate detection methods: a method can de- databases. tect duplicates in one dataset successfully, but may get poor performance on another. This is because the methods have different definitions of duplicate, or those datasets have dif- Funding ferent types or distributions. This is why the duplicate detec- Qingyu Chen’s work is supported by an International tion survey identified the creation of benchmarks as a Research Scholarship from The University of Melbourne. pressing task (34). Multiple benchmarks enable testing of The project receives funding from the Australian Research the robustness and generalization of the proposed methods. Council through a Discovery Project grant, DP150101550. We used six organisms from the expert curated benchmark as the dataset and developed a supervised learning duplicate Conflict of interest. None declared. detection method (46). We tested the generality of the trained model as an example: whether a model trained from Acknowledgements duplicate records in one organism maintains the perform- We greatly appreciate the assistance of Elisabeth Gasteiger from ance in another organism. This is effectively showing how UniProtKB/Swiss-Prot, who advised on and confirmed the mapping Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Page 16 of 17 Database, Vol. 2017, Article ID baw164 process in this work with domain expertise. We also thank Nicole 20. Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering Silvester and Clara Amid from the EMBL European Nucleotide and comparing large sets of protein or nucleotide sequences. Archive, who advised on the procedures regarding merged records Bioinformatics, 22, 1658–1659. in INSDC. Finally we are grateful to Wayne Mattern from NCBI, 21. Verykios,V.S., Moustakides,G.V., and Elfeky,M.G. (2003) A who advised how to use BLAST properly by setting reliable param- Bayesian decision model for cost optimal record matching. eter values. VLDB J., 12, 28–40. 22. McCoy,A.B., Wright,A., Kahn,M.G. et al. (2013) Matching References identifiers in electronic health records: implications for duplicate records and patient safety. BMJ Qual. Saf., 22, 219–224. 1. Benson,D.A., Clark,K., Karsch-Mizrachi,I. et al. (2015) 23. Bagewadi,S., Adhikari,S., Dhrangadhariya,A. et al. (2015) GenBank. Nucleic Acids Res., 43, D30. NeuroTransDB: highly curated and structured transcriptomic 2. Bork,P. and Bairoch,A. (1996) Go hunting in sequence databases metadata for neurodegenerative diseases. Database, 2015, but watch out for the traps. Trends Genet., 12, 425–427. bav099. 3. Altschul,S.F., Boguski,M.S., Gish,W., et al. (1994) Issues in 24. Finn,R.D., Coggill,P., Eberhardt,R.Y. et al. (2015) The Pfam searching molecular sequence databases. Nat. Genet., 6, protein families database: towards a more sustainable future. 119–129. Nucleic Acids Res., 44:D279–D285. 4. Brenner,S.E. (1999) Errors in genome annotation. Trends 25. Herzog,T.N., Scheuren,F.J., and Winkler,W.E. (2007) Data Genet., 15, 132–133. Quality and Record Linkage Techniques. Springer, Berlin. 5. Fan,W. (2012) Web-Age Information Management. Springer, 26. Christen,P. (2012) A survey of indexing techniques for scalable Berlin, pp. 1–16. record linkage and deduplication. IEEE Trans. Knowl. Data 6. UniProt Consortium. (2014) Activities at the universal protein Eng., 24, 1537–1555. resource (UniProt). Nucleic Acids Res., 42, D191–D198. 27. Joffe,E., Byrne,M.J., Reeder,P. et al. (2014) A benchmark com- 7. Nakamura,Y., Cochrane,G., and Karsch-Mizrachi,I. (2013) The parison of deterministic and probabilistic methods for defining international nucleotide sequence database collaboration. manual review datasets in duplicate records reconciliation. Nucleic Acids Res., 41, D21–D24. J. Am. Med. Informat. Assoc., 21, 97–104. 8. Chen,Q., Justin,Z., and Verspoor,K. (2016) Duplicates, redundan- 28. Holm,L. and Sander,C. (1998) Removing near-neighbour redun- cies, and inconsistencies in the primary nucleotide databases: a de- dancy from large protein sequence collections. Bioinformatics, scriptive study. Database, doi: http://dx.doi.org/10.1101/085019. 14, 423–429. 9. Lin,Y.S., Liao,T.Y., and Lee,S.J. (2013) Detecting near- 29. Zorita,E.V., Cusco,P., and Filion,G. (2015) Starcode: sequence duplicate documents using sentence-level features and supervised clustering based on all-pairs search. Bioinformatics, 31, learning. Expert Syst. Appl., 40, 1467–1476. 1913–1919. 10. Liu,X. and Xu,L. (2013), Proceedings of the International 30. Koh,J.L., M.L., Lee,M., Khan,A.M., Tan,P.T., and Brusic,V. Conference on Information Engineering and Applications (IEA) (2004) Duplicate detection in biological data using association 2012. Springer, Heidelberg, pp. 325–332. rule mining. Locus, 501, S22388. 11. Fu,L., Niu,B., Zhu,Z. et al. (2012) CD-HIT: accelerated for clus- 31. Cross,G.R. and Jain,A.K. (1983) Markov random field texture tering the next-generation sequencing data. Bioinformatics, 28, models. IEEE Trans. Pattern Anal. Mach. Intell., 5, 25–39. 3150–3152. 32. Rudniy,A., Song,M., and Geller,J. (2010) Detecting duplicate 12. Jupe,S., Jassal,B., Williams,M., and Wu,G. (2014) A controlled vo- biological entities using shortest path edit distance. Int. J. Data cabulary for pathway entities and events. Database, 2014, bau060. Mining Bioinformatics, 4, 395–410. 13. Wilming,L.G., Boychenko,V., and Harrow,J.L. (2015) 33. Rudniy,A., Song,M., and Geller,J. (2014) Mapping biological Comprehensive comparative homeobox gene annotation in entities using the longest approximately common prefix method. human and mouse. Database, 2015, bav091. BMC Bioinformatics, 15, 187. 14. Williams,G., Davis,P., Rogers,A. et al. (2011) Methods and 34. Elmagarmid,A.K., Ipeirotis,P.G., and Verykios,V.S. (2007) strategies for gene structure curation in WormBase. Database, Duplicate record detection: a survey. IEEE Trans. Knowl. Data 2011, baq039. Eng., 19, 1–16. 15. Safran,M., Dalah,I., Alexander,J. et al. (2010) GeneCards 35. Martins,B. (2011), GeoSpatial Semantics. Springer, Berlin, pp. Version 3: the human gene integrator. Database, 2010, baq020. 34–51. 16. Christen,P. and Goiser,K. (2007) Quality Measures in Data 36. Bilenko,M. and Mooney,R.J. (2003) Proceedings of the ninth Mining. Springer, Berlin, pp. 127–151. ACM SIGKDD international conference on Knowledge discov- 17. Nanduri,R., Bhutani,I., Somavarapu,A.K. et al. (2015) ery and data mining. ACM, New York, pp. 39–48. ONRLDB—manually curated database of experimentally vali- 37. Chen,Q., Zobel,J., and Verspoor,K. (2015) Evaluation of a dated ligands for orphan nuclear receptors: insights into new Machine Learning Duplicate Detection Method for drug discovery. Database, 2015, bav112. Bioinformatics Databases. ACM Ninth International 18. UniProt Consortium. (2014) UniProt: a hub for protein informa- Workshop on Data and Text Mining in Biomedical Informatics tion. Nucleic Acids Res., 43:D204–D212. in conjunction with CIKM, Washington, DC. ACM Press, New 19. Joffe,E., Byrne,M.J., Reeder,P. et al. (2013), AMIA Annual York. Symposium Proceedings. American Medical Informatics Association, Washington, DC, Vol. 2013, pp. 721. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018 Database, Vol. 2017, Article ID baw164 Page 17 of 17 38. Magrane,M. and UniProt Consortium. (2011) UniProt 43. Camacho,C., Coulouris,G., Avagyan,V. et al. (2009) BLASTþ: Knowledgebase: a hub of integrated protein data. Database, architecture and applications. BMC Bioinformatics, 10, 421. 2011, bar009. 44. Suzek,B.E., Wang,Y., Huang,H. et al. (2014) UniRef clusters: a 39. Poux,S., Magrane,M., Arighi,C.N. et al. (2014) Expert curation comprehensive and scalable alternative for improving sequence in UniProtKB: a case study on dealing with conflicting and erro- similarity searches. Bioinformatics,31, 926–932. neous data. Database, 2014, bau016. 45. Rosikiewicz,M., Comte,A., Niknejad,A. et al. (2013) 40. Crick,F. (1970) Central dogma of molecular biology. Nature, Uncovering hidden duplicated content in public transcriptomics 227, 561–563. data. Database, 2013, bat010. 41. Huang,H., McGarvey,P.B., Suzek,B.E. et al. (2011) A compre- 46. Chen,Q., Zobel,J., Zhang,X., and Verspoor,K. (2016) hensive protein-centric ID mapping service for molecular data in- Supervised learning for detection of duplicates in genomic se- tegration. Bioinformatics, 27, 1190–1191. quence databases. PLoS One, 11, e0159644. 42. Bird,S., Klein,E., and Loper,E. (2009) Natural Language Processing with Python. O’Reilly Media, Inc., Sebastopol, CA. Downloaded from https://academic.oup.com/database/advance-article-abstract/doi/10.1093/database/baw164/2870676 by Ed 'DeepDyve' Gillespie user on 17 July 2018

Journal

DatabaseOxford University Press

Published: Jan 8, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off