Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Development and user experiences of an open source data cleaning, deduplication and record linkage system

Development and user experiences of an open source data cleaning, deduplication and record... Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be conducted. Also of increasing interest is the deduplication of a single database. The objectives of record linkage and deduplication are to identify, match and merge all records that relate to the same real-world entities. Because real-world data is commonly 'dirty', data cleaning is an important first step in many deduplication, record linkage, and data mining project. In this paper, an overview of the Febrl (Freely Extensible Biomedical Record Linkage) system is provided, and the results of a recent survey of Febrl users is discussed. Febrl includes a variety of functionalities required for data cleaning, deduplication and record linkage, and it provides a graphical user interface that facilitates its application for users who do not have programming experience. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM SIGKDD Explorations Newsletter Association for Computing Machinery

Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter , Volume 11 (1) – Nov 16, 2009

Loading next page...
 
/lp/association-for-computing-machinery/development-and-user-experiences-of-an-open-source-data-cleaning-GM27JaB2CM

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Association for Computing Machinery
Copyright
The ACM Portal is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.
ISSN
1931-0145
DOI
10.1145/1656274.1656282
Publisher site
See Article on Publisher Site

Abstract

Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be conducted. Also of increasing interest is the deduplication of a single database. The objectives of record linkage and deduplication are to identify, match and merge all records that relate to the same real-world entities. Because real-world data is commonly 'dirty', data cleaning is an important first step in many deduplication, record linkage, and data mining project. In this paper, an overview of the Febrl (Freely Extensible Biomedical Record Linkage) system is provided, and the results of a recent survey of Febrl users is discussed. Febrl includes a variety of functionalities required for data cleaning, deduplication and record linkage, and it provides a graphical user interface that facilitates its application for users who do not have programming experience.

Journal

ACM SIGKDD Explorations NewsletterAssociation for Computing Machinery

Published: Nov 16, 2009

There are no references for this article.