Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Deep learning based approach to unstructured record linkage

Deep learning based approach to unstructured record linkage In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data.Design/methodology/approachIn the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection.FindingsThe experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek et al., 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection.Originality/valueTo address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Web Information Systems Emerald Publishing

Deep learning based approach to unstructured record linkage

Loading next page...
 
/lp/emerald-publishing/deep-learning-based-approach-to-unstructured-record-linkage-0S0vvyGj4R

References (34)

Publisher
Emerald Publishing
Copyright
© Emerald Publishing Limited
ISSN
1744-0084
DOI
10.1108/ijwis-05-2021-0058
Publisher site
See Article on Publisher Site

Abstract

In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data.Design/methodology/approachIn the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection.FindingsThe experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek et al., 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection.Originality/valueTo address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection.

Journal

International Journal of Web Information SystemsEmerald Publishing

Published: Dec 1, 2021

Keywords: Unstructured data; Record linkage; Siamese neural network

There are no references for this article.