Trie-join: a trie-based method for efficient string similarity joins

Jianhua Feng; Jiannan Wang; Guoliang Li

doi:10.1007/s00778-011-0252-8

Loading next page...

References (59)

Xiang Lian, Lei Chen (2010)
Set similarity join on probabilistic data
Proceedings of the VLDB Endowment, 3
S. Chaudhuri, Venkatesh Ganti, R. Kaushik (2006)
A Primitive Operator for Similarity Joins in Data Cleaning
22nd International Conference on Data Engineering (ICDE'06)
Chuan Xiao, Wei Wang, Xuemin Lin, J. Yu, Guoren Wang (2011)
Efficient similarity joins for near-duplicate detection
ACM Trans. Database Syst., 36
Hector Gonzalez, A. Halevy, Anno Langen, J. Madhavan, Rod McChesney, R. Shapley, Warren Shen, Jonathan Goldberg-Kidon (2010)
Socialising Data with Google Fusion Tables
IEEE Data Eng. Bull., 33
S. Agrawal, K. Chakrabarti, S. Chaudhuri, Venkatesh Ganti (2008)
Scalable ad-hoc entity extraction from text collections
Proc. VLDB Endow., 1
Jeffrey Jestes, Feifei Li, Zhepeng Yan, K. Yi (2010)
Probabilistic string similarity joins
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Jiaheng Lu, Jialong Han, Xiaofeng Meng (2009)
Efficient algorithms for approximate member extraction using signature-based inverted lists
Proceedings of the 18th ACM conference on Information and knowledge management
Hiroaki Sakoe (1978)
Dynamic programming algorithm optimization for spoken word recognition
IEEE Transactions on Acoustics, Speech, and Signal Processing, 26
P. Ciaccia, M. Patella, P. Zezula (1997)
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces
S. Chaudhuri, R. Kaushik (2009)
Extending autocompletion to tolerate errors
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
K. Chakrabarti, S. Chaudhuri, Venkatesh Ganti, Dong Xin (2008)
An efficient filter for approximate membership checking
Jiannan Wang, Guoliang Li, J. Yu, Jianhua Feng (2011)
Entity Matching: How Similar Is Similar
Proc. VLDB Endow., 4
Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang (2009)
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Marios Hadjieleftheriou, Nick Koudas, D. Srivastava (2009)
Incremental maintenance of length normalized indexes for approximate string matching
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Chuan Xiao, Wei Wang, Xuemin Lin (2008)
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proc. VLDB Endow., 1
A. Arasu, Venkatesh Ganti, R. Kaushik (2006)
Efficient exact set-similarity joins
J. Peterson (1980)
Computer programs for detecting and correcting spelling errors
Commun. ACM, 23
Hongrae Lee, R. Ng, Kyuseok Shim (2007)
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance
Chen Li, Bin Wang, Xiaochun Yang (2007)
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
S. Nilsson, G. Karlsson (1999)
IP-address lookup using LC-tries
IEEE J. Sel. Areas Commun., 17
D. Morrison (1968)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM (JACM), 15
Edward Sussenguth (1963)
Use of tree structures for processing files
Commun. ACM, 26
S. Sahinalp, M. Tasan, J. Macker, Z. Özsoyoglu (2003)
Distance based indexing for string proximity search
Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)
L. Gravano, Panagiotis Ipeirotis, H. Jagadish, Nick Koudas, S. Muthukrishnan, D. Srivastava (2001)
Approximate String Joins in a Database (Almost) for Free
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, D. Srivastava (2008)
Hashed samples: selectivity estimators for set similarity selection queries
Proc. VLDB Endow., 1
E. Fredkin (1960)
Trie memory
Commun. ACM, 3
Nikolaus Augsten, Michael Böhlen, C. Dyreson, J. Gamper (2008)
Approximate Joins for Data-Centric XML
2008 IEEE 24th International Conference on Data Engineering
J. Wang, G. Li, J. Feng (2010)
Trie-join: Efficient trie-based string similarity joins with edit-distance constraints
PVLDB, 3
S. Chaudhuri, Venkatesh Ganti, R. Kaushik (2006)
Data Debugger: An Operator-Centric Approach for Data Quality Solutions
IEEE Data Eng. Bull., 29
Marjan Celikik, H. Bast (2009)
Fast error-tolerant search on very large texts
(1918)
Available at http://patft.uspto.gov/netacgi/nph- Parser?patentnumber=1261167
Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, D. Srivastava (2008)
Fast Indexes and Algorithms for Set Similarity Selection Queries
2008 IEEE 24th International Conference on Data Engineering
Sunita Sarawagi, Alok Kirpal (2004)
Efficient set joins on similarity predicates
K. Schulz, S. Mihov (2002)
Fast string correction with Levenshtein automata
International Journal on Document Analysis and Recognition, 5
Guoliang Li, S. Ji, Chen Li, Jianhua Feng (2011)
Efficient fuzzy full-text type-ahead search
The VLDB Journal, 20
Hongrae Lee, R. Ng, Kyuseok Shim (2009)
Power-Law Based Estimation of Set Similarity Join Size
Proc. VLDB Endow., 2
Xiaochun Yang, Bin Wang, Chen Li (2008)
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently
M. MacLaren (1969)
The Art of Computer Programming—Volume 1: Fundamental Algorithms (Donald E. Knuth)
Siam Review, 11
Min-Soo Kim, K. Whang, Jae-Gil Lee, Min-Jae Lee (2005)
n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure
D.E. Knuth (1968)
The Art of Computer Programming, Volume 1: Fundamental algorithms
G. Navarro (2001)
A guided tour to approximate string matching
ACM Comput. Surv., 33
D. Knuth (1997)
The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition
S. Chaudhuri, Kris Ganjam, Venkatesh Ganti, R. Motwani (2003)
Robust and efficient fuzzy match for online data cleaning
G. Gonnet (1984)
Handbook Of Algorithms And Data Structures
R. Cole, Lee-Ad Gottlieb, Moshe Lewenstein (2004)
Dictionary matching and indexing with errors and don't cares
S. Ji, Guoliang Li, Chen Li, Jianhua Feng (2009)
Efficient interactive fuzzy keyword search
S. Guha, Nick Koudas, D. Srivastava, T. Yu (2003)
Index-based approximate XML joins
Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)
Guoliang Li, Dong Deng, Jianhua Feng (2011)
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
(1976)
Unimatch: A record linkage system: User's manual
G. Salton, M. McGill (1983)
Introduction to Modern Information Retrieval
Marios Hadjieleftheriou, D. Srivastava (2010)
Weighted Set-Based String Similarity
IEEE Data Eng. Bull., 33
Jiannan Wang, Guoliang Li, Jianhua Feng (2011)
Fast-join: An efficient method for fuzzy token matching based string similarity join
2011 IEEE 27th International Conference on Data Engineering
S. Heinz, J. Zobel, H. Williams (2002)
Burst tries: a fast, efficient data structure for string keys
ACM Trans. Inf. Syst., 20
Tamer Kahveci, Ambuj Singh (2001)
Efficient Index Structures for String Databases
Brent Bryan, F. Eberhardt, C. Faloutsos (2008)
Compact Similarity Joins
2008 IEEE 24th International Conference on Data Engineering
R. Vernica, M. Carey, Chen Li (2010)
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Chen Li, Jiaheng Lu, Yiming Lu (2008)
Efficient Merging and Filtering Algorithms for Approximate String Searches
2008 IEEE 24th International Conference on Data Engineering
A. Arasu, S. Chaudhuri, R. Kaushik (2008)
Transformation-based Framework for Record Matching
2008 IEEE 24th International Conference on Data Engineering
Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang (2009)
Top-k Set Similarity Joins
2009 IEEE 25th International Conference on Data Engineering

Publisher: Springer Journals
Copyright: Copyright © 2012 by Springer-Verlag
Subject: Computer Science; Database Management
ISSN: 1066-8888
eISSN: 0949-877X
DOI: 10.1007/s00778-011-0252-8
Publisher site: See Article on Publisher Site

Abstract

A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join , which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.

Journal

The VLDB Journal – Springer Journals

Published: Aug 1, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Trie-join: a trie-based method for efficient string similarity joins

Trie-join: a trie-based method for efficient string similarity joins

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Trie-join: a trie-based method for efficient string similarity joins

Trie-join: a trie-based method for efficient string similarity joins

References (59)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies