Access the full text.
Sign up today, get DeepDyve free for 14 days.
Xiang Lian, Lei Chen (2010)
Set similarity join on probabilistic dataProceedings of the VLDB Endowment, 3
S. Chaudhuri, Venkatesh Ganti, R. Kaushik (2006)
A Primitive Operator for Similarity Joins in Data Cleaning22nd International Conference on Data Engineering (ICDE'06)
Chuan Xiao, Wei Wang, Xuemin Lin, J. Yu, Guoren Wang (2011)
Efficient similarity joins for near-duplicate detectionACM Trans. Database Syst., 36
Hector Gonzalez, A. Halevy, Anno Langen, J. Madhavan, Rod McChesney, R. Shapley, Warren Shen, Jonathan Goldberg-Kidon (2010)
Socialising Data with Google Fusion TablesIEEE Data Eng. Bull., 33
S. Agrawal, K. Chakrabarti, S. Chaudhuri, Venkatesh Ganti (2008)
Scalable ad-hoc entity extraction from text collectionsProc. VLDB Endow., 1
Jeffrey Jestes, Feifei Li, Zhepeng Yan, K. Yi (2010)
Probabilistic string similarity joinsProceedings of the 2010 ACM SIGMOD International Conference on Management of data
Jiaheng Lu, Jialong Han, Xiaofeng Meng (2009)
Efficient algorithms for approximate member extraction using signature-based inverted listsProceedings of the 18th ACM conference on Information and knowledge management
Hiroaki Sakoe (1978)
Dynamic programming algorithm optimization for spoken word recognitionIEEE Transactions on Acoustics, Speech, and Signal Processing, 26
P. Ciaccia, M. Patella, P. Zezula (1997)
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces
S. Chaudhuri, R. Kaushik (2009)
Extending autocompletion to tolerate errorsProceedings of the 2009 ACM SIGMOD International Conference on Management of data
K. Chakrabarti, S. Chaudhuri, Venkatesh Ganti, Dong Xin (2008)
An efficient filter for approximate membership checking
Jiannan Wang, Guoliang Li, J. Yu, Jianhua Feng (2011)
Entity Matching: How Similar Is SimilarProc. VLDB Endow., 4
Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang (2009)
Efficient approximate entity extraction with edit distance constraintsProceedings of the 2009 ACM SIGMOD International Conference on Management of data
Marios Hadjieleftheriou, Nick Koudas, D. Srivastava (2009)
Incremental maintenance of length normalized indexes for approximate string matchingProceedings of the 2009 ACM SIGMOD International Conference on Management of data
Chuan Xiao, Wei Wang, Xuemin Lin (2008)
Ed-Join: an efficient algorithm for similarity joins with edit distance constraintsProc. VLDB Endow., 1
A. Arasu, Venkatesh Ganti, R. Kaushik (2006)
Efficient exact set-similarity joins
J. Peterson (1980)
Computer programs for detecting and correcting spelling errorsCommun. ACM, 23
Hongrae Lee, R. Ng, Kyuseok Shim (2007)
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance
Chen Li, Bin Wang, Xiaochun Yang (2007)
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
S. Nilsson, G. Karlsson (1999)
IP-address lookup using LC-triesIEEE J. Sel. Areas Commun., 17
D. Morrison (1968)
PATRICIA—Practical Algorithm To Retrieve Information Coded in AlphanumericJournal of the ACM (JACM), 15
Edward Sussenguth (1963)
Use of tree structures for processing filesCommun. ACM, 26
S. Sahinalp, M. Tasan, J. Macker, Z. Özsoyoglu (2003)
Distance based indexing for string proximity searchProceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)
L. Gravano, Panagiotis Ipeirotis, H. Jagadish, Nick Koudas, S. Muthukrishnan, D. Srivastava (2001)
Approximate String Joins in a Database (Almost) for Free
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, D. Srivastava (2008)
Hashed samples: selectivity estimators for set similarity selection queriesProc. VLDB Endow., 1
E. Fredkin (1960)
Trie memoryCommun. ACM, 3
Nikolaus Augsten, Michael Böhlen, C. Dyreson, J. Gamper (2008)
Approximate Joins for Data-Centric XML2008 IEEE 24th International Conference on Data Engineering
J. Wang, G. Li, J. Feng (2010)
Trie-join: Efficient trie-based string similarity joins with edit-distance constraintsPVLDB, 3
S. Chaudhuri, Venkatesh Ganti, R. Kaushik (2006)
Data Debugger: An Operator-Centric Approach for Data Quality SolutionsIEEE Data Eng. Bull., 29
Marjan Celikik, H. Bast (2009)
Fast error-tolerant search on very large texts
(1918)
Available at http://patft.uspto.gov/netacgi/nph- Parser?patentnumber=1261167
Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, D. Srivastava (2008)
Fast Indexes and Algorithms for Set Similarity Selection Queries2008 IEEE 24th International Conference on Data Engineering
Sunita Sarawagi, Alok Kirpal (2004)
Efficient set joins on similarity predicates
K. Schulz, S. Mihov (2002)
Fast string correction with Levenshtein automataInternational Journal on Document Analysis and Recognition, 5
Guoliang Li, S. Ji, Chen Li, Jianhua Feng (2011)
Efficient fuzzy full-text type-ahead searchThe VLDB Journal, 20
Hongrae Lee, R. Ng, Kyuseok Shim (2009)
Power-Law Based Estimation of Set Similarity Join SizeProc. VLDB Endow., 2
Xiaochun Yang, Bin Wang, Chen Li (2008)
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently
M. MacLaren (1969)
The Art of Computer Programming—Volume 1: Fundamental Algorithms (Donald E. Knuth)Siam Review, 11
Min-Soo Kim, K. Whang, Jae-Gil Lee, Min-Jae Lee (2005)
n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure
D.E. Knuth (1968)
The Art of Computer Programming, Volume 1: Fundamental algorithms
G. Navarro (2001)
A guided tour to approximate string matchingACM Comput. Surv., 33
D. Knuth (1997)
The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition
S. Chaudhuri, Kris Ganjam, Venkatesh Ganti, R. Motwani (2003)
Robust and efficient fuzzy match for online data cleaning
G. Gonnet (1984)
Handbook Of Algorithms And Data Structures
R. Cole, Lee-Ad Gottlieb, Moshe Lewenstein (2004)
Dictionary matching and indexing with errors and don't cares
S. Ji, Guoliang Li, Chen Li, Jianhua Feng (2009)
Efficient interactive fuzzy keyword search
S. Guha, Nick Koudas, D. Srivastava, T. Yu (2003)
Index-based approximate XML joinsProceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)
Guoliang Li, Dong Deng, Jianhua Feng (2011)
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
(1976)
Unimatch: A record linkage system: User's manual
G. Salton, M. McGill (1983)
Introduction to Modern Information Retrieval
Marios Hadjieleftheriou, D. Srivastava (2010)
Weighted Set-Based String SimilarityIEEE Data Eng. Bull., 33
Jiannan Wang, Guoliang Li, Jianhua Feng (2011)
Fast-join: An efficient method for fuzzy token matching based string similarity join2011 IEEE 27th International Conference on Data Engineering
S. Heinz, J. Zobel, H. Williams (2002)
Burst tries: a fast, efficient data structure for string keysACM Trans. Inf. Syst., 20
Tamer Kahveci, Ambuj Singh (2001)
Efficient Index Structures for String Databases
Brent Bryan, F. Eberhardt, C. Faloutsos (2008)
Compact Similarity Joins2008 IEEE 24th International Conference on Data Engineering
R. Vernica, M. Carey, Chen Li (2010)
Efficient parallel set-similarity joins using MapReduceProceedings of the 2010 ACM SIGMOD International Conference on Management of data
Chen Li, Jiaheng Lu, Yiming Lu (2008)
Efficient Merging and Filtering Algorithms for Approximate String Searches2008 IEEE 24th International Conference on Data Engineering
A. Arasu, S. Chaudhuri, R. Kaushik (2008)
Transformation-based Framework for Record Matching2008 IEEE 24th International Conference on Data Engineering
Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang (2009)
Top-k Set Similarity Joins2009 IEEE 25th International Conference on Data Engineering
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join , which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.
The VLDB Journal – Springer Journals
Published: Aug 1, 2012
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.