Conferences in Research and Practice in Information Technology
  

Online Version - Last Updated - 20 Jan 2012

 

 
Home
 

 
Procedures and Resources for Authors

 
Information and Resources for Volume Editors
 

 
Orders and Subscriptions
 

 
Published Articles

 
Upcoming Volumes
 

 
Contact Us
 

 
Useful External Links
 

 
CRPIT Site Search
 
    

Two Stage Similarity-aware Indexing for Large-scale Real-time Entity Resolution

Li, S., Liang, H. and Ramadan, B.

    Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. How to find all the records that belong to the same entity as the query record in real-time brings challenges to existing entity resolution approaches. The challenge is especially true for large-scale dataset. In this paper, we propose to use a two-stage similarity-aware indexing approach for large-scale real-time entity resolution. In the first stage, we use locality sensitive hashing to filter out records with low similarities for the purpose of de- creasing the number of comparisons. Then, in the second stage, we pre-calculate the comparison similarities of the attribute values to further decrease the query time. The experiments conducted on a large- scale dataset with over 2 million records shows the effectiveness of the proposed approach.
Cite as: Li, S., Liang, H. and Ramadan, B. (2013). Two Stage Similarity-aware Indexing for Large-scale Real-time Entity Resolution. In Proc. Eleventh Australasian Data Mining Conference (AusDM13) Canberra, Australia. CRPIT, 146. Christen, P., Kennedy, P., Liu, L., Ong, K.L., Stranieri, A. and Zhao, Y. Eds., ACS. 107-115
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS