|
| | | |
Two Stage Similarity-aware Indexing for Large-scale Real-time Entity Resolution
Li, S., Liang, H. and Ramadan, B.
Entity resolution is the process of identifying records
in one or multiple data sources that represent the
same real-world entity. How to find all the records
that belong to the same entity as the query record
in real-time brings challenges to existing entity resolution approaches. The challenge is especially true
for large-scale dataset. In this paper, we propose to
use a two-stage similarity-aware indexing approach
for large-scale real-time entity resolution. In the first
stage, we use locality sensitive hashing to filter out
records with low similarities for the purpose of de-
creasing the number of comparisons. Then, in the
second stage, we pre-calculate the comparison similarities of the attribute values to further decrease the
query time. The experiments conducted on a large-
scale dataset with over 2 million records shows the
effectiveness of the proposed approach. |
Cite as: Li, S., Liang, H. and Ramadan, B. (2013). Two Stage Similarity-aware Indexing for Large-scale Real-time Entity Resolution. In Proc. Eleventh Australasian Data Mining Conference (AusDM13) Canberra, Australia. CRPIT, 146. Christen, P., Kennedy, P., Liu, L., Ong, K.L., Stranieri, A. and Zhao, Y. Eds., ACS. 107-115 |
(from crpit.com)
(local if available)
|
|