|
| | | |
A Two-Step Classification Approach to Unsupervised Record Linkage
Christen, P.
Linking or matching databases is becoming increasingly
important in many data mining projects, as
linked data can contain information that is not available
otherwise, or that would be too expensive to collect
manually. A main challenge when linking large
databases is the classification of the compared record
pairs into matches and non-matches. In traditional
record linkage, classification thresholds have to be
set either manually or using an EM-based approach.
More recently developed classification methods are
mainly based on supervised machine learning techniques
and thus require training data, which is often
not available in real world situations or has to be prepared
manually. In this paper, a novel two-step approach
to record pair classification is presented. In
a first step, example training data of high quality is
generated automatically, and then used in a second
step to train a supervised classifier. Initial experimental
results on both real and synthetic data show
that this approach can outperform traditional unsupervised
clustering, and even achieve linkage quality
almost as good as fully supervised techniques. |
Cite as: Christen, P. (2007). A Two-Step Classification Approach to Unsupervised Record Linkage. In Proc. Sixth Australasian Data Mining Conference (AusDM 2007), Gold Coast, Australia. CRPIT, 70. Christen, P., Kennedy, P. J., Li, J., Kolyshkina, I. and Williams, G. J., Eds. ACS. 111-119. |
(from crpit.com)
(local if available)
|
|