|
| | | |
A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
Hsueh, S., Lin, M. and Chiu, Y.
Entity resolution (ER), which detects records referring to the same entity across data sources, is a long-lasting challenge in database management research. The sheer volume of data collections today calls for the need of a blocking-based ER algorithm using the MapReduce framework for cloud computing. Most studies on blocking-based ER assume that only one blocking key is associated with an entity. An entity in reality may have multiple blocking keys in some applications. When the entities have a number of blocking keys, ER can be more efficient since two entities can form a similar pair only if they share several common keys. Therefore, we propose a MapReduce algorithm to solve the ER problem for a huge collection of entities with multiple keys. The algorithm is characterized in the combination-based blocking and the load-balanced matching. The combination-based blocking utilizes the multiple keys to sort out necessary entity pairs for future matching. The load-balanced matching evenly distributes the required similarity computations to all the reducers in the matching step so as to remove the bottleneck of skewed matching computations for a single node in a MapReduce framework. Our experiments using the well-known CiteSeerX digital library show that the proposed algorithm is both efficient and scalable. |
Cite as: Hsueh, S., Lin, M. and Chiu, Y. (2014). A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys. In Proc. Twelfth Australasian Symposium on Parallel and Distributed Computing (AusPDC 2014) Auckland, New Zealand. CRPIT, 152. Javadi, B. and Garg, S. K. Eds., ACS. 3-9 |
(from crpit.com)
(local if available)
|
|