In standard text retrieval systems, the documents are gathered and indexed on a single server. In distributed information retrieval (DIR), the documents are held in multiple collections; answers to queries are produced by selecting the collections to query and then merging results from these collections. However, in most prior research in the area, collections are assumed to be disjoint. In this paper, we investigate the effectiveness of different combinations of server selection and result merging algorithms in the presence of duplicates. We also test our hash-based method for efficiently detecting duplicates and near-duplicates in the lists of documents returned by collections. Our results, based on two different designs of test data, indicate that some DIR methods are more likely to return duplicate documents, and show that removing such redundant documents can have a significant impact on the final search effectiveness.
|Cite as: Shokouhi, M., Zobel, J. and Bernstein, Y. (2007). Distributed Text Retrieval From Overlapping Collections. In Proc. Eighteenth Australasian Database Conference (ADC 2007), Ballarat, Australia. CRPIT, 63. Bailey, J. and Fekete, A., Eds. ACS. 141-150. |
(local if available)