|
| | | |
Detecting Digital Newspaper Duplicates with Focus on eliminating OCR errors
Peden, Y. and Nayak, R.
documents such as newspapers have been increasingly
converted into electronic documents and become
available for user search. Many of these newspaper
articles appear in several publication avenues with some
variations. Their presence decreases both effectiveness
and efficiency of search engines which directly affects
user experience. This emphasizes on development of a
duplicate detection method, however, digitized
newspapers, in particular, have their own unique
challenges. One important challenge that is discussed in
this paper is the presence of OCR (Optical Character
recognition) errors which negatively affects the value of
document collection. The frequency of syndicated stories
within the newspaper domain poses another challenge
during duplicate/near duplicate detection process. This
paper introduces a duplicate detection method based on
clustering that detects duplicate/near duplicate digitized
newspaper articles. We present the experiments and
assessments of the results on three different data subsets
obtained from the Trove digitized newspaper collection. |
Cite as: Peden, Y. and Nayak, R. (2014). Detecting Digital Newspaper Duplicates with Focus on eliminating OCR errors. In Proc. Twelfth Australasian Data Mining Conference (AusDM14) Brisbane, Australia. CRPIT, 158. Li, X., Liu, L., Ong, K.L. and Zhao, Y. Eds., ACS. 43-49 |
(from crpit.com)
(local if available)
|
|