Conferences in Research and Practice in Information Technology
  

Online Version - Last Updated - 20 Jan 2012

 

 
Home
 

 
Procedures and Resources for Authors

 
Information and Resources for Volume Editors
 

 
Orders and Subscriptions
 

 
Published Articles

 
Upcoming Volumes
 

 
Contact Us
 

 
Useful External Links
 

 
CRPIT Site Search
 
    

Detecting Digital Newspaper Duplicates with Focus on eliminating OCR errors

Peden, Y. and Nayak, R.

    documents such as newspapers have been increasingly converted into electronic documents and become available for user search. Many of these newspaper articles appear in several publication avenues with some variations. Their presence decreases both effectiveness and efficiency of search engines which directly affects user experience. This emphasizes on development of a duplicate detection method, however, digitized newspapers, in particular, have their own unique challenges. One important challenge that is discussed in this paper is the presence of OCR (Optical Character recognition) errors which negatively affects the value of document collection. The frequency of syndicated stories within the newspaper domain poses another challenge during duplicate/near duplicate detection process. This paper introduces a duplicate detection method based on clustering that detects duplicate/near duplicate digitized newspaper articles. We present the experiments and assessments of the results on three different data subsets obtained from the Trove digitized newspaper collection.
Cite as: Peden, Y. and Nayak, R. (2014). Detecting Digital Newspaper Duplicates with Focus on eliminating OCR errors. In Proc. Twelfth Australasian Data Mining Conference (AusDM14) Brisbane, Australia. CRPIT, 158. Li, X., Liu, L., Ong, K.L. and Zhao, Y. Eds., ACS. 43-49
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS