Conferences in Research and Practice in Information Technology
  

Online Version - Last Updated - 20 Jan 2012

 

 
Home
 

 
Procedures and Resources for Authors

 
Information and Resources for Volume Editors
 

 
Orders and Subscriptions
 

 
Published Articles

 
Upcoming Volumes
 

 
Contact Us
 

 
Useful External Links
 

 
CRPIT Site Search
 
    

Reassembling Multilingual Temporal News Datasets with Incomplete Information

Robertson, C.S.

    Institutional investors are building increasingly more sophisticated algorithmic trading engines that account for textual as well as numerical information. To train these engines they need large datasets of information with highly accurate timestamps that cover long periods with differing trading conditions. Thus, the demand for temporal news datasets beyond the point where full archives are available is increasing. Rebuilding the actual temporal news dataset that was transmitted to the market relies on merging multiple datasets, each with incomplete information and sometimes questionable quality. Doing so requires near duplicate detection in a very large dataset including news in many languages. This research is novel as in our scenario we are unaware of the language used in any given news article. In this paper we describe a language independent near duplicate detection algorithm and demonstrate its performance on a dataset consisting of tens of millions of news messages in over 20 languages consisting of hundreds of gigabytes of content.
Cite as: Robertson, C.S. (2011). Reassembling Multilingual Temporal News Datasets with Incomplete Information. In Proc. Australasian Data Mining Conference (AusDM 11) Ballarat, Australia. CRPIT, 121. Vamplew, P., Stranieri, A., Ong, K.-L., Christen, P. and Kennedy, P. J. Eds., ACS. 91-102
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS