Exploiting a Proximity-based Positional Model to Improve the Quality of Information Extraction by Text Segmentation

Huynh, D. and Zhou, X.

    A large number of web pages contain information of entities in a form of lists of field values. Those implicit semi-structured records are often available in textual sources on the web such as advertisings of products, postal addresses, bibliographic information, etc. Harvesting information of those entities from such lists of field values is challenge task because the lists are manually generated, not written in a well-defined templates or may miss some information. In this paper, we introduce a proximity-based positional model (PPM) to improve the quality of extracting information by text segmentation. Our proposed model offers improvements over the fixed-positional model proposed in ONDUX, a current state-of-art method for information extraction by text segmentation (IETS) to revise the labels of text segments in an input list of field values. Different from fixed-positional model in previous work, the key idea of PPM is to define proximity heuristic for labels in an input list in a unified language model. Our proposed model is estimated based on propagated counts of labels through a proximity-based density function. We propose and study several density functions and experimental results on different domains show that PPM is effective to revise labels and helps to improve performance of current state-of-art method.
Cite as: Huynh, D. and Zhou, X. (2013). Exploiting a Proximity-based Positional Model to Improve the Quality of Information Extraction by Text Segmentation. In Proc. Database Technologies 2013 (ADC 2013) Adelaide, Australia. CRPIT, 137. Wang, H. and Zhang, R. Eds., ACS. 23-32
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS