Conferences in Research and Practice in Information Technology
  

Online Version - Last Updated - 20 Jan 2012

 

 
Home
 

 
Procedures and Resources for Authors

 
Information and Resources for Volume Editors
 

 
Orders and Subscriptions
 

 
Published Articles

 
Upcoming Volumes
 

 
Contact Us
 

 
Useful External Links
 

 
CRPIT Site Search
 
    

A Two-Phase Rule Generation and Optimization Approach for Wrapper Generation

Hao, Y. and Zhang, Y.

    Web information extraction is a fundamental issue for web information management and integrations. A common approach is to use wrappers to extract data from web pages or documents. However, a critical issue for wrapper development is how to generate extraction rules. In this paper, we propose a novel two-phase rule generation and optimization (2P-RULE) approach for wrapper generation. 2P-RULE consists of internal rule optimization (IRO) process and external rule optimization (ERO) process. In IRO, a user, through a GUI interface, firstly creates a mapping from useful values in web page to a schema specified by the users according to target web information. Based on the mapping, the system automatically generates a rule list for the schema. Whereas in ERO, the user can create multiple mappings to generate further rule lists. All the acquired rule lists are merged and refined into one optimized rule list, which is expressed with XQuery as the final extraction rules. Experiments show that our 2P-RULE approach is suitable for extracting information from web pages with complex nested structure, and can also achieve better precision and recall ratio.
Cite as: Hao, Y. and Zhang, Y. (2006). A Two-Phase Rule Generation and Optimization Approach for Wrapper Generation. In Proc. Seventeenth Australasian Database Conference (ADC2006), Hobart, Australia. CRPIT, 49. Dobbie, G. and Bailey, J., Eds. ACS. 39-48.
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS
 

 

ACS Logo© Copyright Australian Computer Society Inc. 2001-2014.
Comments should be sent to the webmaster at crpit@scem.uws.edu.au.
This page last updated 16 Nov 2007