Document Classification via Structure Synopses

Ma, L., Shepherd, J. and Nguyen, A.

    Information available in the Internet is frequently supplied simply as plain ascii text, structured according to orthographic and semantic conventions. Traditional document classification is typically formulated as a learning problem where each instance is a whole document that is represented by a feature vector. Such feature vectors are often generated based on the appearance and frequencies of words in the documents. The high-dimensionality of these feature vectors causes some problems: important clues might be missed out, and the classification might be misled by some trivial elements. In this paper, we propose a method which makes use of structuring conventions to reduce size of the feature vector without affecting the accuracy of the classification process. Effectively, a synopsis of document structure is extracted, which contains only the most informative features; then a succinct feature vector is generated to represent the instance. Finally, a decision tree machine learning algorithm is used to classify the document based on its succinct feature vector.
Cite as: Ma, L., Shepherd, J. and Nguyen, A. (2003). Document Classification via Structure Synopses. In Proc. Fourteenth Australasian Database Conference (ADC2003), Adelaide, Australia. CRPIT, 17. Schewe, K.-D. and Zhou, X., Eds. ACS. 59-65.
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS