Categorical Proportional Difference: A Feature Selection Method for Text Categorization

Simeon, M. and Hilderman, R.

    Supervised text categorization is a machine learning task where a predefined category label is automatically assigned to a previously unlabelled document based upon characteristics of the words contained in the document. Since the number of unique words in a learning task (i.e., the number of features) can be very large, the efficiency and accuracy of the learning task can be increased by using feature selection methods to extract from a document a subset of the features that are considered most relevant. In this paper, we introduce a new feature selection method called categorical proportional difference (CPD), a measure of the degree to which a word contributes to differentiating a particular category from other categories. The CPD for a word in a particular category in a text corpus is a ratio that considers the number of documents of a category in which the word occurs and the number of documents from other categories in which the word also occurs. We conducted a series of experiments to evaluate CPD when used in conjunction with SVM and Naive Bayes text classifiers on the OHSUMED, 20 Newsgroups, and Reuters-21578 text corpora. Recall, precision, and the F-measure were used as the measures of performance. The results obtained using CPD were compared to those obtained using six common feature selection methods found in the literature: chi-squared, information gain, document frequency, mutual information, odds ratio, and simplified chi-squared. Empirical results showed that, in general, according to the F-measure, CPD outperforms the other feature selection methods in four out of six text categorization tasks.
Cite as: Simeon, M. and Hilderman, R. (2008). Categorical Proportional Difference: A Feature Selection Method for Text Categorization. In Proc. Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South Australia. CRPIT, 87. Roddick, J. F., Li, J., Christen, P. and Kennedy, P. J., Eds. ACS. 201-208.
pdf (from pdf (local if available) BibTeX EndNote GS