|
| | | |
Categorical Proportional Difference: A Feature Selection Method for Text Categorization
Simeon, M. and Hilderman, R.
Supervised text categorization is a machine learning
task where a predefined category label is automatically
assigned to a previously unlabelled document
based upon characteristics of the words contained in
the document. Since the number of unique words in
a learning task (i.e., the number of features) can be
very large, the efficiency and accuracy of the learning
task can be increased by using feature selection
methods to extract from a document a subset of the
features that are considered most relevant. In this
paper, we introduce a new feature selection method
called categorical proportional difference (CPD), a
measure of the degree to which a word contributes
to differentiating a particular category from other
categories. The CPD for a word in a particular
category in a text corpus is a ratio that considers
the number of documents of a category in which
the word occurs and the number of documents from
other categories in which the word also occurs. We
conducted a series of experiments to evaluate CPD
when used in conjunction with SVM and Naive Bayes
text classifiers on the OHSUMED, 20 Newsgroups,
and Reuters-21578 text corpora. Recall, precision,
and the F-measure were used as the measures of
performance. The results obtained using CPD
were compared to those obtained using six common
feature selection methods found in the literature:
chi-squared, information gain, document frequency, mutual
information, odds ratio, and simplified chi-squared. Empirical
results showed that, in general, according to the
F-measure, CPD outperforms the other feature selection
methods in four out of six text categorization
tasks. |
Cite as: Simeon, M. and Hilderman, R. (2008). Categorical Proportional Difference: A Feature Selection Method for Text Categorization. In Proc. Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South Australia. CRPIT, 87. Roddick, J. F., Li, J., Christen, P. and Kennedy, P. J., Eds. ACS. 201-208. |
(from crpit.com)
(local if available)
|
|