Documents cannot be automtically classified unless they have been represented as a collection of computable features. A model is a representation of a document with computable features. However, a model may not be sufficient to express a document, especially when two documents have the same features, they might not be necessarily classified into the same category. We propose a method for determining the fitness of a document model by using conflict instances. Conflict instances are instances with exactly same features, but with different category lables given by human expert in an interactive document labelling process for training of the classifier. In our paper, we do not treat conflict instances as noises, but as the evidences that can reveal a distribution of positive instances. We develop an approach to the representation of this distribution information as a hyperplane, namely distribution hyperplane. Then the fitness problem becomes a problem of computing the distribution hyperplane. Besides determining the fitness of a model, distribution hyperplane can also be used for: 1) acting as classifier itself; and 2) being a membership function of fuzzy sets. In this paper, we also propose the selection criteria of effectiveness measuring for a model in a process of fitness computations.
|Cite as: Chen, D.-Y., Li, X., Zhao Yang Dong and Chen, X. (2005). Determining the Fitness of a Document Model by Using Conflict Instances. In Proc. Sixteenth Australasian Database Conference (ADC2005), Newcastle, Australia. CRPIT, 39. Williams, H. E. and Dobbie, G., Eds. ACS. 125-133. |
(local if available)