This paper introduces a novel method for minimum number of gene (feature) selection for a classification problem based on gene expression data with an objective function to maximise the classification accuracy. The method uses a hybrid of Pearson correlation coefficient (PCC) and signal-to-noise ratio (SNR) methods combined with an evolving classification function (ECF). First, the correlation coefficients between genes in a set of thousands, is calculated. Genes, that are highly correlated across samples are considered either dependent or coregulated and form a group (a cluster). Signal-to-noise ratio (SNR) method is applied to rank the correlated genes in this group according to their discriminative power towards the classes. Genes with the highest SNR are used in a preliminary feature set as representatives of each group. An incremental algorithm that consists of selecting a minimum number of genes (variables) from the preliminary feature set, starting from one gene, is then applied for building an optimum classification system. Only variables, that increase the classification rate in each of the validation iteration, are selected and added to the final feature set. The results show that the proposed hybrid PCC, SNR and ECF method improves the feature selection process in terms of number of variables required and also improves the classification rate. The classification accuracy of the ECF classifier is tested through the leave one out method for validation.
|Cite as: Goh, L., Kasabov, N. and Song, Q. (2004). A Novel Feature Selection Method to Improve Classification of Gene Expression Data. In Proc. Second Asia-Pacific Bioinformatics Conference (APBC2004), Dunedin, New Zealand. CRPIT, 29. Chen, Y.-P. P., Ed. ACS. 161-166. |
(local if available)