Sequential Optimization Based Feature Selection Algorithm for Efficient Cancer Classification and Prediction
No Thumbnail Available
Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Proceedings of the 14th iSTEAMS International Multidisciplinary Conference, Al-Hikmah University, Ilorin, Nigeria
Abstract
This study proposes an efficient method for optimal selection of feature subsets to enhance the classification
performance of Support Vector Machine (SVM) in a binary and multiclass response high-dimensional genomic
microarray data using Multi-Objective Optimization (MOO) approach. In a Monte-Carlo experiment, a pre-selection of the features was performed with the filter method based on Sidak alpha value to reduce the number of false positive features in the data. The optimal values of the tuning parameters for both the SVM cost and Radial Basis Function (RBF) kernel were determined by grid search in a 10–fold cross-validation. The SVM with RBF kernel was then fitted sequentially to select the set of near optimal genes that are correlated with the response class. The proposed algorithm was compared with the following four machine learning methods: Naïve Bayes (NB), Random Forest (RF), Random Forest with variable selection (RFVS) and LASSO. The Misclassification Error Rate (MER) of the proposed method on simulated data was 1.1% with a sensitivity of 97.8% using four (near) optimal selected genes. In contrast, the MERs of NB, RF, RFVS and LASSO classifiers with 10, 10, 9 and 37 genes were 4.28%, 5.03%, 4.98% and 0.00% respectively using the data. Application of the proposed method on published Leukaemia data yielded an MER of 0.03% with a
sensitivity of 99.95% based on three (3) optimally selected genes. On the other hand, the MERs of NB, RF, RFVS and LASSO classifiers for the Leukaemia data were 1.0%, 3.0%, 5.67% and 0.00% based on 93, 93, 2 and 31 genes respectively. These same fits of performance were achieved by all the methods considered on multiclass response DNA data set. The results generally showed that the proposed algorithm is more parsimonious and achieved better predictive performance than some of the existing methods considered. The sets of optimally selected gene subsets in the data employed here can be further investigated by molecular biologist to establish the pathology of these genes with respect to their respective tumour classes.
Description
Keywords
Support Vector machines, Feature selection, Multi-Objective Optimization