Feature selection and computational optimization in high-dimensional microarray cancer datasets via InfoGain-modified bat algorithm
No Thumbnail Available
Date
2022
Journal Title
Journal ISSN
Volume Title
Publisher
Multimedia Tools and Applications
Abstract
Achieving a satisfactory cancer classification accuracy with the complete set of genes
remains a great challenge, due to the high dimensions, small sample size, and presence of
noise in gene expression data. Feature reduction is critical and sensitive in the classification
task, most importantly in heterogeneous multimedia data. One of the major drawbacks
in cancer study is recognizing informative genes from thousands of available genes
in microarray data. Traditional feature selection algorithms have failed to scale on large
space data like microarray data. Therefore, an effective feature selection algorithm is
required to explore the most significant subset of genes by removing non-predictive genes
from the dataset without compromising the accuracy of the classification algorithm. The
study proposed an information Gain – Modified Bat Algorithm (InfoGain-MBA) features selection model for selecting relevant and informative features from high dimensional
Microarray cancer datasets and evaluate the approach with four classifiers - C4.5,
Decision Tree, Random Forest and classification and regression tree (CART). The results
obtained show that the proposed approach is promising for the classification of microarray
cancer data. The random forest has 100% accuracy with few genes in all seven
datasets used. Further investigations were also conducted to determine the optimal
threshold for each of the datasets.
Description
Keywords
Feature selection, Binary bat algorithm, Information gain, Cancer classification, Microarray data, Random forest, Computational optimization