A novel approach to outliers removal in a noisy numeric dataset for efficient mining

No Thumbnail Available

Date

2016

Journal Title

Journal ISSN

Volume Title

Publisher

Ilorin Journal of Computer Science and Information Technology

Abstract

Data pre-processing is a key task in the data mining process. The task generally consumes the largest portion of the total data engineering effort while unveiling useful patterns from datasets. Basically, data mining is about fitting descriptive or predictive models from data. However, the presence of outlier sometimes reduces the reliability of the models created. It is, therefore, essential to have raw data properly pre-processed before exploring them for mining. In this paper, an algorithm that detects and removes outliers in a numeric dataset is proposed. In order to establish the effectiveness of the proposed algorithm, the clean data obtained through the implementation of the proposed approach is used to create a prediction model. Similarly, the clean data obtained through the use of one of the existing techniques is also used to create a prediction model. Each of the models created is simulated using a set of untrained data and the error associated with each model is measured. The resulting outputs from the two approaches reveal that, the prediction model created using the output from the proposed algorithm has an error of 0.38, while the prediction model created using the cleaned data from the clustering method gives an error of 0.61. Comparison of the errors associated with the models created using the two approaches shows that, the proposed algorithm is suitable for cleaning numeric dataset. The results of the experiment also unveils that, the proposed approach is efficient and can be used as an alternative technique to other existing cleaning methods.

Description

Keywords

Algorithm; Data mining; Data pre-processing; Outliers; Prediction

Citation

Collections