Introducing a novel hybrid algorithm to resolve class imbalance problem for binary classification in two-dimensional space

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Faculty of Science, University of Kelaniya Sri Lanka

Abstract

Classification is a task that involves categorizing data into predefined classes or categories based on their features. The class imbalance problem (CIP) in which the number of instances within the classes of the response variable is unevenly distributed, is crucial in many real-world datasets when classifying the instances into class labels or categories. Typically, the number of minority class instances (positive class) which is often, the class of interest is significantly less than the number of majority class instances (negative class). The presence of the imbalance within the classes leads to biased predictions towards the majority class. Different techniques such as oversampling, under-sampling, and hybrid techniques can be used to handle CIP. Oversampling increases the number of instances in the minority class by either duplicating existing instances or generating synthetic examples while under-sampling lowers the number of instances in the majority class. However, applying oversampling alone causes data replication while under-sampling causes loss of valuable information. The objective of the study is to propose a novel hybrid resampling technique to handle CIP, overcoming those disadvantages caused by oversampling and under-sampling alone. Binary classification problems are related to cases where the target variable has only two classes. This study has mainly focused on such datasets where only two classes are present in the target variable. The proposed algorithm aims to an application of a hybrid resampling technique, that is oversampling and under-sampling the imbalanced data together and leveling the number of instances of both majority and minority classes to half the size of the original dataset using a quartile-based approach. The proposed hybrid resampling technique is evaluated using the Pima Indian Diabetes medical dataset with imbalanced class distributions. Logistic regression was employed to identify the two most influential variables for testing in two-dimensional space. Performance metrics including accuracy, recall, precision, and F-measure are employed to assess the effectiveness of the approach. To carry out the classification process, Support Vector Machine (SVM) with one of the simplest kernel functions, the polynomial kernel function has been applied as the classifier. A training-testing split of 85% to 15% was employed for the evaluation. To compare the performance with existing oversampling techniques; ROS, SMOTE, and ADASYN and undersampling techniques; RUS, NCL, and Tomek Links, and a hybrid technique; SMOTETomek were used. In the performance evaluation process, an average recall of 100 iterations was considered. The highest average recall, 86.96%, has been obtained by the proposed algorithm while that for ROS is 42%, SMOTE is 42.57%, ADASYN is 47.1%, RUS is 40.7%, NCL is 73.7%, TomekLinks is 27.46% and SMOTETomek is 49.95%. Experimental results demonstrate significant improvements in classification performance using this proposed algorithm compared to existing oversampling, under-sampling, and hybrid techniques for handling class imbalance. Future studies will extend this work to multi-class classification problems and increase the number of explanatory variables.

Description

Keywords

Binary Classification, Class Imbalance Problem, SVM, Resampling, Hybrid techniques

Citation

Madhuwanthi U. S. P.; Chandrasekara N. V. (2024), Introducing a novel hybrid algorithm to resolve class imbalance problem for binary classification in two-dimensional space, Proceedings of the International Conference on Applied and Pure Sciences (ICAPS 2024-Kelaniya) Volume 4, Faculty of Science, University of Kelaniya Sri Lanka. Page 126

Collections

Endorsement

Review

Supplemented By

Referenced By