Keywords
|
Imbalanced classification, Imbalanced datasets, Class imbalance problem, Over-sampling, SMOTE.
|
Abstract
|
The Synthetic Minority Over-Sampling Technique (SMOTE) is one of the most well known methods to solve the unequal class distribution problem in imbalanced datasets. However, it has three shortcomings: 1) it may cause the over-generalization problem due to over-sampling of noisy samples, 2) over-sampling of uninformative samples, and 3) increasing the overlaps between different classes around the class boundaries. In this research, an improved SMOTE-based method, namely Range-Controlled SMOTE (RCSMOTE), which targets all three problems simultaneously, is proposed. In order to cope with the two first problems, a sample categorization scheme is applied to identify the minor samples that are proper for over-sampling. In order to mitigate the third problem, an improved sample generation process is proposed which generates the synthetic samples considering an accurately calculated safe range. This range is calculated based on the characteristics of the input data in order to provide us a safe over-sampling region for each dimension in the feature space. The extracted range is used to control the location of the new synthetic samples in data space and prevents the penetration of them into the majority class regions. Experiments conducted on various datasets, confirm that the RCSMOTE overcomes the above-mentioned problems of SMOTE.
|