Journal of Lanzhou University of Technology ›› 2023, Vol. 49 ›› Issue (6): 80-89.

• Automation Technique and Computer Technology • Previous Articles     Next Articles

Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling

ZHAO Xiao-qiang1,2,3, YAO Qing-lei1   

  1. 1. College of Electrical Engineering and Information Engineering, Lanzhou Univ. of Tech., Lanzhou 730050, China;
    2. Key Laboratory of Advanced Control of Industrial Processes of Gansu Province, Lanzhou Univ. of Tech., Lanzhou 730050, China;
    3. National Electrical and Control Engineering Experimental Teaching Center, Lanzhou Univ. of Tech., Lanzhou 730050, China
  • Received:2021-12-31 Online:2023-12-28 Published:2024-01-05

Abstract: To address the problem that traditional methods are prone to generate a large number of false samples or data loss when classifying imbalanced data, a random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling is proposed. First, the density-based DBSCAN clustering decomposition algorithm was applied to the majority class of the imbalanced dataset, which reduces its advantage without data loss. Secondly, the minority class was oversampled by the Borderline-SMOTE algorithm. The number of minority samples was increased to obtain a more balanced dataset, which effectively solved the problem of over-fitting caused by generating too many false samples during over-sampling, and at the same time avoided the problem of data loss caused by under-sampling. Finally, under the premise of the clustering decomposition and oversampling algorithm, random forest achieved better results than SVM, Adaboost, Bagging, and XGBoost. Experimental comparison with other popular algorithms on the KEEL public dataset shows that the proposed algorithm can effectively improve the classification performance of imbalanced data.

Key words: imbalanced data, classification algorithm, DBSCAN, random forest

CLC Number: