Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling

Abstract

Abstract: To address the problem that traditional methods are prone to generate a large number of false samples or data loss when classifying imbalanced data, a random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling is proposed. First, the density-based DBSCAN clustering decomposition algorithm was applied to the majority class of the imbalanced dataset, which reduces its advantage without data loss. Secondly, the minority class was oversampled by the Borderline-SMOTE algorithm. The number of minority samples was increased to obtain a more balanced dataset, which effectively solved the problem of over-fitting caused by generating too many false samples during over-sampling, and at the same time avoided the problem of data loss caused by under-sampling. Finally, under the premise of the clustering decomposition and oversampling algorithm, random forest achieved better results than SVM, Adaboost, Bagging, and XGBoost. Experimental comparison with other popular algorithms on the KEEL public dataset shows that the proposed algorithm can effectively improve the classification performance of imbalanced data.

Key words: imbalanced data, classification algorithm, DBSCAN, random forest

CLC Number:

TP274

ZHAO Xiao-qiang, YAO Qing-lei. Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling[J]. Journal of Lanzhou University of Technology, 2023, 49(6): 80-89.

References

[1] HENGYU Z.A New cost-sensitive SVM algorithm for imbalanced dataset[C]//2021 IEEE International Conference on Consumer Electronics and Computer Engineering(ICCECE).Guangzhou:IEEE,2021:402-407.
[2] 赵小强,张露.一种改进的数据挖掘模糊支持向量机分类算法[J].兰州理工大学学报,2017,43(5):94-99.
[3] BOONCHUAY K,SINAPIROMSARAN K,LURSINSAP C.Decision tree induction based on minority entropy for the class imbalance problem[J].Pattern Analysis and Applications,2017,20(3):769-782.
[4] YUAN B W,LUO X G,ZHANG Z L,et al.A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets[J].Neural Computing and Applications,2021,33(9):4457-4481.
[5] SOH W W,YUSUF R M.Predicting credit card fraud on a imbalanced data[J].International Journal of Data Science and Advanced Analytics,2019(1):12-17.
[6] WU L,XIANG Y,YANG Y,et al.A classification model for class imbalance problem in protein subnuclear localization[C]//2018 11th International Congress on Image and Signal Processing,BioMedical Engineering and Informatics (CISP-BMEI).Beijing:IEEE,2018:1-9.
[7] 陈果,杨默晗,于平超.基于深度学习的航空发动机不平衡故障部位识别[J].航空动力学报,2020,35(12):2602-2615.
[8] ZHANG J,CHEN L,TIAN J X,et al.Breast cancer diagnosis using cluster-based undersampling and boosted C5.0 algorithm[J].International Journal of Control,Automation and Systems,2021,19(5):1998-2008.
[9] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[10] PAKRASHI A,MAC NAMEE B.Kalman filter-based heuristic ensemble(KFHE):a new perspective on multi-class ensemble classification using Kalman filters[J].Information Sciences,2019,485:456-485.
[11] CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:improving prediction of the minority class in boosting[C]//European Conference on Principles of Data Mining and Knowledge Discovery.Berlin:Springer,2003:107-119.
[12] SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2009,40(1):185-197.
[13] RAYHAN F,AHMED S,MAHBUB A,et al.Cusboost:cluster-based under-sampling with boosting for imbalanced classification[C]//2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS).[S.l.]:IEEE,2017:70-75.
[14] AHMED S,MAHBUB A,RAYHAN F,et al.Hybrid methods for class imbalance learning employing bagging with sampling techniques[C]//2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS).[S.l.]:IEEE,2017:126-131.
[15] ELYAN E,MORENO-GARCIA C F,JAYNE C.CDSMOTE:class decomposition and synthetic minority class oversampling technique for imbalanced-data classification[J].Neural Computing and Applications,2021,33(7):2839-2851.
[16] ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the Thirteenth National Conference on Artifical Intelligence.[S.l.]:AAAI Press,1996:226-231.
[17] HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Berlin:Springer,2005:878-887.
[18] 石洪波,陈雨文,陈鑫.SMOTE过采样及其改进算法研究综述[J].智能系统学报,2019,14(6):1073-1083.
[19] 李艳霞,柴毅,胡友强,等.不平衡数据分类方法综述[J].控制与决策,2019,34(4):673-688.
[20] BREIMAN L.Random forests[J].Machine Learning,2001,45(1):5-32.
[21] 徐玲玲,迟冬祥.面向不平衡数据集的机器学习分类策略[J].计算机工程与应用,2020,56(24):12-27.
[22] ALCALÁ-FDEZ J,SANCHEZ L,GARCIA S,et al.KEEL:a software tool to assess evolutionary algorithms for data mining problems[J].Soft Computing,2009,13(3):307-318.
[23] HE H,BAI Y,GARCIA E A,et al.ADASYN:adaptive synthetic sampling approach for imbalanced learning[C]//2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).Piscataway:IEEE,2008:1322-1328.
[24] AHMED S,RAYHAN F,MAHBUB A,et al.LIUBoost:locality informed under-boosting for imbalanced data classification[M].Emerging Technologies in Data Mining and Information Security.Singapore:Springer,2019:133-144.
[25] 罗计根,杜建强,聂斌,等.一种聚类欠采样策略的随机森林优化方法[J].计算机工程与应用,2020,56(22):166-172.
[26] 张家伟,郭林明,杨晓梅.针对不平衡数据的过采样和随机森林改进算法[J].计算机工程与应用,2020,56(11):39-45.