基于DBSCAN聚类分解和过采样的随机森林不平衡数据分类算法

兰州理工大学学报 ›› 2023, Vol. 49 ›› Issue (6): 80-89.

• 自动化技术与计算机技术 • 上一篇下一篇

基于DBSCAN聚类分解和过采样的随机森林不平衡数据分类算法

赵小强^*1,2,3, 姚青磊¹

1.兰州理工大学电气工程与信息工程学院, 甘肃兰州 730050;
2.兰州理工大学甘肃省工业过程先进控制重点实验室, 甘肃兰州 730050;
3.兰州理工大学国家级电气与控制工程实验教学中心, 甘肃兰州 730050

收稿日期:2021-12-31 出版日期:2023-12-28 发布日期:2024-01-05
通讯作者: 赵小强(1969-),男,陕西岐山人,博士,教授,博导.Email:xqzhao@lut.edu.cn
基金资助:
国家自然科学基金(62263021),甘肃省高校产业支撑计划项目(2023CYZC-24),甘肃省科技计划资助项目(21YF5GA072)

Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling

ZHAO Xiao-qiang^1,2,3, YAO Qing-lei¹

1. College of Electrical Engineering and Information Engineering, Lanzhou Univ. of Tech., Lanzhou 730050, China;
2. Key Laboratory of Advanced Control of Industrial Processes of Gansu Province, Lanzhou Univ. of Tech., Lanzhou 730050, China;
3. National Electrical and Control Engineering Experimental Teaching Center, Lanzhou Univ. of Tech., Lanzhou 730050, China

Received:2021-12-31 Online:2023-12-28 Published:2024-01-05

摘要/Abstract

摘要： 针对传统方法在不平衡数据分类时易导致生成假样本数量多或数据丢失等问题,提出了一种基于DBSCAN聚类分解和过采样的随机森林不平衡数据分类算法.首先,将基于密度的DBSCAN聚类分解算法应用于不平衡数据集的多数类,在没有数据丢失的情况下降低了多数类样本的优势;其次,通过Borderline-SMOTE算法对少数类进行过采样,增加了少数类样本的数量,从而得到更加平衡的数据集,有效地解决了过采样时生成过多假样本而导致过拟合的问题,同时避免了欠采样方法造成数据丢失的问题;最后,在聚类分解和过采样算法的前提下,验证了随机森林比SVM、Adaboost、Bagging、XGBoost有更好的效果.在KEEL公用数据集上与其他流行算法进行实验比较,结果显示该算法有效地提高了不平衡数据的分类性能.

关键词: 不平衡数据, 分类算法, DBSCAN, 随机森林

Abstract: To address the problem that traditional methods are prone to generate a large number of false samples or data loss when classifying imbalanced data, a random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling is proposed. First, the density-based DBSCAN clustering decomposition algorithm was applied to the majority class of the imbalanced dataset, which reduces its advantage without data loss. Secondly, the minority class was oversampled by the Borderline-SMOTE algorithm. The number of minority samples was increased to obtain a more balanced dataset, which effectively solved the problem of over-fitting caused by generating too many false samples during over-sampling, and at the same time avoided the problem of data loss caused by under-sampling. Finally, under the premise of the clustering decomposition and oversampling algorithm, random forest achieved better results than SVM, Adaboost, Bagging, and XGBoost. Experimental comparison with other popular algorithms on the KEEL public dataset shows that the proposed algorithm can effectively improve the classification performance of imbalanced data.

Key words: imbalanced data, classification algorithm, DBSCAN, random forest

中图分类号:

TP274

赵小强, 姚青磊. 基于DBSCAN聚类分解和过采样的随机森林不平衡数据分类算法[J]. 兰州理工大学学报, 2023, 49(6): 80-89.

ZHAO Xiao-qiang, YAO Qing-lei. Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling[J]. Journal of Lanzhou University of Technology, 2023, 49(6): 80-89.

参考文献

[1] HENGYU Z.A New cost-sensitive SVM algorithm for imbalanced dataset[C]//2021 IEEE International Conference on Consumer Electronics and Computer Engineering(ICCECE).Guangzhou:IEEE,2021:402-407.
[2] 赵小强,张露.一种改进的数据挖掘模糊支持向量机分类算法[J].兰州理工大学学报,2017,43(5):94-99.
[3] BOONCHUAY K,SINAPIROMSARAN K,LURSINSAP C.Decision tree induction based on minority entropy for the class imbalance problem[J].Pattern Analysis and Applications,2017,20(3):769-782.
[4] YUAN B W,LUO X G,ZHANG Z L,et al.A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets[J].Neural Computing and Applications,2021,33(9):4457-4481.
[5] SOH W W,YUSUF R M.Predicting credit card fraud on a imbalanced data[J].International Journal of Data Science and Advanced Analytics,2019(1):12-17.
[6] WU L,XIANG Y,YANG Y,et al.A classification model for class imbalance problem in protein subnuclear localization[C]//2018 11th International Congress on Image and Signal Processing,BioMedical Engineering and Informatics (CISP-BMEI).Beijing:IEEE,2018:1-9.
[7] 陈果,杨默晗,于平超.基于深度学习的航空发动机不平衡故障部位识别[J].航空动力学报,2020,35(12):2602-2615.
[8] ZHANG J,CHEN L,TIAN J X,et al.Breast cancer diagnosis using cluster-based undersampling and boosted C5.0 algorithm[J].International Journal of Control,Automation and Systems,2021,19(5):1998-2008.
[9] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[10] PAKRASHI A,MAC NAMEE B.Kalman filter-based heuristic ensemble(KFHE):a new perspective on multi-class ensemble classification using Kalman filters[J].Information Sciences,2019,485:456-485.
[11] CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:improving prediction of the minority class in boosting[C]//European Conference on Principles of Data Mining and Knowledge Discovery.Berlin:Springer,2003:107-119.
[12] SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2009,40(1):185-197.
[13] RAYHAN F,AHMED S,MAHBUB A,et al.Cusboost:cluster-based under-sampling with boosting for imbalanced classification[C]//2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS).[S.l.]:IEEE,2017:70-75.
[14] AHMED S,MAHBUB A,RAYHAN F,et al.Hybrid methods for class imbalance learning employing bagging with sampling techniques[C]//2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS).[S.l.]:IEEE,2017:126-131.
[15] ELYAN E,MORENO-GARCIA C F,JAYNE C.CDSMOTE:class decomposition and synthetic minority class oversampling technique for imbalanced-data classification[J].Neural Computing and Applications,2021,33(7):2839-2851.
[16] ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the Thirteenth National Conference on Artifical Intelligence.[S.l.]:AAAI Press,1996:226-231.
[17] HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Berlin:Springer,2005:878-887.
[18] 石洪波,陈雨文,陈鑫.SMOTE过采样及其改进算法研究综述[J].智能系统学报,2019,14(6):1073-1083.
[19] 李艳霞,柴毅,胡友强,等.不平衡数据分类方法综述[J].控制与决策,2019,34(4):673-688.
[20] BREIMAN L.Random forests[J].Machine Learning,2001,45(1):5-32.
[21] 徐玲玲,迟冬祥.面向不平衡数据集的机器学习分类策略[J].计算机工程与应用,2020,56(24):12-27.
[22] ALCALÁ-FDEZ J,SANCHEZ L,GARCIA S,et al.KEEL:a software tool to assess evolutionary algorithms for data mining problems[J].Soft Computing,2009,13(3):307-318.
[23] HE H,BAI Y,GARCIA E A,et al.ADASYN:adaptive synthetic sampling approach for imbalanced learning[C]//2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).Piscataway:IEEE,2008:1322-1328.
[24] AHMED S,RAYHAN F,MAHBUB A,et al.LIUBoost:locality informed under-boosting for imbalanced data classification[M].Emerging Technologies in Data Mining and Information Security.Singapore:Springer,2019:133-144.
[25] 罗计根,杜建强,聂斌,等.一种聚类欠采样策略的随机森林优化方法[J].计算机工程与应用,2020,56(22):166-172.
[26] 张家伟,郭林明,杨晓梅.针对不平衡数据的过采样和随机森林改进算法[J].计算机工程与应用,2020,56(11):39-45.

基于DBSCAN聚类分解和过采样的随机森林不平衡数据分类算法

Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

本文评价

[1]	谢楠, 马振国, 唐兵, 黄煜铭, 张柯琪, 曹丹怡. 继电保护设备剩余寿命预测的智能算法研究[J]. 兰州理工大学学报, 2023, 49(2): 83-87.
[2]	李琳, 董博, 郑玉巧. 大型风力机异常功率数据清洗方法[J]. 兰州理工大学学报, 2022, 48(3): 65-70.
[3]	郑玉巧,刘玉涵,何正文,董博,魏剑峰. 基于QM-DBSCAN的风力机数据清洗方法[J]. 兰州理工大学学报, 2021, 47(6): 50-55.
[4]	朱昶胜, 李岁寒. 基于改进果蝇优化算法的随机森林回归模型及其在风速预测中的应用[J]. 兰州理工大学学报, 2021, 47(4): 83-90.
[5]	董瑞洪, 闫厚华, 张秋余, 李学勇. 基于深度森林算法的分布式WSN入侵检测模型[J]. 兰州理工大学学报, 2020, 46(4): 103-109.