兰州理工大学学报 ›› 2023, Vol. 49 ›› Issue (6): 80-89.

• 自动化技术与计算机技术 • 上一篇    下一篇

基于DBSCAN聚类分解和过采样的随机森林不平衡数据分类算法

赵小强*1,2,3, 姚青磊1   

  1. 1.兰州理工大学 电气工程与信息工程学院, 甘肃 兰州 730050;
    2.兰州理工大学 甘肃省工业过程先进控制重点实验室, 甘肃 兰州 730050;
    3.兰州理工大学 国家级电气与控制工程实验教学中心, 甘肃 兰州 730050
  • 收稿日期:2021-12-31 出版日期:2023-12-28 发布日期:2024-01-05
  • 通讯作者: 赵小强(1969-),男,陕西岐山人,博士,教授,博导.Email:xqzhao@lut.edu.cn
  • 基金资助:
    国家自然科学基金(62263021),甘肃省高校产业支撑计划项目(2023CYZC-24),甘肃省科技计划资助项目(21YF5GA072)

Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling

ZHAO Xiao-qiang1,2,3, YAO Qing-lei1   

  1. 1. College of Electrical Engineering and Information Engineering, Lanzhou Univ. of Tech., Lanzhou 730050, China;
    2. Key Laboratory of Advanced Control of Industrial Processes of Gansu Province, Lanzhou Univ. of Tech., Lanzhou 730050, China;
    3. National Electrical and Control Engineering Experimental Teaching Center, Lanzhou Univ. of Tech., Lanzhou 730050, China
  • Received:2021-12-31 Online:2023-12-28 Published:2024-01-05

摘要: 针对传统方法在不平衡数据分类时易导致生成假样本数量多或数据丢失等问题,提出了一种基于DBSCAN聚类分解和过采样的随机森林不平衡数据分类算法.首先,将基于密度的DBSCAN聚类分解算法应用于不平衡数据集的多数类,在没有数据丢失的情况下降低了多数类样本的优势;其次,通过Borderline-SMOTE算法对少数类进行过采样,增加了少数类样本的数量,从而得到更加平衡的数据集,有效地解决了过采样时生成过多假样本而导致过拟合的问题,同时避免了欠采样方法造成数据丢失的问题;最后,在聚类分解和过采样算法的前提下,验证了随机森林比SVM、Adaboost、Bagging、XGBoost有更好的效果.在KEEL公用数据集上与其他流行算法进行实验比较,结果显示该算法有效地提高了不平衡数据的分类性能.

关键词: 不平衡数据, 分类算法, DBSCAN, 随机森林

Abstract: To address the problem that traditional methods are prone to generate a large number of false samples or data loss when classifying imbalanced data, a random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling is proposed. First, the density-based DBSCAN clustering decomposition algorithm was applied to the majority class of the imbalanced dataset, which reduces its advantage without data loss. Secondly, the minority class was oversampled by the Borderline-SMOTE algorithm. The number of minority samples was increased to obtain a more balanced dataset, which effectively solved the problem of over-fitting caused by generating too many false samples during over-sampling, and at the same time avoided the problem of data loss caused by under-sampling. Finally, under the premise of the clustering decomposition and oversampling algorithm, random forest achieved better results than SVM, Adaboost, Bagging, and XGBoost. Experimental comparison with other popular algorithms on the KEEL public dataset shows that the proposed algorithm can effectively improve the classification performance of imbalanced data.

Key words: imbalanced data, classification algorithm, DBSCAN, random forest

中图分类号: