基于RVC网络的图像语义描述模型

兰州理工大学学报 ›› 2026, Vol. 52 ›› Issue (2): 99-106.

• 自动化技术与计算机技术 • 上一篇下一篇

基于RVC网络的图像语义描述模型

刘仲民^*1,2, 陈恒^1,2, 胡文瑾³

1.兰州理工大学自动化与电气工程学院, 甘肃兰州 730050;
2.兰州理工大学甘肃省工业过程先进控制重点实验室, 甘肃兰州 730050;
3.西北民族大学数学与计算机科学学院, 甘肃兰州 730000

收稿日期:2023-09-17 出版日期:2026-04-28 发布日期:2026-04-28
通讯作者: 刘仲民(1978-),男,甘肃靖远人,博士,副教授. Email:shisl05@lut.edu.cn
基金资助:
国家自然科学基金(62061042)

Research on image semantic description method based on RVC network

LIU Zhong-min^1,2, CHEN Heng^1,2, HU Wen-jin³

1. School of Automation and Electrical Engineering, Lanzhou University of Technology, Lanzhou 730050, China;
2. Key Laboratory of Gansu Advanced Control for Industrial Processes, Lanzhou University of Technology, Lanzhou 730050, China;
3. College of Mathematic and Computer Science, Northwest Minzu University, Lanzhou 730000, China

Received:2023-09-17 Online:2026-04-28 Published:2026-04-28

摘要/Abstract

摘要： 针对图像语义描述过程中出现描述语句不够准确及描述图像无关信息较多等问题,提出一种基于RVC网络的图像语义描述方法.首先在图像特征提取阶段使用ResNeXt-101网络以及Vision Transformer网络提取图像视觉区域特征,其次结合通道注意力机制对提取的视觉特征显著区域分配较大权重,不显著区域分配较少权重,且对图像中的不清晰区域进行优化,最后通过图像解码模块将图像视觉特征与语义特征相结合生成图像的描述语句.为验证RVC网络对图像语义描述的有效性,在MS COCO数据集上进行实验验证并与现有方法进行对比,RVC网络能较好地提取图像特征,使描述的图像语句更加准确和丰富.

关键词: 语义描述, 特征提取, Vision Transformer, 通道注意力机制

Abstract: To address the problems of inaccurate description statements and more irrelevant information in the process of image semantic description, an image semantic description method based on the RVC network is proposed. Firstly, the visual area features are extracted using the ResNeXt-101 network and Vision Transformer network in the image feature extraction stage. Secondly, the significant areas of the extracted visual features are assigned more weight, and the insignificant areas are assigned less weight by combining the channel attention mechanism, and the unclear areas of the image are optimized. Finally, the image decoding module combines the visual features with the semantic features to generate the descriptive statements of the image. In order to verify the effectiveness of the RVC network in describing the image semantics, experiments were conducted on the MS COCO dataset and compared with existing methods. The results demonstrate that the RVC network can more effectively extract image features, producing more accurate and enriched descriptive sentences.

Key words: semantic description, feature extraction, Vision Transformer, channel attention mechanism

中图分类号:

TP391

刘仲民, 陈恒, 胡文瑾. 基于RVC网络的图像语义描述模型[J]. 兰州理工大学学报, 2026, 52(2): 99-106.

LIU Zhong-min, CHEN Heng, HU Wen-jin. Research on image semantic description method based on RVC network[J]. Journal of Lanzhou University of Technology, 2026, 52(2): 99-106.

导出引用管理器 EndNote|Reference Manager|ProCite|BibTeX|RefWorks

链接本文: https://journal.lut.edu.cn/CN/

https://journal.lut.edu.cn/CN/Y2026/V52/I2/99

参考文献

[1] SUN J,LAPUSCHKIN S,SAMEK W,et al.Explain and improve:LRP-inference fine-tuning for image captioning models [J].Information Fusion,2022,77:233-246.
[2] CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-memory transformer for image captioning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle,WA:IEEE,2020:10575-10584.
[3] 刘文婷,卢新明.基于计算机视觉的Transformer研究进展 [J].计算机工程与应用,2022,58(6):1-16.
[4] XU K,BA J,KIROS R,et al.Show,attend and tell:neural image caption generation with visual attention [C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:PMLR,2015:2048-2057.
[5] ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:6077-6086.
[6] FANG Z,WANG J,HU X,et al.Injecting semantic concepts into end-to-end image captioning [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:17988-17998.
[7] MA Y,JI J,SUN X,et al.Knowing what it is:semantic-enhanced dual attention transformer [J].IEEE Transactions on Multimedia,2023,25:3723-3736.
[8] LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning [C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea:IEEE,2019:8927-8936.
[9] HUANG L,WANG W,CHEN J,et al.Attention on attention for image captioning[C]//2019IEEE/CVF International Conference on Computer Vision.Seoul,Korea:IEEE,2019:4633-4642.
[10] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE,2015:3156-3164.
[11] JIANG W,MA L,JIANG Y G,et al.Recurrent fusion network for image captioning [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:510-526.
[12] PAN Y,YAO T,LI Y,et al.X-linear attention networks for image captioning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA:IEEE,2020:10968-10977.
[13] SHARMA P,DING N,GOODMAN S,et al.Conceptual captions:a cleaned,hypernymed,image alt-text dataset for automatic image captioning [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).Melbourne:Association for Computational Linguistics,2018:2556-2565.
[14] ALAHMADI R,HAHN J.Improve image captioning by estimating the gazing patterns from the aption [C]//2022IEEE/CVF Winter Conference on Applications of Computer Vision.Waikoloa:IEEE,2022:1025-1034.
[15] YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:711-727.
[16] HU J,SHEN L,SUN G.Squeeze-and-excitation networks [C]//2018 IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:7132-7141.
[17] WANG Q,WU B,ZHU P,et al.ECA-Net:efficient channel attention for deep convolutional neural networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle:IEEE,2020:11531-11539.
[18] WOO S,PARK J,LEE J Y,et al.CBAM:convolutional block attention module [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:3-19.
[19] RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:1179-1195.

基于RVC网络的图像语义描述模型

Research on image semantic description method based on RVC network

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价

[1]	赵小强, 郭海科. 基于MCNN-APReLU的滚动轴承故障诊断方法[J]. 兰州理工大学学报, 2025, 51(5): 37-45.
[2]	马宁, 赵荣珍, 郑玉巧. 基于CEEMDAN与改进一维多尺度卷积神经网络结合的滚动轴承故障诊断[J]. 兰州理工大学学报, 2025, 51(1): 45-54.
[3]	郑玉巧, 李浩, 魏泰. 基于OVMD-RF方法的风力发电机滚动轴承故障诊断[J]. 兰州理工大学学报, 2024, 50(4): 36-42.
[4]	陈辉, 牛丽丽, 付辉, 张天佑, 席磊. 基于多尺度残差和注意力机制的图像去雾算法[J]. 兰州理工大学学报, 2024, 50(2): 69-76.
[5]	王星, 晏榕璟. 深度可分卷积结合多通道注意力的垃圾图像快速分类模型[J]. 兰州理工大学学报, 2023, 49(3): 88-93.
[6]	赵宏, 傅兆阳, 王乐. 基于特征融合的中文文本情感分析方法[J]. 兰州理工大学学报, 2022, 48(3): 94-102.
[7]	张鹏林, 姚吉, 牛显明, 陈开旺, 张伟平. 基于射线照相的石墨电极缺陷检测与识别[J]. 兰州理工大学学报, 2021, 47(1): 16-21.