兰州理工大学学报 ›› 2026, Vol. 52 ›› Issue (2): 99-106.

• 自动化技术与计算机技术 • 上一篇    下一篇

基于RVC网络的图像语义描述模型

刘仲民*1,2, 陈恒1,2, 胡文瑾3   

  1. 1.兰州理工大学 自动化与电气工程学院, 甘肃 兰州 730050;
    2.兰州理工大学 甘肃省工业过程先进控制重点实验室, 甘肃 兰州 730050;
    3.西北民族大学 数学与计算机科学学院, 甘肃 兰州 730000
  • 收稿日期:2023-09-17 出版日期:2026-04-28 发布日期:2026-04-28
  • 通讯作者: 刘仲民(1978-),男,甘肃靖远人,博士,副教授. Email:shisl05@lut.edu.cn
  • 基金资助:
    国家自然科学基金(62061042)

Research on image semantic description method based on RVC network

LIU Zhong-min1,2, CHEN Heng1,2, HU Wen-jin3   

  1. 1. School of Automation and Electrical Engineering, Lanzhou University of Technology, Lanzhou 730050, China;
    2. Key Laboratory of Gansu Advanced Control for Industrial Processes, Lanzhou University of Technology, Lanzhou 730050, China;
    3. College of Mathematic and Computer Science, Northwest Minzu University, Lanzhou 730000, China
  • Received:2023-09-17 Online:2026-04-28 Published:2026-04-28

摘要: 针对图像语义描述过程中出现描述语句不够准确及描述图像无关信息较多等问题,提出一种基于RVC网络的图像语义描述方法.首先在图像特征提取阶段使用ResNeXt-101网络以及Vision Transformer网络提取图像视觉区域特征,其次结合通道注意力机制对提取的视觉特征显著区域分配较大权重,不显著区域分配较少权重,且对图像中的不清晰区域进行优化,最后通过图像解码模块将图像视觉特征与语义特征相结合生成图像的描述语句.为验证RVC网络对图像语义描述的有效性,在MS COCO数据集上进行实验验证并与现有方法进行对比,RVC网络能较好地提取图像特征,使描述的图像语句更加准确和丰富.

关键词: 语义描述, 特征提取, Vision Transformer, 通道注意力机制

Abstract: To address the problems of inaccurate description statements and more irrelevant information in the process of image semantic description, an image semantic description method based on the RVC network is proposed. Firstly, the visual area features are extracted using the ResNeXt-101 network and Vision Transformer network in the image feature extraction stage. Secondly, the significant areas of the extracted visual features are assigned more weight, and the insignificant areas are assigned less weight by combining the channel attention mechanism, and the unclear areas of the image are optimized. Finally, the image decoding module combines the visual features with the semantic features to generate the descriptive statements of the image. In order to verify the effectiveness of the RVC network in describing the image semantics, experiments were conducted on the MS COCO dataset and compared with existing methods. The results demonstrate that the RVC network can more effectively extract image features, producing more accurate and enriched descriptive sentences.

Key words: semantic description, feature extraction, Vision Transformer, channel attention mechanism

中图分类号: