Research on image semantic description method based on RVC network

Abstract

Abstract: To address the problems of inaccurate description statements and more irrelevant information in the process of image semantic description, an image semantic description method based on the RVC network is proposed. Firstly, the visual area features are extracted using the ResNeXt-101 network and Vision Transformer network in the image feature extraction stage. Secondly, the significant areas of the extracted visual features are assigned more weight, and the insignificant areas are assigned less weight by combining the channel attention mechanism, and the unclear areas of the image are optimized. Finally, the image decoding module combines the visual features with the semantic features to generate the descriptive statements of the image. In order to verify the effectiveness of the RVC network in describing the image semantics, experiments were conducted on the MS COCO dataset and compared with existing methods. The results demonstrate that the RVC network can more effectively extract image features, producing more accurate and enriched descriptive sentences.

Key words: semantic description, feature extraction, Vision Transformer, channel attention mechanism

CLC Number:

TP391

LIU Zhong-min, CHEN Heng, HU Wen-jin. Research on image semantic description method based on RVC network[J]. Journal of Lanzhou University of Technology, 2026, 52(2): 99-106.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

URL: https://journal.lut.edu.cn/EN/

https://journal.lut.edu.cn/EN/Y2026/V52/I2/99

References

[1] SUN J,LAPUSCHKIN S,SAMEK W,et al.Explain and improve:LRP-inference fine-tuning for image captioning models [J].Information Fusion,2022,77:233-246.
[2] CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-memory transformer for image captioning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle,WA:IEEE,2020:10575-10584.
[3] 刘文婷,卢新明.基于计算机视觉的Transformer研究进展 [J].计算机工程与应用,2022,58(6):1-16.
[4] XU K,BA J,KIROS R,et al.Show,attend and tell:neural image caption generation with visual attention [C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:PMLR,2015:2048-2057.
[5] ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:6077-6086.
[6] FANG Z,WANG J,HU X,et al.Injecting semantic concepts into end-to-end image captioning [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:17988-17998.
[7] MA Y,JI J,SUN X,et al.Knowing what it is:semantic-enhanced dual attention transformer [J].IEEE Transactions on Multimedia,2023,25:3723-3736.
[8] LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning [C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea:IEEE,2019:8927-8936.
[9] HUANG L,WANG W,CHEN J,et al.Attention on attention for image captioning[C]//2019IEEE/CVF International Conference on Computer Vision.Seoul,Korea:IEEE,2019:4633-4642.
[10] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE,2015:3156-3164.
[11] JIANG W,MA L,JIANG Y G,et al.Recurrent fusion network for image captioning [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:510-526.
[12] PAN Y,YAO T,LI Y,et al.X-linear attention networks for image captioning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA:IEEE,2020:10968-10977.
[13] SHARMA P,DING N,GOODMAN S,et al.Conceptual captions:a cleaned,hypernymed,image alt-text dataset for automatic image captioning [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).Melbourne:Association for Computational Linguistics,2018:2556-2565.
[14] ALAHMADI R,HAHN J.Improve image captioning by estimating the gazing patterns from the aption [C]//2022IEEE/CVF Winter Conference on Applications of Computer Vision.Waikoloa:IEEE,2022:1025-1034.
[15] YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:711-727.
[16] HU J,SHEN L,SUN G.Squeeze-and-excitation networks [C]//2018 IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:7132-7141.
[17] WANG Q,WU B,ZHU P,et al.ECA-Net:efficient channel attention for deep convolutional neural networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle:IEEE,2020:11531-11539.
[18] WOO S,PARK J,LEE J Y,et al.CBAM:convolutional block attention module [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:3-19.
[19] RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:1179-1195.