[1] SUN J,LAPUSCHKIN S,SAMEK W,et al.Explain and improve:LRP-inference fine-tuning for image captioning models [J].Information Fusion,2022,77:233-246. [2] CORNIA M,STEFANINI M,BARALDI L,et al.Meshed-memory transformer for image captioning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle,WA:IEEE,2020:10575-10584. [3] 刘文婷,卢新明.基于计算机视觉的Transformer研究进展 [J].计算机工程与应用,2022,58(6):1-16. [4] XU K,BA J,KIROS R,et al.Show,attend and tell:neural image caption generation with visual attention [C]//Proceedings of the 32nd International Conference on Machine Learning.Lille,France:PMLR,2015:2048-2057. [5] ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:6077-6086. [6] FANG Z,WANG J,HU X,et al.Injecting semantic concepts into end-to-end image captioning [C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans:IEEE,2022:17988-17998. [7] MA Y,JI J,SUN X,et al.Knowing what it is:semantic-enhanced dual attention transformer [J].IEEE Transactions on Multimedia,2023,25:3723-3736. [8] LI G,ZHU L,LIU P,et al.Entangled transformer for image captioning [C]//2019 IEEE/CVF International Conference on Computer Vision.Seoul,Korea:IEEE,2019:8927-8936. [9] HUANG L,WANG W,CHEN J,et al.Attention on attention for image captioning[C]//2019IEEE/CVF International Conference on Computer Vision.Seoul,Korea:IEEE,2019:4633-4642. [10] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE,2015:3156-3164. [11] JIANG W,MA L,JIANG Y G,et al.Recurrent fusion network for image captioning [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:510-526. [12] PAN Y,YAO T,LI Y,et al.X-linear attention networks for image captioning [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle,WA:IEEE,2020:10968-10977. [13] SHARMA P,DING N,GOODMAN S,et al.Conceptual captions:a cleaned,hypernymed,image alt-text dataset for automatic image captioning [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).Melbourne:Association for Computational Linguistics,2018:2556-2565. [14] ALAHMADI R,HAHN J.Improve image captioning by estimating the gazing patterns from the aption [C]//2022IEEE/CVF Winter Conference on Applications of Computer Vision.Waikoloa:IEEE,2022:1025-1034. [15] YAO T,PAN Y,LI Y,et al.Exploring visual relationship for image captioning [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:711-727. [16] HU J,SHEN L,SUN G.Squeeze-and-excitation networks [C]//2018 IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:7132-7141. [17] WANG Q,WU B,ZHU P,et al.ECA-Net:efficient channel attention for deep convolutional neural networks [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Seattle:IEEE,2020:11531-11539. [18] WOO S,PARK J,LEE J Y,et al.CBAM:convolutional block attention module [C]//2018 European Conference on Computer Vision (ECCV).Cham:Springer International Publishing,2018:3-19. [19] RENNIE S J,MARCHERET E,MROUEH Y,et al.Self-critical sequence training for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:1179-1195. |