兰州理工大学学报 ›› 2022, Vol. 48 ›› Issue (5): 99-106.

• 自动化技术与计算机技术 • 上一篇    下一篇

基于回译和集成学习的维汉神经机器翻译方法

冯笑1,2,3, 杨雅婷*1,2,3, 董瑞1,2,3, 艾孜麦提·艾尼瓦尔1,2,3, 马博1,2,3   

  1. 1.中国科学院 新疆理化技术研究所, 新疆 乌鲁木齐 830011;
    2.中国科学院大学, 北京 100049;
    3.新疆民族语音处理实验室, 新疆 乌鲁木齐 830011
  • 收稿日期:2021-04-16 出版日期:2022-10-28 发布日期:2022-11-21
  • 通讯作者: 杨雅婷(1984-),女,新疆奇台人,博士,研究员. Email:yangyt@ms.xjb.ac.cn
  • 基金资助:
    国家自然科学基金(U2003303),新疆高层次引进人才项目(新人社函[2017]699号),中国科学院西部之光人才培养计划A类资助项目(2017-XBQNXZ-A-005),中科院创新青年促进会资助项目(2017472,科发人函字[2019]26号)

Uyghur-Chinese neural machine translation method based on back translation and ensemble learning

FENG Xiao1,2,3, YANG Ya-ting1,2,3, DONG Rui1,2,3, AZMAT Anwar1,2,3, MA Bo1,2,3   

  1. 1. Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China;
    3. Xinjiang Laboratory of Minority Speech and Language Information Processing,Urumqi 830011, China
  • Received:2021-04-16 Online:2022-10-28 Published:2022-11-21

摘要: 从高效利用现有资源的角度出发,针对维汉平行语料匮乏导致维汉神经机器翻译效果欠佳的问题,提出一个基于回译和集成学习的方法.首先,利用回译和大规模汉语单语语料构造出维汉伪平行语料,并利用伪平行语料进行训练得到中间模型;其次,使用自助采样法对原始平行语料进行N次重采样,得到N个近似同一分布但具有差异性的子数据集;基于N个子数据集分别对中间模型进行微调,得到N个具有差异性的子模型;最后,将这些子模型集成.在CWMT2015和CWMT2017的测试集上的实验证明,该方法比基线系统的BLEU值分别提升了2.37和1.63.

关键词: 神经机器翻译, 回译, 集成学习, 中间模型, 微调, 灾难性遗忘

Abstract: From the perspective of efficient utilization of existing resources, a method based on back-translation and ensemble learning is proposed to solve the problem of the poor performance of Uyghur-Chinese neural machine translation caused by the lack of parallel corpus. Firstly, Uyghur and Chinese pseudo-parallel corpora are constructed by using back translation and large-scale Chinese monolingual corpora, and the intermediate model is obtained by using pseudo parallel corpora training. Secondly, the bootstrap is used to resample the original parallel corpus for N times, and N sub-datasets with similar distribution but different characteristics are obtained. The intermediate model were fine-tuned based on N sub-data sets, and N sub-models with differences were obtained. Finally, integrate these sub-models. Experiments on the test sets of CWMT2015 and CWMT2017 show that theBLEU(Bilingual Evaluation Understudy) value of this method are 2.37 and 1.63 higher than that of the baseline system, respectively.

Key words: neural machine translation, back translation, ensemble learning, intermediate model, fine tuning, catastrophic forgetting

中图分类号: