兰州理工大学学报 ›› 2024, Vol. 50 ›› Issue (5): 77-85.

• 自动化技术与计算机技术 • 上一篇    下一篇

基于蒙特卡罗策略梯度的雷达观测器轨迹规划

陈辉*1, 王荆宇1, 张文旭1, 赵永红2, 席磊3   

  1. 1.兰州理工大学 电气工程与信息工程学院, 甘肃 兰州 730050;
    2.甘肃长风电子科技有限责任公司, 甘肃 兰州 730070;
    3.甘肃省科学院自动化研究所, 甘肃 兰州 730000
  • 收稿日期:2022-06-12 出版日期:2024-10-28 发布日期:2024-10-31
  • 通讯作者: 陈 辉(1978-),男,山西闻喜人,博士,教授,博导.Email:huich78@hotmail.com
  • 基金资助:
    国家自然科学基金(62163023,62366031,62363023,61873116),甘肃省科学院重大专项项目(2023ZDZX-03),2023年甘肃省军民融合发展专项资金项目,甘肃省2024年度重点人才项目

Trajectory planning of radar observer based on Monte Carlo policy gradient

CHEN Hui1, WANG Jing-yu1, ZHANG Wen-xu1, ZHAO Yong-hong2, XI Lei3   

  1. 1. College of Electrical and Information Engineering, Lanzhou Univ. of Tech., Lanzhou 730050, China;
    2. Gansu Province Changfeng Electronic Technology Co. LTD., Lanzhou 730070, China;
    3. Institute of Automation, Gansu Academy of Sciences, Lanzhou 730000, China
  • Received:2022-06-12 Online:2024-10-28 Published:2024-10-31

摘要: 在目标跟踪过程的雷达观测器轨迹规划(OTP)中,针对马尔可夫步进规划智能决策问题,在离散动作空间上,提出了一种基于蒙特卡罗策略梯度(MCPG)算法的雷达轨迹规划方法.首先,联合目标跟踪状态、奖励机制、动作方案和雷达观测器位置,将OTP过程建模为一个连续的马尔可夫决策过程(MDP),提出基于MCPG的全局智能规划方法.其次,将跟踪幕长内的每个时间步作为单独一幕来进行策略更新,提出基于MCPG目标跟踪中观测器轨迹的步进智能规划方法,并深入研究目标的跟踪估计特性,构造以跟踪性能优化为目的的奖励函数.最后,对最优非线性目标跟踪过程中基于强化学习的智能OTP决策仿真实验,表明了所提方法的有效性.

关键词: 目标跟踪, 雷达观测器轨迹规划, 策略梯度, 奖励函数

Abstract: In the radar observer trajectory planning (OTP) of the target tracking process, for the intelligent decision-making problem of Markov stepping planning, a radar trajectory planning method based on the Monte Carlo policy gradient (MCPG) algorithm is proposed in the discrete action space. First, the OTP process is modeled as a continuous Markov decision process (MDP) by combining the target tracking state, reward mechanism, action plan, and radar observer position. A global intelligent planning method based on MCPG is then proposed. Next, by considering each time step in the tracking episode length as a separate episode for policy updates, a step-wise intelligent planning method based on the observer trajectory in MCPG target tracking is proposed. Then, the tracking estimation characteristics of the target are deeply studied, and a reward function for the purpose of tracking performance optimization is constructed. Finally, the simulation experiment of the intelligent OTP decision-making based on reinforcement learning in the optimal nonlinear target tracking shows the effectiveness of the proposed method.

Key words: target tracking, radar observer trajectory planning, policy gradient, reward function

中图分类号: