Safe Policy Improvement via Diffusion Model for Offline Reinforcement Learning

Xiaohan Yang,
Jun Li,
Jiang Liu,
Mengting Sun,
Baozhu Chen,

Abstract


Offline reinforcement learning is a paradigm that learns policies without real-time interaction with the environment. Using offline datasets avoids the potential training dangers and overhead of real-time interaction sampling. However, one key challenge of offline reinforcement learning is that there is no guarantee the learned policy is safe because the high-reward actions in the offline dataset do not satisfy safe constraints. In this paper, we propose a novel algorithm that improves policy safety for offline reinforcement learning named SPI. The core idea of SPI is using diffusion model to expand the original action space by guiding the generation of high reward actions satisfying safety constraints in similar data distributions. First, we remap the offline dataset to enhance the cost property. Second, we integrate the diffusion model, action value function, and safety value function into a framework that promotes the action to converge towards a distribution that is compliant with safety constraints. Extensive experiments under the DSRL benchmark demonstrate that SPI outperforms the relevant baselines in most tasks.

Keywords


Offline reinforcement learning, Safe policy, Diffusion model, Learning from experience

Citation Format:
Xiaohan Yang, Jun Li, Jiang Liu, Mengting Sun, Baozhu Chen, "Safe Policy Improvement via Diffusion Model for Offline Reinforcement Learning," Journal of Internet Technology, vol. 26, no. 7 , pp. 883-892, Dec. 2025.

Refbacks

  • There are currently no refbacks.





Published by Executive Committee, Taiwan Academic Network, Ministry of Education, Taipei, Taiwan, R.O.C
JIT Editorial Office, Office of Library and Information Services, National Dong Hwa University
No. 1, Sec. 2, Da Hsueh Rd., Shoufeng, Hualien 974301, Taiwan, R.O.C.
Tel: +886-3-931-7314  E-mail: jit.editorial@gmail.com