Learning expressive and multimodal policies is essential for solving complex continuous control tasks. We propose FP-SAC, which reinterprets soft policy improvement through stochastic differential equations and derives a distribution-level objective from the corresponding Fokker–Planck equation. By minimizing the PDE residual with physics-informed learning and leveraging normalizing flows, FP-SAC enables stable and accurate learning of multimodal policies, achieving higher performance and greater stability on multigoal, MuJoCo, and Meta-World benchmarks.
@article{hwang2026fpsac,title={Fokker--Planck Soft Actor--Critic},author={Hwang, Hyo-Seok and Kim, Jaewon and Seok, Junhee},journal={IEEE Transactions on Neural Networks and Learning Systems},year={2026},publisher={IEEE},doi={10.1109/TNNLS.2026.3699399},}