Robuta

https://openreview.net/forum?id=zcIV8OQFVF&referrer=%5Bthe%20profile%20of%20Davit%20Soselia%5D(%2Fprofile%3Fid%3D~Davit_Soselia1)
In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A...
odinrewardmitigateshackingrlhf
https://arxiv.org/abs/2511.18397
Abstract page for arXiv paper 2511.18397: Natural Emergent Misalignment from Reward Hacking in Production RL
reward hackingnaturalemergentproduction
https://www.anthropic.com/research/emergent-misalignment-reward-hacking
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
reward hackingshortcutssabotagenaturalemergent
https://arxiv.org/abs/2506.22777
Abstract page for arXiv paper 2506.22777: Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
reward hackingteachingmodelsverbalizechain
https://arxiv.org/abs/2501.19358
Abstract page for arXiv paper 2501.19358: The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
the energya newlossphenomenonrlhf