reward hacking - Robuta Search

https://www.anthropic.com/research/emergent-misalignment-reward-hacking From shortcuts to sabotage: natural emergent misalignment from reward hacking \ Anthropic Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems. reward hacking shortcuts sabotage natural emergent https://danmackinlay.name/post/proposal_human_reward_hacking.html AI Risk Research Idea - Human Reward Hacking — The Dan MacKinlay stable of variably-well-consider’d... Wherein causal inference methods for user–AI interactions are proposed, counterfactuals are invoked to attribute influence, and a real-time cognitive immunity... dan mackinlay stable ai risk reward hacking variably well research https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ Reward Hacking in Reinforcement Learning | Lil'Log Nov 28, 2024 - Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely... reward hacking reinforcement learning lil log