Sponsor of the Day:
Jerkmate
https://www.anthropic.com/research/emergent-misalignment-reward-hacking
From shortcuts to sabotage: natural emergent misalignment from reward hacking \ Anthropic
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
reward hackingshortcutssabotagenaturalemergent
https://danmackinlay.name/post/proposal_human_reward_hacking.html
AI Risk Research Idea - Human Reward Hacking — The Dan MacKinlay stable of variably-well-consider’d...
Wherein causal inference methods for user–AI interactions are proposed, counterfactuals are invoked to attribute influence, and a real-time cognitive immunity...
dan mackinlay stableai riskreward hackingvariably wellresearch
https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
Reward Hacking in Reinforcement Learning | Lil'Log
Nov 28, 2024 - Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely...
reward hackingreinforcement learninglillog