Robuta

https://miraflow.ai/blog/finetuning-llms-direct-preference-optimization-dpo Finetuning LLMs with Direct Preference Optimization (DPO): A Simpler Alternative to RLHF Apr 23, 2026 - RLHF made ChatGPT possible. DPO makes the same alignment possible without the reward model, without the PPO loop, and without the engineering nightmare. This... finetuning llms https://arxiv.org/abs/2502.17424 [2502.17424] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs Abstract page for arXiv paper 2502.17424: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs emergentmisalignmentnarrow