reward model - Robuta Search

https://openreview.net/forum?id=rxKC8v2uHc&referrer=%5Bthe%20profile%20of%20Qiaozhi%20He%5D(%2Fprofile%3Fid%3D~Qiaozhi_He1)

GRAM: A Generative Foundation Reward Model for Reward Generalization | OpenReview

In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on...

reward model gram generative foundation generalization

https://openreview.net/forum?id=dbaYQyruY2&referrer=%5Bthe%20profile%20of%20Xuefeng%20Du%5D(%2Fprofile%3Fid%3D~Xuefeng_Du1)

Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis | OpenReview

Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing...

reward model latent space limited preference data

https://arxiv.org/abs/2406.02900v1

[2406.02900v1] Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Abstract page for arXiv paper 2406.02900v1: Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

scaling laws reward model direct

https://arxiv.org/html/2503.01837v2

Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

multi stage reward policy manipulation demonstration augmented

https://huggingface.co/collections/matlok/papers-training-process-reward-model

Papers - Training - Process Reward Model - a matlok Collection

Unlock the magic of AI with handpicked models, awesome datasets, papers, and mind-blowing Spaces from matlok

training process reward model papers collection

https://arxiv.org/abs/2305.18290?utm_source=brainscriblr.beehiiv.com&utm_medium=referral&utm_campaign=what-is-dpo-ai-with-a-short-code-example

[2305.18290] Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Abstract page for arXiv paper 2305.18290: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

preference optimization language model direct secretly