https://openreview.net/forum?id=rxKC8v2uHc&referrer=%5Bthe%20profile%20of%20Qiaozhi%20He%5D(%2Fprofile%3Fid%3D~Qiaozhi_He1)
In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on...
reward modelgramgenerativefoundationgeneralization
https://openreview.net/forum?id=dbaYQyruY2&referrer=%5Bthe%20profile%20of%20Xuefeng%20Du%5D(%2Fprofile%3Fid%3D~Xuefeng_Du1)
Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing...
reward modellatent spacelimitedpreferencedata
https://arxiv.org/abs/2406.02900v1
Abstract page for arXiv paper 2406.02900v1: Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
scaling lawsreward modeldirect
https://huggingface.co/collections/matlok/papers-training-process-reward-model
Unlock the magic of AI with handpicked models, awesome datasets, papers, and mind-blowing Spaces from matlok
training processreward modelpaperscollection
https://arxiv.org/abs/2305.18290?utm_source=brainscriblr.beehiiv.com&utm_medium=referral&utm_campaign=what-is-dpo-ai-with-a-short-code-example
Abstract page for arXiv paper 2305.18290: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
preference optimizationlanguage modeldirectsecretly