https://openreview.net/forum?id=efwbxMJ5M6&referrer=%5Bthe%20profile%20of%20Shichun%20Liu%5D(%2Fprofile%3Fid%3D~Shichun_Liu2)
We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a...
pretrainedpolicygeneralreward
https://openreview.net/forum?id=L8hYdTQVcs&referrer=%5Bthe%20profile%20of%20Li%20Zhao%5D(%2Fprofile%3Fid%3D~Li_Zhao1)
While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better...
policyfiltrationrlhfmitigatenoise