https://arxiv.org/abs/2502.07193v1
Abstract page for arXiv paper 2502.07193v1: Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits
provablyefficientrlhfpipelineunified
https://openreview.net/forum?id=LXftdR11io&referrer=%5Bthe%20profile%20of%20Yuta%20Saito%5D(%2Fprofile%3Fid%3D~Yuta_Saito1)
We study off-policy learning (OPL) of contextual bandit policies in large discrete action spaces where existing methods -- most of which rely crucially on...
contextual banditspotecpolicylargeaction