https://openreview.net/forum?id=pk5XGSOqbQ&referrer=%5Bthe%20profile%20of%20M.%20Sadegh%20Talebi%5D(%2Fprofile%3Fid%3D~M._Sadegh_Talebi1)
In this paper we analyze the sample complexities of learning the optimal state-action value function $ Q^* $ and an optimal policy $ \pi^* $ in a discounted...
sample complexityentropicriskoptimizationdiscounted
https://www.laborunion.lt/index.php
mdps
https://www.amazon.science/publications/near-optimal-regret-in-linear-mdps-with-aggregate-bandit-feedback
In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback...
nearoptimalregretmdpsaggregate