Model-free RL for Constrained Markov Decision Process: Foundations and ApplicationsSpeakerArnob Ghosh AffiliationResearch Scientist AbstractMany constrained sequential decision-making processes such as safe AV navigation, wireless network control, caching, cloud computing, etc., can be cast as Constrained Markov Decision Processes (CMDP). Reinforcement Learning (RL) algorithms have been used to learn optimal policies for unknown unconstrained MDP. Extending these RL algorithms to unknown CMDP, brings the additional challenge of not only maximizing the reward, but also satisfying the constraints. In contrast to existing model-based approaches or model-free methods accompanied with a ‘simulator’, we aim to develop the first model-free, simulator-free RL algorithm that achieves a sublinear regret and a sublinear constraint violation with respect to the number of episodes. To this end, we consider the episodic linear CMDP where the transition probabilities and rewards are linear in the feature space. In our NeurIPS paper, We show that our approach can achieve O ̃(√K) regret and O ̃(√K) constraint violation where K is the no. of episodes. Our result does not depend on the dimension of the state-space, and hence is applicable to the large state space. This is the first result which shows that O ̃(√K) regret and O ̃(√K) constraint violation is achievable using model-free RL even for small state space (a.k.a tabular case). We also show that we can in fact achieve zero constraint violation for large K. We propose a primal-dual adaptation of the LSVI-UCB algorithm and demonstrate how we overcome analytical challenges for infinite state-space using soft-max policy instead of greedy policy. In our recent ICLR paper, we also show the first (model-free or model-based) results for constrained MDP in infinite-horizon setting for large state space. Bio
|