Near-Optimal Reinforcement Learning with Self-Play

# Near-Optimal Reinforcement Learning with Self-Play

Dec 06, 2020
|
26 views
|
###### Details
This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with $S$ states, $A$ max-player actions and $B$ min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires $\tilde{\mathcal{O}}(S^2AB)$ steps of game playing, when only highlighting the dependency on $(S,A,B)$. In contrast, the best existing lower bound scales as $\Omega(S(A+B))$ and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the \emph{Nash Q-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(S(A+B))$. The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. Towards understanding learning objectives in Markov games other than finding the Nash equilibrium, we present a computational hardness result for learning the best responses against a fixed opponent. This also implies the computational hardness for achieving sublinear regret when playing against adversarial opponents. Speakers: Yu Bai, Chi Jin, Tiancheng Yu