We propose Learning Off-Policy with Online Planning (LOOP), an efficient reinforcement learning framework, combining the benefits of model-based local trajectory optimization and off-policy algorithms. The agent learns a dynamics model and then uses trajectory optimization with the learned model to select actions. To sidestep the myopic effect of fixed-horizon trajectory optimization, a value function learned through an off-policy algorithm is attached to the end of the planning horizon. We investigate various instantiations of this framework and demonstrate its benefit in three settings: online reinforcement learning, offline reinforcement learning, and safe learning. We show that this method significantly improves the underlying model-based and model-free algorithms and achieves state-of-the-art performance in a variety of settings.
Speakers: Harshit Sikchi, Wenxuan Zhou, David Held