COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning - Motivation:
Reinforcement learning methods typically involve collecting a large amount of data for every new task. Since the amount of data we can collect for any single task is limited due to time and cost considerations in the real-world, the learned behavior is usually quite narrow.
In this paper, we propose an approach to incorporate a large amount of prior data, either from previously solved tasks or from unsupervised or undirected environment interaction, to extend and generalize learned behavior. This prior data is not specific to any one task, and can be used to extend a variety of downstream skills.
We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands, and present results in both simulated and real world domains. Our hardest experimental setting involves composing four robotic skills in a row: picking, placing, drawer opening, and grasping, where a +1/0 sparse re-ward is provided only on task completion.
Conservative Q-Learning For Offline RL - Abstract:
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a principled policy improvement procedure. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.