In reinforcement learning, Return, which is the weighted accumulated future rewards, and Value, which is the expected return, serve as the objective that guides the learning of the policy. In classic RL, return is defined as the exponentially discounted sum of future rewards. One key insight is that there could be many feasible ways to define the form of the return function (and thus the value), from which the same optimal policy can be derived, yet these different forms might render dramatically different speeds of learning this policy. In this paper, we research how to modify the form of the return function to enhance the learning towards the optimal policy. We propose to use a general mathematical form for return function, and employ meta-learning to learn the optimal return function in an end-to-end manner. We test our methods on a specially designed maze environment and several Atari games, and our experimental results clearly indicate the advantages of automatically learning optimal return functions in reinforcement learning.
Speakers: Yufei Wang, Qiwei Ye, Tie-Yan Liu