Neural network optimization methods fall into two broad classes: adaptive methods such as Adam and non-adaptive methods such as vanilla stochastic gradient descent (SGD). Here, we formulate the problem of neural network optimization as Bayesian filtering. We find that state-of-the-art adaptive (AdamW) and non-adaptive (SGD) methods can be recovered by taking limits as the amount of information about the parameter gets large or small, respectively. As such, we develop a new neural network optimization algorithm, AdaBayes, that adaptively transitions between SGD-like and Adam(W)-like behaviour. This algorithm converges more rapidly than Adam in the early part of learning, and has generalisation performance competitive with SGD.
Speakers: Laurence Aitchison