How far apart are two neural networks? This is a foundational question in their theory. We derive a simple and tractable bound that relates distance in function space to distance in parameter space for a broad class of nonlinear compositional functions. The bound distills a clear dependence on depth of the composition. The theory is of practical relevance since it establishes a trust region for first-order optimisation. In turn, this suggests an optimiser that we call Frobenius matched gradient descent---or Fromage. Fromage involves a principled form of gradient rescaling and enjoys guarantees on stability of both the spectra and Frobenius norms of the weights. We find that the new algorithm increases the depth at which a multilayer perceptron may be trained as compared to Adam and SGD and is competitive with Adam for training generative adversarial networks. We further verify that Fromage scales up to a language transformer with over $10^8$ parameters. Please find code & reproducibility instructions at: https://github.com/jxbz/fromage.
Speakers: Jeremy Bernstein, Arash Vahdat, Yisong Yue, Ming-Yu Liu