We present PyHessian, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks. This framework is developed in Pytorch, and it enables distributed-memory execution on cloud or supercomputer systems. PyHessian enables fast computations of the top Hessian eigenvalue, the Hessian trace, and the full Hessian eigenvalue density. This general framework can be used to analyze neural network models, including the topology of the loss landscape (i.e., curvature information) to gain insight into the behavior of different models/optimizers. To illustrate this, we apply PyHessian to analyze the effect of residual connections and Batch Normalization layers on the smoothness of the loss landscape. One recent claim, based on simpler first-order analysis, is that residual connections and batch normalization make the loss landscape ``smoother'', thus making it easier for Stochastic Gradient Descent to converge to a good solution. We perform an extensive analysis by measuring directly the Hessian spectrum using PyHessian. This analysis leads to finer-scale insight, demonstrating that while conventional wisdom is sometimes validated, in other cases it is simply incorrect. In particular, we find that batch normalization layers do not necessarily make the loss landscape smoother, especially for shallow networks. Instead, the claimed smoother loss landscape only becomes evident for deep neural networks. We perform extensive experiments on four residual networks (ResNet20/32/38/56) on Cifar-10/100 dataset. We have open-sourced our PyHessian framework for Hessian spectrum computation.
Speakers: Zhewei Yao, Amir Gholaminejad, Kurt Keutzer, Michael Mahoney