Abstract: Conditioning analysis uncovers the landscape of optimization objective by exploring the spectrum of its curvature matrix. It is well explored theoretically for linear models. We extend this analysis to deep neural networks (DNNs). To this end, we propose a layer-wise conditioning analysis that explores the optimization landscape with respect to each layer independently. Such an analysis is theoretically supported under mild assumptions that approximately hold in practice. Based on our analysis, we show that batch normalization (BN) can adjust the magnitude of the layer activations/gradients, and thus stabilizes the training. However, such a stabilization can result in a false impression of a local minimum, which sometimes has detrimental effects on the learning. Besides, we experimentally observe that BN can improve the layer-wise conditioning of the optimization problem. Finally, we observe that the last linear layer of very deep residual network has ill-conditioned behavior during training. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original residual networks, especially when the networks are deep.
Authors: Lei Huang, Jie Qin, Li Liu, Fan Zhu, Ling Shao (Inception Institute of Artificial Intelligence)