NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)
Details
#nfnets #deepmind #machinelearning Batch Normalization is a core component of modern deep learning. It enables training at higher batch sizes, prevents mean shift, provides implicit regularization, and allows networks to reach higher performance than without. However, BatchNorm also has disadvantages, such as its dependence on batch size and its computational overhead, especially in distributed settings. Normalizer-Free Networks, developed at Google DeepMind, are a class of CNNs that achieve state-of-the-art classification accuracy on ImageNet without batch normalization. This is achieved by using adaptive gradient clipping (AGC), combined with a number of improvements in general network architecture. The resulting networks train faster, are more accurate, and provide better transfer learning performance. Code is provided in Jax. OUTLINE: 0:00 - Intro & Overview 2:40 - What's the problem with BatchNorm? 11:00 - Paper contribution Overview 13:30 - Beneficial properties of BatchNorm 15:30 - Previous work: NF-ResNets 18:15 - Adaptive Gradient Clipping 21:40 - AGC and large batch size 23:30 - AGC induces implicit dependence between training samples 28:30 - Are BatchNorm's problems solved? 30:00 - Network architecture improvements 31:10 - Comparison to EfficientNet 33:00 - Conclusion & Comments Paper: https://arxiv.org/abs/2102.06171 Code: https://github.com/deepmind/deepmind-research/tree/master/nfnets My Video on BatchNorm: https://www.youtube.com/watch?v=OioFONrSETc My Video on ResNets: https://www.youtube.com/watch?v=GWt6Fu05voI Abstract: Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free ResNets. Our smaller models match the test accuracy of an EfficientNet-B7 on ImageNet while being up to 8.7x faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%. In addition, Normalizer-Free models attain significantly better performance than their batch-normalized counterparts when finetuning on ImageNet after large-scale pre-training on a dataset of 300 million labeled images, with our best models obtaining an accuracy of 89.2%. Our code is available at this https URL deepmind-research/tree/master/nfnets Authors: Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/ BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

0:00 - Intro & Overview 2:40 - What's the problem with BatchNorm? 11:00 - Paper contribution Overview 13:30 - Beneficial properties of BatchNorm 15:30 - Previous work: NF-ResNets 18:15 - Adaptive Gradient Clipping 21:40 - AGC and large batch size 23:30 - AGC induces implicit dependence between training samples 28:30 - Are BatchNorm's problems solved? 30:00 - Network architecture improvements 31:10 - Comparison to EfficientNet 33:00 - Conclusion & Comments
Comments
loading...