Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)
Details
#fastweights #deeplearning #transformers Transformers are dominating Deep Learning, but their quadratic memory and compute requirements make them expensive to train and hard to use. Many papers have attempted to linearize the core module: the attention mechanism, using kernels - for example, the Performer. However, such methods are either not satisfactory or have other downsides, such as a reliance on random features. This paper establishes an intrinsic connection between linearized (kernel) attention and the much older Fast Weight Memory Systems, in part popularized by Jürgen Schmidhuber in the 90s. It shows the fundamental limitations of these algorithms and suggests new update rules and new kernels in order to fix these problems. The resulting model compares favorably to Performers on key synthetic experiments and real-world tasks. OUTLINE: 0:00 - Intro & Overview 1:40 - Fast Weight Systems 7:00 - Distributed Storage of Symbolic Values 12:30 - Autoregressive Attention Mechanisms 18:50 - Connecting Fast Weights to Attention Mechanism 22:00 - Softmax as a Kernel Method (Performer) 25:45 - Linear Attention as Fast Weights 27:50 - Capacity Limitations of Linear Attention 29:45 - Synthetic Data Experimental Setup 31:50 - Improving the Update Rule 37:30 - Deterministic Parameter-Free Projection (DPFP) Kernel 46:15 - Experimental Results 50:50 - Conclusion & Comments Paper: https://arxiv.org/abs/2102.11174 Code: https://github.com/ischlag/fast-weight-transformers Machine Learning Street Talk on Kernels: https://youtu.be/y_RjsDHl5Y4 Abstract: We show the formal equivalence of linearised self-attention mechanisms and fast weight memories from the early '90s. From this observation we infer a memory capacity limitation of recent linearised softmax attention variants. With finite memory, a desirable behaviour of fast weight memory models is to manipulate the contents of memory and dynamically interact with it. Inspired by previous work on fast weights, we propose to replace the update rule with an alternative rule yielding such behaviour. We also propose a new kernel function to linearise attention, balancing simplicity and effectiveness. We conduct experiments on synthetic retrieval problems as well as standard machine translation and language modelling tasks which demonstrate the benefits of our methods. Authors: Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/ BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

0:00 - Intro & Overview 1:40 - Fast Weight Systems 7:00 - Distributed Storage of Symbolic Values 12:30 - Autoregressive Attention Mechanisms 18:50 - Connecting Fast Weights to Attention Mechanism 22:00 - Softmax as a Kernel Method (Performer) 25:45 - Linear Attention as Fast Weights 27:50 - Capacity Limitations of Linear Attention 29:45 - Synthetic Data Experimental Setup 31:50 - Improving the Update Rule 37:30 - Deterministic Parameter-Free Projection (DPFP) Kernel 46:15 - Experimental Results 50:50 - Conclusion & Comments
Comments
loading...