TransGAN: Two Transformers Can Make One Strong GAN (Machine Learning Research Paper Explained)
Details
#transformer #gan #machinelearning Generative Adversarial Networks (GANs) hold the state-of-the-art when it comes to image generation. However, while the rest of computer vision is slowly taken over by transformers or other attention-based architectures, all working GANs to date contain some form of convolutional layers. This paper changes that and builds TransGAN, the first GAN where both the generator and the discriminator are transformers. The discriminator is taken over from ViT (an image is worth 16x16 words), and the generator uses pixelshuffle to successfully up-sample the generated resolution. Three tricks make training work: Data augmentations using DiffAug, an auxiliary superresolution task, and a localized initialization of self-attention. Their largest model reaches competitive performance with the best convolutional GANs on CIFAR10, STL-10, and CelebA. OUTLINE: 0:00 - Introduction & Overview 3:05 - Discriminator Architecture 5:25 - Generator Architecture 11:20 - Upsampling with PixelShuffle 15:05 - Architecture Recap 16:00 - Vanilla TransGAN Results 16:40 - Trick 1: Data Augmentation with DiffAugment 19:10 - Trick 2: Super-Resolution Co-Training 22:20 - Trick 3: Locality-Aware Initialization for Self-Attention 27:30 - Scaling Up & Experimental Results 28:45 - Recap & Conclusion Paper: https://arxiv.org/abs/2102.07074 Code: https://github.com/VITA-Group/TransGAN My Video on ViT: https://youtu.be/TrdevFK_am4 Abstract: The recent explosive interest on transformers has suggested their potential to become powerful "universal" models for computer vision tasks, such as classification, detection, and segmentation. However, how further transformers can go - are they ready to take some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs)? Driven by that curiosity, we conduct the first pilot study in building a GAN \textbf{completely free of convolutions}, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed \textbf{TransGAN}, consists of a memory-friendly transformer-based generator that progressively increases feature resolution while decreasing embedding dimension, and a patch-level discriminator that is also transformer-based. We then demonstrate TransGAN to notably benefit from data augmentations (more than standard GANs), a multi-task co-training strategy for the generator, and a locally initialized self-attention that emphasizes the neighborhood smoothness of natural images. Equipped with those findings, TransGAN can effectively scale up with bigger models and high-resolution image datasets. Specifically, our best architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones. Specifically, TransGAN sets \textbf{new state-of-the-art} IS score of 10.10 and FID score of 25.32 on STL-10. It also reaches competitive 8.64 IS score and 11.89 FID score on Cifar-10, and 12.23 FID score on CelebA 64×64, respectively. We also conclude with a discussion of the current limitations and future potential of TransGAN. The code is available at \url{this https URL}. Authors: Yifan Jiang, Shiyu Chang, Zhangyang Wang Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/ BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

0:00 - Introduction & Overview 3:05 - Discriminator Architecture 5:25 - Generator Architecture 11:20 - Upsampling with PixelShuffle 15:05 - Architecture Recap 16:00 - Vanilla TransGAN Results 16:40 - Trick 1: Data Augmentation with DiffAugment 19:10 - Trick 2: Super-Resolution Co-Training 22:20 - Trick 3: Locality-Aware Initialization for Self-Attention 27:30 - Scaling Up & Experimental Results 28:45 - Recap & Conclusion
Comments
loading...