Abstract: Vision-and-Language (V+L) research is an interesting area at the nexus of Computer Vision and Natural Language Processing, and has attracted rapidly growing attention from both communities. A variety of V+L tasks, benchmarked over large-scale human-annotated datasets, have driven tremendous progress in joint multimodal representation learning. This tutorial will focus on some of the recently popular tasks in this domain such as visual captioning, visual grounding, visual question answering and reasoning, text-to-image generation, and self-supervised learning for universal image-text representations. We will cover state-of-the-art approaches in each of these areas and discuss key principles that epitomize the core challenges & opportunities in multimodal understanding, reasoning, and generation.
Authors: J.J. Liu