In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos.
Authors: K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar (IIIT Hyderabad, University of Bath)
00:00 - 00:10 Introduction
00:10 - 01:02 Initial result of Angella Merkel
01:02 - 01:35 Architecture
01:35 - 02:07 Comparison
02:07 - 02:29 Features of the model
02:29 - 02:35 Snapshot of applications
02:35 - 02:54 Syncing interviews with translator's speech
02:54 - 04:18 Lip syncing dubbed movie scenes
04:18 - 04:43 Lip syncing famous professors in different languages
04:43 - 04:57 Lip-syncing animated characters
04:57 - 05:35 Futuristic Video Conferencing application
05:35 - 06:00 Creating social media content
06:00 - 06:16 Links and conclusion