Authors: Chong Huang, Kazuhito Koishida Description: Active speaker detection refers to the task of inferring which (if any) of the visible people in a video is/are speaking. Existing methods based on audiovisual fusion are often confused by factors such as non-speaking facial motions, varied illumination, and low-resolution recording. To address these problems, we propose a robust active speaker detection model by incorporating the dense optical flow to strengthen the visual representation of the facial motion. These audio and visual features are processed by a two-stream embedding network, and the embeddings are fed into a prediction network for the binary speaking/non-speaking classification. To improve the learning efficiency of the entire network, we design a multi-task learning strategy to train the network. The proposed method is evaluated on the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker dataset. The results demonstrate that optical flow can improve the performance of neural networks when combined with raw pixels and audio signal. It is also shown that our method consistently outperforms the state-of-the-art method in terms of both the area under the receiver operating characteristic curve (+4.4%) and the balanced accuracy (+5.28%).