Zhengqi Li is a third-year Ph.D. student at Cornell University and interned at Google Research. His paper "Learning the Depth of Moving People by Watching Frozen People” won the Best Paper Honorable Mention at the 2019 CVPR Conference.
"This paper examines 3D reconstruction of scenes with people from monocular video. The paper shows insightful creation of a dataset for this task, in addition to a solid execution of a state-of-the-art algorithm," the Award Committee stated. "It has strong potential to be impactful paper in this area, facilitating future work in outdoor reconstruction with complex moving scenes."
This episode is a live recording of Zhengqi Li presenting his paper during the CVPR poster session. He discussed his project in detail and how he analyzed data gathered from videos to predict dense depth in a scene.
/ Full Interview Transcripts /
My name is Zhengqi Li from Cornell University and I'm a third year Ph.D. student working with Prof. Noah Snavely. I will present our work on "Learning the Depth of Moving People by Watching Frozen People.” This work was done when I was an intern at Google Research.
So the basic goal of our work is to predict dense depth maps from an ordinary video when both camera and people in the scene are freely moving. And as we know classical geometric motion stereo algorithm does not work for moving people. So we tackled this fundamental challenge using a data-driven approach.
However, data for learning depth of moving people in the wild is difficult to collect at large scale. In this work, we created a dataset called the Mannequin Challenge. This dataset comes from a very surprising source from the internet. It contains thousands of YouTube videos in which people are imitating mannequins. All the people in the scenes are static while a hand-held camera tours the scenes. Because people in the scenes are frozen, we can use classical Structure from Motion(SfM) and Multi View Stereo (MVS) to recover camera poses and depth reliably.
Here are some examples in our Mannequin Challenge Dataset. We can see it has a wide range of scenes with natural human poses and a number of people. One thing I need to mention is, as we know, internet videos are very noisy, people may not "unfreeze" and also there might be some fisheye lens distortion, camera motion blur in the videos, so we proposed a series of post-processing methods to remove such outlier frames of videos in our dataset to get our dataset for training the model.
After we have a dataset, then the question is, how do we train the model on frozen people during training time, but apply it to moving people during inference time. The simplest idea that we can imagine is to input single RGB image, which means you only input single RGB frame into the network, and regress to the Multi View Stereo depth. However, this method ignores the 3D information existing in the neighboring frames of the video sequences.
Our proposed method is instead of just including single RGB image into the network, we also include depth for motion parallax as additional information. So in particular, we use mask-RCNN to predict human masks, and then for the frame that we focus on, we select frame t - delta as the key frame, and then we compute optical flow between these two frames, and then we convert the optical flow to depth using camera poses and triangulation. At the same time, we can also compute the confidence from the optical flow and the camera poses. We can then input the RGB images, human masked depth from motion parallax and confidence into network. We hope the network will make use of that additional information to help predict better depth over the entire scene.
We evaluated our method on different datasets. We first evaluate our method on the Mannequin Challenge testset. We compared proposed full model with the RGB-only single view depth prediction method. Here shows that our proposed model has much better depth prediction compared with the single view depth prediction baseline method. We also evaluate our method on the standard TUM RGBD dataset, where both the cameras and people are moving. We also compare with the baseline method and other state-of-the-art motion stereos, DeMon, and the single view depth prediction method DORN. We can see our full method can take advantage of the motion parallax for the rigid scenes and can significantly outperform other baseline and start-of-the-art approaches.
Here are the qualitative comparison to the ground truth depth captured from the Kinects sensors. We can see that our depth predictions are much better than the prior approaches in both human and non-human regions. We can also apply our model to arbitrary internet videos where both cameras and people are moving simultaneously. Comparing our full model with state-of-the-art motion stereo and the single view depth prediction method, we can see our method provides much more accurate depth predictions of entire scenes.
Finally, we can use such depth predictions in a variety of visual effects in an augmented reality application. For example. we can do video defocus and object insertion using our depth prediction. And when people and camera are moving, we can also remove the people entirely using our depth prediction.