Self-supervised Single-view 3D Reconstruction
via Semantic Consistency
Xueting Li1
, Sifei Liu2
, Kihwan Kim2
, Shalini De Mello2
, Varun Jampani2
,
Ming-Hsuan Yang1
, and Jan Kautz2
1 University of California, Merced
2 NVIDIA
Abstract. We learn a self-supervised, single-view 3D reconstruction
model that predicts the 3D mesh shape, texture and camera pose of
a target object with a collection of 2D images and silhouettes. The
proposed method does not necessitate 3D supervision, manually annotated keypoints, multi-view images of an object or a prior 3D template. The key insight of our work is that objects can be represented
as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on
birds). Therefore, by leveraging part segmentation of a large collection
of category-specific images learned via self-supervision, we can effectively enforce semantic consistency between the reconstructed meshes
and the original images. This significantly reduces ambiguities during
joint prediction of shape and camera pose of an object, along with texture. We demonstrate that our unsupervised method performs comparably if not better than existing category-specific reconstruction methods
learned with supervision. More details can be found at the project page
https://sites.google.com/nvidia.com/unsup-mesh-2020.