Authors: Haoyu Ren, Aman Raj, Mostafa El-Khamy, Jungwon Lee Description: We introduce SUW-Learn: A framework for deep-learning with joint supervised learning (S), unsupervised learning (U), and weakly-supervised learning (W). We deploy SUW-Learn for deep learning of the monocular depth from images and video sequences. The supervised learning module optimizes a depth estimation network by knowledge of the ground-truth depth. In contrast, the unsupervised learning module has no knowledge of the ground-truth depth, but optimizes the depth estimation network by predicting the current frame from the estimated 3D geometry. The weakly supervised module optimizes the depth estimation by evaluating the consistency between the estimated depth and weak labels derived from other information, such as the semantic information. SUW-Learn trains the deep-learning networks end-to-end with joint optimization of the desired SUW objectives. To improve the performance of monocular depth networks on scenes with people subjects, we construct the M\&M dataset, by combining two recent datasets with different domain knowledge and from different sources, the Megadepth dataset with images of people around landmarks, and the Mannequin Challenge dataset with video sequences of frozen people. We demonstrate the benefits of joint SUW learning in improving the generalization capability on the M\&M dataset. We benchmark SUW-Learn on the proposed M\&M dataset and the KITTI driving-scene dataset, and achieve the state-of-the-art performance.