Authors: Yunhang Shen, Rongrong Ji, Zhiwei Chen, Xiaopeng Hong, Feng Zheng, Jianzhuang Liu, Mingliang Xu, Qi Tian Description: We investigate the emerging task of learning object detectors with sole image-level labels on the web without requiring any other supervision like precise annotations or additional images from well-annotated benchmark datasets. Such a task, termed as fully webly supervised object detection, is extremely challenging, since image-level labels on the web are always noisy, leading to poor performance of the learned detectors. In this work, we propose an end-to-end framework to jointly learn webly supervised detectors and reduce the negative impact of noisy labels. Such noise is heterogeneous, which is further categorized into two types, namely background noise and foreground noise. Regarding the background noise, we propose a residual learning structure incorporated with weakly supervised detection, which decomposes background noise and models clean data. To explicitly learn the residual feature between clean data and noisy labels, we further propose a spatially-sensitive entropy criterion, which exploits the conditional distribution of detection results to estimate the confidence of background categories being noise. Regarding the foreground noise, a bagging-mixup learning is introduced, which suppresses foreground noisy signals from incorrectly labelled images, whilst maintaining the diversity of training data. We evaluate the proposed approach on popular benchmark datasets by training detectors on web images, which are retrieved by the corresponding category tags from photo-sharing sites. Extensive experiments show that our method achieves significant improvements over the state-of-the-art methods.