Authors: Qi Qian, Juhua Hu, Hao Li Description: With the tremendous success of deep learning in visual tasks, the representations extracted from intermediate layers of learned models, that is, deep features, attract much attention of researchers. Previous empirical analysis shows that those features can contain appropriate semantic information. Therefore, with a model trained on a large-scale benchmark data set (e.g., ImageNet), the extracted features can work well on other tasks. In this work, we investigate this phenomenon and demonstrate that deep features can be suboptimal due to the fact that they are learned by minimizing the empirical risk. When the data distribution of the target task is different from that of the benchmark data set, the performance of deep features can degrade. Hence, we propose a hierarchically robust optimization method to learn more generic features. Considering the example-level and concept-level robustness simultaneously, we formulate the problem as a distributionally robust optimization problem with Wasserstein ambiguity set constraints, and an efficient algorithm with the conventional training pipeline is proposed. Experiments on benchmark data sets demonstrate the effectiveness of the robust deep representations.