Authors: Kaiyue Pang, Yongxin Yang, Timothy M. Hospedales, Tao Xiang, Yi-Zhe Song Description: ImageNet pre-training has long been considered crucial by the ﬁne-grained sketch-based image retrieval (FG-SBIR) community due to the lack of large sketch-photo paired datasets for FG-SBIR training. In this paper, we propose a self-supervised alternative for representation pre-training. Speciﬁcally, we consider the jigsaw puzzle game of recomposing images from shufﬂed parts. We identify two key facets of jigsaw task design that are required for effective FG-SBIR pre-training. The ﬁrst is formulating the puzzle in a mixed-modality fashion. Second we show that framing the optimisation as permutation matrix inference via Sinkhorn iterations is more effective than the common classiﬁer formulation of Jigsaw self-supervision. Experiments show that this self-supervised pre-training strategy signiﬁcantly outperforms the standard ImageNet-based pipeline across all four product-level FG-SBIR benchmarks. Interestingly it also leads to improved cross-category generalisation across both pre-train/ﬁne-tune and ﬁne-tune/testing stages.