Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training.
We provide the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to ℓ2 regularization in this function space. We show that self-distillation iterations results in a dynamic regularization that progressively limits the number of basis functions that can be used to represent the solution. This implies that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance. The regularization induced by self-distillation is very different from ridge-regression; there is no ridge penalty coefficient that can achieve similar regularization effect.