Towards Better Generalization of Adaptive Gradient Methods