Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers