An Image is Worth 16 × 16 Tokens: Visual Priors for Efficient Image Synthesis with Transformers