On Importance Sampling-Based Evaluation of Latent Language Models