Provably Efficient Policy Optimization with Thompson Sampling