Self-Supervised Video Models from Sound and Speech