Self-supervised learning with sight, sound, and space