This episode is an interview with Josh Tobin, a former OpenAI Researcher, discussing highlights from his paper, Geometry-aware Neural Rendering, which was accepted as an oral presentation at NeurIPS 2019 conference.
Josh Tobin is a researcher working at the intersection of machine learning and robotics. His research focuses on applying deep reinforcement learning, generative models, and synthetic data to problems in robotic perception and control. He did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel and was a research scientist at OpenAI for 3 years during his PhD.
Paper At A Glance:
Understanding the 3-dimensional structure of the world is a core challenge in computer vision and robotics. Neural rendering approaches learn an implicit 3D model by predicting what a camera would see from an arbitrary viewpoint. We extend existing neural rendering to more complex, higher dimensional scenes than previously possible. We propose Epipolar Cross Attention (ECA), an attention mechanism that leverages the geometry of the scene to perform efficient non-local operations, requiring only O(n) comparisons per spatial dimension instead of O(n2). We introduce three new simulated datasets inspired by real-world robotics and demonstrate that ECA significantly improves the quantitative and qualitative performance of Generative Query Networks (GQN).
Presentation Slides: https://github.com/dicarlolab/neurips2019/blob/master/slides.pdf
/ Full Interview Transcripts /
Wenli: We’re at NeurIPS 2019 with Josh Tobin, He is a former researcher at UC Berkeley and OpenAI. Nice to meet you and thank you for joining us here.
Josh Tobin: Great to meet you as well. Thanks.
Wenli: You're here because you have a paper that recently got accepted. Congratulations!
Josh Tobin: Thanks so much.
Wenli: The paper is about “Geometry-aware Neural Rendering”. Can you introduce what the paper is about?
Josh Tobin: The goal of the paper is, we want to help robots understand the scenes in the world that they're interacting with. Typically, the way you do that in robotics is, you have some state representation of the world. It’s things like, where are all the objects, what poses the robot in, where's the robot, etc. The challenges are that those types of representations of scenes are really difficult to scale to more and more complex scenes if you have a lot of objects and the objects themselves are really complex.
The topic of the paper is on doing implicit scene representations. What that means is, if you take some observations of a scene - imagine some camera images that render the scene from different viewpoints, then you want to train a model that can have some understanding internally of what's happening in the scene - the way we do that is using a formulation called “neural rendering”. The way that neural rendering works is, you train a neural network that takes as an input one or more viewpoints of the scene - the camera is looking at the scene from above, from the left and the right. And the goal of that model is, given some other arbitrary viewpoints, like over here where it's never seen the world before, to be able to accurately render what the world would look like from that viewpoint. If you can do that well, the intuition is that, internally the model has to have some representation that understands everything that's happening in the world.