Xin Wang is a fourth-year Ph.D. student at UC Santa Barbara and interned at Microsoft Research. His paper "Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation" won the Best Student Paper at the 2019 CVPR Conference.
"Visual navigation is an important area of computer vision-- this paper makes advances in vision-language navigation," the Award Committee stated. "Building on previous work in this area, this paper demonstrates exciting results based on self-imitation learning within a cross-modal setting."
This episode is a live recording of our interview with Xin Wang at the conference. He discussed his project in detail and compares the development within the industry and the academic field.
/ Full Interview Transcripts /
Wenli: We have Xin Wang, the Best Student Paper Award winner here with us. And he's from UC Santa Barbara, a fourth year PhD student. So tell us a bit more about the paper that just won the award. Congratulations, first of all.
Xin Wang: Thank you. I'm Xin Wang, from UC Santa Barbara. I just finished my fourth-year PhD study. I'm generally interested in computer vision, natural language processing and machine learning, especially the intersection of those three areas. So this paper is about vision and language navigation. It’s the task of navigating embodied agents inside 3D environments by following natural language instructions. And this work is a collaborative project with Microsoft Research. I started the project at Microsoft Research when I was interning there in [the] summer. After summer, I went back to the university and continued working on it and eventually, we submitted the work to CVPR.
Wenli: So you were an intern at Microsoft Research. Are you more interested in the industry in the future?
Xin Wang: I was especially interested in industrial positions before, but recently I have changed my ideas. Now I'm more interested in looking for some faculty positions.
Wenli: Why is that? Emphasizing the long term effects?
Xin Wang: I still think the best place for pure research is academia.
Wenli: Right, that's true.
Xin Wang: We will have more freedom to do what we would like to do. And I also enjoy advising students.
Wenli: Okay, you are teaching? That's nice. Do you see the gap that people are constantly talking about? The development gap between industry and the academic, like data are resources. Which one is far behind the other one?
Xin Wang: I think one of the big advantages of industry is the resource and data. You can have unlimited GPU resources to train your models. And you also have both the internal data and external data from the companies. And also you could find more people working on the same project together. That's really good.
Wenli: But you would trade off that with the freedom in the research?
Xin Wang: That's my thought on that.
Wenli: Interesting. Let's go back to the paper. So what was the process? What inspired you to write this paper and to find the advisor and the team you’re working with?
Xin Wang: So I have been working on vision and language for two or three years and I have been working on teaching the machine to describe the visual world. So along this direction, eventually if you think about this, your robot should not only describe the static scene, the visual world, it should also be able to interact with the physical world to perform some physical actions. So when I saw this vision and language navigation dataset, I was very excited. So that’s something I really wanted to work on. So I decided to work on this problem, together with my advisors and Microsoft Research collaborators.
Wenli: What would be the biggest contribution that your paper gave?
Xin Wang: For this problem, one of the limitations of this task is the success signal, it's rather coarse. If the agent reaches the destination, it counts as a success, completely ignoring whether it has followed the natural language instructions or not. For example, you can randomly walk inside a house and stop at the final destination. It is still counted as a success. But this is not what we want. We really want the agent to understand the natural language and actually follow the instructions. So one of the ideas of this paper is that we propose a reinforce cross-modal matching method to have another matching critic, to evaluate to what extent the original instruction can be reconstructed from the generated trajectory, so that the agent will have some ability to actually follow the instruction.
And another critical challenge of this task is the generalizability. The agent is usually trained on some seen environments and tested on the unseen environments. So the performance gap between seen and unseen environments is really large. But for some practical cases, for example, if we have an in-home robot, we would like this robot to get familiar with the house it is deployed to. Then we proposed a self-supervised imitation learning method to let the robot explore the unseen environment with self-supervision so that the policy can be adapted to those new environments. By doing so, the performance gap between seen and unseen sites is hugely decreased.
Wenli: What’s the task you'll be working on next?
Xin Wang: I will certainly continue working on this exciting and necessary direction to combine vision, language and robotics to teach the robot to see the world, describe the world and even interact with the world.
Wenli: Well, thank you so much. And also congratulations again for winning the award. Thank you so much for coming here to share with us.
Xin Wang: Thanks for inviting me here.