This episode is a live recording of our interview with Torsten Sattler at the CVPR 2019 conference. Torsten Sattler is an Associate Professor for the department of Electrical Engineering at Chalmers University of Technology in Sweden. Sattler's research primarily focuses on mixed reality, autonomous driving and robotics.
Sattler presented a tutorial titled "Long Term Visual Localization Under Changing Conditions". His presentation focused on solving the visual localization problem when applied to robotics, augmented reality and autonomous vehicles. During the interview, he shared key applications of his tutorial, the challenges and takeaways he faced and the difference between working in the US and in Sweden.
/ Full Interview Transcripts /
Wenli: Today we have Torsten Sattler here with us. He is an associate professor at Chalmers University of Technology in Sweden, and he's also the organizer of the workshop “Long Term Visual Localization Under Changing Conditions”. Thank you so much for being here to share with our community.
Torsten Sattler: Thank you very much for having me. I feel very honored.
Wenli: Tell us a little about the workshop. What is it about?
Torsten Sattler: The visual localization problem is giving an image in a known scene and try[ing] to figure out where it was taken, and that has important application. If you want to build a self-driving car, the car needs to know where it is, so that it can figure out where to go. If you have a robot that should navigate through scene, let’s say a cleaning robot needs to know where it is in your apartment, so that it can figure out where it has been, where it still needs to clean. If you think about a service robot that's going to fetch something from the kitchen, the robot should know where it is in the building to get there. Or if you're doing augmented or mixed reality, where you try to project virtual objects into the field of view of user, you actually need to know where you are in the world and where you are with respect to those virtual objects, so that you can actually do the augmentation.
There has been a lot of research on it, and what we as a community always did was we capture the scene once and then the scene presentation is valid forever, which is obviously not true. A lot of things have changed. There are seasonal changes - leaves on the tree, no leaves on the tree; there might be snow on the ground, day/night changes, or inner scenes with a lot of furniture moving along. Localization algorithms need to be robust against this. And that motivated the workshop. So it's essentially consisting of two parts: Invited talks by experts from the industry and academia, and the challenge where people were actually supposed to evaluate their algorithms on our datasets, and then try to see who could build the best algorithm that's able to localize images accurately while being robust against various changes in the scene.
Wenli: When you organized this workshop, how did you select the speakers among the data industry experts?
Torsten Sattler: It's a bit of a family thing in the sense of that we have ties to many of the people that were invited. So we invited Jan-Michael Frahm from University of North Carolina at Chapel Hill; nowadays he is also at Facebook. He has been doing a lot of work on 3D reconstruction, also in time varying scenes, [so it] seems natural to invite him. We invited Bernhard Zeisl from Google, who is a tech lead in their division that builds visual localization algorithms: out of this possibility that you take the phone up, wave it around and it will figure out where you are, and then display why you need to go. He has been working on some of the core components there that enables this.
Wenli: What does it have to do with AR? Google Map?
Torsten Sattler: AR in the sense of that they don't show you on a map where you have to go, but they display an arrow in your camera view. So it has much more direct visualization of where you want to go, and he seemed to be a natural choice. We had Niko Sünderhauf who’s more of a robotics person, but we wanted to have this combination between computer vision and robotics because we're all working on the same thing, but in a lot of separate communities. I think there's very much value in trying to bring those two communities together. We have Srikumar Ramalingam, who's at the University of Utah, and I think by now also at Google, so a lot of things that he didn't plan.
Wenli: Yeah, the industry and academia have started to get really blurred.
Torsten Sattler: Yes, and he did a lot of interesting work on localization with semantics that we thought is very interesting for the community, so that's how we got the invited talks. For the organizers, this is a very good example of the value of getting your paper rejected because we submitted a paper in 2017 that described part of this benchmark that we built, and that got rejected. We met a couple of other people who wanted to build the same thing, and we decided to join forces. That led to part of this benchmark, and then for the workshop, we initially planned to have a workshop on our benchmark datasets. We met a bunch of other people who were working on the same thing, and we figured, well, let's join forces and build one large team.
Wenli: Hopefully, this provides some guidance for the industry and people who are interested in the field.
Torsten Sattler: I think it's a very hot topic in the industry based on what I've seen. There were quite some people from industry attending the workshop.
Wenli: What's the highlight? What did you learn this time?
Torsten Sattler: One of my personal highlights was what Bernhard talked about what Google is doing. He couldn't go into details, but it was very interesting to see that side. The other highlights are actually the talks of the people who participated in the challenge. If you organize this challenge, you hope you get some decent contributions, and I was very, very happy with the quality of all the submissions that we got. So it was very nice to see.
Wenli: Are you going to host it again next year?
Torsten Sattler: We’re considering this, probably yes. I got very good feedback. I think it's still an interesting problem. We haven't solved it yet.
Wenli: Some teams will continue and their datasets will increase.
Torsten Sattler: Hopefully yes.
Wenli: What are some of the challenges that you are facing right now?
Torsten Sattler: One big challenge is localizing nighttime images against scene representation built during the day.
Wenli: Is that an entirely different dataset?
Torsten Sattler: Now we have multiple datasets. For each dataset, we have one reference representation taking on the day in a certain season and then we took images or captured images, taking at night, taking seasons under different weather conditions, so that we could actually have a fine-grained evaluation of what happens if you start messing around with things. There are certain things that seem to be easy, or easier than we thought, but localizing nighttime images against representational scene built from day, that’s a hard problem far from being solved.
Then the other thing is, if you have scenes with lots of vegetation, say trees, a lot of grass on the side that change over time. That's hard enough. No one really has a solution for this. I think there are some exciting possibilities with machine learning on predicting those changes, because they're not arbitrary. So if I ask you how the trees in spring look like, you would say, well, it has leaves. If I ask you how does it look like in autumn, you would say the same thing without leaves. A machine learning algorithm should be able to predict those changes, and then hopefully, we can handle those scenes better.
Wenli: How long have you been working in this field?
Torsten Sattler: I have been working on this since 2011, eight years by now.
Wenli: What are some of the breakthroughs that you can share with us?
Torsten Sattler:I think the amazing thing is for being able to scale things really to city scale, trying to figure out what are the challenges there are in terms of being able to design scalable algorithm handling. The larger the scene is, the more complicated it gets, the more ambiguous it gets in the sense of they might be buildings that look very similar in different parts of the scene. So handling this, understanding how to deal with repeating elements, I think that was a very interesting thing.
Wenli: Deep learning of the big data.
Torsten Sattler: Exactly. And then seeing machine learning algorithms start to replace some components of the handcrafted pipelines that we designed back in the days, and making them substantially better. That was an exciting thing.
Wenli: You just said that you had a paper got rejected in 2017. But this year, you have several papers got accepted. Especially, two of your papers caught a lot of attention. Can you briefly tell us about the papers?
Torsten Sattler: The first paper we presented here yesterday was called BAD SLAM, which suggests that it is nothing that you would like to run, nut actually SLAM problem is the problem of simultaneously building a 3D map and trying to figure out where you are. We designed a new and very accurate algorithm for building those maps from images and update it synchronously. That was a fun project to do.
The other paper that we presented yesterday looked a bit into a machine learning algorithm for localization, understanding the limitations of absolute camera pose regression. The idea is that one way to build a localization system is to train your neural network to take an image as input and then outputs the position and orientation from which you took this image. One thing that we noticed is that those things don't work that well, but they seem to be to operating similarly as an algorithm that takes one set of images to train the neural network, find the nearest one, and take the position orientation of this one, and use it is an approximation. That works similarly well as trying to train a network that predicts an accurate pose. That was an interesting observation that will hopefully lead the community to try and develop better algorithms that work much better.
Wenli: You showed them that there’s a solution on this.
Torsten Sattler: Yes, hopefully. For the D2-Net paper (“D2-Net: A Trainable CNN for Joint Detection and Description of Local Features”) which I will present in two hours, the ideas are that we would like to train local features. So you'd like to be able to establish correspondences between pixels in one image, and pixels in another image. We developed a new approach that jointly detects the positions of those features and also computes some mathematical or vector representation that you can use to figure out which region in the first image belongs to which region or which pixel on the second image.
Wenli: What are the future applications of this technology?
Torsten Sattler: We're using this for localization. It makes nice progress on day/night images. That's one of the things that I like about it. It allows us to get correspondences where we couldn't get correspondence before. Once you know which pixels in the two images are related, you can reason about geometry and reason about position of one camera with respect to the other. That’s cool, a building block for augmented reality for self-driving cars, so hopefully, this will allow us to take the next step. I'm very excited about this, it’s one of my favorites.
Wenli: It sounds like you have a lot of collaborations with teams in the US. What are the differences between the research environment in Sweden and US?
Torsten Sattler: They're all very talented people, and they're all doing great work. In this respect, there is no difference. You can do good work anywhere in the world. For me, it's fun to work with people with different backgrounds because you learn a lot on approaching the research problem from a different perspective, in the sense of everyone brings his or her own opinion to the table, their experience in doing research and way of approaching research. It's just a fun thing to know the experiences and you get to learn from other people. You try to figure out if what works for them also works for you. And if it doesn't, why doesn't it work? Can I improve what I'm doing based on what I've seen from others?
Wenli: Thank you so much for coming to our platform to share with us.
Torsten Sattler: Thank you very much for your interest and for having me.