This episode is a live recording of our interview with Profs. Killian Q. Weinberger and Bharath Hariharan as well as their students Yan Wang, Wei-Lun Chao and Divyansh Garg at the CVPR 2019 conference. Dr. Killian Q. Weinberger is an Associate Professor for computer science at Cornell University. He has won several best paper awards at ICML, CVPR, AISTATS and KDD (runner-up award). He was awarded the Outstanding AAAI Senior Program Chair Award in 2011 and served as co-Program Chair for ICML 2016 and for AAAI 2018. Dr. Hariharan is an Assistant Professor for computer science at Cornell University. Yan Wang is a second year PhD student in Cornell University and Divyansh Garg is a fourth year undergraduate student and currently interning at Google AI. Dr. Wei-Lun Chao is a postdoctoral researcher and will become an assistant professor at Ohio University in August.
Their paper "Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving” presented at CVPR 2019 focuses on improving 3D object detection for autonomous driving. During the interview, they shared important contributions and key applications from the paper, as well as the challenges they are currently facing.
/ Full Interview Transcripts /
Part 1: Interview with Divyansh Garg, Yan Wang & Wei-Lun Chao
Wenli: Thank you so much for joining Robin.ly here. Can you introduce yourself to our audience?
Divyansh Garg: Hi, I’m Divyansh. I'm a fourth year undergrad at Cornell. I'm currently interning at Google AI.
Yan Wang: I'm a second year PhD student in Cornell University. I work with Kilian Weinberger in the self-driving car project.
Wei-Lun (Harry) Chao: So I'm Harry. I'm currently a postdoc at Cornell. I will become an assistant professor in Ohio State University in August.
Wenli: Nice. Tell us more about the paper “3D Object Detection for Autonomous Driving”?
Divyansh Garg: In our paper, we do high accuracy, 3D object detection using image data. So what most companies do currently is they rely on a LiDAR sensor, which gives you highly precise 3D points. But if you talk about image-only 3D detection, that has a very low accuracy, that's only like 10% compared to the 90% for LiDAR.
So our approach is about how can you improve on this? We talk about the data representation and how it matters. So what we do is we take images from two cameras, the left and right. And we calculate the distance between corresponding pixels. And we estimate the dense disparity map. We want to convert this to a 3D representation, which is called the point cloud. And we can use this as a pseudo-LiDAR. That's where the name comes from, similar to LiDAR, but it's only a complement [from images]. And this can be used to train a convolutional neural network which can do 3D object detection, similar to like how it's done in the case of LiDAR. And we get a very high improvement from 20% to 70%, just using this change in representation.
Wenli: That’s so impressive. Anything you want to add on to the paper?
Yan Wang: For this one, we achieved a very high accuracy. There are still some things we can improve. And also we have following work to continue improving it.
Wenli: Yeah, that's exciting, right? Working on things that you're interested in, there're more problems that need to be solved. So my question is why 3D LiDAR and camera? Why this field triggered your interest?
Yan Wang: The reason we touch this direction is that the 3D object detection is a really important problem in the self-driving car system. You need to detect the object first and you can do the tracking and the planning, so it’s a fundamental thing. And right now, people only focus on LiDAR-based object detection. But there are two problems. One is it’s not very robust. If you only rely on one sensor, maybe it’s not okay, you may need one alternative thing. The second reason is LiDAR is very expensive. For one 64-line LiDAR, it costs over $40,000. So we want to select a cheaper one. That's the camera. Actually the camera can be used in this direction. We have looked at some depth estimation by camera, and it is really good. But if we look at the accuracy for the 3D object detection, it is really bad. So we think there is still a lot of research we can do in this direction.
Wenli: How quickly do you think this can be applied to the business?
Divyansh Garg: So we have some companies already contacting us about our research and how they can use it to apply on their cars. So we had a startup in the UK who contacted us and said they are interested in working on camera-only self-driving cars, and asked if they can use our research on that. So there’s companies already trying to do this. This is a very interesting future thing. It could happen in a year.
Wei-Lun Chao: I think it also depends on different companies’ standard. Some companies may need very high frame rate. So I think computation is one thing, accuracy is the other thing. So it depends on different companies, what standard they want.
Wenli: We're just talking about the problems you'll be facing, there are a lot of things that need to be solved. But there are also a bright future in the areas that you're studying in right now. So what are some future plans? How the future looks like?
Wei-Lun Chao: I think in the future, the first thing will be how to make our methods achieve the level of LiDAR. Because there's still a gap. I think that's very important to try to close the gap even more. And the other thing is how to make it speed up, because now we use convolutional networks to estimate the depth, compare with LiDAR which directly gets the depth. So how to improve the speed will also be another thing. And then finally, I think overall for autonomous driving, a very important thing is how to generalize your system to different environments. In academia, we usually focus on datasets, but you will really want to generalize all the techniques to different environments. And to do that, how to fuse different sensors, LiDAR, radar, cameras, that will be the future work that we want to achieve.
Wenli: Also for the paper that you just submitted. What are some of the biggest innovative points that you made?
Wei-Lun Chao: I think the innovative point for us is, nowadays, a lot of papers are talking about new deep learning architectures. But what we found out in this problem is, the gap between LiDAR and image-based detection is not because your data is not good or you need a new architecture. It's just because of how you represent your data. In our work, we proposed a general framework, you can combine any new depth estimation network with any good detector together. So we found a bridge of how to represent your data. So I think that's the most important innovative point, how to represent your image depth to a point cloud and how you can use this to apply to a 3D object detector. And I think that's the most important point.
Part 2: Interview with Profs. Bharath Hariharan and Kilian Q. Weinberger
Wenli: Thank you so much for helping your students on the paper that you advised. Here are a couple questions I have for you. What's the current smart sensing technology that a lot of autonomous driving companies are using? What’s your opinion on that?
Kilian Q. Weinberger: Well, they use many different sensors. Our focus in particular is on the question whether you should use LiDAR or not. And LiDAR is actually a great sensor. It's an active sensor, works in the dark, and essentially sends out a pulse of laser and then waits until it comes back and can measure the distance because of it. But the downside of LiDAR is that it's very expensive. So essentially the limitation of LiDAR is that it increases the cost of the car.
Wenli: But the camera is not accurate.
Kilian Q. Weinberger: Camera is a passive sensor, that's inherently different whether you are relying on light that comes from some other sources. But what we show in our paper is that actually you can get surprisingly accurate results even with passive sensors. And that is some sense the surprising outcome of our line of work. And the key is that essentially we showed in the paper that the reason previously the results were not as accurate with stereo cameras, was not as people had believed that stereo is just less accurate. It is because of the internal representation that people used, the way that people processed the camera images. And that was just inherently different. And once we try to get these two processing techniques of processing LiDAR and processing stereo images as close together as possible, because we want to find out what's the difference, where really is the breaking point? That's what we realized that it actually comes down to the way the data is represented internally. And that's what our paper is about.
Wenli: Do you have an example in the real world scenario that the outcomes are pretty similar, very close to each other?
Kilian Q. Weinberger: Between LiDAR and stereo? LiDAR has an advantage. When a car is far away, essentially what you're doing in stereo depth estimation is you're measuring how far is that car displayed in the right and left images. And if you are just off by a little bit by essentially one pixel resolution, then that can actually mean if the car is really far away, that could be an offset of a whole car lane. So that's essentially where LiDAR has an advantage.
But one thing for close by objects, actually that goes away and you can actually do really, really well with the stereo too. And a new paper our students just submitted shows that if you just have a very, very cheap LiDAR, it essentially just needs a few points on that car. Stereo actually estimates the car but the depth may be a little off, so it may be just too close or too far away, but it gets the car. It's just the whole thing has a little bit offset. If you just need a few LiDAR points, and you know actually this point is here, then you can just move the whole car, and then you can correct it.
Wenli: You agree to the industry trend that people are combining the camera and LiDAR together?
Kilian Q. Weinberger: People have already looked into that before. But one thing we showed is essentially to make camera data much more powerful by changing the way it's processed. It's actually a small change, that’s the interesting thing. But that's also what makes it so robust, it's a simple thing, and that works for many different approaches. And if you do this, and then you add a little bit of LiDAR, you actually can get very accurate results.
Wenli: So without a little bit of LiDAR, we couldn't reach the point to L4, right?
Kilian Q. Weinberger: I'm not sure if I agree. You could use high resolution cameras, you could also use some active sensors, etc. I think there are multiple ways. There're still a lot of avenues that are not explored with stereo. Bharath, you agree with her?
Bharath Hariharan: I agree with Kilian actually. I think the step we took was a very simple thing. And it had a massive improvement. And we've not even tapped into the massive amounts of research that's going on in camera-based 3D reconstruction, there're a lot of tools that have not been exploited yet. And especially in domains like self-driving cars, where it's fairly constrained, like you are on a road, you're not in space, and you have lots of things you can ground yourself on. I personally think that there are a lot you can do with cameras, with images that have not been explored. And I will suggest that, what’s missing is that people have not yet looked at image information carefully enough, even simple things have yet to be incorporated in the right manner.
Wenli: That's so exciting. There're still a lot of areas that need to be explored in your field. So thank you so much, that’s all.
Kilian Q. Weinberger: Thank you.