This episode is a live recording of our interview with Yong Jae Lee at the CVPR 2019 conference. Lee is the Demo Chair at CVPR this year and Assistant Professor of Computer Science at UC Davis focusing on computer vision and machine learning. Prior to working at UC Davis, he was a postdoctoral fellow at Carnegie Mellon University and then UC Berkeley. He received his PhD at the University of Texas at Austin in 2012.
His newly submitted paper "FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery" presented at CVPR 2019 aims to create an unsupervised model of fine-grained details of objects. During the interview, he shared important contributions and key applications from the paper.
/ Full Interview Transcripts /
Wenli: We have Yong Jae here with us to talk about his newly submitted paper, “FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery”. Can you briefly tell us about the paper?
Yong Jae Lee: Sure. What we're trying to do in that work is to model fine-grained details of objects in an unsupervised manner. If you think about something like birds, within birds, there are species. For example, there are different types of cuckoo birds. One could be yellow-billed cuckoo, and another could be a black-billed cuckoo. So what we're trying to do is model in an unsupervised way, these characteristics of birds like common shape, one could be like a duck shape, one could be a cocoon shape, one could be a seagull shape. In addition to shape, we want to also model these fine-grained color and texture details that characterize different species across groups. We're able to do this so far on birds, on cars and also dogs.
Wenli: Interesting. So what's the most innovative point of your methodology?
Yong Jae Lee: It's an unsupervised model, and we're building off a lot of the great work that's already out there. But what makes our work unique is that we have a generative model, which is modeling the hierarchical structure of objects. We're doing this in a way where we hierarchically disentangle the different factors of variation. So as I pointed out earlier, for example, with birds, you can first group birds based on a common shape, like it could be a duck shape, and then further fill in or characterize appearance details like color and texture conditioned on a common shape. So our model is able to generate an image in a hierarchical way where it can actually control these different factors. So an example would be, with our model, we can generate a duck in white color in a water background. But by changing one factor of variation, we can now just change only the color of that duck. So let's say we could make it a black duck now, but retain the background and shape details exactly the way it was with the white duck.
Wenli: That's interesting. Among all the areas that you’re studying, what triggered you to start in the fine-grained field?
Yong Jae Lee: I've always been interested in unsupervised learning, and how we got interested in this particular topic was, we noticed that there's been a lot of work on unsupervised modeling of basic level categories. By basic level, I mean being able to differentiate, for example, cars vs. dogs vs. humans. But there hasn't been much work on modeling more fine-grained details of objects, or finding categories of objects. So it's a novel problem domain that we wanted to tackle. We wanted to have an unsupervised model that can learn a representation that is useful for unsupervised grouping of fine-grained object categories.
Wenli: What are some of the business applications that you can think of for fine-grained field?
Yong Jae Lee: One of the applications that we're actually working on as an extension to our current work or the existing work is a conditional variant of our model, where rather than provide the input of random codes - right now it's completely unsupervised - so we need to provide these unsupervised codes and then that will generate an image. But we're working on an extension where we can condition the model on real images.
So imagine that you have three image examples, and you want your model to create a new image which captures specific properties from each image. Let's say you have image A, B, and C, you want to take the background of image A, the shape of the bird in image B, and the color and texture of the bird in image C to create a new image. This kind of model could be potentially useful for e-commerce applications or something like applications that designers could use for clothing and so on. I see several possible real world applications.
Wenli: How did you develop your interest towards this robust visual recognition system? Where do you see the trends?
Yong Jae Lee: I first got interested in computer vision when I was an undergraduate student. I studied at the University of Illinois at Urbana Champaign, and the first project that I worked on was face detection, basically given an image, just being able to detect with a bounding box of all the faces that are present in the image, and that got me very intrigued in this area. So for my PhD, I was fortunate to be able to work with Kristen Grauman at University of Texas at Austin, and afterwards postdoc under Alyosha Efros. I was really fortunate to work with really great and super nice researchers and be able to explore deeper into this large and very interesting area of computer vision.
In terms of the current trends, as a community and field we've made tremendous amount of progress, especially in recent years in various problems like image classification, object detection, and instance segmentation, action recognition. But the state-of-the-art methods all rely on lots of human annotated training data. This reliance on labeled data has become a bottleneck in the whole industry. I think the current trend is we're trying to move towards systems that can learn with minimal human supervision and less reliance on fixed datasets, moving towards more environments, like how humans and animals learn. So that's where I see our field moving forward.
Wenli: So you see that unsupervised learning will be more applicable to the industry one day.
Yong Jae Lee: We're definitely not there yet. But I think that's the way to go.
Wenli: Rather than labeling data, tens of thousands of categories.
Yong Jae Lee: Yeah. When there are specific applications, for example, if I want to be able to detect a particular bottle like that one, it makes sense to label images, because you have something very specific. But if you want to have a system that can interact with other agents, with humans, with animals in unstructured, novel environments, then it can't. It's got to have this capability to be able to adapt to unfamiliar things. So I believe that can only happen when these models learn in a more unsupervised setting.
Wenli: Besides of this paper, you have many other papers also accepted by CVPR this year. One of them is “You Reap What You Sow”. What is that about?
Yong Jae Lee: This is one paper that's moving towards this minimal human supervision paradigm. So we're trying to train object detectors without any bounding box annotations. So typically, when you want to train an object detector, you will have to collect training images with bounding box annotations that tightly fit the object of interest. In this work, we are learning only with image-level tag annotations. So rather than saying, this image contains the car, and here's the box that tightly encloses it, we only train with images that say, this image contains a car, but the system needs to figure out automatically where it is.
We're not the first to work in this space, there's been lots of work. The sort of common underlying theme that goes through all the existing work in weakly supervised object detection is, in the initial step, they need to generate candidate object regions. Because the system doesn't know where the object is, it’s got to propose a lot of candidate object regions. Then from there, find the common patterns that appear across the images that are labeled with the same tag, and at the same time don't appear in other category images.
So this is a very, very challenging problem. It's akin to finding a needle in a haystack. What we proposed was, rather than generate a bunch of candidate object regions, just using static appearance cues, why not leverage motion information from video? Because we can get that information for free without any human supervision. Because things that move together usually constitute the same object. So in this way by using motion, we can automatically generate good vocalizations and use that as a way to initialize our weakly supervised object detector.
Wenli: That's very interesting. I haven’t heard anyone talking about that.
Yong Jae Lee: The idea of using videos for unsupervised learning is again, also not new, but for weakly supervised object detection, replacing the initial standard step of object proposals using appearance cues with motion-based ones is where the main novelty lies in that work.
When: That's so exciting that you're working in this area. There're so many problems to solve. It’s very exciting, very interesting. Are there any other papers you'd like to share with us?
Yong Jae Lee: Yes. I think one other problem space that I recently got interested in is this idea of privacy preserving visual recognition. We had an ECCV 2018 paper on this topic, and what we wanted to do there was be able to create a system that can recognize the action or activity of a human in a video, but at the same time, preserve their privacy.
Wenli: Will they look like a stick?
Yong Jae Lee: We still try to make them look realistic, but we make them look like a different person. So that when you look at the video as a human, you will not be able to tell who the original person is, and the idea is we want the system not only to be able to fool humans, but also be able to fool other machine learning classifiers.
Wenli: Why would you want to do that?
Yong Jae Lee: To preserve the privacy of the human in the video.
Wenli: We know that one of the applications is to preserve the privacy but also to detect the elderly, what they’re doing, if they fall down in the room, but they’re using sticks.
Yong Jae Lee: I'm not familiar with this one, but I think there are related works like this. That's the motivation that you want to still be able to extract useful information.
Wenli: What are the scenarios that you want this person doesn't look like this person but look like another human being?
Yong Jae Lee: Just like what you were saying, let’s say you have a home monitor, you have some cameras in your home because you're worried about your elderly parents or grandparents or a young child. So you want the system to be able to tell you if your young child is doing something potentially dangerous. So it should be able to detect the activity of the child. But at the same time, you might be worried that a hacker might hack into your camera and steal the video. That's why we want to anonymize.
Wenli: I heard that you are also the chair of this demonstration part of the meeting. What is that about?
Yong Jae Lee: The demos are where researchers can come and present to a live audience their work in real time, if you have, for example, something like real-time human pose detector. That’s a wonderful project that you can demo in real time, where you have a camera setup and people can come and you can show on a screen that your system is detecting the pose in real time. So these are the kinds of things that are being presented in the demo session.
Wenli: What’s the result so far?
Yong Jae Lee:
The demos haven't started yet. They'll start tomorrow, but I can tell you with respect to prior conferences that had demo. When I was a student, there were very, very few demos, and now there are lots and lots of demos. I think that shows how much the field has progressed. Now we really have systems that can work in real time. They’re working accurately where a live audience can appreciate what's happening. I think this was quite rare when I was a student. I think it's awesome to see this.
Wenli: Yes, this field is growing fast. It’s receiving more attention than ever. Thank you so much for sharing this.
Yong Jae Lee: Thank you.