[NeurIPS 2019 Paper Highlight] Anish Agarwal @ MIT: On Robustness of Principal Component Regression

NeurIPS 2019

This episode is a live recording of an interview with Anish Agarwal,3rd Ph.D. student in EECS at MIT. He shared highlights from his paper On Robustness of Principal Component Regression, which was accepted for oral presentation at NeurIPS 2019 conference. This paper rigorously establishes that Principal Component Regression (PCR) is robust to noisy, sparse, and possibly mixed valued covariates, and works much better than linear regression, which is the most common method in machine learning or statistics. The 2-step architecture implied from this paper, i.e., first de-noises the data with low dimensional structure, and second applies a more sophisticated prediction model, also sheds light on the architecture design for deep learning research. Anish Agarwal received his BS and MS degree in Computer Science from Caltech. Before joining MIT, he was a management consultant at Boston Consulting Group (BCG) and was the head of Machine Learning in a Bay Area startup, Ikasi Inc. His research interests are in high-dimensional statistics and the design of data marketplaces. Paper At A Glance: Consider the setting of Linear Regression where the observed response variables, in expectation, are linear functions of the p-dimensional covariates. Then to achieve vanishing prediction error, the number of required samples scales faster than pσ2, where σ2 is a bound on the noise variance. In a high-dimensional setting where p is large but the covariates admit a low-dimensional representation (say r ≪ p), then Principal Component Regression (PCR), is an effective approach; here, the response variables are regressed with respect to the principal components of the covariates. The resulting number of required samples to achieve vanishing prediction error now scales faster than rσ2(≪ pσ2). Despite the tremendous utility of PCR, its ability to handle settings with noisy, missing, and mixed (discrete and continuous) valued covariates is not understood and remains an important open challenge. As the main contribution of this work, we address this challenge by rigorously establishing that PCR is robust to noisy, sparse, and possibly mixed valued covariates. Specifically, under PCR, vanishing prediction error is achieved with the number of samples scaling as r max(σ2, ρ−4 log5(p)), where ρ denotes the fraction of observed (noisy) covariates. We establish generalization error bounds on the performance of PCR, which provides a systematic approach in selecting the correct number of components r in a data-driven manner. The key to our result is a simple, but powerful equivalence between (i) PCR and (ii) Linear Regression with covariate pre-processing via Hard Singular Value Thresholding (HSVT). From a technical standpoint, this work advances the state-of-the-art analysis for HSVT by establishing stronger guarantees with respect to the ∥·∥2,∞-error for the estimated matrix rather than the Frobenius norm/mean-squared error (MSE) as is commonly done in the matrix estimation / completion literature. Presentation Slides: https://nips.cc/media/Slides/nips/2019/westexhibitionhallc+b3(12-15-50)-12-16-25-15894-on_robustness_o.pdf Poster: https://www.dropbox.com/s/5b4kw37z0nu5vwp/neurips_poster.pdf?dl=0 / Full Interview Transcripts / Robin.ly Host - Wenli Zhou We're here at NeurIPS 2019 with Anish Agarwal. He is a Ph.D. candidate from MIT. And we're here to discuss his newly submitted paper "On robustness of principal component regression". Thank you so much for being here. Anish Agarwal: My pleasure. Robin.ly Host - Wenli Zhou Yes. Can you tell our community a little bit about the paper? And what is the greatest contribution of the of your topic? Anish Agarwal: Sure. So there's this algorithm that's been used, I think, that was invented maybe 100 years ago, called principal component regression, or PCR. And people use it in the industry all the time. There isn't a really good theoretical understanding of when and why you should use it. And what we essentially did is a rigorous theoretical analysis of this very ubiquitous method. And what we found is that PCR has many surprising properties, that has implications for how we deal with modern datasets tend to be huge; you have millions of measurements and maybe even tens to hundreds of millions of features for every person or measurement. And these measurements are normally very noisy and have a lot of missing data. Also increasingly, such datasets contain very sensitive data, such as highly private information about a user. And turns out that if you have privacy issues, you have problems with noise, you have problems with missing data. This one algorithm that was invented hundreds of years ago that everyone uses actually works as is and solves those problems really, really well in practice. And so what we showed is that it actually does and then we gave many examples in which it has, it works much, much better than something like linear regression, which is still the most common method in machine learning or statistics that are still used. It will show its effectiveness of the problems like we macroeconomists are trying to forecast trajectories of GDPs. In biology, we have genetic information, athlete datasets, what have you, it turns out that PCR is this one method that gets you really far because of this implicit de-noising that that first step of PCR does before you do linear regression. Robin.ly Host - Wenli Zhou Okay, do we have any limitations in this area? Anish Agarwal: So principal component regression as it stands works really well when your data has something called low rank, which means it has a particular kind of structure in the matrix. And that is a linear structure. And when you want to prove things, you normally have to assume linearity because that's when you can be very precise with the statements you make. In reality, a lot of the data you have actually turns out to be linear most of the time. But for very highly unstructured data, like your video data, speech data, the linearity no longer works. And you need to come up with more complicated architectures to make sure you get good prediction values. So the lesson we hope to impart from this particular algorithm is that before you run your regression method, you should first find some way of finding some low dimensional representation, maybe nonlinear below dimensional representations of your high and nonlinear low dimensional representation because it turns out has many interesting properties that are good for the purposes of prediction, such as preserving privacy of the People's data, de-noise the data when you have a lot of noisy measurements, which is true of most datasets, now, we have a lot of missing data, or you just want to make sure that your predictions are going to be good because you're now searching over a smaller space of functions rather than the original space. Robin.ly Host - Wenli Zhou Yeah, Anish Agarwal: So that movement from linear to nonlinear, PCR won't work as well. But hopefully that architectural - first de-noising and then regressing - that kind of architecture can be translated in more complicated see a neural network architecture. Robin.ly Host - Wenli Zhou I think you just answered the question that I wanted to ask next. So you were talking about that the linear the method doesn't work really well in terms of speech and video. Sure, yeah. So yeah. So what will be the real-life applications for your methods? Anish Agarwal: You know, it may not have a direct application for speech. But I think the way I think about what's the point of theory in machine learning and statistics, a big part of the theory is to gain some rules of thumbs gain some ways of thinking about new architectures. And so I think if I designed a method that works well there, we still care about things like privacy, all your data is extremely noisy, you should design an architecture such that the first step de-noises the data, find low dimensional structure, and then you run it through some sort of neural network that's meant to learn a good prediction model. So what I'm proposing is trying to break up your problem into two steps, which is first a low dimensional part, which instead of doing PCA, you do something like learn a variational autoencoder, or learn a generative adversarial network. And then on top of that, you can run some sort of convolution neural network to actually do the prediction part. And so I think the point of the theory is to give you lessons for how to apply or for how to design new architectures, but I think it's it in practice, we still use linear methods 80 90% of the time, right? We think machine learning is just deep learning but 90% of the problems we face in the industry, it's still be solved really, really well with just linear methods. And so they should not be discounted away and they should be tried first to give you a good baseline to see at least you should do at least as well as that. Robin.ly Host - Wenli Zhou Okay. Thank you so much for coming here to share with us your thoughts, Anish. Anish Agarwal: Thank you.