[KDD 2020] Imputing Various Incomplete Attributes via Distance Likelihood Maximization
CrossMind.ai logo

[KDD 2020] Imputing Various Incomplete Attributes via Distance Likelihood Maximization

Dec 16, 2020
Missing values may appear in various attributes. By “various”, we,mean (1) different types of values in a tuple, such as numerical,or categorical, and (2) different attributes in a tuple, either the dependent or determinant attributes of regression models or dependency rules. Such varieties unfortunately prevent the imputation,performing. In this paper, we propose to study the distance models,that predict distances between tuples for missing data imputation.,The immediate benefits are in two aspects, (1) uniformly processing and collaboratively utilizing the distances on all the attributes,with various types of values, and (2) rather than enumerating the,combinations of imputation candidates on various attributes, we,can directly calculate the most likely distances of missing values,to other complete ones and thus infer the corresponding imputations. Our major technical highlights include (1) introducing the,imputation statistically explainable by the likelihood on distances,,(2) proving NP-hardness of finding the maximum likelihood imputation, and (3) devising the approximation algorithm with performance guarantees. Experiments over datasets with real missing,values demonstrate the superiority of the proposed method compared to 11 existing approaches in 5 categories. Our proposal improves not only the imputation accuracy but also the downstream,applications such as classification, clustering and record matching.