Aug 13, 20206 views
Together with the curse of dimensionality, nonlinear dependencies,in large data sets persist as major challenges in data mining tasks.,A reliable way to accurately preserve nonlinear structure is to,compute geodesic distances between data points. Manifold learning,methods, such as Isomap, aim to preserve geodesic distances in a,Riemannian manifold. However, as manifold learning algorithms,operate on the ambient dimensionality of the data, the essential step,of geodesic distance computation is sensitive to high-dimensional,noise. Therefore, a direct application of these algorithms to highdimensional, noisy data often yields unsatisfactory results and does,not accurately capture nonlinear structure.,We propose an unsupervised random forest approach called geodesic forests (GF) to geodesic distance estimation in linear and nonlinear manifolds with noise. GF operates on low-dimensional sparse,linear combinations of features, rather than the full observed dimensionality. To choose the optimal split in a computationally efficient,fashion, we developed Fast-BIC, a fast Bayesian Information Criterion statistic for Gaussian mixture models. We additionally propose,geodesic precision and geodesic recall as novel evaluation metrics,that quantify how well the geodesic distances of a latent manifold,are preserved. Empirical results on simulated and real data demonstrate that GF is robust to high-dimensional noise, whereas other,methods, such as Isomap, UMAP, and FLANN, quickly deteriorate,in such settings. Notably, GF is able to estimate geodesic distances,better than other approaches on a real connectome dataset.