Fast and Scalable Outlier Detection with Sorted Hypercubes

CIKM 2020

Abstract: Outlier detection is the task responsible for finding novel or rare phenomena that provide valuable insights in many areas of the industry. The neighborhood-based algorithms are largely used to tackle this problem due to the intuitive interpretation and wide applicability in different domains. Their major drawback is the intensive neighborhood search that takes hours or even days to complete in large data, thus being impractical in many real-world scenarios. This paper proposes HySortOD – a novel algorithm that uses an efficient hypercube-ordering-and-searching strategy for fast outlier detection. Its main focus is the analysis of data with many instances and a low-to-moderate number of dimensions. We performed comprehensive experiments using real data with up to ∼500k instances and ∼120 dimensions, where our new algorithm outperformed 7 state-of-the-art competitors in runtime, being up to 4 orders of magnitude faster in large data. Specifically, 12 well-known benchmark datasets were deeply investigated and one case study in the crucial task of breast cancer detection was also performed to demonstrate that our approach can be successfully used as an outof-the-box solution for real-world, non-benchmark problems. Based on our experiments, we also identified default parameter values that allow us to be parameter-free and yet report high-quality results.