[KDD 2020] Catalysis Clustering With GAN By Incorporating Domain Knowledge
Clustering is an important unsupervised learning method with serious challenges when data is sparse and high-dimensional. Generated clusters are often evaluated with general measures, which may,not be meaningful or useful for practical applications and domains.,Using a distance metric, a clustering algorithm searches through,the data space, groups close items into one cluster, and assigns far,away samples to different clusters. In many real-world applications,,the number of dimensions is high and data space becomes very,sparse. Selection of a suitable distance metric is very difficult and,becomes even harder when categorical data is involved. Moreover,,existing distance metrics are mostly generic, and clusters created,based on them will not necessarily make sense to domain-specific,applications. One option to address these challenges is to integrate,domain-defined rules and guidelines into the clustering process.,In this work we propose a GAN-based approach called Catalysis,Clustering to incorporate domain knowledge into the clustering,process. With GANs we generate catalysts, which are special synthetic points drawn from the original data distribution and verified,to improve clustering quality when measured by a domain-specific,metric. We then perform clustering analysis using both catalysts,and real data. Final clusters are produced after catalyst points are removed. Experiments on two challenging real-world datasets clearly,show that our approach is effective and can generate clusters that,are meaningful and useful for real-world applications.