Dip-based Deep Embedded Clustering with k-Estimation

Lena Bauer, Collin Leiber, Benjamin Schelling, Christian Boehm, Claudia Plant

Publications: Contribution to bookContribution to proceedingsPeer Reviewed

Abstract

The combination of clustering with Deep Learning has gained much attention in recent years. Unsupervised neural networks like autoencoders can autonomously learn the essential structures in a data set. This idea can be combined with clustering objectives to learn relevant features automatically. Unfortunately, they are often based on a k-means framework, from which they inherit various assumptions, like spherical-shaped clusters. Another assumption, also found in approaches outside the k-means-family, is knowing the number of clusters a-priori. In this paper, we present the novel clustering algorithm DipDECK, which can estimate the number of clusters simultaneously to improving a Deep Learning-based clustering objective. Additionally, we can cluster complex data sets without assuming only spherically shaped clusters. Our algorithm works by heavily overestimating the number of clusters in the embedded space of an autoencoder and, based on Hartigan's Dip-test - a statistical test for unimodality - analyses the resulting micro-clusters to determine which to merge. We show in extensive experiments the various benefits of our method: (1) we achieve competitive results while learning the clustering-friendly representation and number of clusters simultaneously; (2) our method is robust regarding parameters, stable in performance, and allows for more flexibility in the cluster shape; (3) we outperform relevant competitors in the estimation of the number of clusters.
Original languageEnglish
Title of host publicationKDD 2021 - Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Subtitle of host publicationAugust 14 - 18, 2021
Place of PublicationNew York, NY
PublisherAssociation for Computing Machinery (ACM)
Pages903-913
Number of pages11
ISBN (Electronic)978-1-4503-8332-5
DOIs
Publication statusPublished - 14 Aug 2021
Event27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD) - , Singapore
Duration: 14 Aug 202118 Aug 2021

Conference

Conference27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD)
Country/TerritorySingapore
Period14/08/2118/08/21

Austrian Fields of Science 2012

  • 102033 Data mining

Keywords

  • Deep Clustering
  • Dip-test
  • Estimating the number of clusters
  • deep clustering
  • estimating the number of clusters
  • dip-test

Fingerprint

Dive into the research topics of 'Dip-based Deep Embedded Clustering with k-Estimation'. Together they form a unique fingerprint.

Cite this