Fair k-Center Clustering in MapReduce and Streaming Settings

Suman Bera, Syamantak Das, Sainyam Galhotra, Sagar Kale

Publications: Contribution to bookContribution to proceedingsPeer Reviewed

Abstract

Center-based clustering techniques are fundamental to many real-world applications such as data summarization and social network analysis. In this work, we study the problem of fairness aware k-center clustering over large datasets. We are given an input dataset comprising a set of n points, where each point belongs to a specific demographic group characterized by a protected attribute, such as race or gender. The goal is to identify k clusters such that all clusters have considerable representation from all groups and the maximum radius of these clusters is minimized. The majority of the prior techniques do not scale beyond 100K points for k = 50. To address the scalability challenges, we propose an efficient 2-round algorithm for the MapReduce setting that is guaranteed to be a 9-approximation to the optimal solution. Additionally, we develop a 2-pass streaming algorithm that is efficient and has a low memory footprint. These theoretical results are complemented with an empirical evaluation on million-scale datasets, demonstrating that our techniques are effective to identify high-quality fair clusters and efficient as compared to the state-of-the-art.

Original languageEnglish
Title of host publicationWWW 2022 - Proceedings of the ACM Web Conference 2022
EditorsFrédérique Laforest, Raphael Troncy, Elena Simperl
Place of PublicationNew York, NY
PublisherAssociation for Computing Machinery (ACM)
Pages1414-1422
Number of pages9
ISBN (Electronic)9781450390965
DOIs
Publication statusPublished - Apr 2022
EventThe Web Conf 2022 - Virtual Online, Lyon, France
Duration: 25 Apr 202229 Apr 2022

Conference

ConferenceThe Web Conf 2022
Country/TerritoryFrance
CityLyon
Period25/04/2229/04/22

Austrian Fields of Science 2012

  • 102031 Theoretical computer science

Keywords

  • disparate impact
  • fairness
  • k-center clustering

Cite this