# Variational EM for Big Fast Clustering

## Contact

#### Head of lab

Prof. Dr. Jörg Lücke

+49 441 798 5486

+49 441 798-3902

W30 2-201

#### Secretary

Nicole Kulbach (temporary)

+49 441 798-3326

+49 441 798-3902

W30 3-316

Prof. Dr. Jörg Lücke
Arbeitsgruppe Machine Learning
Exzellenzcluster Hearing4all und
Department für Medizinische Physik und Akustik
Fakultät für Medizin und Gesundheitswissenschaften
Carl von Ossietzky Universität Oldenburg
D-26111 Oldenburg

Room 201 (2nd floor)
Building W30 (NeSSy)
Küpkersweg 74
26129 Oldenburg

# Variational EM for Big Fast Clustering

Illustration of truncated distributions for a GMM in D = 2 dimensions. The figure considers a data point $$\vec{y}^{(n)}$$ which lies (for illustrative purposes) to some extend in between some clusters. For C = 8 cluster, the full posterior (aka. the responsibilities)  $$r_c^{(n)} = p(c | \vec{y}^{(n)},Θ)$$ are shown in the top-left. Below, a truncated approximation $$q_c^{(n)}$$ with |K| = C' = 3 is shown for the same data point. The truncated approximation maintains the C' highest posterior values, sets all others to zero, and renormalizes the distribution to sum to one. The three closest clusters, which correspond to the three highest posterior values, are connected with black lines in the main figure.

## Description

In this line of research, we investigate novel variational approaches to decisively improve learning algorithms for clustering. We are interested in the theoretical as well as practical aspects of such approaches.

In practice, variational approaches allow for the derivation of very efficient clustering algorithms for large-scale problems, i.e., when the number of clusters and the number of data points is large, and when the dimensionality of data points is high. Variational approaches can, in this respect, provide a very favorable scaling behavior with the number of clusters. Such a scaling can then be combined with approaches developed by other groups that allow for favorable scaling with the number of data points (e.g. coreset approaches) or the data dimensionality (e.g., by exploiting triangle inequalities).

On the theoretical side, the variational approaches we investigate allow the definition of learning algorithms that interpolate between hard EM clustering approaches (such as k-means) and soft-EM clustering as, e.g., provided by EM for Gaussian mixture models (GMMs). The illustration above shows, for instance, a k-means-like algorithm based on three instead of one 'winning' cluster (this is an instance of the k-means-C' algorithm [Lücke & Forster, 2019] with C' = 3). In the framework of variational EM, k-means is itself obtained as a limit case of the k-means-C' algorithm (namely when C' = 1). As a consequence, k-means (Lloyd's algorithm) can be obtained as an approximation of EM for GMMs without the requirement to take a limit to small or zero cluster variances (which is the usual textbook approach). This implies further, that k-means optimizes a variational lower bound (i.e. a free energy a.k.a. ELBO) of a GMM loglikelihood. The paper [Lücke & Forster, 2019] provides a number of quantitative results on the relation of k-means algorithms to GMM data models including a formula to directly translate the quantization error to a variational lower bound. [Lücke & Forster, 2019] is itself based on more general theoretical results about the family of truncated posteriors as variational distributions [Lücke, 2019].

Based on variational EM interpretations of k-means-like algorithms, accelerations of clustering algorithms can be obtained by defining partial instead of full variational E-steps. Variational lower bounds can then be increased without having to compute the distances of a given data point to all cluster centers. In the paper [Forster & Lücke, 2018] it was shown that such partial E-steps can be defined independently of the number of clusters. As a consequence, the number of required distance evaluations could (for clustering problems with many clusters) be drastically reduced. That the number of strongly reduced distance evaluations translates into algorithms that are in practice also strongly improving the state-of-the-art efficiency was then shown by the paper [Hirschberger et al., 2019]. The paper showed for large-scale problems that (A) the reduction of distance evaluations clearly outweighs to memory overhead and auxiliary computations required to implement the variational EM procedure, and (B) that the variational complexity reduction can be combined with complexity reduction approaches for large data sets (we used coresets) and complexity reduction for large data dimensionality (we used the triangle inequality).

A list of our so-far published clustering algorithms together with their implementations is provided below. Implementations based on python are primarily of interest for the theoretical studies [Lücke & Forster, 2019; Forster & Lücke, 2018] . The python implementations can readily be used in relation to theoretical studies, e.g., on convergence properties or for comparisons with k-means-like algorithms. They are, however, not optimized for execution times or large data sets. The implementations based on C++ were optimized both in terms of memory I/O efficiency and run-time efficiency. Furthermore, their more hardware-near C++ code is supporting fast computation time. The algorithms vc-GMM and vc-k-means are only available as C++ implementations and, most importantly, contain combinations with fast-seeding, fast coreset construction, and efficient computations for high data dimensionalities. These algorithms stand to compete as the fastest clustering algorithms for very large-scale clustering problems.

## Algorithms

 Algorithm Implementation Reference k-means-C' Python Code Repository (not optimized*) Lücke & Forster, 2019 var-k-means-X Python Code Repository (not optimized*) Forster & Lücke, 2018 var-k-means-S Python Code Repository (not optimized*) C++ Code Repository (optimized) Forster & Lücke, 2018 Hirschberger et al., 2019 var-GMM-X Python Code Repository (not optimized*) Forster & Lücke, 2018 var-GMM-S Python Code Repository (not optimized*) C++ Code Repository (optimized) Forster & Lücke, 2018 Hirschberger et al., 2019 vc-GMM C++ Code Repository (optimized) Hirschberger et al., 2019 vc-k-means C++ Code Repository (optimized) Hirschberger et al., 2019

*the Python code is not optimized for real-time execution speed, but for analysis and research only.

## References (own publications)

[Hirschberger et al., 2019]
Large Scale Clustering with Variational EM for Gaussian Mixture Models
Florian Hirschberger, Dennis Forster, and Jörg Lücke, arXiv:1810.00803, 2019.

[Lücke & Forster, 2019]
k-means as a variational EM approximation of Gaussian mixture models
Jörg Lücke and Dennis Forster, Pattern Recognition Letters, 2019.

[Lücke, 2019]
Truncated Variational Expectation Maximization
Jörg Lücke, arXiv:1610.03113, 2019.

[Forster & Lücke, 2018]
Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and k-means
Dennis Forster and Jörg Lücke, AISTATS, 2018.

(Changed: 2021-07-28)