X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000)
Tags
Astrostatistics, Cached Sufficient Statistics, Clustering, Kd-trees and Ball-trees, Mixture Models, Statistical Data Mining for Astrophysics
Abstract
A K-means tutorial.
Despite its popularity for general clustering, k-means suffers three major
shortcomings; it scales poorly computationally, the number of clusters K has to
be supplied by the user, and the search is prone to local minima. We propose
solutions for the first two problems, and a partial remedy for the third.
Building on prior work for algorithmic acceleration that is not based on
approximation, we introduce a new algorithm that efficiently, searches the space
of cluster locations and number of clusters to optimize the Bayesian Information
Criterion (BIC) or the Akaike Information Criterion (AIC) measure. The
innovations include two new ways of exploiting cached sufficient statistics and
a new very efficient test that in one k-means sweep selects the most promising
subset of classes for refinement. This gives rise to a fast, statistically
founded algorithm that outputs both the number of classes and their parameters.
Experiments show this technique reveals the true number of classes in the
underlying distribution, and that it is much faster than repeatedly using
accelerated k-means for different values of K.
Full text
Download (application/pdf, 287.7 kB)
Approximate BibTeX Entry
@inproceedings{pelleg-xmeans,
Year = {2000},
Pages = {727-734},
Publisher = {Morgan Kaufmann},
Address = {San Francisco},
Booktitle = {Proceedings of the Seventeenth International Conference on Machine Learning},
Author = {
Dan Pelleg, Andrew Moore
},
Title = {X-means: Extending K-means with Efficient Estimation of the Number of Clusters}
}