autonlab.org

X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000)

Dan Pelleg, Andrew Moore

Tags

Astrostatistics, Cached Sufficient Statistics, Clustering, Kd-trees and Ball-trees, Mixture Models, Statistical Data Mining for Astrophysics

Abstract

A K-means tutorial.
Despite its popularity for general clustering, k-means suffers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the first two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, we introduce a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. The innovations include two new ways of exploiting cached sufficient statistics and a new very efficient test that in one k-means sweep selects the most promising subset of classes for refinement. This gives rise to a fast, statistically founded algorithm that outputs both the number of classes and their parameters. Experiments show this technique reveals the true number of classes in the underlying distribution, and that it is much faster than repeatedly using accelerated k-means for different values of K.

Full text

Download (application/pdf, 287.7 kB)

Approximate BibTeX Entry

@inproceedings{pelleg-xmeans,
    Year = {2000},
    Pages = {727-734},
    Publisher = {Morgan Kaufmann},
    Address = {San Francisco},
    Booktitle = {Proceedings of the Seventeenth International Conference on Machine Learning},
    Author = { Dan Pelleg, Andrew Moore },
    Title = {X-means: Extending K-means with Efficient Estimation of the Number of Clusters}
}

Copyright 2008, Carnegie Mellon University, Auton Lab. All Rights Reserved.