Accelerating Exact k-means Algorithms with Geometric Reasoning (1999)
Tags
Astrostatistics, Cached Sufficient Statistics, Clustering, Efficient Statistical Algorithms, Kd-trees and Ball-trees, Mixture Models, Statistical Data Mining for Astrophysics
Abstract
A K-means tutorial.
We present new algorithms for the k-means clustering problem. They use the
kd-tree data structure to reduce the large number of nearest-neighbor queries
issued by the traditional algorithm. Sufficient statistics are stored in the
nodes of the kd-tree. Then, an analysis of the geometry of the current cluster
centers results in great reduction of the work needed to update the centers. Our
algorithms behave exactly as the traditional k-means algorithm. Proofs of
correctness are included. The kd-tree can also be used to initialize the k-means
starting centers efficiently. Our algorithms can be easily extended to provide
fast ways of computing the error of a given cluster assignment, regardless of
the method in which those clusters were obtained. We also show how to use them
in a setting which allows approximate clustering results, with the benefit of
running faster. We have implemented and tested our algorithms on both real and
simulated data. Results show a speedup factor of up to 170 on real astrophysical
data, and superiority over the naive algorithm on simulated data in up to 5
dimensions. Our algorithms scale well with respect to the number of points and
number of centers, allowing for clustering with tens of thousands of centers.
Full text
Download (application/pdf, 437.9 kB)
Approximate BibTeX Entry
@inproceedings{pelleg-accelerating,
Month = {aug},
Year = {1999},
Pages = {277--281},
Publisher = {AAAI Press},
Booktitle = {Proceedings of the Fifth International Conference on Knowledge Discovery in Databases},
Editor = {Surajit Chaudhuri and David Madigan},
Author = {Dan Pelleg Andrew Moore},
Title = {Accelerating Exact k-means Algorithms with Geometric Reasoning}
}