autonlab.org

Very Fast EM-based Mixture Model Clustering Using Multiresolution KD-trees (1999)

Andrew Moore

Tags

Bayesian Networks, Clustering, Kd-trees and Ball-trees, Mixture Models, Statistical Data Mining for Astrophysics

Abstract

Clustering is important in many fields including manufacturing, biology, finance, and astronomy. Mixture models are a popular approach due to their statistical foundations, and EM is a very popular method for finding mixture models. EM, however, requires many accesses of the data, and thus has been dismissed as impractical (e.g. (Zhang, Ramakrishnan, & Livny, 1996)) for data mining of enormous datasets. We present a new algorithm, based on the multiresolution kd-trees of (Moore, Schneider, & Deng, 1997), which dramatically reduces the cost of EM-based clustering, with savings rising linearly with the number of datapoints. Although presented here for maximum likelihood estimation of Gaussian mixture models, it is also applicable to nonGaussian models (provided class densities are monotonic in Mahalanobis distance), mixed categorical/numeric clusters, and Bayesian methods such as Autoclass (Cheeseman & Oldford, 1994).

Full text

Download (application/pdf, 297.6 kB)

Approximate BibTeX Entry

@inproceedings{moore-veryfast,
    Month = {April},
    Year = {1999},
    Pages = {543--549},
    Publisher = {Morgan Kaufman},
    Address = {340 Pine Street, 6th Fl., San Francisco, CA 94104},
    Booktitle = {Advances in Neural Information Processing Systems},
    Editor = {M. Kearns and D. Cohn},
    Author = {Andrew Moore},
    Title = {Very Fast EM-based Mixture Model Clustering Using Multiresolution KD-trees}
}

Copyright 2010, Carnegie Mellon University, Auton Lab. All Rights Reserved.