autonlab.org

Research Thrust

Massive Data Mining

The Auton Lab has over 10 years of experience with data mining on massive data streams.  We have expertise with both established techniques and in the development of new algorithms to provide robust and efficient solutions for massive data sets.  Our work has previously addressed problems in a range of fields, including: bio-survelience, large-scale astronomy, the intelligence community,  robotics, life sciences, and a variety of industrial applications.  This work includes both a large number of successful software deployments and a range of downloadable general purpose software.

Our work in massive scale data mining allows users to tractably process large data sets, addressing such problem as:

  • Discovering (previously unknown) structure or patterns in the data - What can we say about the underlying structure of the data?  Our work on this problem focuses on learning underlying probabilistic models.  In particular, we have significant experience in efficiently learning large Bayesian networks, which provide a powerful and readable description of the underlying model.
  • Finding anomalous or interesting data points buried within the data -  Given a large set of data points, can we identify any as anomalous?  Our work on this problem has been used to find new, interesting objects in such data sets as the Sloan Digital Sky Survey.
  • Accurately classifying new data points - Can we accurately classify a new observation given a historical set of data points?  Our work on this problem has touched a variety of applications and includes developing new, more efficient methods for such techniques as nearest neighbor classification and logistic regression.
  • Intelligently choosing the best action to perform - Given a noisy view of the current world state, how do we best choose the next action to perform?  Our work on this problem includes both traditional questions in robotics and the question of active learning.  Active learning asks how we should next sample the data point so as to get the most useful information, allowing us to minimize the number of potentially expensive experiments.

Our primary specialty is in developing novel ways to exploit structure within both the data and the problem itself to make our approaches significantly faster.  In particular, we have developed a range of efficient data structures and search algorithms that effectively target the algorithms, focusing the computation on the important aspects of the problem.  Thus our work enables experts in other fields to accurately and tractably mine massive data streams in their area of interest.

Papers
NameAuthorsActions
Accelerating Exact k-means Algorithms with Geometric Reasoning

Dan Pelleg, Andrew Moore

show
Accelerating Exact k-means Algorithms with Geometric Reasoning (Extended version)

Dan Pelleg, Andrew Moore

show
A Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion

Anna Goldenberg, Jeremy Kubica, Paul Komarek, Andrew Moore, Jeff Schneider

show
Active Learning in Discrete Input Spaces

Jeff Schneider, Andrew Moore

show
AD-trees for Fast Counting and for Fast Learning of Association Rules

Brigham Anderson, Andrew Moore

show
A Dynamic Adaptation of AD-trees for Efficient Machine Learning on Large Data Sets

Paul Komarek, Andrew Moore

show
A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

Daniel Neill, Andrew Moore

show
A Fast Multi-Resolution Method for Detection of Significant Spatial Overdensities

Daniel Neill, Andrew Moore

show
A short tutorial note on computing information gain from counts

Andrew Moore

show
A tutorial on using the Vizier memory-based learning system

Jeff Schneider, Mary Soon Lee, Andrew Moore

show
Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

Andrew Moore, Mary Soon Lee

show
Dependency Trees in Sub-linear Time and Bounded Memory

Dan Pelleg, Andrew Moore

show
Detecting Significant Multidimensional Spatial Clusters

Daniel Neill, Andrew Moore

show
Efficient Algorithms for Non-Parametric Clustering with Clutter

Weng-Keen Wong, Andrew Moore

show
Efficient Exact k-NN and Nonparametric Classification in High Dimensions

Ting Liu, Andrew Moore, Alexander Gray

show
Efficient Locally Weighted Polynomial Regression Predictions

Andrew Moore, Jeff Schneider, Kan Deng

show
Empirical Bayes Screening for Link Analysis

Anna Goldenberg, Andrew Moore

show
Fast, Robust Adaptive Control by Learning only Forward Models

Andrew Moore

show
Fast Inference and Learning in Large-State-Space HMMs

Sajid Siddiqi, Andrew Moore

show
Fast Nonlinear Regression via Eigenimages Applied to Galactic Morphology

Brigham Anderson, Andrew Moore, Andrew Connolly, Robert Nichol

show
Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs

Paul Komarek, Andrew Moore

show
High-Dimensional Probabilistic Classification for Drug Discovery

Alexander Gray, Paul Komarek, Ting Liu, Andrew Moore

show
Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation

Oded Maron, Andrew Moore

show
Interpolating Conditional Density Trees

Scott Davies, Andrew Moore

show
Learning Filaments

Geoff Gordon, Andrew Moore

show
Logistic Regression for Data Mining and High-Dimensional Classification

Paul Komarek

show
Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity

Paul Komarek, Andrew Moore

show
Making Logistic Regression A Core Data Mining Tool With TR-IRLS

Paul Komarek, Andrew Moore

show
Multiresolution Instance-based Learning

Kan Deng, Andrew Moore

show
N-Body Problems in Statistical Learning

Alexander Gray, Andrew Moore

show
Optimal Reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning

Andrew Moore, Weng-Keen Wong

show
Q2: Memory-based active learning for optimizing noisy continuous functions

Andrew Moore, Jeff Schneider, Justin Boyan, Mary Soon Lee

show
Rapid Detection of Significant Spatial Clusters

Daniel Neill, Andrew Moore

show
Rapid Evaluation of Multiple Density Models

Alexander Gray, Andrew Moore

show
Real-valued All-Dimensions search: Low-overhead rapid searching over subsets of attributes

Andrew Moore, Jeff Schneider

show
Repairing Faulty Mixture Models using Density Estimation

Peter Sand, Andrew Moore

show
Rule-based Anomaly Pattern Detection for Detecting Disease Outbreaks

Weng-Keen Wong, Andrew Moore, Gregory Cooper, Michael Wagner

show
The Anchors Hierarchy: Using the Triangle Inequality to Survive High-Dimensional Data

Andrew Moore

show
The IOC algorithm: Efficient Many-Class Non-parametric Classification for High-Dimensional Data

Ting Liu, Ke Yang, Andrew Moore

show
The Racing Algorithm: Model Selection for Lazy Learners

Oded Maron, Andrew Moore

show
Tractable Group Detection on Large Link Data Sets

Jeremy Kubica, Andrew Moore, Jeff Schneider

show
Using Tarjan's Red Rule for Fast Dependency Tree Construction

Dan Pelleg, Andrew Moore

show
Copyright 2008, Carnegie Mellon University, Auton Lab. All Rights Reserved.