Massive Data Mining
The Auton Lab has over 10 years of experience with data mining on massive data streams. We have expertise with both established techniques and in the development of new algorithms to provide robust and efficient solutions for massive data sets. Our work has previously addressed problems in a range of fields, including: bio-survelience, large-scale astronomy, the intelligence community, robotics, life sciences, and a variety of industrial applications. This work includes both a large number of successful software deployments and a range of downloadable general purpose software.
Our work in massive scale data mining allows users to tractably process large data sets, addressing such problem as:
- Discovering (previously unknown) structure or patterns in the data - What can we say about the underlying structure of the data? Our work on this problem focuses on learning underlying probabilistic models. In particular, we have significant experience in efficiently learning large Bayesian networks, which provide a powerful and readable description of the underlying model.
- Finding anomalous or interesting data points buried within the data - Given a large set of data points, can we identify any as anomalous? Our work on this problem has been used to find new, interesting objects in such data sets as the Sloan Digital Sky Survey.
- Accurately classifying new data points - Can we accurately classify a new observation given a historical set of data points? Our work on this problem has touched a variety of applications and includes developing new, more efficient methods for such techniques as nearest neighbor classification and logistic regression.
- Intelligently choosing the best action to perform - Given a noisy view of the current world state, how do we best choose the next action to perform? Our work on this problem includes both traditional questions in robotics and the question of active learning. Active learning asks how we should next sample the data point so as to get the most useful information, allowing us to minimize the number of potentially expensive experiments.
Our primary specialty is in developing novel ways to exploit structure within both the data and the problem itself to make our approaches significantly faster. In particular, we have developed a range of efficient data structures and search algorithms that effectively target the algorithms, focusing the computation on the important aspects of the problem. Thus our work enables experts in other fields to accurately and tractably mine massive data streams in their area of interest.
|Accelerating Exact k-means Algorithms with Geometric Reasoning||show|
|Accelerating Exact k-means Algorithms with Geometric Reasoning (Extended version)||show|
|A Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion||show|
|Active Learning in Discrete Input Spaces||show|
|AD-trees for Fast Counting and for Fast Learning of Association Rules||show|
|A Dynamic Adaptation of AD-trees for Efficient Machine Learning on Large Data Sets||show|
|A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters||show|
|A Fast Multi-Resolution Method for Detection of Significant Spatial Overdensities||show|
|A short tutorial note on computing information gain from counts||show|
|A tutorial on using the Vizier memory-based learning system||show|
|Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets||show|
|Dependency Trees in Sub-linear Time and Bounded Memory||show|
|Detecting Significant Multidimensional Spatial Clusters||show|
|Efficient Algorithms for Non-Parametric Clustering with Clutter||show|
|Efficient Exact k-NN and Nonparametric Classification in High Dimensions||show|
|Efficient Locally Weighted Polynomial Regression Predictions||show|
|Empirical Bayes Screening for Link Analysis||show|
|Fast, Robust Adaptive Control by Learning only Forward Models||show|
|Fast Inference and Learning in Large-State-Space HMMs||show|
|Fast Nonlinear Regression via Eigenimages Applied to Galactic Morphology||show|
|Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs||show|
|High-Dimensional Probabilistic Classification for Drug Discovery||show|
|Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation||show|
|Interpolating Conditional Density Trees||show|
|Logistic Regression for Data Mining and High-Dimensional Classification||show|
|Making Logistic Regression A Core Data Mining Tool: A Practical Investigation of Accuracy, Speed, and Simplicity||show|
|Making Logistic Regression A Core Data Mining Tool With TR-IRLS||show|
|Multiresolution Instance-based Learning||show|
|N-Body Problems in Statistical Learning||show|
|Optimal Reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning||show|
|Q2: Memory-based active learning for optimizing noisy continuous functions||show|
|Rapid Detection of Significant Spatial Clusters||show|
|Rapid Evaluation of Multiple Density Models||show|
|Real-valued All-Dimensions search: Low-overhead rapid searching over subsets of attributes||show|
|Repairing Faulty Mixture Models using Density Estimation||show|
|Rule-based Anomaly Pattern Detection for Detecting Disease Outbreaks||show|
|The Anchors Hierarchy: Using the Triangle Inequality to Survive High-Dimensional Data||show|
|The IOC algorithm: Efficient Many-Class Non-parametric Classification for High-Dimensional Data||show|
|The Racing Algorithm: Model Selection for Lazy Learners||show|
|Tractable Group Detection on Large Link Data Sets||show|
|Using Tarjan's Red Rule for Fast Dependency Tree Construction||show|