Research Thrust
-
Rapid Detection of Emerging Pattern
-
Massive Data Mining
-
Social Network Analysis/Link Analysis/Group Detection
-
Life Science Data Mining
Rapid Detection of Emerging Pattern
Data mining algorithms at Auton Lab have successfully detected new emerging patterns in various domains: Health services, Agriculture, Manufacturing and Oil companies. Our algorithms are 10-1000 timesSave faster than other traditional techniques. The results demonstrate significantly higher detection power with much smaller false positive rates. We have applied these algorithms in semi/fully-automated modes under supervied/unsupervised environments and for retrospective/prospective surveillance. A few algorithms for Rapid detection of emerging patterns are: WSARE, Ultra Fast SSS, and TipMon.
Massive Data Mining
The Auton Lab has over 10 years of experience with data mining on massive data streams. We have expertise with both established techniques and in the development of new algorithms to provide robust and efficient solutions for massive data sets. Our work has previously addressed problems in range of fields, including: bio-survelience, large-scale astronomy, the intelligence community, robotics, life sciences, and a variety of industrial applications. This work include both a large number of successful software deployments and a range available general purpose software.
Our work in massive scale data mining allows users to tractably process large data sets, addressing such problem as:
- Discovering (previously unknown) structure or patterns in the data - What can we say about the underlying structure of the data? Our work on this problem focuses on learning underlying probabilistic models. In particular, we have significant experience in efficiently learning large Bayesian networks, which provide a powerful and readable description of the underlying model.
- Finding anomalous or interesting data points buried within the data - Given a large set of data points, can we identify any as anomalous? Our work on this problem has been used to find new, interesting objects in such data sets as the Sloan Digital Sky Survey.
- Accurately classifying new data points - Can we accurately classify a new observation given a historical set of data points? Our work on this problem has touched a variety of applications and includes developing new more efficient methods for such techniques as nearest neighbor classification and logistic regression.
- Intelligently choosing the best action to perform - Given a noisy view of the current world state, how do we best choose the next action to perform? Our work on this problem includes both traditional questions in robotics and the question of active learning. Active learning asks how we should next sample the data point so as to get the most useful information, allowing us to minimize the number of potentially expensive experiments.
Our primary specialty is in developing novel ways to exploit structure within both the data and the problem itself to make our approaches significantly faster. In particular, we have developed a range of efficient data structures and search algorithms that effectively target the algorithms, focusing the computation on the important aspects of the problem. Thus our work enables experts in other fields to accurately and tractably mine massive data streams in their area of interest.
Social Network Analysis/Link Analysis/Group Detection
Social Network Analysis/Link Analysis/Group Detection seek to discover interesting relationships and patterns among people or other entities, for example:
- Who communicates with whom? And who appears to avoid communicating with whom?
- Are there cliques of people who mostly communicate among themselves and rarely with others, or is communication more evenly distributed?
- Are there "stars" who are linked with a very large number or people, and/or isolated people who are only linked with one or two others?
- Might there be aliases? That is, if we see two people with essentially the same link patterns, but who are never linked with each other, might they in fact be the same person?
- How do patterns of association among entities evolve over time?
- Can we identify groups of entities, based on link data and/or demographic properties?If we know that a communication took place, but we don't know the identity of one of the participants, can we infer who that entity was?
Auton Lab researchers have developed--and continue to develop--many algorithms and associated software packages for investigating these kinds of questions.As usual at the Auton Lab, these technologies place great emphasis on efficient analysis of large datasets. This is a list of representative softwares in this thrust
AFDL
- Activity From Demographics and Links
Bayes Net
Learner
SBNS - Screen-based Bayes Net Structure
search
GDA/k-groups
- Group Detection Algorithm
MNOP - Many
Names, One Person alias detection
XGDA - A
fast group detection algorithm
Life Science Data Mining
Life sciences is a collective term encompassing biochemistry,genetics, ecology, pharmacology, medicine, and many other sciences concerned with living organisms. The Auton Lab has diverse experience in data mining applications for these disciplines, from core areas like drug discovery and drug classification, to big-picture problems in epidemiology and pathogen detection.
Medicinal drugs are typically created through a process similar to Edison's work on the light-bulb: very smart scientists think very hard about the desired effect of a drug, then work very hard to limit their ideas to those few they can afford to carefully test. An alternative methodology is High Throughput Screening (HTS), where truly enormous libraries of drug candidates are tested for efficacy in robotic chemistry labs. Modern HTS labs might make only 1 mistake in 1,000 experiments, but this leads to hundreds of mistakes on a small HTS library -- roughly the same order of magnitude as the number of useful chemicals in the library.
Detecting mistakes in HTS data can save hundreds or thousands of hours of expensive wet lab time, as well as recover wrongly-disqualified candidates for further testing. This is a job for fast, robust, and correct statistics. When traditional statistical software packages failed to scale-up to the demands, the Auton Lab developed new algorithms that met the challenge. Beyond the research, the Auton Lab delivered custom software libraries and user interfaces to our collaborators, to help them make use of our algorithmic innovations.