papers

**Authors:** Daniel Neill, Andrew Moore, Gregory Cooper

This paper develops a new Bayesian method for cluster detection, the “Bayesian spatial scan statistic,” and compares this method to the standard (frequentist) scan statistic approach on the task of prospective disease surveillance.

**Authors:** Anna Goldenberg, Jeremy Kubica, Paul Komarek, Andrew Moore, Jeff Schneider

Link data, consisting of a collection of subsets of entities, can be an important source of information for a variety of fields including the social sciences, biology, criminology, and business intelligence. However, these links may be incomplete, containing one or more unknown members. We consider the problem of link completion, identifying which entities are the most likely missing members of a link given the previously observed links. We concentrate on the case of one missing entity. We compare a variety of recently developed along with standard machine learning and strawman algorithms adjusted to suit the task. The algorithms were tested extensively on a simulated and a range of real-world data sets.

Given limited training samples, learning to classify multiple labels is challenging. Problem decomposition is widely used in this case, where the original problem is decomposed into a set of easier-to-learn subproblems, and predictions from subproblems are combined to make the final decision.

In this paper we show the connection between composite likelihoods and many multilabel decomposition methods, e.g., one-vs-all, one-vs-one, calibrated label ranking, probabilistic classifier chain. This connection holds promise for improving problem decomposition in both the choice of subproblems and the combination of subproblem decisions.

As an attempt to exploit this connection, we design a composite marginal method that improves pairwise decomposition. Pairwise label comparisons, which seem to be a natural choice for subproblems, are replaced by bivariate label densities, which are more informative and natural components in a composite likelihood. For combining subproblem de- cisions, we propose a new mean-field approximation that minimizes the notion of composite divergence and is potentially more robust to inaccurate estimations in subproblems.

Empirical studies on five data sets show that, given limited training samples, the proposed method outperforms many alternatives.

**Authors:** Yi Zhang and Jeff Schneider

**Authors:** Sajid Siddiqi, Byron Boots, Geoff Gordon

Stability is a desirable characteristic for linear dynamical systems, but it is often ignored by algorithms that learn these systems from data. We propose a novel method for learning stable linear dynamical systems: we formulate an approximation of the problem as a convex program, start with a solution to a relaxed version of the program, and incrementally add constraints to improve stability. Rather than continuing to generate constraints until we reach a feasible solution, we test stability at each step; because the convex program is only an approximation of the desired problem, this early stopping rule can yield a higher-quality solution. We apply our algorithm to the task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach leads to noticeable improvement in the quality of simulated sequences. We compare our method to those of Lacy and Bernstein, with positive results in terms of accuracy, quality of simulated sequences, and efficiency.

**Authors:** Paul Komarek, Andrew Moore

This talk was presented at ICML 2000. It describes a modification to AD-trees to allow incremental and lazy growth. We discuss our implementation of these Dynamic AD-trees and present results for datasets with scores of high-arity attributes and millions of rows. ICML 2000.

These slides are best understood with the help of my notes from the presentation.

**Authors:** Daniel Neill, Andrew Moore

Given an N x N grid of squares, where each square has a count and an underlying population, our goal is to find the square region with the highest density, and to calculate its significance by randomization. Any density measure D, dependent on the total count and total population of a region, can be used. For example, if each count represents the number of disease cases occurring in that square, we can use Kulldorff's spatial scan statistic D_K to find the most significant spatial disease cluster. A naive approach to finding the maximum density region requires O(N^3) time, and is generally computationally infeasible. We present a novel algorithm which partitions the grid into overlapping regions, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufficiently dense regions, this method finds the maximum density region in optimal O(N^2) time, in practice resulting in significant (10-200x) speedups.

**Authors:** Daniel Neill, Andrew Moore

Given an NxN grid of squares, where each square s*ij has count c*ij and an underlying population p*ij, our goal is to find the square region S with the highest density, and to calculate the significance of this region by Monte Carlo testing. Any density measure D, which depends on the total count and total population of the region, can be used. For example, if each count c*ij represents the number of disease cases occurring in that square, we can use Kulldorff's spatial scan statistic D_K to find the most significant spatial disease cluster. A naive approach to finding the region of maximum density would be to calculate the density measure for every square region: this requires O(RN^3) calculations, where R is the number of Monte Carlo replications, and hence is generally computationally infeasible. We present a novel multi-resolution algorithm which partitions the grid into overlapping regions, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufficiently dense regions, this method finds the maximum density region in optimal O(RN^2) time, and in practice it results in significant (10-200x) speedups as compared to the naive approach.

**Authors:** Purnamrita Sarkar, Sajid Siddiqi, Geoff Gordon

We consider dynamic co-occurrence data, such as author-word links in papers published in successive years of the same conference. For static co-occurrence data, researchers often seek an embedding of the entities (authors and words) into a low-dimensional Euclidean space. We generalize a recent static co-occurrence model, the CODE model of Globerson et al. (2004), to the dynamic setting: we seek coordinates for each entity at each time step. The coordinates can change with time to explain new observations, but since large changes are improbable, we can exploit data at previous and subsequent steps to find a better explanation for current observations. To make inference tractable, we show how to approximate our observation model with a Gaussian distribution, allowing the use of a Kalman filter for tractable inference. The result is the first algorithm for dynamic embedding of co-occurrence data which provides distributional information for its coordinate estimates. We demonstrate our model both on synthetic data and on author-word data from the NIPS corpus, showing that it produces intuitively reasonable embeddings. We also provide evidence for the usefulness of our model by its performance on an author-prediction task.

**Authors:** Michelle Ntampaka, Hy Trac, Dougal J. Sutherland, Nicholas Battaglia, Barnabás Póczos, and Jeff Schneider

We present a modern machine learning approach for cluster dynamical mass measurements that is a factor of two improvement over using a conventional scaling relation. Different methods are tested against a mock cluster catalog constructed using halos with mass >= 10^14 Msolar/h from Multidark's publicly-available N-body MDPL halo catalog. In the conventional method, we use a standard M(sigma*v) power law scaling relation to infer cluster mass, M, from line-of-sight (LOS) galaxy velocity dispersion, sigma*v. The resulting fractional mass error distribution is broad, with width=0.87 (68% scatter), and has extended high-error tails. The standard scaling relation can be simply enhanced by including higher-order moments of the LOS velocity distribution. Applying the kurtosis as a correction term to log(sigma_v) reduces the width of the error distribution to 0.74 (16% improvement). Machine learning can be used to take full advantage of all the information in the velocity distribution. We employ the Support Distribution Machines (SDMs) algorithm that learns from distributions of data to predict single values. SDMs trained and tested on the distribution of LOS velocities yield width=0.46 (47% improvement). Furthermore, the problematic tails of the mass error distribution are effectively eliminated. Decreasing cluster mass errors will improve measurements of the growth of structure and lead to tighter constraints on cosmological parameters.

**Authors:** Brigham Anderson, Andrew Moore, David Cohn

This paper describes Pairwise Bisection: a nonparametric approach to optimizing a noisy function with few function evaluations. The algorithm uses nonparametric resoning about simple geometric relationships to find minima efficiently. Two factors often frustrate optimization: noise and cost. Output can contain significant quantities of noise or error, while time or money allow for only a handful of experiments. Pairwise bisection is used here to attempt to automate the process of robust and efficient experiment design. Real world functions also tend to violate traditional assumptions of continuousness and Gaussian noise. Since nonparametric statistics do not depend on these assumptions, this algorithm can optimize a wide variety of phenomena with fewer restrictions placed on noise. The algorithm's performance is compared to that of three competing algorithms, Amoeba, PMAX, and Q2 on several different test functions. Results on these functions indicate competitive performance and superior resistance to noise.

**Authors:** Josep Roure, Artur Dubrawski, Jeff Schneider

This paper reviews the results of a study into combining evidence from multiple streams of surveillance data in order to improve timeliness and speciﬁcity of detection of bio-events. In the experiments we used three streams of real food- and agriculture-safety related data\ that is being routinely collected at slaughter houses across the nation, and which carry mutually complementary information about potential outbreaks of bio-events. The results indicate that: (1) Non-speciﬁc aggregation of p-values produced by event detectors set on individual streams of data can lead to superior detection power over that of the individual detectors, and (2) Design of multi-stream detectors tailored to the particular characteristics of the events of interest can further improve timeliness and speciﬁcity of detection. In a practical setup, we recommend combining a set of speciﬁc multi-stream detectors focused on individual types of predictable and deﬁnable scenarios of interest, with non-speciﬁc multi-stream detectors, to account for both anticipated and emerging types of bio-events.

**Authors:** Purnamrita Sarkar, Andrew W. Moore

Recently there has been much interest in graph-based learning, with applications in collaborative fltering for recommender networks, link prediction for social networks and fraud detection. These networks can consist of millions of entities, and so it is very important to develop highly effcient techniques. We are especially interested in accelerating random walk approaches to compute some very interesting proximity measures of these kinds of graphs. These measures have been shown to do well empirically (Liben-Nowell & Kleinberg, 2003; Brand, 2005). We introduce a truncated variation on a well-known measure, namely commute times arising from random walks on graphs. We present a very novel algorithm to compute all interesting pairs of approximate nearest neighbors in truncated commute times, without computing it between all pairs. We show results on both simulated and real graphs of size up to 100,000 entities, which indicate near-linear scaling in computation time.

**Authors:** Yifei Ma and Roman Garnett and Jeff Schneider

The selection of data collection locations is a problem that has received significant research attention from classical design of experiments to various recent active learning algorithms. Typical objectives are to map an unknown function, optimize it, or find level sets in it. Each of these objectives focuses on an assessment of individual points. The introduction of set kernels has led to algorithms that instead consider labels assigned to sets of data points. In this paper we combine these two concepts and consider the problem of choosing data collection locations when the goal is to identify regions whose set of collected data would be labeled positively by a set classifier. We present an algorithm for the case where the positive class is defined in terms of a region's average function value being above some threshold with high probability, a problem we call active area search. To this end, we model the latent function using a Gaussian process and use Bayesian quadrature to estimate its integral on predefined regions. Our method is the first which directly solves the active area search problem. In experiments it outperforms previous algorithms that were developed for other active search goals.

**Authors:** Dougal J. Sutherland, Barnabás Póczos, and Jeff Schneider

Collaborative prediction is a powerful technique, useful in domains from recommender systems to guiding the scientific discovery process. Low-rank matrix factorization is one of the most powerful tools for collaborative prediction. This work presents a general approach for active collaborative prediction with the Probabilistic Matrix Factorization model. Using variational approximations or Markov chain Monte Carlo sampling to estimate the posterior distribution over models, we can choose query points to maximize our understanding of the model, to best predict unknown elements of the data matrix, or to find as many “positive” data points as possible. We evaluate our methods on simulated data, and also show their applicability to movie ratings prediction and the discovery of drug-target interactions.

**Authors:** Dan Pelleg, Andrew Moore

We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify {em useful} anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify ``rare category'' records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands.

**Authors:** Brent Bryan, Jeff Schneider, Robert C. Nichol, Christopher J. Miller, Christopher R. Genovese, Larry Wasserman

We present an efficient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassification rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid $1-\alpha$ confidence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.

**Authors:** William Herlands, Maria De-Arteaga, Daniel Neill, and Artur Dubrawski

NIPS Workshop: Optimization for Machine Learning, 2015.

**Authors:** Maria De-Arteaga, Artur Dubrawski, Peter Huggins

Heinz First Paper, Heinz College / Data Analysis Project, Machine Learning Department, 2016. Carnegie Mellon University.

papers.txt · Last modified: 2016/10/05 22:00 by dbayani

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial 4.0 International