autonlab.org
WARNING: you are not looking at the live version but at an older version.

Fast Classifiers

Paul Komarek Ting Liu Ashwin Tengli Andrew Moore Charles DiFazio

Point of contact: Jeanie Komarek

Software Information

Description

A collection of fast classifiers including knn, aknn, naive bayes, decision tree, and logistic regression.

The listed classifiers can handle any type of data

  • dense categorical
  • dense real
  • sparse categorical
  • sparse real

The dense data can be loaded using a cvs or fds file. The sparse categorical data can be loaded using a two column sparse format file. The sparse real data can be loaded using a three column sparse format file. If the data is sparse then the "is_sparse" attribute must be selected. Internally the algorithms make use of a data-structure called "abdat" to handle the data.

One or more classifiers can be selected to perform cross-validation or predictions. The classifiers perform k-fold cross-validation on the training data and output the average auc value and the roc information. They can be made to do only a prediction on the test data by loading a test dataset and selecting the "do_test" attribute.

Available Classifiers

  • knn
  • aknn
  • naive bayes
  • decision tree
  • logistic regression

K Nearest Neighbor - option "knn"

Author: Ashwin Tengli <tengli@cs.cmu.edu>

The standard knn classifier implemented using a balltree. A balltree algorithm efficient on high dimensional attributes is used. The balltree algorithm does not calculate a center for the balls and instead efficiently indentifies the best(approximately) datapoint that can be used as the center. This saves a lot of memory required for the tree.

Approximate K Nearest Neighbor - option "aknn"

Author: Ting Liu <tingliu@gmail.com>

This is an approximate knn classifier implemented using a spill-tree. The spill-tree algorithm is faster than ball-tree based knn, but it could make some mistakes during classification. The spill-tree is a variant of ball tree, in which the child nodes can spill onto each other and have shared data points. This property allows us to search a spill-tree without backtracking, and still get pretty good accuracy.

More details about using this algorithm are available in the file aknn_readme.txt distributed with this algorithm. The thesis discussion of this algorithm, "An Investigation of Practical Approximate Nearest Neighbor Algorithms (2004)", is available for download at www.autonlab.org.

Naive Bayes - option "naive bayes"

Author: Ashwin Tengli <tengli@cs.cmu.edu>

A fast naive bayes classifier that uses

  • gaussian naive bayes for real attributes
  • counting tricks to speed up calcuations on sparse data
  • Uses Dirichlet prior with Laplace smoothing on categorial atts

Note: This classifier does not require any parameters to be set.

Decison Tree - option "decision tree"

Author: Charles DiFazio <cdifazio@cs.cmu.edu>

Author: Ashwin Tengli <tengli@cs.cmu.edu>

A decision tree built using Information Gain (IG) score for attribute selection. Chi-Square method is used for pruning. For sparse categorical attributes an efficient algorithm to calculate IG is used. This algorithm iterates over rows of the data and updates counts in the contigency table of attribute vs ouput. Since the sparse data does not store default values this iteration is very fast compared to individually calculating IG of attribute and output. The algorithm then uses the counting trick to calculate counts for default value of the attribute. The IGs are obtained using the contingency tables.

Logistic Regression - option "logistic regression"

Author: Paul Komarek <komarek@cmu.edu>

Logistic Regression with Truncated Regularized Iteratively Re-weighted Least Squares (LR-TRIRLS). LR-TRIRLS uses IRLS to learn the LR parameters, with some modifications. IRLS is a quasi-Newton method used for fitting Generalized Linear Models (GLiMs) to data. It is an iterative method, solving a weighted least squares (WLS) linear regression problem on each iteration. These WLS subproblems require solving a linear system of equations, which done naively and exactly is slow, unstable, and prone to overfitting. We approximate the solution to the WLS subproblems using linear conjugate gradient (CG). This provides a significant acceleration, improves stability, and reduces overfitting through early termination. To further reduce overfitting, ridge regularization is used when solving the WLS subproblems.

More Information on Logistic Regression Classifier: http://komarix.org/ac/lr/

Changelog

Download (text/plain, 246 bytes)

Downloads

Notice

Some items below may be inaccessible, since you are not logged in. Please login or register.

FilenameDescriptionUpdate timeSizePermission required
Documentation
aknn_readme.txtadditional information about the aknn classifier6/12/06 2:44:44 PM6.8 kB
fast_classifiers_doc.htmlhtml software documentation6/12/06 2:43:24 PM71.9 kB
Command-line version
fast_classifiers_20070319.tar.gzlinux binary3/19/07 1:43:20 PM903.5 kB Yes
fast_classifiers_20070316.zipwindows binary3/16/07 1:14:40 PM782.7 kB Yes
Copyright 2008, Carnegie Mellon University, Auton Lab. All Rights Reserved.