Diff report

General

Version 1Version 28
Variantmain - enmain - en
Document NameFast Classifiers(not changed)
Creation time6/12/06 2:35:59 PM5/12/10 11:07:00 AM
Created byJeanie KomarekMike Baysek
Statepublishpublish

Changes to parts

Part Authors has changed


Version 1Version 28
Mime typetext/xml(not changed)
File name(not changed)
Size (bytes)228233

Content diff
<html><body><p><a href="daisy:10222">Paul Komarek</a> <a href="daisy:10216">Ting Liu</a> <a href="daisy:10246">Ashwin Tengli</a> <a href="daisy:10224">Andrew Moore</a> <a href="daisy:10260">Charles DiFazio</a> </p></body></html>
<html>
<body>
 
<p><a href="daisy:10222">Paul Komarek</a> <a href="daisy:10216">Ting Liu</a>
<a href="daisy:10246">Ashwin Tengli</a> <a href="daisy:10224">Andrew Moore</a>
<a href="daisy:10260">Charles DiFazio</a></p>
 
</body>
</html>

Part Description has changed


Version 1Version 28
Mime typetext/xml(not changed)
File name(not changed)
Size (bytes)61264674

Content diff
<html>
<body>
 
<p>A collection of fast classifiers including knn, aknn, naive bayes, decision
tree, and logistic regression.</p>
 
<p>The listed classifiers can handle any type of data</p>
 
<ul>
<li>dense categorical</li>
<li>dense real</li>
<li>sparse categorical</li>
<li>sparse real</li>
</ul>
 
<p>The dense data can be loaded using a cvs or fds file. The sparse categorical
data can be loaded using a two column sparse format file. The sparse real data
can be loaded using a three column sparse format file. If the data is sparse
then the "is_sparse" attribute must be selected. Internally the algorithms make
use of a data-structure called "abdat" to handle the data.</p>
 
<p>One or more classifiers can be selected to perform cross-validation or
predictions. The classifiers perform k-fold cross-validation on the training
data and output the average auc value and the roc information. They can be made
to do only a prediction on the test data by loading a test dataset and selecting
the "do_test" attribute.</p>
 
<h2>Available Classifiers</h2>
 
<ul>
<li>knn</li>
<li>aknn</li>
<li>naive bayes</li>
<li>decision tree</li>
<li>logistic regression</li>
</ul>
 
<h3>K Nearest Neighbor - option "knn"</h3>
 
<p>Author: Ashwin Tengli &lt;tengli@cs.cmu.edu&gt;</p>
 
<p>The standard knn classifier implemented using a balltree. A balltree
algorithm efficient on high dimensional attributes is used. The balltree
algorithm does not calculate a center for the balls and instead efficiently
indentifies the best(approximately) datapoint that can be used as the center.
This saves a lot of memory required for the tree.</p>
 
<h3>Approximate K Nearest Neighbor - option "aknn"</h3>
 
<p>Author: Ting Liu &lt;tingliu@gmail.com&gt;</p>
 
<p>This is an approximate knn classifier implemented using a spill-tree. The
spill-tree algorithm is faster than ball-tree based knn, but it could make some
mistakes during classification. The spill-tree is a variant of ball tree, in
which the child nodes can spill onto each other and have shared data points.
This property allows us to search a spill-tree without backtracking, and still
get pretty good accuracy.</p>
 
<p>More details about using this algorithm are available in the file
aknn_readme.txt distributed with this algorithm. The thesis discussion of this
algorithm, "An Investigation of Practical Approximate Nearest Neighbor
Algorithms (2004)", is available for download at www.autonlab.org.</p>
 
<h3>Naive Bayes - option "naive bayes"</h3>
 
<p>Author: Ashwin Tengli &lt;tengli@cs.cmu.edu&gt;</p>
 
<p>A fast naive bayes classifier that uses</p>
 
<ul>
<li>gaussian naive bayes for real attributes</li>
<li>counting tricks to speed up calcuations on sparse data</li>
<li>Uses Dirichlet prior with Laplace smoothing on categorial atts</li>
</ul>
 
<p>Note: This classifier does not require any parameters to be set.</p>
 
<h3>Decison Tree - option "decision tree"</h3>
 
<p>Author: Charles DiFazio &lt;cdifazio@cs.cmu.edu&gt;</p>
 
<p>Author: Ashwin Tengli &lt;tengli@cs.cmu.edu&gt;</p>
 
<p>A decision tree built using Information Gain (IG) score for attribute
selection. Chi-Square method is used for pruning. For sparse categorical
attributes an efficient algorithm to calculate IG is used. This algorithm
iterates over rows of the data and updates counts in the contigency table of
attribute vs ouput. Since the sparse data does not store default values this
iteration is very fast compared to individually calculating IG of attribute and
output. The algorithm then uses the counting trick to calculate counts for
default value of the attribute. The IGs are obtained using the contingency
tables.</p>
 
<h3>Logistic Regression - option "logistic regression"</h3>
 
<p>Author: Paul Komarek &lt;komarek@cmu.edu&gt;</p>
 
<p>Logistic Regression with Truncated Regularized Iteratively Re-weighted Least
Squares (LR-TRIRLS). LR-TRIRLS uses IRLS to learn the LR parameters, with some
modifications. IRLS is a quasi-Newton method used for fitting Generalized Linear
Models (GLiMs) to data. It is an iterative method, solving a weighted least
squares (WLS) linear regression problem on each iteration. These WLS subproblems
require solving a linear system of equations, which done naively and exactly is
slow, unstable, and prone to overfitting. We approximate the solution to the WLS
subproblems using linear conjugate gradient (CG). This provides a significant
acceleration, improves stability, and reduces overfitting through early
termination. To further reduce overfitting, ridge regularization is used when
solving the WLS subproblems.</p>
 
<p>More Information on Logistic Regression Classifier: http://komarix.org/ac/lr/
</p>
 
<h2>Note on Sparse Data</h2>
 
<p>Sparse datasets can be provided as input in 2-column csv or 3-column csv
format. Sparse means not every possible value of the dataset is explicitly
stored. The unstored values are assumed to take a default value. For the abdat
data structure the default value is zero.</p>
 
<p>The 2-column csv format, also called as Simple Sparse Dataset (SSD), is used
for sparse boolean attributes. The SSD format can be used to store sparse
true/false data in a CSV file. For each nonzero element (i,j) in the data, the
SSD will have a row: i, j. In other words, each row of the SSD format lists the
row number and attribute name of nonzero value.</p>
 
<p>This format shares many of the qualities of CSV, including editability using
text editors or spreadsheets.</p>
 
<p>Example data:</p>
 
<pre>
retired,homeowner,republican,cheeselover
0, 0, 1, 0
1, 0, 0, 1
0, 0, 0, 0
0, 1, 1, 0
</pre>
 
<p>Corresponding SSD:</p>
 
<pre>
0,republican
1,retired
1,cheeselover
3,homeowner
3,republican
</pre>
 
<p>For sparse real datasets a similar format that is represented in 3 columns
instead of 2 can be used. The third column contains the real value of the
nonzero element. For each nonzero element (i,j) or (rowno,attribute-name) the
3-column csv file will have a row with format:</p>
 
<pre>
rowno, attribute-name, nonzero-real-value
</pre>
 
</body>
</html>

Part Point of contact has changed


Version 1Version 28
Mime typetext/xml(not changed)
File name(not changed)
Size (bytes)7476

Content diff
<html><body><p><a href="daisy:10263">Jeanie Komarek</a> </p></body></html>
<html>
<body>
 
<p><a href="daisy:17066">Saswati Ray</a></p>
 
</body>
</html>

Part Changelog has been added

Changes to links

No changes detected

Changes to fields

No changes detected