| www.autonlab.org | ver. 2005-05-13 |
|
Top
Overview of the afc_knn_dense Application |
This program performs fast dense K Nearest Neighbor classification. It can be used to train a model and produce an ROC curve based on a testset or k-fold cross-validation. Along with the ROC curve, the program will output an AUC score and a list of predictions for the data.
For details about our implementation of the K Nearest Neighbor algorithm, refer to the readme.afc_knn_dense.pdf document included in the downloaded software package.
A sparse K Nearest Neighbor application is also available for download from www.autonlab.org.
Input Datasets |
Dataset: train |
|
The training dataset. --------------- Format of the dataset ---------------- This is a conventional non-sparse tabular dataset representation. The 'output' parameter is used to specify which column to interpret as the output column. The remaining columns will be interpreted as attributes. The values in the dataset must all be interpretable as numbers. The output column should only have the values 0 or 1. Note - for datasets consisting primarily of zeros, there is a sparse version of this software available for download from www.autonlab.org.
|
Dataset: test (optional) |
|
The testset or set of tests to be predicted. Using the 'test' dataset as a testset: The 'test' dataset will be used as a testset if it includes labeled outputs and the 'predictonly' flag is set to false. The testset is used to assess the logistic regression model learned from the trained dataset. The outputs of the testset are only used to evaluate the performance of the model and are not used during training. Using the 'test' dataset as a set of tests to be predicted: The 'test' dataset will be used as a set of tests to be predicted if the 'predictonly' flag is set to 'true'. In this case, the test dataset does not need to have labeled outputs. If it does have labeled outputs, they will be ignored. The ROC curve and AUC score will not be computed. If a 'test' dataset is specified, then the 'num_folds' parameter must be 0. --------------- Format of the dataset ---------------- This is a conventional non-sparse tabular dataset representation. The 'output' parameter is used to specify which column to interpret as the output column. The remaining columns will be interpreted as attributes. The values in the dataset must all be interpretable as numbers. The output column should only have the values 0 or 1. Note - for datasets consisting primarily of zeros, there is a sparse version of this software available for download from www.autonlab.org.
|
Parameters Documentation |
| verbosity |
Program run Verbosity. |
| num_folds |
Number of folds in a k-fold cross-validation. |
| predictonly |
Use the 'test' dataset for predictions only. |
| output |
The output column of the 'train' dataset. |
| algorithm |
Which knn algorithm to run. |
| k |
Number of nearest neighbors to find. |
| max_points |
Max points per node. |
| min_ball_width |
Width at which a ball is considered a leaf. |
verbosity | ||||||
| ||||||
|
Program run Verbosity. Controls the amount of information output to the screen. Set to -1 for minimal output.
|
num_folds | ||||||
| ||||||
|
Number of folds in a k-fold cross-validation. If a 'test' dataset (testset) is specified, this must be 0. Otherwise, it must be set to 2 or more folds.
|
predictonly | ||||
| ||||
|
Use the 'test' dataset for predictions only. Set 'predictonly' to true if you have specified a 'test' dataset merely to get a set of predictions, while not caring about the ability of the learned model to agree with the classes in the test set. If this is set, the test dataset does not need to have labeled outputs. If it does have labeled outputs, they will be ignored. This parameter is ignored for k-fold cross-validation experiments. The available values of 'predictonly' are: true false
|
output | ||||
| ||||
|
The output column of the 'train' dataset. Enter any fully defined real-valued attribute from the 'train' dataset. The available values of 'output' are: avgmpg_is_bad cylinders_is_3 cylinders_is_4 cylinders_is_5 cylinders_is_6 displacement_is_lo displacement_is_medium horsepower_is_lo horsepower_is_medium weight_is_lo weight_is_medium maxaccel_is_lo maxaccel_is_medium modelyear_is_70 modelyear_is_71 modelyear_is_72 modelyear_is_73 modelyear_is_74 modelyear_is_75 modelyear_is_76 modelyear_is_77 modelyear_is_78 modelyear_is_79 maker_is_asia maker_is_usa
|
algorithm | ||||
| ||||
|
Which knn algorithm to run. 'oldknn' is an implementation of the conventional Ball-tree based k-nearest neighbor search. 'newknn' is an alternative implementation in which two ball trees are built - one from the positive points and one from the negative points. The 'newknn' algorithm has been seen to run faster than the 'oldknn' algorithm for datasets with significantly more negative outputs than postiive ones. The available values of 'algorithm' are: newknn oldknn
|
k | ||||||
| ||||||
|
Number of nearest neighbors to find. |
max_points | ||||||
| ||||||
|
Max points per node. The maximum number of points that will be allowed in a tree node. Sometimes it saves memory to use max_points > 1. It even seems to speed things up, possibly because of L1/L2 cache effects.
|
min_ball_width | ||||||
| ||||||
|
Width at which a ball is considered a leaf. A node will not be split if it finds that all its points are within distance 'min_ball_width' of the node centroid. Nodes that meet this criterion are leaf nodes.
|
Output Tables |
| score_summary |
AUC score and algorithm run time. |
| roc_curve |
The ROC curve for the classifier. |
| per_fold_scores |
Per fold AUC scores and run times. |
| per_fold_score_summary |
Summary of per fold scores. |
| predictions |
Predictions on the data |
| holdout_foldnum |
Table showing which fold each row was held out for. |
| predict_compare |
Summary table of hold out, predictions, and original outputs. |
Table: score_summary |
|
AUC score and algorithm run time. The 'AUC score' is the area under the ROC curve. If a testset is supplied the AUC for the testing dataset is reported. For a k-fold cross-validaiton experiment, the composite AUC for all fold predictions combined is reported. An AUC score near 1.0 is very good, and an AUC of 0.5 is expected when randomly guessing output values. The 'Algorithm Time' is the wall clock time it took to run the experiment. This table will not be computed for a train/test experiment when the 'predictonly' flag is set.
|
Table: roc_curve |
|
The ROC curve for the classifier. ROC is an abbreviation for Receiver Operating Characteristic. A ROC curve measures how well a learner ranks factors (data points) in order of decreasing probability of being active (belonging to the positive class). If a 'test' dataset (testset) is provided, the ROC curve is based on testset performance. Otherwise, it is based on k-fold cross-validation. A ROC curve will not be computed for a train/test experiment when the 'predictonly' flag is set.
|
Table: per_fold_scores |
|
Per fold AUC scores and run times. Lists the AUC score and the wall clock time spent running the algorithm for each fold. This table is only created for k-fold cross-validation experiments. |
Table: per_fold_score_summary |
|
Summary of per fold scores. This table lists the AUC mean, AUC standard deviation (sdev), and the AUC confidence interval radius (ci). It also lists the mean and standard deviation of the algorithm run time (in wall clock time.) This table is only created for k-fold cross-validation experiments. |
Table: predictions |
|
Predictions on the data Predictions on the 'test' dataset for train/test experiments or combined predictions on the training dataset for k-fold cross-validation experiments.
|
Table: holdout_foldnum |
|
Table showing which fold each row was held out for. This table contains the hold out information for each row in the training dataset. This table is only created for k-fold cross-validation experiments. |
Table: predict_compare |
|
Summary table of hold out, predictions, and original outputs. For each row, list the fold each row was held out for (only if doing k-fold cross validation), the prediction, and the original output value. The hold out information can also be seen in the 'holdout_foldnum' table. The predictions can also be seen in the 'predictions' table. |
Running the Software |
To run the non-graphical version of the software, you need to
start a command prompt.
To bring up a command prompt, click on the start button
and select run. In the box, type cmd.exe.
(or, if that fails, type command). You can use
cd to change directories and dir
to see the current directory. Navigate to the directory
where the afc_knn_dense.exe executable is and type
afc_knn_dense.exe to run the program.
Run the program from the command line without arguments to bring
up the graphical interface. Run with the -help flag to see other
options for running the software, including all options listed
below.
Microsoft Windows users will also have a desktop icon and a start menu item for running Dense K Nearest Neighbor with the graphical interface.
| Linux/Unix: | ./afc_knn_dense <key1> <value1> <key2> <value2> ... |
| Windows: | afc_knn_dense <key1> <value1> <key2> <value2> ... |
You must supply all required datasets. Any other parameter you omit will take on its default value.
Note that, from the command-line, you can provide a parameter "prefix" which will be prefixed to any output datasets generated when they're saved. If it includes a directory seperator, the app will create directories as needed to store the output.
Example command line:
| Linux/Unix: | ./afc_knn_dense train example-afc_knn_dense-train.csv num_folds 10 output avgmpg_is_bad |
| Windows: | afc_knn_dense train example-afc_knn_dense-train.csv num_folds 10 output avgmpg_is_bad |
To run from a config file, type this at the shell:
<configfile> is an ASCII file in which each line
contains a <key> <value> pair, using the convention
described in the section on running from the
command line.You can also specify the values of other parameters directly
on the command line, even when using a configuration file.
These parameters will override any parameters provided in the
config file.The same config files can be used by the graphical interface. To generate an example config file, like the one shown to the right, type this at the shell:
|
|
./afc_knn_dense gives the
message 'Permission denied'
chmod +x afc_knn_dense