www.autonlab.org ver. 2005-05-13

Dense K Nearest Neighbor Documentation

Top

Overview of the afc_knn_dense Application

This program performs fast dense K Nearest Neighbor classification. It can be used to train a model and produce an ROC curve based on a testset or k-fold cross-validation. Along with the ROC curve, the program will output an AUC score and a list of predictions for the data.

For details about our implementation of the K Nearest Neighbor algorithm, refer to the readme.afc_knn_dense.pdf document included in the downloaded software package.

A sparse K Nearest Neighbor application is also available for download from www.autonlab.org.

Top

Input Datasets

Dataset: train

The training dataset.

--------------- Format of the dataset ----------------

This is a conventional non-sparse tabular dataset representation. The 'output' parameter is used to specify which column to interpret as the output column. The remaining columns will be interpreted as attributes.

The values in the dataset must all be interpretable as numbers. The output column should only have the values 0 or 1.

Note - for datasets consisting primarily of zeros, there is a sparse version of this software available for download from www.autonlab.org.

Dataset: test (optional)

The testset or set of tests to be predicted.

Using the 'test' dataset as a testset: The 'test' dataset will be used as a testset if it includes labeled outputs and the 'predictonly' flag is set to false. The testset is used to assess the logistic regression model learned from the trained dataset. The outputs of the testset are only used to evaluate the performance of the model and are not used during training.

Using the 'test' dataset as a set of tests to be predicted: The 'test' dataset will be used as a set of tests to be predicted if the 'predictonly' flag is set to 'true'. In this case, the test dataset does not need to have labeled outputs. If it does have labeled outputs, they will be ignored. The ROC curve and AUC score will not be computed.

If a 'test' dataset is specified, then the 'num_folds' parameter must be 0.

--------------- Format of the dataset ----------------

This is a conventional non-sparse tabular dataset representation. The 'output' parameter is used to specify which column to interpret as the output column. The remaining columns will be interpreted as attributes.

The values in the dataset must all be interpretable as numbers. The output column should only have the values 0 or 1.

Note - for datasets consisting primarily of zeros, there is a sparse version of this software available for download from www.autonlab.org.

Top

Parameters Documentation

verbosity

Program run Verbosity.

num_folds

Number of folds in a k-fold cross-validation.

predictonly

Use the 'test' dataset for predictions only.

output

The output column of the 'train' dataset.

algorithm

Which knn algorithm to run.

k

Number of nearest neighbors to find.

max_points

Max points per node.

min_ball_width

Width at which a ball is considered a leaf.

Parameter Details

verbosity

Type:int
Default value:0
Legal values:an integer in the range -1 <= verbosity

Program run Verbosity.

Controls the amount of information output to the screen. Set to -1 for minimal output.

Back to Parameters

num_folds

Type:int
Default value:0
Legal values:an integer in the range 0 <= num_folds

Number of folds in a k-fold cross-validation.

If a 'test' dataset (testset) is specified, this must be 0. Otherwise, it must be set to 2 or more folds.

Back to Parameters

predictonly

Type:string
Default value:false

Use the 'test' dataset for predictions only.

Set 'predictonly' to true if you have specified a 'test' dataset merely to get a set of predictions, while not caring about the ability of the learned model to agree with the classes in the test set.

If this is set, the test dataset does not need to have labeled outputs. If it does have labeled outputs, they will be ignored.

This parameter is ignored for k-fold cross-validation experiments.

The available values of 'predictonly' are:

  true  false

Back to Parameters

output

Type:string
Default value:avgmpg_is_bad

The output column of the 'train' dataset.

Enter any fully defined real-valued attribute from the 'train' dataset.

The available values of 'output' are:

  avgmpg_is_bad           cylinders_is_3        
  cylinders_is_4          cylinders_is_5        
  cylinders_is_6          displacement_is_lo    
  displacement_is_medium  horsepower_is_lo      
  horsepower_is_medium    weight_is_lo          
  weight_is_medium        maxaccel_is_lo        
  maxaccel_is_medium      modelyear_is_70       
  modelyear_is_71         modelyear_is_72       
  modelyear_is_73         modelyear_is_74       
  modelyear_is_75         modelyear_is_76       
  modelyear_is_77         modelyear_is_78       
  modelyear_is_79         maker_is_asia         
  maker_is_usa          

Back to Parameters

algorithm

Type:string
Default value:newknn

Which knn algorithm to run.

'oldknn' is an implementation of the conventional Ball-tree based k-nearest neighbor search.

'newknn' is an alternative implementation in which two ball trees are built - one from the positive points and one from the negative points. The 'newknn' algorithm has been seen to run faster than the 'oldknn' algorithm for datasets with significantly more negative outputs than postiive ones.

The available values of 'algorithm' are:

  newknn  oldknn

Back to Parameters

k

Type:int
Default value:9
Legal values:an integer in the range 0 <= k

Number of nearest neighbors to find.

Back to Parameters

max_points

Type:int
Default value:5
Legal values:an integer in the range 1 <= max_points

Max points per node.

The maximum number of points that will be allowed in a tree node. Sometimes it saves memory to use max_points > 1. It even seems to speed things up, possibly because of L1/L2 cache effects.

Back to Parameters

min_ball_width

Type:double
Default value:0.01
Legal values:a number in the range 1e-007 <= min_ball_width

Width at which a ball is considered a leaf.

A node will not be split if it finds that all its points are within distance 'min_ball_width' of the node centroid. Nodes that meet this criterion are leaf nodes.

Back to Parameters

Top

Output Tables

score_summary

AUC score and algorithm run time.

roc_curve

The ROC curve for the classifier.

per_fold_scores

Per fold AUC scores and run times.

per_fold_score_summary

Summary of per fold scores.

predictions

Predictions on the data

holdout_foldnum

Table showing which fold each row was held out for.

predict_compare

Summary table of hold out, predictions, and original outputs.

Output Table Details

Table: score_summary

AUC score and algorithm run time.

The 'AUC score' is the area under the ROC curve. If a testset is supplied the AUC for the testing dataset is reported. For a k-fold cross-validaiton experiment, the composite AUC for all fold predictions combined is reported.

An AUC score near 1.0 is very good, and an AUC of 0.5 is expected when randomly guessing output values.

The 'Algorithm Time' is the wall clock time it took to run the experiment.

This table will not be computed for a train/test experiment when the 'predictonly' flag is set.

Back To Output Tables

Table: roc_curve

The ROC curve for the classifier.

ROC is an abbreviation for Receiver Operating Characteristic. A ROC curve measures how well a learner ranks factors (data points) in order of decreasing probability of being active (belonging to the positive class).

If a 'test' dataset (testset) is provided, the ROC curve is based on testset performance. Otherwise, it is based on k-fold cross-validation.

A ROC curve will not be computed for a train/test experiment when the 'predictonly' flag is set.

Back To Output Tables

Table: per_fold_scores

Per fold AUC scores and run times.

Lists the AUC score and the wall clock time spent running the algorithm for each fold.

This table is only created for k-fold cross-validation experiments.

Back To Output Tables

Table: per_fold_score_summary

Summary of per fold scores.

This table lists the AUC mean, AUC standard deviation (sdev), and the AUC confidence interval radius (ci). It also lists the mean and standard deviation of the algorithm run time (in wall clock time.)

This table is only created for k-fold cross-validation experiments.

Back To Output Tables

Table: predictions

Predictions on the data

Predictions on the 'test' dataset for train/test experiments or combined predictions on the training dataset for k-fold cross-validation experiments.

Back To Output Tables

Table: holdout_foldnum

Table showing which fold each row was held out for.

This table contains the hold out information for each row in the training dataset.

This table is only created for k-fold cross-validation experiments.

Back To Output Tables

Table: predict_compare

Summary table of hold out, predictions, and original outputs.

For each row, list the fold each row was held out for (only if doing k-fold cross validation), the prediction, and the original output value.

The hold out information can also be seen in the 'holdout_foldnum' table. The predictions can also be seen in the 'predictions' table.

Back To Output Tables

Top

Running the Software

---------- Notes for Microsoft Windows Users ----------

If you installed the graphical version of Dense K Nearest Neighbor you can run the program from the start menu or from a desktop icon.

To run the non-graphical version of the software, you need to start a command prompt. To bring up a command prompt, click on the start button and select run. In the box, type cmd.exe. (or, if that fails, type command). You can use cd to change directories and dir to see the current directory. Navigate to the directory where the afc_knn_dense.exe executable is and type afc_knn_dense.exe to run the program.

Back to Running the Software

---------- Running with a graphical interface ----------

To run with the graphical interface, make sure you have downloaded the graphical installation version of the software. The graphical interface is available on both Windows and Unix platforms. On Unix, you must have gtk+ 1.2.6 or later installed. It is known that the graphical interface will not run on Red Hat 7.1.

Run the program from the command line without arguments to bring up the graphical interface. Run with the -help flag to see other options for running the software, including all options listed below.

Microsoft Windows users will also have a desktop icon and a start menu item for running Dense K Nearest Neighbor with the graphical interface.

Back to Running the Software

---------- Running from the command line ----------

To run from the command line, type this at the shell:
Linux/Unix:./afc_knn_dense <key1> <value1> <key2> <value2> ...
Windows:afc_knn_dense <key1> <value1> <key2> <value2> ...
where each key is either a formal name of a dataset or a parameter name, and each value is either a filename or a parameter value. For parameters that can accept a list of values, specify the list by enclosing it in quotes.

You must supply all required datasets. Any other parameter you omit will take on its default value.

Note that, from the command-line, you can provide a parameter "prefix" which will be prefixed to any output datasets generated when they're saved. If it includes a directory seperator, the app will create directories as needed to store the output.

Example command line:
Linux/Unix:./afc_knn_dense train example-afc_knn_dense-train.csv num_folds 10 output avgmpg_is_bad
Windows:afc_knn_dense train example-afc_knn_dense-train.csv num_folds 10 output avgmpg_is_bad
Back to Running the Software

---------- Running from a saved configuration file ----------

To run from a config file, type this at the shell:
Linux/Unix:./afc_knn_dense config <configfile>
Windows:afc_knn_dense config <configfile>
where <configfile> is an ASCII file in which each line contains a <key> <value> pair, using the convention described in the section on running from the command line.You can also specify the values of other parameters directly on the command line, even when using a configuration file. These parameters will override any parameters provided in the config file.

The same config files can be used by the graphical interface.

To generate an example config file, like the one shown to the right, type this at the shell:
Linux/Unix:./afc_knn_dense sample
Windows:afc_knn_dense sample

# ---------------------------------
# Sample configuration file
# Generated by 'afc_knn_dense'.
# 2005-05-13
# ---------------------------------
train example-afc_knn_dense-train.csv
test example-afc_knn_dense-test.csv
verbosity 0
num_folds 0
predictonly false
output avgmpg_is_bad
algorithm newknn
k 9
max_points 5
min_ball_width 0.01
      
Back to Running the Software

---------- Troubleshooting ----------

Back to Running the Software


Top
www.autonlab.org