autonlab.org

Nearest Neighbor (NIPS 2004) Datasets

The following datasets were used in Liu, Moore, Gray and Yang (2004), An Investigation of Practical Approximate Nearest Neighbor Algorithms. NIPS 2004.

They are stored in this form on this page in order to allow other researchers to run experiments on the same datasets with identical preprocessing, including discretization levels of real-valued attributes and compensation for missing values.

  • Aerial.gz (60.1 megs): this is the gzip file of Aerial dataset described in the paper, it is a texture feature data contain 275,465 feature vectors of 60 dimensions.
  • Corel_hist.csv (3.5 megs): this file contains the Corel_hist dataset described in the paper, with 20,000 histograms (64-dimensional) of color thumbnail-sized images taken from the COREL STOCK PHOTO library.
  • Corel_uci.gz (9.2 megs): this is the gzip file of Corel_uci dataset described in the paper, with 68,040 histograms (64-dimensional) of color thumbnail-sized images taken from the COREL STOCK PHOTO library.
  • Disk_trace.gz (19.5 megs): this is the gzip file of Disk_trace dataset described in the paper, with 40,000 content traces of disk-write operations, each being a 1 kilo-byte block.
  • The large (900 megs) galaxy dataset is currently not available on the web.

Please feel welcome to contact Ting Liu with questions or comments.

Copyright 2008, Carnegie Mellon University, Auton Lab. All Rights Reserved.