autonlab.org

Fast Nearest-neighbor Search in Disk-resident Graphs (2010)

Purnamrita Sarkar

Andrew W. Moore

Abstract

Link prediction, personalized graph search, fraud detection,
and many such graph mining problems revolve around the
computation of the most "similar" k nodes to a given query
node. One widely used class of similarity measures is based
on random walks on graphs, e.g., personalized pagerank, hit-
ting and commute times, and simrank. There are two fun-
damental problems associated with these measures. First,
existing online algorithms typically examine the local neigh-
borhood of the query node which can become signicantly
slower whenever high-degree nodes are encountered (a com-
mon phenomenon in real-world graphs). We prove that turn-
ing high degree nodes into sinks results in only a small ap-
proximation error, while greatly improving running times.
The second problem is that of computing similarities at
query time when the graph is too large to be memory-
resident. The obvious solution is to split the graph into
clusters of nodes and store each cluster on a disk page; ide-
ally random walks will rarely cross cluster boundaries and
cause page-faults. Our contributions here are twofold: (a)
we present an efficient deterministic algorithm to nd the
k closest neighbors (in terms of personalized pagerank) of
any query node in such a clustered graph, and (b) we de-
velop a clustering algorithm (RWDISK) that uses only se-
quential sweeps over data les. Empirical results on sev-
eral large publicly available graphs like DBLP, Citeseer and
Live-Journal (~ 90 M edges) demonstrate that turning high
degree nodes into sinks not only improves running time of
RWDISK by a factor of 3 but also boosts link prediction
accuracy by a factor of 4 on average. We also show that
RWDISK returns more desirable (high conductance and small
size) clusters than the popular clustering algorithm METIS,
while requiring much less memory. Finally our deterministic
algorithm for computing nearest neighbors incurs far fewer
page-faults (factor of 5) than actually simulating random
walks.

Full text

Download (application/pdf, 558.9 kB)

Approximate BibTeX Entry

@inproceedings{sarkar-moore-kdd10,
    Howpublished = {KDD},
    Year = {2010},
    Organization = {Carnegie Mellon University},
    Title = {Fast Nearest-neighbor Search in Disk-resident Graphs}
}

Copyright 2010, Carnegie Mellon University, Auton Lab. All Rights Reserved.