Monday April 06, 12pm-1pm, 1116-E Klaus
Dual-tree k-means with bounded single-iteration runtimeRyan Curtin
Advisor: Prof. Richard Vuduc
|
|
ABSTRACT
K-means is a widely used clustering algorithm, but for k clusters and
a dataset size of N, each iteration of Lloyd's algorithm costs O(kN)
time. Although there are tree-based techniques to accelerate single
Lloyd iterations, none of these techniques are tailored to the case of
large k, which is increasingly common as dataset sizes grow. We
propose a dual-tree algorithm that gives the exact result of Lloyd
iterations; when using cover trees, we are able to bound the
worst-case runtime of each algorithm as O(N + k log k) (this bound
also depends on dataset-dependent quantities). To our knowledge these
are the first sub-O(kN) bounds for exact Lloyd iterations. We then
show that these theoretically favorable algorithms significantly
outperform other approaches in practice, especially for large N and k.
BIO
Ryan Curtin is a nearly-finished Ph.D. student in the School of
Electrical and Computer Engineering at Georgia Tech; his research
focuses on the acceleration of core primitives that make up machine
learning algorithms, generally via the use of trees and other
hierarchical structures. He is also the primary developer and
maintainer of the mlpack machine learning library
(http://www.mlpack.org/). He doesn't enjoy writing biographies, and
also it is not possible to beat him at Double Dash.