Monday April 06, 12pm1pm, 1116E Klaus
Dualtree kmeans with bounded singleiteration runtimeRyan Curtin
Advisor: Prof. Richard Vuduc


ABSTRACT
Kmeans is a widely used clustering algorithm, but for k clusters and
a dataset size of N, each iteration of Lloyd's algorithm costs O(kN)
time. Although there are treebased techniques to accelerate single
Lloyd iterations, none of these techniques are tailored to the case of
large k, which is increasingly common as dataset sizes grow. We
propose a dualtree algorithm that gives the exact result of Lloyd
iterations; when using cover trees, we are able to bound the
worstcase runtime of each algorithm as O(N + k log k) (this bound
also depends on datasetdependent quantities). To our knowledge these
are the first subO(kN) bounds for exact Lloyd iterations. We then
show that these theoretically favorable algorithms significantly
outperform other approaches in practice, especially for large N and k.
BIO
Ryan Curtin is a nearlyfinished Ph.D. student in the School of
Electrical and Computer Engineering at Georgia Tech; his research
focuses on the acceleration of core primitives that make up machine
learning algorithms, generally via the use of trees and other
hierarchical structures. He is also the primary developer and
maintainer of the mlpack machine learning library
(http://www.mlpack.org/). He doesn't enjoy writing biographies, and
also it is not possible to beat him at Double Dash.