Wednesday April 05, 12pm1pm, 1116E Klaus
Scalable Distributed Indexing of Unstructured TextsPatrick Flick
Advisor: Prof. Srinivas Aluru


ABSTRACT
Suffix Arrays and Suffix Trees are fundamental and versatile string data structures which are frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low corecount sharedmemory implementations but achieve suboptimal complexity, and prior distributedmemory parallel algorithms have quadratic worstcase complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the AllNearestSmallerValues (ANSV) problem. We formulate a more generalized version of the ANSV problem, and present a distributedmemory parallel algorithm for solving it in O(np+p) time. Our algorithm minimizes the overall and pernode communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributedmemory algorithms.
BIO
Patrick Flick is a 4th year Ph.D. student in CSE, advised by Prof. Srinivas Aluru. His research interests are in parallel and high performance computing, algorithm engineering, string algorithms, and graph algorithms.