HotCSE Seminar
Computational Science & Engineering
Wednesday April 05, 12pm-1pm, 1116-E Klaus

Scalable Distributed Indexing of Unstructured Texts

Patrick Flick
Advisor: Prof. Srinivas Aluru


Suffix Arrays and Suffix Trees are fundamental and versatile string data structures which are frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values (ANSV) problem. We formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(np+p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms.


Patrick Flick is a 4th year Ph.D. student in CSE, advised by Prof. Srinivas Aluru. His research interests are in parallel and high performance computing, algorithm engineering, string algorithms, and graph algorithms.