February 12th, 12pm-1pm

CUTLASS 3.x: A Microkernel Abstraction for GPU Linear Algebra

Vijay Thakkar

Advisor: Prof. Rich Vuduc

ABSTRACT

We have created a micro-kernel abstraction for GPUs robust enough to uniformly represent the tensor core and data movement operations from NVIDIA GPU architectures spanning Maxwell all the way to Blackwell. In this talk, we describe the novel two level microkernel abstraction that allows for this generality. Spatial microkernels described by CuTe layouts and layout algebra allow us to uniformly represent GPU architecture specific operations regardless of the threads and data they operate upon. CUTLASS 3.x's temporal microkernels allow for a hierarchical organization of synchronization and programmer managed on-chip memory abstracting away architecture specific synchronization. Both of these together result in a robust programming model that is performant, extensible, and simply a joy to use.

BIO

Vijay is a senior architect in the fast kernels team where he has worked on CUTLASS 3.0 project since its inception as one of its leads. For the past three years he has focused on the development of Blackwell kernels via CuTe and CUTLASS's programming model and PTX ISA which just released as CUTLASS 3.8 and CUDA 12.8 respectively. He broadly collaborates with the GPU architecture, compiler, and programming model teams on software/hardware codesign for tensor cores and other DL features of datacenter GPUs. At GaTech, he is registered as a part time PhD student in Rich Vuduc's HPC garage lab, where he hopes to defend his PhD some day.