SSD-Backed PagedAttention Extending KV Cache with GPUDirect Storage

In CS599 (GPU Programming) at Boston University, I explored techniques for optimizing CUDA kernels. For the final project, my team investigated how far we can push LLM serving when GPU memory becomes the bottleneck. We extended PagedAttention-style KV-cache management with an SSD-backed tier, using NVIDIA GPUDirect Storage (KvikIO) to move KV blocks between GPU and NVMe with minimal CPU involvement.

Below, I attach the full report, including the design, implementation details, and evaluation on representative chat workloads.

Final Report(PDF)