Boston University’s EC527 High Performance Programming with Multicore and GPUs is a course that delves into the intricacies of high-performance programming, particularly with CUDA and GPUs. Martine Herbort taught the course in spring 2023. Our journey began with concepts like loop interchange, unrolling, reassociation, and SSE intrinsic. We then transitioned to pthread and OpenMP, and ultimately applied our knowledge to CUDA programming.
This post details the final project I completed for the class.
Writing code from scratch can be daunting, so I began with a template from our homework. This approach not only saved me considerable time but also facilitated the comparison of different matrix multiplication methods with ease. I expanded upon the code, established tests to gauge running times, and subsequently implemented both the conventional Strassen algorithm and its SIMD-optimized counterpart. To my dismay, the standard Strassen algorithm was outperformed by the loop-interchanged matrix multiplication, a shortcoming attributable to the significant overhead of the Strassen method. However, when I integrated a SIMD-optimized kernel into the Strassen matrix multiplication, performance improved substantially. For a comprehensive report and access to the code, please refer to: