SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

ACM Student Research Competition Poster Archive

Accelerating Collective Communications with Lossy Compression on GPU

Student: Jiajun Huang (University of California, Riverside)
Supervisor: Zizhong Chen (University of California, Riverside)

Abstract: GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion.

In this poster, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 4.5X and 20.2X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

ACM-SRC Semi-Finalist: no

Poster: PDF
Poster Summary: PDF

Back to Poster Archive Listing