Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters

SC23 Proceedings

Research Posters Archive

Early Experience in Characterizing Training Large Language Models on Modern HPC Clusters

Authors: Hao Qi, Liuyao Dai, Weicong Chen, and Xiaoyi Lu (University of California, Merced)

Abstract: In the realm of natural language processing, Large Language Models (LLMs) have emerged as powerful tools for tasks such as language translation, text generation, and sentiment analysis. However, the immense parameter size and complexity of LLMs present significant challenges. This work delves into the exploration and characterization of high-performance interconnects in the distributed training of various LLMs. Our findings reveal that high-performance network protocols, notably RDMA, significantly outperform other protocols like IPoIB and TCP/IP in training performance, offering improvements by factors of 2.51x and 4.79x respectively. Additionally, we observe that LLMs with larger parameters tend to demand higher interconnect utilization. Despite these findings, our study suggests potential for further optimization in overall interconnect utilization. This research contributes to a deeper understanding of the performance characteristics of LLMs over high-speed interconnects, paving the way for more efficient training methodologies.

Best Poster Finalist (BP): no

Poster: PDF
Poster summary: PDF

Back to Poster Archive Listing