Benchmarking and In-Depth Performance Study of Large Language Models on Habana Gaudi Processors

SC23 Proceedings

Workshops Archive

Benchmarking and In-Depth Performance Study of Large Language Models on Habana Gaudi Processors

Workshop: Workshop on Software and Hardware Co-Design of Deep Learning Systems in Accelerators (SHDA)

Authors: Chengming Zhang and Baixi Sun (Indiana University); Xiaodong Yu (Stevens Institute of Technology, Argonne National Laboratory (ANL)); Zhen Xie, Weijian Zheng, Kamil A. Iskra, and Pete Beckman (Argonne National Laboratory (ANL)); and Dingwen Tao (Indiana University)

Abstract: Transformer models suffer from high computational complexity. Habana GAUDI architecture offers a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. First, we provide a performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Second, we explore strategies to optimize MME and TPC utilization, offering practical insights to enhance computational efficiency. Third, we evaluate the performance of Transformers on GAUDI, particularly in handling long sequences and uncovering performance bottlenecks. Last, we evaluate the end-to-end performance of two Transformer-based large language models (LLM) on GAUDI. The contributions of this work encompass practical insights for practitioners and researchers alike. We delve into GAUDI's capabilities for Transformers through systematic profiling, analysis, and optimization exploration.

Back to Workshop on Software and Hardware Co-Design of Deep Learning Systems in Accelerators (SHDA) Archive Listing

Back to Full Workshop Archive Listing