Student: Sohei Koyama (University of Tsukuba)
Supervisor: Kohei Hiraga (University of Tsukuba)
Abstract: The frequency of checkpoint creation in large language models is limited by the write bandwidth to a parallel file system. In this study, we aim to reduce the checkpoint creation time by writing to the Intel Optane Persistent Memory installed on the compute nodes.
We propose TensorStore CHFS, a storage driver that adds an ad hoc parallel file system CHFS to the TensorStore. The proposed method succeeded in increasing the checkpoint creation bandwidth of the T5 1.1 model by 4.5 times on 32 nodes.
ACM-SRC Semi-Finalist: no
Poster: PDF
Poster Summary: PDF
Back to Poster Archive Listing