SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning


Workshop: 5th Workshop on Programming and Performance Visualization Tools (ProTools 2023)

Authors: Marcus Ritter and Felix Wolf (Technical University of Darmstadt)


Abstract: With the rapidly increasing size and complexity of DNNs, equally sophisticated methods are needed to train them efficiently, including distributed training and various model/hybrid parallelism approaches. Even though developers heavily rely on state-of-the-art frameworks such as PyTorch and TensorFlow, these provide little insight into an application's training behavior at scale, leading to latent performance bottlenecks and inefficient training configurations. We propose Extra-Deep, an automated empirical performance modeling approach for distributed deep learning. We leverage the created models to analyze a training task's performance, scalability, efficiency, and cost. Using an efficient sampling strategy that reduces the profiling time for the required empirical measurements by, on average, about 94.9%, we can identify cost-effective training configurations even for large-scale applications. We evaluated our approach on three parallelization strategies, with four DNN models and five datasets. The results show that Extra-Deep has an average prediction accuracy of 93.6% when compared to empirical results.





Back to 5th Workshop on Programming and Performance Visualization Tools (ProTools 2023) Archive Listing



Back to Full Workshop Archive Listing