Authors: Tom St. John (Meta AI), Murali Emani (Argonne National Laboratory (ANL)), Geoffrey Fox (University of Virginia), Huihuo Zheng (Argonne National Laboratory), John Tran (NVIDIA)
Abstract: Machine learning applications are rapidly expanding into scientific domains and challenging the hallmarks of traditional high performance computing workloads. We present MLPerf, a community-driven system performance benchmark which spans a range of machine learning tasks. The speakers at this BoF are experts in the fields of HPC, science applications, machine learning, and computer architecture, representing academia, government research organizations, and private industry. In this session, we will cover the past year’s development within the MLPerf organization and provide an update on the latest round of submissions to MLPerf-HPC benchmark suite to solicit input from interested parties within the HPC community.
Long Description: Deep learning is transforming the field of machine learning from theory to practice. Following the widespread adoption of machine learning over the past several years, ML workloads now stand together with traditional scientific computing workloads in the high performance computing application space. These workloads have sparked a renaissance in computer system design. Both academics and the industry are scrambling to integrate ML-centric designs into their products and numerous research efforts are focused on scaling up ML problems to extreme-scale systems.
Despite the breakneck pace of innovation, there is a crucial issue affecting the research and industry communities at large: how to enable fair and useful benchmarking of ML software frameworks, ML hardware accelerators, and ML systems. The ML field requires systematic benchmarking that is both representative of real-world use cases and useful for making fair comparisons across different software and hardware platforms. This is increasingly relevant as the scientific community is adopting ML in its research, such as model-driven simulations, analysis, surrogate models, etc.
MLPerf answers the call. MLPerf is a machine learning benchmark standard driven by industry (70+ companies) and engineers and researchers (1000+) at large. The benchmark suite comprises a set of key machine learning training and inference workloads that are representative of important production use cases, ranging from image classification and object detection to recommendation.
The MLPerf-HPC benchmark suite includes scientific applications that use ML, especially deep learning, at HPC scale. These benchmarks can be used to help project future system performance and assist in the design of future HPC systems. These aim to evaluate certain behaviors unique to HPC applications, such as: - On-node characteristics vs. off-node communication characteristics for various training schema - Big datasets, I/O bottlenecks, reliability, MPI vs. alternative communication backends - Complex workloads where model training/inference might be coupled to simulations/high dimensional-data or hyperparameter optimization
We will also introduce the inaugural submission results from the new MLPerf Storage working group and benchmark set that looks in detail at I/O patterns and their interactions with compute in a range of ML workloads.
In this session, we will focus on the following topics of discussion: - A senior member from the MLPerf committee will present the existing structure and design choices selected during the creation of the MLPerf benchmark suite. - Key stakeholders will present their perspectives on MLPerf and explain how MLPerf provides value to their organizations. - A representative from a national HPC research center will discuss their unique needs when quantifying performance of their machine learning workloads on large-scale systems and how these requirements have been incorporated into the MLPerf-HPC benchmark suite. - We will host an interactive community session where interested members of the audience can ask questions of the speakers to drive discussion focused on how to best address the needs of the ML-oriented HPC community and drive community building initiatives across labs, industry, and HPC centers.
The outcome from these discussions will be summarized in a report that is hosted online and made publicly available.
Website: https://mlcommons.org