SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Birds of a Feather

Operational Data Analytics


Authors: Rachel Palumbo (Oak Ridge National Laboratory (ORNL)), Kadidia Konaté (Lawrence Berkeley National Laboratory), Melissa Romanus (Lawrence Berkeley National Laboratory (LBNL)), Norm Bourassa (Lawrence Berkeley National Laboratory (LBNL), Energy Efficient HPC Working Group), Jim Brandt (Sandia National Laboratories), Jeff Hanson (Hewlett Packard Enterprise (HPE)), Tim Osborne (Oak Ridge National Laboratory (ORNL)), Michael Ott (Leibniz Supercomputing Centre, Energy Efficient HPC Working Group), Ben Schwaller (Sandia National Laboratories), Woong Shin (Oak Ridge National Laboratory (ORNL)), Kathleen Shoga (Lawrence Livermore National Laboratory), Keiji Yamamoto (RIKEN)

Abstract: Operational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system increasingly easy. However, making the data work for HPC operations is not straight-forward and effort being duplicated at many HPC sites to develop methods and tools to analyze the data and leverage it for operations. There is a clear demand to collaborate on this within the community but as standards in terms of semantics and naming of monitoring data are currently missing, such collaboration is severely hampered.

Long Description: Most sites that operate HPC systems are engaged in Operational Data Analytics one way or another. Some may only be monitoring their HPC system for faults or emergencies while others try to collect as much data as possible from their HPC operations, covering the whole data center with its supporting infrastructure, the system hardware and software, and the applications running on the system. Many are overwhelmed by the amount of data they are collecting and find it difficult to either visualize the data in enough detail or find the right tool or approach to analyze the data in order to extract actionable knowledge from it. In the big data world, a plethora of tools and methods are available to analyze such large amounts of data, but choosing the right ones is not trivial and requires expertise not only in data analytics but also in the respective domain.

Some sites have been successful in leveraging their monitoring data to better understand or optimize their HPC operations, but their approaches are not easily transferable to other sites. Consequently, many sites are duplicating efforts when experts with data analysis expertise are in short supply anyway. The main stumbling blocks for closer collaboration seem to be different approaches to organize data, incompatible naming schemes, and a lack of metadata.

Previous instances of the proposed BoF at SC were quite successful and have gradually developed from discussing holistic monitoring of data centers and handling the data tsunami over methods to visualize the data to analyzing it with statistical methods and AI/ML. During the SC22 BoF on Operational Data Analytics, there was vivid discussion among the participants on open data and standardization of monitoring data. This discussion has been continued during the monthly meetings of the ODA team within the Energy Efficient HPC Working Group (EEHPCWG) whose members are also organizers of the proposed BoF. The ODA team comprises practitioners in HPC operations who deal with monitoring and data analytics on a daily basis. This BoF will allow for reporting back on current efforts to drive standardization, help to establish a larger community that is backed by the EEHPCWG, foster collaboration among different HPC sites, and drive further discussion towards standardization and data sharing.




Back to Birds of a Feather Archive Listing