Authors: Ryan Grant (Queen's University, Canada), Siddhartha Jana (Intel Corporation, Energy Efficient HPC Working Group), Natalie Bates (Energy Efficient HPC Working Group), Jeff Autor (Hewlett Packard Enterprise (HPE)), Barry Rountree (Lawrence Livermore National Laboratory), Daniel Wilson (Boston University), Eishi Arima (Technical University Munich), Yiannis Georgiou (Ryax Technologies), David Grant (Oak Ridge National Laboratory), Christopher DePrater (Lawrence Livermore National Laboratory)
Abstract: This BoF will bring together academia, government research laboratories, and industry to discuss and contribute to the two active community-driven, vendor-neutral forums focusing on energy efficiency in HPC software stacks. For more than 7 years, these two complementary forums- HPC-PowerStack and PowerAPI - have led the efforts in identifying and building software solutions across the software stack.
This interactive BoF will enable the community to discuss ongoing challenges in designing cost-effective, cohesive, portable, and interoperable implementations of HPC software for monitoring and control of system efficiency. Attendees will also contribute toward brainstorming solutions for addressing ongoing exascale power challenges.
Long Description: ** Relevance:
Despite recent advances at Exascale, power, and energy are still of great concern to HPC sites of all sizes. There are several parallel R&D efforts in energy-efficient solutions due to the importance of the topic and a community built around solving future energy problems. The majority of techniques developed so far have been designed in accordance with vendor-/site-specific restrictions. State-of-art specifications stop short of defining which software components should actually be interoperating in a unified stack. System-wide coordination is critical for avoiding the underutilization of system FLOPS/Watts.
For 7+ years, two complementary forums: HPC-PowerStack and PowerAPI have been addressing power challenges from within the software stack. The efforts have focused on: (A) identifying the critical software actors needed in a system stack; (B) reaching a consensus on their roles and responsibilities; (C) designing protocols for bidirectional control and feedback signals among them for enabling scalable coordination at multiple granularities; (D) establishing unified hierarchical communication models/APIs to access power monitor and control knobs in hardware and software; and (E) leveraging existing prototypes and building a community that actively participates in open development and engineering efforts.
** Pre-SC23:
Within the PowerStack consortium, 40+ representatives from industry, labs, and academia have convened twice a year (SC/ISC timeframe) for knowledge transfer and collaboration on community-wide standardization efforts for designing a power-management stack. Likewise, the PowerAPI community has convened monthly to focus on the design of the API specification that enables interoperability between the stack components. Over these past ~7 years, the community has arrived at a consensus that (1) job/application awareness is going to be critical for boosting system-wide optimization. This implies the need to drive interoperation between a job-level runtime and the job scheduler; (2) hierarchical control systems are good models for scalable global optimization across the system, so the power stack should be a multi-tiered system with bidirectional control and feedback signals flowing between the layers. Today’s systems are inefficiently designed, in that, they break this hierarchy model. And we as a community need to work towards fixing this. These align with the attendee feedback from our ISC19, SC19, ISC21, SC22, and ISC 23 BoFs.
** BoF-Goals:
Based on feedback from past BoFs & HPC sites, we are extending this proposal to cover facility-level challenges and carbon-neutral principles. This session will present attendees with an opportunity for discussing the latest developments in these topics.
Goals are: (1) make attendees aware of the emerging community effort to design a common power stack and discuss the lessons learned during the past seminar; (2) provide updates on the current and future prototyping efforts that have begun; and (3) align efforts across the community so that the SC22 BoF attendees reach a consensus with regards to sharing R&D resources, avoid duplicating effort, agree on standard interfaces, and reap the rewards together as a community.
Website: http://hpcpowerstack.github.io