SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Birds of a Feather

Go with the (Energy) Flow: Adaptive Capacity Computing


Authors: Felix Wolf (Technical University Darmstadt), Hans-Christian Hoppe (Forschungszentrum Juelich), Andrew Chien (University of Chicago, Argonne National Laboratory (ANL)), Michael Ott (Leibniz Supercomputing Centre), Ana Radovanovic (Google LLC), Utkarsh Shah (Google, LLC)

Abstract: The increasing reliance on inherently variable Green energy is poised to impact HPC centers fundamentally: they cannot count on a guaranteed supply of grid power, yet could play a significant role in stabilizing the Grid by quickly adapting their load.

“Adaptive Capacity Computing” touches on system architecture, hardware, scheduling and resource management, programming models, and applications with the objective of enabling future HPC centers to react gracefully to varying power profiles, achieving optimal throughput and avoiding loss of computational state wherever possible.

This BoF discusses challenges and approaches to support this paradigm, should it become necessary to do so.


Long Description: The ongoing transformation of industrialized societies towards circular economies relying on the efficient use of Green energy sources is poised to have a pronounced effect on HPC infrastructures: energy is becoming more and more expensive, and the inherently variable nature of solar and wind energy threatens to make a guaranteed, constant energy supply in the tens of Megawatts (as required for large Exascale centers) an untenable or at least very costly proposition. In addition, other energy uses will likely be seen as more critical than HPC in case of a sudden energy shortage. Yet, there is also an opportunity for HPC centers to play a significant role in stabilizing the Grid should they become able to adapt their power consumption within milliseconds. Even today, power grid operators offer financial incentives to large consumers that show elasticity, meaning they agree to reduce their power use on cue from the supplier.

The new field of “Adaptive Capacity Computing” (referred to as ACC below) investigates how current HPC systems, software and centers could be empowered to react gracefully to a varying power profile, which will require adapting the compute capacity to the available energy over time without impacting hardware reliability and life, and in a way that still optimizes the computational throughput and, if at all possible, preserves the computational state of applications which have to be pre-empted.

For the BoF discussion, we focus on the challenges HPC centers face to adapt to ACC, and on approaches to overcome them. The topic cuts through almost the complete HW/SW stack: HPC systems (nodes, networks, storage) must become able to safely shutdown on short notice, and to support fast restarts; fast and preferably local storage would allow application states to be quickly saved and restored; resource management has to support dynamic changes in available resources, and it would be advantageous to enable dynamic consolidation of applications; schedulers need to react swiftly and rebuild their schedules in case of impending power changes. Of course, applications should adapt, too: malleable applications would be able to quickly reduce their footprint, while continuing to run, and would then later expand after additional resources become available.

Finally, the interconnect fabric can play an important part, as this could support a quick transfer of applications to different resources or a “quick freeze” which enables a later restart.

This BoF will serve as an open forum for discussing the topics around ACC, and will help to start building a community of interested researchers and developers. The organizers will create an online repository for storing documents and a mailing list to provide a forum for further discussions.

The event will be announced through various channels across different HPC R&D communities and geographies. These will include generic HPC publications/multipliers like HPCWire or HiPEAC, specialized channels of associated R&D projects (such as the Exascale Projects in Europe), and activities of the participating institutions (JSC, LRZ, TU Darmstadt, University of Chicago).


Website: https://fz-juelich.sciebo.de/s/exElp9Tm0Qt3QEc





Back to Birds of a Feather Archive Listing