SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Birds of a Feather

Integrating Cloud Infrastructure with Large Scale HPC Environments


Authors: John Lange (Oak Ridge National Laboratory (ORNL), University of Pittsburgh), Chris Zimmer (Oak Ridge National Laboratory (ORNL)), Kevin Pedretti (Sandia National Laboratories), Vitali Morozov (Argonne National Laboratory (ANL)), Shane Canon (Lawrence Berkeley National Laboratory (LBNL)), Todd Gamblin (Lawrence Livermore National Laboratory), Robert Ciotti (NASA), Bill Magro (Google LLC), Heidi Poxon (Amazon), Roy Varghese (Microsoft Corporation), Sarp Oral (Oak Ridge National Laboratory (ORNL))

Abstract: As cloud environments deploy HPC capable infrastructure, large scale supercomputing and HPC centers are exploring how to integrate these resources into their ecosystems. This BoF will provide an opportunity for these centers to share their experiences and insights as well as provide a venue to establish collaborative efforts and develop broader strategies across the community. This BoF will provide a forum for discussion between supercomputing facility operators, cloud service providers, and the user community that will cover strategies and approaches for integrating cloud resources into existing HPC facility environments.

Long Description: Cloud environments are quickly becoming viable platforms to deliver capabilities that have traditionally only existed in the realm of large scale HPC centers. This shift is largely driven by AI workloads that have introduced a large economic incentive for cloud providers to deploy HPC capable infrastructures and are also emerging as an important class of workload for HPC facilities. At the same time large scale HPC centers are facing an uncertain future as underlying technology trends and changing workload patterns are introducing headwinds for the continued delivery of supercomputing capabilities that are both economically feasible and technologically relevant. Integrating cloud resources with large scale HPC environments provides a path to address these challenges through the availability of flexible infrastructure that can be dynamically adapted to current demands and application use cases that do not map well to supercomputing system architectures designed to support traditional modeling and simulation workloads. In order to support the full range of future HPC applications, large scale HPC centers should find a path to integrate cloud platforms while continuing to deliver on their traditional large scale supercomputing mission. While this is broadly recognized across the supercomputing community, there has yet to emerge a broad and coherent strategy for doing so. Individual and independent efforts have been started at various labs, facilities, and centers, but these are almost entirely ad-hoc and focused on local use cases. This BOF is intended to foster collaboration and cooperation across the supercomputing community to develop a collective strategy for incorporating cloud capabilities into supercomputing ecosystems. This BOF will provide a venue for representatives from various supercomputing facilities to share their experiences, plans, and visions for integrating cloud resources into large scale HPC and supercomputing facilities. As a result of this BOF we hope that participants will develop an awareness and understanding of current cloud integration efforts across the broader supercomputing community, and also provide a venue to foster collaborative efforts for developing collective strategies and approaches that can be adopted by the broader SC community. While the primary participants will likely consist of representatives from the DOE and other large scale computing facilities and cloud service providers, we will also welcome participation from the broader HPC community to share experience, insights, and requirements. Potential topics will include (1) cloud environment use cases such as load shedding and per-application system specialization, (2) approaches for integrating cloud resources into current facility environments, job management systems, and user account and project management infrastructure, and (3) strategies for achieving pricing advantages through collective negotiation and demand aggregation.



Back to Birds of a Feather Archive Listing