Skip to main content
Digital Experience
Schedule
Dates & Deadlines
Toggle navigation
Toggle navigation
Program
Dropdown menu toggle
Program
Schedule
Keynote
I Am HPC Plenary
Invited Talks
Panels
Workshops
Tutorials
Papers
Reproducibility Initiative
AD/AE Process & Badges
Awards
Birds of a Feather
Early Career
Exhibitor Forum
Posters
ACM SRC
Doctoral Showcase
Research Posters
SciViz Showcase
Job Fair
Receptions
Exhibits
Dropdown menu toggle
Exhibits
Exhibitor Prospectus
Exhibitor Application
Exhibitor List & Floorplan
Exhibitor Manual
Exhibitor Forum
Exhibitor Housing
Exhibitor Function Space
SCinet for Exhibitors
HPC Illuminations Pavilion
Quantum Village
Promotional Opportunities
Recruit at the Job Fair
Students
Dropdown menu toggle
Students@SC
Lead Student Volunteers
Student Volunteers
Student Cluster Competition
IndySCC
Mentor–Protégé Matching
HPC Immersion
Alumni Networking Event
Speed Mentoring Event
Guided Interest Groups
Teach the Teacher
Student Tours
Job Fair
SCinet
Dropdown menu toggle
SCinet
SCinet Technology
SCinet Teams
WINS
Network Research Exhibition
INDIS Workshop
Participate in SCinet
Contributors & Volunteers
SCinet for Exhibitors
SC Network Policy
Media
Dropdown menu toggle
Media
Media Registration
Media Partners
Blog
Newsletter
Photos & Logos
Attend
Dropdown menu toggle
Attend
Registration
Visa Applications
Digital Experience
Schedule
Denver
Convention Center
Housing
Family Resources
Inclusivity
Code of Conduct
Volunteer
Search
Search
Home
Search Program
Search Program
Full Schedule
·
Contributors
·
Organizations
·
Search
Program
Organizations
3M
3M Company
ACE Lab, Center for High Performance Computing (CHPC), South Africa
ACE Lab, CHPC
AceNet
Achronix
ACM
ACM Riken R-CCS
ADIA Lab
Advanced Computing Service for Latin America and the Caribbean (SCALAC)
Advanced Micro Devices (AMD) Inc
AGH-UST
Agnostiq
Agrofocal Technologies, Inc.
AHUG
Alibaba Group
Amazon
Amazon Web Services
Amazon Web Services (AWS)
AMD Research
Apple Inc
Archaeologic Inc. and Sourcery Institute
Archaeologic, Inc.
Argonne National Laboratory (ANL)
ARM Ltd
Atomic Energy and Alternative Energies Commission (CEA)
Autonomous University of Barcelona, Spain
AWS
AWS AI
AWS AI Research and Education
Ayar Labs
Ayer Labs
Barcelona Supercomputing Center (BSC)
BASF SE
Beihang University
Beijing Sankuai Online Technology Co, Ltd
Beijing Sankuai Online Technology Co.,Ltd
Beijing University of Posts and Telecommunications
Boise State University
Boston University
Brookhaven National Lab
Brookhaven National Laboratory
Calcul Québec
Center for Applied Scientific Computing
Center for High Performance Computing (CHPC), South Africa
Center for Information Services and High Performance Computing (ZIH)
Centre Europeen de Calcul Atomique et Moleculaire (CECAM)
Centro Nacional de Alta Tecnologia (CENAT)
Cerebras Systems
Cerebras Systems, Inc.
Chalmers University of Technology, Sweden
Charité Universitätsmedizin Berlin
China Institute of Atomic Energy
China University of Petroleum-Beijing
Chinese Academy of Sciences
CHPC
CINECA
CINI - Consorzio Interuniversitario Nazionale per l’Informatica
CINI HPC-KTT lab
Cisco Systems
Cledar
Cloudrun Inc.
CODATA
Codeplay
Codeplay Software Ltd
Codeplay Software Ltd, UK
CodeRefinery
Colorado School of Mines
Columbia University
Computer Network Information Center, Chinese Academy of Sciences
Cornell University
Cranfield University
CSC
CSC IT Center for Science
CSC – IT Center for Science, Finland
CUNY Queens College & Graduate Center
CUNY Queens College and Graduate Center
CXL Consortium
DapuStor Corporation
Data Direct Networks
Dell Technologies Inc
Department of Computer Technology and Application, Qinghai University
Department of Mathematical and Statistical Sciences
DEVCOM Army Research Lab
DEVCOM US Army Research Lab
Digital Research Alliance of Canada
Discovery Partner Institute, University of Illinois Chicago
DoD High Performance Computing Modernization Program
DOE Office of Advanced Scientific Computing Research
Duke University
Edinburgh Parallel Computing Centre
Edinburgh Parallel Computing Centre (EPCC)
Electricité de France
Energy Efficient HPC Working Group
ENI, Italy
ENS Lyon
EPCC
EPCC, The University of Edinburgh
Erlangen National High Performance Computing Center
ETH Zurich
ETH Zürich
European Open File System Association (EOFS)
EVIDEN
ExxonMobil
FAU Erlangen-Nürnberg
Finish IT Center for Science
Finnish IT Center for Science
Florida State University
Forschungszentrum Juelich
Forschungszentrum Jülich
Forschungzentrum Juelich
Frederick National Laboratory for Cancer Research
Free University of Bozen-Bolzano
French Alternative Energies and Atomic Energy Commission (CEA)
French Institute for Research in Computer Science and Automation (INRIA)
Friedrich-Alexander University, Erlangen-Nuremberg
Friedrich-Alexander University, Friedrich-Alexander University, Erlangen-Nuremberg
Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg
Fudan University
GDIT
GE Aerospace Research
Gem State Informatics Inc
GENCI
GENCI, France
General Electric
George Mason University
Georgia Instittutet of Technology
Georgia Institute of Technology
Georgia Tech Research Institute
Goethe University Frankfurt
Google
Google Cloud
Graphcore
Green Revolution Cooling
Green500
Groq
GSI Technology
GWDG
GWDG, Germany
Habana
Hangzhou Dianzi University
Hanyang University
Harbin Institute of Technology
Hartree Centre - STFC
Hartree Centre, Science and Technology Facilities Council (STFC), UK
Harvard University
Hawai‘i Data Science Institute
HDF Group
Hewlett Packard Enterprise
Hewlett Packard Enterprises
High Performance Computing Center Stuttgart
HPE
HQS Quantum Simulations
Huazhong University of Science & Technology
Huazhong University of Science and Technology
Huazhong University of Science and Technology (HUST)
Hunan Normal University
Hyperion Research
IBM
IBM Corporation
IBM T. J. Watson Research Center
IBM T.J. Watson Research Center
Icahn School of Medicine at Mount Sinai
ICM UW
ICT/CAS
ICT/CAS, UCAS
IFDC
Illinois Institute of Technology
Imperial College London
Indian Institute of Science, Bangalore
Indiana University
INESC-ID
INESC-ID, Instituto Superior Técnico, Universidade de Lisboa
INESC-ID, IST, University of Lisbon
InfoTrend
InfoTrend Inc
Innovative Computing Laboratory University of Tennessee
Innovative Computing Laboratory, University of Tennessee
Innovative Computing Laboratory, University of Tennessee, Innovative Computing Laboratory
Inria Rennes - Bretagne Atlantique Research Centre
Inspire Semiconductor
Institute of Chinese Academy of Sciences
Institute of Computing Technology
Institute of Computing Technology, Chinese Academy of Sciences
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Institute of Computing Technology, Institute of Computing Technology, Chinese Academy of Sciences
Institute of Software, Chinese Academy of Sciences
Intel
Intel Cooperation
Intel Corporation
IntelliProp Inc.
International School for Advanced Studies, Italy
Intersect360
Intersect360 Research
Interuniversity Microelectronics Centre (IMEC)
IQM
Irish Centre for High‑End Computing (ICHEC)
ISC Group
Istanbul University - Cerrahpaşa
IT4I
IT4Innovations
İstanbul Üniversitesi - Cerrahpaşa
Johannes Gutenberg University Mainz
Johns Hopkins University
Juelich Supercomputing Centre
Jülich Supercomputing Centre
Jülich Supercomputing Centre (JSC)
Jülich Supercomputing Centre (JSC), Institute for Advanced Simulation
Karlsruhe Institute of Technology
Khronos Group Inc
King Abdullah University of Science & Technology
KISTI
Kitware Inc
Kitware, Inc
Knox College
Koç University
Koç University, Turkey
KTH Royal Institute of Technology
KTH Royal Institute of Technology, Sweden
Lawrence Berkeley National Laboratory
Lawrence Berkeley National Laboratory (LBNL)
Lawrence Livermore National Laboratory
Lawrence Livermore National Laboratory (LLNL)
Lawrence Livermore National Laboratory (LLNL), Energy Efficient High Performance Computing Working Group EEHPCWG)
Leibniz Supercomputing Centre
Leibniz Supercomputing Centre (LRZ)
Leiden Institute of Advanced Computer Science (LIACS)
Lenovo
Lifeboat LLC
LINKS Foundation Inc
Los Alamos National Laboratory
Los Alamos National Laboratory (LANL)
Loyola University Chicago
Maricopa Agricultural Center
Marquette University
Massachusetts Green High Performance Computing Center (MGHPCC)
Massachusetts Institute of Technology (MIT)
Massachusetts Institute of Technology (MIT) Lincoln Laboratory
Massachusetts Institute of Technology(MIT)
Mathematical Institute University of Oxford
Max Planck Computing and Data Facility
McMaster University
MEGWARE Computer
MemVerge Inc
Meta
Meta AI
MGHPCC
Micron technology
Micron Technology Inc
Microsoft Corporation
Mila – Quebec AI Institute
MIT
MIT Lincoln Laboratory
MIT LL
MITRE
MITRE Corporation
Multicore World
NASA
NASA Ames Research Center
NASA Center for Climate Simulation
NASA Goddard Space Flight Center
Nation University of Defense Technology
National Cancer Institute
National Center for Atmospheric Research
National Center for Atmospheric Research (NCAR)
National Center for Materials Service Safety, University of Science and Technology Beijing
National Center for Supercomputing Applications (NCSA)
National Energy Research Scientific Computing Center (NERSC)
National Institute for Research in Digital Science and Technology
National Institute of Advanced Industrial Science & Technology
National Institute of Standards and Technology (NIST)
National Nuclear Security Administration
National Oceanic and Atmospheric Administration (NOAA)
National Renewable Energy Laboratory
National Renewable Energy Laboratory (NREL)
National Research Center of Parallel Computer Engineering & Technology, China
National Research Center of Parallel Computer Engineering & Technology,China
National Research Center of Parallel Computer Engineering and Technology
National Research Center of Parallel Computer Engineering and Technology, China
National Science Foundation
National Science Foundation (NSF)
National Supercomputer Center in Tianjin
National Supercomputing Center in Wuxi
National Supercomputing Center in Wuxi, China
national university of defence technology
National University of Defense Technology (NUDT), China
National University of Singapore
NERSC at LBNL (Lawrence Berkeley National Laboratory)
Netherlands eScience Center
New York University (NYU)
NextSilicon
NIOVA
NIST
NOAA
NOAA N-Wave
NOAA/N-Wave
Northeastern University
Northeastern University, Boston
Northwestern University
np-complete, S.r.l.
np-complete, S.r.l. a socio unico
nstitute of Software, Chinese Academy of Sciences
NUDT
NVIDIA Corp
Nvidia Corporation
Oak Ridge National Lab
Oak Ridge National Laboratory
Oak Ridge National Laboratory (ORNL)
Oak Ridge National Laboratory, University of Manchester
OctoML
Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin
Ohio State University
Ohio Supercomputer Center
Open Parallel Ltd
Open Storage Network
OpenMP ARB
OpenSFS
OpenStack Scientific SIG
Oracle
ORNL
Osaka University, Osaka, Japan
Pacific Northwest National Lab
Pacific Northwest National Laboratory
Pacific Northwest National Laboratory (PNNL)
Paderborn University
ParaTools, Inc.
ParTec AG
Partnership for Advanced Computing in Europe (PRACE)
Pawsey
Pawsey Supercomputing Centre
Penguin Solutions
Pennsylvania State University
Pittsburgh Supercomputing Center (PSC)
PNNL
Politecnico di Torino
PRACE
PRACE aisbl
Predictive Science Inc.
Purdue University
Purdue University Rosen Center for Advanced Computing
Qinghai University
Qualcomm
Quantum Machines
Queen's University, Canada
QuEra
Raytheon Technologies
Renaissance Computing Institute
Renaissance Computing Institute (RENCI)
Rice University
RIKEN
RIKEN Center for Computational Science (R-CCS)
RIKEN R-CCS
Riken R-CCS
Rutgers University
Rutherford Appleton Laboratory, Science and Technology Facilities Council (STFC)
RWTH Aachen University
Sabancı University
SambaNova Systems
San Diego Supercomputer Center
San Diego Supercomputer Center (SDSC)
San Diego Supercomputer Center, UC San Diego
San Francisco State University
Sandia National Laboratories
Sandia National Laboratories, USA
SANO
Sano Centre for Computational Medicine
São Paulo Research Agency (FAPESP)
Scapos
scapos AG
SchedMD
SchedMD LLC
Schneider Electric
School of Computer Science, Beijing University of Posts and Telecommunications
School of Computing at the University of Leeds
School of Computing, University of Leeds, United Kingdom
Science and Technology Facilities Council (STFC), UK
Seagate Research
Shandong University
Shandong University, China
Shanghai Jiao Tong University
Shell
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences
Shenzhen University
Siemens
Simula Research Laboratory
Siranga Sp. z o.o.
SLAC National Accelerator Laboratory
Slippery Rock University of Pennsylvania
Software Sustainability Institute
Southern University of Science and Technology, China
St. John’s University, MN.
StackHPC Ltd
Stanford University
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
Stevens Institute of Technology
STFC DiRAC HPC Facility
Stony Brook University
Stoppels Consulting
Strangeworks
Submer
Sungkyunkwan University
Swiss National Supercomputing Centre
Swiss National Supercomputing Centre (CSCS)
Tactical Computing Laboratories
Tactical Computing Laboratories LLC
Tactical Computing Laboratories, LLC
Tactical Computing Labs
TAE Technologies
TAE Technologies Inc
Taiyuan University of Technology
TCL Eagle Lab
Technical University Darmstadt
Technical University Dresden
Technical University Ilmenau
Technical University Munich
Technical University Munich, Computer Architecture and Parallel Systems
Technical University of Berlin
Technical University of Denmark
Technische Universität Dresden
texas
Texas A&M University
Texas Advanced Computing Center (TACC)
Texas State University
Texas state university
The Digital Research Alliance of Canada
The Ohio State University
The University of Tokyo
the University of Toronto
Tianjin University
TMGcore
Tokyo Institute of Technology
TOP500
TOP500, Green500
Tsinghua University
Tsinghua University, China
TU Berlin
U. Chicago
U.S. Department of Energy
UC San Diego
UK Met Office
UK Research and Innovation
Unaffiliated
Univ. Tennessee Knoxville
Univeristy of British Columbia
Universidad Industrial de Santander
Universidade da Coruña
Universidade Federal do Rio Grande do Sul
Universitat Paderborn
Universität Stuttgart
Université Grenoble Alpes
Université Laval
UniversitUy of Tennessee, Knoxville
University at Buffalo
University Carlos III of Madrid, Spain
University College of London
University of A Coruña
University of Alabama
University of Alabama at Birmingham
University of Arizona
University of Basel
University of Bergen
University of Bergen, Norway
University of Birmingham
University of Bologna
University of Bonn
University of Bristol
University of British Columbia
University of Buffalo
University of California at Riverside
University of California Los Angeles
University of California Los Angeles (UCLA)
University of California, Berkeley
University of California, Davis
University of California, Irvine
University of California, Los Angeles (UCLA)
University of California, Merced
University of Cambridge
University of Chicago
University of Chinese Academy of Sciences
University of Chinese Academy of Sciences (ICT/CAS)
University of Chinese Academy of Sciences, Beijing, China
University of Colorado
University of Colorado (retired)
University of Colorado, Denver
University of Connecticut
University of connecticut
University of Delaware
University of Edinburgh
University of Florida
University of Geneva
University of Göttingen
University of Groningen
University of Hagen
University of Hamburg
University of Hawaii at Manoa
University of Heidelberg
University of Illinois
University of Illinois at Urbana-Champaign
University of Illinois Urbana-Champaign
University of Iowa
University of Kassel
University of Kentucky
University of Klagenfurt, Austria
University of Leiden
University of Macau
University of Maryland
University of Maryland, Laboratory for Physical Sciences (LPS)
University of Miami
University of Michigan
University of Montreal
University of New Mexico
University of North Carolina at Charlotte
University of North Carolina, Charlotte
University of North Florida
University of North Texas
University of Oregon
University of Pittsburgh
University of Rennes
University of Rochester
University of Salerno
University of Science and Technology of China
University of South Carolina
University of Southern California
University of Southern California (USC)
University of Stuttgart
University of Sydney
University of Tennesse
University of Tennessee
University of Tennessee Knoxville
University of Tennessee, Chattanooga
University of Tennessee, Global Computing Laboratory
University of Tennessee, Knoxville
University of Tennessee, University of Tennessee, Innovative Computing Laboratory
University of Tennessee, University of Tennessee, Innovative Computing Laboratory (ICL)
University of Texas
University of Texas at Austin
University of Texas at El Paso
University of Texas at San Antonio
University of Texas, Oden Institute
University of Texas, University of Texas, Oden Institute
University of Tokyo
University of Torino
University of Toronto
University of Trento
University of Tsukuba
University of Utah
University of Utah, Scientific Computing and Imaging Institute (SCI)
University of Victoria
University of Vienna
University of Virginia
University of Washington
University of Waterloo
University of Western Australia
University of Wisconsin
University of Wisconsin, Madison
University of Wisconsin, University of Wisconsin, Madison
University of Wisconsin-Madison
University of Wroclaw
University of Wyoming
University of York
Universiyt of Bristol
US Army Engineer Research and Development Center (ERDC)
US Department of Defense
US Department of Energy
US Drug Enforcement Administration
USDA-ARS SCINet Program
UTK/ICL
Virginia Tech
Wake Forest University
Warsaw University of Technology
Washington State University Vancouver
Washington State University, Washington State University, Vancouver
WeBank
WeBank Blockchain Team
Westphalian University of Applied Sciences
Whamcloud Inc
Whamcloud, inc.
Whamcloud/DDN
Wolley Inc.
X-ScaleSolutions
Xilinx Research
Xi’an Jiaotong-Liverpool University
Xscape Photonics
Zhejiang Lab
Zhejiang University
Contributors
Ondřej Čertík
Shreyas A Kudari
Omar Aaziz
Sameh Abdulah
Raneem Abu Yosef
Nael Abu-Ghazaleh
Ankit Agrawal
Nesreen Ahmed
James P. Ahrens
Alex Aiken
Burak Aksar
Ghadeer Alabandi
Sadaf R. Alam
Marco Aldinucci
Vassil Alexandrov
Evguenia Alexandrova
Yuri Alexeev
Gabrielle Allen
Tyler Allen
Nawras Alnaasan
Gustavo Alonso
Ilkay Altintas
Aaron Andersen
Bill Anderson
Diego Andrade
Katerina Antypas
Hartwig Anzt
Thomas Applencourt
Carlos Arango Gutierrez
Bill Arndt
Ishank Arora
Richard Arthur
Ali Asgari Khoshouyeh
Scott Atchley
Danny Auble
Olivier Aumage
Axel Auweter
Noushin Azami
Ann Backhaus
CAPT Joseph Baczkowski
Frank Baetke
Sharmistha Bagchi-Sen
Xingqiang Bai
Jason Bakos
Pavan Balaji
Ilya Baldin
Riccardo Balin
Peter Balogh
Sahan Bandara
Purushotham Bangalore
Deborah Bard
Kevin J. Barker
João Barreto
Carlos Jaime Barrios Hernandez
Rohan Basu Roy
Natalie Bates
Simon Batzner
Michael Bauer
Matthew E. Baughman
Paul Bauman
Harun Bayraktar
Tom Beck
Gregory B. Becker
Peter Beckman
Mehmet E Belviranli
Tal Ben-Nun
John Bent
Florian Berberich
Randy Berger
Keren Bergman
David E. Bernholdt
Jean-Yves Berthou
Maciej Besta
E. Wes Bethel
Ron Bewtra
Wahid Bhimji
Sanjukta Bhowmick
LK Bhupathi
Matt Bidwell
Marcin Bienkowski
Amanda J. Bienz
Elliott Biondo
George Biros
John Blaas
Nils Blach
Johannes P Blaschke
Brett M. Bode
David Boehme
Serge Bogaerts
Taisuke Boku
Carlos Boneti
Chris Bording
Andrea Borghesi
Lynn Borkon
George Bosilca
Ramin Bostanabad
Brendan Bouffler
Norman J Bourassa
Aurelien Bouteiller
Kurtis Bowman
EBRU BOZDAG
Pete Bradley
Jim Brandt
Wesley Brewer
Marcel Breyer
Patrick Bridges
Ron Brightwell
Michael Brim
André Brinkmann
Gonzalo Brito Gadeschi
Eleanor Broadway
James C. Brodman
Sharon Broude Geva
Benjamin Brown
Nick Brown
Timothy Brown
Lingkun Bu
Reuben D. Budiardja
Muhammed Fatih Bulut
Aydın Buluç
David P. Bunde
Robert Bunger
Luk Burchard
Rod Burns
Martin Burtscher
Anastasiia Butko
Richarda Butler
Dmytro Bykov
Suren Byna
Di Cai
Xing Cai
Silvina Caino-Lores
Jesse Caldwell
Scott Callaghan
Paul Calleja
Alexandru Calotoiu
Richard Shane Canon
Bob Cantwell
Ronald Caplan
Franck Cappello
Armon Carigiet
Lorenzo Carpentieri
Jesus Carretero
Laura Carriere
Laura Carriere
Tiernan Casey
Vito Giovanni Castellana
Roberto Castro
Robert Alexander Caulk
Daniele Cesarini
Luis Ceze
Stuart Chalk
Alan Chalker
Noel Chalmers
Steve Chan
Aparna Chandramowlishwaran
Sunita Chandrasekaran
Barbara Chapman
Kyle Chard
Chen-Chun Chen
Dexun Chen
Fan Chen
Gang Chen
Jacqueline Chen
Jieyang Chen
Qiqian Chen
Quan Chen
Shiyang Chen
Tiancheng Chen
Wenguang Chen
Wenyan Chen
Xiang Chen
Xiaofei Chen
Yaojian Chen
Yiran Chen
Yujie Chen
Shenggan Cheng
Wen Cheng
Kazem Cheshmi
Xuebin Chi
Andrew Chien
Andrew Chien
Wojciech Chlapek
WONIL CHOI
Alok Choudhary
Marcin Chrapek
Steven Carter Christopher
Neil Chue Hong
Bob Ciotti
Emily Clark
Thomas L Clune
Albert Cohen
Jeremy Cohen
Yonatan Cohen
Jaime Combariza
Gene Cooperman
Biagio Cosenza
Ayse Kivilcim Coskun
Cameron Cross
Huimin Cui
jiahuan cui
Massimiliano Culpo
Milan Curcic
Nicholas Curtis
William Cutler
Maciej Cytowski
Marco D'Antonio
Victoria Da Poian
Tamara L. Dahlgren
Dong Dai
Donglai Dai
Yi Dai
John Daly
Anthony Danalis
Sambit Das
Sajal Dash
John Davis
Marcus Day
Johannes de Fine Licht
Bert de Jong
Marco De La Pierre
Bronis R. de Supinski
Tom Deakin
Xavier Delaruelle
Robert L. DeLeon
Philippe DENIEL
Larry Dennison
Chris DePrater
Sean Dettrick
Aditya Devarakonda
Sheng Di
Andreas Dilger
Caiwen Ding
Jianru Ding
Nan Ding
Qiyang Ding
Johannes Doerfert
Jens Domke
Dezun Dong
Jack Dongarra
Matthew G. F. Dosanjh
Erik Draeger
Yu Du
Xiaohui Duan
Nicolas Dube
Jeremy Duckworth
Ann Dunkin
Dmitry Duplyakin
Emre Düzakın
David Eder
Manuel Egele
Berke Egeli
Thomas Eickermann
Makus Eisenbach
Nasir Eisty
Jorge Ejarque
Sinan Ekmekçibaşı
Irfan Elahi
Huseyin M. Elibol
Sally Ellingson
J. Austin Ellis
Hatem Elshazly
Murali Emani
Christian Engelmann
John-David Enright
Nicolás Erdödy
Lucas Esclapez
Thomas Evans
Matthew A Ezell
Alex Fallin
Kaijie Fan
Yi Fan
Yong Fan
Alessandro Fanfarillo
Bo Fang
Jianbin Fang
Jun Fang
Steven Farrell
Jean Favre
Arthur Feeney
Dan Feng
Wang Feng
Wu Feng
Xiaobing Feng
Yangde Feng
John Feo
Charles Ferenbaugh
Fernando Fernandes dos Santos
Rafael Ferreira da Silva
Nicola Ferrier
Federico Ficarelli
Gabe Fierro
Romain Fihue
Weronika Filinger
Zane Fink
Hal Finkel
Jesun Firoz
Jeremy Fischer
Marc Fischer
Paul Fischer
Barton Fiske
Claudia Fohry
Sam Foreman
Félix-Antoine Fortin
Ian Foster
Geoffrey Fox
Basilio B. Fraguela
Olivier Franza
Chip Freitag
Nicholas Frontiere
Haohuan Fu
Kaihua Fu
Xu Fu
Yuhang Fu
David Fuchssteiner
Kasimir Gabert
Vijay Gadepally
Ana Gainaru
Todd Gamblin
Lin Gan
Harshitta Gandhi
Auroop Ganguly
Jie Gao
Ping Gao
Yingxiang Gao
Yue Gao
Simon Garcia De Gonzalo
Arti Garg
Michael Garland
Vikram Gavini
Chio Ge
Markus Geimer
Florian Geissler
Al Geist
Tong Geng
Xiaohan Geng
Ann Gentile
Giorgis Georgakoudis
Anjus George
Antigoni Georgiadou
Anja Gerbes
Balazs Gerofi
Robert Gerstenberger
Bill Gervasi
Sandra Gesing
Mahdieh Ghazimirsaeed
Lukas Gianinazzi
Tom Gibbs
maike gilliot
Thomas Gillis
Sergi Girona
Joseph Glenski
William Godoy
Jan Goetz
Maya Gokhale
Harrison Goldwyn
Elsa J. Gonsiorowski
Andres Gonzalez
Lev Gorenstein
Wyatt Gorman
Andy Gothard
Kalyana Gottiparthi
John Gounley
Richard Graham
David Grant
Ryan Grant
Gretchen Greene
Philipp Grete
James Griffioen
William D. Gropp
Pascal Grosset
Pat Grubel
Thomas Gruber
Nalinrat Guba
Juan David Guerrero Balaguera
Giulia Guidi
Thomas Gulbransen
Haryadi S. Gunawi
Anqi Guo
Chu Guo
Luanzheng Guo
Minyi Guo
Rui Guo
Yanfei Guo
Yang Guo
Zhuoqiang Guo
Brajesh Gupt
Rinku Gupta
Bilel Hadri
Georg Hager
Pouya Haghi
Mahantesh Halappanavar
Steven Hamilton
Simon Hammond
David Hancock
Sean Hanlon
Jeff Hanson
Zixu Hao
Kevin Harms
Peter Harrington
Cyrus Harrison
Rebecca Hartman-Baker
Christine Harvey
Valérie Hayot-Sasson
Shuibing He
Yun (Helen) He
zhengyang he
Zhenhao He
Zhongqiu He
Esa Heiskanen
Stijn Heldens
Michael Hennecke
Marc Henry de Frahan
Alexandra Henzinger
Thomas Herault
Martin C. Herbordt
Oscar Hernandez
Benjamin Hernández
Michael A. Heroux
Anna Herr
Andreas Herten
Simon Hettrick
Elisa Heymann
dean Hildebrand
Frances C Hill
Judith C. Hill
Conrad Hillairet
Torsten Hoefler
Henry Hoffmann
Petrina Hollingsworth
John Holmen
yuxi hong
Hans-Christian Hoppe
Kaiyuan Hou
Markus Hrywniak
Yun Hu
Zhengding Hu
Mengyuan Hua
Chengying Huan
Chun Huang
Haitao Huang
Lei Huang
Yafan Huang
Yunxin Huang
Zanhua Huang
Kevin Huck
David Hudak
Nathaniel Hudson
Axel Huebl
John Huffman
Clayton Hughes
Maxime Hugues
Travis Humble
Edward Hutter
Jinho Hwang
Huda Ibeid
Shadi Ibrahim
Patrick Iff
Shuichi Ihara
Aleksandar Ilic
Thomas Ilsche
Nicholas Ilyadis
Neena Imam
Frank Indiviglio
Mikhail Isaev
Sergio Iserte
Kamil Iskra
Abdullah Al Raqibul Islam
Andrei Ivanov
Christiane Jablonowski
Doug Jacobsen
Daniel Jacobson
Mathias Jacquelin
Sammy Jaeger
Vijay Janapa Reddi
Gustav R. Jansen
Niclas Jansson
Michael Jantz
Stephen Jarvis
Emmanuel Jeannot
Tingwei Ji
Yuede Ji
Qilong Jia
Weile Jia
Xianyan Jia
Wenqi Jiang
Sian Jin
Zhou Jin
Anders Johansson
Graham Johnson
Bryan Johnston
Terry Jones
Wayne Joubert
Xiting Ju
Yi Ju
Wonyeong Jung
Mozhgan Kabiri chimeh
Ryan Kabrick
Yuhong Kan
Raghavendra Kanakagiri
Kaushik Kandadi suresh
Mahmut Kandemir
Rajgopal Kannan
Venkatesh Kannan
Bikash Kanungo
Vasileios Karakasis
Sven Karlsson
Martin Karp
Edward Karrels
George Karypis
Karthik Kashinath
Daniel S. Katz
Kamer Kaya
Engin Kayraklioglu
Alison Kennedy
Ronan Keryell
Gokcen Kestor
David E. Keyes
Soheil Khadirsharbiyani
Mikhail Khalilov
Dongwhee Kim
Jungrae Kim
Mariam Kiran
Christine Kirkpatrick
Fredrik Kjolstad
Scott Klasky
Michael Klemm
Fabio Kon
Martin Kong
Alice Koniges
Seid Koric
Anton Korzh
Tevfik Kosar
Kimmo Koski
Anthony Kougkas
Nicholson K Koukpaizan
Patricia Kovatch
Boris Kozinsky
Quincey Koziol
Dieter A. Kranzlmueller
Jiri Kraus
Jeff Kuehn
Brian Kulis
Nalini Kumar
Julian Kunkel
Thorsten Kurth
Jakub Kurzak
Karsten Kutzer
JaeHyuk Kwack
Grzegorz Kwasniewski
William Ladd
Pierre-Axel Lagadec
Ignacio Laguna
Siyao Lai
Yu-Hsiang Lan
Eric Lancon
Julia Lane
John Lange
Johannes Langguth
Julien Langou
Jeff M. Larkin
Robert Latham
Scott Lathrop
Jonas Latt
Erwin Laure
Richard Lawrence
Hyungro Lee
Jaeyoon Lee
Seyong Lee
Wonchan Lee
Taylor Lee-Patti
Mark Leggot
kelun Lei
John Leidel
Veli-Antti Leinonen
Kurt Lender
Vitus J. Leung
Scott Levy
Daniele Lezzi
Ang Li
Baolin Li
Chao Li
Dong Li
Fangying Li
Guanpeng Li
Hai Li
Huizhong Li
Jianxiong Li
Juan Li
Jun Li
Mingyi Li
Mingzhen Li
Shengguo Li
Sherry Li
Shigang Li
Shunde Li
Wenhao Li
Wenlin Li
Xi Li
Xipeng Li
Xu Li
Yiling Li
Yiming Li
Yiyuan Li
Yong Li
Zitong Li
Xin Liang
komorebi liao
Wei-keng Liao
Justin Lietz
Seung-Hwan Lim
kehao lin
Rongfen Lin
Wei Lin
Volker Lindenstruth
Peter Lindstrom
John Linford
Chuan Liu
Chun-Yi Liu
Fang Liu
Hang Liu
Hanyue Liu
Honggao Liu
Jiahui Liu
Jie Liu
Lijun Liu
Sha Liu
Weifeng Liu
Weiguo Liu
Xiaohui Liu
Xin Liu
Xiyang Liu
Xu Liu
Yang Liu
Yi Liu
Ying Liu
Yiqian Liu
Zhao Liu
Ziming Liu
Glenn K. Lockwood
Jay Lofstead
Bruce Loftis
Gabriel Loh
Bill Long
Yingchi Long
Guy Lonsdale
Daniela Loreti
Mike Lowe
Hatem Ltaief
Tao Lu
Xiaomin Lu
Yuechen Lu
Zhongzhi Luan
Jakob Luettgau
Zarija Lukić
Shirui Luo
Piotr Luszczek
XiaoJing Lv
Dmitry Lyakh
Hui Ma
Julie Ma
Ming Ma
Arthur Maccabe
Tommaso Macrì
Bill Magro
Samreen T. Mahmud
Nicholas Malaya
Anirban Mandal
Joseph Manzano
Jiajun Mao
Daniel Margala
Stefano Markidis
Georgios Markomanolis
Andres Marquez
Nicole Marsaglia
Michael Marthaler
Aristotle Martin
David Martin
Steven Martin
Maxime Martinasso
David J. Martinez
Daniel Martinez-Gonzalez
Satoshi Matsuoka
Timothy Mattson
Timothy Mattson
Patrick S. McCormick
Nic McDonald
Marshall McDonnell
Damon McDougall
Lois Curfman McInnes
Simon McIntosh-Smith
Kim H McMahon
Maryam Mehri Dehnavi
Maryam Mehri Dehnavi
Neil Mehta
Verónica G. Melesse Vergara
Pete Mendygral
Esteban Meneses
Harshitha Menon
Anu MERCIAN
Elia Merzari
Bronson Messer
Peter Messmer
Martin Meuer
Lucas Thibaut Meyer
Marek Michalewicz
Martial Michel
George Michelogiannakis
Petro Junior Milan
Lauren Milechin
Lauren Milechin
Barton Miller
Ross Miller
Steven Miller
William L. Miller
Daniel J Milroy
Misun Min
Marco Minutoli
Dmitry Mishin
Georgy Mitenkov
Nan Mo
Zizhao Mo
Bernd Mohr
Jose Manuel Monsalve Diaz
Shirley Moore
Stan Moore
José Moreira
Vitali Morozov
Karla Vanessa Morris Wright
Phani Motamarri
Irene Moulitsas
Timofey Mukha
Julie Mullen
Hausi Muller
Paul Mullowney
Miranda Mundt
Richard Murphy
Ruslan Murtazin
Albert Musaelian
Aaron Myers
Andrew Myers
Jürgen Müller
Ambarish Nag
Vijay Nain
Prashant Nair
Yuji Nakatsukasa
Prineha Narang
Akira Naruse
Philippe Olivier Navaux
Sarah M. Neuwirth
CJ Newburn
Phuong Nguyen
Stephen Nichols
Bogdan Nicolae
ningming nie
Christoph Niethammer
Hubert Niewiadomski
Zulkar Nine
Israt Nisa
Hendrik Nolte
Matthew Norman
Douglas Norton
Paul Nowoczynski
Kenneth O'Brien
Alan O'Cais
Andrew Ochoa
Lena Oden
Sven Oehme
Daniel Olds
Serkay Olmez
Sarp Oral
Alessandro Orso
Tim Osborne
George Ostrouchov
Michael Ott
Scott Pakin
Rachel Palumbo
Zhe Pan
Dhabaleswar K. (DK) Panda
Yunfei Pang
Gourab Panigrahi
Manolis Papadakis
Tom Papatheodore
Manish Parashar
Konstantinos Parasyris
Ojas Parekh
EunJung (EJ) Park
Scott J. Parker
Valerio Pascucci
Tirthak Patel
Tapasya Patki
Karthik Pattabiraman
Michael Paulitsch
J. Gregory Pauloski
Roger Pearce
Massoud Pedram
Kevin Pedretti
Sean Peisert
Ivy Peng
jintao peng
John Pennycook
Adalberto Perez
Danny Perez
Miquel Pericàs
Alexis Perry-Holby
Karina Pesatova
Antonio J. Peña
Dirk Pflüger
Malachi Phillips
Jacques Pienaar
Anna Pietarila Graham
Dirk Pleiter
Christian Plessl
Norbert Podhorszki
Michal Podstawski
Phillip Pokorny
Theresa Pollinger
Steve Poole
Daniel Pope
Elena Pourmal
Marissa Powers
Heidi Poxon
Sushil K. Prasad
Viktor K. Prasanna
Benjamin Priest
Radu Prodan
Carlos Puchol
Daniel F. Puleri
Jesus Pulido
Satish Puri
Szilárd Páll
Mathias Pütz
Apan Qasem
Depei Qian
Simeng Qian
Yingjin Qian
Yong Qin
Wenyu Qu
Syed Qutub
Santosh Radha
Ana Radovanovic
Ken Raffenetti
Bruno Raffin
David Rager
Vivek Raghunathan
Emily Rakestraw
Bharath Ramesh
Rajdeep Rana
Amanda Randles
Esteban Rangel
Aditya Ranjan
Garrett Wilson Ransom
Nageswara S. Rao
Siddhisanket Raskar
Katherine Rasmussen
Thilina Rathnayake
Matteo Ravasi
Paolo Rech
Daniel Reed
Daniel A. Reed
James Reinders
Theodoros Rekatsinas
Shiru Ren
Pawel Renc
Cedric Renggli
Albert Reuther
Tahsin Reza
Duncan Riach
Alejandro Ribes
Brad Richardson
Moe Richert
Eleanor Rieffel
Irina Rish
Silvio Rizzi
Yves Robert
Alan Robertson
Dana E Robinson
Channa Rock
Josie Esteban Rodriguez Condia
David M. Rogers
James H. Rogers
Paul Romano
Melissa Romanus
Joshua Romero
Jon Rood
Caitlin Ross
Robert B. Ross
Philip C. Roth
Atanas Rountev
Damian Rouson
Sayan Roychowdhury
Katherine Royston
Cindy Rubio-González
Amit Ruhela
Hoon Ryu
Thiago S. F. X. Teixeira
Charlie Sabino
Ponnuswamy Sadayappan
Seppo Sahrakorpi
Phil Sakievich
Siddharth Samsi
Siddharth Samsi
William Sands
Sergiu Sanielevici
Kentaro Sano
Piyush Sao
Heather Savoy
Steve Scargall
Philipp Schaad
Florian Scheidl
Gabin Schieffer
Philipp Schlatter
Stefan Schmid
Perry Schmidt
Evan Schneider
Timo Schneider
William Schonbein
Marc Schouler
Martin Schreiber
Karl Schulz
Laura Schulz
Martin Schulz
Joerg Schumacher
Catherine Schuman
Benjamin Schwaller
Anita Schwartz
Nicholas Schwarz
Alessio Sclocco
Thomas R. W. Scogland
Sudip Seal
Robert Sears
Ada Sedova
Seetharami Seelam
Philippe Segers
Efe Sencan
Jarim Seo
Harald Servat
Peter Seto
Jean M. Sexton
Igor Sfiligoi
Aamir Shafi
Gilad Shaier
John Shalf
Pavel Shamis
Honghui Shang
Mallikarjun (Arjun) Shankar
Sanjif Shanmugavelu
Andrew Shao
Weiqi Shen
Sameer Shende
Jiuchen Shi
Junda Shi
Peng Shi
Runbin Shi
Shunchen Shi
Shupeng Shi
Tianhui Shi
Xiang Shi
Yumeng Shi
Zhan Shi
Shumpei Shiina
Woong Shin
Kathleen Shoga
Fumiyoshi Shoji
Tong Shu
Chaoyang Shui
Laura Shultz
Sergey Shumarayev
David Sickinger
Eva Siegmann
Daniel Silver
Christopher Simmons
Horst Simon
Robert S. Sinkovits
John Sirevicius
Happy Sithole
Ganesh Sivaraman
Anthony Skjellum
Elliott Slaughter
Preston Smith
Sean Smith
Winona Snapp-Childs
Shane Snyder
Avinash Sodani
Edgar Solomonik
Carol Song
Chaobo Song
Linghao Song
Shuaiwen Leon Song
Shuaiwen Leon Song
Xiang Song
Xiaoyu Song
Zeyu Song
Matteo Sonza Reorda
Bob Sorensen
Robert Speck
Filippo Spiga
Tracy Spitler
Jeffrey M. Squyres
Sarat Sreepathi
Ashok Srinivasan
Tom St. John
Eric Stahlberg
Dan C Stanzione
Stefan Stefan Kerkemeier
Trevor Steil
Eric Stephan
Laurie A. Stephey
Sebastian Stern
Rick Stevens
Adam J. Stewart
Vladimir Stojanovic
Harmen Stoppels
Quentin F. Stout
Fred Streitz
Magnus Strengert
Erich Strohmaier
Michelle Strout
Joe Stubbs
Brian Stucky
Estela Suarez
Shashank Subramanian
Vishal Subramanian
Hari Subramoni
Joshua Suetterlein
Nitin Sukhija
Dalal Sukkari
Sreenivas Sukumar
Michael B. Sullivan
Biao Sun
Qiang Sun
Xian-He Sun
Yi Sun
Zibin Sun
Frédéric Suter
Paolo Sylos Labini
László Szűcs
Kalman Szenes
Ryousei Takano
Nathan Tallent
Guangming Tan
Jian Tan
Cyrus Tanade
Houjun Tang
Meng Tang
Tao Tang
Dingwen Tao
Guocheng Tao
Mahidhar Tatineni
Michela Taufer
Kenjiro Taura
Stig Telfer
Keita Teranishi
Christian Terboven
Olivier Terzo
Francois Tessier
Gautam Thakur
Rajeev Thakur
William W. Thigpen
George K. Thiruvathukal
Jeyan Thiyagalingam
Haodong Tian
Jiannan Tian
Devesh Tiwari
Karen Tomko
TUGBA TORUN
Georgia Tourassi
Andrea Townsend-Nicholson
Mike Townsley
Leon Tran
James D. Trotter
Miwako Tsuji
Alexander Tsyplikhin
Paul Tucker
Henry Tufo
Antonino Tumeo
Gabe Turner
Didem Unat
Robert R. Underwood
Pedro Valero-Lara
Alexander Van Craen
Ruud van der Pas
Ben van Werkhoven
Avery VanAusdal
Koyickal Roy Varghese
Dilip Vasudevan
Matthew Vaughn
Jean-Luc Vay
Marc-André Vef
Flavio Vella
Shivaram Venkataraman
Lydia Vermeyden
Jeffrey S. Vetter
Tom Vierjahn
Bapi Vinnakota
Venkatram Vishwanath
Richard Vuduc
Jarosław Wąs
Jacob Wahlgren
Genna Waldvogel
Curt Wallace
Wubing Wan
Daniel Wang
Daoce Wang
Fang Wang
Fei Wang
Feiyi Wang
Haojie Wang
Jin Wang
Jue Wang
Kangyu Wang
Meng Wang
Pengyu wang
Rui Wang
Tengcheng Wang
Wenqiang Wang
Xuan Wang
Yangang Wang
Yinuo Wang
Yinzhi Wang
Yong Wang
Yusong Wang
Zhan Wang
Zhang Wang
Zhen Wang
Zheng Wang
Zijia Wang
Zongguo Wang
Tim Warburton
Logan Ward
Nick Ward
Greg Watson
Josef Weidendorfer
Michèle Weiland
Tino Weinkauf
Brent Welch
Gerhard Wellein
Jack Wells
Bert Wesarg
Corey Wetterer-Nelson
James B. White III
Tim Wickberg
Mark I. Wilkinson
Floris-Jan Willemsen
Samuel W. Williams
Leighton Wilson
Samantha Wittke
J. Lowell Wofford
Felix Wolf
Noah Wolfe
Michael Wong
Mike Woodacre
Justin Wozniak
Steven A. Wright
Chunshu Wu
Fei Wu
Yangjun Wu
Yutong Wu
Zhikun Wu
Franck Wuerthwein
Frank Wuerthwein
Brian J. N. Wylie
Shengye Xiang
Junmin Xiao
Wencong Xiao
Yi Xiao
Bing Xie
Fangfang Xie
Min Xie
Zhikuang Xin
Cheng-Zhong Xu
Huanle Xu
Hui Xu
Kevin Xu
Wei Xu
Wei Xue
Rohan Yadav
Keiji Yamamoto
Limin Yan
YAN YAN
Bin Yang
Bo Yang
Guangwen Yang
Hailong Yang
Haoyu Yang
Lishan Yang
Weiling Yang
Yafei Yang
Yanan Yang
Ying Yang
Yuzhuo Yang
Jienan Yao
Jinghan Yao
Kehan Yao
Kejiang Ye
Pui Kuen Yeung
Enxin Yi
Junqi Yin
Wanwang Yin
Zekun Yin
Rio Yokota
Kazutomo Yoshii
Xin You
Yang You
Jeff Young
Andrew Younge
Xiaodong Yu
Guojun Yuan
Xinhui Yuan
Lingfang Zeng
Yan Zeng
Jidong Zhai
Mingshu Zhai
Bingbin Zhang
Claire Zhang
jifa zhang
Jingrong Zhang
Jingwen Zhang
Peng Zhang
Pengmiao Zhang
Shuai Zhang
Wang Zhang
Wei Zhang
Xuechen Zhang
Yuyang Zhang
Zhao Zhang
Zhenguo Zhang
Zhongcheng Zhang
Hanyu Zhao
Jianqi Zhao
Kai Zhao
Laiping Zhao
Max Xiaohang Zhao
Tong Zhao
Da Zheng
Pengfei Zheng
Weimin Zheng
Yao Zheng
Yu Zhong
Amelie Chi Zhou
chunbao zhou
Haotian Zhou
Hongkuan Zhou
Hui Zhou
Pengyu Zhou
Yongxiao Zhou
You Zhou
Yu Zhu
Sean Ziegeler
Henk-Jan Zilverberg
Christopher Zimmer
Paul Zimmerman
Sammy Zimmerman
Alexandros Nikolaos Ziogas
Sagi Zisman
Presentations
Workshop
10th Annual International Workshop on Innovating the Network for Data Intensive Science (INDIS) Final
Description
Networks for data-intensive science have more extreme requirements than general-purpose networks. These requirements not only closely impact the design of processor interconnects in supercomputers and cluster computers, but they also impact campus networks, regional networks and national backbone networks. The developments in network technologies are tremendous. This enables a fundamentally different approach of integrating networks in supercomputing applications.
This workshop encourages research papers that address one or more of these networking needs; and developments that are essential in the information systems infrastructure for the scientific discovery process.
This workshop will also serve as a platform for participants in Network Research Exhibitions and SCinet to present experimental papers on their latest applications, designs and solutions. SCinet is the high-speed network engine of the SC conference. The show floor network connects to many laboratories and universities worldwide using high-bandwidth connections.
Workshop
13th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS)
Description
The complexity of node architectures in supercomputers increases as we cross milestones on the way toward exascale and beyond. Increasing levels of parallelism in multi- and many-core chips and emerging heterogeneity of computational resources coupled with energy and memory constraints force a reevaluation of our approaches towards operating systems and runtime environments.
The International Workshop on Runtime and Operating Systems for Supercomputers (ROSS) provides a forum for researchers to exchange ideas and discuss research questions that are relevant to upcoming supercomputers and cloud environments for high-performance computing. In addition to typical workshop publications, we encourage novel and possibly immature ideas, provided that they are interesting and on-topic. Well-argued position papers are also welcome.
Workshop
13th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2023)
Description
Increases in the number, variety and complexity of components required to compose next-generation extreme-scale systems mean that systems will experience significant increases in aggregate fault rates, fault diversity, and fault complexity. Additionally, the widespread availability of new storage devices (NVMM, NVMe, SSD), increasing system heterogeneity, and the emergence of novel computing paradigms (neuromorphic, quantum) introduce fault tolerance issues that the research community has just begun to address.
Due to the continued need for research on fault tolerance in extreme-scale systems, the 13th Workshop on Fault- Tolerance for HPC at Extreme Scale (FTXS 2023) will present an opportunity for innovative research ideas to be shared, discussed, and evaluated by researchers in fault-tolerance, resilience, and reliability from academic, government, and industrial institutions. Building on the success of the previous editions of the FTXS workshop, we will assemble quality publications and a featured speaker to facilitate a lively and thought-provoking group discussion.
Workshop
14th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems (ScalAH'23)
Description
Novel hybrid scalable scientific algorithms are needed with the advent of variety of novel accelerators including GPUs, FPGAs as well as with the growth of the size of the Quantum Computing Devices and neuromorphic chips and various AI specific processors. This myriad of devices requires a unified approach that allows efficient and scalable hybrid approaches combining classical and novel computing paradigms to be implemented at scale. These extreme-scale heterogeneous systems require these novel scientific algorithms to hide the complexity, hide network and memory latency, have advanced communication, and have no synchronization points where possible. With the advent of AI in the past few years the need of such scalable mathematical methods and algorithms for such hybrid architectures that are able to handle data and compute intensive applications at scale becomes even more important.
Workshop
1st Workshop on Enabling Predictive Science with Optimization and Uncertainty Quantification in HPC
Description
EPSOUQ-HPC 2023 is a workshop that connects engineers and scientists from within disciplines that conduct simulations or data analysis using supercomputing platforms, with the specific theme of approaches for enabling optimization and uncertainty quantification workflows. The exchange of ideas and methods for the optimization of codes and solutions, simulation validation, and assessment of uncertainties in mathematical models, computational solutions, and experimental data is the key focus. Networking opportunities throughout the workshop will establish an environment to spark discussions among scientists, engineers, students, and professionals from around the world centered on verified and validated supercomputing-enabled predictions. Our scope is to share methods and case studies demonstrating optimization and uncertainty quantification success stories and lessons learned across HPC architectures for a multitude of distributed computing, big data, and AI applications.
Workshop
2023 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC)
Description
The Performance, Portability, and Productivity in HPC workshop aims to bring together developers and researchers with an interest in practical solutions, technologies, tools, and methodologies that enable the development of performance-portable applications across a diverse set of current and future high‑performance computers.
The topic of Performance, Portability, and Productivity focuses on enabling applications and libraries to run across multiple architectures without significant impact on achieved performance and with the goal of maintaining developer productivity. This workshop provides a forum for discussions of successes and failures in tackling the compelling problems that lie at the intersection of performance, portability and productivity and high-performance computing. This area touches on many aspects of HPC software development, and the workshop program is expected to reflect a wide range of experiences and perspectives, including those of compiler, language and runtime experts; applications developers, performance engineers, and domain scientists. For more information see: https://p3hpc.org/workshop/2023/
Birds of a Feather
27th Graph500 List
Performance Measurement, Modeling, and Tools
Description
Data intensive supercomputer applications are increasingly important workloads, especially for “Big Data” problems, but are ill suited for most of today’s computing platforms (at any scale!). The Graph500 list has grown to over 357 entries and has demonstrated the challenges of even simple analytics. The new SSSP kernel introduced at SC17 has increased the benchmark’s overall difficulty. This BoF will unveil the latest Graph500 lists, provide in-depth analysis of the kernels and machines, and enhance the new energy metrics the Green Graph500. It will offer a forum for community and provide a rallying point for data intensive supercomputing problems.
Workshop
2nd International Workshop on Cyber Security in High Performance Computing (S-HPC 2023)
Description
HPC Security has traditionally been an "operational" challenge (i.e., restrict access and usage to certified users). However, as HPC gradually permeates more areas of public interest, a hands-off approach to security aspects in favor of performance/power is becoming imprudent at best. Paired with HPC’s traditional role of early technology adoption, a new set of early target-worthwhile vulnerabilities are emerging that are not necessarily found in other computing scenarios that operate with more established technologies.
Moreover, additional exploits specific to the HPC community arise from acute hardware heterogeneity, novel networks technologies, massive resource management orchestration, heavy reliance on bridle experimental software not hardened by numerous deployments, and software with a lack of maintenance. In combination with the single-node exploits, these vulnerabilities open fertile new attack surfaces. Due to these factors, and based on interest of the previous iteration, we are proposing the next iteration of the Security in High Performance Computing.
Workshop
3rd International Workshop on RESource DISaggregation in High Performance Computing (RESDIS)
Description
Disaggregation is an emerging compute paradigm that splits existing monolithic servers into a number of consolidated single-resource pools that communicate over a fast interconnect. This model decouples individual hardware resources and enables the creation of logical compute platforms with flexible and dynamic hardware configurations. The concept of disaggregation is driven by various recent trends in computation. From an application perspective, the increasing importance of data analytics and machine learning workloads brings unprecedented need for memory capacity, which is in stark contrast with the growing imbalance in the peak compute-to-memory capacity ratio of traditional system board based servers. At the hardware front, the proliferation of heterogeneous, special-purpose computing elements promotes the need for composable platforms, while the increasing maturity of optical interconnects elevates the prospects of distance independence in networking infrastructure. The workshop intends to explore various aspects of resource disaggregation, composability, and their implications for future HPC platforms.
Paper
5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway Supercomputer
Exascale
Large Scale Systems
State of the Practice
Best Paper Finalist
Description
HPL-MxP is an emerging high performance benchmark used to measure the mixed-precision computing capability of leading supercomputers. This work present our efforts on the new Sunway that linearly scales the benchmark to over 40 million cores, sustains an overall mixed-precision performance exceeding 5 ExaFlop/s, and achieves over 85% of peak performance, which is the highest efficiency reached among all heterogeneous systems on the HPL-MxP list. The optimizations of our HPL-MxP implementation include: (1)a Two-Direction Look-Ahead and Overlap algorithm that enables overlaps of all communications with computation; (2)a multi-level process-mapping and communication-scheduling method that uses the network as best as possible while maintaining conflict-free algorithm-flow; and (3)a CG-Fusion computing framework that eliminates up to 60% of inter-chip communications and removes the memory access bottleneck while serving both computation and communication simultaneously. This work could also provide useful insights for tuning cutting-edge applications on Sunway supercomputers as well as other heterogeneous supercomputers.
Workshop
5th Workshop on Programming and Performance Visualization Tools (ProTools 2023)
Description
Understanding program behavior is critical to overcome the expected architectural and programming complexities that arise on modern HPC platforms. To do so, HPC software developers need intuitive support tools for debugging, performance measurement, analysis, and tuning of large-scale HPC applications. Moreover, data collected from these tools such as hardware counters, communication traces, and network traffic can be far too large and too complex to be analyzed in a straightforward manner. We need new automatic analysis and visualization approaches to help application developers intuitively understand the multiple, interdependent effects that algorithmic choices have on application correctness or performance. The Workshop on Programming and Performance Visualization Tools (ProTools) brings together HPC application developers, tool developers, and researchers from the visualization, performance, and program analysis fields for an exchange of new approaches to assist developers in analyzing, understanding, and optimizing programs for extreme-scale platforms.
Paper
69.7-PFlops Extreme Scale Earthquake Simulation with Crossing Multi-Faults and Topography on Sunway
Accelerators
Applications
Modeling and Simulation
Best Paper Finalist
Best Student Paper Finalist
Description
A high-scalable and fully optimized earthquake model is presented based on the latest Sunway supercomputer. Contributions include:
1) the curvilinear grid finite-difference method (CGFDM) and flexible model applying perfectly matched layer (PML) and enabling more accurate and realistic terrain descriptions;
2) a hybrid and non-uniform domain decomposition scheme that efficiently maps the model across different levels of the computing system; and
3) sophisticated optimizations that largely alleviate or even eliminate bottlenecks in memory, communication, etc., obtaining a speedup of over 140x.
Combining all innovations, the design fully exploits the hardware potential of all aspects and enables us to perform the largest CGFDM-based earthquake simulation ever reported (69.7 PFlops using over 39 million cores).
Based on our design, the Turkey earthquakes (February 6, 2023), and the Ridgecrest earthquake (July 4, 2019), are successfully simulated with a maximum resolution of 12-m. Precise hazard evaluations for the hazardous reduction of earthquake-stricken areas are also conducted.
Workshop
6th International Workshop on Emerging Parallel Distributed Runtime Systems and Middleware
Description
Node architectures of extreme-scale systems are rapidly increasing in complexity. Emerging homogeneous and heterogeneous designs provide massive multi-level parallelism, but developing efficient runtime systems and middleware that allow applications to efficiently and productively exploit these architectures is extremely challenging. Moreover, current state-of-the-art approaches may become unworkable once energy consumption, resilience, and data movement constraints are added. The goal of this workshop is to attract the international research community to share new and bold ideas that will address the challenges of design, implementation, deployment, and evaluation of future runtime systems and middleware.
Workshop
7th International Workshop on Software Correctness for HPC Applications (Correctness '23)
Description
Ensuring correctness in HPC applications is one of the fundamental challenges that the HPC community faces today. While significant advances in verification, testing, and debugging have been made to isolate software defects in the context of non-HPC software, several factors make achieving correctness in HPC applications and systems much more challenging than in general systems software---growing heterogeneity (CPUs, GPUs, and special purpose accelerators), massive scale computations, use of combined parallel programing models (e.g., MPI+X), new scalable numerical algorithms (e.g., to leverage reduced precision in floating-point arithmetic), and aggressive compiler optimizations/transformations are some of the challenges that make correctness harder in HPC. As the complexity of future architectures, algorithms, and applications increases, the ability to fully exploit exascale systems will be limited without correctness. The goal of this workshop is to bring together researchers and developers to present and discuss novel ideas to address the problem of correctness in HPC.
Birds of a Feather
A Component-Based Approach for Integrating Quantum Computing Test Beds into HPC Environments: Challenges and Opportunities
Description
Integrating quantum computing (QC) test beds into scientific computing environments presents challenges in software interfaces and system familiarity. High-performance computing (HPC) centers are adopting this task but selecting suitable test bed technologies is complex due to numerous providers with varying maturity levels and the associated risk of single vendor systems.
A component-based approach is promising but faces challenges with the lack of standardized benchmarks, and the need for device-specific calibrations. This discussion addresses the challenge of component-based approaches and explores unifying access to diverse QC technologies, leveraging HPC for optimization, and fulfilling researcher needs.
Paper
A GPU Algorithm for Detecting Strongly Connected Components
Accelerators
Algorithms
Graph Algorithms and Frameworks
Description
Detecting strongly connected components (SCCs) is an important step in various graph computations. The fastest GPU and CPU implementations from the literature work well on graphs where most of the vertices belong to a single SCC and the vertex degrees follow a power-law distribution. However, these algorithms can be slow on the mesh graphs used in certain radiative transfer simulations, which have a nearly constant vertex degree and can have significant variability in the number and size of SCCs. We introduce ECL-SCC, an SCC detection algorithm that addresses these shortcomings. Our approach is GPU-friendly and employs innovative techniques such as maximum ID propagation and edge removal. On an A100 GPU, ECL-SCC performs on par with the fastest prior GPU code on power-law graphs and outperforms it by 7.8x on mesh graphs. Moreover, ECL-SCC running on the GPU outperforms fast parallel CPU code by three orders of magnitude on meshes.
Paper
A High-Performance MST Implementation for GPUs
Accelerators
Algorithms
Graph Algorithms and Frameworks
Description
Finding a minimum spanning tree (MST) is a fundamental graph algorithm with applications in many fields. This paper presents ECL-MST, a fast MST implementation designed specifically for GPUs. ECL-MST is based on a parallelization approach that unifies Kruskal's and Borůvka's algorithm and incorporates new and existing optimizations from the literature, including implicit path compression and edge-centric operation. On two test systems, it outperforms leading GPU and CPU codes from the literature on all of our 17 input graphs from various domains. On a Titan V GPU, ECL-MST is, on average, 4.6 times faster than the next fastest code, and on an RTX 3080 Ti GPU, it is 4.5 times faster. On both systems, ECL-MST running on the GPU is roughly 30 times faster than the fastest parallel CPU code.
Birds of a Feather
A National Science Data Fabric to Democratize Data Access and Reusability
Cloud Computing
Distributed Computing
Description
We are building a National Science Data Fabric (NSDF) that introduces a novel trans-disciplinary approach for integrated data delivery and access to shared storage, networking, computing, and educational resources. Such a data fabric can democratize data-driven scientific discovery across the growing data science community. In this BoF, we want to engage the data science community to discuss the challenges and opportunities of the NSDF project and other similar efforts to connect an open network of institutions, including resource-disadvantaged institutions, and develop a federated testbed configurable for individual and shared scientific use.
Paper
A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
Description
Memory disaggregation has recently been adopted in major data centers to improve resource utilization, driven by cost and sustainability. Meanwhile, studies on large-scale HPC facilities have also highlighted memory under-utilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system in three levels, moving from general, to multi-tier memory, and then to memory pooling. We also provide tools to facilitate the quantitative approach. We evaluated a set of representative HPC workloads on an emulated platform. Our results show that interference in memory pooling has varied application impact, depending on access ratio and arithmetic intensity. Finally, our method is applied in two case studies to show benefits at both the application and system level.
Paper
Accelerating Communications in Federated Applications with Transparent Object Proxies
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
Description
Advances in networks, accelerators, and cloud services encourage programmers to reconsider where to compute---such as when fast networks make it cost-effective to compute on remote accelerators despite added latency. Workflow and cloud-hosted serverless computing frameworks can manage multi-step computations spanning federated collections of cloud, high-performance computing (HPC), and edge systems, but passing data among computational steps via cloud storage can incur high costs. Here, we overcome this obstacle with a new programming paradigm that decouples control flow from data flow by extending the pass-by-reference model to distributed applications. We describe ProxyStore, a system that implements this paradigm by providing object proxies that act as wide-area object references with just-in-time resolution. This proxy model enables data producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. We demonstrate the benefits of this model with synthetic benchmarks and real-world scientific applications, running across various computing platforms.
Birds of a Feather
Accelerating Storage IO to GPUs
Data Analysis, Visualization, and Storage
Description
Storage IO is becoming more of a bottleneck, especially for a new generation of AI-based workloads that are accelerated by GPUs. This session will provide a brief overview of key trends, available solutions presented as lightning talks, and illustrative application performance gains in this space. The majority of the session will engage in an open, forward-looking discussion with the gathered community on promising areas for investigation. Presenters will include those from academia and industry with new and challenging applications, storage partners with characterization, and innovators with new solutions in GPU-initiated storage and greater security. Join us for an exciting exchange!
Birds of a Feather
ACCESS Resource Provider Forum
Description
The ACCESS Resource providers (RP) will give an overview of the available resources and their unique characteristics. Those resources are open to a broad audience of computational researchers. Individuals can apply for allocations by submitting a request to ACCESS. Once this request is approved they can exchange their awarded service units for resources at one or several of the providers (e.g. node hours, GPU hours, storage).
The presentations will highlight the variety of resources and will be followed by a discussion with the community, allowing the audience to directly interact with the RPs.
Visit https://app.meet.ps/attendee/fcqctplo to submit questions beforehand
Paper
Adaptive Workload-Balanced Scheduling Strategy for Global Ocean Data Assimilation on Massive GPUs
Accelerators
Algorithms
Graph Algorithms and Frameworks
Description
Global ocean data assimilation is a crucial technique to estimate the actual oceanic state by combining numerical model outcomes and observation data, which is widely used in climate research. Due to the imbalanced distribution of observation data in global ocean, the parallel efficiency of recent methods suffers from workload imbalance. When massive GPUs are applied for global ocean data assimilation, the workload imbalance becomes more severe, resulting in poor scalability. In this work, we propose a novel adaptive workload-balance scheduling strategy, assimilation, which successfully estimates the total workload prior to execution and ensures a balanced workload assignment. Further, we design a parallel dynamic programming approach to accelerate the schedule decision, and develop a factored dataflow to exploit the parallel potential of GPUs. Evaluation demonstrates that our algorithm outperforms the state-of-the-art method by up to 9.1x speedup. This work is the first to scale global ocean data assimilation to 4,000 GPUs.
Tutorial
ADIOS-2: A Framework to Enable HPC Tools for Extreme Scale I/O, In Situ Visualization, and Performance Analysis
Description
As concurrency and complexity continue to increase on high-end machines, storage I/O performance is rapidly becoming a fundamental challenge to scientific discovery. At the exascale, online analysis will become a dominant form of data analytics, and thus scalable in situ workflows will become critical, along with high performance I/O to storage. The many components of a workflow running simultaneously pose another challenge of evaluating and improving the performance of these workflows. Therefore, performance data collection needs to be an integral part of the entire workflow.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
Paper
ADT-FSE: A New Encoder for SZ
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
Description
SZ is a lossy floating-point data compressor that excels in compression ratio and throughput for high-performance computing (HPC), time series databases, and deep learning applications. However, SZ performs poorly for small chunks and has slow decompression. We pinpoint the Huffman tree in the quantization factor encoder as the bottleneck of SZ. In this paper, we propose ADT-FSE, a new quantization factor encoder for SZ. Based on the Gaussian distribution of quantization factors, we design an adaptive data transcoding (ADT) scheme to map quantization factors to codes for better compressibility, and then use finite state entropy (FSE) to compress the codes. Experiments show that ADT-FSE improves the quantization factor compression ratio, compression and decompression throughput by up to 5x, 2x and 8x, respectively, over the original SZ Huffman encoder. On average, SZ_ADT is over 2x faster than ZFP in decompression.
Birds of a Feather
Advanced Architecture "Playgrounds" - Past Lessons and Future Accesses of Testbeds
Architecture and Networks
Description
Testbeds play a vital role in assessing the readiness of novel architectures for upcoming supercomputers for the exascale and post-exascale era. These testbeds also act as co-design hubs, enabling the collection of application operational requirements, while identifying critical gaps that need to be addressed for an architecture to become viable for HPC. Various research centers are actively deploying testbeds, and our aim is to build a community that facilitates the sharing of information, encouraging collaboration and understanding of the available evaluation resources. This BoF will facilitate the exchange of best practices, including testbed design, benchmarking, system evaluation, and availability.
Tutorial
Advanced MPI Programming
Description
The vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. Parallel system architectures are evolving to include complex, heterogeneous nodes comprising general-purpose CPUs as well as accelerators such as GPUs. At the same time, the MPI standard itself is evolving to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid programming (MPI + threads, shared memory, GPUs), topologies and topology mapping, neighborhood and nonblocking collectives, and some of the new performance-oriented features in MPI-4. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.
Tutorial
Advanced OpenMP: Host Performance and 5.2 Features
Description
With the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP, but rather from the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.2 and comment on developments targeting OpenMP 6.0.
Birds of a Feather
Advances in FPGA Programming and Technology for HPC
Architecture and Networks
Description
FPGAs have gone from niche components to being a central part of many data centers worldwide. The last year has seen tremendous advances in FPGA programmability and technology, especially in the shift to reconfigurable architectures that are heterogeneous and/or based on CGRAs or other AI engines. This BoF has two parts. The first is a series of lightning talks presenting advances in tools, technologies, and use-cases for these emerging architectures. The second part of the BoF will be a general discussion driven by the interests of the attendees, potentially including additional topics.
Birds of a Feather
Agriculture Empowered by Supercomputing
Applications
Description
Agriculture worldwide is facing massive challenges in production, distribution, pollution reduction, and food security and waste: less than 40% of any crop is actually marketed. The farm, the oldest human-engineered system, produces the vast majority of human sustenance and consumes the majority of global freshwater. Its efficient operation is of vital importance -particularly when supply chains are disrupted by wars and pandemics. This BoF will discuss how novel supercomputing technologies and related distributed heterogeneous systems at scale could empower the primary sector and, as a result, stop operating in a needlessly fragile and inefficient way.
Workshop
AI Assisted Software Development for HPC (AI4DEV)
Description
While software is an important component in the pursuit of scientific discovery, software development in HPC continues to be challenging. Today the software development process combines contributions from many domain scientists, and involves complex programming models. As the development complexity increases, it requires a steep learning curve for new developers, resulting in a slow development pace. With the continuous integration of applications in deep software stacks (workflows, compilers, runtime libraries, heterogeneous systems) novel techniques and practical tools for assisting software development in HPC are invaluable. Recent advances in generative AI and large language models, such as GitHub’s Copilot and OpenAI’s GPT, demonstrate potential for developer assistance and automated code synthesis. The goal of the AI4DEV workshop is to create a forum for researchers, scientists, and practitioners to discuss ideas on how AI can help in the whole development process. The workshop features contributed papers and invited talks in the area.
Birds of a Feather
Americas HPC Collaboration: Global Actions
Description
The SC23 edition of the Birds of a Feather Americas High-Performance Computing Collaboration: Global Actions seeks to showcase collaborations that have resulted from the partnerships formed since the first edition at SC19, presenting opportunities and experiences between different HPC Networks and Laboratories from countries in North, Central, and South America with other continents, mainly with Europe. In the BoF, different aspects will be discussed around the expectations and experiences of collaboration in HPC, to feed the continental roadmap. This BoF is a crucial step to support the signature of an MoU to start the formalization of the Americas HPC Collaboration.
Paper
AMRIC: A Novel In Situ Lossy Compression Framework for Efficient I/O in Adaptive Mesh Refinement Applications
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
Description
As supercomputers advance toward exascale capabilities, computational intensity increases significantly, and the volume of data requiring storage and transmission experiences exponential growth. Adaptive Mesh Refinement (AMR) has emerged as an effective solution to address these two challenges. Concurrently, error-bounded lossy compression is recognized as one of the most efficient approaches to tackle the latter issue. Despite their respective advantages, few attempts have been made to investigate how AMR and error-bounded lossy compression can function together. To this end, this study presents a novel in-situ lossy compression framework that employs the HDF5 filter to improve both I/O costs and boost compression quality for AMR applications. We implement our solution into the AMReX framework and evaluate on two real-world AMR applications, Nyx and WarpX, on the Summit supercomputer. Experiments with 512 cores demonstrate that AMRIC improves the compression ratio by 81x and the I/O performance by 39x over AMReX's original compression solution.
Birds of a Feather
Analyzing Parallel I/O
Data Analysis, Visualization, and Storage
Description
Parallel I/O performance can be a critical bottleneck for applications, yet users often need to be equipped for identifying and diagnosing I/O performance issues. Increasingly complex hierarchies of storage hardware and software deployed on many systems only compound this problem. Tools that can effectively capture, analyze, and tune I/O behavior for these systems empower users to realize performance gains for many applications.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the above-mentioned problem, drawing on the expertise of users, I/O researchers, and administrators in attendance.
Paper
ANT-MOC: Scalable Neutral Particle Transport Using 3D Method of Characteristics on Multi-GPU Systems
Accelerators
Applications
Modeling and Simulation
Best Paper Finalist
Best Student Paper Finalist
Description
The Method Of Characteristic (MOC) to solve the Neutron Transport Equation (NTE) is the core of full-core simulation for reactors. High resolution is enabled by discretizing the NTE through massive tracks to traverse the 3D reactor geometry. However, the 3D full-core simulation is prohibitively expensive because of the high memory consumption and the severe load imbalance. To deal with these challenges, we develop ANT-MOC. Specifically, we build a performance model for memory footprint, computation and communication, based on which a track management strategy is proposed to overcome the resolution bottlenecks caused by limited GPU memory. Furthermore, we implement a novel multi-level load mapping strategy to ensure load balancing among nodes, GPUs, and CUs. ANT-MOC enables a 3D full-core reactor simulation with 100 billion tracks on 16,000 GPUs, with 70.69% and 89.38% parallel efficiency for strong scalability and weak scalability, respectively.
Paper
Application Performance Modeling via Tensor Completion
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
Description
Performance tuning, software/hardware co-design, and job scheduling are among the many tasks that rely on models to predict application performance. We propose and evaluate low-rank tensor decomposition for modeling application performance. We discretize the input and configuration domains of an application using regular grids. Application execution times mapped within grid-cells are averaged and represented by tensor elements. We show that low-rank canonical-polyadic (CP) tensor decomposition is effective in approximating these tensors. We further show that this decomposition enables accurate extrapolation of unobserved regions of an application's parameter space. We then employ tensor completion to optimize a CP decomposition given a sparse set of observed execution times. We consider alternative piecewise/grid-based models and supervised learning models for six applications and demonstrate that CP decomposition optimized using tensor completion offers higher prediction accuracy and memory-efficiency for high-dimensional performance modeling.
Birds of a Feather
Applications, Libraries, and Tools in Modern Fortran (alt.fortran)
State of the Practice
Description
This BoF provides a forum for Fortran developers to engage with its modern programming features. Fortran continues to play a crucial role in numerous legacy applications, but with features introduced in recent standards, the language also supports modern programming practices and high-performance computing. As Fortran 2023 approaches, this BoF brings together developers from various domains to share experiences and explore the language's evolving capabilities. After some brief panelist presentations, the session will focus on an interactive discussion where audience members will be encouraged to share their own experiences and ask questions of our panelists.
Birds of a Feather
Arm in HPC: Experiences and Lessons Learned in Operating Arm-Based HPC Systems
State of the Practice
Description
This BoF brings together the Arm HPC community to discuss experiences and lessons learnt in delivering and operating Arm-based HPC systems. The topic of Arm HPC ecosystem maturity has been extensively discussed, focusing especially on the upper part of the stack (compiler, libraries, applications). This BoF focuses instead on the other side of the coin with a focus on administration and management of systems. Primed by a short opening session from well-recognized experts in the community, the host and panel will engage attendees to share and ask probing questions. Audience participation is strongly encouraged.
Paper
Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous Machines
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
Description
In a parallel and distributed application, a mapping is a selection of a processor for each computation or task and memories for the data collections that each task accesses. Finding high-performance mappings is challenging, particularly on heterogeneous hardware with multiple choices for processors and memories. We show that fast mappings are sensitive to the machine, application, and input. Porting to a new machine, modifying the application, or using a different input size may necessitate re-tuning the mapping to maintain the best possible performance.
We present AutoMap, a system that automatically tunes the mapping to the hardware used and finds fast mappings without user intervention or code modification. In contrast, hand-written mappings often require days of experimentation. AutoMap utilizes a novel constrained coordinate-wise descent search algorithm that balances the trade-off between running computations quickly and minimizing data movement. AutoMap discovers mappings up to 2.41x faster than custom, hand-written mappers.
Paper
Automatic Generation of Distributed-Memory Mappings for Tensor Computations
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
Description
While considerable research has been directed at automatic parallelization for shared-memory platforms, little progress has been made in automatic parallelization schemes for distributed-memory systems. We introduce an innovative approach to automatically produce distributed-memory parallel code for an important sub-class of affine tensor computations common to Coupled Cluster (CC) electronic structure methods, neuro-imaging applications, and deep learning models.
We propose a novel systematic approach to modeling the relations and trade-offs of mapping computations and data onto multi-dimensional grids of homogeneous nodes. Our formulation explores the space of computation and data distributions across processor grids. Tensor programs are modeled as a non-linear symbolic formulation accounting for the volume of data communication and per-node capacity constraints induced under specific mappings. Solutions are found, iteratively, using the Z3 SMT solver, and used to automatically generate efficient MPI code. Our evaluation demonstrates the effectiveness of our approach over Distributed-Memory Pluto and the Cyclops Tensor Framework.
Tutorial
Best Practices of HPC in the Cloud
Description
High Performance Computing in the cloud has grown significantly over the last five years. Weather, computational fluid dynamics (CFD), genomic analysis and more are workloads that leverage the elasticity and the broad compute choices of the cloud to innovate faster and deliver faster results. The large choice of compute, storage and network options and the dynamic nature of cloud can make the first experience a daunting proposition. Cloud technologies also provide new capabilities to scientist, engineer and HPC specialists, however, how to use them may not be immediately clear.
This tutorial provides an intermediate and advanced content to run and manage HPC in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
Tutorial
Better Software for Reproducible Science
Description
Producing scientific software is a challenge. The high-performance modeling and simulation community, in particular, faces the confluence of disruptive changes in computing architectures and new opportunities (and demands) for greatly improved simulation capabilities, especially through coupling physics and scales. Simultaneously, computational science and engineering (CSE), as well as other areas of science, are experiencing an increasing focus on scientific reproducibility and software quality. Code coupling requires aggregate team interactions including integration of software processes and practices. These challenges demand large investments in scientific software development and improved practices. Focusing on improved developer productivity and software sustainability is both urgent and essential.
Attendees will learn about practices, processes, and tools to improve the productivity of those who develop CSE software, increase the sustainability of software artifacts, and enhance trustworthiness in their use. We will focus on aspects of scientific software development that are not adequately addressed by resources developed for industrial software engineering. Topics include the design, refactoring, and testing of complex scientific software systems; collaborative software development; and software packaging. The second half of this full-day tutorial will focus on reproducibility, and why and how to keep a lab notebook for computationally-based research.
Paper
BLAD: Adaptive Load Balanced Scheduling and Operator Overlap Pipeline for Accelerating the Dynamic GNN Training
Artificial Intelligence/Machine Learning
Description
Dynamic graph networks are widely used for learning time-evolving graphs, but prior work on training these networks is inefficient due to communication overhead, long synchronization, and poor resource usage. Our investigation shows that communication and synchronization can be reduced by carefully scheduling the workload and the execution order of operators in GNNs can be adjusted without hurting training convergence.
We propose a system called BLAD to consider the above factors, comprising a two-level load scheduler and an overlap-aware topology manager. The scheduler allocates each snapshot group to a GPU, alleviating cross-GPU communication.
The snapshots in a group are then carefully allocated to processes on a GPU, enabling overlap of compute-intensive NN operators and memory-intensive graph operators. The topology manager adjusts the operators' execution order to maximize the overlap. Experiments show that it achieves 27.2% speed up on training time on average without affecting final accuracy, compared to state-of-the-art solutions.
Paper
Breaking Boundaries: Distributed Domain Decomposition with Scalable Physics-Informed Neural PDE Solvers
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
Description
Mosaic Flow is a novel domain decomposition method designed to scale physics-informed neural PDE solvers to large domains. Its unique approach leverages pre-trained networks on small domains to solve partial differential equations on large domains purely through inference, resulting in high reusability. This paper presents an end-to-end parallelization of Mosaic Flow, combining data parallel training and domain parallelism for inference on large-scale problems. By optimizing the network architecture and data parallel training, we significantly reduce the training time for learning the Laplacian operator to minutes on 32 GPUs. Moreover, our distributed domain decomposition algorithm enables scalable inferences for solving the Laplace equation on domains 4096x larger than the training domain, demonstrating strong scaling while maintaining accuracy on 32 GPUs. The reusability of Mosaic Flow, combined with the improved performance achieved through the distributed-memory algorithms, makes it a promising tool for modeling complex physical phenomena and accelerating scientific discovery.
Paper
Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
Description
Many real-world computations involve sparse data structures in the form of sparse matrices. A common strategy for optimizing sparse matrix operations is to reorder a matrix to improve data locality. However, it's not always clear whether reordering will provide benefits over the unordered matrix, as its effectiveness depends on several factors, such as structural features of the matrix, the reordering algorithm and the hardware that is used. This paper aims to establish the relationship between matrix reordering algorithms and the performance of sparse matrix operations. We thoroughly evaluate six different matrix reordering algorithms on 490 matrices across eight multicore architectures, focusing on the commonly used sparse matrix-vector multiplication (SpMV) kernel. We find that reordering based on graph partitioning provides better SpMV performance than the alternatives for a large majority of matrices, and that the resulting performance is explained through a combination of data locality and load balancing concerns.
Paper
Calculon: a Methodology and Tool for High-Level Codesign of Systems and Large Language Models
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
Description
This paper presents a parameterized analytical performance model of transformer-based Large Language Models (LLMs) for guiding high-level algorithm-architecture codesign studies. This model derives from an extensive survey of performance optimizations that have been proposed for the training and inference of LLMs; the model's parameters capture application characteristics, the hardware system, and the space of implementation strategies. With such a model, we can systematically explore a joint space of hardware and software configurations to identify optimal system designs under given constraints, like the total amount of system memory. We implemented this model and methodology in a Python-based open-source tool called Calculon. Using it, we identified novel system designs that look significantly different from current inference and training systems, showing quantitatively the estimated potential to achieve higher efficiency, lower cost, and better scalability.
Workshop
Canopie-HPC
Description
The ongoing revolution enabled via containerization, virtualization, and new orchestration models has dramatically changed how applications and services are delivered and managed across the computing industry. This revolution has established a new ecosystem of tools and techniques with new, flexible and agile approaches, and continues to gain traction in the HPC community. In addition to HPC-optimized container runtimes, emerging technologies like Kubernetes create a new set of opportunities and challenges. While adoption is growing, questions regarding best practices, foundational concepts, tools, and standards remain. Our goal is to promote the adoption of these tools and introspect the impact of this new ecosystem on HPC use cases. This workshop serves as a key venue for presenting late-breaking research, sharing experiences and best practices, and fostering collaboration in this field. Our fifth workshop iteration will continue to emphasize real-world experiences and challenges in adopting and optimizing these new approaches for HPC.
Panel
Carbon-Neutrality, Sustainability, and HPC
Energy Efficiency
Green Computing
Sustainability
Description
What does it mean for computer systems to be sustainable? We have made significant improvements to operational efficiency in HPC systems. We now need to consider a broader scope of environmental impacts across the life cycle of our systems. This includes how they are designed and manufactured, how they are transported, how they are operated and how we are tearing them down, re-using and recycling them after they are no longer useful. These considerations may not be obvious. For example, manufacturing costs dominate the life cycle carbon footprint of systems and that trend is on the rise. How can we start to consider the carbon footprint across the end to end life cycle of our systems? We have a lot of capabilities to understand the performance, power and energy of our systems, but the same cannot be said for carbon footprint. Should carbon footprint be a first order optimization target?
Panel
Chiplet Ecosystem in High Performance Computing, AI/ML, and Data Acceleration
Artificial Intelligence/Machine Learning
Codesign
Heterogeneous Computing
Description
Chiplets have become a compelling approach to incorporating specialization and massive bandwidth into compute and memory devices used in HPC. But there are many challenges in realizing the vision for affordable modular HPC using advanced packaging technology. We bring together a diverse panel of experts for a discussion on whether there will be an ecosystem or marketplace of Chiplets that will be available for system developers to use to build next generation devices and weigh the pros and cons of off-the-shelf Chiplets vs custom designed Chiplets. Chiplets could be processors, GPUs, Networking interfaces, optical engines, memory controllers, or FPGAs.
Paper
Choosing the Best Parallelization and Implementation Styles for Graph Analytics Codes: Lessons Learned from 1106 Programs
Architecture and Networks
Data Movement and Memory
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
Description
Graph analytics has become a major workload in recent years. The underlying core algorithms tend to be irregular and data dependent, making them challenging to parallelize. Yet, these algorithms can be implemented and parallelized in many ways for CPUs and even more ways for GPUs. We took 6 key graph algorithms and created hundreds of parallel CUDA, OpenMP, and parallel C++ versions of each of them, most of which have never been described or studied. To determine which parallelization and implementation styles work well and under what circumstances, we evaluated the resulting 1106 programs on 2 GPUs and 2 CPUs using 5 input graphs. Our results show which styles and combinations thereof work well and which ones should be avoided. We found that choosing the wrong implementation style can yield over a 10x performance loss on average. The worst combinations of styles can cost 6 orders of magnitude in performance.
Paper
Cloud Computing to Enable Wearable-Driven Longitudinal Hemodynamic Maps
Algorithms
Cloud Computing
Distributed Computing
Heterogeneous Computing
Large Scale Systems
State of the Practice
Description
Tracking hemodynamic responses to treatment and stimuli over long periods remains a grand challenge. Moving from established single-heartbeat technology to longitudinal profiles would require continuous data describing how the patient’s state evolves, new methods to extend the temporal domain over which flow is sampled, and high-throughput computing resources. While personalized digital twins can accurately measure 3D hemodynamics over several heartbeats, state-of-the-art methods would require hundreds of years of wallclock time on leadership scale systems to simulate one day of activity. To address these challenges, we propose a cloud-based, parallel-in-time framework leveraging continuous data from wearable devices to capture the first 3D patient-specific, longitudinal hemodynamic maps. We demonstrate the validity of our method by establishing ground truth data for 750 beats and comparing the results. Our cloud-based framework is based on an initial fixed set of simulations to enable the wearable-informed creation of personalized longitudinal hemodynamic maps.
Paper
Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service
Cloud Computing
Distributed Computing
Energy Efficiency
Green Computing
Programming Frameworks and System Software
State of the Practice
Sustainability
Description
This paper presents a solution to the challenge of mitigating carbon emissions from hosting large-scale machine learning (ML) inference services. ML inference is critical to modern technology products, but it is also a significant contributor to carbon footprint. We introduce, Clover, a carbon-friendly ML inference service runtime system that balances performance, accuracy, and carbon emissions through mixed-quality models and GPU resource partitioning. Our experimental results demonstrate that Clover is effective in substantially reducing carbon emissions while maintaining high accuracy and meeting service level agreement (SLA) targets.
Paper
Co-Design Hardware and Algorithm for Vector Search
Accelerators
Artificial Intelligence/Machine Learning
Codesign
Fault Handling and Tolerance
Performance Measurement, Modeling, and Tools
Post-Moore Computing
Description
Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce FANNS, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, FANNS automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. FANNS attains up to 23.0x and 37.2x speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5x and 7.6x speedup in median and 95th percentile latency within an eight-accelerator configuration.
Birds of a Feather
Commercial and Industrial HPC Use – What Is Really Needed?
Description
New HPC technologies offer new opportunities but also bring challenges for the users in a fast-developing HPC ecosystem. In order to get a better understanding and to prepare adapted offers for industrial / commercial HPC users, the EC funded the HPC-GIG project to organize three market studies on the current HPC offers for industry, the current and future needs of industrial and commercial HPC use and the legal and business requirements for industrial/commercial use. In this BoF, we will present the highlights of the market studies and discuss with both industrial users and HPC experts the outlook for future services.
Birds of a Feather
Community Engagement on NSF Learning and Workforce Development Programs to Democratize Cyberinfrastructure Access
Description
The National Science Foundation's Office of Advanced Cyberinfrastructure (OAC) supports the development and provisioning of state-of-the-art cyberinfrastructure resources, including HPC systems, tools, and services essential to the advancement of science and engineering. A critical vision and investment plan of OAC is to support inclusive and sustainable workforce development that will lead to transformative research leveraging such cyberinfrastructure. We seek to engage with the community and institutions to obtain feedback on preparing a workforce to address the evolving needs of research communities, including facilitating the invention and usage of CI, promoting democratized access, and fostering sustainable CI ecosystems.
Tutorial
Compression for Scientific Data
Description
Large-scale numerical simulations, observations, experiments and AI computations are generating or consuming very large datasets that are difficult to analyze, store, and transfer. Data compression is an attractive and efficient technique to significantly reduce scientific datasets. This tutorial reviews the motivations, principles, techniques and error analysis methods for lossy compression of scientific datasets. It details the main compression stages (e.g. decorrelation, approximation and coding) and their variations through the presentation of the state of the art lossy compressors: SZ, ZFP, TThresh, MGARD, SPERR. A special attention is spent on lossy compression trustability. The tutorial addresses the following questions: Why lossy compression? How does compression work? How to measure and control compression error? What are the current use cases? The tutorial uses examples of real-world scientific datasets to illustrate the different compression techniques and their performance. From a participant perspective, the tutorial will detail how to use compression software as executables and as modules integrated in parallel I/O libraries (ADIOS, HDF5). This half-day tutorial, given by two of the leading teams in this domain and targeting primarily beginners interested in learning about lossy compression for scientific data, is improved from the highly rated tutorials given at ISC17-22 and SC17-22.
Panel
Computing at the Edge: HPC and AI Supporting Recent US Space Missions
Artificial Intelligence/Machine Learning
Edge Computing
IoT
Description
NASA’s space missions have captured the imagination of those around the world for generations. From the International Space Station to Artemis, there is a need for HPC, data movement, analytics, and AI capabilities delivered as efficient pipelines. For example future missions, such as the Dragonfly mission to Titan and other icy moon missions may require AI at the extreme edge – with planned data flows to the core. Demand is skyrocketing with use cases spanning operational decision-making at the edge, ensuring the health and safety of our astronauts, and advancing scientific discovery. In fact, this edge case ability, with AI/ML, is changing the business models for the evolving space and climate economies and the way we architect HPC systems. Hear from and engage with our panel of experts about how recent missions have expanded our concept of computing at the edge – for both space-based and terrestrial challenges.
Birds of a Feather
Continuum Computing: A Multi-Paradigm Approach
Cloud Computing
Distributed Computing
Description
High-Performance Computing systems that have been traditionally deployed at a single site are expected to significantly expand their reach to include a variety of remote edge systems. These edge systems include computing platforms located near instruments as well as instruments themselves. Examples range from interconnected ecosystems of large science instruments to smart energy grids supported by complex analytics and control. These interconnected systems form a compute and instrument continuum wherein computation is orchestrated in various stage. This BoF will discuss the aggregation and synthesis of previously distinct techniques and tools (including HPC, AI/ML, and digital twins) to enable continuum computing.
Birds of a Feather
Current and Future HPC Storage Environments
Data Analysis, Visualization, and Storage
Description
Storage is an important part of HPC environments, especially with the explosion of data that comes with increasing computational power. But there are a number of evolving options and tradeoffs for storage (POSIX/S3, SSD/HDD/tape, on-premises/public cloud, management policies, etc.). The goal of this BoF is to facilitate a discussion about storage environments, and to share and hear plans and ideas from the audience and from the BoF leaders. Ultimately, we hope to help each other and the community better understand the options and best practices in the storage landscape.
Tutorial
Custom FPGA Workload Development Using Open FPGA Stack and oneAPI
Description
Open FPGA Stack is the first complete hardware and software infrastructure that is fully open sourced and comprised of composable hardware code and upstreamed kernel code to Linux.org to enable a collaborative community of FPGA developers. The intention of OFS is to provide an efficient approach to develop a custom FPGA-based platform or solution by providing a framework of synthesizable code, a simulation environment, and scripts that developers can use as-is or modify. OFS source code can be used for development of an Intel, 3rd party, or custom FPGA solution. This hands-on tutorial will spotlight Open FPGA Stack, as well as oneAPI (supported by OFS), by providing FPGA developers the opportunity to do some basic FPGA workload development using the open source OFS infrastructure, source code, and documentation we provide on GitHub at www.github.com/OFS. Attendees will modify the Acceleration Functional Unit Region (AFU Region) to create their own FPGA workload using both RTL and C++ (enabled by oneAPI).
Paper
cuSZp: An Ultra-Fast GPU Error-Bounded Lossy Compression Framework with Optimized End-to-End Performance
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
Description
Modern scientific applications and supercomputing systems are generating large amounts of data in various fields, leading to critical challenges in data storage footprints and communication times. To address this issue, error-bounded GPU lossy compression has been widely adopted, since it can reduce the volume of data within a customized threshold on data distortion. In this work, we propose an ultra-fast error-bounded GPU lossy compressor cuSZp. Specifically, cuSZp computes the linear recurrences with hierarchical parallelism to fuse the massive computation into one kernel, drastically improving the end-to-end throughput. In addition, cuSZp adopts a block-wise design along with a lightweight fixed-length encoding and bit-shuffle inside each block such that it achieves high compression ratios and data quality. Our experiments on NVIDIA A100 GPU with 6 representative scientific datasets demonstrate that cuSZp can achieve an ultra-fast end-to-end throughput (95.53x compared with cuSZ) along with a high compression ratio and high reconstructed data quality.
Birds of a Feather
DAOS Storage Community BoF
Data Analysis, Visualization, and Storage
Description
DAOS (https://docs.daos.io/) is an open-source scale-out object store that delivers extremely high performance to the most data-intensive HPC/AI workloads. With growing adoption, DAOS has seen significant community contributions like domain-specific container types, additional hardware support beyond x86_64 (e.g. ARM64), and enabling DAOS in the cloud.
This BoF brings together the DAOS community to discuss, share experiences, and brainstorm on future enhancements of DAOS. Topics include practical experiences with on-prem and cloud deployments, application use cases, and the software roadmap. This session targets end users, middleware developers, system administrators, DAOS core software developers, and vendors of DAOS-based hardware/software/cloud offerings.
Paper
DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication
Algorithms
Linear Algebra
Post-Moore Computing
Description
Sparse matrix-vector multiplication (SpMV) plays a key role in computational science and engineering, graph processing and machine learning applications. Much SpMV work was devoted to resolving problems such as random access to vector x and unbalanced load. However, we have experimentally found that the compute of inner products still occupies much overhead in SpMV operation, which has been largely ignored in existing work.
In this paper, we propose DASP, a new algorithm using specific dense MMA units for accelerating the compute part of general SpMV. We analyze the row-wise distribution of nonzeros and group the rows into three categories. We then organize them into small blocks of proper sizes to meet the requirement of MMA computation. For the three categories, DASP offers different strategies to complete SpMV. The experimental results on the latest NVIDIA Ampere and Hopper GPUs show that our DASP brought significant speedups over state-of-the-art SpMV work.
Paper
Data Flow Lifecycles for Optimizing Workflow Coordination
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
Description
A critical performance challenge in distributed scientific workflows is coordinating tasks and data flows on distributed resources. To guide these decisions, this paper introduces data flow lifecycle anal- ysis. Workflows are commonly represented using directed acyclic graphs (DAGs). Data flow lifecycles (DFL) enrich task DAGs with data objects and properties that describe data flow and how tasks interact with that flow. Lifecycles enable analysis from several important perspectives: task, data, and data flow. We describe representation, measurement, analysis, visualization, and opportunity identification for DFLs. Our measurement is both distributed and scalable, using space that is constant per data file. We use lifecycles and opportunity analysis to reason about improved task placement and reduced data movement for five scientific workflows with different characteristics. Case studies show improvements of 15×, 1.9×, and 10–30×. Our work is implemented in the DataLife tool.
Tutorial
Deep Learning at Scale
Description
Deep learning is rapidly and fundamentally transforming the way science and industry use data to solve problems. Deep neural network models have been shown to be powerful tools for extracting insights from data across a large number of domains, from large language models (LLMs) to protein folding. As these models grow in complexity to solve increasingly challenging problems with larger and larger datasets, the need for scalable methods and software to train them grows accordingly.
The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide training accounts on some of the worlds largest GPU systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models from real scientific computing applications.
Paper
Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication
Accelerators
Artificial Intelligence/Machine Learning
Codesign
Fault Handling and Tolerance
Performance Measurement, Modeling, and Tools
Post-Moore Computing
Description
Soft errors are prevalent in modern High-Performance Computing (HPC) systems, resulting in silent data corruptions (SDCs), compromising system reliability. Instruction duplication is a widely used software-based protection technique against SDCs. Existing instruction duplication techniques are mostly implemented at LLVM level and may suffer from low SDC coverage at assembly level. In this paper, we evaluate instruction duplication at both LLVM and assembly levels. Our study shows that existing instruction duplication techniques have protection deficiency at assembly level and are usually over-optimistic in the protection. We investigate the root-causes of the protection deficiency and propose a mitigation technique, Flowery, to solve the problem. Our evaluation shows that Flowery can effectively protect programs from SDCs evaluated at assembly level.
Paper
Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers
Accelerators
Architecture and Networks
Data Analysis, Visualization, and Storage
Fault Handling and Tolerance
Best Student Paper Finalist
Description
Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.
Birds of a Feather
Designing HPC Outreach Activities
Description
HPC Outreach is essential to enthusing young minds about computational science, informing the public and growing the HPC community, and yet many institutions do not have sufficient funding or staff effort to support the outreach activities. Effective outreach requires well designed activities that are suitable to the target audience and event type. Different activities are needed for different age groups, scientific backgrounds or venues. Each activity also has its own lifecycle and cannot be reused indefinitely. The goal of this session is to design several new activities that the community would be able to develop over the coming year.
Paper
DGAP: Efficient Dynamic Graph Analysis on Persistent Memory
Architecture and Networks
Data Movement and Memory
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
Description
Dynamic graphs have grown in importance for numerous real-world applications. To accommodate this, graph frameworks, particularly their internal data structures, must support both persistent graph updates and rapid graph analysis simultaneously. Emerging persistent memory technologies, such as Optane DCPMM, offer a promising choice to simplify the designs by providing data persistence, low latency, and high IOPS together. We propose DGAP, a framework for efficient dynamic graph analysis on persistent memory. DGAP utilizes mutable Compressed Sparse Row (CSR) with new designs for persistent memory to construct the framework. Specifically, DGAP introduces a per-section edge log to reduce write amplification; a per-thread undo log to enable high-performance, crash-consistent rebalancing operations; and a data placement schema to minimize in-place updates. Our extensive evaluation results demonstrate that DGAP can achieve up to 3.2x better graph update performance and up to 3.77x better graph analysis performance compared to state-of-the-art dynamic graph frameworks for persistent memory.
Workshop
Digital Twins: Practices and Principles for High Performance Computing
Description
Digital twins are physically accurate virtual representations of real-world systems, providing beneficial information in actionable time by combining sensor data with surrogate models. Recent shifts in HPC combining simulation, AI, and edge computing have not only given us the opportunity to apply digital twins in science, but has also magnified their impact on global public policy and institutions, in domains including climate change, renewable energy, industry 4.0 and global healthcare. Increasingly accurate simulations become virtual sources of truth, capable of multi-physics synchrony in real world time ranging from subatomic to interstellar spaces. Digital twins in HPC are crucial to enabling breakthroughs in computational biomedicine, nuclear fusion, and building automation. This workshop will bring like minds together to identify challenges and opportunities in establishing digital twins as a common HPC practice and will highlight key principles for their use in high performance computing.
Paper
DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training
Artificial Intelligence/Machine Learning
Description
Memory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation learning and have demonstrated superior performance in many real-world applications. However, their node memory favors smaller batch sizes to capture more dependencies in graph events and needs to be maintained synchronously across all trainers. As a result, existing frameworks suffer from accuracy loss when scaling to multiple GPUs. Even worse, the tremendous overhead to synchronize the node memory make it impractical to be deployed to distributed GPU clusters.
In this work, we propose DistTGL --- an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
DistTGL has three improvements over existing solutions: an enhanced TGNN model, a novel training algorithm, and an optimized system. In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
Paper
DPS: Adaptive Power Management for Overprovisioned Systems
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
Description
Maximizing performance under a power budget is essential for HPC systems and has inspired the development of many power management frameworks. These can be broadly characterized into two groups: model-based and stateless. Model-based frameworks achieve good performance under a power budget but are highly dependent on the quality of the model and the data used to train it. Stateless frameworks are more robust and require no training, but are generally lower performance. In this paper, we propose a new framework that does not require a model, but does track state in the form of recent power dynamics. We implement this idea and test it on a public cloud running both Spark and HPC jobs. We find when total power demand is low, our framework achieves equivalent performance to prior work, but when power demand is high it achieves a mean 8% performance improvement (with no reliance on a learned model).
Paper
EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs
Artificial Intelligence/Machine Learning
Description
Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long queuing time for resource allocation, and lowers the cluster utilization. Adapting to resource elasticity can alleviate this but often introduces inconsistent model accuracy, due to lacking of capability to decouple model training procedure from resource allocation. We propose EasyScale, an elastic training system that achieves consistent model accuracy under resource elasticity for both homogeneous and heterogeneous GPUs. EasyScale preserves the data-parallel training behaviors strictly, traces the consistency-relevant factors carefully, utilizes the deep learning characteristics for EasyScaleThread abstraction and fast context-switching. To utilize heterogeneous cluster, EasyScale dynamically assigns workers based on the intra-inter-job schedulers, minimizing load imbalance and maximizing aggregated job throughput. Deployed in an online serving cluster, EasyScale powers the training jobs to utilize idle GPUs opportunistically, improving overall cluster utilization by 62.1%.
Workshop
EduHPC-23: Workshop on Education for High Performance Computing
Description
The EduHPC workshop brings together stakeholders from industry (developers, hardware and software vendors), national labs, and academia in the context of SC, to hear the pedagogical challenges others are facing, share approaches to meeting such challenges, and generally exchange ideas related to high-performance computing, parallel and distributed computing, distributed data science, scalable AI and IoT/Edge computing in undergraduate and graduate education. In addition to paper presentations, this workshop will feature invited keynotes, panels (e.g., reproducibility in HPC education and training, inclusive pedagogy and efforts in broadening participation in HPC), special sessions such as “Peachy Assignments,” and invited talks on opportunities for collaboration, resource sharing, educator training, internships, and other means of increasing cross-fertilization between industry, government, and academia.
Tutorial
Efficient Distributed GPU Programming for Exascale
Description
Over the past decade, GPUs became ubiquitous in HPC installations around the world, delivering the majority of performance of some of the largest supercomputers (e.g. Summit, Sierra, JUWELS Booster). This trend continues in the recently deployed and upcoming Pre-Exascale and Exascale systems (JUPITER, LUMI, Leonardo; El Capitan, Frontier, Perlmutter): GPUs are chosen as the core computing devices to enter this next era of HPC. To take advantage of future GPU-accelerated systems with tens of thousands of devices, application developers need to have the proper skills and tools to understand, manage, and optimize distributed GPU applications.
In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, also advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems in general, taking the NVIDIA platform as an example. It is a combination of lectures and hands-on exercises, using one of Europe’s fastest supercomputers, JUWELS Booster, for interactive learning and discovery.
Paper
Efficient Maximal Biclique Enumeration on GPUs
Accelerators
Algorithms
Graph Algorithms and Frameworks
Description
Maximal biclique enumeration (MBE) in bipartite graphs is an important problem in data mining with many real-world applications. All existing solutions for MBE are designed for CPUs. Parallel MBE algorithms for GPUs are needed for MBE acceleration leveraging its many computing cores. However, enumerating maximal bicliques using GPUs has three main challenges including large memory requirement, thread divergence, and load imbalance. In this paper, we propose GMBE, the first highly-efficient GPU solution for the MBE problem. To overcome the challenges, we design a stack-based iteration approach to reduce GPU memory usage, a pro-active pruning method using the vertex’s local neighborhood size to alleviate thread divergence, and a load-aware task scheduling framework to achieve load balance among threads within GPU warps and blocks. Our experimental results show that GMBE on an NVIDIA A100 GPU can achieve 70.6× speedup over the state-of-the-art parallel MBE algorithm ParMBE on a 96-core CPU machine.
Paper
Embracing Irregular Parallelism in HPC with YGM
Distributed Computing
Message Passing
Programming Frameworks and System Software
Best Student Paper Finalist
Description
YGM is a general-purpose asynchronous distributed computing library for C++/MPI, designed to handle the irregular data access patterns and small messages of graph algorithms and data science applications. It uses data serialization to give an easily usable active message interface and message aggregation to maximize application throughput. Our design philosophy makes a tradeoff that increases network bandwidth utilization at the cost of added latency. We provide a suite of benchmarks showcasing YGM’s performance. Compared to similar distributed active message benchmark implementations that do not provide message buffering, we are able to achieve over 10x throughput on thousands of cores at a latency cost that can be as small as 2x or as large as 100x, depending on the machine being used. For applications that can be written to be latency-tolerant, this represents a significant potential performance improvement through using YGM.
Birds of a Feather
Enabling I/O and Computation Malleability in High-Performance Computing
Programming Frameworks and System Software
Description
Traditional interest in increasing the parallelism for individual jobs in HPC systems is conditioned by the diversity and dynamics of their resource demands at runtime. Malleability techniques can help to dynamically adapt resource usage to achieve maximum efficiency. Malleable HPC systems, however, present a series of fundamental research challenges in the fields of resource management, scheduling, malleability control, flexibilization of application structures, and data movement. All aforementioned issues will be discussed in the proposed Birds of a Feather session, which aims at building a community of developers and users around the topic of malleability in high-performance computing, networking, and storage.
Paper
Enabling Real World Scale Structural Superlubricity All-Atom Simulation on the Next-Generation Sunway Supercomputer
Accelerators
Applications
Architecture and Networks
Modeling and Simulation
Description
Molecular dynamics (MD) simulation can provide an affordable way for inspecting microscopic phenomena, which is a powerful complement to real-world experiments. But the spatial scale of MD simulations is usually magnitudes smaller than experiment systems. In this paper, we present our work, redesigning the widely used inter-layer potential in structural superlubricity. By carrying out a specialized neighbor list for inter-layer potential computation, the total memory access amount is reduced significantly. Besides, a simple but efficient vectorization strategy is implemented based on the new neighbor list. In the extreme case, our work can scale to 38 million cores to achieve a sustainable performance of 61 PFLOPS, enabling a simulation of a superlubricity system of 32 um^2 with 7.2 billion atoms at 4.75 ns/day, which is 11,834 times of reported largest scale simulation in superlubricity systems in contact area and almost ten times faster in time-to-solution.
Tutorial
Energy-Efficient GPU Computing
Description
Energy efficiency has become a critical concern in High Performance Computing (HPC) and supercomputing, especially with the advent of exascale systems. The increasing demand for computational power and the associated energy consumption have led to a growing need for optimization techniques to reduce power consumption. GPUs, now the primary source of compute power in exascale supercomputers, contribute significantly to the overall energy expenditure of these systems. Consequently, the development and implementation of energy-efficient strategies for GPU applications are essential to reduce the environmental impact and operational costs of HPC facilities.
This tutorial offers a comprehensive introduction to energy-efficient computing in the context of HPC, focusing on GPU applications. As a participant, you will gain insight into code optimization techniques that improve energy efficiency, automatically explore performance-energy trade-offs using Kernel Tuner, dive into mixed-precision techniques, and learn how to write clean code for reduced-precision arithmetic on GPUs.
Finally, the tutorial addresses GPU clock frequency optimization as a means to improve energy efficiency, including how to find the optimal core clock frequency range. The hands-on approach of this tutorial enables participants to acquire valuable knowledge and practical experience in energy-efficient computing, essential for advancing environmentally sustainable and cost-effective HPC and supercomputing solutions.
Paper
Enhance the Strong Scaling of LAMMPS on Fugaku
Accelerators
Applications
Architecture and Networks
Modeling and Simulation
Description
Physical phenomenon such as protein folding requires simulation up to microseconds of physical time, which directly corresponds to the strong scaling of molecular dynamics(MD) on modern supercomputers. In this paper, we present a highly scalable implementation of the state-of-the-art MD code LAMMPS on Fugaku by exploiting the 6D mesh/torus topology of the TofuD network. Based on our detailed analysis of the MD communication pattern, we first adapt coarse-grained peer-to-peer ghost-region communication with uTofu interface, then further improve the scalability via fine-grained thread pool. Finally, Remote direct memory access (RDMA) primitives are utilized to avoid buffer overhead. Numerical results show that our optimized code can reduce 77% of the communication time, improving the performance of baseline LAMMPS by a factor of 2.9x and 2.2x for Lennard-Jones and embedded-atom method potentials when scaling to 36, 846 computing nodes. Our optimization techniques can also benefit other applications with stencil or domain decomposition methods.
Paper
Enhancing Adaptive Physics Refinement Simulations through the Addition of Realistic Red Blood Cell Counts
Applications
Modeling and Simulation
Description
Simulations of cancer cell transport require accurately modeling mm-scale and longer trajectories through a circulatory system containing trillions of deformable red blood cells, whose intercellular interactions require submicron fidelity. Using a hybrid CPU-GPU approach, we extend the advanced physics refinement (APR) method to couple a finely-resolved region of explicitly-modeled red blood cells to a coarsely-resolved bulk fluid domain. We further develop algorithms that: capture the dynamics at the interface of differing viscosities, maintain hematocrit within the cell-filled volume, and move the finely-resolved region and encapsulated cells while tracking an individual cancer cell. Comparison to a fully-resolved fluid-structure interaction model is presented for validation. Finally, we use the advanced APR method to simulate cancer cell transport over a mm-scale distance while maintaining a local region of RBCs, using a fraction of the computational power required to run a fully-resolved model.
Workshop
ESPM2 2023: Eighth International Workshop on Extreme Scale Programming Models and Middleware
Description
Next generation architectures and systems are characterized by high concurrency, low memory per-core, and multiple levels of hierarchy and heterogeneity. These characteristics bring out new challenges in performance, fault-tolerance, and scalability that must be tackled by next generation programming models and associated middleware/runtimes. This workshop focuses on different aspects of programming models for emerging domain-specific AI hardware (Cerebras, Habana, Graphcore, SambaNova etc.), task-based parallelism (Charm++, X10, etc.), PGAS (OpenSHMEM, UPC/UPC++, CAF, etc.), Deep Learning (PyTorch, TensorFlow, etc.), directive-based languages (OpenMP, OpenACC) and hybrid MPI+X, etc. It also focuses on their associated middleware (unified runtimes, interoperability for hybrid programming, tight integration of MPI+X, and support for accelerators and FPGAs) for next generation systems and architectures. The ultimate objective of the ESPM2 workshop is to serve as a forum that brings together researchers from academia and industry working in the areas of programming models, runtime systems, languages, and application developers.
Birds of a Feather
European HPC Ecosystem - Updates and Gap Analysis
Description
In recent years, the European HPC ecosystem has undergone profound changes. EuroHPC JU a joint initiative between the EU, European countries, and private partners to develop a World Class Supercomputing Ecosystem in Europe was created. PRACE is in the process of transforming itself into a European HPC User and Centre Association.
The objective of this BoF is to give an overview of the current state of European HPC activities. We will present and discuss with the different European HPC stakeholders the current state of play, future plans, challenges and analyze critically the European HPC offers and services.
Birds of a Feather
European RISC-V HPC and AI pre-exascale accelerators
Architecture and Networks
Description
This BoF aims to foster discussion on RISC-V accelerators led by efforts on European accelerators for HPC and foster community interest in these projects. There are several accelerator efforts around the HPC community in Europe, many of them leveraging and fostering the RISC-V ecosystem. We will start with a short presentation (15 minutes) on a brief overview of current efforts and a quick insight into EUPILOT (part of the European Processor Initiative - EPI effort) to start the conversation. A Q&A session and open discussion with audience members will follow the introduction.
Workshop
ExaMPI: Workshop on Exascale MPI
Description
The aim of this workshop is to bring together researchers and developers to present and discuss innovative algorithms and concepts in the message passing programming model and to create a forum for open and potentially controversial discussions on the future of MPI in the exascale era and beyond.
Birds of a Feather
Example Projects of HPC Data Center Heat Reuse
State of the Practice
Description
Efficient energy usage of data centers has attention locally, nationally and globally. Many data centers are increasingly interested in utilizing waste heat reuse. Two organizations; CSC and NREL will provide an overview of the cooling and heat reuse processes with lessons learned from design, construction and operations.
The session will outline the metrics (ERF, ERE, CoP etc.) used and foster discussion of standards, gaps and the different approaches. Both sites will highlight metrics, methodologies and how differences affect the calculations.
Audience discussion and Q&A is aimed at engaging the community to understand potentiality for new waste heat reuse projects.
ACM Gordon Bell Finalist
Awards
Exascale Multiphysics Nuclear Reactor Simulations for Advanced Designs
TP
Description
ENRICO is a coupled application developed under the US Department of Energy’s Exascale Computing Project (ECP) targeting the modeling of advanced nuclear reactors. It couples radiation transport with heat and fluid simulation, including the high-fidelity, high-resolution Monte-Carlo code Shift and the Computational fluid dynamics code NekRS. NekRS is based on rapidly convergent high-order spectral element discretizations that feature minimal numerical dissipation and dispersion.
On Frontier, NekRS has recently achieved an unprecedented milestone in breaching over 1 billion spectral elements and 350 billion degrees of freedom. Shift has demonstrated the capability to transport upwards of 1 billion particles per second in full core nuclear reactor simulations featuring complete temperature-dependent, continuous-energy physics on Frontier. Shift achieved a weak-scaling efficiency of 97.8% on 8192 nodes of Frontier and calculated 6 reactions in 214,896 fuel pin regions below 1% statistical error yielding first-of-a-kind resolution for a Monte Carlo transport application.
Panel
Exascale Software Ecosystems States of the Unions and SWOT Analysis
Exascale
Heterogeneous Computing
Software Engineering
Description
This panel brings together experts and leads from national exascale initiatives around the globe focusing on stacks encompassing algorithms to system level software to share their insights and experiences, and to identify synergies for collaboration. Exascale systems being deployed and on the horizon feature diversity and heterogeneity of not only hardware but software ecosystems. On one hand, the variety of accelerator technologies, alongside processor, memory, networking and storage configurations, pose challenges for algorithm developers, domain specific language and library architects, and performance engineers. On the other hand, there are expectations for supporting modern software development and delivery tools for reproducibility, portability, efficiency and security to fulfill the edge to cloud to supercomputing continuum requirements for workflows. Against these backdrops, national initiatives are prioritizing and funding a diverse portfolio of initiatives to address the programmatic needs, which the panel will reflect on as a SWOT (strengths, weaknesses, opportunities and threats) analysis.
Paper
Experiences Readying Applications for Exascale
Exascale
Large Scale Systems
State of the Practice
Best Paper Finalist
Description
The advent of exascale computing invites an assessment of existing best practices for developing application readiness on the world’s largest supercomputers. This work details observations from the last four years in preparing scientific applications to run on the Oak Ridge Leadership Computing Facility's (OLCF) Frontier system. This paper addresses a range of topics in software including programmability, tuning, and portability considerations that are key to moving applications from existing systems to future installations. A set of representative workloads provides case studies for general system and software testing. We evaluate the use of early access systems for development across several generations of hardware. Finally, we discuss how best practices were identified and disseminated to the community through a wide range of activities including user-guides and trainings. We conclude with recommendations for ensuring application readiness on future leadership computing systems.
Paper
Experimental Evaluation of Xanadu X8 Photonic Quantum Computer: Error Measurement, Characterization, and Implications
Post-Moore Computing
Quantum Computing
Best Paper Finalist
Description
Among the various types of quantum computers, photonic quantum computers have shown great potential due to their high degree of scalability. However, the development of photonic quantum computers is still in its infancy, and the characterization of their performance is of critical importance to guide further improvements. In this work, we present the first characterization and insights derived from Xanadu's X8 photonic quantum computer. Our work represents an important step toward the development of practical and scalable photonic quantum computers.
ACM Gordon Bell Finalist
Awards
Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection through Unprecedented Spectral-Element Simulations
TP
Description
We detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major innovations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh–Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not possible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolving the long-standing question regarding the ultimate regime in RBC.
Paper
FASDA: An FPGA-Aided, Scalable, and Distributed Accelerator for Range-Limited Molecular Dynamics
Accelerators
Applications
Architecture and Networks
Modeling and Simulation
Description
Conducting long-timescale simulations of small molecules using Molecular Dynamics (MD) is crucial in drug design. However, traditional methods to accelerate the process, including ASICs or GPUs, have limitations. ASIC solutions are not always generally available, while GPU solutions may not scale when processing small molecules. FPGAs are both communication processors and accelerators, with tight coupling between these capabilities, and so could be used to address strong scaling in this domain.
We present FASDA, the first FPGA-based MD accelerator available for community development. FASDA enables the use of FPGA enhanced clusters and clouds to execute range-limited MD, which is the most resource-intensive and computation-demanding component in MD. FASDA is built with a series of plugable components that are adjustable based on user requirements and demonstrates nearly linear scaling on an eight FPGA cluster. It outperforms the state-of-the-art GPU solution by 4.67x, with the resulting prospect of significantly reducing lead evaluation time.
Tutorial
Fault-Tolerance for High-Performance and Big Data Applications: Theory and Practice
Description
Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big data applications, with a fair balance between theory and practice. This tutorial is organized across four main topics:
(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;
(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);
(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (MPI standard extension). Relevant examples will include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.
A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session.
The tutorial is open to all SC23 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.
Paper
Fine-Grained Policy-Driven I/O Sharing for Burst Buffers
Data Analysis, Visualization, and Storage
I/O and File Systems
State of the Practice
Description
A burst buffer is commonly deployed on large-scale supercomputers to bridge the performance gap between the shared file system and the I/O needs of modern supercomputing applications. Existing I/O sharing methods either require resource isolation, offline profiling, or repeated execution that significantly limit the utilization and applicability of these systems. Here we present ThemisIO, a policy-driven I/O sharing framework for a remote-shared burst buffer. ThemisIO can accurately and efficiently allocate I/O cycles among applications purely based on real-time I/O behavior, without requiring user-supplied information or offline-profiled application characteristics. By exploiting a statistical token-based strategy, ThemisIO can precisely balance I/O cycles between applications via time slicing to enforce processing isolation, enabling a variety of fair sharing policies. Our experiments show that ThemisIO sustains 13.5–13.7% higher I/O throughput and 19.5–40.4% lower performance variation than existing algorithms. For applications, ThemisIO significantly reduces or nearly eliminates the slowdown caused by I/O interference.
Workshop
First International Workshop on HPC Testing and Evaluation of Systems, Tools, and Software (HPCTESTS 2023)
Description
This workshop brings together HPC researchers, practitioners, and vendors from around the globe to present and discuss state-of-the-art HPC system testing methodologies, tools, benchmarks, tests, procedures, and best practices. The increasing complexity of HPC architectures requires a larger number of tests in order to thoroughly evaluate the status of the system after its installation or a software upgrade before it is transitioned to production users. Therefore, HPC centers and vendors use different methodologies to evaluate their systems during its lifetime, not only at the beginning during the installation and acceptance time, but also regularly during maintenance windows. This workshop will provide a venue to present and discuss the latest HPC system test technologies. The event will include a keynote focused on current HPC system testing topics, followed by a series of paper presentations from peer-reviewed accepted submissions, and will conclude with a panel discussion.
Birds of a Feather
First Steps Toward Adopting Direct Liquid Cooled HPC
State of the Practice
Description
Liquid cooling mitigates the effects of heat density, reduces energy consumption and increases performance. It is now a requirement to stay on the chip technology roadmap. After a decade's experience with liquid cooling in large-scale supercomputing centers, many data centers are still facing challenges with adoption. Building on deep expertise from major supercomputing centers, this BoF will present recommendations for initial adoption of direct liquid cooling (DLC). See https://sites.google.com/lbl.gov/ee-hpc-wg-liquid-cooling/home. There will be presentations on experiences from sites that have just adopted DLC. We are expecting a lot of audience discussion and networking that extends beyond the BoF.
Paper
FISCO-BCOS: An Enterprise-Grade Permissioned Blockchain System with High-Performance
Cloud Computing
Distributed Computing
Energy Efficiency
Performance Measurement, Modeling, and Tools
Description
Enterprise-grade permissioned blockchain systems provide a promising infrastructure for data sharing and cooperation between different companies. However, performance bottlenecks seriously hinder the adoption of these systems in many industrial applications that process complex business logic and huge transaction volumes.
In this paper, we present FISCO-BCOS, an enterprise-grade permissioned blockchain system with high performance. We conducted experiments on two popular test platforms and compared FISCO-BCOS with state-of-the-art platforms in academia and industry such as BIDL and Hyperledger Fabric (HLF). The result shows that FISCO-BCOS achieves 7.4 times and 28.4 times the throughput of BIDL and HLF, respectively, with half the latency of them. FISCO-BCOS has already been used in over 300 different large-scale industrial scenarios and has become one of the most popular permissioned blockchains.
Paper
FORGE: Pre-Training Open Foundation Models for Science
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
Description
Large language models (LLMs) are poised to revolutionize the way we conduct scientific research, yet their complexity and cost hinder adoption by the wider science community. Identifying suitable scientific use cases, optimizing model and data sizes, and scaling up training are among the most pressing issues. Here we provide practical solutions for building and using LLM-based foundation models targeting scientific use cases. We present an end-to-end examination of the effectiveness of LLMs in scientific research, including their scaling behavior and computational requirements on Frontier, the first exascale supercomputer. We have also developed for release to the scientific community a suite of open foundation models called FORGE with up to 26B parameters using 257B tokens from over 200M scientific articles. We have demonstrated the use and effectiveness of FORGE on scientific downstream tasks. Our research establishes best practices that can be applied across various fields to utilize LLMs for scientific discovery.
Workshop
Fourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23)
Description
As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. While there has been much C/R research and tools development, continued C/R research is indispensable to keep pace with ever-changing HPC architectures, technologies, and workloads. More effort is also needed to narrow the gap between proof-of-concept C/R research codes and production-quality codes capable of deployment in real-world workloads. In this workshop, we will bring together C/R researchers and tools developers, practitioners, application developers, and end users to focus on C/R research and successes in production use, motivating the development of usable C/R tools, the closing of the gap between state-of-the-art research and production, and the harnessing of the full benefits of C/R for the HPC community. Paper submissions will be peer-reviewed, and a venue for accepted papers will be identified. We especially encourage PhD students and HPC end users to participate.
Workshop
Fourth International Workshop on Quantum Computing Software
Description
Quantum computing is emerging as a remarkable technology that promises to achieve major scientific breakthroughs. This includes solving complex problems whose solution lies well beyond contemporary and even future supercomputers based on conventional technologies. Interacting with these quantum computers, including noisy-intermediate scale quantum devices, for both basic and applied research will require a unique collection of software tools.
The purpose of this workshop is to explore the innovative software needed to make quantum computing practical and accessible. The workshop will focus heavily on the tools and software for quantum computing with a particular emphasis on realized implementations.
Topics of interest for this workshop include but are not limited to: Languages, Compilers/Profilers, Quantum Machine Learning Software, Numerical Simulators, Workflows, Debugging/Verification, and Optimal Quantum Control Software.
Topics that are not relevant to the workshop include domain-specific applications of quantum computing, development of quantum computing hardware or devices, and benchmarking of quantum computers.
Workshop
Fourth Workshop on Heterogeneous Memory Systems (HMEM)
Description
Heterogeneous memory architectures have recently emerged and revolutionized the traditional memory hierarchy. Today’s architectures may comprise multiple memory technologies next to DRAM, such as: 3D-stacked memory, high-bandwidth multi-channel RAM, persistent memory, or Compute Express Link (CXL)-based architectures.
Even though heterogeneous memory architectures can benefit applications in terms of improved performance, energy-efficiency, and cost trade-offs, exploiting the full potential of such complex architectures poses significant challenges. Since heterogeneous memory architectures introduce dramatic disruptions to the usual memory hierarchy assumptions that have guided decades of system and software design, we need to rethink solutions across all the layers of system and software stack to embrace the new era of memory heterogeneity and satisfy modern applications demands.
As in previous years, the workshop on Heterogeneous Memory systems (HMEM) will serve as a forum to bring together researchers from the HPC community to present and discuss ongoing research around heterogeneous memory systems.
Tutorial
From Zero to Hero: Conquering the Arm Neoverse
Description
Arm technology has increasingly become a compelling choice for HPC due to its promise of higher efficiency, density, scalability, and broad ecosystem of software. Arm expansion in the datacentre started in 2018 with Arm Neoverse, a set of infrastructure CPU IPs designed for high-end computing. The Arm-based Fugaku supercomputer, first of its kind implementing Arm SVE instruction set, entered the Top 500 in June 2020 scoring at the top and retaining a leadership position over the years not only in HPL but also for HPCG (where it is still unbeaten). This event has been a wake-up call for the HPC community. The datacentre and HPC space have long been dominated by x86 CPUs. There is a growing interest in diversifying and exploring new architectures to re-create a vibrant and diverse ecosystem of architectures as it was more than a decade ago. Arm technology is at the forefront of this wave of change. This tutorial welcomes scientists and engineers interested in running a variety of workloads on a Arm-based system, either on-premises or in the cloud. The tutorial will guide the attendee through compile, execute, profile and optimize codes for Arm to demystify those claims that changing CPU architecture is hard.
Paper
Frontier: Exploring Exascale
Exascale
Large Scale Systems
State of the Practice
Best Paper Finalist
Description
As the US Department of Energy (DOE) computing facilities began deploying petascale systems in 2008, DOE was already setting its sights on exascale. In that year, DARPA published a report on the feasibility of reaching exascale. The report authors identified several key challenges in the pursuit of exascale including power, memory, concurrency, and resiliency. That report informed the DOE's computing strategy for reaching exascale. With the deployment of Oak Ridge National Laboratory's Frontier supercomputer, we have officially entered the exascale era. In this paper, we discuss Frontier's architecture, how it addresses those challenges, and describe some early application results from Oak Ridge Leadership Computing Facility's Center of Excellence and the Exascale Computing Project.
Workshop
Future Is Sparse: Methods and Tools for Sparse Computations
Description
Many real-world computations involve sparse data structures in the form of sparse matrices, graphs, or sparse tensors. In computational sciences, sparse matrices are commonly used for numerically solving partial differential equations. Likewise, a large number of approaches for deep learning using graph representations, namely GNNs, have been proposed. More generic, multi-dimensional sparse tensors are currently at the heart of data-driven fields such as AI and deep learning. Getting high performance on such sparse data structures is well-known to be challenging and is still an open research problem. This workshop will gather a group of experts who are researching various aspects of this topic and aim to present the attendees with an overview of the state of the art research activities. More importantly, the workshop will provide a forum for interactions between the SC participants, so that new ideas can be generated to push forward the state of the art.
Paper
FuzzyFlow: Leveraging Dataflow to Find and Squash Program Optimization Bugs
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Description
The current hardware landscape and application scale is driving performance engineers toward writing bespoke optimizations. Verifying such optimizations, and generating minimal failing cases, is important for robustness in the face of changing program conditions, such as inputs and sizes. However, isolation of minimal test-cases from existing applications and generating new configurations are often difficult due to side effects on the system state, mostly related to dataflow. This paper introduces FuzzyFlow: a fault localization and test case extraction framework designed to test program optimizations. We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations to enable fast checking for semantic equivalence. To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation. We demonstrate FuzzyFlow on exemplary use cases in real-world applications where the approach provides up to 528 times faster optimization testing and debugging compared to traditional approaches.
Birds of a Feather
Go with the (Energy) Flow: Adaptive Capacity Computing
Middleware and System Software
Description
The increasing reliance on inherently variable Green energy is poised to impact HPC centers fundamentally: they cannot count on a guaranteed supply of grid power, yet could play a significant role in stabilizing the Grid by quickly adapting their load.
“Adaptive Capacity Computing” touches on system architecture, hardware, scheduling and resource management, programming models, and applications with the objective of enabling future HPC centers to react gracefully to varying power profiles, achieving optimal throughput and avoiding loss of computational state wherever possible.
This BoF discusses challenges and approaches to support this paradigm, should it become necessary to do so.
Paper
Graph3PO: A Temporal Graph Data Processing Method for Latency QoS Guarantee in Object Cloud Storage System
Cloud Computing
Data Analysis, Visualization, and Storage
Graph Algorithms and Frameworks
Best Paper Finalist
Description
Object cloud storage systems are deployed with diverse applications that have varying latency service level objectives (SLOs), posting challenges for supporting quality of service with limited storage resources. Existing methods provide prediction-based recommendations for dispatching requests from applications to storage devices, but the prediction accuracy can be affected by complex system topology. To address this issue, Graph3PO is designed to combine storage device queue information with system topological information for forming a temporal graph, which can accurately predict device queue states. Additionally, Graph3PO contains the urgency degree model and cost model for measuring SLO violation risks and penalties of scheduling requests on storage device queues. When the urgency degree of a request exceeds a threshold, Graph3PO determines whether to schedule it in the queue or initiate a hedge request to another storage device. Experimental results show that Graph3PO outperforms its competitors, with SLO violation rates 2.8 to 201.1 times lower.
Paper
GRAPHINE: Enhanced Neutral Atom Quantum Computing Using Application-Specific Rydberg Atom Arrangement
Post-Moore Computing
Quantum Computing
Best Paper Finalist
Description
Multiple technologies for realizing quantum computing are currently under development. Neutral atom quantum computing is one such promising technology; it offers advantages such as the ability to perform long-distance interactions and gates consisting of more than two qubits. A particular advantage it provides is the flexibility to arrange the qubits in different topologies by customizing atom layouts. We design GRAPHINE, which, to the best of our knowledge, is the first technique to leverage this flexibility to design application-specific topologies for different quantum algorithms based on the structural characteristics of the algorithm circuits. This enables GRAPHINE to improve key performance metrics like the number of gates and pulses by up to 56% and the probability of error by up to 42% on average over widely-used topology designs.
Paper
GraphSet: High Performance Graph Mining through Equivalent Set Transformations
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
Description
Graph mining is of critical use in a number of fields such as social networks, knowledge graphs, and fraud detection. As an NP-complete problem, improving computation performance is the main target for current optimizations. Due to excellent performance, state-of-the-art graph mining systems mainly rely on pattern-aware algorithms. Despite previous efforts, complex control flows introduced by pattern-aware algorithms bring large overhead and also impede further acceleration on heterogeneous hardware.
To address these challenges, we propose a set-based equivalent transformation approach for the optimization of pattern-aware graph mining applications, which can leverage set properties to eliminate most control flows and reduce computation overhead exponentially. We implement a high-performance pattern-aware graph mining system supporting both CPU and GPU, namely GraphSet, to automatically apply these transformations. Evaluation results show that GraphSet outperforms state-of-the-art cross-platform and hardware-specific graph mining frameworks by up to 3384.1x and 243.2x (18.0x and 10.2x on average), respectively.
Paper
GreenNFV: Energy-Efficient Network Function Virtualization with Service Level Agreement Constraints
Cloud Computing
Distributed Computing
Energy Efficiency
Green Computing
Programming Frameworks and System Software
State of the Practice
Sustainability
Description
Network Function Virtualization (NFV) platforms consume significant energy, introducing high operational costs in edge and data centers. This paper presents a novel framework called GreenNFV that optimizes resource usage for network function chains using deep reinforcement learning. GreenNFV optimizes resource parameters such as CPU sharing ratio, CPU frequency scaling, last-level cache (LLC) allocation, DMA buffer size, and packet batch size. GreenNFV learns the resource scheduling model from the benchmark experiments and takes Service Level Agreements (SLAs) into account to optimize resource usage models based on the different throughput and energy consumption requirements. Our evaluation shows that GreenNFV models achieve high transfer throughput and low energy consumption while satisfying various SLA constraints. Specifically, GreenNFV with Throughput SLA can achieve 4.4X higher throughput and 1.5X better energy efficiency over the baseline settings, whereas GreenNFV with Energy SLA can achieve 3X higher throughput while reducing energy consumption by 50%
Paper
Hanayo: Harnessing Wave-Like Pipeline Parallelism for Enhanced Large Model Training Efficiency
Artificial Intelligence/Machine Learning
Description
Large-scale language models have become increasingly challenging and expensive to train. Among various methods addressing this issue, Pipeline Parallelism has been widely employed to accommodate massive model weights within limited GPU memory. This paper introduces Hanayo, a wave-like pipeline parallelism strategy that boasts a concise structure and practical applicability, alongside a high-performance pipeline execution runtime to tackle the challenges of pipeline strategy implementation. Hanayo mitigates the issues of pipeline bubbles and excessive memory consumption prevalent in existing schemes, without resorting to model duplicates as in Chimera. Our evaluation, conducted on four distinct computing clusters and involving both GPT-like and BERT-like architectures with up to 32 GPUs, demonstrates up to a 30.4% increase in throughput compared to the state-of-the-art approach.
Tutorial
Hands-On HPC Application Development Using C++ and SYCL
Description
SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++.
In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.
This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.
Tutorial
Hands-On Practical Hybrid Parallel Application Performance Engineering
Description
This tutorial presents state-of-the-art performance tools for leading-edge HPC systems founded on the community-developed Score-P instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI, OpenMP, hybrid combination of both, and increasingly common usage of accelerators. Parallel performance tools from the Virtual Institute – High Productivity Supercomputing (VI-HPS) are introduced and featured in hands-on exercises with Score-P, Scalasca, Vampir, and TAU. We present the complete workflow of performance engineering, including instrumentation, measurement (profiling and tracing, timing and PAPI hardware counters), data storage, analysis, tuning, and visualization. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. Using their own notebook computers, participants will conduct exercises on a contemporary HPC system where remote access will be provided for the hands-on sessions through AWS running an E4S [http://e4s.io] image containing all of the necessary tools. This image supports NVIDIA GPUs using CUDA 12 and Python. This will help to prepare participants to locate and diagnose performance bottlenecks in their own parallel programs.
Birds of a Feather
HDF5: Building on 25 years of success
Data Analysis, Visualization, and Storage
Description
HDF5 is a critical I/O library for scientific applications. It has been 25 years since its first release in November, 1998. HDF5’s sustainability and adaptation to today’s computational and storage environment would not be possible without feedback and contributions from the HDF5 community. We will begin with a panel who will present case studies on how they use or would like to use HDF5 in current and emerging computational environments. We will then invite our community members to discuss the roadmap, how to contribute to HDF5, and what is required to sustain HDF5 for another 25 years.
Paper
HEAR: Homomorphically Encrypted Allreduce
Distributed Computing
Message Passing
Programming Frameworks and System Software
Best Student Paper Finalist
Description
Allreduce is one of the most commonly used collective operations. Its latency and bandwidth can be improved by offloading the calculations to the network. However, no way exists to conduct such offloading securely; in state-of-the-art solutions, the data is passed unprotected into the network. Security is a significant concern for High-Performance Computing applications, but achieving it while maintaining performance remains challenging. We present HEAR, the first high-performance system for securing in-network compute and Allreduce operations based on homomorphic encryption. HEAR implements carefully designed and modified encryption schemes for the most common Allreduce functions and leverages communication domain knowledge in MPI programs to obtain decryption and encryption routines with high performance. HEAR operates on integers and floats with no code base and no or little hardware changes. We design and evaluate HEAR, showing its minimal overhead, and open-source our implementation. HEAR represents the first step towards achieving confidential HPC.
Birds of a Feather
High Performance Computing for Environmental and Earth Sciences
Applications
Description
The growth of climate and Earth data brings a pressing need to enhance techniques for its handling with HPC. This is crucial for our understanding of the coupling between the solid Earth, atmosphere, hydrology, and oceans, enabling proactive responses to extremes through improved forecasting of weather, climate change, and sudden disasters like earthquakes. This BoF will discuss the HPC community’s interface with Earth data and related engagements of stakeholder communities in climate, environmental, and Earth sciences, including the mathematics and spatial statistics that are involved. It endeavors to begin a targeted approach to the democratization of HPC for Earth sciences.
Workshop
High Performance Python for Science at Scale
Description
This workshop aims to connect researchers, developers, and Python practitioners to share their experiences scaling Python applications and codes on supercomputers. The goal is to provide a platform for topical discussion of best practices, hands-on demonstrations, and community engagement via open-source contributions to new libraries, runtimes, and frameworks. Based on keynote talks that survey and summarize the best practices and recent success stories, panel sessions that discuss details of implementation and live demo sessions for hands-on enthusiasts – the workshop will serve as a requirement gathering exercise for the future of Python in HPC and science.
Paper
High Throughput Training of Deep Surrogates from Large Ensemble Runs
Artificial Intelligence/Machine Learning
Description
Recent years have seen a surge in deep learning approaches to accelerate numerical solvers, which provide faithful but computationally intensive simulations of the physical world. These deep surrogates are generally trained in a supervised manner from limited amounts of data slowly generated by the same solver they intend to accelerate. We propose an open-source framework that enables the online training of these models from a large ensemble run of simulations. It leverages multiple levels of parallelism to generate rich datasets. The framework avoids I/O bottlenecks and storage issues by directly streaming the generated data. A training reservoir mitigates the inherent bias of streaming while maximizing GPU throughput. Experiment on training a deep surrogate for the heat equation shows the proposed approach enables training on 8TB of data in 2 hours with an accuracy improved by 47% and a batch throughput multiplied by 13 compared to a traditional offline procedure.
Paper
High-Performance and Programmable Attentional Graph Neural Networks with Global Tensor Formulations
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
Description
Graph attention models (A-GNNs), a type of Graph Neural Networks (GNNs), have been shown to be more powerful than simpler convolutional GNNs (C-GNNs). However, A-GNNs are more complex to program and difficult to scale. To address this, we develop a novel mathematical formulation, based on tensors that group all the feature vectors, targeting both training and inference of A-GNNs The formulation enables straightforward adoption of communication-minimizing routines, it fosters optimizations such as vectorization, and it enables seamless integration with established linear algebra DSLs or libraries such as GraphBLAS. Our implementation uses a data redistribution scheme explicitly developed for sparse-dense tensor operations used heavily in GNNs, and fusing optimizations that further minimize memory usage and communication cost. We ensure theoretical asymptotic reductions in communicated data compared to the established message-passing GNN paradigm. Finally, we provide excellent scalability and speedups of >5x over modern libraries such as Deep Graph Library.
Paper
High-Performance SVD Partial Spectrum Computation
Algorithms
Linear Algebra
Post-Moore Computing
Description
We introduce a new singular value decomposition (SVD) solver based on the QR-based Dynamically Weighted Halley (QDWH) algorithm for computing the partial spectrum SVD (QDWHpartial-SVD) problems. By optimizing the rational function underlying the algorithms only in the desired part of the spectrum, QDWHpartial-SVD algorithm efficiently computes a fraction (say 1-20%) of the most significant singular values/vectors. We develop a high-performance implementation of QDWHpartial-SVD on distributed-memory manycore systems and demonstrate their numerical robustness. We perform a benchmarking campaign against their counterparts from the state-of-the-art numerical libraries across various matrix sizes using up to 36K MPI processes. Experimental results show performance speedups for QDWHpartial-SVD up to 6X and 2X against PDGESVD from ScaLAPACK and KSVD, respectively. We also report energy consumption for these algorithms and demonstrate how QDWHpartial-SVD can further outperform PDGESVD in that regard by performing fewer memory-bound operations.
Paper
HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU
Accelerators
Distributed Computing
Middleware and System Software
Performance Measurement, Modeling, and Tools
Post-Moore Computing
Best Paper Finalist
Description
The end of Dennard scaling and the slowdown of Moore's law led to a shift in technology trends toward parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated in state-of-the-art hardware architectures such as GPUs, the primary execution vehicle for HPC applications today.
This paper presents HPAC-Offload, a pragma-based programming model that extends OpenMP offload applications to support AC techniques, allowing portable approximations across different GPU architectures. We conduct a comprehensive performance analysis of HPAC-Offload across GPU-accelerated HPC applications, revealing that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our analysis offers deep insights into the performance of GPU-based AC that guide the future development of AC algorithms and systems for these architectures.
Panel
HPC and Cloud Converged Computing: Merging Infrastructures and Communities
Artificial Intelligence/Machine Learning
Cloud Computing
Heterogeneous Computing
Description
The end of Dennard scaling and tapering of Moore’s law has led to economic conditions that favor cloud hyperscalers. Consequently, cloud is projected to be the largest sector of computing by revenue by 2025. The tremendous growth translates into substantial investment in research and development to manage the complexity of emerging systems. Cloud technologies such as elasticity, containerization and orchestration, and automation are gaining prevalence in HPC due to their abilities to manage new composite scientific workflows. Similarly, HPC techniques for performance optimization, scheduling, and fine-grained resource management are being integrated into the cloud to improve performance. The trend of integrating technologies from each community into the other leads to Converged Computing, an environment that combines the best capabilities from both worlds. In this highly interactive panel, we invite experts from industry, national laboratories, and academia to discuss their experiences with converged computing and share their views on its future.
Birds of a Feather
HPC for Geospatial (HPC4Geo)
Artificial Intelligence/Machine Learning
Description
The HPC for Geospatial (HPC4Geo) BoF will bring together researchers and practitioners across industry, national labs, academia, and government to share and discuss the state of the art in high performance computing for geospatial applications. Topics will include HPC implementations, data and computing architectures, real-time analytics, scalable artificial intelligence algorithms, and storage systems. The HPC4Geo BoF will primarily focus on reviewing current research and operational needs within the geospatial community, and discussion of 5-year horizon challenges/opportunities, with an emphasis on technical intersections between AI at the edge, massive data collection/throughput/processing, and scalable/portable AI models.
Birds of a Feather
HPC Graph Toolkits and the GraphBLAS Forum
Algorithms
Description
Government agencies, industry and academia are demanding a new generation of tools to efficiently solve large scale analytics problems in a variety of business, scientific and national security applications. This BoF gathers the community developing high-performance frameworks and workflows for large scale graph analytics to survey current approaches, identify new challenges and opportunities, and discuss interoperability of emerging infrastructures. A central goal is developing requirements and recommendations for future tools. As in previous editions, this BoF will explore and compare and contrast conventional implementations as well as algebraic approaches, inviting the GraphBLAS community to discuss its state and evolution.
Birds of a Feather
HPC Next: The RISC-V Ecosystem
Architecture and Networks
Description
RISC-V is an open instruction set standard which is experiencing extraordinary growth and has the potential to revolutionize supercomputing. There are a growing number of RISC-V activities by the HPC community, and the goal of this BoF is to continue the discussion with the community about the RISC-V ecosystem and how it can best support HPC research and development. Beginning with a short overview on the status of the RISC-V HPC ecosystem, this will be followed by a Q&A with the panel and audience. There will be directed questions, as well as ad hoc questions and discussions with the audience.
Workshop
HPC Systems Professionals Workshop (HPCSYSPROS23)
Description
The complexity of High Performance Computing (HPC) systems necessitates advanced techniques in system administration, configuration, and engineering and with staff who are well-versed on the best practices in this field. HPC Systems Professionals include system engineers, system administrators, network administrators, storage administrators, and operations staff who face unique problems to HPC systems. The ACM SIGHPC SYSPROS Virtual Chapter, the sponsor for this workshop, has been established to provide opportunities to develop and grow relationships focused specifically on the needs of HPC systems practitioners and to act as a support resource for them to help with the issues encountered in this specialized field.
This workshop is designed to share best practices for common HPC system deployment and maintenance, to provide a platform to discuss upcoming technologies, and to present the state of the practice techniques that increase performance and reliability of systems, and in turn increase researcher and analyst productivity.
Workshop
HUST-23: 10th International Workshop on HPC User Support Tools
Description
The HPC user suppport tools (HUST) workshop, has become a key forum to promote new and innovative user support tools such as XALT, Spack, Easybuild, and ReFrame to the HPC community. Many of the HPC user tools presented at earlier HUST workshops have matured to the point of becoming the community standard and are integral tools for the user support at HPC centers around the world. The HUST workshop is a forum for system administrators, user support members, tool developers, policy makers, and end users to learn about new and innovative tools. The HUST workshop central aim is as a publication venue for current and on-going support tool developments and to promote the uptake of these tools. Identify and support best practices, novel tools and novel ideas to help streamline user support efforts within the novel technology ecosystems at HPC centers. These issues are all in-scope for the HUST workshop.
Paper
I/O in WRF: A Case Study in Modern Parallel I/O Techniques
Data Analysis, Visualization, and Storage
I/O and File Systems
State of the Practice
Description
Large-scale parallel applications can face significant I/O performance bottlenecks, making efficient I/O crucial. This work presents a comparative study of several parallel I/O implementations in the Weather Research and Forecasting model, including PnetCDF blocking and non-blocking I/O options, netCDF4, HDF5 Log VOL, and ADIOS. For I/O methods creating files in a canonical data layout, PnetCDF's non-blocking option offers up to 2x improvement over its blocking option and up to 4.5x over HDF5 via netCDF4, demonstrating the effectiveness of the write request aggregation technique. The HDF5 Log VOL outperforms ADIOS with a 4x improvement in write performance when creating files in the log layout, although both require non-negligible time to convert the file back to canonical order for post-run analysis. From these results, we extract some observations that can guide I/O strategies for modern parallel codes.
Workshop
IA^3 2023 - 13th Workshop on Irregular Applications: Architectures & Algorithms
Description
Due to the heterogeneous datasets they process, data intensive applications employ a diverse set of methods and data structures, exhibiting irregular memory accesses, control flows, and communication patterns. Current supercomputing systems are organized around components optimized for data locality and bulk synchronous computations. Managing any form of irregularity on them requires a substantial programming effort, and often leads to poor performance. Holistic solutions to these challenges emerge only by considering the problem from multiple perspectives: from micro- to system-architectures, from compilers to languages, from libraries to runtimes, and from algorithm design to data characteristics. Only collaborative efforts among researchers with different expertise, including domain experts and end users, can lead to significant breakthroughs. This workshop brings together scientists with different backgrounds to discuss methods and technologies for efficiently supporting irregular applications on current and future architectures.
Birds of a Feather
IEEE Quantum-HPC Working Group
Post-Moore Computing
Description
The IEEE Quantum-HPC Working Group BoF is the second community-building session targeting academic and enterprise stakeholders in HPC and hybrid HPC-QCS (Quantum Computing and Simulation). Launched at IEEE Quantum Week 2023, the Quantum-HPC Working Group addresses the challenges and opportunities of interfacing HPC and QCS through a full-stack approach across infrastructure, system software, programming tools and use cases. The BoF brings together attendees who are interested in the role of QCS in the HPC ecosystem to chart a sustainable path forward to interface the two technologies by collaborating on the focus areas and technical working structure of Quantum-HPC.
Panel
Immersion Cooling: 3 Considerations You Should Care about and Real-World Deployment Experiences
Artificial Intelligence/Machine Learning
Energy Efficiency
Hardware Technologies
Description
As the world increasingly relies on a new era of high-wattage CPU and GPU platforms to deliver HPC and AI breakthroughs, deploying and cooling these systems within traditional data centers presents a problem. This panel will discuss the need to create more sustainable deployment environments and how immersion cooling is a critical piece to this puzzle.
Join expert panelists in a discussion about the following three considerations: 1) Cost, 2) Available Options, and 3) Service/Support/Warranty Implications. The panel will discuss why out with the old (cooling with fans) and in with the immersive (immersing HPC and AI servers into high-tech non-conductive fluid) is a sustainable option for the future of modern data centers. Most people can agree on one thing – driving rack density and cooling of high-wattage processors presents a new set of challenges. The good news? We have options! Join our SC23 panel to learn more.
Birds of a Feather
Implementing Zero Trust on HPC
Architecture and Networks
Description
Zero-Trust is the cybersecurity architecture of choice and is now being discussed in supercomputing environments. Zero-Trust is based on a least-privilege per-request approach - and it has serious implications for HPC centers, application developers, and end-user workflows. Join this discussion with US Federal CIOs to discuss their expectations and with HPC leaders on their approach.
Tutorial
In-Situ Analysis and Visualization with Ascent and ParaView Catalyst
Description
Scientific visualization and analysis are key ingredients in HPC simulation workflows. For decades, the dominant paradigm has been post-hoc visualization; simulation codes iterate and save files to disk, giving the domain scientists the opportunity to read the data back at a later time for analysis. In recent years though, this paradigm has been stressed by an ever-diverging rate of growth between I/O and compute speeds. In-situ processing helps mitigate these I/O bottlenecks, enabling simulation and visualization calculations to run in-memory, at higher spatial and temporal resolution, avoiding the transfer of raw data to disks. Even in cases where I/O bottlenecks do not dominate, in-situ processing is well suited for batch-focused analysis, allowing simulation users to obtain distilled results without additional workflow steps.
This half-day tutorial introduces the in-situ visualization paradigm along with Ascent and ParaView Catalyst, two open-source in-situ processing libraries. Both libraries leverage a common interface, Conduit, which provides an intuitive model for describing hierarchical scientific data in C++, C, Fortran, and Python. Attendees will gain hands-on experience learning how to describe simulation data with Conduit and how to use Ascent and Catalyst to transform data, render images, and export results.
Birds of a Feather
Increasing Memory Utilization and Reducing Total Memory Cost Using CXL
Artificial Intelligence/Machine Learning
Description
CXL’s advanced memory expansion and fabric management capabilities can be used to increase system scalability and flexibility across multiple compute domains, enabling resource sharing for higher performance, reduced software stack complexity, and lower overall datacenter memory cost. The fabric enhancements and memory expansion features included in CXL 3.0 deliver new levels of composability required by the large models used in HPC and AI in the modern datacenter. Expert representatives from CXL Consortium member companies who are implementing the specification will explore the CXL 3.0 features, new use case enablement, and ROI examples when implementing CXL attached memory.
Birds of a Feather
Integrating Cloud Infrastructure with Large Scale HPC Environments
Cloud Computing
Description
As cloud environments deploy HPC capable infrastructure, large scale supercomputing and HPC centers are exploring how to integrate these resources into their ecosystems. This BoF will provide an opportunity for these centers to share their experiences and insights as well as provide a venue to establish collaborative efforts and develop broader strategies across the community. This BoF will provide a forum for discussion between supercomputing facility operators, cloud service providers, and the user community that will cover strategies and approaches for integrating cloud resources into existing HPC facility environments.
Birds of a Feather
Interactive and Urgent HPC
State of the Practice
Description
Many HPC systems are managed using batch queues; however, not all HPC applications and workflows are best served by batch queue systems. Interactive prototyping, urgent streaming data analysis, application steering, and in-situ visualization are among the workflows that require interactive and urgent capabilities to be effective. After three successful SC BoFs and seven successful workshops at SC and ISC, the interactive and urgent HPC community is writing a position paper during the summer of 2023 to document progress and cast future research foci. In this BoF, we will present the state of the draft paper and solicit discussion and feedback.
Paper
Interference-Aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach
Accelerators
Distributed Computing
Middleware and System Software
Performance Measurement, Modeling, and Tools
Post-Moore Computing
Best Paper Finalist
Description
A common strategy for improving efficiency in training deep learning entails multiplexing tasks on a single GPU. To mitigate the interference caused by multiplexing, existing approaches primarily employ kernel-level solutions to regulate GPU kernel execution, or harness hardware-level techniques to explicitly restrict GPU streaming multiprocessors and memory. Nevertheless, none of them perform satisfactorily in optimizing the completion time of tasks.
In this paper, we present IADeep, a middleware solution designed to significantly improve multiplexing efficiency. The core concept is the co-optimization of task assignments within a cluster and interference mitigation on each device. IADeep coordinates the configuration of all co-located tasks in a less fine-grained fashion, effectively reducing interference and enhancing task training performance. Across the entire cluster, IADeep intelligently selects applications suitable for multiplexing to further amplify the advantages of optimizing task configurations. Evaluations on a 20 RTX 3090-GPU cluster demonstrate that IADeep can significantly outperform state-of-the-art multiplexing solutions.
Birds of a Feather
Introducing MPI 4.1, the Newest Version of the Message Passing Interface Standard
Programming Frameworks and System Software
Description
The Message Passing Interface (MPI) API is the most dominant programming approach for HPC environments. Its specification is driven by the MPI forum, an open forum consisting of MPI developers, vendors and users. Just before SC23, the MPI forum published the latest version of the standard, MPI 4.1. We will take a look at the new features and will discuss what this means for the user of MPI. However, MPI 4.1 is not the end of the MPI standard – the forum is already working toward MPI 5.0 and we will discuss ideas, directions and will feedback from the community.
Tutorial
Introduction to High-Performance Parallel Distributed Computing Using Chapel, UPC++, and Coarray Fortran
Description
A majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models.
The tutorial is targeted for users with little-to-no parallel programming experience, but everyone is welcome. A partial differential equation example will be demonstrated in all three programming models. That example and others will be provided to attendees in a virtual environment. Attendees will be shown how to compile and run these programming examples, and the virtual environment will remain available to attendees throughout the conference, along with Slack-based interactive tech support.
Come join us to learn about some productive and performant parallel programming models!
Tutorial
Introduction to Quantum Computing
Description
Quantum computing offers the potential to revolutionize high-performance computing by providing a means to solve certain computational problems faster than any classical computer. Relatively recently, quantum computing has advanced from a theoretical possibility to engineered reality, with commercial entities offering early prototype quantum processors representing a variety of qubit technologies and computational paradigms. The media have been showcasing each new development and implicitly conveying the message that quantum-computing ubiquity is nigh. Here, we will respond to this hype and provide an overview of the exciting but still early state of the field.
We introduce participants to the computational models underlying quantum computing. We work through examples of its immense computational power while highlighting what the quantum computing community still does not know in terms of quantum algorithms and where the power of quantum computing comes from. We examine the thought processes that programmers use to map problems to circuit-model quantum computers, quantum annealers, measurement-based quantum systems, analog Rydberg atom arrays, and other recent inventions in the quantum-computing space. We conclude with an overview of the hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of the HPC developer's repertoire.
Birds of a Feather
IO500: The High-Performance Storage Community
Data Analysis, Visualization, and Storage
Description
As efficient IO becomes increasingly critical to reach peak computing performance, IO500 has become the de-facto standard for measuring HPC storage performance. Developed in 2017, the IO500 has released bi-annual lists at SC and ISC since then. This BoF’s highlight is the presentation of the new IO500 list.
This BoF’s goal is to foster the IO500 community to progress common goals of creating, sharing, and benefiting from a large corpus of shared storage performance data. We are also building a detailed repository of high-performance production storage systems as they evolve, providing a knowledge base for HPC researchers and system designers.
Workshop
ISAV23: In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization
Description
As HPC platforms increase in size and complexity, one significant challenge is the widening gap between computational capacity and our ability to store data for subsequent analysis. One promising approach known as in situ processing is to perform as much analysis as possible while computed data is still resident in memory. The ISAV workshop at SC has become the “center of gravity” in the HPC space for a community of in situ developers, practitioners, and researchers from industry, government laboratories, and academia. The goals include presentation and discussion of research findings, lessons learned, early ideas, and insights related to developing and applying in situ methods across many science and engineering applications in HPC environments. Submitted papers undergo a peer review process and appear in a published proceedings.
Paper
Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
Description
This paper introduces Itoyori, a task-parallel runtime system designed to tackle the challenge of scaling task parallelism (more specifically, nested fork-join parallelism) beyond a single node. The partitioned global address space (PGAS) model is often employed in task-parallel systems, but naively combining them can lead to poor performance due to fine-grained and redundant remote memory accesses. Itoyori addresses this issue by automatically caching global memory accesses at runtime, enabling efficient cache sharing among parallel tasks running on the same processor. As a real-world case study, we ported an existing task-parallel implementation of the Fast Multipole Method (FMM) to distributed memory with Itoyori and achieved a 7.5x speedup when scaled from a single node to 12 nodes and up to 6.0x faster performance than without caching. This study demonstrates that global-view fork-join programming can be made practical and scalable, while requiring minimal changes to the shared-memory code.
Birds of a Feather
Julia for HPC
Description
The “Julia for HPC” birds-of-a-feather (BoF) session provides a place for the high-performance computing (HPC) community with interests in the Julia programming language. Julia proposes an integrated development end-to-end co-design model as a LLVM front-end for science to close the gap between high-productivity languages and the desired performance of traditional compiled languages on extreme heterogeneous systems.
We invite participants from academia, government, and industry to share and discuss their experiences, identify and learn about current opportunities and gaps. Potential topics include: community, adoption and support in leadership facilities, the Julia ecosystem, programming models and packages targeting HPC workflows.
Birds of a Feather
Khronos SYCL: What’s Next?
Programming Frameworks and System Software
Description
The SYCL programming model provides an open standard way to program heterogeneous systems in modern C++. Since the major SYCL2020 release, which added abstractions and features for HPC, SYCL has seen increased use in application domains needing large exascale-class machines, including fusion energy, molecular dynamics, and aerospace.
In this Birds of a Feather session, we will bring together the community of everyone using and developing SYCL applications and implementations. We will discuss future directions and seek feedback on priorities for SYCLNext. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.
Birds of a Feather
Knowledge Graphs: How Will They Transform Science?
Applications
Description
Diverse big data, interdisciplinary science, ML/AI applications and in-situ computations necessitate knowledge representation. Knowledge, organized for machine understanding in graph form known as knowledge graphs, augments large-scale science. For example, biology and semantic web utilize large knowledge graphs. Utilizing AI, knowledge graphs enable natural language querying of linked information, semantic recommendation systems, and knowledge completion. HPC challenges abound including parallelizing queries, retrieval-efficient knowledge representation, and knowledge graph context-exploiting AI. This BoF will introduce big ideas, as lightning talks followed by discussion, and engage a general audience in a discussion of emerging research topics aiming to seed a community for collaboration.
ACM Gordon Bell Finalist
Awards
Large-Scale Materials Modeling at Quantum Accuracy: Ab Initio Simulations of Quasicrystals and Interacting Extended Defects in Metallic Alloys
TP
Description
Ab initio electronic-structure has remained dichotomous between achievable accuracy and length-scale. Quantum many-body (QMB) methods realize quantum accuracy but fail to scale. Density functional theory (DFT) scales favorably but remains far from quantum accuracy. We present a framework that breaks this dichotomy by use of three interconnected modules: (i)
invDFT: a methodological advance in inverse DFT linking QMB methods to DFT; (ii) MLXC: a machine-learned density functional trained with invDFT data, commensurate with quantum accuracy; (iii) DFT-FE-MLXC: an adaptive higher-order spectral finite-element (FE) based DFT implementation that integrates MLXC with efficient solver strategies and HPC innovations in
FE-specific dense linear algebra, mixed-precision algorithms, and asynchronous compute-communication. We demonstrate a paradigm shift in DFT that not only provides an accuracy commensurate with QMB methods in ground-state energies, but also attains an unprecedented performance of 659.7 PFLOPS (43.1% peak FP64 performance) on 619,124 electrons using 8,000 GPU nodes of Frontier supercomputer,
Paper
Large-Scale Simulation of Structural Dynamics Computing on GPU Clusters
Accelerators
Applications
Modeling and Simulation
Best Paper Finalist
Best Student Paper Finalist
Description
Structural dynamics simulation plays an important role in research on reactor design and complex engineering. The Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method combined with Newmark method is an efficient way to solve large-scale structural dynamics problems. However, the sparse direct solver and the load imbalance caused by inconsistent density models are two critical issues limiting the performance and the scalability of structural dynamics computing. For the former, we propose an efficient variable-size batched method to accelerate SpMV on GPUs. For the latter, we establish an online performance prediction model, based on which we then design a novel inter-cluster subdomain fine-tuning algorithm to balance the workload of HTFETI parallel computing. We are the first to achieve the high-fidelity structural dynamics simulation of China Experimental Fast Reactor core assembly with up to 53.4 billion grids. The weak and strong scalability efficiencies reach 91.77% and 86.13% on 12,800 GPUs, respectively.
Paper
Legate Sparse: Distributed Sparse Computing in Python
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
Description
The sparse module of the popular SciPy Python library is widely used across applications in scientific computing, data analysis, and machine learning. The standard implementation of SciPy is restricted to a single CPU and cannot take advantage of modern distributed and accelerated computing resources. We introduce Legate Sparse, a system that transparently distributes and accelerates unmodified sparse matrix-based SciPy programs across clusters of CPUs and GPUs, and composes with cuNumeric, a distributed NumPy library. Legate Sparse uses a combination of static and dynamic techniques to performantly compose independently written sparse and dense array programming libraries, providing a unified Python interface for distributed sparse and dense array computations. We show that Legate Sparse is competitive with single-GPU libraries like CuPy and the industry-standard PETSc library on up to 1280 CPU cores and 192 GPUs of the Summit supercomputer, while offering the productivity benefits of idiomatic SciPy and NumPy.
Birds of a Feather
Less Worrying, More Learning, More Sharing - Ways to Embrace IPv6
Architecture and Networks
Description
IPv6 is quickly becoming the dominant protocol on the internet. As the global transition from IPv4 to IPv6 continues, many ISPs are now seeing over 50% of their traffic via IPv6. SCinet22 saw wireless IPv6 traffic ranging from 35-55%. This BoF continues the engagement from SC22 with discussions centered on international migration efforts, cyber security, HPC, IPAM and real-time IPv6 usage from SCinet23! Join our discussion on the efforts, implications and challenges for transitioning HPC, data centers and networks. Ask questions, provide updates, and hear from others about their real-world experience - learn all the ways you can embrace IPv6.
Tutorial
Leveraging SmartNICs for HPC Applications
Description
The past few years have witnessed a surge in the number of advanced network adapters, known as "SmartNICs", that offer additional functionalities beyond standard packet processing capabilities. These devices often feature programmable lightweight processing cores, FPGAs, and even CPU- and GPU-based platforms capable of running separate operating systems. Though primarily aimed at data center operations, such as infrastructure management, packet filtering, and I/O acceleration, SmartNICs are increasingly being explored for high-performance computing (HPC) application acceleration.
This tutorial offers an in-depth exploration of the state-of-the-art for SmartNICs and the emerging software ecosystems supporting them. Attendees will engage in hands-on exercises to better understand how to use SmartNICs for HPC application acceleration, including MPI collective operation offloading, OpenMP offload, and algorithmic modifications to maximize on-board processing power. Participants will have the opportunity to execute these exercises using cutting-edge SmartNICs like NVIDIA's BlueField-3 Data Processing Unit (DPU). The tutorial presenters will discuss additional techniques for optimizing applications to harness SmartNICs as communication accelerators in HPC systems.
Paper
Leveraging the Compute Power of Two HPC Systems for Higher-Dimensional Grid-Based Simulations with the Widely-Distributed Sparse Grid Combination Technique
Algorithms
Cloud Computing
Distributed Computing
Heterogeneous Computing
Large Scale Systems
State of the Practice
Description
This paper presents the core concepts of the widely-distributed combination technique, which allows us to use the compute power and memory of more than one HPC system for the same simulation. We apply the sparse-grid combination technique to a six-dimensional advection problem serving as a proxy for plasma simulations. The full-grid solution approximated by the combination technique would contain ≈5ZB if computed with conventional grid-based methods. The combination-technique simulation operates on ≈988GB plus the supporting sparse grid data structures. We propose a new approach to divide the compute load, requiring only 76GB to be exchanged. Based on this, we have realized the first synchronous grid-based simulation using two HPC systems, the Tier-0 supercomputers Hawk and SuperMUC-NG. The simulation is computed at an average overhead of ≈35% (108s per combination step) for file-I/O and transfer. The presented concepts apply to any pair of HPC systems if high-speed data transfer is possible.
Workshop
LLVM-HPC2023: The Ninth Workshop on the LLVM Compiler Infrastructure in HPC
Description
LLVM, winner of the 2012 ACM Software System Award, has become an integral part of the software-development ecosystem for optimizing compilers, dynamic-language execution engines, source-code analysis and transformation tools, debuggers and linkers and a whole host of programming-language and toolchain-related components. Now heavily used in both academia and industry, where it allows for rapid development of production-quality tools, LLVM is increasingly used in work targeted at high-performance computing. Research in, and implementation of, program analysis, compilation, executio, and profiling have clearly benefited from the availability of a high-quality, freely-available infrastructure on which to build. This workshop will focus on recent developments, from both academia and industry, that build on LLVM to advance the state-of-the-art in high-performance computing.
Birds of a Feather
LUSTRE Community BoF: Lustre in HPC, AI, and the Cloud
Data Analysis, Visualization, and Storage
Description
Lustre is the leading open-source and open-development file system for HPC. Around two thirds of the top 100 supercomputers use Lustre. It is a community developed technology with contributors from around the world. Lustre currently supports many HPC infrastructures beyond scientific research, such as financial services, energy, manufacturing and life sciences. Lustre clients are available for broadly deployed instruction set architectures such as x86, POWER, and ARM.
At this BoF, Lustre developers, administrators, and solution providers will gather to discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in Cloud environments.
Birds of a Feather
Machine Learning from the Data’s Perspective: Data-Centric AI for Scientific Computing
Artificial Intelligence/Machine Learning
Description
This BoF will spotlight the underemphasized role of inputs and data in machine learning (ML), contrasting the prevalent focus on hardware aspects. It invites the SC community to contribute insights in these areas: 1) the value proposition for data-centric AI in scientific computing; 2) foundation models for the long tail of science; 3) the role of benchmarks in data-centric AI. To foster interactive dialogue, we will facilitate discussions, conduct live polling, and arrange short breakout sessions. These activities will enable participants to delve into the practical implications of data-centric AI, benchmarking, and contributing to scientific foundation models.
Tutorial
Magic Castle: Terraforming the Cloud to Teach HPC
Description
Are you new to the world of HPC and are trying to find an affordable and accessible way that you can learn, practice and experiment? Do you miss the days when learning about HPC was connecting a few grey boxes together and configuring a cluster? Do you wish you could transfer all the complexity inherent in production HPC systems into an accessible sandbox environment, designed to facilitate teaching and experimental development? Stop wishing and come explore Magic Castle with this tutorial!
Magic Castle is an open-source software that replicates the HPC infrastructure experience via community or commercial cloud resources. It is easy to deploy and can be created in minutes. Once their cluster is deployed, the user is provided with a complete HPC cluster software environment including the scheduler, a data-transfer node, JupyterHub, and thousands of software applications compiled by experts and accessible via CVMFS. Since its initial public release in 2018, Magic Castle has been used for thousands of workshops and tutorials world-wide.
In this tutorial, you will learn how to deploy a virtual HPC cluster on your preferred cloud resource in minutes, and fully customize your environment to suit your application, whether that be training, development, or practice.
Tutorial
Managing HPC Software Complexity with Spack
Description
The modern scientific software stack includes thousands of packages, from C, C++, and Fortran libraries, to packages written in interpreted languages like Python and R. HPC applications may depend on hundreds of packages spanning all of these ecosystems. To achieve high performance, they must also leverage low-level and difficult-to-build libraries such as MPI, BLAS, and LAPACK. Integrating this stack is extremely challenging. The complexity can be an obstacle to deployment at HPC sites and deters developers from building on each other's work.
Spack is an open source tool for HPC package management that simplifies building, installing, customizing, and sharing HPC software stacks. Its adoption has grown rapidly: it is used by end-users, developers, clouds, and the world's largest HPC centers. Spack provides a powerful and flexible dependency model, a simple Python syntax for writing package build recipes, and a repository of over 7,000 packages maintained by a community of over 1,100 contributors. This tutorial provides an introduction to Spack's capabilities: installing and authoring packages, integrating Spack with development workflows, and deploying software at HPC facilities. Attendees will learn foundational skills for automating day-to-day tasks, as well as deeper knowledge of Spack for advanced use cases.
Tutorial
Mastering Tasking with OpenMP
Description
With the increasing prevalence of multi-core processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Since version 3.0 released in 2008, OpenMP offers tasking to support the creation of composable parallel software blocks and the parallelization of irregular algorithms. Developers usually find OpenMP easy to learn. However, mastering the tasking concept of OpenMP requires a change in the way developers reason about the structure of their code and how to expose the parallelism of it. Our tutorial addresses this critical aspect by examining the tasking concept in detail and presenting patterns as solutions to many common problems.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We present the OpenMP tasking language features in detail and focus on performance aspects, such as introducing cut-off mechanisms, exploiting task dependencies, and preserving locality. All aspects are accompanied by extensive case studies. If accepted as a full-day tutorial, we will include hands-on sessions. Throughout all topics, we present the recent additions of OpenMP 5.1 and 5.2 and comment on the developments targeting OpenMP 6.0.
Paper
MBFGraph: An SSD-Based External Graph System for Evolving Graphs
Cloud Computing
Data Analysis, Visualization, and Storage
Graph Algorithms and Frameworks
Best Paper Finalist
Description
The challenge of executing extensive graph analyses in-memory intensifies with growing graph sizes. This has given rise to disk-based external graph analytics systems that prioritize cost-effective HDDs/SSDs over pricier memory solutions. In response to this issue, our paper introduces and assesses the MBFGraph external graph system. This system leverages millions of Bloom filters within 1KB or 2KB graph data blocks to diminish graph analysis execution delays. Through our innovative MBF-query and MBF-construct algorithms, MBFGraph utilizes these Bloom filters as approximate indices, enabling the reading of only pertinent sections of dynamic graph data, thereby facilitating scalable analytics. Our tests revealed that, on a 475GB graph, MBFGraph cut down the execution duration of BFS and Pagerank by 24% and 60% respectively, using a mere 4GB memory. This is in comparison to a sequential, tailored-for-workload, disk-based external graph analytics system.
Birds of a Feather
Meeting HPC Community Needs: How SIGHPC, TCPP, and SIAG-SC Join Efforts to Engage Communities and Deliver Services
Description
Come and learn from the leaders of the professional societies focused on HPC from ACM, IEEE, and SIAM! Your SIGHPC, TCPP, and SIAG-SC representatives invite SC23 participants to join this cross-society BoF to learn about joint societies' efforts to promote collaborations, discuss the status of HPC as a community, and engage the audience to address common challenges.
Paper
Mirage: Toward Low-interruption Services on Batch GPU Clusters with Reinforcement Learning
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
Description
Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we explore a set of machine learning and reinforcement learning techniques to design a proactive provisioner. We examine the generality of the method using production job traces from three GPU clusters. We validate the effectiveness and generality of our proactive provisioner using the validation trace of each cluster. Our experiments show that the proposed resource provisioner safeguards 23%-76% of jobs with zero interruption across varying load levels on the three clusters.
Paper
Mitigating Coupling Map Constrained Correlated Measurement Errors on Quantum Devices
Post-Moore Computing
Quantum Computing
Best Paper Finalist
Description
We introduce a technique for the suppression of state-dependent and correlated measurement errors, which are commonly observed on modern superconducting quantum devices. Our method leverages previous results, establishing that correlated errors tend to be physically localized on quantum devices to perform characterizations over the coupling map of the device, and to join overlapping measurement calibrations as a series of sparse matrices. We term this "Coupling Map Calibration". We quantitatively demonstrate the advantages of our proposed error mitigation system design across a range of current IBM quantum devices. Our experimental results on common benchmark circuits demonstrate up to a 41% reduction in the error rate without increasing the number of executions of the quantum device required when compared to conventional error mitigation methods.
Birds of a Feather
Mixed Feelings about Mixed Precisions
Applications
Description
What if we have been oversolving in computational science and engineering for decades? Are low precision arithmetic formats only for AI workloads? How can HPC applications exploit mixed-precision hardware features? This BoF invites the HPC community at large interested in applying mixed precisions into their workflows and discussing the impact on time-to-solution, memory footprint, data motion, and energy consumption. Experts from scientific applications/software libraries/hardware architectures will briefly provide the context on this trendy topic, share their own perspectives, and mostly engage with the audience via a set of questions, while gathering feedback to define a roadmap moving forward.
Birds of a Feather
MLPerf: A Benchmark for Machine Learning
Artificial Intelligence/Machine Learning
Description
Machine learning applications are rapidly expanding into scientific domains and challenging the hallmarks of traditional high performance computing workloads. We present MLPerf, a community-driven system performance benchmark which spans a range of machine learning tasks. The speakers at this BoF are experts in the fields of HPC, science applications, machine learning, and computer architecture, representing academia, government research organizations, and private industry. In this session, we will cover the past year’s development within the MLPerf organization and provide an update on the latest round of submissions to MLPerf-HPC benchmark suite to solicit input from interested parties within the HPC community.
Birds of a Feather
Modular, Container, and Pallet Racking - for the Nex Gen Data Center?
State of the Practice
Description
Modular and container-based industrial structures for HPC buildings are now common. Resulting CapEx reductions include shorter design-build schedules, and commodity pricing of the structural envelope, and flexibility for expansion and upgradability are enhanced. Typical HPC life cycles for power, cooling and compute machinery are highly varied and require constant modification and renovation of facilities. Commodity structures can reduce this problem. Replacing concrete with steel, creating vertically stacked compute racks, might allow 3-D cube compute architectures with low latency communication and high accessibility for servicing. The transition from air to liquid cooling will drive this change.
Birds of a Feather
MPICH: A High Performance Open-Source MPI Implementation
Programming Frameworks and System Software
Description
MPICH is a widely used, open-source implementation of the MPI message passing standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This BoF session will provide a forum for users of MPICH as well as developers of MPI implementations derived from MPICH to discuss experiences and issues in using and porting MPICH. Future plans for MPICH will be discussed. Representatives from MPICH-derived implementations will provide brief updates on the status of their efforts. MPICH developers will also be present for an open forum discussion.
Birds of a Feather
Navigating Complexity: Achieving Performance Portability in the Evolving Landscape of Heterogeneous HPC Systems
State of the Practice
Description
With increasing demand for AI in HPC, there has been an explosion in architectures, programming models, and AI frameworks. The already-daunting task of programming for heterogenous systems has become even more challenging. This BoF, organized by the IXPUG but not limited to Intel technology, will focus on portable programming across a wide variety of architectures running a diverse set of HPC, and AI workloads.
This BoF will explore challenges, state-of-the-art approaches, and emergent best practices for programming across heterogeneous systems and novel architectures, identifying common principles and practices that enable development and maintenance of software across sites, architectures, and applications.
Tutorial
Networking Technologies for High-Performance Computing: Principles and Solutions
Description
InfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, and Slingshot technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, filesystems, storage, cloud computing, Big Data (Spark) and AI (Deep Learning and Machine Learning) environments. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, Omni-Path, EFA, Tofu, and Slingshot. In-depth overview of the architectural features of IB, HSE (including iWARP and RoCE), and Omni-Path, their similarities and differences, and the associated protocols will be presented. An overview of the emerging NVLink2, NVSwitch, AMD Infinity Fabric, Slingshot, and Tofu architectures will also be given. Next, an overview of the OpenFabrics stack and Libfabrics software stack to support a range of different interconnects will be provided. Hardware/software solutions and the market trends behind these networking technologies will be highlighted. Sample performance numbers of these technologies and protocols for different environments will be presented. Finally, hands-on exercises will be carried out for the attendees to gain first-hand experience of running experiments with high-performance networks.
Birds of a Feather
New Competences: Are We Ready for the Uptake of Exascale and Hybrid Quantum-Classical Computing?
State of the Practice
Description
Exascale computing (EC) can process larger quantities of data faster than ever before and the technologies being developed can help accelerate innovation across the economy. Quantum-classical hybrid solutions have already gone beyond research environments into the business spheres. The first-generation EC projects in the USA and UK are soon ending.
Which tools and environments are emerging as most sought-after? How ready are we to answer the skills needs of computational researchers and business users? Do we have a clear competence framework? What are the needed skills to harness the promise and potential of emerging technologies?
Workshop
Ninth Computational Approaches for Cancer Workshop (CAFCW23)
Description
New computational opportunities and challenges have emerged within cancer research and clinical application areas as the size, source, and complexity of cancer datasets have grown. Simultaneously, advances in computational capabilities, with exceptional growth in AI and deep learning, are reaching unprecedented scales. The workshop focuses on bringing together interested individuals ranging from clinicians, cancer biologists, mathematicians, data and computational scientists, engineers, developers, thought leaders, and others with an interest in advancing computation to better understand, diagnose, treat, and prevent cancer. As a unifying theme in 2023, special emphasis will be given to “Diversity, Equity and Inclusion – from Science to Scientist,” bringing forward the need to involve contributions from all in progressing from data to partnerships, spanning demographics as well as extending to patient involvement. As an interdisciplinary workshop, sharing insights and challenges fosters collaborations and future innovations accelerating progress in computationally and data-driven cancer research and clinical applications.
Workshop
Ninth International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H2RC 2023)
Description
As in the previous eight years, this workshop will bring together application experts, software developers, and hardware engineers, both from industry and academia, to share experiences and best practices to leverage the practical application of reconfigurable logic to Scientific Computing, Machine/Deep Learning, and “Big Data” applications. In particular, the workshop will focus on sharing experiences and techniques for accelerating applications and/or improving energy efficiency with FPGAs using OpenCL, OpenMP, OpenACC, SYCL, DPC++, C, C++, and other high-level design flows, which enable and improve cross-platform functional and performance portability while also improving productivity. Particular emphasis is given to cross-platform comparisons and combinations that foster a better understanding within the industry and research community on what are the best mappings of applications to a diverse range of hardware architectures that are available today (e.g., FPGA, GPU, Many-cores and hybrid devices, ASICs), and on how to most effectively achieve cross-platform compatibility.
Paper
NNQS-Transformer: An Efficient and Scalable Neural Network Quantum States Approach for Ab Initio Quantum Chemistry
Applications
Modeling and Simulation
Description
Neural network quantum state (NNQS) has emerged as a promising candidate for quantum many-body problems, but its practical applications are often hindered by the high cost of sampling and local energy calculation. We develop a high-performance NNQS method for ab initio electronic structure calculations. The major innovations include:
(1) A transformer based architecture as the quantum wave function ansatz;
(2) A data-centric parallelization scheme for the variational Monte Carlo (VMC) algorithm which preserves data locality and well adapts for different computing architectures;
(3) A parallel batch sampling strategy which reduces the sampling cost and achieves good load balance;
(4) A parallel local energy evaluation scheme which is both memory and computationally efficient;
(5) Study of real chemical systems demonstrates both the superior accuracy of our method compared to state-of-the-art and the strong and weak scalability for large molecular systems with up to 120 spin orbitals.
Tutorial
Node-Level Performance Engineering
Description
The gap between peak performance and application performance is continuing to open. Paradoxically, bad node-level performance leads to highly scalable code, but at the price of increased overall time to solution. Consequently, valuable resources are wasted, often on a massive scale. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, data transfer bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering and performance patterns are suggested as powerful tools that help the user understand the bottlenecks at hand and to assess the impact of possible code optimizations. A cornerstone of these concepts is the roofline model, which is described in detail, including useful case studies, limits of its applicability, and possible refinements. We also show how simple performance tools can support node-level performance analysis by providing the developer with useful information about the bottlenecks of their code.
Birds of a Feather
Open Cloud Infrastructure Solutions to Run HPC Workloads
Cloud Computing
Description
Cloud-native methods are increasingly used for HPC infrastructure. The advantages claimed include agility in system management and flexible support of new and evolving workflows.
In the last ten years, open cloud infrastructure has become widespread in scientific computing and OpenStack is the dominant open source cloud solution. The OpenStack Scientific SIG represents this community.
This session brings together leading practitioners of OpenStack and related technologies for open solutions in production operations. The session will present current use cases of cloud-native open infrastructure. The advantages and challenges of this approach will be presented. Attendees will be invited to share experiences.
Birds of a Feather
Open MPI State of the Union
Programming Frameworks and System Software
Description
Open MPI continues to drive the start of the art in HPC. This year, we've added new features, fixed bugs, improved performance, and collaborated with many across the HPC community. We'll discuss what Open MPI has accomplished over the past year and present a roadmap for the next year.
One of Open MPI's strengths lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.
Birds of a Feather
Open OnDemand User Group Meeting
Middleware and System Software
Description
This BoF is meant to be an open discussion to guide the future roadmap for Open OnDemand (openondemand.org), by getting feedback from the community on the prioritization of the various tasks planned for the next few years. OOD is extremely relevant to ongoing discussions within the HPC community about user interfaces and science gateways. The session leaders, all part of the OOD development team, will jointly develop the content for the presentation in advance to ensure a wide range of viewpoints and topics are presented. We will also consult with our user advisory group in advance for their suggestions.
Birds of a Feather
OpenACC Users Forum
Programming Frameworks and System Software
Description
OpenACC organization helps researchers and developers advance science by expanding their parallel computing skills and supporting a directive-based, high-level parallel programming model on CPUs, GPUs, and more. OpenACC supports over 25 global hackathons annually and facilitated the acceleration of over 200 applications on multiple platforms (e.g., Frontier, Perlmutter, JUWELS, Summit, and Piz Daint). This BoF serves as a forum for OpenACC users, implementers, and the organization officers to openly discuss the status of OpenACC and its community. Presentations will be given by OpenACC officers, compiler implementers, and invited users, followed by an open mic discussion with the audience.
Birds of a Feather
OpenMP API Version 6.0 - What to Expect
Programming Frameworks and System Software
Description
This BoF is highly interactive and provides attendees with first-hand information from OpenMP implementers and language designers on the future of the OpenMP API. Lightning talks and discussion rounds will give BoF participants amble opportunity to learn and interact with OpenMP experts, ask questions, and provide community feedback. Sub-committee leaders of the OpenMP ARB will provide insight into the future of OpenMP, focusing on the upcoming release of the OpenMP API version 6.0 in November 2024 and the progress that has been made. Vendor representatives will discuss support and timelines for OpenMP features, and expert users will describe their journey.
Birds of a Feather
Operational Data Analytics
State of the Practice
Description
Operational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system increasingly easy. However, making the data work for HPC operations is not straight-forward and effort being duplicated at many HPC sites to develop methods and tools to analyze the data and leverage it for operations. There is a clear demand to collaborate on this within the community but as standards in terms of semantics and naming of monitoring data are currently missing, such collaboration is severely hampered.
Paper
Optimizing Direct Convolutions on ARM Multi-Cores
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
Description
Convolution kernels are widely seen in deep learning workloads and are often responsible for performance bottlenecks. Recent research has demonstrated that a direct convolution approach can outperform the traditional convolution implementation based on tensor-to-matrix conversions. However, existing approaches for direct convolution still have room for performance improvement. We present NDIRECT, a new direct convolution approach that targets ARM-based multi-core CPUs commonly found in smartphones and HPC systems. NDIRECT is designed to be compatible with the data layout formats used by mainstream deep learning frameworks but offers new optimizations for the computational kernel, data packing, and parallelization. We evaluate NDIRECT by applying it to representative convolution kernels and demonstrating its performance on four distinct ARM multi-core CPU platforms. We compare NDIRECT against state-of-the-art convolution optimization techniques. Experimental results show that NDIRECT gives the best overall performance across evaluation scenarios and platforms.
Paper
Optimizing High-Performance Linpack for Exascale Accelerated Architectures
Accelerators
Algorithms
Linear Algebra
Best Paper Finalist
Description
We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform latency-sensitive factorization phases. We detail novel performance improvements such as a multi-threaded approach to computing the panel factorization phase on the CPU, time-sharing of CPU cores between processes on the node, as well as several optimizations which hide MPI communication. We present some performance results of this implementation of the HPL benchmark on a single node of the Frontier early access cluster at Oak Ridge National Laboratory, as well as scaling to multiple nodes.
Paper
Optimizing MPI Collectives on Shared Memory Multi-Cores
Distributed Computing
Message Passing
Programming Frameworks and System Software
Best Student Paper Finalist
Description
Collective communication operations, such as broadcasting and reductions, often contribute to performance bottlenecks in Message Passing Interface (MPI) programs. As the number of processor cores integrated into CPUs increases, running multiple MPI processes on shared-memory machines to leverage hardware parallelism is becoming increasingly common. In this context, optimizing MPI collective communications for shared-memory execution is crucial. This paper identifies two primary limitations of existing MPI collective implementations on shared-memory systems. The first is the extensive redundant data movements when performing reduction collectives, and the second is the ineffective use of non-temporal instructions to optimize streamed data processing. To address these challenges, we propose two optimization techniques designed to minimize data movements and enhance the use of non-temporal instructions. We integrate our optimizations into the OpenMPI and evaluate their performance through micro-benchmarks and real-world application tests on two multi-core clusters. Experiments show that our approach significantly outperforms existing techniques by 1.2-6.4x.
Paper
Optimizing Reconfigurable Optical Datacenters: The Power of Randomization
Algorithms
Cloud Computing
Distributed Computing
Heterogeneous Computing
Large Scale Systems
State of the Practice
Description
Reconfigurable optical topologies are a promising new technology to improve datacenter network performance and cope with the explosive growth of traffic. In particular, these networks allow to adaptively connect racks between which there is currently much traffic, hence making an optimal use of the bandwidth by avoiding multi-hop forwarding.
This paper studies the dynamic optimization of such reconfigurable topologies, adapting to the traffic in an online manner. The underlying algorithmic problem can be described as an online maximum weight b-matching problem, a generalization of maximum weight matching where each node has at most b>=1 incident matching edges.
We make the case for a randomized approach for matching optimization. Our main contribution is a O(log b)-competitive algorithm and we show that it is asymptotically optimal. This algorithm is exponentially better than the best possible deterministic online algorithm.
We complement our theoretical results with trace-driven simulations, based on real-world datacenter workloads.
Paper
PanguLU: A Scalable Regular Two-Dimensional Block-Cyclic Sparse Direct Solver on Distributed Heterogeneous Systems
Accelerators
Algorithms
Linear Algebra
Best Paper Finalist
Description
Sparse direct solvers play a vital role in large-scale high performance computing in science and engineering. Existing distributed sparse direct methods employ multifrontal/supernodal patterns to aggregate columns of nearly identical forms and to exploit dense basic linear algebra subprograms (BLAS) for computation. We propose a new sparse direct solver called PanguLU. Our work relies on simpler regular 2D blocking and stores blocks in their sparse forms to avoid any extra fill-ins. Based on sparse patterns of blocks, a variety of block-wise sparse BLAS methods are developed and selected for higher efficiency on local GPUs. To make PanguLU more scalable, we also adjust mapping of blocks to processes for overall more balanced workload, and propose a synchronisation-free communication strategy to reduce overall latency overhead. Experiments on two distributed heterogeneous platforms consisting of 128 A100 GPUs and 128 MI50 GPUs demonstrate that PanguLU achieves up to 11.70x and 17.97x speedups over SuperLU_DIST.
Tutorial
Parallel Computing 101
Description
This tutorial provides a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, students, managers, and anyone seeking an overview of parallel computing. It discusses software and hardware/software interaction, with an emphasis on standards, portability, and systems that are widely available.
The tutorial surveys basic parallel computing concepts, using examples selected from multiple engineering, scientific, and machine learning problems. These examples illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. It discusses numerous parallelization and load balancing approaches, and software engineering and performance improvement aspects, including the use of state-of-the-art tools.
The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are most suitable for. Extensive pointers to web-based resources are provided to facilitate follow-up studies.
Tutorial
Parallel I/O In Practice
Description
I/O on HPC systems is a black art. This tutorial sheds light on the state-of-the-art in parallel I/O and provides the knowledge necessary for attendees to best leverage I/O resources available to them. We cover the entire I/O software stack including storage and parallel file systems at the lowest layer, the role of NVRAM devices, intermediate layers (such as MPI-IO), and high-level I/O libraries (such as HDF-5). We emphasize ways to use these interfaces that result in high performance and tools for generating insight into these stacks.
Our first third of the tutorial covers parallel I/O fundamentals. We discuss storage technologies, both present and near-future and the major parallel and distributed file systems. We focus on application in our second third, connecting storage to our examination of the upper library layers of the I/O stack, covering MPI-IO, Parallel netCDF, and HDF5. Finally, we discuss tools for understanding I/O behavior.
Paper
Parallel Top-K Algorithms on GPU: A Comprehensive Study and New Methods
Accelerators
Algorithms
Graph Algorithms and Frameworks
Description
The top-K problem is an essential part of many important applications in scientific computing, information retrieval, etc. As data volume grows rapidly, high-performance parallel top-K algorithms become critical. We propose two parallel top-K algorithms, AIR top-K (Adaptive and Iteration-fused Radix top-K) and GridSelect, for GPU. AIR top-K employs an iteration-fused design to minimize CPU-GPU communication and device data access. Its adaptive strategy eliminates unnecessary device memory traffic automatically under various data distributions. GridSelect can process data on-the-fly. It adopts a shared queue and parallel two-step insertion to decrease the frequency of costly operations. We comprehensively compare 8 open-source GPU implementations and our methods for a wide range of problem sizes and data distributions. For batch sizes 1 and 100, respectively, AIR top-K shows 1.98-21.48X and 8.01-574.78X speedup over previous radix top-K algorithm, and 1.44-7.34X and 1.38-31.91X speedup over state-of-the-art methods. GridSelect shows up to 882.29X speedup over its baseline.
Birds of a Feather
Pathfinding in HPC Education and Training
Description
Despite the quantity of existing training materials, acquisition and development of HPC skills is not straightforward enough to address the needs of the growing and diversifying HPC community. To address this, the HPC teaching and training ecosystem must mirror the growth and diversification of the HPC community and technologies. This BoF creates an opportunity to gather the user/learner community perspectives and explore new requirements in order to identify new entry points and build well-defined learning pathways that more accurately represent the aims of the user/learner community and changing technology landscape. We encourage those interested in HPC training to attend.
Workshop
PDSW23: 8th International Parallel Data Systems Workshop
Description
Efficient data storage and data management are crucial to scientific productivity in both traditional simulation-oriented HPC environments and Big Data analysis environments. This issue is further exacerbated by the growing volume of experimental and observational data, the widening gap between the performance of computational hardware and storage hardware, and the emergence of new data-driven algorithms in machine learning. The goal of this workshop is to facilitate research that addresses the most critical challenges in scientific data storage and data processing.
PDSW will continue to build on the successful tradition established by its predecessor workshops : the Petascale Data Storage Workshop (PDSW, 2006-2015) and the Data Intensive Scalable Computing Systems (DISCS 2012-2015) workshop. These workshops were successfully combined in 2016, and the resulting joint workshop has attracted up to 38 full paper submissions and 140 attendees per year from 2016 to 2022.
Paper
PeeK: A Prune-Centric Approach for K Shortest Path Computation
Accelerators
Algorithms
Graph Algorithms and Frameworks
Description
The 𝐾 shortest path (KSP) algorithm, which finds the top 𝐾 shortest simple paths from a given source to a target vertex, has a wide range of real-world applications. While the top 𝐾 shortest simple paths offer invaluable insights, computing them is time-consuming. In this work, we observe existing works search 𝐾 shortest paths from the original graph, while the top 𝐾 shortest paths only cover a meager portion of the original graph. This paper devises PeeK. It first applies 𝐾 upper bound pruning to prune the vertices and edges that will not appear in any of the 𝐾 shortest paths. Second, PeeK adaptively compacts the graph that, not only removes the deleted vertices or edges but also efficiently computes the downstream task. We compare PeeK with five algorithms. For parallel computation with 32 threads, PeeK achieves 5.1x and 28.8x speedup over the state-of-the-art for 𝐾 = 8, 128, respectively.
Tutorial
Performance Tuning with the Roofline Model on GPUs and CPUs
Description
The Roofline performance model offers an insightful and intuitive method for extracting the key execution characteristics of HPC applications and comparing them against the performance bounds of modern CPUs and GPUs. Its ability to abstract the complexity of memory hierarchies and identify the most profitable optimization techniques have made Roofline-based analysis increasingly popular in the HPC community. Although different flavors of the Roofline model have been developed to deal with various definitions of memory data movement, there remains a need for a systematic methodology when applying them to analyze applications running on multicore and accelerated systems. The tutorial aims to bridge this gap on both CPUs and GPUs by both exposing the fundamental aspects behind different Roofline modeling principles as well as providing several practical use case scenarios that highlight their efficacy for application optimization. This tutorial presents a unique combination of instruction to Roofline by its creator, hands-on instruction in using Roofline within Intel’s, NVIDIA’s, and AMD’s production performance tools, and discussions of real-world Roofline use cases at ALCF, NERSC, and OLCF computing centers. The tutorial presenters have a long history of collaborating on the Roofline model and have presented several Roofline-based tutorials.
Paper
Phases, Modalities, Spatial and Temporal Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics
Architecture and Networks
Data Movement and Memory
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
Description
Memory performance is a bottleneck in graph analytics acceleration. Existing Machine Learning (ML) prefetchers struggle with phase transitions and irregular memory accesses in graph processing. We propose MPGraph, an ML-based Prefetcher for Graph analytics using domain specific models. MPGraph introduces three novel optimizations: soft detection for phase transitions, phase-specific multi-modality models for access delta and page predictions, and chain spatio-temporal prefetching (CSTP) for prefetch control.
Our transition detector achieves 34.17–82.15% higher precision compared with Kolmogorov–Smirnov Windowing and decision tree. Our predictors achieve 6.80–16.02% higher F1-score for delta and 11.68–15.41% higher accuracy-at-10 for page prediction compared with LSTM and vanilla attention models. Using CSTP, MPGraph achieves 12.52–21.23% IPC improvement, outperforming state-of-the-art non-ML prefetcher BO by 7.58–12.03% and ML-based prefetchers Voyager and TransFetch by 3.27–4.58%. For practical implementation, we demonstrate MPGraph using compressed models with reduced latency shows significantly superior accuracy and coverage compared with BO, leading to 3.58% higher IPC improvement.
Workshop
PMBS23: The 14th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems
Description
The PMBS23 workshop is concerned with the comparison of high-performance computing systems through performance modeling, benchmarking or through the use of tools such as simulators. We are particularly interested in research which reports the ability to measure and make tradeoffs in software/hardware co-design to improve sustained application performance. We are also keen to capture the assessment of future systems.
The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking and simulation, and we welcome research that brings together current theory and practice. We recognize that the term 'performance' has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.
Paper
Portable and Scalable All-Electron Quantum Perturbation Simulations on Exascale Supercomputers
Applications
Modeling and Simulation
Description
Quantum perturbation theory is pivotal in determining the critical physical properties of materials. The first-principles computations of these properties have yielded profound and quantitative insights in diverse domains of chemistry and physics.
In this work, we propose a portable and scalable OpenCL implementation for quantum perturbation theory, which can be generalized across various high-performance computing (HPC) systems. Optimal portability is realized through the utilization of a cross-platform unified interface and a collection of performance-portable heterogeneous optimizations. Exceptional scalability is attained by addressing major constraints on memory and communication, employing a locality-enhanced task mapping strategy and a packed hierarchical collective communication scheme. Experiments on two advanced supercomputers demonstrate that the quantum perturbation calculation exhibits remarkably performance on various material systems, scaling the system to 200,000 atoms with all-electron precision. This research enables all-electron quantum perturbation simulations on substantially larger molecular scales, with a potentially significant impact on progress in material sciences.
Tutorial
Portable GPU Acceleration of HPC Applications with Standard C++
Description
This hands-on tutorial teaches how to parallelize and optimize HPC applications for multi-core CPUs and GPUs using the portable parallelism and concurrency features of the ISO C++23 standard without any language or vendor extensions. We further show how to integrate this approach with MPI to target large multi-node homogeneous and heterogeneous HPC systems. The attendees learn problem-solving strategies for parallelizing classic HPC patterns (multi-dimensional loops, map-reduce, scans) and concurrency problems, e.g., to hide the latency of MPI communication behind computation. The tutorial provides attendees zero-setup web access to Jupyter Lab running on modern multi-GPU accelerated systems, enabling attendees to solve the hands-on exercises directly in their web browser. These hands-on exercises transfer the above mentioned technique to produce a portable multi-node, heterogeneous, and asynchronous 2D unsteady heat-equation mini-application. Finally, we synthesize practical techniques acquired from our professional experience applying the portable ISO C++23 parallel and asynchronous programming models to port large real-world HPC applications to heterogeneous supercomputers and refer further learning resources.
Birds of a Feather
Power Consumption and Exascale Computing: Toward a “Short Production Circuit” Model
Architecture and Networks
Description
Since the “Good Old Times” of petascale, HPC computer centers have had to evolve a lot. As a result, HPC centers require huge amounts of power to run, and TCO has gone through the roof, in particular in times of rising energy prices.
This BoF proposes a “short production system” compute model, where the compute, storage and network systems collaborate to execute applications and workflows in a small, compact and contiguous part of the system, exploit locality of compute and data resources, and thereby reduce energy usage and cost and avoid spreading applications and data across the whole system.
Tutorial
Principles and Practice of High Performance Deep/Machine Learning Training and Inference
Description
Recent advances in Machine and Deep Learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL frameworks including TensorFlow, PyTorch, and cuML enable high-performance training, inference, and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, ML/DL frameworks, DL Training and Inference, and Hyperparameter Optimization with special focus on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU/DPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed ML/DL training and hyperparameter optimizations on a modern GPU cluster.
Paper
Prodigy: Toward Unsupervised Anomaly Detection in Production HPC Systems
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
Description
Performance variations caused by anomalies in modern High Performance Computing (HPC) systems lead to decreased efficiency, impaired application performance, and increased operational costs. While machine learning (ML)-based frameworks for automated anomaly detection (often based on time series telemetry data) are gaining popularity in the literature, practical deployment challenges are often overlooked. Some ML-based frameworks require extensive customization, while others need a rich set of labeled samples, none of which are feasible for a production HPC system.
This paper introduces a variational autoencoder-based anomaly detection framework, Prodigy, that outperforms the state-of-the-art alternatives by achieving a 0.95 F1-score when detecting performance anomalies. The paper also provides a real system implementation of Prodigy that enables easy integration with monitoring frameworks and rapid deployment. We deploy Prodigy on a production HPC system and demonstrate 88% accuracy in detecting anomalies. Prodigy involves an interface to provide job- and node-level analysis and explanations for anomaly predictions.
Tutorial
Programming Novel AI Accelerators for Scientific Computing
Description
Scientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. There are specialized hardware accelerators designed and built to run AI applications efficiently. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand the differences between these accelerators, their capabilities, programming approaches, and how they perform, particularly for scientific applications. In this tutorial, we will cover an overview of the AI accelerators landscape with a focus on SambaNova, Cerebras, Graphcore, Groq, and Habana systems along with architectural features and details of their software stacks. We will have hands-on exercises that will help attendees understand how to program these systems by learning how to refactor codes written in standard AI framework implementations and compile and run the models on these systems. The tutorial will enable the attendees with an understanding of the key capabilities of emerging AI accelerators and their performance implications for scientific applications.
Tutorial
Programming Your GPU with OpenMP: A “Hands-On” Introduction
Description
If you are an HPC programmer, you know OpenMP. Alongside MPI, OpenMP is the open, cross-vendor foundation of HPC. As hardware complexity has grown, OpenMP has grown as well adding GPU support in OpenMP 4.0 (2013). With a decade of evolution since then, OpenMP GPU technology is now a mature option for programming any GPU you are likely to find on the market.
While there are many ways to program a GPU, the best way is through OpenMP. Why? Because the GPU does not exist in isolation. There are always one or more CPUs on a node. Programmers need portable code that fully exploits all available processors. In other words, programmers need a programming model, such as OpenMP, that fully embraces heterogeneity.
In this tutorial, we explore GPU programing with OpenMP. We assume attendees already know the fundamentals of multi-threading with OpenMP, so we use our time on the directives that define how to map loops onto GPUs and optimize data movement between the CPU and GPU. Students will use their own laptops (with Windows, Linux, or macOS) to connect to remote servers we will provide with GPUs and all the software needed for the tutorial.
Birds of a Feather
Providing a Unified User Interface and Experience for Geographically Dispersed Computing Resources
Distributed Computing
Description
This BoF session will address user experience challenges that arise from geographically dispersed computing resources, such as when an organization operates multiple HPC clusters or wishes to combine on-premises and cloud-based compute services. A series of speakers will provide an overview of current perspectives on and solutions for making dispersed computing resources available to user communities. We invite participants to engage in a facilitated follow-up discussion to identify key unresolved hurdles and document emerging community best practices for providing the best possible user experience in geographically dispersed HPC settings.
Panel
Quantum Computing and HPC: Opportunities and Challenges for New Companies in the Field of HPC
Codesign
Quantum Computing
Description
Quantum Computing is quickly maturing and has started to enter the area of High-Performance Computing. As a consequence, we are seeing more and more work on quantum computing in the SC program and also more and more exhibitors focusing on this new technology and its relationship to HPC. This, however, comes with many challenges, especially for new companies in this field, as they have to bridge the gap between physics and computer science, both from a technology and a community point of view. In this panel, we will discuss this topic with five quantum computing companies covering hardware, software and workflow aspects, their take on the impact of HPC on them as well their impact on HPC, special challenges, and the future prospects of quantum computing as a new accelerator technology for HPC.
Paper
Rapid Simulations of Atmospheric Data Assimilation of Hourly-Scale Phenomena with Modern Neural Networks
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
Description
Atmospheric data assimilation is essential for numerical weather prediction. Ensemble-based data assimilation connects multiple instances of atmospheric model through Kalman-filter-based algorithm, which is regarded as a challenging computing task today. In this work, we present our efforts to build a fast, low-cost, and scalable atmospheric data assimilation prototype for the new-generation Sunway supercomputer, including (1) A UNet-neural-network-based surrogate model for atmospheric dynamic simulation to generate all the background ensemble with both satisfactory accuracy and reasonable robustness; (2) Batched LETKF with an efficient eigenvalue decomposition implementation and a data staging strategy to cover the observation IO time ; (3) A framework able to flexibly deploy the components, thus available to reach the maximum resource efficiency. Experimental evaluations show that our AI-integrated ensemble data assimilation prototype can finish hour-cycle assimilation in minutes, keep linear scalability and save an order of magnitude of computing resources compared with the traditional scientific method.
Paper
ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers
Algorithms
Linear Algebra
Post-Moore Computing
Description
Resistive random access memory (ReRAM) is a promising technology that can perform low-cost and in-situ matrix-vector multiplication (MVM) in analog domain. Scientific computing requires high-precision floating-point (FP) processing. However, performing floating-point computation in ReRAM is challenging because of high hardware cost and execution time due to the large FP value range. In this work we present ReFloat, a data format and an accelerator architecture, for low-cost and high-performance floating-point processing in ReRAM for iterative linear solvers. ReFloat matches the ReRAM crossbar hardware and represents a block of FP values with reduced bits and an optimized exponent base for a high range of dynamic representation. Thus, ReFloat achieves less ReRAM crossbar consumption and fewer processing cycles and overcomes the noncovergence issue in a prior work. The evaluation on the SuiteSparse matrices shows that ReFloat achieves 5.02x to 84.28x improvement in terms of solver time compared to a state-of-the-art ReRAM based accelerator.
Workshop
Research Software Engineers in HPC (RSE-HPC-2023)
Description
Research software engineers (RSEs) are critical to the impact of HPC, data science, and the larger scientific community. They have existed for decades, though often not under that name. The past several years, however, have seen the development of the RSE concept, common job titles, and career paths; the creation of professional networks to connect RSEs; and the emergence of RSE groups at universities, national laboratories, and industry.
This workshop will bring together RSEs and allies involved in HPC, from all over the world, to grow the RSE community by establishing and strengthening professional networks of current RSEs and RSE leaders. We will hear about successes and challenges that RSEs and RSE groups have experienced and discuss ways to increase awareness of RSE opportunities and improve support for RSEs.
The workshop will be highly interactive, featuring breakout discussions and panels, as well as invited addresses and submitted talks.
Paper
Rethinking Deployment for Serverless Functions: A Performance-First Perspective
Cloud Computing
Distributed Computing
Energy Efficiency
Performance Measurement, Modeling, and Tools
Description
Serverless computing commonly adopts strong isolation mechanisms for deploying functions, which may bring significant performance overhead because each function needs to run in a completely new environment, i.e., the “one-to-one” model. To accelerate the function computation, prior work has proposed using sandbox sharing to reduce the overhead, i.e., the “many-to-one” model. Nonetheless, either process-based true-parallelism or thread-based pseudo-parallelism prevents its adaptation for latency-sensitive web services.
To achieve optimal performance and resource efficiency for serverless workflow, we argue an “m-to-n” deployment model that manipulates multiple granularities of computing abstractions (e.g., processes, threads), and sandboxes to amortize overhead. We propose wrap, a new deployment abstraction that balances the tradeoffs between interaction overhead, startup overhead and function execution. We further design Chiron, a wrap-based deployment manager that can automatically perform the orchestration of multiple computing abstractions based on performance prioritization. Our comprehensive evaluation indicates Chiron outperforms state-of-the-art systems by 1.3x-21.8x on system throughput.
Workshop
RSDHA: Redefining Scalability for Diversely Heterogeneous Architectures
Description
"Scalable computing" has governed another dimension. Contrary to the traditional use, which is often proportional to the total number of nodes or transistors in a system working collaboratively to compute a given workload, the newly rising dimension rather scales with the number of different type of processors (i.e., accelerators) each optimized for a domain specific task. The proposed workshop targets to investigate the far-end intersection of the two dimensions where we believe that future architectures will be located at.
The proposed workshop seeks solutions for the two types of future architectures:
- For diversely heterogeneous HPC systems: How could the traditional HPC applications adopt the architectural, programming and runtime approaches employed by the state-of-the-art diversely heterogeneous embedded systems?
- For distributed, large-scale embedded systems: How could the future embedded systems take lessons from traditional HPC to beat the multi-node scalability challenges as they become increasingly more connected?
Panel
RSEs in HPC Centers: Funding, Coordinating, Doing
Software Engineering
Description
Research Software Engineering (RSEng) as a professional designation has grown over the last 10+ years in industry, academia, and government sectors. Within HPC centers, Research Software Engineers (RSE) fill the role of combining software engineering expertise with the in-depth process of participating in and applying research. In this panel, we invite practicing RSEs, funders, university, and HPC center leaders who are experienced and dedicated to Research Software Engineering to present their varying perspectives on funding, managing, and doing RSEng within worldwide HPC centers. The moderator is Daniel S. Katz (Chief Scientist, NCSA; co-founder, US-RSE), and panelists are Gabrielle Allen (Director, School of Computing, University of Wyoming), Neil Chue Hong (EPCC, University of Edinburgh; Director, Software Sustainability Institute), Alison Kennedy (Strategic Advisor, UK Research and Innovation), Fabio Kon (Special Advisor, São Paulo Research Foundation), and Miranda Mundt (RSE, Sandia National Laboratories; Steering Committee Member, US-RSE).
Paper
Runtime Composition of Iterations for Fusing Loop-Carried Sparse Dependence
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Description
Dependence between iterations in sparse computations causes inefficient use of memory and computation resources. This paper proposes sparse fusion, a technique that generates efficient parallel code for the combination of two sparse matrix kernels, where at least one of the kernels has loop-carried dependencies. Existing implementations optimize individual sparse kernels separately. However, this approach leads to synchronization overheads and load imbalance due to the irregular dependence patterns of sparse kernels, as well as inefficient cache usage due to their irregular memory access patterns. Sparse fusion uses a novel inspection strategy and code transformation to generate parallel fused code optimized for data locality and load balance. Sparse fusion outperforms the best of unfused implementations using ParSy and MKL by an average of 4.2× and is faster than the best of fused implementations using existing scheduling algorithms, such as LBC, DAGP, and wavefront by an average of 4× for various kernel combinations.
Panel
Runtimes and Workflow Systems for Extreme Heterogeneity: Challenges and Opportunities
Codesign
Heterogeneous Computing
Runtime Systems
Description
Extreme heterogeneity is defined as one of the most important priority research directions today. Additionally, the applications are expected to grow in complexity to enable progress in multiple areas of science, technology and engineering. This urges consideration of hardware/software co-design to facilitate the adoption of emerging technologies. With such a scenario in mind, the opportunities lie in designing new features in runtimes and workflows. This panel aims to debate how future systems will look. Advances in this matter are key to executing science workflows and understanding their results, enabling efficient execution on diverse platforms, ensuring scalability of high-level descriptions of analytics workflows, and increasing user productivity and system utilization. In other words, how easily and rapidly a science team can develop or port a workflow to a new platform, and how well the resulting implementation makes use of the platform and its resources.
Panel
Scalable and Adaptable Architectures for AI/HPC Advancement
Artificial Intelligence/Machine Learning
Description
AI/Machine Learning usage is exploding in both application and model size. Predictive analytics, physics, modeling, and new use cases for generative AI/ML are increasing model sizes by 10x every 18 months. The custom processors and accelerators used for AI/ML require continually higher I/O bandwidth to address this model growth. However, how does one deploy a high-performance architecture that is scalable and adaptable through time to address this phenomenon? The panel will discuss the architectures, I/O and large-scale system topologies that will be needed to grow well beyond 200 billion parameters. You will gain insights into system concepts, scaled across workload size, that are both cost-effective from a new configurability perspective as well as a focus on energy-efficiency. Is there a new Billion Parameters per Watt metric? These are the topics the panel will discuss and debate.
Tutorial
Scalable Big Data Processing on High Performance Computing Systems
Description
There are several popular Big Data processing frameworks including Apache Spark and Dask. These frameworks are not capable of exploiting high-speed and low-latency networks like InfiniBand, Omni-Path, Slingshot, and others. In the High Performance Computing (HPC) community, the Message Passing Interface (MPI) libraries are widely adopted to tackle this issue by executing scientific and engineering applications on parallel hardware connected via fast interconnect.
This tutorial introduces MPI4Spark and MPI4Dask that are enhanced Spark and Dask frameworks, respectively, and capable of utilizing MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution by forking new processes using Dynamic Process Management (DPM). MPI4Spark also provides portability and performance benefits as it can utilize popular HPC interconnects. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs.
This tutorial provides a detailed overview of the design, implementation, and evaluation of MPI4Spark and MPI4Dask on state-of-the-art HPC systems. Later, we also cover writing, running, and demonstrating user Big Data applications on HPC systems.
Paper
Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay
Accelerators
Distributed Computing
Middleware and System Software
Performance Measurement, Modeling, and Tools
Post-Moore Computing
Best Paper Finalist
Description
HPC is a heterogeneous world in which host and device code are interleaved throughout the application. Given the significant performance advantage of accelerators, device code execution time is becoming the new bottleneck. Tuning the accelerated parts is consequently highly desirable but often impractical due to the large overall application runtime which includes unrelated host parts.
We propose a Record-Replay (RR) mechanism to facilitate auto-tuning of large (OpenMP) offload applications. RR dissects the application, effectively isolating GPU kernels into independent executables. These comparatively small code-lets are amenable to various forms of post-processing, including elaborate auto-tuning. By eliminating the resource requirements and application dependencies, massively parallel and distributed auto-tuning becomes feasible.
Using RR, we run scalable Bayesian Optimization to determine optimal kernel launch parameters. LULESH showcases an end-to-end speedup of up to 1.53x, while RR enables 102x faster tuning compared to existing approaches using the entire application.
ACM Gordon Bell Finalist
Awards
Scaling the “Memory Wall” for Multi-Dimensional Seismic Processing with Algebraic Compression on Cerebras CS-2 Systems
TP
Description
We exploit the high memory bandwidth of AI-customized Cerebras CS-2 systems for seismic processing. Through low-rank matrix approximation, memory hungry seismic applications fit onto memory-austere SRAM waferscale hardware, addressing a challenge arising in many wave-equation-based algorithms that rely on multi-dimensional convolution operators. Exploiting sparsity inherent in seismic data in the frequency domain, we implement embarrassingly parallel tile low-rank matrix-vector multiplications (TLR-MVM), which account for most of the elapsed time in MDC operations, to solve the Multi-Dimensional Deconvolution (MDD) inverse problem. By reducing memory footprint along with arithmetic complexity, we fit a standard seismic benchmark dataset into the local memories of Cerebras processing elements. TLR-MVM on 48 CS-2 systems in support of MDD gives a sustained memory bandwidth of 92.58PB/s on 35,784,000 processing elements, a significant milestone that highlights the capabilities of AI-customized architectures to enable a new generation of seismic algorithms that will empower multiple technologies of our low-carbon future.
ACM Gordon Bell Finalist
Awards
Scaling the Leading Accuracy of Deep Equivariant Models to Biomolecular Simulations of Realistic Size
TP
Description
This work brings the leading accuracy, sample efficiency, and robustness of deep equivariant neural networks to the extreme computational scale. This is achieved through a combination of innovative model architecture, massive parallelization, and models and implementations optimized for efficient GPU utilization. The resulting Allegro architecture bridges the accuracy/speed tradeoff of atomistic simulations and enables description of dynamics in structures of unprecedented complexity at quantum fidelity. To illustrate the scalability of Allegro, we perform nanoseconds-long stable simulations of protein dynamics and scale up to a 44-million atom structure of a complete, all-atom, explicitly solvated HIV capsid on the Perlmutter supercomputer. We demonstrate excellent strong scaling up to 100 million atoms and 70% weak scaling to 5120 A100 GPUs
Tutorial
Scientific Computing with Kubernetes
Description
Kubernetes has emerged as the leading container orchestration solution that works on resources ranging from on-prem clusters to commercial clouds. Developed at Google, now maintained by Cloud Native Foundation, it sports a diverse and active development community. At SDSC Kubernetes capabilities are available on Expanse, Voyager, and Prototype National Research Platform (PNRP) Nautilus (multi-site distributed resource) clusters. The ability to run services in Kubernetes enables execution of non-traditional workloads. This can enable some complex scientific workflows to be run that are difficult to handle through traditional batch scheduling on HPC clusters.
Kubernetes does not have a traditional batch interface, but the concepts are similar enough to allow for porting of existing batch-focused workloads to it. Users can customize their software environment in containers. Kubernetes provides significantly richer semantics, including explicit storage and network provisioning, that allow execution of scientific computing workflows typically not feasible on batch systems.
In this tutorial, the attendees will get an overview of the Kubernetes architecture, typical job and workflow submission procedures, learn how to use various storage options, and will learn how to run their software using Kubernetes. Theoretical information will be paired with hands-on sessions operating on the PNRP production Kubernetes cluster Nautilus.
Birds of a Feather
Scientific Software and the People Who Make It Happen: Building Communities of Practice
Description
Software has become central to all aspects of modern science and technology. Especially in high-performance computing (HPC) and computational science and engineering (CSE), it is becoming ever-larger and more complex while computer platforms evolve and become more diverse. Simultaneously, the teams behind the software are becoming larger, more technically diverse, and more geographically distributed.
This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website, http://bit.ly/swe-cse-bof.
Workshop
Second International Workshop on RISC-V for HPC
Description
RISC-V is an open standard Instruction Set Architecture (ISA) which enables the open development of CPUs and a shared common software ecosystem. There are already approximately 10 billion RISC-V cores, which is expected to accelerate rapidly. Nonetheless, for all the successes that RISC-V has faced, it is yet to become popular in HPC. Recent advances however, such as the vectorisation standard and data center RISC-V-based CPUs, mean that this technology is becoming a more realistic proposition for our workloads.
This workshop aims to connect those currently involved in RISC-V with the wider HPC community. We look to bring together RISC-V experts with scientific software developers, vendors, and supercomputing center operators to explore the advantages, challenges, and opportunities that RISC-V can bring to HPC. Furthermore, we aim to further expand the RISC-V HPC SIG, enabling interested attendees to participate in one of the most exciting open-source technological activities of our time.
Tutorial
Secure Coding Practices and Dependency Analysis Tools
Description
HPC increasingly involves the development and deployment of network and cloud services. Unique to the HPC field is the large amount of software that we develop to drive these services. These services must assure data integrity and availability, while providing access to a global scientific and engineering community.
Securing your network is not enough! Every service that you deploy is a window into your data center from the outside world, and a window that could be exploited by an attacker.
This tutorial is relevant to anyone wanting to learn about minimizing security flaws in the software they develop or manage. We share our experiences gained from performing vulnerability assessments of critical middleware. You will learn skills critical for software developers and analysts.
Dependency analysis tools –tools that find weaknesses in the software supply chain– are the first line of defense in assessing the security of a software project. These tools can catch flaws in the packages and libraries a program depends upon, and that affects the safety of the application. This tutorial is also relevant to anyone wanting to learn how to use these automated dependency analysis tools to minimize security flaws in the software they develop or manage.
Birds of a Feather
SIGHPC Annual Member Meeting
Data Analysis, Visualization, and Storage
Description
The annual business meeting of SIGHPC is your opportunity to hear about and discuss the status of SIGHPC and its chapters. All of the elected officers and many of the other volunteers will be present to answer your questions about SIGHPC. Representatives from our chapters will also be available. We will also be discussing upcoming plans for the year.
Birds of a Feather
Slurm Community BoF
Middleware and System Software
Description
Slurm is an open source workload manager used many on Top500 systems and provides a rich set of features including topology aware optimized resource allocation, cloud bursting, hierarchical bank accounts with fair-share job prioritization and many resource limits. The meeting will consist of three parts: The Slurm development team will present details about newly released 23.11 version and changes in the upcoming version 24.08, describe the Slurm roadmap, and solicit user feedback. Everyone interested in Slurm use and/or development is encouraged to attend.
Birds of a Feather
SmartNICs : Exploring the Future of In-Network Computation with the HPC Community
Architecture and Networks
Description
SmartNIC availability has rapidly increased in recent years due to wider adoption in the cloud. Leveraging these emerging devices in HPC can provide the infrastructure needed to develop new offloading capabilities that go beyond the traditional packet processing to support HPC optimizations. This BoF aims at building a community to discuss SmartNIC use-cases to accelerate applications, improve storage, enable software-defined infrastructures, address operational aspects of HPC centers and more. It also aims to serve as the state-of-the-union for SmartNICs within the HPC audience, acting as a central hub for sharing information on this emerging technology.
Birds of a Feather
Software Testing for Scientific Computing in HPC
Programming Frameworks and System Software
Description
Effective software testing plays a critical role in guaranteeing the performance, correctness, and reproducibility of applications and software. When it comes to testing high-performance computing (HPC) software and applications, unique requirements arise due to factors such as massive parallelism, concurrency and heterogeneity, the scale of target platforms, lack of oracles, and application-specific verification and validation techniques. In this BoF session, we aim to foster insightful discussions among a panel of expert speakers and the audience, focusing on methodologies and challenges in HPC software testing, and deepen our understanding in this crucial part of HPC software development.
Tutorial
Solving Optimization Problems Using Near Term Quantum Devices
Description
Optimization problems are among the most promising quantum applications that combine the use of quantum processor and classical processors. This tutorial is aimed at teaching the participants to solve optimization problems using two distinct quantum computing paradigms: (i) hybrid quantum-classical algorithms using gate-based systems, and (ii) neutral atom analog Hamiltonian simulations device. In gate-based systems, a parameterized quantum circuit is designed which is then used to compute value of an objective function and iteratively optimize via classical optimization algorithms. Such hybrid algorithms rely on rapid iterative computations of quantum and classical processors, requiring regular sharing of data between them. The analog Hamiltonian simulation quantum device comprises of an array of two-level neutral Rydberg atoms with ground state and excited Rydberg state. The atoms can be arranged in any 1D or 2D geometry and initially prepared in the ground state. The parameters of the driving Hamiltonian are then adiabatically varied and the state of each neutral atom is measured which represents the final solution. The tutorial will provide introduction to quantum computing and demonstrate the aforementioned solutions using hand-on sessions provided via free cloud access to quantum hardware.
Paper
Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
Description
Dedicated accelerator hardware has become essential for processing AI-based workloads, leading to the rise of novel accelerator architectures. Furthermore, fundamental differences in memory architecture and parallelism have made these accelerators targets for scientific computing. The sequence alignment problem is fundamental in bioinformatics; we have implemented the X-Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore Intelligence Processor Unit (IPU) accelerator. The X-Drop algorithm has an irregular computational pattern, which makes it difficult to accelerate due to load balancing.
Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves 10x speedup over a state-of-the-art GPU implementation and up to 4.65x compared to CPU. In addition, we introduce a memory-restricted X-Drop algorithm that reduces memory footprint by 55x and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by 3.6x.
Birds of a Feather
Spack Community BoF
Programming Frameworks and System Software
Description
Spack is a package manager for scientific computing, with a rapidly growing open-source community. Spack has over 1000 contributors from academia, industry, and laboratories across the world, and is used to manage software releases for the U.S. Exascale Computing Project. At this BoF, Spack developers will give updates on the community, new features, and the roadmap for future development. We will poll the audience to gather valuable information on how Spack is being used, and will open the floor for questions. All are invited to provide feedback, request features, and discuss future directions. Help us make installing HPC software simple!
Paper
Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory Faults
Accelerators
Artificial Intelligence/Machine Learning
Codesign
Fault Handling and Tolerance
Performance Measurement, Modeling, and Tools
Post-Moore Computing
Description
The advent of High Performance Computing has led to the adoption of Convolutional Neural Networks (CNNs) in safety-critical applications such as autonomous vehicles. However, CNNs are vulnerable to DRAM errors corrupting their parameters, thereby degrading their accuracy. Existing techniques for protecting CNNs from DRAM errors are either expensive or fail to protect from large-granularity, multi-bit errors, which occur commonly in DRAMs.
We propose a software-implemented coding scheme, Structural Coding (SC) for protecting CNNs from large-granularity memory errors. SC achieves three orders of magnitude reduction in Silent Data Corruption (SDC) rates of CNNs compared to no protection. Its average error correction coverage is also higher than other software-techniques to protect CNNs from faults in the memory. Further, its average performance, memory, and energy overheads are respectively 3%, 15.71%, and 4.38%. These overheads are much lower than other software protection techniques.
Birds of a Feather
SuperCompCloud: Emerging topics in supercomputing and cloud interoperability
Cloud Computing
Description
The SuperCompCloud series of panels, workshops, and BoFs has a goal of bringing together experts and practitioners from academia, national labs, and industry to discuss technologies, use cases and best practices in order to share vision and direction for leveraging high performance, extreme-scale computing and on-demand cloud ecosystems in light of increasing software complexity, narrowing on-premise infrastructure options, and cloud-only architectures. The session will continue the discussion of the latest challenges and plans in addition to interactive polling to engage the community in discussion with a level of interactivity distinct from the workshop series.
Panel
Superconducting Digital Computing in HPC
Codesign
Hardware Technologies
Description
Superconducting digital computing (SDC) has significant potential to preserve performance scaling for a wide range of HPC applications due to its tens to hundreds of GHz operating frequencies coupled with low dynamic energy. The current limitations of the technology such as device density, EDA tools, data movement, and cooling are active areas of research with promising directions. This, combined with studies that designed SDC accelerators for compute-intensive applications, hint that SDC may play an important role in HPC, though significant work remains to show the best integration strategy with HPC systems and on-sensor processing. In this panel, we invite experts from the superconducting community to discuss SDC’s ecosystem, how SDC may be used in practice in future systems, and the positive impact SDC can have to the performance and efficiency of key HPC applications.
Workshop
Sustainable Supercomputing
Description
Providing a sustainable path for supercomputing is a pressing topic for our community, industry, and governments. Supercomputing has an insatiable appetite for computational cycles, while we face increasing challenges of delivering performance per watt advances with silicon technology trends. All within the context of climate change, the drive toward net-zero, and economic pressures driven by geo-political challenges.
Improving the sustainability of supercomputing provides many opportunities when the end-to-end cycle is considered. From the design of computational circuits and systems; to the power and cooling that is used to operate them, along with the suite of software tools used to administrate, maintain, and raise operational efficiency of HPC systems. All elements of the system must be considered, from compute nodes and interconnects, to IO and storage components of the system.
This workshop will gather users, researchers, hardware and software developers to address opportunities and challenges of sustainability in the supercomputing context.
Paper
SYnergy: Fine-Grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving
Cloud Computing
Distributed Computing
Energy Efficiency
Performance Measurement, Modeling, and Tools
Description
Energy-efficient computing uses power management techniques such as frequency scaling to save energy. Implementing energy-efficient techniques on large-scale computing systems is challenging. While most modern architectures, including GPUs, are capable of frequency scaling, these features are often not available on large systems.
We propose SYnergy, a novel energy-efficient approach that spans languages, compilers, runtimes, and job schedulers to achieve unprecedented fine-grained energy savings on large-scale heterogeneous clusters. SYnergy defines an extension to the SYCL programming model that allows programmers to define a specific energy goal for each kernel. Through compiler integration and a machine learning model, each kernel is statically optimized for the specific target. The methodology is inherently portable and has been evaluated on both NVIDIA and AMD GPUs. Experimental results show unprecedented improvements in energy and energy-related metrics on real-world applications, as well as scalable energy savings on a 64-GPU cluster.
Birds of a Feather
System Software for Quantum Accelerated HPC
Post-Moore Computing
Description
As Quantum Computing, QC, systems mature and make their way out of laboratories into HPC centers as accelerators, we also must rethink the role of system software. We require stable software environments targeted at broad, non-physics end-user communities that are directly integrated into HPC system software as well as HPC schedulers. In this BoF, we will highlight recent developments relating to first QC system installations in HPC centers and discuss open questions and challenges. We aim to establish an international discussion on this emerging, critical issue and to help clear the road for the next steps towards efficient quantum acceleration.
Paper
TANGO: Re-Thinking Quantization for Graph Neural Network Training on GPUs
Artificial Intelligence/Machine Learning
Description
Graph Neural Networks (GNNs) are rapidly gaining popularity since they hold state-of-the-art performance for various critical graph-related tasks. While quantization is a primary approach to accelerating GNN computation, quantized training faces remarkable challenges. We observe that current quantized GNN training systems often experience longer training time than their full-precision counterparts for two reasons: (i) addressing the accuracy challenge results in too much overhead. (ii) The optimization opportunity exposed by quantization is not well leveraged. This paper introduces Tango, which re-thinks quantization challenges and opportunities for graph neural network training on GPUs with the following contributions: First, we introduce light-weighted rules to meet the accuracy requirement for quantized GNN training. Second, we design and implement quantization-aware primitives and inter-primitive optimizations to accelerate GNN training. Third, we integrate Tango with the mainstream Deep Graph Library (DGL) system and demonstrate that Tango outperforms the state-of-the-art across all the evaluated GNN models and datasets.
Birds of a Feather
TCHPC Career Panel
Description
This BoF will be in the form of a panel consisting of representatives from the industry, national labs, and academia with a background in HPC. The panel will share advice on different career options in HPC, and their experiences in their respective career trajectories. The primary audience for this event is current, preferably ABD, graduate students. The format will include a brief introduction by each speaker, followed by a moderated discussion based on a set of previously submitted questions and ending with further questions from the audience.
Workshop
Tenth SC Workshop on Best Practices for HPC Training and Education
Description
The inherent wide distribution, heterogeneity, and dynamism of the current and emerging high-performance computing and software environments increasingly challenge cyberinfrastructure facilitators, trainers, and educators. The challenge is how to support and train the current diverse users and prepare the future educators, researchers, developers, and policymakers to keep pace with the rapidly evolving HPC environments to advance discovery and economic competitiveness for many generations.
The tenth annual full-day workshop on HPC training and education is an ACM SIGHPC Education Chapter coordinated effort, aimed at fostering more collaborations among the practitioners from traditional and emerging fields to explore educational needs in HPC, to develop and deploy HPC training, and to identify new challenges and opportunities for the latest HPC platforms. The workshop will also be a platform for disseminating results and lessons learned in these areas and will be captured in a Special Edition of the Journal of Computational Science Education.
Workshop
Tenth Workshop on Accelerator Programming Using Directives (WACCPD 2023)
Description
Heterogeneous node architectures are becoming omnipresent in today’s HPC systems. Exploiting the maximum compute capability out of such systems, while also maintaining code portability and
maintainability, necessitates accelerator programming approaches such as OpenMP offloading, OpenACC, standard C++/Fortran parallelism, SYCL, DPC++, Kokkos, RAJA. However, the use of these programming approaches remains a research activity and there are many possible trade-offs between performance, portability, maintainability, and ease of use that must be considered for optimal use of accelerator-based HPC systems.
Toward this end, the workshop will highlight the improvements over state-of-the-art through the accepted papers and talks. In addition, the event will foster discussion with a keynote/panel to draw the community’s attention to key areas that will facilitate the transition to accelerator-based HPC. The workshop aims to showcase all aspects of innovative high-level language features, lessons learned while using directives/abstractions to migrate scientific legacy code, experiences using novel accelerator architectures, among others.
Workshop
The 18th Workshop on Workflows in Support of Large-Scale Science (WORKS23)
Description
Scientific workflows have underpinned some of the most significant discoveries of the past several decades. Workflow management systems provide abstraction and automation which enable a broad range of researchers to easily define sophisticated computational processes and to then execute them efficiently on parallel and distributed computing systems. Workflows are becoming more complex and require more sophisticated workflow management capabilities.
This workshop focuses on the many facets of scientific workflow management systems, ranging from actual execution to service management and the coordination and optimization of data, service, and job dependencies. The workshop covers a broad range of issues in the scientific workflow lifecycle that include: scientific workflows representation; workflow scheduling techniques to optimize the execution on heterogeneous infrastructures; provisioning workflows on infrastructures; workflow engines that deal with failures in the application and infrastructure; and computer science problems related to scientific workflows such as semantic technologies, compiler methods, fault tolerance, etc.
Workshop
The 1st International Workshop on the Environmental Sustainability of High-Performance Software
Description
Sustainability has been recently identified as a crucial goal for many human activities. The term includes, not only the commitment to reduce the environmental footprint, but also economical and societal goals of equality and dignity.
Specific research areas of High Performance Computing (HPC) have so far focused on hardware and infrastructural strategies to reduce the environmental footprint of HPC systems, but further efforts are needed to study the sustainability of high performance algorithms. It is indeed crucial to develop methods and tools to assess the sustainability of parallel software in practice, define scheduling policies intrinsically oriented to sustainability, and assess the footprint of parallel programming paradigms.
The workshop’s objective is to foster the exchange of ideas and, possibly, build a community of researchers working on sustainable HPC algorithms, platforms, schedulers, programming paradigms, and theoretical models.
Workshop
The 6th Annual Parallel Applications Workshop, Alternatives to MPI+X (PAW-ATM)
Description
Supercomputers get faster and more complex every year. MPI, long the dominant model for distributed computation, has adapted by combining with models for intra-node parallelism (e.g. OpenMP, CUDA). These MPI+X hybrids offer performance but demand significant programmer effort to write, debug and tune applications.
Alternatives to MPI+X are worth exploring as programmer productivity becomes a major component of the time to science. Alternatives include parallel programing languages (e.g. Chapel, Regent, Fortran 2018), general purpose libraries (e.g. Charm++, COMPSs, HPX, Legion, UPC++), and domain specific libraries (e.g. Arkouda, Dask, Spark). With many options to choose from, it is hard for programmers to know which alternative models are appropriate for their application and for programming model developers to understand the opportunities for improvement.
Through discussion of specific applications, PAW-ATM brings together application experts and programming model developers to improve applications and models.
Workshop
The 9th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-9)
Description
In this new exascale computing era, applications must increasingly perform online data analysis and reduction—tasks that introduce algorithmic, implementation, and programming model challenges that are unfamiliar to many scientists and that have major implications for the design and use of various elements of exascale systems. There are at least three important topics that our community is striving to answer: (1) whether several orders of magnitude of data reduction are possible for exascale sciences; (2) understanding the performance and accuracy trade-off of data reduction; and (3) solutions to effectively reduce data while preserving the information hidden in large scientific data. Tackling these challenges requires expertise from computer science, mathematics, and application domains to study the problem holistically, and develop solutions and hardened software tools that can be used by production applications. DRBSD-9 is a great venue to publish and share the latest research findings and achievements in this critical research topic.
Workshop
The First Workshop on Democratizing High-Performance Computing (D-HPC)
Description
High-performance computing (HPC) is at an inflection point in which the near-end of Moore’s Law, the big data explosion from AI workflows and experiments, increasing operational costs, and heterogeneous architectures have led to significant technical and economical barriers.
The proposed “Democratizing HPC” (D-HPC) workshop will bring together interdisciplinary communities of developers, facility staff and users, vendors, researchers, educators, etc., to define, understand, and quantify the accessibility of HPC technologies and ecosystems that characterize the path from idea to scientific discovery. Past success stories are the message passing interface (MPI) and graphics processing units (GPU). We understand the democratization of HPC in its broadest meaning, pursuing success stories about enabling and/or improving accessibility in data- and compute-intensive applications across all domains from simulations, data management and analysis, to more recent fields, such as AI, for a wide variety of computing targets: manycore, quantum, neuromorphic, field-programmable gate arrays (FPGAs), chiplets, etc.
Birds of a Feather
The Future of Benchmarks in Supercomputing
Performance Measurement, Modeling, and Tools
Description
As supercomputing welcomes new workflows of simulations, data science and artificial intelligence in the Exascale era, the goal of this session is to pose, engage, debate, and address the question - "How should the SC community evolve performance benchmarks?". The session will be organized as presentations and panel discussions with audience participation that will invite active members of the Top500, HPCG, MLPerf, TeraSort, etc. and key personnel from industry, academia, and government to discuss the value, need and desire for evolving the benchmark suite that is inclusive and accommodative of emerging applications to guide future supercomputing system design and architecture.
Birds of a Feather
The Future of NSF Supported Advanced Cyberinfrastructure
Description
The National Science Foundation's vision and investment plans for cyberinfrastructure (CI) are designed to address the evolving needs of the science and engineering research community. Senior leadership and program staff from NSF’s Office of Advanced Cyberinfrastructure (OAC) will discuss OAC's vision, strategic and national priorities, as well as the latest funding opportunities across all aspects of the research cyberinfrastructure ecosystem. Substantial time will be devoted to Q&A between attendees and NSF staff.
Panel
The Golden Age of Compilers: Analyzing Cross-Cutting Issues and Opportunities across HPC and AI Domains
Artificial Intelligence/Machine Learning
Compilers
Performance Optimization
Description
This panel discussion aims at identifying cross-cutting issues, opportunities, similarities, and discrepancies between HPC and AI workloads and systems, as well as defining the role of compilers in the development of HPC applications and AI models. While there is a clear overlap in problems being solved in HPC and AI communities, often solutions are siloed to one, with software fragmentation and increased maintenance cost. It has become critical to identify current gaps and potential solutions in current compiler frameworks and to develop an interoperable environment to help researchers move to the next stage of scientific discoveries, such as moving from classification models to machine reasoning. This panel brings together the experience of distinguished researchers from industry, academia, the U.S. National Laboratory, and the U.S. Department of Energy, to share their vision, identify current gaps and research opportunities, and define a future research agenda.
Paper
The Graph Database Interface: Scaling Online Transactional and Analytical Graph Workloads to Hundreds of Thousands of Cores
Cloud Computing
Data Analysis, Visualization, and Storage
Graph Algorithms and Frameworks
Best Paper Finalist
Description
Graph databases (GDBs) are crucial in academic and industry applications. The key challenges in developing GDBs are achieving high performance, scalability, programmability, and portability. To tackle these challenges, we harness established practices from the HPC landscape to build a system that outperforms all past GDBs presented in the literature by orders of magnitude, for both OLTP and OLAP workloads. For this, we first identify and crystallize performance-critical building blocks in the GDB design, and abstract them into a portable and programmable API specification, called the Graph Database Interface (GDI), inspired by the best practices of MPI. We then use GDI to design a GDB for distributed-memory RDMA architectures. Our implementation harnesses one-sided RDMA communication and collective operations, and it offers architecture-independent theoretical performance guarantees. The resulting design achieves extreme scales of more than a hundred thousand cores. Our work will facilitate the development of next-generation extreme-scale graph databases.
Birds of a Feather
The Green500: Trends in Energy-Efficient Supercomputing
Description
With power being a first-order design constraint on par with performance, it is important to measure and analyze energy-efficiency trends in supercomputing. To raise the awareness of greenness as a first-order design constraint, the Green500 seeks to characterize the energy-efficiency of supercomputers for different metrics, workloads, and methodologies. This BoF discusses trends across the Green500 and highlights from the current Green500 list. In addition, the Green500, Top500, and Energy Efficient HPC Working Group have been working together on improving power-measurement methodology, and this BoF presents recommendations for changes to sampling rates that will improve ease of submission without compromising accuracy.
Panel
The Impact of Exascale and the Exascale Computing Project on Industry
Artificial Intelligence/Machine Learning
Applications
Exascale
Description
Exascale computing promises broad advances in simulation, data analytics, and machine learning. The US Department of Energy (DOE) is funding the Exascale Computing Project (ECP) to develop the applications, software, and integration needed to harness the immense computing power of exascale machines. As part of this effort, the ECP established the Industry and Agency Council (IAC), made up of executives from US industry, US government agencies and US independent software vendors (ISVs). As the ECP winds down, this panel is a chance for IAC members to reflect on how the ECP and the move to exascale computing is impacting industry’s current and planned use of HPC in saving energy, boosting competitiveness, and building global technology leadership. Moderated by Fran Hill (Chief Scientist for DoD’s HPC Modernization Program), the panel will be a lively and informative discussion of how exascale and the ECP are impacting businesses both large and small.
Birds of a Feather
The International Post-Exascale (InPEx) Project
Description
Efficient use of exascale systems for large-scale applications implies the development, in a combined manner, of applications, the full software stack, and the machine. As the BoF organizers did in the context of IESP and BDEC workshops (exascale.org), we plan to launch a new series of workshops that will gather stakeholders in Europe (EuroHPC, French NumPEX project, BSC, JSC), USA (DOE, NSF partners), Japan (FugakuNEXT, Riken-CC) and large-scale applications communities to target the co-design of software and hardware components of future exascale systems and preparing the major scientific and industrial application domains to fully exploit the capabilities of these systems.
Tutorial
The OpenMP Common Core: A “Hands-On” Introduction
Description
OpenMP is the de facto standard for writing parallel applications for shared memory computers. Born ~25 years ago in 1997, it runs on just about every shared memory platform on the market. It’s also very complicated. We created OpenMP to be the “simple API” for application programmers. With a specification running to over 600 pages OpenMP has grown into an intimidating API viewed by many as for “experts only”.
Most OpenMP programmers, however, use around 21 items from the specification. We call these 21 items the “OpenMP Common Core”. By focusing on the common core, we make OpenMP what it was always meant to be: a simple API for parallel application programmers.
In this hands-on tutorial, we explore the OpenMP Common Core. We utilize active learning through a carefully selected set of exercises, so students will master the Common Core and learn to apply it to their own problems. Students will use their own laptops (with Windows, Linux, or OS/X) to access remote systems that support OpenMP (a remote SMP server). Alternatively, students can load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.
Workshop
The Second Workshop on Federated and Privacy Preserving AI for HPC
Description
Federated learning (FL) is a distributed machine learning method that allows a network of participants to collaboratively train a shared model without exchanging their data. Since this is a quite new and fast evolving scientific domain, there are myriads of outstanding algorithmic, technical, administrative, and policy issues with the deployment and use of federated learning across different compute environments. The goal of the First Workshop on Federated and Privacy Preserving AI for HPC (FPPAI4HPC) was to start building the community by discussing different aspects of federated and privacy preserving AI as well as the potential benefits and challenges of developing and deploying FL and other privacy preserving methods for AI applications on HPC platforms. The goal of the second FPPAI4HPC workshop is to continue building the community by taking a more focused approach to early experiences and best practices in developing and deploying FPPAI frameworks on different computing platforms.
Workshop
Third International Symposium on Quantitative Codesign of Supercomputers
Description
This symposium aims at combining two methodologies—collaborative codesign and data-driven analysis—to realize the potential of supercomputing more fully. We refer to the design solutions that rely on intelligence from data-driven insights across applications, systems, system software, workflows, and facilities as Quantitative Codesign of Supercomputers (QCSC). We seek to bring together the community to overcome challenges in extracting meaning from data across such wide-ranging sources. For SC23, our focus will be on opportunities and challenges in QCSC arising from the explosion of new architectural directions in and new paradigms for HPC. Experts with interact with the community on how directions in AI/ML, Cloud, and HPC will change the computing landscape and how we can still get comparative and meaningful quantitative insight across the expanding space of use cases, programming paradigms, and architectures.
Birds of a Feather
TOP500 Supercomputers
Description
The TOP500 list of supercomputers serves as a “Who’s Who” in the field of High Performance Computing (HPC). It started as a list of the most powerful supercomputers in the world and has evolved to a major source of information about trends in HPC. The 62nd TOP500 list will be published in November 2023 just in time for SC23.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
Birds of a Feather
Toward a National Artificial Intelligence (AI) Research Resource for Strengthening and Democratizing AI R&D
Artificial Intelligence/Machine Learning
Description
AI is driving scientific discovery and economic growth. While AI R&D is advancing rapidly, access to the computational and data resources that drive the frontiers of AI remains limited. This BoF will explore how democratizing access to national-level cyberinfrastructure (CI) for AI R&D can help strengthen the AI research and innovation ecosystem. Specifically, this BoF will catalyze a discussion about the nature and composition of such CI, how it can be realized nationally and connected internationally, how to measure both successes and failures, and what are necessary guardrails to ensure responsible AI.
ACM Gordon Bell Finalist
Awards
Toward Exascale Computation for Turbomachinery Flows
TP
Description
A state-of-the-art large eddy simulation code has been developed to solve compressible flows in turbomachinery. The code has been engineered with a high degree of scalability, enabling it to effectively leverage the many-core architecture of the new Sunway system. A consistent performance of 115.8 DP-PFLOPs has been achieved on a high-pressure turbine cascade consisting of over 1.69 billion mesh elements and 865 billion Degree of Freedoms (DOFs). By leveraging a high-order unstructured solver and its portability to large heterogeneous parallel systems, we have progressed toward solving the grand challenge problem outlined by NASA, which involves a time dependent simulation of a complete engine, incorporating all the aerodynamic and heat transfer components.
Paper
Toward Sustainable HPC: Carbon Footprint Estimation and Environmental Implications of HPC Systems
Cloud Computing
Distributed Computing
Energy Efficiency
Green Computing
Programming Frameworks and System Software
State of the Practice
Sustainability
Description
The rapid growth in demand for HPC systems has led to a rise in carbon footprint, which requires urgent intervention. In this work, we present a comprehensive analysis of the carbon footprint of high-performance computing (HPC) systems, considering the carbon footprint during both the hardware manufacturing and system operational stages. Our work employs HPC hardware component carbon footprint modeling, regional carbon intensity analysis, and experimental characterization of the system life cycle to highlight the importance of quantifying the carbon footprint of HPC systems.
Paper
TrivialSpy: Identifying Software Triviality via Fine-Grained and Dataflow-Based Value Profiling
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Description
Trivial operations cause software inefficiencies that waste functional units and memory bandwidth for executing useless instructions. Although previous works have identified a significant amount of trivial operations in widely used programs, the proposed solutions only provide useful observations, other than actionable guidance to eliminate trivial operations for better performance. In this paper, we propose TrivialSpy - a fine-grained and dataflow-based value profiler to effectively identify software triviality with optimization potential estimation. With the help of dataflow analysis, TrivialSpy can detect software trivialities of heavy operation, trivial chain, and redundant backward slice. In addition, TrivialSpy can identify trivial breakpoints that combine multiple trivial conditions for more optimization opportunities. The evaluation results demonstrate TrivialSpy is capable of identifying software triviality in highly optimized programs. Based on the optimization guidance provided by TrivialSpy, we can achieve 52.09% performance speedup at maximum after eliminating trivial operations.
Birds of a Feather
Two Worlds Collide: Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm
Artificial Intelligence/Machine Learning
Description
This Birds of a Feather session, “Two Worlds Collide: Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm,” continues a series started in 2021 with a theme of discussing and brainstorming solutions for a new paradigm in HPC: the coupling of simulation with machine learning for state-of-the-art research. In this installment, we focus on sustainability and assurance for coupled simulation and deep learning. We discuss the current state and needs for enabling integration of HPC simulation with modern deep learning stacks to provide transformative scientific discoveries while delivering productivity, portability, and correctness for safety and mission critical applications.
Paper
Understanding the Effects of Permanent Faults in GPU’s Parallelism Management and Control Units
Accelerators
Architecture and Networks
Data Analysis, Visualization, and Storage
Fault Handling and Tolerance
Best Student Paper Finalist
Description
Modern Graphics Processing Units (GPUs) demand life expectancy extended to many years, exposing the hardware to aging (i.e., permanent faults arising after the end-of-manufacturing test). Hence, techniques to assess permanent fault impacts in GPUs are strongly required, especially in safety-critical domains.
This paper presents a method to evaluate permanent faults in the GPU's scheduler and control units, together with the first figures to quantify these effects. We inject 5.83x10^5 permanent faults in the gate-level units of a GPU model. Then, we map the observed error categories as software errors by instrumenting 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors (1,000 errors per application), reducing evaluation times from several years to hundreds of hours. Our results highlight that faults in GPU parallelism management units impact software execution parameters. Moreover, errors in resource management or instructions codes hang code, while 45% of errors induce silent data corruption.
Panel
Understanding the Performance, Reproducibility, Validation, Portability, and Sustainability of Coupled HPC Simulation and Deep Learning Calculations
Artificial Intelligence/Machine Learning
Applications
Reproducibility
Description
Recent advances in deep learning (DL) for scientific computing have paved the way for a new type of integrated programming environment. This environment must support the seamless integration of simulation applications with deep learning frameworks using methods such as in-memory coupling and inference serving. Especially for HPC, this environment brings a slew of challenges, forcing developers to revisit decades of solved problems in scientific computing: kernel optimization, verification/validation strategies, building/porting practices. Interfacing HPC simulation codes with DL frameworks from industry—whose philosophies and strategies may differ from those within HPC—brings critical questions about how these two communities can work together to develop sustainable, integrated programming environments that are trustworthy, vetted, and portable, and where HPC communities can express requirements for scientific software and can track ownership. Discussions are needed about how to overcome these challenges: here, panelists from academia, national laboratories and industry will start a conversation, sharing perspectives and experiences.
Paper
Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters
Accelerators
Algorithms
Linear Algebra
Best Paper Finalist
Description
This paper presents a unified framework for reducing communication costs of sparse triangular solvers (SpTRSV) on CPU and GPU clusters. The proposed framework builds upon a 3D communication-avoiding process layout that distributes a sparse triangular matrix into a 3D layout consisting of 2D grids. This work significantly reduces inter-process communication by replicating computation and using sparse allreduce operations across the 2D grids. This also allows for integration of a number of communication-optimized 2D SpTRSV algorithms including binary communication tree-based CPU algorithms and one-sided GPU communication (e.g., NVSHMEM)-based algorithms. With all these communication reduction schemes, the resulting SpTRSV exhibits significantly better scalability than existing works on leadership CPU and CPU clusters such as Cori, Perlmutter and Crusher.
Birds of a Feather
Unified Communication X (UCX) Community
Architecture and Networks
Description
In order to exploit the capabilities of new HPC systems and to meet their demands in scalability, communication software needs to scale on millions of cores and support applications with adequate functionality. UCX is a collaboration between industry, national labs and academia that consolidates that provides a unified open-source framework.
The UCX project is managed by the UCF consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, Ohio State University, AMD, ARM, IBM, NVIDIA, and more. The session will serves as the UCX community meeting, and will introduce the latest development to HPC developers and the broader user community.
Paper
Unity ECC: Unified Memory Protection Against Bit and Chip Errors
Accelerators
Architecture and Networks
Data Analysis, Visualization, and Storage
Fault Handling and Tolerance
Best Student Paper Finalist
Description
DRAM vendors utilize On-Die Error Correction Codes (OD-ECC) to correct random bit errors internally. Meanwhile, system companies utilize Rank-Level ECC (RL-ECC) to protect data against chip errors. Separate protection increases the redundancy ratio to 32.8% in DDR5 and incurs significant performance penalties. This paper proposes a novel RL-ECC, Unity ECC, that can correct both single-chip and double-bit error patterns. Unity ECC corrects double-bit errors using unused syndromes of single-chip correction. Our evaluation shows that Unity ECC without OD-ECC can provide the same reliability level as Chipkill RL-ECC with OD-ECC. Moreover, it can significantly improve system performance and reduce DRAM energy and area by eliminating OD-ECC.
Panel
Unleashing the Power within Data Democratization: Needs, Challenges, and Opportunities
Applications
Reproducibility
Description
The scientific community needs a data fabric that integrates data delivery and access to shared storage, networking, computing, and educational resources. Such a data fabric can potentially democratize data-driven scientific discovery across the growing data science community.
In this panel, we will discuss the needs, challenges, and opportunities of the data science community leveraging the existing cyberinfrastructures and software tools while strategizing on what is missing to connect an open network of institutions, including resource-disadvantaged institutions.
Tutorial
Unlocking the Potential of HPC in the Google Cloud with Open-Source Tools
Description
Cloud computing technologies have seen tremendous growth in recent years, with many organizations moving their HPC workloads to the cloud due to its flexibility in the organization and provisioning of HPC infrastructure. While such a diverse and flexible set of options brings additional degrees of freedom, they also bring a daunting set of hardware and software choices. Furthermore, the lines between traditional system administrator and application deployment can be blurred.
In this tutorial, we will provide a foundation to understand how to run HPC workloads in the cloud effectively and with minimal complexity. We start with a primer on cloud foundations and how they map to common HPC concepts, and then dive deeper into core HPC cloud components. We then introduce important HPC partners, discuss industry-specific solutions and present blueprints describing infrastructure, scheduler and applications.
Finally, we present the best practices to run HPC in the cloud and how to explore your options for the best configuration for price/performance.
This tutorial will use a combination of lectures and hands-on labs using Google Cloud, the open-source Google Cloud HPC Toolkit, Slurm, Spack, and other popular open-source HPC software to provide a balance of both theoretical and hands-on learning.
Birds of a Feather
Updates from the HPC Certification Forum
Description
Creating and providing HPC training for practitioners with diverse backgrounds is challenging, and requires a multitude of educational resources covering different skills. However, the sheer volume does not guarantee discoverability or quality of the content. The main goal of the International HPC Certification program is to ease the provision and uptake of training by clearly categorizing, defining and eventually assessing the skills required to efficiently use HPC resources. The session aims to present the current status, discuss the developed processes, tools, and skills, and to ensure community involvement. Anyone interested in HPC education is invited to participate in the discussion.
Tutorial
Using Containers to Accelerate HPC
Description
Within just the past few years, the use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. The containerization model has gained traction within the HPC community as well with the promise of improved reliability, reproducibility, portability, and levels of customization that were not previously possible on supercomputers. This adoption has been enabled by a number of HPC Container runtimes that have emerged including Singularity, Shifter, Enroot, Charliecloud, and others.
This hands-on tutorial looks to train users on the usability of containers on HPC resources. We will provide a detailed background on Linux containers, along with introductory hands-on experience building a container image, sharing the container and running it on a HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to setup GUI enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter, and Singularity, and in-depth knowledge to deploy custom containers on their own resources.
Paper
VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
Description
The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA’s Sparse Tensor Cores(SPTCs), which promise a 2x speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.
Birds of a Feather
Welcome to C++ 23, the “Pandemic” Edition and C++ NEXT in 2026
Programming Frameworks and System Software
Description
Welcome to C++ 23, the “pandemic” edition. C++ was named Tiobe Programming Language of the Year for 2022 by the Tiobe Index of language popularity. C/C++ is used in 79.4% of parallel programming languages based on Hyperion 2021 research HPC Briefing at ISC 2021. We will review C++23 final content and implementation status by all compilers while we look ahead to see what is coming for C++ 26. This BoF will pull together important leader within ISO C++ Standard that are co-authors in key C++23 features such as ML, executors, mdspan, library and Concurrency.
Workshop
WHPC@SC23: 16th International Women in HPC Workshop
Description
The 16th international Women in HPC workshop will be held at SC23 in Denver, CO, USA with the goal of fostering a diverse and inclusive HPC community. The WHPC workshop series has become the leading SC event focused on DEI topics. We aim to cultivate skills for valuing a diverse workforce and creating a welcoming environment for all. New this year, we will have an increased emphasis on diversity and inclusion of both women and men from under represented groups.
WHPC@SC23 we will focus on the following topics:
- Improving diversity and inclusion for all in the HPC workforce
- Building a deeper understanding of what diversity, equity, and inclusion means for different groups
- Strategies for recruitment, retention, and success
- Building community through real-time networking
- Learning from, and valuing, different experiences and career paths
We will also include short lightning talks by early career researchers from under represented groups.
Birds of a Feather
With Great Power Comes Great Responsibility: Ethics in HPC
Description
HPC changes the world and society around us on a daily basis. Ensuring that HPC resources are both used ethically and are ethically available is of utmost importance to ensure a more equitable world. With our first BoF in 2019 and annually since (save 2020's Covid limitation) and expansion to ISC 2023 as well, we are fostering discussion with the community about what our ethical standards should be and fostering lively discussion debating these standards. This BoF will continue this tradition while incorporating, for the first time, efforts for establishing specific ethical principles driving toward a formal community declaration.
Birds of a Feather
Workflows Community: Modern Workflows for Continuum and Cross-Facility Computing
Cloud Computing
Distributed Computing
Description
Discoveries in science increasingly rely on workflows to coordinate complex experiments, ranging from cloud-based data preprocessing to multi-facility computational workflows. Continuum and cross-facility workflows have gained prominence, providing continuous computing access and spanning multiple sites. This BoF session, organized by the Workflows Community Initiative, will address challenges, opportunities, and future directions for continuum and cross-facility workflows. Participants will share domain-specific insights, covering topics such as facility coordination, metadata tracking, and standardization. The BoF will produce tangible outputs, including lightning talks and a community roadmap, fostering networking and international collaborations.
Workshop
Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S)
Description
Artificial intelligence (AI) is a game-changing technology that has shown tremendous advantages and improvements in algorithms, implementation, and applications. We have seen many successful stories of applying AI to scientific applications. However, there are a number of problems remaining to be studied to enhance the usability of AI in scientific applications. Addressing the above problems will bridge the gap between AI and scientific applications and enable wider employment of AI in HPC. The purpose of this workshop is to bring together computer scientists and domain scientists from academia, government, and industry to share recent advances in the use of AI in various scientific applications, introduce new scientific application problems to the broader community, and stimulate tools and infrastructures to support the application of AI in scientific applications. The workshop will be organized as plenary talks based on peer-reviewed paper submissions accompanied by keynotes from distinguished researchers and a panel discussion.
Workshop
Workshop on Machine Learning with Graphs in High Performance Computing Environments
Description
The intent of this workshop is to bring together researchers, practitioners, and scientific communities to discuss methods that utilize extreme scale systems for learning graph data. This workshop will focus on the greatest challenges in utilizing High Performance Computing (HPC) for machine learning with graphs and methods for exploiting extreme scale parallelism for data, computation, and model optimization.
We invite researchers and practitioners to participate in this workshop to discuss the challenges in using HPC for machine learning with graphs and to share the wide range of applications that would benefit from HPC powered machine learning with graphs.
Workshop
Workshop on Memory Technologies, Systems, and Applications
Description
The growing disparity between compute and memory speed, known as the memory wall problem, has been one of the most critical and long-standing challenges in the computing industry. The prevalence of heterogeneous computing, the ongoing expansion of the memory hierarchy, and the advent of disaggregated architectures have considerably expanded the scope of this problem. Computer architecture, operating systems, storage systems, performance models, tools, and applications themselves are being enhanced or even redesigned to address the performance, programmability, and energy efficiency challenges of the increasingly complex and heterogeneous memory systems. Exploring the intersection of these research areas will enable cohesive and synergistic development and collaboration on the future of memory technologies, systems, and applications. MTSA’23: Workshop on Memory Technologies, Systems, and Applications aims to bring together researchers from industry, government labs, and academia concerned with the challenges of efficiently using existing and emerging memory systems.
Workshop
Workshop on Software and Hardware Co-Design of Deep Learning Systems on Accelerators (SHDA)
Description
Emerging are advanced accelerators oriented for speeding up deep learning systems, while scientific domains raise the customized demands on system performance of deep learning applications. To fully utilize the new features of state-of-the-art hardware accelerators and meet the diverse demands from scientific domains, the increasing demands on software and hardware co-design pose major challenges for deep learning systems. This requires new research and software tools that can further deliver higher performance and resource efficiency of deep learning systems, taking advantage of new architectures and hardware available on next generation accelerators. This international workshop will be focused on a promising research field — software and hardware co-design of deep learning systems on accelerators, and provide a platform for researchers to show their preliminary results, inspire ideas, explore novel directions, promote collaborations, and enlarge the community.
Paper
Xfast: Extreme File Attribute Stat Acceleration for Lustre
Data Analysis, Visualization, and Storage
I/O and File Systems
State of the Practice
Description
Directory tree walks on parallel file systems are costly operations frequently required by many storage management tasks. Even listing the contents of a single directory can take minutes to hours for huge directories, as the tree walk performance of parallel file systems in Linux is severely throttled by sequentially accessing distributed metadata for each file through the syscall interface.
We present extreme file attribute stat (Xfast), which scales the performance of directory tree walks by combining techniques that have been developed over a time frame of 10 years for the Lustre file system. Scalable statahead predicts file access patterns and prefetches required attributes, while the Size on MDT (SOM) mechanism reduces the number of RPC calls to collect file attributes. Xfast improves the performance of common directory operations, e.g. reduces the time to list one million files from 11 minutes to less than one minute for a single process.
Workshop
XLOOP 2023: The 5th Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing
Description
Advancement in computational power and high-speed networking is enabling a new model of scientific experiment, experiment-in-the-loop computing (EILC). In this model, simulation and/or learning modules are run as data is collected from observational and experimental sources. Presently, the amount and complexity of data generated by simulations and by observational and experimental sources, such as sensor networks and large-scale scientific facilities, continues to increase. Several research challenges exist, many of which are independent of the scientific application domain. New algorithms, including artificial intelligence and machine learning algorithms, to merge simulation ensembles and experimental data sets must be developed. Data transfer techniques and workflows must be constructed to control the ensembles and integrate simulated and observed data sets. The Workshop on Extreme-Scale Experiment-in-the-Loop Computing (XLOOP 2023) will be a unique opportunity to promote this interdisciplinary topic area. We invite papers, presentations, and participants from the physical and computer sciences.
Sessions
ACM Gordon Bell Finalist
Awards
ACM Gordon Bell Finalists Presentations 1
TP
ACM Gordon Bell Finalist
Awards
ACM Gordon Bell Finalists Presentations 2
TP
Paper
Algorithms on GPUs
Accelerators
Algorithms
Graph Algorithms and Frameworks
Paper
Applications in Materials Science and Biology
Applications
Modeling and Simulation
Paper
Applications of Machine Learning
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
Paper
Architecture-Specific Optimization
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
Paper
Code Optimization
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Paper
Data Centers and Large Distributed Systems
Algorithms
Cloud Computing
Distributed Computing
Heterogeneous Computing
Large Scale Systems
State of the Practice
Paper
Data Compression
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
Paper
Data Coordination
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
Paper
Exascale Computing
Exascale
Large Scale Systems
State of the Practice
Best Paper Finalist
Paper
Extreme-Scale Applications
Accelerators
Applications
Modeling and Simulation
Best Paper Finalist
Best Student Paper Finalist
Paper
Fault Tolerance and FPGA Codesign
Accelerators
Artificial Intelligence/Machine Learning
Codesign
Fault Handling and Tolerance
Performance Measurement, Modeling, and Tools
Post-Moore Computing
Paper
Global Task Parallelism
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
Paper
GPU Middleware and System Software
Accelerators
Distributed Computing
Middleware and System Software
Performance Measurement, Modeling, and Tools
Post-Moore Computing
Best Paper Finalist
Paper
Graph Algorithms in HPC
Accelerators
Algorithms
Graph Algorithms and Frameworks
Paper
Graph Analytics
Architecture and Networks
Data Movement and Memory
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
Paper
Graph Frameworks and Databases
Cloud Computing
Data Analysis, Visualization, and Storage
Graph Algorithms and Frameworks
Best Paper Finalist
Paper
Handling Hardware Faults
Accelerators
Architecture and Networks
Data Analysis, Visualization, and Storage
Fault Handling and Tolerance
Best Student Paper Finalist
Paper
High Performance for Graph Operations
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
Paper
High Performance I/O
Data Analysis, Visualization, and Storage
I/O and File Systems
State of the Practice
Paper
Linear Algebra I
Accelerators
Algorithms
Linear Algebra
Best Paper Finalist
Paper
Linear Algebra II
Algorithms
Linear Algebra
Post-Moore Computing
Paper
Message Passing Innovations
Distributed Computing
Message Passing
Programming Frameworks and System Software
Best Student Paper Finalist
Paper
Molecular Dynamics Applications and Accelerators
Accelerators
Applications
Architecture and Networks
Modeling and Simulation
Press Briefing
Press Briefing
Paper
Quantum Computing
Post-Moore Computing
Quantum Computing
Best Paper Finalist
Paper
Resource Management
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
Inclusivity
SC First Timers
Inclusivity
Paper
Sustainable Computing
Cloud Computing
Distributed Computing
Energy Efficiency
Green Computing
Programming Frameworks and System Software
State of the Practice
Sustainability
Paper
Tensor Computation
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
Birds of a Feather
Top500 BOF
Paper
Topics in Cloud Computing
Cloud Computing
Distributed Computing
Energy Efficiency
Performance Measurement, Modeling, and Tools
Paper
Training Graph Neural Networks
Artificial Intelligence/Machine Learning
Paper
Training in HPC Machine Learning
Artificial Intelligence/Machine Learning
Try a different query.
Back To Top Button