Authors: Christine Kirkpatrick (San Diego Supercomputer Center, UC San Diego; CODATA), Geoffrey Fox (University of Virginia), Vijay Janapa Reddi (Harvard University)
Abstract: This BoF will spotlight the underemphasized role of inputs and data in machine learning (ML), contrasting the prevalent focus on hardware aspects. It invites the SC community to contribute insights in these areas: 1) the value proposition for data-centric AI in scientific computing; 2) foundation models for the long tail of science; 3) the role of benchmarks in data-centric AI. To foster interactive dialogue, we will facilitate discussions, conduct live polling, and arrange short breakout sessions. These activities will enable participants to delve into the practical implications of data-centric AI, benchmarking, and contributing to scientific foundation models.
Long Description: The hardware aspects of machine learning (ML) often receive considerable attention, overshadowing the crucial role of inputs and data in terms of accuracy and efficiency. However, we believe it is imperative to address this oversight and delve into the emerging field of Data-centric AI. With this in mind, we invite the scientific community attending SC to provide their valuable input on the following key areas: 1) the priorities and gaps in Data-centric AI, 2) the value proposition for improved benchmark (AI ready) data, and 3) AI reproducibility.
To ensure a comprehensive discussion, we encourage the inclusion of the HPC community, particularly those involved in designing and operating ML clusters. Their expertise will significantly contribute to shaping research agendas, developing novel methodologies, and streamlining technologies.
During this session, we aim to explore the influence of data preparation choices on computation, thereby illuminating the intricate relationship between data and AI. A series of engaging short talks will serve as catalysts for discussion. For instance, DataPerf will present their groundbreaking work in data-centric AI through benchmark competitions. Additionally, experts will shed light on the importance of AI readiness for cyberinfrastructure professionals and elucidate sources of irreproducibility in ML, including model variance across hardware and software versions.
To foster interactive dialogue, we will facilitate discussions, conduct live polling, and arrange short breakout sessions with subsequent group report outs. These activities will enable participants to delve into the practical implications of AI readiness, data-centric AI, and AI reproducibility.
Ultimately, the insights and perspectives shared during this BoF session will be transformed into a concise summary report. This report will be made accessible to the HPC community, serving as a valuable resource for future endeavors, such as community research roadmaps and benchmarking competition guidelines. By collectively addressing these critical aspects, we aim to advance the field of ML and ensure its continued growth within the scientific community.
Website: https://www.farr-rcn.org/