FORGE: Pre-Training Open Foundation Models for Science

SC23 Proceedings

Technical Papers Archive

FORGE: Pre-Training Open Foundation Models for Science

Authors: Junqi Yin, Sajal Dash, Feiyi Wang, and Mallikarjun Shankar (Oak Ridge National Laboratory (ORNL))

Abstract: Large language models (LLMs) are poised to revolutionize the way we conduct scientific research, yet their complexity and cost hinder adoption by the wider science community. Identifying suitable scientific use cases, optimizing model and data sizes, and scaling up training are among the most pressing issues. Here we provide practical solutions for building and using LLM-based foundation models targeting scientific use cases. We present an end-to-end examination of the effectiveness of LLMs in scientific research, including their scaling behavior and computational requirements on Frontier, the first exascale supercomputer. We have also developed for release to the scientific community a suite of open foundation models called FORGE with up to 26B parameters using 257B tokens from over 200M scientific articles. We have demonstrated the use and effectiveness of FORGE on scientific downstream tasks. Our research establishes best practices that can be applied across various fields to utilize LLMs for scientific discovery.

Back to Technical Papers Archive Listing