Workshop: Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S)
Authors: Gautham Dharuman, Logan Ward, Heng Ma, and Priyanka V. Setty (Argonne National Laboratory); Ozan Gokdemir (University of Chicago); Sam Foreman, Murali Emani, Kyle Hippe, and Alexander Brace (Argonne National Laboratory); Kristopher Keipert and Thomas Gibbs (NVIDIA); Ian Foster (Argonne National Laboratory); Anima Anandkumar (California Institute of Technology); and Venkatram Vishwanath and Arvind Ramanathan (Argonne National Laboratory)
Abstract: Large language models (LLMs) trained on vast biological datasets can learn biological motifs and correlations across the evolutionary landscape of natural proteins. LLMs can then be used for de novo design of novel proteins with specific structures, functions, and physicochemical properties. We employ a pre-trained genome-scale language model that uses codons as tokens and integrate it into a workflow for targeted generation of sequences. Our framework suggests new gene sequences that are ranked for downstream evaluation by metrics that collectively capture extensive sequence-specific, biophysical, and biochemical properties. We demonstrate our integrated workflow to design novel variants of the enzyme, malate dehydrogenase (MDH), that exhibit more favorable activation energies than their natural counterparts (reduction of 4.01 kJ/mol) with sustained sequence generation rates of 10^4/hr and simulation rates of 10^2/hr on 64 nodes of Polaris with about 99.7% system utilization during the run.