SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

ACM Student Research Competition Poster Archive

Better Data Splits for Machine Learning with Astartes

Student: Jackson Burns (Massachusetts Institute of Technology (MIT))
Supervisor: William Green (Massachusetts Institute of Technology (MIT))

Abstract: Machine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets to develop and evaluate models. Common practice is to assign these subsets randomly. Although this approach is fast, it only measures a model's capacity to interpolate. These testing errors may be overly optimistic on out-of-scope data; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many similarity- and distance-based algorithms to partition data into more challenging splits. This poster focuses on use-cases within cheminformatics. However, astartes operates on arbitrary vectors, so its principals and workflow are generalizable to other ML domains as well. astartes is available via the Python package managers pip and conda and is publicly hosted on GitHub (

ACM-SRC Semi-Finalist: no

Poster: PDF
Poster Summary: PDF

Back to Poster Archive Listing