SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Heterogeneous Syslog Analysis: There Is Hope


Workshop: HPC Systems Professionals Workshop (HPCSYSPROS23)

Authors: Andres Quan, Leah Howell, and Hugh Greenberg (Los Alamos National Laboratory (LANL))


Abstract: Heterogeneous test-bed clusters present a unique challenge in identifying system hardware failures and anomalies as a result of the variation in the ways that errors and warnings are reported through the system log. We present a novel approach for the real-time classification of syslog messages, generated from a heterogeneous test-bed cluster, to proactively identify potential hardware issues and security events. By integrating machine learning models with high-performance computing systems, our system facilitates continuous system health monitoring.

The paper introduces a taxonomy for classifying system issues into actionable categories of problems, while filtering out groups of messages that the system administrators would consider unimportant "noise". Finally we experiment with using newly available large language models as a form of message classifier, and share our results and experience with doing so. Results demonstrate promising performance, and more explainable results compared to currently available techniques, but the computational costs may offset the benefits.





Back to HPC Systems Professionals Workshop (HPCSYSPROS23) Archive Listing



Back to Full Workshop Archive Listing