Workshop: 7th International Workshop on Software Correctness for HPC Applications (Correctness '23)
Authors: Pedro Valero-Lara (Oak Ridge National Laboratory (ORNL)), Ian Jorquera (Colorado State University), and Frank Lui and Jeffrey Vetter (Oak Ridge National Laboratory (ORNL))
Abstract: Using NVIDIA Tensor Cores has enabled the significant acceleration of general matrix multiplication for applications in AI and in high-performance computing. The use of such specialized accelerators can provide a performance increase between 8x and 20x, albeit with a loss in precision. However, higher precisions are required in many applications. Fortunately, mixed-precision methods can be employed to maintain a high precision while also taking advantage of the performance of lower-precision AI cores. We extend the state of the art by using NVIDIA’s new TF32 framework, which not only burdens some constraints of the previous frameworks but also provides an equivalent precision and performance by using a much simpler approach. We also propose a new framework called TF64 that attempts double-precision arithmetic with low-precision Tensor Cores. Although this framework does not exist yet, we validated the correctness of this idea and achieved an equivalent of 64-bit precision on 32-bit hardware.