SC23 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Technical Papers Archive

Unity ECC: Unified Memory Protection Against Bit and Chip Errors


Authors: Dongwhee Kim, Jaeyoon Lee, and Wonyeong Jung (Sungkyunkwan University); Michael Sullivan (NVIDIA Corporation); and Jungrae Kim (Sungkyunkwan University)

Abstract: DRAM vendors utilize On-Die Error Correction Codes (OD-ECC) to correct random bit errors internally. Meanwhile, system companies utilize Rank-Level ECC (RL-ECC) to protect data against chip errors. Separate protection increases the redundancy ratio to 32.8% in DDR5 and incurs significant performance penalties. This paper proposes a novel RL-ECC, Unity ECC, that can correct both single-chip and double-bit error patterns. Unity ECC corrects double-bit errors using unused syndromes of single-chip correction. Our evaluation shows that Unity ECC without OD-ECC can provide the same reliability level as Chipkill RL-ECC with OD-ECC. Moreover, it can significantly improve system performance and reduce DRAM energy and area by eliminating OD-ECC.




Back to Technical Papers Archive Listing