Evaluating Performance of Modern GPU with Partitioned Last-Level Caches
  • Yoon, Jihun
  • Hwang, Joonseong
  • Han, Sukhyun
  • Jang, Sungbin
  • Jang, Yoonho
  • ... Hong, Seokin
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

For increased computing capability with the need for high parallelism, data-center GPUs (A100, H100, etc.) make their architecture more efficient by splitting Network-on-Chip (NoC) into two partitions to improve local bandwidth. However, these partitioned designs introduce non-uniform access latencies, which significantly affect performance but are often neglected in current simulation frameworks.In this paper, we characterize the performance discrepancy caused by the unmodeled part of network contention in traditional non-partitioned GPU simulations, which fail to reflect real hardware behavior. We extend Accel-Sim to accurately model the A100 GPU architecture, incorporating realistic L2 partitioning, cross-partition latency, and latency asymmetry. Through microbenchmark-guided calibration and hardware profiler validation, we demonstrate that our simulator highly correlates with real hardware and unveils critical workload-specific trade-offs. Our model achieves a correlation of 0.991 with real hardware, compared to 0.989 for the prior design. Moreover, the prior non-partitioned design results in 1.17× longer L2 access latency than the partitioned design. This finding highlights the need for precise delay modeling in evaluating and designing future GPU architectures.

키워드

GPGPULast-Level Cache (LLC)Network-on-Chip (NoC)
제목
Evaluating Performance of Modern GPU with Partitioned Last-Level Caches
저자
Yoon, JihunHwang, JoonseongHan, SukhyunJang, SungbinJang, YoonhoHong, Seokin
DOI
10.1109/ITC-CSCC66376.2025.11137585
발행일
2025-07
유형
Conference Paper
저널명
2025 International Technical Conference on Circuits/Systems, Computers, and Communications, ITC-CSCC 2025