Compile-Time QoS Scheme for Deep Learning Inferences
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

With the proliferation of deep learning technologies across various service domains, the sharing of accelerators such as GPUs, TPUs, and NPUs for inference processing has become increasingly common. These accelerators must efficiently handle multiple deep learning services operating concurrently. However, inference requests, characterized by sequences of short-duration kernels, create significant challenges for online schedulers attempting to maintain Quality of Service (QoS) guarantees. This paper presents QoSlicer, a novel compile-time QoS management framework that employs kernel slicing to relieve the burden on schedulers. By generating multiple pre-determined slicing plans, QoSlicer enables more efficient, lightweight QoS scheduling while ensuring target latency requirements are met. Our approach incorporates a heuristic search algorithm to identify optimal slicing plans and implements robust performance estimation models to validate these plans. Our experimental evaluation across 75 diverse workload combinations demonstrates that QoSlicer improves throughput by an average of 20.2% compared to state-of-the-art scheduling techniques.

키워드

Deep learning inferenceKernel slicingMulti-tenancyQoS
제목
Compile-Time QoS Scheme for Deep Learning Inferences
저자
Hong, SunginKim, HyunjunHan, Hwansoo
DOI
10.1145/3712285.3759846
발행일
2025
유형
Proceedings Paper
저널명
Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
페이지
1697 ~ 1709