상세 보기
- Shin, Changyong;
- Go, Younghun;
- Yoo, Yeonho;
- Jeong, Jinwoo;
- Hwang, Jaehyun;
- 외 2명
WEB OF SCIENCE
0SCOPUS
0초록
GPU sharing aims to enhance the efficiency of GPU utilization by running distributed deep learning training jobs concurrently. However, GPU sharing poses a significant challenge: the increase in job completion time (JCT) caused by interference between jobs is inconsistent, complicating job scheduling. Our experiments reveal that the degree of JCT increase varies by as much as-3.7x. While previous studies have analyzed this JCT inconsistency problem, none of them have been able to minimize the inconsistency. We propose TensorShare, a proactive GPU sharing technique that leverages a deep learning model to predict the extent of JCT increase. This study defines a new metric, called GPU SLA, which represents the upper threshold of JCT increase. TensorShare then introduces a novel scheduler that proactively identifies which jobs meet GPU SLA while minimizing the JCT increase. Our evaluation shows that TensorShare improves GPU SLA satisfaction rates by 26.1x-47.3x and reduces the JCT increase by 37%-60%. Furthermore, we evaluate TensorShare with large language models that are not included in training TensorShare's prediction model, achieving-7x and-10.3x improvements in GPU SLA satisfaction and JCT inconsistency, respectively.
키워드
- 제목
- Prediction-based GPU sharing for distributed training
- 저자
- Shin, Changyong; Go, Younghun; Yoo, Yeonho; Jeong, Jinwoo; Hwang, Jaehyun; Yang, Gyeongsik; Yoo, Chuck
- 발행일
- 2026-08
- 유형
- Article
- 권
- 181