상세 보기
- Song, Joon-Seok;
- Lee, Juyeob;
- Park, Eunil
SCOPUS
0초록
Video summarization aims to selectively extract important events and contexts embedded in a video and reconstruct them into a concise video while preserving the original content. Previous studies employed attention-based approaches to capture long-term global dependencies or convolutional neural networks to learn local frame-level patterns separately. However, these approaches tend to be biased toward either global or local information, limiting their ability to achieve high summarization performance. We propose a new video summarization architecture, the Multi-Scale Convolution-Attention Residual Fusion (MARs), which integrates convolutional modules with multi-head self-attention mechanisms. The proposed model captures inter-frame variations and fine-grained local features through convolutional modules while simultaneously learning global contextual information across the entire video using attention-based modules. Furthermore, additional designs, such as temporal-difference embedding, multi-scale convolution, and positional encoding, are incorporated to enhance the summarization performance. Experimental results on two benchmark datasets demonstrate that the proposed model outperforms existing methods. Ablation studies further validate the contribution of each module, confirming the effectiveness of the proposed architecture. Codes are available at https://github.com/dxlabskku/MARs.
키워드
- 제목
- MARs: Multi-Scale Convolution-Attention residual Fusion for Video Summarization
- 저자
- Song, Joon-Seok; Lee, Juyeob; Park, Eunil
- 발행일
- 2026
- 유형
- Conference Paper
- 저널명
- MMSys 2026 - Proceedings of the 2026 ACM Multimedia System Conference
- 페이지
- 84 ~ 95