MARs: Multi-Scale Convolution-Attention residual Fusion for Video Summarization
Citations

SCOPUS

0

초록

Video summarization aims to selectively extract important events and contexts embedded in a video and reconstruct them into a concise video while preserving the original content. Previous studies employed attention-based approaches to capture long-term global dependencies or convolutional neural networks to learn local frame-level patterns separately. However, these approaches tend to be biased toward either global or local information, limiting their ability to achieve high summarization performance. We propose a new video summarization architecture, the Multi-Scale Convolution-Attention Residual Fusion (MARs), which integrates convolutional modules with multi-head self-attention mechanisms. The proposed model captures inter-frame variations and fine-grained local features through convolutional modules while simultaneously learning global contextual information across the entire video using attention-based modules. Furthermore, additional designs, such as temporal-difference embedding, multi-scale convolution, and positional encoding, are incorporated to enhance the summarization performance. Experimental results on two benchmark datasets demonstrate that the proposed model outperforms existing methods. Ablation studies further validate the contribution of each module, confirming the effectiveness of the proposed architecture. Codes are available at https://github.com/dxlabskku/MARs.

키워드

Attention-Convolution fusionMulti-scale convolutionTemporal-difference embeddingVideo summarization
제목
MARs: Multi-Scale Convolution-Attention residual Fusion for Video Summarization
저자
Song, Joon-SeokLee, JuyeobPark, Eunil
DOI
10.1145/3793853.3795749
발행일
2026
유형
Conference Paper
저널명
MMSys 2026 - Proceedings of the 2026 ACM Multimedia System Conference
페이지
84 ~ 95