Object-aware Sound Source Localization via Audio-Visual Scene Understanding
  • Um, Sung Jin
  • Kim, Dongjin
  • Lee, Sangmin
  • Kim, Jung Uk
Citations

WEB OF SCIENCE

1
Citations

SCOPUS

3

초록

Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. Extensive experimental results on MUSIC and VGGSound datasets demonstrate the effectiveness of our approach, significantly outperforming existing methods in both single-source and multi-source localization scenarios. Code and generated detailed contextual information are available at: https://github.com/VisualAIKHU/OA-SSL.

키워드

audio-visualmultimodal learningsound source localization
제목
Object-aware Sound Source Localization via Audio-Visual Scene Understanding
저자
Um, Sung JinKim, DongjinLee, SangminKim, Jung Uk
DOI
10.1109/CVPR52734.2025.00781
발행일
2025
유형
Proceedings Paper
저널명
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
페이지
8342 ~ 8351