Memory-efficient cross-modal attention for RGB-X segmentation and crowd counting
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zhang, Youjia | - |
dc.contributor.author | Choi, Soyun | - |
dc.contributor.author | Hong, Sungeun | - |
dc.date.accessioned | 2025-02-04T02:30:25Z | - |
dc.date.available | 2025-02-04T02:30:25Z | - |
dc.date.issued | 2025-06 | - |
dc.identifier.issn | 0031-3203 | - |
dc.identifier.issn | 1873-5142 | - |
dc.identifier.uri | https://scholarx.skku.edu/handle/2021.sw.skku/120192 | - |
dc.description.abstract | In multimodal visual understanding, fusing RGB images with additional modalities like depth or thermal data is essential for improving both accuracy and robustness. However, traditional approaches often rely on task-specific architectures that are difficult to generalize across different multimodal scenarios. To address this limitation, we propose the Cross-modal Spatio-Channel Attention (CSCA) module, designed to flexibly integrate diverse modalities into various model architectures while enhancing performance. CSCA employs spatial attention to capture interactions between modalities effectively, improving model adaptability. Additionally, we introduce a patch-based cross-modal interaction mechanism that optimizes the processing of spatial and channel features, reducing memory overhead while preserving critical spatial information. These refinements significantly simplify cross-modal interactions, increasing computational efficiency. Extensive experiments demonstrate that CSCA generalizes well across various multimodal combinations, achieving promising performance in crowd counting and image segmentation tasks, particularly in RGB-Depth, RGB-Thermal, and RGB-Polarization scenarios. Our approach provides a scalable and efficient solution for multimodal integration, with the potential for broader applications in future work. © 2025 Elsevier Ltd | - |
dc.language | 영어 | - |
dc.language.iso | ENG | - |
dc.publisher | Elsevier Ltd | - |
dc.title | Memory-efficient cross-modal attention for RGB-X segmentation and crowd counting | - |
dc.type | Article | - |
dc.publisher.location | 영국 | - |
dc.identifier.doi | 10.1016/j.patcog.2025.111376 | - |
dc.identifier.scopusid | 2-s2.0-85215868535 | - |
dc.identifier.wosid | 001410592600001 | - |
dc.identifier.bibliographicCitation | Pattern Recognition, v.162 | - |
dc.citation.title | Pattern Recognition | - |
dc.citation.volume | 162 | - |
dc.type.docType | Article | - |
dc.description.isOpenAccess | N | - |
dc.description.journalRegisteredClass | scie | - |
dc.description.journalRegisteredClass | scopus | - |
dc.relation.journalResearchArea | Computer Science | - |
dc.relation.journalResearchArea | Engineering | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
dc.relation.journalWebOfScienceCategory | Engineering, Electrical & Electronic | - |
dc.subject.keywordAuthor | Multimodal learning | - |
dc.subject.keywordAuthor | Non-local attention | - |
dc.subject.keywordAuthor | RGB-D semantic segmentation | - |
dc.subject.keywordAuthor | RGB-D/T crowd counting | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
(03063) 25-2, SUNGKYUNKWAN-RO, JONGNO-GU, SEOUL, KOREA samsunglib@skku.edu
COPYRIGHT © 2021 SUNGKYUNKWAN UNIVERSITY ALL RIGHTS RESERVED.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.