상세 보기
초록
Multimodal Large Language Models (MLLM), which integrate large language models (LLMs) with vision models, aim to overcome the text-centric limitations of traditional LLMs. While models like GPT-4 and PaLM-E excel at processing text data, they face limitations in complex fields such as medical image analysis and cross-modality reasoning. MLLMs enhance understanding and reasoning by combining textual and non-linguistic data, such as images, expanding their applicability across various domains. This study proposes optimization strategies for effectively handling multimodal inputs through prompt engineering techniques. We evaluate three MLLMs-Llama-3.2, Phi-3.5, and Qwen2-VL-using datasets such as Flickr30k, NoCaps, and MSCOCO, and analyze their performance on tasks including image captioning, object recognition, and scene understanding. Furthermore, a comparison of the impact of Chain-of-Thought (CoT) and In-Context Learning (ICL) techniques on model performance shows that CoT is more effective for tasks requiring logical reasoning, while ICL enhances adaptability across diverse scenarios. © 2025 IEEE.
키워드
- 제목
- Strategic Application of Prompt Engineering in Multi-Modal Large Language Models
- 저자
- Son, Minjun; Jun, Woomin; Lee, Sungjin
- 발행일
- 2025
- 유형
- Conference paper
- 저널명
- Digest of Technical Papers - IEEE International Conference on Consumer Electronics