Strategic Application of Prompt Engineering in Multi-Modal Large Language Models
  • Son, Minjun
  • Jun, Woomin
  • Lee, Sungjin
Citations

SCOPUS

1

초록

Multimodal Large Language Models (MLLM), which integrate large language models (LLMs) with vision models, aim to overcome the text-centric limitations of traditional LLMs. While models like GPT-4 and PaLM-E excel at processing text data, they face limitations in complex fields such as medical image analysis and cross-modality reasoning. MLLMs enhance understanding and reasoning by combining textual and non-linguistic data, such as images, expanding their applicability across various domains. This study proposes optimization strategies for effectively handling multimodal inputs through prompt engineering techniques. We evaluate three MLLMs-Llama-3.2, Phi-3.5, and Qwen2-VL-using datasets such as Flickr30k, NoCaps, and MSCOCO, and analyze their performance on tasks including image captioning, object recognition, and scene understanding. Furthermore, a comparison of the impact of Chain-of-Thought (CoT) and In-Context Learning (ICL) techniques on model performance shows that CoT is more effective for tasks requiring logical reasoning, while ICL enhances adaptability across diverse scenarios. © 2025 IEEE.

키워드

Chain of ThoughtIn-context learningMultimodal Large Language ModelPrompt Engineering
제목
Strategic Application of Prompt Engineering in Multi-Modal Large Language Models
저자
Son, MinjunJun, WoominLee, Sungjin
DOI
10.1109/ICCE63647.2025.10930109
발행일
2025
유형
Conference paper
저널명
Digest of Technical Papers - IEEE International Conference on Consumer Electronics