상세 보기
- Park, Leo Hyun;
- Kim, YoonSik;
- Hwang, Eunbi;
- Han, Sangsoo;
- Kim, Hyoungshick;
- 외 1명
SCOPUS
0초록
Large Language Models (LLMs) have rapidly advanced in reasoning capability and accessibility, driving their deployment across diverse applications. Yet this progress has also widened the surface for safety and security vulnerabilities. Adversaries can exploit prompt diversity, dialog memory, or multimodal inputs to induce unsafe or confidential outputs, while continual fine-tuning and third-party integration render static assurance infeasible. This paper introduces our ongoing national R&D project on developing the AutoPT Framework-an Autonomous Purple Teaming architecture that extends the collaborative principles of purple teaming toward self-adaptive, continuously verifiable LLM assurance. AutoPT unifies autonomous adversarial exploration and adaptive defensive reinforcement through two co-evolving agents. The red module, AutoPT-Red, employs coverage-guided fuzzing and internal measurement metrics to autonomously uncover vulnerabilities. The blue module, AutoPT-Blue, performs self-healing adaptation by updating guardrails and detecting integrity or confidentiality violations using embedding-based feedback. Preliminary case studies on jailbreak fuzzing and backdoor-poisoning defense validate the feasibility of this closed-loop, self-adapting architecture. As part of a broader national initiative, this work lays the conceptual and technical foundation for transitioning industrial purple teaming into a fully autonomous, scalable, and measurable assurance paradigm for generative AI systems.
키워드
- 제목
- Toward an Autonomous Purple Teaming Framework for Security and Safety in Large Language Models
- 저자
- Park, Leo Hyun; Kim, YoonSik; Hwang, Eunbi; Han, Sangsoo; Kim, Hyoungshick; Kwon, Taekyoung
- 발행일
- 2025
- 유형
- Conference Paper
- 저널명
- Proceedings of IEEE Pacific Rim International Symposium on Dependable Computing, PRDC
- 페이지
- 182 ~ 187