Unsupervised Detection of LLM-Generated Text in Korean Using Syntactic and Semantic Cues

Citations

SCOPUS

0

초록

As Large Language Models (LLMs) are increasingly used for content creation, detecting AI-generated text has become a critical challenge. Prior work has largely focused on English, leaving low-resource languages such as Korean underexplored. We propose an unsupervised detection framework that integrates two complementary signals: syntactic token cohesiveness (TOCSIN) and semantic regeneration similarity (SimLLM). To support evaluation, we construct a Korean pairwise dataset of 1,000 anchors with continuation- and regeneration-style generations and further assess performance across domains (news, research paper abstracts, essays) and model families (GPT-3.5 Turbo, GPT-4o, HyperCLOVA X, LLaMA-3-8B). Without any training, our ensemble achieves up to 0.963 F1 and 0.985 ROC-AUC, outperforming baselines. These results demonstrate that the combination of syntactic and semantic cues enables robust unsupervised detection in low-resource settings. Code available at https://github.com/dxlabskku/llm-detection-main.

제목
Unsupervised Detection of LLM-Generated Text in Korean Using Syntactic and Semantic Cues
저자
Jeon, HeejeongPark, MinsuChoi, YunSeokPark, Eunil
DOI
10.18653/v1/2026.findings-eacl.77
발행일
2026
유형
Conference Paper
저널명
19th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2026
페이지
1504 ~ 1518