Encoder-Free Style-Controllable Text-to-Speech with Voice Attribute Vectors
Citations

SCOPUS

0

초록

Natural language-based style control in text-tospeech (TTS) synthesis typically relies on large pretrained text encoders, which introduce significant computational overhead. In this paper, we propose a novel encoder-free approach that replaces heavy text encoders with precomputed prototype attribute vectors without additional training. The key idea is to construct each attribute vector by averaging the embeddings of diverse style prompts that contain the target attribute in their descriptions. Analysis of the embedding space shows that tokens representing the same attribute form tight clusters, indicating that compact prototype vectors can effectively capture their semantic meaning. Experiments show that our model achieves comparable word error rates and automatic evaluation scores to the baseline with the text encoder. These results demonstrate that lightweight attribute representations can successfully replace large text encoders in production-oriented TTS, reducing the model size by 38% without compromising perceived quality.

키워드

computational efficiencyprototype vectorrepresentation learningspeech style controltext-to-speech
제목
Encoder-Free Style-Controllable Text-to-Speech with Voice Attribute Vectors
저자
Kang, JaehoonLee, YejinShim, Kyuhong
DOI
10.1109/ICEIC69189.2026.11386084
발행일
2026
유형
Conference Paper
저널명
2026 International Conference on Electronics, Information, and Communication, ICEIC 2026