상세 보기
- Kang, Jaehoon;
- Lee, Yejin;
- Shim, Kyuhong
SCOPUS
0초록
Natural language-based style control in text-tospeech (TTS) synthesis typically relies on large pretrained text encoders, which introduce significant computational overhead. In this paper, we propose a novel encoder-free approach that replaces heavy text encoders with precomputed prototype attribute vectors without additional training. The key idea is to construct each attribute vector by averaging the embeddings of diverse style prompts that contain the target attribute in their descriptions. Analysis of the embedding space shows that tokens representing the same attribute form tight clusters, indicating that compact prototype vectors can effectively capture their semantic meaning. Experiments show that our model achieves comparable word error rates and automatic evaluation scores to the baseline with the text encoder. These results demonstrate that lightweight attribute representations can successfully replace large text encoders in production-oriented TTS, reducing the model size by 38% without compromising perceived quality.
키워드
- 제목
- Encoder-Free Style-Controllable Text-to-Speech with Voice Attribute Vectors
- 저자
- Kang, Jaehoon; Lee, Yejin; Shim, Kyuhong
- 발행일
- 2026
- 유형
- Conference Paper
- 저널명
- 2026 International Conference on Electronics, Information, and Communication, ICEIC 2026