FVTTS: Face Based Voice Synthesis for Text-to-Speech
- Authors
- Lee, Minyoung; Park, Eunil; Hong, Sungeun
- Issue Date
- 2024
- Publisher
- International Speech Communication Association
- Keywords
- end-to-end TTS; face to speech; face voice conversion; face-based TTS
- Citation
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 4953 - 4957
- Pages
- 5
- Indexed
- SCOPUS
- Journal Title
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
- Start Page
- 4953
- End Page
- 4957
- URI
- https://scholarx.skku.edu/handle/2021.sw.skku/119906
- DOI
- 10.21437/Interspeech.2024-140
- ISSN
- 2308-457X
- Abstract
- A face is expressive of individual identity and used in various studies such as identification, authentication, and personalization. Similarly, a voice is a means of expressing individuals, and personalized voice synthesis based on voice reference is active. However, the voice-based method confronts voice sample dependency limitations. We propose Face-based Voice synthesis for Text-To-Speech (FVTTS) to synthesize voice from face images that are more expressive of personal identity than voice samples. A major challenge in face-based TTS methods is extracting distinct voice features highly related to voice from the face image. Our face encoder is designed to tackle this by integrating global facial attributes with voice-related features to represent personalized characteristics. FVTTS has shown superiority in various metrics and adaptability across different data domains. We establish a new standard in face-based TTS, leading the way in personalized voice synthesis. © 2024 International Speech Communication Association. All rights reserved.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - Computing and Informatics > Convergence > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.