Abstract: Modeling the rich prosodic variations inherent in human speech is essential for generating natural-sounding speech. While speaker embeddings are commonly used as conditioning inputs in personalized speech generation, they are typically optimized for speaker recognition, which encourages the loss of intra-speaker variation. This strategy makes them suboptimal for speech generation in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network that employs multiple sub-centers per speaker class during training, instead of a single center as in conventional approaches. This sub-center modeling allows the embedding to capture a broader range of speaker-specific variations while maintaining speaker classification performance. We demonstrate the effectiveness of the proposed embeddings on a voice conversion task, showing improved naturalness and prosodic expressiveness in the synthesized speech.
------------------> Sub-center Speaker Embeddings for Voice Conversion <--------


-----------------------------> Speech Samples <---------------------------
The samples are from speakers that are unseen during the VC training. For each conversion, a random reference utterance from target speaker(~3s) is used
Methods:
- VC with baseline ECAPA-TDNN [1,2]
- VC with Sub-center ECAPA-TDNN, C=10, T=0.1 (least intra-class variance)
- Proposed Method: VC with Sub-center ECAPA-TDNN, C=20 (most intra-class variance)
Zero-shot Voice Conversion
Ground-Truth | VC with Baseline ECAPA-TDNN[1,2] | VC with Sub-center ECAPA-TDNN, C=10, T=0.1 (least intra-class variance) | Proposed Method: VC with Sub-center ECAPA-TDNN, C=20 (most intra-class variance) | Female-to-Male |
|
---|---|---|---|---|---|
Source: p229 | Target: p345 | ||||
Source: p308 | Target: p260 | Female-to-Female |
|||
Source: p329 | Target: p305 | ||||
Source: p265 | Target: s5 | Male-to-Female |
|||
Source: p260 | Target: p310 | ||||
Source: 246 | Target: 305 | Male-to-Male |
|||
Source: p298 | Target: p345 | ||||
Source: p246 | Target: p260 | ||||
[2] Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, & Emmanuel Dupoux (2021). Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021 (pp. 3615–3619).