We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

Abstract: In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech.

------------------> Sub-center Speaker Embeddings for Voice Conversion <--------

Figure 1: Proposed sub-center modeling(pink) on ECAPA-TDNN[1] network

Figure 2: VC framework[2] that utilizes proposed sub-center speaker embeddings

-----------------------------> Speech Samples <---------------------------

Experimental Setup:

The samples are from speakers that are unseen during the VC training. For each conversion, a random reference utterance from target speaker(~3s) is used

Methods:

VC with baseline ECAPA-TDNN [1,2]

VC with Sub-center ECAPA-TDNN, C=10, T=0.1 (less intra-class variance)

Proposed Method: VC with Sub-center ECAPA-TDNN, C=20 (more intra-class variance)

Zero-shot Voice Conversion

	Ground-Truth	VC with Baseline ECAPA-TDNN[1,2]
Female-to-Male
	Source: p229	Target: p345
	Source: p308	Target: p260
Female-to-Female
	Source: p329	Target: p305
	Source: p265	Target: s5
Male-to-Female
	Source: p260	Target: p310
	Source: 246	Target: 305
Male-to-Male
	Source: p298	Target: p345
	Source: p246	Target: p260

[1] Brecht Desplanques, Jenthe Thienpondt, & Kris Demuynck (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proc. Interspeech 2020 (pp. 3830–3834).
[2] Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, & Emmanuel Dupoux (2021). Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021 (pp. 3615–3619).

------------------> Sub-center Speaker Embeddings for Voice Conversion <--------

-----------------------------> Speech Samples <---------------------------

The samples are from speakers that are unseen during the VC training. For each conversion, a random reference utterance from target speaker(~3s) is used

Methods: VC with baseline ECAPA-TDNN [1,2] VC with Sub-center ECAPA-TDNN, C=10, T=0.1 (less intra-class variance) Proposed Method: VC with Sub-center ECAPA-TDNN, C=20 (more intra-class variance)

Zero-shot Voice Conversion

Female-to-Male

Female-to-Female

Male-to-Female

Male-to-Male

Methods:

VC with baseline ECAPA-TDNN [1,2]

VC with Sub-center ECAPA-TDNN, C=10, T=0.1 (less intra-class variance)

Proposed Method: VC with Sub-center ECAPA-TDNN, C=20 (more intra-class variance)