RETHINKING SPEAKER EMBEDDINGS FOR SPEECH GENERATION:
SUB-CENTER MODELING FOR CAPTURING INTRA-SPEAKER DIVERSITY


Ismail Rasim Ulgen1 , John H. L. Hansen2, Carlos Busso3 and Berrak Sisman1

1 Center for Language and Speech Processing (CLSP), Johns Hopkins University, USA
2 Center for Robust Speech Systems (CRSS), The University of Texas at Dallas, USA
3 Language Technologies Institute (LTI), Carnegie Mellon University, USA

Abstract: Modeling the rich prosodic variations inherent in human speech is essential for generating natural-sounding speech. While speaker embeddings are commonly used as conditioning inputs in personalized speech generation, they are typically optimized for speaker recognition, which encourages the loss of intra-speaker variation. This strategy makes them suboptimal for speech generation in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network that employs multiple sub-centers per speaker class during training, instead of a single center as in conventional approaches. This sub-center modeling allows the embedding to capture a broader range of speaker-specific variations while maintaining speaker classification performance. We demonstrate the effectiveness of the proposed embeddings on a voice conversion task, showing improved naturalness and prosodic expressiveness in the synthesized speech.

------------------> Sub-center Speaker Embeddings for Voice Conversion <--------


Figure 1: Proposed sub-center modeling(pink) on ECAPA-TDNN[1] network
Figure 2: VC framework[2] that utilizes proposed sub-center speaker embeddings

-----------------------------> Speech Samples <---------------------------

Experimental Setup:

The samples are from speakers that are unseen during the VC training. For each conversion, a random reference utterance from target speaker(~3s) is used
Methods:
  • VC with baseline ECAPA-TDNN [1,2]
  • VC with Sub-center ECAPA-TDNN, C=10, T=0.1 (least intra-class variance)
  • Proposed Method: VC with Sub-center ECAPA-TDNN, C=20 (most intra-class variance)



Zero-shot Voice Conversion


Ground-Truth VC with Baseline ECAPA-TDNN[1,2] VC with Sub-center ECAPA-TDNN, C=10, T=0.1 (least intra-class variance) Proposed Method: VC with Sub-center ECAPA-TDNN, C=20 (most intra-class variance)

Female-to-Male

Source: p229 Target: p345
Source: p308 Target: p260

Female-to-Female

Source: p329 Target: p305
Source: p265 Target: s5

Male-to-Female

Source: p260 Target: p310
Source: 246 Target: 305

Male-to-Male

Source: p298 Target: p345
Source: p246 Target: p260
[1] Brecht Desplanques, Jenthe Thienpondt, & Kris Demuynck (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proc. Interspeech 2020 (pp. 3830–3834).
[2] Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, & Emmanuel Dupoux (2021). Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021 (pp. 3615–3619).