Abstract

Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn mappings between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conversion and waveform synthesis. This work unifies conversion and synthesis into a single model, thereby eliminating the need for a separate vocoder. By leveraging cycle-consistent training and a self-supervised auxiliary training task, our model is able to efficiently generate converted high-quality raw audio waveforms. Subjective listening tests show that our method outperforms the baseline in whispered speech conversion (up to 6.7% relative improvement), and mean opinion score predictions yield competitive results in conventional VC (between 0.5% and 2.4% relative improvement).

Source code is available at: https://github.com/audiodemo/voice-conversion/tree/main/src

Audio samples

Whispered speech conversion

Whispered speech conversion aims to to recover the fundamental frequency F0, without changing the linguistic content of an utterance, by mapping the whispered input to a corresponding normally phonated output produced by the same speaker. Some samples from the testset are presented in the table below.

Whisper Normal Our method (cycle-consistent only) Our method (+ masking + Ladv2) Our method (+ feature encoder) MaskCycleGAN-VC [3] HiFi-GAN + DTW [4] NVC-Net [5]
s006u110
s007u238
s008u098
s015u422
s105u147
s109u189
s111u083

Whispered and voiced audio samples were taken from the wTIMIT corpus [1].

Conventional voice conversion

Traditional voice conversion is performed between different speakers. Some samples from the testset are presented in the table below.

Source Target Our method (full) MaskCycleGAN-VC [3]
F→F
M→M
M→F
F→M

F→F: Female to female (p225 → p229); M→M: Male to male (p273 → p274); M→F: Male to female (p232 → p231); F→M: Female to male (p231 → p232);

Our method refers to the full model, i.e., including masking, additional adversarial loss, and GLU feature encoder.

Source and target audio samples were taken from the VCTK dataset [2].

References

[1] Lim, B. P. (2010). Computational differences between whispered and non-whispered speech. University of Illinois.

[2] Yamagishi, J., Veaux, C., & MacDonald, K. (2019). CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR).

[3] Kaneko, T., Kameoka, H., Tanaka, K., & Hojo, N. (2021). Maskcyclegan-VC: Learning Non-Parallel Voice Conversion with Filling in Frames. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5919-5923.

[4] Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In H. Larochelle, M. Ranzato, Advances in Neural Information Processing Systems (Vol. 33, pp. 17022-17033).

[5] Nguyen, B., & Cardinaux, F. (2022). NVC-Net: End-To-End Adversarial Voice Conversion. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7012-7016.