Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked
Cycle-Consistent Generative Adversarial Networks
Abstract
Cycle-consistent generative adversarial networks have been widely used in
non-parallel voice conversion (VC).
Their ability to learn mappings between source and target features without
relying on parallel training data eliminates the need for temporal alignments.
However, most methods decouple the conversion of acoustic features from
synthesizing the audio signal by using separate models for conversion and
waveform synthesis.
This work unifies conversion and synthesis into a single model,
thereby eliminating the need for a separate vocoder.
By leveraging cycle-consistent training and a self-supervised auxiliary
training task, our model is able to efficiently generate converted high-quality
raw audio waveforms.
Subjective listening tests show that our method outperforms the baseline in
whispered speech conversion (up to 6.7% relative improvement),
and mean opinion score predictions yield competitive results
in conventional VC (between 0.5% and 2.4% relative improvement).
Whispered speech conversion aims to to recover the fundamental
frequency F0, without changing the linguistic content of an utterance,
by mapping the whispered input to a corresponding normally phonated
output produced by the same speaker.
Some samples from the testset are presented in the table below.
Whisper
Normal
Our method (cycle-consistent only)
Our method (+ masking + Ladv2)
Our method (+ feature encoder)
MaskCycleGAN-VC [3]
HiFi-GAN + DTW [4]
NVC-Net [5]
s006u110
s007u238
s008u098
s015u422
s105u147
s109u189
s111u083
Our method (cycle-consistent only):
Regular cycle-consistent training only (i.e., no masking,
no additional adversarial loss, and no GLU feature encoder
Our method (+masking+Ladv2):
Masking and second adversarial loss added to the training procedure
Our method (+masking+Ladv2):
Dull model with masking, additional adversarial loss and GLU feature encoder
MaskCycleGAN-VC:
The MaskCycleGAN-VC proposed in [3]
HiFi-GAN+DTW:
Original HiFi-GAN [4] trained with DTW-aligned parallel input features
NVC-Net:
The model proposed in [5] trained in a many-to-many setting
Whispered and voiced audio samples were taken from the wTIMIT corpus [1].
Conventional voice conversion
Traditional voice conversion is performed between different
speakers.
Some samples from the testset are presented in the table below.
Source
Target
Our method (full)
MaskCycleGAN-VC [3]
F→F
M→M
M→F
F→M
F→F: Female to female (p225 → p229);
M→M: Male to male (p273 → p274);
M→F: Male to female (p232 → p231);
F→M: Female to male (p231 → p232);
Our method refers to the full model, i.e., including masking,
additional adversarial loss, and GLU feature encoder.
Source and target audio samples were taken from the VCTK dataset [2].