FINALLY: Fast and Universal Speech Enhancement With Studio-like Quality

Abstract

In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditional clean speech distribution, which, as we argue, is essential for speech enhancement task. We study various feature extractors for perceptual loss to facilitate the stability of adversarial training, developing a methodology for probing the structure of the feature space. This leads us to integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model. The resulting speech enhancement model, which we refer to as FINALLY, builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline. Empirical results on various datasets confirm our model's ability to produce clear, high-quality speech at 48 kHz, achieving state-of-the-art performance in the field of speech enhancement.

Model Architecure

The model is a six-component neural network consisting of WavLM-large, SpectralUNet, Upsampler, WaveUNet, SpectralMaskNet and the Upsample WaveUNet modules. SpectralUNet is responsible for initial preprocessing of audio in the spectral domain using two-dimensional convolutions. Additionaly, the SSL features, obtained from WavLM-large, are added to spectral ones. The Upsampler is a HiFi-GAN generator-based module that increases the temporal resolution of the input tensor, mapping it to the waveform domain. WaveUNet performs post-processing in the waveform domain and improves the output of the Upsampler by incorporating phase information gleaned directly from the raw input waveform. SpectralMaskNet is applied to perform spectrum-based post-processing and thus, remove any possible artifacts that remained after WaveUNet. Thus, the model alternates between time and frequency domains, allowing for effective audio restoration. Finally, the Upsample WaveUNet is a learnable upsampler of the signal sampling rate, consisting of the WaveUNet with an additional convolutional upsampling block in the decoder that upsamples the temporal resolution by 3 times.

Real data samples

In this section we illustrate the performance of FINALLY on read world samples. As you can hear, our model superbly removes all existing distortions, producing high quality audio samples

Comparison with UNIVERSE diffusion model

Additionally, we provide comparison of our model with UNIVERSE on their validation data. Our model is less prone to hallucinations, while delivering high perceptual quality.

Additional Comparison with HiFi-GAN-2 and UNIVERSE

Examples of clusters obtained during LMOS studies

As we mentioned in our paper, we generated clusters with the help of VITS. In this part we provide the examples of different clusters. As it can be heard, the diversity of samples from one cluster is not caused by the phrase, speaker or phoneme duration mismatch. WavLM tends to preserve this structure whilst the L2 distance, for instance, usually not.

Cluster 1	Cluster 2	Cluster 3	Cluster 4

Data

To ensure a fair comparison with our work, we provide samples from five datasets enhanced by our model:

VoxCeleb: 50 audio clips from VoxCeleb1 (Nagrani et al., 2017), covering the Speech Transmission Index (STI) range of 0.75-0.99, balanced between male and female speakers. Link
UNIVERSE: 100 audio clips randomly generated by the authors of UNIVERSE (Serrà et al., 2022) from clean utterances sampled from VCTK and Harvard sentences, alongside noises from DEMAND and FSDnoisy18k. The data includes various simulated distortions like band limiting, reverberation, codec, and transmission artifacts. For more details, refer to (Serrà et al., 2022). Link
VCTK-DEMAND: Validation samples from the Valentini denoising benchmark (Valentini-Botinhao et al., 2017). This dataset facilitates broad comparisons across various speech enhancement models, with a test set of 824 utterances containing artificially simulated noisy samples from 2 speakers at 4 SNR levels (17.5, 12.5, 7.5, and 2.5 dB). Link
LibriTTS: A multi-speaker corpus of English speech at 24kHz sampling rate, originally intended for TTS. We provide an enhanced version for 100 randomly selected samples from the test-other set. Link
Deep Noise Suppression Challenge: We provide enhanced version for dns5-blind-testset data for both headset and non-headset tracks. For more information about the challenge, please, refere to the DNS github page. Our enhanced data is available through the link.

BibTeX

@misc{babaev2024finallyfastuniversalspeech,
          title={FINALLY: fast and universal speech enhancement with studio-like quality}, 
          author={Nicholas Babaev and Kirill Tamogashev and Azat Saginbaev and Ivan Shchekotov and Hanbin Bae and Hosang Sung and WonJun Lee and Hoon-Young Cho and Pavel Andreev},
          year={2024},
          eprint={2410.05920},
          archivePrefix={arXiv},
          primaryClass={cs.SD},
          url={https://arxiv.org/abs/2410.05920}, 
      }

Input	Ground Truth


Finally (ours)	UNIVERSE

Input	Ground Truth


Finally (ours)	UNIVERSE

Input	Ground Truth


Finally (ours)	UNIVERSE

Input	UNIVERSE	Ours	Ground truth

FINALLY
Fast and Universal Speech Enhancement With Studio-like Quality

Abstract

Model Architecure

Real data samples

Input

Output

Input

Output

Input

Output

Comparison with UNIVERSE diffusion model

Input

Ground Truth

Finally (ours)

UNIVERSE

Input

Ground Truth

Finally (ours)

UNIVERSE

Input

Ground Truth

Finally (ours)

UNIVERSE

Additional Comparison with HiFi-GAN-2 and UNIVERSE

Comparison with HiFi-GAN-2

Input

HiFi-GAN-2

Ours

Comparison with UNIVERSE

Input

UNIVERSE

Ours

Ground truth

Examples of clusters obtained during LMOS studies

Data

BibTeX

FINALLY Fast and Universal Speech Enhancement With Studio-like Quality

Abstract

Model Architecure

Real data samples

Input

Output

Input

Output

Input

Output

Comparison with UNIVERSE diffusion model

Input

Ground Truth

Finally (ours)

UNIVERSE

Input

Ground Truth

Finally (ours)

UNIVERSE

Input

Ground Truth

Finally (ours)

UNIVERSE

Additional Comparison with HiFi-GAN-2 and UNIVERSE

Comparison with HiFi-GAN-2

Input

HiFi-GAN-2

Ours

Comparison with UNIVERSE

Input

UNIVERSE

Ours

Ground truth

Examples of clusters obtained during LMOS studies

Data

BibTeX

FINALLY
Fast and Universal Speech Enhancement With Studio-like Quality