FINALLY
Fast and Universal Speech Enhancement With Studio-like Quality

Samsung Research
NeurIPS 2024

*Indicates Equal Contribution

Abstract

In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditional clean speech distribution, which, as we argue, is essential for speech enhancement task. We study various feature extractors for perceptual loss to facilitate the stability of adversarial training, developing a methodology for probing the structure of the feature space. This leads us to integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model. The resulting speech enhancement model, which we refer to as FINALLY, builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline. Empirical results on various datasets confirm our model's ability to produce clear, high-quality speech at 48 kHz, achieving state-of-the-art performance in the field of speech enhancement.

Model Architecure

The model is a six-component neural network consisting of WavLM-large, SpectralUNet, Upsampler, WaveUNet, SpectralMaskNet and the Upsample WaveUNet modules. SpectralUNet is responsible for initial preprocessing of audio in the spectral domain using two-dimensional convolutions. Additionaly, the SSL features, obtained from WavLM-large, are added to spectral ones. The Upsampler is a HiFi-GAN generator-based module that increases the temporal resolution of the input tensor, mapping it to the waveform domain. WaveUNet performs post-processing in the waveform domain and improves the output of the Upsampler by incorporating phase information gleaned directly from the raw input waveform. SpectralMaskNet is applied to perform spectrum-based post-processing and thus, remove any possible artifacts that remained after WaveUNet. Thus, the model alternates between time and frequency domains, allowing for effective audio restoration. Finally, the Upsample WaveUNet is a learnable upsampler of the signal sampling rate, consisting of the WaveUNet with an additional convolutional upsampling block in the decoder that upsamples the temporal resolution by 3 times.

Normal and Anomalous Representations

Real data samples

In this section we illustrate the performance of FINALLY on read world samples. As you can hear, our model superbly removes all existing distortions, producing high quality audio samples

Comparison with UNIVERSE diffusion model

Additionally, we provide comparison of our model with UNIVERSE on their validation data. Our model is less prone to hallucinations, while delivering high perceptual quality.

Additional Comparison with HiFi-GAN-2 and UNIVERSE


Examples of clusters obtained during LMOS studies

As we mentioned in our paper, we generated clusters with the help of VITS. In this part we provide the examples of different clusters. As it can be heard, the diversity of samples from one cluster is not caused by the phrase, speaker or phoneme duration mismatch. WavLM tends to preserve this structure whilst the L2 distance, for instance, usually not.

Cluster 1 Cluster 2 Cluster 3 Cluster 4

Data

To ensure a fair comparison with our work, we provide samples from five datasets enhanced by our model:

BibTeX

@misc{babaev2024finallyfastuniversalspeech,
          title={FINALLY: fast and universal speech enhancement with studio-like quality}, 
          author={Nicholas Babaev and Kirill Tamogashev and Azat Saginbaev and Ivan Shchekotov and Hanbin Bae and Hosang Sung and WonJun Lee and Hoon-Young Cho and Pavel Andreev},
          year={2024},
          eprint={2410.05920},
          archivePrefix={arXiv},
          primaryClass={cs.SD},
          url={https://arxiv.org/abs/2410.05920}, 
      }