Utterance restoration is an automated voice processing task where the goal is to recreate high-fidelity speech from the imperfect original recordings, affected by the presence of diverse distortions. In recent years, generative diffusion models have been shown to be remarkably effective in this domain, demonstrating leading performance on various benchmarks. However, their computational demands render them impractical when utilised in edge devices or in real-time scenarios. In this paper we introduce LAFUFU — a novel approach to the utterance restoration problem leveraging the latent-space acoustic representations. Rather than working directly with raw audio inputs, our method operates on compact, information-dense features extracted using a dedicated pretrained encoder network. By doing so, we are able to achieve multifold improvements in model inference speed without compromising the output integrity. We also show that, given an equivalent time constraints, LAFUFU is capable of producing higher-quality restored utterances than the classical non-latent alternatives, as evidenced by its competitive performance on the EARS-WHAM and EARS-Reverb frontier benchmarks. Those results highlight representation learning as a key enabler for unlocking generative diffusion potential in audio applications, suggesting further progress is achievable via this research avenue.
Listen to LAFUFU's speech restoration results across different speakers and distortion types
Speech restoration examples with reverberation distortion
Speech restoration examples with background noise distortion
@article{lafufu2025,
title={LAFUFU: Latent Acoustic Features for Ultra-Fast Utterance Restoration},
author={Rados{\l}aw {\L}azarz and Mateusz Wosik and Miko{\l}aj Pudo and Urszula Krywalska and Adam Cieślak},
year={2025},
url={https://github.com/SamsungLabs/LAFUFU}
}