Model Architecure
The model is a six-component neural network consisting of WavLM-large, SpectralUNet, Upsampler, WaveUNet, SpectralMaskNet and the Upsample WaveUNet modules.
SpectralUNet is responsible for initial preprocessing of audio in the spectral domain using two-dimensional convolutions.
Additionaly, the SSL features, obtained from WavLM-large, are added to spectral ones.
The Upsampler is a HiFi-GAN generator-based module that increases the temporal resolution of the input tensor, mapping it to the waveform domain.
WaveUNet performs post-processing in the waveform domain and improves the output of the Upsampler by incorporating phase information gleaned directly from the raw input waveform.
SpectralMaskNet is applied to perform spectrum-based post-processing and thus, remove any possible artifacts that remained after WaveUNet.
Thus, the model alternates between time and frequency domains, allowing for effective audio restoration.
Finally, the Upsample WaveUNet is a learnable upsampler of the signal sampling rate, consisting of
the WaveUNet with an additional convolutional upsampling block in the decoder that upsamples the temporal resolution by 3 times.