Abstract :-In this paper, we explore the FFTNet architecture for speech enhancement task. FFTNet is an end-to-end model initially proposed for speech synthesis operating in waveform domain. A non-causal dilated convolution extension of the FFTNet for speech enhancement is proposed. In dilation of factor 2, the present sample in each layer is estimated from the immediate past and future samples of the previous layer. In conventional WaveNet, increasing dilation factor is used in each layer. In contrast, the proposed method uses a decreasing dilation factor starting from 512. This makes faraway samples of noisy speech is used while extracting the feature representations in initial layers, which may be the best choice considering the stationarity of the noise compared to the quasi stationarity of speech. We have investigated the impact of proposed architecture with a data set consists of 28 speakers with 10 different noise types with 4 different signal-to-noise ratio. We compare our work against recent waveform domain speech enhancement algorithms like SEGAN and WaveNet. The proposed model has considerably reduced the model parameters; 32% compared to the WaveNet and 87% to the SEGAN architecture. On the same time, the subjective and objective metrics confirm that the proposed method outperforms the WaveNet and SEGAN for speech enhancement. A tensorflow implimentation is provided at here
Acknowledgment: This work was funded by the E.U. Horizon2020 Grant Agreement 675324, Marie Sklodowska-Curie Innovative Training Network, ENRICH.