Recent Changes - Search:

Icassp18-2

ON THE USE OF WAVENET AS A STATISTICAL VOCODER

Nagaraj Adiga, Vassilis Tsiaras, Yannis Stylianou
Abstract - In this paper, we explore the possibility of using the WaveNet architecture as a statistical vocoder. In that case, the generation of speech waveforms is locally conditioned only by acoustic features. Focusing on the single speaker case at the moment, we investigate the impact of the local conditions as well as that of the amount of data available for training. Furthermore, variations of the WaveNet architecture are considered and discussed in the context of our work. We compare our work against a very recent work which also used WaveNet architecture as a speech vocoder using the same speech data. More specifically, we used two female and two male speakers from the CMU-ARCTIC database to contrast the use of cepstrum coefficients and filter-bank features as local conditioners with the goal to improve the overall quality for both male and female speakers. In the paper we also discuss the impact of the size of the training data. Objective metrics for quality and intelligibility of the generated by the WaveNet speech as well as subjective tests support our suggestions.

Examples of synthesized waveforms from our experiments on data size variations and local conditioning:
A) Data size variation tested for SLT speaker:

Original 80 sentences 160 sentences 320 sentences 640 sentences 1050 sentences

B) Local conditioning experiments using MCEPS+BAP+F0 and MFBANK+F0 features:
1) SLT Female speaker

Original MCEPS+BAP+F0 MFBANK+F0

2) BDL Male speaker

Original MCEPS+BAP+F0 MFBANK+F0

3) CLB Female speaker

Original MCEPS+BAP+F0 MFBANK+F0

4) RMS Male speaker

Original MCEPS+BAP+F0 MFBANK+F0
  • For any clarification, do not hesitate to ask me any question.
Edit - History - Print - Recent Changes - Search
Page last modified on August 27, 2019, at 12:04 PM