Listening Test | Main / Icassp18-2

ON THE USE OF WAVENET AS A STATISTICAL VOCODER

Nagaraj Adiga, Vassilis Tsiaras, Yannis Stylianou
Abstract - In this paper, we explore the possibility of using the WaveNet architecture as a statistical vocoder. In that case, the generation of speech waveforms is locally conditioned only by acoustic features. Focusing on the single speaker case at the moment, we investigate the impact of the local conditions as well as that of the amount of data available for training. Furthermore, variations of the WaveNet architecture are considered and discussed in the context of our work. We compare our work against a very recent work which also used WaveNet architecture as a speech vocoder using the same speech data. More specifically, we used two female and two male speakers from the CMU-ARCTIC database to contrast the use of cepstrum coefficients and filter-bank features as local conditioners with the goal to improve the overall quality for both male and female speakers. In the paper we also discuss the impact of the size of the training data. Objective metrics for quality and intelligibility of the generated by the WaveNet speech as well as subjective tests support our suggestions.

Examples of synthesized waveforms from our experiments on data size variations and local conditioning:
A) Data size variation tested for SLT speaker:

Original	80 sentences	160 sentences	320 sentences	640 sentences	1050 sentences

B) Local conditioning experiments using MCEPS+BAP+F0 and MFBANK+F0 features:
1) SLT Female speaker

Original	MCEPS+BAP+F0	MFBANK+F0

2) BDL Male speaker

Original	MCEPS+BAP+F0	MFBANK+F0

3) CLB Female speaker

Original	MCEPS+BAP+F0	MFBANK+F0

4) RMS Male speaker

Original	MCEPS+BAP+F0	MFBANK+F0

For any clarification, do not hesitate to ask me any question.