TTS Style Conversion

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

Dipjyoti Paul, Muhammed PV Shifas, Yannis Pantazis and Yannis Stylianou

Abstract :- The increased adoption of digital assistants makes text-to-speech (TTS) synthesis systems an indispensable feature of modern mobile devices. It is hence desirable to build a system capable of generating highly intelligible speech in the presence of noise. Past studies have investigated style conversion in TTS synthesis, yet degraded synthesized quality often leads to worse intelligibility. To overcome such limitations, we proposed a novel transfer learning approach using Tacotron and WaveRNN based TTS synthesis. The proposed speech system exploits two modification strategies: (a) Lombard speaking style data and (b) Spectral Shaping and Dynamic Range Compression (SSDRC) which has been shown to provide high intelligibility gains by redistributing the signal energy on the time-frequency domain. We refer to this extension as Lombard-SSDRC TTS system. Intelligibility enhancement as quantified by the Intelligibility in Bits (SIIB-Gauss) measure shows that the proposed Lombard-SSDRC TTS system shows significant relative improvement between 110% and 130% in speech-shaped noise (SSN), and 47% to 140% in competing-speaker noise (CSN) against the state-of-the-art TTS approach. Additional subjective evaluation shows that Lombard-SSDRC TTS successfully increases the speech intelligibility with relative improvement of 455% for SSN and 104\% for CSN in median keyword correction rate compared to the baseline TTS method.

Audio samples of different styles:

Original	Original Lombard	TTS	Lombard TTS [1]	Lombard TTS (ours)	SSDRC TTS	Lombard-SSDRC TTS

Audio samples of different styles in TTS under speech-shaped noise (-5dB):

TTS	Lombard TTS [1]	Lombard TTS (ours)	Lombard-SSDRC TTS

Audio samples of different styles in TTS under competing-speaker noise (-14dB):

TTS	Lombard TTS [1]	Lombard TTS (ours)	Lombard-SSDRC TTS

Reference:

[1] "Lombard speech synthesis using transfer learning in a Tacotron text-to-speech system," Bajibabu Bollepalli, Lauri Juvela, Paavo Alku, in Proc. Interspeech, pp. 2833-2837, 2019.

*ENRICH has received funding from the EU H2020 research and innovation programme under the MSCA GA 675324