Syllabus and Presenters

Please note that start and end time of each lecture or hands-on session is in Singapore time zone [UTC+8]!

Lecture 1: Introduction to neural vocoders: Tuesday, May 24 2022, 13h00-14h00
Presenters: Junichi Yamagishi and Xin Wang

Non-neural, Signal processing based vocoders
Neural vocoders
Fusion of neural and signal processing vocoders
Flow models
Diffusion models

Hands-on session: Tuesday, May 24 2022, 14h00-15h00

Hands on neural vocoders

Lecture 2: Neural acoustic modeling: Tuesday, May 24 2022, 15h00-16h00
Presenters: Vassilis Tsiaras and George Kafentzis

Sequence to sequence with attention
- Tacotron 2
- Transformer TTS
FastSpeech based modeling

Hands-on session: Tuesday, May 24 2022, 16h00-17h00

Hands on acoustic modeling

Lecture 3: TTS Frontend Using Machine Learning: Wednesday, May 25 2022, 13h00-14h00
Presenters: Alistair Conkie and Soumi Maiti

Basic components of traditional TTS Frontend: pronunciation, normalization
Scalable, Multilingual Frontend
Neural networks and transformers
BERT
Snorkel-augmenting data

Hands-on session: Wednesday, May 25 2022, 14h00-15h00

Hands on TTS frontend

Lecture 4: Inclusive Neural TTS Ia: Thursday, May 26 2022, 13h00-14h00
Presenter: Malcolm Slaney

Speech speeding up approaches for screen reading

Lecture 5: Inclusive Neural TTS Ib: Thursday, May 26 2022, 14h00-15h00
Presenter: Yutian Chen

Customer voice and Voice Banking: using WaveNet reunite speech-impaired users with their original voices

Lecture 6: Inclusive Neural TTS II: Thursday, May 26 2022, 15h00-16h00
Presenters: Yannis Stylianou, Petko Petkov, and Shifas Padinjaru Veetil

Speech perception in adverse listening conditions
DSP based solution for improving listening under noise and for users with hearing loss
Neural based approaches for improving intelligibility
End-to-end intelligibility improvement for communications

Hands-on session: Thursday, May 26 2022, 16h00-17h00

Hands on intelligibility improvement

Pre-reading material

A. Conkie and A. Finch, Scalable Multilingual Frontend, IEEE ICASSP2020, https://arxiv.org/abs/2004.04934v1
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu, FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, arXiv:2006.04558, 2020
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https://arxiv.org/abs/1712.05884, 2017
J. Kong, J. Kim, and J. Bae, HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, in Proc. NIPS, 2020, vol. 33, pp. 17022–17033
J.-M. Valin and J. Skoglund, LPCNet: Improving Neural Speech Synthesis Through Linear Prediction, in Proc. ICASSP, 2019, pp. 5891–5895.
N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, WaveGrad: Estimating gradients for waveform generation, Proc. International Conference on Learning Representations, 2021.
R. Yamamoto, E. Song, and J.-M. Kim, Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram, in Proc. ICASSP, 2020, pp. 6199–6203.

WaveNet: A generative model for raw audio

Covell, Withgott and Slaney: Mach1: Nonuniform Time-Scale Modification of Speech, Proc. ICASSP1998, Seattle WA, May 12-15, 1998
Chen, Y., Assael, Y., Shillingford, B., Budden, D., Reed, S., Zen, H., Wang, Q., Cobo, L.C., Trask, A., Laurie, B. and Gulcehre, C., Sample Efficient Adaptive Text-to-Speech, in International Conference on Learning Representations, Sept 2018.
TC Zorila, V Kandia, Y Stylianou, Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression, Interspeech 2012
Muhammed P.V. Shifas, Cătălin Zorilă, and Yannis Stylianou, End-to-End Neural Based Modification of Noisy Speech for Speech-in-Noise Intelligibility Improvement, in: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), pp. 162–173
P. N. Petkov and W. B. Kleijn, Spectral Dynamics Recovery for Enhanced Speech Intelligibility in Noise, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 327-338, Feb. 2015.
P. N. Petkov and Y. Stylianou, Adaptive Gain Control for Enhanced Speech Intelligibility Under Reverberation, in IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1434-1438, Oct. 2016.