Syllabus and Presenters
Please note that start and end time of each lecture or hands-on session is in Singapore time zone [UTC+8]!
Lecture 1: Introduction to neural vocoders: Tuesday, May 24 2022, 13h00-14h00
Presenters: Junichi Yamagishi and Xin Wang
- Non-neural, Signal processing based vocoders
- Neural vocoders
- Fusion of neural and signal processing vocoders
- Flow models
- Diffusion models
- Hands on neural vocoders
Lecture 2: Neural acoustic modeling: Tuesday, May 24 2022, 15h00-16h00
Presenters: Vassilis Tsiaras and George Kafentzis
- Sequence to sequence with attention
- Tacotron 2
- Transformer TTS
- FastSpeech based modeling
- Hands on acoustic modeling
Lecture 3: TTS Frontend Using Machine Learning: Wednesday, May 25 2022, 13h00-14h00
Presenters: Alistair Conkie and Soumi Maiti
- Basic components of traditional TTS Frontend: pronunciation, normalization
- Scalable, Multilingual Frontend
- Neural networks and transformers
- BERT
- Snorkel-augmenting data
- Hands on TTS frontend
Lecture 4: Inclusive Neural TTS Ia: Thursday, May 26 2022, 13h00-14h00
Presenter: Malcolm Slaney
- Speech speeding up approaches for screen reading
Lecture 5: Inclusive Neural TTS Ib: Thursday, May 26 2022, 14h00-15h00
Presenter: Yutian Chen
- Customer voice and Voice Banking: using WaveNet reunite speech-impaired users with their original voices
Lecture 6: Inclusive Neural TTS II: Thursday, May 26 2022, 15h00-16h00
Presenters: Yannis Stylianou, Petko Petkov, and Shifas Padinjaru Veetil
- Speech perception in adverse listening conditions
- DSP based solution for improving listening under noise and for users with hearing loss
- Neural based approaches for improving intelligibility
- End-to-end intelligibility improvement for communications
- Hands on intelligibility improvement
Pre-reading material
- A. Conkie and A. Finch, Scalable Multilingual Frontend, IEEE ICASSP2020, https://arxiv.org/abs/2004.04934v1
- Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu, FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, arXiv:2006.04558, 2020
- Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https://arxiv.org/abs/1712.05884, 2017
- J. Kong, J. Kim, and J. Bae, HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, in Proc. NIPS, 2020, vol. 33, pp. 17022–17033
- J.-M. Valin and J. Skoglund, LPCNet: Improving Neural Speech Synthesis Through Linear Prediction, in Proc. ICASSP, 2019, pp. 5891–5895.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, WaveGrad: Estimating gradients for waveform generation, Proc. International Conference on Learning Representations, 2021.
- R. Yamamoto, E. Song, and J.-M. Kim, Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram, in Proc. ICASSP, 2020, pp. 6199–6203. A. van den Oord et al., WaveNet: A generative model for raw audio, arXiv Prepr. arXiv1609.03499, 2016.
- Covell, Withgott and Slaney: Mach1: Nonuniform Time-Scale Modification of Speech, Proc. ICASSP1998, Seattle WA, May 12-15, 1998
- Chen, Y., Assael, Y., Shillingford, B., Budden, D., Reed, S., Zen, H., Wang, Q., Cobo, L.C., Trask, A., Laurie, B. and Gulcehre, C., Sample Efficient Adaptive Text-to-Speech, in International Conference on Learning Representations, Sept 2018.
- TC Zorila, V Kandia, Y Stylianou, Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression, Interspeech 2012
- Muhammed P.V. Shifas, Cătălin Zorilă, and Yannis Stylianou, End-to-End Neural Based Modification of Noisy Speech for Speech-in-Noise Intelligibility Improvement, in: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), pp. 162–173
- P. N. Petkov and W. B. Kleijn, Spectral Dynamics Recovery for Enhanced Speech Intelligibility in Noise, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 327-338, Feb. 2015.
- P. N. Petkov and Y. Stylianou, Adaptive Gain Control for Enhanced Speech Intelligibility Under Reverberation, in IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1434-1438, Oct. 2016.