2022 Speech Processing Courses in Crete
Inclusive Neural Speech Synthesis

22-27 May 2022    Singapore @ ICASSP22

PROGRAMME

Syllabus and Presenters


Please note that start and end time of each lecture or hands-on session is in Singapore time zone [UTC+8]!


Lecture 1: Introduction to neural vocoders: Tuesday, May 24 2022, 13h00-14h00
Presenters: Junichi Yamagishi and Xin Wang
  • Non-neural, Signal processing based vocoders
  • Neural vocoders
  • Fusion of neural and signal processing vocoders
  • Flow models
  • Diffusion models
Hands-on session: Tuesday, May 24 2022, 14h00-15h00
  • Hands on neural vocoders​

Lecture 2: Neural acoustic modeling: Tuesday, May 24 2022, 15h00-16h00
Presenters: Vassilis Tsiaras and George Kafentzis
  • Sequence to sequence with attention
    • Tacotron 2
    • Transformer TTS
  • FastSpeech based modeling
Hands-on session: Tuesday, May 24 2022, 16h00-17h00
  • Hands on acoustic modeling

Lecture 3: TTS Frontend Using Machine Learning: Wednesday, May 25 2022, 13h00-14h00
Presenters: Alistair Conkie and Soumi Maiti
  • Basic components of traditional TTS Frontend: pronunciation, normalization
  • Scalable, Multilingual Frontend
  • Neural networks and transformers
  • BERT
  • Snorkel-augmenting data
Hands-on session: Wednesday, May 25 2022, 14h00-15h00
  • Hands on TTS frontend

Lecture 4: Inclusive Neural TTS Ia: Thursday, May 26 2022, 13h00-14h00
Presenter: Malcolm Slaney
  • Speech speeding up approaches for screen reading

Lecture 5: Inclusive Neural TTS Ib: Thursday, May 26 2022, 14h00-15h00
Presenter: Yutian Chen
  • Customer voice and Voice Banking: using WaveNet reunite speech-impaired users with their original voices

Lecture 6: Inclusive Neural TTS II: Thursday, May 26 2022, 15h00-16h00
Presenters: Yannis Stylianou, Petko Petkov, and Shifas Padinjaru Veetil
  • Speech perception in adverse listening conditions
  • DSP based solution for improving listening under noise and for users with hearing loss
  • Neural based approaches for improving intelligibility
  • End-to-end intelligibility improvement for communications
Hands-on session: Thursday, May 26 2022, 16h00-17h00
  • Hands on intelligibility improvement


Pre-reading material

  • A. Conkie and A. Finch, Scalable Multilingual Frontend, IEEE ICASSP2020, https://arxiv.org/abs/2004.04934v1
  • Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu, FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, arXiv:2006.04558, 2020
  • Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https://arxiv.org/abs/1712.05884, 2017
  • J. Kong, J. Kim, and J. Bae, HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, in Proc. NIPS, 2020, vol. 33, pp. 17022–17033
  • J.-M. Valin and J. Skoglund, LPCNet: Improving Neural Speech Synthesis Through Linear Prediction, in Proc. ICASSP, 2019, pp. 5891–5895.
  • N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, WaveGrad: Estimating gradients for waveform generation, Proc. International Conference on Learning Representations, 2021.
  • R. Yamamoto, E. Song, and J.-M. Kim, Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram, in Proc. ICASSP, 2020, pp. 6199–6203.
  • A. van den Oord et al., WaveNet: A generative model for raw audio, arXiv Prepr. arXiv1609.03499, 2016.
  • Covell, Withgott and Slaney: Mach1: Nonuniform Time-Scale Modification of Speech, Proc. ICASSP1998, Seattle WA, May 12-15, 1998
  • Chen, Y., Assael, Y., Shillingford, B., Budden, D., Reed, S., Zen, H., Wang, Q., Cobo, L.C., Trask, A., Laurie, B. and Gulcehre, C., Sample Efficient Adaptive Text-to-Speech, in International Conference on Learning Representations, Sept 2018.
  • TC Zorila, V Kandia, Y Stylianou, Speech-in-noise intelligibility improvement based on spectral shaping and dynamic range compression, Interspeech 2012
  • Muhammed P.V. Shifas, Cătălin Zorilă, and Yannis Stylianou, End-to-End Neural Based Modification of Noisy Speech for Speech-in-Noise Intelligibility Improvement, in: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), pp. 162–173
  • P. N. Petkov and W. B. Kleijn, Spectral Dynamics Recovery for Enhanced Speech Intelligibility in Noise, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 327-338, Feb. 2015.
  • P. N. Petkov and Y. Stylianou, Adaptive Gain Control for Enhanced Speech Intelligibility Under Reverberation, in IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1434-1438, Oct. 2016.