The Organizing Committee of INTERSPEECH 2023 can confirm the following Tutorials will take place on Sunday, 20th August in the Convention Centre Dublin.

Morning Tutorials

T1: Speech Assessment Metrics: From Psychoacoustics to Machine Learning

Fei Chen (Department of Electrical and Electronic Engineering, Southern University of Science and Technology, China) and Yu Tsao (The Research Center for Information Technology Innovation (CITI), Academia Sinica, Taiwan)

An important measure of the effectiveness of speech technology applications is the intelligibility and quality of the processed speech signals provided by these applications. A number of speech evaluation metrics have been derived to quantitatively measure specific properties of speech signals. Objective speech assessment metrics have been developed as surrogates for human listening tests. Speech assessment metrics based on deep learning-based models have garnered significant attention.


T2: Recent Advances in Speech Processing, Multi-talker ASR and Diarization for Cocktail Party Problem

Shi-Xiong Zhang (Tencent AI lab, Bellevue, USA), Yong Xu (Tencent AI lab, Bellevue, USA), Shinji Watanabe (Carnegie Mellon University, Pittsburgh, USA) and Dong Yu (Tencent AI lab, Bellevue, USA)

A new trend in today’s speech fields is to develop systems towards solving more wild and more challenging scenarios such as multiple simultaneous speakers in meetings or cocktail party environments. Significant research activity has occurred in recent years in these fields and great advances have been made. This tutorial will bring together all the state-of-the-art researches on solving “Who said What and When” in multi-talker scenarios, including: 1) front-end speech separation and beamforming; back-end speaker diarization and speech recognition; 2) modeling techniques for single-channel, multi- channel or audio-visual inputs; 3) the pipeline systems of multiple speech modules vs the end-to-end integrated neural networks. The goal is to give audiences a complete picture of this cross-disciplinary field and enlighten the future directions and collaborations.


T3: Resource-Efficient and Cross-Modal Learning Toward Foundation Models

Pin-Yu Chen (IBM AI and MIT-IBM Watson AI Lab, NY, USA), C. -H Huck Yang (Amazon Alexa Speech, WA, USA), Shalini Ghosh (Amazon Alexa Speech, WA, USA), Jia-Hong Huang (Universiteit van Amsterdam, the Netherlands) and Marcel Worring (Universiteit van Amsterdam, the Netherlands)

In this tutorial, the first session will introduce the theoretical advantages of large-scale pre-trained foundation models by the universal approximation theory and how to update the large-scale speech and acoustic models effectively using parameter-efficient learning. Next, our second session will introduce how we can do effective cross-modal pre-training of representations across visual, speech, and language modalities, which can be learned without necessarily needing aligned data across modalities and can benefit tasks in individual modalities as well. Finally, our third session will explore different applications on multimedia processing benefited from the pre-training of acoustic and language modelling with benchmark performance.


T4: Advancements in Speech and Sound Processing for Cochlear Implants: Science, Technology, ML, and Cloud Connectivity

Juliana Saba (Centre for Robust Speech Systems, The University of Texas, USA), Ram C.M.C Shekar (Centre for Robust Speech Systems, The University of Texas, USA), Oldooz Hazrati (Food & Drug Administration, USA) and John H.L. Hansen (Centre for Robust Speech Systems, The University of Texas, USA)

This tutorial will provide an overview of speech and sound perception specific for cochlear implant users and discuss how human subjects research is conducted through the use of research platforms. Signal processing aspects, such as sound coding strategies, the translation of acoustic parameters in the electric space, and performance of these clinical devices in various listening situations will be discussed. Two types of strategies will be provided: speech-specific and non-speech. A brief explanation of advancements in speech processing, research platforms, and cloud-based technology as well as future directions will be discussed.

Afternoon Tutorials


T5: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models

Kazuyoshi Yoshii (Kyoto University, Japan/SSU Team with AIP, RIKEN, Tokyo, Japan), Aditya Arie Nugraha (SSU Team with AIP, RIKEN, Tokyo, Japan), Mathieu Fontaine (LTCI, Télécom Paris, Palaiseau, France/SSU Team with AIP, RIKEN, Tokyo, Japan) and Yoshiaki Bando (Artificial Intelligence Research Centre (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan/SSU Team with AIP, RIKEN, Tokyo, Japan)

This tutorial aims to enlighten audio and speech researchers who are interested in source separation and speech enhancement on how to formulate a physics-aware probabilistic model that explicitly stands for the generative process of observed audio signals (direct problem) and how to derive its maximum likelihood estimator (inverse problem) in a principled manner. Under mismatched conditions and/or with less training data, the separation performance of supervised methods might be degraded drastically in the real world, as is often the case with deep learning-based methods that work well in controlled benchmarks. We show first that the state-of-the-art blind source separation (BSS) methods can work comparably or even better in the real world and play avital role for drawing the full potential of deep learning-based methods. Secondly, this tutorial introduces how to develop an augmented reality (AR) application for smart glasses with real-time speech enhancement and recognition of target speakers.


T6: Advances in audio anti-spoofing and deepfake detection using graph neural networks and self-supervised learning.

Jee-weon Jung (Naver corporation, Korea / Carnegie Mellon University, USA), Hye-jin Shim (University of Finland, Finland), Hemlata Tak (EURECOM, France) and Xin Wang (National Institute of Informatics, Japan)

This tutorial will delve into the latest advances in audio anti-spoofing and audio deepfake detection, driven by the application of graph neural networks and self-supervised learning. We will provide a comprehensive overview of the latest state-of-the-art techniques, including in-depth analysis and hands-on coding demonstrations. By attending this tutorial, participants will gain a thorough understanding of state-of-the-art audio anti-spoofing models and will be knowledgeable enough to experiment with these models and leverage them as future baselines.

T7: Navigating the Evolving Landscape of Conversational AI for Digital Health: From Yesterday to Tomorrow

Tulika Saha (University of Liverpool, United Kingdom), Abhisek Tiwari (Indian Institute of Technology Patna, India) and Sriparna Saha (Indian Institute of Technology Patna, India)

In the past few years, dozens of surveys have revealed a scarcity of healthcare professionals, particularly psychiatrists, limiting access to healthcare for severely ill individuals. With the motivation of efficiently utilizing doctors’ time and providing an accessible platform for early diagnosis, clinical assistance using artificial intelligence is gaining immense popularity and demand in both research and industry communities. As a result, telemedicine has grown substantially in recent years, particularly since the COVID outbreak. The tutorial aims to present a comprehensive overview of the use of conversational agents in healthcare, including recent advancements and future prospects. The tutorial will also provide a demonstration of our newly developed virtual disease diagnosis assistant. The tutorial has been crafted with fundamentals to advanced concepts in mind, which makes it beneficial for researchers who are beginners or experts.


T8: Opensource tools for automatic speech recognition with Lhotse and Icefall: Training efficient transducers with large data

Fangjun Kuang (Xiaomi, China) Matthew Wiesner (John Hopkins University, USA), Piotr Zelasko (Meaning, USA), Desh Raj (John Hopkins University, USA), Dan Povey (Xiaomi, China) and Sanjeev Khudanpur (John Hopkins University, USA)

The focus of this tutorial is on the new features in Lhotse and Icefall such as: efficient algorithms and architectures that enable fast and memory-efficient training of Transducers, even in academic environments using modest GPU resources; novel fast decoding algorithms; sequential data storage and I/O to enable easy storage and processing of large corpora (>30,000 hrs); new Lhotse workflows with Whisper and Wav2Vec2.0, new ASR recipes focusing on corpora with 5000+ hrs of speech, and demonstrating how Lhotse can be used to support full-stack speech processing with blind source separation in multi- talker multi-microphone recordings. Finally, we present a new ASR server framework in Python, called Sherpa,10that supports both streaming and non- streaming recognition. We hope this tutorial will encourage the wider community, including industrial and academic researchers, to develop and deploy full-stack, Transducer based ASR solutions trained on large corpora such as the Gigaspeech, or SPGI Speech corpora.