Special Sessions/Challenges

The Organizing Committee of INTERSPEECH 2023 can confirm the following Special Sessions, Challenges and Panel Session.

Special Sessions


Biosignals such as of articulatory or neurological activities provide information about the human speech process and thus can serve as an alternative modality to the acoustic speech signal. As such, they can be the primary driver for speech-driven human-computer interfaces intended to support humans when acoustic speech is not available or perceivable. For instance, articulatory-related biosignals, such as Electromyography (EMG) or Electromagnetic Articulography (EMA), can be leveraged to synthesize the acoustic speech signal from silent articulation. By the same token, neuro-steered hearing aids process neural activities, reflected in signals such as Electroencephalography (EEG), to detect the human selective auditory attention to single out and enhance the attended speech stream. Progress in the field of speech-related biosignal processing will lead to the design of novel biosignal-enabled speech communication devices and speech rehabilitation for everyday situations.


With the special session “Biosignal-enabled Spoken Communication”, we aim at bringing together researchers working on biosignals and speech processing to exchange ideas on the interdisciplinary topics. Topics include, but are not limited to:

  • Processing of biosignals related to spoken communication, such as brain activity captured by, e.g., EEG, Electrocorticography (ECoG), or functional magnetic resonance imaging (fMRI).
  • Processing of biosignals stemming from respiratory, laryngeal, or articulatory activity, represented by, e.g., EMA, EMG, videos, or similiar.
  • Application of biosignals for speech processing, e.g., speech recognition, synthesis, enhancement, voice conversion, or auditory attention detection.
  • Utilization of biosignals to increase the explainability or performance of acoustic speech processing methods.
  • Development of novel machine learning algorithms, feature representations, model architectures, as well as training and evaluation strategies for improved performance or to address common challenges.
  • Applications such as speech restoration, training and therapy, speech-related brain-computer interfaces (BCIs), speech communication in noisy environments, or acoustic-free speech communication for preserving privacy.




Dr. Siqi Cai, Human Language Technology Laboratory, National University of Singapore, Singapore

Kevin Scheck,  Cognitive Systems Lab, University of Bremen, Germany

Assoc. Prof. Hiroki Tanaka, Augmented Human Communication Labs, Nara Institute of Science and Technology, Japan

Prof. Dr.-Ing. Tanja Schultz, Cognitive Systems Lab, University of Bremen, Germany

Prof. Haizhou Li, The Chinese University of Hong Kong, Shenzhen, China; National University of Singapore, Singapore


Speech technology is increasingly embedded in everyday living, with its applications spanning from critical domains like medicine, psychiatry, education, to more commercial settings. This rapid growth can be largely attributed to the successful use of deep learning in modelling large amounts of speech data. However, performance of speech technology in related applications varies, depending on the demographics of the population, the data it has been trained on and is applied to. That is, inequity in speech technology appears across age, gender, people with vocal disorders or from atypical populations, people with non-native accents.

A large group vulnerable to the inequities of speech technology and its performance is children. The goal of this interdisciplinary session is to address the limitations and advances of speech-technology and speech-science, focusing on child speech, while bringing together researchers working within these domains.


We invite papers on the following topics, but not limited to:

  • Using speech science (knowledge from children’s speech acquisition, production, perception, and generally natural language understanding) to develop and improve speech technology applications.
  • Using techniques used for developing speech technology to learn more about child speech production, perception and processing.
  • Computational modelling of child speech.
  • Speech technology applications for children including (but not limited to), speech recognition,
  • voice-conversion, language identification, segmentation, diarization etc.
  • Use and/or modification of data creation techniques, feature extraction schemes, tools and training architectures developed for adult speech for developing child speech applications.
  • Speech technology for children from typical and non-typical groups (atypical, non-native speech, slow-learners, etc.)




Line H. Clemmensen, Technical University of Denmark, Denmark

Nina R. Benway, Syracuse University, USA

Odette Scharenborg, Delft University of Technology, the Netherlands

Sneha Das, Technical University of Denmark, Denmark

Tanvina Patel, Delft University of Technology, the Netherlands

Zhengjun Yue, Delft University of Technology, the Netherlands


Speech and language technology (SLT) has the potential to help educate, facilitate medical treatment, provide access to services and information, empower, support independent living, and enable communication and cultural exchange between communities.

While speech synthesis and automatic speech recognition have been used to aid accessibility for several decades, a wider range of speech and language technologies are powerful tools in applications useful to society. Dialog technology has been used in domains including public education and cultural exhibits, independent learning applications, anti-bullying initiatives, health, digital resources for minority or lesser spoken languages, and companion/assistive systems for the elderly. These applications have potential to provide societal benefits or public good by giving access to highly interactive services in sectors or contexts where dialogue and language is a critical interaction component, and where other interface paradigms would be less effective or have higher infrastructure barriers. Other applications seek to improve access to information ,and provide spoken word versions of written texts for education and entertainment (Daisy Digital Books), while machine translation and a wide range of NLP tools also have potential to aid communication and access to information.

The Dialog for Good special session (DiGo) aims to highlight the use of SLT for social good.  It will promote novel use cases, cutting edge research and technological developments in any domain which facilitates society, building awareness of the opportunities that SLT offers. We hope the workshop will foster networking among researchers and service providers, leading to further initiatives to develop this highly interdisciplinary area of speech and language research and technology.


We welcome submissions on dialog and speech and language technology and applications in areas including, but not limited to:

  • Education
  • Access to social services / participation in society
  • Lesser Resourced Languages
  • Health
  • Social/Public services
  • Culture
  • Mobility/Migration
  • Political Freedom
  • Agriculture
  • Sustainability


Emer Gilmartin (Inria, Paris)

Neasa Ni Chiarain (Centre for Language and Communications Studies, Trinity College, Dublin)

Jens Edlund (KTH, Stockholm)

Brendan Spillane (University College Dublin/ADAPT)

David Traum (ICT)

Justine Cassell (INRIA Paris/CMU)

Vinny Wade (ADAPT Centre, Dublin)


Pre-trained acoustic models learned in an unsupervised fashion have exploded in the domain of speech. The representations discovered by CPC, wav2vec 2.0, HuBERT, WavLM, and others, can be used to massively reduce the amount of labelled data to train speech recognizers; they also produce excellent speech resynthesis.

However, while pre-trained acoustic representations seem to be nicely isomorphic with phones or phone states under optimal listening conditions, very little work has addressed invariances. Do the representations remain consistent across instances of the same phoneme in different phonetic contexts (i.e., are they phonemic or merely allophone representations)? Do they hold up under noise and distortions? Are they invariant to different talkers and/or accents?

Progress on these issues could unlock new levels of performance on higher-level tasks such as word segmentation, named entity recognition, and language modelling, where using the discretized “units” discovered by pre-trained acoustic models still lag behind state-of-the-art text-based models. Importantly, progress on talker and accent robustness would contribute to the serious fairness problem that current ASR models have (including those using pre-trained acoustic models as features) whereby lower socioeconomic status is highly correlated with higher word error rate.

The 2023 Interspeech Special Session on Invariant and Robust Pretrained Acoustic Models (IRPAM) aims to address both the evaluation problem and the problem of invariance in pretrained acoustic models. The evaluation track will accept proposed systematic evaluation measures, test sets, or benchmarks for pre-trained acoustic models, including but not limited to context-invariance, talker-invariance, accent-invariance, robustness to noise and distortions, etc. The model track will propose new models or techniques and demonstrate empirically that they improve the invariance or robustness properties of pre-trained speech representations, evaluating using existing approaches or variants on existing benchmarks/measures. This could also include techniques for disentanglement in pre-trained acoustic models.




Ewan Dunbar, University of Toronto

Emmanuel Dupoux, École des Hautes Études en Sciences Sociales / École Normale Supérieure / Meta AI

Hung-yi Lee, National Taiwan University

Abdelrahman Mohamed


Developing methods that are able to handle multiple simultaneous speakers represents a major challenge for researchers in many fields of speech technology and speech science, for example, in speech enhancement, auditory modelling and machine listening or speaking.  Significant research activity has occurred in many of these fields in recent years and great advances have been made, but often in a siloed manner. This cross-disciplinary special session will bring together researchers from across the whole field to present and discuss their latest research on multi-talker methods, encouraging a sharing of ideas and fertilising future collaboration.


We welcome submissions on many different topics, including, but not limited to:

  • Single channel speech separation;
  • Automatic speech recognition of overlapped speech;
  • Speech enhancement in the presence of competing speakers;
  • Diarization of overlapped speech;
  • Target speaker ASR and speech enhancement;
  • Understanding human speech perception in multi-talker environments;
  • Improving speech synthesis in competing-speaker scenarios;
  • Multi-modal approaches to multi-talker speech processing: for example audio-visual methods, location-aware approaches;
  • Clinical applications of multi-talker methods, eg. for hearing impaired listeners;
  • Downstream technologies operating in multi-talker scenarios, eg. meeting transcription, human-robot interaction;
  • Evaluation methods for multi-talker speech technologies.

Note however that we intend the focus of the session to be on applications in single-channel or binaural conditions, rather than on methods pertaining specifically to microphone arrays or other specialist hardware.




Peter Bell, University of Edinburgh, UK

Michael Akeroyd, University of Nottingham, UK

Marc Delcroix, NTT, Japan

Liang Lu, Otter.ai, USA

Jonathan Le Roux, MERL, USA

Jinyu Li, Microsoft, USA

Cassia Valentini, University of Edinburgh, UK

DeLiang Wang, Ohio State University, USA

Jon Barker, University of Sheffield


This special session has the goal of serving as a central hub for researchers investigating how the human brain processes speech under various acoustic/linguistic conditions and in various populations. Understanding speech requires our brain to rapidly process a variety of acoustic and linguistic properties, with variability due to age, language proficiency, attention, and neurocognitive ability among other factors. Until recently, neurophysiology research was limited to studying the encoding of individual linguistic units in isolation (e.g., syllables) using tightly controlled and uniform experiments that were far from realistic scenarios. Recent advances in modelling techniques led to the possibility of studying the neural processing of speech with more ecologically constructed stimuli involving natural, conversational speech, enabling researchers to examine the contribution of factors such as native language and language proficiency, speaker sex, and age to speech perception.

One of the approaches, known as forward modelling, involves modelling how the brain encodes speech information as a function of certain parameters (e.g., time, frequency, brain region), contributing to our understanding of what happens to the speech signal as it passes along the auditory pathway. This framework has been used to study both young and ageing populations, as well as neurocognitive deficits. Another approach, known as backward modelling, involves decoding speech features or other relevant parameters from the neural response recorded during natural listening tasks. A noteworthy contribution of this approach was the discovery that auditory attention can be reliably decoded from several seconds of non-invasive brain recordings (EEG/MEG) in multi-speaker environments, leading to a new subfield of auditory neuroscience focused on neuro-enabled hearing technology applications.


Giovanni  Di Liberto, Trinity College Dublin (School of Computer Science and Statistics, ADAPT Centre, TCIN)

Alejandro Lopez Valdes, Trinity College Dublin (School of Engineering, Electronic and Electrical Engineering, Global Brain Health Institute, TCBE, TCIN)

Mick Crosse, SEGOTIA; Trinity College Dublin (School of Engineering)

Mounya Elhilali, Johns Hopkins University (Department of Electrical and Computer Engineering, Department of Psychological and Brain Sciences)


Technological advancements have been rapidly transforming healthcare in the last several years, with speech and language tools playing an integral role. However, this brings a multitude of unique challenges to consider when integrating speech and language tools in healthcare and health research settings. Many of these challenges are common to the two themes of this special session. The first theme, From Collection and Analysis to Clinical Translation, seeks to draw attention to all aspects of speech-health studies that affect the overall quality and reliability of any analysis undertaken on the data and thus affect user acceptance and clinical translation. These factors include increasing our understanding into how changes in health affect the neuroanatomical and neurophysiological mechanisms related to speech and language, and how best to go about capturing, analyzing and quantifying these changes. Alongside these efforts, the speech health community also needs to consider practical issues of feasibility to help advance the translational potential of speech as a health signal. The second theme, Speech and Language Technology For Medical Conversations, covers a growing field of ambient intelligence in which automatic speech recognition and natural language processing tools are combined to automatically transcribe and interpret clinician-patient conversations and generate subsequent medical documentation. This multifaceted area includes many foci centered around language technologies. Such as those for long-form conversations, for translation of conversations into accurate clinical documentation, for providing feedback to medical students, for diagnostic support from spontaneous conversations with physicians, or for novel applications for language technology.  By combining these themes, this session will bring the wider speech-health community together to discuss innovative ideas, challenges, and opportunities for utilizing speech technologies within the scope of healthcare applications.




Nicholas Cummins, King’s College London and Thymia

Thomas Schaaf, 3M

Heidi Christensen, University of Sheffield

Julien Epps, University of New South Wales

Matt Gormley, Carnegie Mellon University

Sandeep Konam, Abridge.ai

Emily Mower Provost, University of Michigan

Chaitanya Shivade, Amazon.com

Thomas Quatieri, MIT Lincoln Laboratory



The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.

Due to a bias towards standard speech varieties, non-standard, accented speech remains an ongoing challenge for automatic processing. Although existing works have explored automatic speech recognition and language diarization in code-switching speech corpora, those tasks are still challenging for natural in-the-wild speech containing more than one language, particularly when the code-switching occurs in short language spans.

Aligning closely with Interspeech 2023’s theme, ‘Inclusive Spoken Language Science and Technology – Breaking Down Barriers’, we present the challenge of developing robust language identification and language diarization systems that are reliable for non-standard accented, bilingual, child-directed speech collected via a video call platform.

As video calls become increasingly ubiquitous, we present a unique first-of-its-kind Zoom video call dataset.  The MERLIon CCS Challenge will tackle automatic language identification and language diarization in a subset of audio recordings from the Talk Together Study, where parents narrated an onscreen wordless picture book to their child.

The main objectives of this inaugural challenge are:

  1. to benchmark the current and novel language identification and language diarization systems in a code-switching scenario, including extremely short utterances;
  2. to test the robustness of such systems under accented speech;
  3. to challenge the research community to propose novel solutions in terms of adaptation, training, and novel embedding extraction for this particular set of tasks.

Techniques developed in the challenge may benefit other related fields allowing a greater understanding of how code-switching occurs in real-life situations.

The challenge will feature language identification and language diarization. Two tracks, open and closed, are available. The tracks differ by the data used during system training.




Leibny Paola Garcia Perera, John Hopkins University

YH Victoria Chua, Nanyang Technological University

Hexin Liu, Nanyang Technological University

Fei Ting Woon, Nanyang Technological University

Andy Khong, Nanyang Technological University

Justin Dauwels, TU Delft

Sanjeev Khudanpur, John Hopkins University

Suzy J Styles, Nanyang Technological University

Panel Sessions

Note: there are no paper submissions for the panel session.


Speech processing system capacity for learning about human speech and enabling speech-based human computer interaction has afforded many possibilities. While the community has worked hard on the topic of speech processing system errors, we have not really grappled with the risks and negative impacts of speech applications – not because they don’t happen, but presumably because these topics are rarely in scope for the activity. The field of trustworthy and responsible AI seeks to explore limitations of technology and reduce its risks to individuals, communities, and society. There is mounting evidence of significant AI risks, such as AI bias causing harms to certain groups (e.g., in facial recognition technology), leading research communities to pay more attention to these concerns.

This special session will focus on this topic in the context of speech processing systems. It will consist of moderated discussion of bias in speech processing from a broader, more holistic socio-technical perspective, that (1) includes and goes beyond computational and statistical biases in the data and model pipelines, to include systemic bias, default culture, the role of domain expertise and contextual considerations, and human-cognitive biases across AI lifecycle, (2) centers on impacts, how risks in AI lead to those impacts, and how design considerations and organizational practices can be developed and normalized to address risks, (3) will elicit thoughts and questions from session attendees and panels made up of the session organizers.

The session’s goals are to encourage the speech community to:

  • develop a research roadmap for evaluating and mitigating bias propagation beyond the model pipeline in speech applications
  • avoid pitfalls of other AI tasks/applications in bias
  • consider how we can:
  • explore limitations within speech applications?
  • evaluate speech application impacts in real-world settings?
  • improve our capacity for bringing socio-technical context into the design and development of speech applications?
  • know which variables within human speech are being learned by speech processing systems that have contributed to risk or unintended impacts?


Aylin Caliskan, University of Washington

Craig Greenberg, National Institute of Standards and Technology

John Hansen, University of Texas, Dallas

Abigail Jacobs, University of Michigan

Nina Markl, University of Edinburgh

Doug Reynolds, National Security Agency; MIT Lincoln Laboratory

Hilke Schellmann, New York University

Reva Schwartz, National Institute of Standards and Technology

Mona Sloane, NYU; University of Tübingen