Survey Talks

Corinne Fredouille

Avignon Université- LIA

A communication disorder is “an impairment in the ability to receive, send, process and understand verbal, nonverbal and graphic concepts or symbol systems” (American Speech-Language-Hearing Association – ASHA, 1993). This presentation will deal with a specific case of communication disorder, namely speech and voice disorders. After defining this specific context, we will focus on the assessment of this type of disorder, which is necessary in the clinical field, and on how automatic approaches can overcome the limitations of perceptual assessment, particularly in terms of subjectivity and reproducibility. We will briefly review the classical machine learning approaches used since the 90s and, more recently, the application of deep learning. At this point, we’ll look at the concepts of explicability/interpretability and how they can be used to provide useful information to clinicians.

Tamás Gábor Csapó

Budapest University of Technology and Economics

For articulatory-to-acoustic mapping experiments, articulatory data (i.e., information about the movement of the articulatory organs) is recorded while the subject is speaking. An example for such an articulatory acquisition technique is Ultrasound Tongue Imaging (UTI). Typically, during ultrasound recordings, the transducer is placed below the chin in mid-sagittal orientation. For ultrasound-to-speech conversion, machine learning methods are applied for predicting the speech signal, while the network is conditioned on the articulatory input. A potential long-term application might be a ‘Silent Speech Interface’ (SSI), where silent (mouthed) articulation can be converted to audible speech. Such an SSI could be helpful for the communication of the speaking impaired, in military applications, or in extremely noisy conditions. In this survey talk we will overview the progress of ultrasound-to-speech conversion in the last 20 years, including several open questions and unsolved challenges in the field.

Oldrich Plchot

Brno University of Technology
Czech Republic

This talk will cover state-of-the-art and emerging methods for extracting speaker representations (embeddings) from speech. We will compare unsupervised, self-supervised, weakly supervised, and fully supervised approaches and discuss various applications and use cases that fit the methods. More focus will be given to self-supervised Transformers and their use for extracting speaker representations, as these models have quickly risen in popularity and become an integral part of state-of-the-art speech modeling for automatic speech recognition. Apart from fine-tuning them to extract speaker embeddings, we will discuss strategies for domain adaptation and self-pretraining.

Jan Skoglund

United States

Speech compression is a fundamental component of digital voice communication, such as video conferencing and telephony. Today’s systems rely heavily on technology developed decades ago. However, as modern advances in AI and deep learning methods have found much success in other areas of speech processing such as recognition and synthesis, we have recently seen promising results also in speech compression. This talk will give an overview of the topic and discuss some recent progress in AI-based systems.

Paola Garcia

Johns Hopkins University
United States

In recent years, speech and language technology has achieved remarkable progress, impacting daily lives. However, when it comes to children’s speech, state-of-the-art technologies have shown limited effectiveness. Children’s speech patterns differ from adults, varying not only in acoustics but also in linguistic structures and language developmental stages. In this survey, we will give an overview of the advancements in speech and language technology, emphasizing the aspects tailored to address children’s speech. Recognizing the impact of children’s speech across several interdisciplinary fields, we will also shed light on specific applications, such as virtual assistants, bilingualism, language development, educational settings, and gaming. Additionally, we will delve into potential future applications and innovative solutions to address current challenges in speech technologies for children and their implications across related fields.

Zofia Malisz

KTH Royal Institute of Technology

In this survey talk, I discuss the latest contributions of phonetic research in prosody to improvements in speech synthesis. I also talk about the ways in which recent advances in synthesis are used to explain natural speech prosody. I argue that speech scientists and speech engineers would benefit from working more with each other: in particular, in the pursuit of acoustic parameter control in neural speech synthesis or computational modeling of speech rhythm, timing and conversational phenomena. My hope is to inspire more students and researchers to take up these research challenges and explore the potential of working at the intersection of the two fields.

Nicole Holliday

Pomona College
United States

As speech technology becomes an increasingly integral part of the everyday lives of humans around the world, issues related to language variation and change and algorithmic inequality will come to the forefront for citizens and researchers alike. Indeed, over the past few years, researchers across disciplines such as computer science, communications, and linguistics have begun to approach these concerns from a variety of scholarly perspectives. For sociolinguists who are primarily interested in how social factors influence language use and vice versa, the fact that humans and machines are regularly speaking with one another presents an entirely new area of research interest with major impacts for the public. In this talk, I will discuss three main issues related to sociolinguistic variation and speech technology. The first is the issue of how speakers may alter their linguistic behavior as a response to repeated interactions with digital speech systems that behave differently from human interlocutors. The second issue is how speech technology may respond differently to speakers that employ linguistic variation in manners that their models were not trained to accommodate, thus reproducing social inequality in access to such systems. Finally, I will discuss large-scale challenges related to algorithmic bias, as well as the pitfalls that speech researchers need to be aware of when designing and evaluating new systems.

Reinhold Haeb-Umbach

University of Paderborn

Petra Wagner

Bielefeld University

In earlier days of our disciplines, speech technology was strongly inspired by insights and models from speech production and perception. Similarly, speech scientists relied on speech technology to test their own theoretical assumptions, e.g., when using synthetic speech stimuli in listening tests. With the success of neural network architectures in speech technology, this interdisciplinary connection has weakened considerably, as neural architectures lack the phonetic interpretability that appears necessary to link the two disciplines. Without trying to be comprehensive, our talk will provide several examples for research, where both disciplines continue to work together, and to profit from one another.

Wenwu Wang

University of Surrey
United Kingdom

In automated audio captioning, the aim is to provide a meaningful language description of the content for an audio clip. This can be used in a variety of applications, e.g., for assisting the hearing-impaired to understand environmental sounds, facilitating retrieval of multimedia content, and analyzing sounds for security surveillance. To generate text descriptions for an audio clip, it is essential to comprehend the audio events and scenes within an audio clip, as well as interpret the textual information presented in natural language. In addition, learning the mapping and alignment of these two streams of information is crucial. In this survey talk, we will give an introduction of this field, including problem description, potential applications, datasets, open challenges, recent technical progresses, and possible future research directions.