Survey Talks
Corinne Fredouille
Avignon Université- LIA
France
Deep Learning and Explainability/Interpretability for Pathological Voice and Speech Analysis
A communication disorder is “an impairment in the ability to receive, send, process and understand verbal, nonverbal and graphic concepts or symbol systems” (American Speech-Language-Hearing Association – ASHA, 1993). This presentation will deal with a specific case of communication disorder, namely speech and voice disorders. After defining this specific context, we will focus on the assessment of this type of disorder, which is necessary in the clinical field, and on how automatic approaches can overcome the limitations of perceptual assessment, particularly in terms of subjectivity and reproducibility. We will briefly review the classical machine learning approaches used since the 90s and, more recently, the application of deep learning. At this point, we’ll look at the concepts of explicability/interpretability and how they can be used to provide useful information to clinicians.
Tamás Gábor Csapó
Budapest University of Technology and Economics
Hungary
Ultrasound-to-Speech Conversion
For articulatory-to-acoustic mapping experiments, articulatory data (i.e., information about the movement of the articulatory organs) is recorded while the subject is speaking. An example for such an articulatory acquisition technique is Ultrasound Tongue Imaging (UTI). Typically, during ultrasound recordings, the transducer is placed below the chin in mid-sagittal orientation. For ultrasound-to-speech conversion, machine learning methods are applied for predicting the speech signal, while the network is conditioned on the articulatory input. A potential long-term application might be a ‘Silent Speech Interface’ (SSI), where silent (mouthed) articulation can be converted to audible speech. Such an SSI could be helpful for the communication of the speaking impaired, in military applications, or in extremely noisy conditions. In this survey talk we will overview the progress of ultrasound-to-speech conversion in the last 20 years, including several open questions and unsolved challenges in the field.
Oldrich Plchot
Brno University of Technology
Czech Republic
Current Trends in Speaker Verification / Extracting Speaker-Related Representations from Speech
This talk will cover state-of-the-art and emerging methods for extracting speaker representations (embeddings) from speech. We will compare unsupervised, self-supervised, weakly supervised, and fully supervised approaches and discuss various applications and use cases that fit the methods. More focus will be given to self-supervised Transformers and their use for extracting speaker representations, as these models have quickly risen in popularity and become an integral part of state-of-the-art speech modeling for automatic speech recognition. Apart from fine-tuning them to extract speaker embeddings, we will discuss strategies for domain adaptation and self-pretraining.
Jan Skoglund
Google
United States
On Speech Compression (in the AI Era)
Speech compression is a fundamental component of digital voice communication, such as video conferencing and telephony. Today’s systems rely heavily on technology developed decades ago. However, as modern advances in AI and deep learning methods have found much success in other areas of speech processing such as recognition and synthesis, we have recently seen promising results also in speech compression. This talk will give an overview of the topic and discuss some recent progress in AI-based systems.
Paola Garcia
Johns Hopkins University
United States
Speech and Language Technology for Children
In recent years, speech and language technology has achieved remarkable progress, impacting daily lives. However, when it comes to children’s speech, state-of-the-art technologies have shown limited effectiveness. Children’s speech patterns differ from adults, varying not only in acoustics but also in linguistic structures and language developmental stages. In this survey, we will give an overview of the advancements in speech and language technology, emphasizing the aspects tailored to address children’s speech. Recognizing the impact of children’s speech across several interdisciplinary fields, we will also shed light on specific applications, such as virtual assistants, bilingualism, language development, educational settings, and gaming. Additionally, we will delve into potential future applications and innovative solutions to address current challenges in speech technologies for children and their implications across related fields.
Zofia Malisz
KTH Royal Institute of Technology
Sweden
Realising the potential of modern speech synthesis for prosodic research
In this survey talk, I discuss the latest contributions of phonetic research in prosody to improvements in speech synthesis. I also talk about the ways in which recent advances in synthesis are used to explain natural speech prosody. I argue that speech scientists and speech engineers would benefit from working more with each other: in particular, in the pursuit of acoustic parameter control in neural speech synthesis or computational modeling of speech rhythm, timing and conversational phenomena. My hope is to inspire more students and researchers to take up these research challenges and explore the potential of working at the intersection of the two fields.
Nicole Holliday
Pomona College
United States
A Sociolinguistic Perspective on Speech Technology
As speech technology becomes an increasingly integral part of the everyday lives of humans around the world, issues related to language variation and change and algorithmic inequality will come to the forefront for citizens and researchers alike. Indeed, over the past few years, researchers across disciplines such as computer science, communications, and linguistics have begun to approach these concerns from a variety of scholarly perspectives. For sociolinguists who are primarily interested in how social factors influence language use and vice versa, the fact that humans and machines are regularly speaking with one another presents an entirely new area of research interest with major impacts for the public. In this talk, I will discuss three main issues related to sociolinguistic variation and speech technology. The first is the issue of how speakers may alter their linguistic behavior as a response to repeated interactions with digital speech systems that behave differently from human interlocutors. The second issue is how speech technology may respond differently to speakers that employ linguistic variation in manners that their models were not trained to accommodate, thus reproducing social inequality in access to such systems. Finally, I will discuss large-scale challenges related to algorithmic bias, as well as the pitfalls that speech researchers need to be aware of when designing and evaluating new systems.
Reinhold Haeb-Umbach
University of Paderborn
Germany
Petra Wagner
Bielefeld University
Germany
How Neural Network Architectures can Inform Basic Research in Phonetics - and Vice Versa
In earlier days of our disciplines, speech technology was strongly inspired by insights and models from speech production and perception. Similarly, speech scientists relied on speech technology to test their own theoretical assumptions, e.g., when using synthetic speech stimuli in listening tests. With the success of neural network architectures in speech technology, this interdisciplinary connection has weakened considerably, as neural architectures lack the phonetic interpretability that appears necessary to link the two disciplines. Without trying to be comprehensive, our talk will provide several examples for research, where both disciplines continue to work together, and to profit from one another.
Wenwu Wang
University of Surrey
United Kingdom
Automated Audio Captioning: Audio-Text Cross-Modal Learning
In automated audio captioning, the aim is to provide a meaningful language description of the content for an audio clip. This can be used in a variety of applications, e.g., for assisting the hearing-impaired to understand environmental sounds, facilitating retrieval of multimedia content, and analyzing sounds for security surveillance. To generate text descriptions for an audio clip, it is essential to comprehend the audio events and scenes within an audio clip, as well as interpret the textual information presented in natural language. In addition, learning the mapping and alignment of these two streams of information is crucial. In this survey talk, we will give an introduction of this field, including problem description, potential applications, datasets, open challenges, recent technical progresses, and possible future research directions.