Learning Speech Audio Representations with Multimodal Self-Supervision

Learning Speech Audio Representations with Multimodal Self-Supervision

David Harwath

March 30, 2023, 4:30-5:30pm, ECEB 3017 or online


Humans learn spoken language and visual perception at an early age by being immersed in the world around them. Why can't computers do the same? In this talk, I will describe our ongoing work to develop methodologies for grounding continuous speech signals at the raw waveform level to natural image scenes. I will first present self-supervised models capable of discovering structure (words and sub-word units) in the speech signal. Instead of conventional annotations, these models learn from correspondences between speech sounds and visual patterns such as objects and textures. Next, I will demonstrate how these models can be leveraged to learn cross-lingual correspondences. Finally, I will show how these representations can be used as a drop-in replacement for text transcriptions in an image captioning system, enabling us to directly synthesize spoken descriptions of images without the need for text as an intermediate representation.


David Harwath is an assistant professor in the computer science department at UT Austin. His research focuses on multimodal, self-supervised learning algorithms for speech, audio, vision, and text. He as received the NSF CAREER award (2023), an ASRU best paper nomination (2015), and was awarded the 2018 George M. Sprowls Award for best computer science PhD thesis at MIT. He holds a B.S. in electrical engineering from UIUC (2010), a S.M. in computer science from MIT (2013), and a Ph.D. in computer science from MIT (2018).