An overview of speech-visual multimodal learning

Xinsheng Wang, November 2, 2020, 4:00-5:00pm CST

Inspired by human infants’ ability to learn spoken language by listening and paying attention to the concurrent speech and visual scenes, many efforts have been carried out to learn speech semantic embeddings grounded by visual information. Beyond speech embedding learning, several recent works were proposed to achieve a higher level cross-modal task between speech and images, e.g., speech-to-image generation and image-to-speech generation. This presentation will give a brief overview on speech-visual multimodal learning based on the related literatures in the Interspeech 2020.

Slides