An overview of speech-visual multimodal learning
Xinsheng Wang, November 2, 2020, 4:00-5:00pm CST
Inspired by human infants’ ability to learn spoken language by listening and paying attention to the concurrent speech and visual scenes, many efforts have been carried out to learn speech semantic embeddings grounded by visual information. Beyond speech embedding learning, several recent works were proposed to achieve a higher level cross-modal task between speech and images, e.g., speech-to-image generation and image-to-speech generation. This presentation will give a brief overview on speech-visual multimodal learning based on the related literatures in the Interspeech 2020.