Audio-Visual Scene Analysis with
Self-Supervised Multisensory Features

Andrew Owens     Alexei A. Efros    
UC Berkeley

We apply our self-supervised audio-visual representation to sound localization,
action recognition, and on/off-screen audio-visual source separation.

The thud of a bouncing ball, the onset of speech as lips open — when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation.We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech.

Code available!

We learn a fused audio-visual representation through self-supervision, by training a neural network to predict whether audio and visual signals are temporally aligned.

 Download Paper
Slides: keynote
Poster: pdf
Code here!

Concurrent work

Concurrently and independently from us, a number of groups have proposed closely related — and very interesting! — methods for source separation and sound localization. Here is a partial list: