The sound of crashing waves, the roar of fast-moving
cars — sound conveys important information
about the objects in our surroundings. In this work,
we show that ambient sounds can be used as a
supervisory signal for learning visual models. To
demonstrate this, we train a convolutional neural
network to predict a statistical summary of the
sound associated with a video frame. We show that,
through this process, the network learns a
representation that conveys information about
objects and scenes. We evaluate this representation
on several recognition tasks, finding that its
performance is comparable to that of other
state-of-the-art unsupervised learning
methods. Finally, we show through visualizations
that the network learns units that are selective to
objects that are often associated with
characteristic sounds.
|
|