Person Tracking via Audio and Video Fusion

Eleonora D'Arca, Neil Robertson, James R. Hopgood

Research output: Chapter in Book/Report/Conference proceedingConference contribution

419 Downloads (Pure)

Abstract

In this paper we present a joint audio-video (AV)
tracker which can track the active source between
two freely moving persons speaking in turn to simulate
a meeting scenario, but less constrained. Our
tracker differs from existing work in that it requires
only a small number of sensors, works when
speaker is not close to the sensors and relies on simple,
yet efficient, inference techniques in AV processing.
The system uses audio and video measures
of the target position on the ground plane
to strengthen the single modality predictions that
would be weak if taken on their own as occlusions,
clutter, reverberations and speech pauses happen
in the test environment. In particular, the intermicrophone
signal delays and the target image locations
are input to single modality Bayesian filters,
whose proposed likelihoods are multiplied in a
Kalman Filter to give the joint AV final estimation.
Despite the low complexity of the system, results
show that the multi-modal tracker does not fail,
tolerating video occlusion and intermittent speech
(within 50 cm of accuracy) in the context of a nonmeeting
scenario. The system evaluation is done
both on single modality than multi-modality tracking,
and the performance improvement given by the
AV fusion is discussed and quantified i.e 24 % improvement
on the audio tracker accuracy.
Original languageEnglish
Title of host publicationProc. 9th IET Data Fusion & Target Tracking Conference 2012 (DF&TT'12)
Subtitle of host publicationAlgorithms and Applications
PublisherInstitution of Engineering and Technology
Number of pages6
Publication statusPublished - 6 May 2012

Keywords

  • multimodal fusion
  • audio tracking
  • video tracking
  • Kalman filtering
  • AV occlusion

Fingerprint

Dive into the research topics of 'Person Tracking via Audio and Video Fusion'. Together they form a unique fingerprint.

Cite this