Abstract
In this paper we present a joint audio-video (AV)
tracker which can track the active source between
two freely moving persons speaking in turn to simulate
a meeting scenario, but less constrained. Our
tracker differs from existing work in that it requires
only a small number of sensors, works when
speaker is not close to the sensors and relies on simple,
yet efficient, inference techniques in AV processing.
The system uses audio and video measures
of the target position on the ground plane
to strengthen the single modality predictions that
would be weak if taken on their own as occlusions,
clutter, reverberations and speech pauses happen
in the test environment. In particular, the intermicrophone
signal delays and the target image locations
are input to single modality Bayesian filters,
whose proposed likelihoods are multiplied in a
Kalman Filter to give the joint AV final estimation.
Despite the low complexity of the system, results
show that the multi-modal tracker does not fail,
tolerating video occlusion and intermittent speech
(within 50 cm of accuracy) in the context of a nonmeeting
scenario. The system evaluation is done
both on single modality than multi-modality tracking,
and the performance improvement given by the
AV fusion is discussed and quantified i.e 24 % improvement
on the audio tracker accuracy.
tracker which can track the active source between
two freely moving persons speaking in turn to simulate
a meeting scenario, but less constrained. Our
tracker differs from existing work in that it requires
only a small number of sensors, works when
speaker is not close to the sensors and relies on simple,
yet efficient, inference techniques in AV processing.
The system uses audio and video measures
of the target position on the ground plane
to strengthen the single modality predictions that
would be weak if taken on their own as occlusions,
clutter, reverberations and speech pauses happen
in the test environment. In particular, the intermicrophone
signal delays and the target image locations
are input to single modality Bayesian filters,
whose proposed likelihoods are multiplied in a
Kalman Filter to give the joint AV final estimation.
Despite the low complexity of the system, results
show that the multi-modal tracker does not fail,
tolerating video occlusion and intermittent speech
(within 50 cm of accuracy) in the context of a nonmeeting
scenario. The system evaluation is done
both on single modality than multi-modality tracking,
and the performance improvement given by the
AV fusion is discussed and quantified i.e 24 % improvement
on the audio tracker accuracy.
Original language | English |
---|---|
Title of host publication | Proc. 9th IET Data Fusion & Target Tracking Conference 2012 (DF&TT'12) |
Subtitle of host publication | Algorithms and Applications |
Publisher | Institution of Engineering and Technology |
Number of pages | 6 |
Publication status | Published - 6 May 2012 |
Keywords
- multimodal fusion
- audio tracking
- video tracking
- Kalman filtering
- AV occlusion