Abstract
In this paper we present a convolutional neural network (CNN)-based model for human head pose estimation in low-resolution multi-modal RGB-D data. We pose the problem as one of classification of human gazing direction. We further fine-tune a regressor based on the learned deep classifier. Next we combine the two models (classification and regression) to estimate approximate regression confidence. We present state-of-the-art results in datasets that span the range of high-resolution human robot interaction (close up faces plus depth information) data to challenging low resolution outdoor surveillance data. We build upon our robust head-pose estimation and further introduce a new visual attention model to recover interaction with the environment. Using this probabilistic model, we show that many higher level scene understanding like human-human/scene interaction detection can be achieved. Our solution runs in real-time on commercial hardware.
Original language | English |
---|---|
Pages (from-to) | 2094-2107 |
Number of pages | 14 |
Journal | IEEE Transactions on Multimedia |
Volume | 17 |
Issue number | 11 |
Early online date | 28 Sept 2015 |
DOIs | |
Publication status | Published - Nov 2015 |
Keywords
- Convolutional neural networks (CNNs)
- deep learning
- gaze direction
- head-pose
- RGB-D
ASJC Scopus subject areas
- Electrical and Electronic Engineering
- Signal Processing
- Media Technology
- Computer Science Applications