Abstract
In this paper we present a convolutional neural network (CNN)-based model for human head pose estimation in low-resolution multi-modal RGB-D data. We pose the problem as one of classification of human gazing direction. We further fine-tune a regressor based on the learned deep classifier. Next we combine the two models (classification and regression) to estimate approximate regression confidence. We present state-of-the-art results in datasets that span the range of high-resolution human robot interaction (close up faces plus depth information) data to challenging low resolution outdoor surveillance data. We build upon our robust head-pose estimation and further introduce a new visual attention model to recover interaction with the environment. Using this probabilistic model, we show that many higher level scene understanding like human-human/scene interaction detection can be achieved. Our solution runs in real-time on commercial hardware.
| Original language | English |
|---|---|
| Pages (from-to) | 2094-2107 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 17 |
| Issue number | 11 |
| Early online date | 28 Sept 2015 |
| DOIs | |
| Publication status | Published - Nov 2015 |
Keywords
- Convolutional neural networks (CNNs)
- deep learning
- gaze direction
- head-pose
- RGB-D
ASJC Scopus subject areas
- Electrical and Electronic Engineering
- Signal Processing
- Media Technology
- Computer Science Applications
Fingerprint
Dive into the research topics of 'Deep Head Pose: Gaze-Direction Estimation in Multimodal Video'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver