Deep Head Pose: Gaze-Direction Estimation in Multimodal Video

Sankha Subhra Mukherjee, Neil Robertson

Research output: Contribution to journalArticlepeer-review

148 Citations (Scopus)
299 Downloads (Pure)


In this paper we present a convolutional neural network (CNN)-based model for human head pose estimation in low-resolution multi-modal RGB-D data. We pose the problem as one of classification of human gazing direction. We further fine-tune a regressor based on the learned deep classifier. Next we combine the two models (classification and regression) to estimate approximate regression confidence. We present state-of-the-art results in datasets that span the range of high-resolution human robot interaction (close up faces plus depth information) data to challenging low resolution outdoor surveillance data. We build upon our robust head-pose estimation and further introduce a new visual attention model to recover interaction with the environment. Using this probabilistic model, we show that many higher level scene understanding like human-human/scene interaction detection can be achieved. Our solution runs in real-time on commercial hardware.
Original languageEnglish
Pages (from-to)2094-2107
Number of pages14
JournalIEEE Transactions on Multimedia
Issue number11
Early online date28 Sept 2015
Publication statusPublished - Nov 2015


  • Convolutional neural networks (CNNs)
  • deep learning
  • gaze direction
  • head-pose
  • RGB-D

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Signal Processing
  • Media Technology
  • Computer Science Applications


Dive into the research topics of 'Deep Head Pose: Gaze-Direction Estimation in Multimodal Video'. Together they form a unique fingerprint.

Cite this