Multimodal Engagement Prediction in Human-Robot Interaction Using Transformer Neural Networks

Jia Yap Lim, John See*, Christian Dondrup

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Engagement estimation in human-robot interaction (HRI) is still a challenging task, as it is essential to maintain continuous spontaneous communication between humans and social robots by gauging user engagement levels. To some extent, users can pretend to be engaged in a conversation with a robot, hence it is crucial to analyse other viable cues obtainable from the video. Some recent studies have only used a single modality to estimate user engagement, particularly audio-visual ones. Meanwhile, the use of emotions has not been extensively explored, whereby it may provide critical information such as behavioural patterns and facial expressions, which allows for a better nderstanding of engagement levels. In this paper, we propose a framework that utilises Transformer-based models to demonstrate the effectiveness of a multimodal architecture to engagement prediction in HRI. Experimentation on the UE-HRI dataset, a real-life dataset of users communicating spontaneously with a social robot in a dynamic environment demonstrated the efficacy of a fully Transformer-based architecture compared to other standard models described in the existing literature. An online mode assessment showed the feasibility of predicting user engagement in real-time HRI scenarios.

Original languageEnglish
Title of host publicationMultiMedia Modeling. MMM 2025
PublisherSpringer
Pages3-17
Number of pages15
ISBN (Electronic)9789819620746
ISBN (Print)9789819620739
DOIs
Publication statusPublished - 1 Jan 2025
Event31st International Conference on Multimedia Modeling 2025 - Nara, Japan
Duration: 8 Jan 202510 Jan 2025

Publication series

NameLecture Notes in Computer Science
Volume15524
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference31st International Conference on Multimedia Modeling 2025
Abbreviated titleMMM 2025
Country/TerritoryJapan
CityNara
Period8/01/2510/01/25

Keywords

  • Human-robot interaction
  • Multimodal framework
  • Transformer neural networks
  • User engagement prediction

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Multimodal Engagement Prediction in Human-Robot Interaction Using Transformer Neural Networks'. Together they form a unique fingerprint.

Cite this