TY - GEN
T1 - Multimodal Engagement Prediction in Human-Robot Interaction Using Transformer Neural Networks
AU - Lim, Jia Yap
AU - See, John
AU - Dondrup, Christian
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Engagement estimation in human-robot interaction (HRI) is still a challenging task, as it is essential to maintain continuous spontaneous communication between humans and social robots by gauging user engagement levels. To some extent, users can pretend to be engaged in a conversation with a robot, hence it is crucial to analyse other viable cues obtainable from the video. Some recent studies have only used a single modality to estimate user engagement, particularly audio-visual ones. Meanwhile, the use of emotions has not been extensively explored, whereby it may provide critical information such as behavioural patterns and facial expressions, which allows for a better nderstanding of engagement levels. In this paper, we propose a framework that utilises Transformer-based models to demonstrate the effectiveness of a multimodal architecture to engagement prediction in HRI. Experimentation on the UE-HRI dataset, a real-life dataset of users communicating spontaneously with a social robot in a dynamic environment demonstrated the efficacy of a fully Transformer-based architecture compared to other standard models described in the existing literature. An online mode assessment showed the feasibility of predicting user engagement in real-time HRI scenarios.
AB - Engagement estimation in human-robot interaction (HRI) is still a challenging task, as it is essential to maintain continuous spontaneous communication between humans and social robots by gauging user engagement levels. To some extent, users can pretend to be engaged in a conversation with a robot, hence it is crucial to analyse other viable cues obtainable from the video. Some recent studies have only used a single modality to estimate user engagement, particularly audio-visual ones. Meanwhile, the use of emotions has not been extensively explored, whereby it may provide critical information such as behavioural patterns and facial expressions, which allows for a better nderstanding of engagement levels. In this paper, we propose a framework that utilises Transformer-based models to demonstrate the effectiveness of a multimodal architecture to engagement prediction in HRI. Experimentation on the UE-HRI dataset, a real-life dataset of users communicating spontaneously with a social robot in a dynamic environment demonstrated the efficacy of a fully Transformer-based architecture compared to other standard models described in the existing literature. An online mode assessment showed the feasibility of predicting user engagement in real-time HRI scenarios.
KW - Human-robot interaction
KW - Multimodal framework
KW - Transformer neural networks
KW - User engagement prediction
UR - http://www.scopus.com/inward/record.url?scp=85215949164&partnerID=8YFLogxK
U2 - 10.1007/978-981-96-2074-6_1
DO - 10.1007/978-981-96-2074-6_1
M3 - Conference contribution
AN - SCOPUS:85215949164
SN - 9789819620739
T3 - Lecture Notes in Computer Science
SP - 3
EP - 17
BT - MultiMedia Modeling. MMM 2025
PB - Springer
T2 - 31st International Conference on Multimedia Modeling 2025
Y2 - 8 January 2025 through 10 January 2025
ER -