Abstract
User engagement prediction in human-robot interaction (HRI) is typically conducted across diverse environmental settings, including both uncontrolled and controlled environments. Such environmental variations compel social robots to capture and analyse user behaviours differently. To the best of our knowledge, most of the prior works rely on video, audio and feature vectors extracted from the UE-HRI (uncontrolled) dataset to estimate user engagement. The existing literature has overlooked the potential of Multimodal Large Language Models (MLLMs) for user engagement prediction in HRI contexts, thus leaving a critical gap in understanding their operational mechanisms and capacity to elevate model performance. To address this gap, this paper pioneers an investigation into MLLM efficacy for engagement prediction across different environmental settings using the UE-HRI (uncontrolled) and eHRI (controlled) datasets. Moreover, we perform rigorous experiments to identify important factors influencing MLLM performance, including prompts, model types, model parameters, and keyword extraction strategies.
| Original language | English |
|---|---|
| Title of host publication | HRI Companion '26: Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction |
| Publisher | Association for Computing Machinery |
| Pages | 650-654 |
| Number of pages | 5 |
| ISBN (Print) | 9798400723216 |
| DOIs | |
| Publication status | Published - 16 Mar 2026 |
Keywords
- Multimodal large language model
- Uncontrolled and controlled human-robot interaction
- User engagement prediction
ASJC Scopus subject areas
- Artificial Intelligence
- Human-Computer Interaction
Fingerprint
Dive into the research topics of 'Stable or Stuck? Understanding MLLM Engagement Prediction in Uncontrolled and Controlled HRI'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver