Abstract
The post-COVID-19 era has seen continual adoption and reliance on video-based communication, underscoring the need for unobtrusive affect recognition in digital interactions. This paper proposes an efficient multimodal approach to emotion recognition in video conversational scenarios, leveraging linear attention-based Transformer networks to process both visual and audio cues. We explore various linear attention mechanisms, comparing them with classical self-attention. Using the K-EmoCon dataset, we demonstrate that the proposed approach yields competitive performance in predicting the affective states of conversing persons while significantly improving memory efficiency. Our ablation studies reveal that carefully tuned simple fusion methods can match or exceed more complex approaches. This research contributes to developing more accessible and efficient multimodal emotion recognition systems for video-based conversations, with applications for enhancing remote communication and monitoring digital well-being in the post-pandemic era.
Original language | English |
---|---|
Title of host publication | MRAC '24: Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing |
Publisher | Association for Computing Machinery |
Pages | 15-23 |
Number of pages | 9 |
ISBN (Electronic) | 9798400712036 |
DOIs | |
Publication status | Published - 28 Oct 2024 |
Event | 32nd ACM International Conference on Multimedia 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024 Conference number: 32 https://icmsaust.com.au/event/acm-international-conference-for-multimedia-2024/ |
Conference
Conference | 32nd ACM International Conference on Multimedia 2024 |
---|---|
Abbreviated title | MM '24 |
Country/Territory | Australia |
City | Melbourne |
Period | 28/10/24 → 1/11/24 |
Internet address |
Keywords
- multimodal transformers
- linear attention
- affect prediction
- video conversations