Abstract
Multimodal Emotion Recognition in Conversation (MERC) is an important element in human-machine interaction. It allows machines to automatically identify and track the emotional status of speakers during a conversation in a multimodal setting. However, the conversations involving various audio and visual cues aligned with textual cues are very complex. Recent works have tried integrating the audio and visual modalities with textual to improve the performance of emotion recognition in conversation. Although many MERC models leverage textual, audio, and visual modalities, those models assume that the speaker’s textual utterance, audio speech, and facial sequences are present. However, a conversation may contain multiple parties, among which only one is the speaker. Previous MERC assumed the availability of all modalities, but in many instances, one or more modalities may be unavailable during multiparty conversations. To tackle these issues, we propose the Possible Speaker Informed Multimodal Emotion Recognition in Conversation framework (PSI). PSI is specifically tasked to extract audio (speech) and visual (face) sequences of a possible speaker in the presence of multiple parties. Further, PSI seamlessly extracts the rich unimodal features and fuses them while addressing the unavailability of specific modalities. PSI demonstrates competitive performance with existing state-of-the-art models through experiments with a benchmark dataset.
Original language | English |
---|---|
Title of host publication | 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Publisher | IEEE |
ISBN (Electronic) | 9798350368741 |
DOIs | |
Publication status | Published - 7 Mar 2025 |