Abstract
Thanks to Deep Learning models able to learn from Internet-scale corpora, we observed tremendous advances in both text-only and multi-modal tasks such as question answering and image captioning. However, real-world tasks require agents that are embodied in the environment and can collaborate with humans by following language instructions. In this work, we focus on ALFRED, a large-scale instruction-following dataset proposed to develop artificial agents that can execute both navigation and manipulation actions in 3D simulated environments. We present a new Natural Language Understanding component for Embodied Agents as well as an in-depth error analysis of the model failures for this challenge, going beyond the success-rate performance that has been driving progress on this benchmark. Furthermore, we provide the research community with important directions for future work in this field which are essential to develop collaborative embodied agents.
Original language | English |
---|---|
Article number | 13 |
Journal | CEUR Workshop Proceedings |
Volume | 3596 |
Publication status | Published - 22 Dec 2023 |
Event | 9th Italian Conference on Computational Linguistics 2023 - Venice, Italy Duration: 30 Nov 2023 → 2 Dec 2023 |
Keywords
- deep learning
- embodied AI
- situated interaction
- visual grounding
ASJC Scopus subject areas
- General Computer Science