An Analysis of Visually Grounded Instructions in Embodied AI Tasks

Marco Grazioso*, Alessandro Suglia

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

6 Downloads (Pure)


Thanks to Deep Learning models able to learn from Internet-scale corpora, we observed tremendous advances in both text-only and multi-modal tasks such as question answering and image captioning. However, real-world tasks require agents that are embodied in the environment and can collaborate with humans by following language instructions. In this work, we focus on ALFRED, a large-scale instruction-following dataset proposed to develop artificial agents that can execute both navigation and manipulation actions in 3D simulated environments. We present a new Natural Language Understanding component for Embodied Agents as well as an in-depth error analysis of the model failures for this challenge, going beyond the success-rate performance that has been driving progress on this benchmark. Furthermore, we provide the research community with important directions for future work in this field which are essential to develop collaborative embodied agents.

Original languageEnglish
Article number13
JournalCEUR Workshop Proceedings
Publication statusPublished - 22 Dec 2023
Event9th Italian Conference on Computational Linguistics 2023 - Venice, Italy
Duration: 30 Nov 20232 Dec 2023


  • deep learning
  • embodied AI
  • situated interaction
  • visual grounding

ASJC Scopus subject areas

  • Computer Science(all)


Dive into the research topics of 'An Analysis of Visually Grounded Instructions in Embodied AI Tasks'. Together they form a unique fingerprint.

Cite this