Skip to main navigation Skip to search Skip to main content

XR: Cross-Modal Agents for Composed Image Retrieval

Research output: Chapter in Book/Report/Conference proceedingChapter (peer-reviewed)peer-review

5 Downloads (Pure)

Abstract

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval a a progressively coordinated reasoning process. It orchestrates three specialised types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through target reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines as FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.
Original languageEnglish
Title of host publicationProceedings of the ACM Web Conference 2026
PublisherAssociation for Computing Machinery
Pages2071-2082
Number of pages12
ISBN (Print)9798400723070
DOIs
Publication statusPublished - 12 Apr 2026
EventACM Web Conference 2026 - Dubai, United Arab Emirates
Duration: 29 Jun 20263 Jul 2026
https://www2026.thewebconf.org/

Conference

ConferenceACM Web Conference 2026
Abbreviated titleWWW 26
Country/TerritoryUnited Arab Emirates
CityDubai
Period29/06/263/07/26
Internet address

Keywords

  • Compose Image Retrieval
  • Agents
  • Cross-modality

Fingerprint

Dive into the research topics of 'XR: Cross-Modal Agents for Composed Image Retrieval'. Together they form a unique fingerprint.

Cite this