Abstract
Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval a a progressively coordinated reasoning process. It orchestrates three specialised types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through target reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines as FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the ACM Web Conference 2026 |
| Publisher | Association for Computing Machinery |
| Pages | 2071-2082 |
| Number of pages | 12 |
| ISBN (Print) | 9798400723070 |
| DOIs | |
| Publication status | Published - 12 Apr 2026 |
| Event | ACM Web Conference 2026 - Dubai, United Arab Emirates Duration: 29 Jun 2026 → 3 Jul 2026 https://www2026.thewebconf.org/ |
Conference
| Conference | ACM Web Conference 2026 |
|---|---|
| Abbreviated title | WWW 26 |
| Country/Territory | United Arab Emirates |
| City | Dubai |
| Period | 29/06/26 → 3/07/26 |
| Internet address |
Keywords
- Compose Image Retrieval
- Agents
- Cross-modality
Fingerprint
Dive into the research topics of 'XR: Cross-Modal Agents for Composed Image Retrieval'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver