Abstract
Multimodal models leverage complementary information across modalities to enrich feature representations. While visual information shows potential in representing structure for some combinatorial optimization problems (COPs), its application to complex scheduling like the Flexible Job Shop Scheduling Problem (FJSP) remains underexplored. Current learning-based FJSP solvers predominantly rely on handcrafted state features. This dependence can lead to inconsistencies and may not fully capture the problem's intricate dynamics. Crucially, these methods overlook visual modalities. Visual representations offer a distinct advantage by inherently capturing the global topological structure and complex resource interactions within the FJSP state. Unlike localized handcrafted features, this holistic, structural view provides a richer foundation for understanding scheduling complexity and making informed decisions. To overcome these limitations by leveraging visual information-known for representing topological structures and providing richer state representations-we introduce the AO-framework. This multimodal feature fusion approach enhances handcrafted state features by integrating insights from visual data. Our core contribution is a novel fusion mechanism utilizing orthogonal projection and local attention. Unlike traditional methods that often rely on simple concatenation of visual data, our method uniquely reduces redundancy by projecting global image-derived features onto local handcrafted features. This process extracts distinct information inherent to the visual modality, significantly improving the quality and complementarity of the resulting state features and enabling more informed scheduling decisions. To our knowledge, the AO-framework represents the first multimodal framework applied to scheduling problems, demonstrating the significant potential of visual information in this domain. Extensive experiments across various FJSP solvers and datasets confirm that our framework yields substantial enhancements in solution quality, decision-making capabilities, and generalization.
| Original language | English |
|---|---|
| Title of host publication | MM '25: Proceedings of the 33rd ACM International Conference on Multimedia |
| Publisher | Association for Computing Machinery |
| Pages | 2496-2505 |
| Number of pages | 10 |
| ISBN (Electronic) | 9798400720352 |
| DOIs | |
| Publication status | Published - 27 Oct 2025 |
| Event | 33rd ACM International Conference on Multimedia 2025 - Dublin, Ireland Duration: 27 Oct 2025 → 31 Oct 2025 |
Conference
| Conference | 33rd ACM International Conference on Multimedia 2025 |
|---|---|
| Abbreviated title | MM 2025 |
| Country/Territory | Ireland |
| City | Dublin |
| Period | 27/10/25 → 31/10/25 |
Keywords
- combinatorial optimization
- flexible job-shop scheduling problem
- multimodal fusion
- reinforcement learning
ASJC Scopus subject areas
- Human-Computer Interaction
- Software
- Artificial Intelligence
- Computer Graphics and Computer-Aided Design
Fingerprint
Dive into the research topics of 'Visual-Enhanced Multimodal Framework for Flexible Job Shop Scheduling Problem'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver