Skip to main navigation Skip to search Skip to main content

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

Research output: Contribution to journalArticlepeer-review

Abstract

The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT7B, it achieves up to 6.8× prefill speedup and 10× FLOP reduction, while retaining 96.88% of the original performance.

Original languageEnglish
JournalTransactions on Machine Learning Research
Publication statusPublished - 22 Nov 2025

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models'. Together they form a unique fingerprint.

Cite this