Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Ruoyu Chen^1,2    Xiaoqing Guo³    Kangwei Liu^1,2    Siyuan Liang⁴    Shiming Liu⁵    Qunli Zhang⁵    Hua Zhang¹    Xiaochun Cao⁶
¹Institute of Information Engineering, Chinese Academy of Sciences          ²University of Chinese Academy of Sciences
³Hong Kong Baptist University            ⁴National University of Singapore           ⁵Imperial College London          ⁶Sun Yat-sen University

Paper arXiv Code (Coming Soon)

An overview of our proposed Eagle explanation method for multimodal large language models. Eagle attribution which perceptual regions drive the generation (Where MLLMs Attend) and quantifies modality reliance (What They Rely On).

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present Eagle, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. Eagle attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, Eagle performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that Eagle consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs.

Eagle Explanation Method

Overview of the proposed Eagle framework. The input image is first sparsified into sub-regions, then attributed via greedy search with the designed objective, and finally analyzed for modality relevance between language priors and perceptual evidence.

SOTA Results in Faithfulness, Localization, and Hallucination Diagnosis

❮ ❯

We evaluate our method on open-source MLLMs, including LLaVA-1.5, Qwen2.5-VL, and InternVL3.5, using the MS COCO and MMVP datasets for image captioning and VQA. On faithfulness metrics, our approach outperforms existing attribution methods (LLaVA-CAM, IGOS++, and TAM) by an average of 20.0% in insertion and 13.4% in deletion for image captioning, and by 20.6% and 8.1% on the same metrics for VQA. At the word level, our method achieves more rational explanations of object tokens, surpassing TAM by 36.42% and 42.63% on the Pointing Game under box-level and mask-level annotations, respectively. Finally, on the RePOPE benchmark for object hallucination, our method accurately localizes the visual elements responsible for hallucinations and mitigates them by removing only a minimal set of interfering regions. These results demonstrate the versatility of our method across diverse tasks and benchmarks.

🚀 Examples of Sentence-level Explanations for Image Captioning Tasks

LLaVA-CAM often misses key regions and IGOS++ yields redundant maps, while our method highlights critical regions that align closely with visually grounded tokens, producing concise and human-consistent explanations.

Some Examples of Sentence-level Explanations for LLaVA-1.5

❮ ❯

Some Examples of Sentence-level Explanations for Qwen2.5-VL

❮ ❯

Some Examples of Sentence-level Explanations for InternVL3.5

❮ ❯

🚀 Examples of Sentence-level Explanations for Visual Question Answering Tasks

Some Examples of VQA Explanations for LLaVA-1.5

❮ ❯

Some Examples of VQA Explanations for Qwen2.5-VL

❮ ❯

Some Examples of VQA Explanations for InternVL3.5

❮ ❯

🚀 Examples of Word-level Explanations for Image Captioning Tasks

For localization, our method achieves the best Pointing Game results under both box- and mask-level settings, confirming that predictions are grounded in specific objects. While TAM performs well on stronger models but poorly on LLaVA-1.5, IGOS++ gains from overly redundant maps. In contrast, our method yields sparse yet focused highlights that more accurately localize the objects mentioned in captions.

Some Examples of Word-level Explanations for LLaVA-1.5 7B

❮ ❯

Some Examples of Word-level Explanations for Qwen2.5-VL 7B

❮ ❯

Some Examples of Sentence-level Explanations for InternVL3.5 4B

❮ ❯

🚀 Examples Explaining the Causes of Object Hallucinations in MLLMs

The following examples including the Hallucination Map, where highlighted purple regions indicate areas prone to hallucinations identified by our method. Hallucination Mitigation denotes the minimal region that must be removed to eliminate hallucinations. The curve illustrates changes in the logit of the ground-truth token as hallucination-prone regions are progressively deleted, with the red line marking the deletion point determined by Hallucination Mitigation. Our method rapidly localizes regions that cause hallucination.

Examples of Explaining Object Hallucinations for LLaVA-1.5

Examples of Explaining Object Hallucinations for LLaVA-1.5

Examples of Explaining Object Hallucinations for LLaVA-1.5

Examples of Explaining Object Hallucinations for LLaVA-1.5

Examples of Explaining Object Hallucinations for LLaVA-1.5

Examples of Explaining Object Hallucinations for LLaVA-1.5

❮ ❯

Examples of Explaining Object Hallucinations for Qwen2.5-VL

Examples of Explaining Object Hallucinations for Qwen2.5-VL

Examples of Explaining Object Hallucinations for Qwen2.5-VL

Examples of Explaining Object Hallucinations for Qwen2.5-VL

Examples of Explaining Object Hallucinations for Qwen2.5-VL

Examples of Explaining Object Hallucinations for Qwen2.5-VL

❮ ❯

Examples of Explaining Object Hallucinations for InternVL3.5

Examples of Explaining Object Hallucinations for InternVL3.5

Examples of Explaining Object Hallucinations for InternVL3.5

Examples of Explaining Object Hallucinations for InternVL3.5

Examples of Explaining Object Hallucinations for InternVL3.5

Examples of Explaining Object Hallucinations for InternVL3.5

❮ ❯

BibTeX

@article{chen2025where,
        title={Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation},
        author={Chen, Ruoyu and Guo, Xiaoqing and Liu, Kangwei and Liang, Siyuan and Liu, Shiming and Zhang, Qunli and Zhang, Hua and Cao, Xiaochun},
        journal={arXiv preprint arXiv:2509.22496},
        year={2025}
}