Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Ruoyu Chen1,2    Xiaoqing Guo3    Kangwei Liu1,2    Siyuan Liang4    Shiming Liu5    Qunli Zhang5    Hua Zhang1    Xiaochun Cao6   
1Institute of Information Engineering, Chinese Academy of Sciences          2University of Chinese Academy of Sciences          
3Hong Kong Baptist University            4National University of Singapore           5Imperial College London          6Sun Yat-sen University
Intro Image

An overview of our proposed Eagle explanation method for multimodal large language models. Eagle attribution which perceptual regions drive the generation (Where MLLMs Attend) and quantifies modality reliance (What They Rely On).

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present Eagle, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. Eagle attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, Eagle performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that Eagle consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs.

Eagle Explanation Method

EAGLE Model

Overview of the proposed Eagle framework. The input image is first sparsified into sub-regions, then attributed via greedy search with the designed objective, and finally analyzed for modality relevance between language priors and perceptual evidence.

SOTA Results in Faithfulness, Localization, and Hallucination Diagnosis


We evaluate our method on open-source MLLMs, including LLaVA-1.5, Qwen2.5-VL, and InternVL3.5, using the MS COCO and MMVP datasets for image captioning and VQA. On faithfulness metrics, our approach outperforms existing attribution methods (LLaVA-CAM, IGOS++, and TAM) by an average of 20.0% in insertion and 13.4% in deletion for image captioning, and by 20.6% and 8.1% on the same metrics for VQA. At the word level, our method achieves more rational explanations of object tokens, surpassing TAM by 36.42% and 42.63% on the Pointing Game under box-level and mask-level annotations, respectively. Finally, on the RePOPE benchmark for object hallucination, our method accurately localizes the visual elements responsible for hallucinations and mitigates them by removing only a minimal set of interfering regions. These results demonstrate the versatility of our method across diverse tasks and benchmarks.

🚀 Examples of Sentence-level Explanations for Image Captioning Tasks

LLaVA-CAM often misses key regions and IGOS++ yields redundant maps, while our method highlights critical regions that align closely with visually grounded tokens, producing concise and human-consistent explanations.

Some Examples of Sentence-level Explanations for LLaVA-1.5

sentence llava 1

Some Examples of Sentence-level Explanations for Qwen2.5-VL

sentence qwen 1

Some Examples of Sentence-level Explanations for InternVL3.5

sentence internvl 1

🚀 Examples of Sentence-level Explanations for Visual Question Answering Tasks

LLaVA-CAM often misses key regions and IGOS++ yields redundant maps, while our method highlights critical regions that align closely with visually grounded tokens, producing concise and human-consistent explanations.

Some Examples of VQA Explanations for LLaVA-1.5

sentence vqa llava 1

Some Examples of VQA Explanations for Qwen2.5-VL

sentence vqa qwen 1

Some Examples of VQA Explanations for InternVL3.5

sentence vqa internvl 1

🚀 Examples of Word-level Explanations for Image Captioning Tasks

For localization, our method achieves the best Pointing Game results under both box- and mask-level settings, confirming that predictions are grounded in specific objects. While TAM performs well on stronger models but poorly on LLaVA-1.5, IGOS++ gains from overly redundant maps. In contrast, our method yields sparse yet focused highlights that more accurately localize the objects mentioned in captions.

Some Examples of Word-level Explanations for LLaVA-1.5 7B

word llava 1

Some Examples of Word-level Explanations for Qwen2.5-VL 7B

word qwen 1

Some Examples of Sentence-level Explanations for InternVL3.5 4B

word internvl 1

🚀 Examples Explaining the Causes of Object Hallucinations in MLLMs

The following examples including the Hallucination Map, where highlighted purple regions indicate areas prone to hallucinations identified by our method. Hallucination Mitigation denotes the minimal region that must be removed to eliminate hallucinations. The curve illustrates changes in the logit of the ground-truth token as hallucination-prone regions are progressively deleted, with the red line marking the deletion point determined by Hallucination Mitigation. Our method rapidly localizes regions that cause hallucination.

Examples of Explaining Object Hallucinations for LLaVA-1.5

hallucination llava 1

Examples of Explaining Object Hallucinations for LLaVA-1.5

hallucination llava 2

Examples of Explaining Object Hallucinations for LLaVA-1.5

hallucination llava 3

Examples of Explaining Object Hallucinations for LLaVA-1.5

hallucination llava 4

Examples of Explaining Object Hallucinations for LLaVA-1.5

hallucination llava 5

Examples of Explaining Object Hallucinations for LLaVA-1.5

hallucination llava 6

Examples of Explaining Object Hallucinations for Qwen2.5-VL

hallucination qwen 1

Examples of Explaining Object Hallucinations for Qwen2.5-VL

hallucination qwen 2

Examples of Explaining Object Hallucinations for Qwen2.5-VL

hallucination qwen 3

Examples of Explaining Object Hallucinations for Qwen2.5-VL

hallucination qwen 4

Examples of Explaining Object Hallucinations for Qwen2.5-VL

hallucination qwen 5

Examples of Explaining Object Hallucinations for Qwen2.5-VL

hallucination qwen 6

Examples of Explaining Object Hallucinations for InternVL3.5

hallucination internvl 1

Examples of Explaining Object Hallucinations for InternVL3.5

hallucination internvl 2

Examples of Explaining Object Hallucinations for InternVL3.5

hallucination internvl 3

Examples of Explaining Object Hallucinations for InternVL3.5

hallucination internvl 4

Examples of Explaining Object Hallucinations for InternVL3.5

hallucination internvl 5

Examples of Explaining Object Hallucinations for InternVL3.5

hallucination internvl 6

BibTeX

@article{chen2025where,
        title={Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation},
        author={Chen, Ruoyu and Guo, Xiaoqing and Liu, Kangwei and Liang, Siyuan and Liu, Shiming and Zhang, Qunli and Zhang, Hua and Cao, Xiaochun},
        journal={arXiv preprint arXiv:2509.22496},
        year={2025}
}