Cross-modal Information Flow in Multimodal Large Language Models

如果无法正常显示，请先停止浏览器的去广告插件。

1. Cross-modal Information Flow in Multimodal Large Language Models Zhi Zhang * , Srishti Yadav *† , Fengze Han ‡ , Ekaterina Shutova * * ILLC, University of Amsterdam, Netherlands † Dept. of Computer Science, University of Copenhagen, Denmark ‡ Dept. of Computer Engineering, Technical University of Munich, Germany zhangzhizz2626@gmail.com, srya@di.ku.dk, fengze.han@tum.de, e.shutova@uva.nl Abstract Up The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information inter- act within these models. In this study, we aim to fill this gap by examining the information flow between different modalities—language and vision—in MLLMs, focusing on visual question answering. Specifically, given an image- question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole im- age into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual infor- mation about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional as- pects of image and language processing in the MLLMs, thereby facilitating future research into multimodal infor- mation localization and editing. up Are the blinds up or down? Assistant: Figure 1. Illustration of the internal mechanism of MLLMs when solving multimodal tasks. From bottom to top layers, the model first propagates general visual information from the whole image into the linguistic hidden representation; next, selected visual in- formation relevant to answering the question is transferred to the linguistic representation; finally, the integrated multimodal infor- mation within the hidden representation of the question flows to last position facilitating the final prediction. In addition, the an- swers are initially generated in lowercase form and then converted to uppercase for the first letter. 1. Introduction Specifically, LLMs generate responses based on both visual and linguistic inputs where visual representations extracted from an image encoder precede the word embeddings in the input sequence. Despite the successful performance and wide applicability of MLLMs, there is still a lack of under- standing of their internal working mechanisms at play when Multimodal large language models (MLLMs) [5, 11, 24, 27, 28] have demonstrated notable performance across a wide range of vision-language tasks, which is largely attributed to the combination of powerful auto-regressive large language models [39, 40, 44, 47] and visual encoders [13, 16, 35]. 1

2. solving multimodal tasks. Acquiring deeper insights into these mechanisms could not only enhance the interpretabil- ity and transparency [31, 33] of these models but also pave the way for developing more efficient and robust models for multimodal interactions. Some initial studies have begun to explore the internal states corresponding to external behaviors of MLLMs, fo- cusing on specific aspects such as information storage in the model’s parameters [6], reflecting undesirable content generation through logit distributions of the generated to- kens [46], the localization and evolution of object-related visual information [32, 34, 37], the localization of safety mechanism [43] and the reduction of redundant visual to- kens [45]. However, the information flow between the two modalities within MLLMs remains poorly-understood, thus prompting our main question: Where in the model and how is visual and linguistic information integrated within the auto-regressive MLLMs to generate the final prediction in vision-language tasks? To address this question, we investigate the interaction of different modalities by locating and analyzing the infor- mation flow [15] between them, across different layers. Our focus is on the task of visual question answering (VQA), a popular multimodal task, where the answer is generated by MLLMs based on the input image and the corresponding question. Specifically, we aim to reverse engineer the infor- mation flow between the two modalities at inference time, by selectively inhibiting specific attention patterns between tokens corresponding to visual and linguistic inputs and by observing the resulting changes in the performance of the answer prediction. In modern auto-regressive MLLMs, which employ Transformer decoder-only architecture [41], the attention layer is the sole module enabling communication between hidden representations corresponding to different positions of the input. To inhibit cross-modal information flow, we therefore adopt an attention knockout approach, proposed by Geva et al. [19]. We use it to block attention edges con- necting different types of hidden representations (e.g. image and question) at specific transformer layers. We apply this method to a range of MLLMs from the LLaVA series, including LLaVA-1.5-7b, LLaVA-1.5- 13b [27], LLaVA-v1.6-Vicuna-7b [28] and Llama3-LLaVA- NEXT-8b [2] and a number of diverse question types in VQA, as shown in Table 1. Our experiments focus on the following research questions: (1) How is the (more general) visual information from the whole image fused with the lin- guistic information in the question? (2) How is the more targeted visual information (i.e. specific image regions di- rectly relevant to answering the question) integrated with linguistic information from the question? and (3) In what ways do the linguistic and visual components of the input contribute to the final answer prediction? To answer these questions we conduct a series of experiments, blocking in- formation flow between (1) the input positions correspond- ing to the whole image to the different parts of the question; (2) the input positions corresponding to image regions con- taining objects relevant to answering the question, to the question; (3) the input positions corresponding to the im- age and the question to the final prediction, across different layers of the MLLM. Our results reveal that in MLLMs, visual information un- dergoes a two-stage integration into the language represen- tation within the lower-to-middle layers: first in a compre- hensive manner, and subsequently in a more targeted fash- ion. This integrated multimodal representation is then prop- agated to the hidden representations in the subsequent lay- ers, ultimately reaching the last position for generating an accurate response. The visualization of this mechanism is shown in Figure 1. To the best of our knowledge, ours is the first paper to elucidate the information flow between the two modalities in auto-regressive MLLMs. It thus contributes to enhancing the transparency of these models and provides novel and valuable insights for their development. 2. Related work MLLMs multimodal large language models have demon- strated remarkable performance across a wide range of vision-language tasks, which is largely attributed to the de- velopment of the auto-regressive large language models. The representative MLLMs [5, 11, 24–28] consist of an im- age encoder [13, 16, 35] and a powerful decoder-only large language model [39, 40, 44, 47]. The visual and linguistic information are integrated in original LLM. In this paper, we will investigate this inner working mechanism of multi- modal information processing into these models. Interpretability of multimodal models The inter- pretability of multimodal models has attracted a great deal of attention in the research community. Works in [7, 17] treat the model as a black box, analyzing input—output re- lationships to interpret the behavior of models, such as com- paring the importance of different modalities [7] and the different modalities’ contribution to visual or textual tasks [17]. The works from [3, 8, 29, 38] aim to explain predic- tions by tracing outputs to specific input contributions for a single sample, including through merging the attention scores [3, 38], using gradient-based methods [8] or model disentanglement [29]. Additionally, some works [9, 20, 36] adopt a top-down approach, probing learned representations to uncover high-level concepts, such as visual-semantics [9], verb understanding [20], shape and size [36]. In con- trast, our work focuses on the model’s internal processing mechanisms when solving multimodal tasks. Mechanistic interpretability of MLLMs Mechanistic interpretability [31, 33] is an emerging research area in 2

3. Answer NLP, aiming to reverse-engineer detailed computations within neural networks. While it has gained attraction in NLP, research in the multimodal domain remains limited. Palit et al. [34] introduced a causal tracing tool for image- conditioned text generation on BLIP [23], marking one of the few early efforts in this area. Several initial studies have started to explore the internal states of MLLMs by linking external behaviours to specific mechanisms, such as information storage in model parameters [6], undesir- able content generation reflected in the logit distributions of the first generated token [46], localization and evolution of object-related visual information [32, 34, 37], safety mech- anism localization [43], and reducing redundant visual to- kens [45]. However, research offering a comprehensive un- derstanding of the internal mechanisms behind multimodal information integration in MLLMs is still lacking. This pa- per makes an important first step towards filling this gap. Auto-regressive Model CLIP-ViT-L Tokenizer Image Question + Assistant: Figure 2. The typical architecture of multimodal large language model. It consists of an image encoder and a decoder-only large language model in which the multimodal information is integrated. We omitted the projection matrix for the visual patch feature as it is nonessential for our analysis. 3. Tracing information flow in MLLMs the bias terms and layer normalization, as they are not cru- cial for our analysis. Formally, the hidden representation h ℓi ∈ R d in the position i of the input sequence at layer ℓ can be expressed as The focus of this paper is on auto-regressive multimodal large language models, which consist of an image encoder and a decoder-only language model, as shown in Figure 2. The image encoder transforms images into representations that the language model can take as input, while the lan- guage model integrates these visual cues with any provided text, generating responses one word at a time. Often, these components are initialized from a pre-trained image en- coder (e.g. CLIP-ViT-L-336px [35] ) and a large language model (e.g. Llama 2 [40]) respectively. Since the inter- action between modalities only occurs in the decoder-only transformer, our analysis centers around it and we refer to it as MLLM for brevity unless otherwise specified. h ℓi = h i ℓ−1 + a ℓi + f i ℓ , (1) where a ℓi ∈ R d and f i ℓ ∈ R d are the outputs of MHAT and FFN modules at layer ℓ, respectively. h 0 i represents a vector in the input I with position of i. All hidden representations at layer ℓ corresponding to the whole input I can be denoted N ×d by H ℓ = [h ℓi ] N . i=1 ∈ R MHAT The masked multi-head attention (MHAT) mod- ule in each transformer layer ℓ contains four projection ma- ℓ , W V ℓ , W O ℓ ∈ R d×d . For the multi-head trixes: W Q ℓ , W K attention, the input H ℓ−1 is first projected to query,key ℓ , V ℓ = and value: Q ℓ = H ℓ−1 W Q ℓ , K ℓ = H ℓ−1 W K H ℓ−1 W V ℓ . Then the projected query, key and value matri- ces are evenly split along the columns to H different heads: d ℓ,j H N × H {Q ℓ,j } H } j=1 , {V ℓ,j } H , respectively. j=1 , {K j=1 ∈ R d d× H After splitting W O ℓ into {W O ℓ,j } H ∈ R , we follow j=1 works in [12, 15, 19] to represent the output of MHAT N ×d at layer ℓ as the sum of the out- A ℓ = [a ℓi ] N i=1 ∈ R put from different heads 3.1. Background: MLLMs Input The input to an MLLM typically comprises image and text features, with the image features being initially ex- tracted from an image encoder and the text being encoded through word embeddings. Formally, an image x is evenly split into fixed-size patches and encoded by an image en- V coder to obtain N V visual patch features V = [v i ] N i=1 , d v i ∈ R . Similarly, the text t, consisting of N T tokens, is embedded into representations through a lookup table of T word embeddings, resulting in the text input T = [t i ] N i=1 , d t i ∈ R . By concatenation of V and T , the multimodal in- put sequence I = [v 1 . . . v N V , t 1 . . . t N T ] ∈ R N ×d , where N = N V + N T , is fed into MLLM. A ℓ = H X A ℓ,j V ℓ,j W O ℓ,j (2) j=1 Hidden representation The input sequence is fed into the MLLM, where the hidden representation at each token posi- tion is encoded across L transformer layers. Each layer pri- marily consists of two modules: a masked multi-head atten- tion (MHAT) followed by a fully connected feed-forward network (FFN) [41]. For conciseness, we have excluded A ℓ,j Q ℓ,j (K ℓ,j ) T p = softmax + M ℓ,j d/H ! (3) where M ℓ,j is a strictly upper triangular mask for A ℓ,j for j-th head at layer ℓ. For an auto-regressive transformer model, M ℓ,j is used to guarantee that every position of the 3

4. input sequence cannot attend to succeeding positions and at- tends to all preceding positions. Therefore, for the element ℓ,j M s,t with the coordinate (s, t) in M ℓ,j , ( −∞ if t > s, ℓ,j M s,t = (4) 0 otherwise. FFN representations. Therefore, we locate the information flow between different hidden representations corresponding to different positions of the input sequence, such as visual in- puts, linguistic inputs, and the last position in the input se- quence (the position of answer prediction), by blocking the attention edge between them in the MHAT module and ob- serving the resulting decline in performance as compared to the original model with an intact attention pattern. Formally, in order to prevent information flow from the hidden representations h ℓs with position s in the source set S (e.g. all positions of visual tokens in the input sequence) to the hidden representations h ℓt with position t in the target set T (e.g. all positions of linguistic tokens in the input se- quence) at a specific layer ℓ < L, we set the corresponding ℓ,j element M s,t in M ℓ,j to −∞ and the updated Eq. (4) is ( −∞ if (t > s) or (s in S and t in T ), ℓ,j (7) M s,t = 0 otherwise. FFN computes the output representation through f j ℓ = W U ℓ σ W B ℓ a ℓj + h ℓ−1 (5) j where W U ℓ ∈ R d×d ff and W B ℓ ∈ R d ff ×d are projection ma- trices with inner-dimensionality d ff , and σ is a nonlinear ac- tivation function. Output The hidden representation h L N corresponding to the last position N of the input sequence at final layer L is projected by an unembedding matrix E ∈ R |V|×d and finally the probability distribution over all words in the vo- cabulary V is computed by P N = softmax Eh L (6) N , This prevents the token position in the target set from at- tending to that in the source set when MLLM generates the predicted answer. where the word with the highest probability in P N is the final prediction. 4. Experimental setting 3.2. Attention knockout Setup Our paper investigates the inner working mecha- nism of MLLMs , focusing on visual question answering (VQA). Typically, the VQA setup involves an image and a corresponding question about this image, which the model needs to answer. We first investigate where the informa- tion from different modalities (image and textual question) is processed in MLLMs, and then how it is integrated within the model. Finally, we explore how the MLLM makes the final decision using this multimodal information. In this paper, we mainly investigate the interaction between different modalities by locating and analyzing the informa- tion flow between them. We adopt a reverse-engineering approach to trace the information flow. Specifically, by in- tentionally blocking specific connections between different components in the computation process, we trace the infor- mation flow within them by observing changes in the prob- ability of final prediction. In MLLMs, the attention module (MHAT) is the only module, which has the function of communication between different types of hidden representation corresponding to different positions in the input sequence. Therefore, we in- tentionally block the attention edges between hidden repre- sentations at different token positions (termed as attention knockout) to trace the information flow between them. We take inspiration from the work of [19], where the authors use attention knockout to assess how the factual information is extracted from a single-modality LLM by evaluating the contribution of certain words in a sentence to last-position prediction. We extend this method to multimodal research by not only examining the contribution of each modality to the last-position prediction but also the transfer of informa- tion between different modalities. Intuitively, when blocking the attention edge connecting two hidden representations corresponding to different posi- tions of the input sequence leads to a significant deteriora- tion in model performance, it suggests that there exists func- tionally important information transfer between these two Tasks and data We collect our data from the valida- tion set of GQA dataset [21]. GQA is a dataset designed to support visual reasoning and compositional question- answering, offering the semantic and visual richness of real- world images. It is derived from the Visual Genome dataset, which includes detailed scene graph structures [22]. In GQA, the questions are categorized through two dimen- sions: structure and semantics. The former defines the question format (5 classes) and the latter refers to the se- mantic information for the main subject of the question (5 classes). The answers to these questions consist of only one word or phrase, which is easy to evaluate. Based on the two dimensions, the questions in GQA are categorized into 15 groups. We exclude most groups that consist of simple binary questions (yes/no) and demonstrate poor per- formance on the model investigated in this paper. Finally, we select 6 out of 15 groups (4 structural and 4 seman- tic classes) in which their performance is higher than 80% in average performance, as shown in Table 1. The diffi- culty of our selected groups ranges from simple multimodal 4

5. Name ChooseAttr ChooseCat ChooseRel CompareAttr LogicalObj QueryAttr Structural type Choose Choose Choose Compare Logical Query Semantic Type Attribute Category Relation Attribute Object Attribute Open / Binary Open Open Open Open Binary Open Image Example Question Example Answer Num. What was used to make the door, wood or metal? Wood 1000 Which piece of furniture is striated, bed or door? Bed 1000 Is the door to the right or to the left of the bed? Right 964 What is common to the bike and the dog? Color 570 Are there either women or men that are running? No 991 In which part of the image is the dog? Left 1000 Table 1. Different types of questions in our VQA dataset. The questions are categorized based on two dimensions: structure and semantics. The structural types define the question format, including: Choose for selecting between alternatives, Compare for comparisons between objects, Logical for logical inference, and Query for open-ended questions. The semantic types focus on the subject matter, covering Object existence, and Attribute, Category, Relation of objects. Additionally, questions are labeled as Open for open-ended queries or Binary for yes/no answers. The dataset is derived from the GQA dataset [21]. Due to space limitations, we present two images, noting that 50% of question samples in our dataset have unique images. Models We investigate the current state-of-the-art and open-source multimodal large language models from the LLaVA series: LLaVA-1.5-7b, LLaVA-1.5-13b [27], LLaVA-v1.6-Vicuna-7b [28] and Llama3-LLaVA-NEXT-8b [2], which achieve state-of-the-art performance across a di- verse range of 11 tasks including GQA. These models are trained on similar publicly available data but with differ- ent architectures and model sizes, which allows us to ex- plore cross-modal interaction and processing over different architectures and minimize interference of unknown factors from training data. All these models have the same image encoder (CLIP-ViT-L-336px [35]) but with different LLM: Vicuna-v1.5-7b [47] with 32 layers (transformer blocks) in LLaVA-1.5-7b and LLaVA-v1.6-Vicuna-7b, Vicuna-v1.5- 13b [47] with 40 layers in LLaVA-1.5-13b and Llama3- 8b [14] with 32 layers in Llama3-LLaVA-NEXT-8b, where Vicuna-v1.5 is the standard and dense transformer architec- ture [41] and Llama3 adopts grouped query attention [4]. In terms of image processing, LLaVA-1.5-7b and LLaVA- 1.5-13b directly feed the original fixed-length image patch features from the image encoder into the LLM as input tokens. In contrast, LLaVA-v1.6-Vicuna-7b and Llama3- LLaVA-NEXT-8b employ a dynamic high-resolution tech- nique, which dynamically adjusts image resolution, result- ing in variable-length image patch features with higher res- olution. Due to space limitations, we will primarily present the results for the model LLaVA-1.5-13b in the subsequent sections of this paper, while similar findings for other mod- els are presented in Appendix E. perception tasks to more complex multimodal reasoning. For example, ChooseAttr and ChooseCat ask about basic object attributes and categories for one object in the im- age, ChooseRel and QueryAttr involve spatial reasoning, and CompareAttr and LogicalObj require more challenging comparisons and logical reasoning between two objects in the image. For each selected group, we sample an average of 920 image-question pairs that are correctly predicted by most models used in this paper. For each model, we only use correctly predicted samples for analysis (Each model achieves an accuracy greater than 95% on the dataset we collected). More details about the dataset and the process of collection can be found in Appendix A. Format Formally, given an image i and a question q (the question may contain answer options os = [o1, o2]), the model is expected to generate the answer a in the last po- sition of the input sequence. In addition, the correct one in the options is referred to as the true option (o t ) while the other ones are denoted as the false option (o f ). Since the image, question and options might contain multiple input tokens, we use I, Q, O t , O f to represent the set of input po- sitions corresponding to image, question, true option and false option, respectively. Evaluation We quantify the information flow between different input parts by evaluating the relative change in the probability of the answer word which is caused by blocking connections between different input parts (attention knock- out). Formally, given an image-question pair, the MLLM generates the answer a with the highest probability p 1 from the output distribution P N defined in Equation (6). After applying attention knockout at specific layers, we record the updated probability p 2 for the same answer a as in p 1 . The relative change in probability, p c %, is calculated as p c % = ((p 2 −p 1 )/p 1 )×100. In this paper, attention knockout is applied to each transformer layer (within a defined win- dow) individually and evaluate their respective p c values. 5. Contribution of different modalities to the final prediction For a successful answer prediction for the task of VQA, the MLLM will process the input image-question pair [i, q] and generate the final answer from the output layer of the model corresponding to the last position. We first investi- 5

8. information from the whole image into the question posi- tions building a more generic representation. And it is only in the later layers, that the model starts to pay attention to the specific regions in the image relevant to the ques- tion, fusing the more fine-grained linguistic and visual rep- resentations. The other MLLMs also present similar results as shown in Appendix E. The additional more fine-grained analysis on intervention of the attention edge between ob- ject words in question and image region can be found in Appendix F. Moreover, we find compared with LLaVA-1.5- 13b, the model LLaVA-1.5-7b with smaller size has less in- formation flow from the position of V oth to that of question in the first stage, as shown in Appendix E.

9. References [1] interpreting GPT: the logit lens — LessWrong — less- wrong.com. https : / / www . lesswrong . com / posts / AcKRB8wDpdaN6v6ru / interpreting - gpt-the-logit-lens. [Accessed 14-11-2024]. 8 [2] lmms-lab/llama3-llava-next-8b · hugging face. https: //huggingface.co/lmms-lab/llama3-llava- next-8b, 2024. Accessed: 2024-11-13. 2, 5 [3] Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu, Chenfei Wu, Nan Duan, and Vasudev Lal. Vl-interpret: An interactive visualization tool for interpreting vision-language transformers. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 21406– 21415, 2022. 2 [4] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023. 5, 7 [5] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, 2023. arXiv:2308.12966 [cs]. 1, 2 [6] Samyadeep Basu, Martin Grayson, Cecily Morrison, Be- smira Nushi, Soheil Feizi, and Daniela Massiceti. Under- standing information storage and transfer in multi-modal large language models. arXiv preprint arXiv:2406.04236, 2024. 2, 3 [7] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen Chun Chen, and Jingjing Liu. Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12351 LNCS:565–580, 2020. arXiv: 2005.07310 ISBN: 9783030585389. 2 [8] Hila Chefer, Shir Gur, and Lior Wolf. Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. pages 397–406, 2021. arXiv: 2103.15679. 2 [9] Adam Dahlgren Lindström, Johanna Björklund, Suna Ben- sch, and Frank Drewes. Probing multimodal embeddings for linguistic properties: the visual-semantic case. In Pro- ceedings of the 28th International Conference on Compu- tational Linguistics, pages 730–744, Barcelona, Spain (On- line), 2020. International Committee on Computational Lin- guistics. 2 [10] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge Neurons in Pretrained Transform- ers, 2022. arXiv:2104.08696 [cs]. 3 [11] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards General- purpose Vision-Language Models with Instruction Tuning, 2023. arXiv:2305.06500 [cs]. 1, 2 [12] Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. An- [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] 9 alyzing transformers in embedding space. arXiv preprint arXiv:2209.02535, 2022. 3 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. arXiv: 2010.11929. 1, 2 Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 5 Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer- circuits.pub/2021/framework/index.html. 2, 3 Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 1, 2 Stella Frank, Emanuele Bugliarello, and Desmond Elliott. Vision-and-Language or Vision-for-Language? On Cross- Modal Influence in Multimodal Transformers. pages 9847– 9857, 2021. arXiv: 2109.04448 ISBN: 9781955917094. 2 Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computa- tional Linguistics, 2021. 3 Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting Recall of Factual As- sociations in Auto-Regressive Language Models, 2023. arXiv:2304.14767 [cs]. 2, 3, 4 Lisa Anne Hendricks and Aida Nematzadeh. Probing image- language transformers for verb understanding. In Find- ings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pages 3635–3644, Online, 2021. Association for Computational Linguistics. 2 Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 4, 5, 1 Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 4, 1

10. [23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. Tech- nical report. 3 [24] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. 2023. arXiv: 2301.12597. 1, 2 [25] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models, 2024. [26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning, 2023. arXiv:2304.08485 [cs]. [27] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024. 1, 2, 5 [28] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 2, 5 [29] Yiwei Lyu, Paul Pu Liang, Zihao Deng, Ruslan Salakhutdi- nov, and Louis-Philippe Morency. Dime: Fine-grained inter- pretations of multimodal models via disentangled local ex- planations. In Proceedings of the 2022 AAAI/ACM Confer- ence on AI, Ethics, and Society, pages 455–467, 2022. 2 [30] Kevin Meng, David Bau, Alex J Andonian, and Yonatan Be- linkov. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, 2022. 3 [31] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023. 2 [32] Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual infor- mation processing in vision-language models. arXiv preprint arXiv:2410.07149, 2024. 2, 3 [33] Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https : / / www.transformer- circuits.pub/2022/mech- interp-essay, 2024. Accessed: 2024-10-20. 2 [34] Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang. Towards Vision-Language Mechanistic Interpretabil- ity: A Causal Tracing Tool for BLIP. 2, 3 [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 3, 5 [36] Emmanuelle Salin, Badreddine Farah, Stéphane Ayache, Benoit Favre, Emmanuelle Salin, Badreddine Farah, Stéphane Ayache, Benoit Favre Are Vision-language Trans, and Probing Perspective. Are Vision-Language Transform- [37] [38] [39] [40] [41] [42] [43] [44] 10 ers Learning Multimodal Representations? A probing per- spective. Proceedings of the 36th AAAI Conference on Arti- ficial Intelligence, 2022. 2 Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, and Antonio Torralba. Multimodal Neurons in Pre- trained Text-Only Transformers. In 2023 IEEE/CVF Interna- tional Conference on Computer Vision Workshops (ICCVW), pages 2854–2859, Paris, France, 2023. IEEE. 2, 3 Gabriela Ben Melech Stan, Raanan Yehezkel Rohekar, Yaniv Gurwicz, Matthew Lyle Olson, Anahita Bhiwandi- walla, Estelle Aflalo, Chenfei Wu, Nan Duan, Shao-Yen Tseng, and Vasudev Lal. Lvlm-intrepret: An interpretabil- ity tool for large vision-language models. arXiv preprint arXiv:2404.03118, 2024. 2 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. 1, 2 Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Vik- tor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Ko- renev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizen- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiao- qing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foundation and Fine- Tuned Chat Models, 2023. arXiv:2307.09288 [cs]. 1, 2, 3 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neu- ral Information Processing Systems. Curran Associates, Inc., 2017. 2, 3, 5, 6 Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fan- dong Meng, Jie Zhou, and Xu Sun. Label Words are An- chors: An Information Flow Perspective for Understanding In-Context Learning, 2023. arXiv:2305.14160 [cs]. 3 Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen, and Xueqi Cheng. Cross-modal safety mechanism trans- fer in large vision-language models. arXiv preprint arXiv:2410.12662, 2024. 2, 3 Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura,

11. Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: Open Pre-trained Transformer Language Models, 2022. arXiv:2205.01068 [cs]. 1, 2 [45] Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and Jieping Ye. From Redundancy to Relevance: Enhancing Ex- plainability in Multimodal Large Language Models, 2024. arXiv:2406.06579 [cs]. 2, 3 [46] Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, and Stephen Gould. The first to know: How token distributions reveal hidden knowledge in large vision- language models? arXiv preprint arXiv:2403.09037, 2024. 2, 3 [47] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Li, and OTHERS. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023. 1, 2, 5 11

12. Cross-modal Information Flow in Multimodal Large Language Models Supplementary Material Object Attribute Category Relation Global

13.

14.

15.

16.

17.

18.

19.

20.

21.