Cross-modal Information Flow in Multimodal Large Language Models
如果无法正常显示,请先停止浏览器的去广告插件。
1. Cross-modal Information Flow in Multimodal Large Language Models
Zhi Zhang * , Srishti Yadav *† , Fengze Han ‡ , Ekaterina Shutova *
*
ILLC, University of Amsterdam, Netherlands
†
Dept. of Computer Science, University of Copenhagen, Denmark
‡
Dept. of Computer Engineering, Technical University of Munich, Germany
zhangzhizz2626@gmail.com, srya@di.ku.dk, fengze.han@tum.de, e.shutova@uva.nl
Abstract
Up
The recent advancements in auto-regressive multimodal
large language models (MLLMs) have demonstrated
promising progress for vision-language tasks. While there
exists a variety of studies investigating the processing of
linguistic information within large language models, little
is currently known about the inner working mechanism of
MLLMs and how linguistic and visual information inter-
act within these models. In this study, we aim to fill this
gap by examining the information flow between different
modalities—language and vision—in MLLMs, focusing on
visual question answering. Specifically, given an image-
question pair as input, we investigate where in the model
and how the visual and linguistic information are combined
to generate the final prediction. Conducting experiments
with a series of models from the LLaVA series, we find that
there are two distinct stages in the process of integration
of the two modalities. In the lower layers, the model first
transfers the more general visual features of the whole im-
age into the representations of (linguistic) question tokens.
In the middle layers, it once again transfers visual infor-
mation about specific objects relevant to the question to
the respective token positions of the question. Finally, in
the higher layers, the resulting multimodal representation
is propagated to the last position of the input sequence for
the final prediction. Overall, our findings provide a new and
comprehensive perspective on the spatial and functional as-
pects of image and language processing in the MLLMs,
thereby facilitating future research into multimodal infor-
mation localization and editing.
up
Are the blinds
up or down?
Assistant:
Figure 1. Illustration of the internal mechanism of MLLMs when
solving multimodal tasks. From bottom to top layers, the model
first propagates general visual information from the whole image
into the linguistic hidden representation; next, selected visual in-
formation relevant to answering the question is transferred to the
linguistic representation; finally, the integrated multimodal infor-
mation within the hidden representation of the question flows to
last position facilitating the final prediction. In addition, the an-
swers are initially generated in lowercase form and then converted
to uppercase for the first letter.
1. Introduction
Specifically, LLMs generate responses based on both visual
and linguistic inputs where visual representations extracted
from an image encoder precede the word embeddings in
the input sequence. Despite the successful performance and
wide applicability of MLLMs, there is still a lack of under-
standing of their internal working mechanisms at play when
Multimodal large language models (MLLMs) [5, 11, 24, 27,
28] have demonstrated notable performance across a wide
range of vision-language tasks, which is largely attributed to
the combination of powerful auto-regressive large language
models [39, 40, 44, 47] and visual encoders [13, 16, 35].
1
2. solving multimodal tasks. Acquiring deeper insights into
these mechanisms could not only enhance the interpretabil-
ity and transparency [31, 33] of these models but also pave
the way for developing more efficient and robust models for
multimodal interactions.
Some initial studies have begun to explore the internal
states corresponding to external behaviors of MLLMs, fo-
cusing on specific aspects such as information storage in
the model’s parameters [6], reflecting undesirable content
generation through logit distributions of the generated to-
kens [46], the localization and evolution of object-related
visual information [32, 34, 37], the localization of safety
mechanism [43] and the reduction of redundant visual to-
kens [45]. However, the information flow between the two
modalities within MLLMs remains poorly-understood, thus
prompting our main question: Where in the model and how
is visual and linguistic information integrated within the
auto-regressive MLLMs to generate the final prediction in
vision-language tasks?
To address this question, we investigate the interaction
of different modalities by locating and analyzing the infor-
mation flow [15] between them, across different layers. Our
focus is on the task of visual question answering (VQA), a
popular multimodal task, where the answer is generated by
MLLMs based on the input image and the corresponding
question. Specifically, we aim to reverse engineer the infor-
mation flow between the two modalities at inference time,
by selectively inhibiting specific attention patterns between
tokens corresponding to visual and linguistic inputs and by
observing the resulting changes in the performance of the
answer prediction.
In modern auto-regressive MLLMs, which employ
Transformer decoder-only architecture [41], the attention
layer is the sole module enabling communication between
hidden representations corresponding to different positions
of the input. To inhibit cross-modal information flow, we
therefore adopt an attention knockout approach, proposed
by Geva et al. [19]. We use it to block attention edges con-
necting different types of hidden representations (e.g. image
and question) at specific transformer layers.
We apply this method to a range of MLLMs from
the LLaVA series, including LLaVA-1.5-7b, LLaVA-1.5-
13b [27], LLaVA-v1.6-Vicuna-7b [28] and Llama3-LLaVA-
NEXT-8b [2] and a number of diverse question types in
VQA, as shown in Table 1. Our experiments focus on the
following research questions: (1) How is the (more general)
visual information from the whole image fused with the lin-
guistic information in the question? (2) How is the more
targeted visual information (i.e. specific image regions di-
rectly relevant to answering the question) integrated with
linguistic information from the question? and (3) In what
ways do the linguistic and visual components of the input
contribute to the final answer prediction? To answer these
questions we conduct a series of experiments, blocking in-
formation flow between (1) the input positions correspond-
ing to the whole image to the different parts of the question;
(2) the input positions corresponding to image regions con-
taining objects relevant to answering the question, to the
question; (3) the input positions corresponding to the im-
age and the question to the final prediction, across different
layers of the MLLM.
Our results reveal that in MLLMs, visual information un-
dergoes a two-stage integration into the language represen-
tation within the lower-to-middle layers: first in a compre-
hensive manner, and subsequently in a more targeted fash-
ion. This integrated multimodal representation is then prop-
agated to the hidden representations in the subsequent lay-
ers, ultimately reaching the last position for generating an
accurate response. The visualization of this mechanism is
shown in Figure 1. To the best of our knowledge, ours is the
first paper to elucidate the information flow between the two
modalities in auto-regressive MLLMs. It thus contributes
to enhancing the transparency of these models and provides
novel and valuable insights for their development.
2. Related work
MLLMs multimodal large language models have demon-
strated remarkable performance across a wide range of
vision-language tasks, which is largely attributed to the de-
velopment of the auto-regressive large language models.
The representative MLLMs [5, 11, 24–28] consist of an im-
age encoder [13, 16, 35] and a powerful decoder-only large
language model [39, 40, 44, 47]. The visual and linguistic
information are integrated in original LLM. In this paper,
we will investigate this inner working mechanism of multi-
modal information processing into these models.
Interpretability of multimodal models The inter-
pretability of multimodal models has attracted a great deal
of attention in the research community. Works in [7, 17]
treat the model as a black box, analyzing input—output re-
lationships to interpret the behavior of models, such as com-
paring the importance of different modalities [7] and the
different modalities’ contribution to visual or textual tasks
[17]. The works from [3, 8, 29, 38] aim to explain predic-
tions by tracing outputs to specific input contributions for
a single sample, including through merging the attention
scores [3, 38], using gradient-based methods [8] or model
disentanglement [29]. Additionally, some works [9, 20, 36]
adopt a top-down approach, probing learned representations
to uncover high-level concepts, such as visual-semantics
[9], verb understanding [20], shape and size [36]. In con-
trast, our work focuses on the model’s internal processing
mechanisms when solving multimodal tasks.
Mechanistic interpretability of MLLMs Mechanistic
interpretability [31, 33] is an emerging research area in
2
3. Answer
NLP, aiming to reverse-engineer detailed computations
within neural networks. While it has gained attraction in
NLP, research in the multimodal domain remains limited.
Palit et al. [34] introduced a causal tracing tool for image-
conditioned text generation on BLIP [23], marking one of
the few early efforts in this area. Several initial studies
have started to explore the internal states of MLLMs by
linking external behaviours to specific mechanisms, such
as information storage in model parameters [6], undesir-
able content generation reflected in the logit distributions of
the first generated token [46], localization and evolution of
object-related visual information [32, 34, 37], safety mech-
anism localization [43], and reducing redundant visual to-
kens [45]. However, research offering a comprehensive un-
derstanding of the internal mechanisms behind multimodal
information integration in MLLMs is still lacking. This pa-
per makes an important first step towards filling this gap.
Auto-regressive Model
CLIP-ViT-L Tokenizer
Image Question + Assistant:
Figure 2. The typical architecture of multimodal large language
model. It consists of an image encoder and a decoder-only large
language model in which the multimodal information is integrated.
We omitted the projection matrix for the visual patch feature as it
is nonessential for our analysis.
3. Tracing information flow in MLLMs
the bias terms and layer normalization, as they are not cru-
cial for our analysis. Formally, the hidden representation
h ℓi ∈ R d in the position i of the input sequence at layer ℓ
can be expressed as
The focus of this paper is on auto-regressive multimodal
large language models, which consist of an image encoder
and a decoder-only language model, as shown in Figure 2.
The image encoder transforms images into representations
that the language model can take as input, while the lan-
guage model integrates these visual cues with any provided
text, generating responses one word at a time. Often, these
components are initialized from a pre-trained image en-
coder (e.g. CLIP-ViT-L-336px [35] ) and a large language
model (e.g. Llama 2 [40]) respectively. Since the inter-
action between modalities only occurs in the decoder-only
transformer, our analysis centers around it and we refer to it
as MLLM for brevity unless otherwise specified.
h ℓi = h i ℓ−1 + a ℓi + f i ℓ ,
(1)
where a ℓi ∈ R d and f i ℓ ∈ R d are the outputs of MHAT and
FFN modules at layer ℓ, respectively. h 0 i represents a vector
in the input I with position of i. All hidden representations
at layer ℓ corresponding to the whole input I can be denoted
N ×d
by H ℓ = [h ℓi ] N
.
i=1 ∈ R
MHAT The masked multi-head attention (MHAT) mod-
ule in each transformer layer ℓ contains four projection ma-
ℓ
, W V ℓ , W O ℓ ∈ R d×d . For the multi-head
trixes: W Q ℓ , W K
attention, the input H ℓ−1 is first projected to query,key
ℓ
, V ℓ =
and value: Q ℓ = H ℓ−1 W Q ℓ , K ℓ = H ℓ−1 W K
H ℓ−1 W V ℓ . Then the projected query, key and value matri-
ces are evenly split along the columns to H different heads:
d
ℓ,j H
N × H
{Q ℓ,j } H
} j=1 , {V ℓ,j } H
, respectively.
j=1 , {K
j=1 ∈ R
d
d× H
After splitting W O ℓ into {W O ℓ,j } H
∈
R
, we follow
j=1
works in [12, 15, 19] to represent the output of MHAT
N ×d
at layer ℓ as the sum of the out-
A ℓ = [a ℓi ] N
i=1 ∈ R
put from different heads
3.1. Background: MLLMs
Input The input to an MLLM typically comprises image
and text features, with the image features being initially ex-
tracted from an image encoder and the text being encoded
through word embeddings. Formally, an image x is evenly
split into fixed-size patches and encoded by an image en-
V
coder to obtain N V visual patch features V = [v i ] N
i=1 ,
d
v i ∈ R . Similarly, the text t, consisting of N T tokens,
is embedded into representations through a lookup table of
T
word embeddings, resulting in the text input T = [t i ] N
i=1 ,
d
t i ∈ R . By concatenation of V and T , the multimodal in-
put sequence I = [v 1 . . . v N V , t 1 . . . t N T ] ∈ R N ×d , where
N = N V + N T , is fed into MLLM.
A ℓ =
H
X
A ℓ,j V ℓ,j W O ℓ,j
(2)
j=1
Hidden representation The input sequence is fed into the
MLLM, where the hidden representation at each token posi-
tion is encoded across L transformer layers. Each layer pri-
marily consists of two modules: a masked multi-head atten-
tion (MHAT) followed by a fully connected feed-forward
network (FFN) [41]. For conciseness, we have excluded
A ℓ,j
Q ℓ,j (K ℓ,j ) T
p
= softmax
+ M ℓ,j
d/H
!
(3)
where M ℓ,j is a strictly upper triangular mask for A ℓ,j for
j-th head at layer ℓ. For an auto-regressive transformer
model, M ℓ,j is used to guarantee that every position of the
3
4. input sequence cannot attend to succeeding positions and at-
tends to all preceding positions. Therefore, for the element
ℓ,j
M s,t
with the coordinate (s, t) in M ℓ,j ,
(
−∞ if t > s,
ℓ,j
M s,t =
(4)
0
otherwise.
FFN
representations. Therefore, we locate the information flow
between different hidden representations corresponding to
different positions of the input sequence, such as visual in-
puts, linguistic inputs, and the last position in the input se-
quence (the position of answer prediction), by blocking the
attention edge between them in the MHAT module and ob-
serving the resulting decline in performance as compared to
the original model with an intact attention pattern.
Formally, in order to prevent information flow from the
hidden representations h ℓs with position s in the source set
S (e.g. all positions of visual tokens in the input sequence)
to the hidden representations h ℓt with position t in the target
set T (e.g. all positions of linguistic tokens in the input se-
quence) at a specific layer ℓ < L, we set the corresponding
ℓ,j
element M s,t
in M ℓ,j to −∞ and the updated Eq. (4) is
(
−∞ if (t > s) or (s in S and t in T ),
ℓ,j
(7)
M s,t =
0
otherwise.
FFN computes the output representation through
f j ℓ = W U ℓ σ W B ℓ a ℓj + h ℓ−1
(5)
j
where W U ℓ ∈ R d×d ff and W B ℓ ∈ R d ff ×d are projection ma-
trices with inner-dimensionality d ff , and σ is a nonlinear ac-
tivation function.
Output The hidden representation h L
N corresponding to
the last position N of the input sequence at final layer L
is projected by an unembedding matrix E ∈ R |V|×d and
finally the probability distribution over all words in the vo-
cabulary V is computed by
P N = softmax Eh L
(6)
N ,
This prevents the token position in the target set from at-
tending to that in the source set when MLLM generates the
predicted answer.
where the word with the highest probability in P N is the
final prediction.
4. Experimental setting
3.2. Attention knockout
Setup Our paper investigates the inner working mecha-
nism of MLLMs , focusing on visual question answering
(VQA). Typically, the VQA setup involves an image and a
corresponding question about this image, which the model
needs to answer. We first investigate where the informa-
tion from different modalities (image and textual question)
is processed in MLLMs, and then how it is integrated within
the model. Finally, we explore how the MLLM makes the
final decision using this multimodal information.
In this paper, we mainly investigate the interaction between
different modalities by locating and analyzing the informa-
tion flow between them. We adopt a reverse-engineering
approach to trace the information flow. Specifically, by in-
tentionally blocking specific connections between different
components in the computation process, we trace the infor-
mation flow within them by observing changes in the prob-
ability of final prediction.
In MLLMs, the attention module (MHAT) is the only
module, which has the function of communication between
different types of hidden representation corresponding to
different positions in the input sequence. Therefore, we in-
tentionally block the attention edges between hidden repre-
sentations at different token positions (termed as attention
knockout) to trace the information flow between them. We
take inspiration from the work of [19], where the authors
use attention knockout to assess how the factual information
is extracted from a single-modality LLM by evaluating the
contribution of certain words in a sentence to last-position
prediction. We extend this method to multimodal research
by not only examining the contribution of each modality to
the last-position prediction but also the transfer of informa-
tion between different modalities.
Intuitively, when blocking the attention edge connecting
two hidden representations corresponding to different posi-
tions of the input sequence leads to a significant deteriora-
tion in model performance, it suggests that there exists func-
tionally important information transfer between these two
Tasks and data We collect our data from the valida-
tion set of GQA dataset [21]. GQA is a dataset designed
to support visual reasoning and compositional question-
answering, offering the semantic and visual richness of real-
world images. It is derived from the Visual Genome dataset,
which includes detailed scene graph structures [22]. In
GQA, the questions are categorized through two dimen-
sions: structure and semantics. The former defines the
question format (5 classes) and the latter refers to the se-
mantic information for the main subject of the question (5
classes). The answers to these questions consist of only
one word or phrase, which is easy to evaluate. Based on
the two dimensions, the questions in GQA are categorized
into 15 groups. We exclude most groups that consist of
simple binary questions (yes/no) and demonstrate poor per-
formance on the model investigated in this paper. Finally,
we select 6 out of 15 groups (4 structural and 4 seman-
tic classes) in which their performance is higher than 80%
in average performance, as shown in Table 1. The diffi-
culty of our selected groups ranges from simple multimodal
4
5. Name
ChooseAttr
ChooseCat
ChooseRel
CompareAttr
LogicalObj
QueryAttr
Structural
type
Choose
Choose
Choose
Compare
Logical
Query
Semantic
Type
Attribute
Category
Relation
Attribute
Object
Attribute
Open /
Binary
Open
Open
Open
Open
Binary
Open
Image
Example
Question Example
Answer Num.
What was used to make the door, wood or metal? Wood 1000
Which piece of furniture is striated, bed or door? Bed 1000
Is the door to the right or to the left of the bed?
Right 964
What is common to the bike and the dog?
Color 570
Are there either women or men that are running?
No
991
In which part of the image is the dog?
Left 1000
Table 1. Different types of questions in our VQA dataset. The questions are categorized based on two dimensions: structure and semantics.
The structural types define the question format, including: Choose for selecting between alternatives, Compare for comparisons between
objects, Logical for logical inference, and Query for open-ended questions. The semantic types focus on the subject matter, covering Object
existence, and Attribute, Category, Relation of objects. Additionally, questions are labeled as Open for open-ended queries or Binary for
yes/no answers. The dataset is derived from the GQA dataset [21]. Due to space limitations, we present two images, noting that 50% of
question samples in our dataset have unique images.
Models We investigate the current state-of-the-art and
open-source multimodal large language models from
the LLaVA series: LLaVA-1.5-7b, LLaVA-1.5-13b [27],
LLaVA-v1.6-Vicuna-7b [28] and Llama3-LLaVA-NEXT-8b
[2], which achieve state-of-the-art performance across a di-
verse range of 11 tasks including GQA. These models are
trained on similar publicly available data but with differ-
ent architectures and model sizes, which allows us to ex-
plore cross-modal interaction and processing over different
architectures and minimize interference of unknown factors
from training data. All these models have the same image
encoder (CLIP-ViT-L-336px [35]) but with different LLM:
Vicuna-v1.5-7b [47] with 32 layers (transformer blocks)
in LLaVA-1.5-7b and LLaVA-v1.6-Vicuna-7b, Vicuna-v1.5-
13b [47] with 40 layers in LLaVA-1.5-13b and Llama3-
8b [14] with 32 layers in Llama3-LLaVA-NEXT-8b, where
Vicuna-v1.5 is the standard and dense transformer architec-
ture [41] and Llama3 adopts grouped query attention [4].
In terms of image processing, LLaVA-1.5-7b and LLaVA-
1.5-13b directly feed the original fixed-length image patch
features from the image encoder into the LLM as input
tokens. In contrast, LLaVA-v1.6-Vicuna-7b and Llama3-
LLaVA-NEXT-8b employ a dynamic high-resolution tech-
nique, which dynamically adjusts image resolution, result-
ing in variable-length image patch features with higher res-
olution. Due to space limitations, we will primarily present
the results for the model LLaVA-1.5-13b in the subsequent
sections of this paper, while similar findings for other mod-
els are presented in Appendix E.
perception tasks to more complex multimodal reasoning.
For example, ChooseAttr and ChooseCat ask about basic
object attributes and categories for one object in the im-
age, ChooseRel and QueryAttr involve spatial reasoning,
and CompareAttr and LogicalObj require more challenging
comparisons and logical reasoning between two objects in
the image. For each selected group, we sample an average
of 920 image-question pairs that are correctly predicted by
most models used in this paper. For each model, we only
use correctly predicted samples for analysis (Each model
achieves an accuracy greater than 95% on the dataset we
collected). More details about the dataset and the process
of collection can be found in Appendix A.
Format Formally, given an image i and a question q (the
question may contain answer options os = [o1, o2]), the
model is expected to generate the answer a in the last po-
sition of the input sequence. In addition, the correct one in
the options is referred to as the true option (o t ) while the
other ones are denoted as the false option (o f ). Since the
image, question and options might contain multiple input
tokens, we use I, Q, O t , O f to represent the set of input po-
sitions corresponding to image, question, true option and
false option, respectively.
Evaluation We quantify the information flow between
different input parts by evaluating the relative change in the
probability of the answer word which is caused by blocking
connections between different input parts (attention knock-
out). Formally, given an image-question pair, the MLLM
generates the answer a with the highest probability p 1 from
the output distribution P N defined in Equation (6). After
applying attention knockout at specific layers, we record
the updated probability p 2 for the same answer a as in p 1 .
The relative change in probability, p c %, is calculated as
p c % = ((p 2 −p 1 )/p 1 )×100. In this paper, attention knockout
is applied to each transformer layer (within a defined win-
dow) individually and evaluate their respective p c values.
5. Contribution of different modalities to the
final prediction
For a successful answer prediction for the task of VQA,
the MLLM will process the input image-question pair [i, q]
and generate the final answer from the output layer of the
model corresponding to the last position. We first investi-
5
6.
7.
8. information from the whole image into the question posi-
tions building a more generic representation. And it is only
in the later layers, that the model starts to pay attention
to the specific regions in the image relevant to the ques-
tion, fusing the more fine-grained linguistic and visual rep-
resentations. The other MLLMs also present similar results
as shown in Appendix E. The additional more fine-grained
analysis on intervention of the attention edge between ob-
ject words in question and image region can be found in
Appendix F. Moreover, we find compared with LLaVA-1.5-
13b, the model LLaVA-1.5-7b with smaller size has less in-
formation flow from the position of V oth to that of question
in the first stage, as shown in Appendix E.
9. References
[1] interpreting GPT: the logit lens — LessWrong — less-
wrong.com.
https : / / www . lesswrong . com /
posts / AcKRB8wDpdaN6v6ru / interpreting -
gpt-the-logit-lens. [Accessed 14-11-2024]. 8
[2] lmms-lab/llama3-llava-next-8b · hugging face. https:
//huggingface.co/lmms-lab/llama3-llava-
next-8b, 2024. Accessed: 2024-11-13. 2, 5
[3] Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu,
Chenfei Wu, Nan Duan, and Vasudev Lal. Vl-interpret: An
interactive visualization tool for interpreting vision-language
transformers. In Proceedings of the IEEE/CVF Conference
on computer vision and pattern recognition, pages 21406–
21415, 2022. 2
[4] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury
Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa:
Training generalized multi-query transformer models from
multi-head checkpoints. arXiv preprint arXiv:2305.13245,
2023. 5, 7
[5] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren
Zhou. Qwen-VL: A Versatile Vision-Language Model for
Understanding, Localization, Text Reading, and Beyond,
2023. arXiv:2308.12966 [cs]. 1, 2
[6] Samyadeep Basu, Martin Grayson, Cecily Morrison, Be-
smira Nushi, Soheil Feizi, and Daniela Massiceti. Under-
standing information storage and transfer in multi-modal
large language models. arXiv preprint arXiv:2406.04236,
2024. 2, 3
[7] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen Chun Chen,
and Jingjing Liu. Behind the Scene: Revealing the Secrets
of Pre-trained Vision-and-Language Models. Lecture Notes
in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics),
12351 LNCS:565–580, 2020. arXiv: 2005.07310 ISBN:
9783030585389. 2
[8] Hila Chefer, Shir Gur, and Lior Wolf.
Generic
Attention-model Explainability for Interpreting Bi-Modal
and Encoder-Decoder Transformers. pages 397–406, 2021.
arXiv: 2103.15679. 2
[9] Adam Dahlgren Lindström, Johanna Björklund, Suna Ben-
sch, and Frank Drewes. Probing multimodal embeddings
for linguistic properties: the visual-semantic case. In Pro-
ceedings of the 28th International Conference on Compu-
tational Linguistics, pages 730–744, Barcelona, Spain (On-
line), 2020. International Committee on Computational Lin-
guistics. 2
[10] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang,
and Furu Wei. Knowledge Neurons in Pretrained Transform-
ers, 2022. arXiv:2104.08696 [cs]. 3
[11] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale
Fung, and Steven Hoi. InstructBLIP: Towards General-
purpose Vision-Language Models with Instruction Tuning,
2023. arXiv:2305.06500 [cs]. 1, 2
[12] Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. An-
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
9
alyzing transformers in embedding space. arXiv preprint
arXiv:2209.02535, 2022. 3
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is
Worth 16x16 Words: Transformers for Image Recognition at
Scale. 2020. arXiv: 2010.11929. 1, 2
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab-
hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil
Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The
llama 3 herd of models. arXiv preprint arXiv:2407.21783,
2024. 5
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom
Henighan, Nicholas Joseph, Ben Mann, Amanda Askell,
Yuntao Bai, Anna Chen, Tom Conerly, Nova Das-
Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds,
Danny Hernandez, Andy Jones, Jackson Kernion, Liane
Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown,
Jack Clark, Jared Kaplan, Sam McCandlish, and Chris
Olah. A mathematical framework for transformer circuits.
Transformer Circuits Thread, 2021. https://transformer-
circuits.pub/2021/framework/index.html. 2, 3
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu,
Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue
Cao. Eva: Exploring the limits of masked visual representa-
tion learning at scale. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
19358–19369, 2023. 1, 2
Stella Frank, Emanuele Bugliarello, and Desmond Elliott.
Vision-and-Language or Vision-for-Language? On Cross-
Modal Influence in Multimodal Transformers. pages 9847–
9857, 2021. arXiv: 2109.04448 ISBN: 9781955917094. 2
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy.
Transformer feed-forward layers are key-value memories. In
Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing. Association for Computa-
tional Linguistics, 2021. 3
Mor Geva, Jasmijn Bastings, Katja Filippova, and
Amir Globerson.
Dissecting Recall of Factual As-
sociations in Auto-Regressive Language Models, 2023.
arXiv:2304.14767 [cs]. 2, 3, 4
Lisa Anne Hendricks and Aida Nematzadeh. Probing image-
language transformers for verb understanding. In Find-
ings of the Association for Computational Linguistics: ACL-
IJCNLP 2021, pages 3635–3644, Online, 2021. Association
for Computational Linguistics. 2
Drew A Hudson and Christopher D Manning. Gqa: A new
dataset for real-world visual reasoning and compositional
question answering. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, pages
6700–6709, 2019. 4, 5, 1
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Connecting language and vision using crowdsourced dense
image annotations. International journal of computer vision,
123:32–73, 2017. 4, 1
10. [23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
BLIP: Bootstrapping Language-Image Pre-training for Uni-
fied Vision-Language Understanding and Generation. Tech-
nical report. 3
[24] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
BLIP-2: Bootstrapping Language-Image Pre-training with
Frozen Image Encoders and Large Language Models. 2023.
arXiv: 2301.12597. 1, 2
[25] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng
Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya
Jia. Mini-gemini: Mining the potential of multi-modality
vision language models, 2024.
[26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
Visual Instruction Tuning, 2023. arXiv:2304.08485 [cs].
[27] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee.
Improved baselines with visual instruction tuning. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 26296–26306, 2024.
1, 2, 5
[28] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan
Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im-
proved reasoning, ocr, and world knowledge, 2024. 1, 2,
5
[29] Yiwei Lyu, Paul Pu Liang, Zihao Deng, Ruslan Salakhutdi-
nov, and Louis-Philippe Morency. Dime: Fine-grained inter-
pretations of multimodal models via disentangled local ex-
planations. In Proceedings of the 2022 AAAI/ACM Confer-
ence on AI, Ethics, and Society, pages 455–467, 2022. 2
[30] Kevin Meng, David Bau, Alex J Andonian, and Yonatan Be-
linkov. Locating and editing factual associations in GPT. In
Advances in Neural Information Processing Systems, 2022.
3
[31] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess
Smith, and Jacob Steinhardt.
Progress measures for
grokking via mechanistic interpretability. arXiv preprint
arXiv:2301.05217, 2023. 2
[32] Clement Neo, Luke Ong, Philip Torr, Mor Geva, David
Krueger, and Fazl Barez. Towards interpreting visual infor-
mation processing in vision-language models. arXiv preprint
arXiv:2410.07149, 2024. 2, 3
[33] Chris Olah. Mechanistic interpretability, variables, and
the importance of interpretable bases.
https : / /
www.transformer- circuits.pub/2022/mech-
interp-essay, 2024. Accessed: 2024-10-20. 2
[34] Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu
Liang. Towards Vision-Language Mechanistic Interpretabil-
ity: A Causal Tracing Tool for BLIP. 2, 3
[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervi-
sion. In International conference on machine learning, pages
8748–8763. PMLR, 2021. 1, 2, 3, 5
[36] Emmanuelle Salin, Badreddine Farah, Stéphane Ayache,
Benoit Favre, Emmanuelle Salin, Badreddine Farah,
Stéphane Ayache, Benoit Favre Are Vision-language Trans,
and Probing Perspective. Are Vision-Language Transform-
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
10
ers Learning Multimodal Representations? A probing per-
spective. Proceedings of the 36th AAAI Conference on Arti-
ficial Intelligence, 2022. 2
Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David
Bau, and Antonio Torralba. Multimodal Neurons in Pre-
trained Text-Only Transformers. In 2023 IEEE/CVF Interna-
tional Conference on Computer Vision Workshops (ICCVW),
pages 2854–2859, Paris, France, 2023. IEEE. 2, 3
Gabriela Ben Melech Stan, Raanan Yehezkel Rohekar,
Yaniv Gurwicz, Matthew Lyle Olson, Anahita Bhiwandi-
walla, Estelle Aflalo, Chenfei Wu, Nan Duan, Shao-Yen
Tseng, and Vasudev Lal. Lvlm-intrepret: An interpretabil-
ity tool for large vision-language models. arXiv preprint
arXiv:2404.03118, 2024. 2
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien
Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
Lample. LLaMA: Open and Efficient Foundation Language
Models. 1, 2
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash-
lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer,
Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer-
nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia
Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn,
Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Vik-
tor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Ko-
renev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning
Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan
Silva, Eric Michael Smith, Ranjan Subramanian, Xiao-
qing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams,
Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov,
Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan
Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov,
and Thomas Scialom. Llama 2: Open Foundation and Fine-
Tuned Chat Models, 2023. arXiv:2307.09288 [cs]. 1, 2, 3
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is All you Need. In Advances in Neu-
ral Information Processing Systems. Curran Associates, Inc.,
2017. 2, 3, 5, 6
Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fan-
dong Meng, Jie Zhou, and Xu Sun. Label Words are An-
chors: An Information Flow Perspective for Understanding
In-Context Learning, 2023. arXiv:2305.14160 [cs]. 3
Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen,
and Xueqi Cheng. Cross-modal safety mechanism trans-
fer in large vision-language models.
arXiv preprint
arXiv:2410.12662, 2024. 2, 3
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe,
Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab,
Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam
Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura,
11. Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT:
Open Pre-trained Transformer Language Models, 2022.
arXiv:2205.01068 [cs]. 1, 2
[45] Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan,
Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and
Jieping Ye. From Redundancy to Relevance: Enhancing Ex-
plainability in Multimodal Large Language Models, 2024.
arXiv:2406.06579 [cs]. 2, 3
[46] Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana,
Liang Zheng, and Stephen Gould. The first to know: How
token distributions reveal hidden knowledge in large vision-
language models? arXiv preprint arXiv:2403.09037, 2024.
2, 3
[47] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan
Li, Li, and OTHERS. Judging llm-as-a-judge with mt-bench
and chatbot arena. 2023. 1, 2, 5
11
12. Cross-modal Information Flow in Multimodal Large Language Models
Supplementary Material
Object Attribute Category Relation Global
13.
14.
15.
16.
17.
18.
19.
20.
21.