泛化视觉语言模型

Processing images to generate text, such as image captioning and visual question-answering, has been studied for years. Traditionally such systems rely on an object detection network as a vision encoder to capture visual features and then produce text via a text decoder. Given a large amount of existing literature, in this post, I would like to only focus on one approach for solving vision language tasks, which is to extend pre-trained generalized language models to be capable of consuming visual signals.

将图像处理生成文本，例如图像描述和视觉问答，已经研究多年。传统上，此类系统依赖于物体检测网络作为视觉编码器来捕获视觉特征，然后通过文本解码器生成文本。鉴于大量现有文献，本文仅关注一种解决视觉语言任务的方法，即扩展预训练的通用语言模型，使其能够处理视觉信号。

I roughly group such vision language models (VLMs) into four buckets:

我大致将此类 vision language models (VLMs) 分成四个类别：

Translating images into embedding features that can be jointly trained with token embeddings.
将图像翻译成可以与 token embeddings 联合训练的 embedding features。
Learning good image embeddings that can work as a prefix for a frozen, pre-trained language model.
学习良好的图像嵌入，这些嵌入可以作为 frozen、pre-trained language model 的 prefix。
Using a specially designed cross-attention mechanism to fuse visual information into layers of the language model.
使用一种专门设计的 cross-attention 机制将视觉信息融合到语言模型的层中。
Combine vision and language models without any training.
在没有任何训练的情况下结合 vision 和 language model。

Jointly Training with Image and Text#

图像和文本联合训练#

One straightforward approach to fuse visual information into language models is to treat images as normal text tokens and train the model on a sequence of joint representations of both text and images. Precisely, images are divided into multiple smaller patches and each patch is treated as one “token” in the input sequence.

将视觉信息融合到语言模型中的一种直接方法是将图像视为普通的文本 token，并在文本和图像的联合表示序列上训练模型。具体来说，图像被分成多个更小的 patch，每个 patch 被视为输入序列中的一个“token”。

VisualBERT (Li et al. 2019) feeds both text inputs and image regions into BERT such that it is able to discover the internal alignment between images and text with self-attention mechanism.

VisualBERT (Li et al. 2019) 将文本输入和图像区域都输入到 BERT，从而能够通过自注...