我们如何构建一个定制的视觉 LLM 来改善 Grab 的文档处理

In the world of digital services, accurate extraction of information from user-submitted documents such as identification (ID) cards, driver’s licenses, and registration certificates is a critical first step for processes like electronic know-your-customer (eKYC). This task is especially challenging in Southeast Asia (SEA) due to the diversity of languages and document formats.

在数字服务的世界中，从用户提交的文件（如身份证、驾驶执照和注册证书）中准确提取信息是电子客户识别（eKYC）等流程的关键第一步。由于语言和文件格式的多样性，这项任务在东南亚（SEA）尤其具有挑战性。

We began this journey to address the limitations of traditional Optical Character Recognition (OCR) systems, which struggled with the variety of document templates it had to process. While powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding SEA languages, produced errors, hallucinations, and had high latency. On the other hand, open-sourced Vision LLMs were more efficient but not accurate enough for production.

我们开始这段旅程，以解决传统光学字符识别（OCR）系统的局限性，这些系统在处理多样的文档模板时遇到了困难。虽然强大的专有大型语言模型（LLM）是一个选项，但它们在理解 SEA 语言方面往往表现不佳，产生错误、幻觉，并且延迟较高。另一方面，开源的视觉 LLM 更高效，但在生产中准确性不足。

This prompted us to fine-tune and ultimately develop a lightweight, specialized Vision LLM from the ground up. This blog is our account of the entire process.

这促使我们微调并最终从零开始开发一个轻量级的专业视觉 LLM。这个博客是我们整个过程的记录。

Figure 1: Simplified overview of how Vision LLM works.

图1：Vision LLM 工作原理的简化概述。

You’ve likely heard of LLMs that process text. You give the LLM a text prompt, and it responds with a text output. A Vision LLM takes this a step further by allowing the model to understand images. The basic architecture involves three key components:

您可能听说过处理文本的LLM。您给LLM一个文本提示，它会以文本输出作出回应。视觉LLM更进一步，允许模型理解图像。基本架构涉及三个关键组件：

Image encoder: This component ‘looks’ at an image and converts it into a numerical (vectorized) format.
图像编码器：该组件“查看”图像并将其转换为数值（向量化）格式。
Vision-language projector: It acts as a translator, converting the image’s numerical format into a representation that the language model can understand.
视觉-语言投影器：它充当翻译器，将图像的数值格式转换为语言模型可以理解的表示。
Language mo...