内部标记：AI 动力的 PDF 布局检测引擎的指导源代码导览

Last week, Marker, the PDF to Markdown converter, topped the Hacker News homepage for a while. As a curious student in the ML world, I thought it’d be a good opportunity to look under the hood, and learn more about how this awesome Document AI tool works.

上周，PDF到Markdown转换器Marker在Hacker News首页上一度成为热门话题。作为机器学习领域的一个好奇学生，我认为这是一个很好的机会，可以深入了解这个令人惊叹的文档AI工具的工作原理。

What is Marker?

Marker是什么？

As an analogy, think of marker as an intelligent transcriber, capable of reading through complex books and scientific article PDFs and converting them to clean text-oriented markdown files. Think of it as an intelligent digitization assistant for your document digitization needs.

类比一下，将标记器视为一个智能的记录员，能够阅读复杂的书籍和科学文章的PDF，并将它们转换为干净的面向文本的Markdown文件。将其视为您的文档数字化需求的智能数字化助手。

The official description for the tool, is a bit more technical, which is as follows:

该工具的官方描述稍微技术一些，如下所示：

Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

Marker将PDF、EPUB和MOBI转换为markdown。它比nougat快10倍，对大多数文档更准确，并且具有较低的幻觉风险。

Support for a range of PDF documents (optimized for books and scientific papers)
支持一系列PDF文档（针对图书和科学论文进行了优化）
Removes headers/footers/other artifacts
删除标题/页脚/其他文档元素
Converts most equations to latex
将大多数方程式转换为Latex
Formats code blocks and tables
格式化代码块和表格
Support for multiple languages (although most testing is done in English). See settings.py for a language list.
支持多种语言（尽管大部分测试是用英语进行的）。请参阅settings.py中的语言列表。
Works on GPU, CPU, or MPS
适用于GPU、CPU或MPS

Working Overview

工作概述

(Full Image)

（完整图片）

Marker functions in roughly 6 phases, as listed below:

标记函数大致分为6个阶段，如下所示：

Preparation: Use PyMuPDF to convert any document into PDF format
准备：使用PyMuPDF将任何文档转换为PDF格式
OCR: Run either Tesseract or OCRMyPDF to detect textual content (optionally, naive text grab through PyMuPDF as well)
OCR：运行Tesseract或OCRMyPDF来检测文本内容（可选择通过PyMuPDF进行简单文本提取）
Layout Detection: Use a custom LayoutLMv3 model to detect tables, diagrams, titles, captions, headers and footers.
布局检测：使用自定义的LayoutLMv3模型来...