内部标记:AI 动力的 PDF 布局检测引擎的指导源代码导览
Last week, Marker, the PDF to Markdown converter, topped the Hacker News homepage for a while. As a curious student in the ML world, I thought it’d be a good opportunity to look under the hood, and learn more about how this awesome Document AI tool works.
上周,PDF到Markdown转换器Marker在Hacker News首页上一度成为热门话题。作为机器学习领域的一个好奇学生,我认为这是一个很好的机会,可以深入了解这个令人惊叹的文档AI工具的工作原理。
What is Marker?
Marker是什么?
As an analogy, think of marker as an intelligent transcriber, capable of reading through complex books and scientific article PDFs and converting them to clean text-oriented markdown files. Think of it as an intelligent digitization assistant for your document digitization needs.
类比一下,将标记器视为一个智能的记录员,能够阅读复杂的书籍和科学文章的PDF,并将它们转换为干净的面向文本的Markdown文件。将其视为您的文档数字化需求的智能数字化助手。
The official description for the tool, is a bit more technical, which is as follows:
该工具的官方描述稍微技术一些,如下所示:
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
Marker将PDF、EPUB和MOBI转换为markdown。它比nougat快10倍,对大多数文档更准确,并且具有较低的幻觉风险。
-
Support for a range of PDF documents (optimized for books and scientific papers)
支持一系列PDF文档(针对图书和科学论文进行了优化)
-
Removes headers/footers/other artifacts
删除标题/页脚/其他文档元素
-
Converts most equations to latex
将大多数方程式转换为Latex
-
Formats code blocks and tables
格式化代码块和表格
-
Support for multiple languages (although most testing is done in English). See settings.py for a language list.
支持多种语言(尽管大部分测试是用英语进行的)。请参阅settings.py中的语言列表。
-
Works on GPU, CPU, or MPS
适用于GPU、CPU或MPS
Working Overview
工作概述
(完整图片)
Marker functions in roughly 6 phases, as listed below:
标记函数大致分为6个阶段,如下所示:
-
Preparation: Use PyMuPDF to convert any document into PDF format
准备:使用PyMuPDF将任何文档转换为PDF格式
-
OCR: Run either Tesseract or OCRMyPDF to detect textual content (optionally, naive text grab through PyMuPDF as well)
OCR:运行Tesseract或OCRMyPDF来检测文本内容(可选择通过PyMuPDF进行简单文本提取)
-
Layout Detection: Use a custom LayoutLMv3 model to detect tables, diagrams, titles, captions, headers and footers.
布局检测:使用自定义的LayoutLMv3模型来...