内部标记:AI 动力的 PDF 布局检测引擎的指导源代码导览

Last week, Marker, the PDF to Markdown converter, topped the Hacker News homepage for a while. As a curious student in the ML world, I thought it’d be a good opportunity to look under the hood, and learn more about how this awesome Document AI tool works.

上周,PDF到Markdown转换器Marker在Hacker News首页上一度成为热门话题。作为机器学习领域的一个好奇学生,我认为这是一个很好的机会,可以深入了解这个令人惊叹的文档AI工具的工作原理。

What is Marker?

Marker是什么?

As an analogy, think of marker as an intelligent transcriber, capable of reading through complex books and scientific article PDFs and converting them to clean text-oriented markdown files. Think of it as an intelligent digitization assistant for your document digitization needs.

类比一下,将标记器视为一个智能的记录员,能够阅读复杂的书籍和科学文章的PDF,并将它们转换为干净的面向文本的Markdown文件。将其视为您的文档数字化需求的智能数字化助手。

The official description for the tool, is a bit more technical, which is as follows:

该工具的官方描述稍微技术一些,如下所示:

Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

Marker将PDF、EPUB和MOBI转换为markdown。它比nougat快10倍,对大多数文档更准确,并且具有较低的幻觉风险。

  • Support for a range of PDF documents (optimized for books and scientific papers)

    支持一系列PDF文档(针对图书和科学论文进行了优化)

  • Removes headers/footers/other artifacts

    删除标题/页脚/其他文档元素

  • Converts most equations to latex

    将大多数方程式转换为Latex

  • Formats code blocks and tables

    格式化代码块和表格

  • Support for multiple languages (although most testing is done in English). See settings.py for a language list.

    支持多种语言(尽管大部分测试是用英语进行的)。请参阅settings.py中的语言列表。

  • Works on GPU, CPU, or MPS

    适用于GPU、CPU或MPS

Working Overview

工作概述

(Full Image)

完整图片

Marker functions in roughly 6 phases, as listed below:

标记函数大致分为6个阶段,如下所示:

  1. Preparation: Use PyMuPDF to convert any document into PDF format

    准备:使用PyMuPDF将任何文档转换为PDF格式

  2. OCR: Run either Tesseract or OCRMyPDF to detect textual content (optionally, naive text grab through PyMuPDF as well)

    OCR:运行TesseractOCRMyPDF来检测文本内容(可选择通过PyMuPDF进行简单文本提取)

  3. Layout Detection: Use a custom LayoutLMv3 model to detect tables, diagrams, titles, captions, headers and footers.

    布局检测:使用自定义的LayoutLMv3模型来...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.139.0. UTC+08:00, 2024-12-22 20:40
浙ICP备14020137号-1 $访客地图$