Inside Marker: A Guided Source Code Tour for an AI-powered PDF Layout Detection Engine

Inside Marker: A Guided Source Code Tour for an AI-powered PDF Layout Detection Engine

Last week, Marker, the PDF to Markdown converter, topped the Hacker News homepage for a while. As a curious student in the ML world, I thought it’d be a good opportunity to look under the hood, and learn more about how this awesome Document AI tool works.

What is Marker?

As an analogy, think of marker as an intelligent transcriber, capable of reading through complex books and scientific article PDFs and converting them to clean text-oriented markdown files. Think of it as an intelligent digitization assistant for your document digitization needs.

The official description for the tool, is a bit more technical, which is as follows:

Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.

  • Support for a range of PDF documents (optimized for books and scientific papers)

  • Removes headers/footers/other artifacts

  • Converts most equations to latex

  • Formats code blocks and tables

  • Support for multiple languages (although most testing is done in English). See settings.py for a language list.

  • Works on GPU, CPU, or MPS

Working Overview

(Full Image)

Marker functions in roughly 6 phases, as listed below:

  1. Preparation: Use PyMuPDF to convert any document into PDF format

  2. OCR: Run either Tesseract or OCRMyPDF to detect textual content (optionally, naive text grab through PyMuPDF as well)

  3. Layout Detection: Use a custom LayoutLMv3 model to detect tables, diagrams, titles, captions, headers and footers.

  4. Column Detection and Ordering: Use another custom LayoutLMv3 model to detect columns, and order the blocks right (top to bottom, left to right)

  5. Equation/Code Handling: Use Nougat to convert an image of the equation to equivalent latex code; also properly detect/fix code and table blocks through heuristics.

  6. Text Cleanup/Beautification: Use T5ForTextClassification custom model for cleaning up text, such as removing unnecessary whitespace, weird characters, etc, all in a conservative, intent-preserving way.

With the help of the above 6 phases, Marker converts any document into a clean Markdown file.

Examples

Simple Textbook

Before (PDF)

After (Markdown)

Two Column Layout

Before (PDF)

After (Markdown)

Equations in Scientific Paper

Before (PDF)

After (Markdown + Latex)

An Intuitive Explanation for Marker’s Working

At a high level view, the conversion process deploys lots of manual heuristics combined with four custom AI/ML models to get the job done.

The document is scanned multiple times by different processes in different stages, and each stage either adds useful information/annotations to the page or removes unnecessary elements from the page.

Step 1: Preparation

Tools: PyMuPDF 

Marker accepts epub, mobi and PDF files as input to the conversion pipeline. It uses PyMuPDF to convert epub or mobi files into PDF, as part of the preparatory stage. The rest of the pipeline operates completely on PDF content.

Step 2: OCR

Tools: OCRMyPDF, Tesseract

For instance, the first layer of information added is OCR, which gives us text lines in the document. Marker uses OCRMyPDF or Tesseract to perform the OCR step.

Step 3: Layout Detection

Tools: LayoutLMv3 (custom model - layout_segmenter)

Another layer of information added is of block types, using LayoutLMv3. Marker can detect the following block types:

{

  "id2label": {

    "0": "Caption",

    "1": "Footnote",

    "2": "Formula",

    "3": "List-item",

    "4": "Page-footer",

    "5": "Page-header",

    "6": "Picture",

    "7": "Section-header",

    "8": "Table",

    "9": "Text",

    "10": "Title"

  }
}

Step 4: Ordering

Tools: LayoutLMv3 (custom model - column_detector)

There are heuristics to remove many element types, such as headers/footers.

Step 5: Formula/Equation Conversion

Tools: Nougat

Then we use the nougat model to convert formulae and equation images into latex code

Step 6: Post-Process, Beautify

Tools: T5 (custom model - pdf_postprocessor_t5_base)

Finally, we have model which cleans up the resultant text, converts it into Markdown format and writes the output file to disk.

How to get started?

The official guide to installing Marker from source can be found at the github page.

To convert a PDF to markdown, you can run the following command:

python convert_single.py --parallel_factor 2 --max_pages 10 input.pdf output.md

Interesting Models/Libraries/Tools used in the Project

Ray: Scale AI Workloads

The official description for Ray is as follows:

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

If the definition sounds a bit too distant, you’re not the only one. I had to dig up a bit more to understand why one would use Ray in building up AI/ML-enabled software.

The most instructive example I could find for Ray, was from the book Scaling Python with Ray.

In its simplest form, Ray is a boosted multiprocessing library. Not only can it parallelize processing lists, it can do so on multiple remote machines. So if you start your AI/ML code with Ray primitives, when you have more data to process or want a job done faster, you can simply add machines to your configuration to scale the computation to many machines and processes.

PyMuPDF: PDF Utility Kit

PyMuPDF is a fast and powerful library for creating and manipulating PDF files. In Marker, PyMuPDF is used in the first step of processing.

If the input document type is say an epub or mobi, PyMuPDF is used to convert it into a PDF file.

Each page is accessed and represented using PyMuPDF data structures.

And if the document doesn’t require an OCR, and has text embedded in already, then we use PyMuPDF to extract text contents directly from the PDF.

Nougat: Equation Detection for Scientific Works

Nougat __is a transformer-based AI model for converting scanned and digital-born PDFs into textual markup.

By digital-born PDFs, I mean, the PDF is not machine-readable text, but is an image representation of the underlying text or content.

Nougat works nicely for extracting equation figures from the document and converting them into valid latex code.

For example, consider the following image:

Nougat will consider such an image into the following latex code:

    The well known Pythagorean theorem $x^2 + y^2 = z^2$ was 
    proved to be invalid for other exponents. 
    Meaning the next equation has no integer solutions:

    $ x^n + y^n = z^n $

Nougat uses the popular Markdown format to represent document structure, while the formula components are converted to latex code.

LayoutLMv3: Layout Detection for Books and Documents

LayoutLMv3 is a transformer based visual model (ViT), and an evolution over its previous versions (LayoutLMv2 and LayoutLM).

The new version is transformer based, and multi-modal. The means, it understands both images and text together, using techniques such as Patch Embedding, Masked Image Modeling and Word-Patch alignment. We will not go into the technical details of how the training is done in this article. 

First of all, it is important to understand that LayoutLMv3 is a powerful framework, which can perform many, many text and image oriented tasks:

  • Form understanding

  • Document Summary

  • Document classification

  • Question and Answer on Documents

  • Scene Understanding

  • Object Detection Within Documents

  • Column Detection

  • And more

Within Marker, we use LayoutLMv3 for the specific purpose of finding various document elements such as paragraphs, figures, tables, formulae, and so on. Also, we use LayoutLMv3 to get column ordering within documents.

T5: PDF Post-Processing Model

T5, or Text-to-Text Transfer Transformer, is a powerful AI/ML model developed by Google AI.  It's specifically designed for text-to-text tasks, making it a good fit for post-processing our textual content.

Similar to LayoutLMv3, T5 is also a unified framework for dealing with text manipulation tasks. Marker uses its own custom PDF post processing model to make conservative improvements to the extracted full text:

  1. Remove unwanted and weird characters from previous stages. Often OCR and other steps introduce noisy characters, which do not belong with the actual content from the source.

  2. Beautify or cleanup content: Markup deletes unnecessary indents, spaces and such things around elements. 

  3. Preserve original intent: Marker is conservative in applying changes or edits; so only the most obvious mistakes are corrected, to preserve the integrity of the original document.

A Note About Performance

Detailed benchmark results are available within the Marker repository. Here I will quickly summarise the speed, accuracy and RAM usage numbers.

Accuracy

Marker is comparable in accuracy to nougat. While nougat performs slightly better in scientific papers alone, in general both nougat and marker hover around 65% and 63% accuracy respectively. In comparison, naive get text methods deliver 28% accuracy.

Speed

Maker is around 10x faster than nougat, in processing pages/documents. While nougat took around 3 seconds per page, Marker can get it done at around 0.3 seconds per page.

Memory Usage

Marker takes around 2 GB VRAM on average per task. Given an A6000 GPU, you can process 24 docs in parallel on average.

Conclusion

Marker tends to work great for PDFs of the digital-native variety. By digital-native I mean, latex-generated PDF. Also it works well for high quality scans of various documents. However it seemed to fail for an old book from archive.org, which I wanted to read. State of the Art commercial solutions such as Amazon Textract or Azure Document AI handle this type of document near perfectly.

While it is clear that open source document AI such as Marker has a long way to go, compared to previous solutions, Marker is a rich contribution to the FOSS Document AI ecosystem. For this, we must be grateful to Vikas Paruchur, the author of Marker PDF. Thanks, Vikas!