XY-Cut++- Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark
如果无法正常显示,请先停止浏览器的去广告插件。
1. XY-Cut++: Advanced Layout Ordering via Hierarchical Mask
Mechanism on a Novel Benchmark
Shuai Liu1 , Youmeng Li1*, Jizeng Wei1
arXiv:2504.10258v3 [cs.CV] 28 Jan 2026
1
College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
*Corresponding author(s). E-mail(s): liyoumeng@tju.edu.cn;
Contributing authors: shuai liu@tju.edu.cn; weijizeng@tju.edu.cn;
Abstract
Document Reading Order Recovery is a fundamental task in document image understanding, playing a
pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocess-
ing step for large language models (LLMs). Existing methods often struggle with complex layouts (e.g.,
multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and
textual semantics), and a lack of robust block-level evaluation benchmarks. We introduce XY-Cut++,
an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmenta-
tion, and cross-modal matching to address these challenges. Our method significantly enhances layout
ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-
of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms
existing baselines by up to 24% and demonstrates consistent accuracy across simple and complex layouts
on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation
for document structure recovery, setting a new standard for layout ordering tasks and facilitating more
effective RAG and LLM preprocessing.
Keywords: Document reading order, Document layout analysis, XY-Cut++, DocBench-100
1 Introduction
Document Reading Order Recovery is a fundamental task in document image understanding, serving as
a cornerstone for enhancing Retrieval-Augmented Generation (RAG) systems and enabling high-quality
data preprocessing for large language models (LLMs). Accurate layout recovery is essential for applications
such as digital publishing and knowledge base construction. However, this task faces several significant
challenges: (1) complex layout structure (e.g., multi-column layout, nested text boxes, non-rectangular text
regions, cross-page content), (2) inefficient cross-modal alignment due to the high computational costs of
integrating visual and textual features, and (3) the lack of standardized evaluation protocols for block-
level reading order. Traditional approaches like XY-Cut [1] fail to model semantic dependencies in complex
designs, while deep learning methods such as LayoutReader [2] suffer from prohibitive latency, limiting real-
world deployment. Compounding these issues, existing datasets like ReadingBank [2] focus on word-level
sequence annotations, which inadequately evaluate block-level structural reasoning—a necessity for real-
world layout recovery. Although OmniDocBench [3] recently introduced block-level analysis support, its
coverage of diverse complex layouts (e.g., newspapers, technical reports) remains sparse, further hindering
systematic progress. Meanwhile, large-scale layout datasets such as DocBank [4] and PubLayNet [5] have
advanced document layout analysis but primarily target word/line-level or scientific article layouts and do
not directly benchmark block-level reading order.
To address these challenges, we propose XY-Cut++, an advanced framework for layout ordering that
incorporates three core innovations: (a) pre-mask processing to mitigate interference from high-dynamic-
range elements (e.g., title, figure, table), (b) multi-granularity region splitting for the adaptive decomposition
of diverse layouts, and (c) lightweight cross-modal matching leveraging minimal semantic cues. Our approach
significantly outperforms traditional methods in layout ordering accuracy. Additionally, we introduce
1
2. XY-Cut++
book
notebook
…
Layout Detection
…
paper
newspaper
3
Visual Information:
4
5
6
7
8
“Index”
End2end Eval(Images)
1
2
9
10
14
Order Eval(Jsons)
Shallow Semantics:Text,Title,Figure…
DocBench-100
16
17
1824
111925
1220
15
31
26
21
Extraction
30
13
22
23
27
28
32
33
29
Fig. 1 XY-Cut++ Architecture with Hierarchical Mask Mechanism and DocBench-100 Benchmark. (left) DocBench-100
dataset composition and dual evaluation protocols. (right) Algorithm workflow integrating adaptive pre-mask processing, multi-
granularity segmentation, and cross-modal matching.
DocBench-100, a novel benchmark dataset designed for evaluating layout ordering techniques. It includes
100 pages (30 complex and 70 regular layouts) with block-level reading order annotations. Extensive exper-
iments on DocBench-100 demonstrate that our method achieves state-of-the-art performance. Specifically,
our method achieves BLEU-4 scores of 98.6 (complex) and 98.9 (regular), surpassing baselines by 24% on
average while maintaining 1.06× FPS of geometric-only approaches.
The contributions of this paper are summarized as follows:
1. We present a simple yet high-performance enhanced XY-Cut framework that fuses shallow semantics
and geometry awareness to achieve accurate layout ordering.
2. We curate DocBench-100, a block-level benchmark with diverse complex layouts that complements
existing datasets and directly targets reading-order evaluation.
3. We achieved state-of-the-art results on both existing and new block-level datasets, and provided
extensive ablations to each component in our methods.
2 Related Work
Document Reading Order Recovery has advanced from early heuristic rule-based methods to modern deep
learning models, yet reliably handling complex layouts remains challenging.
2.1 Traditional Approaches
The XY-Cut algorithm [1] is a foundational technique in Document Reading Order Recovery that recur-
sively divides documents into smaller regions based on horizontal and vertical projections. As illustrated in
Figure 2, while it is effective for simple layouts, it struggles with complex structures, leading to inaccuracies
④
1
2
1U2U3U4U5U6U7
①
⑤
3
1U2
4
6U7
5
3U4
②
5
1
③
2
3
4
6
7
⑥
6
7
Fig. 2 XY-Cut Recursive Partitioning Workflow and Failure Analysis in Complex Layouts: (1) Initial document segmentation
steps, (2) Connectivity assumption violations caused by cross-layout cell structures (cell 5), and (3) Error propagation through
subsequent layout ordering. The correct reading order is 1, 3, 2, 4, 5, 6, 7.
2
3. Error Cut
No Cut
①
x
②
③
Fig. 3 Challenges posed by L-shaped inputs: (1) segmentation failure due to the inability to process L-shaped structures, and
(2) missegmentation caused by improper handling of L-shaped regions. The correct sequence of segmentation is ➁+➂ ➀.
in reading order recovery. Specifically, the rigid threshold mechanisms of XY-Cut can introduce hierarchical
errors when handling intricate layouts, resulting in suboptimal performance. To address these limitations,
various enhancements have been proposed. For instance, dynamic programming optimizations [6] have
been introduced to improve segmentation stability. Additionally, mask-based normalization techniques [7]
have been developed to mitigate some of the challenges associated with complex layouts. However, these
improvements are still insufficient for handling intricate layouts and cross-page content. In our analysis of
the XY-Cut algorithm, as discussed in [6, 7] and illustrated in Figure 3, we identified that many challeng-
ing cases arise from L-shaped inputs. A straightforward solution involves initially masking these L-shaped
regions and subsequently restoring them. Upon implementing this approach, we found it to be remarkably
effective while maintaining both simplicity and efficiency.
2.2 Deep Learning Approaches
Recent advances in Document Reading Order Recovery have been driven by deep learning models that effec-
tively leverage multimodal cues. The LayoutLM series [8–12] pioneered the integration of textual semantics
with 2D positional encoding, where LayoutLMv2 [9] introduced spatially aware pre-training objectives for
cross-modal alignment and LayoutLMv3 [10] further unified text/image embeddings through self-supervised
masked language modeling and image-text matching. LayoutXLM [11] extended this framework to multilin-
gual document understanding. XYLayoutLM [12] advanced the field with its Augmented XY-Cut algorithm
and Dilated Conditional Position Encoding, addressing variable-length layout modeling and generating
reading order-aware representations for enhanced document understanding. Building upon these founda-
tions, LayoutReader [2] demonstrated the effectiveness of deep learning in explicit reading order prediction
through the sequential modeling of paragraph relationships. Critical to these advancements are large-scale
datasets like DocBank [4], which provides weakly supervised layout annotations for fine-grained spatial mod-
eling, and PubLayNet [5], which contains over 360,000 scientific document pages with hierarchical labels
that encode implicit reading order priors through structural patterns. Specialized benchmarks like Table-
Bank [13] further address domain-specific challenges by preserving the ordering of tabular data for table
structure recognition.
Architectural innovations have significantly enhanced spatial reasoning capabilities. DocRes [14] intro-
duces dynamic task-specific prompts (DTSPrompt), enabling the model to perform various document
restoration tasks, such as dewarping, deshadowing, and appearance enhancement, meanwhile improving
the overall readability and structure of documents. [15] iteratively aggregates features across different lay-
ers and resolutions, resulting in more refined feature representations that are beneficial for complex layout
analysis tasks. By leveraging Hierarchical Document Analysis (HDA), models can more effectively capture
intricate structural relationships within documents, facilitating more accurate predictions of reading order.
Furthermore, advancements in unsupervised learning methods, such as those employed in DocUNet [16],
have enabled more effective handling of document distortions, thereby enhancing OCR performance and
layout analysis accuracy. DocFormer [17] unified text, visual, and layout embeddings through transformer
fusion, improving contextual understanding for logical flow prediction. Complementary to these approaches,
EAST [18] established robust text detection through geometry-aware frameworks, serving as a critical
preprocessing step for element-level sequence derivation.
2.3 Benchmarks for Document Reading Order Recovery
Document Reading Order Recovery has seen significant advancements with deep learning models like Lay-
outLM [8] and LayoutReader [2], which integrate visual and textual features for tasks such as reading order
prediction. However, existing methods face critical limitations in handling complex layouts (e.g., multi-
column structures, cross-page content). A key challenge is the lack of benchmarks that directly evaluate
block-level reading order, which is essential for applications like document digitization. While datasets such
as ReadingBank [2] provide word-level annotations for predicting reading sequences, their design complicates
the evaluation of methods focused on block-level performance. Specifically, word-level sequence annotations
3
4. Table 1 DocBench-100 subset statistics by
column structure. Percentages are
computed within each subset.
Subset
Dc
Dr
PagesSingleDouble≥3
30
703.3%
38.6%6.7%
54.4%90.0%
7.0%
cannot simplify the assessment of dependencies between text blocks (e.g., the order of paragraphs or figures
spanning multiple columns), which are essential for modeling complex layouts. Recently, OmniDocBench [3]
has supported block-level analysis but suffers from limited coverage of complex layouts and sparse repre-
sentation of domain-specific layouts (e.g., newspapers). To address these gaps, we introduce DocBench-100,
which offers a broader range of layout structures and explicit metrics for assessing reading order accuracy,
thereby enabling robust benchmarking of layout recovery systems.
3 DocBench-100 Dataset
To enable rigorous evaluation of block-level reading order, we construct DocBench-100, a curated
benchmark emphasizing diverse real-world layouts. It complements existing datasets (ReadingBank [2],
OmniDocBench [3], DocBank [4], PubLayNet [5]) by focusing on page-level block ordering across complex
structures.
3.1 Sources and Composition
We source candidate pages from public document detection corpora (notably PP-DocLayout [19]) and
MinerU [20] extraction outputs, then select pages exhibiting challenging phenomena: multi-column articles,
irregular or spanning titles, interleaved figures/captions, and nested regions. The dataset comprises 100
pages split into: complex (Dc , 30 pages) and regular (Dr , 70 pages), following the prevalence in practical
applications. An overview will be presented in Figure 4.
3.2 Fields and File Structure
Each page includes an image and two JSON files: (1) an input JSON (No Index) with fields page id,
page size, bbox, label; (2) a GT JSON that additionally includes block-level reading order index. This
design supports both end-to-end and oracle-detection evaluation protocols.
3.3 Annotation Pipeline
We adopt a two-stage pipeline:
• Automatic pre-annotation: MinerU [20] provides initial blocks, labels, and an order hypothesis.
• Manual verification and screening: Annotators correct segmentation and labels, and assign final
index. Pages with inherently ambiguous global reading order (e.g., collage-like scans lacking semantic
anchors) are excluded. All remaining reading orders are manually verified.
3.4 Statistics and Usage Protocols
Dc contains predominantly multi-column layouts (≥ 3 columns) and irregular titles, while Dr is dominated
by single/double columns common in academic and business documents. Summary statistics by column
structure are reported in Table 1. We recommend reporting both: (a) end-to-end image-based evaluation
(requires detection) and (b) JSON-based evaluation on block sequences. We will release the evaluation script
for block-level BLEU and dataset documentation.
4 Methods
Our geometry–semantic fusion pipeline (Figure 5) has four stages: (1) PP-DocLayout [19] extracts visual
features and shallow semantic labels; (2) highly dynamic elements (e.g., titles, tables, figures) are pre-
masked to alleviate the L-shape problem; (3) cross-layout elements are identified via global analysis, then
processed with masking, real-time density estimation, and heuristic sorting; and (4) masked elements are
re-mapped using nearest IoU edge-weighted margins. This geometry-aware pipeline with shallow semantics
attains state-of-the-art results on DocBench-100 and OmniDocBench [3].
4
5. DocBench-100: Block-Level Layout Ordering Benchmark
a. Regular Dataset (Dr, 70 pages)
b.Complex Dataset (Dc, 30 pages)
Fig. 4 DocBench-100 image overview: (a) 70-page regular subset Dr , and (b) 30-page complex subset Dc . All pages provide
block-level reading-order ground truth for benchmarking layout-ordering methods.
4.1 Pre-Mask Processing
Our Pre-Masking mechanism is designed to reduce the interference of highly dynamic elements (e.g., figures
and tables) on core block sorting. These elements can appear anywhere in documents with high positional
flexibility, which may disrupt sorting logic. Specifically, we first identify such elements using shallow semantic
labels from the PP-DocLayout detection model. We then build a binary mask to temporarily exclude these
elements from subsequent multi-granularity segmentation and core block sorting. After multi-granularity
sorting, we remap the masked elements back to the ordered layout in the cross-modal matching phase, using
an IoU-weighted distance nearest-neighbor strategy. This two-stage ”mask-then-remap” pipeline effectively
separates highly dynamic elements from the key sorting stage, ensuring accurate core block ordering while
maintaining layout fidelity, and also reduces the previously mentioned ”L”-shaped input issues.
4.2 Multi-Granularity Segmentation
As shown in Figure 6, our hybrid segmentation framework integrates cross-layout element masking, geometric
pre-segmentation, and density-driven adaptive refinement in three key phases. Each phase targets specific
layout challenges, enabling robust segmentation for both regular and complex documents.
Phase 1: Cross-Layout Detection We first compute the document-level median bounding box length
to establish an adaptive threshold for cross-layout elements (e.g., cross-column text spanning multiple grid
units). Each content block Bi is represented by a 4-tuple (x1 , y1 , x2 , y2 ), where (x1 , y1 ) is the top-left corner
and (x2 , y2 ) is the bottom-right corner. Let {li }N
i=1 denote the length of each content bounding box Bi :
for horizontal-layout documents (commonly used in daily scenarios, e.g., reports, articles), li = x2,i − x1,i
(horizontal width); for vertical-layout documents (e.g., classical Chinese texts), li = y2,i − y1,i (vertical
height). β = 1.3 is an empirically tuned scaling factor. The adaptive threshold Tl is:
Tl = β · median({li }N
i=1 )
(1)
Cross-layout elements are detected via two criteria: (1) li > Tl , and (2) horizontal projection overlap
with at least 2 other blocks. The judgment function Ccross (Bi ) is defined as:
(
Ccross (Bi ) =
1 li > Tl ∧
0 otherwise
P
j̸=i Ioverlap (Bi , Bj ) ≥ 2
(2)
Here, Ioverlap (Bi , Bj ) is an indicator function (1 for horizontal projection overlap, 0 otherwise). Detected
cross-layout elements are masked for subsequent layout-aware splitting, while single-layout elements (within
one grid unit) are processed immediately to avoid interference.
5
6. XY-Cut++: Hierarchical Mask-Based Layout Ordering Framework
header
header image
text
tit
le
doc
title
text
te
xt
single-layout text
text
a
s
i
d
e
text
imagetext
texttext
cross-layout text13
cross-layout text24
7
1
single-layout text
t
e
x
t
cross-layout text
2
8
single-layout
text3image9
single-layout
text4510
paragraph title
6
single-layout
text
text
table
5
11
table
footnote
a. Layout Detection
b. Pre-Mask Processing
d.Cross-Modal Matching
c. Multi-Granularity Segmentation
Fig. 5 End-to-End Layout Ordering for Diverse Document Layouts Framework Overview: (a) Layout Detection (PP-
DocLayout [19]), (b) Pre-Mask Processing, (c) Multi-Granularity Segmentation, and (d) Cross-Modal Matching.
②
1.Pre Cut
2.Td<0.9 => YXCut
1.len>β*median_len
Diou2
Reorder by (index, label priority, y1, x1)
D1<D4<D3<D5<D2
2
Cross-Layout Text
1
1
median line
1
1
3
2
2
4
B'o3(x ',y ',x ',y ')
1
2
1
2
2
D3<D4<D1<D5<D2
1
5
1
2
2
5
4
… D
Fig. 6 Multi-Granularity Segmentation: (1) cross-layout ele-
ment masking, (2) preliminary segmentation via pre-cut (①),
and (3) recursive density-driven partitioning. The enhanced
XY-Cut algorithm adapts its splitting axis selection through
real-time density evaluation (τd ), prioritizing horizontal splits
for content-dense regions and vertical splits otherwise (②③).
5
3
Bp2(x ,y ,x ,y )
④
4
2
3
2
①
3.no
text
1
Dedge3
(DV3,DH3)
1
B'o1(x ',y ',x ',y ')
1
1
B'o2(x ',y ',x ',y ')
2
Dedge1
3
2.interval<w/5
Pre Cut
2
(DV1,DH1)
1.len>β1*median_len
3.no
text
1
2
1
Bp1(x ,y ,x ,y )
③
2.CountIou>α>=2
iou1
Diou3
…
6
7
Fig. 7 Cross-Modal Matching: (1) semantic hierarchy-aware
stage decomposition and (2) adaptive distance metric match-
ing with dynamic policies and semantic-specific tuning. Sub-
sequently, cells are reordered based on Index, Label Priority,
Y1, and X1.
Phase 2: Geometric Pre-Segmentation This phase aims to identify central content elements, includ-
ing body titles (e.g., chapter and section titles), and isolated graphical components (e.g., figures and tables).
These elements are leveraged to perform coarse-grained partitioning of the original page into multiple
non-overlapping sub-regions, denoted as R. Each R is subsequently subjected to density-driven adaptive
refinement in Phase 3 to achieve fine-grained segmentation. Such elements are classified based on geometric
features, formalized by the judgment function P (Bi ):
P (Bi ) = I
∥ci − cpage ∥2
≤ 0.2 ∧ (ϕtext (Bi ) = ∞),
dpage
(3)
Where:
y +y
x +x
ci = (cxi , cyi ): Center of Bi , computed as cxi = 1,i 2 2,i , cyi = 1,i 2 2,i ;
cpage : Page center coordinate;
∥ci − cpage ∥2 : Euclidean distance between Bi and page center;
dpage : The normalization factor dpage is determined by the block type:
(
Wpage , if ri < 3,
dpage =
Hpage , if ri ≥ 3,
where ri =
wi
x2,i − x1,i
=
,
hi
y2,i − y1,i
and Wpage , Hpage are the page width and height, respectively.
I(·): Indicator function (1 if condition holds, 0 otherwise);
ϕtext (Bi ): Minimum Euclidean distance from Bi to the nearest text block; ϕtext (Bi ) = ∞ means Bi is
isolated (no adjacent text);
Target elements Bi : Body titles (chapter/section titles), visual components (e.g., Figure, Table), all with
high layout flexibility.
6
7. Elements with P (Bi ) = 1 are marked as isolated and added to the mask set. Their coordinates are used to
split the page into non-overlapping sub-regions R, laying the groundwork for recursive refinement.
Phase 3: Density-Driven Refinement To achieve fine-grained segmentation, we apply the adaptive
XY-Cut algorithm to each coarse-grained sub-region R from Phase 2. This algorithm dynamically selects
the splitting axis (horizontal/vertical) based on regional content density of the current sub-region R. The
density metric τd quantifies the ratio of cross-layout element area to single-layout element area within R,
calculated as:
PNc (Cc ) (Cc )
hk
k=1 wk
τd = PN
,
(4)
(Cs ) (Cs )
s
hk
k=1 wk
(C ) (C )
where Cc and Cs are cross-layout and single-layout element sets in R; Nc , Ns are their counts. wk c hk c
(C ) (C )
and wk s hk s represent the area of the k -th element in each set, respectively.
A higher τd indicates a denser distribution of cross-layout elements (typically horizontal content). We set
a density threshold θv = 0.9 to determine the splitting direction, as defined by the strategy function S (R):
(
XY-Cut τd > θv
S (R) =
YX-Cut otherwise
(5)
Here, XY-Cut refers to prioritizing horizontal splitting (along the Y-axis): we compute the Y-axis projection
histogram of region R, find the position ycut with the minimum projection value (i.e., the ”gap” between
content blocks), and split R into upper and lower sub-regions. YX-Cut prioritizes vertical splitting (along
the X-axis) using the same logic on the X-axis projection histogram.
The recursive splitting process terminates when each sub-region contains only one non-masked content
block. Each final atomic region Ri is represented as a 7-tuple, encapsulating spatial coordinates, content
type, semantic label, and positional index:
(i)
(i)
(i)
(i)
Ri = ⟨x1 , y1 , x2 , y2 , Ci , Labeli , Indexi ⟩
(i)
(i)
(i)
(6)
(i)
where (x1 , y1 ) and (x2 , y2 ) denote the top-left and bottom-right coordinates of Ri ; Ctype =
{cross-layout, single-layout}, with Ci ∈ Ctype indicating the content type of Ri ; Labeli represents the seman-
tic label (e.g., title, figure, paragraph) of Ri ; and Indexi denotes the positional index of Ri in the document
layout. This 7-tuple representation provides comprehensive information to support subsequent cross-modal
matching.
4.3 Cross-Modal Matching
To establish reading coherence consistent with human habits, we propose a geometry-aware cross-modal
alignment framework, consisting of multi-stage semantic filtering and adaptive distance metric modules
(Figure 7). This framework fuses semantic priority and geometric constraints to realize accurate ordering of
atomic regions.
Multi-Stage Semantic Filtering: This module realizes the restoration of masked elements and opti-
mizes the ordering sequence by leveraging label priority, ensuring high-semantic elements (e.g., cross-layout
text, titles) are prioritized in the matching and restoration process. We first define core sets and a global
label priority sequence, then formalize the matching logic to sequentially restore masked elements into the
ordered sequence.
The multi-stage semantic filtering module restores masked elements by leveraging a global label priority
sequence Lorder . The process operates on three core sets: the pre-ordered atomic region set S , the masked
element set M, and the dynamically updated target sequence T (initialized as T = S ). The mathematical
formulation is:
Lorder : Lcross-layout ≻ Ltitle ≻ Lvision ≻ Lothers
(
1, ∃Bo ∈ T, Lorder (l(B o )) ≻ Lorder (lcurrent )
F (B p , T, lcurrent ) =
0, otherwise
(lcurrent )
M
sorted = {(B p , B best ) | B p ∈ M,
T =T ∪M
(lcurrent )
sorted ,
(7)
l(B p ) = lcurrent , F = 1, B best ∈ T }
∀lcurrent ∈ Lorder
Here, the binary matching function F verifies for a pending element Bp (with label lcurrent ) whether a higher-
priority candidate exists in T . For each Bp that satisfies F = 1, the specific optimal anchor Bbest ∈ T is
determined by minimizing a joint geometric distance (detailed in the following Adaptive Distance Metric).
7
8. (l
)
current
then contains all such matched pairs (Bp , Bbest ) for the current label. These pairs are
The set Msorted
subsequently integrated into T , and the process iterates through Lorder , ensuring high-priority elements are
restored first.
This multi-stage semantic filtering strategy effectively eliminates irrational matching pairs (e.g., visual
elements paired with footnotes) and reinforces the semantic coherence of the final ordered sequence, laying
a solid foundation for downstream cross-modal alignment.
Adaptive Distance Metric: We design a joint geometric distance metric with early termination to find
the optimal matching position for each Bp in Msem . Given a pending layout element Bp = (x1 , y1 , x2 , y2 )
and an ordered candidate Bo = (x′1 , y1′ , x′2 , y2′ ), the distance D(Bp , Bo , l) (parameterized by Bp ’s label l) is
a weighted sum of four geometric constraints ϕk :
D(Bp , Bo , l) =
4
X
wk · ϕk (Bp , Bo ),
(8)
k=1
where wk are adaptive weights (detailed later), and ϕ1 ∼ ϕ4 encode different geometric constraints to
capture spatial relationships:
• Intersection Constraints (ϕ1 ): Filters invalid pairs by layout direction and overlap. The overlap
threshold is set to τoverlap = 0.3.
(
1, if direction(Bp ) ̸= direction(Bo′ ) ∨ IoUprojection < τoverlap
ϕ1 =
0, otherwise
(9)
• Boundary Proximity (ϕ2 ): Measures spatial adjacency. Weighted by wedge , it uses center distances,
prioritizing axis-aligned adjacency.
(
dx + dy ,
(diagonal adjacency)
(10)
ϕ2 = wedge ×
min(dx , dy ), (axis-aligned)
• Vertical Continuity (ϕ3 ): Optimizes vertical ordering.
(
ϕ3 =
−y2′ , l ∈ Lcross-layout ∧ y1 > y2′
(baseline alignment)
y1′ ,
(11)
• Horizontal Ordering (ϕ4 ): Follows left-to-right reading logic via Bo left boundary.
ϕ4 = x′1
(12)
We introduce two parameterization strategies to optimize the metric’s adaptability:
Dynamic Weight Adaptation. To ensure the four geometric constraint distances do not interfere
with each other, we design scale-sensitive weights with staggered scaling levels. The weight vector wk =
[w1 , w2 , w3 , w4 ] is formulated as:
wk = [max(h, w)2 , max(h, w), 1, max(h, w)−1 ]
(13)
where h, w denote page dimensions, it establishes a distance metric with clear priorities.
Semantic-Specific Tuning. Optimal weights from grid search on 2.8k documents:
[1, 0.1, 0.1, 1]
[0.2, 0.1, 1, 1]
wedge =
[1, 1, 0.1, 1]
[1, 1, 1, 0.1]
l ∈ Ltitle ∩ Ohorizontal
l ∈ Ltitle ∩ Overtical
l ∈ Lcross-layout
l ∈ Lotherwise
(14)
where Ohorizontal and Overtical denote the horizontal layout set and vertical layout set, respectively; l is the
semantic label of the element. This weight design assigns differentiated weights based on label characteristics
and layout orientations, further weakening mutual interference between constraints and improving matching
stability across various scenarios.
Statistical validation demonstrates significant improvements: +2.3 BLEU-4 over uniform baselines. The
overall process can be simply expressed in Algorithm 1.
8
9. 113
224
335
4
54
56
7
7
6
8
11
6
24307
10
253111
9
16
17
14
15
10
13
(a) Input
32
27
21
2228
2329
30
2531
18
19
1
2
16
17
14
2430
2531
2027
21
2228
23
11
18
19
12
26
13
33
(b) XY-Cut
99
10
15
12
26
20
8
24
15
18
19
12
16
17
14
8
32
26
13
33
29
(c) MinerU
2027
21
2228
2329
32
33
(d) XY-Cut++(Ours)
Fig. 8 Visualization of Complex Page Dc from DocBench-100 Dataset Using Different Layout Analysis Methods.
(a) Input Image, (b) XY-Cut [1], classic projection-based segmentation, (c) MinerU [20], an end-to-end Document Content
Extraction tool, (d) XY-Cut++, our proposed method.
3
113
224
5
3
4
5
2
6
(a) Input
1
5
4
6
6
(b) XY-Cut
(c) MinerU
(d) XY-Cut++(Ours)
Fig. 9 Visualization of Regular Page Dr from DocBench-100 Dataset Using Different Layout Analysis Methods.
(a) Input Image, (b) XY-Cut [1], a classic projection-based segmentation, (c) MinerU [20], an end-to-end Document Content
Extraction tool, (d) XY-Cut++, our proposed method.
Table 2 Progressive Component Analysis on DocBench-100.Metric Key: BLEU-4 ↑ / ARD ↓ / Tau ↑.
Dc
Method
XY-Cut
+Pre-Mask
+MGS
+CMM
Dr
µ
BLEU-4ARDTauBLEU-4ARDTauBLEU-4ARDTau
0.749
0.818
0.946
0.9860.233
0.196
0.164
0.0230.878
0.887
0.969
0.9950.819
0.823
0.969
0.9890.098
0.087
0.036
0.0030.912
0.920
0.985
0.9970.797
0.822
0.962
0.9880.139
0.120
0.074
0.0090.902
0.910
0.980
0.996
5 Experiments
In this section, we describe the experimental setup for evaluating our proposed method on the DocBench-
100 benchmark. We compare our approach with several state-of-the-art baselines and demonstrate its
effectiveness through both quantitative and qualitative analyses.
5.1 Dataset
We evaluate on DocBench-100; detailed sources, fields, annotation pipeline, screening criteria, statistics,
and usage protocols are described in Section 3. For fairness, we report both end-to-end and JSON-based
protocols, and clarify FPS timing scope alongside Table 6.
5.2 Setup
Baselines
We compare our method with the following state-of-the-art approaches:
• XY-Cut [1]: A classic method for projection-based segmentation.
9
10. Algorithm 1 Geometry-Aware Cross-Modal Matching with Semantic Filtering
Require: Sorted anchor set S , masked element set M, distance weights w1 , w2 , w3 , w4
Ensure: Matching result set T
1: Initialize global label priority: Lorder : Lcross-layout ≻ Ltitle ≻ Lvision ≻ Lother
2: T ← {(Bo , Bo ) | Bo ∈ S}
▷ Initialize with self-pairs of anchors
3: for each semantic label l in Lorder (descending priority) do
4:
for each pending box Bp ∈ M with label l do
▷ Check F (Bp , T , l)
5:
Dmin ← ∞, Bbest ← null
6:
for each ordered box Bo in T do
▷ Search candidates in T for F
7:
Dcurr ← 0
8:
for k ← 1 to 4 do
9:
Dcurr ← Dcurr + wk · ϕk (Bp , Bo )
10:
if Dcurr > Dmin then
11:
break
▷ Early termination
12:
end if
13:
end for
14:
if Dcurr < Dmin then
15:
Dmin ← Dcurr , Bbest ← Bo
16:
end if
17:
end for
18:
if Bbest ̸= null then
▷ Valid match found for Bp
(l)
(l)
19:
T ← T ∪ {(Bp , Bbest )}
▷ Msorted ← Msorted ∪ {(Bp , Bbest )}
20:
end if
21:
end for
22:
Sort T by: Label priority of Bp (desc) ≻ Index of Bbest (asc) ≻ y1 of Bp (asc) ≻ x1 of Bp (asc)
23: end for
24: return T
Table 3 Ablation Study of Pre-Mask, Multi-Granularity Segmentation (MGS), and Cross-Modal Matching (CMM) on
DocBench-100.Metric Key: BLEU-4 ↑.
Method
Baseline
+Pre-Mask
+MGS
+CMM
Mask
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Mask Cross-Layout
Pre-Cut
Adaptive Scheme
ϕ1
ϕ2
ϕ3
ϕ4
Dynamic Weights
Multi-StageBLEU-4
✓
✓
✓
✓
✓
✓
✓0.797
0.822
0.905
0.914
0.923
0.962
0.963
0.765
0.858
0.985
0.881
0.694
0.988
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
• LayoutReader [2, 10]: A LayoutLMv3-based model fine-tuned on 500k samples.
• MinerU [20]: An end-to-end document content extraction tool.
Evaluation Metrics
We evaluate the performance using the following metrics:
• BLEU-4 [21] (↑): Measures the similarity between candidate and reference texts using up to 4-gram
overlap.
• ARD (↓): Absolute Relative Difference, quantifies prediction accuracy by comparing predicted and actual
values.
• Tau (↑): Kendall’s Tau, measures the rank correlation between two sets of data.
• FPS (↑): Frames Per Second, a measure of how many frames a system can process per second.
Block-level BLEU (detection order). We evaluate the ordering of detection boxes (block IDs), not tex-
tual content. Following the original BLEU definition [21], we compute n-gram precisions on block identifiers
10
11. Table 4 Reading Order Recovery Performance:
BLEU-4 ↑ Results on DocBench-100. The best
results are in bold.
MethodDcDrµ
XY-Cut [1]
LayoutReader [2, 10]
MinerU [20]0.749
0.656
0.7010.818
0.844
0.9460.797
0.788
0.873
XY-Cut++(ours)0.9860.9890.988
Table 5 Reading Order Recovery Performance on Textual Content of OmniDocBench (Excluding Figures/Tables with
Insignificant Layout Impact). Metric Key: BLEU-4 ↑ / ARD ↓ / Tau ↑. Best results are in bold; second-best are underlined.
Method
Single
XY-Cut [1]
LayoutReader [2, 10]
MinerU [20]
XY-Cut++(ours)
Double
Three
Complex
Mean
BLEU-4ARDTauBLEU-4ARDTauBLEU-4ARDTauBLEU-4ARDTauBLEU-4ARDTau
0.895
0.988
0.961
0.9930.042
0.004
0.025
0.0040.931
0.995
0.969
0.9960.695
0.831
0.933
0.9510.230
0.084
0.037
0.0270.794
0.918
0.971
0.9740.702
0.595
0.923
0.9670.090
0.208
0.042
0.0330.923
0.805
0.965
0.9840.717
0.716
0.887
0.9010.120
0.116
0.050
0.0640.866
0.864
0.932
0.9420.753
0.783
0.926
0.9530.118
0.099
0.039
0.0370.878
0.906
0.959
0.972
Table 6 Model Efficiency and Semantic Information Usage on DocBench-100 and OmniDocBench. Key Metrics: FPS
(Total Pages/Total Times) ↑. FPS values are averaged over 10 runs on Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz with
256GB memory. Best results are in bold; second-best are underlined.
Method
Semantic Info
FPS
DocBench-100OmniDocBenchMean
XY-Cut [1]
LayoutReader [2, 10]
MinerU [20]✗
✗
✗685
17
10289
27
12487
22
11
XY-Cut++(ours)✓781248514
and report BLEU-4 with brevity penalty:
BLEU 4(ŝ, s) = BP · exp
4
X
1
n=1
4
!
log pn
,
(15)
(
exp(1 − r/c), c ≤ r
where pn is the precision of block-level n-grams and BP =
with c the hypothesis
1,
c>r
length and r the reference length.
FPS measurement scope. Table 6 reports FPS for the ordering/sorting module only; upstream detection
(e.g., PP-DocLayout) and downstream OCR/LM are excluded. Results are averaged over 10 runs on Intel(R)
Xeon(R) Gold 6326 CPU @ 2.90GHz, 256GB RAM.
5.3 Main Results
We evaluate our method on DocBench-100, analyzing component contributions and comparing against
state-of-the-art baselines. All metrics are computed over the union of Dc and Dr , subsets unless specified.
5.3.1 Progressive Component Ablation
Table 2 demonstrates the cumulative impact of each technical component:
• XY-Cut Baseline: Achieves 0.749 BLEU-4 on complex layouts (Dc ), showing limitations in handling
complex layouts (e.g., L-shaped) elements.
• +Pre-Mask: Improves BLEU-4 by 6.9 points (from 0.749 to 0.818) on Dc via adaptive thresholding
(Eq. (1), β = 1.3), reducing false splits by 15.9%.
• +MGS: Delivers a 19.7 absolute BLEU-4 gain on Dc through three-phase segmentation, with density-
aware splitting (Eq. (5)) reducing ARD by 29.7%.
• +CMM: Achieves near-perfect alignment (0.995 τ ) on Dc through geometric constraints (Eq. (9)–(12)),
finalizing a 90.1% ARD reduction from baseline.
11
12. The complete model reduces ARD by 93.5% compared to baseline (0.139 → 0.009), demonstrating
superior ranking consistency. Notably, our method maintains balanced performance across both subsets
(0.988 µ-BLEU), proving effective for diverse layout types.
5.3.2 Architectural Analysis
We perform a systematic dissection of core components through controlled ablations to evaluate their
contributions:
Pre-Mask Processing (Pre-Mask): To alleviate the ”L-shaped” problem, we applied preliminary
masking on highly dynamic elements such as titles, tables, and figures (see Methods Section 4.1). This
approach reduced visual noise and improved reading order recovery, resulting in a 2.5-point BLEU-4 score
increase, as shown in Table 3.
Multi-Granularity Segmentation (MGS): Mask cross-layout is a very direct method to solve the
problem that XY-Cut cannot segment L-shaped input. Pre-Cut realizes preliminary sub-page division
through page analysis, thereby avoiding page content mixing affecting sorting. The adaptive splitting strat-
egy enables reasonable segmentation through real-time density estimation. Table 3 shows additive benefits
of the mask cross-layout (+8.3), Pre-Cut (+9.2), and the adaptive splitting scheme (+10.1) over the baseline
(0.822 BLEU-4).
Cross-Modal Matching (CMM): As shown in Table 3, the single-stage strategy performs comparably
to the baseline approach. This suggests that employing a detection model equipped with text, title, and
annotation labels is sufficient to achieve nearly consistent performance across various scenarios. Notably,
on the OmniDocBench dataset, which has few label categories, our method still achieves state-of-the-art
results. Furthermore, we observe that the edge-weighted margin distance plays a crucial role among the four
distance metrics examined. This finding highlights the significance of dynamic weights based on shallow
semantics. In contrast, implementing cross-modal matching results in an overall improvement of 2.7 points
in BLEU-4 scores.
5.3.3 Benchmark Comparison
As shown in Table 4, our approach establishes new benchmarks across all evaluation dimensions. Specifically,
it outperforms XY-Cut by a significant margin, achieving a +23.7 absolute improvement on Dc (from 74.9
to 98.6). Additionally, it surpasses LayoutReader by +5.3 on Dr (from 94.6 to 98.9), despite not using any
learning-based components. Furthermore, our method achieves a Kendall’s τ of 0.996 overall, indicating
near-perfect ordinal consistency (with p < 0.001 in the Wilcoxon signed-rank test). Visual results presented
in Figure 8 and 9 further demonstrate the robustness in handling multi-column layouts and cross-page
elements, where previous methods frequently fail.
To further validate the versatility and robustness of our proposed method, we conducted extensive eval-
uations on the OmniDocBench dataset [3], which features a diverse and challenging set of document images.
As shown in Table 5, our proposed method (XY-Cut++) achieves state-of-the-art (SOTA) performance
across almost all layout types, despite challenging subpage nesting patterns (see Limitations). Notably, as
shown in Table 6, XY-Cut++ achieves a superior balance between performance and speed, attaining an
average FPS of 514 across DocBench-100 and OmniDocBench. This performance surpasses even the direct
projection-based XY-Cut algorithm, which achieves an average FPS of 487. The significant speed improve-
ment of XY-Cut++ is primarily attributed to semantic filtering, which minimizes redundant processing by
handling each block only once. In contrast, XY-Cut requires repeatedly partitioning blocks into different
subsets, resulting in increased recursive depth and computational overhead. This optimization enhances com-
putational efficiency without losing performance, making XY-Cut++ more robust and versatile for diverse
document layouts.
6 Discussion
XY-Cut++ closes the gap between classical projection-based methods and neural models for block-level
reading-order recovery through a simple three-stage pipeline. Directional ordering (left-to-right, top-to-
bottom) is largely solved; the remaining pain points are fine-grained semantic segmentation (e.g., ambiguous
block boundaries and caption linkage; see Figure A8) and sub-page structures. Across DocBench-100 and
OmniDocBench, XY-Cut++ delivers large, consistent gains (Table 2, Table 5) while retaining high through-
put on CPU (Table 6). On Dc , BLEU-4 rises from 0.749 to 0.986 and ARD falls from 0.233 to 0.023, with
visual results (Figures 8,9) confirming robustness on multi-column and L-shaped cases.
Mechanistically, the three components are complementary: Pre-Mask suppresses dominant elements to
expose a stable backbone; MGS uses an adaptive density τd (Eq. (4)) to choose the split axis (Eq. (5)); and
12
13. CMM applies four geometric constraints (Eqs. (9)–(12)) with scale-aware (Eq. (13)) and semantic-specific
weights (Eq. 14). Ablations indicate edge-weighted margins are especially effective.
Practically, reliable block-level ordering benefits RAG/LLM preprocessing by improving chunking
and caption attachment, and XY-Cut++ is easy to deploy after a detector with coarse labels (e.g.,
PP-DocLayout). DocBench-100 helps standardize block-level evaluation and emphasizes difficult page
topologies.
Limitations primarily stem from label granularity for fine-grained segmentation and from nested “sub-
pages” (Figure A8); detector quality also matters. Heuristic hyperparameters (e.g., β in Eq. (1), θv in Eq. (5))
may require domain tuning. Future work targets: (i) sub-page detection with hierarchical reasoning to local-
ize ordering, (ii) learning the split policy and edge weights from weak supervision to improve within-block
segmentation, and (iii) coupling lightweight language priors (caption/title cues) with end-to-end RAG/LLM
evaluation.
7 Conclusion
We studied block-level reading-order recovery for real-world documents, a prerequisite for reliable RAG/LLM
pipelines. We proposed XY-Cut++, a hierarchical, geometry-aware framework that integrates pre-mask
processing, multi-granularity segmentation, and cross-modal matching. XY-Cut++ delivers precise block
ordering and achieves state-of-the-art performance on DocBench-100 and OmniDocBench, while maintaining
high throughput on CPU (cf. Tables 2, 6).
This accuracy–efficiency balance is achieved via two design choices: (i) shallow semantic labels used
as structural priors and (ii) a hierarchical mask mechanism that stabilizes XY-Cut through density-aware
splitting and edge-weighted matching. These components jointly improve accuracy without sacrificing
throughput, making XY-Cut++ a practical module for production-grade document parsing. Beyond the
algorithm, DocBench-100 offers a focused, block-level benchmark that emphasizes challenging page topolo-
gies and promotes standardized evaluation. Future work will extend XY-Cut++ with sub-page detection
and lightweight language priors to better handle nested and fine-grained structures (see Section 6).
13
14. Appendix A
A.1
Additional Visual Results
Other Cases
Fig. A1 DocBench-100 (complex subset) — Example A. A challenging multi-column page with spanning titles and interleaved
elements. XY-Cut++ maintains a coherent block-level reading order across columns by combining pre-mask and density-aware
splitting.
Fig. A2 DocBench-100 (complex subset) — Example B. Nested regions and interleaved figures/captions. The mask-then-
remap strategy reduces L-shape artifacts and preserves local grouping for globally consistent ordering.
14
15. Fig. A3 DocBench-100 (complex subset) — Example C. L-shaped text around graphics. Pre-mask normalization avoids early
incorrect splits, and the final sequence aligns with GT ordering.
15
16. Fig. A4 DocBench-100 (complex subset) — Example D. Newspaper-like dense columns. The adaptive axis selection (hori-
zontal vs. vertical) driven by regional density yields stable column-wise ordering.
16
17. Fig. A5 DocBench-100 (regular subset) — Example A. A standard single/double-column page illustrating the method’s
stability on common business and academic layouts.
Fig. A6 OmniDocBench (complex) — Example A. A multi-column page with captions and side notes; geometry-aware
matching preserves inter-column flow and caption linkage, illustrating robustness on challenging document layouts.
17
18. Fig. A7 OmniDocBench (double) — Example B and ancient-text case. Despite right-to-left script conventions and sparse
annotations, the method maintains correct block-level order, evidencing robustness to atypical typography.
A.2
Failure Cases
Fig. A8 Challenging cases on OmniDocBench and DocBench-100. Failures stem from (1) insufficient fine-grained semantics
and (2) sub-page complexity, leading to sorting errors; these motivate sub-page detection and stronger semantic priors.
18
19. References
[1] Ha J, Haralick RM, Phillips IT (1995) Recursive xy cut using bounding boxes of connected components.
In: Proceedings of 3rd International Conference on Document Analysis and Recognition, IEEE, pp
952–955
[2] Wang Z, Xu Y, Cui L, et al (2021) Layoutreader: Pre-training of text and layout for reading order detec-
tion. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
pp 4735–4744
[3] Ouyang L, Qu Y, Zhou H, et al (2024) Omnidocbench: Benchmarking diverse pdf document parsing
with comprehensive annotations. arXiv preprint arXiv:241207626
[4] Li M, Xu Y, Cui L, et al (2020) Docbank: A benchmark dataset for document layout analysis. In: Pro-
ceedings of the 28th International Conference on Computational Linguistics, International Committee
on Computational Linguistics
[5] Zhong X, Tang J, Yepes AJ (2019) Publaynet: largest dataset ever for document layout analysis. In:
2019 International conference on document analysis and recognition (ICDAR), IEEE, pp 1015–1022
[6] Meunier JL (2005) Optimized xy-cut for determining a page reading order. In: Eighth International
Conference on Document Analysis and Recognition (ICDAR’05), IEEE, pp 347–351
[7] Sutheebanjard P, Premchaiswadi W (2010) A modified recursive xy cut algorithm for solving block
ordering problems. In: 2010 2nd International Conference on Computer Engineering and Technology,
IEEE, pp V3–307
[8] Xu Y, Li M, Cui L, et al (2020) Layoutlm: Pre-training of text and layout for document image under-
standing. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery
& data mining, pp 1192–1200
[9] Xu Y, Xu Y, Lv T, et al (2021) Layoutlmv2: Multi-modal pre-training for visually-rich document
understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), Association for Computational Linguistics
[10] Huang Y, Lv T, Cui L, et al (2022) Layoutlmv3: Pre-training for document ai with unified text and
image masking. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 4083–
4091
[11] Xu Y, Lv T, Cui L, et al (2021) Layoutxlm: Multimodal pre-training for multilingual visually-rich
document understanding. arXiv preprint arXiv:210408836
[12] Gu Z, Meng C, Wang K, et al (2022) Xylayoutlm: Towards layout-aware multimodal networks for
visually-rich document understanding. In: Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pp 4583–4592
[13] Li M, Cui L, Huang S, et al (2020) Tablebank: Table benchmark for image-based table detection
and recognition. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp
1918–1925
[14] Zhang J, Peng D, Liu C, et al (2024) Docres: a generalist model toward unifying document image
restoration tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp 15654–15664
[15] Yu F, Wang D, Shelhamer E, et al (2018) Deep layer aggregation. In: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp 2403–2412
[16] Ma K, Shu Z, Bai X, et al (2018) Docunet: Document image unwarping via a stacked u-net. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4709
[17] Xu Y, Li M, Cui L, et al (2021) Layoutlmv2: Multi-modal pre-training for visually-rich document
understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational
19
20. Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), pp 3452–3464
[18] Zhou X, Yao C, Wen H, et al (2017) East: An efficient and accurate scene text detector. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2642–2651
[19] Sun T, Cui C, Du Y, et al (2025) Pp-doclayout: A unified document layout detection model to accelerate
large-scale data construction. arXiv preprint arXiv:250317213
[20] Wang B, Xu C, Zhao X, et al (2024) Mineru: An open-source solution for precise document content
extraction. arXiv preprint arXiv:240918839
[21] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine trans-
lation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics,
pp 311–318
20