Layout-Aware Parsing Meets Efficient LLMs- A Unified, Scalable Framework for Resume Information Extraction and Evaluation
如果无法正常显示,请先停止浏览器的去广告插件。
1. Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable
Framework for Resume Information Extraction and Evaluation
Fanwei Zhu ∗ Jinke Yu † Zulong Chen †
zhufw@hzcu.edu.cn
Hangzhou City University
Hangzhou, China yujinke.yjk@alibaba-inc.com
Alibaba Group
Hangzhou, China zulong.czl@alibaba-inc.com
Alibaba Group
Hangzhou, China
Ying Zhou ‡ Junhao Ji † Zhibo Yang§
zhouying@zhejianglab.org
Zhejiang Lab
Hangzhou, China jijunhao.jjh@alibaba-inc.com
Alibaba Group
Hangzhou, China yangzhibo450@gmail.com
Alibaba Cloud
Hangzhou, China
Yuxue Zhang † Haoyuan Hu ¶ Zhenghao Liu ∥
yuxue.zyx@alibaba-inc.com
Alibaba Group
Hangzhou, China haoyuan.huhy@antgroup.com
Ant Group
Hangzhou, China liuzhenghao@mail.neu.edu.cn
Northeastern University
Shenyang, China
Abstract
Automated resume information extraction is critical for scaling
talent acquisition, yet real-world deployment faces three major
challenges: the extreme heterogeneity of resume layouts and con-
tent, the high cost and latency of large language models (LLMs), and
the lack of standardized datasets and evaluation tools. In this work,
we present a layout-aware and efficiency-optimized auto-extraction
and evaluation framework to addresses all three challenges. Our
system combines a fine-tuned layout parser to normalize diverse
document formats, an inference-efficient LLM extractor based on
parallel prompting and instruction tuning, and a robust two-stage
automated evaluation framework supported by new benchmark
datasets. Extensive experiments show that our framework signifi-
cantly outperforms strong baselines in both accuracy and efficiency.
In particular, we demonstrate that a fine-tuned compact 0.6B LLM
achieves top-tier accuracy yet significantly reduces inference la-
tency and computational cost. The system is fully deployed in Al-
ibaba’s intelligent HR platform, supporting real-time applications
across business units.
Keywords
Layout-Aware Parser, Parallel Prompt Routing, Automatic Resume
Analysis
1
Introduction
The ability to efficiently and accurately screen resumes has become
a critical part of the recruitment process in modern enterprises.
However, manual review is slow, costly, and prone to error, making
it impractical for industrial use. While automated resume analy-
sis offers a solution, existing methods often struggle to balance
accuracy, latency, and computational cost.
Early resume parsing systems, built on handcrafted rules or tradi-
tional statistical models [6, 22], offer fast processing but fail to gen-
eralize across the vast diversity of linguistic styles and visual layouts
found in real-world resumes. Conversely, modern LLMs [4, 9, 11]
provide the deep semantic understanding needed for robust extrac-
tion, but their high inference latency and computational expense are
often prohibitive for real-time, large-scale deployment. An effective
industrial system must therefore bridge this gap, delivering state-
of-the-art accuracy without compromising on production efficiency
and cost.
Challenges. Specifically, building a practical resume analysis sys-
tem at industrial scale requires addressing three key challenges
• Layout and content heterogeneity: Resumes submitted by candi-
dates are highly diverse in both structure and content. Many
contain important information embedded in images or use com-
plex, multi-column formats that disrupt standard reading order,
making consistent parsing difficult.
• High cost and latency of LLMs: While LLMs offer strong extraction
capabilities, directly applying them to raw text leads to high
latency and token usage, which is costly and unsuitable for real-
time, large-scale applications.
• Lack of data and evaluation tools: Due to privacy concerns, high-
quality annotated resume datasets are rare. Evaluating extraction
quality, especially for list-style entities like work experience, is
hard to do manually at scale, calling for automated and reliable
evaluation frameworks.
In this work, we present a practical, layout-aware, and efficiency-
optimized framework for automatic resume extraction and evalua-
tion. Our system addresses the above challenges through the follow-
ing key components: First, we introduce a unified layout parsing
model that fuses PDF metadata with OCR content and employs
a fine-tuned resume layout parser to reconstruct a semantically
coherent reading order from diverse, often multi-column resume
layouts. Second, to enable efficient LLM-extraction, we adopt a task
decomposition strategy with index-based pointer outputs, reducing
both token usage and response time. On top of this, we fine-tune a
compact 0.6B model using instruction supervision, enabling high
accuracy at low cost. Third, we develop a two-stage evaluation
framework using the Hungarian algorithm for entity alignment
2. Conference’17, July 2017, Washington, DC, USA
Fanwei Zhu, Jinke Yu, Zulong Chen, Ying Zhou, Junhao Ji, Zhibo Yang, Yuxue Zhang, Haoyuan Hu, and Zhenghao Liu
and multi-strategy field matching. This enables robust, fine-grained
assessment without human involvement.
Extensive experiments on a synthetic dataset with diverse lay-
outs and a complex real-world resume dataset demonstrate that
our system consistently outperforms state-of-the-art baselines in
accuracy and efficiency. Notably, our fine-tuned Qwen3-0.6B-SFT
model surpasses the accuracy of top-tier models like Claude-4 while
offering 3–4× faster inference. The complete system is deployed
within Alibaba Group’s intelligent HR platform, where it supports
real-time parsing with high throughput across multiple business
units.
Contributions. In summary, our key contributions are as follows:
• We propose a unified, layout-aware resume parsing framework
that robustly handles layout and content heterogeneity.
• We design an inference-efficient LLM extraction strategy and
fine-tune a compact model to achieve competitive accuracy at
low latency and cost.
• We introduce a robust two-stage automated evaluation protocol
for field-level performance measurement.
• We open-source the full pipeline, along with the datasets to pro-
mote future research and practical adoption. 1
2
Related work
Rich document understanding. Resume information extraction
is closely related to the broader field of visually-rich document
understanding. Traditional NLP models, which process text as a 1D
sequence, often fail on semi-structured documents like resumes,
invoices, and forms. Recent advances in this area have been driven
by multimodal models that jointly captured textual content and
layout structure. Xu et al. [20] introduced a pre-training model Lay-
outLM that integrates text content with bounding box information,
overcoming limitations of purely text-based NLP models on various
document understanding tasks. LayoutLMv2 [21] further enhanced
this approach by incorporating visual features from raw document
images alongside text and layout signals. LayoutLMv3 [10] unified
token and patch representations in a single Transformer, demon-
strating improved multimodal reasoning.
Recent efforts have also explored more efficient alternatives.
Wang et al. [19] proposed to integrate spatial structure into a lan-
guage model using disentangled attention, avoiding the use of a
heavy vision encoder. Zhang et al. [23] focused on training-free
prompting strategies for LLMs, using entity, layout, and document
similarities to construct more effective in-context examples.
Automatic Resume Analysis. The automated analysis of resumes
has a rich history, with research evolving from traditional rule-
based systems to modern neural models. Early approaches framed it
as a Named Entity Recognition (NER) problem and solved it with hi-
erarchical extraction pipelines. For example, Yu et al. [22] proposed
a two-pass cascaded hybrid model that first used a Hidden Markov
Model (HMM) to segment the resume into general sections (e.g.,
Education, Experience) and then applied specialized models (HMM
or SVM) within each block. Chen et al. [6] advanced the cascaded
1 The code and dataset will be released at Alibaba Open Source: https://github.com/
ALIBABA upon completion of the internal approval process.
approach by explicitly incorporating PDF-specific layout features
(e.g., font size, coordinates) into both their SVM-based block classi-
fier and their CRF-based field extractor. More recently, Zu et al. [24]
modernized this pipeline by replacing feature-engineered models
with neural networks. Their system first employs neural classifiers
for a sophisticated line-by-line segmentation of the resume into text
blocks, which are then processed by a BiLSTM-CNN-CRF model
for final entity extraction. Beyond direct extraction, research has
also explored more complex and application-oriented tasks. For
instance, Pawar et al. [16] moved beyond simple NER to tackle the
extraction of complex, N-ary, cross-sentence relations, using a joint
hierarchical neural model. Daryani et al.[7] built an end-to-end can-
didate ranking system using an NLP parser followed by IR-based
resume-to-job matching. Ali et al. [2] focused on document-level
classification to sort resumes into job categories using SVMs.
While prior methods have advanced the field, our work uniquely
targets the practical challenges of deploying resume extraction
systems in real-world hiring workflows. We explicitly address the
layout and content heterogeneity of resumes, the high latency and
cost of LLM inference at scale, and the inefficiency of manual eval-
uation—enabling accurate, scalable, and production-ready resume
analysis for industrial use.
3
Approach
Our layout-aware automatic resume information extraction system
follows a three-stage architecture as illustrated in Figure 1. First,
a Layout-Aware Parsing and Regeneration stage performs hybrid
content fusion (integrating PDF metadata and OCR) and uses a
fine-tuned layout regenerator to re-order text from complex, multi-
column layouts into a single, indexed sequence. Second, a Paral-
lelized, Instruction-Tuned LLM Extractor processes this normalized
text, using parallelized task decomposition and an index-based
pointer mechanism with our Qwen-0.6B-SFT model to balance high
accuracy with low-latency inference. Finally, a Two-Stage Auto-
mated Evaluation framework ensures objective assessment by first
using the Hungarian algorithm for robust entity alignment and
then applying a multi-strategy matching logic for a fine-grained,
field-level comparison.
3.1
Layout-Aware Parsing and Regeneration
3.1.1 PDF Parser: Text Extraction & Fusion. Resume files submit-
ted by candidates vary widely in format such as PDF, Word, etc..
To support consistent downstream processing, we first convert all
files to PDF for unified processing. However, PDF resumes still in-
clude challenges such as embedded images, non-standard fonts, and
custom encodings that hinder reliable text extraction. To address
this, we adopt a hybrid content extraction strategy that combines
metadata-based parsing with OCR.
Hybrid PDF Content Extraction. We extract available metadata
from the PDF, which typically includes structured text content
and associated positional metadata (e.g., bounding boxes) for each
text object. In parallel, we render each PDF page into an image
and identify candidate image regions by masking out known text
regions using metadata bounding boxes. Remaining unmasked areas
are treated as image regions and processed with a state-of-the-art
3. Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation Conference’17, July 2017, Washington, DC, USA
I. Layout-Aware Parsing and Regeneration
II. Parallelized Instruction-Tuned LLM Extractor
III. Two-Stage Evaluation
Ground-Truth Results
Content Fusion
PDF META DATA
Training
Entity Alignment
(Hungarian Algorithm)
TEXT BLOCK MASKED
Hierarchical Re-ordering
Multi-Strategy
Fields Matching
Indexed Linearization
SFT-Qwen-0.6B
INTER-SEGMENT
Extracted Results
INTRA-SEGMENT
Exact Matches
Figure 1: Overview of the layout-aware, LLM-powered resume extraction and evaluation pipeline.
OCR engine. The OCR results are then spatially aligned back to
their corresponding coordinates within the document. avoids the burden of fine-grained labeling while effectively captur-
ing internally linear blocks.
Content Fusion. The final step in this stage is to fuse the textual
content obtained from the metadata-based PDF parsing with OCR-
derived image region text. The result is a single, layout-aware set
of content primitives. Each primitive is a tuple containing a text
string and its corresponding bounding box coordinates (text, 𝑥 𝑚𝑖𝑛 ,
𝑦 𝑚𝑖𝑛 , 𝑥 𝑚𝑎𝑥 , 𝑦 𝑚𝑎𝑥 ). This fusion ensures that all textual information
from the resume is captured, creating a complete, but structurally
unordered collection of resume content that serves as the input for
our layout unification stage. Hierarchical Re-ordering and Indexed Linearization. After
segmenting each document into layout regions (also referred to as
big blocks), we impose a two-level hierarchical sorting strategy to
regenerate a uniform layout with consistent reading order:
3.1.2 Layout Regeneration: Fine-tuned Layout Parser & Unification.
Having obtained a complete but unordered set of text primitives, the
next critical task is to reconstruct a semantically coherent reading
flow.
Our empirical analysis shows that approximately 20% of resumes
employ non-linear, multi-column layouts that break the standard
top-to-bottom, left-to-right reading flow. This makes content-level
parsing alone insufficient; layout understanding and unification
are essential. Conventional document layout analysis requires fine-
grained annotations, which are costly and impractical for privacy-
sensitive resume data. To address this, we propose a lightweight
layout reconstruction approach that segments resumes into logi-
cally coherent blocks and reorders them hierarchically to produce
a unified reading sequence.
Layout Segments Identification. We frame the problem of iden-
tifying layout regions (e.g., sidebars, main content columns) as
an object detection task with a simplified objective: to partition
a non-linear page into a set of smaller, linearly-structured layout
segments. A segment is considered linear if its internal content can
be correctly read by a simple top-to-bottom, left-to-right sort.
To achieve this, we fine-tune the YOLOv10 object detection
model [18] for the resume domain. We construct a resume-specific
segmentation dataset of around 500 resumes, annotating only the
bounding boxes of the major layout segments required to enforce
linearity within each box. This lightweight annotation strategy
• Inter-segment sorting: The detected layout segments are sorted
globally according to their visual positions (i.e., the coordinates
of their top-left corners), following a standard top-to-bottom, left-
to-right rule. This establishes the high-level reading flow, such
as processing the main content column before a right-aligned
sidebar if they start at the same vertical position.
• Intra-segment text block sorting: Within each layout segment,
we further sort the individual text primitives (i.e., text blocks)
that fall within its boundaries. This local sort also follows the
top-to-bottom, left-to-right principle, arranging the lines and
words into a coherent, readable sequence.
These sorted blocks are concatenated to form a single, linearized
text stream that conforms to human reading patterns. Noticeably,
when constructing this sequence, we assign a unique, sequential
index (i.e., a line number) to each line in the text block. This in-
dexing brings significant efficiency gain as we will discuss in the
subsequent information extraction stage (Section 3.3).
3.2
Parallelized Instruction-Tuned LLM
Extractor
With layout unification completed, the document is converted into
an indexed, linear text sequence. The final step is to extract struc-
tured key-value information suitable for automated recruitment
systems. To achieve higher accuracy and robustness, our framework
leverages the advanced understanding and reasoning of LLMs to
design an LLM-based extraction strategy with a particular focus on
the efficiency optimizations.
3.2.1 Parallelized Extraction via Task Decomposition. Our primary
goal is to extract four categories of resume content essential for
4. Conference’17, July 2017, Washington, DC, USA
Fanwei Zhu, Jinke Yu, Zulong Chen, Ying Zhou, Junhao Ji, Zhibo Yang, Yuxue Zhang, Haoyuan Hu, and Zhenghao Liu
downstream HR applications: Basic Information (e.g., name, con-
tact details, location), Work Experience (e.g., company, title, dates,
description), and Education Background (e.g., institution, degree,
major, dates). A naïve approach would be to prompt the LLM to ex-
tract all fields in a single pass. However, this method is suboptimal,
as complex prompts requiring the model to identify disparate types
of information simultaneously can degrade performance. Instead,
we adopt a parallelized task decomposition strategy, decomposing
the task into three independent sub-tasks, one for each information
category. The complete, linearized resume text is processed in three
parallel threads, with each thread invoking the LLM with a highly
specialized prompt tailored to its specific extraction target. For
example, the prompt for basic information extraction explicitly re-
quests name, phone, email, etc., while the work experience prompt
specifies extraction of job titles, time spans, and description. This
parallelized, decomposed approach not only improves extraction
accuracy but also significantly reduces end-to-end latency. Example
prompts are provided in Appendix A.
Index-based Pointer Mechanism. To further address the chal-
lenges of high latency and content drift when extracting long, de-
scriptive fields (e.g., work description, project description), we apply
a key optimization of index-based pointer mechanism in our frame-
work. Rather than prompting the LLM to generate the full verbatim
content, a process that is slow, token-intensive, and prone to genera-
tive errors like paraphrasing or hallucination, we explicitly prompt
it to return a line number range (e.g., [15, 25]) referencing the
indexed text from our layout regeneration stage (Section 3.1.2).
This transforms the task for the LLM from a complex, open-
ended generative task into a much simpler and more constrained
span identification task, offering two major benefits:
• Efficiency: Returning index ranges requires only a few tokens,
significantly reducing inference time and cost compared to gen-
erating full descriptions.
• Content Fidelity: During our post-processing stage, we use these
returned indices to re-extract the exact text block from the origi-
nal source document, guaranteeing 100% content fidelity.
3.2.2 Model Selection and Fine-Tuning. Although large language
models such as GPT-4o and Qwen-max exhibit strong generaliza-
tion and semantic capabilities, their high latency and inference cost
are often prohibitive for real-time, large-scale enterprise applica-
tions. In contrast, compact models like Qwen-0.6B offer significantly
faster inference (approximately 150 tokens per second), enabling
them to process a typical resume (300–400 tokens) in 1–2 seconds
per extraction sub-task. However, the lightweight Qwen-0.6B lacks
sufficient performance for complex extraction tasks due to limited
capacity.
To address this, we fine-tune Qwen-0.6B on a carefully con-
structed supervised dataset tailored for resume parsing. The su-
pervised fine-tuning (SFT) dataset consists of 15,500 resumes and
59,500 instruction-based training samples. Each training sample
is represented as a triplet of (instruction, input, output), where
the instruction specifies the extraction task (e.g., extracting work
experience), the input is the indexed resume text, and the output
is corresponding human-verified JSON-format label. The dataset
combines both synthetic and real-world resumes covering diverse
formats and content styles. Further construction details are pro-
vided in Appendix B. After fine-tuning, the adapted Qwen-0.6B
model achieves strong extraction accuracy comparable to much
larger LLMs, while maintaining high inference efficiency, making it
practical and scalable for real-time resume information extraction
in industrial deployment.
Output Formatting. To decide the suitable output format, we
conducted experiments comparing several formats and adopt JSON
as the output format. While alternatives like YAML or Markdown
reduce token length, JSON yields the most stable results, likely
due to Qwen’s fine-tuning on JSON-style data. Instead of using
JSON-compatible decoding modes (e.g., automaton decoder), which
often compromise the model’s parsing capabilities, we adopt a
prompt-based strategy that naturally guides the model to pro-
duce valid JSON and apply lightweight string-based extraction
(e.g., text.find("{") and text.rfind("}")) to retrieve valid JSON blocks.
3.2.3 Post-Processing and Data Refinement. As the raw model out-
puts often contain formatting inconsistencies or hallucinated con-
tent, we further implement a multi-stage post-processing and data
refinement pipeline to enhance the data fidelity. The process con-
sists of four key stages:
• Grounded content re-extraction: To mitigate content drift in long-
form fields (e.g., job or project descriptions), we use the line indices
returned by the LLM as pointers to re-extract all descriptive
text directly from the original source document, eliminating any
content drift introduced by the model.
• Domain-specific normalization: We apply a set of domain-specific
cleaning and normalization rules to standardize volatile fields,
such as normalizing diverse date formats and removing suffix
noise in organization names.
• Context-aware de-duplication: We identify and filter redundant
entries, such as a project that is also mentioned within a work
experience, by comparing the source text spans (i.e., line number
ranges) of all extracted entities.
• Source text verification: Finally, we perform a verification step for
every extracted record, discarding any entity whose key identify-
ing fields (e.g., company name and job title) cannot be found in the
original document text, thereby pruning model hallucinations.
Through this rigorous four-stage pipeline, we transform raw
LLM outputs into clean, trustworthy, and application-ready struc-
tured data that faithfully reflects the content of the original docu-
ment.
3.3
Two-Stage Automatic Evaluation
After extracting the key information, another critical task is to
evaluate the accuracy of extraction. Manual evaluation is impracti-
cal for large-scale system development as it is prohibitively time-
consuming, expensive, and often subject to human inconsistency.
To enable efficient comparison and provide objective, reliable re-
sults, an automated evaluation methodology is essential.
For clarity, we define two key terms in resume information ex-
traction:
• An Entity refers to a complete block of information. For example,
a single "Work Experience" or one "Education History" entry is
considered one entity.
5. Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation Conference’17, July 2017, Washington, DC, USA
• A Field refers to a specific attribute within an entity. For a "Work
Experience" entity, its fields would be company name, position,
start date, end date, and description.
Our primary evaluation challenge is to accurately compare the
list of entities, along with their fields predicted by our model against
a ground truth list. However, designing such an automatic evalua-
tion model is non-trivial. A naive approach, such as a one-to-one
comparison of entities based on their sequential order often fails
due to the following challenges:
• Quantity mismatch: The number of predicted entities may dif-
fer from ground-truth, leading to either missed extractions or
spurious extractions.
• Order mismatch: Even if both lists contain the same entities,
the model may extract them in an order that differs from the
ground-truth, causing a naive index-based comparison to fail.
• Partial match: An extracted entity could be only a partial match
as some fields may incorrectly extracted or even missing. For in-
stance, a predicted Work Experience entity might have the correct
company name and position fields, but a different data format or
incomplete description text.
To overcome the above challenges, we propose a robust, two-step
automated evaluation framework that intelligently aligns entities,
performs a fine-grained field-level comparison, and aggregates the
results into clear, quantitative metrics.
Entity Alignment via the Hungarian Algorithm. For any pair
of lists to be compared, for instance, a ground-truth list of 𝑀 work
experiences 𝐺 = {𝑔 1 , . . . , 𝑔 𝑀 } and a predicted list of 𝑁 experiences
𝑃 = {𝑝 1 , . . . , 𝑝 𝑁 }, we construct an 𝑀 × 𝑁 similarity matrix 𝑆. Each
element 𝑆 𝑖 𝑗 represents the similarity between 𝑔 𝑖 and 𝑝 𝑗 , which is
computed as the average normalized string similarity of their key
fields, such as company name and position. We then apply the
Hungarian algorithm [12] to 𝑆 to find one-to-one assignment that
maximizes the total similarity of all matched pairs {(𝑝 𝑖 , 𝑞 𝑗 )}. This
naturally resolves the aforementioned challenges: it is impervious
to order mismatch and can find the optimal partial assignment even
when 𝑀 ≠ 𝑁 .
Multi-Strategy Fields Matching. Once entity pairs are aligned,
we proceed to a fine-grained comparison of their constituent fields
for each aligned entity pair. Recognizing that a single “exact match"
rule is inadequate for diverse data types, we designed a multi-
strategy comparison function that dynamically selects the most
appropriate validation rule based on the field’s semantic nature.
• Period Fields (e.g., dates): Normalized into (year, month) format
to support flexible matching across date representations.
• Named Entities (e.g., organizations, schools, job titles): Compared
using partial substring matching to tolerate abbreviations or
suffix differences.
• Long Descriptions (e.g., project or job descriptions): Matched via
edit-distance-based similarity (e.g., edit distance > 0.9) to allow
for minor paraphrasing.
• Other Fields (e.g., names, email addresses): Evaluated using nor-
malized exact match after lowercasing and punctuation removal.
An extracted field is considered as correct only if it is both success-
fully aligned with a ground-truth entity by the Hungarian algorithm
and subsequently passes the multi-strategy field matching criteria.
To validate the reliability of this automatic evaluation framework,
we conducted human verification on a subset of results. The eval-
uation consistently aligned with human judgment, confirming its
effectiveness in producing accurate matching outcomes.
Finally, we aggregate the field-level matching outcomes across all
evaluated resumes to compute quantitative performance metrics for
each field. Such per-field evaluation not only aggregates to overall
extraction quality but also highlights field-specific weaknesses-
offering valuable insights for further model improvement.
4 Experiments
4.1 Experimental Settings
4.1.1 Datasets. As there is no public resume dataset for evalua-
tion, we construct two distinct datasets to evaluate our model’s
performance.
• SynthResume. SynthResume is a synthetic corpus of 2,994 re-
sumes with different layouts, which is constructed through a
semi-automated, LLM-based generation pipeline (Figure 8). This
process involves curating a diverse set of resume templates and
using an LLM to populate them with new, randomized yet plau-
sible content. Each sample undergoes layout parsing and pre-
labeling via Qwen-max, followed by manual correction to ensure
annotation quality. Details of dataset construction is provided in
Appendiex C.
• RealResume. RealResume is a real-world private dataset of
13,100 real-world resumes collected from Alibaba’s HR system.
This dataset is inherently more complex, featuring documents
with challenging characteristics such as custom fonts, intricate
layouts, and mixed Chinese-English content. The annotation
process for this dataset followed the same methodology as for
SynthResume.
4.1.2 Methods for Comparison. To comprehensively evaluate our
proposed framework, we compare it against a wide range of baseline
systems, as well as benchmark different LLMs in our pipeline. We
categorize the compared methods into four groups:
• Non-LLM Baselines. This includes a commercial resume pars-
ing system widely used in industry Bello [5], and a deep learning-
based information extraction pipeline built on the PaddlePaddle
framework PaddleNLP [17].
• Naïve LLM Baseline. This approach directly applies a large
language model (Claude-4 [4]) to OCR-extracted resume text,
without any layout-aware preprocessing or task decomposition.
• Our Pipeline (Zero-Shot LLM). We evaluate our full layout-
aware pipeline paired with various off-the-shelf LLMs, including
Claude-4 [4], Gemini-2.5flash [9], GPT-4o [15], Deepseek-v3 [11],
Qwen-max [1], and Qwen3 series models [3], to examine how
different models perform in a zero-shot setting within our archi-
tecture.
• Our Pipeline (Fine-Tuned LLM). This variant uses our pipeline
with a supervised fine-tuned model, Qwen3-0.6B-SFT, trained
on our SFT dataset. It represents our final, production-oriented
system, designed to balance top-tier accuracy with the efficiency
required for large-scale deployment.
Details about these models are provided in Appendix D.
6. Conference’17, July 2017, Washington, DC, USA
Fanwei Zhu, Jinke Yu, Zulong Chen, Ying Zhou, Junhao Ji, Zhibo Yang, Yuxue Zhang, Haoyuan Hu, and Zhenghao Liu
Table 1: Statistics of Datasets.
Dataset Source
Size
SynthResume Semi-Automated Synthetic 2,994
RealResume Real-World 13,100
Fields
Layout 15 Chinese Linear & Non-Linear Templates 2,500 / 494
19 Mixed (Eng/Chn) More Complex, Custom Fonts 13,000 / 100
4.1.3 Metrics. We evaluate our system using standard information
extraction metrics: Precision, Recall, and F1-score, following prior
work [8]. Evaluation is conducted at the field level, where an ex-
tracted field is considered a correct match only if it satisfies both
entity alignment and field-matching criteria. Additionally, we intro-
duce a novel alignment Accuracy metric that measures the fraction
of correctly extracted fields among all aligned fields. This metric
helps us distinguish errors caused by the alignment algorithm from
those caused by our field-matching rules.
Formal definitions of the metrics are provided in Appendix E.
4.1.4 Implementation Details. We fine-tune the Qwen-0.6B model
using a full-parameter supervised fine-tuning approach, updating
all trainable parameters to maximize task-specific performance. For
optimization, we employ the AdamW optimizer [14]. The training
process is stabilized by a low initial learning rate of 5e-6, which
helps prevent catastrophic forgetting. The learning rate is managed
by a cosine annealing scheduler [13] to facilitate smooth conver-
gence in the later stages of training. To accommodate GPU memory
constraints, we set the per-device batch size to 2. We apply gradient
accumulation over 2 steps, resulting in an effective batch size of
4. This configuration provides a robust balance between training
stability and computational efficiency, allowing us to effectively
train the model within our hardware limitations.
4.2
Impact of Hyper-parameters
We analyze the effects of two key decoding parameters: repetition
penalty and temperature on extraction performance. We report the
results on the SynthResume dataset in in Figure 2, while similar
trends observed on RealResume.
The repetition penalty discourages the model from generating
repetitive content by penalizing previously generated tokens. We
vary it while keeping temperature at 0 and find that a small penalty
of 1.01 yields the best F1 score. The temperature controls the ran-
domness of token sampling. For this experiment, the repetition
penalty is fixed at its optimal value (1.01). We find that a mod-
erate temperature of 0.5 provides the most stable and accurate
performance. These optimal settings are adopted for all subsequent
experiments.
Figure 2: Impact of Hyper-parameters.
Data Split (Train/Test)
Language
4.3
Performance Comparison
We evaluate and compare the performance of various resume in-
formation extraction methods across both the SynthResume and
RealResume datasets. The overall performance averaged across all
resume fields, along with average processing time per resume, are
presented in Table 2. We also report fine-grained accuracy compar-
isons across different field groups in Table 3. Specifically, the Period
group includes four date-related fields: employment start/end and
education start/end. The Named Entity group includes six entity
fields: company name, job title, school, major, degree, and depart-
ment. The Long Text group comprises two descriptive fields: job
description and education description. From the general and fine-
grained comparison, we observe the following key findings.
First, our layout-aware pipeline is critical for achieving state-
of-the-art performance. As shown in Table 2, our pipeline con-
sistently outperforms all baselines. On the SynthResume dataset,
the naïve LLM baseline (Claude-4) achieves an F1-score of 0.927,
whereas integrating Claude-4 into our layout-aware pipeline boosts
it to 0.946. The improvement is even more pronounced on the Re-
alResume dataset, which includes more complex, mixed-language
resumes- raising the F1-score from 0.919 to 0.959, a substantial
40-point gain. This significant improvement that validates the ef-
fectiveness of our framework, particularly in handling real-world
resume complexity. Moreover, our pipeline outperforms traditional
non-LLM methods by a wide margin. Compared to Bello, a state-of-
the-art industrial system, our fine-tuned Qwen3-0.6B-SFT model
improves F1-score by over 14 points (0.817 → 0.964), while also
reducing inference latency (1.62s → 1.54s), making it well-suited
for real-world deployment.
Second, our fine-tuned small model offers the optimal trade-
off between accuracy and efficiency. While top-tier models such
as Claude-4 and Gemini-2.5 deliver strong performance, their infer-
ence latency (4–13 seconds per resume) limits scalability in high-
throughput environments. Our fine-tuned Qwen-0.6B-SFT model
provides the optimal balance of performance and efficiency for a pro-
duction environment. As shown in Table 2, supervised fine-tuning
boosts the base Qwen3-0.6B model’s F1-score on RealResume from
0.641 to 0.964- surpassing even Claude-4 (0.959) while reducing la-
tency to just 1.54 seconds, achieving a 3–4× speedup. These results
validate the effectiveness of fine-tuning compact models to meet
both high accuracy and low latency demands in production-scale
resume processing.
Third, the gains of our framework are most evident in com-
plex Long Text fields. Long Text fields are most challenging as
they require coherent extraction of multi-sentence descriptions. As
shown in Table 3, on RealResume, the naïve LLM baseline (Claude-
4) achieves only an F1-score of 0.548 in this group, but with our full
pipeline, the score rises sharply to 0.854. A similar boost is seen
7. Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation Conference’17, July 2017, Washington, DC, USA
Table 2: Comparison of Overall Model Performance on the SynthResume and RealResume Datasets. Results are averaged across
all resume fields to provide a holistic evaluation of each method. Best scores are in bold, second-best are underlined.
SynthResume Dataset
RealResume Dataset
Category Model Acc. ↑ Prec. ↑ Recall ↑ F1 ↑ Time (s) ↓ Acc. ↑ Prec. ↑ Recall ↑ F1 ↑ Time (s) ↓
Non-LLM Baselines Bello
PaddleNLP 0.815
0.576 0.787
0.669 0.741
0.474 0.762
0.523 1.43
22.0 0.835
0.515 0.836
0.584 0.746
0.422 0.817
0.492 1.62
20.9
Naïve LLM Baseline Claude-4 0.926 0.923 0.933 0.927 20.54 0.896 0.896 0.901 0.919 22.71
Our Pipeline
(Zero-Shot LLM) Claude-4
Gemini2.5-flash
GPT-4o
Deepseek-v3
Qwen-max
Qwen3-14B
Qwen3-4B
Qwen3-0.6B 0.949
0.949
0.952
0.951
0.950
0.911
0.885
0.618 0.950
0.958
0.958
0.959
0.945
0.906
0.876
0.671 0.943
0.945
0.948
0.941
0.947
0.911
0.895
0.663 0.946
0.951
0.952
0.950
0.946
0.908
0.885
0.645 4.98
11.23
5.50
8.66
9.40
6.30
5.24
1.22 0.948
0.947
0.944
0.939
0.935
0.914
0.861
0.589 0.937
0.933
0.936
0.935
0.927
0.898
0.833
0.632 0.952
0.955
0.950
0.936
0.934
0.911
0.887
0.622 0.959
0.954
0.954
0.944
0.937
0.912
0.869
0.641 4.62
13.67
6.26
10.58
19.2
8.55
6.85
1.54
Our Pipeline
(Fine-Tuned Model) Qwen3-0.6B-SFT 0.931 0.918 0.917 0.917 1.22 0.961 0.938 0.964 0.964 1.54
with model fine-tuning: Qwen3-0.6B’s F1-score for Long Text jumps
from 0.136 to 0.846 after SFT. Accurately extracting long text (e.g.,
job descriptions and project summaries) is essential for downstream
tasks such as candidate-job matching, making this improvement
especially impactful in real-world hiring applications.
Interestingly, naïve LLM baseline achieves the highest F1-score
on Period fields. This suggests that LLMs may possess strong in-
trinsic capabilities for handling short, visually distinct, and highly
regular patterns such as dates. In such cases, our layout regenerator
may introduce minor segmentation noise that slightly degrades per-
formance. This observation motivates future work on field-specific
extraction strategies to further optimize performance.
4.4
leads to a clear drop, validating the effectiveness of our integrated
design. Finer-grained analysis further shows that Long Text fields
are the most sensitive. Removing Text Fusion or Layout Generator
causes over a 10-point drop in accuracy, due to OCR errors and
disrupted reading order in multi-line content. Omitting the Post
Processor also leads to a 7.7-point decline, highlighting the diffi-
culty LLMs face in generating long, verbatim text accurately. Our
post-processing module provides crucial robustness and traceability
for such complex fields. In contrast, Named Entity and Period fields
are more robust, but still benefit from Text Fusion, confirming that
dual-modal PDF text extraction is consistently superior to OCR.
4.5
Ablation Study
To understand the impact of each major component in our pipeline,
we conduct an ablation study with SynthResume Dataset by remov-
ing three core modules individually: (1) w/o Text Fusion: removing
PDF’s metadata-based text content and relying solely on OCR for
text extraction; (2) w/o Layout Generator: Disabling the hierarchical
layout reordering, a naive top-to-bottom, left-to-right sort is ap-
plied directly to the text boxes; (3) w/o Post-Processor: Skipping the
post-processing step in the LLM-based extractor, the model directly
generate the full long description.
Online Deployment
With the success in offline experiments, our pipeline has been
deployed to support Alibaba’s Intelligent HR system (CaiMi). The
deployment involves both offline training and online serving, as
shown in Figure 3.
Online Service
Data & Training
MaxCompute Platform TPP
Templates Generation Service Configure
CaiMi HR System
Table 4: Ablation study on key system components.
Data Filling
Request Parsing
Request
Finer-Grained Accuracy ↑
Variants
Overall Acc. ↑ Period Named Entity Long Text
w/o Text Fusion
w/o Layout Generator
w/o Post Processor 0.907
0.916
0.921 0.892
0.892
0.897 0.945
0.950
0.952 0.743
0.758
0.781
Full system 0.932 0.897 0.952 0.858
As shown in Table 4, each component of our framework con-
tributes meaningfully to overall performance. The full system achieves
the highest overall accuracy (0.932), and removing any component
Resume Generation
Return
Nebula Training Platform
Whale Platform
Training Configure
LLM Fine-Tuning
Resume Analyzer
LLM API Service
Upload
LLM Deployment
Figure 3: System Deployment and Serving Framework.
8. Conference’17, July 2017, Washington, DC, USA
Fanwei Zhu, Jinke Yu, Zulong Chen, Ying Zhou, Junhao Ji, Zhibo Yang, Yuxue Zhang, Haoyuan Hu, and Zhenghao Liu
Table 3: Fine-grained Accuracy and F1 Scores on the SynthResume and RealResume datasets, grouped by field types: Period,
Named Entity, and Long Text. Missing values indicate that the PaddleNLP baseline does not support extraction of Long Text
fields. Full results including Precision and Recall are provided in Appendix F.
SynthResume Dataset
Model
Period
Named Entity
RealResume Dataset
Long Text
Period
Named Entity
Long Text
Acc. ↑ F1 ↑ Acc. ↑ F1 ↑ Acc. ↑ F1 ↑ Acc. ↑ F1 ↑ Acc. ↑ F1 ↑ Acc. ↑ F1 ↑
Non-LLM Baselines
Bello
PaddleNLP 0.852
0.433 0.885
0.511 0.841
0.818 0.749
0.699 0.469
– 0.259
– 0.921
0.387 0.879
0.451 0.885
0.722 0.769
0.622 0.540
– 0.500
–
Naïve LLM Baseline
Claude-4 0.977 0.980 0.939 0.951 0.731 0.684 0.979 0.986 0.937 0.959 0.582 0.548
0.973
0.975
0.974
0.971
0.945
0.902
0.889
0.668 0.963
0.959
0.966
0.968
0.979
0.961
0.955
0.637 0.956
0.957
0.961
0.962
0.970
0.954
0.953
0.691 0.825
0.835
0.840
0.829
0.824
0.698
0.513
0.256 0.801
0.831
0.829
0.813
0.811
0.664
0.527
0.184 0.963
0.970
0.963
0.960
0.971
0.963
0.901
0.680 0.972
0.978
0.972
0.971
0.978
0.972
0.930
0.734 0.964
0.945
0.950
0.939
0.936
0.939
0.921
0.647 0.949
0.931
0.941
0.918
0.915
0.899
0.886
0.671 0.869
0.888
0.880
0.867
0.827
0.724
0.579
0.120 0.854
0.865
0.870
0.852
0.820
0.707
0.567
0.136
Our Pipeline (Fine-Tuned Model)
Qwen3-0.6B-SFT 0.897 0.907 0.951 0.938 0.858 0.767 0.981 0.984 0.956 0.937 0.880 0.846
Our Pipeline (Zero-Shot LLM)
Claude-4
0.960
Gemini2.5-flash
0.963
GPT-4o
0.960
Deepseek-v3
0.961
Qwen-max
0.942
Qwen3-14B
0.888
Qwen3-4B
0.893
Qwen3-0.6B
0.619
In the offline phase, we construct the training dataset on the
MaxCompute Platform, where a diverse set of resume templates is
first generated. These templates are populated using LLM-based
content synthesis followed by manual correction, producing high-
quality synthetic resumes. Combined with real resumes collected
from Alibaba’s hiring applications, we construct a fine-tuning cor-
pus of labeled examples. All resume texts and labels are stored on
Alibaba Cloud’s Object Storage Service (OSS). Model fine-tuning
is conducted on Neubla, Alibaba’s internal distributed AI training
platform. We fine-tune the Qwen-0.6B model using full-parameter
supervised learning on a compute node with 8× NVIDIA A800
GPUs. With Neubla’s optimized training infrastructure, the full
fine-tuning process completes within 30 minutes.
At online serving, the fine-tuned model is deployed on Whale
Platform, Alibaba’s LLM-serving infrastructure. Serving orchestra-
tion is managed by TPP, Alibaba’s internal online inference engine.
Upon a parsing request from the CaiMi HR system, TPP orches-
trates the entire resume parsing workflow. This includes initial
OCR and text fusion, layout regeneration, LLM API invocation via
Whale, and finally, returning the extracted structured results back
to the HR system. The entire pipeline demonstrates strong real-
time performance, achieving a throughput of 240–300 resumes per
minute (i.e., 4–5 QPS), with an average response latency of 1.54
seconds per resume. This meets the strict latency and throughput
requirements of large-scale enterprise hiring systems.
5
Conclusion and Future Work
In this paper, we present a layout-aware, efficiency-optimized auto-
matic resume extraction and assessment pipeline that successfully
addresses the key challenges of layout heterogeneity, LLM latency,
and evaluation difficulty in industrial-scale resume information ex-
traction. We demonstrate that by combining a robust layout-aware
parsing pipeline with a lightweight fine-tuned LLM model, our ap-
proach delivers both high accuracy and low latency. The framework
significantly outperforms existing baselines and has been success-
fully deployed in Alibaba’s intelligent HR system, serving real-time
scenarios with high throughput and reliability. We also open-source
the entire pipeline and contribute benchmark datasets to advance
future research and practical application. Future work will explore
dynamic, field-specific extraction strategies that selectively apply
different extraction models, such as applying simpler methods for
regular fields while reserving our full pipeline for more complex,
context-dependent ones, to further optimize performance.
References
[1] Alibaba DAMO Academy. 2024. Qwen: Alibaba’s Open Multilingual LLM Family.
https://huggingface.co/Qwen.
[2] Irfan Ali, Nimra Mughal, Zahid Hussain Khand, Javed Ahmed, and Ghulam Mu-
jtaba. 2022. Resume classification system using natural language processing and
machine learning techniques. Mehran University Research Journal Of Engineering
& Technology 41, 1 (2022), 65–79.
[3] Alibaba DAMO Academy. 2024. Qwen3: Think Deeper, Act Faster. https://
qwenlm.github.io/blog/qwen3/. Accessed: 2025-07-26.
[4] Anthropic. 2024. Claude 4 Model Overview. https://www.anthropic.com/news/
claude-4. Accessed: 2025-07-26.
[5] Bello AI. 2024. Bello Intelligent Resume Parser. https://www.belloai.com/parser?
lan=en. Accessed: 2025-07-26.
9. Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation Conference’17, July 2017, Washington, DC, USA
[6] Jiaze Chen, Liangcai Gao, and Zhi Tang. 2016. Information extraction from
resume documents in pdf format. Electronic Imaging 28 (2016), 1–8.
[7] Chirag Daryani, Gurneet Singh Chhabra, Harsh Patel, Indrajeet Kaur Chhabra,
and Ruchi Patel. 2020. An automated resume screening system using natural
language processing and similarity. ETHICS AND INFORMATION TECHNOLOGY
[Internet]. VOLKSON PRESS (2020), 99–103.
[8] Y Gyana Deepa, Ankathi Sindhu, Alakuntla Shruthi, and Bitla Neha. 2025. Auto-
mated Resume Parsing: A Review of Techniques, Challenges and Future Direc-
tions. (2025).
[9] Google DeepMind. 2024. Gemini Flash. https://deepmind.google/technologies/
gemini/#models. Accessed: 2025-07-26.
[10] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3:
Pre-training for document ai with unified text and image masking. In Proceedings
of the 30th ACM international conference on multimedia. 4083–4091.
[11] DeepSeek Inc. 2024. DeepSeek LLM Technical Report. Technical Report (2024).
https://huggingface.co/deepseek-ai.
[12] Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval
research logistics quarterly 2, 1-2 (1955), 83–97.
[13] Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic Gradient Descent with
Warm Restarts. In International Conference on Learning Representations (ICLR).
https://arxiv.org/abs/1608.03983
[14] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.
In International Conference on Learning Representations (ICLR). https://arxiv.org/
abs/1711.05101
[15] OpenAI. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.
org/abs/2303.08774
[16] Sachin Pawar, Devavrat Thosar, Nitin Ramrakhiyani, Girish K Palshikar, Anindita
Sinha, and Rajiv Srivastava. 2021. Extraction of complex semantic relations from
resumes. In ASEA workshop@ IJCAI.
[17] Baidu NLP Team. 2021. PaddleNLP: An Easy-to-Use and High-Performance NLP
Library. https://github.com/PaddlePaddle/PaddleNLP.
[18] Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, et al. 2024.
Yolov10: Real-time end-to-end object detection. Advances in Neural Information
Processing Systems 37 (2024), 107984–108011.
[19] Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin,
Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. 2023. Do-
cLLM: A layout-aware generative language model for multimodal document
understanding. arXiv preprint arXiv:2401.00908 (2023).
[20] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020.
Layoutlm: Pre-training of text and layout for document image understanding.
In Proceedings of the 26th ACM SIGKDD international conference on knowledge
discovery & data mining. 1192–1200.
[21] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan
Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2020. Layoutlmv2: Multi-
modal pre-training for visually-rich document understanding. arXiv preprint
arXiv:2012.14740 (2020).
[22] Kun Yu, Gang Guan, and Ming Zhou. 2005. Resume information extraction
with cascaded hybrid model. In Proceedings of the 43rd annual meeting of the
Association for Computational Linguistics (ACL’05). 499–506.
[23] Jinyu Zhang, Zhiyuan You, Jize Wang, and Xinyi Le. 2025. Sail: Sample-centric
in-context learning for document information extraction. In Proceedings of the
AAAI Conference on Artificial Intelligence, Vol. 39. 25868–25876.
[24] Shicheng Zu and Xiulai Wang. 2019. Resume information extraction with a novel
text block segmentation algorithm. Int J Nat Lang Comput 8, 2019 (2019), 29–48.
dataset in an instruction-based format. The SFT dataset consists of
two parts:
• Synthetic Dataset: We generate 2,500 synthetic resumes using
a layout-diverse generation pipeline. For each resume, we create
training instances for three extraction tasks (i.e., extracting ba-
sic information, work experience, and education background),
yielding a total of 7,500 instruction-format samples.
• Real-World Dataset: We collect 13,000 real resumes with com-
plex formats and mixed-language content. Each resume is anno-
tated for four extraction tasks (i.e., extracting basic information,
work experience, project experience, and education background),
resulting in 52,000 SFT training examples.
Each training example is represented as a triplet of (instruction,
input, output), where the instruction specifies the extraction task,
the input is a fully indexed resume text, and the output is the cor-
responding structured JSON-formatted label derived from human-
annotated ground truth. An example of instruction-based sample
for basic information extraction is illustrated in Figure 7. This SFT
dataset enables the model to learn fine-grained extraction behavior
under explicit instructions and to generalize across diverse layout
structures and resume styles.
C
Construction of SynthResume Dataset
To facilitate training and evaluation of our models, we construct a
large-scale synthetic resume dataset, SynthResume, consisting of
2,994 richly structured samples. This dataset is generated using a
semi-automated pipeline as shown in Figure 8, designed to simulate
realistic resumes with diverse layout and content variations.
Synthetic Resume Pipeline
❶
❸
Manual Selection
❷
Manual Inspection
CV
Non-Linear
Templates
Linear
Templates
1. Template Collection
Text
Text
Raw Resume Template
CV
Layout Split
Text Extraction
2. Layout Analysis
Imitatively Generation
Synthetic Resume
3. Text Injection
Figure 8: Pipeline of Synthetic Resume Dataset Construction
Appendix
A Task-specific Prompts
In the LLM-based Extractor, we adopt a parallelized task decomposi-
tion strategy, where each extraction task is handled independently
using a specialized instruction prompt. This modular approach
improves both extraction accuracy and efficiency in production
settings. The task-specific prompts for extracting basic informa-
tion, education background, and work experience are illustrated in
Figures 4, 5, and 6, respectively.
B
Construction of Supervised Fine-Tuning
Dataset
To adapt the Qwen-6B language model for resume information
extraction, we construct a high-quality supervised fine-tuning (SFT)
The construction process includes the following key stages:
Step 1: Template Collection. We first manually curate a collection of
resume templates that reflect a wide variety of real-world styles,
including both linear templates ( single-column, top-down formats)
and non-linear templates (complex, multi-column layouts with side-
bars).
Step 2: Layout Analysis and Text Extraction. Each resume template
is processed through a layout analysis module to identify its major
visual blocks (e.g., header, main content column, sidebar) and extract
the text within each block. This process resulted in a clean, ordered
text representation of the original resume’s content, which served
as the backbone for synthesizing new resumes.
Step 3: Content Generation and Text Injection. Each template’s ex-
tracted content is used as input context to a large language model,
10. Conference’17, July 2017, Washington, DC, USA
Fanwei Zhu, Jinke Yu, Zulong Chen, Ying Zhou, Junhao Ji, Zhibo Yang, Yuxue Zhang, Haoyuan Hu, and Zhenghao Liu
Prompt for Basic Information Extraction
Extract the following information into a JSON object. If any field does not exist, output an empty string "".
{
"basicInfo": {
"name": "",
// Full name, e.g., "Zhang San"
"personalEmail": "",
// Email address, e.g., "610730297@qq.com"
"phoneNumber": "",
// Phone number, e.g., "13915732235".
// Preserve original format, including country/area code if present (e.g., "+1 (201) 706 1136")
"age": "",
// Current age (numeric only)
"born": "",
// Birth year and month if available, e.g., "1996-11"
"gender": "",
// "Male" or "Female". Leave blank if not specified.
"desiredLocation": ["city name", ...],
// Explicitly mentioned preferred job location(s), e.g., ["Beijing", "Shanghai"].
// Only extract if clearly stated. If none, return [].
"jobIntention": "",
// Job intention or target position, e.g., "Algorithm Engineer". Leave blank if unclear.
"currentLocation": "", // Current city of residence. Do NOT infer from work experience or place of origin.
"placeOfOrigin": ""
// Place of origin or hometown. Should not be confused with current location.
}
}
Figure 4: Prompt for Extracting Basic Information in JSON Format.
Prompt for Education Background Extraction
Extract the following education experiences into a JSON array. If any field does not exist, output an empty string "". If no education
experience is mentioned, return an empty array [].
{
"education": [
{
"school": "",
// Full name of the educational institution, e.g., "Tsinghua University"
"major": "",
// Major or field of study, e.g., "Computer Science"
"degree": "",
// Degree earned, e.g., "Bachelor", "Master", "PhD"
"startDate": "",
// Start date in "YYYY-MM" format if available, e.g., "2018-09"
"endDate": "",
// End date in "YYYY-MM" format. If still studying, use "present"
"location": ""
// City or region of the school, e.g., "Beijing"
}
]
}
Figure 5: Prompt for Extracting Education Background in JSON Format.
with a specific prompt designed to generate new content while
preserving the original field structure: “Given the above resume as
a template, generate a new resume by randomly replacing the field
values while preserving the structure and semantics.” The LLM-
generated text undergoes a manual inspection to ensure quality
and coherence. Once verified, this new text is injected back into the
original visual template. The final output of this stage is a complete,
fully-formatted synthetic resume PDF that retains the layout of the
source template but contains entirely new, synthetic content. • Automated Pre-labeling: Each of the 2,994 synthetic resumes is
processed by our full extraction pipeline, using a powerful LLM
model Qwen-Max to generate an initial set of indexed text and
pre-labeled JSON annotations.
• Manual Correction: To ensure label quality, these pre-labels are
then subjected to a rigorous manual correction phase by human
annotators, correcting any errors made by the initial automated
pass.
Finally, to create high-quality labels for these resumes, we employ
a two-pass annotations: The fully annotated dataset are sorted by the length of the resume
text. The longest 2,500 resumes were allocated to the training set,
with the remaining 494 forming the test set. This entire process
yielded a high-quality, and structurally diverse dataset ideal for
11. Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation Conference’17, July 2017, Washington, DC, USA
Prompt for Work Experience Extraction
Extract the following work experiences into a JSON array. If any field does not exist, output an empty string "". If no work experience is
mentioned, return an empty array [].
{
"workExperience": [
{
"company": "",
// Company name, e.g., "Alibaba Group"
"position": "",
// Job title or role, e.g., "Backend Engineer"
"startDate": "",
// Start date in "YYYY-MM" format, e.g., "2020-07"
"endDate": "",
// End date in "YYYY-MM" format; use "present" if currently employed
"location": "",
// City or region of the job location, e.g., "Hangzhou"
"description": ""
// Description of responsibilities and achievements; preserve original wording
}
]
}
Figure 6: Prompt for Extracting Work Experience in JSON Format.
training and evaluating layout-aware information extraction mod-
els. • Qwen3 Series [3]. Qwen3 series includes models of various sizes
designed for efficiency and controllability. We compare Qwen3-
14B, Qwen3-4B, and Qwen3-0.6B to evaluate the performance of
our pipeline with different complexity of LLM models.
D E
Details of Baselines
To comprehensively evaluate our proposed framework, we compare
it against a wide range of baseline systems, as well as benchmark
different LLMs in our pipeline.
• Bello [5]. A widely deployed, commercial resume analysis ser-
vice in HR automation. It processes resumes across varying for-
mats and layouts, applies bilingual parsing, knowledge graph
enhanced field extraction, and document structure understanding
at scale. We treat Bello’s Intelligent Parser as a strong industrial
baseline in our work.
• PaddleNLP [17]. An open-source NLP library developed by
Baidu based on the PaddlePaddle deep learning framework. It
provides pre-trained models and end-to-end pipelines for a wide
range of NLP tasks. In our experiments, we adopt its IE workflow
for resume field extraction.
• Claude-4 [4]. The latest large language model from Anthropic,
designed for high-performance reasoning and long-context un-
derstanding.
• Gemini-2.5flash [9]. A lightweight version of Google’s Gemini
2.5 model, optimized for fast inference and reduced latency, while
maintaining reasonable performance for common LLM tasks.
• GPT-4o [15]. OpenAI’s flagship multimodal model released in
2024, capable of handling text, audio, and image inputs natively.
• Deepseek-v3 [11]. An open-source, multilingual LLM devel-
oped by DeepSeek that achieves competitive performance across
various reasoning and language tasks.
• Qwen-max [1]. A state-of-the-art, multilingual large language
model with over 100 billion parameters, released as part of the
Qwen model series by Alibaba. It is designed for general-purpose
reasoning and excels in multi-step, instruction-following tasks.
Details of Metrics
In our experiments, we adapt three standard evaluation metrics
Precision, Recall and F1-score [8] to resume entity extraction task,
and introduce a novel metric Accuracy to specifically evaluate our
multi-strategy field matching logic. Let 𝐸 𝑔𝑡 be the set of ground-
truth entity fields, 𝐸 𝑝𝑟𝑒𝑑 be the set of fields predicted by our model,
𝐸 𝑎𝑙𝑖𝑔𝑛 be the set of aligned fields via Hungarian algorithm, and
𝐸 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 be the set of correct matches through multi-strategy field
matching. An entity field 𝑒 ∈ 𝐸 𝑝𝑟𝑒𝑑 is considered correct and in-
cluded in 𝐸 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 if and only if it is first successfully aligned with
a ground-truth entity via the Hungarian algorithm (i.e., in 𝐸 𝑎𝑙𝑖𝑔𝑛 )
and subsequently passes our multi-strategy field matching criteria.
Based on these definitions, we compute the following metrics:
• Precision measures the proportion of correctly extracted entity
fields among all fields predicted by the model.
Precision =
|𝐸 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 |
|𝐸 𝑝𝑟𝑒𝑑 |
(1)
• Recall quantifies the proportion of ground-truth entity fields
that are correctly extracted by the model.
Recall =
|𝐸 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 |
|𝐸 𝑔𝑡 |
(2)
• F1-Score is the harmonic mean of Precision and Recall, providing
a single, balanced measure of overall performance.
F1-Score = 2 ×
Precision × Recall
Precision + Recall
(3)
• Accuracy measures the fraction of correctly extracted entity
fields among all aligned fields. This metric helps us distinguish
errors caused by the alignment algorithm from those caused by
12. Conference’17, July 2017, Washington, DC, USA
Fanwei Zhu, Jinke Yu, Zulong Chen, Ying Zhou, Junhao Ji, Zhibo Yang, Yuxue Zhang, Haoyuan Hu, and Zhenghao Liu
Example Instruction-Based SFT Sample for Basic Information Extraction
Instruction: You are a professional resume analysis assistant. Your task is to convert the given resume text into the JSON format specified
below. (If both Chinese and English resumes appear, only extract from the Chinese one.)
Extract the following information into a JSON object. If any field does not exist, output an empty string "".
{
"basicInfo": {
"name": "",
# Name, e.g.: Zhang San
"personalEmail": "",
# Email, e.g.: 610730297@qq.com
"phoneNumber": "",
# Phone/Mobile number, preserve original format, e.g., "+1 (201) 706 1136"
"age": "",
# Current age
"born": "",
# Birth year (and month if available), e.g.: 1996-11
"gender": "",
# Male/Female. Leave empty if not present.
"desiredLocation": ["city name", ...],
# Expected job location(s), e.g., ["Beijing", "Shanghai"]
# Must be explicitly mentioned. If not, set to [].
"currentLocation": "", # Current city of residence. Must be explicitly mentioned.
"placeOfOrigin": ""
# Place of origin. Do not confuse with current location.
}
}
/no_think
Input (Indexed Resume Text):
[0]: Gu Dabai [1]: Phone: 13987898888 [2]: Email: 123245677@123.com [3]: Job Objective: New Media Operations [4]: Education [5]:
2012.09—2016.06 Hebei University of Science and Technology, School of Business, Major: Information Management and Information
Systems, Bachelor’s degree [6]: Core Courses: Principles of Management, Macroeconomics, Data Structures, Database Systems, Intro
to Software Engineering, CAD [7]: Internet Business Intelligence [8]: Work Experience [9]: 2016.07—Present Rainbow Network Co.,
New Media Operations Director [10]: Responsible for managing multiple social media accounts (Weibo, WeChat, Toutiao) [11]: Led
online/offline brand campaigns and media exposure strategies [12]: Organized celebrity livestreams, reaching 2+ billion views; [13]:
Gained 500K new followers annually per account [14]: 2013.09—2015.06 Haichao Consulting Co., Senior Consultant [15]: Assisted in
public relations and brand image development [16]: Participated in project management and business analysis; [17]: Built client profiles
and data processing tools [18]: Skills & Certifications [19]: TEM-8 (Advanced English), experience with Facebook, LinkedIn [20]: National
Level 5 JAVA Certificate [21]: Self-Evaluation [22]: Creative, passionate, and curious thinker [23]: Keen on digital and online marketing
trends [24]: Highly organized and responsible [25]: Excellent in teamwork and independent execution
Output:
{
"basicInfo": {
"name": "Gu Dabai",
"phoneNumber": "13987898888",
"personalEmail": "123245677@123.com",
"age": "",
"born": "",
"gender": "",
"desiredLocation": [],
"currentLocation": "",
"placeOfOrigin": ""
}
}
Figure 7: An Instruction-based SFT Sample for Extracting Basic Personal Information.
F
our field-matching rules.
Accuracy =
|𝐸 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 |
|𝐸 𝑎𝑙𝑖𝑔𝑛 |
(4)
Finer-Grained Comparison of Different
Models
In addition to the overall performance averaged across all resume
fields, we also conduct finer-grained accuracy comparisons across
different field groups in SynthResume and RealResume datasets.
The complete results are illustrated in Table 5 and Table 6 respec-
tively.
13. Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation Conference’17, July 2017, Washington, DC, USA
Table 5: Fine-grained Performance Comparison of Different model on SynthResume Dataset.
Period
Model
Named Entity
Long Text
Acc. Prec. Recall F1 Acc. Prec. Recall F1 Acc. Prec. Recall F1
Non-LLM Baselines
Bello
0.852
PaddleNLP
0.433 0.922
0.723 0.852
0.407 0.885
0.511 0.841
0.818 0.781
0.727 0.720
0.679 0.749
0.699 0.469
0.272 0.246
– 0.273
– 0.259
–
OCR + LLM
Claude-4 0.977 0.984 0.975 0.980 0.939 0.955 0.951 0.951 0.731 0.634 0.741 0.684
Our Pipeline (LLM)
Claude-4
Gemini2.5-flash
GPT-4o
Deepseek-v3
Qwen-max
Qwen3-14B
Qwen3-4B
Qwen3-0.6B 0.960
0.963
0.960
0.961
0.942
0.888
0.893
0.619 0.983
0.983
0.987
0.978
0.948
0.892
0.872
0.682 0.963
0.967
0.961
0.965
0.943
0.913
0.907
0.660 0.973
0.975
0.974
0.971
0.945
0.902
0.889
0.668 0.963
0.959
0.966
0.968
0.979
0.961
0.955
0.637 0.964
0.972
0.968
0.976
0.969
0.955
0.949
0.749 0.949
0.944
0.955
0.951
0.970
0.954
0.956
0.734 0.956
0.957
0.961
0.962
0.970
0.954
0.953
0.691 0.825
0.835
0.840
0.829
0.824
0.698
0.513
0.256 0.786
0.823
0.819
0.834
0.798
0.664
0.503
0.162 0.816
0.842
0.840
0.794
0.826
0.665
0.554
0.226 0.801
0.831
0.829
0.813
0.811
0.664
0.527
0.184
Our Pipeline (SFT)
Qwen3-0.6B-sft 0.897 0.908 0.907 0.907 0.951 0.941 0.936 0.938 0.858 0.760 0.777 0.767
Table 6: Fine-grained Performance comparison of Different model on RealResume Dataset.
Period
Model
Named Entity
Long Text
Acc. Prec. Recall F1 Acc. Prec. Recall F1 Acc. Prec. Recall F1
Non-LLM Baselines
Bello
0.921
PaddleNLP
0.387 0.968
0.587 0.813
0.381 0.879
0.451 0.885
0.722 0.801
0.653 0.740
0.597 0.769
0.622 0.540
– 0.553
– 0.459
– 0.500
–
OCR + LLM
Claude-4 0.979 0.987 0.984 0.986 0.937 0.958 0.960 0.959 0.582 0.512 0.598 0.548
Our Pipeline (LLM)
Claude-4
Gemini2.5-flash
GPT-4o
Deepsek-v3
Qwen-max
Qwen3-14B
Qwen3-4B
Qwen3-0.6B 0.963
0.970
0.963
0.960
0.971
0.963
0.901
0.680 0.985
0.984
0.982
0.987
0.989
0.982
0.902
0.750 0.960
0.973
0.962
0.957
0.968
0.963
0.963
0.724 0.972
0.978
0.972
0.971
0.978
0.972
0.930
0.734 0.964
0.945
0.950
0.939
0.936
0.939
0.921
0.647 0.924
0.898
0.914
0.903
0.895
0.871
0.849
0.683 0.980
0.974
0.973
0.940
0.941
0.935
0.936
0.697 0.949
0.931
0.941
0.918
0.915
0.899
0.886
0.671 0.869
0.888
0.880
0.867
0.827
0.724
0.579
0.120 0.819
0.851
0.848
0.842
0.810
0.699
0.533
0.126 0.899
0.880
0.899
0.865
0.832
0.717
0.612
0.182 0.854
0.865
0.870
0.852
0.820
0.707
0.567
0.136
Our Pipeline (SFT)
Qwen3-0.6B-sft 0.956 0.976 0.951 0.963 0.953 0.909 0.962 0.932 0.866 0.807 0.874 0.838