CAST- Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
如果无法正常显示,请先停止浏览器的去广告插件。
1. C AST:
Enhancing Code Retrieval-Augmented Generation
with Structural Chunking via Abstract Syntax Tree
Yilin Zhang 1 *
Xinran Zhao 1 Zora Zhiruo Wang 1 Chenyang Yang 1
Jiayi Wei 2 Tongshuang Wu 1
1
Carnegie Mellon University, 2 Augment Code
Abstract
Retrieval-Augmented Generation (RAG) has
become essential for large-scale code gener-
ation, grounding predictions in external code
corpora to improve factuality. However, a criti-
cal yet underexplored aspect of RAG pipelines
is chunking—the process of dividing docu-
ments into retrievable units. Existing line-
based chunking heuristics often break semantic
structures, splitting functions or merging unre-
lated code, which can degrade generation qual-
ity. We propose chunking via Abstract Syntax
Trees ( C AST), a structure-aware method that
recursively breaks large AST nodes into smaller
chunks and merges sibling nodes while respect-
ing size limits. This approach generates self-
contained, semantically coherent units across
programming languages and tasks, improving
performance on diverse code generation tasks,
e.g., boosting Recall@5 by 4.3 points on Re-
poEval retrieval and Pass@1 by 2.67 points on
SWE-bench generation. Our work highlights
the importance of structure-aware chunking for
scaling retrieval-enhanced code intelligence.
1
Introduction
Large-scale code generation has emerged as a cor-
nerstone of modern software engineering, powering
tasks that range from automated bug fixing (Meng
et al., 2024) to full-fledged repository-level com-
pletion (Zhang et al., 2023a). Retrieval-augmented
generation (RAG) pushes this frontier further by al-
lowing language models to ground their predictions
in a rich external corpus of data (Guu et al., 2020),
effectively mitigating hallucinations and improving
factual correctness (Izacard et al., 2022).
One crucial preprocessing step in Retrieval-
Augmented Generation (RAG) is chunking (Bohnet
et al., 2023)—breaking large documents into man-
ageable segments that can be efficiently indexed,
*
Corresponding contact email addresses:
{ja-
sonzh3,sherryw}@andrew.cmu.edu. Our code is available at
https://github.com/yilinjz/astchunk
Query
# Code completion task:
# Call compute_stats and print out a summary.
def print_summary(values):
Code Chunking
Syntax-agnostic chunks
breaks code structure
def normalize(vals):
...code for mean and var...
return [(v - mean) / std
for v in vals]
def compute_stats(vals):
total = sum(vals)
n = len(vals) or 1
distinct_count =len(set(vals))
mean = total / n
var = sum((v - mean)**2
for v in vals) / n
return distinct_count, mean, var
Code Retrieval and Generation
def print_summary(values):
stats = compute_stats(values) ❌
print(f“Total:{stats[‘total’]}”)
print(f“Count:{stats[‘count’]}”)
print(f“Mean:{stats[‘mean’]:.2f}”)
print(f”Var: {variance:.2f}")
def print_summary(values):
distinct_count, mean, variance =
compute_stats(values)
✅
print(f”Distinct count
{distinct_count}”)
print(f"Mean: {mean:.2f}")
print(f”Var: {variance:.2f}")
Syntax-aware chunks
preserve code context
Figure 1: Syntax-agnostic chunking often omits cru-
cial information needed to generate functional code. In
this example, fixed-size chunking breaks the structure
of the compute_stats method, causing the model to
lose context regarding its return value. As a result, the
model generates incorrect code based on a mistaken
assumption of what is returned. In contrast, when given
syntax-aware chunks, the model accurately identifies
the return values and integrates them correctly within
the existing codebase.
retrieved, and used as contextual input during gen-
eration. To date, most chunking approaches rely on
fixed-size, line-based splitting (Lewis et al., 2020).
While simple and generally effective, this method
struggles with structured content like code, where
the document naturally contains semantic or syntac-
tic blocks. As shown in Figure 1, naive chunking
often splits meaningful units (e.g., functions and
classes) across different chunks, losing structural
integrity and context.
Can we chunk documents more intelligently,
preserving their original structure? In this work,
we explore C AST—Chunking via Abstract Syntax
Trees. ASTs represent code as hierarchical trees
with typed nodes corresponding to program units.
By parsing source code into an AST, we apply a re-
cursive, split-then-merge algorithm to convert tree
structures into chunks that are better aligned with
syntactic boundaries.
2. Extensive experiments show that C AST im-
proves performance across a range of code gen-
eration tasks. Specifically, it offers three key ad-
vantages: (1) Structure-preserving chunks: AST
traversal yields more self-contained chunks, im-
proving both retrieval and generation. For instance,
StarCoder2-7B sees an average of 5.5 points gain
on RepoEval (Zhang et al., 2023b). (2) Cross-
language consistency: The language-agnostic na-
ture of C AST enables better generalization across
programming languages, achieving up to 4.3 points
gain on CrossCodeEval (Ding et al., 2023). (3)
Metadata retention: AST-based chunks more faith-
fully capture metadata at the file, class, and func-
tion levels, enhancing context matching in hybrid
code+natural language tasks, e.g., up to 2.7 points
gain on SWE-bench (Jimenez et al., 2024), which
focuses on resolving GitHub issues.
2
C AST
We focus on the first stage of the RAG pipeline:
chunking. In this step, source code is parsed into
semantically meaningful units (such as functions
or classes) while preserving the structure of the
code. These units are then grouped into coherent
chunks, which serve as the retrievable context that
can be obtained by a subsequent retriever and used
to prompt a language model.
Design Goal. Our design for C AST pursues four
aligned goals: (1) syntactic integrity—whenever
possible, chunk boundaries should align with com-
plete syntactic units instead of splitting them; (2)
high information density—each chunk is packed
up to, but not beyond, a fixed size budget to maxi-
mize content utility; (3) language invariance—the
algorithm employs no language-specific heuristics
so it works unchanged across diverse programming
languages and code-related tasks; and (4) plug-
and-play compatibility—concatenating the chunks
must reproduce the original file verbatim, enabling
seamless drop-in replacement within existing RAG
pipelines.
AST Parsing. To support syntax-aware chunk-
ing, we leverage the Abstract Syntax Tree (AST)
representation of code. An AST is a tree-structured
abstraction that captures the syntactic structure of
source code in a way that is both hierarchical and
semantically rich. Rather than treating code as
plain text, AST encodes language constructs—like
functions, classes, loops, and conditionals—as dis-
tinct nodes in a structured parse tree. This en-
ables us to identify meaningful code boundaries
with precision, ensuring that chunking respects the
underlying syntax. Since ASTs are widely sup-
ported across languages, this approach also en-
hances the language-invariance and portability of
our method. Our work uses the tree-sitter li-
brary (Tree-sitter, 2025) for the AST tree parsing.
AST-based Recursive Chunking. With the AST
tree at hand, we use a recursive, split-then-merge
algorithm for converting tree structures into chunks,
as shown in Figure 2. To retain as much syntactic
information as possible, we first traverse the tree in
a top-down manner, to fit those large AST nodes
into a single chunk whenever possible. For those
nodes that must be split due to exceeding the chunk
size limit, to avoid too many overly small chunks,
we further perform a greedy merging step, combin-
ing adjacent small sibling nodes into one chunk, to
maximize the per-chunk information density. The
detailed process is also described in Alg. 1.
Chunk size metric. Choosing an appropriate
budget for each chunk is nontrivial: two seg-
ments of equal line count can carry wildly different
amounts of code, and AST-aligned chunks natu-
rally vary in their physical span (e.g., a single im-
port line versus an entire class body). So unlike
prior work (Wang et al., 2024), we measure chunk
size by the number of non-whitespace characters
rather than by lines. This keeps chunks text-dense
and comparable across diverse files, languages, and
coding styles, ensuring that our budget reflects ac-
tual content rather than incidental formatting.
3
Experiments
We evaluate C AST with various top retrieval and
generation models in various code task settings.
We present results of selected end-to-end RACG
pipelines (retriever + LM) in Section 3.2 and full
tables in the Appendix (5, 6, 7, 8).
3.1
Experiment Settings
Datasets. We evaluate C AST on various software
engineering (SE) tasks using three benchmarks:
• RepoEval (Zhang et al., 2023b): Code comple-
tion tasks with long intra-file contexts;
• CrossCodeEval (Ding et al., 2023): Multi-
language queries requiring cross-file reasoning;
• SWE-bench (Jimenez et al., 2024): General SE
tasks involving code patch generation. We use
3. A
FIXED-SIZE CHUNKING
Fixed-size chunks
import lib1
Source code
def foo():
...
import lib2
def __init__():
...
def __init__():
...
import lib2
class A:
def bar():
...
class B:
class A:
import lib1
chunk 1
chunk 2
a = A()
b = B()
chunk 3
def __init__():
...
b 1
def foo():
...
AST-based parsing
class B:
def __init__():
...
Module
b 2
def bar():
...
a = A()
Merged chunks
Import
Exp.
Import
Exp.
b = B()
Class
Def.
import lib1
import lib2
Class
Def.
class A:
def __init__():
...
B
cAST CHUNKING
Func
Def.
Func
Def.
Func
Def.
Func
Def.
def foo():
...
class B:
def __init__():
...
def bar():
...
a = A()
b = B()
chunk 1
chunk 2
Figure 2: Comparison of fixed-size chunking vs. C AST. For C AST, we first parse the document into a tree of AST
nodes. Then, starting from the first level, we greedily merge AST nodes into chunks. If adding a node would exceed
the chunk size limit, we recursively break it into smaller nodes. The output of C AST is a list of chunks where each
chunk contains a list of AST nodes.
the SWE-bench Lite variant (bench Lite, 2024),
a 300-problem subset where each issue is solv-
able by editing a single file.
Metrics. For retrieval performance, we report
three common metrics: nDCG, Precision and Re-
call, with k = 5. Notably, since retrieval scores
from different corpus distributions are not directly
comparable, we implement a score mapping tech-
nique to align AST-based retrieval scores with
those of the baseline, with details in Appendix A.2.
As for generation, we use Pass@k (Chen et al.,
2021) for execution-based datasets and match-
based metrics for the others, following prior work
(Wang et al., 2024; Ding et al., 2023). Specifically,
we report the canonical Pass@1 score for RepoE-
val and SWE-bench. Additionally, we record the
Pass@8 score for SWE-bench by sampling mul-
tiple responses with high temperature following
Agentless (Xia et al., 2024a) to examine the ro-
bustness of C AST. For CrossCodeEval, we report
exact match (EM), edit similarity (ES), and other
identifier match metrics in the original work.
Retrieval and Generation Models. We adopt
various kinds of retrievers, including general-text
dense retrievers: BGE-base (Xiao et al., 2023) and
GIST-base (Solatorio, 2024); and code-specific re-
triever: Codesage-small-v2 (Zhang et al., 2024),
following CodeRAG-Bench (Wang et al., 2024).
Similarly, for generations, we include two
code-specific LMs: StarCoder2-7B (Lozhkov
et al., 2024), CodeLlama-7B-Python (Roziere
et al., 2023); and two general-purpose LMs
(claude-3.7-sonnet, gemini-2.5-pro-0325),
as both represent the state-of-the-art in coding.
Further details of our experimental setup are in-
troduced in Appendix A.1.
3.2
C AST
Results and Analysis
Table 1 presents the end-to-end RACG results
with selected retrievers (BGE-base, GIST-base,
Codesgae-small-v2) on the three datasets. The re-
sults highlight several key observations:
Retrieval. C AST ’s structure-aware chunking
steadily improves retrieval performance across
datasets and retrievers. Specifically, all models
show gains of 1.2–3.3 points in Precision and
1.8–4.3 in Recall on code-to-code retrieval (Repo-
Eval), and 0.5–1.4 in Precision and 0.7–1.1 in Re-
call on the more challenging NL-to-code retrieval
(SWE-Bench). These improvements suggest that
aligning chunks with abstract syntax boundaries
helps diverse retrievers surface semantically co-
herent code fragments, supplying richer and more
accurate evidence for downstream tasks.
4. Metric (Model)
C AST
chunking
Fixed-size chunking
Metric (Model)
nDCG
R Precision
Recall
G
71.1 75.9
34.9 38.1
69.8 75.0 85.1
44.1
83.9 71.3 74.2
32.8 34.8
67.4 70.7 83.0
42.9
82.1
Pass@1 (StarCoder2) 51.7 57.9
Pass@1 (CodeLlama) 49.6 56.6 73.2
72.1 47.5 51.2
45.6 51.5 67.6
66.5
SWE-Bench
nDCG
R Precision
Recall
G
Pass@1 (Claude)
Pass@8 (Gemini)
44.0 44.4
39.7 39.1
18.4 18.5 43.1
38.8
18.3 42.4 43.1
38.3 38.6
17.3 17.8 42.6
37.5
17.5
16.3 15.0
35.3 33.7 16.7
32.7 13.7 14.7
32.3 33.0 14.0
31 0
39.9 32.0 33.5 36.3
29.1
74.3 21.2 23.0
71.0 71.7 24.8
73.1
CrossCodeEval
R Identifier Match (EM) 34.7 34.0
G
EM (StarCoder2)
ES (StarCoder2)
23.8 23.4
72.2 71.9
Table 1: Retrieval and Generation Performances across
three benchmarks, using different retrieval models
(BGE, GIST, CodeSage) and different LMs (full model
names in §3.1).
Generation. C AST benefits both intra-file and
cross-file code completion. Notably, gains are
most pronounced when the RACG pipeline em-
ploys code-specific retrievers, implying that the
structurally aligned chunks deliver fuller context
to both the specialized retriever and the generation
model, which in turn facilitates more accurate con-
text retrieval and coherent code synthesis. On NL-
to-code generation, we observe remarkable gains
with BGE-base and CodeSage retrievers under one
and multiple rounds of sampling.
Correlation between retrieval and generation
performance. Among the three retrieval metrics
we use, we notice that higher precision tends to
convert into better generation performance, align-
ing with conclusions from prior work (Zhao et al.,
2024). This suggests that ensuring the top-k con-
text is highly relevant reduces noise and enables the
language model to concentrate on concise, accurate
evidence, thereby boosting answer fidelity (Fang
et al., 2024; Salemi and Zamani, 2024).
By contrast, recall-oriented metrics and nDCG
correlate only weakly with downstream qual-
ity—once the necessary evidence appears in the
retrieved set, adding lower-ranked chunks yields
diminishing returns or can even hurt performance
by introducing distractors.
4
Ablations
Necessity of merging. The motivation for intro-
ducing merging in our algorithm is to maximize
the information density of each chunk. Under a
Split-only
CodeSage 71.1 75.9 85.1 53.5 59.1 66.1
Pass@1 (StarCoder2) 51.7 57.9
G
Pass@1 (CodeLlama) 49.6 56.6 73.2
72.1 48.3 45.0
47.2 48.5 65.4
58.4
R nDCG
RepoEval
Split-then-merge ( C AST)
BGE GIST
BGE GIST CodeSage BGE GIST CodeSage
BGE GIST CodeSage
Table 2: Ablation study comparing performance metrics
for Split-then-merge ( C AST) and Split-only methodolo-
gies across different models.
Pipeline (R + G)
BGE + StarCoder2
GIST + StarCoder2
CodeSage + StarCoder2
Context length (tokens)
3500 4000 8000
46.9
57.1
70.5 51.7
57.9
73.2 51.7
58.2
69.2
Table 3: Ablation study evaluating the impact of differ-
ent context lengths on the overall performance of several
retrieval and generation pipelines.
split-only approach, small AST nodes, such as im-
port statements and variable assignments, gener-
ate an excessive number of chunks, which unnec-
essarily enlarges the index and degrades retrieval
performance. These fine-grained chunks also con-
tain limited context, making them less effective for
downstream tasks, as shown in Table 2. Across all
retrievers, we find that both retrieval and generation
performance decline under the split-only strategy.
Selection of context length. In our experiments,
we set max_context_length = 4000, which roughly
corresponds to the top five chunks. A comparison
of different context lengths is shown in Table 3. We
observe that doubling the context length does not
necessarily improve generation, whereas a modest
reduction in context length can lead to performance
degradation, likely due to chunk truncation.
Selection of maximum chunk size. We set
max_chunk_size = 2000 in our experiments, as the
resulting chunks exhibit similar statistics (e.g., line
counts and token counts) to the fixed-size chunking
baseline. A sensitivity analysis of max_chunk_size
is presented in Table 4. We observe that re-
trieval and generation performance peak when
max_chunk_size is between 2000 and 2500 char-
acters. Additionally, generation performance also
depends on max_context_length, as shown in the
previous analysis. When context length allows,
larger chunks can provide more information, while
smaller chunks help mitigate the risk of truncation.
5. Metric (Model)
Maximum chunk size
1000 1500 2000 2500 3000
R nDCG
69.0 68.4 71.1 72.3 69.4
G Pass@1 (StarCoder2) 43.4 45.8 51.7 50.1 51.2
Table 4: Ablation study of maximum chunk size effects
on retrieval and generation performance.
5
Related Work
Structure-aware modeling in code tasks. Early
work showed clear benefits from feeding explicit
syntax to models: TranX (grammar-guided decod-
ing) and path-based encoders code2vec/code2seq
leveraged AST productions or paths to outperform
token-only baselines in NL-to-code and summariza-
tion (Yin and Neubig, 2018; Alon et al., 2019b,a).
Transformer-era studies refined this idea. Graph-
CodeBERT (Guo et al., 2021) and the Code Trans-
former (Zügner et al., 2021) inject data-flow edges
or AST distances, while CODEDISEN (Zhang
et al., 2021) disentangles syntax from semantics
for cross-language transfer. More recent models
layer structure-aware objectives onto large LMs:
TypeT5 (Wei et al., 2023) adds static-analysis con-
text for type inference, and AST-T5 (Gong et al.,
2024) and StructCoder (Tipirneni et al., 2024) mask
or generate subtrees to boost transpilation and Java-
Python translation.
Although modern LLMs can often internal-
ize such structure from raw tokens, these results
indicate that explicit syntax still provides mea-
surable gains—especially in preprocessing steps
like chunking, where respecting function or class
boundaries directly controls what the model sees.
In light of the importance of structure awareness in
the above literature, we propose to leverage the tree
structure of code snippets to improve chunking.
Retrieval-augmented code generation. Suc-
cessful code RAG hinges on pairing high-quality
retrievers with generation frameworks that can ef-
fectively leverage the fetched context. General-
purpose systems—RAG (Lewis et al., 2020),
FiD (Izacard and Grave, 2021), and RePlug (Shi
et al., 2023)—demonstrate that feeding high-recall
evidence to a language model markedly improves
factuality. In the software-engineering domain,
CodeRAG-Bench (Wang et al., 2024) confirms
these gains on repository-level tasks while reveal-
ing that lexical-matching retrievers often miss rele-
vant code, motivating code-specific retrieval mod-
els. State-of-the-art code retrievers such as Code-
BERT (Feng et al., 2020), UniXcoder (Guo et al.,
2022), and CodeRetriever (Li et al., 2022) learn
joint code–text or code–code embeddings and con-
sistently surpass generic dense models in code
search and question answering. Most pipelines
still inherit fixed line-based chunking from natural-
language RAG. Our work shows that respecting
syntactic units with AST-aware chunks further en-
hances these retrieval-generation loops.
Most relevantly, CodeCRAG (Du et al., 2025)
utilizes the graphical view of code flow to improve
the overall LLM code generation pipeline. Shen
et al. (2024); Xia et al. (2024b); Song et al. (2024)
propose to compute code similarity based on the
graph structure of code. In our work, we conduct a
fine-grained study on one important block of code
RAG workflow: chunking.
6
Conclusion and Discussion
In this work, we present C AST as a simple and
effective chunking strategy for retrieval-augmented
code generation. Through the structural awareness
brought by AST, we are allowed to maintain syn-
tactic integrity and high information density dur-
ing chunking. Extensive experiments on various
retrievers, LLM generators, and code generation
tasks, validate the gain from C AST over the com-
monly used fixed-size chunking strategy on both
retrieval and RAG tasks.
By maintaining the original RAG pipeline, for
the code agent practitioner, C AST could be used
as a simple plug-and-play tool to provide infor-
mative and formatted chunks for later stage agent
use. For code RAG benchmark developers, C AST
could serve as additional resources and an effective
alternative or complementary retrieval unit.
Limitations
Contextual Awareness. In our experiments, for a
fair comparison, we maintain the original retrieval-
augmented code generation pipeline to parse code
snippets into self-contained chunks, without ex-
plicit contextual awareness from higher chunking
units in the AST. However, as shown in (Sarthi
et al., 2024; Cai et al., 2024), in textual RAG, in-
cluding multi-level information in the tree struc-
tures can improve the retrieval performance, which
can also potentially benefit code retrieval with the
natural structures that can be extracted with our
AST framework.
6. Multi-view of the code. In this work, we mainly
explore chunking with pure code files. However,
each code snippet can potentially have multiple
views, e.g., the input-output elicitation in the com-
ments, natural language descriptions, pseudo code,
and etc. Each of these views can emphasize differ-
ent facets of the very code snippet. Previous work
shows that including multiple views helps model
math reasoning (Liang et al., 2023). Similarly, in-
stead of pure AST-based chunking on code snip-
pets, including different chunk candidates from dif-
ferent views can potentially relieve the code com-
pleteness reliance of our cAST.
Inner Execution Dynamics. In this work, we
focus on introducing the structural awareness to re-
trieval augmented generation with AST, as a static
analysis of the code semantics. However, the exe-
cution trace (Ni et al., 2024), type inference (Wei
et al., 2023), and compilation (Cummins et al.,
2024) can potentially lead to a deep understanding
of the variable dynamics. Introducing the aware-
ness of such in-depth query analysis can help aug-
ment our cAST with per-query adaptiveness.
Acknowledgments
The authors thank Jamie Callan, Fernando Diaz,
Graham Neubig, Daniel Fried, and Pengcheng Yin
for their insights into design and evaluation choices.
The authors also thank the constructive discussions
with colleagues from CMU WInE Lab and Aug-
ment Code. Xinran Zhao is supported by the ONR
Award N000142312840. This work is supported by
the OpenAI Research Credit program, the Amazon
AI Research Gift Fund, and the Gemma Academic
Program GCP Credit Award.
References
Uri Alon, Shaked Brody, Omer Levy, and Eran Ya-
hav. 2019a. code2seq: Generating sequences from
structured representations of code. In International
Conference on Learning Representations (ICLR).
Uri Alon, Meital Zilberstein, Omer Levy, and Eran
Yahav. 2019b. code2vec: Learning distributed
representations of code. In Proceedings of the
ACM/IEEE Symposium on Principles of Program-
ming Languages (POPL).
SWE bench Lite. 2024. Swe-bench lite. https://www.
swebench.com/lite.html .
Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aha-
roni, Daniel Andor, Livio Baldini Soares, Massimil-
iano Ciaramita, Jacob Eisenstein, Kuzman Ganchev,
Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma,
Jianmo Ni, Lierni Sestorain Saralegui, Tal Schus-
ter, William W. Cohen, Michael Collins, Dipanjan
Das, and 3 others. 2023. Attributed question answer-
ing: Evaluation and modeling for attributed large
language models. Preprint, arXiv:2212.08037.
Fengyu Cai, Xinran Zhao, Tong Chen, Sihao Chen,
Hongming Zhang, Iryna Gurevych, and Heinz
Koeppl. 2024. MixGR: Enhancing retriever general-
ization for scientific domain through complementary
granularity. Preprint, arXiv:2407.10691.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Henrique Ponde de Oliveira Pinto, Jared Kaplan,
Harri Edwards, Yuri Burda, Nicholas Joseph, Greg
Brockman, Alex Ray, Raul Puri, Gretchen Krueger,
Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela
Mishkin, Brooke Chan, Scott Gray, and 39 others.
2021. Evaluating large language models trained on
code. Preprint, arXiv:2107.03374.
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao
Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang,
and Dong Yu. 2023. Dense x retrieval: What re-
trieval granularity should we use? arXiv preprint
arXiv:2312.06648.
Chris Cummins, Volker Seeker, Dejan Grubisic, Bap-
tiste Roziere, Jonas Gehring, Gabriel Synnaeve, and
Hugh Leather. 2024. Meta large language model
compiler: Foundation models of compiler optimiza-
tion. Preprint, arXiv:2407.02524.
Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Han-
tian Ding, Ming Tan, Nihal Jain, Murali Krishna
Ramanathan, Ramesh Nallapati, Parminder Bhatia,
Dan Roth, and Bing Xiang. 2023. Crosscodeeval:
A diverse and multilingual benchmark for cross-file
code completion. In Thirty-seventh Conference on
Neural Information Processing Systems Datasets and
Benchmarks Track.
Kounianhua Du, Jizheng Chen, Renting Rui, Huacan
Chai, Lingyue Fu, Wei Xia, Yasheng Wang, Ruiming
Tang, Yong Yu, and Weinan Zhang. 2025. Code-
grag: Bridging the gap between natural language
and programming language via graphical retrieval
augmented generation. Preprint, arXiv:2405.02355.
Feiteng Fang, Yuelin Bai, Shiwen Ni, Min Yang, Xi-
aojun Chen, and Ruifeng Xu. 2024. Enhancing
noise robustness of retrieval-augmented language
models with adaptive adversarial training. Preprint,
arXiv:2405.20978.
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-
aocheng Feng, Ming Gong, Linjun Shou, Bing Qin,
Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code-
BERT: A pre-trained model for programming and
natural languages. In Findings of the Association
for Computational Linguistics: EMNLP, pages 1536–
1547.
Linyuan Gong, Mostafa Elhoushi, and Alvin Cheung.
2024. AST-T5: Structure-aware pretraining for
7. code generation and understanding. arXiv preprint
arXiv:2401.03003.
Michael Günther, Jackmin Ong, Isabelle Mohr, Alaed-
dine Abdessalem, Tanguy Abel, Mohammad Kalim
Akram, Susana Guzman, Georgios Mastrapas, Saba
Sturua, Bo Wang, and 1 others. 2023. Jina em-
beddings 2: 8192-token general-purpose text em-
beddings for long documents. arXiv preprint
arXiv:2310.19923.
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming
Zhou, and Jian Yin. 2022. UniXcoder: Unified cross-
modal pre-training for code representation. In Pro-
ceedings of the 60th Annual Meeting of the Associ-
ation for Computational Linguistics (ACL), pages
7212–7225.
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu
Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svy-
atkovskiy, Shengyu Fu, Michele Tufano, Shao Kun
Deng, Colin Clement, Dawn Drain, Neel Sundaresan,
Jian Yin, Daxin Jiang, and Ming Zhou. 2021. Graph-
CodeBERT: Pre-training code representations with
data flow. In International Conference on Learning
Representations (ICLR).
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
pat, and Mingwei Chang. 2020. Retrieval augmented
language model pre-training. In International confer-
ence on machine learning, pages 3929–3938. PMLR.
Charles R. Harris, K. Jarrod Millman, Stéfan van der
Walt, Ralf Gommers, Pauli Virtanen, David Cour-
napeau, Eric Wieser, Julian Taylor, Sebastian Berg,
Nathaniel J. Smith, Robert Kern, Matti Picus,
Stephan Hoyer, Marten H. van Kerkwijk, Matthew
Brett, Allan Haldane, Jaime Fernández del Río, Mark
Wiebe, Pearu Peterson, and 7 others. 2020. Array
programming with numpy. Nature, 585:357–362.
John D Hunter. 2007. Matplotlib: A 2d graphics en-
vironment. Computing in science & engineering,
9(03):90–95.
Gautier Izacard and Edouard Grave. 2021. Leveraging
passage retrieval with generative models for open
domain question answering. In International Confer-
ence on Learning Representations (ICLR).
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas
Hosseini, Fabio Petroni, Timo Schick, Jane A. Yu,
Armand Joulin, Sebastian Riedel, and Edouard Grave.
2022. Few-shot learning with retrieval augmented
language models. ArXiv, abs/2208.03299.
Carlos E Jimenez, John Yang, Alexander Wettig,
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Narasimhan. 2024. SWE-bench: Can language mod-
els resolve real-world github issues? In The Twelfth
International Conference on Learning Representa-
tions.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E.
Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi-
cient memory management for large language model
serving with pagedattention. In Proceedings of the
ACM SIGOPS 29th Symposium on Operating Systems
Principles.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020.
Retrieval-augmented generation for knowledge-
intensive NLP tasks. In Advances in Neural Infor-
mation Processing Systems (NeurIPS), pages 9459–
9474.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas
Muennighoff, Denis Kocetkov, Chenghao Mou, Marc
Marone, Christopher Akiki, Jia Li, Jenny Chim,
Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo,
Thomas Wang, Olivier Dehaene, Mishig Davaadorj,
Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko,
and 48 others. 2023. Starcoder: may the source be
with you! Preprint, arXiv:2305.06161.
Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu,
Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang,
Weizhu Chen, and Nan Duan. 2022. CodeRetriever:
A large scale contrastive pre-training method for code
search. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Process-
ing (EMNLP), pages 2898–2910, Abu Dhabi, United
Arab Emirates. Association for Computational Lin-
guistics.
Zhenwen Liang, Dian Yu, Xiaoman Pan, Wenlin Yao,
Qingkai Zeng, Xiangliang Zhang, and Dong Yu.
2023. Mint: Boosting generalization in mathemati-
cal reasoning via multi-view fine-tuning. Preprint,
arXiv:2307.07951.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Fed-
erico Cassano, Joel Lamy-Poirier, Nouamane Tazi,
Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei,
and 1 others. 2024. Starcoder 2 and the stack v2: The
next generation. arXiv preprint arXiv:2402.19173.
Xiangxin Meng, Zexiong Ma, Pengfei Gao, and
Chao Peng. 2024. An empirical study on llm-
based agents for automated bug fixing. Preprint,
arXiv:2411.10213.
Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin
Deng, Kensen Shi, Charles Sutton, and Pengcheng
Yin. 2024. Next: Teaching large language mod-
els to reason about code execution. Preprint,
arXiv:2404.14662.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Köpf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Te-
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
and 2 others. 2019. Pytorch: An imperative style,
high-performance deep learning library. In Advances
8. in Neural Information Processing Systems 32: An-
nual Conference on Neural Information Processing
Systems 2019, NeurIPS 2019, December 8-14, 2019,
Vancouver, BC, Canada, pages 8024–8035.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Jingyu Liu, Tal Remez, Jérémy Rapin, and 1 oth-
ers. 2023. Code llama: Open foundation models for
code. arXiv preprint arXiv:2308.12950.
Alireza Salemi and Hamed Zamani. 2024. Evaluating
retrieval quality in retrieval-augmented generation.
Preprint, arXiv:2404.13781.
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh
Khanna, Anna Goldie, and Christopher D. Manning.
2024. Raptor: Recursive abstractive processing for
tree-organized retrieval. In International Conference
on Learning Representations (ICLR).
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, and 1 others. 2019. Huggingface’s transformers:
State-of-the-art natural language processing. ArXiv
preprint, abs/1910.03771.
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and
Lingming Zhang. 2024a. Agentless: Demystifying
llm-based software engineering agents. Preprint,
arXiv:2407.01489.
Yu Xia, Tian Liang, Weihuan Min, and Li Kuang. 2024b.
Improving ast-level code completion with graph re-
trieval and multi-field attention. In Proceedings of
the 32nd IEEE/ACM International Conference on
Program Comprehension, pages 125–136.
Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas
Muennighoff. 2023. C-pack: Packaged resources
to advance general chinese embedding. arXiv.
Zhili Shen, Pavlos Vougiouklis, Chenxin Diao, Kaus-
tubh Vyas, Yuanyi Ji, and Jeff Z Pan. 2024. Im-
proving retrieval-augmented text-to-sql with ast-
based ranking and schema pruning. arXiv preprint
arXiv:2407.03227. Pengcheng Yin and Graham Neubig. 2018. TRANX:
A transition-based neural abstract syntax parser for
semantic parsing and code generation. In Proceed-
ings of the 2018 Conference on Empirical Methods
in Natural Language Processing (System Demonstra-
tions), pages 7–12, Brussels, Belgium. Association
for Computational Linguistics.
Weijia Shi, Sewon Min, Michihiro Yasunaga, Min-
joon Seo, Mike Lewis, Luke Zettlemoyer, and Wen-
tau Yih. 2023. REPLUG: Retrieval-augmented
black-box language models.
arXiv preprint
arXiv:2301.12652. Dejiao Zhang, Wasi Ahmad, Ming Tan, Hantian Ding,
Ramesh Nallapati, Dan Roth, Xiaofei Ma, and Bing
Xiang. 2024. Code representation learning at scale.
arXiv preprint arXiv:2402.01935.
Aivin V. Solatorio. 2024. Gistembed: Guided in-sample
selection of training negatives for text embedding
fine-tuning.
Yewei Song, Cedric Lothritz, Xunzhu Tang, Tegawendé
Bissyandé, and Jacques Klein. 2024. Revisiting code
similarity evaluation with abstract syntax tree edit
distance. In Proceedings of the 62nd Annual Meet-
ing of the Association for Computational Linguistics
(Volume 2: Short Papers), pages 38–46, Bangkok,
Thailand. Association for Computational Linguistics.
Sindhu Tipirneni, Ming Zhu, and Chandan K. Reddy.
2024. Structcoder: Structure-aware transformer for
code generation. ACM Transactions on Knowledge
Discovery from Data, 18(3):70:1–70:20.
Tree-sitter. 2025. Tree-sitter documentation. https:
//tree-sitter.github.io/tree-sitter/ . Accessed:
May 11, 2025.
Zora Zhiruo Wang, Akari Asai, Xinyan Yu, Frank F.
Xu, Yiqing Xie, Graham Neubig, and Daniel Fried.
2024. CodeRAG-Bench: Can retrieval augment code
generation? arXiv preprint arXiv:2406.14497.
Jiayi Wei, Greg Durrett, and Isil Dillig. 2023. TypeT5:
Seq2seq type inference using static analysis. In In-
ternational Conference on Learning Representations
(ICLR).
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin
Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and
Weizhu Chen. 2023a. RepoCoder: Repository-level
code completion through iterative retrieval and gen-
eration. pages 2471–2484. Association for Computa-
tional Linguistics.
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin
Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and
Weizhu Chen. 2023b. RepoCoder: Repository-level
code completion through iterative retrieval and gen-
eration. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing,
pages 2471–2484, Singapore. Association for Com-
putational Linguistics.
Jingfeng Zhang, Haiwen Hong, Yin Zhang, Yao Wan,
Ye Liu, and Yulei Sui. 2021. Disentangled code rep-
resentation learning for multiple programming lan-
guages. In Findings of the Association for Computa-
tional Linguistics: ACL–IJCNLP, pages 4454–4466.
Xinran Zhao, Tong Chen, Sihao Chen, Hongming
Zhang, and Tongshuang Wu. 2024. Beyond rele-
vance: Evaluate and improve retrievers on perspec-
tive awareness. Preprint, arXiv:2405.02714.
Daniel Zügner, Tobias Kirschstein, Michele Catasta,
Jure Leskovec, and Stephan Günnemann. 2021.
Language-agnostic representation learning of source
code from structure and context. In International
Conference on Learning Representations.
9. A
Appendix
A.1
Implementation Details
For Gemini and Claude models, we use the official
API service. For other open-sourced models, we
use locally served models on nodes with 8 Nvidia
A100 (40G) GPU and 8 Nvidia A6000 (40G) GPUs
with CUDA 12 installed. Our inference structure is
built upon vLLM (Kwon et al., 2023).
For fair comparison of chunks with varying sizes,
instead of using top-k chunks directly, We use
max_context_length to sequentially include re-
trieved chunks up to a threshold, truncating the
final chunk if needed. We set the limit to 4000
for RepoEval and SWE-Bench, and extend it to
10000 for CrossCodeEval to test cross-file retrieval.
1 For generation, we adopt different settings based
on evaluation metrics based on prior work (Wang
et al., 2024; Li et al., 2023; Xia et al., 2024a): We
use t = 0.2, top p = 0.95, and 1 sample for Pass@1;
t = 0.8 and 8 samples for Pass@8.
A.2
Metric Score Mapping Details
In Section 3.1, we denote the distributional incom-
parability across corpses. We implement a score
mapping technique to align AST-based retrieval
scores over baselines.
Specifically, similar to (Chen et al., 2023), we
assign each line of code a score inherited from
its corresponding AST chunk. These line-level
scores are then aggregated to recompute the scores
of baseline chunks, allowing us to rerank them and
estimate AST-based retrieval performance within
the baseline framework.
A.3
AST-based Chunking Algorithm Details
In the main paper, we provide textual descriptions
of our algorithm. Here, we present the pseudo code
of our implementation in Alg. 1.
Algorithm 1 AST-based Chunking Algorithm
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
MAX_SIZE ← maximum chunk size
function C HUNK C ODE (code)
tree ← P ARSE AST(code)
if G ET S IZE (code) ≤ MAX_SIZE then
return [tree]
else
return C HUNK N ODES (tree.children)
end if
end function
function C HUNK N ODES (nodes)
chunks ← [ ], chunk ← [ ], size ← 0
for node in nodes do
s ← G ET S IZE (node)
if (chunk = [ ] and s > MAX_SIZE) or
(size + s > MAX_SIZE) then
if chunk ̸ = [ ] then
chunks.append(chunk)
chunk, size ← [ ], 0
end if
if s > MAX_SIZE then
subchunks ← C HUNK N ODES (node.children)
chunks.extend(subchunks)
continue
end if
else
chunk.append(node); size ← size + s
end if
end for
if chunk ̸ = [ ] then
chunks.append(chunk)
end if
return chunks
end function
RepoEval with various retrievers and generators.
In Table 8, we show the RAG performance with
various retrievers on CCEval across different pro-
gramming languages.s
These tables show similar conclusions with our
findings in the main paper, where C AST consis-
tently performs better than fixed-size line-based
chunking with syntactic integrity and high informa-
tion density.
A.5
A.4
Extended Experiment Results
In the main paper, we show concise results from
our experiment to demonstrate a clear contribu-
tion. We further include detailed results from our
settings here. In Table 5, we present the retrieval
performance with various metrics and retrievers on
RepoEval and SWE-bench. In Table 7, we present
the RAG performance on SWE-Bench with various
retrievers (large language models) and generators.
In Table 6, we present the RAG performance on
1
We use default tokenizers for open-weighted models, and
cl100k_base for API models.
Performance differences across different
programming languages
A key limitation of fixed-size, line-based chunk-
ing is its poor generalizability across program-
ming languages. Language-specific syntax means
a line limit tuned for one language over- or under-
segments another, leading to uneven information
density and degraded retrieval and generation qual-
ity. In contrast, C AST uses structure-aware seg-
mentation based on abstract-syntax units common
across languages, mitigating these issues.
Table 8 reports results with the Codesage-small-
v2 + Starcoder2-7B pipeline. Though both meth-
10. Method
C AST
nDCG@5
nDCG@10
P@5
P@10
Recall@5
Recall@10
Fixed-size
P@5 P@10
nDCG@5 nDCG@10 71.3
71.1
74.2
75.1
83.0
86.8 74.6
73.9
78.0
79.5
86.4
90.9 32.8
31.3
34.8
34.8
42.9
46.3
42.4
42.8
43.1
43.5
42.6 39.5
39.9
40.6
41.7
40.0 38.3
38.3
38.6
39.2
37.5
Recall@5 Recall@10
19.1
18.1
20.6
21.1
24.5
26.7 67.4
64.9
70.7
71.1
82.1
84.9 74.1
70.6
78.5
80.2
89.1
92.9
31.2
31.2
31.8
33.2
31.0 17.3
17.0
17.8
18.0
17.5 24.4
24.6
25.9
26.5
24.7
RepoEval
BGE-base
BGE-large
GIST-base
GIST-large
Codesage-small-v2
Jina-v2-code 71.1
72.2
75.9
78.9
85.1
87.1 74.7
75.4
78.5
81.9
88.8
90.5 34.9
34.9
38.1
38.8
44.1
47.9 20.4
20.2
21.2
22.0
25.3
27.1 69.8
69.6
75.0
76.6
83.9
87.9
BGE-base
BGE-large
GIST-base
GIST-large
Codesage-small-v2 44.0
42.2
44.4
44.0
43.1 41.5
40.4
42.5
41.9
41.4 39.7
37.7
39.1
39.5
38.8 32.5
31.6
32.9
33.1
32.8 18.4
17.5
18.5
18.5
18.3
77.6
76.3
80.5
82.8
91.0
94.7
SWE-bench
26.8
26.1
27.6
27.0
26.4
Table 5: Retrieval performance (nDCG, Precision, Recall@{5,10}) on RepoEval and SWE-bench.
Method
BGE-base
BGE-large
GIST-base
GIST-large
Codesage-small-v2
Jina-v2-code
C AST
StarCoder2 CodeLlama
51.7
48.8
57.9
61.7
73.2
80.7 49.6
50.9
56.6
60.3
72.1
75.9
Fixed-size
StarCoder2 CodeLlama
47.5
45.8
51.2
59.2
67.6
75.1
45.6
49.9
51.5
55.5
66.5
75.1
Table 6: RAG performance (Pass@1) on RepoEval with various retrievers.
ods use fixed chunk lengths, performance variation
across languages is notably higher for the baseline.
Averaged over four languages, C AST improves EM
by 2.9 on code and 3.0 on identifier, with the largest
gains on TypeScript—the noisiest language. These
consistent gains highlight the value of respecting
syntax when handling multilingual code.
The performance differences across different lan-
guages with different chunking strategies, as well
as RAG design choices, can form an interesting
future line of work.
A.6
Ethical Statements
We foresee no ethical concerns or potential risks in
our work. All of the retrieval models, code genera-
tors, and datasets are open-sourced or with public
APIs, as shown in Section 3. The LLMs we ap-
plied in the experiments are also publicly available.
Given our context, the outputs of LLMs (code snip-
pets) are unlikely to contain harmful and dangerous
information. All the code is executed in sandboxes,
with no threat to the public internet. The natu-
ral language part of our experiments is mainly on
English. Multiple programming languages are in-
cluded: Python, Java, C#, and TypeScript.
Our code is open source and available at https:
//github.com/yilinjz/astchunk .
A.7
Licenses of scientific artifacts
We conclude the licenses of the scientific artifacts
we used in Table 9. All of our usage for scien-
tific discovery follows the original purpose of the
artifacts.
11. C AST
Claude-3.7-Sonnet Gemini-2.5-pro
Method
BGE-base
BGE-large
GIST-base
GIST-large
Codesage-small-v2
16.3
13.3
15.0
15.3
16.7
Fixed-size
Claude-3.7-Sonnet Gemini-2.5-pro
35.3
30.3
33.7
31.0
32.7
13.7
14.6
14.7
13.0
14.0
32.3
33.7
33.0
33.0
31.0
Table 7: RAG performance (Claude w/ Pass@1 & Gemini w/ Pass@8) on SWE-bench.
Method
C AST
Fixed-size
EM (code) ES (code) EM (id) F1 (id) EM (code) ES (code) EM (id) F1 (id)
BGE-base + Starcoder2-7B
Python
Java
C#
TypeScript
23.8
27.8
26.9
13.4
72.2
70.9
73.5
49.6
34.7
37.5
32.0
19.5
63.8
63.8
56.4
43.6
21.2
27.3
23.9
11.4
71.0
71.6
71.8
46.0 32.0
37.1
28.3
17.4 62.1
64.1
53.8
40.2
71.7
71.3
72.5
46.1 33.5
36.8
28.7
17.2 63.3
63.7
54.3
40.2
73.1
71.5
72.4
46.0 36.3
38.3
29.9
17.7 65.7
64.6
54.9
40.6
GIST-base + Starcoder2-7B
Python
Java
C#
TypeScript 23.4
28.0
26.6
13.0 71.9
71.2
73.2
49.3
Python
Java
C#
TypeScript 29.1
30.9
28.3
13.7 74.3
72.2
74.2
49.1
34.0
37.7
31.2
19.7
63.7
64.3
56.0
43.9
23.0
27.0
24.3
11.2
Codesage-small-v2 + Starcoder2-7B
39.9
41.2
33.4
19.6
67.6
66.1
58.2
43.5
24.8
28.1
25.5
11.9
Table 8: RAG performance (Code Match & Identifier Match) on CrossCodeEval.
Artifacts/Packages Citation Link License
RepoEval
SWE-bench
CrossCodeEval (Zhang et al., 2023b)
(Jimenez et al., 2024)
(Ding et al., 2023) https://github.com/irgroup/repro_eval MIT License
MIT License
Apache License 2.0
PyTorch
transformers
numpy
matplotlib
vllm (Paszke et al., 2019)
(Wolf et al., 2019)
(Harris et al., 2020)
(Hunter, 2007)
(Kwon et al., 2023) BGE
GIST
CodeSage
Jina-v2-Code
StarCoder2
CodeLlama (Xiao et al., 2023)
(Solatorio, 2024)
(Zhang et al., 2024)
(Günther et al., 2023)
(Lozhkov et al., 2024)
(Roziere et al., 2023)
https://github.com/SWE-bench/SWE-bench
https://github.com/amazon-science/cceval
https://pytorch.org/
https://huggingface.co/transformers/v2.11.0/index.html
https://numpy.org/
https://matplotlib.org/
https://github.com/vllm-project/vllm
https://huggingface.co/BAAI/bge-large-en
https://huggingface.co/avsolatorio/GIST-Embedding-v0
https://huggingface.co/codesage/codesage-small-v2
https://huggingface.co/jinaai/jina-embeddings-v2-base-code
https://huggingface.co/bigcode/starcoder2-7b
https://huggingface.co/codellama/CodeLlama-7b-hf
BSD-3 License
Apache License 2.0
BSD License
BSD compatible License
Apache License 2.0
MIT license
MIT license
Apache License 2.0
Apache License 2.0
LICENSE
LICENSE
Table 9: Details of datasets, major packages, and existing models we use. The curated datasets and our code/software
are under the MIT License.