Planning In Natural Language Improves LLM Search For Code Generation
如果无法正常显示,请先停止浏览器的去广告插件。
1. Planning In Natural Language Improves
LLM Search For Code Generation
Evan Wang 1,2 , Federico Cassano °3,4 , Catherine Wu ° , Yunfeng Bai 1 , Will Song 1 , Vaskar Nath 1 ,
Ziwen Han 1 , Sean Hendryx 1 , Summer Yue 1 , Hugh Zhang 1
1 Scale
AI, 2 California Institute of Technology, 3 Northeastern University, 4 Cursor AI,
° Work conducted while at Scale AI
Correspondence to # evan.wang@scale.com and hugh.zhang@scale.com
Abstract
While scaling training compute has led to remarkable improvements in large language models (LLMs),
scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing
component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly
sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity
can be mitigated by searching over candidate plans for solving a problem in natural language. Based
on this insight, we propose P LAN S EARCH , a novel search algorithm which shows strong results across
HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding).
P LAN S EARCH generates a diverse set of observations about the problem and then uses these observations
to construct plans for solving the problem. By searching over plans in natural language rather than
directly over code solutions, P LAN S EARCH explores a significantly more diverse range of potential
solutions compared to baseline search methods. Using P LAN S EARCH on top of Claude 3.5 Sonnet
achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score
achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%).
Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately
predict performance gains due to search as a direct function of the diversity over generated ideas.
“If you fail to plan, you plan to fail.” — Mastermind, Taylor Swift
1. Introduction
The bitter lesson [40] famously posits that two forms of scaling trump everything else: learning and
search. While recent advances in large language models have removed all doubt on the effectiveness of
learning, search has not yet proven its value for large language models, despite its success with classical
machine learning techniques [4, 7, 8, 10, 17, 37, 38].
Here, we refer to search as any method of spending compute at inference time to improve overall
performance [28]. In this work, we focus our efforts on improving LLM search for code generation, one
of the most important current applications of LLMs. We hypothesize the major bottleneck preventing
widespread use of search at inference time for code is a lack of high-level diversity in model outputs.
This lack of diversity is likely in part due to specific post-training objectives commonly used to train
1
2. Pass@k Scores by Method on LiveCodeBench
0.8
0.730
0.7
0.649
0.4
0.703
0.606
0.6
0.533
0.5
0.390
0.770
0.556
0.532
0.414
0.413
0.403
0.3
0.2
od
er-
0.0
Repeated Sampling@1
Repeated Sampling@200
PlanSearch@200
0.1
Figure 1: Comparison of R EPEATED S AMPLING , both pass@1 and pass@k, and our novel method
P LAN S EARCH . On every model, our method outperforms baselines by a wide margin, with the best
model-method combination of Claude 3.5 Sonnet / P LAN S EARCH achieving performance nearly double
that of the best model without search.
LLMs as chatbots, in which models are oftentimes optimized to produce a single correct answer [31, 33].
We empirically demonstrate that this is the case for many open-source language models which have
undergone significant post-training. Specifically, we show that in many cases, despite instruction tuned
models outperforming base models by large margins on a single sample regime (pass@1), this trend
disappears—sometimes even reversing—when generating many samples. We refer to Figure 3 as an
example of this phenomenon.
Furthermore, the lack of diversity is particularly harmful for search algorithms. In the most egregious
of cases with little to no diversity, such as greedy decoding, repeated sampling from the model returns
highly similar programs, resulting in minimal gain from additional inference-time compute. This di-
versity problem is also not reflected in many public leaderboards (e.g. LMSYS Chatbot Arena [14],
LiveCodeBench [22], OpenLLMLeaderboard [1]), which often report only the pass rate from a single sam-
ple of the model, ignoring an entire dimension along which to compare models. While the performance
of one sample is the primary metric of relevance for applications such as chatbots, as users typically
are sensitive to latency, this single scalar is insufficient to fully capture the quality of a model when it is
allowed to use more inference time compute.
In this paper, we explore several directions for improving the diversity of LLMs at inference time. We
hypothesize that the right axis of diversity to search over is the natural language conceptual/idea
space, and we validate our hypothesis across several experiments. First, we show that models can
produce the correct final program when fed the correct solution sketches, where these sketches have been
“backtranslated” from passing solution code into sketches in idea space (Section 3.2). Second, we show
that when models are asked to generate their own ideas before implementing them on LiveCodeBench
(I DEA S EARCH ), their accuracy conditioned on a particular sketch trends towards either 0% or 100%,
2
3. Stage 1: Generate diverse observations
First-order
observations
"Use hashmaps"
Coding Problem
"Use greedy
search"
"Use binary
search"
Stage 2: Implement code from observations
Second-order
observations
"Precompute
positions for each
subsequence"
"Return False if
the array is
empty"
Strategy
Description
Pseudocode
Code
Prompt new solutions with
"Your idea is wrong"
. . .
"Identify the
longest
subsequence"
Strategy
Description
Pseudocode
Code
"Bucket each
array into a size
of root n"
Figure 2: An example trajectory of P LAN S EARCH , which searches over plans in natural language as a
method of increasing diversity in the search process. P LAN S EARCH first generates observations, then
combinatorially samples subsets of these observations to generate the next step in the search process. To
generate the next layer of observations, the combinations derived from the first observations are used as
a stepping stone to generate the next observations, and the process repeats. After generating both the
first and second order observations, P LAN S EARCH then generates a natural language description of a
strategy to solve the problem. For additional diversity, the model is prompted to regenerate its strategy
as an additional sample before generating code. See Section 4.2 for additional discussion.
suggesting that most of the variance in passing a particular problem is captured by whether the sketch is
correct rather than any other factor. These two experiments suggest a natural method to improving LLM
search for code generation: by searching for the correct idea to implement.
Guided by this principle of maximizing exploration of ideas, we propose P LAN S EARCH . In contrast to
many existing search methods that search over individual tokens [44, 46], lines of code [24], or even
entire programs [25], P LAN S EARCH searches over possible plans for solving the problem at hand, where
a plan is defined as a collection of high level observations and sketches helpful to solve a particular
problem (Figure 2). To generate novel plans, P LAN S EARCH generates a number of observations about
the problem, before combining these observations into a candidate plan for solving the problem. This is
done for every possible subset of the generated observations to maximally encourage exploration in idea
space, before the codes are eventually all translated into a final code solution (Section 4.2). We find that
searching over plans outperforms both standard repeated sampling and directly searching over ideas
(I DEA S EARCH , introduced in Section 4.1.2) in terms of effectively using compute at inference time.
Applying P LAN S EARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on
LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and the
standard best-of-n sampling score (pass@200 = 60.6%). Furthermore, consistent with recent findings on
the effectiveness of search on top of small models [5, 6, 12, 42], allowing P LAN S EARCH based on a small
model (GPT-4o-mini) outperforms larger models not augmented with search after merely 4 attempts.
Evaluations of P LAN S EARCH across two other coding benchmarks, HumanEval+ and MBPP+ [26],
suggest similar improvements.
3
4. Finally, we measure the diversity of output code over the idea space of all search methods via an LLM-as-
a-judge procedure (Section 6.1) and show that the resulting diversity score is highly correlated with the
performance gains generated by that search method. This provides further support for our hypothesis
that the effective exploration of plans in idea space is key to LLM search for code generation (Figure 6).
2. Related Work
We reiterate that search as defined in the context of our paper refers to any method which expends
inference time compute to improve performance. We further specify planning as any form of high level
observation or abstract thought that assists a model in generating a final solution. Our work builds off a
long history of work in scaling search and planning.
2.1
Search in Classical AI
Classical search algorithms like breadth-first search, depth-first search, and A* search have been widely
used for pathfinding, planning, and optimization [34]. More advanced search techniques like Monte
Carlo tree search (MCTS) have achieved remarkable success in domains like game playing, enabling
superhuman performance in Go [37, 38], poker [7, 8] and Diplomacy [17]. More recently, Jones [23] find
scaling laws for the performance of AI systems in board games, where ELO improves logarithmically
with the amount of compute spent at inference.
2.2
Search with Language Models
Applying search on top of LLMs has been a topic of much interest, especially with an eye towards code
generation [13, 25]. Historically, methods such as beam search significantly improved performance for
translation systems [18]. Closer to the present day, several recent works have explored repeated sampling
[5, 6, 12, 42] as a search method for improving performance. Repeated sampling is a method which
directly generates candidate code solutions from the model many times at moderate to high temperatures
in hopes that one of the resulting generations will be correct. However, although these works address
the roughly linear increase in pass@k with respect to log k, they only focus on the most basic version of
repeated sampling, without searching in idea space.
When combined with a verifier, reward model, or other filtering algorithm to select the best generation
(in cases where pass@k is not a viable metric due to lack of test cases), it is also known under the name of
best-of-n sampling [29]. Many works show somewhat good results under intelligent selection of such
a filtering algorithm [11, 12]. Recently, several approaches have demonstrated the power of repeated
sampling. For example, repeated sampling from a small model can sometimes outperform taking a single
sample from a large model on an equalized compute bases [39]. Unlike algorithms such as repeated
sampling, which search over the output space, the key insight of P LAN S EARCH is that it is far more
effective to instead search plans over the latent idea space. By explicitly searching over different natural
language plans before generating the code, we significantly increase the diversity of the final code
outputs and thus, the resulting pass@k scores for sufficiently large k.
Regarding searching over plans in natural language, several approaches have also proposed generalizing
chain-of-thought [41] reasoning into a search-like process, such as Tree of Thoughts [43] and Reasoning
via Planning [20]. However, prior methods have largely demonstrated effectiveness on somewhat
contrived problems designed to highlight the power of search, such as the game of 24, or classic
planning benchmarks such as Blocksworld [27], where both benchmarks are easier to solve by explicitly
considering many options, and where the ‘steps’ over which to search over are fairly obvious. By contrast,
most real-world planning is used to assist in domains that are complex enough to benefit from, but
4
5. Pass@k vs k for DeepSeek-Coder-V2-Lite Models on MBPP+
0.90
0.85
0.80
0.75
0.70
0.65
0.60
1
10
k
DeepSeek-Coder-V2-Lite-Base
DeepSeek-Coder-V2-Lite-Instruct
100
Figure 3: Despite DeepSeek-Coder-V2-Lite-Base having significantly lower pass@1 than its instruct
counterpart, we observe that this trend reverses as k increases, suggesting that the instruct model has
less diversity than its base model counterpart. We observe this trend for many, but not all, models and
benchmarks, and provide the full data in Appendix H.
not require, the additional exploration of plans. We demonstrate that P LAN S EARCH , which plans in
natural language, outperforms baseline search methods in one such domain: code generation. Moreover,
our analysis reveals the underlying reason that such search is effective: it increases the diversity of the
generated ideas, allowing more efficient search relative to other methods which repeatedly submit highly
similar, incorrect solutions.
3. Motivation
Coding is a powerful area in which search should excel. While search in other domains requires both
generating many solutions and selecting the correct solution amongst all the resulting generations,
coding often only requires the former, as any valid piece of code can be tested via code execution against
given test cases. This allows code search algorithms to sidestep many of the issues that plague search
algorithms for more open-ended domains (e.g. generating poetry) due to difficulty in selecting correct
solutions out of all the generated solutions.
3.1
Defining the Search Space
Perhaps the most important question for eliciting strong search capacities is determining which space to
search over, as finding the proper layer of abstraction is critical to progress in the field. Prior approaches
have varied, with many people searching over individual tokens [44, 46], lines of code [24], or even
entire programs [25]. We hypothesize that the key factor is obtaining the correct solution sketch, which
we define as a description of the correct program in natural language space. Intuitively, conducting
the reasoning process in natural language space allows us to effectively harness the training process
of LLMs, which have observed many human reasoning traces in both pre- and post-training. Prior
work [41] has observed strong positive effects from being allowed to conduct such reasoning in natural
language, making it a natural place to search over. We describe two experiments providing evidence for
this hypothesis by testing on the LiveCodeBench benchmark using GPT-4o-mini as our model.
5
6. 1.00
0.95
Distribution of Solve Rates Conditioned on Idea
0.35
Per Problem
Per Idea
0.30
0.25
0.85
0.90
Effects of Backtranslation on Performance
Pass@1
Baseline Pass@1
Pass@5
Baseline Pass@5
0.80
0.75
0.20
0.15
0.10
0.70
0.05
0.65
10 1
10 2
Average Solution Token Length
0.00
10 3
(a) Performance of GPT-4o-mini on LiveCodeBench
when provided with backtranslated solutions of vary-
ing lengths. The baselines plot performance without
backtranslated solutions. Providing the model with a
compressed solution in natural language, even as short
as 10 tokens, significantly increases performance.
0.0
0.2
0.4
Solve Rate
0.6
0.8
1.0
(b) We plot the distribution of solve rates conditioned on
being given a solution sketch and without. When con-
ditioning on a given sketch, we notice that downstream
solve rates polarize towards either 0% or 100%. Most of
the variance in performance is predicted by whether a
given idea is correct or not.
Figure 4: Backtranslation shows the promise of providing good sketches, and conditioning on idea shows
the presence of a solution sketch polarizes performance.
3.2
Backtranslation
To investigate the hypothesis whether the idea space, instantiated as solution sketches, is the right area of
exploration, a natural question is whether LLMs can correctly implement a correct code solution given a
correct sketch. Inspired by approaches to backtranslation in machine learning [16, 32, 35], we experiment
with “backtranslating” passing code solutions back into idea space. First, we generate code solutions
using GPT-4o to generate 1000 attempts to solve the problem and filter out problems without any passing
solutions. As we also do not have a dataset of correct solution sketches associated with each solution, we
generate a candidate correct idea via backtranslation. We do this by feeding an LLM both the problem
and code solution and asking the LLM to convert said solution into a natural language description of the
solution. Additionally, we vary the detail of the backtranslated idea via instructions to the LLM in the
prompt (e.g. ‘in w words’). A full description of the prompts can be found in Appendix J.1, alongside
several example backtranslated solutions of various lengths.
We observe that prompting a model with a backtranslated idea significantly improves accuracy, increas-
ing with the length of the translated idea (Figure 4a), which suggests that having a correct sketch is
sufficient to produce the correct final solution with relatively high accuracy, even only after 10 tokens
of backtranslated solution. This suggests that the correct direction of search is to explore through idea
space to maximize the chance of arriving at a correct idea.
3.3
Conditioning on Idea Quality
In a follow-up experiment, we prompt an LLM to generate its own sketches to solve LiveCodeBench
problems instead of providing it with golden ones via backtranslation. First, we generate 5 ideas per
problem using I DEA S EARCH , defined in Section 4.1.2. For each idea, we then sample 25 candidate
solutions and measure their pass rate. For this experiment, we filter out any problem that GPT-4o-mini
solves with either a 100% or a 0% solve rate, since such problems are either too easy or too hard for the
model and would not be informative for this experiment. We end with 75 problems and 375 sketches.
6
7. To test our hypothesis that generating a correct sketch is a critical factor for solving problems, we compare
the distribution of solve rates for generating correct code solutions conditioned on a given sketch to the
distribution over solve rates given a sketch drawn at random, i.e., just the distribution over solve rates.
Formally, for any problem P i , we sample some sketch I from some conditional distribution with
probability mass P ( I | P i ) . The probability of solving P i is then P ( solve | P i , I ) . We compare the solve-
rate distribution, P ( solve | P i , I ) over all problems and all sketches versus the solve-rate distribution of
∑ I P ( solve | P i , I ) · P ( I | P i ) = P ( solve | P i ) over all problems.
While verifying whether a sketch is correct or incorrect is difficult without access to external labels, a key
insight is that if generating the correct idea is a critical factor in solving the problem, then conditioning
on a particular sketch should polarize the distribution of solve rates towards { 0, 1 } . If the model is given
a correct sketch, it should consistently generate correct solutions, while if given a bad sketch, it should
consistently generate incorrect solutions.
Our results confirm this to be the case. Figure 4b shows the distribution of solve rates across problems,
both unconditionally (in red) and conditioned on each sketch (in blue). We notice that when grouping by
sketches, the solve rates indeed become polarized towards { 0, 1 } . This result has important implications
for improving code generation, suggesting that a large portion of variance in performance may come from
whether the model is able to generate a correct idea or not. Therefore, a natural path for improvement is
to focus on the sketch generation step and search for correct sketches and observations in idea space
before generating solution code.
4. Methods
We provide a description of the various methods of search we explore in our work. If additional
background on competitive programming and related notation is desired, we provide more (optional)
information in Appendix K.
4.1
4.1.1
Baselines
R EPEATED S AMPLING
We consider the basic prompting approach as a baseline, in which we use few-shot prompting by
providing the LLM with a number of problem-solution pairs before asking it to solve the desired
question [9]. A full example of the prompt is given in Appendix J.2. In code generation, the most
common variant of search utilized is repeated sampling, where models are repeatedly sampled from
until they generate an output that passes the test or the maximum number of samples is reached. Refer
to the Related Work for more information (Section 2.2).
4.1.2
I DEA S EARCH
A natural extension of the R EPEATED S AMPLING approach discussed in Section 4.1.1 is to avoid prompting
the LLM for the solution code immediately. This can be viewed as an application of the commonly used
“chain-of-thought” prompting to programming problems [41], although we find that IdeaSearch shows
non-negligible performance boosts over standard “chain-of-thought” prompting (see Appendix E).
In I DEA S EARCH , the LLM is given the problem P and is asked to output a natural language solution S of
the problem. Then, a separate instance of the LLM is given P and S, and tasked to follow the proposed
solution S to solve the problem P. The purpose of I DEA S EARCH is to isolate the effectiveness of having
the correct “idea/sketch” for solving the problem. Empirically, we find that explicitly forcing the search
7
8. algorithm to articulate an idea for solving the problem increases diversity. See Appendix J.3 for detailed
prompts.
4.2
P LAN S EARCH
While both R EPEATED S AMPLING and I DEA S EARCH are successful and lead to improvement in the
results on benchmark results, we observe that in many of the cases, prompting multiple times (pass@k)
(even at high temperatures) will only lead to small, narrow changes in the output code that change minor
aspects but fail to improve upon pitfalls in idea.
4.2.1
Prompting for Observations
Starting from the problem statement P, we prompt an LLM for “observations”/hints to the problem.
We denote these observations as O i 1 , where, i ∈ { 1, . . . , n 1 } due to the fact that they are first-order
observations. Typically, n 1 is on the order of 3 to 6. The exact number depends on the LLM output.
To use these observations to inspire future idea generation, we create all subsets with size at most 2 of
S 1 = O 1 1 , . . . , O n 1 1 . Each of these subsets is a combination of observations, and for clarity we denote
each subset as C i 1 , i ∈ { 1, . . . , l 1 } , where l 1 = 1 + n 1 + ( n 2 1 ) .
4.2.2
Deriving New Observations
The set of all observations can be thus defined as a directed tree with depth 1, where the root node is P,
and an edge exists for each C i 1 pointing from P to C i 1 . We then repeat this procedure from Section 4.2.1
2 , . . . , O 2
on each leaf node C i 1 to generate a set of second order observations, S i 2 = { O i,1
i,n i,2 } . To obtain
second order observations, we prompt the model with both the original problem P and all observations
contained in C i 1 , framed as primitive observations that are necessary in order to solve P. The LLM is then
prompted to use/merge the observations found in C i 1 in order to derive new ones.
2 , for all i ∈ { 1, . . . , l } . This process
The same procedure as Section 4.2.1 is used to create all subsets C i,j
1
may be arbitrarily repeated, but we truncate the tree at depth 2 for computational constraints.
Note that there is no assumption any of the observations generated are correct. In fact, it is critical to
note that many of them may be incorrect. The observations merely serve to elicit the model to search
over a more diverse set of ideas.
4.2.3
Observations to Code
After the observations have been made, they must be implemented as ideas before being translated into
code. For each leaf node, we prompt the model with all observations, along with the original problem
P, in order to generate a natural language solution to the problem P. To add more diversity, for each
generated idea, we generate an additional idea by supposing the idea is wrong, and asking an LLM to
give criticisms/feedback, thus increasing our proposed ideas by a factor of 2.
These natural language solutions are then translated into pseudocode, which are subsequently translated
into actual Python code. We take a more granular approach to reduce the translation error (which may
cause the model to revert to its original mode, disregarding the reasoned-through observations). We
provide all prompts for all sections in Appendix J.4.
8
9. Model Benchmark
Pass@1 Pass@200 I DEA S EARCH @200 P LAN S EARCH @200
GPT-4o-mini
GPT-4o
DeepSeek-Coder-V2
Claude-Sonnet-3.5 LiveCodeBench
LiveCodeBench
LiveCodeBench
LiveCodeBench 39.0
41.3
41.4
40.3 53.3
60.6
53.2
55.6 59.4
70.4
65.9
70.2 64.9
73.0
70.3
77.0
GPT-4o-mini
GPT-4o
DeepSeek-Coder-V2
Claude-Sonnet-3.5 HumanEval+
HumanEval+
HumanEval+
HumanEval+ 83.7
86.4
82.8
81.6 95.0
98.2
91.4
88.9 97.5
97.6
97.2
95.6 98.2
99.5
99.3
98.5
GPT-4o-mini
GPT-4o
DeepSeek-Coder-V2
Claude-Sonnet-3.5 MBPP+
MBPP+
MBPP+
MBPP+ 73.5
77.2
76.3
77.1 83.8
87.4
81.9
83.0 87.3
89.3
89.1
87.8 91.0
92.2
92.6
93.7
Table 1: We find that P LAN S EARCH and I DEA S EARCH improve upon search baselines across all models,
with P LAN S EARCH achieving the best results across all models and benchmarks considered. Notably,
using P LAN S EARCH on top of Claude 3.5 Sonnet [2] has a pass@200 of 77.0 on LiveCodeBench, which
is nearly double the performance of the top model without using search (41.4). We highly encourage
readers to check Appendix A for complete results and pass@k curves.
5. Experimental Results
5.1
Datasets
We evaluate our search methods on three benchmarks: MBPP+, HumanEval+ [26], and LiveCodeBench [22].
MBPP [3] and HumanEval [13] are some of the most widely used code benchmarks in the field. However,
since both benchmarks provide only a few test cases, [26] updates both benchmarks with additional test
cases that increase the benchmarks’ robustness to reward hacking. LiveCodeBench is a benchmark for
coding that consists of competitive programming problems which typically require advanced reasoning
capabilities. Given the reality that coding data is often highly upsampled during pre-training [15, 30],
LiveCodeBench differentiates itself from other benchmarks by taking care to segregate problems by
date to avoid data contamination concerns. For this paper, we use only the subset of problems between
May 2024 and September 2024 to avoid possibilities of contamination. We choose May 2024 as the
cutoff date to ensure that our results with our best performing model (Claude 3.5 Sonnet) are not due
to contamination, because Claude 3.5 Sonnet has a knowledge cutoff of April 2024. To ensure fair
comparison, we use the same cutoff for all models evaluated, even though the precise cutoff dates for
other models may vary slightly from May 2024.
5.2
Experiment Details
For all search algorithms, we require that all output code be in the correct format specified, and we
mark a solution as incorrect if it does not follow the intended formatting. The extracted code is then run
through all tests of the program and marked as correct if and only if it passes all tests.
All models are run with temperature 0.9 and top-p of 0.95. Temperature was determined through a coarse
hyperparameter sweep on R EPEATED S AMPLING and I DEA S EARCH from T ∈ { 0.0, 0.1, 0.2, . . . , 1.2 } ,
which we describe in Appendix F.
9
10. Pass@k vs k for Methods with Public Filtering on LiveCodeBench
GPT-4o-mini
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3
GPT-4o
1
DeepSeek-Coder-V2
0.3
10
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3
1
k
0.3
10
1
Sonnet-3.5
10
Public Filtering
No Public Filtering
1
k
Repeated Sampling
IdeaSearch
PlanSearch
10
Figure 5: Performance of all models and methods on LiveCodeBench with public test filtering. The idea
of public test filtering is to shift pass@k curves leftward (i.e., bringing high k to low k), so we plot curves
in detail over k ∈ { 1, . . . , 20 } . Dotted lines are provided for reference of the base method pass@k before
filtering. Even at 10 completions, P LAN S EARCH outperforms filtered R EPEATED S AMPLING by a flat 30 to
40%. Again, full pass@k plots are included in their entirety in Appendix A.
Both R EPEATED S AMPLING and I DEA S EARCH generate exactly n codes, whereas P LAN S EARCH generates
a variable number of codes, usually ranging on the order of 300 to 400. To compute pass@k, we use the
unbiased estimator in Equation 4 [13] 1 . If k > n, we assume the remaining generations did not pass. To
compute pass@k for filtering, we limit the pool of codes to those that are filtered, meaning that both n
and c may shrink in size. This can be thought of as a conditional probability, where the condition is that
the code passes public tests.
1 Note
that the estimator in Equation 4 theoretically requires that the number of successes follows a binomial distribution.
R EPEATED S AMPLING and I DEA S EARCH obey this, but P LAN S EARCH generations may not be independent. See Appendix N
for more discussion.
10
11. 5.3
Results
Our summarized results for R EPEATED S AMPLING , I DEA S EARCH , and P LAN S EARCH can be found in
Table 1, Figure 1, and Figure 5. Additionally, we plot our full pass@k curves for all methods, models, and
datasets in Appendix A. For sake of easy comparison, we also plot all relative gains compared to R E -
PEATED S AMPLING @1 averaged over all models in Appendix C. For a compute-normalized comparison
between R EPEATED S AMPLING and P LAN S EARCH , see Figure 18.
5.4
Public Test Filtering
Public test filtering is a method which only chooses samples out of the original pool n which pass the
public tests. This is particularly useful in settings such as code deployment where executing the full suite
of tests may be computationally costly or otherwise undesirable (e.g. in a coding contest where every
incorrect submission is penalized). Thus, instead of submitting all n codes, after public test filtering, only
codes c i would be submitted such that c i ( x j ) = y j for all j ∈ { 1, . . . , u } , where c i ( x ) refers to the output
from running the code on some input x. The primary effect of public test filtering is to shift the pass@k
curve leftward, since public test filtering will discard low quality candidate solutions that either fail to
compile or fail elementary test cases for the problem.
All problems in MBPP+, HumanEval+, and LiveCodeBench come with a few public tests which are
usually used to sanity check any submissions. We can further improve performance by filtering on
these public tests before a final submission, as described. Applying public test filtering reduces the
number of samples to achieve the same accuracy by tenfold: P LAN S EARCH to achieve a 77.1% accuracy
on LiveCodeBench after just 20 submissions (pass@20) compared to a pass@200 of 77.0% without using
public filtering (see Figure 5). We provide full results for the other datasets in Appendix B.
6. Analysis
Our results suggest that both P LAN S EARCH and I DEA S EARCH outperform basic sampling by a wide
margin (Figures 12, 13, 14), with P LAN S EARCH achieving the best score across all methods and models
considered. We show the detailed pass@k results for each dataset in Figures 7, 8 and 9. We also compare
with Chain-of-Thought [41] in Appendix E. Interestingly, we find that I DEA S EARCH performs somewhat
better, which we speculate comes from differences in splitting solution sketch into two model responses,
instead of doing both chain-of-thought and code solution in one model response.
Investigating the differences in specific models, we notice that trends exhibited by the pass@k curves are
not uniform across all models; in fact, each curve seems unique. We hypothesize that these differences
are in part due to changes in idea diversity, as investigated in Figures 6, 26, 27. From the figures, we
can see that our approximate diversity score accounts for much of the variance we see in the relative
improvement that arrives from scaling-up inference-time compute. This correlation holds across all
methods and models on the same dataset, thus suggesting that diversity score can be used as a proxy to
predict for relative pass@k improvement. For further discussion on the specifics of the diversity score,
see Section 6.1.
One interesting point of observation is that P LAN S EARCH often hurts pass@1 for several models, in-
cluding most notably Sonnet 3.5 on LiveCodeBench, our best performing combination. Intuitively, this
is because increasing the diversity across ideas likely dilutes the probability that any particular idea is
generated, while simultaneously increasing the chance of having at least one correct idea within said
pool. Therefore, pass@1 may be slightly lower than usual, yet pass@k will likely surpass “pools” of ideas
lacking diversity for this reason. See Figure 48 for a graphical intuition.
Finally, in Table 1 and Figure 1, we present our main results normalized across attempts/completion,
11
12. where each search method is allowed k attempts to solve each problem. An alternative method of normal-
izing across methods is to equalize the amount of compute spent on each method. Since P LAN S EARCH
and I DEA S EARCH first plan out an idea before implementing the final solution, they both spend more
compute at inference time per solution generated. In Appendix D, we report the equivalent plots nor-
malized across compute. Our findings are highly similar and suggest that P LAN S EARCH outperforms all
other methods if sufficient compute is expended at inference time.
6.1
Measuring Diversity
Idea Diversity vs Relative Gains from Search (on LiveCodeBench)
1.8
1.6
Repeated Sampling
IdeaSearch
PlanSearch
1.4
1.2
1.0
0.8
GPT-4o-mini
GPT-4o
DeepSeek-Coder-V2
Sonnet-3.5
0.6
0.4
0.2
0.3
0.4
Idea Diversity
0.5
0.6
Figure 6: We observe a strong positive correlation between the measured amount of idea diversity in
a search algorithm and the resulting improvements due to search (Section 6.1). Diversity score is the
probability that GPT-4o-mini believes two randomly selected output codes implement different ideas
(higher is more diverse). Our findings suggest that diversity in idea space is essential for effective LLM
search.
We find that diversity as measured in idea space is highly predictive of search performance, as measured
by the relative improvement between a model/method’s pass@1 and its pass@200 (Figure 6). While the
most common measure of diversity is entropy [36], entropy is insufficient for a number of reasons for the
precise setting of LLMs [21, 45]. As a simple example, consider two different language models, one of
which generates minor variations of the same program while another generates a variety of programs
with different underlying ideas. Even if both models have the same entropy, the latter model will be
significantly better when augmented with search capabilities.
In our setting, we measure diversity by grounding it in idea space using a simple pair-matching strategy
across all generated programs. Formally, suppose we have a pool of n code generations, { c 1 , . . . , c n } . We
12
13. assume that each piece of code implements some sketch, which can be thought to exist in some latent
‘idea’ space. We consider two sketches similar if they are within ϵ of each other in this latent space, for
some choice of ϵ. As such, in this space, c i having a similar idea to c j and similarly for c j and c k does not
imply c i and c k share a similar idea.
To compute the diversity of such a given generation pool, we ask an LLM to judge the similarity of
two ideas in the following manner. First, we construct each of the ( n 2 ) pairs. For each pair ( c i , c j ) , we
judge (using an LLM) whether both c i and c j implement the same idea. We define this as the function
S ( c i , c j ) ∈ { 0, 1 } , which evaluates to 1 if c i and c j implement the same idea and 0 otherwise. Our overall
diversity score for a particular problem is then defined as:
D = 1 −
∑ i < j S ( c i , c j )
( n 2 )
(1)
Models that output programs that all implement the same idea will have a score of D = 0, while models
that output completely unique programs will have a score of D = 1. Overall, a score of D implies that if
two codes are chosen at random, the probability that they are the same idea (as measured by the LLM) is
D. In Appendix M, we describe this measure in additional mathematical depth.
For a particular method, our reported diversity score is simply the diversity score over all problems in
the considered dataset. For computational feasibility, for large n, we instead sample a subset of 40 codes
and test all pairs from that subset instead. In order to test code samples, we first backtranslate using an
LLM to express the code in natural language before comparing each pair using both the code and the
backtranslated idea. We detail the prompts used in Appendix J.5 and use OpenAI’s GPT-4o-mini as the
supporting LLM.
7. Limitations and Future Work
While P LAN S EARCH substantially improves diversity over idea space at inference-time, fundamentally,
improvements in diversity should come at the post-training stage. This likely requires re-imagining the
post-training pipeline for LLMs around search, instead of the current paradigm optimized for a single
correct response. This may require both collecting high quality post-training data that is also sufficiently
diverse, and new learning objectives that do not aim solely to maximize the expected reward of a given
response. We are optimistic around future work to design significantly improved post-training objectives
that maximize both quality and diversity and which specifically optimized to use inference-time compute
to maximum effectiveness.
In terms of methodological improvements to P LAN S EARCH , P LAN S EARCH currently searches all leaf
nodes in the search tree uniformly. Because of this, it becomes quickly intractable to go further than a
couple levels deep, and in our experiments, we are only able to go two levels down the tree. Several
approaches based on Monte-Carlo Tree Search (MCTS), such as Tree of Thought [43] or Reasoning as
Planning [19], have suggested that some form of dynamic pruning and expansion of nodes can be very
helpful. We are optimistic that P LAN S EARCH can be further improved by such methods. Furthermore,
P LAN S EARCH is a fairly elementary method taking advantage of the paradigm that searching over a con-
ceptual or idea space is an effective method to improve diversity, and thus, downstream task performance.
It is completely feasible to search at an even higher level of abstraction than observations, which may be
used to inject even more diversity into the final generated outputs.
P LAN S EARCH and I DEA S EARCH tradeoff a slight deterioration of pass@1 performance for a large
improvement in pass@k performance. However, in many such cases outside of code generation, it is
infeasible to run an LLM-based model for more than a few attempts at most. For example, in Figure 9,
P LAN S EARCH does not significantly outperform R EPEATED S AMPLING until k ≥ 4.
Fortunately, many filtering algorithms exist, which implicitly bring pass@k (for high k) to pass@1 (or
13
14. lower k), i.e. shifting the original pass@k curve leftward. A simple example of this is public test
filtering. As seen in Figure 5, pass@1 of filtered P LAN S EARCH significantly improves upon pass@1 of
base R EPEATED S AMPLING , which gets even better as k increases. Moreover, most to almost all base models
with public test filtering outperform their instruct model variants at pass@1, no matter the dataset (see
Appendix I), where clearly base models are known to be worse, yet trading off for somewhat higher
diversity. Thus, we argue that there exists a potential for a new paradigm—developing search algorithms
which tradeoff pass@1 performance for much stronger pass@k performance, then filtering the promising
generated solutions to extract the pass@k back into the pass@1.
Additionally, we focus on code generation in this paper and do not consider the applicability of
P LAN S EARCH to a broader set of domains. One point of importance is that the pass@k metric heavily
used throughout code generation may not be as applicable to other domains and a larger focus on
selecting the correct solution out of all possible generated candidates may be required, instead of merely
generating the correct solution. However, with good filtering methods, which we demonstrate can be
simple in nature, pass@k, for medium k, can be effectively brought down to pass@1, emphasizing a
similar paradigm of increasing diversity, then strengthening existing filtering methods.
Finally, a natural extension of this work is training the underlying model itself on successful plans
and code solutions obtained from P LAN S EARCH . This has the potential to distill the pass@k into the
pass@1—without inference-time methods like filtering—by reducing the likelihood of the model going
down branches of the search tree which do not lead to correct solutions. We believe that such training is
likely to significantly improve the model and look forward to future work in this direction.
8. Acknowledgements
We would like to thank Jason Wei, Miles Turpin, Sail Wang, Horace He, Kenneth Li, Celia Chen, Rahul
Chalamala, Alan Wu, and Kevin Chang for their helpful comments, suggestions and discussion over the
course of this project.
References
[1] Zhiqiang Shen Aidar Myrzakhan, Sondos Mahmoud Bsharat. Open-llm-leaderboard: From
multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint
arXiv:2406.07545, 2024.
[2] Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet, June 2024.
[3] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan,
Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large
language models, 2021. URL https://arxiv.org/abs/2108.07732.
[4] Anton Bakhtin, David J. Wu, Adam Lerer, Jonathan Gray, Athul Paul Jacob, Gabriele Farina,
Alexander H. Miller, and Noam Brown. Mastering the Game of No-Press Diplomacy via Human-
Regularized Reinforcement Learning and Planning, October 2022. URL http://arxiv.org/abs/
2210.05492. arXiv:2210.05492 [cs].
[5] Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, and Mehran Kazemi. Smaller,
weaker, yet better: Training llm reasoners via compute-optimal sampling, 2024. URL https:
//arxiv.org/abs/2408.16737.
[6] Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and
Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling,
2024. URL https://arxiv.org/abs/2407.21787.
14
15. [7] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus
beats top professionals. Science, 359(6374):418–424, 2018. Publisher: American Association for the
Advancement of Science.
[8] Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365(6456):
885–890, 2019. Publisher: American Association for the Advancement of Science.
[9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,
Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell,
M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.
1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[10] Murray Campbell, A.Joseph Hoane, and Feng hsiung Hsu. Deep blue. Artificial Intelligence,
134(1):57–83, 2002. ISSN 0004-3702. doi: https://doi.org/10.1016/S0004-3702(01)00129-1. URL
https://www.sciencedirect.com/science/article/pii/S0004370201001291.
[11] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen.
Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022.
[12] Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James
Zou. Are more llm calls all you need? towards scaling laws of compound inference systems, 2024.
URL https://arxiv.org/abs/2403.02419.
[13] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth
Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang,
Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish,
Ilya Sutskever, and Wojciech Zaremba. Evaluating Large Language Models Trained on Code, July
2021. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
[14] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng
Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena:
An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/
2403.04132.
[15] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.
arXiv preprint arXiv:2407.21783, 2024.
[16] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at
scale, 2018. URL https://arxiv.org/abs/1808.09381.
[17] FAIR, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried,
Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath,
15
16. Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala,
Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and
Markus Zijlstra. Human-level play in the game of Diplomacy by combining language models
with strategic reasoning. Science, 378(6624):1067–1074, December 2022. doi: 10.1126/science.
ade9097. URL https://www.science.org/doi/10.1126/science.ade9097. Publisher: American
Association for the Advancement of Science.
[18] Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation.
In Proceedings of the First Workshop on Neural Machine Translation. Association for Computational
Linguistics, 2017. doi: 10.18653/v1/w17-3207. URL http://dx.doi.org/10.18653/v1/W17-3207.
[19] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting
Hu. Reasoning with Language Model is Planning with World Model, May 2023. URL http:
//arxiv.org/abs/2305.14992. arXiv:2305.14992 [cs].
[20] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu.
Reasoning with language model is planning with world model, 2023. URL https://arxiv.org/
abs/2305.14992.
[21] Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang. Unifying Human and Statistical Evaluation
for Natural Language Generation. North American Association for Computational Linguistics (NAACL).,
April 2019. URL http://arxiv.org/abs/1904.02792. arXiv: 1904.02792.
[22] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando
Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free
evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
[23] Andy L. Jones. Scaling scaling laws with board games, 2021. URL https://arxiv.org/abs/2104.
03113.
[24] Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy
Liang. Spoc: Search-based pseudocode to code, 2019. URL https://arxiv.org/abs/1906.04908.
[25] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom
Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation
with alphacode. Science, 378(6624):1092–1097, 2022.
[26] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is Your Code Generated by
ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In
Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.
net/forum?id=1qvx610Cu7.
[27] Drew M. McDermott. The 1998 ai planning systems competition. AI Magazine, 21(2):35, Jun. 2000.
doi: 10.1609/aimag.v21i2.1506. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/
article/view/1506.
[28] Aidan McLaughlin. AI Search: The Bitter-er Lesson, 2024. Accessed on September 3, 2024.
[29] Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng
Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad
Beirami. Controlled decoding from language models, 2024. URL https://arxiv.org/abs/2310.
17022.
16
17. [30] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor
Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian,
Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny
Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles
Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea
Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen,
Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung,
Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch,
Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty
Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte,
Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh,
Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross,
Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton,
Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton,
Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela
Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan,
Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan
Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt
Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic,
Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel
Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan
Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski,
Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine
McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz,
Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong
Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind
Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex
Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex
Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael
Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong,
Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae,
Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri
Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather
Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica
Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin,
Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski
Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet,
Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón
Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang,
Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, C. J. Weinmann, Akila Welihinda, Peter Welinder,
Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah
Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin
Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao,
Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. GPT-4 Technical Report, March
2024. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
[31] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,
2022.
17
18. [32] Hieu Pham, Xinyi Wang, Yiming Yang, and Graham Neubig. Meta back-translation, 2021. URL
https://arxiv.org/abs/2102.07847.
[33] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea
Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in
Neural Information Processing Systems, 36, 2024.
[34] Stuart Russell and Peter Norvig. Artificial intelligence: a modern approach. 2002.
[35] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models
with monolingual data, 2016. URL https://arxiv.org/abs/1511.06709.
[36] Claude E Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):
379–423, 623–656, 1948.
[37] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine
Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the Game of Go with
Deep Neural Networks and Tree Search. Nature, 529(7587):484–489, January 2016. ISSN 0028-0836,
1476-4687. doi: 10.1038/nature16961. URL http://www.nature.com/articles/nature16961.
[38] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, and others. Mastering the Game of Go
Without Human Knowledge. Nature, 550(7676):354–359, 2017. Publisher: Nature Publishing Group.
[39] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally
can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.
03314.
[40] Richard S Sutton. The bitter lesson. Incomplete Ideas, 2019. URL http://www.incompleteideas.net/
IncIdeas/BitterLesson.html.
[41] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le,
and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv,
2022. doi: 10.48550/arXiv.2201.11903. URL http://arxiv.org/abs/2201.11903. arXiv:2201.11903
[cs].
[42] Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. An empirical analysis of
compute-optimal inference for problem-solving with language models, 2024. URL https://arxiv.
org/abs/2408.00724.
[43] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik
Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, May
2023. URL http://arxiv.org/abs/2305.10601. arXiv:2305.10601 [cs].
[44] Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-
training via process reward guided tree search, 2024. URL https://arxiv.org/abs/2406.03816.
[45] Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. Trading off diversity
and quality in natural language generation. In Proceedings of the workshop on human evaluation of NLP
systems (HumEval), pp. 25–33, Online, April 2021. Association for Computational Linguistics. URL
https://aclanthology.org/2021.humeval-1.3.
[46] Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan.
Planning with large language models for code generation. In The Eleventh International Conference on
Learning Representations, 2023. URL https://openreview.net/forum?id=Lr8cOOtYbfL.
18
19. Appendix
A. Full Pass@K curves for All Models and All Benchmarks
See Figures 7, 8, 9. We plot all models and methods on HumanEval+, MBPP+ [26], and LiveCodeBench [22],
respectively.
Pass@k vs k for Methods on HumanEval+
GPT-4o-mini
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
1
1.0
10
DeepSeek-Coder-V2
100
1
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
1
10
k
GPT-4o
1.0
100
1
10
Sonnet-3.5
10
k
100
Repeated Sampling
IdeaSearch
PlanSearch
100
Figure 7: Pass@k performance of all models and methods on HumanEval+, plotted over k ∈ { 1, . . . , 200 } .
19
20. Pass@k vs k for Methods on MBPP+
GPT-4o-mini
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
GPT-4o
1
10
DeepSeek-Coder-V2
0.5
100
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
1
10
k
0.5
100
1
1
10
Sonnet-3.5
10
k
100
Repeated Sampling
IdeaSearch
PlanSearch
100
Figure 8: Pass@k performance of all models and methods on MBPP+, plotted over k ∈ { 1, . . . , 200 } .
20
21. Pass@k vs k for Methods on LiveCodeBench
GPT-4o-mini
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3
GPT-4o
1
10
DeepSeek-Coder-V2
0.3
100
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3
1
10
k
0.3
100
1
1
10
Sonnet-3.5
10
k
100
Repeated Sampling
IdeaSearch
PlanSearch
100
Figure 9: Pass@k performance of all models and methods on LiveCodeBench, plotted over k ∈
{ 1, . . . , 200 } .
21
22. B. Full Pass@k Curves with Public Filtering
See Figures 10, 11, 5. We plot all models and methods with public test filtering on HumanEval+,
MBPP+ [26], and LiveCodeBench [22], respectively.
Pass@k vs k for Methods with Public Filtering on HumanEval+
GPT-4o-mini
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
1
1.0
DeepSeek-Coder-V2
10
1
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
1
k
GPT-4o
1.0
10
Sonnet-3.5
10
Public Filtering
No Public Filtering
1
k
Repeated Sampling
IdeaSearch
PlanSearch
10
Figure 10: Pass@k performance of all models and methods on HumanEval+, with public test filtering,
plotted over k ∈ { 1, . . . , 20 } . Note that dotted lines are provided for reference of the base method pass@k
before filtering.
22
23. Pass@k vs k for Methods with Public Filtering on MBPP+
GPT-4o-mini
0.90 0.90
0.85 0.85
0.80 0.80
0.75 0.75
0.70 0.70
0.65 0.65
0.60 0.60
0.55 0.55
0.50
GPT-4o
1
DeepSeek-Coder-V2
0.50
10
0.90 0.90
0.85 0.85
0.80 0.80
0.75 0.75
0.70 0.70
0.65 0.65
0.60 0.60
0.55 0.55
0.50
1
k
0.50
10
1
Sonnet-3.5
10
Public Filtering
No Public Filtering
1
k
Repeated Sampling
IdeaSearch
PlanSearch
10
Figure 11: Pass@k performance of all models and methods on MBPP+, with public test filtering, plotted
over k ∈ { 1, . . . , 20 } . Note that dotted lines are provided for reference of the base method pass@k before
filtering.
23
24. C. Average Relative Improvements
See Figures 12, 13, 14. To create these graphs, the relative improvements of each point on all pass@k
curves are computed and compared to the respective pass@1 of R EPEATED S AMPLING . Then these
values are averaged over all models, so that there is one curve per method per dataset. The datasets are
HumanEval+, MBPP+ [26], and LiveCodeBench [22], respectively. For the public test filtered versions,
see Figures 15, 16, 17.
20
Average Gains on HumanEval+
15
10
5
0
5
10
Repeated Sampling
IdeaSearch
PlanSearch
15
20 0
10
10 1
k
10 2
Figure 12: Performance gain over R EPEATED S AMPLING @1 averaged over all models on HumanEval+,
plotted over k ∈ { 1, . . . , 200 } .
24
25. Average Gains on MBPP+
20
10
0
10
20
10 0
Repeated Sampling
IdeaSearch
PlanSearch
10 1
k
10 2
Figure 13: Performance gain over R EPEATED S AMPLING @1 averaged over all models on MBPP+, plotted
over k ∈ { 1, . . . , 200 } .
25
26. 80
Average Gains on LiveCodeBench
60
40
20
0
10 0
Repeated Sampling
IdeaSearch
PlanSearch
10 1
k
10 2
Figure 14: Performance gain over R EPEATED S AMPLING @1 averaged over all models on LiveCodeBench,
plotted over k ∈ { 1, . . . , 200 } .
26
27. Average Gains with Public Filtering on HumanEval+
15
Public Filtering
No Public Filtering
10
5
0
5
10
Repeated Sampling
IdeaSearch
PlanSearch
15
20 0
10
k
10 1
Figure 15: Average performance gain over all models of methods with public test filtering compared to
R EPEATED S AMPLING @1, plotted over k ∈ { 1, . . . , 20 } . Note that dotted lines are provided for reference
of the base method pass@k (before filtering).
27
28. 20
15
Average Gains with Public Filtering on MBPP+
Public Filtering
No Public Filtering
10
5
0
5
10
15
Repeated Sampling
IdeaSearch
PlanSearch
20
10 0
k
10 1
Figure 16: Average performance gain over all models of methods with public test filtering compared to
R EPEATED S AMPLING @1, plotted over k ∈ { 1, . . . , 20 } . Note that dotted lines are provided for reference
of the base method pass@k (before filtering).
28
29. Average Gains with Public Filtering on LiveCodeBench
Public Filtering
No Public Filtering
60
40
20
0
Repeated Sampling
IdeaSearch
PlanSearch
10 0
k
10 1
Figure 17: Average performance gain over all models of methods with public test filtering compared to
R EPEATED S AMPLING @1, plotted over k ∈ { 1, . . . , 20 } . Note that dotted lines are provided for reference
of the base method pass@k (before filtering).
29
30. D. Compute Normalized Pass@K Graphs
See Figure 18. For each run of a method in Appendix A, we compute the number of generated tokens
needed per completion, per problem, independently on each dataset. Then, we average across all
datasets to obtain 244 generated tokens per completion per problem for R EPEATED S AMPLING , and 1, 428
generated tokens per completion per problem for P LAN S EARCH .
Compute-Normalized Repeated Sampling vs PlanSearch
0.8
PlanSearch
Repeated Sampling
0.7
0.6
0.5
0.4
GPT-4o-mini
GPT-4o
DeepSeek-Coder-V2
Sonnet-3.5
0.3
245
10 3
10 4
Average Tokens Used (per problem)
10 5
Figure 18: Normalized pass@k by average tokens used per problem. R EPEATED S AMPLING uses roughly
244 tokens per completion per problem, and P LAN S EARCH uses roughly 1428 tokens per completion per
problem. When we normalize compute across methods, we find that P LAN S EARCH begins to be more
effective than repeated sampling if the user is willing to sample at least 10,000 tokens per problem.
30
31. E. Comparison with Chain-of-Thought
See Figures 19, 20, 21, which are run on LiveCodeBench [22], MBPP+, and HumanEval+ [26], respectively.
These are the same plots as Appendix A, with CoT [41]. See Figures 22, 23, 24 for the public test filtered
versions.
Pass@k vs k with CoT on LiveCodeBench
GPT-4o-mini
GPT-4o
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
1
10
DeepSeek-Coder-V2
100
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3
1
10
k
0.3
100
1
1
10
Sonnet-3.5
10
k
Repeated Sampling
IdeaSearch
Chain-of-Thought
PlanSearch
100
Figure 19: Pass@k graphs on LiveCodeBench, with the Chain-of-Thought baseline.
31
100
32. Pass@k vs k with CoT on MBPP+
GPT-4o-mini
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
GPT-4o
1
10
DeepSeek-Coder-V2
0.5
100
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5
1
10
k
0.5
100
1
1
10
Sonnet-3.5
10
k
Repeated Sampling
IdeaSearch
Chain-of-Thought
PlanSearch
100
Figure 20: Pass@k graphs on MBPP+, with the Chain-of-Thought baseline.
32
100
33. Pass@k vs k with CoT on HumanEval+
GPT-4o-mini
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
1
1.0
10
DeepSeek-Coder-V2
GPT-4o
1.0
100
1
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6
10
Sonnet-3.5
0.6
1
10
k
100
1
10
k
Repeated Sampling
IdeaSearch
Chain-of-Thought
PlanSearch
100
Figure 21: Pass@k graphs on HumanEval+, with the Chain-of-Thought baseline.
33
100
34. Pass@k vs k with CoT (Public Filtering) on LiveCodeBench
GPT-4o-mini
GPT-4o
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
1
DeepSeek-Coder-V2
10
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
1
k
10
1
Sonnet-3.5
10
Public Filtering
No Public Filtering
1
k
Repeated Sampling
IdeaSearch
Chain-of-Thought
PlanSearch
10
Figure 22: Pass@k graphs on LiveCodeBench, with the Chain-of-Thought baseline and public filtering.
34
35. Pass@k vs k with CoT (Public Filtering) on MBPP+
GPT-4o-mini
0.90 0.90
0.85 0.85
0.80 0.80
0.75 0.75
0.70 0.70
0.65 0.65
0.60 0.60
0.55 0.55
0.50
GPT-4o
1
DeepSeek-Coder-V2
0.50
10
0.90 0.90
0.85 0.85
0.80 0.80
0.75 0.75
0.70 0.70
0.65 0.65
0.60 0.60
0.55 0.55
0.50
1
k
0.50
10
1
Sonnet-3.5
10
Public Filtering
No Public Filtering
1
k
Repeated Sampling
IdeaSearch
Chain-of-Thought
PlanSearch
10
Figure 23: Pass@k graphs on MBPP+, with the Chain-of-Thought baseline and public filtering.
35
36. Pass@k vs k with CoT (Public Filtering) on HumanEval+
GPT-4o-mini
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
1
1.0
DeepSeek-Coder-V2
GPT-4o
1.0
10
1
1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6
Sonnet-3.5
Public Filtering
No Public Filtering
0.6
1
k
10
10
1
k
Repeated Sampling
IdeaSearch
Chain-of-Thought
PlanSearch
10
Figure 24: Pass@k graphs on HumanEval+, with the Chain-of-Thought baseline and public filtering.
36
37. F. Ablation on Temperature for R EPEATED S AMPLING and I DEA S EARCH
See Figure 25. We sweep over temperature increments of 0.1 from 0.0 to 1.2, inclusive, with top-p of 0.95,
on R EPEATED S AMPLING and I DEA S EARCH .
Temperature Effects on Pass@k (on LiveCodeBench)
0.575
Repeated Sampling
IdeaSearch
1.0
0.550
0.525
0.8
0.500
1.2
0.6
0.475
0.450
0.4
0.425
0.2
0.400
0.375 0
10
10 1
k
0.0
Figure 25: Sweep over temperature in 0.1 increments from 0.0 to 1.2. R EPEATED S AMPLING and
I DEA S EARCH both exhibit pass@k improvements at higher temperature, although it seems that higher
temperatures may begin to plateau.
37
38. G. Diversity Score vs Search Improvement Plots for MBPP+ and HumanEval+
See Figures 26, 27, 6. Each figure is made through running the diversity measure as described in
Section 6.1 on the generated codes of each run, then compared with the relative gain from pass@k
compared to pass@1.
Idea Diversity vs Relative Gains from Search (on HumanEval+)
0.9
0.8
Repeated Sampling
IdeaSearch
PlanSearch
0.7
0.6
0.5
0.4
0.3
GPT-4o-mini
GPT-4o
DeepSeek-Coder-V2
Sonnet-3.5
0.2
0.1
0.1
0.2
0.3
Idea Diversity
0.4
0.5
Figure 26: Relationship between the measured diversity score as described in Section 6.1 (where higher is
more diverse) and relative improvement from the pass@1 of the method to the pass@200 of the method.
38
39. 0.9
0.8
Idea Diversity vs Relative Gains from Search (on MBPP+)
Repeated Sampling
IdeaSearch
PlanSearch
0.7
0.6
0.5
0.4
0.3
GPT-4o-mini
GPT-4o
DeepSeek-Coder-V2
Sonnet-3.5
0.2
0.1
0.1
0.2
0.3
Idea Diversity
0.4
0.5
0.6
Figure 27: Relationship between the measured diversity score as described in Section 6.1 (where higher
is more diverse) and relative improvement from the pass@1 of the method to the pass@200 of the method
on MBPP+.
39
40. H. Base Models vs. Instruct Models for Large Samples
We find that base models, despite performing poorly relative to their instruct counterparts for evaluated
with pass@1, will frequently match or even exceed performance on pass@k for sufficiently high k. This
is likely due to higher amounts of diversity in base models, which have not undergone post-training
designed to elicit a single strong response from the model.
We see this effect across all models for HumanEval+ and MBPP+, but only the DeepSeek-Coder-V2
family for LiveCodeBench.
See Figures 29, 30, 31 for Llama-3.1-8b pass@k comparisons.
See Figures 32, 33, 34 for Llama-3.1-70b pass@k comparisons.
See Figures 3, 35, 36 for DeepSeek-Coder-V2-Lite pass@k comparisons.
We also ran Llama-3.1-8b and DeepSeek-Coder-V2-Lite pass@k comparisons for k up to 10, 000; see
Figures 37, 28.
Pass@k vs k for DeepSeek-Coder-V2-Lite Models on LiveCodeBench
0.6
0.5
0.4
0.3
0.2
DeepSeek-Coder-V2-Lite-Base
DeepSeek-Coder-V2-Lite-Instruct
1
10
100
k
1,000
10,000
Figure 28: Pass@k curves comparing DeepSeek-Coder-V2-Lite’s base and instruct versions on Live-
CodeBench with up to 10, 000 completions.
I. Base Models vs. Instruct Models with Public Test Filtering
We repeat the graphs from Appendix H, but with public test filtering. We find that base models with
public test filtering almost always exceed the pass@1 of their instruct model variants.
See Figures 38, 39, 40 for Llama-3.1-8b pass@k comparisons with public test filtering.
See Figures 41, 42, 43 for Llama-3.1-70b pass@k comparisons with public test filtering.
See Figures 44, 45, 46 for DeepSeek-Coder-V2-Lite pass@k comparisons with public test filtering.
40
41. Pass@k vs k for Llama-3.1-8B Models on MBPP+
0.8
0.7
0.6
0.5
1
10
k
Llama-3.1-8B-Base
Llama-3.1-8B-Instruct
100
Figure 29: Pass@k curves comparing Llama-3.1-8B’s base and instruct versions on MBPP+.
Pass@k vs k for Llama-3.1-8B Models on HumanEval+
0.9
0.8
0.7
0.6
0.5
0.4
0.3
1
10
k
Llama-3.1-8B-Base
Llama-3.1-8B-Instruct
100
Figure 30: Pass@k curves comparing Llama-3.1-8B’s base and instruct versions on HumanEval+.
41
42. Pass@k vs k for Llama-3.1-8B Models on LiveCodeBench
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
1
10
k
Llama-3.1-8B-Base
Llama-3.1-8B-Instruct
100
Figure 31: Pass@k curves comparing Llama-3.1-8B’s base and instruct versions on LiveCodeBench.
Pass@k vs k for Llama-3.1-70B Models on MBPP+
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
1
10
k
Llama-3.1-70B-Base
Llama-3.1-70B-Instruct
100
Figure 32: Pass@k curves comparing Llama-3.1-70B’s base and instruct versions on MBPP+.
42
43. Pass@k vs k for Llama-3.1-70B Models on HumanEval+
0.9
0.8
0.7
0.6
0.5
1
10
k
Llama-3.1-70B-Base
Llama-3.1-70B-Instruct
100
Figure 33: Pass@k curves comparing Llama-3.1-70B’s base and instruct versions on HumanEval+.
Pass@k vs k for Llama-3.1-70B Models on LiveCodeBench
0.6
0.5
0.4
0.3
0.2
1
10
k
Llama-3.1-70B-Base
Llama-3.1-70B-Instruct
100
Figure 34: Pass@k curves comparing Llama-3.1-70B’s base and instruct versions on LiveCodeBench.
43
44. Pass@k vs k for DeepSeek-Coder-V2-Lite Models on HumanEval+
0.9
0.8
0.7
0.6
0.5
1
10
k
DeepSeek-Coder-V2-Lite-Base
DeepSeek-Coder-V2-Lite-Instruct
100
Figure 35: Pass@k curves comparing DeepSeek-Coder-V2-Lite’s base and instruct versions on Hu-
manEval+.
Pass@k vs k for DeepSeek-Coder-V2-Lite Models on LiveCodeBench
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
1
10
k
DeepSeek-Coder-V2-Lite-Base
DeepSeek-Coder-V2-Lite-Instruct
100
Figure 36: Pass@k curves comparing DeepSeek-Coder-V2-Lite’s base and instruct versions on Live-
CodeBench.
44
45. Pass@k vs k for Llama-3.1-8B Models on LiveCodeBench
0.5
0.4
0.3
0.2
Llama-3.1-8B-Base
Llama-3.1-8B-Instruct
0.1
1
10
100
k
1,000
10,000
Figure 37: Pass@k curves comparing Llama-3.1-8B’s base and instruct versions on LiveCodeBench with
up to 10, 000 completions.
Pass@k vs k for Llama-3.1-8B Models on MBPP+
Public Filtering
No Public Filtering
0.8
0.7
0.6
0.5
Llama-3.1-8B-Base
Llama-3.1-8B-Instruct
1
k
10
Figure 38: Pass@k curves comparing Llama-3.1-8B’s base and instruct versions on MBPP+ with public
test filtering.
45
46. Pass@k vs k for Llama-3.1-8B Models on HumanEval+
Public Filtering
No Public Filtering
0.8
0.7
0.6
0.5
0.4
0.3
Llama-3.1-8B-Base
Llama-3.1-8B-Instruct
1
k
10
Figure 39: Pass@k curves comparing Llama-3.1-8B’s base and instruct versions on HumanEval+ with
public test filtering.
Pass@k vs k for Llama-3.1-8B Models on LiveCodeBench
Public Filtering
No Public Filtering
0.45
0.40
0.35
0.30
0.25
0.20
0.15
Llama-3.1-8B-Base
Llama-3.1-8B-Instruct
0.10
1
k
10
Figure 40: Pass@k curves comparing Llama-3.1-8B’s base and instruct versions on LiveCodeBench with
public test filtering.
46
47. Pass@k vs k for Llama-3.1-70B Models on MBPP+
0.90
Public Filtering
No Public Filtering
0.85
0.80
0.75
0.70
0.65
0.60
0.55
Llama-3.1-70B-Base
Llama-3.1-70B-Instruct
1
k
10
Figure 41: Pass@k curves comparing Llama-3.1-70B’s base and instruct versions on MBPP+ with public
test filtering.
Pass@k vs k for Llama-3.1-70B Models on HumanEval+
Public Filtering
No Public Filtering
0.9
0.8
0.7
0.6
0.5
Llama-3.1-70B-Base
Llama-3.1-70B-Instruct
1
k
10
Figure 42: Pass@k curves comparing Llama-3.1-70B’s base and instruct versions on HumanEval+ with
public test filtering.
47
48. Pass@k vs k for Llama-3.1-70B Models on LiveCodeBench
0.60
Public Filtering
No Public Filtering
0.55
0.50
0.45
0.40
0.35
0.30
0.25
Llama-3.1-70B-Base
Llama-3.1-70B-Instruct
0.20
1
k
10
Figure 43: Pass@k curves comparing Llama-3.1-70B’s base and instruct versions on LiveCodeBench with
public test filtering.
Pass@k vs k for DeepSeek-Coder-V2-Lite Models on MBPP+
Public Filtering
No Public Filtering
0.90
0.85
0.80
0.75
0.70
0.65
0.60
1
k
DeepSeek-Coder-V2-Lite-Base
DeepSeek-Coder-V2-Lite-Instruct
10
Figure 44: Pass@k curves comparing DeepSeek-Coder-V2-Lite’s base and instruct versions on MBPP+
with public test filtering.
48
49. Pass@k vs k for DeepSeek-Coder-V2-Lite Models on HumanEval+
Public Filtering
No Public Filtering
0.9
0.8
0.7
0.6
0.5
1
k
DeepSeek-Coder-V2-Lite-Base
DeepSeek-Coder-V2-Lite-Instruct
10
Figure 45: Pass@k curves comparing DeepSeek-Coder-V2-Lite’s base and instruct versions on Hu-
manEval+ with public test filtering.
Pass@k vs k for DeepSeek-Coder-V2-Lite Models on LiveCodeBench
Public Filtering
No Public Filtering
0.45
0.40
0.35
0.30
0.25
0.20
0.15
1
k
DeepSeek-Coder-V2-Lite-Base
DeepSeek-Coder-V2-Lite-Instruct
10
Figure 46: Pass@k curves comparing DeepSeek-Coder-V2-Lite’s base and instruct versions on Live-
CodeBench with public test filtering.
49
50. J. Prompts
J.1
J.1.1
Backtranslation
Backtranslate System Prompt
You are an expert Python programmer. You will be given an algorithmic question (problem
specification). You will return a high-level, natural language solution to the question,
like an editorial. You will NOT return any code. Be as creative as possible, going beyond
what you think is intuitively correct.
J.1.2
Implement Backtranslation Idea
You are an expert Python programmer. You will be given a question (problem specification)
and a natural language solution/tutorial that describes how to solve the problem. You will
generate a correct Python program that matches said specification and tutorial and passes
all tests. You will NOT return anything except for the program inside markdown codeblocks.
J.2
Repeated Sampling
You are an expert Python programmer. You will be given a question (problem specification)
and will generate a correct Python program that matches the specification and passes all
tests. You will NOT return anything except for the program inside Markdown codeblocks.
J.3
Simple Idea
You will given a competitive programming problem; please output a high-level description
of how to solve the problem in natural language. Below are examples:
Example input: PROBLEM DESCRIPTION HERE
Example output: EXAMPLE OUTPUT HERE
Here is the competitive programming problem: PROBLEM TO SOLVE
Brainstorm a high-level, natural language solution to the problem above. Note that your
intuition may lead you astray, so come up with simple, creative ideas that go beyond what
you would usually come up with and go beyond your narrow intuition. Brainstorming solutions
that do not seem intuitively correct IS CRUCIAL.
J.4
J.4.1
P LAN S EARCH
Prompt for Observation Part 1
You are an expert Python programmer. You will be given an competitive programming question
(problem specification). You will return several useful, non-obvious, and correct observations
about the problem, like hints to solve the problem. You will NOT return any code. Be as
creative as possible, going beyond what you think is intuitively correct.
50
51. J.4.2
Prompt for Observation Part 2
You are an expert Python programmer. You will be given an competitive programming question
(problem specification) and several correct observations about the problem.
You will brainstorm several new, useful, and correct observations about the problem, derived
from the given observations. You will NOT return any code. Be as creative as possible, going
beyond what you think is intuitively correct.
J.4.3
Combining Observations
Here is a sample prompt from the function with placeholders:
Here is the competitive programming problem:
Problem statement placeholder
Here are the intelligent observations to help solve the problem:
Observation 1 placeholder
Observation 2 placeholder
Observation 3 placeholder
Use these observations above to brainstorm a natural language solution to the problem above.
Note that your intuition may lead you astray, so come up with simple, creative ideas that
go beyond what you would usually come up with and exceeds your narrow intuition.
Quote relevant parts of the observations EXACTLY before each step of the solution. QUOTING
IS CRUCIAL.
J.5
Measuring Diversity
You are an expert Python programmer. You will be given a competitive programming problem
and two pieces of code which are attempts to solve the problem. For your convenience, you
will also be given the idea for each code, summarized in natural language. You will be asked
to answer whether the ideas behind the code are the same. You must ONLY output ’Yes.’ or
’No.’
K. Competitive Programming
Competitive programming is a popular subset of programming tasks that involve solving complex
algorithmic reasoning. Typically, problems consist of a problem statement (written in natural language)
P, with associated tests: ( x i , y i ) , i ∈ { 1, . . . , m } , for which any solution must pass all of them.
The number of tests m depends on the problem, but typically ranges on the order of 25 to 100. A small
subset of the tests are typically given to the solver (we call these public tests) to use as validation that
their program passes simple cases. The rest of the tests are hidden. Solutions to the problems must
generally pass all the tests to be considered correct. Formally, we let f ( x ) denote the output of said
code ran on input x. The solution code is considered correct (passing) if and only if f ( x i ) = y i for all
i ∈ { 1, . . . , m } .
51
52. Each dataset consists of many (on the order of low-hundreds) independent problems, and models are
evaluated on each of these problems independently.
L. A Model of Repeated Sampling: Pass@k
Consider a simplified model of repeated sampling for code generation. Suppose we have a dataset
D = { P 1 , . . . , P l } with l problems. For some problem P i , define the probability p i as the probability that
our code generation model solves the problem P i in one submission. The pass@k [13, 24] metric (for
problem P i ) is defined as the probability that our code generation model solves the problem P i at least
once out of k submissions. Thus, if we know the true p i of our model, we may compute our pass@k
simply:
pass@k i = 1 − ( 1 − p i ) k (2)
pass@k = (3)
∑ pass@k i /l
i
However, it turns out that for k > 1, the naïve estimator as seen in Equation 2 is biased, if we sample
n i ≥ k from our code model to solve P i , c i ≤ n i are correct, and compute p i = c i /n i [13]. Instead, pass@k i
is typically computed using the unbiased estimator:
( n − k c )
pass@k i = 1 − n
( k )
(4)
Note that reporting pass@k on a dataset where l = 1 is rather pointless, since pass@k can be derived
using only pass@1 1 and n 1 . Every curve, over a suitable range of k values, will look like the S-curve seen
in Figure 47 (as k is plotted on a log scale).
Pass@k for Solve Probability p=0.04
1.0
0.8
0.6
0.4
0.2
0.0
p=0.04
10 0
10 1
k
10 2
Figure 47: A simple pass@k ‘S-curve’ plotted with 1 − ( 1 − p ) k , where p = 0.04.
However, with datasets where l > 1, models are able to differentiate themselves through larger k, since
the overall pass@k is an average of these l curves. For example, for l = 3, it is less optimal to have solved
52
53. probabilities of Set1 = { 0.001, 0.7, 0.9 } versus Set2 = { 0.05, 0.1, 0.25 } , in the regime of roughly k = 20 to
k = 2, 000 (in which both converge to 1), even though Set1 has a pass@1 of 53% and Set2 has a pass@1 of
13%. See Figure 48.
Although not shown in the graph, Set2 converges close to 1 at roughly k = 400, several orders of
magnitude below Set1. In addition, note that the slight notch seen in Set1’s curve at large k is due to the
presence of low, but non-zero solve-rates, which can be seen in empirical pass@k curves later on. (These
can be thought as the beginning of the ‘ramping-up’ regime of the typical S-curves in Figure 47.)
Pass@k Across Different Solve Probability Sets
1.0
0.8
0.6
Average (Set 1)
p=0.001 (Set 1)
p=0.7 (Set 1)
p=0.9 (Set 1)
Average (Set 2)
p=0.01 (Set 2)
p=0.1 (Set 2)
p=0.25 (Set 2)
0.4
0.2
0.0
10 0
10 1
k
10 2
Figure 48: Two pass@k curves on a hypothetical dataset of length l = 3, and the solve probabilities of Set
1 are { 0.001, 0.7, 0.9 } and Set 2 are { 0.05, 0.1, 0.25 } . Note that the pass@1 is 53% and 13%, respectively.
However, at roughly k = 20, Set 2 surpasses Set 1 and within an order of magnitude, achieves pass@k of
roughly 1.0.
M. Mathematics of the Diversity Measure
While our choice of a diversity metric is intuitive, one should note that there are a number of intriguing
details that result from our definition. In particular, it is not necessarily the case that a model that
outputs k unique ideas out of n samples to achieve a diversity score of n k . Consider an example of n = 9
codes, separated into 3 cliques of 3, where each clique implements the same idea (and separate cliques
implement separate ideas). In this setup, 13 of ideas are unique, but in our metric, there are 3 matching
9
idea pairs (and 9 total matching idea pairs) out of ( 92 ) = 36, for a diversity score of 1 − 36
= 34 .
N. Biased Estimator for Pass@K Due to Non-Independence of P LAN S EARCH
From a pure theoretical standpoint, the expression is biased (if using the same interpretation), but it still
leads to a similar interpretation—computing the probability that a subset of size k drawn from the set of
53
54. samples we already generated contains at least one success. (These given samples were generated by
one run of P LAN S EARCH .) As such, in theory, the estimator may be slightly biased in the P LAN S EARCH
case when computing its true pass@k. In practice, we do not believe this to be a large concern, especially
as our primary results feature a relatively large k = 200.
54