Improving LLM Reasoning using SElf-generated data- RL and Verifiers
如果无法正常显示,请先停止浏览器的去广告插件。
1. Improving LLM Reasoning
using SElf-generated data:
RL and Verifiers
Rishabh Agarwal
Research Scientist, Google DeepMind
2. Large Language Models (LLMs)
LLM
3. Slide Credit: Hritik Bansal
3
4. Training LLMs needs high quality data
High quality
data,
scraped
from web or
collected
from
humans.
5. Are we running out of high-quality data?
➢
➢
Time-consuming and expensive to
scale
Hard to create for complex tasks
epochai.org
6. Synthetic data to the rescue?
What if the models could generate
their own training outputs?
Naively
doing so
can
result in
model
collapse!
The Curse of Recursion: Training on Generated Data Makes Models Forget. Shumailov et al, 2023.
7. Synthetic data to the rescue?
Verification can often be easier than Generation!
Given a string, find the length of
the longest substring without
repeating characters.
Generating code can be harder than verifying it
via test case execution.
Solving sudoku puzzles is
harder than checking one!
Can we use model-generated data for training
given access to some form of feedback?
8. How do we self-generate data for
problem-solving?
LLM Responses
Okay, so I have to find the
percent of the starting…
The stock loses 10% of its
value on Monday…
A stock loses 10% of its value
on Monday. On Tuesday it
loses 20%...
Problem
I need to find the overall
percent loss in value…
Let's start by representing
the unknown value of...
Let V be the value of the
stock at the beginning of…
Is the response
correct?
Model
generated
Fine-tuning
data
9. A simple recipe for self-training (ReST EM )
Repeat this process a few times:
1. Generate samples from the model and filter them using
binary feedback. (E-step)
2. Fine-tune the model on these samples (M-step)
This process corresponds to expectation-maximization based RL! Check the
math in the paper.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al
10. Expectation-Maximization explained
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al
11. Problem-Solving tasks: Math & Coding
Hendrycks MATH
APPS Coding (Intro)
We will buy a product for N yen (the currency of Japan) at
a shop. If we use only 1000-yen bills to pay the price, how
much change will we receive? Assume we use the
minimum number of bills required.
-----Constraints----- - 1 \leq N \leq 10000 - N is an integer.
-----Input----- Input is given from Standard Input in the
following format: N
-----Output----- Print the amount of change as an integer.
-----Sample Input-----
1900
-----Sample Output-----
100
We will use two 1000-yen bills to pay the price and receive
100 yen in change.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al
12. This… beats human data!
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
13. ReST EM works on coding too.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
14. Overfitting is an issue
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
15. Pass@K performance improves as well
Pass@K measures the probability that at least one of the top k-generated solution for a problem is correct.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
16. Apples-to-Apples Comparison
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
17. Distilling Palm-2-S using L
18. Impact on reasoning tasks
19. Held-Out Eval: 2023 Hungarian HS Exam
20. Things we learned so far:
●
●
Self-generated data improves performance, given reliable reward.
Self-generated data can often outperform human data – it’s more
in-distribution!
21. Revisiting ReST EM
Repeat this process a few times:
1. Generate samples from the model and filter them using
binary feedback.
2. Fine-tune the model on these samples
Discard the large amounts of incorrect solutions generated during this
process, potentially neglecting valuable information!
22. Incorrect solutions for training verifiers
LLM Responses
Okay, so I have to find the
percent of the starting…
The stock loses 10% of its
value on Monday…
A stock loses 10% of its value
on Monday. On Tuesday it
loses 20%...
Problem
I need to find the overall
percent loss in value…
Let's start by representing
the unknown value of...
Let's Verify Step by Step. OpenAI, 2023.
Is the response
correct?
Learned
verifier
23. How to use a verifier?
LLM Responses
Okay, so I have to find the
10% of the starting… verifier
The stock loses 5% of its
value on Monday… verifier
I need to find the overall
percent loss in value… verifier
Let's start by representing
the unknown value of... verifier
A stock loses 5% of its value
on Monday. On Tuesday it
loses 10%...
Problem
Calculate probability of
being correct
Let's Verify Step by Step. OpenAI, 2023.
24. Idea: Augmenting ReST EM with a verifier
Test-time verification
x = problem, y = model-generated solution
V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024
25. V-STaR: ReST EM + verifier works quite well!
ReST EM
Large gains on math and code reasoning with LLaMA2 7B and 13B models.
V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024
26. V-STaR: Performance across iterations
V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024
27. A Strong Baseline: Majority Voting
LLM Responses
A stock loses 10% of its value
on Monday. On Tuesday it
loses 20%...
Problem
Final answer
Okay, so I have to find the
percent of the starting… 10 Majority Voting
Answer
The stock loses 10% of its
value on Monday… 11 10
I need to find the overall
percent loss in value… 5 Let's start by representing
the unknown value of... 10
Self-Consistency Improves Chain of Thought Reasoning in Language Models. Wang et. al, 2022
28. V-STaR Scales Better With Test-Time
Compute
V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024
29. Things we learned so far:
●
●
●
●
Self-generated data improves performance, given reliable reward.
Self-generated data can often outperform human data – it’s more
in-distribution!
We can train a verifier, using both correct and incorrect solutions.
Verifiers can help make use of test-time compute.
30. Revisiting Verifiers
LLM Responses
Okay, so I have to find the
percent of the starting…
The stock loses 10% of its
value on Monday…
A stock loses 10% of its value
on Monday. On Tuesday it
loses 20%...
Problem
I need to find the overall
percent loss in value…
Let's start by representing
the unknown value of...
Let's Verify Step by Step. Lightman et. al, 2023.
Is the response
correct?
Binary
Classifier
(LLM)
31. Train Verifiers as Next-token Predictors
A stock loses 10%
of its value on
Monday. On
Tuesday it loses
20%...
Generative
Verifier
(GenRM)
The stock loses
10% of its value
on Monday…
Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
32. Generative Verifiers Can Use
Chain-of-Thought (CoT) Reasoning
Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
33. Generative Verifiers Can Use
Chain-of-Thought Reasoning
Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
34. Generative Verifiers Unify Generation
and Verification Tasks
Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
35. Generative Verifiers Can Use Test-Time
Compute (“Think More”)
Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
36. Things we learned so far:
●
●
●
●
●
Self-generated data improves performance, given reliable reward.
Self-generated data can often outperform human data – it’s more
in-distribution!
We can train a verifier, using both correct and incorrect solutions.
Verifiers can help make use of test-time compute.
Training Verifiers with Next Token Prediction has a lot of benefits!
37. Revisiting ReST EM again!
Repeat this process a few times:
1. Generate samples from the LLM and filter them using
binary feedback.
2. Fine-tune the model on these samples
What if we also have access to a smaller language model?
38. Compute-Matched Sampling
For autoregressive language
models,
Sampling cost (FLOPs) ~ 2ND
N is the number of model
parameters
D is the number of
inference tokens
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024
39. Compute-Matched Sampling Tradeoffs
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024
40. Compute-Matched Sampling Is Better!
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024
41. Cost-Matched Sampling is Even Better!
Price of Gemini 1.5 Flash =
35x Price of Gemini 1.5 Pro
Knowledge distillation:
Gemma-7B, 9B, and 27B
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024
42. Things we learned so far:
●
●
●
●
●
●
Self-generated data improves performance, given reliable reward.
Self-generated data can often outperform human data – it’s more
in-distribution!
We can train a verifier, using both correct and incorrect solutions.
Verifiers can help make use of test-time compute.
Training Verifiers with Next Token Prediction has a lot of benefits!
Consider whether a smaller model can generate better synthetic data
for a given amount of compute?
43. Revisiting ReST EM (yet again!)
Repeat this process a few times:
1. Generate samples from the model and filter them using
binary feedback.
2. Fine-tune the model on these samples
Is fine-tuning necessary? Wait, what?
44. Background: In-Context Learning
45. Many-Shot In-Context Learning
Many-Shot In-Context Learning. Agarwal et al, 2024
46. In-Context ReST EM : Reinforced ICL
1. Generate samples from the model and filter
them using binary feedback.
2. Put these (problem, solution) pairs
in-context for the model.
47. Reinforced ICL on MATH
48. Reinforced ICL on Big-Bench Hard
49. Reinforced ICL: Iteration 2
50. Thank you!
Questions?