Improving LLM Reasoning using SElf-generated data- RL and Verifiers

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. Improving LLM Reasoning using SElf-generated data: RL and Verifiers Rishabh Agarwal Research Scientist, Google DeepMind
2. Large Language Models (LLMs) LLM
3. Slide Credit: Hritik Bansal 3
4. Training LLMs needs high quality data High quality data, scraped from web or collected from humans.
5. Are we running out of high-quality data? ➢ ➢ Time-consuming and expensive to scale Hard to create for complex tasks epochai.org
6. Synthetic data to the rescue? What if the models could generate their own training outputs? Naively doing so can result in model collapse! The Curse of Recursion: Training on Generated Data Makes Models Forget. Shumailov et al, 2023.
7. Synthetic data to the rescue? Verification can often be easier than Generation! Given a string, find the length of the longest substring without repeating characters. Generating code can be harder than verifying it via test case execution. Solving sudoku puzzles is harder than checking one! Can we use model-generated data for training given access to some form of feedback?
8. How do we self-generate data for problem-solving? LLM Responses Okay, so I have to find the percent of the starting… The stock loses 10% of its value on Monday… A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Problem I need to find the overall percent loss in value… Let's start by representing the unknown value of... Let V be the value of the stock at the beginning of… Is the response correct? Model generated Fine-tuning data
9. A simple recipe for self-training (ReST EM ) Repeat this process a few times: 1. Generate samples from the model and filter them using binary feedback. (E-step) 2. Fine-tune the model on these samples (M-step) This process corresponds to expectation-maximization based RL! Check the math in the paper. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al
10. Expectation-Maximization explained Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al
11. Problem-Solving tasks: Math & Coding Hendrycks MATH APPS Coding (Intro) We will buy a product for N yen (the currency of Japan) at a shop. If we use only 1000-yen bills to pay the price, how much change will we receive? Assume we use the minimum number of bills required. -----Constraints----- - 1 \leq N \leq 10000 - N is an integer. -----Input----- Input is given from Standard Input in the following format: N -----Output----- Print the amount of change as an integer. -----Sample Input----- 1900 -----Sample Output----- 100 We will use two 1000-yen bills to pay the price and receive 100 yen in change. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al
12. This… beats human data! Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
13. ReST EM works on coding too. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
14. Overfitting is an issue Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
15. Pass@K performance improves as well Pass@K measures the probability that at least one of the top k-generated solution for a problem is correct. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
16. Apples-to-Apples Comparison Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al
17. Distilling Palm-2-S using L
18. Impact on reasoning tasks
19. Held-Out Eval: 2023 Hungarian HS Exam
20. Things we learned so far: ● ● Self-generated data improves performance, given reliable reward. Self-generated data can often outperform human data – it’s more in-distribution!
21. Revisiting ReST EM Repeat this process a few times: 1. Generate samples from the model and filter them using binary feedback. 2. Fine-tune the model on these samples Discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information!
22. Incorrect solutions for training verifiers LLM Responses Okay, so I have to find the percent of the starting… The stock loses 10% of its value on Monday… A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Problem I need to find the overall percent loss in value… Let's start by representing the unknown value of... Let's Verify Step by Step. OpenAI, 2023. Is the response correct? Learned verifier
23. How to use a verifier? LLM Responses Okay, so I have to find the 10% of the starting… verifier The stock loses 5% of its value on Monday… verifier I need to find the overall percent loss in value… verifier Let's start by representing the unknown value of... verifier A stock loses 5% of its value on Monday. On Tuesday it loses 10%... Problem Calculate probability of being correct Let's Verify Step by Step. OpenAI, 2023.
24. Idea: Augmenting ReST EM with a verifier Test-time verification x = problem, y = model-generated solution V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024
25. V-STaR: ReST EM + verifier works quite well! ReST EM Large gains on math and code reasoning with LLaMA2 7B and 13B models. V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024
26. V-STaR: Performance across iterations V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024
27. A Strong Baseline: Majority Voting LLM Responses A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Problem Final answer Okay, so I have to find the percent of the starting… 10 Majority Voting Answer The stock loses 10% of its value on Monday… 11 10 I need to find the overall percent loss in value… 5 Let's start by representing the unknown value of... 10 Self-Consistency Improves Chain of Thought Reasoning in Language Models. Wang et. al, 2022
28. V-STaR Scales Better With Test-Time Compute V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024
29. Things we learned so far: ● ● ● ● Self-generated data improves performance, given reliable reward. Self-generated data can often outperform human data – it’s more in-distribution! We can train a verifier, using both correct and incorrect solutions. Verifiers can help make use of test-time compute.
30. Revisiting Verifiers LLM Responses Okay, so I have to find the percent of the starting… The stock loses 10% of its value on Monday… A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Problem I need to find the overall percent loss in value… Let's start by representing the unknown value of... Let's Verify Step by Step. Lightman et. al, 2023. Is the response correct? Binary Classifier (LLM)
31. Train Verifiers as Next-token Predictors A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Generative Verifier (GenRM) The stock loses 10% of its value on Monday… Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
32. Generative Verifiers Can Use Chain-of-Thought (CoT) Reasoning Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
33. Generative Verifiers Can Use Chain-of-Thought Reasoning Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
34. Generative Verifiers Unify Generation and Verification Tasks Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
35. Generative Verifiers Can Use Test-Time Compute (“Think More”) Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.
36. Things we learned so far: ● ● ● ● ● Self-generated data improves performance, given reliable reward. Self-generated data can often outperform human data – it’s more in-distribution! We can train a verifier, using both correct and incorrect solutions. Verifiers can help make use of test-time compute. Training Verifiers with Next Token Prediction has a lot of benefits!
37. Revisiting ReST EM again! Repeat this process a few times: 1. Generate samples from the LLM and filter them using binary feedback. 2. Fine-tune the model on these samples What if we also have access to a smaller language model?
38. Compute-Matched Sampling For autoregressive language models, Sampling cost (FLOPs) ~ 2ND N is the number of model parameters D is the number of inference tokens Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024
39. Compute-Matched Sampling Tradeoffs Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024
40. Compute-Matched Sampling Is Better! Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024
41. Cost-Matched Sampling is Even Better! Price of Gemini 1.5 Flash = 35x Price of Gemini 1.5 Pro Knowledge distillation: Gemma-7B, 9B, and 27B Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024
42. Things we learned so far: ● ● ● ● ● ● Self-generated data improves performance, given reliable reward. Self-generated data can often outperform human data – it’s more in-distribution! We can train a verifier, using both correct and incorrect solutions. Verifiers can help make use of test-time compute. Training Verifiers with Next Token Prediction has a lot of benefits! Consider whether a smaller model can generate better synthetic data for a given amount of compute?
43. Revisiting ReST EM (yet again!) Repeat this process a few times: 1. Generate samples from the model and filter them using binary feedback. 2. Fine-tune the model on these samples Is fine-tuning necessary? Wait, what?
44. Background: In-Context Learning
45. Many-Shot In-Context Learning Many-Shot In-Context Learning. Agarwal et al, 2024
46. In-Context ReST EM : Reinforced ICL 1. Generate samples from the model and filter them using binary feedback. 2. Put these (problem, solution) pairs in-context for the model.
47. Reinforced ICL on MATH
48. Reinforced ICL on Big-Bench Hard
49. Reinforced ICL: Iteration 2
50. Thank you! Questions?

- 위키
Copyright © 2011-2025 iteam. Current version is 2.139.1. UTC+08:00, 2025-01-16 13:55
浙ICP备14020137号-1 $방문자$