Improving LLM Reasoning using SElf-generated data- RL and Verifiers

1. Improving LLM Reasoning using SElf-generated data: RL and Veriﬁers Rishabh Agarwal Research Scientist, Google DeepMind

2. Large Language Models (LLMs) LLM

3. Slide Credit: Hritik Bansal 3

4. Training LLMs needs high quality data High quality data, scraped from web or collected from humans.

5. Are we running out of high-quality data? ➢ ➢ Time-consuming and expensive to scale Hard to create for complex tasks epochai.org

6. Synthetic data to the rescue? What if the models could generate their own training outputs? Naively doing so can result in model collapse! The Curse of Recursion: Training on Generated Data Makes Models Forget. Shumailov et al, 2023.

7. Synthetic data to the rescue? Verification can often be easier than Generation! Given a string, ﬁnd the length of the longest substring without repeating characters. Generating code can be harder than verifying it via test case execution. Solving sudoku puzzles is harder than checking one! Can we use model-generated data for training given access to some form of feedback?

8. How do we self-generate data for problem-solving? LLM Responses Okay, so I have to find the percent of the starting… The stock loses 10% of its value on Monday… A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Problem I need to find the overall percent loss in value… Let's start by representing the unknown value of... Let V be the value of the stock at the beginning of… Is the response correct? Model generated Fine-tuning data

9. A simple recipe for self-training (ReST EM ) Repeat this process a few times: 1. Generate samples from the model and filter them using binary feedback. (E-step) 2. Fine-tune the model on these samples (M-step) This process corresponds to expectation-maximization based RL! Check the math in the paper. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al

10. Expectation-Maximization explained Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al

11. Problem-Solving tasks: Math & Coding Hendrycks MATH APPS Coding (Intro) We will buy a product for N yen (the currency of Japan) at a shop. If we use only 1000-yen bills to pay the price, how much change will we receive? Assume we use the minimum number of bills required. -----Constraints----- - 1 \leq N \leq 10000 - N is an integer. -----Input----- Input is given from Standard Input in the following format: N -----Output----- Print the amount of change as an integer. -----Sample Input----- 1900 -----Sample Output----- 100 We will use two 1000-yen bills to pay the price and receive 100 yen in change. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (TMLR) 2023. Singh*, Co-reyes*, Agarwal* et al

12. This… beats human data! Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al

13. ReST EM works on coding too. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al

14. Overfitting is an issue Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al

15. Pass@K performance improves as well Pass@K measures the probability that at least one of the top k-generated solution for a problem is correct. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al

16. Apples-to-Apples Comparison Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. 2023. Singh*, Co-reyes*, Agarwal* et al

17. Distilling Palm-2-S using L

18. Impact on reasoning tasks

19. Held-Out Eval: 2023 Hungarian HS Exam

20. Things we learned so far: ● ● Self-generated data improves performance, given reliable reward. Self-generated data can often outperform human data – it’s more in-distribution!

21. Revisiting ReST EM Repeat this process a few times: 1. Generate samples from the model and filter them using binary feedback. 2. Fine-tune the model on these samples Discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information!

22. Incorrect solutions for training verifiers LLM Responses Okay, so I have to find the percent of the starting… The stock loses 10% of its value on Monday… A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Problem I need to find the overall percent loss in value… Let's start by representing the unknown value of... Let's Verify Step by Step. OpenAI, 2023. Is the response correct? Learned verifier

23. How to use a verifier? LLM Responses Okay, so I have to find the 10% of the starting… verifier The stock loses 5% of its value on Monday… verifier I need to find the overall percent loss in value… verifier Let's start by representing the unknown value of... verifier A stock loses 5% of its value on Monday. On Tuesday it loses 10%... Problem Calculate probability of being correct Let's Verify Step by Step. OpenAI, 2023.

24. Idea: Augmenting ReST EM with a verifier Test-time verification x = problem, y = model-generated solution V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024

25. V-STaR: ReST EM + verifier works quite well! ReST EM Large gains on math and code reasoning with LLaMA2 7B and 13B models. V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024

26. V-STaR: Performance across iterations V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024

27. A Strong Baseline: Majority Voting LLM Responses A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Problem Final answer Okay, so I have to find the percent of the starting… 10 Majority Voting Answer The stock loses 10% of its value on Monday… 11 10 I need to find the overall percent loss in value… 5 Let's start by representing the unknown value of... 10 Self-Consistency Improves Chain of Thought Reasoning in Language Models. Wang et. al, 2022

28. V-STaR Scales Better With Test-Time Compute V-STaR: Training Verifiers for Self-Taught Reasoners. Hosseini et al. 2024

29. Things we learned so far: ● ● ● ● Self-generated data improves performance, given reliable reward. Self-generated data can often outperform human data – it’s more in-distribution! We can train a verifier, using both correct and incorrect solutions. Verifiers can help make use of test-time compute.

30. Revisiting Verifiers LLM Responses Okay, so I have to find the percent of the starting… The stock loses 10% of its value on Monday… A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Problem I need to find the overall percent loss in value… Let's start by representing the unknown value of... Let's Verify Step by Step. Lightman et. al, 2023. Is the response correct? Binary Classifier (LLM)

31. Train Verifiers as Next-token Predictors A stock loses 10% of its value on Monday. On Tuesday it loses 20%... Generative Verifier (GenRM) The stock loses 10% of its value on Monday… Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.

32. Generative Verifiers Can Use Chain-of-Thought (CoT) Reasoning Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.

33. Generative Verifiers Can Use Chain-of-Thought Reasoning Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.

34. Generative Verifiers Unify Generation and Verification Tasks Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.

35. Generative Verifiers Can Use Test-Time Compute (“Think More”) Generative Verifiers: Reward Modeling as Next-Token Prediction. Zhang et. al, 2024.

36. Things we learned so far: ● ● ● ● ● Self-generated data improves performance, given reliable reward. Self-generated data can often outperform human data – it’s more in-distribution! We can train a verifier, using both correct and incorrect solutions. Verifiers can help make use of test-time compute. Training Verifiers with Next Token Prediction has a lot of benefits!

37. Revisiting ReST EM again! Repeat this process a few times: 1. Generate samples from the LLM and filter them using binary feedback. 2. Fine-tune the model on these samples What if we also have access to a smaller language model?

38. Compute-Matched Sampling For autoregressive language models, Sampling cost (FLOPs) ~ 2ND N is the number of model parameters D is the number of inference tokens Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024

39. Compute-Matched Sampling Tradeoffs Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024

40. Compute-Matched Sampling Is Better! Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024

41. Cost-Matched Sampling is Even Better! Price of Gemini 1.5 Flash = 35x Price of Gemini 1.5 Pro Knowledge distillation: Gemma-7B, 9B, and 27B Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling. Bansal et al, 2024

42. Things we learned so far: ● ● ● ● ● ● Self-generated data improves performance, given reliable reward. Self-generated data can often outperform human data – it’s more in-distribution! We can train a verifier, using both correct and incorrect solutions. Verifiers can help make use of test-time compute. Training Verifiers with Next Token Prediction has a lot of benefits! Consider whether a smaller model can generate better synthetic data for a given amount of compute?

43. Revisiting ReST EM (yet again!) Repeat this process a few times: 1. Generate samples from the model and filter them using binary feedback. 2. Fine-tune the model on these samples Is fine-tuning necessary? Wait, what?

44. Background: In-Context Learning

45. Many-Shot In-Context Learning Many-Shot In-Context Learning. Agarwal et al, 2024

46. In-Context ReST EM : Reinforced ICL 1. Generate samples from the model and filter them using binary feedback. 2. Put these (problem, solution) pairs in-context for the model.

47. Reinforced ICL on MATH

48. Reinforced ICL on Big-Bench Hard

49. Reinforced ICL: Iteration 2

50. Thank you! Questions?