Teaching Language Models to Solve Sudoku Through Reinforcement Learning

In the world of AI and language models, we often hear about systems that can write essays, generate code, or answer complex questions. But what about teaching them to solve puzzles that require structured thinking, spatial reasoning, and logical deduction? This is where my recent experiment comes in—teaching language models to solve Sudoku puzzles through reinforcement learning.

Sudoku presents a fascinating challenge for language models. Unlike open-ended text generation, solving a Sudoku puzzle requires:

  • Following strict rules (each row, column, and box must contain numbers 1-9 without repetition)
  • Maintaining a consistent grid format
  • Applying step-by-step logical reasoning
  • Understanding spatial relationships between grid elements
  • Arriving at a single correct solution

What makes this particularly interesting is that language models aren’t designed for structured problem-solving. They’re trained to predict text, not to follow logical rules or maintain grid structures. Yet with the right approach, they can learn these skills.

For this experiment, I leveraged a dataset of 4 million Sudoku puzzles from Kaggle, ranging from very easy to challenging difficulties. The dataset preparation involved several key steps:

  1. Loading and filtering: I used the kagglehub library to download the dataset and filtered puzzles based on difficulty levels.
  2. Difficulty classification: Puzzles were categorized into four difficulty levels based on the number of clues:
    • Level 1 (Very Easy): 50-81 clues
    • Level 2 (Easy): 40-49 clues
    • Level 3 (Medium): 30-39 clues
    • Level 4 (Hard): 17-29 clues
  3. Grid formatting: Each puzzle, originally represented as a string of 81 characters, was transformed into a visually appealing grid format with proper row, column, and box separators:
9 1 2 | 3 6 4 | 8 5 7
6 3 4 | 7 5 8 | 2 9 1
5 7 8 | 1 9 2 | 4 6 3
- - - + - - - + - - -
4 5 3 | 9 8 7 | 1 2 6
2 6 1 | 5 4 3 | 7 8 9
8 9 7 | 6 2 1 | 3 4 5
- - - + - - - + - - -
7 2 9 | 4 1 6 | 5 3 8
3 8 5 | 2 7 9 | 6 1 4
1 4 6 | 8 3 5 | 9 7 2
  1. Prompt engineering: Each puzzle was wrapped in a carefully designed prompt that instructed the model to:
  • Think through the solution step by step in <think> tags
  • Provide the final answer in <answer> tags with proper grid formatting

For this initial experiment, I created a focused dataset of 400 training samples, primarily using easier puzzles to establish a baseline for learning. The dataset was deliberately kept small to test how efficiently the models could learn with limited examples. Also I have limited resources in that a 24GB RTX 4090 only fits in around 3000 max context length using unsloth grpo training and hence I had to choose the easier problems to avoid OOM as the difficult problems and their reasoning chains are longer

I decided to explore whether reinforcement learning—specifically Group Relative Policy Optimization (GRPO)—could teach language models to become Sudoku solvers. I experimented with two different model sizes:

  • Qwen 2.5 7B Instruct: Fine-tuned with LoRA rank 16
  • Qwen 2.5 3B Instruct: Fine-tuned with LoRA rank 32

Importantly, I used no cold-start data or distillation from larger models like DeepSeek R1. This was a pure reinforcement learning approach, starting from the base instruction-tuned models.The training configuration included:

  • Batch size: 1
  • Gradient accumulation steps: 8
  • Learning rate: 3e-4 (Karpathy Constant)
  • Max steps: 500
  • Evaluation every 10 steps
  • Max sequence length: 3000 tokens

The heart of reinforcement learning is the reward function—how we tell the model when it’s doing well. I designed a multi-component reward system with several specialized functions:

It is very important for good parsing that the model always remembers to use proper thinking and answering tags – and tags respectively. These tags serve two critical purposes:

  1. They separate the reasoning process from the final answer
  2. They make it easy to extract and evaluate the model’s solution

To enforce this structure, I implemented two complementary reward functions:

Code for the format compliance rewards

def tags_presence_reward_func(completions, **kwargs) -> List[float]:
    """
    Reward for the presence of required tags.
    
    Gives 0.25 points for each required tag present (up to 1.0 total).
    """
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for r in responses:
        score = 0.0
        # Check each tag individually
        if "<think>" in r: score += 0.25
        if "</think>" in r: score += 0.25
        if "<answer>" in r: score += 0.25
        if "</answer>" in r: score += 0.25
        
        rewards.append(score)
    
    return rewards2. Grid Structure Rewards

def tags_order_reward_func(completions, **kwargs) -> List[float]:
    """
    Reward for the correct order of tags.
    
    Gives 1.0 if tags appear in the correct order: <think>...</think> followed by <answer>...</answer>
    """
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for r in responses:
        # Check if tags appear in the correct order
        if re.search(r"<think>.*?</think>.*?<answer>.*?</answer>", r, re.DOTALL):
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    
    return rewards

The first function (tags_presence_reward_func) gives partial credit for each tag that appears, encouraging the model to include all required tags. The second function (tags_order_reward_func) ensures these tags appear in the correct order—thinking first, then answering. Together, they teach the model to maintain a consistent structure that separates reasoning from solution.

Sudoku solutions must be presented in a specific grid format to be readable. This reward function evaluates how well the model maintains proper grid structure:

Code for the grid structure rewards

def answer_format_reward_func(completions, **kwargs) -> List[float]:
    """
    Reward for the correct formatting of the answer grid.
    
    Checks if the answer has the correct Sudoku grid format with properly ordered dividers.
    Gives partial rewards for each correctly structured element.
    """
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for r in responses:
        answer = extract_answer(r)
        
        # No answer at all
        if not answer:
            rewards.append(0.0)
            continue
        
        # Split into lines, ignoring empty lines
        lines = [line for line in answer.split('\n') if line.strip()]
        
        # Initialize score components
        score_components = {
            'correct_line_count': 0,         # Should have 11 lines (9 number lines + 2 divider lines)
            'correct_divider_lines': 0,      # Should have divider lines at positions 3 and 7
            'correct_number_lines': 0,       # Should have 9 properly formatted number lines
            'correct_divider_position': 0,   # Dividers should appear at correct positions in each line
            'correct_box_pattern': 0         # Box pattern with '+' at correct intersections
        }
        
        # Check correct line count (9 number lines + 2 divider lines = 11)
        expected_line_count = 11
        score_components['correct_line_count'] = min(1.0, len(lines) / expected_line_count)
        
        # [Additional detailed scoring logic omitted for brevity]
        
        # Calculate overall score as weighted average of components
        weights = {
            'correct_line_count': 0.2,
            'correct_divider_lines': 0.2,
            'correct_number_lines': 0.2,
            'correct_divider_position': 0.2,
            'correct_box_pattern': 0.2
        }
        
        total_score = sum(score * weights[component] for component, score in score_components.items())
        rewards.append(total_score)
    
    return rewards

This function breaks down grid formatting into multiple components—correct number of lines, proper placement of dividers, appropriate use of separators—and rewards the model for each aspect it gets right. This granular approach helps the model learn the specific spatial structure of a Sudoku grid.

Of course, the ultimate goal is for the model to solve the puzzle correctly. Two reward functions evaluate solution accuracy:

def exact_answer_reward_func(completions, answer, **kwargs) -> List[float]:
    """
    Reward for exact match of the answer with the expected solution.
    
    Gives 5.0 for exact match, 0.0 otherwise.
    """
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for r, expected in zip(responses, answer):
        extracted = normalize_grid(extract_answer(r))
        expected_norm = normalize_grid(expected)
        
        if extracted == expected_norm:
            rewards.append(5.0)
        else:
            rewards.append(0.0)
    
    return rewards

def simple_robust_partial_reward_function(completions, quiz_str, solution_str, **kwargs) -> list[float]:
    """
    Robust partial reward function with strict clue preservation and progressive rewards.
    """
    rewards = []
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_answer(r) for r in responses]

    for extracted_response, puzzle, solution in zip(extracted_responses, quiz_str, solution_str):
        try:
            # Extract and normalize model's answer
            model_answer = normalize_grid(extracted_response)
            
            if len(model_answer) != 81:
                rewards.append(0.0)
                continue
                
            # Convert to lists for element-wise comparison
            model_digits = list(model_answer)
            puzzle_digits = list(puzzle)
            solution_digits = list(solution)
            
            # Clue preservation check (non-negotiable)
            clue_violation = any(
                p != "0" and m != p 
                for p, m in zip(puzzle_digits, model_digits)
            )
            if clue_violation:
                rewards.append(0.0)
                continue
                
            # Calculate empty cell rewards
            empty_cells = [i for i, d in enumerate(puzzle_digits) if d == "0"]
            correct_empty = sum(
                1 for i in empty_cells 
                if model_digits[i] == solution_digits[i]
            )
            
            # Base reward calculation
            base_reward = correct_empty * 0.250
            rewards.append(base_reward)
            
        except Exception as e:
            print(f"Reward calculation error: {e}")
            rewards.append(0.0)
    
    return rewards

The first function (exact_answer_reward_func) provides a large reward (5.0) for completely correct solutions, creating a strong incentive to solve the entire puzzle.

The second function (simple_robust_partial_reward_function) is more nuanced, providing partial credit for partially correct solutions. It has two key features:

  • It strictly enforces that the model preserves the original clues (giving zero reward if any clue is changed)
  • It rewards the model proportionally for each empty cell it fills correctly

This partial reward is crucial for learning, as it provides a smoother gradient for the model to follow during training.

Finally, a Sudoku solution must follow the game’s rules—no repeated digits in any row, column, or 3×3 box:

def rule_compliance_reward_func(completions, **kwargs) -> List[float]:
    """
    Reward for compliance with Sudoku rules.
    
    Checks if the answer follows basic Sudoku rules (no repeated digits in rows, columns, boxes).
    Gives partial rewards for each type of constraint that is satisfied.
    """
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    for r in responses:
        answer = extract_answer(r)
        digits = normalize_grid(answer)
        
        # If we don't have 81 digits, can't properly check rules
        if len(digits) != 81:
            rewards.append(0.0)
            continue
        
        # Convert to 9x9 grid
        grid = np.array([int(digit) for digit in digits]).reshape(9, 9)
        
        # Check rows, columns, and boxes
        valid_rows = valid_columns = valid_boxes = 0
        
        # Check rows
        for i in range(9):
            if len(set(grid[i, :])) == 9:
                valid_rows += 1
                
        # Check columns
        for j in range(9):
            if len(set(grid[:, j])) == 9:
                valid_columns += 1
                
        # Check 3x3 boxes
        for box_i in range(3):
            for box_j in range(3):
                box = grid[box_i*3:(box_i+1)*3, box_j*3:(box_j+1)*3].flatten()
                if len(set(box)) == 9:
                    valid_boxes += 1
        
        # Calculate overall compliance score
        total_units = 9 + 9 + 9  # rows + columns + boxes
        valid_units = valid_rows + valid_columns + valid_boxes
        rewards.append(valid_units / total_units)
    
    return rewards

This function checks each row, column, and 3×3 box for duplicates, rewarding the model for each constraint it satisfies. This teaches the model the fundamental rules of Sudoku, encouraging it to generate valid solutions even if they’re not exactly matching the expected answer.

The training results revealed something fascinating: model size made a dramatic difference in learning stability and performance.

The 7B model (with LoRA rank 16) showed promising results:

  • Maintained stable completion lengths around 1000 tokens
  • Generated consistently formatted solutions
  • Showed steady improvement in reward metrics
  • Maintained policy stability throughout training

In stark contrast, the 3B model (with LoRA rank 32) struggled significantly:

  • Experienced catastrophic instability during training
  • Showed massive policy divergence (KL spikes up to 80!)
  • Failed to maintain consistent performance
  • Eventually collapsed and couldn’t recover

The graphs tell a clear story: while the 7B model (pink line) maintained stable performance, the 3B model (green line) showed wild fluctuations before eventually failing completely.

Length Completion for Train and Test:

Train completion length between 7B and 3B model

Eval completion length between 7B and 3B model

Net Reward for Train and Test

Train Reward on a train set of 400 sudoku puzzles

Test Reward on a train set of 100 sudoku puzzles

Answer format rewards

Train rewards for formatting

Train rewards for formatting

Most Important: Final Answer Rewards (Meaning the model generated completely correct response grid and matches exactly)

Train exact reward

Eval exact reward

For 7B model the exact answer reward increases meaning that the model produces perfectly matching answer but for 3B there is a collapse. This is increasing as well and this proves that the 7B model learns to solve sudoku with very less data and learns it fast!

Partial Rewards

Train partial rewards

Eval partial rewards

This experiment reveals several important insights about teaching language models complex reasoning tasks:

1. There’s a minimum size threshold for complex reasoning without having cold start data mentioned in Deepseek R1 paper

Some tasks simply require a certain model capacity to learn stably. The 3B model’s failure suggests Sudoku solving may be one such task.

2. Stability is a prerequisite for learning

Before a model can learn to solve puzzles correctly, it needs to maintain stable training dynamics. The 7B model’s consistent metrics allowed it to make steady progress.

3. Multi-component rewards provide better guidance

Breaking down the reward into format compliance, rule adherence, and solution accuracy helped guide the learning process more effectively than a single pass/fail signal would have.

4. Reinforcement learning can teach structured thinking

Despite the challenges, GRPO successfully taught the 7B model to maintain proper formatting and begin solving puzzles—skills not inherent to language models.

This is very much an ongoing project, with several planned next steps:

  1. Increasing difficulty: Introducing more challenging puzzles to test the model’s reasoning capabilities
  2. Scaling compute: Applying more computational resources to train for longer periods and with larger batch sizes
  3. Exploring model architectures: Testing LoRA rank 32 for the 7B model to see if higher rank improves performance
  4. Distillation approach: Creating a cold-start dataset by distilling from larger models like DeepSeek R1, then applying GRPO on top of that foundation
  5. Advanced reward functions: Implementing more nuanced reward mechanisms that I’ve already designed but haven’t yet deployed in training
  6. Evaluation framework: Developing more sophisticated evaluation metrics to assess reasoning quality, not just solution accuracy

One of the most critical aspects of my future work involves implementing more sophisticated reward functions that I’ve already designed. The current simple reward function works, but my enhanced versions incorporate several key improvements that could significantly boost learning efficiency.

Here’s the enhanced reward function I’ve designed but haven’t yet implemented in training:

def enhanced_partial_answer_reward_func(completions, answer, **kwargs) -> List[float]:
    """
    Optimized reward function using precomputed difficulty levels from dataset
    with clue integrity checks and progressive bonuses.
    """
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    
    # Get batch data from dataset
    original_puzzles = kwargs['original_puzzles']
    difficulties = kwargs['difficulty']
    
    for r, expected, puzzle, difficulty in zip(responses, answer, original_puzzles, difficulties):
        # Extract grids from dataset fields
        extracted = normalize_grid(extract_answer(r))
        expected_norm = normalize_grid(expected)
        puzzle_norm = normalize_grid(puzzle)
        
        # Validate lengths
        if len(extracted) != 81 or len(expected_norm) != 81:
            rewards.append(0.0)
            continue
            
        # Score components
        clue_preservation = 0
        deduction_accuracy = 0
        total_deductions = 0
        
        for i in range(81):
            if puzzle_norm[i] != '0':  # Original clue
                clue_preservation += 1 if extracted[i] == puzzle_norm[i] else 0
            else:  # Required deduction
                total_deductions += 1
                deduction_accuracy += 1 if extracted[i] == expected_norm[i] else 0
                
        # Normalize scores
        clue_score = clue_preservation / (81 - total_deductions) if total_deductions < 81 else 1.0
        deduction_score = deduction_accuracy / total_deductions if total_deductions > 0 else 0.0
        
        # Non-linear difficulty scaling using precomputed difficulty
        difficulty_multiplier = 1 + (0.5 * (difficulty ** 1.3))  # 1.0, 1.7, 2.3, 2.9
        
        # Progressive bonuses
        if deduction_score >= 0.95:
            deduction_bonus = 0.6
        elif deduction_score >= 0.85:
            deduction_bonus = 0.3
        elif deduction_score >= 0.75:
            deduction_bonus = 0.1
        else:
            deduction_bonus = 0.0
            
        # Final calculation
        base_reward = (0.15 * clue_score) + (0.85 * deduction_score) + deduction_bonus
        final_reward = base_reward * difficulty_multiplier
        
        # Clue error penalty
        clue_errors = (81 - total_deductions) - clue_preservation
        if clue_errors > 0:
            final_reward *= max(0, 1 - (clue_errors / 5))  # Up to 100% penalty
            
        rewards.append(max(final_reward, 0.05))  # Minimum reward
    
    return rewards

My reward function design philosophy centers on several key principles:

1. Progressive rewards over binary feedback: Rather than simply marking answers as right or wrong, I provide partial credit for partial solutions. This creates a smoother learning gradient that helps the model improve incrementally.

  1. Difficulty-aware scaling: The enhanced functions incorporate puzzle difficulty as a multiplier, giving higher rewards for solving harder puzzles. This encourages the model to tackle more challenging problems rather than just optimizing for easy ones.

3. Strict clue preservation: All reward functions enforce a non-negotiable rule that original puzzle clues must be preserved. This prevents the model from “cheating” by changing the puzzle itself.

4. Bonus thresholds: The enhanced functions include bonus rewards when the model crosses certain performance thresholds (75%, 85%, 95% correct). These act as motivational milestones that accelerate learning when the model is on the right track.

5. Minimum reward floor (This one I am the most sus on): Even partially correct solutions receive a small minimum reward (0.05), ensuring that the model always gets some feedback even for minimal progress.

The current simple function focuses on the most critical aspects (clue preservation and partial credit), while the enhanced versions add sophistication through difficulty scaling and progressive bonuses. In future training runs, I plan to implement these more nuanced reward functions to see if they can further improve learning efficiency and solution quality.

The key insight from my reward function design is that process-based rewards—rewarding the journey, not just the destination—are crucial for teaching complex reasoning tasks. By providing feedback on intermediate steps and partial solutions, we create a more effective learning environment than binary success/failure signals could provide.

Teaching language models to solve Sudoku isn’t just about puzzles—it’s about developing AI systems that can:

  1. Follow structured processes
  2. Apply logical reasoning step-by-step
  3. Maintain format consistency
  4. Verify their own work against known rules
  5. Understand spatial relationships

These capabilities have applications far beyond games:

  1. Programming: Teaching models to write code that follows strict syntax and logical constraints
  2. Mathematical problem-solving: Enabling step-by-step solutions to complex math problems
  3. Scientific reasoning: Helping models understand and apply scientific methods and principles
  4. Formal verification: Training models to check their own work against established rules

This experiment represents just the beginning of my exploration into teaching language models structured reasoning through reinforcement learning. While the initial results with the 7B model are promising, there’s still much to learn and improve.

The stark difference between the 3B and 7B models’ performance highlights an important lesson: some tasks have minimum capacity requirements for stable learning. As I continue to refine the approach with more data, better reward functions, and larger models, I expect to see even more impressive results.

I’ll be updating this project regularly as new findings emerge. The journey of teaching machines to think logically and solve structured problems is challenging but fascinating—and I’m excited to see where it leads next.

Note: This is a work in progress, and I welcome feedback and suggestions from the community!

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-06 07:01
浙ICP备14020137号-1 $访客地图$