Agent Skillsareal-project/AReaL › add-dataset

add-dataset

GitHub

指导如何在AReaL框架中添加新数据集加载器。涵盖创建包含SFT和RL训练处理逻辑的Python文件,并在__init__.py中注册数据集名称与导入函数,实现数据集集成。

.claude/skills/add-dataset/SKILL.md areal-project/AReaL

Trigger Scenarios

用户询问如何添加数据集 用户希望集成新数据集 用户提及创建数据集加载器

Install

npx skills add areal-project/AReaL --skill add-dataset -g -y
More Options

Non-standard path

npx skills add https://github.com/areal-project/AReaL/tree/main/.claude/skills/add-dataset -g -y

Use without installing

npx skills use areal-project/AReaL@add-dataset

指定 Agent (Claude Code)

npx skills add areal-project/AReaL --skill add-dataset -a claude-code -g -y

安装 repo 全部 skill

npx skills add areal-project/AReaL --all -g -y

预览 repo 内 skill

npx skills add areal-project/AReaL --list

SKILL.md

Frontmatter
{
    "name": "add-dataset",
    "description": "Guide for adding a new dataset loader to AReaL. Use when user wants to add a new dataset."
}

Add Dataset

Add a new dataset loader to AReaL.

When to Use

This skill is triggered when:

  • User asks "how do I add a dataset?"
  • User wants to integrate a new dataset
  • User mentions creating a dataset loader

Step-by-Step Guide

Step 1: Create Dataset File

Create areal/dataset/<name>.py:

from datasets import Dataset, load_dataset


def get_<name>_sft_dataset(
    path: str,
    split: str,
    tokenizer,
    max_length: int | None = None,
) -> Dataset:
    """Load dataset for SFT training.

    Args:
        path: Path to dataset (HuggingFace hub or local path)
        split: Dataset split (train/validation/test)
        tokenizer: Tokenizer for processing
        max_length: Maximum sequence length (optional)

    Returns:
        HuggingFace Dataset with processed samples
    """
    dataset = load_dataset(path=path, split=split)

    def process(sample):
        # Tokenize the full sequence (prompt + response)
        seq_token = tokenizer.encode(
            sample["question"] + sample["answer"] + tokenizer.eos_token
        )
        prompt_token = tokenizer.encode(sample["question"])
        # Loss mask: 0 for prompt, 1 for response
        loss_mask = [0] * len(prompt_token) + [1] * (len(seq_token) - len(prompt_token))
        return {"input_ids": seq_token, "loss_mask": loss_mask}

    dataset = dataset.map(process).remove_columns(["question", "answer"])

    if max_length is not None:
        dataset = dataset.filter(lambda x: len(x["input_ids"]) <= max_length)

    return dataset


def get_<name>_rl_dataset(
    path: str,
    split: str,
    tokenizer,
    max_length: int | None = None,
) -> Dataset:
    """Load dataset for RL training.

    Args:
        path: Path to dataset
        split: Dataset split
        tokenizer: Tokenizer for length filtering
        max_length: Maximum sequence length

    Returns:
        HuggingFace Dataset with prompts and answers for reward computation
    """
    dataset = load_dataset(path=path, split=split)

    def process(sample):
        messages = [
            {
                "role": "user",
                "content": sample["question"],
            }
        ]
        return {"messages": messages, "answer": sample["answer"]}

    dataset = dataset.map(process).remove_columns(["question"])

    if max_length is not None:

        def filter_length(sample):
            content = sample["messages"][0]["content"]
            tokens = tokenizer.encode(content)
            return len(tokens) <= max_length

        dataset = dataset.filter(filter_length)

    return dataset

Step 2: Register in init.py

Update areal/dataset/__init__.py:

# Add to VALID_DATASETS
VALID_DATASETS = [
    # ... existing datasets
    "<name>",
]

# Add to _get_custom_dataset function
def _get_custom_dataset(name: str, ...):
    # ... existing code
    elif name == "<name>":
        from areal.dataset.<name> import get_<name>_sft_dataset, get_<name>_rl_dataset
        if dataset_type == "sft":
            return get_<name>_sft_dataset(path, split, max_length, tokenizer)
        else:
            return get_<name>_rl_dataset(path, split, max_length, tokenizer)

Step 3: Add Config (Optional)

If the dataset needs special configuration, add to areal/api/cli_args.py:

@dataclass
class TrainDatasetConfig:
    # ... existing fields
    <name>_specific_field: Optional[str] = None

Step 4: Add Tests

Create tests/test_<name>_dataset.py:

import pytest
from areal.dataset.<name> import get_<name>_sft_dataset, get_<name>_rl_dataset

def test_sft_dataset_loads(tokenizer):
    dataset = get_<name>_sft_dataset("path/to/data", split="train", tokenizer=tokenizer)
    assert len(dataset) > 0
    assert "input_ids" in dataset.column_names
    assert "loss_mask" in dataset.column_names

def test_rl_dataset_loads(tokenizer):
    dataset = get_<name>_rl_dataset("path/to/data", split="train", tokenizer=tokenizer)
    assert len(dataset) > 0
    assert "messages" in dataset.column_names
    assert "answer" in dataset.column_names

Reference Implementations

Dataset File Description
GSM8K areal/dataset/gsm8k.py Math word problems
Geometry3K areal/dataset/geometry3k.py Geometry problems
CLEVR areal/dataset/clevr_count_70k.py Visual counting
HH-RLHF areal/dataset/hhrlhf.py Helpfulness/Harmlessness
TORL areal/dataset/torl_data.py Tool-use RL

Required Fields

SFT Dataset

{
    "messages": [
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."},
    ]
}

RL Dataset

{
    "messages": [
        {"role": "user", "content": "..."},
    ],
    "answer": "ground_truth_for_reward",
    # Optional metadata for reward function
}

Common Mistakes

  • ❌ Returning List[Dict] instead of HuggingFace Dataset
  • ❌ Using Python loops instead of dataset.map()/filter()
  • ❌ Missing "messages" field for RL datasets
  • ❌ Wrong message format (should be list of dicts with role and content)
  • ❌ Not registering in __init__.py

Version History

  • d99124e Current 2026-07-05 09:13

Same Skill Collection

.agents/skills/add-archon-model/SKILL.md
.agents/skills/add-dataset/SKILL.md
.agents/skills/add-reward/SKILL.md
.agents/skills/add-unit-tests/SKILL.md
.agents/skills/add-workflow/SKILL.md
.agents/skills/commit-conventions/SKILL.md
.agents/skills/create-pr/SKILL.md
.agents/skills/debug-distributed/SKILL.md
.agents/skills/review-pr/SKILL.md
.agents/skills/translate-doc-zh/SKILL.md
.agents/skills/upgrade-deps/SKILL.md
.claude/skills/add-archon-model/SKILL.md
.claude/skills/add-reward/SKILL.md
.claude/skills/add-unit-tests/SKILL.md
.claude/skills/add-workflow/SKILL.md
.claude/skills/commit-conventions/SKILL.md
.claude/skills/debug-distributed/SKILL.md
.opencode/skills/add-archon-model/SKILL.md
.opencode/skills/add-dataset/SKILL.md
.opencode/skills/add-reward/SKILL.md
.opencode/skills/add-unit-tests/SKILL.md
.opencode/skills/add-workflow/SKILL.md
.opencode/skills/commit-conventions/SKILL.md
.opencode/skills/debug-distributed/SKILL.md

Metadata

Files
0
Version
d99124e
Hash
c93070b3
Indexed
2026-07-05 09:13

- 위키
Copyright © 2011-2026 iteam. Current version is 2.155.2. UTC+08:00, 2026-07-05 21:53
浙ICP备14020137号-1 $방문자$