reward

# Reward Model Training

## Overview

Reward models learn to score responses based on human preferences. They're used in RLHF pipelines (PPO, GRPO, RLOO) to provide reward signals for policy optimization. The model outputs a scalar reward for each response. This skill includes patterns for scoring thinking/reasoning quality.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `RewardTrainer` | Trainer for reward model |
| `RewardConfig` | Training hyperparameters |
| `AutoModelForSequenceClassification` | Model with `num_labels=1` |
| `task_type="SEQ_CLS"` | LoRA task type for reward models |
| Preference pairs | Training data format |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"
```

## Critical Import Order

```python
# Standard transformers for reward models (not Unsloth)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
import torch
```

## Reward Model Concepts

### How Reward Models Work

1. Take prompt + response as input
2. Output scalar reward score
3. Trained on preference pairs (chosen > rejected)
4. Used to guide RL policy optimization

### Architecture

```
Input: [prompt + response]
  ↓
Base LLM (frozen or LoRA)
  ↓
Classification Head (Linear → Scalar)
  ↓
Output: Reward score (float)
```

## Dataset Format

### Required Fields

```python
dataset = [
    {
        "prompt": "What is recursion?",
        "chosen": "Recursion is a function calling itself with a base case.",
        "rejected": "Recursion is loops."
    },
    # ... more preference pairs
]
```

### Preprocessing

```python
def format_for_reward(sample):
    prompt = tokenizer.apply_chat_template(
        [{"role": "
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details