Reward model training for RLHF pipelines. Covers RewardTrainer, preference dataset preparation, sequence classification heads, and reward scaling for stable reinforcement learning. Includes thinking quality scoring patterns.
View on GitHubatrawog/bazzite-ai-plugins
bazzite-ai-jupyter
bazzite-ai-jupyter/skills/reward/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/reward/SKILL.md -a claude-code --skill rewardInstallation paths:
.claude/skills/reward/# Reward Model Training
## Overview
Reward models learn to score responses based on human preferences. They're used in RLHF pipelines (PPO, GRPO, RLOO) to provide reward signals for policy optimization. The model outputs a scalar reward for each response. This skill includes patterns for scoring thinking/reasoning quality.
## Quick Reference
| Component | Purpose |
|-----------|---------|
| `RewardTrainer` | Trainer for reward model |
| `RewardConfig` | Training hyperparameters |
| `AutoModelForSequenceClassification` | Model with `num_labels=1` |
| `task_type="SEQ_CLS"` | LoRA task type for reward models |
| Preference pairs | Training data format |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |
## Critical Environment Setup
```python
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"
```
## Critical Import Order
```python
# Standard transformers for reward models (not Unsloth)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
import torch
```
## Reward Model Concepts
### How Reward Models Work
1. Take prompt + response as input
2. Output scalar reward score
3. Trained on preference pairs (chosen > rejected)
4. Used to guide RL policy optimization
### Architecture
```
Input: [prompt + response]
↓
Base LLM (frozen or LoRA)
↓
Classification Head (Linear → Scalar)
↓
Output: Reward score (float)
```
## Dataset Format
### Required Fields
```python
dataset = [
{
"prompt": "What is recursion?",
"chosen": "Recursion is a function calling itself with a base case.",
"rejected": "Recursion is loops."
},
# ... more preference pairs
]
```
### Preprocessing
```python
def format_for_reward(sample):
prompt = tokenizer.apply_chat_template(
[{"role": "