Back to Skills

reward

verified

Reward model training for RLHF pipelines. Covers RewardTrainer, preference dataset preparation, sequence classification heads, and reward scaling for stable reinforcement learning. Includes thinking quality scoring patterns.

View on GitHub

Marketplace

bazzite-ai-plugins

atrawog/bazzite-ai-plugins

Plugin

bazzite-ai-jupyter

development

Repository

atrawog/bazzite-ai-plugins

bazzite-ai-jupyter/skills/reward/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/reward/SKILL.md -a claude-code --skill reward

Installation paths:

Claude
.claude/skills/reward/
Powered by add-skill CLI

Instructions

# Reward Model Training

## Overview

Reward models learn to score responses based on human preferences. They're used in RLHF pipelines (PPO, GRPO, RLOO) to provide reward signals for policy optimization. The model outputs a scalar reward for each response. This skill includes patterns for scoring thinking/reasoning quality.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `RewardTrainer` | Trainer for reward model |
| `RewardConfig` | Training hyperparameters |
| `AutoModelForSequenceClassification` | Model with `num_labels=1` |
| `task_type="SEQ_CLS"` | LoRA task type for reward models |
| Preference pairs | Training data format |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"
```

## Critical Import Order

```python
# Standard transformers for reward models (not Unsloth)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
import torch
```

## Reward Model Concepts

### How Reward Models Work

1. Take prompt + response as input
2. Output scalar reward score
3. Trained on preference pairs (chosen > rejected)
4. Used to guide RL policy optimization

### Architecture

```
Input: [prompt + response]
  ↓
Base LLM (frozen or LoRA)
  ↓
Classification Head (Linear → Scalar)
  ↓
Output: Reward score (float)
```

## Dataset Format

### Required Fields

```python
dataset = [
    {
        "prompt": "What is recursion?",
        "chosen": "Recursion is a function calling itself with a base case.",
        "rejected": "Recursion is loops."
    },
    # ... more preference pairs
]
```

### Preprocessing

```python
def format_for_reward(sample):
    prompt = tokenizer.apply_chat_template(
        [{"role": "

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
7989 chars