Back to Skills

dpo

verified

Direct Preference Optimization for learning from preference pairs. Covers DPOTrainer, preference dataset preparation, implicit reward modeling, and beta tuning for stable preference learning without explicit reward models. Includes thinking quality patterns.

View on GitHub

Marketplace

bazzite-ai-plugins

atrawog/bazzite-ai-plugins

Plugin

bazzite-ai-jupyter

development

Repository

atrawog/bazzite-ai-plugins

bazzite-ai-jupyter/skills/dpo/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/dpo/SKILL.md -a claude-code --skill dpo

Installation paths:

Claude
.claude/skills/dpo/
Powered by add-skill CLI

Instructions

# Direct Preference Optimization (DPO)

## Overview

DPO learns from preference pairs (chosen vs rejected responses) without training an explicit reward model. It directly optimizes the policy using the Bradley-Terry preference model, making it simpler than RLHF while achieving comparable results. This skill includes patterns for training thinking/reasoning models.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `DPOTrainer` | Trainer for preference optimization |
| `DPOConfig` | Training hyperparameters |
| `beta` | Temperature for implicit reward (0.1 typical) |
| `learning_rate` | 5e-6 (most conservative of RL methods) |
| `ref_model` | Reference model for KL constraint |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"
```

## Critical Import Order

```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

# Then TRL imports
from trl import DPOConfig, DPOTrainer
from datasets import Dataset
import torch
```

## DPO Concepts

### How DPO Works

1. Given prompt + chosen response + rejected response
2. Compute log-probabilities under policy and reference
3. Optimize policy to increase P(chosen) / P(rejected) ratio
4. Beta controls how strongly to enforce preference

### Key Differences from RLHF

| Aspect | DPO | RLHF |
|--------|-----|------|
| Reward Model | Implicit | Explicit |
| Training | Single stage | Multi-stage |
| Complexity | Simpler | More complex |
| Compute | Lower | Higher |

## Dataset Format

### Required Fields

```python
dataset = [
    {
        "prompt": "What is recursion?",
        "chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop.",
        "rejected": "Recursion is loops."
  

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
8463 chars