dpo

# Direct Preference Optimization (DPO)

## Overview

DPO learns from preference pairs (chosen vs rejected responses) without training an explicit reward model. It directly optimizes the policy using the Bradley-Terry preference model, making it simpler than RLHF while achieving comparable results. This skill includes patterns for training thinking/reasoning models.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `DPOTrainer` | Trainer for preference optimization |
| `DPOConfig` | Training hyperparameters |
| `beta` | Temperature for implicit reward (0.1 typical) |
| `learning_rate` | 5e-6 (most conservative of RL methods) |
| `ref_model` | Reference model for KL constraint |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"
```

## Critical Import Order

```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

# Then TRL imports
from trl import DPOConfig, DPOTrainer
from datasets import Dataset
import torch
```

## DPO Concepts

### How DPO Works

1. Given prompt + chosen response + rejected response
2. Compute log-probabilities under policy and reference
3. Optimize policy to increase P(chosen) / P(rejected) ratio
4. Beta controls how strongly to enforce preference

### Key Differences from RLHF

| Aspect | DPO | RLHF |
|--------|-----|------|
| Reward Model | Implicit | Explicit |
| Training | Single stage | Multi-stage |
| Complexity | Simpler | More complex |
| Compute | Lower | Higher |

## Dataset Format

### Required Fields

```python
dataset = [
    {
        "prompt": "What is recursion?",
        "chosen": "Recursion is when a function calls itself with a simpler version of the problem, including a base case to stop.",
        "rejected": "Recursion is loops."
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details