grpo

# Group Relative Policy Optimization (GRPO)

## Overview

GRPO is a reinforcement learning method for LLM alignment. It generates multiple completions per prompt, scores them with a reward function, and optimizes the policy to favor higher-reward responses using relative policy gradients. This skill includes patterns for training thinking/reasoning models.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `GRPOTrainer` | RL trainer for policy optimization |
| `GRPOConfig` | Training hyperparameters |
| `reward_funcs` | Reward function(s) for scoring |
| `completion_ids` | Token IDs passed to reward functions (no re-tokenization) |
| `beta` | KL penalty coefficient (0.1 typical) |
| `num_generations` | Completions per prompt (2-4) |
| `learning_rate` | 1e-5 (10x lower than SFT) |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Set BEFORE importing unsloth/TRL
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
```

## Critical Import Order

```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

# Then TRL imports
from trl import GRPOConfig, GRPOTrainer
from datasets import Dataset
import torch
```

**Warning**: Setting `ACCELERATE_MIXED_PRECISION` after imports may cause training issues.

## GRPO Concepts

### How GRPO Works

1. Generate multiple completions for each prompt
2. Score completions with reward function(s)
3. Compute relative advantages within each group
4. Update policy to favor higher-reward completions
5. Apply KL penalty to prevent divergence from reference

### Key Differences from PPO

| Aspect | GRPO | PPO |
|--------|------|-----|
| Baseline | Group relative | Value function |
| Critic | Not needed | Required |
| Memory | Lower | Higher |
|
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details