Back to Skills

grpo

verified

Group Relative Policy Optimization for reinforcement learning from human feedback. Covers GRPOTrainer, reward function design, policy optimization, and KL divergence constraints for stable RLHF training. Includes thinking-aware reward patterns.

View on GitHub

Marketplace

bazzite-ai-plugins

atrawog/bazzite-ai-plugins

Plugin

bazzite-ai-jupyter

development

Repository

atrawog/bazzite-ai-plugins

bazzite-ai-jupyter/skills/grpo/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/grpo/SKILL.md -a claude-code --skill grpo

Installation paths:

Claude
.claude/skills/grpo/
Powered by add-skill CLI

Instructions

# Group Relative Policy Optimization (GRPO)

## Overview

GRPO is a reinforcement learning method for LLM alignment. It generates multiple completions per prompt, scores them with a reward function, and optimizes the policy to favor higher-reward responses using relative policy gradients. This skill includes patterns for training thinking/reasoning models.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `GRPOTrainer` | RL trainer for policy optimization |
| `GRPOConfig` | Training hyperparameters |
| `reward_funcs` | Reward function(s) for scoring |
| `completion_ids` | Token IDs passed to reward functions (no re-tokenization) |
| `beta` | KL penalty coefficient (0.1 typical) |
| `num_generations` | Completions per prompt (2-4) |
| `learning_rate` | 1e-5 (10x lower than SFT) |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Set BEFORE importing unsloth/TRL
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
```

## Critical Import Order

```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

# Then TRL imports
from trl import GRPOConfig, GRPOTrainer
from datasets import Dataset
import torch
```

**Warning**: Setting `ACCELERATE_MIXED_PRECISION` after imports may cause training issues.

## GRPO Concepts

### How GRPO Works

1. Generate multiple completions for each prompt
2. Score completions with reward function(s)
3. Compute relative advantages within each group
4. Update policy to favor higher-reward completions
5. Apply KL penalty to prevent divergence from reference

### Key Differences from PPO

| Aspect | GRPO | PPO |
|--------|------|-----|
| Baseline | Group relative | Value function |
| Critic | Not needed | Required |
| Memory | Lower | Higher |
| 

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
10129 chars