Group Relative Policy Optimization for reinforcement learning from human feedback. Covers GRPOTrainer, reward function design, policy optimization, and KL divergence constraints for stable RLHF training. Includes thinking-aware reward patterns.
View on GitHubatrawog/bazzite-ai-plugins
bazzite-ai-jupyter
bazzite-ai-jupyter/skills/grpo/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/grpo/SKILL.md -a claude-code --skill grpoInstallation paths:
.claude/skills/grpo/# Group Relative Policy Optimization (GRPO) ## Overview GRPO is a reinforcement learning method for LLM alignment. It generates multiple completions per prompt, scores them with a reward function, and optimizes the policy to favor higher-reward responses using relative policy gradients. This skill includes patterns for training thinking/reasoning models. ## Quick Reference | Component | Purpose | |-----------|---------| | `GRPOTrainer` | RL trainer for policy optimization | | `GRPOConfig` | Training hyperparameters | | `reward_funcs` | Reward function(s) for scoring | | `completion_ids` | Token IDs passed to reward functions (no re-tokenization) | | `beta` | KL penalty coefficient (0.1 typical) | | `num_generations` | Completions per prompt (2-4) | | `learning_rate` | 1e-5 (10x lower than SFT) | | Token ID 151668 | `</think>` boundary for Qwen3-Thinking models | ## Critical Environment Setup ```python import os from dotenv import load_dotenv load_dotenv() # Force text-based progress in Jupyter os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Set BEFORE importing unsloth/TRL os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' ``` ## Critical Import Order ```python # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported # Then TRL imports from trl import GRPOConfig, GRPOTrainer from datasets import Dataset import torch ``` **Warning**: Setting `ACCELERATE_MIXED_PRECISION` after imports may cause training issues. ## GRPO Concepts ### How GRPO Works 1. Generate multiple completions for each prompt 2. Score completions with reward function(s) 3. Compute relative advantages within each group 4. Update policy to favor higher-reward completions 5. Apply KL penalty to prevent divergence from reference ### Key Differences from PPO | Aspect | GRPO | PPO | |--------|------|-----| | Baseline | Group relative | Value function | | Critic | Not needed | Required | | Memory | Lower | Higher | |