Reinforcement Learning with Leave-One-Out estimation for policy optimization. Covers RLOOTrainer, reward function integration, baseline estimation, and variance reduction techniques for stable RL training. Includes thinking-aware patterns.
View on GitHubatrawog/bazzite-ai-plugins
bazzite-ai-jupyter
bazzite-ai-jupyter/skills/rloo/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/rloo/SKILL.md -a claude-code --skill rlooInstallation paths:
.claude/skills/rloo/# Reinforcement Learning with Leave-One-Out (RLOO) ## Overview RLOO is a reinforcement learning method that uses leave-one-out baseline estimation for variance reduction. Like GRPO, it generates multiple completions per prompt but uses a different baseline computation that can provide more stable gradients. This skill includes patterns for training thinking/reasoning models. ## Quick Reference | Component | Purpose | |-----------|---------| | `RLOOTrainer` | RL trainer with RLOO baseline | | `RLOOConfig` | Training hyperparameters | | `reward_funcs` | Reward function(s) for scoring | | `completion_ids` | Token IDs passed to reward functions (no re-tokenization) | | `num_generations` | Completions per prompt (4 typical) | | `kl_coef` | KL penalty coefficient (0.05, lower than GRPO) | | `learning_rate` | 1e-5 (same as GRPO) | | Token ID 151668 | `</think>` boundary for Qwen3-Thinking models | ## Critical Environment Setup ```python import os from dotenv import load_dotenv load_dotenv() # Force text-based progress in Jupyter os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Set BEFORE importing unsloth/TRL os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' ``` ## Critical Import Order ```python # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported # Then TRL imports from trl import RLOOConfig, RLOOTrainer from datasets import Dataset import torch ``` ## RLOO Concepts ### How RLOO Works 1. Generate K completions for each prompt 2. Score all completions with reward function 3. For each completion, compute baseline as mean of other K-1 rewards 4. Advantage = reward - leave-one-out baseline 5. Update policy using advantages ### Leave-One-Out Baseline ``` For completion i: baseline_i = mean(rewards except reward_i) advantage_i = reward_i - baseline_i This reduces variance compared to: - Single-sample estimates (high variance) - Fixed baselines (may be inaccurate) ``` ### Comp