Back to Skills

rloo

verified

Reinforcement Learning with Leave-One-Out estimation for policy optimization. Covers RLOOTrainer, reward function integration, baseline estimation, and variance reduction techniques for stable RL training. Includes thinking-aware patterns.

View on GitHub

Marketplace

bazzite-ai-plugins

atrawog/bazzite-ai-plugins

Plugin

bazzite-ai-jupyter

development

Repository

atrawog/bazzite-ai-plugins

bazzite-ai-jupyter/skills/rloo/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/rloo/SKILL.md -a claude-code --skill rloo

Installation paths:

Claude
.claude/skills/rloo/
Powered by add-skill CLI

Instructions

# Reinforcement Learning with Leave-One-Out (RLOO)

## Overview

RLOO is a reinforcement learning method that uses leave-one-out baseline estimation for variance reduction. Like GRPO, it generates multiple completions per prompt but uses a different baseline computation that can provide more stable gradients. This skill includes patterns for training thinking/reasoning models.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `RLOOTrainer` | RL trainer with RLOO baseline |
| `RLOOConfig` | Training hyperparameters |
| `reward_funcs` | Reward function(s) for scoring |
| `completion_ids` | Token IDs passed to reward functions (no re-tokenization) |
| `num_generations` | Completions per prompt (4 typical) |
| `kl_coef` | KL penalty coefficient (0.05, lower than GRPO) |
| `learning_rate` | 1e-5 (same as GRPO) |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Set BEFORE importing unsloth/TRL
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
```

## Critical Import Order

```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

# Then TRL imports
from trl import RLOOConfig, RLOOTrainer
from datasets import Dataset
import torch
```

## RLOO Concepts

### How RLOO Works

1. Generate K completions for each prompt
2. Score all completions with reward function
3. For each completion, compute baseline as mean of other K-1 rewards
4. Advantage = reward - leave-one-out baseline
5. Update policy using advantages

### Leave-One-Out Baseline

```
For completion i:
  baseline_i = mean(rewards except reward_i)
  advantage_i = reward_i - baseline_i

This reduces variance compared to:
  - Single-sample estimates (high variance)
  - Fixed baselines (may be inaccurate)
```

### Comp

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
8903 chars