rloo

# Reinforcement Learning with Leave-One-Out (RLOO)

## Overview

RLOO is a reinforcement learning method that uses leave-one-out baseline estimation for variance reduction. Like GRPO, it generates multiple completions per prompt but uses a different baseline computation that can provide more stable gradients. This skill includes patterns for training thinking/reasoning models.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| `RLOOTrainer` | RL trainer with RLOO baseline |
| `RLOOConfig` | Training hyperparameters |
| `reward_funcs` | Reward function(s) for scoring |
| `completion_ids` | Token IDs passed to reward functions (no re-tokenization) |
| `num_generations` | Completions per prompt (4 typical) |
| `kl_coef` | KL penalty coefficient (0.05, lower than GRPO) |
| `learning_rate` | 1e-5 (same as GRPO) |
| Token ID 151668 | `</think>` boundary for Qwen3-Thinking models |

## Critical Environment Setup

```python
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Set BEFORE importing unsloth/TRL
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
```

## Critical Import Order

```python
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

# Then TRL imports
from trl import RLOOConfig, RLOOTrainer
from datasets import Dataset
import torch
```

## RLOO Concepts

### How RLOO Works

1. Generate K completions for each prompt
2. Score all completions with reward function
3. For each completion, compute baseline as mean of other K-1 rewards
4. Advantage = reward - leave-one-out baseline
5. Update policy using advantages

### Leave-One-Out Baseline

```
For completion i:
  baseline_i = mean(rewards except reward_i)
  advantage_i = reward_i - baseline_i

This reduces variance compared to:
  - Single-sample estimates (high variance)
  - Fixed baselines (may be inaccurate)
```

### Comp
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details