Back to Skills

book-sft-pipeline

verified

End-to-end system for creating supervised fine-tuning datasets from books and training style-transfer models. Covers text extraction, intelligent segmentation, synthetic instruction generation, Tinker-compatible output, LoRA training, and validation.

View on GitHub

Marketplace

agents-skills-plugins

EricGrill/agents-skills-plugins

Plugin

book-training

ai

Repository

EricGrill/agents-skills-plugins
3stars

plugins/book-training/skills/book-sft-pipeline/SKILL.md

Last Verified

January 20, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/EricGrill/agents-skills-plugins/blob/main/plugins/book-training/skills/book-sft-pipeline/SKILL.md -a claude-code --skill book-sft-pipeline

Installation paths:

Claude
.claude/skills/book-sft-pipeline/
Powered by add-skill CLI

Instructions

# Book SFT Pipeline

A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.

## When to Activate

Activate this skill when:
- Building fine-tuning datasets from literary works
- Creating author-voice or style-transfer models
- Preparing training data for Tinker or similar SFT platforms
- Designing text segmentation pipelines for long-form content
- Training small models (8B or less) on limited data

## Core Concepts

### The Three Pillars of Book SFT

**1. Intelligent Segmentation**
Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.

**2. Diverse Instruction Generation**
Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.

**3. Style Over Content**
The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.

## Pipeline Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR AGENT                           │
│  Coordinates pipeline phases, manages state, handles failures   │
└──────────────────────┬──────────────────────────────────────────┘
                       │
       ┌───────────────┼───────────────┬───────────────┐
       ▼               ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  EXTRACTION  │ │ SEGMENTATION │ │  INSTRUCTION │ │   DATASET    │
│    AGENT     │ │    AGENT     │ │    AGENT     │ │   BUILDER    │
│ ePub → Text  │ │ Text → Chunks│ │ Chunks →     │ │ Pairs →      │
│              │ │ 150-400 words│ │ Prompts      │ │ JSONL        │
└──────────────┘ └──────────────

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
9323 chars