Back to Skills

pdftext

verified

Extract text from PDFs for LLM consumption using AI-powered or traditional tools. Use when converting academic PDFs to markdown, extracting structured content (headers/tables/lists), batch processing research papers, preparing PDFs for RAG systems, or when mentions of "pdf extraction", "pdf to text", "pdf to markdown", "docling", "pymupdf", "pdfplumber" appear. Provides Docling (AI-powered, structure-preserving, 97.9% table accuracy) and traditional tools (PyMuPDF for speed, pdfplumber for quality). All processing is on-device with no API calls.

View on GitHub

Marketplace

warren-claude-code-plugin-marketplace

WarrenZhu050413/Warren-Claude-Code-Plugin-Marketplace

Plugin

claude-context-orchestrator

Repository

WarrenZhu050413/Warren-Claude-Code-Plugin-Marketplace
5stars

claude-context-orchestrator/skills/pdftext/SKILL.md

Last Verified

January 18, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/WarrenZhu050413/Warren-Claude-Code-Plugin-Marketplace/blob/main/claude-context-orchestrator/skills/pdftext/SKILL.md -a claude-code --skill pdftext

Installation paths:

Claude
.claude/skills/pdftext/
Powered by add-skill CLI

Instructions

# PDF Text Extraction

## Tool Selection

| Tool | Speed | Quality | Structure | Use When |
|------|-------|---------|-----------|----------|
| **Docling** | 0.43s/page | Good | ✓ Yes | Need headers/tables/lists, academic PDFs, LLM consumption |
| **PyMuPDF** | 0.01s/page | Excellent | ✗ No | Speed critical, simple text extraction, good enough quality |
| **pdfplumber** | 0.44s/page | Perfect | ✗ No | Maximum fidelity needed, slow acceptable |

**Decision:**
- Academic research → Docling (structure preservation)
- Batch processing → PyMuPDF (60x faster)
- Critical accuracy → pdfplumber (0 quality issues)

## Installation

```bash
# Create virtual environment
python3 -m venv pdf_env
source pdf_env/bin/activate

# Install Docling (AI-powered, recommended)
pip install docling

# Install traditional tools
pip install pymupdf pdfplumber
```

**First run downloads ML models** (~500MB-1GB, cached locally, no API calls).

## Basic Usage

### Docling (Structure-Preserving)

```python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()  # Reuse for multiple PDFs
result = converter.convert("paper.pdf")
markdown = result.document.export_to_markdown()

# Save output
with open("paper.md", "w") as f:
    f.write(markdown)
```

**Output includes:** Headers (##), tables (|...|), lists (- ...), image markers.

### PyMuPDF (Fast)

```python
import fitz

doc = fitz.open("paper.pdf")
text = "\n".join(page.get_text() for page in doc)
doc.close()

with open("paper.txt", "w") as f:
    f.write(text)
```

### pdfplumber (Highest Quality)

```python
import pdfplumber

with pdfplumber.open("paper.pdf") as pdf:
    text = "\n".join(page.extract_text() or "" for page in pdf.pages)

with open("paper.txt", "w") as f:
    f.write(text)
```

## Batch Processing

See `examples/batch_convert.py` for ready-to-use script.

**Pattern:**
```python
from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter()  # Ini

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
3216 chars