document-parsers

# Document Parsers

**Purpose:** Autonomously parse and extract content from multiple document formats (PDF, DOCX, HTML, Markdown) using industry-standard libraries and AI-powered parsing tools.

**Activation Triggers:**
- Building RAG (Retrieval-Augmented Generation) pipelines
- Extracting text, tables, or metadata from documents
- Processing large document collections
- Converting documents to structured formats
- Handling complex PDFs with tables and layouts
- OCR for scanned documents
- Chunking documents for vector embeddings
- Building document search systems

**Key Resources:**
- `scripts/setup-llamaparse.sh` - Install and configure LlamaParse (AI-powered parsing)
- `scripts/setup-unstructured.sh` - Install Unstructured.io library
- `scripts/parse-pdf.py` - Functional PDF parser with multiple backend options
- `scripts/parse-docx.py` - DOCX document parser
- `scripts/parse-html.py` - HTML to structured text parser
- `templates/multi-format-parser.py` - Universal document parser template
- `templates/table-extraction.py` - Specialized table extraction template
- `examples/parse-research-paper.py` - Research paper parsing with citations
- `examples/parse-legal-document.py` - Legal document parsing with sections

## Parser Comparison & Selection Guide

### 1. LlamaParse (AI-Powered Premium)

**Best For:**
- Complex PDFs with tables, charts, and mixed layouts
- Scanned documents requiring OCR
- Documents where accuracy is critical
- Multi-column layouts and scientific papers
- Financial reports and invoices

**Pros:**
- AI-powered layout understanding
- Excellent table extraction accuracy
- Built-in OCR support
- Handles complex formatting
- Structured output (Markdown/JSON)

**Cons:**
- Requires API key (paid service)
- API rate limits
- Network dependency
- Slower than local parsers

**Documentation:** https://docs.cloud.llamaindex.ai/llamaparse

**Setup:**
```bash
./scripts/setup-llamaparse.sh
```

**Usage Pattern:**
```python
from llama_parse import LlamaPars
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details