Multi-format document parsing tools for PDF, DOCX, HTML, and Markdown with support for LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, and python-docx. Use when parsing documents, extracting text from PDFs, processing Word documents, converting HTML to text, extracting tables from documents, building RAG pipelines, chunking documents, or when user mentions document parsing, PDF extraction, DOCX processing, table extraction, OCR, LlamaParse, Unstructured.io, or document ingestion.
View on GitHubFebruary 1, 2026
Select agents to install to:
npx add-skill https://github.com/vanman2024/ai-dev-marketplace/blob/main/plugins/rag-pipeline/skills/document-parsers/SKILL.md -a claude-code --skill document-parsersInstallation paths:
.claude/skills/document-parsers/# Document Parsers **Purpose:** Autonomously parse and extract content from multiple document formats (PDF, DOCX, HTML, Markdown) using industry-standard libraries and AI-powered parsing tools. **Activation Triggers:** - Building RAG (Retrieval-Augmented Generation) pipelines - Extracting text, tables, or metadata from documents - Processing large document collections - Converting documents to structured formats - Handling complex PDFs with tables and layouts - OCR for scanned documents - Chunking documents for vector embeddings - Building document search systems **Key Resources:** - `scripts/setup-llamaparse.sh` - Install and configure LlamaParse (AI-powered parsing) - `scripts/setup-unstructured.sh` - Install Unstructured.io library - `scripts/parse-pdf.py` - Functional PDF parser with multiple backend options - `scripts/parse-docx.py` - DOCX document parser - `scripts/parse-html.py` - HTML to structured text parser - `templates/multi-format-parser.py` - Universal document parser template - `templates/table-extraction.py` - Specialized table extraction template - `examples/parse-research-paper.py` - Research paper parsing with citations - `examples/parse-legal-document.py` - Legal document parsing with sections ## Parser Comparison & Selection Guide ### 1. LlamaParse (AI-Powered Premium) **Best For:** - Complex PDFs with tables, charts, and mixed layouts - Scanned documents requiring OCR - Documents where accuracy is critical - Multi-column layouts and scientific papers - Financial reports and invoices **Pros:** - AI-powered layout understanding - Excellent table extraction accuracy - Built-in OCR support - Handles complex formatting - Structured output (Markdown/JSON) **Cons:** - Requires API key (paid service) - API rate limits - Network dependency - Slower than local parsers **Documentation:** https://docs.cloud.llamaindex.ai/llamaparse **Setup:** ```bash ./scripts/setup-llamaparse.sh ``` **Usage Pattern:** ```python from llama_parse import LlamaPars