High-performance Rust web crawler with stealth mode, LLM-ready Markdown export, multi-format output, sitemap discovery, and robots.txt support. Optimized for content extraction, site mapping, structure analysis, and LLM/RAG pipelines.
View on GitHubleobrival/topographic-plugins-official
web-crawler
plugins/web-crawler/skills/web-crawler/SKILL.md
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/leobrival/topographic-plugins-official/blob/main/plugins/web-crawler/skills/web-crawler/SKILL.md -a claude-code --skill web-crawlerInstallation paths:
.claude/skills/web-crawler/# Rust Web Crawler (rcrawler) High-performance web crawler built in **pure Rust** with production-grade features for fast, reliable site crawling. ## When to Use This Skill Use this skill when the user requests: - Web crawling or site mapping - Sitemap discovery and analysis - Link extraction and validation - Site structure visualization - robots.txt compliance checking - Performance-critical web scraping - Generating interactive web reports with graph visualization ## Core Capabilities ### ๐ Performance - **60+ pages/sec** throughput with async Tokio runtime - **<50ms startup** time - Near-instant initialization - **~50MB memory** usage - Efficient resource consumption - **5.4 MB binary** - Single executable, no dependencies ### ๐ค Intelligence - **Sitemap discovery**: Automatically finds and parses sitemap.xml (3 standard locations) - **robots.txt compliance**: Respects crawling rules with per-domain caching - **Smart filtering**: Auto-excludes images, CSS, JS, PDFs by default - **Domain auto-detection**: Extracts and restricts to base domain automatically ### ๐ Safety - **Rate limiting**: Token bucket algorithm (default 2 req/s) - **Configurable timeout**: 30 second default - **Memory safe**: Rust's ownership system prevents crashes - **Graceful shutdown**: 2-second grace period for pending requests ### ๐ Output - **Multiple formats**: JSON, Markdown, HTML, CSV, Links, Text - **LLM-ready Markdown**: Clean content with YAML frontmatter - **Interactive HTML report**: Dashboard with graph visualization - **Stealth mode**: User-agent rotation and realistic headers - **Content filtering**: Remove nav, ads, scripts for clean data - **Real-time progress**: Updates every 5 seconds during crawl ### ๐ Monitoring - **Structured logging**: tracing with timestamps and log levels - **Progress tracking**: `[Progress] Pages: X/Y | Active jobs: Z | Errors: N` - **Detailed statistics**: Pages found, crawled, external links, errors, duration ## Installation & Se