Production-proven Playwright web scraping patterns with selector-first approach and robust error handling. Use when users need to build web scrapers, extract data from websites, automate browser interactions, or ask about Playwright selectors, text extraction (innerText vs textContent), regex patterns for HTML, fallback hierarchies, or scraping best practices.
View on GitHubnathanvale/side-quest-marketplace
scraper-toolkit
February 4, 2026
Select agents to install to:
npx add-skill https://github.com/nathanvale/side-quest-marketplace/blob/main/plugins/scraper-toolkit/skills/playwright-scraper/SKILL.md -a claude-code --skill playwright-scraperInstallation paths:
.claude/skills/playwright-scraper/# Playwright Web Scraper
Production-proven web scraping patterns using Playwright with selector-first approach and robust error handling.
---
## Core Principles
### 1. Selector-First Approach
**Always prefer semantic locators over CSS selectors:**
```typescript
// ✅ BEST: Semantic locators (accessible, maintainable)
await page.getByRole('button', { name: 'Submit' })
await page.getByText('Welcome')
await page.getByLabel('Email')
// ⚠️ ACCEPTABLE: Text patterns for dynamic content
await page.locator('text=/\\$\\d+\\.\\d{2}/')
// ❌ AVOID: Brittle CSS selectors
await page.locator('.btn-primary')
await page.locator('#submit-button')
```
### 2. Page Text Extraction
**Critical difference between `textContent` and `innerText`:**
```typescript
// ❌ WRONG: Returns ALL text including hidden elements, scripts, iframes
const pageText = await page.textContent("body");
// ✅ CORRECT: Returns only VISIBLE text (what users see)
const pageText = await page.innerText("body");
```
**Use case for each:**
- `innerText("body")` - Extract visible content for regex matching
- `textContent(selector)` - Get text from specific elements
### 3. Regex Patterns for Extraction
**Handle newlines and whitespace in HTML:**
```typescript
// ❌ FAILS: [^$]* doesn't match across newlines
const match = pageText.match(/ADULT[^$]*(\$\d+\.\d{2})/);
// ✅ WORKS: [\s\S]{0,10} matches any character including newlines
const match = pageText.match(/ADULT[\s\S]{0,10}(\$\d+\.\d{2})/);
```
**Common patterns:**
```typescript
// Price extraction
/\$(\d+\.\d{2})/
// Date/time
/(\d{1,2}\s+[A-Za-z]{3}\s+\d{4},\s+\d{1,2}:\d{2}[ap]m)/i
// Screen number
/Screen\s+(\d+)/i
```
### 4. Fallback Hierarchy
Implement 4-tier fallback for robustness:
```typescript
async function extractField(page: Page, fieldName: string): Promise<string | null> {
// Tier 1: Primary semantic selector
try {
const value = await page.getByLabel(fieldName).textContent();
if (value) return value.trim();
} catch {}
//