playwright-scraper

# Playwright Web Scraper

Production-proven web scraping patterns using Playwright with selector-first approach and robust error handling.

---

## Core Principles

### 1. Selector-First Approach

**Always prefer semantic locators over CSS selectors:**

```typescript
// ✅ BEST: Semantic locators (accessible, maintainable)
await page.getByRole('button', { name: 'Submit' })
await page.getByText('Welcome')
await page.getByLabel('Email')

// ⚠️ ACCEPTABLE: Text patterns for dynamic content
await page.locator('text=/\\$\\d+\\.\\d{2}/')

// ❌ AVOID: Brittle CSS selectors
await page.locator('.btn-primary')
await page.locator('#submit-button')
```

### 2. Page Text Extraction

**Critical difference between `textContent` and `innerText`:**

```typescript
// ❌ WRONG: Returns ALL text including hidden elements, scripts, iframes
const pageText = await page.textContent("body");

// ✅ CORRECT: Returns only VISIBLE text (what users see)
const pageText = await page.innerText("body");
```

**Use case for each:**
- `innerText("body")` - Extract visible content for regex matching
- `textContent(selector)` - Get text from specific elements

### 3. Regex Patterns for Extraction

**Handle newlines and whitespace in HTML:**

```typescript
// ❌ FAILS: [^$]* doesn't match across newlines
const match = pageText.match(/ADULT[^$]*(\$\d+\.\d{2})/);

// ✅ WORKS: [\s\S]{0,10} matches any character including newlines
const match = pageText.match(/ADULT[\s\S]{0,10}(\$\d+\.\d{2})/);
```

**Common patterns:**
```typescript
// Price extraction
/\$(\d+\.\d{2})/

// Date/time
/(\d{1,2}\s+[A-Za-z]{3}\s+\d{4},\s+\d{1,2}:\d{2}[ap]m)/i

// Screen number
/Screen\s+(\d+)/i
```

### 4. Fallback Hierarchy

Implement 4-tier fallback for robustness:

```typescript
async function extractField(page: Page, fieldName: string): Promise<string | null> {
  // Tier 1: Primary semantic selector
  try {
    const value = await page.getByLabel(fieldName).textContent();
    if (value) return value.trim();
  } catch {}

  //
Marketplace

Plugin

Repository

Last Verified

Install Skill

Instructions

Validation Details