Discover and explore data for a concept or domain. Use when the user asks what data exists for a topic (e.g., "ARR", "customers", "orders"), wants to find relevant tables, or needs to understand what data is available before analysis.
View on GitHubSelect agents to install to:
npx add-skill https://github.com/astronomer/agents/blob/main/skills/discovering-data/SKILL.md -a claude-code --skill discovering-dataInstallation paths:
.claude/skills/discovering-data/# Data Exploration
Discover what data exists for a concept or domain. Answer "What data do we have about X?"
## Fast Table Validation
**When you have multiple candidate tables, quickly validate before committing to complex queries.**
### Strategy: Progressive Complexity
Start with the **simplest possible query**, then add complexity only after each step succeeds:
```
Step 1: Does the data exist? → Simple LIMIT query, no JOINs
Step 2: How much data? → COUNT(*) with same filters
Step 3: What are the key IDs? → SELECT DISTINCT foreign_keys LIMIT 100
Step 4: Get related details → JOIN on the specific IDs from step 3
```
**Never jump from step 1 to complex aggregations.** If step 1 returns 50 rows, use those IDs directly:
```sql
-- After finding deployment_ids in step 1:
SELECT o.org_name, d.deployment_name
FROM DEPLOYMENTS d
JOIN ORGANIZATIONS o ON d.org_id = o.org_id
WHERE d.deployment_id IN ('id1', 'id2', 'id3') -- IDs from step 1
```
### When a Metadata Table Returns 0 Results
If a smaller metadata/config table (like `*_LOG`, `*_CONFIG`) returns 0 results, **check the execution/fact table** before concluding data doesn't exist.
Metadata tables may have gaps or lag. The actual execution data (in tables with millions/billions of rows) is often more complete.
### Use Row Counts as a Signal
When `list_tables` returns row counts:
- **Millions+ rows** → likely execution/fact data (actual events, transactions, runs)
- **Thousands of rows** → likely metadata/config (what's configured, not what happened)
For questions like "who is using X" or "how many times did Y happen", prioritize high-row-count tables first - they contain actual activity data.
⚠️ **CRITICAL: Tables with 1B+ rows require special handling**
If you see a table with billions of rows (like 6B), you MUST:
1. Use simple queries only: `SELECT col1, col2 FROM table WHERE filter LIMIT 100`
2. NO JOINs, NO GROUP BY, NO aggregations on the first query
3. Only add complexity afte