Back to Skills

discovering-data

verified

Discover and explore data for a concept or domain. Use when the user asks what data exists for a topic (e.g., "ARR", "customers", "orders"), wants to find relevant tables, or needs to understand what data is available before analysis.

View on GitHub

Marketplace

astronomer

astronomer/agents

Plugin

data

Repository
Verified Org

astronomer/agents
8stars

skills/discovering-data/SKILL.md

Last Verified

January 23, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/astronomer/agents/blob/main/skills/discovering-data/SKILL.md -a claude-code --skill discovering-data

Installation paths:

Claude
.claude/skills/discovering-data/
Powered by add-skill CLI

Instructions

# Data Exploration

Discover what data exists for a concept or domain. Answer "What data do we have about X?"

## Fast Table Validation

**When you have multiple candidate tables, quickly validate before committing to complex queries.**

### Strategy: Progressive Complexity

Start with the **simplest possible query**, then add complexity only after each step succeeds:

```
Step 1: Does the data exist?     → Simple LIMIT query, no JOINs
Step 2: How much data?           → COUNT(*) with same filters
Step 3: What are the key IDs?    → SELECT DISTINCT foreign_keys LIMIT 100
Step 4: Get related details      → JOIN on the specific IDs from step 3
```

**Never jump from step 1 to complex aggregations.** If step 1 returns 50 rows, use those IDs directly:

```sql
-- After finding deployment_ids in step 1:
SELECT o.org_name, d.deployment_name
FROM DEPLOYMENTS d
JOIN ORGANIZATIONS o ON d.org_id = o.org_id
WHERE d.deployment_id IN ('id1', 'id2', 'id3')  -- IDs from step 1
```

### When a Metadata Table Returns 0 Results

If a smaller metadata/config table (like `*_LOG`, `*_CONFIG`) returns 0 results, **check the execution/fact table** before concluding data doesn't exist.

Metadata tables may have gaps or lag. The actual execution data (in tables with millions/billions of rows) is often more complete.

### Use Row Counts as a Signal

When `list_tables` returns row counts:
- **Millions+ rows** → likely execution/fact data (actual events, transactions, runs)
- **Thousands of rows** → likely metadata/config (what's configured, not what happened)

For questions like "who is using X" or "how many times did Y happen", prioritize high-row-count tables first - they contain actual activity data.

⚠️ **CRITICAL: Tables with 1B+ rows require special handling**

If you see a table with billions of rows (like 6B), you MUST:
1. Use simple queries only: `SELECT col1, col2 FROM table WHERE filter LIMIT 100`
2. NO JOINs, NO GROUP BY, NO aggregations on the first query
3. Only add complexity afte

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
8072 chars