Trace upstream data lineage. Use when the user asks where data comes from, what feeds a table, upstream dependencies, data sources, or needs to understand data origins.
View on GitHubSelect agents to install to:
npx add-skill https://github.com/astronomer/agents/blob/main/skills/tracing-upstream-lineage/SKILL.md -a claude-code --skill tracing-upstream-lineageInstallation paths:
.claude/skills/tracing-upstream-lineage/# Upstream Lineage: Sources
Trace the origins of data - answer "Where does this data come from?"
## Lineage Investigation
### Step 1: Identify the Target Type
Determine what we're tracing:
- **Table**: Trace what populates this table
- **Column**: Trace where this specific column comes from
- **DAG**: Trace what data sources this DAG reads from
### Step 2: Find the Producing DAG
Tables are typically populated by Airflow DAGs. Find the connection:
1. **Search DAGs by name**: Use `list_dags` and look for DAG names matching the table name
- `load_customers` -> `customers` table
- `etl_daily_orders` -> `orders` table
2. **Explore DAG source code**: Use `get_dag_source` to read the DAG definition
- Look for INSERT, MERGE, CREATE TABLE statements
- Find the target table in the code
3. **Check DAG tasks**: Use `list_tasks` to see what operations the DAG performs
### Step 3: Trace Data Sources
From the DAG code, identify source tables and systems:
**SQL Sources** (look for FROM clauses):
```python
# In DAG code:
SELECT * FROM source_schema.source_table # <- This is an upstream source
```
**External Sources** (look for connection references):
- `S3Operator` -> S3 bucket source
- `PostgresOperator` -> Postgres database source
- `SalesforceOperator` -> Salesforce API source
- `HttpOperator` -> REST API source
**File Sources**:
- CSV/Parquet files in object storage
- SFTP drops
- Local file paths
### Step 4: Build the Lineage Chain
Recursively trace each source:
```
TARGET: analytics.orders_daily
^
+-- DAG: etl_daily_orders
^
+-- SOURCE: raw.orders (table)
| ^
| +-- DAG: ingest_orders
| ^
| +-- SOURCE: Salesforce API (external)
|
+-- SOURCE: dim.customers (table)
^
+-- DAG: load_customers
^
+-- SOURCE: PostgreSQL