Patterns for efficient ML data pipelines using Polars, Arrow, and ClickHouse. TRIGGERS - data pipeline, polars vs pandas, arrow format, clickhouse ml, efficient loading, zero-copy, memory optimization.
View on GitHubFebruary 5, 2026
Select agents to install to:
npx add-skill https://github.com/terrylica/cc-skills/blob/main/plugins/devops-tools/skills/ml-data-pipeline-architecture/SKILL.md -a claude-code --skill ml-data-pipeline-architectureInstallation paths:
.claude/skills/ml-data-pipeline-architecture/# ML Data Pipeline Architecture
Patterns for efficient ML data pipelines using Polars, Arrow, and ClickHouse.
**ADR**: [2026-01-22-polars-preference-hook](/docs/adr/2026-01-22-polars-preference-hook.md) (efficiency preferences framework)
> **Note**: A PreToolUse hook enforces Polars preference. To use Pandas, add `# polars-exception: <reason>` at file top.
## When to Use This Skill
Use this skill when:
- Deciding between Polars and Pandas for a data pipeline
- Optimizing memory usage with zero-copy Arrow patterns
- Loading data from ClickHouse into PyTorch DataLoaders
- Implementing lazy evaluation for large datasets
- Migrating existing Pandas code to Polars
---
## 1. Decision Tree: Polars vs Pandas
```
Dataset size?
├─ < 1M rows → Pandas OK (simpler API, richer ecosystem)
├─ 1M-10M rows → Consider Polars (2-5x faster, less memory)
└─ > 10M rows → Use Polars (required for memory efficiency)
Operations?
├─ Simple transforms → Either works
├─ Group-by aggregations → Polars 5-10x faster
├─ Complex joins → Polars with lazy evaluation
└─ Streaming/chunked → Polars scan_* functions
Integration?
├─ scikit-learn heavy → Pandas (better interop)
├─ PyTorch/custom → Polars + Arrow (zero-copy to tensor)
└─ ClickHouse source → Arrow stream → Polars (optimal)
```
---
## 2. Zero-Copy Pipeline Architecture
### The Problem with Pandas
```python
# BAD: 3 memory copies
df = pd.read_sql(query, conn) # Copy 1: DB → pandas
X = df[features].values # Copy 2: pandas → numpy
tensor = torch.from_numpy(X) # Copy 3: numpy → tensor
# Peak memory: 3x data size
```
### The Solution with Arrow
```python
# GOOD: 0-1 memory copies
import clickhouse_connect
import polars as pl
import torch
client = clickhouse_connect.get_client(...)
arrow_table = client.query_arrow("SELECT * FROM bars") # Arrow in DB memory
df = pl.from_arrow(arrow_table) # Zero-copy view
X = df.select(features).to_numpy() # Single allocation
tensor