Columnar file patterns including partitioning, predicate pushdown, and schema evolution.
View on GitHubmajesticlabs-dev/majestic-marketplace
majestic-data
January 24, 2026
Select agents to install to:
npx add-skill https://github.com/majesticlabs-dev/majestic-marketplace/blob/main/plugins/majestic-data/skills/parquet-coder/SKILL.md -a claude-code --skill parquet-coderInstallation paths:
.claude/skills/parquet-coder/# Parquet-Coder
Patterns for efficient columnar data storage with Parquet.
## Basic Operations
```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Write with compression
df.to_parquet('data.parquet', compression='snappy', index=False)
# Common compression options:
# - snappy: Fast, good compression (default)
# - gzip: Slower, better compression
# - zstd: Best balance of speed/compression
# - None: No compression (fastest writes)
# Read entire file
df = pd.read_parquet('data.parquet')
# Read specific columns only (predicate pushdown)
df = pd.read_parquet('data.parquet', columns=['id', 'name', 'value'])
```
## PyArrow for Large Files
```python
# Read as PyArrow Table (more memory efficient)
table = pq.read_table('data.parquet')
# Convert to pandas when needed
df = table.to_pandas()
# Filter while reading (row group filtering)
table = pq.read_table(
'data.parquet',
filters=[
('date', '>=', '2024-01-01'),
('status', '=', 'active')
]
)
# Read in batches for huge files
parquet_file = pq.ParquetFile('huge.parquet')
for batch in parquet_file.iter_batches(batch_size=100_000):
df_batch = batch.to_pandas()
process(df_batch)
```
## Partitioned Datasets
```python
# Write partitioned by columns
df.to_parquet(
'data/',
partition_cols=['year', 'month'],
compression='snappy'
)
# Creates: data/year=2024/month=01/part-0.parquet
# Read partitioned dataset
df = pd.read_parquet('data/') # Reads all partitions
# Read specific partitions only
df = pd.read_parquet('data/year=2024/')
# With PyArrow dataset API (more control)
import pyarrow.dataset as ds
dataset = ds.dataset('data/', format='parquet', partitioning='hive')
# Filter on partition columns (very fast)
table = dataset.to_table(
filter=(ds.field('year') == 2024) & (ds.field('month') >= 6)
)
```
## Schema Definition
```python
# Explicit schema for consistency
schema = pa.schema([
('id', pa.int64()),
('name', pa.string()),