Back to Skills

parquet-coder

verified

Columnar file patterns including partitioning, predicate pushdown, and schema evolution.

View on GitHub

Marketplace

majestic-marketplace

majesticlabs-dev/majestic-marketplace

Plugin

majestic-data

Repository

majesticlabs-dev/majestic-marketplace
19stars

plugins/majestic-data/skills/parquet-coder/SKILL.md

Last Verified

January 24, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/majesticlabs-dev/majestic-marketplace/blob/main/plugins/majestic-data/skills/parquet-coder/SKILL.md -a claude-code --skill parquet-coder

Installation paths:

Claude
.claude/skills/parquet-coder/
Powered by add-skill CLI

Instructions

# Parquet-Coder

Patterns for efficient columnar data storage with Parquet.

## Basic Operations

```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Write with compression
df.to_parquet('data.parquet', compression='snappy', index=False)

# Common compression options:
# - snappy: Fast, good compression (default)
# - gzip: Slower, better compression
# - zstd: Best balance of speed/compression
# - None: No compression (fastest writes)

# Read entire file
df = pd.read_parquet('data.parquet')

# Read specific columns only (predicate pushdown)
df = pd.read_parquet('data.parquet', columns=['id', 'name', 'value'])
```

## PyArrow for Large Files

```python
# Read as PyArrow Table (more memory efficient)
table = pq.read_table('data.parquet')

# Convert to pandas when needed
df = table.to_pandas()

# Filter while reading (row group filtering)
table = pq.read_table(
    'data.parquet',
    filters=[
        ('date', '>=', '2024-01-01'),
        ('status', '=', 'active')
    ]
)

# Read in batches for huge files
parquet_file = pq.ParquetFile('huge.parquet')
for batch in parquet_file.iter_batches(batch_size=100_000):
    df_batch = batch.to_pandas()
    process(df_batch)
```

## Partitioned Datasets

```python
# Write partitioned by columns
df.to_parquet(
    'data/',
    partition_cols=['year', 'month'],
    compression='snappy'
)
# Creates: data/year=2024/month=01/part-0.parquet

# Read partitioned dataset
df = pd.read_parquet('data/')  # Reads all partitions

# Read specific partitions only
df = pd.read_parquet('data/year=2024/')

# With PyArrow dataset API (more control)
import pyarrow.dataset as ds

dataset = ds.dataset('data/', format='parquet', partitioning='hive')

# Filter on partition columns (very fast)
table = dataset.to_table(
    filter=(ds.field('year') == 2024) & (ds.field('month') >= 6)
)
```

## Schema Definition

```python
# Explicit schema for consistency
schema = pa.schema([
    ('id', pa.int64()),
    ('name', pa.string()),
 

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
6276 chars