Back to Skills

dataset-processing

verified

Use this skill when processing large-scale ML datasets. Covers data loading, preprocessing, augmentation, multimodal data handling, and streaming/sharding techniques.

View on GitHub

Marketplace

everything-claude-code

yxbian23/ai-research-claude-code

Plugin

everything-claude-code

workflow

Repository

yxbian23/ai-research-claude-code

skills/dataset-processing/SKILL.md

Last Verified

January 25, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/yxbian23/ai-research-claude-code/blob/main/skills/dataset-processing/SKILL.md -a claude-code --skill dataset-processing

Installation paths:

Claude
.claude/skills/dataset-processing/
Powered by add-skill CLI

Instructions

# Dataset Processing

This skill provides comprehensive guidance for processing and managing large-scale machine learning datasets.

## When to Activate

- Loading and preprocessing large datasets
- Creating custom data pipelines
- Implementing data augmentation
- Processing multimodal data (image+text)
- Setting up distributed data loading

## Data Loading Patterns

### Basic PyTorch DataLoader

```python
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, data_path: str, transform=None):
        self.data = self._load_data(data_path)
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        if self.transform:
            item = self.transform(item)
        return item

# DataLoader with optimal settings
dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    prefetch_factor=2,
    persistent_workers=True,
)
```

### HuggingFace Datasets

```python
from datasets import load_dataset, Dataset, DatasetDict

# Load from Hub
dataset = load_dataset("imagenet-1k", split="train")

# Load from local files
dataset = load_dataset("json", data_files="data.jsonl")
dataset = load_dataset("csv", data_files="data.csv")
dataset = load_dataset("parquet", data_files="data.parquet")

# Load from folder structure
dataset = load_dataset("imagefolder", data_dir="images/")

# Create from pandas
import pandas as pd
df = pd.read_csv("data.csv")
dataset = Dataset.from_pandas(df)
```

### WebDataset for Large-Scale Data

```python
import webdataset as wds

# Create WebDataset from sharded tar files
dataset = (
    wds.WebDataset("data/shard-{000000..000999}.tar")
    .shuffle(1000)
    .decode("pil")
    .to_tuple("jpg", "json")
    .map_tuple(transform_image, transform_label)
    .batched(32)
)

# Use with DataLoader
dataloader = wds.WebLoader(dataset, num_workers=4)
```

## Data

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
9652 chars