Use this skill when processing large-scale ML datasets. Covers data loading, preprocessing, augmentation, multimodal data handling, and streaming/sharding techniques.
View on GitHubyxbian23/ai-research-claude-code
everything-claude-code
skills/dataset-processing/SKILL.md
January 25, 2026
Select agents to install to:
npx add-skill https://github.com/yxbian23/ai-research-claude-code/blob/main/skills/dataset-processing/SKILL.md -a claude-code --skill dataset-processingInstallation paths:
.claude/skills/dataset-processing/# Dataset Processing
This skill provides comprehensive guidance for processing and managing large-scale machine learning datasets.
## When to Activate
- Loading and preprocessing large datasets
- Creating custom data pipelines
- Implementing data augmentation
- Processing multimodal data (image+text)
- Setting up distributed data loading
## Data Loading Patterns
### Basic PyTorch DataLoader
```python
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, data_path: str, transform=None):
self.data = self._load_data(data_path)
self.transform = transform
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
item = self.data[idx]
if self.transform:
item = self.transform(item)
return item
# DataLoader with optimal settings
dataloader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4,
pin_memory=True,
prefetch_factor=2,
persistent_workers=True,
)
```
### HuggingFace Datasets
```python
from datasets import load_dataset, Dataset, DatasetDict
# Load from Hub
dataset = load_dataset("imagenet-1k", split="train")
# Load from local files
dataset = load_dataset("json", data_files="data.jsonl")
dataset = load_dataset("csv", data_files="data.csv")
dataset = load_dataset("parquet", data_files="data.parquet")
# Load from folder structure
dataset = load_dataset("imagefolder", data_dir="images/")
# Create from pandas
import pandas as pd
df = pd.read_csv("data.csv")
dataset = Dataset.from_pandas(df)
```
### WebDataset for Large-Scale Data
```python
import webdataset as wds
# Create WebDataset from sharded tar files
dataset = (
wds.WebDataset("data/shard-{000000..000999}.tar")
.shuffle(1000)
.decode("pil")
.to_tuple("jpg", "json")
.map_tuple(transform_image, transform_label)
.batched(32)
)
# Use with DataLoader
dataloader = wds.WebLoader(dataset, num_workers=4)
```
## Data