Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.
View on GitHubMadAppGang/claude-code
video-editing
January 23, 2026
Select agents to install to:
npx add-skill https://github.com/MadAppGang/claude-code/blob/main/plugins/video-editing/skills/transcription/SKILL.md -a claude-code --skill transcriptionInstallation paths:
.claude/skills/transcription/plugin: video-editing updated: 2026-01-20 # Transcription with Whisper Production-ready patterns for audio/video transcription using OpenAI Whisper. ## System Requirements ### Installation Options **Option 1: OpenAI Whisper (Python)** ```bash # macOS/Linux/Windows pip install openai-whisper # Verify whisper --help ``` **Option 2: whisper.cpp (C++ - faster)** ```bash # macOS brew install whisper-cpp # Linux - build from source git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp && make # Windows - use pre-built binaries or build with cmake ``` **Option 3: Insanely Fast Whisper (GPU accelerated)** ```bash pip install insanely-fast-whisper ``` ### Model Selection | Model | Size | VRAM | Accuracy | Speed | Use Case | |-------|------|------|----------|-------|----------| | tiny | 39M | ~1GB | Low | Fastest | Quick previews | | base | 74M | ~1GB | Medium | Fast | Draft transcripts | | small | 244M | ~2GB | Good | Medium | General use | | medium | 769M | ~5GB | Better | Slow | Quality transcripts | | large-v3 | 1550M | ~10GB | Best | Slowest | Final production | **Recommendation:** Start with `small` for speed/quality balance. Use `large-v3` for final delivery. ## Basic Transcription ### Using OpenAI Whisper ```bash # Basic transcription (auto-detect language) whisper audio.mp3 --model small # Specify language and output format whisper audio.mp3 --model medium --language en --output_format srt # Multiple output formats whisper audio.mp3 --model small --output_format all # With timestamps and word-level timing whisper audio.mp3 --model small --word_timestamps True ``` ### Using whisper.cpp ```bash # Download model first ./models/download-ggml-model.sh base.en # Transcribe ./main -m models/ggml-base.en.bin -f audio.wav -osrt # With timestamps ./main -m models/ggml-base.en.bin -f audio.wav -ocsv ``` ## Output Formats ### SRT (SubRip Subtitle) ``` 1 00:00:01,000 --> 00:00:04,500 Hello and welcome to this video. 2 00:00:05,000 --> 00:00:0