Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
View on GitHubskills/ai-multimodal/SKILL.md
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/bmad-labs/skills/blob/main/skills/ai-multimodal/SKILL.md -a claude-code --skill ai-multimodalInstallation paths:
.claude/skills/ai-multimodal/# AI Multimodal Process audio, images, videos, documents, and generate images/videos using Google Gemini's multimodal API. ## Setup ```bash export GEMINI_API_KEY="your-key" # Get from https://aistudio.google.com/apikey pip install google-genai python-dotenv pillow ``` ## Quick Start **Verify setup**: `python scripts/check_setup.py` **Analyze media**: `python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>` - TIP: When you're asked to analyze an image, check if `gemini` command is available, then use `"<prompt to analyze image>" | gemini -y -m gemini-2.5-flash` command. If `gemini` command is not available, use `python scripts/gemini_batch_process.py --files <file> --task analyze` command. **Generate content**: `python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"` > **Stdin support**: You can pipe files directly via stdin (auto-detects PNG/JPG/PDF/WAV/MP3). > - `cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"` > - `python scripts/gemini_batch_process.py --files image.png --task analyze` (traditional) ## Models - **Image generation**: `imagen-4.0-generate-001` (standard), `imagen-4.0-ultra-generate-001` (quality), `imagen-4.0-fast-generate-001` (speed) - **Video generation**: `veo-3.1-generate-preview` (8s clips with audio) - **Analysis**: `gemini-2.5-flash` (recommended), `gemini-2.5-pro` (advanced) ## Scripts - **`gemini_batch_process.py`**: CLI orchestrator for `transcribe|analyze|extract|generate|generate-video` that auto-resolves API keys, picks sensible default models per task, streams files inline vs File API, and saves structured outputs (text/JSON/CSV/markdown plus generated assets) for Imagen 4 + Veo workflows. - **`media_optimizer.py`**: ffmpeg/Pillow-based preflight tool that compresses/resizes/converts audio, image, and video inputs, enforces target sizes/bitrates, splits long clips into hour chunks, and batch-processes
Issues Found: