Capper — Video Captioning & Animation CLI

Automatically transcribe video audio and overlay animated, customizable captions using OpenAI Whisper and FFmpeg.

Supports both the OpenAI Whisper API and a local whisper installation (openai-whisper or whisper.cpp).

Features

Word-level transcription via Whisper with exact timestamps
API or local — use OpenAI's cloud API or a local whisper binary
Configurable word grouping — control how many words appear per on-screen frame
Multiple animation types — fade-in, pop-in (scale bounce), slide-in (directional), or none
Per-word karaoke highlighting — words light up as they are spoken
Full text styling — font family, size, color, bold, italic, stroke/outline, drop shadow
10 alignment positions — any corner, edge center, or screen center
YAML or JSON configuration files
Auto-detects video resolution via ffprobe

Prerequisites

Go 1.23+
FFmpeg (with libass enabled)

FFmpeg

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Fedora
sudo dnf install ffmpeg

# Arch
sudo pacman -S ffmpeg

Verify FFmpeg has libass support:

ffmpeg -filters 2>&1 | grep ass

Whisper (choose one)

Option A — OpenAI API key: Set OPENAI_API_KEY env var or pass --api-key.

Option B — Local whisper (Python openai-whisper):

pip install openai-whisper
# Verify:
whisper --help

Option C — Local whisper.cpp:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
# Download a model:
./models/download-ggml-model.sh base
# The binary is named 'main' (or 'whisper-cli')

Installation

git clone https://github.com/anomalyco/capper.git
cd capper
go build -o capper .

Or:

go install .

Windows package

Capper runs on Windows as well as Linux/macOS. To produce a fully self-contained Windows bundle — transcription and rendering work offline, with no Python, no separate FFmpeg, and no API key — run from Linux or macOS:

./scripts/build-windows.sh

This cross-compiles capper.exe and downloads a static win64 FFmpeg, whisper.cpp, and a speech model, assembling:

dist/capper-win64.zip   (~286 MB)
└── capper-win64/
    ├── capper.exe        # CLI + embedded web UI
    ├── ffmpeg.exe        # render & preview
    ├── ffprobe.exe
    ├── whisper-cli.exe   # offline transcription (whisper.cpp)
    ├── *.dll             # whisper.cpp + OpenBLAS runtime
    ├── ggml-base.bin     # bundled speech model (~148 MB)
    ├── run.bat           # double-click to launch the styling UI
    └── my_config.json    # caption style, pre-wired to the bundled whisper

The user unzips it anywhere and double-clicks run.bat to open the UI at http://localhost:8080. capper.exe prepends its own folder to PATH on startup, so it finds the bundled FFmpeg and whisper with no setup. For CLI use, run capper.exe --input video.mp4 ... from that folder.

The model is overridable at build time — e.g. MODEL=ggml-small.bin ./scripts/build-windows.sh for higher accuracy, or ggml-base.en.bin for a smaller English-only model. (You can still switch the config to the OpenAI API or Python whisper instead; the bundle just defaults to offline whisper.cpp.)

GPU build: GPU=1 ./scripts/build-windows.sh bundles the CUDA (cuBLAS) whisper.cpp build, which uses an NVIDIA GPU when present and falls back to CPU automatically if there's no usable GPU. It bundles the full CUDA runtime (self-contained — no CUDA install needed). Requires a reasonably recent NVIDIA driver.

Each release publishes both:

Download	For
`capper-win64.zip`	CPU — works everywhere (~150 MB)
`capper-win64-cuda.zip`	NVIDIA GPU, CPU fallback (~580 MB)

So a GPU user just downloads capper-win64-cuda.zip and runs run.bat — no manual DLL swapping. The capper.exe inside is identical to the CPU one, so the in-app updater works for either.

Models: the bundles ship no speech model — you download one on first run from the Speech model panel, which lists every model (tiny → large) with its size and a one-click download, and remembers your choice. The config points at base by default, so it's pre-selected and one click away. No manual model files to manage.

Updating Windows users

Updates ship as just the ~18 MB capper.exe — the bundled FFmpeg, whisper, and model never change, so users never re-download the full bundle. The version is baked into the binary and the app self-updates from GitHub Releases.

To publish an update — just push a tag:

git tag 1.3.0 && git push --tags

The release GitHub Action (.github/workflows/release.yml) then builds the Linux binary and the full Windows bundle and publishes a GitHub Release with the right assets — nothing is built on your machine. (v1.3.0-style tags work too.) To build a release locally instead, run ./scripts/release.sh 1.3.0.

What the user sees: when they open the UI, capper checks the latest release. If a newer version exists, a green "⬆ Update to 1.3.0" button appears in the header. Clicking it downloads the new capper.exe, swaps it in place, and — when launched via run.bat — capper restarts itself on the same port and the page reloads into the new version automatically. No manual download, no reinstall.

(There's also a CLI: capper.exe update to update in place, and capper.exe version to print the current version.)

Usage

# API mode (default) — requires OPENAI_API_KEY
capper --input video.mp4 --api-key sk-...

# API mode with custom config
capper --input video.mp4 --config my_style.yaml

# Local whisper (Python)
capper --input video.mp4 --config examples/config.yaml

# Local whisper with custom binary and model
capper --input video.mp4 --config my_local.yaml

# Override output path
capper --input video.mp4 --config config.yaml --output final.mp4

Configuration Reference

API mode (`whisper.mode: "api"`)

whisper:
  mode: "api"                 # Use OpenAI API
  model: "whisper-1"          # API model name
  language: "en"
  prompt: ""                  # Optional context for better accuracy
  temperature: 0.0

  # Not needed for API mode:
  # binary_path: ""
  # model_path: ""

Local mode (`whisper.mode: "local"`)

For Python openai-whisper:

whisper:
  mode: "local"
  binary_path: "whisper"      # Path to the whisper command
  model_path: "base"          # Model name: tiny, base, small, medium, large
  language: "en"
  prompt: ""
  temperature: 0.0

For whisper.cpp:

whisper:
  mode: "local"
  binary_path: "/path/to/whisper.cpp/main"   # Or 'whisper-cli', 'whisper-cpp'
  model_path: "/path/to/models/ggml-base.bin" # Full path to model file
  language: "en"

Capper auto-detects which whisper variant you are using based on the binary name.

Full configuration

words_per_frame: 4
display_mode: "static"        # "static" or "karaoke"
output_path: "output.mp4"

font:
  family: "Arial"
  size: 48
  color: "#FFFFFF"
  bold: false
  italic: false
  underline: false

stroke:
  color: "#000000"
  width: 2.0

shadow:
  color: "#000000"
  depth: 2.0

animation:
  type: "fade-in"             # fade-in | pop-in | slide-in | none
  duration_ms: 300
  slide_direction: "bottom"   # left | right | top | bottom
  slide_distance: 50

position:
  alignment: 2                # 2 = bottom center
  margin_left: 60
  margin_right: 60
  margin_top: 20
  margin_bottom: 100

whisper:
  mode: "api"                 # "api" or "local"
  model: "whisper-1"
  language: "en"
  prompt: ""
  temperature: 0.0
  # Local mode only:
  # binary_path: "whisper"
  # model_path: "base"

karaoke:
  active_color: "#FFFF00"
  inactive_color: "#FFFFFF"

Alignment Values

Value	Position
1	Bottom left
2	Bottom center
3	Bottom right
4	Middle left
5	Middle center
6	Middle right
7	Top left
8	Top center
9	Top right

How It Works

Audio extraction — FFmpeg extracts mono 16kHz WAV audio from the input video
Transcription — Audio is sent to OpenAI Whisper API or processed by a local whisper binary to get word-level timestamps
Frame grouping — Words are grouped into on-screen frames based on words_per_frame
ASS generation — An Advanced SubStation Alpha subtitle file is generated with all styles, positions, and animation override tags
Rendering — FFmpeg burns the ASS subtitles directly into the video stream, copying the original audio

Examples

See the examples/ directory for sample configuration files:

examples/config.yaml — Standard API-mode fade-in captions
examples/config.json — Bold pop-in style with JSON format
examples/config-local.json — Local whisper mode

Environment Variables

Variable	Description
`OPENAI_API_KEY`	OpenAI API key (API mode only)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
animation		animation
caption		caption
cmd		cmd
config		config
examples		examples
render		render
scripts		scripts
whisper		whisper
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
capper		capper
go.mod		go.mod
go.sum		go.sum
main.go		main.go
my_config.json		my_config.json
video.mp4.capper.json		video.mp4.capper.json
video2.mp4.capper.json		video2.mp4.capper.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Capper — Video Captioning & Animation CLI

Features

Prerequisites

FFmpeg

Whisper (choose one)

Installation

Windows package

Updating Windows users

Usage

Configuration Reference

API mode (`whisper.mode: "api"`)

Local mode (`whisper.mode: "local"`)

Full configuration

Alignment Values

How It Works

Examples

Environment Variables

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Capper — Video Captioning & Animation CLI

Features

Prerequisites

FFmpeg

Whisper (choose one)

Installation

Windows package

Updating Windows users

Usage

Configuration Reference

API mode (whisper.mode: "api")

Local mode (whisper.mode: "local")

Full configuration

Alignment Values

How It Works

Examples

Environment Variables

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

API mode (`whisper.mode: "api"`)

Local mode (`whisper.mode: "local"`)

Packages