Skip to content

Commit 9d463de

Browse files
authored
Merge pull request #166 from WEIFENG2333/feat/videocaptioner-harness
feat: Add VideoCaptioner CLI — AI video captioning with styled subtitles
2 parents 7dd6c2e + a4ef6ff commit 9d463de

20 files changed

Lines changed: 1765 additions & 1 deletion

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -469,13 +469,14 @@ The catalog auto-updates whenever `registry.json` changes — new community CLIs
469469
| **🤖 AI/ML Platforms** | Automate model training, inference pipelines, and hyperparameter tuning through structured commands | Stable Diffusion WebUI, ComfyUI, Ollama, InvokeAI, Text-generation-webui, Open WebUI, Fooocus, Kohya_ss, AnythingLLM, SillyTavern |
470470
| **📊 Data & Analytics** | Enable programmatic data processing, visualization, and statistical analysis workflows | JupyterLab, Apache Superset, Metabase, Redash, DBeaver, KNIME, Orange, OpenSearch Dashboards, Lightdash |
471471
| **💻 Development Tools** | Streamline code editing, building, testing, and deployment processes via command interfaces | Jenkins, Gitea, Hoppscotch, Portainer, pgAdmin, SonarQube, ArgoCD, OpenLens, Insomnia, Beekeeper Studio, **[iTerm2](https://iterm2.com)** |
472-
| **🎨 Creative & Media** | Control content creation, editing, and rendering workflows programmatically | Blender, GIMP, OBS Studio, Audacity, Krita, Kdenlive, Shotcut, Inkscape, Darktable, LMMS, Ardour |
472+
| **🎨 Creative & Media** | Control content creation, editing, and rendering workflows programmatically | Blender, GIMP, OBS Studio, Audacity, Krita, Kdenlive, Shotcut, Inkscape, Darktable, LMMS, Ardour, VideoCaptioner |
473473
| **🔬 Scientific Computing** | Automate research workflows, simulations, and complex calculations | ImageJ, FreeCAD, QGIS, ParaView, Gephi, LibreCAD, Stellarium, KiCad, JASP, Jamovi |
474474
| **🏢 Enterprise & Office** | Convert business applications and productivity tools into agent-accessible systems | NextCloud, GitLab, Grafana, Mattermost, LibreOffice, AppFlowy, NocoDB, Odoo (Community), Plane, ERPNext |
475475
| **📞 Communication & Collaboration** | Automate meeting scheduling, participant management, recording retrieval, and reporting through structured CLI | Zoom, Jitsi Meet, BigBlueButton, Mattermost |
476476
| **📐 Diagramming & Visualization** | Create and manipulate diagrams, flowcharts, architecture diagrams, and visual documentation programmatically | Draw.io (diagrams.net), Mermaid, PlantUML, Excalidraw, yEd |
477477
| **🌐 Network & Infrastructure** | Manage network services, DNS, ad-blocking, and infrastructure through structured CLI commands | AdGuardHome |
478478
| **🔬 Graphics & GPU Debugging** | Analyze GPU frame captures, inspect pipeline state, export shaders, and diff rendering state | RenderDoc |
479+
| **🎬 Video & Subtitles** | Transcribe speech, translate subtitles, burn styled captions into video — full captioning pipeline | VideoCaptioner |
479480
| **✨ AI Content Generation** | Generate professional deliverables (slides, docs, diagrams, websites, research reports) through AI-powered cloud APIs | [AnyGen](https://www.anygen.io), Gamma, Beautiful.ai, Tome |
480481

481482
---
@@ -751,6 +752,13 @@ Each application received complete, production-ready CLI interfaces — not demo
751752
<td align="center">✅ 98</td>
752753
</tr>
753754
<tr>
755+
<td align="center"><strong>🎬 <a href="videocaptioner/agent-harness/">VideoCaptioner</a></strong></td>
756+
<td>AI Video Captioning</td>
757+
<td><code>cli-anything-videocaptioner</code></td>
758+
<td>videocaptioner CLI (PyPI)</td>
759+
<td align="center">✅ 26</td>
760+
</tr>
761+
<tr>
754762
<td align="center"><strong>🎨 Sketch</strong></td>
755763
<td>UI Design</td>
756764
<td><code>sketch-cli</code></td>
@@ -879,6 +887,7 @@ cli-anything/
879887
├── 🦙 ollama/agent-harness/ # Ollama CLI (98 tests)
880888
├── 🎨 sketch/agent-harness/ # Sketch CLI (19 tests, Node.js)
881889
├── 🔬 renderdoc/agent-harness/ # RenderDoc CLI (59 tests)
890+
└── 🎬 videocaptioner/agent-harness/ # VideoCaptioner CLI (26 tests)
882891
└── ☁️ cloudcompare/agent-harness/ # CloudCompare CLI (88 tests)
883892
```
884893

registry.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,18 @@
384384
"contributor_url": "https://github.com/levishilf"
385385
},
386386
{
387+
"name": "videocaptioner",
388+
"display_name": "VideoCaptioner",
389+
"version": "1.0.0",
390+
"description": "AI-powered video captioning — transcribe speech, optimize/translate subtitles, burn styled subtitles into video",
391+
"requires": "videocaptioner (pip install videocaptioner), ffmpeg",
392+
"homepage": "https://github.com/WEIFENG2333/VideoCaptioner",
393+
"install_cmd": "pip install git+https://github.com/HKUDS/CLI-Anything.git#subdirectory=videocaptioner/agent-harness",
394+
"entry_point": "cli-anything-videocaptioner",
395+
"skill_md": "videocaptioner/agent-harness/cli_anything/videocaptioner/skills/SKILL.md",
396+
"category": "video",
397+
"contributor": "WEIFENG2333",
398+
"contributor_url": "https://github.com/WEIFENG2333"
387399
"name": "intelwatch",
388400
"display_name": "Intelwatch",
389401
"version": "1.0.0",
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# VideoCaptioner: Project-Specific Analysis & SOP
2+
3+
## Architecture Summary
4+
5+
VideoCaptioner is an AI-powered video captioning tool that provides a complete
6+
pipeline from speech recognition to styled subtitle synthesis. It ships as a
7+
standalone CLI (`pip install videocaptioner`) with a well-defined command interface.
8+
9+
```
10+
+----------------------------------------------------------+
11+
| VideoCaptioner CLI |
12+
| +------------+ +----------+ +-----------+ +-----------+ |
13+
| | Transcribe | | Subtitle | | Synthesize| | Process | |
14+
| | (ASR) | | (NLP) | | (FFmpeg) | | (Pipeline)| |
15+
| +-----+------+ +----+-----+ +-----+-----+ +-----+-----+ |
16+
| | | | | |
17+
| +-----+--------------+-------------+-------------+-----+ |
18+
| | Core Engine | |
19+
| | ASR engines, LLM optimization, Translation, | |
20+
| | Subtitle rendering (ASS + Rounded), FFmpeg | |
21+
| +-----------------------------------------------------+ |
22+
+----------------------------------------------------------+
23+
```
24+
25+
## CLI Strategy: Subprocess Wrapper
26+
27+
Unlike applications that need reverse-engineering of internal formats,
28+
VideoCaptioner already provides a production CLI. Our harness:
29+
30+
1. **Click wrapper** provides the CLI-Anything standard interface
31+
2. **Subprocess backend** delegates to `videocaptioner` CLI commands
32+
3. **JSON mode** (`--json`) returns structured output for agents
33+
4. **REPL mode** provides interactive session with tab-completion
34+
35+
### Why Subprocess?
36+
37+
VideoCaptioner's CLI is:
38+
- **Production-tested** with 50+ unit tests and 200+ QA test cases
39+
- **Feature-complete** with 7 subcommands covering the full pipeline
40+
- **Well-documented** with clear `--help` text and exit codes
41+
- **Actively maintained** on PyPI with automated releases
42+
43+
Wrapping via subprocess preserves all these qualities without reimplementation.
44+
45+
## Coverage
46+
47+
### Transcription (4 ASR engines)
48+
- `bijian` — Free, Chinese & English, no setup needed
49+
- `jianying` — Free, Chinese & English, no setup needed
50+
- `whisper-api` — All languages, OpenAI-compatible API
51+
- `whisper-cpp` — All languages, local model
52+
53+
### Subtitle Processing
54+
- **Split** — Semantic re-segmentation via LLM
55+
- **Optimize** — Fix ASR errors, punctuation, formatting via LLM
56+
- **Translate** — 38 languages, 3 translators (LLM, Bing free, Google free)
57+
- **Layout** — target-above, source-above, target-only, source-only
58+
59+
### Video Synthesis
60+
- **Soft subtitles** — Embedded subtitle track (switchable)
61+
- **Hard subtitles** — Burned into video frames
62+
- **ASS style** — Traditional outline/shadow with presets (default, anime, vertical)
63+
- **Rounded style** — Modern rounded background boxes
64+
- **Customizable** — Inline JSON override for any style parameter
65+
- **Quality levels** — ultra (CRF 18), high (CRF 23), medium (CRF 28), low (CRF 32)
66+
67+
### Utilities
68+
- Configuration management (TOML config + env vars)
69+
- Style preset listing with full parameters
70+
- Online video download (YouTube, Bilibili, etc.)
71+
72+
## Testing Strategy
73+
74+
- **Unit tests**: Mock subprocess calls, verify argument construction
75+
- **End-to-end tests**: Real videocaptioner CLI with test media files
76+
- **Prerequisite**: `videocaptioner` and `ffmpeg` must be installed
77+
78+
## Limitations
79+
80+
- Requires `videocaptioner` package to be installed separately
81+
- Free ASR engines (bijian/jianying) only support Chinese & English
82+
- LLM features require an OpenAI-compatible API key
83+
- Hard subtitle styles require FFmpeg

videocaptioner/agent-harness/cli_anything/__init__.py

Whitespace-only changes.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# VideoCaptioner CLI
2+
3+
AI-powered video captioning tool with beautiful customizable subtitle styles.
4+
5+
## Architecture
6+
7+
- **Subprocess backend** delegates to the production `videocaptioner` CLI (`pip install videocaptioner`)
8+
- **Click** provides the CLI framework with subcommand groups and REPL
9+
- **JSON output mode** (`--json`) for agent consumption
10+
- **Free features included**: bijian ASR (Chinese/English), Bing/Google translation
11+
12+
## Pipeline
13+
14+
```
15+
Audio/Video → ASR Transcription → Subtitle Splitting → LLM Optimization → Translation → Video Synthesis
16+
(bijian/whisper) (semantic) (fix errors) (38 languages) (styled subtitles)
17+
```
18+
19+
## Install
20+
21+
```bash
22+
pip install videocaptioner click prompt-toolkit
23+
```
24+
25+
## Run
26+
27+
```bash
28+
# One-shot: transcribe a Chinese video and add English subtitles
29+
cli-anything-videocaptioner process video.mp4 --asr bijian --translator bing --target-language en --subtitle-mode hard
30+
31+
# Transcribe only
32+
cli-anything-videocaptioner transcribe video.mp4 --asr bijian -o output.srt
33+
34+
# Translate existing subtitles
35+
cli-anything-videocaptioner subtitle input.srt --translator google --target-language ja
36+
37+
# Burn subtitles with anime style
38+
cli-anything-videocaptioner synthesize video.mp4 -s sub.srt --subtitle-mode hard --style anime
39+
40+
# Custom style (red outline, large font)
41+
cli-anything-videocaptioner synthesize video.mp4 -s sub.srt --subtitle-mode hard \
42+
--style-override '{"outline_color": "#ff0000", "font_size": 48}'
43+
44+
# JSON output mode (for agent consumption)
45+
cli-anything-videocaptioner --json transcribe video.mp4 --asr bijian
46+
47+
# Interactive REPL
48+
cli-anything-videocaptioner
49+
```
50+
51+
## Subtitle Styles
52+
53+
Two rendering modes for beautiful subtitles:
54+
55+
**ASS mode** — traditional outline/shadow:
56+
- Presets: `default` (white+black), `anime` (warm+orange), `vertical` (portrait videos)
57+
58+
**Rounded mode** — modern rounded background boxes:
59+
- Preset: `rounded` (dark text on semi-transparent background)
60+
61+
Fully customizable via `--style-override` with inline JSON.
62+
63+
## Coverage
64+
65+
| Feature | Commands |
66+
|---------|----------|
67+
| Transcription | 4 ASR engines, auto language detection, word timestamps |
68+
| Subtitle Processing | Split + optimize + translate, 3 translators, 38 languages |
69+
| Video Synthesis | Soft/hard subtitles, 4 quality levels, 5 style presets |
70+
| Styles | ASS outline + rounded background, inline JSON customization |
71+
| Utilities | Config management, style listing, video download |

videocaptioner/agent-harness/cli_anything/videocaptioner/__init__.py

Whitespace-only changes.

videocaptioner/agent-harness/cli_anything/videocaptioner/core/__init__.py

Whitespace-only changes.
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
"""Full pipeline — transcribe → optimize → translate → synthesize in one command."""
2+
3+
from cli_anything.videocaptioner.utils.vc_backend import run_quiet
4+
5+
6+
def process(
7+
input_path: str,
8+
output_path: str | None = None,
9+
asr: str = "bijian",
10+
language: str = "auto",
11+
translator: str | None = None,
12+
target_language: str | None = None,
13+
subtitle_mode: str = "soft",
14+
quality: str = "medium",
15+
layout: str | None = None,
16+
style: str | None = None,
17+
style_override: str | None = None,
18+
render_mode: str | None = None,
19+
no_optimize: bool = False,
20+
no_translate: bool = False,
21+
no_split: bool = False,
22+
no_synthesize: bool = False,
23+
reflect: bool = False,
24+
prompt: str | None = None,
25+
api_key: str | None = None,
26+
api_base: str | None = None,
27+
model: str | None = None,
28+
) -> str:
29+
"""Run the complete captioning pipeline.
30+
31+
Args:
32+
input_path: Video or audio file path.
33+
output_path: Output file or directory path.
34+
asr: ASR engine.
35+
language: Source language.
36+
translator: Translation service.
37+
target_language: Target language.
38+
subtitle_mode: soft or hard.
39+
quality: Video quality.
40+
layout: Bilingual layout.
41+
style: Style preset name.
42+
style_override: Inline JSON style override.
43+
render_mode: ass or rounded.
44+
no_optimize: Skip optimization.
45+
no_translate: Skip translation.
46+
no_split: Skip re-segmentation.
47+
no_synthesize: Skip video synthesis.
48+
reflect: Reflective translation.
49+
prompt: Custom LLM prompt.
50+
api_key: LLM API key.
51+
api_base: LLM API base URL.
52+
model: LLM model name.
53+
54+
Returns:
55+
Output file path.
56+
"""
57+
args = ["process", input_path, "--asr", asr, "--language", language,
58+
"--subtitle-mode", subtitle_mode, "--quality", quality]
59+
if output_path:
60+
args += ["-o", output_path]
61+
if translator:
62+
args += ["--translator", translator]
63+
if target_language:
64+
args += ["--target-language", target_language]
65+
if layout:
66+
args += ["--layout", layout]
67+
if style:
68+
args += ["--style", style]
69+
if style_override:
70+
args += ["--style-override", style_override]
71+
if render_mode:
72+
args += ["--render-mode", render_mode]
73+
if no_optimize:
74+
args.append("--no-optimize")
75+
if no_translate:
76+
args.append("--no-translate")
77+
if no_split:
78+
args.append("--no-split")
79+
if no_synthesize:
80+
args.append("--no-synthesize")
81+
if reflect:
82+
args.append("--reflect")
83+
if prompt:
84+
args += ["--prompt", prompt]
85+
if api_key:
86+
args += ["--api-key", api_key]
87+
if api_base:
88+
args += ["--api-base", api_base]
89+
if model:
90+
args += ["--model", model]
91+
return run_quiet(args)
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
"""Subtitle processing — optimize and translate subtitle files."""
2+
3+
from cli_anything.videocaptioner.utils.vc_backend import run_quiet
4+
5+
6+
def process_subtitle(
7+
input_path: str,
8+
output_path: str | None = None,
9+
translator: str | None = None,
10+
target_language: str | None = None,
11+
format: str = "srt",
12+
layout: str | None = None,
13+
no_optimize: bool = False,
14+
no_translate: bool = False,
15+
no_split: bool = False,
16+
reflect: bool = False,
17+
prompt: str | None = None,
18+
api_key: str | None = None,
19+
api_base: str | None = None,
20+
model: str | None = None,
21+
) -> str:
22+
"""Optimize and/or translate a subtitle file.
23+
24+
Args:
25+
input_path: Subtitle file (.srt, .ass, .vtt).
26+
output_path: Output file or directory path.
27+
translator: Translation service (llm, bing, google).
28+
target_language: Target language BCP 47 code.
29+
format: Output format (srt, ass, txt, json).
30+
layout: Bilingual layout (target-above, source-above, target-only, source-only).
31+
no_optimize: Skip LLM optimization.
32+
no_translate: Skip translation.
33+
no_split: Skip re-segmentation.
34+
reflect: Enable reflective translation (LLM only).
35+
prompt: Custom LLM prompt.
36+
api_key: LLM API key.
37+
api_base: LLM API base URL.
38+
model: LLM model name.
39+
40+
Returns:
41+
Output file path.
42+
"""
43+
args = ["subtitle", input_path, "--format", format]
44+
if output_path:
45+
args += ["-o", output_path]
46+
if translator:
47+
args += ["--translator", translator]
48+
if target_language:
49+
args += ["--target-language", target_language]
50+
if layout:
51+
args += ["--layout", layout]
52+
if no_optimize:
53+
args.append("--no-optimize")
54+
if no_translate:
55+
args.append("--no-translate")
56+
if no_split:
57+
args.append("--no-split")
58+
if reflect:
59+
args.append("--reflect")
60+
if prompt:
61+
args += ["--prompt", prompt]
62+
if api_key:
63+
args += ["--api-key", api_key]
64+
if api_base:
65+
args += ["--api-base", api_base]
66+
if model:
67+
args += ["--model", model]
68+
return run_quiet(args)

0 commit comments

Comments
 (0)