Why AI Video Editing Doesn't Need to Watch the Video — The 13.3k Star video-use Hides a Counter-Intuitive Answer
A 3-minute 1080p video, when split into frames, yields about 5400 images. Priced by GPT-4o's visual tokens, just 'watching' it would cost tens of millions of tokens — before even making a single cut.
That's why most "AI auto-editing" products are either painfully slow or ridiculously expensive. They are doing something that defies the laws of physics: making a text engine "watch" a video.
But the browser-use team thinks differently. Their newly open-sourced video-use has already garnered 13.3k stars and 1.7k forks, and its core idea can be summed up in one sentence:
Don't show AI the footage; give AI the script.
How a Video Becomes 12KB
video-use's technical approach has two layers. The first is "full-load" text transcription.
Each clip is processed through ElevenLabs Scribe, outputting word-level precise timestamps, speaker diarization (who is speaking), and audio event markers (laughter, applause, music start). All clips are eventually packed into a single takes_packed.md file of about 12KB.
12KB. A video that might be several GB of raw footage is compressed to the size of a WeChat message.
This is AI's "workbench" — not a timeline, not a preview window, but a text file it handles best.
The second layer is "on-demand" visual snapshots. When there's an ambiguous pause or a need to check cut quality, the timeline_view tool grabs a PNG of the waveform plus frame overlay, looking only at those few seconds. Not staring at the screen the whole time, but precisely like a director shouting "playback second 42."
From Raw Footage to Final Cut, AI is Leashed at Every Step
The entire flow looks like an assembly line:
Transcription → Packing → LLM Inference → Generate Edit Decision List → Render → Self-Evaluation
But it's completely different from the "let AI run free" approach — every step is constrained within a controllable range.
The most ingenious design is the self-evaluation loop. After rendering, AI runs timeline_view at every cut point to check three things: visual frame jumps, audio pops, and subtitles blocking the picture. Any issue is corrected on the spot, with a maximum of three retries before handing the preview to you.
No need to watch the entire process; AI grades itself before submitting the homework.
What It Can Do
After reading the README, the features cover the most frequent operations of a video editor:
- Remove filler words — umm, uh, repeated starts, interrupted mistakes, cut in one go
- Color correction — warm cinematic / neutral natural / custom ffmpeg chain, independent color grading per clip
- Audio fades — automatically added at every cut point to prevent pops
- Subtitle embedding — default two-word capitalized segments, fully configurable font/position/style
- Animation overlays — supports HyperFrames, Remotion, Manim, PIL four engines, generated via parallel sub-agents
But these operations are not hard-coded fixed workflows. AI dynamically generates an EDL (Edit Decision List) based on your natural language instructions. You say "make the second clip's tone warm cinematic and remove all umms," and it generates the corresponding editing command sequence.
Skill, Not App
The product form of video-use is also noteworthy — it is a Skill, not a standalone App.
Installation:
# 1. Clone and link to your Agent's skills directory
git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use
# 2. Install dependencies
cd ~/Developer/video-use
uv sync
brew install ffmpeg
brew install yt-dlp # optional, for downloading online footage
# 3. Configure ElevenLabs API Key
cp .env.example .env
# Edit .env and fill in ELEVENLABS_API_KEY
After installation, you don't need to open any editing software. Drop the footage into a folder, tell Claude Code "edit this into a publishable video," and it starts working.
This choice has depth: Skill means it parasitically lives within the Agent's capability system instead of building from scratch. Your Agent can already read files, run commands, and call APIs; video-use just adds a set of "craftsmanship" for video editing.
A Problem Bigger Than Video
The browser-use team has done two projects with a consistent philosophy:
- browser-use (100k+ stars): Don't show AI webpage screenshots; give it structured DOM tree
- video-use (13.3k stars): Don't show AI video frames; give it transcription text + timeline
Same methodology, validated twice: When designing tools, the laziest approach is to imitate humans, and the most efficient approach is to understand the essential capabilities of the tool user.
Humans watch videos with their eyes, so editing software provides timelines and preview windows. But LLMs excel at text understanding, context reasoning, and precise instruction execution — not "visual search."
This insight may apply beyond video. Any unstructured data, as long as you find the right structured representation, can be efficiently processed by LLMs. Audio can become text, 3D models can become scene descriptions, code can become AST...
It's not about making AI become human; it's about letting AI do what AI does best.
Project link: https://github.com/browser-use/video-use
暂无评论。