Why AI Video Editing Doesn't Require Watching Video — A Counterintuitive Answer from 13.3k Star's video-use
A 3-minute 1080p video consists of approximately 5400 frames. At the visual token cost of GPT-4o, "just watching" could consume millions of tokens — even before any cutting begins.
This is why most plants ofl AI auto-editing products either crash or cost exorbitantly. They are attempting to violate a physical law: allowing a text-based model to "watch" a video.
However, browser-use's team does not believe this. Their recently open-sourced video-use has already been downloaded by 13.3k people and forked 1.7 times, with the core idea summarized in a single sentence:
Instead of letting AI watch the video, let the AI read the script.
Turning One Video Into a 12KB File
The video-use technology route has two layers:
- Full Loading Text-to-Speech.
- Every segment runs through ElevenLabs Scribe, outputting word-level timestamps, speaker separation (who is speaking), and audio event annotations (laughter, applause, music start). All materials are ultimately packaged into a single 12KB text file called
takes_packed.md. - This file is approximately the size of a WeChat message but contains the complete transcript with timing information.
- Every segment runs through ElevenLabs Scribe, outputting word-level timestamps, speaker separation (who is speaking), and audio event annotations (laughter, applause, music start). All materials are ultimately packaged into a single 12KB text file called
- On-Demand Loading Visual Snapshots.
- When uncertainty exists about a pause or when checking video quality during cuts, the
timeline_viewtool generates a waveform and frame-overlaid PNG image only for that specific 2-10 second duration. - This does not require the AI to look at the entire video; instead, it acts like a director calling "playback at second 42" precisely.
- When uncertainty exists about a pause or when checking video quality during cuts, the
From Raw Materials to Final Cut, AI is Always Tethered
The entire process looks like a production line:
Transcription → Packaging → LLM Inference → Generating Edit Decision Table → Rendering → Self-Review
However, unlike approaches that let AI improvise, each step is constrained to a controlled, manageable scope.
The most clever design is the Self-Review Loop.
After rendering completes, the AI runs the timeline_view tool at each cut point to check three things: visual frame skipping, audio pops, and subtitle occlusion. If issues are found, they are fixed in place, with a maximum of three retry attempts. Only after self-correction does the preview output file to the user.
No mid просмотр is required, and the AI generates its own score and only submits the result.
Capabilities Covered by the Features
After reviewing the README, the functionality covers the most frequent tasks for editors:
- Removing Tics and Fillers — removing "umm," "uh," and interruptions during repeated openings or mistakes with a single action.
- Color Correction — supporting warm film-style tones, neutral natural tones, and custom ffmpeg chains with independent color grading per segment.
- Audio Fade-In and Fade-Out — automatically adding transitions at each cut to prevent popping sounds.
- Subtitle Embedding — default settings include two words per line capitalized and separated, with full control over font, position, and style.
- Animation Overlay — support for HyperFrames, Remotion, Manim, and PIL, with support for multiple animation engines working in parallel via subagents.
Note: These operations are not hardcoded into fixed code paths. Instead, the AI dynamically generates EDL (Edit Decision List) clips corresponding to adjustments in natural language instructions.
- Example: If you say, "Color the second segment's tone to a warm film style, remove all 'umm' words," it generates the corresponding set of editing instructions.
Skill, Not App
The product philosophy of video-use is worth noting — it is not an App but a Skill.
- Installation:
# 1. Clone and link your Skill directory git clone https://github.com/browser-use/video-use ~/Developer/video-use ln -sfn ~/Developer/video-use ~/.config/claude/skills/video-use # 2. Install dependencies cd ~/Developer/video-use uv sync # 3. Configure ElevenLabs API Key cp .env.example .env # Edit .env to set ELEVENLABS_API_KEY
Once installed, you do not need to open any video editor. Simply place the media files in a folder and instruct the AI, "Give me a cut," and it begins working.
The reason for this design choice lies in the deeper meaning: Skill means it is attached to the capabilities of an existing Agent, rather than operating independently. Your Agent already possesses the ability to read files, run commands, and manage APIs; video-use simply adds a set of video editing "skills" to it.
A Larger Problem Than Video
The browser-use team has developed two projects with a consistent methodology:
- browser-use: Prevents AI from viewing webpage screenshots and instead uses structured DOM trees.
- video-use: Prevents AI from viewing video frames, using instead a transcript and timeline.
The same methodology was validated twice: To optimize the design of tools, the most lazy approach is to mimic humans, while the most efficient way is to understand the underlying capabilities of the tool's end users.
Humans rely on their eyes to watch videos, so editors are given timelines and preview windows. However, LLMs are powerful at processing text, performing high-level reasoning, and accurately interpreting instructions — they are not designed for visual "search." This approach may extend beyond video. Any unstructured data that can be structure-transformed appropriately can be efficiently processed by LLMs. Audio can be transcribed into text; 3D models can be converted into scene descriptions; code can be turned into ASTs...
Rather than trying to make the AI become human, make it do the thing AI does best.
Project Link: https://github.com/browser-use/video-use
暂无评论。