AI Agent's Hands and Eyes Finally Aligned

BrowserSkill AI Agent Browser Automation Tencent Open Source Agent Browser

发布于 2026-07-02 10:31:17 4 次浏览

AI Agent's Hands and Eyes Finally Aligned

Tencent open-sources BrowserSkill, using a local bridge to connect the Agent's execution power into your logged-in real browser

Have you noticed a contradiction? The "brain" of an AI Agent can already read web pages, but its "hands" can never reach the browser you are actually using.

It can analyze page structures but cannot enter your intranet backend; it can generate operation commands but must cold-start a brand new Chromium every time—playing separately from your daily browser, not sharing login states, and even competing for windows.

In 2026, the task of "equipping an Agent with a browser" has split into two paths: one is to let the Agent create a new browser itself, the other is to let the Agent borrow the one you already have. The BrowserSkill open-sourced by Tencent in June takes the second path—and it is currently the only solution on this path that is Agent-neutral + purely local.

Two Paths, One Watershed

Let's start with the big picture. This year, AI browser tools have converged into two camps. Before selecting, think clearly which one you want:

Path A: Borrow your browser—reuse your logged-in real browser, with login state, no cold start, human-machine coexistence. Suitable for intranet backends, SaaS operations, semi-automated collaboration.

Path B: Create a new browser—a brand new instance dedicated to the Agent, cleanly isolated, unattended, suitable for CI and batch scraping. But by default, it has no login state.

BrowserSkill stands on Path A, but compared to its peers like official Claude in Chrome and OpenClaw Relay, it has two unique labels: not bound to any specific Agent framework and not connected to any external server. These two labels may seem like technical details, but they determine who can use it and who dares to use it.

Why "Agent-Neutral" Is Not a Small Matter

Official Claude in Chrome only recognizes Anthropic official account logins. Many domestic developers running Claude Code use authtoken via relay stations or even connect to third-party APIs—these people cannot directly use the official browser functionality.

To an Agent, BrowserSkill is just an ordinary shell command bsk, no different from curl. It doesn't care which model, key, or authentication method you use. Cursor, Claude Code, Codex, OpenClaw, CodeBuddy, WorkBuddy, Hermes Agent—after installing bsk install-skill, it automatically detects the local framework and writes into their respective skills directories with one click.

This means: You can change your Agent without changing tools, and change your API without modifying configuration. In the current era of rapid iteration for Agent frameworks, this neutrality is more valuable than the quantity of features.

Why "Purely Local" Is Not a Small Matter

Official Claude in Chrome reports the URLs you visit to its server for policy validation. That's fine for personal blogs, but if you let the Agent view intranet knowledge bases, Jira tickets, or company backends—URL reporting becomes data leakage.

BrowserSkill's entire chain runs on 127.0.0.1: Agent → bsk CLI → bsk Daemon (WebSocket local port 52800) → browser extension → CDP → Agent Window. No telemetry, no credential reporting, and install.sh is line-by-line auditable. Intranet-friendly is not just icing on the cake; it's the dividing line between usable and unusable.

Core Mechanism: One Window, Two Identities

BrowserSkill's key design is the Agent Window—within your already open browser, it carves out a dedicated window with an orange border for the Agent to use. It shares your login state (Cookie/Session) but does not touch the tabs you are currently using. If the Agent needs to touch your tabs, it must explicitly use bsk tab borrow and automatically returns them after use.

This solves the biggest fear of Path A: The Agent and the human competing for the same browser. Playwright cold-starts a new instance each time and steals focus; BrowserSkill directly opens an isolated window inside your browser—no cold start, just your real session.

In terms of interaction, BrowserSkill makes snapshot the first choice and screenshots the last resort. snapshot organizes the page's interactive elements into a numbered accessibility tree (@e1, @e2, ...), and the Agent directly uses click @e12. This saves more tokens, reduces steps, and increases certainty compared to feeding the entire DOM or screenshot to the model.

Interestingly, this is not just BrowserSkill's own choice. Vercel's agent-browser uses Refs, and BrowserAct uses a state-numbered tree—three different projects have independently arrived at the same design. Snapshot + Numbered References is becoming the industry consensus for Agent browser tools.

Three Steps to Install

# 1. Install CLI (macOS/Linux)
curl -fsSL https://raw.githubusercontent.com/Tencent/BrowserSkill/main/install.sh | sh
# Windows: irm https://raw.githubusercontent.com/Tencent/BrowserSkill/main/install.ps1 | iex

# 2. Install browser extension (Search BrowserSkill on Chrome Web Store)
# ⚠️ Install it into the browser you normally use and have logged into target sites

# 3. Configure Agent
bsk install-skill    # Press space to select framework, then Enter to auto-configure

After installation, run a set of test commands:

bsk browsers                            # View connected browsers, get instance id
bsk session start --browser <id>        # Start a session
bsk navigate --session <sid> https://example.com
bsk snapshot --session <sid>            # Output numbered accessibility tree
bsk click   --session <sid> @e12        # Click by number
bsk fill    --session <sid> @e8 "hello" # Fill input box
bsk request-help --session <sid>        # On CAPTCHA/login, pause and hand back to you
bsk session stop <sid>                  # Must close after use

The most intuitive feeling from testing: on a logged-in site, an ordinary isolated browser will redirect you to the login page; BrowserSkill uses your already logged-in browser to open it, directly showing the page after login. Login state sharing is indeed effective.

Know the Boundaries: It Is Not Omnipotent

BrowserSkill's CLI + Daemon is written in Rust, lightweight and efficient, but there are several hard boundaries you must know:

1. Your browser must be open—This is not a flaw, it's a design premise. It needs to use your logged-in browser. The real shortcoming is unstable connection: after the service worker goes idle or the browser restarts, the instance id changes and the connection drops. Long unattended tasks are not as reliable as Playwright.

2. Cannot read console/network—There are no commands for console or network capture. To find frontend bugs by reading errors, you must use evaluate as a workaround. This is a hard limitation compared to official Claude in Chrome, which specializes in reading console for debugging. Some developers have already forked and added this (CDP Log/Network/Runtime domain interception), and issues have been raised upstream.

3. Does not support GIF/session recording—Screenshots are only single PNG, no screen recording. To demonstrate workflows, you need an external screen recorder.

4. Extension permissions are relatively broad—Requires debugger + <all_urls>, technically able to read all content and Cookies of any site. The "do not extract credentials" statement in SKILL.md is only a prompt constraint, not technically enforced. Compared to OpenClaw's manifest which only needs debugger + localhost, it is less restrained.

5. Page content enters the Agent context—This means it enters your LLM provider. Pages containing sensitive information bypass manual desensitization. All "real browser + LLM" solutions have this issue, not unique to BrowserSkill, but you must be aware of it.

Tencent's Agent Ecosystem Puzzle

BrowserSkill is not Tencent's only move in the Agent browser field. Zoom out a bit:

SkillHub: Tencent's AI Skills community, based on the OpenClaw ecosystem, aggregates 13,000 skills, optimized for Chinese users (domestic mirror acceleration, Chinese search, security scanning)
QBotClaw: AI agent built into QQ Browser, the first browser-native AI assistant in China, capable of reading login states, bookmarks, and history
QQBrowserSkill: Web Skill for QQ Browser, allowing OpenClaw to directly access real websites to complete operations

BrowserSkill is the open-source infrastructure layer in this puzzle—not tied to Tencent's own products, MIT license for commercial use and forking, open to all Agent frameworks. SkillHub and QBotClaw handle ecosystem and distribution, while BrowserSkill handles the underlying bridge. This division of labor is clear.

One-Sentence Selection Guide

Need login state, coexistence with humans, no cold start → BrowserSkill. Need CI, unattended, console reading, clean isolation → Playwright.

They are not replacements for each other, but complementary. If you are already having AI do web-related work and have been tormented by "login states" and "window grabbing," BrowserSkill is worth trying. But don't treat it as fully automatic magic—it requires your browser to be open, cannot read the console, and its permission boundaries rely on self-discipline. Understand these three points before deciding where to use it.

GitHub: https://github.com/Tencent/BrowserSkill

Promotional link: Agnes AI — 1M context + 4K image generation + video all free, API: apihub.agnes-ai.com/v1 | https://platform.agnes-ai.com/

AI Agent's Hands and Eyes Finally Aligned

Two Paths, One Watershed

Why "Agent-Neutral" Is Not a Small Matter

Why "Purely Local" Is Not a Small Matter

Core Mechanism: One Window, Two Identities

Three Steps to Install

Know the Boundaries: It Is Not Omnipotent

Tencent's Agent Ecosystem Puzzle

One-Sentence Selection Guide

评论