| # | Model | Overall Score ↓ | Elapsed ↕ | Cost ↕ | Δ |
|---|
| # | Model | Speed · Shorter = Faster ↑ | Overall Score ↕ | Cost ↕ | Δ |
|---|
| # | Model | Cost Efficiency (longer = cheaper) ↑ | Overall Score ↕ | Elapsed ↕ | Δ |
|---|
| # | Model | Overall Score ↓ | Elapsed ↕ | Cost ↕ | Δ |
|---|
AI agents are impressive in demos. They book flights in one turn, summarize documents on command, and generate code that almost works. But ask an agent to watch a full football match and write a report with clipped video highlights — or negotiate a meeting time over multiple rounds of email with three busy colleagues — and things fall apart fast.
We built WildClawBench because we wanted to know: how well do today's best models actually perform when dropped into a real working environment with real tools, real files, and real complexity?
The answer: not well enough. Every frontier model we tested — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4.20, Kimi K2.5, Qwen 3.5 — scores below 0.55 out of 1.0. Most hover between 0.15 and 0.45.
The tasks aren't exotic. They're the kind of work a competent human assistant handles every day. That gap is what makes WildClawBench useful.
WildClawBench makes one thing clear: each frontier model has its own distinct strengths. Claude Opus 4.6 takes the overall lead. Its edge is most apparent in complex workflows involving multiple steps, specifically in coding tasks that demand reliable tool use and codebase understanding rather than just plausible output. However, this top tier performance comes at the highest cost.
At a quarter of that price, GPT-5.4 follows closely behind across nearly every metric and shines particularly bright in Creative Synthesis. Additionally, while MiMo V2 Pro does not top the leaderboards, it remains a noteworthy competitor — its solid performance proves that newer model families are rapidly becoming serious players in practical agent environments. Finally, when considering cost effectiveness, MiniMax M2.7 stands out as the cheapest usable option, making it highly practical for broad deployment.
"Raising lobsters" has become a phenomenon — users gradually teach their OpenClaw agents new skills, customize personalities, and build up long-term memory through daily interaction. A natural question follows: whose lobster is better?
Beyond bragging rights, there is real value in understanding which skill combinations, persona designs, and memory strategies actually improve agent performance on a given model. That's why we created the Personal OpenClaw Leaderboard.
Unlike benchmarks that test agents against mock APIs with canned responses, WildClawBench runs every task inside a real OpenClaw instance — the same open-source personal AI assistant that thousands of real users rely on daily. Agents get access to a real bash shell, a real file system, a real browser, real email and calendar services. When a web search returns unexpected results, or a Python package throws an undocumented error, the agent has to deal with it — just like a real user would.
Every task in WildClawBench was designed from scratch by our team. We didn't adapt tasks from existing benchmarks or auto-generate them from templates. Each one represents a real workflow that we've personally encountered or wanted an AI assistant to handle. They span six categories:
WildClawBench doesn't just test whether an agent can follow instructions. It probes three orthogonal capabilities:
WildClawBench illuminates the future trajectory for real-world coding agents, but it also highlights how much uncharted territory remains. There is a massive opportunity for the community to push the boundaries in several key directions:
WildClawBench builds on the excellent open-source agent ecosystem: