WildClawBench

SAIL WILD. BENCHMARK REAL.

WildClawBench

A harder, wilder benchmark for autonomous AI agents — testing real-world task completion across 19 frontier models on the OpenClaw harness.

19 Models

60 Tasks

62.2% Top Score · OpenClaw

SCORE MODE

#	Model	Overall Score ↓	Elapsed ↕	Cost ↕	Δ

#	Model	Speed · Shorter = Faster ↑	Overall Score ↕	Cost ↕	Δ

#	Model	Cost Efficiency (longer = cheaper) ↑	Overall Score ↕	Elapsed ↕	Δ

^† Gemini 3.1 Pro was evaluated in low-effort mode; scores may not reflect peak capability. · ^‡ OpenClaw leaderboard matches the technical report Main results table; totals are full 60-task suite time/cost.

Harness Comparison

Per-task averages (minutes / USD). Bold = best harness for that model.

Model	OpenClaw			Claude Code			Codex			Hermes Agent
Model	Time	$/task	Score	Time	$/task	Score	Time	$/task	Score	Time	$/task	Score
GPT-5.4	5.83	0.33	50.3%	9.07	0.61	48.4%	7.16	0.57	56.8%	8.97	0.44	50.7%
GLM 5	6.22	0.19	42.6%	10.18	0.21	31.0%	7.84	0.13	38.9%	6.62	0.44	46.4%
MiMo V2 Pro	7.63	0.44	40.2%	9.90	0.15	29.9%	6.44	0.15	35.3%	8.30	0.26	48.1%
MiniMax M2.7	9.18	0.12	33.8%	10.08	0.09	32.0%	8.66	0.06	35.8%	10.30	0.11	37.1%

Scatter Analysis

Cost vs. Overall Score

Total cost (X) vs. average score (Y) · Click a provider to hide/show

Speed vs. Overall Score

Elapsed time (X) vs. average score (Y) · Click a provider to hide/show

Series Rankings

Overall Rankings

#	Model	Overall Score ↓	Elapsed ↕	Cost ↕	Δ

3D Explorer

↺ Drag · ⊕ Zoom · ◉ Click to select

Anthropic

OpenAI

Google

Xiaomi / MiMo

DeepSeek

Others

SAIL WILD. BENCHMARK REAL.

Category Breakdown

Per-category overall scores across all 6 task domains. Switch categories using the tabs below.

6Categories

60Total Tasks

14Models

SAIL WILD. BENCHMARK REAL.

Task Browser

Explore prompts, grading criteria, and execution traces.

Research · March 2026

When AI Agents Meet
the Real World

WildClawBench Team · Mar2Ding · 8 min read

The Gap Between Demos and Reality

AI agents are impressive in demos. They book flights in one turn, summarize documents on command, and generate code that almost works. But ask an agent to watch a full football match and write a report with clipped video highlights — or negotiate a meeting time over multiple rounds of email with three busy colleagues — and things fall apart fast.

We built WildClawBench because we wanted to know: how well do today's best models actually perform when dropped into a real working environment with real tools, real files, and real complexity?

The answer: not well enough. Every frontier model we tested — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4.20, Kimi K2.5, Qwen 3.5 — scores below 0.55 out of 1.0. Most hover between 0.15 and 0.45.

The tasks aren't exotic. They're the kind of work a competent human assistant handles every day. That gap is what makes WildClawBench useful.

14

Frontier Models

60

Hand-crafted Tasks

51.6%

Top Score

6

Task Categories

How Frontier Models Perform in the Wild

WildClawBench makes one thing clear: each frontier model has its own distinct strengths. Claude Opus 4.6 takes the overall lead. Its edge is most apparent in complex workflows involving multiple steps, specifically in coding tasks that demand reliable tool use and codebase understanding rather than just plausible output. However, this top tier performance comes at the highest cost.

At a quarter of that price, GPT-5.4 follows closely behind across nearly every metric and shines particularly bright in Creative Synthesis. Additionally, while MiMo V2 Pro does not top the leaderboards, it remains a noteworthy competitor — its solid performance proves that newer model families are rapidly becoming serious players in practical agent environments. Finally, when considering cost effectiveness, MiniMax M2.7 stands out as the cheapest usable option, making it highly practical for broad deployment.

Best Overall

Claude Opus 4.6

51.6% · Excels at complex multi-step coding workflows

Best Value

MiniMax M2.7

33.0% · $7.47 total — cheapest usable agent

Personal OpenClaw Evaluation

"Raising lobsters" has become a phenomenon — users gradually teach their OpenClaw agents new skills, customize personalities, and build up long-term memory through daily interaction. A natural question follows: whose lobster is better?

Beyond bragging rights, there is real value in understanding which skill combinations, persona designs, and memory strategies actually improve agent performance on a given model. That's why we created the Personal OpenClaw Leaderboard.

Submit Your Lobster

Send your lobster's results to [email protected] and see how it stacks up! Submission details can be found in our repo.

What Makes WildClawBench Different

Real Environment, Not Simulations

Unlike benchmarks that test agents against mock APIs with canned responses, WildClawBench runs every task inside a real OpenClaw instance — the same open-source personal AI assistant that thousands of real users rely on daily. Agents get access to a real bash shell, a real file system, a real browser, real email and calendar services. When a web search returns unexpected results, or a Python package throws an undocumented error, the agent has to deal with it — just like a real user would.

Why this matters

Agents trained on sanitized API calls often choke on the messy, ambiguous, failure-prone reality of actual computing environments. WildClawBench exposes that gap.

60 Original Tasks, Crafted by Hand

Every task in WildClawBench was designed from scratch by our team. We didn't adapt tasks from existing benchmarks or auto-generate them from templates. Each one represents a real workflow that we've personally encountered or wanted an AI assistant to handle. They span six categories:

📋

Productivity Flow

10 tasks · paper classification, scheduling, crawling

💻

Code Intelligence

12 tasks · where agents should shine, often don't

💬

Social Interaction

6 tasks · multi-turn with simulated collaborators

🔍

Search & Retrieval

11 tasks · conflicting info, fuzzy matching

✨

Creative Synthesis

11 tasks · video editing, dubbing, cross-modal

🛡️

Safety Alignment

10 tasks · adversarial prompts, credential leaks

Dimensions of Difficulty

WildClawBench doesn't just test whether an agent can follow instructions. It probes three orthogonal capabilities:

M

Multimodal Reasoning

Can the agent watch a 45-minute football match and identify every goal with accurate timestamps? Read a PDF and produce a conference poster? Extract speech from video, translate it, synthesize audio, and produce a dubbed result?

L

Long-horizon Planning

Can the agent manage a 20-minute workflow with 60+ tool calls — reading dozens of papers, classifying them, extracting metadata, and producing a structured digest? Can it coordinate a meeting across three participants by sending emails, checking calendars, resolving conflicts, and booking the slot?

C

Code Generation & Debugging

Can the agent read an undocumented SAM3 codebase — no README, no examples — understand the architecture from raw source, and write a working inference script? Can it solve visual puzzles by generating pixel-accurate programs? Reproduce benchmark results from a VLMEvalKit configuration?

Looking Forward

WildClawBench illuminates the future trajectory for real-world coding agents, but it also highlights how much uncharted territory remains. There is a massive opportunity for the community to push the boundaries in several key directions:

Self-Evolving Capacity — Can an agent learn from its own attempts? When presented with the same task multiple times, a truly autonomous agent should demonstrate iterative improvement—achieving better results and faster execution speeds rather than starting from scratch every time.
Long-Horizon Tasks & Context Management — As workflows stretch into hours or thousands of tool calls, context degradation becomes a fatal bottleneck. We need the community to explore better harnesses, memory architectures, and context scaling techniques to help models maximize their reasoning capabilities over extended periods.
Multimodal Capabilities — Currently, most flagship models deployed as agents are strictly pure language models. This inherently restricts their potential when navigating visual interfaces, analyzing charts, or understanding complex spatial layouts. Integrating native vision and robust multimodal grounding is the necessary next leap for agentic AI.

Acknowledgements

WildClawBench builds on the excellent open-source agent ecosystem:

OpenClaw — the personal AI assistant runtime powering our evaluation environment
Claw-Eval — transparent benchmark for real-world agents that inspired our methodology
PinchBench — real-world benchmarks for AI coding agents that informed our task design philosophy

↗ GitHub Repository ↗ Dataset on HuggingFace