Breaking

Arena.ai unveils 'Agent Arena'; GPT-5.5 tops real-task leaderboard

June 4, 2026 at 12:07 EDT

Arena.ai (formerly LMSYS Chatbot Arena / LMArena) on June 4, 2026 announced 'Agent Arena,' a new platform that scales real-world agentic evaluation. Across millions of live user sessions, it measures how models complete actual complex tasks using tools such as web search, filesystem, and terminal.

June 4, 2026 · Arena.ai

Agent Arena: scoring AI on the work it actually does

A new platform retires the static Q&A leaderboard, ranking frontier models on real-world agentic tasks — web search, files, terminal — across millions of live user sessions with continuous updates.

300K+

real tasks recorded at launch

2M+

tool calls executed by agents

40M

lines of code generated

50+

turns in long-horizon workflows

Launch leaderboard — top 5 agents

Ranked by task success, steerability, error recovery & more

1 · GPT-5.5 (High)

2 · Claude Opus 4.7

3 · GLM-5.1

4 · Gemini 3.1 Pro

5 · Kimi-K2.6

Bars indicate ranking order, not exact score gaps.

7 days of real tool use

How agents actually reached for their tools

bash

936K

write_file

550K

web_search

276K

Generated output in the same window: 8.5M lines of .py & 7.8M lines of .md.

Praised as

"The most important real-world eval" — more reliable than static benchmarks. Captures full-stack app building, financial modeling, data analysis and multi-day, hundreds-of-turn tasks.

The pushback

Weak on edge cases and infrequent tasks — "live sessions alone are not enough." Harness effects of the execution environment still need deeper analysis.

Continue reading

The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.