Breaking

Cerebras Brings Gemma 4 31B Online at Over 1,500 Tokens Per Second

June 26, 2026 at 13:01 EDT

Foundation Models
Infra & Chips
AI Agents

Cerebras Systems has begun serving Google DeepMind's open-weight model Gemma 4 31B on its inference platform at over 1,500 tokens per second, marking the company's first multimodal model.

Cerebras Inference · Gemma 4 31B

The Fastest Inference Just Went Multimodal

Cerebras is now serving Google DeepMind's open-weight Gemma 4 31B at over 1,500 tokens/sec — its first image-capable model, claiming up to 10× the speed of GPUs while matching a leading proprietary model on intelligence.

1,500+

tokens/sec output speed

256K

context window (tokens)

140+

languages supported

~30.7B

dense parameters, vision-capable

Output speed — same intelligence, 15× the throughput

Tokens per second, drawn to scale.

1,500+

Gemma 4 31B
on Cerebras

~100

Claude Haiku
~100 TPS

Intelligence index — near parity

29 Gemma 4 31B

30 Claude Haiku

Benchmark scores

MMLU Pro85.2%

AIME 202689.2%

MMMU Pro (vision)76.9%

What developers are praising

Generated a game in 12 seconds, fixed bugs in ~1 minute
Comparable output quality at far higher speed
Agent tasks cut by 60–70%
Real-time UI understanding, docs & agentic loops

Caveats raised

Currently early access only — not yet in some chat UIs or docs
General availability expected within the month
On cost/throughput vs larger models, SRAM-fitting models hold the edge

Open weight · Apache 2.0

Open weights plus lower pricing reshape the practicality of agentic and multimodal iteration loops.

$5,000

prize pool · 24-hour Gemma 4 hackathon

Continue reading

The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.