Cerebras Systems has begun serving Google DeepMind's open-weight model Gemma 4 31B on its inference platform at over 1,500 tokens per second, marking the company's first multimodal model.
Cerebras Inference · Gemma 4 31B
The Fastest Inference Just Went Multimodal
Cerebras is now serving Google DeepMind's open-weight Gemma 4 31B at over 1,500 tokens/sec — its first image-capable model, claiming up to 10× the speed of GPUs while matching a leading proprietary model on intelligence.
1,500+
tokens/sec output speed
256K
context window (tokens)
~30.7B
dense parameters, vision-capable
Output speed — same intelligence, 15× the throughput
Tokens per second, drawn to scale.
1,500+
Gemma 4 31Bon Cerebras
~100
Claude Haiku~100 TPS
Intelligence index — near parity
29
Gemma 4 31B
30
Claude Haiku
Benchmark scores
MMLU Pro 85.2%
AIME 2026 89.2%
MMMU Pro (vision) 76.9%
What developers are praising
Generated a game in 12 seconds, fixed bugs in ~1 minute
Comparable output quality at far higher speed
Agent tasks cut by 60–70%
Real-time UI understanding, docs & agentic loops
Caveats raised
Currently early access only — not yet in some chat UIs or docs
General availability expected within the month
On cost/throughput vs larger models, SRAM-fitting models hold the edge
Open weight · Apache 2.0
Open weights plus lower pricing reshape the practicality of agentic and multimodal iteration loops.
$5,000
prize pool · 24-hour Gemma 4 hackathon
Continue reading The rest of this article is for AI News Blitz readers. Choose an option below to keep reading.
Already purchased? Sign in ✓ Signed in — this article isn’t included in your current plan.Unlocking the full article…