BREAKING
NVIDIA cuts DeepSeek V4 token cost to ~1/5
0
x
perf after 1 month
0
x
full-stack throughput
0
x
power eff vs H200
How the inference stack co-designs gains
1
CUDA-native frameworks
↓
2
NVFP4 4-bit precision
↓
3
Dynamo prefill/decode split
↓
4
Day-0 optimized
0
%
lower per-token FLOPs
0
%
less KV cache memory
0
M
max context tokens
SGLang GB300: ~2,200 to ~11,200 t/s
Before
2200
After
11200
Inference cost race heats up
AI NEWS BLITZ
In just one month, NVIDIA slashed DeepSeek V4 token cost to about a fifth on Blackwell.