BREAKING
NVIDIA cuts DeepSeek V4 token cost to ~1/5
0x
perf after 1 month
0x
full-stack throughput
0x
power eff vs H200
How the inference stack co-designs gains
1CUDA-native frameworks
2NVFP4 4-bit precision
3Dynamo prefill/decode split
4Day-0 optimized
0%
lower per-token FLOPs
0%
less KV cache memory
0M
max context tokens
SGLang GB300: ~2,200 to ~11,200 t/s
Before2200
After11200
Inference cost race heats up
AI NEWS BLITZ
In just one month, NVIDIA slashed DeepSeek V4 token cost to about a fifth on Blackwell.