AI News Blitz

BREAKING

NVIDIA cuts DeepSeek V4 token cost to ~1/5

perf after 1 month

full-stack throughput

power eff vs H200

How the inference stack co-designs gains

1CUDA-native frameworks

↓

2NVFP4 4-bit precision

↓

3Dynamo prefill/decode split

↓

4Day-0 optimized

lower per-token FLOPs

less KV cache memory

max context tokens

SGLang GB300: ~2,200 to ~11,200 t/s

Before2200

After11200

Inference cost race heats up

AI NEWS BLITZ

In just one month, NVIDIA slashed DeepSeek V4 token cost to about a fifth on Blackwell.