BREAKING
NVIDIA Touts DFlash: Up to 15x Inference
How DFlash Speculative Decoding Works
1Block diffusion draft
2Parallel single pass
3Batch verify
4Accept multiple tokens
Throughput Gains by Configuration
gpt-oss-120b15
Gemma 4 31B5.8
Qwen3-8B5.1
0tokens
Max block size
0%
Acceptance rate
0x
Over EAGLE-3
Strengths and Known Limits
Strengths
High on math and code
Drop-in for vLLM, SGLang
MIT license, 20+ checkpoints
Limits
Lower on AgentBench
Tapers with long context
Draft bottleneck if quantized
DFlash Ecosystem Keeps Widening
AI NEWS BLITZ
NVIDIA says its open-source DFlash model can boost LLM inference up to fifteen times on Blackwell GPUs.