BREAKING
NVIDIA Touts DFlash: Up to 15x Inference
How DFlash Speculative Decoding Works
1
Block diffusion draft
↓
2
Parallel single pass
↓
3
Batch verify
↓
4
Accept multiple tokens
Throughput Gains by Configuration
gpt-oss-120b
15
Gemma 4 31B
5.8
Qwen3-8B
5.1
0
tokens
Max block size
0
%
Acceptance rate
0
x
Over EAGLE-3
Strengths and Known Limits
Strengths
●
High on math and code
●
Drop-in for vLLM, SGLang
●
MIT license, 20+ checkpoints
Limits
●
Lower on AgentBench
●
Tapers with long context
●
Draft bottleneck if quantized
DFlash Ecosystem Keeps Widening
AI NEWS BLITZ
NVIDIA says its open-source DFlash model can boost LLM inference up to fifteen times on Blackwell GPUs.