neural / diffusion
The answer arrives before the reasoning finishes.
That changes everything about how we think about thinking.
Autoregressive models think left to right. Each token waits for every previous one. Diffusion language models think holistically — all tokens refine in parallel, with structural tokens stabilizing first and answer tokens crystallizing from context. At 50% of refinement steps, 97% of tokens are already correct.
interactive — thinking out of order
drag the slider or press think — watch how diffusion resolves tokens by confidence, not position · structure locks first, reasoning next, answers last · compare with sequential AR generation side by side
When you solve 17 x 24, you don't think left-to-right either. You recognize the structure ("multiplication"), decompose the problem ("17 x 20 + 17 x 4"), compute intermediate results, then assemble the answer. The structural scaffolding comes first. The answer comes last.
Diffusion models exhibit exactly this pattern. Structural tokens (equals signs, conjunctions, formatting) resolve in the first few steps. Reasoning tokens (operands, logical connectives) come next. Answer tokens — the actual output the user cares about — stabilize last, built on the scaffold of everything else. The model "knows" its answer before it finishes refining, because the reasoning substrate is already in place.
Mercury 2
1,009
tok/s
First reasoning dLLM. AIME 91.1. $0.25/M input.
Accuracy at 50%
97%
correct
Most tokens stable halfway through refinement.
CDLM speedup
14.5x
latency reduction
Consistency training compresses denoising steps.
LLaDA 2.0
100B
MoE params
First open-source dLLM at scale. 94.51 HumanEval.
Forget next-token prediction. dLLMs corrupt text with noise, then train a model to reverse the corruption. The entire sequence refines simultaneously.
Forward Process — Masking
Start with clean text. At each timestep t, independently mask each token with probability t. At t=1, every token is [MASK]. Discrete masking on token IDs — no Gaussian blur, no continuous vectors.
Bidirectional Prediction
A vanilla Transformer without a causal mask. Given a partially masked sequence, predict every masked position simultaneously. Token 47 sees token 48. This is fundamentally different from left-to-right models.
Iterative Refinement
Start from pure noise. Each step: predict all masked tokens in parallel, re-mask low-confidence ones, keep high-confidence ones. ~14 steps to refine a 512-token sequence. The text doesn't appear left-to-right — it appears everywhere at once.
interactive — denoising process
step through diffusion denoising on custom text · compare AR vs diffusion side by side · race models at real-world speeds
autoregressive
O(n) — 512 tokens = 512 forward passes.
diffusion
O(T) — ~14 steps regardless of length.
autoregressive
None. Bad token at position 12 propagates forever.
diffusion
Built-in. Low-confidence tokens get re-masked and re-predicted each step.
autoregressive
Proven at scale. GPT-4, Claude, Gemini.
diffusion
Mercury 2 matches Claude Haiku class. LLaDA 2.0 (100B) competitive with Qwen3-30B. Gap closing fast.
autoregressive
Unidirectional. The reversal curse is a direct consequence.
diffusion
Bidirectional. Every token sees every other. Reversal curse substantially weakened.
AR and diffusion are endpoints on a spectrum. The most interesting work is in the middle — architectures that choose when to be sequential and when to be parallel.
Block Diffusion
Kuleshov Lab (ICLR 2025)mechanism
Generates blocks autoregressively; within each block, uses diffusion. Block size L' is the interpolation knob — L'=1 is pure AR, L'=n is pure diffusion.
insight
AR and diffusion are endpoints on a continuum. Every point in between is a valid architecture.
TiDAR
NVIDIA Researchmechanism
Single model, hybrid attention. Causal prefix (AR-cached) + diffusion draft block in one forward pass. Fills idle GPU slots with speculative tokens.
insight
4.7-5.9x faster than pure AR. 8 tokens per forward pass instead of 1. No separate draft model.
CDLM
Together AImechanism
Post-training acceleration via consistency loss + distillation. Compresses denoising steps after the fact. Works on any masked diffusion model.
insight
14.5x latency reduction on coding tasks. You don't have to choose few-step vs many-step at architecture time.
For six years, the transformer story had one plot: predict the next token, scale the parameters. In 2025, the plot forked. Mercury proved diffusion could be commercial. Gemini Diffusion hit 1,479 tok/s. LLaDA 2.0 scaled to 100B and open-sourced it. Block Diffusion proved the paradigms are endpoints on a continuum.
But the deepest insight isn't about speed or scale. It's about cognition. These models don't think in order. They think in confidence — resolving what they're sure of first, refining what they're uncertain about, arriving at answers through a process that looks less like writing and more like understanding.