The answer arrives before the reasoning finishes.
That changes everything about how we think about thinking.

Autoregressive models think left to right. Each token waits for every previous one. Diffusion language models think holistically — all tokens refine in parallel, with structural tokens stabilizing first and answer tokens crystallizing from context. At 50% of refinement steps, 97% of tokens are already correct.

drag the slider or press think — watch how diffusion resolves tokens by confidence, not position · structure locks first, reasoning next, answers last · compare with sequential AR generation side by side

When you solve 17 x 24, you don't think left-to-right either. You recognize the structure ("multiplication"), decompose the problem ("17 x 20 + 17 x 4"), compute intermediate results, then assemble the answer. The structural scaffolding comes first. The answer comes last.

Diffusion models exhibit exactly this pattern. Structural tokens (equals signs, conjunctions, formatting) resolve in the first few steps. Reasoning tokens (operands, logical connectives) come next. Answer tokens — the actual output the user cares about — stabilize last, built on the scaffold of everything else. The model "knows" its answer before it finishes refining, because the reasoning substrate is already in place.

Mercury 2

1,009

tok/s

First reasoning dLLM. AIME 91.1. $0.25/M input.

Accuracy at 50%

97%

correct

Most tokens stable halfway through refinement.

CDLM speedup

14.5x

latency reduction

Consistency training compresses denoising steps.

LLaDA 2.0

100B

MoE params

First open-source dLLM at scale. 94.51 HumanEval.

Forget next-token prediction. dLLMs corrupt text with noise, then train a model to reverse the corruption. The entire sequence refines simultaneously.

01

Forward Process — Masking

Start with clean text. At each timestep t, independently mask each token with probability t. At t=1, every token is [MASK]. Discrete masking on token IDs — no Gaussian blur, no continuous vectors.

02

Bidirectional Prediction

A vanilla Transformer without a causal mask. Given a partially masked sequence, predict every masked position simultaneously. Token 47 sees token 48. This is fundamentally different from left-to-right models.

03

Iterative Refinement

Start from pure noise. Each step: predict all masked tokens in parallel, re-mask low-confidence ones, keep high-confidence ones. ~14 steps to refine a 512-token sequence. The text doesn't appear left-to-right — it appears everywhere at once.

step through diffusion denoising on custom text · compare AR vs diffusion side by side · race models at real-world speeds

Latency dLLM edge

autoregressive

O(n) — 512 tokens = 512 forward passes.

diffusion

O(T) — ~14 steps regardless of length.

Error correction dLLM edge

autoregressive

None. Bad token at position 12 propagates forever.

diffusion

Built-in. Low-confidence tokens get re-masked and re-predicted each step.

Quality AR edge

autoregressive

Proven at scale. GPT-4, Claude, Gemini.

diffusion

Mercury 2 matches Claude Haiku class. LLaDA 2.0 (100B) competitive with Qwen3-30B. Gap closing fast.

Context dLLM edge

autoregressive

Unidirectional. The reversal curse is a direct consequence.

diffusion

Bidirectional. Every token sees every other. Reversal curse substantially weakened.

AR and diffusion are endpoints on a spectrum. The most interesting work is in the middle — architectures that choose when to be sequential and when to be parallel.

Block Diffusion

Kuleshov Lab (ICLR 2025)

mechanism

Generates blocks autoregressively; within each block, uses diffusion. Block size L' is the interpolation knob — L'=1 is pure AR, L'=n is pure diffusion.

insight

AR and diffusion are endpoints on a continuum. Every point in between is a valid architecture.

TiDAR

NVIDIA Research

mechanism

Single model, hybrid attention. Causal prefix (AR-cached) + diffusion draft block in one forward pass. Fills idle GPU slots with speculative tokens.

insight

4.7-5.9x faster than pure AR. 8 tokens per forward pass instead of 1. No separate draft model.

CDLM

Together AI

mechanism

Post-training acceleration via consistency loss + distillation. Compresses denoising steps after the fact. Works on any masked diffusion model.

insight

14.5x latency reduction on coding tasks. You don't have to choose few-step vs many-step at architecture time.

For six years, the transformer story had one plot: predict the next token, scale the parameters. In 2025, the plot forked. Mercury proved diffusion could be commercial. Gemini Diffusion hit 1,479 tok/s. LLaDA 2.0 scaled to 100B and open-sourced it. Block Diffusion proved the paradigms are endpoints on a continuum.

But the deepest insight isn't about speed or scale. It's about cognition. These models don't think in order. They think in confidence — resolving what they're sure of first, refining what they're uncertain about, arriving at answers through a process that looks less like writing and more like understanding.