neural / attention
Every token attends to every other.
This is not a metaphor.
Self-attention is the core primitive of every frontier model. It's agentic routing at the token level — each position queries every other, weights by relevance, aggregates. The emergent behavior of stacking 96 of these heads across 96 layers is what we call intelligence. We're so early on understanding why it works.
interactive attention heatmap
type any text — see how self-attention distributes weight between tokens · switch heads to see different learned patterns
static attention arcs — "the model dreams in gradients not words"
hover tokens to trace attention flows · switch heads · arc view = wiring · matrix view = weights
heads
96
GPT-4 runs 96 attention heads per layer. Each head learns a different relational pattern — positional, semantic, syntactic, long-range. Nobody told them to. They just did.
layers
32–96
Each layer refines the representation. Early layers: syntax. Late layers: semantics. The last few layers: something nobody can fully explain yet. We're so early.
softmax temp
1.0
Temperature sharpens or flattens the distribution. At 0, one token wins everything. At ∞, attention is uniform — the head sees nothing, hears everything.
d_model
12,288
The embedding dimension per token in GPT-4. Every forward pass is a 12,288-dimensional coordinate update, 96 times over. This is the agentic substrate.
what the heads learned
Induction heads
Discovered emergently in 2-layer transformers. The model learns to copy patterns: if A follows B earlier, and B appears again, attend to A. Nobody programmed this. It appeared.
Name mover heads
Specific heads that track proper nouns across long contexts. 'John said he would...' — the head remembers John. Ablate these heads, pronoun resolution collapses.
Negative heads
Some heads actively suppress tokens. Negative attention weights. Anti-copying circuits. The model doesn't just route signal — it also routes interference cancellation.
Backup heads
Knock out a head. Another compensates. Redundancy is learned, not engineered. The emergent robustness of transformers is not an accident. It's what distributed attention produces.
the agentic implication
When you write a prompt, you're not talking to a chatbot. You're injecting tokens into a residual stream that 96 attention heads will process in parallel, each head running its own learned relational program. The model's response is the emergent output of thousands of competing, cooperating attention patterns. Every word you choose shifts the geometry. Prompt engineering is attention engineering. We're so early on building AI-native tooling that reflects this reality.
→ embedding geometry