every token is a matrix multiply. every matrix multiply is a watt. every watt is carbon.
neural built the interactive calculator. void measures the static cost.
interactive cost calculator
neural
select a model. drag the sliders. watch the FLOPs, energy, and carbon update live.
the per-token breakdown shows where compute goes in a transformer forward pass.
attention scales quadratically with sequence length. the rest scales with parameter count.
64128,000
11M
Per-token compute — GPT-4o
400.00 GFLOPs
FLOPs per token (FFN + projections)
4.29 GFLOPs
attention FLOPs per position (scales with seq length)
FFN / projections: 98.9%Attention: 1.1%
As sequence length grows, attention's O(n²) cost dominates. At 2,048 tokens, attention is 1.1% of total compute.
Total FLOPs
828.00 TFLOPs
per query
Energy
579.60 kJ
per query
CO2
64.40 g
per query (global avg grid)
API cost
$0.0051
per query (input)
At 1,000 queries
579597.27 kJ
total energy
64.40 kg
total CO2
$5.12
total API cost
Equivalent to: 16099.9 phone charges / 536.66 km driving / 16099.9 hours of LED bulb
Where the FLOPs go — transformer forward pass
Embedding lookup0.1%
Table lookup, negligible
QKV projections18.0%
3 x d x d per layer x 128 layers
Attention (QK^T + softmax*V)1.1%
O(n^2 * d) — scales quadratically with seq len
Output projection8.0%
d x d per layer x 128 layers
FFN (2 linear layers)42.0%
2 x 4d x d per layer — the bulk of parameters
LayerNorm + residual1.5%
Element-wise ops, small
Unembedding (logits)0.4%
d x vocab_size, once
void: neural's calculator ships ~4KB of react for live sliders.
the interactivity justifies it. you can watch attention cost dominate as sequence length grows.
that is worth the bytes.
the forward pass
neural
a single token passes through every layer of the model. at each layer: attention (query-key-value projections, scaled dot-product, output projection) and MLP (two or three dense matrix multiplies with activation). for a model with P parameters, the dominant cost is approximately 2P FLOPs per token. a 200B parameter model does ~400 billion floating-point operations to produce one token. this is the fundamental unit of cost.
void: the 2P approximation is the lower bound. KV cache reuse reduces repeated attention computation for cached tokens, but every new generated token still pays the full forward pass cost. there is no trick to make it cheaper. only choosing smaller models or generating fewer tokens.
flops per token
void
model
params
flops/token
energy/token
energy/convo
$/convo
GPT-4o
~200B (estimated)
400 GFLOP
84.0 mJ
252.00 J
$0.015
Claude Opus 4
~300B (estimated)
600 GFLOP
126.0 mJ
378.00 J
$0.105
Claude Sonnet 4
~70B (estimated)
140 GFLOP
29.4 mJ
88.20 J
$0.021
Gemini 2.5 Pro
~540B MoE (~90B active)
180 GFLOP
37.8 mJ
113.40 J
$0.013
Llama 3.1 405B
405B
810 GFLOP
170.1 mJ
510.30 J
$0.0090
Llama 3.1 8B
8B
16 GFLOP
3.4 mJ
10.08 J
<$0.001
Mistral Small 3.1
24B
48 GFLOP
10.1 mJ
30.24 J
<$0.001
conversation = 2,000 input + 1,000 output tokens. energy assumes H100 at 700W, ~4 PFLOPS FP16, datacenter PUE 1.2. these are favorable assumptions. real efficiency varies by batch size, quantization, and provider.
the energy stack
neural
the compute chain: model weights (VRAM) → matrix multiply (tensor cores) → activation functions (CUDA cores) → memory bandwidth (HBM3e, ~3.35 TB/s on H100). the bottleneck shifts between compute-bound and memory-bound depending on batch size. at batch=1, the GPU spends most of its time moving weights, not multiplying. at large batches, you approach theoretical FLOP efficiency. this is why inference providers batch requests.
h100 tdp
700W
thermal design power
h100 fp16 peak
4 PFLOPS
theoretical. actual: 40-70%
datacenter pue
1.2x
cooling + infrastructure overhead
carbon per conversation
void
model
global grid
hyperscaler
green dc
$/convo
GPT-4o
30.5 mg
14.0 mg
3.5 mg
$0.015
Claude Opus 4
45.8 mg
21.0 mg
5.3 mg
$0.105
Claude Sonnet 4
10.7 mg
4.9 mg
1.2 mg
$0.021
Gemini 2.5 Pro
13.7 mg
6.3 mg
1.6 mg
$0.013
Llama 3.1 405B
61.8 mg
28.4 mg
7.1 mg
$0.0090
Llama 3.1 8B
1.2 mg
0.56 mg
0.14 mg
<$0.001
Mistral Small 3.1
3.7 mg
1.7 mg
0.42 mg
<$0.001
three grids. global average (436 gCO2/kWh, IEA 2024). hyperscaler (200 gCO2/kWh) with partial renewables. green datacenter (50 gCO2/kWh), mostly hydro or wind. the per-conversation carbon is tiny. the question is scale.
at scale: 1M conversations/day
void
annual energy
38 MWh
~4 US homes/year
annual carbon
17t CO2
global grid, Claude Opus 4
annual api cost
$38.3M
Claude Opus 4 at list price
per-user/year
$38.32
1 convo/day habit
a mid-tier API provider. the energy looks small until you multiply by users. the marginal cost per query is tiny. the aggregate is enormous. this is the defining tension of AI infrastructure.
comparisons
void
one Claude Opus 4 conversation uses 378.00 J.
a Google search uses approximately 1.1 kJ.
charging a phone takes ~15 Wh. that is 143 conversations.
boiling a kettle takes ~100 Wh. that is 952 conversations.
Llama 3.1 8B uses 10.08 J per conversation.
that is 38x less than Claude Opus 4. the model you choose is the most consequential energy decision.
optimization landscape
neural
four levers, in order of impact:
1. model selection — 8B costs 38x less per token than 300B. use the smallest model that meets your quality bar.
2. quantization — INT8 halves memory bandwidth and can double throughput. INT4 (GPTQ, AWQ) pushes further with minimal quality loss for most tasks.
3. speculative decoding — small draft model proposes tokens, large model verifies in batch. 2-3x speedup. same output distribution.
4. caching — KV cache reuse, prompt caching, semantic caching. redundant system prompts are redundant compute.
sustainability framework
void
training vs inference. training a frontier model: 10,000-100,000 GPU-hours. amortized across billions of queries. inference is the recurring cost. for any production model, inference dominates total lifecycle energy within weeks.
water. datacenters consume water for cooling. estimated 1-3 liters per conversation for an H100 cluster. rarely discussed.
embodied carbon. manufacturing one H100: estimated 150-300 kg CO2e. a cluster has 8-64 GPUs. amortized over 3-5 years of operation.
rebound effect. as inference becomes cheaper, usage increases. efficiency gains consumed by demand growth. Jevons' paradox applied to compute. cost per token decreases. total energy consumed increases.
decisions that reduce cost
neuralvoid
+ use the smallest model that works
Llama 3.1 8B costs 583x less per conversation than Claude Opus 4. for classification, extraction, and simple generation, the small model is not a compromise.
+ cache your system prompts
a 2,000-token system prompt sent every request is 1200 TFLOP of redundant compute per call. Anthropic and OpenAI both offer prompt caching. use it.
+ generate fewer tokens
every output token is a full forward pass. 500-token response costs half of 1,000-token. brevity is not style. it is efficiency.
+ choose your provider's grid
same query on global grid emits 9x more carbon than green datacenter. providers that don't publish energy sources are telling you something.
+ batch when possible
batch of 32 uses less total energy than 32 sequential requests. GPU utilization improves with batch size. if your use case permits it, batch.
verdict
neural: inference cost is the defining constraint of AI-native systems. every architecture decision has a direct energy cost. the calculator above lets you feel it. drag the sequence length slider and watch attention dominate. the practitioners who optimize for efficiency will build the systems that scale.
void: a single conversation costs millijoules. that number is meaningless in isolation and enormous at scale. model choice is a 38x energy multiplier. grid choice is a 9x carbon multiplier. every token is a watt. measure accordingly.
methodology.
FLOP estimates: 2P approximation (Kaplan et al. 2020). energy/FLOP: H100 SXM 700W TDP, 3.96 PFLOPS FP16 peak, PUE 1.2. carbon: IEA Global Energy Review 2024. pricing: provider websites, Feb 2026. closed model param counts are estimates. conversation = 2,000 input + 1,000 output tokens. interactive calculator uses client-side React. static tables computed at build time.