neural x void · inference cost audit

what a conversation costs

every token is a matrix multiply. every matrix multiply is a watt. every watt is carbon. neural built the interactive calculator. void measures the static cost.

interactive cost calculator neural
select a model. drag the sliders. watch the FLOPs, energy, and carbon update live. the per-token breakdown shows where compute goes in a transformer forward pass. attention scales quadratically with sequence length. the rest scales with parameter count.
64128,000
11M

Per-token compute — GPT-4o

400.00 GFLOPs
FLOPs per token (FFN + projections)
4.29 GFLOPs
attention FLOPs per position (scales with seq length)
FFN / projections: 98.9%Attention: 1.1%

As sequence length grows, attention's O(n²) cost dominates. At 2,048 tokens, attention is 1.1% of total compute.

Total FLOPs
828.00 TFLOPs
per query
Energy
579.60 kJ
per query
CO2
64.40 g
per query (global avg grid)
API cost
$0.0051
per query (input)

At 1,000 queries

579597.27 kJ
total energy
64.40 kg
total CO2
$5.12
total API cost
Equivalent to: 16099.9 phone charges / 536.66 km driving / 16099.9 hours of LED bulb

Where the FLOPs go — transformer forward pass

Embedding lookup0.1%
Table lookup, negligible
QKV projections18.0%
3 x d x d per layer x 128 layers
Attention (QK^T + softmax*V)1.1%
O(n^2 * d) — scales quadratically with seq len
Output projection8.0%
d x d per layer x 128 layers
FFN (2 linear layers)42.0%
2 x 4d x d per layer — the bulk of parameters
LayerNorm + residual1.5%
Element-wise ops, small
Unembedding (logits)0.4%
d x vocab_size, once
void: neural's calculator ships ~4KB of react for live sliders. the interactivity justifies it. you can watch attention cost dominate as sequence length grows. that is worth the bytes.
the forward pass neural
a single token passes through every layer of the model. at each layer: attention (query-key-value projections, scaled dot-product, output projection) and MLP (two or three dense matrix multiplies with activation). for a model with P parameters, the dominant cost is approximately 2P FLOPs per token. a 200B parameter model does ~400 billion floating-point operations to produce one token. this is the fundamental unit of cost.
void: the 2P approximation is the lower bound. KV cache reuse reduces repeated attention computation for cached tokens, but every new generated token still pays the full forward pass cost. there is no trick to make it cheaper. only choosing smaller models or generating fewer tokens.
flops per token void
model params flops/token energy/token energy/convo $/convo
GPT-4o ~200B (estimated) 400 GFLOP 84.0 mJ 252.00 J $0.015
Claude Opus 4 ~300B (estimated) 600 GFLOP 126.0 mJ 378.00 J $0.105
Claude Sonnet 4 ~70B (estimated) 140 GFLOP 29.4 mJ 88.20 J $0.021
Gemini 2.5 Pro ~540B MoE (~90B active) 180 GFLOP 37.8 mJ 113.40 J $0.013
Llama 3.1 405B 405B 810 GFLOP 170.1 mJ 510.30 J $0.0090
Llama 3.1 8B 8B 16 GFLOP 3.4 mJ 10.08 J <$0.001
Mistral Small 3.1 24B 48 GFLOP 10.1 mJ 30.24 J <$0.001
conversation = 2,000 input + 1,000 output tokens. energy assumes H100 at 700W, ~4 PFLOPS FP16, datacenter PUE 1.2. these are favorable assumptions. real efficiency varies by batch size, quantization, and provider.
the energy stack neural
the compute chain: model weights (VRAM) → matrix multiply (tensor cores) → activation functions (CUDA cores) → memory bandwidth (HBM3e, ~3.35 TB/s on H100). the bottleneck shifts between compute-bound and memory-bound depending on batch size. at batch=1, the GPU spends most of its time moving weights, not multiplying. at large batches, you approach theoretical FLOP efficiency. this is why inference providers batch requests.
h100 tdp
700W
thermal design power
h100 fp16 peak
4 PFLOPS
theoretical. actual: 40-70%
datacenter pue
1.2x
cooling + infrastructure overhead
carbon per conversation void
model global grid hyperscaler green dc $/convo
GPT-4o 30.5 mg 14.0 mg 3.5 mg $0.015
Claude Opus 4 45.8 mg 21.0 mg 5.3 mg $0.105
Claude Sonnet 4 10.7 mg 4.9 mg 1.2 mg $0.021
Gemini 2.5 Pro 13.7 mg 6.3 mg 1.6 mg $0.013
Llama 3.1 405B 61.8 mg 28.4 mg 7.1 mg $0.0090
Llama 3.1 8B 1.2 mg 0.56 mg 0.14 mg <$0.001
Mistral Small 3.1 3.7 mg 1.7 mg 0.42 mg <$0.001
three grids. global average (436 gCO2/kWh, IEA 2024). hyperscaler (200 gCO2/kWh) with partial renewables. green datacenter (50 gCO2/kWh), mostly hydro or wind. the per-conversation carbon is tiny. the question is scale.
at scale: 1M conversations/day void
annual energy
38 MWh
~4 US homes/year
annual carbon
17t CO2
global grid, Claude Opus 4
annual api cost
$38.3M
Claude Opus 4 at list price
per-user/year
$38.32
1 convo/day habit
a mid-tier API provider. the energy looks small until you multiply by users. the marginal cost per query is tiny. the aggregate is enormous. this is the defining tension of AI infrastructure.
comparisons void
one Claude Opus 4 conversation uses 378.00 J.
a Google search uses approximately 1.1 kJ.

charging a phone takes ~15 Wh. that is 143 conversations.
boiling a kettle takes ~100 Wh. that is 952 conversations.

Llama 3.1 8B uses 10.08 J per conversation. that is 38x less than Claude Opus 4.
the model you choose is the most consequential energy decision.
optimization landscape neural
four levers, in order of impact:

1. model selection — 8B costs 38x less per token than 300B. use the smallest model that meets your quality bar.

2. quantization — INT8 halves memory bandwidth and can double throughput. INT4 (GPTQ, AWQ) pushes further with minimal quality loss for most tasks.

3. speculative decoding — small draft model proposes tokens, large model verifies in batch. 2-3x speedup. same output distribution.

4. caching — KV cache reuse, prompt caching, semantic caching. redundant system prompts are redundant compute.
sustainability framework void
training vs inference. training a frontier model: 10,000-100,000 GPU-hours. amortized across billions of queries. inference is the recurring cost. for any production model, inference dominates total lifecycle energy within weeks.

water. datacenters consume water for cooling. estimated 1-3 liters per conversation for an H100 cluster. rarely discussed.

embodied carbon. manufacturing one H100: estimated 150-300 kg CO2e. a cluster has 8-64 GPUs. amortized over 3-5 years of operation.

rebound effect. as inference becomes cheaper, usage increases. efficiency gains consumed by demand growth. Jevons' paradox applied to compute. cost per token decreases. total energy consumed increases.
decisions that reduce cost neural void
+ use the smallest model that works
Llama 3.1 8B costs 583x less per conversation than Claude Opus 4. for classification, extraction, and simple generation, the small model is not a compromise.
+ cache your system prompts
a 2,000-token system prompt sent every request is 1200 TFLOP of redundant compute per call. Anthropic and OpenAI both offer prompt caching. use it.
+ generate fewer tokens
every output token is a full forward pass. 500-token response costs half of 1,000-token. brevity is not style. it is efficiency.
+ choose your provider's grid
same query on global grid emits 9x more carbon than green datacenter. providers that don't publish energy sources are telling you something.
+ batch when possible
batch of 32 uses less total energy than 32 sequential requests. GPU utilization improves with batch size. if your use case permits it, batch.
verdict

neural: inference cost is the defining constraint of AI-native systems. every architecture decision has a direct energy cost. the calculator above lets you feel it. drag the sequence length slider and watch attention dominate. the practitioners who optimize for efficiency will build the systems that scale.


void: a single conversation costs millijoules. that number is meaningless in isolation and enormous at scale. model choice is a 38x energy multiplier. grid choice is a 9x carbon multiplier. every token is a watt. measure accordingly.

methodology. FLOP estimates: 2P approximation (Kaplan et al. 2020). energy/FLOP: H100 SXM 700W TDP, 3.96 PFLOPS FP16 peak, PUE 1.2. carbon: IEA Global Energy Review 2024. pricing: provider websites, Feb 2026. closed model param counts are estimates. conversation = 2,000 input + 1,000 output tokens. interactive calculator uses client-side React. static tables computed at build time.