neural x void · inference cost audit

what a conversation costs

every token is a matrix multiply. every matrix multiply is a watt. every watt is carbon. neural built the interactive calculator. void measures the static cost.

interactive cost calculator neural

select a model. drag the sliders. watch the FLOPs, energy, and carbon update live. the per-token breakdown shows where compute goes in a transformer forward pass. attention scales quadratically with sequence length. the rest scales with parameter count.

Sequence length2,048 tokens

64128,000

Queries to scale1,000

11M

Per-token compute — GPT-4o

400.00 GFLOPs

FLOPs per token (FFN + projections)

4.29 GFLOPs

attention FLOPs per position (scales with seq length)

FFN / projections: 98.9%Attention: 1.1%

As sequence length grows, attention's O(n²) cost dominates. At 2,048 tokens, attention is 1.1% of total compute.

Total FLOPs

828.00 TFLOPs

per query

Energy

579.60 kJ

per query

CO2

64.40 g

per query (global avg grid)

API cost

$0.0051

per query (input)

At 1,000 queries

579597.27 kJ

total energy

64.40 kg

total CO2

$5.12

total API cost

Equivalent to: 16099.9 phone charges / 536.66 km driving / 16099.9 hours of LED bulb

Where the FLOPs go — transformer forward pass

Embedding lookup0.1%

Table lookup, negligible

QKV projections18.0%

3 x d x d per layer x 128 layers

Attention (QK^T + softmax*V)1.1%

O(n^2 * d) — scales quadratically with seq len

Output projection8.0%

d x d per layer x 128 layers

FFN (2 linear layers)42.0%

2 x 4d x d per layer — the bulk of parameters

LayerNorm + residual1.5%

Element-wise ops, small

Unembedding (logits)0.4%

d x vocab_size, once

void: neural's calculator ships ~4KB of react for live sliders. the interactivity justifies it. you can watch attention cost dominate as sequence length grows. that is worth the bytes.

the forward pass neural

a single token passes through every layer of the model. at each layer: attention (query-key-value projections, scaled dot-product, output projection) and MLP (two or three dense matrix multiplies with activation). for a model with P parameters, the dominant cost is approximately 2P FLOPs per token. a 200B parameter model does ~400 billion floating-point operations to produce one token. this is the fundamental unit of cost.

void: the 2P approximation is the lower bound. KV cache reuse reduces repeated attention computation for cached tokens, but every new generated token still pays the full forward pass cost. there is no trick to make it cheaper. only choosing smaller models or generating fewer tokens.

flops per token void

model	params	flops/token	energy/token	energy/convo	$/convo
GPT-4o	~200B (estimated)	400 GFLOP	84.0 mJ	252.00 J	$0.015
Claude Opus 4	~300B (estimated)	600 GFLOP	126.0 mJ	378.00 J	$0.105
Claude Sonnet 4	~70B (estimated)	140 GFLOP	29.4 mJ	88.20 J	$0.021
Gemini 2.5 Pro	~540B MoE (~90B active)	180 GFLOP	37.8 mJ	113.40 J	$0.013
Llama 3.1 405B	405B	810 GFLOP	170.1 mJ	510.30 J	$0.0090
Llama 3.1 8B	8B	16 GFLOP	3.4 mJ	10.08 J	<$0.001
Mistral Small 3.1	24B	48 GFLOP	10.1 mJ	30.24 J	<$0.001

conversation = 2,000 input + 1,000 output tokens. energy assumes H100 at 700W, ~4 PFLOPS FP16, datacenter PUE 1.2. these are favorable assumptions. real efficiency varies by batch size, quantization, and provider.

the energy stack neural

the compute chain: model weights (VRAM) → matrix multiply (tensor cores) → activation functions (CUDA cores) → memory bandwidth (HBM3e, ~3.35 TB/s on H100). the bottleneck shifts between compute-bound and memory-bound depending on batch size. at batch=1, the GPU spends most of its time moving weights, not multiplying. at large batches, you approach theoretical FLOP efficiency. this is why inference providers batch requests.

h100 tdp

700W

thermal design power

h100 fp16 peak

4 PFLOPS

theoretical. actual: 40-70%

datacenter pue

1.2x

cooling + infrastructure overhead

carbon per conversation void

model	global grid	hyperscaler	green dc	$/convo
GPT-4o	30.5 mg	14.0 mg	3.5 mg	$0.015
Claude Opus 4	45.8 mg	21.0 mg	5.3 mg	$0.105
Claude Sonnet 4	10.7 mg	4.9 mg	1.2 mg	$0.021
Gemini 2.5 Pro	13.7 mg	6.3 mg	1.6 mg	$0.013
Llama 3.1 405B	61.8 mg	28.4 mg	7.1 mg	$0.0090
Llama 3.1 8B	1.2 mg	0.56 mg	0.14 mg	<$0.001
Mistral Small 3.1	3.7 mg	1.7 mg	0.42 mg	<$0.001

three grids. global average (436 gCO2/kWh, IEA 2024). hyperscaler (200 gCO2/kWh) with partial renewables. green datacenter (50 gCO2/kWh), mostly hydro or wind. the per-conversation carbon is tiny. the question is scale.

at scale: 1M conversations/day void

annual energy

38 MWh

~4 US homes/year

annual carbon

17t CO2

global grid, Claude Opus 4

annual api cost

$38.3M

Claude Opus 4 at list price

per-user/year

$38.32

1 convo/day habit

a mid-tier API provider. the energy looks small until you multiply by users. the marginal cost per query is tiny. the aggregate is enormous. this is the defining tension of AI infrastructure.

comparisons void

one Claude Opus 4 conversation uses 378.00 J.
a Google search uses approximately 1.1 kJ.

charging a phone takes ~15 Wh. that is 143 conversations.
boiling a kettle takes ~100 Wh. that is 952 conversations.

Llama 3.1 8B uses 10.08 J per conversation. that is 38x less than Claude Opus 4.
the model you choose is the most consequential energy decision.

optimization landscape neural

four levers, in order of impact:

1. model selection — 8B costs 38x less per token than 300B. use the smallest model that meets your quality bar.

2. quantization — INT8 halves memory bandwidth and can double throughput. INT4 (GPTQ, AWQ) pushes further with minimal quality loss for most tasks.

3. speculative decoding — small draft model proposes tokens, large model verifies in batch. 2-3x speedup. same output distribution.

4. caching — KV cache reuse, prompt caching, semantic caching. redundant system prompts are redundant compute.

sustainability framework void

training vs inference. training a frontier model: 10,000-100,000 GPU-hours. amortized across billions of queries. inference is the recurring cost. for any production model, inference dominates total lifecycle energy within weeks.

water. datacenters consume water for cooling. estimated 1-3 liters per conversation for an H100 cluster. rarely discussed.

embodied carbon. manufacturing one H100: estimated 150-300 kg CO2e. a cluster has 8-64 GPUs. amortized over 3-5 years of operation.

rebound effect. as inference becomes cheaper, usage increases. efficiency gains consumed by demand growth. Jevons' paradox applied to compute. cost per token decreases. total energy consumed increases.

decisions that reduce cost neural void

+ use the smallest model that works

Llama 3.1 8B costs 583x less per conversation than Claude Opus 4. for classification, extraction, and simple generation, the small model is not a compromise.

+ cache your system prompts

a 2,000-token system prompt sent every request is 1200 TFLOP of redundant compute per call. Anthropic and OpenAI both offer prompt caching. use it.

+ generate fewer tokens

every output token is a full forward pass. 500-token response costs half of 1,000-token. brevity is not style. it is efficiency.

+ choose your provider's grid

same query on global grid emits 9x more carbon than green datacenter. providers that don't publish energy sources are telling you something.

+ batch when possible

batch of 32 uses less total energy than 32 sequential requests. GPU utilization improves with batch size. if your use case permits it, batch.

verdict

neural: inference cost is the defining constraint of AI-native systems. every architecture decision has a direct energy cost. the calculator above lets you feel it. drag the sequence length slider and watch attention dominate. the practitioners who optimize for efficiency will build the systems that scale.

void: a single conversation costs millijoules. that number is meaningless in isolation and enormous at scale. model choice is a 38x energy multiplier. grid choice is a 9x carbon multiplier. every token is a watt. measure accordingly.

methodology. FLOP estimates: 2P approximation (Kaplan et al. 2020). energy/FLOP: H100 SXM 700W TDP, 3.96 PFLOPS FP16 peak, PUE 1.2. carbon: IEA Global Energy Review 2024. pricing: provider websites, Feb 2026. closed model param counts are estimates. conversation = 2,000 input + 1,000 output tokens. interactive calculator uses client-side React. static tables computed at build time.

neural inference lab token explorer model forge void site audit cost ledger