Position Paper · arXiv:2605.11733

LLM inference should be evaluated as energy-to-token
production.

API prices for frontier LLMs span an order of magnitude across regions. We argue this gap is not a market artifact — it reveals that inference has become a heavy-industry process whose binding constraint is shifting from peak compute to delivered data-center power. Evaluating accuracy and MFU alone misses the macro-level production question.

Read on arXiv → PDF BibTeX

The Wooden Barrel Effect: token output is bounded by the shorter of compute capacity (CapEx) and delivered power (OpEx) staves.

K(t) Effective compute — CapEx, FLOPs/sec

P(t) Delivered facility power — OpEx, Watts

Φ_system System efficiency multiplier — software/architecture levers

Our position

“The ML community must stop treating LLM inference solely as a model or software engineering problem and instead evaluate it through a Token Production Function — token output is bounded jointly by compute-per-token and energy-per-token ceilings under fixed quality and service targets. System optimizations like KV-cache compression, sparse attention, and difficulty-adaptive reasoning are macro-level energy levers, not micro-level engineering tricks.

§ 2 — Framework

A dimensionally consistent production function.

We borrow Leontief's bottleneck logic from production economics: token output rate is bounded by the shorter of two staves — effective compute and PUE-adjusted delivered power — modulated by a system-level efficiency multiplier Φ_system.

Q̇_token(t) = min( K_eff(t) / c_tok, θ · P_facility(t) / e_tok ) × Φ_system(t) × U(t)

K_eff(t) Effective compute (FLOPs/sec, CapEx)

P_facility(t) PUE-adjusted delivered IT power (Watts, OpEx)

c_tok, e_tok FLOPs/token and joules/token at fixed (q*, s*)

Φ_system System efficiency multiplier: KV compression, sparse attention, scheduling

U(t) Utilization factor (batching, scheduling losses)

θ Engineering conversion constant (nats/joule at stated accounting boundary)

0 → 0 TWh

Global DC electricity, 2024 → 2030 (IEA)

0×

Φ_system rise since 2020

3 – 30×

Listed-price spread
Apr 2026, frontier tier

§ 3 — Empirical lens

Three epochs of inference, viewed through the production function.

Inference history (2020–2026) maps cleanly onto a constraint-binding story: compute-abundant → compute-explosion → power-bound. Each epoch is defined by which constraint becomes binding and which Φ_system levers ship.

2020

2022

2024

2026

Epoch 01 2020–2022

Pre-Cambrian

Both compute and power abundant. GPT-3-scale models on concentrated clusters, dense attention, no KV scheduling.

Φ ≈ 1 baseline
Token output negligible vs human data
Energy hidden in opex budgets

Epoch 02 2023–2024

LLM Explosion

Exponential compute growth. First wave of system-level wins makes memory traffic the visible bottleneck.

FlashAttention · vLLM/PagedAttention
INT4 / AWQ quantization
API pricing still relatively uniform

Epoch 03 2025–2026

Context War & Power Wall

Context lengths reach 1M+. Some regions hit the delivered-power ceiling first. Price divergence becomes consistent with constraint divergence.

MLA · NSA · sparse-hybrid stacks
415 → 945 TWh global DC electricity
~140 T daily tokens in CN, Mar 2026

Global data center electricity (TWh/yr, 2020–2030, IEA) on the left axis versus an illustrative Φ_system proxy on the right axis (log scale, normalized 2020=1). Energy grows roughly linearly while Φ_system rises over an order of magnitude — tokens partially decouple from joules. — Fig. 1 **Energy grows linearly · efficiency grows by an order of magnitude.** IEA-anchored DC electricity (left) versus illustrative Φ_system proxy (right, log). Step-changes anchored to FlashAttention (2022), vLLM/INT4 (2023), MLA (2024), NSA & sparse-hybrid (2025–26).

§ 4 — System optimizations as energy levers

A 50 % KV cache cut is not a benchmark trick. It is a national energy multiplier.

Under fixed quality and SLO targets, MLA, CSA/HCA, NSA, hybrid linear-attention, and difficulty-adaptive reasoning all reduce joules-per-token. Composed, they push the throughput ceiling within a fixed power envelope by an order of magnitude.

Architectural efficiency multiplier comparison: MLA (2.5×), CSA/HCA (3.7×), Hybrid Linear (4.0×), composed MLA + INT4 projection (10×). All measured under fixed quality and latency targets, except Composed which is a Table 1 row D projection. — Fig. 2 **Throughput within a fixed power envelope, normalized to 1× standard baseline.** Solid bars are measured under stated configurations; the hatched Composed row is a compositional upper anchor, not a head-to-head measurement.

§ 5 — Two trajectories

Different binding constraints, different token economies.

Two stylized archetypes — real ecosystems combine features of both. The gap between them is not a market accident; it is a different choice of which barrel stave to lengthen.

Path A · CapEx-driven

The luxury-token trajectory.

Unlimited silicon, constrained grids, high PUE.

K(t) grows fast, P(t) grid-bound
PUE 1.5–2.0, legacy infrastructure
Tokens optimized for peak AGI capability
Economics shaped by CapEx amortization

Path B · OpEx + Φ_system

The commodity-token trajectory.

Constrained silicon, aggressive infrastructure, low PUE.

Low PUE 1.1–1.2, modern grid
Φ_system maximized: MLA · CSA/HCA · NSA
Ultra-cheap application tokens
Economics shaped by Joules/token

Stylized two-path divergence in token cost index, 2024–2030. Path A rises to ~6.3× while Path B falls to ~0.14×, producing a stylized ~38× divergence by 2030. — Fig. 3 **Stylized divergence envelope.** Path A rises to ~6.3×; Path B falls to ~0.14×. Anchored to the 3×–30× listed-price spread observed across vendor tiers in April 2026.

§ 6 — What we ask of the community

A reporting agenda for inference papers and benchmarks.

Six disclosure dimensions that turn “report Joules/token” from a slogan into a comparable benchmark. Reviewers should treat the absence of these as a reviewable gap, not a stylistic preference.

Joules per token at stated (q*, s*).

Quality target (e.g., MMLU-class accuracy) and service target (e.g., 100 ms TTFT) must be fixed before energy is reported.

example 2.3 J/tok @ MMLU 0.71,
≤ 100 ms TTFT
Active binding constraint.

State whether the deployment is compute-bound or power-bound at the disclosed operating point, with the falsifiable ρ − ρ* diagnostic.

example power-bound,
ρ − ρ* = +0.42 J/PFLOP
PUE-adjusted delivered power, not theoretical TDP.

Wall-plug power including cooling, networking, and PUE — not GPU TDP under microbenchmarks.

example 580 W wall-plug,
PUE = 1.18
K_eff convention.

Default to realized effective serving throughput at the disclosed operating point. Peak-throughput K_eff may be reported alongside as an upper-bound calibration.

example K_eff = 18 % of peak
at b = 64, ctx = 32 K
Utilization-adjusted token output.

Batching, scheduling losses, and memory stalls must be visible — not absorbed into a single peak number.

example tok/s/GPU at U = 0.7,
scheduler = vLLM 0.6
Energy-accounting boundary.

State the boundary: chip · server · rack · facility. Cross-paper comparison is impossible without it.

example boundary = rack-level,
incl. networking, excl. cooling

LaTeX disclosure block — paste into your paper's experimental setup

% Energy-to-token disclosure (cite Liu et al., 2026)
\paragraph{Energy-to-token disclosure.}
We report Joules/token at the operating point
$(q^{*}, s^{*}) = $ ⟨task & quality target⟩, $\le$ ⟨latency SLO⟩ ms TTFT.
Delivered power: ⟨W⟩ wall-plug, PUE = ⟨x⟩,
boundary = ⟨chip|server|rack|facility⟩.
$K_{\text{eff}}$ = realized serving throughput at this point;
peak $K_{\text{eff}}$ reported as upper bound. We classify this
deployment as ⟨compute|power⟩-bound under the $\rho - \rho^{*}$
diagnostic with the configuration above.

Authors · Paper · Citation

The paper.

Xiang Liu^1,* Shimiao Yuan^2,* Zhenheng Tang³ Peijie Dong¹

Kaiyong Zhao⁴ Qiang Wang⁵ Bo Li^3,6 Xiaowen Chu^1,†

¹HKUST(GZ) ²UCAS ³HKUST ⁴XGRIDS ⁵HITSZ ⁶Guangzhou HKUST Fok Ying Tung Research Institute
^*Equal contribution ^†Corresponding author

Contact (first author): xliu886@connect.hkust-gz.edu.cn

arXiv:2605.11733 → PDF BibTeX ↓

Cite this paper

@misc{liu2026positionllminferenceevaluated,
      title={Position: LLM Inference Should Be Evaluated as Energy-to-Token Production},
      author={Xiang Liu and Shimiao Yuan and Zhenheng Tang and Peijie Dong and Kaiyong Zhao and Qiang Wang and Bo Li and Xiaowen Chu},
      year={2026},
      eprint={2605.11733},
      archivePrefix={arXiv},
      primaryClass={cs.CE},
      url={https://arxiv.org/abs/2605.11733},
}

LLM inference should be evaluated as energy-to-token
production.

A dimensionally consistent production function.