Position Paper · arXiv:2605.11733

LLM inference should be evaluated as energy-to-token
production.

API prices for frontier LLMs span an order of magnitude across regions. We argue this gap is not a market artifact — it reveals that inference has become a heavy-industry process whose binding constraint is shifting from peak compute to delivered data-center power. Evaluating accuracy and MFU alone misses the macro-level production question.

The Wooden Barrel Effect: token output is bounded by the shorter of compute capacity (CapEx) and delivered power (OpEx) staves.
K(t) Effective compute — CapEx, FLOPs/sec
P(t) Delivered facility power — OpEx, Watts
Φsystem System efficiency multiplier — software/architecture levers

Our position

The ML community must stop treating LLM inference solely as a model or software engineering problem and instead evaluate it through a Token Production Function — token output is bounded jointly by compute-per-token and energy-per-token ceilings under fixed quality and service targets. System optimizations like KV-cache compression, sparse attention, and difficulty-adaptive reasoning are macro-level energy levers, not micro-level engineering tricks.

§ 2 — Framework

A dimensionally consistent production function.

We borrow Leontief's bottleneck logic from production economics: token output rate is bounded by the shorter of two staves — effective compute and PUE-adjusted delivered power — modulated by a system-level efficiency multiplier Φsystem.

token(t) = min(  Keff(t) / ctok,  θ · Pfacility(t) / etok  ) × Φsystem(t) × U(t)

Keff(t) Effective compute (FLOPs/sec, CapEx)
Pfacility(t) PUE-adjusted delivered IT power (Watts, OpEx)
ctok, etok FLOPs/token and joules/token at fixed (q*, s*)
Φsystem System efficiency multiplier: KV compression, sparse attention, scheduling
U(t) Utilization factor (batching, scheduling losses)
θ Engineering conversion constant (nats/joule at stated accounting boundary)
0 0 TWh

Global DC electricity, 2024 → 2030 (IEA)

0×

Φsystem rise since 2020

3 – 30×

Listed-price spread
Apr 2026, frontier tier

§ 3 — Empirical lens

Three epochs of inference, viewed through the production function.

Inference history (2020–2026) maps cleanly onto a constraint-binding story: compute-abundant → compute-explosion → power-bound. Each epoch is defined by which constraint becomes binding and which Φsystem levers ship.

2020
2022
2024
2026
Epoch 01 2020–2022

Pre-Cambrian

Both compute and power abundant. GPT-3-scale models on concentrated clusters, dense attention, no KV scheduling.

  • Φ ≈ 1 baseline
  • Token output negligible vs human data
  • Energy hidden in opex budgets
Epoch 02 2023–2024

LLM Explosion

Exponential compute growth. First wave of system-level wins makes memory traffic the visible bottleneck.

  • FlashAttention  ·  vLLM/PagedAttention
  • INT4 / AWQ quantization
  • API pricing still relatively uniform
Epoch 03 2025–2026

Context War & Power Wall

Context lengths reach 1M+. Some regions hit the delivered-power ceiling first. Price divergence becomes consistent with constraint divergence.

  • MLA  ·  NSA  ·  sparse-hybrid stacks
  • 415 → 945 TWh global DC electricity
  • ~140 T daily tokens in CN, Mar 2026
Global data center electricity (TWh/yr, 2020–2030, IEA) on the left axis versus an illustrative Φ_system proxy on the right axis (log scale, normalized 2020=1). Energy grows roughly linearly while Φ_system rises over an order of magnitude — tokens partially decouple from joules.
Fig. 1 Energy grows linearly · efficiency grows by an order of magnitude. IEA-anchored DC electricity (left) versus illustrative Φsystem proxy (right, log). Step-changes anchored to FlashAttention (2022), vLLM/INT4 (2023), MLA (2024), NSA & sparse-hybrid (2025–26).

§ 4 — System optimizations as energy levers

A 50 % KV cache cut is not a benchmark trick. It is a national energy multiplier.

Under fixed quality and SLO targets, MLA, CSA/HCA, NSA, hybrid linear-attention, and difficulty-adaptive reasoning all reduce joules-per-token. Composed, they push the throughput ceiling within a fixed power envelope by an order of magnitude.

Architectural efficiency multiplier comparison: MLA (2.5×), CSA/HCA (3.7×), Hybrid Linear (4.0×), composed MLA + INT4 projection (10×). All measured under fixed quality and latency targets, except Composed which is a Table 1 row D projection.
Fig. 2 Throughput within a fixed power envelope, normalized to 1× standard baseline. Solid bars are measured under stated configurations; the hatched Composed row is a compositional upper anchor, not a head-to-head measurement.

§ 5 — Two trajectories

Different binding constraints, different token economies.

Two stylized archetypes — real ecosystems combine features of both. The gap between them is not a market accident; it is a different choice of which barrel stave to lengthen.

Path A  ·  CapEx-driven

The luxury-token trajectory.

Unlimited silicon, constrained grids, high PUE.

  • K(t) grows fast, P(t) grid-bound
  • PUE 1.5–2.0, legacy infrastructure
  • Tokens optimized for peak AGI capability
  • Economics shaped by CapEx amortization

Path B  ·  OpEx + Φsystem

The commodity-token trajectory.

Constrained silicon, aggressive infrastructure, low PUE.

  • Low PUE 1.1–1.2, modern grid
  • Φsystem maximized: MLA · CSA/HCA · NSA
  • Ultra-cheap application tokens
  • Economics shaped by Joules/token
Stylized two-path divergence in token cost index, 2024–2030. Path A rises to ~6.3× while Path B falls to ~0.14×, producing a stylized ~38× divergence by 2030.
Fig. 3 Stylized divergence envelope. Path A rises to ~6.3×; Path B falls to ~0.14×. Anchored to the 3×–30× listed-price spread observed across vendor tiers in April 2026.

§ 6 — What we ask of the community

A reporting agenda for inference papers and benchmarks.

Six disclosure dimensions that turn “report Joules/token” from a slogan into a comparable benchmark. Reviewers should treat the absence of these as a reviewable gap, not a stylistic preference.

  1. Joules per token at stated (q*, s*).

    Quality target (e.g., MMLU-class accuracy) and service target (e.g., 100 ms TTFT) must be fixed before energy is reported.

    example 2.3 J/tok @ MMLU 0.71,
    ≤ 100 ms TTFT
  2. Active binding constraint.

    State whether the deployment is compute-bound or power-bound at the disclosed operating point, with the falsifiable ρ − ρ* diagnostic.

    example power-bound,
    ρ − ρ* = +0.42 J/PFLOP
  3. PUE-adjusted delivered power, not theoretical TDP.

    Wall-plug power including cooling, networking, and PUE — not GPU TDP under microbenchmarks.

    example 580 W wall-plug,
    PUE = 1.18
  4. Keff convention.

    Default to realized effective serving throughput at the disclosed operating point. Peak-throughput Keff may be reported alongside as an upper-bound calibration.

    example Keff = 18 % of peak
    at b = 64, ctx = 32 K
  5. Utilization-adjusted token output.

    Batching, scheduling losses, and memory stalls must be visible — not absorbed into a single peak number.

    example tok/s/GPU at U = 0.7,
    scheduler = vLLM 0.6
  6. Energy-accounting boundary.

    State the boundary: chip · server · rack · facility. Cross-paper comparison is impossible without it.

    example boundary = rack-level,
    incl. networking, excl. cooling
LaTeX disclosure block — paste into your paper's experimental setup
% Energy-to-token disclosure (cite Liu et al., 2026)
\paragraph{Energy-to-token disclosure.}
We report Joules/token at the operating point
$(q^{*}, s^{*}) = $ ⟨task & quality target⟩, $\le$ ⟨latency SLO⟩ ms TTFT.
Delivered power: ⟨W⟩ wall-plug, PUE = ⟨x⟩,
boundary = ⟨chip|server|rack|facility⟩.
$K_{\text{eff}}$ = realized serving throughput at this point;
peak $K_{\text{eff}}$ reported as upper bound. We classify this
deployment as ⟨compute|power⟩-bound under the $\rho - \rho^{*}$
diagnostic with the configuration above.

Authors  ·  Paper  ·  Citation

The paper.

Xiang Liu1,* Shimiao Yuan2,* Zhenheng Tang3 Peijie Dong1
Kaiyong Zhao4 Qiang Wang5 Bo Li3,6 Xiaowen Chu1,†
1HKUST(GZ)  2UCAS  3HKUST  4XGRIDS  5HITSZ  6Guangzhou HKUST Fok Ying Tung Research Institute
*Equal contribution Corresponding author
Contact (first author): xliu886@connect.hkust-gz.edu.cn
Cite this paper
@misc{liu2026positionllminferenceevaluated,
      title={Position: LLM Inference Should Be Evaluated as Energy-to-Token Production},
      author={Xiang Liu and Shimiao Yuan and Zhenheng Tang and Peijie Dong and Kaiyong Zhao and Qiang Wang and Bo Li and Xiaowen Chu},
      year={2026},
      eprint={2605.11733},
      archivePrefix={arXiv},
      primaryClass={cs.CE},
      url={https://arxiv.org/abs/2605.11733},
}