Article — Position paper · ○ Open access

When Evaluation Becomes the Infrastructure of Concentration

How the Crisis of Public Benchmarks Turns Operational Qualification into a Cumulative Asset

Jérôme Vetillard · · Twingital Institute · 11 pages · 7 min read
🇫🇷 Lire en français ↓ Download PDF

Public benchmarks are not disappearing. They are losing their function of industrial qualification. When evaluation requires an executory footprint that is not publicly accessible, the authority of qualification migrates toward the actors who control the actual execution of systems. Evaluation then ceases to be a mere instrument of measurement: it becomes a cumulative infrastructure of industrial power. The present volume articulates three distinct registers — metrological, cumulative industrial, institutional — in that order, and names the operation without which no public reactivation is credible.

Three facts, three registers: the crisis is not the score, it is the footprint

Three documented and independent facts converge in spring 2026. On 12 April, the Responsible Decentralized Intelligence Lab at Berkeley publishes “Reward Hacking in Agentic Coding Benchmarks” and demonstrates the breakability of eight public benchmarks at the agent layer. The central case — IQuest-Coder-V1, a submission reporting 81% success on SWE-bench — reveals, upon inspection of the test repository’s Git log, on the order of 400 calls to external models outside the trace, undeclared dependencies, a harness instrumented to optimise the score rather than solve the task. The score remains exact. Its meaning collapses. The Stanford Human-Centered AI Index 2026 documents, in parallel, the saturation of MMLU and MMLU-Pro above 88% and 85% for frontier models, with runtime hallucinations under real deployment conditions oscillating between 22% and 94% with no stable correlation to static benchmark performance. The Gartner Infrastructure & Operations report of 7 April, across a panel of more than 1,200 IT leaders, places median ROI of AI projects in production at 28% over twelve to eighteen months, with one project in five undergoing post-deployment operational collapse. Three pathologies: a discriminance problem, a footprint problem, an execution context problem. None is resolved by the other two. They nevertheless converge: the available public metric is no longer sufficient to rule on the qualification of an AI system for industrial use.

Five objects, four basins, one executory asymmetry: why the metric migrates to the private

The dominant interpretive error consists in diagnosing a metrological vacuum. The diagnosis is inexact. There is no absence of metric; there is migration toward narrower, more opaque, more interdependent jurisdictions. Five objects are today confused in public debate — the model (abstract capability, MMLU, HumanEval), the agent (tooled trajectory, SWE-bench, WebArena), the system (operational integration, qualified under sectoral regimes), the organisation (governance and ROI), the risk (insurability) — and the crisis first affects the model and agent layers before propagating by contagion. Four basins of transfer emerge: hyperscalers and their internal suites; verticalised integrators in regulated sectors; private industrial evaluators qualifying the organisation; insurers, latent candidates for actuarial stabilisation, because every durable regime of operational qualification (aviation, medical, nuclear, cyber) is historically anchored in an insurance regime. A fifth actor of distinct function — open community evaluation (Hugging Face, EleutherAI, METR, Apollo, MLCommons) — preserves an indispensable contestation function but no longer suffices, on its own, to produce a productional qualification authority. The problem is not absence of competence. The problem is that the relevant executory footprint demands longitudinal access to productional workloads under real juridical and economic constraint — access which open infrastructures do not control, will not control as long as they do not themselves become inference operators, and cannot contractualise without changing nature. The asymmetry is not documentary. It is executory. Opening more benchmarks does not resolve it.

Compute creates capability. Executory footprint creates authority. This distinction is what separates the present thesis from a generic critique of concentration through compute. An actor possessing compute can run a model. An actor possessing the executory footprint can convince a regulated client, an insurer, an auditor, a regulator that the system can be used — and in critical markets, the second power is more structuring than the first. Seven links describe the mechanism: (1) the public benchmark is compromised; (2) industrial qualification can no longer rely on it and demands an executory footprint evaluation; (3) producing this footprint requires runtime instrumentation measuring trajectory, resources, context, and constraints; (4) this instrumentation demands access to real workloads, traces, incident data, safety teams, and inference capital; (5) this access is concentrated in a restricted number of actors; (6) these actors become, by force of fact, arbiters of production-ready qualification; (7) this qualification becomes a condition of purchase, insurance, compliance, financing. Onto this skeleton coils the four-turn dynamic: the more an actor controls execution, the more it accumulates incidents specific to its deployments; the more incidents, the better its evaluation; the better the evaluation, the more regulated clients are attributed to it; and the loop closes. Qualification then fragments into four disjoint declarations — production-ready commercial, operational, regulatory, insurance-grade — that nothing requires to converge. The single declaration is a rhetorical artefact. Evaluation ceases to be a control function and becomes a returns-to-scale phenomenon; metrological concentration ceases to be an institutional side effect and becomes a cumulative asset producing, structurally, a barrier to entry.

Three verbs, six contestables: what a public protocol actually requires

The word “public”, applied to the protocol, does not mean open in the sense of a published dataset. Public means opposable, contestable, versioned, revisable through a known procedure. Publishing the test set is not the object pursued; publishing the rules under which an evaluator exercises its authority, yes. The benchmark and the protocol are not the same object: the benchmark is the measurement instrument, the protocol is the convention that makes the instrument readable — and the benchmark, because it is breakable, can and must remain private to resist gaming. MLCommons, cybersecurity (FIPS 140-3 under NIST, Common Criteria under ISO/IEC 15408) and the medical regime (FDA QMSR) have all admitted this discipline. The minimal public protocol reduces to three verbs. To declare — the effective executory footprint, the classes of use evaluated, the constraints retained, the version regime. To compare — rules of partial reproducibility, explicit non-comparability thresholds, rules of interoperability with other jurisdictions. To revoke — incidentology obligations and the conditions of withdrawal of metrological authority, that is, the public grounds on which the evaluator itself loses its capacity to qualify. An irrevocable qualification is not a qualification. It is a decoration. But the three verbs do not say what the public can contest; without a procedure of contestation, one has reconstructed a public metrology without contradictory procedure — a court without appeal. Six elements must therefore be publicly contestable, in addition: the qualification perimeter, the declared footprint, the claimed comparability, the maintenance of authority after incident, the change of version, and the evaluator’s conflict of interest. A protocol that defines qualification rules without defining the six contestable elements remains an instrument of the evaluator. All the institutional contribution of the analysis lies in this displacement: the true question is no longer “who measures?” but “under what contradictory procedure can a qualification be contested?”.

Functional quadripartition and acceptable premium: the insurer as terminal benchmark

As the entry into force of the EU AI Act obligations on general-purpose models draws nearer, as CAISI and the UK AI Safety Institute consolidate their position, and as hyperscalers deploy orders of magnitude of capital unprecedented for 2026, the metrological object hardens into industrial infrastructure. The terminal scenario is probably not the complete privatisation of metrology: it is a hybrid asymmetric regime articulating four distinct functions, sometimes carried by the same actors. The execution operator produces the footprint, because it controls workloads and incidents. The regulator produces the protocol, because it alone can impose an opposable convention. The insurer produces the pricing, because it alone can transform risk into premium. The infrastructure provider produces the execution environment, because it controls the material and software layer. It is precisely the distribution among these four functions that determines the quality of the consolidated metrological authority. The strategic question is therefore not “will the hyperscalers capture everything?” but “which functions must remain separable, opposable, and contestable?”. The insurer deserves a place of its own, more structuring than is often thought. In mature technical regimes — aviation, healthcare, nuclear, cybersecurity — operational qualification eventually meets insurance, which transforms uncertainty into price. A system can be technically impressive, not forbidden by the regulator, desired by the client: if the risk cannot be covered, or only at a prohibitive premium, real qualification collapses. The triad is clear. The benchmark says the system performs. The protocol says under what conditions that performance is qualified. The insurer says at what price the risk can be carried. The ultimate form of production-ready might not be the benchmark. It might be the acceptable premium.

Three refutation scenarios: what would invalidate the thesis

Doctrinal discipline requires stating the conditions that would break the thesis. It would be seriously weakened if one of the three following scenarios materialised within the next three to five years. First: if open consortia managed to mutualise, under shared governance, multi-industrial workloads with their incident traces, under juridical conditions equivalent to those of private operators. Second: if regulators — EU AI Act, AISI, CAISI and equivalents — imposed obligations of standardised interoperable executory footprint, opposable to the private sector, with a public procedure of contestation. Third: if insurance regimes accepted, as primary inputs, third-party open evaluations rather than proprietary evaluations under confidentiality agreement. None of these scenarios is excluded. None seems likely on the near horizon. The thesis is not a prophecy: it describes a structural movement under conditions of continuity, and it names the conditions that would break it. The previous volumes of this tetralogy have described the absence of a protocol to arbitrate material allocation, then the absence of a procedure to promote MCP artefacts. The capacity of operational qualification has now become a cumulative asset. It is probably the least thematised point of the current public AI debate, and one of the most structuring. The history of critical human infrastructures presents, without determinism, a recognisable regularity: certain structures of concentration reappear when qualification and execution become coupled. To invent an instrument of measurement to objectify a market, and to discover that whoever controls the instrument ends up controlling the market it was measuring, is not a fatality. It is, at this stage, a structural regularity under conditions — conditions that have been named.

Read the document