A Clinical Benchmark Is a Perimeter Decision, Not a Measurement

The benchmark as institutional filter, not technical instrument

The debate on the evaluation of clinical artificial intelligences has settled on a finding no one contests: benchmarks are imperfect. That is true, and it is beside the point. No serious practitioner has ever believed a benchmark to be exhaustive, and one does not write a doctrine on a self-evident fact. The problem is harder, and it can be stated in a single sentence that everything else in this article does nothing but hold, test, and operationalize:

A clinical benchmark defines which forms of failure become governable objects for the system.

The benchmark, understood here in a deliberately broad sense as a validation space, is not only a measurement instrument: it is the perimeter effectively used to decide that a system is sufficiently valid to be promoted, that a behavior deserves to be monitored, that an incident deserves to be treated as a systemic signal. What does not enter this perimeter does not disappear from clinical reality: it exits institutional attention. Perimeter validity is not governance. It is its condition of possibility.

How the burden of proof has shifted three times, and why that is not enough

The decade 2015 to 2025 produced a first shift. Systems were no longer judged on methodological plausibility alone; reproducible performance metrics were required: AUROC, sensitivity, calibration, F1. That shift was sound but limited. An average score on a narrow benchmark guarantees neither transferability nor utility.

The editorial sequence of 2026 produced the second shift. Nature Medicine no longer asks whether models hallucinate, but whether they improve clinical outcomes, and answers, across several convergent texts, that in many cases we do not know (Is AI actually improving healthcare?; Show us the evidence for the value of medical AI, 2026). The burden of proof passed from intrinsic performance to proof of impact. That is correct, and necessary.

But this shift does not suffice either. The trap is subtle: even the proof of clinical impact silently inherits the visibility perimeter defined upstream. A randomized pragmatic trial demonstrating an improvement in length of stay or readmission rate says nothing about the classes of incidents that remain invisible to the measurement instrument used in the trial itself. The outcome validates what it observes. It remains silent on what it has not instrumented.

The third shift, the one this article seeks to formulate, does not replace the first two: it completes them. The relevant question is no longer only is the model performant or does it improve outcomes, but will the incidents it produces in production remain observable, attributable, and contestable?

Three validity levels that must never be conflated

Three validity levels are conflated under the word “validation,” and separating them is the central contribution of this text.

Statistical validity answers: does the model predict correctly within the benchmark? Clinical validity answers: does the benchmark represent the relevant clinical situations? Governable validity, the third level, answers a question the first two never ask: will the incidents produced in production remain institutionally manageable?

This validity stratifies into three planes. Observable: the incident produces a trace the system records. Attributable: the trace can be linked to a class of incidents, to a model, to an identifiable decision. Governable: a stable and contestable organizational policy exists to handle this class. An incident can be observable without being attributable; it can be attributable without being governable. A model’s AUROC score says nothing about any of these three planes. A model can excel at the first level, pass the second, and fail the third without anything in its score signaling it.

An example makes the absurdity of the leaderboard reflex physically palpable. Model A: AUROC 0.96 on a narrow benchmark. Model B: AUROC 0.91, with explicit coverage of critical rare classes and a dedicated escalation policy for each. Aggregate performance ranking prefers A. Governability prefers B, because B recognizes the rare incident as an incident, attributes it, traces it, escalates it, where A dissolves it in a flattering average. This is not a paradox; it is what happens when average accuracy ceases to be conflated with contestable safety.

The preference for B must not become a naive apology for over-signaling. A system too eager to recognize classes produces the symmetric pathology: alert fatigue, which destroys governance through saturation. Authentic governability is not the maximization of detectability; it is its calibration. The criterion is not how many events are escalated, but which classes, under which arbitration policy. Conflating the two replaces a blind spot with fog.

The propagation theorem: why the initial perimeter fixes the cost of all subsequent corrections

At this point, the dominant industrial objection presents itself: post-market will compensate. Real-world surveillance, incident tracking, modification management plans will correct downstream what the benchmark missed upstream. The objection deserves to be taken seriously, because it describes a real mechanism.

The response is what can be called a propagation theorem. Let S be a deployed surveillance system and T its initial visibility taxonomy. An event class c absent from T is not rendered impossible for S to discover: a weak signal, a clinician report, a pharmacovigilance cluster, a retrospective qualitative analysis can identify it. What the theorem asserts is more precise: the absence of c from the initial perimeter conditions the institutional cost of its subsequent recognition as a governable class.

A class not instrumented from the outset requires, in order to become governable, a complete institutional reconstruction: new taxonomy, new threshold, new protocol, new monitoring, new KPI, new SLA, new budget line, new contractual clause, sometimes a new regulatory revision. The cost is not impossibility; it is the institutional debt accumulated for each omitted class. And that institutional debt, unlike technical debt, is not repaid in a sprint.

The problem is not specific to healthcare. It is isomorphic to difficulties that several mature disciplines have already named: observability in distributed systems, detectability in control theory, support mismatch in statistical learning, causal insufficiency in pharmacovigilance. When the same invariant reappears across four independent formulations, it is not a vocabulary coincidence. It is a structural constraint.

PCCP, AI Act article 9, MDR: how regulators have begun to organize the theorem’s use

The theorem does not imply that the perimeter is frozen. It implies that broadening the taxonomy requires a dedicated procedure, anticipated at design time, without which it does not exist.

That is precisely what regulators have begun to industrialize. The Predetermined Change Control Plan (PCCP), formalized by the FDA in its December 2024 final guidance and progressively adopted for AI devices in 2025 to 2026, is an instrument of explicit feedback: it obligates the manufacturer to declare in advance the classes of modification achievable without new submission, which amounts to pre-specifying the conversion regime between post-market signal and perimeter revision. The European AI Act, in articles 9 (risk management system) and 10 (data governance), requires an equivalent device in spirit. The MDR and IVDR complete the architecture through post-market surveillance.

None of these devices eliminates the propagation theorem. They organize its use. They transform the initial blind spot into a declared blind spot and institute a protocol for reducing it over time. A system governed by PCCP or AI Act article 9 is not a system without blind spots. It is a system whose blind spots are contestable, provided the declaration is honest and the nomenclature used in the declaration is the one used in monitoring.

From dataset choice to behavioral architecture: what the benchmark actually compiles

What the composition of a benchmark actually does deserves naming through one image, a single one, because it is accurate at one point and false everywhere else. The benchmark acts as the implicit compiler of the system’s risk policy: it translates a data choice into a behavioral architecture. The image fixes the idea that the dataset is not upstream of the system but inside its conduct. It ceases to hold as soon as it is extended: a compiler produces deterministic execution; a benchmark acts probabilistically, the runtime remains partially adaptive, and the human operator modifies the trajectory. The operational terms that replace it: visibility perimeter, detectability perimeter, governability perimeter.

The propagation can then be described step by step. Rare class absent from the benchmark, therefore no class-specific calibration, therefore no dedicated alert threshold, therefore no escalation policy, therefore no targeted monitoring, therefore no associated KPI, therefore no corresponding contractual SLA, therefore no specific budgetary arbitration, therefore no exploitable post-market signal. When the incident occurs, it is interpreted as individual noise rather than as a systemic class. At each link, nothing malfunctions. Each component does exactly what it was configured to do. The defect is nowhere and everywhere. It is in the initial visibility boundary, which propagated without any engineer ever deciding to ignore the risk.

Escalation workflows, SLAs, budgetary arbitrations: where propagation ends up

Propagation does not stop at the software. It extends into the organization, structuring escalation workflows, clinical review timelines, support contracts, reimbursement policies, insurance clauses, internal compliance obligations, staffing arbitrations. The benchmark does not only compile a software architecture; it compiles an operational bureaucracy.

The three most often forgotten links are the three financial ones: KPI, SLA, budgetary arbitration. A CFO who validates a support contract on the basis of an SLA defined without a benchmark coverage declaration signs, without knowing it, an implicit exemption for all out-of-perimeter classes.

The field makes this assertion verifiable. Consider OCTOPUS, the multicenter observational study on mNSCLC BRAF V600E (n=184, five European countries, survival modeling by SurvTRACE) conducted within the TweenMe framework. This mutation represents approximately 1 to 2 percent of metastatic non-small cell lung cancers. If the validation cohort does not explicitly include this sub-population, the consequence is not that an individual patient will be poorly stratified: it is that the class becomes operationally nonexistent even as it remains clinically real. The presence or absence of a few hundred patients in the validation perimeter can determine, through propagation, the allocation of significant institutional attention over several years of deployment lifecycle.

Visible classes become tracked, funded, auditable, prioritized. Invisible classes become statistically rare, operationally marginal, organizationally silent. The system no longer merely sees certain incidents poorly: it eventually ceases to treat them as central objects of decision. This is what separates this text from a machine learning safety reflection. The subject is not the robustness of a model; it is the institutional distribution of visibility, of which the model is only the technical core.

1,400 FDA authorizations and the governability horizon buyers never receive

By early 2026, the FDA had authorized more than 1,400 AI/ML devices, approximately three quarters in radiology, on benchmarks that are not published device by device. The polemical reflex would be to denounce the authorization of black boxes; that is not the point here, and it would miss it.

The problem is not that an authorization rests on an incomplete benchmark. Every benchmark is, as conceded from the outset. The problem is more precise: the opacity of the benchmark is not only a methodological opacity; it becomes an opacity on the effective clinical visibility boundary of the authorized system. The market and buyers hold a market authorization certificate; they do not hold the perimeter of the classes the system will know how to escalate, qualify, contest. They purchase a performance without knowing the governability horizon that accompanies it.

The European counterpart is less commented, yet more structuring for continental actors. The AI Act requires documented governance practices and dataset representativeness (article 10), as well as a risk management system operating throughout the lifecycle (article 9), which formally includes perimeter revision in response to post-market signals. On paper, these devices constitute a coherent architecture for the propagation theorem and its conversion regime. In practice, their implementation remains largely to be invented. A representativeness declaration that enumerates demographic proportions without declaring the clinical classes covered does not satisfy the propagation theorem. It satisfies its cosmetic version.

The upstream promotion gate: a principle and its three field proofs

This mechanism extends a prior doctrinal line. In the work on the promotion gate, an institutional passage point decides what accesses higher status. The benchmark plays exactly this role for a clinical system. It is its upstream promotion gate.

The principle that follows is simple to state and demanding to hold: a system should not be promoted to production if its validation space does not explicitly cover the classes of incidents it will have to govern, or if it does not declare the classes it does not cover and the conversion regime by which it will subsequently integrate them.

The field gives this rule its substance across three distinct forms. The BRAF V600E patient of the OCTOPUS cohort is not a pedagogical example: it is a precise clinical class whose presence or absence in a perimeter decides the system’s capacity to recognize it subsequently. PREDICARE, on the territorial decompensation trajectory, raises the same question at physiological scale: what is in the perimeter that guarantees the at-risk patient will be seen before decompensation, not after? ToxTwin, on graph-predicted molecular toxicity, raises it again at the molecular level: which structural families are absent from training, and what is the escalation policy when the system encounters them? Three systems, three domains, one invariant: the answer is never in the aggregate score. It is in the composition.

What the thesis does not claim, and why that matters

The thesis does not require an exhaustive benchmark. It requires the declaration of the perimeter and the honest propagation of that taxonomy through to monitoring.

It does not claim that human systems are exempt from the same limit. Human clinical taxonomies have their blind spots; AI changes the scale, speed, and standardization. A clinician locally reconstructs an emerging category; a deployed system propagates its blind spot at the speed of its deployment and with the uniformity of its code.

It does not exempt itself. The identification of critical classes absent from the benchmark depends recursively on a prior visibility structure, that of the designer or the regulator. The thesis shifts the burden of visibility one notch without claiming to exhaust it. This is a gradient of governability, not a guarantee, and that is exactly what the PCCP and AI Act article 9 regimes are in the process of instituting.

Three operational requirements: coverage declaration, untested classes, taxonomic propagation clause

For a decision-maker, whether buyer, regulator, or medical director, three operational demands are sufficient to transform the thesis into an instrument.

First, a coverage declaration by clinical class. For each relevant pathology or sub-population, the supplier produces the number of cases included in the benchmark, the inclusion protocol, and the performance measured on that specific class. The aggregate average becomes one proxy among others, not the primary measure.

Second, an explicit statement of untested classes. This demand is paradoxical in appearance and essential in practice. An organization that knows what it does not know can allocate its attention. An organization that believes it covers everything can prioritize nothing. An honest declaration of blind spots is more protective than an optimistic declaration of total coverage.

Third, a taxonomic propagation clause. The perimeter declared in the validation space appears, class by class, in the monitoring grid, in the alert thresholds, in the tracked KPIs, and in the contractual SLAs. Any discontinuity between these spaces is a governance gap. The minimal control consists of verifying that the nomenclature used to describe the validation space is identical to the one used to describe monitoring. When they diverge, the propagation theorem has already begun its work.

These three requirements are not a metrological ideal. They are the minimal operational translation of the propagation theorem for a buyer who does not wish to sign an implicit exemption.

The problem is not only statistical

A clinical benchmark does not only measure a performance. It defines which forms of failure will exist as governable objects for the system, and it fixes the institutional cost of recognizing all the others.

As long as industry ranks its models by average accuracy, it will continue to optimize what it sees and to deploy what it will not know how to govern. Systems will not necessarily become less performant. They will become capable of producing forms of failure they will not know how to recognize, qualify, or contest, and operational responsibility will dissolve exactly where the benchmark left a blank.

The problem is therefore not only statistical. It is institutional.

A broader question surfaces behind this one, and another article will have to address it: if our measurement instruments end up selecting the operationally accessible reality for systems, do we ultimately compress what we hold to be clinically real? It is noted here. It is not unfolded.

Because a benchmark does not only define what a system knows how to do. It defines what it will know how to consider a problem.

Read the document

↓ Download PDF

Key takeaways

A clinical benchmark defines which forms of failure become governable objects for the system. What does not enter the perimeter does not disappear from clinical reality: it exits institutional attention. Perimeter validity is not governance. It is its condition of possibility.
Three validity levels must be held separate: statistical validity (does the model predict correctly?), clinical validity (does the benchmark represent relevant clinical situations?), and governable validity (will incidents produced in production remain institutionally manageable?). An AUROC score says nothing about the third level.
Governable validity stratifies into three planes: observable (the incident produces a trace the system records), attributable (the trace can be linked to a class and an identifiable decision), governable (a stable contestable organizational policy exists). A model can excel at level one, pass level two, and fail level three without anything in its score signaling it.
The propagation theorem: the absence of an event class in the initial perimeter conditions the institutional cost of its subsequent recognition as a governable class. A class not instrumented from the outset requires a complete institutional reconstruction to become governable: new taxonomy, threshold, protocol, monitoring, KPI, SLA, budget line, contractual clause, sometimes regulatory revision.
The leaderboard reflex optimizes the wrong quantity. Model A at AUROC 0.96 on a narrow benchmark is preferred by aggregate ranking. Model B at AUROC 0.91 with explicit coverage of critical rare classes and dedicated escalation policies is preferred by governability. B recognizes the rare incident as an incident; A dissolves it in a flattering average.
The benchmark compiles the operational bureaucracy: escalation workflows, clinical review timelines, support contracts, reimbursement policies, insurance clauses, staffing arbitrations. The three most forgotten links are the three financial ones: KPI, SLA, budgetary arbitration. A CFO validating a support contract on an SLA without benchmark coverage declaration signs an implicit exemption for all out-of-perimeter classes.
The benchmark is the upstream promotion gate of a clinical system. A system should not be promoted to production if its validation space does not explicitly cover the classes of incidents it will have to govern, or if it does not declare the classes it does not cover and the conversion regime by which it will subsequently integrate them.
Three operational requirements: a coverage declaration by clinical class; an explicit statement of untested classes (an honest blind-spot declaration is more protective than an optimistic total-coverage claim); a taxonomic propagation clause ensuring nomenclature continuity from validation space to monitoring grid, alert thresholds, KPIs, and contractual SLAs.