A Prediction Error Degrades a Metric, an Intervention Error Degrades a Patient

A performance curve proves prediction and nothing else

A health digital twin that displays a good performance curve has demonstrated exactly one thing: that it predicts. It has not shown that it explains, that it simulates, or that it can be used to intervene on a patient. This article advances a single rule of proportionality: a model’s claims must be commensurate with the proof it supplies and the responsibility it commits, and a proof of prediction satisfies neither when the claim is to intervene. The target is the statistical twin, and the statistical part of the hybrid twin; a mechanistic twin that carries its causal assumptions in its equations is a different evaluation problem.

Four verbs, one hierarchy of responsibility

To predict is to associate an output with an input. To explain is to know why the association holds. To simulate is to produce reliable counterfactuals. To intervene is to act on a real body. These verbs do not form a staircase of knowledge: medicine has intervened for decades without explaining, from aspirin to lithium to general anaesthesia. What separates them is responsibility. A prediction error degrades a metric, an explanation error compromises a model, a simulation error invalidates hypotheses, and an intervention error degrades a patient. The required proof grows heavier at each notch because the bearer of the error changes, not because the cognition does. The industrial conflation consists in presenting a system that has filled the first row as if it had filled the others, and the slippage hides inside a verb: the twin that “models,” “anticipates,” or “tests.”

Calibration, not discrimination, is the bridge to decision

Most twin evaluations over-weight discrimination and starve calibration. The order should be reversed. Discrimination measures whether the model ranks the most at-risk patients above the least at-risk, which the AUC or c-index summarize. Calibration measures whether an announced twenty percent risk corresponds to one event in five. A clinical decision rests on a probability threshold, not on a rank, and a threshold means something only on calibrated probabilities. A model that discriminates well but calibrates poorly sends the wrong patients across the threshold with a deceptive confidence, and calibration is usually what distribution shift degrades first.

Provenance explains, transferability decides

A heavily imported performance is not, for that reason, fragile. AlphaFold depends on a gigantic exogenous prior and is more robust than many models trained on a single narrow task. Importing a regularity is often what saves a model when the patient cohort is too small to supply it. Hence a distinction that cuts: provenance explains why a model holds or gives way, transferability decides whether it may be deployed. The two dimensions are decoupled, and the decisive criterion is stability under distribution shift, not the purity of the model’s origin. The dependency audit does not purge the exogenous in the name of some endogenous purity; it declares it and tests it where it counts.

Simulation has a domain of validity, or it is out of warranty

A twin rarely claims merely to predict; it claims to simulate, and simulation demands a sufficiently correct causal structure, not just good performance. Even at a strong signal, a twin can be wrong if it simulates on an incorrect structure. Two properties bound the scope of a simulation: transportability, whether a result established in one population holds in another, and invariance, whether a relation survives across environments. Outside the domain where transport and invariance have been shown, a simulation is not wrong by misfortune; it is out of warranty. Most serious therapeutic systems live at the level of local interventional causality over a bounded domain. The fault is to sell that level as generalizable structural causality.

Responsibility commands the proof, and the card records it

Critical engineering already grades proof by the severity of failure: DO-178C defines five assurance levels and scales its verification objectives with criticality. Health classifies the risk of an AI medical device under the AI Act and the MDR/IVDR, but provides no scale for the legitimacy of simulation. The proposed deliverable closes that gap without overreach. It is an identity card for the twin in four blocks: the object, with its predictive and decisional targets; the data, with effective sample size and declared dependencies; the validity, with out-of-distribution transferability, causal level, and the domain beyond which the twin must not simulate; the assurance, with the targeted level and the explicit conditions of non-deployment. Object, data, validity, assurance: what is claimed, with what, how far, and under what warranty. The card is not a standard in force; it makes the spirit of existing regulation operational for the case of twins.

A digital twin that claims to intervene with a proof of prediction is not a virtual patient. It is a decision disguised as a measurement, and a responsibility that no one has signed. The full paper, with the responsibility table, the dependency audit, and the references, is available below.

Read the document

↓ Download PDF

Key takeaways

Predicting, explaining, simulating, and intervening are four distinct claims. Each commits its own proof and its own responsibility, and a proof of prediction covers none of the other three.
The hierarchy among the four verbs is one of responsibility, not of knowledge. Moving from prediction to intervention does not raise the understanding required; it changes what bears the error, from a metric to a patient.
For the step from prediction to decision, calibration matters more than discrimination. A high AUC says nothing reliable about a probability threshold, and distribution shift degrades calibration first.
Provenance is not transferability. Where a performance comes from explains why it holds or fails; whether it survives distribution shift by subpopulation is what authorizes deployment. The two are decoupled.
Responsibility commands the proof. As in DO-178C avionics, the cost of proof must rise with the cost of the error, not with displayed performance. EU regulation classifies the risk but sets no scale for the legitimacy of simulation.
The deliverable is an identity card for the twin (object, data, validity, assurance): what is claimed, with what, how far, and under what warranty. It is the output of the dependency audit, not a separate idea.