Health digital twins, and the proportionality between claim, proof, and responsibility
A health digital twin that displays a good performance curve has demonstrated exactly one thing: that it predicts. It has not shown that it explains, that it simulates, or that it can be used to intervene on a patient. This article advances a single rule of proportionality: a model’s claims must be commensurate with the proof it supplies and the responsibility it commits, and a proof of prediction satisfies neither when the claim is to intervene. The target is the statistical twin, and the statistical part of the hybrid twin; a mechanistic twin that carries its causal assumptions in its equations is a different evaluation problem.
To predict is to associate an output with an input. To explain is to know why the association holds. To simulate is to produce reliable counterfactuals. To intervene is to act on a real body. These verbs do not form a staircase of knowledge: medicine has intervened for decades without explaining, from aspirin to lithium to general anaesthesia. What separates them is responsibility. A prediction error degrades a metric, an explanation error compromises a model, a simulation error invalidates hypotheses, and an intervention error degrades a patient. The required proof grows heavier at each notch because the bearer of the error changes, not because the cognition does. The industrial conflation consists in presenting a system that has filled the first row as if it had filled the others, and the slippage hides inside a verb: the twin that “models,” “anticipates,” or “tests.”
Most twin evaluations over-weight discrimination and starve calibration. The order should be reversed. Discrimination measures whether the model ranks the most at-risk patients above the least at-risk, which the AUC or c-index summarize. Calibration measures whether an announced twenty percent risk corresponds to one event in five. A clinical decision rests on a probability threshold, not on a rank, and a threshold means something only on calibrated probabilities. A model that discriminates well but calibrates poorly sends the wrong patients across the threshold with a deceptive confidence, and calibration is usually what distribution shift degrades first.
A heavily imported performance is not, for that reason, fragile. AlphaFold depends on a gigantic exogenous prior and is more robust than many models trained on a single narrow task. Importing a regularity is often what saves a model when the patient cohort is too small to supply it. Hence a distinction that cuts: provenance explains why a model holds or gives way, transferability decides whether it may be deployed. The two dimensions are decoupled, and the decisive criterion is stability under distribution shift, not the purity of the model’s origin. The dependency audit does not purge the exogenous in the name of some endogenous purity; it declares it and tests it where it counts.
A twin rarely claims merely to predict; it claims to simulate, and simulation demands a sufficiently correct causal structure, not just good performance. Even at a strong signal, a twin can be wrong if it simulates on an incorrect structure. Two properties bound the scope of a simulation: transportability, whether a result established in one population holds in another, and invariance, whether a relation survives across environments. Outside the domain where transport and invariance have been shown, a simulation is not wrong by misfortune; it is out of warranty. Most serious therapeutic systems live at the level of local interventional causality over a bounded domain. The fault is to sell that level as generalizable structural causality.
Critical engineering already grades proof by the severity of failure: DO-178C defines five assurance levels and scales its verification objectives with criticality. Health classifies the risk of an AI medical device under the AI Act and the MDR/IVDR, but provides no scale for the legitimacy of simulation. The proposed deliverable closes that gap without overreach. It is an identity card for the twin in four blocks: the object, with its predictive and decisional targets; the data, with effective sample size and declared dependencies; the validity, with out-of-distribution transferability, causal level, and the domain beyond which the twin must not simulate; the assurance, with the targeted level and the explicit conditions of non-deployment. Object, data, validity, assurance: what is claimed, with what, how far, and under what warranty. The card is not a standard in force; it makes the spirit of existing regulation operational for the case of twins.
A digital twin that claims to intervene with a proof of prediction is not a virtual patient. It is a decision disguised as a measurement, and a responsibility that no one has signed. The full paper, with the responsibility table, the dependency audit, and the references, is available below.
Doctrinal notes and explorations on AI in regulated systems. Once or twice a month. One-click unsubscribe.