Axis III · D7

Performance evaluation and measurement

Thesis. Measuring performance in production is an epistemological problem before it is a metrics problem. A metric without an interpretation frame produces an opposable number that is non-informative.

The distinction that cuts

Test performance vs production performance. The first is measured on a distribution fixed by the designer; the second endures the distribution imposed by the real. Any gap between the two is imputable to the evaluation, not the model.

Typical market error

Reporting an average AUC as if it were an intrinsic property of the model, when it has meaning only for a given cohort, at a given time, on a given event definition. More serious: using aggregate metrics that mask collapses in subpopulations. Algorithmic fairness is not an ethical option but a condition of metrological validity.

Failure signals

A single metric reported per model, without confidence interval or computation protocol. No stratification by subpopulation (age, sex, comorbidity, centre, period). Calibration never evaluated separately from discrimination. No measurement of withhold rate or its clinical correlate. Evaluation metrics chosen after seeing results, a variant of the garden of forking paths (Gelman & Loken, 2013).

References

TRIPOD+AI statement (Collins et al., BMJ 2024); STARD 2015; CONSORT-AI and SPIRIT-AI; ISO/IEC TS 4213 (assessment of classification performance); on calibration, Van Calster et al., BMC Medicine 2019; on fairness, Mitchell et al. Model Cards (FAccT 2019), Barocas-Hardt-Narayanan Fairness and Machine Learning (2023).

Ground of implementation

OCTOPUS / ISPOR studies mNSCLC BRAF V600E, n=184, five EU countries, ambispective RWE. The TweenMe pipeline produces patient vectors with 299 features, TM-CTGAN/TM-TVAE synthesis, TSTR at 95.2 %, OS log-rank p=0.911. Survival modelling uses SurvTRACE (transformer architecture, Wang & Sun 2022, arXiv:2110.00855). The TSTR at 95.2 % establishes operational fidelity for downstream tasks; it does not establish general statistical indistinguishability between synthetic and real cohorts, and does not exempt from per-task validation. Top 5 % poster selection at ISPOR 2026.

Articulation

Feeds D3 with the reference metrology for drift detection, and conditions D4. Without defensible metrology, the regulatory file is documented fiction.