No clinical AI should be scaled without independent health-economic evidence
Doctrinal position on the gap between the spectacular technical performance of AI diagnostic systems and the persistent absence of independent health-economic evidence of their systemic benefit.
Microsoft AI unveiled MAI-DxO, a multi-agent “orchestrator” that solved 85% of 304 complex NEJM clinical cases — four times the score of a physician control group, at lower notional cost. The system operates as a virtual panel of specialists that questions, orders tests, and self-corrects before naming a diagnosis.
Three advances deserve attention: sequential, cost-aware reasoning that surpasses multiple-choice benchmarks LLMs already dominate; a built-in cost signal where every test carries a CPT price tag, attacking the $100 billion per year over-testing problem in the United States alone; and simultaneous breadth and depth coverage that no single clinician can match.
The analysis reveals several structural biases: sampling bias (rare, “signal-rich” NEJM cases are not representative of primary-care prevalence), hindsight bias (solved cases retro-fitted into dialogues may leak textual cues), baseline bias (21 off-specialty GPs constitute a fragile yardstick for “superhuman” claims), and model-on-model bias (Microsoft’s own LLM grades its sibling system, with shared incentives and blind spots).
Three dimensions remain outside the benchmark scope but lie at the heart of clinical practice: text-only semiology with no imaging, auscultation or tactile signs; a post-anamnesis starting point on pre-digested vignettes that bypasses the complexity of real history-taking; and no confrontation with the inconsistency, emotion and non-verbal cues of real patients.
The fundamental question remains twofold: will AI genuinely outperform — or at best augment — human clinicians and integrate efficiently into the healthcare value chain? And will it reduce system-wide costs and improve outcomes, or become a Trojan horse letting hyperscalers siphon public-health budgets?
Health-economic evaluation must be integrated into every AI-for-health initiative from its inception.