AI in Healthcare: Impressive Progress, Missing Proof

AI in healthcare: impressive progress, missing proof

Doctrinal position on the gap between the spectacular technical performance of AI diagnostic systems and the persistent absence of independent health-economic evidence of their systemic benefit.

The MAI-DxO case

Microsoft AI unveiled MAI-DxO, a multi-agent “orchestrator” that solved 85% of 304 complex NEJM clinical cases — four times the score of a physician control group, at lower notional cost. The system operates as a virtual panel of specialists that questions, orders tests, and self-corrects before naming a diagnosis.

What is remarkable

Three advances deserve attention: sequential, cost-aware reasoning that surpasses multiple-choice benchmarks LLMs already dominate; a built-in cost signal where every test carries a CPT price tag, attacking the $100 billion per year over-testing problem in the United States alone; and simultaneous breadth and depth coverage that no single clinician can match.

Methodological biases to identify

The analysis reveals several structural biases: sampling bias (rare, “signal-rich” NEJM cases are not representative of primary-care prevalence), hindsight bias (solved cases retro-fitted into dialogues may leak textual cues), baseline bias (21 off-specialty GPs constitute a fragile yardstick for “superhuman” claims), and model-on-model bias (Microsoft’s own LLM grades its sibling system, with shared incentives and blind spots).

Structural clinical limitations

Three dimensions remain outside the benchmark scope but lie at the heart of clinical practice: text-only semiology with no imaging, auscultation or tactile signs; a post-anamnesis starting point on pre-digested vignettes that bypasses the complexity of real history-taking; and no confrontation with the inconsistency, emotion and non-verbal cues of real patients.

The health-economic question

The fundamental question remains twofold: will AI genuinely outperform — or at best augment — human clinicians and integrate efficiently into the healthcare value chain? And will it reduce system-wide costs and improve outcomes, or become a Trojan horse letting hyperscalers siphon public-health budgets?

Health-economic evaluation must be integrated into every AI-for-health initiative from its inception.

Read the document

↓ Download PDF

Key takeaways

Starting point: MAI-DxO (Microsoft AI) scores 85% on 304 complex NEJM cases — four times the physician control group score, at lower notional cost.
Identified methodological biases: sampling bias (rare NEJM cases vs. primary-care prevalence), hindsight bias (textual cues in solved cases), baseline bias (21 off-specialty GPs), model-on-model bias (Microsoft's own LLM evaluates its sibling system).
Structural clinical limitations: text-only semiology (no imaging, auscultation, tactile signs), post-anamnesis starting point on pre-digested vignettes, no confrontation with real-patient inconsistency, emotion, and non-verbal cues.
Unresolved health-economic question: will AI reduce system-wide costs and improve outcomes, or become a Trojan horse letting hyperscalers siphon public-health budgets?
Core thesis: health-economic evaluation must be integrated into every AI-for-health initiative from inception — no clinical AI should be scaled without an independent study demonstrating net benefit for the common good.