Bayesian: the word that buys regulatory credibility without earning it

"Most Bayesian implementations in healthcare are not truly Bayesian in the epistemological sense. It is frequentism dressed up as a probabilistic graph. The result looks Bayesian. It sells as Bayesian. It does not have the properties."

Why medical AI calls itself Bayesian — and what it actually delivers.

Browse the pitch decks of health AI start-ups. Read the CE marking dossiers. Attend the medtech conferences.

One word comes up with troubling regularity: Bayesian.

"Our model is Bayesian." Implying: rigorous, transparent, uncertainty-quantifying, regulator-approved. Implying: we are not a black box. We are serious statistics.

Before asking whether it really is Bayesian — let us first meet Reverend Bayes.

What Bayesianism actually is

Thomas Bayes, an 18th-century Anglican minister, formulated something simple and revolutionary: probability is not a frequency. It is a state of belief. A measure of our uncertainty about the world — not a property of the world itself.

Formally: P(A|B) — the probability of A given B. If you observe B, to what extent should you revise your belief about A? Bayes' theorem answers this question with brutal elegance:

P(A|B) = P(B|A) × P(A) / P(B)

Three terms. Three concepts.

P(A) is the prior — what you believed before observing anything. P(B|A) is the likelihood — the probability of observing B if A is true. P(A|B) is the posterior — what you should believe now that you have seen B.

It is an engine for updating knowledge. You start with an initial belief. You observe data. You revise. You observe more. You revise again. At every step, your uncertainty is explicit, quantified, propagated.

In medicine, this philosophy is natural. A clinician examining a patient performs intuitive Bayesian reasoning: starting from a pre-test probability (the prior — epidemiology, clinical presentation), observing a test result (the likelihood), and revising the diagnosis (the posterior). The positive or negative likelihood ratio is nothing other than the Bayes factor dressed in medical language.

The promise of Bayesian AI in healthcare is therefore precise: encode this reasoning formally, propagate uncertainty at every step, produce not a score — but a distribution. Not "87% probability" — but "87% probability, with ±12% uncertainty given the quality of data available for this patient."

It is a serious promise. It deserves to be taken seriously — precisely because it is rarely kept.

The real virtues — let us be honest

In medical AI, this philosophy has legitimate, documented applications.

Rare data first. In rare oncology, orphan diseases, paediatrics — cohorts are small. Too small to train a deep neural network without massive overfitting. These are HDLSS situations (High Dimensional, Low Sample Size). Bayesian methods allow incorporation of priors from the literature, from previous trials, from expert opinion — compensating for lack of data with structured knowledge. This is a real strength, not a marketing argument.

Uncertainty quantification next. A Bayesian model does not produce a point estimate — it produces a distribution. It says not "87% probability of sepsis" but "between 74% and 96%, depending on the quality of data available for this patient." This distinction is fundamental: there are two types of uncertainty that most models conflate. Aleatoric uncertainty — linked to intrinsic data noise, irreducible — and epistemic uncertainty — linked to the model's ignorance, reducible with more data or a better model. A rigorous Bayesian framework distinguishes them and explicitly propagates the latter into predictions. This distribution is directly actionable for the clinician: it tells them when to trust the model and when to seek additional input.

Integration of causal knowledge last. A Bayesian network can explicitly encode known pathophysiological relationships — "fever influences CRP, which influences the sepsis score" — as a directed graph. The model does not discover these relationships from data: it integrates them as structural constraints. In domains where causality is well established, this is a decisive advantage over a neural network that learns correlations without distinguishing cause and effect.

The FDA and EMA understood this. Their guidances on Bayesian methods in clinical trials — 2010 and 2018 respectively — explicitly recognise the value of this approach for small populations, rare diseases and medical devices with limited data.

Bayesianism in healthcare thus has genuine legitimacy. It is not an archaism. It is not a fad.

Which is what makes the rest all the more uncomfortable.

The DAG, the causal graph — and the fatal confusion

A Bayesian network is built around a DAG — a directed acyclic graph — in which each arrow says: "this variable influences that one."

Each node is a variable. Each edge is a probabilistic dependency. The entire structure encodes a joint probability distribution over all variables — enabling efficient calculation of conditional probabilities: P(sepsis | fever, elevated CRP, hypotension).

It is elegant. It is powerful. And this is where the confusion begins.

An edge in a Bayesian network does not say: "A causes B." It says: "if I observe A, I must revise my belief about B." It is a statistical relationship — a dependency in the data. Not a statement about the world.

The distinction is the work of Judea Pearl — 2011 Turing Award winner — who formalised what epidemiologists knew intuitively but could not state precisely. Pearl distinguishes three levels of reasoning:

Observe: P(B|A). If I see A, what can I say about B? This is the level of correlation. Of classical statistics. Of the Bayesian network.

Intervene: P(B|do(A)). If I act on A — if I fix it through external intervention — what happens to B? This is the level of the randomised controlled trial. Of causality.

Imagine: If A had been different, would B have changed? This is the level of the counterfactual. Of medical liability. Of personalised medicine.

A Bayesian network operates at the first level. A causal graph operates at all three.

The difference is not technical. It is epistemological.

Concrete example. Smoking causes lung cancer. Smoking also yellows fingers. In clinical data, yellow fingers and lung cancer are therefore correlated — not because one causes the other, but because they share a common cause.

Smoking → Lung cancer

Smoking → Yellow fingers

A Bayesian network learned from this data will detect the correlation between yellow fingers and cancer and include an edge between them. This is statistically justified. It is causally wrong.

Intervening on yellow fingers — bleaching them — will have no effect on cancer risk. The Bayesian network, queried about this intervention, gives the wrong answer. The causal graph, which encodes the fork structure with smoking as common cause, answers correctly: the edge between yellow fingers and cancer is a correlation illusion, not an influence relationship.

This is precisely the distinction between observing and intervening — Pearl's first and second levels. A Bayesian network operates at the first. It cannot answer second-level questions.

Most start-ups sell causality. They deliver correlation. In a formalism that looks enough like one for nobody to demand the other.

An additional precision worsens the diagnosis. A DAG can be built in two ways: specified by hand by a clinical expert, who encodes their hypotheses about relationships between variables — or learned automatically from statistical dependencies observed in the data. In the first case, edges may reflect real causality if the expert knows it. In the second, a DAG learned solely from data implies no valid causal interpretation — it encodes correlations, nothing more. The yellow fingers example applies in both situations: through ignorance in the first case, by construction in the second.

Is it really Bayesian?

Let us return to the question posed at the opening.

In the majority of implementations encountered in health start-ups — the honest answer is: no.

Not entirely. Not in the sense that Bayes, Laplace or Pearl would understand it.

Here is what actually happens.

The graph structure — the DAG, the nodes, the edges — is drawn by hand by a clinical expert or biostatistician. This is already a strong assumption: one presumes to know a priori which variables influence which others. In well-understood pathologies, this is defensible. In complex, multifactorial, poorly elucidated pathologies — it is structured fiction.

The probabilities in the nodes are then estimated by maximum likelihood on the training data. That is: by a classical frequentist method. One seeks the parameters that maximise the probability of observing the data — with no prior, no posterior distribution, no uncertainty propagation.

The result looks Bayesian. It calls itself Bayesian. It is not.

It is frequentism in a Bayesian costume.

The difference is not cosmetic. It is operational.

A true Bayesian model propagates uncertainty through the graph. If I am uncertain about a parameter value — because my training data is scarce, or because the patient before me bears little resemblance to the training population — that uncertainty ripples through to the output. The clinician receives a distribution, not a score.

A disguised frequentist model propagates nothing. It produces "87% probability of sepsis" with the same confidence on ten thousand patients as on ten. Epistemic uncertainty is invisible. It is hidden behind a number that looks precise.

This is precisely what the FDA and EMA guidances on Bayesian methods seek to guarantee — rigorous uncertainty propagation in limited-data contexts. Which most commercial implementations do not provide.

They benefit from the regulatory aura of Bayesianism without meeting its requirements. It is an unearned transfer of legitimacy.

Why does nobody say so? Because the regulator rarely verifies implementation at the inference level. Because the clinician lacks the tools to detect it. And because the development team — here we come to it — sometimes lacks the expertise to do otherwise.

The HR deficit nobody names

Let us ask the question directly. Why do so many health start-ups arrive at the same architectural conclusion?

The official answer: Bayesian is explainable, rigorous, regulatorily accepted. As we have seen — this is partially true.

The real answer: it is what they can recruit.

A senior data scientist proficient in transformer architectures, fine-tuning medical language models, uncertainty calibration via conformal prediction, federated learning on FHIR data — costs between €80,000 and €140,000 in France. Leaves for London, Zurich or San Francisco as soon as a serious offer appears. And does not exist in sufficient numbers on the market.

A biostatistician trained in logistic regression, survival models, Bayesian networks — exists. Graduates in volume from biostatistics masters, pharmacy schools, CROs. Costs €45,000 to €60,000. Knows the regulatory vocabulary. Knows how to talk to physicians.

The architectural choice then becomes an HR choice disguised as a technical choice.

This is not an accusation of bad faith. It is a market constraint. Start-ups work with what they find — and what they find is solid, honest, competent biostat profiles. The problem is not their competence. The problem is the drift.

Because a competent frequentist biostatistician knows how to build well-calibrated models, interpret confidence intervals, handle missing data. These skills are real and valuable. But transposing them into a Bayesian formalism without mastering rigorous Bayesian inference — without knowing how to specify an informative prior, without understanding posterior sensitivity to prior choice, without mastering the methods that actually compute the posterior — which neither Stan, nor PyMC, nor any spreadsheet does automatically — produces exactly what we have described.

Dressed-up frequentism. Sold as Bayesian.

Hidden technical debt

One delivers something that works under development conditions. One does not deliver something that will work under real deployment conditions — different population, evolving data, out-of-distribution cases.

The debt is invisible at delivery. It reveals itself over time.

A frozen prior is a bias that sleeps. A causal structure imposed in 2021 on a Parisian teaching hospital population will be silently wrong in 2026 on a rural general practice population. Nobody in the start-up has the expertise — or often the mandate — to evolve the model. Rigorous Bayesian maintenance is a rare profession. It is not on the organisational chart.

What was sold as a perennial architecture because transparent is in reality a frozen architecture because understaffed.

What it costs — and to whom

Let us return to what matters.

A frequentist model disguised as Bayesian, hand-specified by a competent biostatistician not trained in rigorous Bayesian inference, deployed on a population different from the training population, without drift detection procedures, without planned architectural maintenance — this model will be wrong.

Silently. Progressively. Without alarm.

It will not say "I am outside my domain of validity." It will produce "84% probability of rehospitalisation" for a patient on whom it has no statistical legitimacy. With the same apparent confidence as for the ten thousand patients in the training cohort.

And the clinician, informed that the model is "Bayesian — therefore rigorous, therefore explainable" — will trust it.

This is not a hypothetical scenario. It is the exact mechanism by which correctly certified medical algorithms at time T become dangerous at T+18 months.

A further point worsens the diagnosis. A model's overall performance can remain acceptable while its calibration has degraded — calibration being the correspondence between predicted probability and actual event frequency. A poorly calibrated model displays a precise number. It does not say that this number is wrong for the subgroup of the patient before you. This is precisely what the clinician cannot detect without specific tooling.

Drift — dataset shift in the methodological literature — is documented. Under-detection of this drift in deployed systems is documented too. Three vectors feed it silently: evolution of patient populations, changes in medical practice, transformation of clinical information systems. None is exceptional. All are inevitable.

The cost is not borne by the start-up. It delivered what was in the specification. The CE marking was obtained. The investor is satisfied.

The cost is borne by the patient who receives an erroneous recommendation. By the clinician who trusted a number that looked rigorous. By the health system that purchased a solution presented as perennial and will have to replace it.

And by the next honest project leader — who arrives with a deep learning model that is better calibrated, better validated, better monitored — and is met with: "but is it explainable?"

The loop is closed.

What must be demanded

Not necessarily deep learning. Bayesian methods have real virtues in specific contexts — rare data, established causal knowledge, well-defined populations. These virtues deserve to be used honestly.

Not necessarily frequentism. Classical methods have their place — provided they are not sold for what they are not.

What must be demanded is the same as for any medical device: independent clinical validation on the target population, not on the development cohort; honest characterisation of uncertainty, not a point score presented as a distribution; a drift monitoring procedure, not a certification frozen at the deployment date; a team capable of maintaining the model over time, not merely delivering it.

And an honest answer to the question posed at the opening: is it really Bayesian? Or is it frequentism in costume — useful, perhaps, but not what is on the label?

Health AI does not need false Bayesianism to be legitimate. It needs rigour. Transparency about what it actually is. And professionals honest enough to tell their clients — and their investors — what they really deliver.

This is not a critique of ambition. It is a demand for precision. In a medicine that readily calls itself "precision medicine" — precision is not optional.

References

(1) Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

(2) FDA Guidance (2010). Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials.

(3) EMA Reflection Paper (2018). Use of Bayesian methods in clinical studies.

(4) Gelman, A. et al. (2013). Bayesian Data Analysis. CRC Press.

Bayesian: the word that buys regulatory credibility without earning it.