Article — Position paper · ○ Open access

How Do You Attribute Performance in a Complex Socio-Technical Pipeline?

Drug discovery is a chain, not an act: you do not attribute a chain to one of its links, you distribute it

Jérôme Vetillard · · Twingital Institute · 11 pages · 7 min read
🇫🇷 Lire en français ↓ Download PDF

Why a Pivotal Trial Cannot Tell Whether AI Discovers Drugs

A Phase III readout measures a molecule, not the process that produced it. This is the first reason the 2026 verdict on AI drug discovery is mis-framed: it asks for an author when it should ask for a distribution. The fifteen or so AI-originated compounds entering pivotal Phase III will establish whether this survivor holds, not whether the method that generated it constitutes a reproducible industrial capability. Drug discovery is not an act; it is a chain, and one does not attribute a chain to one of its links.

The question is ill-posed twice over. In substance, because a trial arbitrates a candidate, not a procedure. In form, because “did the AI discover?” seeks an origin where the operative question is “what share of the performance belongs to which component?”. The expectation of a moment of truth belongs to a specific audience: specialized press, a few influencers, a fraction of funds counting clinical programs (on the order of 173, around fifteen in Phase III) and reading a promise into the volume. R&D leadership knows a pivotal trial does not adjudicate a method, and seasoned investors reason in risk-adjusted value, not in validation narrative.

What “Capability” Must Mean Before the Word Is Usable

A capability is the reproducible property of a process producing a measurable advantage within a declared domain of use. Three conditions, none optional: reproducible (not a one-off), measurable (not a narrative), domain-declared (not general by default). A capability without a declared domain is not a capability; it is an average. A platform excellent on kinases may be mediocre elsewhere, which is why the pertinent question is never “can AI discover drugs” but “can this platform discover, within this declared domain, at a known cost”.

This is the substitutability discipline applied to digital twins: a system is worth something only inside its stated domain of validity, and extrapolating beyond it is a bias, not a generalization.

Why “AI-Discovered Drug” Names Nothing Measurable

“Discovered by AI” is a communication term, not an operational category. It aggregates contributions that share neither the same point of application nor the same causal weight, across three non-commensurable tiers: discovery (target identification and validation), design (de novo generation, lead optimization), and development (ADMET prediction, biomarker discovery, patient stratification, trial-protocol optimization). Repositioning of known compounds cuts across all three. A platform that identifies a novel target and one that reorders a library of lead series do not do the same job, and their success is not counted in the same currency. Treating “AI drugs” as one population is the first error, because the label fuses three disjoint populations into a false unit.

Why the Phase III Population Cannot Be Compared Naively

A molecule in Phase III is not a sample; it is a survivor. Base rates are blunt: across indications, the probability of moving from Phase II to Phase III is on the order of 30%, falling below 20% in neurology (Wong, Siah and Lo, Biostatistics, 2019). The signal of a platform does not simply dilute across filters, because modern pipelines are iterative, not sequential: algorithmic output informs a human decision, which prompts a new query, which redirects optimization.

The right notion is therefore entanglement, not dilution. Entanglement here means non-separability: a component’s marginal contribution depends on the realized values of the others, because algorithmic outputs and human decisions condition one another by iteration. Two entangled quantities do not subtract. This is what obstructs naive attribution, and it commands a subtler point about survivorship bias. If a platform’s intrinsic value is precisely to eliminate bad candidates better upstream, then the over-representation of survivors is not an artifact to correct; it is a mediator, a link on the causal path of the very effect one seeks. We do not yet know, as things stand, whether survival acts as confounder or as mediator, and that ambiguity is the identification obstacle, stated properly.

Within Which Causal Framework Attribution Becomes Meaningful

To say AI “does not directly cause” a better molecule has meaning only inside a specified causal framework. The potential-outcomes framework (Rubin, 1974) defines a component’s contribution as a counterfactual contrast: the quantity of interest with the component, minus the same quantity without it, on a comparable unit. The directed-graph framework (Pearl, 2009) determines whether that contrast is even identifiable from observed data, distinguishing confounders, mediators and colliders. One defines the estimand; the other rules on its identifiability. Articulating them in a real industrial pipeline is an open research problem, not a formality. Within this frame, one thing is clear: a component’s causal contribution is not its rank in the chain but its marginal contribution. AI upstream is not AI responsible.

Why an 80 to 90 Percent Phase I Clearance Proves Less Than It Seems

A high Phase I clearance rate, reported at 80 to 90% for AI-labeled compounds against roughly 52% historically, does not on its own demonstrate better molecular design. Phase I tests tolerability, not efficacy, and the gap admits two explanations equally compatible with the data: better design, or selection of candidates near a chemical space already known to be safe. An upstream toxicological filter (ToxTwin is one such terrain) does not turn a mediocre molecule into a good one; it modifies the distribution of molecules committed to the experimental pipeline. The gain comes from the selection function, not the design function. Two causally distinct mechanisms, generating better candidates versus selecting more efficiently among them, produce the same observable figure, and a clinical readout cannot tell them apart. In an agentic architecture, where a generator, a toxicological filter, ADMET modules and a decision orchestrator each reshape the candidate distribution passed downstream, final performance becomes a property of the composition, not of any isolated algorithm.

Why the Right Unit Is eNPV, Not Probability of Success

The economic error is to reduce a platform’s capability to P(success). What a pharmaceutical leadership maximizes is the risk-adjusted expected net present value (eNPV): expected future cash flows weighted by transition probabilities and discounted for time. Three levers enter it, not one: probability of success, cost and time. Eighteen months saved shift cash flows toward the present and raise eNPV at unchanged probability. A platform can therefore transform a portfolio’s economics without touching biology: discovering 20% faster and 50% cheaper, even at the price of a few additional points of failure, can raise aggregate eNPV. Serious investors are not looking for a better AI; they are looking for a better portfolio, built as much by cost structure and iteration speed as by terminal success rate.

How Attribution Would Actually Operate: Ablation and the Shapley Value

The correct question is causal: what is the effect of the process on eNPV. The unit to attribute is a component’s marginal contribution to eNPV, decomposable into three interpretable effects, on transition probability, on cost and on time. The ideal arrangement, a matched arm running an AI-assisted pipeline against a classical one on the same target and indication, is an ablation: remove the component, observe the variation. It is unexecutable, because no one funds two competing several-hundred-million-euro programs for the pleasure of the inference. The realistic path runs through quasi-experiments familiar to econometrics: historical matching, propensity scores, instrumental variables, synthetic controls, target trial emulation, Bayesian causal inference. None equals a randomized trial; their convergence forms the body of evidence the isolated readout cannot provide.

The Shapley value (Shapley, 1953) indicates the form a rigorous attribution would take. Its appeal rests on two properties no intuition replaces: efficiency (contributions sum exactly to total performance, with no unexplained remainder) and symmetry (interchangeable components receive equal shares). These are precisely the guarantees one wants when contributions are entangled and no naive decomposition holds. The objection is serious: Shapley requires a well-defined coalition, a value function, and a combinatorial cost that explodes with component count, none given in advance in a real pipeline. The tool is sound; its application is a labor of its own. This is a sketch, not a method.

Performance, Provenance, Capability: The Distinction That Resolves the Debate

The debate persists because it conflates three questions. Performance asks: does this molecule work. Provenance asks: how did it come about, and what share belongs to AI. Capability asks: is the advantage a reproducible property of the process, within a declared domain, at a known cost. An approval answers the first and stays silent on the third. The objection that mass statistics will settle it holds only on homogeneous populations, and the AI-labeled population is not homogeneous: it fuses three tiers, survival filters before the count, and no matching of target or period corrects these biases.

The critique must state its own conditions of refutation, on pain of being itself unfalsifiable. Here they are: if, across several hundred cases, independent quasi-experiments converged on robust reductions in time and cost, superior chemical diversity and better approval rates at matched target and indication, and if these effects held across several confounder specifications, then the capability hypothesis would become reasonably credible even without a randomized arm. Until that body of evidence exists, capability remains to be estimated, not proclaimed.

The scope extends beyond chemistry. Protein generation, laboratory robotics, materials design and industrial optimization all satisfy the same condition: production results from a chain of heterogeneous interacting agents, human and algorithmic. Wherever that holds, attributing a collective result to a single agent is a category error, and allocating it is a measurement program. The survivor that reaches Phase III is an excellent indicator of clinical performance and a poor estimator of provenance, because its very survival has erased the trace of what produced it. 2026 will not tell whether AI can discover drugs. It will tell whether this or that survivor holds, and offer, to whoever cares to build it, the first occasion to attribute the rest.

Read the document