ToxTwin V2.4 — The Tri-Router, or Why a Single Model Never Suffices

Context: What V2.3 Had Established

ToxTwin V2.3 covers 14 toxicological endpoints — 12 Tox21, Ames mutagenicity, hERG inhibition — via a GNN end-to-end pipeline validated on strict 5-fold scaffold CV. The V2.3 router selected, per endpoint, the best between two models: a fine-tuned GINEConv (V2.0a) and a multi-representation ensemble GINEConv + AttentiveFP + Morgan ECFP6 (V2.2). V2.3 metrics established Tox21 mean AUC at 0.867, Ames at 0.843, hERG at 0.785. Twelve out of fourteen endpoints reached their target threshold. The two missing ones — Ames (−0.027 below target) and hERG (−0.045) — posed a precise question: is the lever architecture or data?

First Failure: Carcinogenicity Is Not Mutagenicity

PubChem AID 1259411, labeled “Ames” in several meta-analyses, is in reality an in vivo multi-species carcinogenicity assay. Its labels correlate only partially with bacterial mutagenicity as defined by ICH S2R1. The naive integration of these 547 compounds into multi-task fine-tuning produced a regression across all endpoints: from 0.769 to 0.698 on V2.0a for Ames.

The lesson is simple in its formulation, costly in its consequences: a compound “Ames positive” in an in vivo carcinogenicity assay and a compound Ames positive in the ICH S2R1 sense (Salmonella reverse mutation, TA98/TA100 ± S9) do not carry the same information. Conflating them injects noise, not signal.

The Two Data Levers Retained

Ames corpus — ISS as primary regulatory source. The Istituto Superiore di Sanità dataset, distributed via Mendeley Data, presents three decisive properties: strain-level labels (TA98, TA100, TA102, TA1535, TA1537), documented multi-step curation, and harmonization according to the ICH S2R1 convention (positive if at least one strain positive). After RDKit sanitization and InChIKey deduplication against the existing Hansen corpus, 1,511 new compounds were integrated, with a near-balanced positive/negative ratio (1.1:1).

hERG corpus — ChEMBL v34. Target CHEMBL240 (KCNH2) aggregates 22,273 IC50/Ki activities. After nM filtering, binarization at the 10 µM threshold, sanitization, and deduplication, 6,485 new compounds were integrated. Beneficial side effect: the hERG corpus moves from a 65/35 imbalance to a 48.5/51.5 ratio, a rebalancing that directly improves model calibration.

What 3D Conformational Features Do Not Provide

The hypothesis tested was: the hERG channel being pharmacologically sensitive to the 3D shape of the ligand in its vestibule, adding conformational descriptors (NPR, USR, USRCAT, 79 dimensions) should improve prediction. Results on 5-fold scaffold CV invalidate the hypothesis.

Configuration	Ames AUC	hERG AUC
Morgan ECFP6 only	0.841	0.750
3D only	0.763	0.679
Morgan + 3D	0.842	0.764

The marginal gain on hERG (+0.014) and null on Ames (+0.001) indicates that the GNN and topological fingerprints already capture the essential exploitable structural information at this corpus scale. The current performance ceiling is a chemical coverage ceiling, not a representational capacity ceiling.

The Tri-Router Architecture

The operational discovery of V2.4 is that Ames ISS data and hERG ChEMBL data do not benefit the same ensemble model. An ensemble trained on the enriched Ames corpus (V2.4b) produces the best Ames score but a lower hERG score. An ensemble trained on the enriched hERG corpus (V2.4d) produces the inverse. The solution is structural, not parametric: a tri-router that selects, for each endpoint, the optimal model among three — V2.0a (multi-task GINEConv), V2.4b (Ames-optimized ensemble), V2.4d (hERG-optimized ensemble).

V2.4 routing table: V2.4b serves 10 endpoints (NR-AR-LBD, NR-AhR, NR-ER, NR-ER-LBD, NR-PPAR-γ, SR-ARE, SR-HSE, SR-MMP, SR-p53, Ames); V2.4d serves 4 endpoints (NR-AR, NR-Aromatase, SR-ATAD5, hERG). Encoder V2.0a remains available as fallback but is selected by no endpoint in the optimal routing.

V2.4 Results

Metric	V2.3	V2.4	Delta
Tox21 mean AUC	0.867	0.898	+0.031
Ames AUC	0.843	0.853	+0.010
hERG AUC	0.785	0.800	+0.015
Targets reached	12/14	12/14	=

Protocol unchanged: strict 5-fold Bemis-Murcko scaffold CV, InChIKey verification train ∩ val = ∅, frozen holdout not used for selection. The two missing endpoints remain Ames (0.853 vs target 0.87, gap −0.017) and hERG (0.800 vs target 0.83, gap −0.030). Gaps have been reduced by 37% (Ames) and 33% (hERG) relative to V2.3.

Three Methodological Lessons

A single model is a compromise, not an optimum. Multi-task fine-tuning produces a generalist encoder, but endpoint specialization — specific data, specific head, router selection — systematically outperforms the shared encoder. The tri-router does not choose the “best model” but the best model for this task.

Data beats architecture, but not just any data. Injecting mislabeled data degrades performance even when volume increases. Curation — verification of the primary source, label harmonization according to regulatory conventions, control of the positive/negative ratio — is the work that produces the gain, not the download.

Additional features do not compensate for missing data. 3D descriptors, despite their pharmacological justification, provide only marginal gain when the topological model is already correctly trained.

Limits

The tri-router introduces operational complexity: three models loaded in GPU memory instead of two, and a routing table that will need to be revalidated with each data evolution. Endpoint selection relies on a single val set per fold — selection variance is not quantified. V2.4 metrics are not directly comparable to V2.3 metrics for endpoints whose corpus has changed (Ames, hERG), although the validation protocol remains identical. The frozen holdout has not been used: it remains available for a final pre-deployment validation.

The V2.4 REST API exposes the tri-router via the same endpoints as V2.3. No client-side modification is necessary — the routing field in the API response now indicates V2.4b or V2.4d instead of V2.0a or V2.2.

Read the document

↓ Download PDF

Key takeaways

The thesis in one sentence: improving a multi-endpoint pipeline does not consist in training a better model but in identifying which model is better for each endpoint, and proving it on a protocol that does not cheat.
First documented failure: PubChem AID 1259411 labeled 'Ames' is an in vivo carcinogenicity assay — naive integration of these 547 compounds produced a regression from 0.769 to 0.698 on V2.0a.
Ames ISS corpus (1,511 new compounds, ICH S2R1 strain-level labels) and hERG ChEMBL v34 (6,485 new compounds, rebalanced to 48.5/51.5): regulatory curation outweighs volume.
3D conformational features tested on hERG: marginal gain +0.014 AUC — the topological GNN already captures the essentials; the current ceiling is a chemical coverage ceiling, not a representational capacity ceiling.
Tri-router architecture: V2.4b (Ames ensemble) on 10 endpoints, V2.4d (hERG ensemble) on 4 endpoints — endpoint specialization systematically outperforms the shared encoder.
V2.4 results on strict 5-fold Bemis-Murcko scaffold CV: Tox21 mean AUC 0.898 (+0.031), Ames 0.853 (+0.010), hERG 0.800 (+0.015) — gaps to target reduced by 37% (Ames) and 33% (hERG).