From V2.3 to V2.4: Ames and hERG corpus enrichment, selective tri-model architecture, and the methodological lessons of a controlled improvement cycle
ToxTwin V2.3 covers 14 toxicological endpoints — 12 Tox21, Ames mutagenicity, hERG inhibition — via a GNN end-to-end pipeline validated on strict 5-fold scaffold CV. The V2.3 router selected, per endpoint, the best between two models: a fine-tuned GINEConv (V2.0a) and a multi-representation ensemble GINEConv + AttentiveFP + Morgan ECFP6 (V2.2). V2.3 metrics established Tox21 mean AUC at 0.867, Ames at 0.843, hERG at 0.785. Twelve out of fourteen endpoints reached their target threshold. The two missing ones — Ames (−0.027 below target) and hERG (−0.045) — posed a precise question: is the lever architecture or data?
PubChem AID 1259411, labeled “Ames” in several meta-analyses, is in reality an in vivo multi-species carcinogenicity assay. Its labels correlate only partially with bacterial mutagenicity as defined by ICH S2R1. The naive integration of these 547 compounds into multi-task fine-tuning produced a regression across all endpoints: from 0.769 to 0.698 on V2.0a for Ames.
The lesson is simple in its formulation, costly in its consequences: a compound “Ames positive” in an in vivo carcinogenicity assay and a compound Ames positive in the ICH S2R1 sense (Salmonella reverse mutation, TA98/TA100 ± S9) do not carry the same information. Conflating them injects noise, not signal.
Ames corpus — ISS as primary regulatory source. The Istituto Superiore di Sanità dataset, distributed via Mendeley Data, presents three decisive properties: strain-level labels (TA98, TA100, TA102, TA1535, TA1537), documented multi-step curation, and harmonization according to the ICH S2R1 convention (positive if at least one strain positive). After RDKit sanitization and InChIKey deduplication against the existing Hansen corpus, 1,511 new compounds were integrated, with a near-balanced positive/negative ratio (1.1:1).
hERG corpus — ChEMBL v34. Target CHEMBL240 (KCNH2) aggregates 22,273 IC50/Ki activities. After nM filtering, binarization at the 10 µM threshold, sanitization, and deduplication, 6,485 new compounds were integrated. Beneficial side effect: the hERG corpus moves from a 65/35 imbalance to a 48.5/51.5 ratio, a rebalancing that directly improves model calibration.
The hypothesis tested was: the hERG channel being pharmacologically sensitive to the 3D shape of the ligand in its vestibule, adding conformational descriptors (NPR, USR, USRCAT, 79 dimensions) should improve prediction. Results on 5-fold scaffold CV invalidate the hypothesis.
| Configuration | Ames AUC | hERG AUC |
|---|---|---|
| Morgan ECFP6 only | 0.841 | 0.750 |
| 3D only | 0.763 | 0.679 |
| Morgan + 3D | 0.842 | 0.764 |
The marginal gain on hERG (+0.014) and null on Ames (+0.001) indicates that the GNN and topological fingerprints already capture the essential exploitable structural information at this corpus scale. The current performance ceiling is a chemical coverage ceiling, not a representational capacity ceiling.
The operational discovery of V2.4 is that Ames ISS data and hERG ChEMBL data do not benefit the same ensemble model. An ensemble trained on the enriched Ames corpus (V2.4b) produces the best Ames score but a lower hERG score. An ensemble trained on the enriched hERG corpus (V2.4d) produces the inverse. The solution is structural, not parametric: a tri-router that selects, for each endpoint, the optimal model among three — V2.0a (multi-task GINEConv), V2.4b (Ames-optimized ensemble), V2.4d (hERG-optimized ensemble).
V2.4 routing table: V2.4b serves 10 endpoints (NR-AR-LBD, NR-AhR, NR-ER, NR-ER-LBD, NR-PPAR-γ, SR-ARE, SR-HSE, SR-MMP, SR-p53, Ames); V2.4d serves 4 endpoints (NR-AR, NR-Aromatase, SR-ATAD5, hERG). Encoder V2.0a remains available as fallback but is selected by no endpoint in the optimal routing.
| Metric | V2.3 | V2.4 | Delta |
|---|---|---|---|
| Tox21 mean AUC | 0.867 | 0.898 | +0.031 |
| Ames AUC | 0.843 | 0.853 | +0.010 |
| hERG AUC | 0.785 | 0.800 | +0.015 |
| Targets reached | 12/14 | 12/14 | = |
Protocol unchanged: strict 5-fold Bemis-Murcko scaffold CV, InChIKey verification train ∩ val = ∅, frozen holdout not used for selection. The two missing endpoints remain Ames (0.853 vs target 0.87, gap −0.017) and hERG (0.800 vs target 0.83, gap −0.030). Gaps have been reduced by 37% (Ames) and 33% (hERG) relative to V2.3.
A single model is a compromise, not an optimum. Multi-task fine-tuning produces a generalist encoder, but endpoint specialization — specific data, specific head, router selection — systematically outperforms the shared encoder. The tri-router does not choose the “best model” but the best model for this task.
Data beats architecture, but not just any data. Injecting mislabeled data degrades performance even when volume increases. Curation — verification of the primary source, label harmonization according to regulatory conventions, control of the positive/negative ratio — is the work that produces the gain, not the download.
Additional features do not compensate for missing data. 3D descriptors, despite their pharmacological justification, provide only marginal gain when the topological model is already correctly trained.
The tri-router introduces operational complexity: three models loaded in GPU memory instead of two, and a routing table that will need to be revalidated with each data evolution. Endpoint selection relies on a single val set per fold — selection variance is not quantified. V2.4 metrics are not directly comparable to V2.3 metrics for endpoints whose corpus has changed (Ames, hERG), although the validation protocol remains identical. The frozen holdout has not been used: it remains available for a final pre-deployment validation.
The V2.4 REST API exposes the tri-router via the same endpoints as V2.3. No client-side modification is necessary — the routing field in the API response now indicates V2.4b or V2.4d instead of V2.0a or V2.2.