Technical project Case study

ToxTwin V2.4 — Molecular Toxicity Digital Twin

From SMILES to calibrated toxicological scoring across 14 endpoints — trajectory V1.0 to V2.4

Twingital Institute · Lead architect & developer · 2025–2026 ·France

Context

Pre-Phase 1 toxicological prediction is one of the most expensive bottlenecks in pharmaceutical development.

Approach

Design and deployment of an end-to-end pipeline based on Graph Neural Networks, with Medallion architecture, multi-representation ensemble, endpoint tri-router and sovereign infrastructure.

Summary

ToxTwin V2.4 is a predictive toxicological scoring platform for pre-Phase 1 assessment, built on an end-to-end GNN pipeline. It covers 14 toxicological endpoints — 12 Tox21 assays, Ames mutagenicity and hERG channel inhibition — with a pharmacological interpretation layer powered by a local LLM. Version V2.4 introduces an endpoint tri-router and Ames and hERG corpora enriched from primary regulatory sources. It is the first version to exceed 0.89 Tox21 mean AUC on a strict 5-fold scaffold CV protocol. The entire system runs on sovereign infrastructure with no cloud dependency.

Keywords — Predictive toxicology · Graph Neural Network · GINEConv · AttentiveFP · Ensemble learning · Tri-router · Isotonic calibration · Composite applicability domain · Scaffold split · hERG · Ames · ISS · ICH S2R1 · ChEMBL · Sovereign infrastructure

Key points

  • Tox21 mean AUC 0.898 — 12/14 targets reached, Ames and hERG gaps reduced by 33–37%
  • Endpoint tri-router: corpus × architecture specialization systematically outperforms the shared encoder
  • Ames ISS corpus (ICH S2R1 strain-level labels) + hERG ChEMBL v34 (rebalanced to 48.5/51.5)
  • 3D conformational features tested and rejected — the ceiling is a coverage ceiling, not representational
  • Transparent development cycle: 6 versions, 4 major biases documented and corrected
  • V2.4 REST API backward-compatible with V2.3, routing field exposed per response

Trajectory V1.0 → V2.4: Six Versions, Four Biases, One Architecture

The development of ToxTwin illustrates a principle that is generally preferred to remain unspoken in product presentations: most performance gains result from the correction of pre-existing errors, not from the introduction of innovations. The trajectory below traces each version, the problem it was intended to solve, the blocker encountered, and the workaround adopted.

V1.0 — First GINEConv Pipeline (September 2025)

Objective: establish a GNN baseline on 12 Tox21 endpoints with a 9-dimensional GINEConv trained from scratch. Announced AUC: 0.857 on random split.

Blocker — silent OGB bug. The GINEConv encoder was actually operating in 9D mode (basic atomic features) instead of the 163D OGB mode (enriched atomic and bond features). The model passed through the pipeline without raising a dimension error, but was ignoring the majority of available chemical information.

Discovery: during the V1.3 audit, checkpoint comparison revealed that the weights corresponded to a 9D encoder. The 0.857 AUC was non-reproducible on scaffold split — the actual AUC on strict protocol was 0.594 ± 0.056.

Workaround: migration to GINEConv OGB 163D, full retraining, freezing of a SHA256 holdout for all future evaluations.


V1.1 → V1.3 — OGB Bug Fix and Isotonic Calibration (October–November 2025)

Objective: restart from a clean base with the correct encoder, introduce probabilistic calibration and define an applicability domain.

Blocker — data leakage in the val set. The naive fold construction for cross-validation allowed structurally similar compounds (Tanimoto > 0.6) to appear simultaneously in training and validation. The model was partially learning to recognize its validation examples, producing an artificially favorable validation AUC.

Workaround: introduction of strict Bemis-Murcko scaffold split with InChIKey verification train ∩ val = ∅ at each fold. Corrected Ames AUC: 0.864 ± 0.056. Introduction of OOF isotonic calibration and k-NN AD (p95 threshold = 0.332 on 163 OGB features).

Secondary blocker — Ames circularity. V1.0 Ames labels were derived by SMARTS rules from structures, not from experimental results. The model was learning chemical rules, not real mutagenicity. Circular Ames AUC: not interpretable.

Workaround: full replacement of the Ames corpus with the Hansen dataset (validated experimental labels), removal of all SMARTS-derived labels.


V2.0 — Multi-Representation Ensemble Architecture (December 2025 – January 2026)

Objective: break through the single GINEConv ceiling by combining multiple molecular representations.

Architecture: GINEConv (topological, 163D OGB) + AttentiveFP (attentional, adaptive edge weights) + Morgan ECFP6 (sub-structural fingerprints, 2048 bits). Fusion by concatenation + per-endpoint classification head.

Blocker — hERG class imbalance. The initial hERG corpus had a 65/35 ratio (blockers/non-blockers). The classification head converged toward the majority class, producing near-zero recall on non-blockers. ECE > 0.15 on hERG.

Workaround: inverse class weighting per endpoint, decision threshold adjustment, focal loss on imbalanced endpoints. hERG AUC: 0.785 in V2.3.

Blocker — degenerate applicability domain. The V1.3 AD based on k-NN in raw feature space produced incoherent coverage zones across endpoints — some endpoints declared 98% of submitted molecules “in domain” regardless of their structure. The AD was not discriminant.

Workaround (V2.3): composite tri-signal AD — structural similarity (Tanimoto on ECFP4), latent space proximity (cosine distance post-encoding), regional density (kernel estimation in PCA-reduced space). Each signal contributes to a calibrated composite score.


V2.1 → V2.2 — Per-Endpoint Selection and Binary Router (February–March 2026)

Objective: identify, per endpoint, the optimal ensemble configuration rather than a single configuration for all.

Mechanism: binary router selecting, per endpoint, between V2.0a (fine-tuned multi-task GINEConv) and V2.2 (full ensemble). Selection based on median AUC per fold.

Blocker — inter-fold selection instability. For some endpoints (SR-ARE, NR-Aromatase), the router selected different models depending on the fold. Selection was not stable across scaffold split sampling variance.

Workaround: selection based on median AUC across 5 folds (not the best individual fold), with a minimum of 3 concordant folds required for selection validation. Tox21 mean AUC V2.3: 0.867.


V2.3 — Full Audit and Consolidation (March–April 2026)

Objective: produce the first version with all metrics audited, reproducible, and comparable to the literature on an identical protocol.

V2.3 outcome: 12/14 targets reached. Missing points: Ames (0.843 vs target 0.87, gap −0.027) and hERG (0.785 vs target 0.83, gap −0.045). Frozen holdout audit: SHA256 = 052a2aa2c4cff3d8… — holdout metrics consistent with CV metrics.

Question posed for V2.4: are the two missing endpoints an architecture problem or a data problem?


V2.4 — Tri-Router and Corpus Enrichment (April 2026)

Answer to the V2.3 question: data. But not just any data.

First documented failure — PubChem AID 1259411. This assay, labeled “Ames” in several meta-analyses, is in reality an in vivo multi-species carcinogenicity assay (labels by animal strain, not bacterial strain). Naive integration of these 547 compounds into multi-task fine-tuning produced a regression across all endpoints, including Ames itself: from 0.769 to 0.698 on V2.0a. Additional volume injected noise, not signal.

Workaround — ISS corpus for Ames. The Istituto Superiore di Sanità dataset (Mendeley Data) provides strain-level bacterial labels (TA98, TA100, TA102, TA1535, TA1537) harmonized according to the ICH S2R1 convention (positive if at least one strain positive). After RDKit sanitization and InChIKey deduplication against the Hansen corpus, 1,511 new compounds integrated, 1.1:1 ratio.

hERG corpus — ChEMBL v34. CHEMBL240 (KCNH2): 22,273 IC50/Ki activities. After nM filtering, binarization at 10 µM threshold, and deduplication, 6,485 new compounds integrated. Rebalancing 65/35 → 48.5/51.5 — direct improvement of model calibration.

3D feature hypothesis tested and invalidated. NPR, USR, USRCAT descriptors (79 dimensions) added for hERG: AUC gain +0.014 only. Topological fingerprints already capture the essentials; the ceiling is a chemical coverage ceiling.

Structural discovery — tri-router. Ames ISS data and hERG ChEMBL data do not benefit the same ensemble model. A tri-router replaces the binary router: V2.4b (Ames-optimized ensemble, 10 endpoints), V2.4d (hERG-optimized ensemble, 4 endpoints). Encoder V2.0a remains as fallback but is selected by no endpoint in optimal routing.

V2.4 results:

MetricV2.3V2.4Delta
Tox21 mean AUC0.8670.898+0.031
Ames AUC0.8430.853+0.010
hERG AUC0.7850.800+0.015
Targets reached12/1412/14=

The two missing endpoints (Ames gap −0.017, hERG gap −0.030) have reduced their gaps by 37% and 33% respectively. Protocol unchanged: strict 5-fold Bemis-Murcko scaffold CV, InChIKey train ∩ val = ∅, frozen holdout not used for selection.


V2.4 Architecture

The tri-router selects, for each endpoint, the optimal model among three: V2.0a (multi-task GINEConv), V2.4b (Ames-optimized ensemble, active on 10 endpoints), V2.4d (hERG-optimized ensemble, active on 4 endpoints). Each ensemble combines GINEConv + AttentiveFP + Morgan ECFP6 with per-endpoint classification heads. The routing table, fusion weights and AD thresholds constitute Twingital Institute intellectual property. The V2.4 REST API is backward-compatible with V2.3 — the routing field now exposes V2.4b or V2.4d per response.

Medallion Pipeline — V2.4 State

The pipeline ingests PubChem (~100K compounds), ChEMBL v34 (~100K + 22,273 hERG activities), Tox21 (7,831 compounds), ISS Ames (1,511 new compounds) and NER enrichment via local LLM. The Silver layer produces ~152,000 deduplicated compounds after RDKit sanitization and InChIKey control. The Gold layer builds training datasets with strict Bemis-Murcko scaffold split, three-set InChIKey verification train ∩ val ∩ test = ∅, and per-endpoint positive/negative ratio control.

Three Methodological Lessons from V2.4

A single model is a compromise, not an optimum. Endpoint specialization — specific data, specific head, router selection — systematically outperforms the shared encoder. The tri-router does not choose the “best model” but the best model for this task.

Data beats architecture, but not just any data. Injecting mislabeled data degrades performance even when volume increases. Curation — verification of the primary regulatory source, label harmonization, ratio control — is the work that produces the gain.

Additional features do not compensate for missing data. 3D descriptors, despite their pharmacological justification, provide only marginal gain when the topological model is already correctly trained.

Regulatory Compliance

ToxTwin is not a medical device. Pre-clinical toxicological scoring is outside the “high risk” category (EU AI Act, Annex III). The pipeline is designed by anticipation according to high-risk requirements: Delta Lake audit trail, MLflow versioning, uncertainty and applicability domain exposed in every API response.

V3.0 Roadmap

The frozen holdout remains available for a final pre-deployment validation. V3.0 planned scope covers: transition metals (cisplatin, carboplatin, oxaliplatin — specialized featurization for Pt/Ru/Au oxidation states and coordination geometry), DILI, ClinTox and Carcinogens endpoints, and external validation on ECHA or industrial partner corpus.


Technical Note V2.4 — The Tri-Router

Ames and hERG corpus enrichment, selective tri-model architecture, and the methodological lessons of a controlled improvement cycle.

↓ Download V2.4 technical note (PDF)


ToxTwin Series — Associated Publications

Technical note · April 2026

ToxTwin V2.4 — The Tri-Router

Ames ISS corpus enrichment (1,511 compounds, ICH S2R1 strain-level labels) and hERG ChEMBL v34 (6,485 compounds, rebalanced to 48.5/51.5), selective tri-model architecture. Tox21 mean AUC 0.898 (+0.031), Ames 0.853, hERG 0.800.

Read article →  ↓ PDF

Technical synthesis · 8 pages

ToxTwin V2.3 — Consolidated Technical Synthesis

Architecture, V1→V2.3 trajectory, performance, applicability domain and quality plan. First release with metrics audited on strict scaffold CV protocol.

Read article →  ↓ PDF

User guide · 9 pages

ToxTwin V2.3 User Guide

Submit a SMILES, interpret calibrated probability scores, understand the applicability domain, read the pharmacological interpretation.

Read article →  ↓ PDF

Validation report · V1.3

Predictive Toxicological Report — Doxorubicin

First published validation report of the ToxTwin pipeline. SR-ARE / SR-MMP / Ames triplet — mechanistic signature of anthracyclines.

Read article →  ↓ PDF

Interactive demonstration

Test ToxTwin on your molecules

Analyze your own SMILES via ToxTwin V2.4 — 14 toxicological endpoints, certified calibration, composite applicability domain, endpoint tri-router. 3 free analyses, extended access on registration.

Access the demonstration →