ToxTwin: Industrialisation of a Pre-Phase 1 Toxicological Prediction Pipeline

ToxTwin Series — Article 1/3. See also: Report RPT-2026-001 · API Tests & Guide

Toxicological prediction is one of the most costly bottlenecks in pharmaceutical development. Approximately 90% of drug candidates fail in clinical trials, ~30% due to toxicity (DiMasi et al., Tufts CSDD, 2016). This article presents ToxTwin, an end-to-end pipeline designed to score drug candidates prior to Phase 1 entry, on sovereign infrastructure with full regulatory traceability.

The economic problem and conventional limitations

A toxicity signal detected in Phase 2 or 3 represents an irrecoverable cost of several hundred million dollars. In vitro assays offer only partial toxicity mechanism coverage. Animal models suffer from limited inter-species translatability (~70% concordance for hepatic toxicity). Classical QSAR approaches depend on predefined descriptors and generalise poorly outside their training domain. Published GNNs (DeepTox, AttentiveFP, GROVER) focus on predictive performance without documenting the industrial pipeline required for deployment in a regulated context.

The ToxTwin hypothesis: build a “molecular digital twin” integrating multi-source data into a traceable, reproducible pipeline aligned with European regulatory requirements (MDR, EU AI Act), to produce an actionable risk score upstream of Phase 1, accompanied by uncertainty quantification and an applicability domain indicator.

Data architecture: Medallion model applied to toxicology

Bronze layer: raw ingestion

Immutable archive of source data, stored in Delta Lake 3.2 on PySpark 3.5.3 (ACID transactions, time-travel). PubChem (100,000 compounds via PUG-REST), ChEMBL v34 (100,000 compounds MW ≤ 900 Da, 500,000 biological activities including hERG for cardiotoxicity), Tox21 (7,831 compounds, 78,000+ binary labels across 12 biological assays covering stress response and nuclear receptors).

NER enrichment via local LLM

ToxTwin’s main differentiation: PubChem toxicological monographs submitted to Phi-4 14B (deployed locally via Ollama) for named entity extraction. From 7,345 compounds: 2,453 LD50s, 189 LDLos, 151 TDLos, 112 NOAELs, 3,138 target organ annotations, 876 chronic toxicity data points. NER success rate: 99.99%. A composite tox_score (0–5) is calculated on a severity scale. Local LLM choice rests on data sovereignty, zero marginal cost at scale, and reproducibility via versioned model.

Silver layer: 5-phase curation

Pipeline producing 144,879 deduplicated compounds: chemical sanitisation (RDKit), cross-source deduplication on InChI-Key (priority ChEMBL > PubChem), drug-like filtering (Lipinski +20%), Tox21 tox_annotations join (77,733 pivoted labels), enriched tox_profiles join (5,400 Phi-4 profiles).

Gold layer: training dataset

Featurisation: Morgan Fingerprints ECFP4 (2,048 bits) + 14 RDKit physicochemical descriptors = 2,062 dimensions per compound. Bemis-Murcko scaffold split: 4,074 train (80%), 452 validation, 502 test holdout.

Modelling: ToxGNN-V1

Baseline and architecture

Random Forest per endpoint (Mean ROC-AUC 0.836) establishes reference. ToxGNNEncoder is a 5-layer Graph Isomorphism Network (GIN) with jump connections, 2,132,038 trainable parameters, hidden dimension 512, dropout 0.3. Each atom = node (128 dimensions), each bond = bidirectional edge (11 dimensions).

AttrMasking pre-training

GROVER-inspired strategy: 25% of atoms masked, one-hot attribute reconstruction. Molecular equivalent of masked BERT. 100 epochs on 144,877 compounds (loss: 0.084 → 0.0045). Hyperparameters optimised via Optuna (TPE Sampler, 30 trials).

Supervised multi-task fine-tuning

Shared layer + 12 independent classification heads. Freeze/unfreeze strategy (10 frozen epochs, then unfrozen at lr ×0.1). Weighted BCE with missing label masking. Version v2 retained (dropout 0.5, AUC 0.857).

Results and benchmark positioning

ToxGNN-V1 outperforms RF baseline on 7/12 endpoints (+0.021 mean AUC). Most significant gains: SR-p53 (+0.077), NR-ER-LBD (+0.078). Watchpoint: NR-AR (0.659). MoleculeNet positioning: between DeepTox (0.846) and AttentiveFP (0.863), gap with GROVER (0.876) explained by pre-training corpus size (144K vs 10M+).

Post-training evaluation

ECE calibration: 0.149 (FAIL, target < 0.10) — root cause: weighted BCE shifting logits. MC Dropout epistemic uncertainty (30 passes): mean 0.0459 (PASS). Tanimoto applicability domain: 38.2% out-of-domain on scaffold split (expected by design, critical protection flag in production).

Sovereign infrastructure and regulatory considerations

Stack: Windows Server 2025 + WSL2 Ubuntu 24.04, RTX 5080, PySpark/Delta Lake, MLflow, Ollama Phi-4 14B, FastAPI. ToxTwin score is not a medical device under MDR, but the pipeline is designed as if high-risk to anticipate potential reclassification. Four regulatory differentiation mechanisms: Phi-4 NER extraction, epistemic uncertainty in API, applicability domain with automatic flag, auditable Medallion pipeline.

Functional validation and API access

Validation on aspirin (NR-ER 0.350, LD50 200 mg/kg, tox_score 4/5, Tanimoto 0.594) and ibuprofen (NR-PPAR-gamma 0.383 — documented structure-activity relationship, Lehmann et al. 1997, emerging without explicit supervision). REST API (FastAPI) with Supabase Auth, 10 requests/week. Roadmap: v1.1 (ECE fix, GPU, NR-AR enrichment), v2.0 (1M+ compounds, multi-head attention, full ADMET profile).

Read the document

↓ Download PDF

Key takeaways

ToxGNN-V1: 5-layer GIN, Mean AUC 0.857 on Bemis-Murcko scaffold split, 12 Tox21 endpoints — between DeepTox (0.846) and AttentiveFP (0.863).
Medallion Bronze/Silver/Gold architecture on PySpark/Delta Lake: 144,879 deduplicated compounds, ACID transactions, time-travel for audit trail.
NER extraction by local Phi-4 14B: 2,453 LD50s, 3,138 target organ annotations, 99.99% success rate on 7,345 compounds.
AttrMasking pre-training (molecular BERT equivalent) on 144,877 compounds, Bayesian optimisation via Optuna (30 trials).
Epistemic uncertainty (MC Dropout, PASS) and applicability domain (Tanimoto, threshold 0.3) integrated in every API response.
Sovereign infrastructure (Windows Server 2025, RTX 5080, no cloud) aligned with MDR Class IIa and EU AI Act.