From raw data to toxicological score — architecture, pipeline and regulatory considerations
ToxTwin Series — Article 1/3. See also: Report RPT-2026-001 · API Tests & Guide
Toxicological prediction is one of the most costly bottlenecks in pharmaceutical development. Approximately 90% of drug candidates fail in clinical trials, ~30% due to toxicity (DiMasi et al., Tufts CSDD, 2016). This article presents ToxTwin, an end-to-end pipeline designed to score drug candidates prior to Phase 1 entry, on sovereign infrastructure with full regulatory traceability.
A toxicity signal detected in Phase 2 or 3 represents an irrecoverable cost of several hundred million dollars. In vitro assays offer only partial toxicity mechanism coverage. Animal models suffer from limited inter-species translatability (~70% concordance for hepatic toxicity). Classical QSAR approaches depend on predefined descriptors and generalise poorly outside their training domain. Published GNNs (DeepTox, AttentiveFP, GROVER) focus on predictive performance without documenting the industrial pipeline required for deployment in a regulated context.
The ToxTwin hypothesis: build a “molecular digital twin” integrating multi-source data into a traceable, reproducible pipeline aligned with European regulatory requirements (MDR, EU AI Act), to produce an actionable risk score upstream of Phase 1, accompanied by uncertainty quantification and an applicability domain indicator.
Immutable archive of source data, stored in Delta Lake 3.2 on PySpark 3.5.3 (ACID transactions, time-travel). PubChem (100,000 compounds via PUG-REST), ChEMBL v34 (100,000 compounds MW ≤ 900 Da, 500,000 biological activities including hERG for cardiotoxicity), Tox21 (7,831 compounds, 78,000+ binary labels across 12 biological assays covering stress response and nuclear receptors).
ToxTwin’s main differentiation: PubChem toxicological monographs submitted to Phi-4 14B (deployed locally via Ollama) for named entity extraction. From 7,345 compounds: 2,453 LD50s, 189 LDLos, 151 TDLos, 112 NOAELs, 3,138 target organ annotations, 876 chronic toxicity data points. NER success rate: 99.99%. A composite tox_score (0–5) is calculated on a severity scale. Local LLM choice rests on data sovereignty, zero marginal cost at scale, and reproducibility via versioned model.
Pipeline producing 144,879 deduplicated compounds: chemical sanitisation (RDKit), cross-source deduplication on InChI-Key (priority ChEMBL > PubChem), drug-like filtering (Lipinski +20%), Tox21 tox_annotations join (77,733 pivoted labels), enriched tox_profiles join (5,400 Phi-4 profiles).
Featurisation: Morgan Fingerprints ECFP4 (2,048 bits) + 14 RDKit physicochemical descriptors = 2,062 dimensions per compound. Bemis-Murcko scaffold split: 4,074 train (80%), 452 validation, 502 test holdout.
Random Forest per endpoint (Mean ROC-AUC 0.836) establishes reference. ToxGNNEncoder is a 5-layer Graph Isomorphism Network (GIN) with jump connections, 2,132,038 trainable parameters, hidden dimension 512, dropout 0.3. Each atom = node (128 dimensions), each bond = bidirectional edge (11 dimensions).
GROVER-inspired strategy: 25% of atoms masked, one-hot attribute reconstruction. Molecular equivalent of masked BERT. 100 epochs on 144,877 compounds (loss: 0.084 → 0.0045). Hyperparameters optimised via Optuna (TPE Sampler, 30 trials).
Shared layer + 12 independent classification heads. Freeze/unfreeze strategy (10 frozen epochs, then unfrozen at lr ×0.1). Weighted BCE with missing label masking. Version v2 retained (dropout 0.5, AUC 0.857).
ToxGNN-V1 outperforms RF baseline on 7/12 endpoints (+0.021 mean AUC). Most significant gains: SR-p53 (+0.077), NR-ER-LBD (+0.078). Watchpoint: NR-AR (0.659). MoleculeNet positioning: between DeepTox (0.846) and AttentiveFP (0.863), gap with GROVER (0.876) explained by pre-training corpus size (144K vs 10M+).
ECE calibration: 0.149 (FAIL, target < 0.10) — root cause: weighted BCE shifting logits. MC Dropout epistemic uncertainty (30 passes): mean 0.0459 (PASS). Tanimoto applicability domain: 38.2% out-of-domain on scaffold split (expected by design, critical protection flag in production).
Stack: Windows Server 2025 + WSL2 Ubuntu 24.04, RTX 5080, PySpark/Delta Lake, MLflow, Ollama Phi-4 14B, FastAPI. ToxTwin score is not a medical device under MDR, but the pipeline is designed as if high-risk to anticipate potential reclassification. Four regulatory differentiation mechanisms: Phi-4 NER extraction, epistemic uncertainty in API, applicability domain with automatic flag, auditable Medallion pipeline.
Validation on aspirin (NR-ER 0.350, LD50 200 mg/kg, tox_score 4/5, Tanimoto 0.594) and ibuprofen (NR-PPAR-gamma 0.383 — documented structure-activity relationship, Lehmann et al. 1997, emerging without explicit supervision). REST API (FastAPI) with Supabase Auth, 10 requests/week. Roadmap: v1.1 (ECE fix, GPU, NR-AR enrichment), v2.0 (1M+ compounds, multi-head attention, full ADMET profile).