ToxTwin V2.3 — Consolidated Technical Synthesis

Overview

ToxTwin V2.3 is a predictive toxicological scoring platform for pre-Phase 1 assessment, built on an end-to-end GNN pipeline. It covers 14 toxicological endpoints — 12 Tox21 assays, Ames mutagenicity and hERG channel inhibition — with a pharmacological interpretation layer powered by a local LLM. The entire system runs on sovereign infrastructure with no cloud dependency. Version V2.3 is the first release with metrics validated on a strict 5-fold scaffold CV protocol.

Trajectory V1.0 → V2.3

Development went through six major versions. V1.0 reported an AUC of 0.857 which proved not reproducible. A full audit in V1.3 identified four fundamental biases: a silent architecture incompatibility masking random weights, data leakage through validation set contamination, circularity in Ames SMARTS labels, and an applicability domain computed in a degenerate space. The corrected true AUC was 0.594 ± 0.056.

Corrective phases raised performance to 0.867: replacement of the pre-training corpus with a reference drug-like corpus (Phase 1), fusion of complementary molecular representations (Phase 2), and per-endpoint selection mechanism (Phase 3). The fusion logic, routing mechanism and assignment tables are Twingital Institute intellectual property.

Architecture

The V2.3 architecture combines multiple molecular representations — topological, attentional and substructural — through a fusion and per-endpoint selection mechanism. Detailed component specifications, dimensionalities and parameter counts are Twingital Institute intellectual property.

The inference pipeline proceeds through SMILES resolution, molecular featurisation, encoding and fusion (intellectual property), probabilistic calibration and applicability domain evaluation.

Medallion Data Pipeline

The three-layer Medallion pipeline (Bronze, Silver, Gold) integrates data from PubChem (~100,000 compounds), ChEMBL v34 (~100K compounds + activities), Tox21 (7,831 compounds) and NER enrichment via local LLM (~7,300 processed profiles). Multi-phase curation produces approximately 145,000 deduplicated compounds. The training dataset uses a strict Bemis-Murcko scaffold split with InChIKey verification.

Applicability Domain

The composite AD score is computed from independent components assessing structural similarity, latent space proximity and regional density. The components, weights and decision thresholds are Twingital Institute intellectual property. An out-of-domain score does not mean toxicity — it means insufficient data for a reliable prediction.

Probabilistic Calibration

Each endpoint has a calibrator trained on out-of-fold predictions from the 5-fold CV protocol. Post-calibration ECE is below 0.05 across all endpoints.

LLM Interpretation

A local LLM transforms raw scores into pharmacologically grounded toxicological reports. The anti-hallucination mechanisms, prompt architecture and knowledge base structure are Twingital Institute intellectual property. The system produces a dual output: human-readable structured report and machine-readable structured data.

V2.3 Performance

The 12 Tox21 endpoints achieve a mean AUC of 0.867 ± 0.043. All 12 Tox21 targets are reached; Ames (0.843, target 0.87) and hERG (0.785, target 0.83) remain below target. ToxTwin outperforms AttentiveFP on SR-MMP (+0.061) and matches GROVER on NR-Aromatase (+0.005). It remains behind DeepAmes (−0.046) and CardioTox (−0.087).

Technical Limits

ToxTwin V2.3 is limited to molecules under 500 heavy atoms, to atoms C H N O S P F Cl Br I (transition metals not supported, planned for V3.0), and is not recommended for peptides over 5 amino acids.

Security and Compliance

ToxTwin is not a medical device. Pre-clinical toxicological scoring falls outside the “high-risk” category (EU AI Act, Annex III). The pipeline is designed proactively to meet high-risk requirements: audit trail, model versioning, uncertainty and applicability domain exposed in every API response.

V2.4 Plan

Internal validation (strict 5-fold scaffold CV, ECE calibration, consistent routing) is complete. Robustness tests (reproducibility, SMILES invariance, SAR sensitivity, AD coverage) and external validation (frozen holdout, prospective validation, DeepAmes and CardioTox benchmarks) remain to be conducted. Ames and hERG corpus enrichment is a priority.

Read the document

↓ Download PDF

Key takeaways

Tox21 Mean AUC 0.867 ± 0.043 on strict 5-fold scaffold CV — 12/14 targets reached.
4 major biases discovered and corrected between V1.0 and V2.3: OGB bug, data leakage, Ames circularity, degenerate AD.
Multi-representation ensemble architecture with per-endpoint selection (Twingital Institute intellectual property).
Probabilistic calibration via out-of-fold isotonic regression — ECE < 0.05 across 14/14 endpoints.
Composite tri-signal applicability domain: structural similarity, latent proximity, regional density.
Medallion Bronze/Silver/Gold pipeline — 145,000 curated compounds from PubChem, ChEMBL and Tox21.