Architecture, V1→V2.3 trajectory, performance, applicability domain and quality plan
ToxTwin V2.3 is a predictive toxicological scoring platform for pre-Phase 1 assessment, built on an end-to-end GNN pipeline. It covers 14 toxicological endpoints — 12 Tox21 assays, Ames mutagenicity and hERG channel inhibition — with a pharmacological interpretation layer powered by a local LLM. The entire system runs on sovereign infrastructure with no cloud dependency. Version V2.3 is the first release with metrics validated on a strict 5-fold scaffold CV protocol.
Development went through six major versions. V1.0 reported an AUC of 0.857 which proved not reproducible. A full audit in V1.3 identified four fundamental biases: a silent architecture incompatibility masking random weights, data leakage through validation set contamination, circularity in Ames SMARTS labels, and an applicability domain computed in a degenerate space. The corrected true AUC was 0.594 ± 0.056.
Corrective phases raised performance to 0.867: replacement of the pre-training corpus with a reference drug-like corpus (Phase 1), fusion of complementary molecular representations (Phase 2), and per-endpoint selection mechanism (Phase 3). The fusion logic, routing mechanism and assignment tables are Twingital Institute intellectual property.
The V2.3 architecture combines multiple molecular representations — topological, attentional and substructural — through a fusion and per-endpoint selection mechanism. Detailed component specifications, dimensionalities and parameter counts are Twingital Institute intellectual property.
The inference pipeline proceeds through SMILES resolution, molecular featurisation, encoding and fusion (intellectual property), probabilistic calibration and applicability domain evaluation.
The three-layer Medallion pipeline (Bronze, Silver, Gold) integrates data from PubChem (~100,000 compounds), ChEMBL v34 (~100K compounds + activities), Tox21 (7,831 compounds) and NER enrichment via local LLM (~7,300 processed profiles). Multi-phase curation produces approximately 145,000 deduplicated compounds. The training dataset uses a strict Bemis-Murcko scaffold split with InChIKey verification.
The composite AD score is computed from independent components assessing structural similarity, latent space proximity and regional density. The components, weights and decision thresholds are Twingital Institute intellectual property. An out-of-domain score does not mean toxicity — it means insufficient data for a reliable prediction.
Each endpoint has a calibrator trained on out-of-fold predictions from the 5-fold CV protocol. Post-calibration ECE is below 0.05 across all endpoints.
A local LLM transforms raw scores into pharmacologically grounded toxicological reports. The anti-hallucination mechanisms, prompt architecture and knowledge base structure are Twingital Institute intellectual property. The system produces a dual output: human-readable structured report and machine-readable structured data.
The 12 Tox21 endpoints achieve a mean AUC of 0.867 ± 0.043. All 12 Tox21 targets are reached; Ames (0.843, target 0.87) and hERG (0.785, target 0.83) remain below target. ToxTwin outperforms AttentiveFP on SR-MMP (+0.061) and matches GROVER on NR-Aromatase (+0.005). It remains behind DeepAmes (−0.046) and CardioTox (−0.087).
ToxTwin V2.3 is limited to molecules under 500 heavy atoms, to atoms C H N O S P F Cl Br I (transition metals not supported, planned for V3.0), and is not recommended for peptides over 5 amino acids.
ToxTwin is not a medical device. Pre-clinical toxicological scoring falls outside the “high-risk” category (EU AI Act, Annex III). The pipeline is designed proactively to meet high-risk requirements: audit trail, model versioning, uncertainty and applicability domain exposed in every API response.
Internal validation (strict 5-fold scaffold CV, ECE calibration, consistent routing) is complete. Robustness tests (reproducibility, SMILES invariance, SAR sensitivity, AD coverage) and external validation (frozen holdout, prospective validation, DeepAmes and CardioTox benchmarks) remain to be conducted. Ames and hERG corpus enrichment is a priority.