SMILES → Scoring → Interpretation workflow on an out-of-domain candidate
ToxTwin Series — Article 2/3. See also: ToxGNN-V1 Pipeline · API Tests & Guide
This document illustrates the complete ToxTwin workflow for predictive toxicological analysis of a drug candidate from its SMILES representation. It serves as a template for routine analyses and demonstrates the system’s behaviour when faced with a structurally novel compound.
The SMILES (Simplified Molecular Input Line Entry System) is the standardised textual representation of chemical structure. Before submission to ToxTwin, validity is verified via RDKit (MolFromSmiles() → canonicalisation → 2D/3D). Complementary tools include ChemDraw (PerkinElmer), MarvinSketch (ChemAxon), PubChem Sketcher, JSME/Ketcher for web applications.
Three access modes: POST curl request, Swagger UI interface (http://localhost:8000/docs), or programmatic Python integration. Candidate RPT-2026-001 is submitted with a SMILES containing a [F+] motif (cationic fluorine) — a potential alkylating agent, rare in approved drugs.
The compound is not referenced in PubChem, ChEMBL or Tox21. No experimental data in the Silver database: no LD50, no organ profile, no Ames test available. This is precisely ToxTwin’s primary use case: scoring before any experimental testing.
Notable structural features: [F+] (potential alkylating agent, expected SR-p53 signal), piperidine ring (oxidative metabolism, N-oxide formation), pyridine ring (possible CYP450 inhibitor), hydroxyl group (glucuronide conjugation), amine (hERG interaction risk if lipophilic). Inference time: 582.8 ms in direct SMILES resolution.
Maximum Tanimoto similarity: 0.153 (threshold: 0.30). Out-of-domain compound. The closest molecule in the training set shares only ~15% of circular substructures (Morgan FP, radius 2). Causes: [F+] under-represented in Tox21/ChEMBL, unusual pharmacophoric combination (piperidine + pyridine + F+), absence from public databases.
MC Dropout uncertainty (20 stochastic passes) shows elevated uncertainties (> 5%) on the most active endpoints: SR-p53 (± 6.2%), SR-ARE (± 6.0%), NR-AR (± 7.1%) — expected and coherent behaviour for an out-of-domain compound.
The two priority signals are SR-p53 at 32.6% (genotoxicity — p53 is the genome guardian, its activation indicates potential genotoxic stress, consistent with the electrophilic [F+]) and SR-ARE at 25.5% (oxidative stress — Nrf2/KEAP1 pathway activation, consistent with [F+]). Secondary signals (NR-AhR 15.4%, SR-HSE 13.1%, NR-AR 13.0%) remain in the low zone. The remaining 7 endpoints show scores < 10% with low uncertainties.
Priority 1: Ames test (OECD 471) for bacterial mutagenicity, Comet test (OECD 489) for DNA strand breaks, ROS/GSH test for oxidative stress. Priority 2: In vitro micronuclei (chromosomal aberrations), rat LD50 (OECD 420). Priority 3: hERG inhibition by patch-clamp (potential cardiotoxicity linked to amine + nitrogen cycle), CYP450 panel (metabolic interactions).
MODERATE risk profile on two genotoxic pathways (SR-p53 32.6%, SR-ARE 25.5%), consistent with the [F+] group. Out-of-domain compound (Tanimoto 0.153) — predictions are extrapolations. Progression towards regulatory studies imperatively requires experimental validation of genotoxic signals (Ames + Comet) before any development decision. This report constitutes a decision support tool and does not substitute for qualified toxicologist judgement or mandatory regulatory studies (ICH S2, S7A, S7B).
Doctrinal notes and explorations on AI in regulated systems. Once or twice a month. One-click unsubscribe.