SMILES → Scoring → Interpretation workflow on an out-of-domain candidate
ToxTwin Series — Article 2/3. See also: ToxGNN-V1 Pipeline · API Tests & Guide
This document illustrates the complete ToxTwin workflow for predictive toxicological analysis of a drug candidate from its SMILES representation. It serves as a template for routine analyses and demonstrates the system’s behaviour when faced with a structurally novel compound.
The SMILES (Simplified Molecular Input Line Entry System) is the standardised textual representation of chemical structure. Before submission to ToxTwin, validity is verified via RDKit (MolFromSmiles() → canonicalisation → 2D/3D). Complementary tools include ChemDraw (PerkinElmer), MarvinSketch (ChemAxon), PubChem Sketcher, JSME/Ketcher for web applications.
Three access modes: POST curl request, Swagger UI interface (http://localhost:8000/docs), or programmatic Python integration. Candidate RPT-2026-001 is submitted with a SMILES containing a [F+] motif (cationic fluorine) — a potential alkylating agent, rare in approved drugs.
The compound is not referenced in PubChem, ChEMBL or Tox21. No experimental data in the Silver database: no LD50, no organ profile, no Ames test available. This is precisely ToxTwin’s primary use case: scoring before any experimental testing.
Notable structural features: [F+] (potential alkylating agent, expected SR-p53 signal), piperidine ring (oxidative metabolism, N-oxide formation), pyridine ring (possible CYP450 inhibitor), hydroxyl group (glucuronide conjugation), amine (hERG interaction risk if lipophilic). Inference time: 582.8 ms in direct SMILES resolution.
Maximum Tanimoto similarity: 0.153 (threshold: 0.30). Out-of-domain compound. The closest molecule in the training set shares only ~15% of circular substructures (Morgan FP, radius 2). Causes: [F+] under-represented in Tox21/ChEMBL, unusual pharmacophoric combination (piperidine + pyridine + F+), absence from public databases.
MC Dropout uncertainty (20 stochastic passes) shows elevated uncertainties (> 5%) on the most active endpoints: SR-p53 (± 6.2%), SR-ARE (± 6.0%), NR-AR (± 7.1%) — expected and coherent behaviour for an out-of-domain compound.
The two priority signals are SR-p53 at 32.6% (genotoxicity — p53 is the genome guardian, its activation indicates potential genotoxic stress, consistent with the electrophilic [F+]) and SR-ARE at 25.5% (oxidative stress — Nrf2/KEAP1 pathway activation, consistent with [F+]). Secondary signals (NR-AhR 15.4%, SR-HSE 13.1%, NR-AR 13.0%) remain in the low zone. The remaining 7 endpoints show scores < 10% with low uncertainties.
Priority 1: Ames test (OECD 471) for bacterial mutagenicity, Comet test (OECD 489) for DNA strand breaks, ROS/GSH test for oxidative stress. Priority 2: In vitro micronuclei (chromosomal aberrations), rat LD50 (OECD 420). Priority 3: hERG inhibition by patch-clamp (potential cardiotoxicity linked to amine + nitrogen cycle), CYP450 panel (metabolic interactions).
MODERATE risk profile on two genotoxic pathways (SR-p53 32.6%, SR-ARE 25.5%), consistent with the [F+] group. Out-of-domain compound (Tanimoto 0.153) — predictions are extrapolations. Progression towards regulatory studies imperatively requires experimental validation of genotoxic signals (Ames + Comet) before any development decision. This report constitutes a decision support tool and does not substitute for qualified toxicologist judgement or mandatory regulatory studies (ICH S2, S7A, S7B).