High-Fidelity Synthetic Patient Generation and Validation for NSCLC Using TVAE and OCTOPUS Clinical Data

Authors: Abir Tadmouri (Pierre Fabre) · Salma Barkaoui, Head of ML/AI (Qualees) · Mohammed Bennani (Qualees) · Jérôme Vetillard, VP R&D & CPO (Qualees) · Hadhami Mejbri, Data Scientist (Qualees)

Background and objectives

High-fidelity synthetic health data generation plays a crucial role in advancing disease prediction models and enabling data sharing while protecting patient privacy. In oncology research, data scarcity and patient heterogeneity pose major challenges, particularly for rare disease subtypes such as NSCLC. This study aimed to enhance statistical power in evaluating oncology treatment efficacy by augmenting the real-world OCTOPUS dataset (~200 patients) with high-quality synthetic data while preserving clinical and statistical integrity.

OCTOPUS data preprocessing and integration

Data integration proceeds from 59 SAS datasets and 37 SDTM domains to 11 selected domains. USUBJID is identified as the unique patient ID, the DM (Demographics) table serves as the central hub. Key domains for survival modelling are integrated (DM, CM, MI, DD, SU, PR, TU, VS, AE + SUPP tables). Treatment flow is mapped per patient: systemic → first-line → second-line → subsequent lines. Patients are classified into four outcome groups (completed treatment, death, physician discontinuation, other interruptions). Hybrid rule-based + MICE imputation shows best longitudinal coherence compared to KNN and SoftImpute. Final dataset: 156 rows × 190 columns.

TVAE method and synthetic generation

TVAE (Tabular Variational AutoEncoder) uses a variational autoencoder architecture to learn the underlying distribution of input features and generate new synthetic samples. In this study, TVAE is enhanced by optimising its loss function to better preserve feature relationships. Generation produces 500 synthetic patients × 190 columns from the 156 preprocessed real patients.

Validation pipeline and results

Multi-tier validation covers: Kolmogorov-Smirnov test (0.604) for continuous variable distributional similarity, Chi-Square test for categorical variables, entropy comparison for population complexity and variability, ANOVA, mutual information for linear and non-linear dependencies, correlation preservation (0.935), KL divergence (0.721). ML Utility (Train on Synthetic, Test on Real — TSTR) achieves 0.907, designated as the most important metric. Overall practical quality score is 0.802, with geometric similarity of 0.844.

Conclusion and perspectives

This poster was selected in the Top 5% of contributions submitted to ISPOR Europe 2025. The synthetic OCTOPUS cohort of 500 patients successfully preserves complex clinical relationships and predictive signals essential for downstream research applications. This scalable framework opens promising avenues for patient-specific survival risk prediction and personalised treatment selection based on individual patient trajectories, advancing precision oncology research with a high-fidelity dataset ready for production use.

Read the document

Access the full article

Enter your details to access the document. Free access — no sales outreach.

Personalized document · Free access · No sales outreach

Key takeaways

Poster selected in the Top 5% of ISPOR Europe 2025 contributions.
Optimised TVAE (Tabular Variational AutoEncoder) pipeline generating 500 synthetic patients from 156 real OCTOPUS patients (BRAF V600E NSCLC), 190 columns.
Overall practical quality score of 0.802 — ML Utility 0.907 (most important metric), correlation preservation 0.935, geometric similarity 0.844.
Structured ETL from 59 SAS datasets / 37 SDTM domains to 11 selected domains (DM, CM, MI, DD, SU, PR, TU, VS, AE + SUPP tables).
Hybrid rule-based + MICE imputation strategy (best longitudinal coherence vs KNN and SoftImpute). Per-patient treatment flow mapping (lines 1→n).
Multi-tier validation: KS test (0.604), Chi-Square, entropy, ANOVA, mutual information, correlation preservation, KL divergence (0.721).
The framework opens avenues for patient-specific survival risk prediction and personalised treatment selection in precision oncology.