Minkey Chang

Data Scientist / AI Engineer

Creating and Evaluating Synthetic Tabular Data

Synthetic tabular data mimics real tables without exposing real rows—handy for privacy, testing, or training when you can’t share the original. The catch is making sure the fake data is both useful and safe. One approach that scales without a GPU is sequential synthesis (one column at a time); then you need to check utility and re-identification risk. Here’s a minimal setup and three checks that work.


Sequential synthesis: one column at a time

Sequential synthesis works well when you don’t have a GPU or need to generate large tables: build the data one column at a time, each new column modeled and sampled given the ones you already have. For each column you pick a simple model that fits the type (e.g. regression for numbers, multinomial for categories), fit it, then sample. You choose the column order so dependencies make sense (e.g. age before income). Everything runs on CPU and you can inspect each step, so it’s interpretable. The downside is you don’t capture very complex joint patterns that deep generators might; but for many use cases it’s enough and it scales.

Once you have synthetic data, you need to ask: does it preserve utility (can you still do useful analysis?) and limit privacy risk (can someone re-identify people)? Below are three checks that work in practice.


Check 1: Can you tell real from synthetic? (Propensity score matching)

If a classifier can easily tell “real” vs “synthetic” rows from the features alone, the synthetic data is too different and might leak information about who’s in the real set (membership inference). Propensity score matching measures this: train a classifier to predict real vs synthetic, and look at the scores. If real and synthetic scores are well separated, the synthetic data is too distinguishable. You want them to overlap.

Propensity score matching: real vs synthetic distributions
Propensity score matching: real vs synthetic.

Check 2: Do key statistics match? (CI overlap)

You usually care that statistics you care about (means, proportions, regression coefficients) are similar in the synthetic data. CI overlap compares confidence intervals for a chosen statistic: estimate it from the real data and from the synthetic data. If the intervals overlap a lot, the synthetic data supports similar conclusions; if they don’t, it’s distorting that statistic. This is about utility for analysis, not row-by-row similarity.

Confidence interval overlap for a statistic
CI overlap: do key statistics match?

Check 3: Re-identification risk (Quasi-identifiers)

Even without names or IDs, combinations of columns (e.g. age + ZIP + gender) can identify people. Quasi-identifier checks ask: do too many synthetic rows look like specific real individuals on those combinations? Keeping that risk low matters when you release or share synthetic data.

Quasi-identifier risk and re-identification
Quasi-identifier risk.

Summary

Sequential synthesis is a practical, CPU-only option when you need interpretable generation or don’t have a GPU. To evaluate what you generate: propensity score matching (can you tell real from synthetic?), CI overlap (do key statistics match?), and quasi-identifier metrics (re-identification risk). In practice, the synthpop package in R and syn_seq in synthcity in Python are enough to get started.


References

  1. Synthpop — Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1–26. doi:10.18637/jss.v074.i11.

  2. TabSyn — Zhang, H., Zhang, J., Srinivasan, B., et al. (2024). Mixed-type tabular data synthesis with score-based diffusion in latent space. ICLR. arXiv:2310.09656.

  3. CTGAN / TVAE — Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN. NeurIPS, 32. arXiv:1907.00503.

  4. Synthetic data overview — Jordon, J., Szpruch, L., Houssiau, F., et al. (2022). Synthetic data — what, why and how? arXiv preprint. arXiv:2205.03257.

  5. Synthcity — Qian, Z., Cebere, B.-C., & van der Schaar, M. (2023). Synthcity: facilitating innovative use cases of synthetic data. arXiv preprint. arXiv:2301.07573.

Other posts