Tuesday, May 13, 2025, 11:45, 4A301

Nikola Simidjievski

Synthesis & Augmentation of Tabular Data In the Age of Foundation Models

Foundation models – large pre-trained performant models – have shown remarkable success in applications that predominately focus on vision, language, and sound data. On the other hand, tabular data – one of the most prevalent data modalities in many critical domains of business, science, and healthcare – has seen limited benefits from these advances. Tabular data poses unique challenges that relate to heterogeneity, dimensionality, and scarcity as well as lack of explicit symmetries, implicit structures and incomplete prior knowledge — all of which have limiting effects on how we construct, train and apply/transfer large models for tabular data.

Data synthesis is one of the remedies for overcoming some of these challenges: It can help improve model performance in data-scarce but critical applications, but it can also be utilized as a data augmentation mechanism for training more robust models. Although previous research has sought to adapt the successes of generative modeling of homogeneous modalities to tabular modalities, defining an effective generator for tabular data remains an open problem. In this talk, I will present several novel data-centric approaches for data synthesis that focus on tabular data. Our key innovation is transforming recent pre-trained tabular classifiers into data generators and leveraging their learned information in the input and manifold space. These methods are fast, require no additional training, and can be applied to any downstream predictive model. They consistently improve performance, especially on small datasets where training well-performing models is hard. Consequently, we also uncover several properties and benefits that can help in the way how we design robust and performant general-purpose tabular foundation models.