Tweet

Vadim Borisov

Oct 21 • 8 tweets • 4 min read

Leave VAEs and GANs behind: LLMs are all you need for tabular data generation!
We introduce a new method GReaT (Generation of Realistic Tabular data), with state-of-the-art generative abilities (see below). How we did it? ↓ (1/n)
#tabulardata

(2/n) Tabular data frequently consists of categorical and numerical data. Furthermore, categorical data and feature names typically are words. Thus, it is possible to represent a tabular data sample as a meaningful sentence, e.g., "Age is 42, Education is HS-grad, ..."

(3/n) ... where feature name and value are used together. After this step, we can fine-tune a pre-trained large language model (LLM) on obtained sentences, and use the LLM to synthesise new data samples!

(4/n) While most of the methods for tabular data generation are primarily based on VAEs or GANs, our approach utilized the pre-training power of the Transformer-based language models. We also demonstrate that pre-training benefits the generative quality of synthetic data.

(5/n) Try GReaT today! We have an easy-to-use developed a Python package be-great, which can be installed it with pip:
> pip install be-great

Code is available on GitHub: github.com/kathrinse/be_g…

Here is a example for training and sampling for California housing dataset:

(6/n) For more details please refer to our preprint:
arxiv.org/abs/2210.06280

@t_leemann

(7/n) I would like to say thank you to all contributors @t_leemann, @MartinPawelczyk, @Gjergji_ . Especially, I thank Kathrin Seßler, this project wouldn't exist without her.

(8/n) Lastly, if you want to know more about tabular data and deep neural networks, you should definitely check our survey. Here is the link: arxiv.org/abs/2110.01889

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Vadim Borisov

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!