Leave VAEs and GANs behind: LLMs are all you need for tabular data generation!
We introduce a new method GReaT (Generation of Realistic Tabular data), with state-of-the-art generative abilities (see below). How we did it? ↓ (1/n)
#tabulardata
(2/n) Tabular data frequently consists of categorical and numerical data. Furthermore, categorical data and feature names typically are words. Thus, it is possible to represent a tabular data sample as a meaningful sentence, e.g., "Age is 42, Education is HS-grad, ..."
(3/n) ... where feature name and value are used together. After this step, we can fine-tune a pre-trained large language model (LLM) on obtained sentences, and use the LLM to synthesise new data samples!
(4/n) While most of the methods for tabular data generation are primarily based on VAEs or GANs, our approach utilized the pre-training power of the Transformer-based language models. We also demonstrate that pre-training benefits the generative quality of synthetic data.
(5/n) Try GReaT today! We have an easy-to-use developed a Python package be-great, which can be installed it with pip:
> pip install be-great
Code is available on GitHub: github.com/kathrinse/be_g…
Here is a example for training and sampling for California housing dataset:
(6/n) For more details please refer to our preprint:
arxiv.org/abs/2210.06280
(7/n) I would like to say thank you to all contributors @t_leemann, @MartinPawelczyk, @Gjergji_ . Especially, I thank Kathrin Seßler, this project wouldn't exist without her.
(8/n) Lastly, if you want to know more about tabular data and deep neural networks, you should definitely check our survey. Here is the link: arxiv.org/abs/2110.01889
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.