By Jaewon Kim in paper — Jul 28, 2024

[Paper Review] CTAB-GAN: Effective Table Data Synthesizing

해당 논문은 Tabular Data를 생성하는 Probabilistic Generative Model의 일종인 CTAB - GAN에 대한 논문 리뷰이다.

해당 논문에 대한 arXiv 링크는 아래와 같다.

CTAB-GAN: Effective Table Data Synthesizing

While data sharing is crucial for knowledge development, privacy concerns and strict regulation (e.g., European General Data Protection Regulation (GDPR)) unfortunately limit its full effectiveness. Synthetic tabular data emerges as an alternative to enable data sharing while fulfilling regulatory and privacy constraints. The state-of-the-art tabular data synthesizers draw methodologies from generative Adversarial Networks (GAN) and address two main data types in the industry, i.e., continuous and categorical. In this paper, we develop CTAB-GAN, a novel conditional table GAN architecture that can effectively model diverse data types, including a mix of continuous and categorical variables. Moreover, we address data imbalance and long-tail issues, i.e., certain variables have drastic frequency differences across large values. To achieve those aims, we first introduce the information loss and classification loss to the conditional GAN. Secondly, we design a novel conditional vector, which efficiently encodes the mixed data type and skewed distribution of data variable. We extensively evaluate CTAB-GAN with the state of the art GANs that generate synthetic tables, in terms of data similarity and analysis utility. The results on five datasets show that the synthetic data of CTAB-GAN remarkably resembles the real data for all three types of variables and results into higher accuracy for five machine learning algorithms, by up to 17%.

arXiv.orgZilong Zhao

논문 저자의 코드는 아래의 Github 링크에서 찾을 수 있다.

0. Abstract

Data sharing이 knowledge development에 중요한데도 불구하고, privacy concern이나 strict regulation으로 인해 full effectiveness가 제한받는 상황이다.

따라서 이에 대한 대안으로 Synthetic Tabular Data, 즉 인공적으로 Tabular Data를 생성하여 Data sharing을 가능하게 하면서도 privacy에 대한 제약도 충족할 수 있도록 한다.

저자들은 CTAB - GAN Model이라는 Conditional table GAN Architecture을 제안하여 다음의 사항들을 달성한다.

💡

Problem

Model diverse data type: Continuous, Discrete(Categorical) variable들의 mixture를 다룬다.

Address data imbalance and long tail issues: large value들간의 range를 가지는 variable의 극단적인 frequency difference가 존재한다.

💡

Solution: CTAB - GAN

따라서 위의 문제들을 해결하기 위해 아래의 2가지를 제안한다.

1) Information Loss and Classification Loss to the conditional GAN

2) Novel Conditional Vector (encodes mixed data type and skew distribution of data variable)

💡

Evaluation

이렇게 설계된 CTAB - GAN은 SOTA GAN들이 생성하는 synthetic table들을 비교하면서 evaluate한다.

Data similarity
Analysis utility

5개의 dataset에 대해 CTAB-GAN이 보여주는 결과는 3개의 type인 variable들이 만드는 real data를 놀랍도록 잘 resemble하고, 5개의 Machine Learning algorithm의 정확도를 17%나 올렸다.

3. CTAB - GAN

💡

Problems

CTAB - GAN은 Tabular Data Generator로서 Section 1.1에 명시된 문제들을 해결하기 위해 설계되었다.

Section 1.1에 명시된 문제들은 다음과 같다:

1) Mixed data type variables

Variable이 오직 하나의 column이 단일 data type (continuous, categorical)으로 구성되어 있는 것이 아니라, 2개의 data type의 mixture로 구성되어 있을 수도 있다.

또한 missing value와 같은 결측치도 포함하고 있을 수 있다.

예를 들어, Categorical Value(0 value)과 Continuous Value(any positive value)들을 가진 Column이 있다고 해보자.

여기서 기존의 SOTA Model들은 Categorical value(0)의 의미를 고려하지 않고 mixed type variable을 continuous로 간주하여 tabular data를 생성하게 된다.

그러할 경우, 정확한 0이라는 값이 아니라 0 근처의 값을 예측하게 되는 문제가 발생할 수도 있고, 실제로는 아무 의미도 없는 음수의 값을 예측하는 문제도 발생할 수가 있다.

2) Long tail distributions

실제 세계의 data들은 주로 distribution의 initial value나 end의 rare case에 의해 long tail distribution이 발생한다.

실제 Data들은 range의 starting point에서 99%의 확률로 존재하기 시작하지만, synthetic data generator들은 이러한 값들을 예측하지 못하는 경향이 있다.

3) Skewed multi-mode continuous variable

multi-mode라는 단어는 VGM (Variational Gaussian Mixture)로부터 유래된 것이다.

CTGAN을 제외한 다른 SOTA Model들은 일반적인 Gaussian Distribution을 이용하여 continuous variable을 모델링하고 있지만, 실제 Distribution은 단일 peak를 가지고 있지 않을 뿐더러 각 peak마다 높이가 다르다.

따라서 정확한 Real Distribution을 모델링하기 위해서는 VGM을 사용하면서도 기존 CTGAN이 original distribution의 몇개의 mode를 잃어버리는 현상을 방지하기 위해 저자들은 CTAB - GAN을 제안한 것이다.

💡

Solutions

위의 문제들을 해결하기 위해 저자들은 아래의 해결책들을 제시한다.

1) Mixed - type encoder

mixed - type encoder를 사용하여 missing value들 뿐만 아니라 categorical-continuous mixed variable들을 잘 표현할 수 있도록 한다.

2) CGAN with classification and information loss

Conditional Table GAN 구조에 효과적으로 minority class를 다루기 위해 classification loss와 information loss를 추가하여 각각 semantic integrity와 training stability를 향상시켰다.

따라서 총 4개의 Loss Term이 사용된다.

1) Discriminator Update: Original GAN Loss (Real & Fake)

2) Generator Update:

2-1) Original GAN Loss + Conditional Loss (Conditional Vector)

2-2) Information Loss ($\mu, \sigma$)

3) Classifier & Generator Update:

3-1) Classification Loss (Real): Update Classifier
3-2) Classification Loss (Fake): Update Generator

3) Leverage a log-frequency sampler

Log - frequency sampler를 사용하여 imbalanced variable들에 대해 나타나는 mode collapse 문제들을 극복할 수 있도록 한다.

3.1) Technical Background

GAN은 synthetic data를 생성하는데 사용되는 유명한 방법 중 하나이다.

이는 realistic data를 synthesize하는 Generator와 real sample과 synthetic sample을 구별하는 Discriminator간의 adversarial game을 수행한다.

1) To address problem of dataset imbalance (Mixed data type variables)

Conditional Generator
Training - by - sampling

conditional vector를 사용해 categorical variable들의 class를 표현하고, 이는 generator의 input에 들어가고 동시에 real training data의 sampling을 한정한다.

2) To enhance the generation quality

2개의 Loss function term을 제시한다.

Information Loss: training stability ↑
- generated data와 real data간의 statistic 차이

Classification Loss: semantic integrity ↑ (해당 data가 실제 세계를 정확히 반영하는 정도)
- GAN architecture에 Discrimiantor와 병렬적으로 추가되는 auxiliary classifier에 필요함
- synthesized class와 predicted class간의 차이 (Corresponding Class로 잘 분류됐는지)

3) To counter complex distributions in continuous variables

Gaussian mixture model로부터 value-mode pair로 encoding하는 mode-specific normalization idea를 사용한다.

3.2) Design of CTAB - GAN

CTAB - GAN은 3개의 Model로 구성되어 있다.

Generator $\mathcal{G}$
Discriminator $\mathcal{D}$
auxiliary Classifier $\mathcal{C}$

기본 알고리즘이 Conditional GAN (CGAN)에 base되어 있으므로 Generator $\mathcal{G}$는 noise vector $z$와 conditional vector $c$를 필요로 한다.

Section 3.3의 Mixed-type encoder (encoding and decoding of real and synthetic data)은 Figure에 생략되어 있으며, Section 3.4의 conditional vector는 간단하게 표시되어 있다.

GAN은 기본적으로 Discriminator $\mathcal{D}$와 Generator $\mathcal{G}$간의 zero-sum minimax game을 통해 training된다. Discriminator는 objective를 maximize하고자 하는 반면, Generator는 objective를 minimize하고자 한다.

이는 Discriminator $\mathcal{D}$가 Generator $\mathcal{G}$에게 work quality에 대한 fedd back를 주는 것으로 해석할 수 있다.

여기서 Generator $\mathcal{G}$가 주는 추가적인 feedback은 information loss와 classification loss를 포함한다.

Information loss
- Real & Synthetic record의 statistics를 match (first-order: $\mu$, second-order: $\sigma$)
- synthetic record가 real record와 동일한 statistical chracteristic를 갖도록 함

Classification Loss
- semantic integrity (해당 data가 실제 세계를 정확히 반영하는 정도)
- value들의 combination이 semantically incorrect 하면 페널티를 가함

위의 2개의 Loss들은 모두 Generator $\mathcal{G}$의 original loss term에 더해져서 training 과정에 반영된다.

따라서 총 4개의 Loss Term이 사용된다.

1) Discriminator Update: Original GAN Loss (Real & Fake)

2) Generator Update:

2-1) Original GAN Loss + Conditional Loss (Conditional Vector)

2-2) Information Loss ($\mu, \sigma$)

3) Classifier & Generator Update:

3-1) Classification Loss (Real): Update Classifier
3-2) Classification Loss (Fake): Update Generator

Generator $\mathcal{G}$와 Discriminator $\mathcal{D}$는 각각 4개, 2개의 CNN Layer로 구성되어 있다.

CNN은 image안의 pixel들간의 관계를 알아내는 데 효과적이므로, tabular data의 경우 synthetic data의 semantic integrity를 향상시키는데 도움을 준다.

Classifier $\mathcal{C}$는 7개의 MLP Layer를 사용하는데, 더 좋은 semantic integrity를 해석하기 위해 original data에서 train된다.

따라서 synthetic data들은 Classifier의 input으로 들어가기 전에 encoded된 상태에서 reverse-transform을 통해 original data로 변환된 후 들어간다.

3.3) Mixed-type Encoder

Tabular Data는 variable마다 encoded되어 있다.

총 3개의 variable type이 있다: categorical, continuous, mixed

이때 mixed라 함은 categorical과 continuous value들을 둘 다 포함하고 있거나, continous value와 missing value를 포함하고 있는 2가지 경우를 의미한다.

저자들은 Mixed-type Encoder를 제안하며 이러한 변수들에 대응하고자 한다.

이 encoder를 통해 mixed variable들의 value들은 value-mode pair들이 concatenate되어 있다고 본다.

mixed variable의 value들이 value-mode로 pairing 되어 있음을 볼 수 있다.

(a)를 보면 mixed variable의 distribution을 확인해볼 수 있다.

categorical part에서는 value가 $\mu_0, \text{or}\: \mu_3$ 둘 중 하나의 exact value가 되거나 continuouse part에서는 $\mu_1, \text{and}\: \mu_2$ 두 개의 peak에 걸쳐 분포되어 있음을 알 수 있다.

💡

1.1) Mixed Variable: Continuous value

Variational Gaussian Mixture (VGM)를 사용한다.

1) 우선 mode의 개수 $k$를 예측하고, Gaussian Mixture에 fitting한다.

$$\mathbb{P} = \sum_{k=1}^2 \omega_k\mathcal{N}(\mu_k, \sigma_k)$$

$\mathcal{N}$: Gaussian Distribution
$\omega_k, \mu_k, \sigma_k$: weight, mean, standard deviation of each mode $k$

위의 경우 variable distribution에 있는 continuous region이 2개 $\mu_1, \mu_2$이다.

2) Fitting된 Gaussian Mixture의 각 mode 중 각 value $\tau$에 대해 highest probability를 지니는 mode를 선택한다.

value $\tau$에 대해 encode할 2개의 mode로부터 나온 probability density는 $\rho_1, \rho_2$ 2개의 값이다.

이 중에서 $\rho_1$이 $\rho_2$보다 더 높기 때문에 mode 1을 선택하고 이를 이용하여 normalize한다.

3) 선택된 mode $k'$의 parameter $\mu', \sigma'$를 이용하여 해당 variable $\tau$을 normalize 시킨다.

$$\alpha = \frac {\tau - \mu_{k'}} {4\sigma_{k'}}$$

1) $$\alpha = \frac {\tau - \mu_{k'}} {\sigma_{k'}}$$

원래 하던 방식은 zero-centered된 기존 Distribution의 $[-\sigma, \sigma]$ 범위 안의 data를 $[-1, 1]$로 mapping한다.

2) $$\alpha = \frac {\tau - \mu_{k'}} {4\sigma_{k'}}$$

새로운 방식은 zero-centered된 기존 Distribution의 $[-4\sigma, 4\sigma]$ 범위 안의 data를 $[-1, 1]$로 mapping한다.

즉, $\mathcal{N}(0, 1)$인 Distribution으로 Mapping하는 것이 아니라, 각 value들을 $[-1, 1]$로 Mapping하는 목적을 달성하기 위해서는 $4\sigma$로 나누는 것이 맞다.

4) 이렇게 나온 scalar $\alpha$와 one-hot encoded vector(mode) $\beta$ 둘을 concatenate 하여 continuous variable에 대한 encoding은 $\alpha \oplus \beta$로 처리한다.

$\oplus$: vector concatenation operator
$\beta = [0, 1, 0, 0]$: 총 mode 중 2번째($\mu_1$)이므로 one-hot encoded vector로 처리
$4$로 나누는 이유는 아마 distribution of mixed variable의 전체 mode 개수로 나눠줘야 하기 때문일 것이다.

💡

1.2) Mixed Variable: Categorical value

Categorical value의 경우 $\alpha$가 직접적으로 value of the mode를 표현한다는 것만 제외하고, 전부 Continuous value와 동일하다.

즉, 위의 그림에서 $\alpha$의 값이 바로 $\mu_0 \: \text{or} \: \mu_3$이 되는 것이다.

즉 만약 $\mu_3$이 value가 된다면, final encoding은 $\mu_3 \oplus [0, 0, 0, 1]$이 된다.

주의할 점은 Categorical value들은 반드시 number로 한정되어 있지는 않다는 것이다. string도 가능하고 missing value와 같은 결측치 또한 가능하다.

이러한 symbol들은 주어진 continuous region밖의 numeric value로 mapping하여 다룰 수 있기 때문이다.

💡

2) Categorical Variable

지금까지 Mixed Variable의 Continuous, Categorical value들을 어떻게 vector로 처리하는지 알아보았다.

Categorical Variable은 Mixed Variable의 continuous intervals 들의 encoding과 동일한 encoding을 사용한다.

Categorical Variable들은 one-hot vector $\gamma$를 통해 encoding된다.

Missing Value들은 one-hot vector에 일부 extra bit를 추가해서 다룬다.

💡

3) Final Encoding

하나의 Row를 기준으로 볼 때, 총 $N$개의 Variable(Column)이 있다고 하자.

A row with $[1, 2, ... , N]$ variables ($N = m + n$)
$n$ continuous / mixed variables ($\alpha \oplus \beta$)
$m$ categorical variables ($\gamma$)

Final Encoding Vector는 수식으로 다음과 같이 나타낼 수 있다.

$$\bigoplus_{i=1}^{n} \alpha_i \oplus \beta_i \bigoplus_{j=n+1}^N \gamma_j$$

$\beta$: Mode one-hot encoding vector
$\gamma$: Class(Category) one-hot encoding vector

2개의 Mixed / Continuous variable, 1개의 Categorical variable, 총 3개의 variable이 존재한다.

주의할 점은 위 Vector는 Conditional Vector $\mathcal{V}$이지, Final Encoding은 아니라는 점이다.

$mathcal{V}$는 오직 하나의 selected variable with selected mode/class를 제외하면 전부 0으로 처리하는 vector이기 때문이다.

3.4) Counter imbalanced training datasets

CTAB-GAN에서는 Conditional GAN을 이용하여 imbalanced training dataset을 counter시킨다.

즉 real data를 sample할 때 conditional vector를 이용하여 training data를 filter & rebalance한다.

Conditional Vector $\mathcal{V}$는 all mode one-hot encodings $\beta$ (continuous and mixed variables)들과 all class one-hot encodings $\gamma$ (categorical variables)들의 concatenation으로 이루어진 bit vector이다.

중요한 점은 Conditional Vector $\mathcal{V}$는 zero vector에 selected variable with selected mode/class에 대해 single correspondence를 보인다는 것이다.

One continuous(C1), one mixed(C2), one categorical(C3), with class 2 selected on C3.

Dataset을 real dataset을 기준으로 re-balance시키기 위해 우리는 매 training process마다 conditional vector가 필요하다.

1) Uniform Probability을 가지고 random하게 variable($N$개) 중에 하나를 고른다.

2) 선택된 variable에 대해 각 mode/class별 (mixed or continuous/categorical) probabiity distribution을 계산한다.

해당 variable의 frequency를 proxy로 사용한다는 관점으로, probability에 log씌운 값을 기준으로 mode에서 sampling한다.

3) 여기서 log를 사용한 이유는 minority modes/classes들이 training 과정에서 나타날 확률이 더 높아지기 때문이다.

이는 rare modes/classes들에 대해 mode collapse가 일어나는 것을 방지하도록 도와준다.

4.5) Treat long tail

우리는 Continuous value들을 VGM(Variational Gaussian Mixture)를 이용하여 multi-mode data distribution을 아룬다.

그러나, Gaussian mixture들에 모든 형태의 data distribution에 대해 모델링할 수 있는 것은 아니다.

특히나 data의 bulk area로부터 멀리 떨어진 few rare points들, 즉 distributions with long tail 부분은 Gaussian Mixture로 mapping할 수 없다.

이러한 현상들을 방지하기 위해 variable들을 다루기 전에 preprocessing을 통해 먼저 long-tail distribution들에 logarithm transformation을 적용한다.

따라서 해당 variable의 각 value $\tau$, lower bound $l$에 대하여 각 value $\tau$를 $\tau^c$로 대체한다.

$$\tau^c = \begin{cases} \log(\tau) & \text{if } l > 0 \\ log(\tau - l + \epsilon) & \text{if } l \leq 0, \text{ where } \epsilon > 0 \end{cases}$$

이러한 Log-transformation은 tail과 bulk data간의 distance를 줄여주어 VGM으로 하여금 더 쉽게 all value(including tails)들을 encode할 수 있도록 만들어준다.

[Paper Review] CTAB-GAN: Effective Table Data Synthesizing

0. Abstract