
EE 641 - Unit 3
Fall 2025
Generative Adversarial Networks
[Score Matching] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Advances in Neural Information Processing Systems, 2019, pp. 11918–11930.
[GAN] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
[GAN Review] I. Goodfellow, “NIPS 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
[GAN Theory] S. Arora and Y. Zhang, “Do GANs actually learn the distribution? An empirical study,” arXiv preprint arXiv:1706.08224, 2017.
[DCGAN] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in International Conference on Learning Representations, 2016.
[WGAN] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
[WGAN-GP] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of Wasserstein GANs,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
[BigGAN] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” in International Conference on Learning Representations, 2019.
[StyleGAN] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401–4410.
[Pix2Pix] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
[VAE] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in International Conference on Learning Representations, 2014.
[VQ-VAE] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, 2017, pp. 6306–6315.
Energy defines probability
\[p(\mathbf{x}) = \frac{1}{Z} \exp\left(-\frac{E(\mathbf{x})}{T}\right)\]
where partition function: \[Z = \int \exp\left(-\frac{E(\mathbf{x})}{T}\right) d\mathbf{x}\]
Why this distribution?
Maximizes entropy \(S = -\sum_i p_i \log p_i\) subject to:
Result from Lagrange multipliers: \[p_i \propto \exp(-\beta E_i)\] where \(\beta = 1/T\) is inverse temperature

Temperature controls exploration-exploitation:
Computing \(Z\) is intractable
\[Z(\boldsymbol{\theta}) = \int \exp(-E(\mathbf{x}; \boldsymbol{\theta})) d\mathbf{x}\]
What \(Z\) does:
Computational reality for images (224×224×3):

Approximation strategies:
Helmholtz Free Energy \[F(\boldsymbol{\theta}) = -T \log Z(\boldsymbol{\theta})\]
Derivatives give expectations: \[\frac{\partial F}{\partial \theta_i} = \langle \frac{\partial E}{\partial \theta_i} \rangle_{p(\mathbf{x}|\boldsymbol{\theta})}\]
Second derivatives give covariances: \[\frac{\partial^2 F}{\partial \theta_i \partial \theta_j} = \text{Cov}\left[\frac{\partial E}{\partial \theta_i}, \frac{\partial E}{\partial \theta_j}\right]\]

Physics interpretation:
Learning problem:
Why \(F = -T \log Z\) connects physics and ML:
Objective: Given data \(\{\mathbf{x}_1, ..., \mathbf{x}_N\}\), maximize: \[\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^N \log p(\mathbf{x}_i|\boldsymbol{\theta})\]
Gradient of log-likelihood: \[\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta}) = -\nabla_{\boldsymbol{\theta}} E(\mathbf{x};\boldsymbol{\theta}) + \mathbb{E}_{p(\mathbf{x}'|\boldsymbol{\theta})}[\nabla_{\boldsymbol{\theta}} E(\mathbf{x}';\boldsymbol{\theta})]\]
Two phases:

Critical Challenge: Computing negative phase requires samples from current model \(p(\mathbf{x}|\boldsymbol{\theta})\)!
Learning = Energy Sculpting: \[\nabla_{\boldsymbol{\theta}} \log p(data) = \underbrace{-\nabla_{\boldsymbol{\theta}} E(data)}_{\text{push down}} + \underbrace{\mathbb{E}_{model}[\nabla_{\boldsymbol{\theta}} E]}_{\text{push up}}\]
Positive phase (easy) vs Negative phase (intractable expectation)
Theorem: For any energy-based model, the log-likelihood gradient decomposes as:
\[\frac{\partial}{\partial \boldsymbol{\theta}} \log p(\mathbf{x}^{(n)}; \boldsymbol{\theta}) = -\frac{\partial E(\mathbf{x}^{(n)}; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} + \frac{\partial F(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\]
Proof: Starting from \(p(\mathbf{x}; \boldsymbol{\theta}) = \frac{1}{Z(\boldsymbol{\theta})} e^{-E(\mathbf{x};\boldsymbol{\theta})}\):
\[\frac{\partial}{\partial \boldsymbol{\theta}} \log p(\mathbf{x}^{(n)}) = \frac{\partial}{\partial \boldsymbol{\theta}} [-E(\mathbf{x}^{(n)}) - \log Z(\boldsymbol{\theta})]\]
\[= -\frac{\partial E(\mathbf{x}^{(n)})}{\partial \boldsymbol{\theta}} - \frac{1}{Z} \frac{\partial Z}{\partial \boldsymbol{\theta}}\]
Since \(\frac{\partial Z}{\partial \boldsymbol{\theta}} = \int \frac{\partial}{\partial \boldsymbol{\theta}} e^{-E(\mathbf{x};\boldsymbol{\theta})} d\mathbf{x} = -\int e^{-E(\mathbf{x};\boldsymbol{\theta})} \frac{\partial E}{\partial \boldsymbol{\theta}} d\mathbf{x}\):
\[\frac{1}{Z} \frac{\partial Z}{\partial \boldsymbol{\theta}} = -\mathbb{E}_{p(\mathbf{x};\boldsymbol{\theta})}\left[\frac{\partial E}{\partial \boldsymbol{\theta}}\right] = \frac{\partial F}{\partial \boldsymbol{\theta}}\]
Fisher Information Connection: \(I(\boldsymbol{\theta}) = \text{Cov}\left[\frac{\partial E}{\partial \boldsymbol{\theta}}\right] = \frac{\partial^2 F}{\partial \boldsymbol{\theta}^2}\)
Why Modern ML Exists:
Positive phase: \(-\frac{\partial E(data)}{\partial \boldsymbol{\theta}}\)
Negative phase: \(\frac{\partial F}{\partial \boldsymbol{\theta}}\)
Modern Solutions:
Main Problem: All tractable learning requires avoiding the negative phase computation.
Markov Chain Monte Carlo (MCMC)
To sample from \(p(\mathbf{x}) \propto \exp(-E(\mathbf{x}))\):
Langevin Dynamics: \[\mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\epsilon}{2}\nabla_{\mathbf{x}} E(\mathbf{x}_t) + \sqrt{\epsilon}\boldsymbol{\eta}_t\] where \(\boldsymbol{\eta}_t \sim \mathcal{N}(0, \mathbf{I})\)
Hamiltonian Monte Carlo:

Why MCMC fails at scale:
Full gradient requires equilibrium samples: \[\nabla_{\boldsymbol{\theta}} \mathcal{L} = \underbrace{\frac{1}{N}\sum_{i=1}^N \nabla_{\boldsymbol{\theta}} E(\mathbf{x}_i)}_{\text{Data term (easy)}} - \underbrace{\mathbb{E}_{p_{\boldsymbol{\theta}}}[\nabla_{\boldsymbol{\theta}} E(\mathbf{x})]}_{\text{Model term (hard!)}}\]
Computational cost per gradient step:
For images (\(d = 150\)K):

Approximation strategies emerged from this problem:
Bipartite Structure
Visible units \(\mathbf{v} \in \{0,1\}^D\), Hidden units \(\mathbf{h} \in \{0,1\}^H\)
Energy function: \[E(\mathbf{v}, \mathbf{h}) = -\mathbf{v}^T \mathbf{W} \mathbf{h} - \mathbf{b}^T \mathbf{v} - \mathbf{c}^T \mathbf{h}\]
Conditional independence: \[p(\mathbf{h}|\mathbf{v}) = \prod_{j=1}^H p(h_j|\mathbf{v})\] \[p(\mathbf{v}|\mathbf{h}) = \prod_{i=1}^D p(v_i|\mathbf{h})\]
where: \[p(h_j = 1|\mathbf{v}) = \sigma(\mathbf{W}_{:,j}^T \mathbf{v} + c_j)\] \[p(v_i = 1|\mathbf{h}) = \sigma(\mathbf{W}_{i,:} \mathbf{h} + b_i)\]

Why RBMs work: Tractable Gibbs sampling
The CD-k Algorithm
Instead of running chain to equilibrium, use k steps:
Gradient approximation: \[\Delta \mathbf{W} \approx \langle \mathbf{v}\mathbf{h}^T \rangle_{\text{data}} - \langle \mathbf{v}\mathbf{h}^T \rangle_{\text{CD-k}}\]
CD-1 often sufficient! (k=1)

What CD actually does:
What CD actually optimizes:
Not KL(p_data || p_model), but: \[\text{CD}_k = \text{KL}(p_{\text{data}} || p_{\text{model}}) - \text{KL}(p_k || p_{\text{model}})\]
where \(p_k\) = distribution after \(k\) Gibbs steps from data
Partition function cancellation:
\[\frac{\partial \text{CD}_k}{\partial \boldsymbol{\theta}} = \mathbb{E}_{p_{\text{data}}}[\nabla_{\boldsymbol{\theta}} E] - \mathbb{E}_{p_k}[\nabla_{\boldsymbol{\theta}} E] + \underbrace{(\nabla_{\boldsymbol{\theta}} \log Z - \nabla_{\boldsymbol{\theta}} \log Z)}_{=0}\]
The problematic \(\nabla_{\boldsymbol{\theta}} \log Z(\boldsymbol{\theta})\) terms cancel.
Tractable gradient computation without equilibrium sampling!
Implications:
Persistent CD: Continue chains across minibatches

Connection to score matching:
Avoid the partition function entirely!
Score function: \(\mathbf{s}(\mathbf{x}; \boldsymbol{\theta}) = \nabla_{\mathbf{x}} \log p(\mathbf{x}; \boldsymbol{\theta})\)
For EBM: \(\mathbf{s}(\mathbf{x}) = -\nabla_{\mathbf{x}} E(\mathbf{x}; \boldsymbol{\theta})\) (no \(Z\)!)
Score Matching Objective: \[J(\boldsymbol{\theta}) = \frac{1}{2}\mathbb{E}_{p_{\text{data}}}[||\mathbf{s}(\mathbf{x}; \boldsymbol{\theta}) - \nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})||^2]\]
Problem: Don’t know \(\nabla_{\mathbf{x}} \log p_{\text{data}}\)!
Solution (Integration by parts): \[J(\boldsymbol{\theta}) = \mathbb{E}_{p_{\text{data}}}[\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}(\mathbf{x}; \boldsymbol{\theta})) + \frac{1}{2}||\mathbf{s}(\mathbf{x}; \boldsymbol{\theta})||^2] + C\]

Why score matching works: No sampling required!
Practical score matching via denoising
Perturb data: \(\tilde{\mathbf{x}} = \mathbf{x} + \boldsymbol{\epsilon}\), where \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I})\)
Result (Vincent, 2011): \[\mathbb{E}_{\tilde{\mathbf{x}}}[||\mathbf{s}(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log p(\tilde{\mathbf{x}}|\mathbf{x})||^2]\] \[= \mathbb{E}_{\tilde{\mathbf{x}}}[||\mathbf{s}(\tilde{\mathbf{x}}) + \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^2}||^2] + C\]
Denoising Autoencoder Connection:

Connection to diffusion models:
Neural Networks as Energy Functions
Energy: \(E(\mathbf{x}; \boldsymbol{\theta}) = ||f_{\boldsymbol{\theta}}(\mathbf{x})||^2\)
Or more generally: \[E(\mathbf{x}; \boldsymbol{\theta}) = -\log \sum_y \exp(f_{\boldsymbol{\theta}}(\mathbf{x}, y))\]
Training with Langevin Dynamics:
# Sampling loop (inner)
x = x_init
for t in range(T):
x = x - λ * grad_x(E(x)) + sqrt(2λ) * randn()
# Parameter update (outer)
grad_theta = grad_E(x_data) - grad_E(x_sample)
theta = theta - lr * grad_thetaMemory Requirements:

Problems with deep EBMs:
Computational Reality
ImageNet image: 224 × 224 × 3 = 150,528 dims
Per gradient step:
Per epoch (1.2M images):
Memory explosion:

Fundamental problem beyond computation:
Score Matching → Diffusion Models
Contrastive Learning → Self-Supervised
Energy Functions → Implicit Models

Lessons learned:
Legacy → Modern ML:
EBMs: Model \(p(\mathbf{x})\) explicitly
GANs: Generate samples directly
What changes: Replace intractable sampling with adversarial training
Instead of: \(\mathbb{E}_{p_{\text{model}}}[f(\mathbf{x})]\) Use: Discriminator to estimate density ratios

GANs trade computational intractability for training instability:
Generated samples from modern GANs (2024):

Quality metrics:

Computational requirements:
Goodfellow et al. inaugural results:

Architecture (fully connected):
Quantitative results (MNIST):
Historical impact:
Kullback-Leibler divergence
\[\text{KL}(p||q) = \mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q(x)}\right]\]
Properties:
Forward KL \((p||q)\): Mean-seeking
Reverse KL \((q||p)\): Mode-seeking

GAN behavior matches reverse KL:
Minimax game between two networks:
\[\min_G \max_D V(D,G) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[\log(1-D(G(\mathbf{z})))]\]
Interpretation as binary classification:
Discriminator \(D\) solves:
Generator \(G\):

Step-by-step derivation for fixed \(G\):
Starting from: \[V(D,G) = \int_{\mathbf{x}} p_{\text{data}}(\mathbf{x})\log D(\mathbf{x}) + p_g(\mathbf{x})\log(1-D(\mathbf{x})) d\mathbf{x}\]
For any \(\mathbf{x}\), maximize integrand: \[f(y) = a \log(y) + b \log(1-y)\] where \(a = p_{\text{data}}(\mathbf{x})\), \(b = p_g(\mathbf{x})\), \(y = D(\mathbf{x})\)
Taking derivative: \[\frac{df}{dy} = \frac{a}{y} - \frac{b}{1-y}\]
Setting to zero and solving: \[\frac{a}{y} = \frac{b}{1-y} \Rightarrow a(1-y) = by\] \[a - ay = by \Rightarrow a = y(a+b)\]
Therefore: \[D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}\]
Verification: Second derivative \(\frac{d^2f}{dy^2} = -\frac{a}{y^2} - \frac{b}{(1-y)^2} < 0\) confirms maximum.

With \(D = D^*\), generator objective becomes:
\[C(G) = \max_D V(D,G) = V(D^*, G)\]
Substituting \(D^*\): \[C(G) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}\left[\log \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}\right]\] \[+ \mathbb{E}_{\mathbf{x} \sim p_g}\left[\log \frac{p_g(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}\right]\]
This equals: \[C(G) = -\log 4 + 2 \cdot \text{JS}(p_{\text{data}} || p_g)\]
where Jensen-Shannon divergence: \[\text{JS}(p||q) = \frac{1}{2}\text{KL}(p||m) + \frac{1}{2}\text{KL}(q||m)\] \[m = \frac{1}{2}(p + q)\]

Properties of JS divergence:
Forward KL: \(D_{KL}(p_{data}||p_g)\) \[\mathbb{E}_{x \sim p_{data}}\left[\log \frac{p_{data}(x)}{p_g(x)}\right]\]
Reverse KL: \(D_{KL}(p_g||p_{data})\)
JS Divergence (GAN objective): \[JS(p||q) = \frac{1}{2}KL(p||m) + \frac{1}{2}KL(q||m)\]

Generator gradient: \[\nabla_{\theta_G} V = -\mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}\left[\nabla_{\theta_G} \log(1-D(G(\mathbf{z};\theta_G)))\right]\]
Expanding: \[\nabla_{\theta_G} V = \mathbb{E}_{\mathbf{z}}\left[\frac{1}{1-D(G(\mathbf{z}))} \cdot \nabla_{\theta_G} D(G(\mathbf{z}))\right]\]
Problem when D is too good:

Vanishing gradients when \(D\) too good:
Modified generator objective:
Instead of: \(\min_G \mathbb{E}_{\mathbf{z}}[\log(1-D(G(\mathbf{z})))]\)
Use: \(\max_G \mathbb{E}_{\mathbf{z}}[\log D(G(\mathbf{z}))]\)
Same optimum, different dynamics: \[\nabla_{\theta_G} = \mathbb{E}_{\mathbf{z}}\left[\frac{1}{D(G(\mathbf{z}))} \cdot \nabla_{\theta_G} D(G(\mathbf{z}))\right]\]
When \(D(G(\mathbf{z})) \approx 0\):
Changes gradient dynamics completely

Reverse KL behavior in GANs:
Generator minimizes (approximately): \[\text{KL}(p_g || p_{\text{data}})\]
Properties of reverse KL:
Sequential mode hopping:

Why mode collapse happens:
Per iteration costs:
For batch size B, image size H×W×C:
Typical architecture sizes:
Training time comparison:
Cost vs other methods:
| Method | Forward/iter | Memory | Training stability |
|---|---|---|---|
| VAE | 1× | 2× params | Stable |
| GAN | 3× | 4× params | Unstable |
| EBM (CD-1) | 4× | 5× params | Biased |
| EBM (Full) | 1000× | 10× params | Accurate |
D/G update ratio affects cost:
Computational bottlenecks:
Trade-off: GANs are 3× slower than VAEs per iteration but produce sharper samples
Nash equilibrium exists:
But not unique and not stable:
Actual training dynamics:
Theoretical result (Goodfellow et al.): If G and D have enough capacity and updates are small:

What happens in practice:
Label Smoothing (Salimans et al. 2016):
# Instead of hard labels 0 and 1
real_labels = torch.ones(batch_size)
fake_labels = torch.zeros(batch_size)
# Use soft labels
real_labels = 0.7 + 0.3 * torch.rand(batch_size) # [0.7, 1.0]
fake_labels = 0.0 + 0.3 * torch.rand(batch_size) # [0.0, 0.3]Impact: 15% reduction in mode collapse frequency
Optimizer Configuration:
# Discriminator: SGD with momentum
opt_D = torch.optim.SGD(D.parameters(),
lr=0.0002, momentum=0.9)
# Generator: Adam for stability
opt_G = torch.optim.Adam(G.parameters(),
lr=0.0001, betas=(0.5, 0.999))Update Ratio:
Batch Size Impact:
| Batch Size | Training Time | FID Score | Stability |
|---|---|---|---|
| 8 | 48h | 45.2 | Poor |
| 32 | 24h | 28.4 | Moderate |
| 64 | 18h | 22.1 | Good |
| 128 | 12h | 20.3 | Best |
| 256 | 10h | 21.8 | Good* |
*Diminishing returns, memory limited
Gradient Penalties:
# Gradient clipping (basic)
torch.nn.utils.clip_grad_norm_(
G.parameters(), max_norm=10.0)
# Spectral normalization (better)
D = SpectralNorm(D) # σ(W) = 1
# R1 regularization (StyleGAN)
grad = autograd.grad(d_real, real_img)[0]
r1_loss = (grad ** 2).sum() / 2Memory overhead: +25% for gradient penalties
Weight Initialization:
def init_weights(m):
if isinstance(m, nn.Conv2d):
# He initialization for ReLU
nn.init.kaiming_normal_(m.weight, mode='fan_out')
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.ConvTranspose2d):
# Xavier for generator
nn.init.xavier_normal_(m.weight)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
model.apply(init_weights)Normalization Strategies:
| Network | Normalization | Location | Why |
|---|---|---|---|
| Generator | BatchNorm | All except output | Stabilizes gradients |
| Discriminator | None/LayerNorm | Optional | BN causes correlation |
| Both | SpectralNorm | All layers | Enforces Lipschitz |
Impact on training:

Common Failure Modes and Fixes:
1. Mode Collapse
# Symptoms: D loss → 0, G produces identical outputs
# Check: sample diversity
diversity = torch.std(generated_batch, dim=0).mean()
if diversity < threshold:
# Fixes:
# - Reduce learning rate
# - Add noise to inputs
# - Increase batch size
# - Use unrolled GANs2. Vanishing Gradients
# Monitor gradient norms
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
if grad_norm < 1e-5:
print(f"Vanishing gradient in {name}")
# Fix: Use WGAN-GP or non-saturating loss3. Oscillation
Diagnostic Metrics:
@torch.no_grad()
def diagnose_gan(G, D, dataloader):
metrics = {}
# 1. Gradient health
metrics['d_grad_norm'] = compute_grad_norm(D)
metrics['g_grad_norm'] = compute_grad_norm(G)
# 2. Mode coverage
fake_samples = []
for _ in range(10):
z = torch.randn(100, latent_dim)
fake = G(z)
fake_samples.append(fake)
# Inter-batch diversity
diversity = compute_diversity(fake_samples)
metrics['diversity'] = diversity
# 3. Discriminator confidence
real_scores = []
fake_scores = []
for real_batch in dataloader:
real_score = D(real_batch).mean()
fake_score = D(G(z)).mean()
real_scores.append(real_score)
fake_scores.append(fake_score)
metrics['d_real_acc'] = (real_scores > 0.5).mean()
metrics['d_fake_acc'] = (fake_scores < 0.5).mean()
return metricsWarning signs:
Complete Training Recipe:
class GANTrainer:
def __init__(self, G, D, config):
self.G = G
self.D = D
# Optimizers with different LR
self.opt_G = Adam(G.parameters(),
lr=config.lr_g, betas=(0.5, 0.999))
self.opt_D = Adam(D.parameters(),
lr=config.lr_d, betas=(0.5, 0.999))
# Learning rate scheduling
self.scheduler_G = ExponentialLR(self.opt_G, gamma=0.99)
self.scheduler_D = ExponentialLR(self.opt_D, gamma=0.99)
# Gradient penalty weight
self.lambda_gp = config.lambda_gp
# Label smoothing
self.real_label = 0.9
self.fake_label = 0.1
def train_step(self, real_batch):
batch_size = real_batch.size(0)
# Train Discriminator
for _ in range(self.d_steps):
self.opt_D.zero_grad()
# Real samples
real_validity = self.D(real_batch)
real_labels = torch.full_like(real_validity,
self.real_label)
real_labels += torch.rand_like(real_labels) * 0.1
# Fake samples
z = torch.randn(batch_size, self.latent_dim)
fake = self.G(z).detach()
fake_validity = self.D(fake)
fake_labels = torch.full_like(fake_validity,
self.fake_label)
fake_labels += torch.rand_like(fake_labels) * 0.1
# Gradient penalty
gp = self.gradient_penalty(real_batch, fake)
# Total loss
d_loss = (
F.binary_cross_entropy(real_validity, real_labels) +
F.binary_cross_entropy(fake_validity, fake_labels) +
self.lambda_gp * gp
)
d_loss.backward()
torch.nn.utils.clip_grad_norm_(self.D.parameters(), 5.0)
self.opt_D.step()
# Train Generator
self.opt_G.zero_grad()
z = torch.randn(batch_size, self.latent_dim)
fake = self.G(z)
fake_validity = self.D(fake)
# Generator wants D to output 1 for fake
g_loss = F.binary_cross_entropy(fake_validity,
torch.ones_like(fake_validity))
g_loss.backward()
torch.nn.utils.clip_grad_norm_(self.G.parameters(), 5.0)
self.opt_G.step()
return {'d_loss': d_loss.item(), 'g_loss': g_loss.item()}Training Configuration:
# config.yaml
training:
epochs: 200
batch_size: 64
# Learning rates
lr_g: 0.0001
lr_d: 0.0002
# Update frequency
d_steps: 1 # 5 for WGAN
g_steps: 1
# Regularization
lambda_gp: 10 # WGAN-GP
label_smoothing: 0.1
# Stability
gradient_clip: 5.0
spectral_norm: true
# Data augmentation
augmentation:
horizontal_flip: 0.5
color_jitter: 0.1
# Checkpointing
save_every: 1000
validate_every: 500
# Early stopping
patience: 10000
min_fid: 20.0Hardware Requirements:
| Model | GPU Memory | Training Time | Batch Size |
|---|---|---|---|
| DCGAN 64×64 | 4GB | 12h | 128 |
| StyleGAN 256×256 | 16GB | 3d | 32 |
| BigGAN 128×128 | 32GB | 7d | 256 |
| StyleGAN2 1024×1024 | 48GB | 14d | 8 |
Guidance:
Wasserstein-1 distance (Earth Mover’s):
\[W(p,q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma}[||x - y||]\]
where \(\Pi(p,q)\) = set of all joint distributions with marginals p and q
Intuition:
Example:

Advantages of Wasserstein distance:
Dual formulation:
\[W(p,q) = \sup_{||f||_L \leq 1} \mathbb{E}_{x \sim p}[f(x)] - \mathbb{E}_{y \sim q}[f(y)]\]
where \(||f||_L \leq 1\) means f is 1-Lipschitz: \[|f(x_1) - f(x_2)| \leq |x_1 - x_2|\]
Why this helps:
Connection to discriminator:

WGAN formulation:
\[\max_{D} \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[D(\mathbf{x})] - \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[D(G(\mathbf{z}))]\]
subject to: D is 1-Lipschitz
No more sigmoids:
Training algorithm:
for iteration in range(num_iterations):
# Train critic multiple times
for _ in range(n_critic): # typically 5
x_real = sample_batch(data)
z = sample_noise(batch_size)
x_fake = G(z)
d_loss = -D(x_real).mean() + D(x_fake).mean()
update_D(d_loss)
enforce_lipschitz(D) # Weight clipping or gradient penalty
# Train generator
z = sample_noise(batch_size)
g_loss = -D(G(z)).mean()
update_G(g_loss)
Why Lipschitz constraint matters:
Original WGAN: Enforce Lipschitz via weight clipping
Problems with weight clipping:
Capacity underuse:
Gradient issues:
Optimization difficulty:

Weight distribution after clipping:
Better Lipschitz enforcement:
Instead of clipping, add penalty: \[L = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[D(\mathbf{x})] - \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[D(G(\mathbf{z}))]\] \[+ \lambda \mathbb{E}_{\hat{\mathbf{x}} \sim p_{\hat{\mathbf{x}}}}[(||\nabla_{\hat{\mathbf{x}}} D(\hat{\mathbf{x}})||_2 - 1)^2]\]
Sample \(\hat{\mathbf{x}}\) along interpolations: \[\hat{\mathbf{x}} = \epsilon \mathbf{x}_{\text{real}} + (1-\epsilon) \mathbf{x}_{\text{fake}}\] where \(\epsilon \sim U[0,1]\)
Why ||∇D|| = 1:
Implementation:
# Interpolate between real and fake
eps = torch.rand(batch_size, 1, 1, 1)
x_hat = eps * x_real + (1-eps) * x_fake
x_hat.requires_grad = True
# Compute gradient penalty
d_hat = discriminator(x_hat)
grad = autograd.grad(d_hat, x_hat,
create_graph=True)[0]
grad_norm = grad.view(batch_size, -1).norm(2, dim=1)
gp = lambda * ((grad_norm - 1) ** 2).mean()Computational cost:

WGAN-GP advantages:
Upsampling in Generator Networks:
Standard convolution (stride=1, 3×3 kernel):
Transposed convolution (stride=2, 3×3 kernel):
Matrix perspective:
Computational cost:

Implementation (PyTorch):
# Upsampling from 4×4 to 8×8
nn.ConvTranspose2d(
in_channels=512,
out_channels=256,
kernel_size=4,
stride=2,
padding=1
)
# Output: [B, 256, 8, 8]
# Parameters: 4×4×512×256 = 2,097,152Checkerboard artifacts:
Deep Convolutional GAN (2015)
All-convolutional nets
Batch normalization
Activation functions
No fully connected hidden layers
Computational requirements:

Impact: First architecture to reliably generate sharp images at 64×64
Control generation with conditions
Standard GAN: G(z) → x
Conditional GAN: G(z, y) → x
where y can be:
Modified objective: \[\min_G \max_D V(D,G) = \mathbb{E}_{x,y}[\log D(x|y)]\] \[+ \mathbb{E}_{z,y}[\log(1-D(G(z|y)|y))]\]
Both G and D see the condition y

How to inject conditions:
1. Concatenation (simplest)
# Generator
x = concat([z, y], dim=1)
x = linear(x, 4*4*1024)
# Discriminator
x = concat([image, y_spatial], dim=1)
x = conv2d(x, 64)2. Projection Discriminator (better)
# Inner product of embedding and features
h = conv_layers(x) # → features
y_emb = embed(y) # → embedding
score = h @ y_emb.T + bias3. Adaptive Instance Norm (StyleGAN)

Trade-offs:
Growing resolution during training
Start: 4×4 → 8×8 → … → 1024×1024
Progressive training schedule:
Smooth transition:
# Alpha increases from 0 to 1
low_res = upsample(prev_layer)
high_res = new_conv_layer(prev_layer)
output = (1-alpha)*low_res + alpha*high_resBenefits:

Style-based generator
1. Mapping network:
2. Style injection via AdaIN:
3. Stochastic variation:
Perceptual path length:

Problem: No paired training data
Want: horses ↔︎ zebras, summer ↔︎ winter
CycleGAN solution:
Objectives:
Adversarial loss: \[\mathcal{L}_{\text{GAN}}(G, D_Y) = \mathbb{E}_y[\log D_Y(y)] + \mathbb{E}_x[\log(1-D_Y(G(x)))]\]
Cycle consistency loss: \[\mathcal{L}_{\text{cyc}} = \mathbb{E}_x[||F(G(x)) - x||_1] + \mathbb{E}_y[||G(F(y)) - y||_1]\]
Total: \(\mathcal{L} = \mathcal{L}_{\text{GAN}} + \lambda \mathcal{L}_{\text{cyc}}\)

What works well:
Common failure modes:
Semantic changes:
Mode collapse in cycles:
Color/texture bias:
Computational cost:

BigGAN (2018)
StyleGAN2/3 (2019/2021)
Diffusion-GAN Hybrids

Computational requirements:
Architecture (Isola et al. 2017):
Generator: U-Net with skip connections
Discriminator: PatchGAN (70×70 receptive field)
Objective: \[\mathcal{L} = \mathcal{L}_{cGAN}(G,D) + \lambda \mathcal{L}_{L1}(G)\]
where: \[\mathcal{L}_{L1}(G) = \mathbb{E}_{x,y,z}[||y - G(x,z)||_1]\]
λ = 100 typical (balances adversarial vs reconstruction)

Performance (256×256 images):
| Task | PSNR | SSIM | FID | Time/img |
|---|---|---|---|---|
| Maps→Aerial | 21.2 | 0.42 | 45.3 | 22ms |
| Edges→Photo | 18.8 | 0.38 | 62.1 | 22ms |
| Day→Night | 19.5 | 0.51 | 38.9 | 22ms |
Memory requirements:
Inception Score (IS):
\[\text{IS}(G) = \exp\left(\mathbb{E}_{\mathbf{x} \sim p_g}\left[\text{KL}(p(y|\mathbf{x}) || p(y))\right]\right)\]
where:
What IS measures:
Inception Score problems:
Mode collapse increases IS
ImageNet-specific

Inception Score doesn’t measure quality:
FID measures distribution similarity:
\[\text{FID} = ||\boldsymbol{\mu}_r - \boldsymbol{\mu}_g||^2 + \text{Tr}(\boldsymbol{\Sigma}_r + \boldsymbol{\Sigma}_g - 2\sqrt{\boldsymbol{\Sigma}_r \boldsymbol{\Sigma}_g})\]
where:
Assumptions:
Requirements:

FID correlates with human judgment:
FID variance depends on sample size:
With N samples:
Recommended sample sizes:
Computational cost:
Common mistakes:

Spectral Normalization
Self-Attention Layers
Moving Average Generator

Other useful tricks:
Text Generation (SeqGAN, LeakGAN):
Audio Synthesis (WaveGAN, MelGAN):
Molecular Generation (MolGAN, ChemGAN):
Tabular Data (CTGAN, TGAN):
# Handling mixed data types
continuous_cols = ['age', 'income']
categorical_cols = ['education', 'occupation']
# Mode-specific normalization
# Gaussian mixture for continuous
# One-hot + embedding for categoricalPerformance comparison:
| Domain | GAN Type | Metric | Score |
|---|---|---|---|
| Text | SeqGAN | BLEU-4 | 0.85 |
| Audio | MelGAN | MOS | 4.2/5 |
| Molecules | MolGAN | Validity | 95% |
| Tabular | CTGAN | F1 | 0.89 |
Discrete Sequences (Text/Code):
Problem: Sampling is non-differentiable
Solutions:
REINFORCE: Treat as RL problem
Gumbel-Softmax: Continuous relaxation
Continuous embeddings: Skip discrete entirely
Time Series (Audio/Finance):
Challenges:
Architecture modifications:
# Dilated convolutions for receptive field
layers = [
Conv1d(dilation=2**i)
for i in range(10)
]
# Receptive field: 1024 timestepsMemory requirements:
Domain-Specific Evaluation:
Images:
Text:
Audio:
Molecules:

Joint distribution: \[p(\mathbf{x}, \mathbf{z}) = p(\mathbf{x}|\mathbf{z})p(\mathbf{z})\]
Components:
What we want: \[p(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}\]
Problem: Integral intractable for neural network decoder
Contrast with EBMs:

Need p(x) for: Maximum likelihood training, sampling, model comparison
Posterior intractability:
\[p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})} = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{\int p(\mathbf{x}|\mathbf{z}')p(\mathbf{z}')d\mathbf{z}'}\]
Intractability both ways:

Standard solution: Variational Inference
Goal: Find q(z|x) ≈ p(z|x)
Direct KL minimization: \[\text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x})) = \mathbb{E}_{q_{\phi}}\left[\log \frac{q_{\phi}(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})}\right]\]
Problem: Requires p(z|x) which needs p(x)!
Solution: Rewrite using Bayes rule: \[\log p(\mathbf{x}) = \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x})) + \mathcal{L}(\theta, \phi; \mathbf{x})\]
where Evidence Lower BOund (ELBO): \[\mathcal{L}(\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))\]

Why ELBO works:
KL ≥ 0 → ELBO ≤ log p(x)
Maximizing ELBO:
Tractable: only needs p(z) and p(x|z)
Encoder Network \(q_{\phi}(\mathbf{z}|\mathbf{x})\):
Reparameterization sampling: \[\mathbf{z} = \boldsymbol{\mu}_{\phi}(\mathbf{x}) + \boldsymbol{\sigma}_{\phi}(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\]
Decoder Network \(p_{\theta}(\mathbf{x}|\mathbf{z})\):
ELBO Loss Computation: \[\mathcal{L} = \underbrace{\log p_{\theta}(\mathbf{x}|\mathbf{z})}_{\text{Decoder output}} - \underbrace{\text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))}_{\text{Encoder regularity}}\]

Connections:
Start with log marginal: \[\log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z}) d\mathbf{z}\]
Introduce q(z|x) via importance sampling: \[\log p(\mathbf{x}) = \log \int \frac{p(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})} q_{\phi}(\mathbf{z}|\mathbf{x}) d\mathbf{z}\]
Rewrite as expectation: \[\log p(\mathbf{x}) = \log \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\frac{p(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\right]\]
Apply Jensen’s inequality (log is concave): \[\log \mathbb{E}[f(\mathbf{z})] \geq \mathbb{E}[\log f(\mathbf{z})]\]
Therefore: \[\log p(\mathbf{x}) \geq \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log \frac{p(\mathbf{x}, \mathbf{z})}{q_{\phi}(\mathbf{z}|\mathbf{x})}\right]\]
Expand joint p(x,z) = p(x|z)p(z): \[\mathcal{L} = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))\]

Result: ELBO = Reconstruction - Regularization
How this solves the EBM problem:
Two forms of ELBO:
Form 1: Reconstruction - KL \[\mathcal{L} = \underbrace{\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]}_{\text{Reconstruction}} - \underbrace{\text{KL}(q(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))}_{\text{Regularization}}\]
Form 2: Negative free energy \[\mathcal{L} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}, \mathbf{z})] + H[q(\mathbf{z}|\mathbf{x})]\]
where \(H\) is entropy of \(q\)
Information theory view:

Practical implications:
Encoder outputs Gaussian q(z|x): \[q_{\phi}(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2(\mathbf{x})))\]
Prior is standard Gaussian: \[p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I})\]
Closed form KL: \[\text{KL}(q || p) = \frac{1}{2}\sum_{j=1}^J \left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)\]
Derivation: For Gaussians \(\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1)\) and \(\mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)\): \[\text{KL} = \frac{1}{2}\left[\text{tr}(\boldsymbol{\Sigma}_2^{-1}\boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2-\boldsymbol{\mu}_1)^T\boldsymbol{\Sigma}_2^{-1}(\boldsymbol{\mu}_2-\boldsymbol{\mu}_1) - k + \log\frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|}\right]\]
For our case: \(\boldsymbol{\mu}_2 = 0\), \(\boldsymbol{\Sigma}_2 = \mathbf{I}\), diagonal \(\boldsymbol{\Sigma}_1\)

What each term penalizes:
Need gradient of expectation: \[\nabla_{\phi} \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})] = \nabla_{\phi} \int q_{\phi}(\mathbf{z}|\mathbf{x}) f(\mathbf{z}) d\mathbf{z}\]
Gradient w.r.t. encoder parameters φ that define q(z|x):
The gradient: \(\nabla_{\phi} \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)]\)
Problem: Expectation over distribution depends on parameters φ. Creates high-variance gradient estimates.
Why naive sampling fails:
Two approaches:
1. Score function estimator (REINFORCE): \[\nabla_{\phi} \mathbb{E}_{q_{\phi}}[f] = \mathbb{E}_{q_{\phi}}[f(\mathbf{z}) \nabla_{\phi} \log q_{\phi}(\mathbf{z}|\mathbf{x})]\]
Problem: Variance scales as \(O(e^D)\) with dimension D
2. Reparameterization trick: \[\mathbf{z} = \boldsymbol{\mu}_{\phi}(\mathbf{x}) + \boldsymbol{\sigma}_{\phi}(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\]
Now: \(\nabla_{\phi} \mathbb{E}_{\boldsymbol{\epsilon}}[f(\mathbf{z})] = \mathbb{E}_{\boldsymbol{\epsilon}}[\nabla_{\phi} f(\mathbf{z})]\)

Why reparameterization works:
Gradient variance comparison:
Score function gradient: \[\text{Var}[\nabla_{\phi}^{\text{SF}}] \approx \text{Var}[f(\mathbf{z})] \cdot \text{Var}[\nabla_{\phi} \log q_{\phi}]\]
Reparameterized gradient: \[\text{Var}[\nabla_{\phi}^{\text{Reparam}}] \approx \text{Var}_{\boldsymbol{\epsilon}}[\nabla_{\mathbf{z}} f(\mathbf{z})] \cdot ||\nabla_{\phi} \mathbf{z}||^2\]
Empirical comparison (MNIST VAE):

Practical impact:
Forward pass:
def encode(x):
h = encoder_network(x)
mu = fc_mu(h)
log_var = fc_logvar(h) # Log variance for stability
return mu, log_var
def reparameterize(mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
z = mu + eps * std
return z
def forward(x):
mu, log_var = encode(x)
z = reparameterize(mu, log_var)
x_recon = decode(z)
return x_recon, mu, log_varNumerical stability tricks:
log_var = torch.clamp(log_var, -10, 10)
Gradient computation:
Stable ELBO computation:
1. KL divergence for Gaussians:
def kl_divergence(mu, log_var):
# Analytical KL for N(mu, sigma) || N(0, I)
# Avoid computing sigma directly
kl = -0.5 * torch.sum(
1 + log_var - mu.pow(2) - log_var.exp(),
dim=1
)
return kl2. Reconstruction loss (Bernoulli):
def stable_bce_loss(x_logits, x_true):
# Use logits directly, avoid sigmoid
max_val = torch.clamp(x_logits, min=0)
loss = x_logits - x_logits * x_true + max_val \
+ torch.log(torch.exp(-max_val) \
+ torch.exp(-x_logits - max_val))
return loss.sum(dim=1)3. Log-sum-exp for marginal likelihood:

Common pitfalls and fixes:
Encoder network q(z|x):
class Encoder(nn.Module):
def __init__(self, input_dim=784, hidden_dim=400,
latent_dim=20):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2_mu = nn.Linear(hidden_dim, latent_dim)
self.fc2_logvar = nn.Linear(hidden_dim, latent_dim)
def forward(self, x):
h = F.relu(self.fc1(x))
mu = self.fc2_mu(h)
log_var = self.fc2_logvar(h)
# Clamp for numerical stability
log_var = torch.clamp(log_var, min=-10, max=10)
return mu, log_varDecoder network p(x|z):
class Decoder(nn.Module):
def __init__(self, latent_dim=20, hidden_dim=400,
output_dim=784):
super().__init__()
self.fc1 = nn.Linear(latent_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, z):
h = F.relu(self.fc1(z))
x_recon = torch.sigmoid(self.fc2(h)) # For binary
return x_reconComputational advantage over EBMs:

Design choices:
Binary data (MNIST): \[p(\mathbf{x}|\mathbf{z}) = \prod_{i=1}^D \text{Bernoulli}(x_i | p_i)\]
Decoder outputs logits, loss = binary cross-entropy:
x_logits = decoder(z) # No sigmoid
recon_loss = F.binary_cross_entropy_with_logits(
x_logits, x, reduction='sum')Continuous data (natural images): \[p(\mathbf{x}|\mathbf{z}) = \mathcal{N}(\boldsymbol{\mu}_{\theta}(\mathbf{z}), \sigma^2 \mathbf{I})\]
Decoder outputs mean, fixed variance:
x_mean = decoder(z)
# Gaussian with unit variance
recon_loss = F.mse_loss(x_mean, x, reduction='sum')
# Or learned variance
log_var = decoder_logvar(z)
recon_loss = gaussian_nll_loss(x, x_mean, log_var)Impact on reconstruction:

Recommendation:
Posterior collapse phenomenon:
VAE ignores latent code: q(z|x) ≈ p(z) = N(0,I)
Why it happens:
Solutions:
1. KL annealing: Start with β=0, increase to 1
2. Free bits: Minimum KL per dimension
3. Decoder weakening: Dropout, smaller network

Monitoring collapse:
Modified objective with β > 1: \[\mathcal{L}_{\beta} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})] - \beta \cdot \text{KL}(q(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))\]
What β controls:
Disentanglement metrics:
Trade-off: Reconstruction quality vs disentanglement

Applications:
Vector Quantization VAE:
Replace continuous z with discrete codes from codebook
Architecture:
Codebook learning:
# K vectors of dimension D
codebook = nn.Embedding(num_codes=512, dim=64)
def quantize(z_e):
# Find nearest codebook vector
distances = torch.cdist(z_e, codebook.weight)
indices = distances.argmin(dim=-1)
z_q = codebook(indices)
# Straight-through estimator
z_q = z_e + (z_q - z_e).detach()
return z_q, indicesLoss function: \[\mathcal{L} = \log p(\mathbf{x}|\mathbf{z}_q) + ||\text{sg}[\mathbf{z}_e] - \mathbf{e}||^2 + \beta ||\mathbf{z}_e - \text{sg}[\mathbf{e}]||^2\]
where sg = stop gradient, \(\mathbf{e}\) = codebook vectors

Advantages over continuous VAE:
Ladder VAE structure:
Multiple stochastic layers: \[p(\mathbf{x}, \mathbf{z}_1, ..., \mathbf{z}_L) = p(\mathbf{x}|\mathbf{z}_1)p(\mathbf{z}_1|\mathbf{z}_2)...p(\mathbf{z}_L)\]
Benefits:
Implementation approach:
class HierarchicalVAE(nn.Module):
def encode(self, x):
# Bottom-up pass
h1 = self.enc1(x)
z1_mu, z1_logvar = self.z1_params(h1)
h2 = self.enc2(h1)
z2_mu, z2_logvar = self.z2_params(h2)
return [(z1_mu, z1_logvar),
(z2_mu, z2_logvar)]
def decode(self, z_list):
# Top-down generation
h = self.dec2(z_list[1])
h = self.dec1(torch.cat([h, z_list[0]], dim=1))
return self.output(h)
Performance gains:
Quantitative comparison:
| Metric | VAE | GAN |
|---|---|---|
| Training stability | Stable | Unstable |
| Mode coverage | Good | Poor (collapse) |
| Sample quality | Blurry | Sharp |
| Likelihood | Tractable bound | None |
| Latent inference | q(z|x) | Requires BiGAN |
| Training time (CIFAR-10, V100) | 12 hours | 24-48 hours |
| Hyperparameter sensitivity | Low | High |
Computational requirements (same architecture):
Use VAE when: Need likelihood, stable training, latent inference Use GAN when: Need sharp samples, mode coverage less critical

Empirical results (CIFAR-10, single V100):
EBMs (2000s): Explicit density, intractable
VAEs (2014): Tractable lower bound
GANs (2014): Implicit generation
Modern (2020s): Best of both worlds
2024 methods:

Open problems: