Homework #2: Generative Adversarial Networks and Variational Autoencoders

EE 641: Fall 2025

Assignment Details

Assigned: 24 September
Due: Tuesday, 07 October at 23:59

Submission: Gradescope via GitHub repository

Requirements
  • PyTorch >= 2.0 must be installed
  • Allowed libraries: PyTorch, NumPy, Pillow (PIL), matplotlib, librosa (for audio only), and Python standard library
  • No other external libraries permitted (including no torchvision.models, pre-trained models, or GAN-specific libraries)

Overview

In this assignment you will implement and analyze two fundamental generative models: Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). The first problem explores GAN training dynamics and mode collapse through font generation. The second problem implements a hierarchical VAE for drum pattern generation with style control.

Getting Started

Download the starter code: hw2-starter.zip

Extract and generate the datasets:

unzip hw2-starter.zip
cd hw2-starter
python setup_data.py --seed 641

This creates a data/ directory with: - fonts/: Synthetic font dataset (28×28 grayscale letter images) - drums/: Drum pattern dataset (16×9 binary matrices)

Use seed 641 to ensure consistent results across submissions.

Problem 1: Font Generation GAN - Understanding Mode Collapse

Build a GAN that generates letter images and observe mode collapse firsthand. You will implement diagnostic tools and test different stabilization techniques.

Part A: Dataset and Data Loading

You will work with a font dataset containing grayscale images of letters A-Z in 10 different fonts.

The dataset is provided in the following structure: - Images: 28×28 grayscale, normalized to [0, 1] - Classes: 26 letters × 10 fonts = 260 unique letter-font combinations - Training: 200 samples per letter (mixed fonts) - Validation: 60 samples per letter

The starter code provides dataset.py with the data loader:

Part B: GAN Architecture

Implement models.py with Generator and Discriminator networks.

The starter code provides architecture skeletons that work well for 28×28 grayscale images:

Part C: Training Dynamics and Mode Collapse

Implement the training loop in training_dynamics.py to observe mode collapse:

Part D: Implementing Fixes for Mode Collapse

The starter code provides three techniques to combat mode collapse. Choose ONE to implement in fixes.py:

Part E: Analysis and Experiments

Complete evaluate.py to analyze your trained models:

Deliverables

Your problem1/ directory must contain:

  1. All code files as specified above
  2. results/training_log.json with loss curves and mode coverage metrics
  3. results/best_generator.pth - saved model weights
  4. results/mode_collapse_analysis.png - visualization of mode collapse
  5. results/visualizations/ containing:
    • Generated letter grids at epochs 10, 30, 50, 100
    • Mode coverage histogram (which letters survive)
    • Interpolation sequences
    • Comparison of vanilla vs fixed GAN

Your report must include analysis of:

  • Why certain letters (like O, A) survive mode collapse while others (Q, X, Z) disappear
  • Quantitative comparison of mode coverage with and without your chosen fix
  • Discussion of training dynamics: when does collapse begin?
  • Evaluation of your chosen stabilization technique’s effectiveness

Problem 2: Hierarchical VAE for Music Generation

Build a Variational Autoencoder that learns to generate drum patterns with controllable style. You will implement hierarchical latent variables, handle discrete outputs, and prevent posterior collapse.

Part A: Dataset and Representation

You will work with a dataset of drum patterns represented as binary matrices.

The dataset contains: - Format: 16×9 binary matrices (16 timesteps, 9 drum instruments) - Instruments: Kick, Snare, Closed Hi-hat, Open Hi-hat, Tom1, Tom2, Crash, Ride, Clap - Styles: Rock, Jazz, Hip-hop, Electronic, Latin (200 patterns each) - Total: 1000 unique drum patterns from MIDI files

The starter code provides dataset.py with the data loader:

Part B: Hierarchical VAE Architecture

A hierarchical VAE uses multiple levels of latent variables to capture different aspects of the data. The provided architecture design separates high-level style from low-level variations:

Implement the VAE architecture in hierarchical_vae.py:

Part C: Training Techniques for Discrete Data

Discrete outputs and posterior collapse are major challenges for VAEs. The starter code provides proven techniques:

Use the provided utilities in training_utils.py:

Part D: Training Implementation

The starter code provides train.py with the training loop structure:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import json

def compute_hierarchical_elbo(recon_x, x, mu_low, logvar_low, mu_high, logvar_high, beta=1.0):
    """
    Compute ELBO for hierarchical VAE.
    
    ELBO = E[log p(x|z_low)] - beta * KL(q(z_low|x) || p(z_low|z_high)) 
           - beta * KL(q(z_high|z_low) || p(z_high))
    
    Args:
        recon_x: Reconstructed pattern logits [batch, 16, 9]
        x: Original patterns [batch, 16, 9]
        mu_low, logvar_low: Low-level latent parameters
        mu_high, logvar_high: High-level latent parameters
        beta: KL weight
        
    Returns:
        loss: Total loss
        recon_loss: Reconstruction component
        kl_low: KL for low-level latent
        kl_high: KL for high-level latent
    """
    # Reconstruction loss (binary cross-entropy)
    recon_loss = F.binary_cross_entropy_with_logits(
        recon_x.view(-1), x.view(-1), reduction='sum'
    )
    
    # TODO: Implement KL divergences
    # KL(q(z_high) || p(z_high)) where p(z_high) = N(0, I)
    kl_high = -0.5 * torch.sum(1 + logvar_high - mu_high.pow(2) - logvar_high.exp())
    
    # TODO: KL(q(z_low) || p(z_low|z_high))
    # This is more complex - can simplify to standard KL for now
    kl_low = -0.5 * torch.sum(1 + logvar_low - mu_low.pow(2) - logvar_low.exp())
    
    return recon_loss + beta * (kl_low + kl_high), recon_loss, kl_low, kl_high

def train_epoch(model, data_loader, optimizer, epoch, device):
    """
    Train for one epoch with annealing schedules.
    """
    model.train()
    total_loss = 0
    
    # Get annealing parameters for this epoch
    beta = kl_annealing_schedule(epoch, method='cyclical')
    temperature = temperature_annealing_schedule(epoch)
    
    for batch_idx, (patterns, styles, _) in enumerate(data_loader):
        patterns = patterns.to(device)
        
        # TODO: Forward pass
        # TODO: Compute loss with current beta
        # TODO: Backward and optimize
        
        # Log progress
        if batch_idx % 10 == 0:
            print(f'Epoch {epoch}, Batch {batch_idx}: Loss = {loss.item():.4f}, '
                  f'Beta = {beta:.3f}, Temp = {temperature:.2f}')
    
    return total_loss / len(data_loader)

def main():
    # Configuration
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    batch_size = 32
    num_epochs = 100
    learning_rate = 0.001
    
    # TODO: Initialize dataset, model, optimizer
    # TODO: Training loop with logging
    # TODO: Save checkpoints and final model
    pass

if __name__ == '__main__':
    main()

Part E: Analysis and Music Generation

Complete analyze_latent.py to analyze the trained model and generate music:

Part F: Creative Experiments

Create a notebook experiments.ipynb with the following analyses:

  1. Genre Blending: Interpolate between jazz and rock patterns
  2. Complexity Control: Find latent dimensions that control pattern density
  3. Humanization: Add controlled variations to mechanical patterns
  4. Style Consistency: Generate full drum tracks with consistent style

Deliverables

Your problem2/ directory must contain:

  1. All code files as specified above
  2. results/training_log.json with loss curves and KL values
  3. results/best_model.pth - saved model weights
  4. results/generated_patterns/ containing:
    • 10 samples from each style
    • Interpolation sequences
    • Style transfer examples
  5. results/latent_analysis/ containing:
    • t-SNE visualization of latent space
    • Disentanglement analysis
    • Dimension interpretation results
  6. results/audio_samples/ with generated drum loops (optional but encouraged)

Your report must include analysis of:

  • Evidence of posterior collapse and how annealing prevented it
  • Interpretation of what each latent dimension learned to control
  • Quality assessment: Do generated patterns sound musical?
  • Comparison of different annealing strategies
  • Success rate of style transfer while preserving rhythm

Submission Requirements

Your GitHub repository must follow this exact structure:

ee641-hw2-[username]/
├── problem1/
│   ├── models.py
│   ├── dataset.py
│   ├── models.py
│   ├── training_dynamics.py
│   ├── fixes.py
│   ├── train.py
│   ├── evaluate.py
│   └── results/
│       ├── training_log.json
│       ├── best_generator.pth
│       ├── mode_collapse_analysis.png
│       └── visualizations/
├── problem2/
│   ├── dataset.py
│   ├── hierarchical_vae.py
│   ├── training_utils.py
│   ├── train.py
│   ├── analyze_latent.py
│   └── results/
│       ├── training_log.json
│       ├── best_model.pth
│       ├── generated_patterns/
│       └── latent_analysis/
├── report.pdf
└── README.md

The README.md in your repository root must contain:

  • Your full name
  • USC email address
  • Instructions to run each problem if they differ from the standard commands
  • Any implementation notes
Testing Your Submission

Before submitting:

  1. Your repository structure must match the requirement exactly
  2. python train.py must run without errors in each problem directory
  3. All output files must be generated in the correct locations