Problem 2: Hierarchical VAE for Music Generation

Build a Variational Autoencoder that learns to generate drum patterns with controllable style. You will implement hierarchical latent variables, handle discrete outputs, and prevent posterior collapse.

Part A: Dataset and Representation

You will work with a dataset of drum patterns represented as binary matrices.

The dataset contains: - Format: 16×9 binary matrices (16 timesteps, 9 drum instruments) - Instruments: Kick, Snare, Closed Hi-hat, Open Hi-hat, Tom1, Tom2, Crash, Ride, Clap - Styles: Rock, Jazz, Hip-hop, Electronic, Latin (200 patterns each) - Total: 1000 unique drum patterns from MIDI files

The starter code provides dataset.py with the data loader:

Part B: Hierarchical VAE Architecture

A hierarchical VAE uses multiple levels of latent variables to capture different aspects of the data. The provided architecture design separates high-level style from low-level variations:

Implement the VAE architecture in hierarchical_vae.py:

Part C: Training Techniques for Discrete Data

Discrete outputs and posterior collapse are major challenges for VAEs. The starter code provides proven techniques:

Use the provided utilities in training_utils.py:

Part D: Training Implementation

The starter code provides train.py with the training loop structure:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import json

def compute_hierarchical_elbo(recon_x, x, mu_low, logvar_low, mu_high, logvar_high, beta=1.0):
    """
    Compute ELBO for hierarchical VAE.
    
    ELBO = E[log p(x|z_low)] - beta * KL(q(z_low|x) || p(z_low|z_high)) 
           - beta * KL(q(z_high|z_low) || p(z_high))
    
    Args:
        recon_x: Reconstructed pattern logits [batch, 16, 9]
        x: Original patterns [batch, 16, 9]
        mu_low, logvar_low: Low-level latent parameters
        mu_high, logvar_high: High-level latent parameters
        beta: KL weight
        
    Returns:
        loss: Total loss
        recon_loss: Reconstruction component
        kl_low: KL for low-level latent
        kl_high: KL for high-level latent
    """
    # Reconstruction loss (binary cross-entropy)
    recon_loss = F.binary_cross_entropy_with_logits(
        recon_x.view(-1), x.view(-1), reduction='sum'
    )
    
    # TODO: Implement KL divergences
    # KL(q(z_high) || p(z_high)) where p(z_high) = N(0, I)
    kl_high = -0.5 * torch.sum(1 + logvar_high - mu_high.pow(2) - logvar_high.exp())
    
    # TODO: KL(q(z_low) || p(z_low|z_high))
    # This is more complex - can simplify to standard KL for now
    kl_low = -0.5 * torch.sum(1 + logvar_low - mu_low.pow(2) - logvar_low.exp())
    
    return recon_loss + beta * (kl_low + kl_high), recon_loss, kl_low, kl_high

def train_epoch(model, data_loader, optimizer, epoch, device):
    """
    Train for one epoch with annealing schedules.
    """
    model.train()
    total_loss = 0
    
    # Get annealing parameters for this epoch
    beta = kl_annealing_schedule(epoch, method='cyclical')
    temperature = temperature_annealing_schedule(epoch)
    
    for batch_idx, (patterns, styles, _) in enumerate(data_loader):
        patterns = patterns.to(device)
        
        # TODO: Forward pass
        # TODO: Compute loss with current beta
        # TODO: Backward and optimize
        
        # Log progress
        if batch_idx % 10 == 0:
            print(f'Epoch {epoch}, Batch {batch_idx}: Loss = {loss.item():.4f}, '
                  f'Beta = {beta:.3f}, Temp = {temperature:.2f}')
    
    return total_loss / len(data_loader)

def main():
    # Configuration
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    batch_size = 32
    num_epochs = 100
    learning_rate = 0.001
    
    # TODO: Initialize dataset, model, optimizer
    # TODO: Training loop with logging
    # TODO: Save checkpoints and final model
    pass

if __name__ == '__main__':
    main()

Part E: Analysis and Music Generation

Complete analyze_latent.py to analyze the trained model and generate music:

Part F: Creative Experiments

Create a notebook experiments.ipynb with the following analyses:

  1. Genre Blending: Interpolate between jazz and rock patterns
  2. Complexity Control: Find latent dimensions that control pattern density
  3. Humanization: Add controlled variations to mechanical patterns
  4. Style Consistency: Generate full drum tracks with consistent style

Deliverables

Your problem2/ directory must contain:

  1. All code files as specified above
  2. results/training_log.json with loss curves and KL values
  3. results/best_model.pth - saved model weights
  4. results/generated_patterns/ containing:
    • 10 samples from each style
    • Interpolation sequences
    • Style transfer examples
  5. results/latent_analysis/ containing:
    • t-SNE visualization of latent space
    • Disentanglement analysis
    • Dimension interpretation results
  6. results/audio_samples/ with generated drum loops (optional but encouraged)

Your report must include analysis of:

  • Evidence of posterior collapse and how annealing prevented it
  • Interpretation of what each latent dimension learned to control
  • Quality assessment: Do generated patterns sound musical?
  • Comparison of different annealing strategies
  • Success rate of style transfer while preserving rhythm