Problem 2: Hierarchical VAE for Music Generation
Build a Variational Autoencoder that learns to generate drum patterns with controllable style. You will implement hierarchical latent variables, handle discrete outputs, and prevent posterior collapse.
Part A: Dataset and Representation
You will work with a dataset of drum patterns represented as binary matrices.
The dataset contains: - Format: 16×9 binary matrices (16 timesteps, 9 drum instruments) - Instruments: Kick, Snare, Closed Hi-hat, Open Hi-hat, Tom1, Tom2, Crash, Ride, Clap - Styles: Rock, Jazz, Hip-hop, Electronic, Latin (200 patterns each) - Total: 1000 unique drum patterns from MIDI files
The starter code provides dataset.py with the data loader:
Part B: Hierarchical VAE Architecture
A hierarchical VAE uses multiple levels of latent variables to capture different aspects of the data. The provided architecture design separates high-level style from low-level variations:
Implement the VAE architecture in hierarchical_vae.py:
Part C: Training Techniques for Discrete Data
Discrete outputs and posterior collapse are major challenges for VAEs. The starter code provides proven techniques:
Use the provided utilities in training_utils.py:
Part D: Training Implementation
The starter code provides train.py with the training loop structure:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import json
def compute_hierarchical_elbo(recon_x, x, mu_low, logvar_low, mu_high, logvar_high, beta=1.0):
"""
Compute ELBO for hierarchical VAE.
ELBO = E[log p(x|z_low)] - beta * KL(q(z_low|x) || p(z_low|z_high))
- beta * KL(q(z_high|z_low) || p(z_high))
Args:
recon_x: Reconstructed pattern logits [batch, 16, 9]
x: Original patterns [batch, 16, 9]
mu_low, logvar_low: Low-level latent parameters
mu_high, logvar_high: High-level latent parameters
beta: KL weight
Returns:
loss: Total loss
recon_loss: Reconstruction component
kl_low: KL for low-level latent
kl_high: KL for high-level latent
"""
# Reconstruction loss (binary cross-entropy)
recon_loss = F.binary_cross_entropy_with_logits(
recon_x.view(-1), x.view(-1), reduction='sum'
)
# TODO: Implement KL divergences
# KL(q(z_high) || p(z_high)) where p(z_high) = N(0, I)
kl_high = -0.5 * torch.sum(1 + logvar_high - mu_high.pow(2) - logvar_high.exp())
# TODO: KL(q(z_low) || p(z_low|z_high))
# This is more complex - can simplify to standard KL for now
kl_low = -0.5 * torch.sum(1 + logvar_low - mu_low.pow(2) - logvar_low.exp())
return recon_loss + beta * (kl_low + kl_high), recon_loss, kl_low, kl_high
def train_epoch(model, data_loader, optimizer, epoch, device):
"""
Train for one epoch with annealing schedules.
"""
model.train()
total_loss = 0
# Get annealing parameters for this epoch
beta = kl_annealing_schedule(epoch, method='cyclical')
temperature = temperature_annealing_schedule(epoch)
for batch_idx, (patterns, styles, _) in enumerate(data_loader):
patterns = patterns.to(device)
# TODO: Forward pass
# TODO: Compute loss with current beta
# TODO: Backward and optimize
# Log progress
if batch_idx % 10 == 0:
print(f'Epoch {epoch}, Batch {batch_idx}: Loss = {loss.item():.4f}, '
f'Beta = {beta:.3f}, Temp = {temperature:.2f}')
return total_loss / len(data_loader)
def main():
# Configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
batch_size = 32
num_epochs = 100
learning_rate = 0.001
# TODO: Initialize dataset, model, optimizer
# TODO: Training loop with logging
# TODO: Save checkpoints and final model
pass
if __name__ == '__main__':
main()Part E: Analysis and Music Generation
Complete analyze_latent.py to analyze the trained model and generate music:
Part F: Creative Experiments
Create a notebook experiments.ipynb with the following analyses:
- Genre Blending: Interpolate between jazz and rock patterns
- Complexity Control: Find latent dimensions that control pattern density
- Humanization: Add controlled variations to mechanical patterns
- Style Consistency: Generate full drum tracks with consistent style
Deliverables
Your problem2/ directory must contain:
- All code files as specified above
results/training_log.jsonwith loss curves and KL valuesresults/best_model.pth- saved model weightsresults/generated_patterns/containing:- 10 samples from each style
- Interpolation sequences
- Style transfer examples
results/latent_analysis/containing:- t-SNE visualization of latent space
- Disentanglement analysis
- Dimension interpretation results
results/audio_samples/with generated drum loops (optional but encouraged)
Your report must include analysis of:
- Evidence of posterior collapse and how annealing prevented it
- Interpretation of what each latent dimension learned to control
- Quality assessment: Do generated patterns sound musical?
- Comparison of different annealing strategies
- Success rate of style transfer while preserving rhythm