Homework #1: Multi-Scale Detection and Spatial Regression
EE 641: Fall 2025
Assigned: 03 September
Due: Tuesday, 16 September at 23:59
Submission: Gradescope via GitHub repository
- PyTorch >= 2.0 must be installed
- Allowed libraries: PyTorch, NumPy, Pillow (PIL), matplotlib, and Python standard library only
- No other external libraries permitted (including no torchvision.ops, cv2, or detection-specific libraries)
Overview
In this assignment you will build two computer vision systems: a multi-scale object detector and a keypoint localization network. The first problem requires implementing anchor-based detection with feature pyramids. The second problem compares spatial heatmap regression against direct coordinate prediction for keypoint localization.
Dataset Setup
Download the dataset generation script: generate_datasets.py
Generate the synthetic datasets with the following command:
python generate_datasets.py --seed 641 --num_train 1000 --num_val 200This creates a datasets/ directory with training and validation data for both problems. Use these exact parameters to ensure consistent results.
Problem 1: Multi-Scale Single-Shot Detector
Build a simplified single-shot object detector that handles multiple object scales through a basic feature pyramid architecture.
Part A: Dataset and Data Loading
You will work with a synthetic shape detection dataset containing three classes of objects at different scales. The dataset is provided in COCO-style JSON format.
Create dataset.py implementing the following data loader:
import torch
from torch.utils.data import Dataset
from PIL import Image
import json
class ShapeDetectionDataset(Dataset):
def __init__(self, image_dir, annotation_file, transform=None):
"""
Initialize the dataset.
Args:
image_dir: Path to directory containing images
annotation_file: Path to COCO-style JSON annotations
transform: Optional transform to apply to images
"""
self.image_dir = image_dir
self.transform = transform
# Load and parse annotations
# Store image paths and corresponding annotations
pass
def __len__(self):
"""Return the total number of samples."""
pass
def __getitem__(self, idx):
"""
Return a sample from the dataset.
Returns:
image: Tensor of shape [3, H, W]
targets: Dict containing:
- boxes: Tensor of shape [N, 4] in [x1, y1, x2, y2] format
- labels: Tensor of shape [N] with class indices (0, 1, 2)
"""
passThe dataset contains:
- Classes: 0: circle (small), 1: square (medium), 2: triangle (large)
- Images: 224×224 RGB images
- Annotations: Bounding boxes in [x1, y1, x2, y2] format with class labels
Part B: Multi-Scale Architecture
Create model.py with a detector that extracts features at multiple scales:
import torch
import torch.nn as nn
class MultiScaleDetector(nn.Module):
def __init__(self, num_classes=3, num_anchors=3):
"""
Initialize the multi-scale detector.
Args:
num_classes: Number of object classes (not including background)
num_anchors: Number of anchors per spatial location
"""
super().__init__()
self.num_classes = num_classes
self.num_anchors = num_anchors
# Feature extraction backbone
# Extract features at 3 different scales
# Detection heads for each scale
# Each head outputs: [batch, num_anchors * (4 + 1 + num_classes), H, W]
pass
def forward(self, x):
"""
Forward pass.
Args:
x: Input tensor of shape [batch, 3, 224, 224]
Returns:
List of 3 tensors (one per scale), each containing predictions
Shape: [batch, num_anchors * (5 + num_classes), H, W]
where 5 = 4 bbox coords + 1 objectness score
"""
passArchitecture Requirements:
Backbone: 4 convolutional blocks
- Block 1 (Stem): Conv(3→32, stride=1) → BN → ReLU → Conv(32→64, stride=2) → BN → ReLU [224→112]
- Block 2: Conv(64→128, stride=2) → BN → ReLU [112→56] → Output as Scale 1
- Block 3: Conv(128→256, stride=2) → BN → ReLU [56→28] → Output as Scale 2
- Block 4: Conv(256→512, stride=2) → BN → ReLU [28→14] → Output as Scale 3
Detection Heads: For each scale, apply:
- 3×3 Conv (keep channels same)
- 1×1 Conv →
num_anchors * (5 + num_classes)channels
Output Format: Each spatial location predicts for each anchor:
- 4 values: bbox offsets (tx, ty, tw, th)
- 1 value: objectness score
num_classesvalues: class scores
Part C: Anchor Generation and Matching
Create anchor generation utilities in utils.py:
import torch
import numpy as np
def generate_anchors(feature_map_sizes, anchor_scales, image_size=224):
"""
Generate anchors for multiple feature maps.
Args:
feature_map_sizes: List of (H, W) tuples for each feature map
anchor_scales: List of lists, scales for each feature map
image_size: Input image size
Returns:
anchors: List of tensors, each of shape [num_anchors, 4]
in [x1, y1, x2, y2] format
"""
# For each feature map:
# 1. Create grid of anchor centers
# 2. Generate anchors with specified scales and ratios
# 3. Convert to absolute coordinates
pass
def compute_iou(boxes1, boxes2):
"""
Compute IoU between two sets of boxes.
Args:
boxes1: Tensor of shape [N, 4]
boxes2: Tensor of shape [M, 4]
Returns:
iou: Tensor of shape [N, M]
"""
pass
def match_anchors_to_targets(anchors, target_boxes, target_labels,
pos_threshold=0.5, neg_threshold=0.3):
"""
Match anchors to ground truth boxes.
Args:
anchors: Tensor of shape [num_anchors, 4]
target_boxes: Tensor of shape [num_targets, 4]
target_labels: Tensor of shape [num_targets]
pos_threshold: IoU threshold for positive anchors
neg_threshold: IoU threshold for negative anchors
Returns:
matched_labels: Tensor of shape [num_anchors]
(0: background, 1-N: classes)
matched_boxes: Tensor of shape [num_anchors, 4]
pos_mask: Boolean tensor indicating positive anchors
neg_mask: Boolean tensor indicating negative anchors
"""
passAnchor Configuration:
- Scale 1 (56×56): anchor scales [16, 24, 32]
- Scale 2 (28×28): anchor scales [48, 64, 96]
- Scale 3 (14×14): anchor scales [96, 128, 192]
- All scales use aspect ratios: [1:1]
Part D: Loss Implementation
Implement the multi-task loss in loss.py:
import torch
import torch.nn as nn
import torch.nn.functional as F
class DetectionLoss(nn.Module):
def __init__(self, num_classes=3):
super().__init__()
self.num_classes = num_classes
def forward(self, predictions, targets, anchors):
"""
Compute multi-task loss.
Args:
predictions: List of tensors from each scale
targets: List of dicts with 'boxes' and 'labels' for each image
anchors: List of anchor tensors for each scale
Returns:
loss_dict: Dict containing:
- loss_obj: Objectness loss
- loss_cls: Classification loss
- loss_loc: Localization loss
- loss_total: Weighted sum
"""
# For each prediction scale:
# 1. Match anchors to targets
# 2. Compute objectness loss (BCE)
# 3. Compute classification loss (CE) for positive anchors
# 4. Compute localization loss (Smooth L1) for positive anchors
# 5. Apply hard negative mining (3:1 ratio)
pass
def hard_negative_mining(self, loss, pos_mask, neg_mask, ratio=3):
"""
Select hard negative examples.
Args:
loss: Loss values for all anchors
pos_mask: Boolean mask for positive anchors
neg_mask: Boolean mask for negative anchors
ratio: Negative to positive ratio
Returns:
selected_neg_mask: Boolean mask for selected negatives
"""
passLoss Weights:
- Objectness loss weight: 1.0
- Classification loss weight: 1.0
- Localization loss weight: 2.0
Part E: Training Script
Create train.py that trains the model:
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import json
def train_epoch(model, dataloader, criterion, optimizer, device):
"""Train for one epoch."""
model.train()
# Training loop
pass
def validate(model, dataloader, criterion, device):
"""Validate the model."""
model.eval()
# Validation loop
pass
def main():
# Configuration
batch_size = 16
learning_rate = 0.001
num_epochs = 50
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Initialize dataset, model, loss, optimizer
# Training loop with logging
# Save best model and training log
pass
if __name__ == '__main__':
main()The training script must:
- Train for 50 epochs
- Use SGD with momentum=0.9
- Save the best model based on validation loss
- Log training metrics to
results/training_log.json
Part F: Evaluation and Visualization
Create evaluate.py to compute detection metrics and generate visualizations:
def compute_ap(predictions, ground_truths, iou_threshold=0.5):
"""Compute Average Precision for a single class."""
pass
def visualize_detections(image, predictions, ground_truths, save_path):
"""Visualize predictions and ground truth boxes."""
pass
def analyze_scale_performance(model, dataloader, anchors):
"""Analyze which scales detect which object sizes."""
# Generate statistics on detection performance per scale
# Create visualizations showing scale specialization
passDeliverables
Your problem1/ directory must contain:
All code files as specified above
results/training_log.jsonwith loss curves and metricsresults/best_model.pth- saved model weightsresults/visualizations/containing:- Detection results on 10 validation images
- Anchor coverage visualization for each scale
- Analysis showing which scales detect which object sizes
Your report must include analysis of:
- How different scales specialize for different object sizes
- The effect of anchor scales on detection performance
- Visualization of the learned features at each scale
Problem 2: Heatmap vs Direct Regression for Keypoint Detection
Implement and compare two approaches to keypoint localization: spatial heatmap regression and direct coordinate regression. Quantify the performance difference between these methods.
Part A: Dataset and Data Loading
You will work with synthetic “stick figure” images containing 5 keypoints per figure. The dataset includes keypoint annotations in pixel coordinates.
Create dataset.py implementing data loading for both approaches:
import torch
from torch.utils.data import Dataset
from PIL import Image
import numpy as np
import json
class KeypointDataset(Dataset):
def __init__(self, image_dir, annotation_file, output_type='heatmap',
heatmap_size=64, sigma=2.0):
"""
Initialize the keypoint dataset.
Args:
image_dir: Path to directory containing images
annotation_file: Path to JSON annotations
output_type: 'heatmap' or 'regression'
heatmap_size: Size of output heatmaps (for heatmap mode)
sigma: Gaussian sigma for heatmap generation
"""
self.image_dir = image_dir
self.output_type = output_type
self.heatmap_size = heatmap_size
self.sigma = sigma
# Load annotations
pass
def generate_heatmap(self, keypoints, height, width):
"""
Generate gaussian heatmaps for keypoints.
Args:
keypoints: Array of shape [num_keypoints, 2] in (x, y) format
height, width: Dimensions of the heatmap
Returns:
heatmaps: Tensor of shape [num_keypoints, height, width]
"""
# For each keypoint:
# 1. Create 2D gaussian centered at keypoint location
# 2. Handle boundary cases
pass
def __getitem__(self, idx):
"""
Return a sample from the dataset.
Returns:
image: Tensor of shape [1, 128, 128] (grayscale)
If output_type == 'heatmap':
targets: Tensor of shape [5, 64, 64] (5 heatmaps)
If output_type == 'regression':
targets: Tensor of shape [10] (x,y for 5 keypoints, normalized to [0,1])
"""
passDataset Properties:
- Images: 128×128 grayscale images
- Keypoints: 5 points (head, left_hand, right_hand, left_foot, right_foot)
- Annotations: (x, y) coordinates in pixel space
Part B: Network Architectures
Create model.py with both heatmap and regression networks:
import torch
import torch.nn as nn
import torch.nn.functional as F
class HeatmapNet(nn.Module):
def __init__(self, num_keypoints=5):
"""
Initialize the heatmap regression network.
Args:
num_keypoints: Number of keypoints to detect
"""
super().__init__()
self.num_keypoints = num_keypoints
# Encoder (downsampling path)
# Input: [batch, 1, 128, 128]
# Progressively downsample to extract features
# Decoder (upsampling path)
# Progressively upsample back to heatmap resolution
# Output: [batch, num_keypoints, 64, 64]
# Skip connections between encoder and decoder
pass
def forward(self, x):
"""
Forward pass.
Args:
x: Input tensor of shape [batch, 1, 128, 128]
Returns:
heatmaps: Tensor of shape [batch, num_keypoints, 64, 64]
"""
pass
class RegressionNet(nn.Module):
def __init__(self, num_keypoints=5):
"""
Initialize the direct regression network.
Args:
num_keypoints: Number of keypoints to detect
"""
super().__init__()
self.num_keypoints = num_keypoints
# Use same encoder architecture as HeatmapNet
# But add global pooling and fully connected layers
# Output: [batch, num_keypoints * 2]
pass
def forward(self, x):
"""
Forward pass.
Args:
x: Input tensor of shape [batch, 1, 128, 128]
Returns:
coords: Tensor of shape [batch, num_keypoints * 2]
Values in range [0, 1] (normalized coordinates)
"""
passArchitecture Specifications:
Encoder (shared between both networks):
- Conv1: Conv(1→32) → BN → ReLU → MaxPool (128→64)
- Conv2: Conv(32→64) → BN → ReLU → MaxPool (64→32)
- Conv3: Conv(64→128) → BN → ReLU → MaxPool (32→16)
- Conv4: Conv(128→256) → BN → ReLU → MaxPool (16→8)
HeatmapNet Decoder:
- Deconv4: ConvTranspose(256→128) → BN → ReLU (8→16)
- Concat with Conv3 output (skip connection)
- Deconv3: ConvTranspose(256→64) → BN → ReLU (16→32)
- Concat with Conv2 output (skip connection)
- Deconv2: ConvTranspose(128→32) → BN → ReLU (32→64)
- Final: Conv(32→num_keypoints) (no activation)
RegressionNet Head:
- Global Average Pooling
- FC1: Linear(256→128) → ReLU → Dropout(0.5)
- FC2: Linear(128→64) → ReLU → Dropout(0.5)
- FC3: Linear(64→num_keypoints*2) → Sigmoid
Part C: Training Implementation
Create train.py to train both models:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import json
def train_heatmap_model(model, train_loader, val_loader, num_epochs=30):
"""
Train the heatmap-based model.
Uses MSE loss between predicted and target heatmaps.
"""
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
# Log losses and save best model
pass
def train_regression_model(model, train_loader, val_loader, num_epochs=30):
"""
Train the direct regression model.
Uses MSE loss between predicted and target coordinates.
"""
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
# Log losses and save best model
pass
def main():
# Train both models with same data
# Save training logs for comparison
pass
if __name__ == '__main__':
main()Training specifications:
- Train both models for 30 epochs
- Use Adam optimizer with lr=0.001
- Batch size: 32
- Save models as
heatmap_model.pthandregression_model.pth - Log training/validation loss to
training_log.json
Part D: Evaluation Metrics
Create evaluate.py to compute PCK (Percentage of Correct Keypoints):
import torch
import numpy as np
import matplotlib.pyplot as plt
def extract_keypoints_from_heatmaps(heatmaps):
"""
Extract (x, y) coordinates from heatmaps.
Args:
heatmaps: Tensor of shape [batch, num_keypoints, H, W]
Returns:
coords: Tensor of shape [batch, num_keypoints, 2]
"""
# Find argmax location in each heatmap
# Convert to (x, y) coordinates
pass
def compute_pck(predictions, ground_truths, thresholds, normalize_by='bbox'):
"""
Compute PCK at various thresholds.
Args:
predictions: Tensor of shape [N, num_keypoints, 2]
ground_truths: Tensor of shape [N, num_keypoints, 2]
thresholds: List of threshold values (as fraction of normalization)
normalize_by: 'bbox' for bounding box diagonal, 'torso' for torso length
Returns:
pck_values: Dict mapping threshold to accuracy
"""
# For each threshold:
# Count keypoints within threshold distance of ground truth
pass
def plot_pck_curves(pck_heatmap, pck_regression, save_path):
"""
Plot PCK curves comparing both methods.
"""
pass
def visualize_predictions(image, pred_keypoints, gt_keypoints, save_path):
"""
Visualize predicted and ground truth keypoints on image.
"""
passPart E: Comparative Analysis
Create baseline.py for additional experiments:
def ablation_study(dataset, model_class):
"""
Conduct ablation studies on key hyperparameters.
Experiments to run:
1. Effect of heatmap resolution (32x32 vs 64x64 vs 128x128)
2. Effect of Gaussian sigma (1.0, 2.0, 3.0, 4.0)
3. Effect of skip connections (with vs without)
"""
# Run experiments and save results
pass
def analyze_failure_cases(model, test_loader):
"""
Identify and visualize failure cases.
Find examples where:
1. Heatmap succeeds but regression fails
2. Regression succeeds but heatmap fails
3. Both methods fail
"""
passDeliverables
Your problem2/ directory must contain:
All code files as specified above
results/training_log.jsonwith training curves for both methodsresults/heatmap_model.pthandresults/regression_model.pthresults/visualizations/containing:- PCK curves comparing both methods
- Predicted heatmaps at different training stages
- Sample predictions from both methods on test images
- Failure case analysis
Your report must include:
- PCK curves at thresholds [0.05, 0.1, 0.15, 0.2]
- Analysis of why heatmap approach works better (or worse)
- Ablation study results showing effect of sigma and resolution
- Visualization of learned heatmaps and failure cases
Submission Requirements
Your GitHub repository must follow this exact structure:
ee641-hw1-[username]/
├── problem1/
│ ├── model.py
│ ├── dataset.py
│ ├── loss.py
│ ├── train.py
│ ├── evaluate.py
│ ├── utils.py
│ └── results/
│ ├── training_log.json
│ ├── best_model.pth
│ └── visualizations/
├── problem2/
│ ├── model.py
│ ├── dataset.py
│ ├── train.py
│ ├── evaluate.py
│ ├── baseline.py
│ └── results/
│ ├── training_log.json
│ ├── heatmap_model.pth
│ ├── regression_model.pth
│ └── visualizations/
├── report.pdf
└── README.md
The README.md in your repository root must contain:
- Your full name
- USC email address
- Instructions to run each problem if they differ from the standard commands
- Any implementation notes
Before submitting:
- Your repository structure must match the requirement exactly
python train.pymust run without errors in each problem directory- All output files must be generated in the correct locations