Homework #1: Multi-Scale Detection and Spatial Regression

EE 641: Fall 2025

Assignment Details

Assigned: 03 September
Due: Tuesday, 16 September at 23:59

Submission: Gradescope via GitHub repository

Requirements

PyTorch >= 2.0 must be installed
Allowed libraries: PyTorch, NumPy, Pillow (PIL), matplotlib, and Python standard library only
No other external libraries permitted (including no torchvision.ops, cv2, or detection-specific libraries)

Overview

In this assignment you will build two computer vision systems: a multi-scale object detector and a keypoint localization network. The first problem requires implementing anchor-based detection with feature pyramids. The second problem compares spatial heatmap regression against direct coordinate prediction for keypoint localization.

Dataset Setup

Download the dataset generation script: generate_datasets.py

Generate the synthetic datasets with the following command:

python generate_datasets.py --seed 641 --num_train 1000 --num_val 200

This creates a datasets/ directory with training and validation data for both problems. Use these exact parameters to ensure consistent results.

Problem 1: Multi-Scale Single-Shot Detector

Build a simplified single-shot object detector that handles multiple object scales through a basic feature pyramid architecture.

Part A: Dataset and Data Loading

You will work with a synthetic shape detection dataset containing three classes of objects at different scales. The dataset is provided in COCO-style JSON format.

Create dataset.py implementing the following data loader:

import torch
from torch.utils.data import Dataset
from PIL import Image
import json

class ShapeDetectionDataset(Dataset):
    def __init__(self, image_dir, annotation_file, transform=None):
        """
        Initialize the dataset.
        
        Args:
            image_dir: Path to directory containing images
            annotation_file: Path to COCO-style JSON annotations
            transform: Optional transform to apply to images
        """
        self.image_dir = image_dir
        self.transform = transform
        # Load and parse annotations
        # Store image paths and corresponding annotations
        pass
    
    def __len__(self):
        """Return the total number of samples."""
        pass
    
    def __getitem__(self, idx):
        """
        Return a sample from the dataset.
        
        Returns:
            image: Tensor of shape [3, H, W]
            targets: Dict containing:
                - boxes: Tensor of shape [N, 4] in [x1, y1, x2, y2] format
                - labels: Tensor of shape [N] with class indices (0, 1, 2)
        """
        pass

The dataset contains:

Classes: 0: circle (small), 1: square (medium), 2: triangle (large)
Images: 224×224 RGB images
Annotations: Bounding boxes in [x1, y1, x2, y2] format with class labels

Part B: Multi-Scale Architecture

Create model.py with a detector that extracts features at multiple scales:

import torch
import torch.nn as nn

class MultiScaleDetector(nn.Module):
    def __init__(self, num_classes=3, num_anchors=3):
        """
        Initialize the multi-scale detector.
        
        Args:
            num_classes: Number of object classes (not including background)
            num_anchors: Number of anchors per spatial location
        """
        super().__init__()
        self.num_classes = num_classes
        self.num_anchors = num_anchors
        
        # Feature extraction backbone
        # Extract features at 3 different scales
        
        # Detection heads for each scale
        # Each head outputs: [batch, num_anchors * (4 + 1 + num_classes), H, W]
        pass
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape [batch, 3, 224, 224]
            
        Returns:
            List of 3 tensors (one per scale), each containing predictions
            Shape: [batch, num_anchors * (5 + num_classes), H, W]
            where 5 = 4 bbox coords + 1 objectness score
        """
        pass

Architecture Requirements:

Backbone: 4 convolutional blocks
- Block 1 (Stem): Conv(3→32, stride=1) → BN → ReLU → Conv(32→64, stride=2) → BN → ReLU [224→112]
- Block 2: Conv(64→128, stride=2) → BN → ReLU [112→56] → Output as Scale 1
- Block 3: Conv(128→256, stride=2) → BN → ReLU [56→28] → Output as Scale 2
- Block 4: Conv(256→512, stride=2) → BN → ReLU [28→14] → Output as Scale 3
Detection Heads: For each scale, apply:
- 3×3 Conv (keep channels same)
- 1×1 Conv → num_anchors * (5 + num_classes) channels
Output Format: Each spatial location predicts for each anchor:
- 4 values: bbox offsets (tx, ty, tw, th)
- 1 value: objectness score
- num_classes values: class scores

Part C: Anchor Generation and Matching

Create anchor generation utilities in utils.py:

import torch
import numpy as np

def generate_anchors(feature_map_sizes, anchor_scales, image_size=224):
    """
    Generate anchors for multiple feature maps.
    
    Args:
        feature_map_sizes: List of (H, W) tuples for each feature map
        anchor_scales: List of lists, scales for each feature map
        image_size: Input image size
        
    Returns:
        anchors: List of tensors, each of shape [num_anchors, 4]
                 in [x1, y1, x2, y2] format
    """
    # For each feature map:
    # 1. Create grid of anchor centers
    # 2. Generate anchors with specified scales and ratios
    # 3. Convert to absolute coordinates
    pass

def compute_iou(boxes1, boxes2):
    """
    Compute IoU between two sets of boxes.
    
    Args:
        boxes1: Tensor of shape [N, 4]
        boxes2: Tensor of shape [M, 4]
        
    Returns:
        iou: Tensor of shape [N, M]
    """
    pass

def match_anchors_to_targets(anchors, target_boxes, target_labels, 
                            pos_threshold=0.5, neg_threshold=0.3):
    """
    Match anchors to ground truth boxes.
    
    Args:
        anchors: Tensor of shape [num_anchors, 4]
        target_boxes: Tensor of shape [num_targets, 4]
        target_labels: Tensor of shape [num_targets]
        pos_threshold: IoU threshold for positive anchors
        neg_threshold: IoU threshold for negative anchors
        
    Returns:
        matched_labels: Tensor of shape [num_anchors]
                       (0: background, 1-N: classes)
        matched_boxes: Tensor of shape [num_anchors, 4]
        pos_mask: Boolean tensor indicating positive anchors
        neg_mask: Boolean tensor indicating negative anchors
    """
    pass

Anchor Configuration:

Scale 1 (56×56): anchor scales [16, 24, 32]
Scale 2 (28×28): anchor scales [48, 64, 96]
Scale 3 (14×14): anchor scales [96, 128, 192]
All scales use aspect ratios: [1:1]

Part D: Loss Implementation

Implement the multi-task loss in loss.py:

import torch
import torch.nn as nn
import torch.nn.functional as F

class DetectionLoss(nn.Module):
    def __init__(self, num_classes=3):
        super().__init__()
        self.num_classes = num_classes
        
    def forward(self, predictions, targets, anchors):
        """
        Compute multi-task loss.
        
        Args:
            predictions: List of tensors from each scale
            targets: List of dicts with 'boxes' and 'labels' for each image
            anchors: List of anchor tensors for each scale
            
        Returns:
            loss_dict: Dict containing:
                - loss_obj: Objectness loss
                - loss_cls: Classification loss  
                - loss_loc: Localization loss
                - loss_total: Weighted sum
        """
        # For each prediction scale:
        # 1. Match anchors to targets
        # 2. Compute objectness loss (BCE)
        # 3. Compute classification loss (CE) for positive anchors
        # 4. Compute localization loss (Smooth L1) for positive anchors
        # 5. Apply hard negative mining (3:1 ratio)
        pass
    
    def hard_negative_mining(self, loss, pos_mask, neg_mask, ratio=3):
        """
        Select hard negative examples.
        
        Args:
            loss: Loss values for all anchors
            pos_mask: Boolean mask for positive anchors
            neg_mask: Boolean mask for negative anchors
            ratio: Negative to positive ratio
            
        Returns:
            selected_neg_mask: Boolean mask for selected negatives
        """
        pass

Loss Weights:

Objectness loss weight: 1.0
Classification loss weight: 1.0
Localization loss weight: 2.0

Part E: Training Script

Create train.py that trains the model:

import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import json

def train_epoch(model, dataloader, criterion, optimizer, device):
    """Train for one epoch."""
    model.train()
    # Training loop
    pass

def validate(model, dataloader, criterion, device):
    """Validate the model."""
    model.eval()
    # Validation loop
    pass

def main():
    # Configuration
    batch_size = 16
    learning_rate = 0.001
    num_epochs = 50
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # Initialize dataset, model, loss, optimizer
    # Training loop with logging
    # Save best model and training log
    pass

if __name__ == '__main__':
    main()

The training script must:

Train for 50 epochs
Use SGD with momentum=0.9
Save the best model based on validation loss
Log training metrics to results/training_log.json

Part F: Evaluation and Visualization

Create evaluate.py to compute detection metrics and generate visualizations:

def compute_ap(predictions, ground_truths, iou_threshold=0.5):
    """Compute Average Precision for a single class."""
    pass

def visualize_detections(image, predictions, ground_truths, save_path):
    """Visualize predictions and ground truth boxes."""
    pass

def analyze_scale_performance(model, dataloader, anchors):
    """Analyze which scales detect which object sizes."""
    # Generate statistics on detection performance per scale
    # Create visualizations showing scale specialization
    pass

Deliverables

Your problem1/ directory must contain:

All code files as specified above
results/training_log.json with loss curves and metrics
results/best_model.pth - saved model weights
results/visualizations/ containing:
- Detection results on 10 validation images
- Anchor coverage visualization for each scale
- Analysis showing which scales detect which object sizes

Your report must include analysis of:

How different scales specialize for different object sizes
The effect of anchor scales on detection performance
Visualization of the learned features at each scale

Problem 2: Heatmap vs Direct Regression for Keypoint Detection

Implement and compare two approaches to keypoint localization: spatial heatmap regression and direct coordinate regression. Quantify the performance difference between these methods.

Part A: Dataset and Data Loading

You will work with synthetic “stick figure” images containing 5 keypoints per figure. The dataset includes keypoint annotations in pixel coordinates.

Create dataset.py implementing data loading for both approaches:

import torch
from torch.utils.data import Dataset
from PIL import Image
import numpy as np
import json

class KeypointDataset(Dataset):
    def __init__(self, image_dir, annotation_file, output_type='heatmap', 
                 heatmap_size=64, sigma=2.0):
        """
        Initialize the keypoint dataset.
        
        Args:
            image_dir: Path to directory containing images
            annotation_file: Path to JSON annotations
            output_type: 'heatmap' or 'regression'
            heatmap_size: Size of output heatmaps (for heatmap mode)
            sigma: Gaussian sigma for heatmap generation
        """
        self.image_dir = image_dir
        self.output_type = output_type
        self.heatmap_size = heatmap_size
        self.sigma = sigma
        # Load annotations
        pass
    
    def generate_heatmap(self, keypoints, height, width):
        """
        Generate gaussian heatmaps for keypoints.
        
        Args:
            keypoints: Array of shape [num_keypoints, 2] in (x, y) format
            height, width: Dimensions of the heatmap
            
        Returns:
            heatmaps: Tensor of shape [num_keypoints, height, width]
        """
        # For each keypoint:
        # 1. Create 2D gaussian centered at keypoint location
        # 2. Handle boundary cases
        pass
    
    def __getitem__(self, idx):
        """
        Return a sample from the dataset.
        
        Returns:
            image: Tensor of shape [1, 128, 128] (grayscale)
            If output_type == 'heatmap':
                targets: Tensor of shape [5, 64, 64] (5 heatmaps)
            If output_type == 'regression':
                targets: Tensor of shape [10] (x,y for 5 keypoints, normalized to [0,1])
        """
        pass

Dataset Properties:

Images: 128×128 grayscale images
Keypoints: 5 points (head, left_hand, right_hand, left_foot, right_foot)
Annotations: (x, y) coordinates in pixel space

Part B: Network Architectures

Create model.py with both heatmap and regression networks:

import torch
import torch.nn as nn
import torch.nn.functional as F

class HeatmapNet(nn.Module):
    def __init__(self, num_keypoints=5):
        """
        Initialize the heatmap regression network.
        
        Args:
            num_keypoints: Number of keypoints to detect
        """
        super().__init__()
        self.num_keypoints = num_keypoints
        
        # Encoder (downsampling path)
        # Input: [batch, 1, 128, 128]
        # Progressively downsample to extract features
        
        # Decoder (upsampling path)
        # Progressively upsample back to heatmap resolution
        # Output: [batch, num_keypoints, 64, 64]
        
        # Skip connections between encoder and decoder
        pass
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape [batch, 1, 128, 128]
            
        Returns:
            heatmaps: Tensor of shape [batch, num_keypoints, 64, 64]
        """
        pass

class RegressionNet(nn.Module):
    def __init__(self, num_keypoints=5):
        """
        Initialize the direct regression network.
        
        Args:
            num_keypoints: Number of keypoints to detect
        """
        super().__init__()
        self.num_keypoints = num_keypoints
        
        # Use same encoder architecture as HeatmapNet
        # But add global pooling and fully connected layers
        # Output: [batch, num_keypoints * 2]
        pass
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape [batch, 1, 128, 128]
            
        Returns:
            coords: Tensor of shape [batch, num_keypoints * 2]
                   Values in range [0, 1] (normalized coordinates)
        """
        pass

Architecture Specifications:

Encoder (shared between both networks):
- Conv1: Conv(1→32) → BN → ReLU → MaxPool (128→64)
- Conv2: Conv(32→64) → BN → ReLU → MaxPool (64→32)
- Conv3: Conv(64→128) → BN → ReLU → MaxPool (32→16)
- Conv4: Conv(128→256) → BN → ReLU → MaxPool (16→8)
HeatmapNet Decoder:
- Deconv4: ConvTranspose(256→128) → BN → ReLU (8→16)
- Concat with Conv3 output (skip connection)
- Deconv3: ConvTranspose(256→64) → BN → ReLU (16→32)
- Concat with Conv2 output (skip connection)
- Deconv2: ConvTranspose(128→32) → BN → ReLU (32→64)
- Final: Conv(32→num_keypoints) (no activation)
RegressionNet Head:
- Global Average Pooling
- FC1: Linear(256→128) → ReLU → Dropout(0.5)
- FC2: Linear(128→64) → ReLU → Dropout(0.5)
- FC3: Linear(64→num_keypoints*2) → Sigmoid

Part C: Training Implementation

Create train.py to train both models:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import json

def train_heatmap_model(model, train_loader, val_loader, num_epochs=30):
    """
    Train the heatmap-based model.
    
    Uses MSE loss between predicted and target heatmaps.
    """
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    # Training loop
    # Log losses and save best model
    pass

def train_regression_model(model, train_loader, val_loader, num_epochs=30):
    """
    Train the direct regression model.
    
    Uses MSE loss between predicted and target coordinates.
    """
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    # Training loop
    # Log losses and save best model
    pass

def main():
    # Train both models with same data
    # Save training logs for comparison
    pass

if __name__ == '__main__':
    main()

Training specifications:

Train both models for 30 epochs
Use Adam optimizer with lr=0.001
Batch size: 32
Save models as heatmap_model.pth and regression_model.pth
Log training/validation loss to training_log.json

Part D: Evaluation Metrics

Create evaluate.py to compute PCK (Percentage of Correct Keypoints):

import torch
import numpy as np
import matplotlib.pyplot as plt

def extract_keypoints_from_heatmaps(heatmaps):
    """
    Extract (x, y) coordinates from heatmaps.
    
    Args:
        heatmaps: Tensor of shape [batch, num_keypoints, H, W]
        
    Returns:
        coords: Tensor of shape [batch, num_keypoints, 2]
    """
    # Find argmax location in each heatmap
    # Convert to (x, y) coordinates
    pass

def compute_pck(predictions, ground_truths, thresholds, normalize_by='bbox'):
    """
    Compute PCK at various thresholds.
    
    Args:
        predictions: Tensor of shape [N, num_keypoints, 2]
        ground_truths: Tensor of shape [N, num_keypoints, 2]
        thresholds: List of threshold values (as fraction of normalization)
        normalize_by: 'bbox' for bounding box diagonal, 'torso' for torso length
        
    Returns:
        pck_values: Dict mapping threshold to accuracy
    """
    # For each threshold:
    # Count keypoints within threshold distance of ground truth
    pass

def plot_pck_curves(pck_heatmap, pck_regression, save_path):
    """
    Plot PCK curves comparing both methods.
    """
    pass

def visualize_predictions(image, pred_keypoints, gt_keypoints, save_path):
    """
    Visualize predicted and ground truth keypoints on image.
    """
    pass

Part E: Comparative Analysis

Create baseline.py for additional experiments:

def ablation_study(dataset, model_class):
    """
    Conduct ablation studies on key hyperparameters.
    
    Experiments to run:
    1. Effect of heatmap resolution (32x32 vs 64x64 vs 128x128)
    2. Effect of Gaussian sigma (1.0, 2.0, 3.0, 4.0)
    3. Effect of skip connections (with vs without)
    """
    # Run experiments and save results
    pass

def analyze_failure_cases(model, test_loader):
    """
    Identify and visualize failure cases.
    
    Find examples where:
    1. Heatmap succeeds but regression fails
    2. Regression succeeds but heatmap fails
    3. Both methods fail
    """
    pass

Deliverables

Your problem2/ directory must contain:

All code files as specified above
results/training_log.json with training curves for both methods
results/heatmap_model.pth and results/regression_model.pth
results/visualizations/ containing:
- PCK curves comparing both methods
- Predicted heatmaps at different training stages
- Sample predictions from both methods on test images
- Failure case analysis

Your report must include:

PCK curves at thresholds [0.05, 0.1, 0.15, 0.2]
Analysis of why heatmap approach works better (or worse)
Ablation study results showing effect of sigma and resolution
Visualization of learned heatmaps and failure cases

Submission Requirements

Your GitHub repository must follow this exact structure:

ee641-hw1-[username]/
├── problem1/
│   ├── model.py
│   ├── dataset.py
│   ├── loss.py
│   ├── train.py
│   ├── evaluate.py
│   ├── utils.py
│   └── results/
│       ├── training_log.json
│       ├── best_model.pth
│       └── visualizations/
├── problem2/
│   ├── model.py
│   ├── dataset.py
│   ├── train.py
│   ├── evaluate.py
│   ├── baseline.py
│   └── results/
│       ├── training_log.json
│       ├── heatmap_model.pth
│       ├── regression_model.pth
│       └── visualizations/
├── report.pdf
└── README.md

The README.md in your repository root must contain:

Your full name
USC email address
Instructions to run each problem if they differ from the standard commands
Any implementation notes

Testing Your Submission

Before submitting:

Your repository structure must match the requirement exactly
python train.py must run without errors in each problem directory
All output files must be generated in the correct locations