Homework #1: Multi-Scale Detection and Spatial Regression

EE 641: Fall 2025

Assignment Details

Assigned: 03 September
Due: Tuesday, 16 September at 23:59

Submission: Gradescope via GitHub repository

Requirements
  • PyTorch >= 2.0 must be installed
  • Allowed libraries: PyTorch, NumPy, Pillow (PIL), matplotlib, and Python standard library only
  • No other external libraries permitted (including no torchvision.ops, cv2, or detection-specific libraries)

Overview

In this assignment you will build two computer vision systems: a multi-scale object detector and a keypoint localization network. The first problem requires implementing anchor-based detection with feature pyramids. The second problem compares spatial heatmap regression against direct coordinate prediction for keypoint localization.

Dataset Setup

Download the dataset generation script: generate_datasets.py

Generate the synthetic datasets with the following command:

python generate_datasets.py --seed 641 --num_train 1000 --num_val 200

This creates a datasets/ directory with training and validation data for both problems. Use these exact parameters to ensure consistent results.

Problem 1: Multi-Scale Single-Shot Detector

Build a simplified single-shot object detector that handles multiple object scales through a basic feature pyramid architecture.

Part A: Dataset and Data Loading

You will work with a synthetic shape detection dataset containing three classes of objects at different scales. The dataset is provided in COCO-style JSON format.

Create dataset.py implementing the following data loader:

import torch
from torch.utils.data import Dataset
from PIL import Image
import json

class ShapeDetectionDataset(Dataset):
    def __init__(self, image_dir, annotation_file, transform=None):
        """
        Initialize the dataset.
        
        Args:
            image_dir: Path to directory containing images
            annotation_file: Path to COCO-style JSON annotations
            transform: Optional transform to apply to images
        """
        self.image_dir = image_dir
        self.transform = transform
        # Load and parse annotations
        # Store image paths and corresponding annotations
        pass
    
    def __len__(self):
        """Return the total number of samples."""
        pass
    
    def __getitem__(self, idx):
        """
        Return a sample from the dataset.
        
        Returns:
            image: Tensor of shape [3, H, W]
            targets: Dict containing:
                - boxes: Tensor of shape [N, 4] in [x1, y1, x2, y2] format
                - labels: Tensor of shape [N] with class indices (0, 1, 2)
        """
        pass

The dataset contains:

  • Classes: 0: circle (small), 1: square (medium), 2: triangle (large)
  • Images: 224×224 RGB images
  • Annotations: Bounding boxes in [x1, y1, x2, y2] format with class labels

Part B: Multi-Scale Architecture

Create model.py with a detector that extracts features at multiple scales:

import torch
import torch.nn as nn

class MultiScaleDetector(nn.Module):
    def __init__(self, num_classes=3, num_anchors=3):
        """
        Initialize the multi-scale detector.
        
        Args:
            num_classes: Number of object classes (not including background)
            num_anchors: Number of anchors per spatial location
        """
        super().__init__()
        self.num_classes = num_classes
        self.num_anchors = num_anchors
        
        # Feature extraction backbone
        # Extract features at 3 different scales
        
        # Detection heads for each scale
        # Each head outputs: [batch, num_anchors * (4 + 1 + num_classes), H, W]
        pass
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape [batch, 3, 224, 224]
            
        Returns:
            List of 3 tensors (one per scale), each containing predictions
            Shape: [batch, num_anchors * (5 + num_classes), H, W]
            where 5 = 4 bbox coords + 1 objectness score
        """
        pass

Architecture Requirements:

  1. Backbone: 4 convolutional blocks

    • Block 1 (Stem): Conv(3→32, stride=1) → BN → ReLU → Conv(32→64, stride=2) → BN → ReLU [224→112]
    • Block 2: Conv(64→128, stride=2) → BN → ReLU [112→56] → Output as Scale 1
    • Block 3: Conv(128→256, stride=2) → BN → ReLU [56→28] → Output as Scale 2
    • Block 4: Conv(256→512, stride=2) → BN → ReLU [28→14] → Output as Scale 3
  2. Detection Heads: For each scale, apply:

    • 3×3 Conv (keep channels same)
    • 1×1 Conv → num_anchors * (5 + num_classes) channels
  3. Output Format: Each spatial location predicts for each anchor:

    • 4 values: bbox offsets (tx, ty, tw, th)
    • 1 value: objectness score
    • num_classes values: class scores

Part C: Anchor Generation and Matching

Create anchor generation utilities in utils.py:

import torch
import numpy as np

def generate_anchors(feature_map_sizes, anchor_scales, image_size=224):
    """
    Generate anchors for multiple feature maps.
    
    Args:
        feature_map_sizes: List of (H, W) tuples for each feature map
        anchor_scales: List of lists, scales for each feature map
        image_size: Input image size
        
    Returns:
        anchors: List of tensors, each of shape [num_anchors, 4]
                 in [x1, y1, x2, y2] format
    """
    # For each feature map:
    # 1. Create grid of anchor centers
    # 2. Generate anchors with specified scales and ratios
    # 3. Convert to absolute coordinates
    pass

def compute_iou(boxes1, boxes2):
    """
    Compute IoU between two sets of boxes.
    
    Args:
        boxes1: Tensor of shape [N, 4]
        boxes2: Tensor of shape [M, 4]
        
    Returns:
        iou: Tensor of shape [N, M]
    """
    pass

def match_anchors_to_targets(anchors, target_boxes, target_labels, 
                            pos_threshold=0.5, neg_threshold=0.3):
    """
    Match anchors to ground truth boxes.
    
    Args:
        anchors: Tensor of shape [num_anchors, 4]
        target_boxes: Tensor of shape [num_targets, 4]
        target_labels: Tensor of shape [num_targets]
        pos_threshold: IoU threshold for positive anchors
        neg_threshold: IoU threshold for negative anchors
        
    Returns:
        matched_labels: Tensor of shape [num_anchors]
                       (0: background, 1-N: classes)
        matched_boxes: Tensor of shape [num_anchors, 4]
        pos_mask: Boolean tensor indicating positive anchors
        neg_mask: Boolean tensor indicating negative anchors
    """
    pass

Anchor Configuration:

  • Scale 1 (56×56): anchor scales [16, 24, 32]
  • Scale 2 (28×28): anchor scales [48, 64, 96]
  • Scale 3 (14×14): anchor scales [96, 128, 192]
  • All scales use aspect ratios: [1:1]

Part D: Loss Implementation

Implement the multi-task loss in loss.py:

import torch
import torch.nn as nn
import torch.nn.functional as F

class DetectionLoss(nn.Module):
    def __init__(self, num_classes=3):
        super().__init__()
        self.num_classes = num_classes
        
    def forward(self, predictions, targets, anchors):
        """
        Compute multi-task loss.
        
        Args:
            predictions: List of tensors from each scale
            targets: List of dicts with 'boxes' and 'labels' for each image
            anchors: List of anchor tensors for each scale
            
        Returns:
            loss_dict: Dict containing:
                - loss_obj: Objectness loss
                - loss_cls: Classification loss  
                - loss_loc: Localization loss
                - loss_total: Weighted sum
        """
        # For each prediction scale:
        # 1. Match anchors to targets
        # 2. Compute objectness loss (BCE)
        # 3. Compute classification loss (CE) for positive anchors
        # 4. Compute localization loss (Smooth L1) for positive anchors
        # 5. Apply hard negative mining (3:1 ratio)
        pass
    
    def hard_negative_mining(self, loss, pos_mask, neg_mask, ratio=3):
        """
        Select hard negative examples.
        
        Args:
            loss: Loss values for all anchors
            pos_mask: Boolean mask for positive anchors
            neg_mask: Boolean mask for negative anchors
            ratio: Negative to positive ratio
            
        Returns:
            selected_neg_mask: Boolean mask for selected negatives
        """
        pass

Loss Weights:

  • Objectness loss weight: 1.0
  • Classification loss weight: 1.0
  • Localization loss weight: 2.0

Part E: Training Script

Create train.py that trains the model:

import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import json

def train_epoch(model, dataloader, criterion, optimizer, device):
    """Train for one epoch."""
    model.train()
    # Training loop
    pass

def validate(model, dataloader, criterion, device):
    """Validate the model."""
    model.eval()
    # Validation loop
    pass

def main():
    # Configuration
    batch_size = 16
    learning_rate = 0.001
    num_epochs = 50
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # Initialize dataset, model, loss, optimizer
    # Training loop with logging
    # Save best model and training log
    pass

if __name__ == '__main__':
    main()

The training script must:

  • Train for 50 epochs
  • Use SGD with momentum=0.9
  • Save the best model based on validation loss
  • Log training metrics to results/training_log.json

Part F: Evaluation and Visualization

Create evaluate.py to compute detection metrics and generate visualizations:

def compute_ap(predictions, ground_truths, iou_threshold=0.5):
    """Compute Average Precision for a single class."""
    pass

def visualize_detections(image, predictions, ground_truths, save_path):
    """Visualize predictions and ground truth boxes."""
    pass

def analyze_scale_performance(model, dataloader, anchors):
    """Analyze which scales detect which object sizes."""
    # Generate statistics on detection performance per scale
    # Create visualizations showing scale specialization
    pass

Deliverables

Your problem1/ directory must contain:

  1. All code files as specified above

  2. results/training_log.json with loss curves and metrics

  3. results/best_model.pth - saved model weights

  4. results/visualizations/ containing:

    • Detection results on 10 validation images
    • Anchor coverage visualization for each scale
    • Analysis showing which scales detect which object sizes

Your report must include analysis of:

  • How different scales specialize for different object sizes
  • The effect of anchor scales on detection performance
  • Visualization of the learned features at each scale

Problem 2: Heatmap vs Direct Regression for Keypoint Detection

Implement and compare two approaches to keypoint localization: spatial heatmap regression and direct coordinate regression. Quantify the performance difference between these methods.

Part A: Dataset and Data Loading

You will work with synthetic “stick figure” images containing 5 keypoints per figure. The dataset includes keypoint annotations in pixel coordinates.

Create dataset.py implementing data loading for both approaches:

import torch
from torch.utils.data import Dataset
from PIL import Image
import numpy as np
import json

class KeypointDataset(Dataset):
    def __init__(self, image_dir, annotation_file, output_type='heatmap', 
                 heatmap_size=64, sigma=2.0):
        """
        Initialize the keypoint dataset.
        
        Args:
            image_dir: Path to directory containing images
            annotation_file: Path to JSON annotations
            output_type: 'heatmap' or 'regression'
            heatmap_size: Size of output heatmaps (for heatmap mode)
            sigma: Gaussian sigma for heatmap generation
        """
        self.image_dir = image_dir
        self.output_type = output_type
        self.heatmap_size = heatmap_size
        self.sigma = sigma
        # Load annotations
        pass
    
    def generate_heatmap(self, keypoints, height, width):
        """
        Generate gaussian heatmaps for keypoints.
        
        Args:
            keypoints: Array of shape [num_keypoints, 2] in (x, y) format
            height, width: Dimensions of the heatmap
            
        Returns:
            heatmaps: Tensor of shape [num_keypoints, height, width]
        """
        # For each keypoint:
        # 1. Create 2D gaussian centered at keypoint location
        # 2. Handle boundary cases
        pass
    
    def __getitem__(self, idx):
        """
        Return a sample from the dataset.
        
        Returns:
            image: Tensor of shape [1, 128, 128] (grayscale)
            If output_type == 'heatmap':
                targets: Tensor of shape [5, 64, 64] (5 heatmaps)
            If output_type == 'regression':
                targets: Tensor of shape [10] (x,y for 5 keypoints, normalized to [0,1])
        """
        pass

Dataset Properties:

  • Images: 128×128 grayscale images
  • Keypoints: 5 points (head, left_hand, right_hand, left_foot, right_foot)
  • Annotations: (x, y) coordinates in pixel space

Part B: Network Architectures

Create model.py with both heatmap and regression networks:

import torch
import torch.nn as nn
import torch.nn.functional as F

class HeatmapNet(nn.Module):
    def __init__(self, num_keypoints=5):
        """
        Initialize the heatmap regression network.
        
        Args:
            num_keypoints: Number of keypoints to detect
        """
        super().__init__()
        self.num_keypoints = num_keypoints
        
        # Encoder (downsampling path)
        # Input: [batch, 1, 128, 128]
        # Progressively downsample to extract features
        
        # Decoder (upsampling path)
        # Progressively upsample back to heatmap resolution
        # Output: [batch, num_keypoints, 64, 64]
        
        # Skip connections between encoder and decoder
        pass
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape [batch, 1, 128, 128]
            
        Returns:
            heatmaps: Tensor of shape [batch, num_keypoints, 64, 64]
        """
        pass

class RegressionNet(nn.Module):
    def __init__(self, num_keypoints=5):
        """
        Initialize the direct regression network.
        
        Args:
            num_keypoints: Number of keypoints to detect
        """
        super().__init__()
        self.num_keypoints = num_keypoints
        
        # Use same encoder architecture as HeatmapNet
        # But add global pooling and fully connected layers
        # Output: [batch, num_keypoints * 2]
        pass
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape [batch, 1, 128, 128]
            
        Returns:
            coords: Tensor of shape [batch, num_keypoints * 2]
                   Values in range [0, 1] (normalized coordinates)
        """
        pass

Architecture Specifications:

  1. Encoder (shared between both networks):

    • Conv1: Conv(1→32) → BN → ReLU → MaxPool (128→64)
    • Conv2: Conv(32→64) → BN → ReLU → MaxPool (64→32)
    • Conv3: Conv(64→128) → BN → ReLU → MaxPool (32→16)
    • Conv4: Conv(128→256) → BN → ReLU → MaxPool (16→8)
  2. HeatmapNet Decoder:

    • Deconv4: ConvTranspose(256→128) → BN → ReLU (8→16)
    • Concat with Conv3 output (skip connection)
    • Deconv3: ConvTranspose(256→64) → BN → ReLU (16→32)
    • Concat with Conv2 output (skip connection)
    • Deconv2: ConvTranspose(128→32) → BN → ReLU (32→64)
    • Final: Conv(32→num_keypoints) (no activation)
  3. RegressionNet Head:

    • Global Average Pooling
    • FC1: Linear(256→128) → ReLU → Dropout(0.5)
    • FC2: Linear(128→64) → ReLU → Dropout(0.5)
    • FC3: Linear(64→num_keypoints*2) → Sigmoid

Part C: Training Implementation

Create train.py to train both models:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import json

def train_heatmap_model(model, train_loader, val_loader, num_epochs=30):
    """
    Train the heatmap-based model.
    
    Uses MSE loss between predicted and target heatmaps.
    """
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    # Training loop
    # Log losses and save best model
    pass

def train_regression_model(model, train_loader, val_loader, num_epochs=30):
    """
    Train the direct regression model.
    
    Uses MSE loss between predicted and target coordinates.
    """
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    # Training loop
    # Log losses and save best model
    pass

def main():
    # Train both models with same data
    # Save training logs for comparison
    pass

if __name__ == '__main__':
    main()

Training specifications:

  • Train both models for 30 epochs
  • Use Adam optimizer with lr=0.001
  • Batch size: 32
  • Save models as heatmap_model.pth and regression_model.pth
  • Log training/validation loss to training_log.json

Part D: Evaluation Metrics

Create evaluate.py to compute PCK (Percentage of Correct Keypoints):

import torch
import numpy as np
import matplotlib.pyplot as plt

def extract_keypoints_from_heatmaps(heatmaps):
    """
    Extract (x, y) coordinates from heatmaps.
    
    Args:
        heatmaps: Tensor of shape [batch, num_keypoints, H, W]
        
    Returns:
        coords: Tensor of shape [batch, num_keypoints, 2]
    """
    # Find argmax location in each heatmap
    # Convert to (x, y) coordinates
    pass

def compute_pck(predictions, ground_truths, thresholds, normalize_by='bbox'):
    """
    Compute PCK at various thresholds.
    
    Args:
        predictions: Tensor of shape [N, num_keypoints, 2]
        ground_truths: Tensor of shape [N, num_keypoints, 2]
        thresholds: List of threshold values (as fraction of normalization)
        normalize_by: 'bbox' for bounding box diagonal, 'torso' for torso length
        
    Returns:
        pck_values: Dict mapping threshold to accuracy
    """
    # For each threshold:
    # Count keypoints within threshold distance of ground truth
    pass

def plot_pck_curves(pck_heatmap, pck_regression, save_path):
    """
    Plot PCK curves comparing both methods.
    """
    pass

def visualize_predictions(image, pred_keypoints, gt_keypoints, save_path):
    """
    Visualize predicted and ground truth keypoints on image.
    """
    pass

Part E: Comparative Analysis

Create baseline.py for additional experiments:

def ablation_study(dataset, model_class):
    """
    Conduct ablation studies on key hyperparameters.
    
    Experiments to run:
    1. Effect of heatmap resolution (32x32 vs 64x64 vs 128x128)
    2. Effect of Gaussian sigma (1.0, 2.0, 3.0, 4.0)
    3. Effect of skip connections (with vs without)
    """
    # Run experiments and save results
    pass

def analyze_failure_cases(model, test_loader):
    """
    Identify and visualize failure cases.
    
    Find examples where:
    1. Heatmap succeeds but regression fails
    2. Regression succeeds but heatmap fails
    3. Both methods fail
    """
    pass

Deliverables

Your problem2/ directory must contain:

  1. All code files as specified above

  2. results/training_log.json with training curves for both methods

  3. results/heatmap_model.pth and results/regression_model.pth

  4. results/visualizations/ containing:

    • PCK curves comparing both methods
    • Predicted heatmaps at different training stages
    • Sample predictions from both methods on test images
    • Failure case analysis

Your report must include:

  • PCK curves at thresholds [0.05, 0.1, 0.15, 0.2]
  • Analysis of why heatmap approach works better (or worse)
  • Ablation study results showing effect of sigma and resolution
  • Visualization of learned heatmaps and failure cases

Submission Requirements

Your GitHub repository must follow this exact structure:

ee641-hw1-[username]/
├── problem1/
│   ├── model.py
│   ├── dataset.py
│   ├── loss.py
│   ├── train.py
│   ├── evaluate.py
│   ├── utils.py
│   └── results/
│       ├── training_log.json
│       ├── best_model.pth
│       └── visualizations/
├── problem2/
│   ├── model.py
│   ├── dataset.py
│   ├── train.py
│   ├── evaluate.py
│   ├── baseline.py
│   └── results/
│       ├── training_log.json
│       ├── heatmap_model.pth
│       ├── regression_model.pth
│       └── visualizations/
├── report.pdf
└── README.md

The README.md in your repository root must contain:

  • Your full name
  • USC email address
  • Instructions to run each problem if they differ from the standard commands
  • Any implementation notes
Testing Your Submission

Before submitting:

  1. Your repository structure must match the requirement exactly
  2. python train.py must run without errors in each problem directory
  3. All output files must be generated in the correct locations