Homework #3: Attention Mechanisms and Transformers

EE 641: Fall 2025

Assignment Details

Assigned: 15 October
Due: Friday, 31 October at 23:59 (UPDATED)

Submission: Gradescope via GitHub repository

Requirements
  • PyTorch >= 2.0 must be installed
  • Allowed libraries: PyTorch, NumPy, matplotlib, and Python standard library
  • No other external libraries permitted (including no torchvision.models, pre-trained models, or attention library implementations)

Overview

In this assignment you will implement the core attention mechanisms that power modern transformer architectures. The first problem focuses on building multi-head self-attention from scratch and analyzing what different attention heads learn through a multi-digit addition task. The second problem explores positional encoding strategies and their impact on length generalization.

Getting Started

Download the starter code: hw3-starter.zip.

Extract the starter code:

unzip hw3-starter.zip
cd hw3-starter

Problem 1: Multi-Head Attention - Multi-Digit Addition

Implement scaled dot-product attention and multi-head attention from scratch. Train on a multi-digit addition task and analyze what different attention heads learn.

Part A: Dataset and Data Loading

You will work with a multi-digit addition dataset where the model must add two 3-digit numbers with carry propagation.

Generate the dataset:

cd problem1
python generate_data.py --seed 641 --num-digits 3

The dataset contains: - Input: Two 3-digit numbers separated by ‘+’ token (e.g., [3, 4, 7, +, 1, 5, 9]) - Output: Sum padded to 4 digits (e.g., [0, 5, 0, 5] for 505) - Training samples: 10,000 - Validation samples: 2,000 - Test samples: 2,000

Example training samples:

{"input": [5, 0, 7, 10, 1, 5, 9], "target": [0, 6, 6, 6]}
{"input": [1, 5, 5, 10, 4, 9, 1], "target": [0, 6, 4, 6]}
{"input": [3, 9, 4, 10, 3, 1, 6], "target": [0, 7, 1, 0]}

Where token 10 represents the ‘+’ operator (507 + 159 = 666, 155 + 491 = 646, 394 + 316 = 710).

The starter code provides dataset.py with the data loader:

Part B: Attention Implementation

Implement attention.py with the core attention mechanisms:

Part C: Model Architecture

Implement a sequence-to-sequence transformer in model.py:

Part D: Training

Train your model using the provided training script:

Part E: Attention Analysis

Implement analyze.py to extract and visualize attention patterns:

Deliverables

Your problem1/ directory must contain:

  1. All code files as specified above
  2. results/training_log.json with loss curves and accuracy metrics
  3. results/best_model.pth - saved model weights
  4. results/attention_patterns/ containing:
    • Heatmaps for each attention head
    • Example visualizations on test cases
  5. results/head_analysis/ containing:
    • Head ablation results
    • Head importance rankings

Your report must include analysis of:

  • Attention pattern visualizations from at least 4 different heads
  • Head ablation study: which heads are critical vs redundant?
  • Discussion: How do attention heads specialize for carry propagation?
  • Quantitative results: percentage of heads that can be pruned with minimal accuracy loss

Problem 2: Positional Encoding and Length Extrapolation

Implement three positional encoding strategies and test their ability to generalize to sequences longer than those seen during training.

Part A: Dataset and Data Loading

You will work with a sorting detection dataset where the model must determine if a sequence of integers is sorted in ascending order.

Generate the dataset:

cd problem2
python scripts/generate_data.py --seed 641

The dataset contains: - Task: Binary classification (1 = sorted, 0 = unsorted) - Training: Sequences of length 8-16 with integers 0-99 - Testing: Sequences of length 32, 64, 128, 256 (for extrapolation analysis) - Training samples: 10,000 (50% sorted, 50% unsorted) - Validation samples: 2,000

Example training samples:

{"sequence": [3, 15, 27, 41, 58, 72, 83, 94], "is_sorted": 1, "length": 8}
{"sequence": [15, 3, 72, 27, 41, 94, 58, 83], "is_sorted": 0, "length": 8}
{"sequence": [1, 5, 12, 18, 23, 29, 34, 40, 47, 53, 61, 68], "is_sorted": 1, "length": 12}

The starter code provides src/dataset.py with the data loader:

Part B: Positional Encoding Implementations

Implement src/positional_encoding.py with three encoding strategies:

Part C: Model Architecture

The transformer encoder is provided in src/model.py:

Part D: Training

Train your model using the provided training script:

Part E: Extrapolation Analysis

Implement analyze.py to test models on longer sequences:

Deliverables

Your problem2/ directory must contain:

  1. All code files as specified above
  2. results/sinusoidal/ containing:
    • best_model.pth - saved model weights
    • training_log.json - loss curves and accuracy
    • training_curves.png - visualization
  3. results/learned/ with same structure
  4. results/none/ with same structure
  5. results/extrapolation/ containing:
    • extrapolation_results.json - accuracy data
    • extrapolation_curves.png - main result plot
    • learned_position_embeddings.png - position visualization

Your report must include analysis of:

  • Extrapolation curves showing accuracy vs. sequence length for all three methods
  • Mathematical explanation: Why does sinusoidal encoding extrapolate while learned encoding fails?
  • Position embedding visualization for learned encoding
  • Quantitative comparison: accuracy at lengths 32, 64, 128, 256

Submission Requirements

Your GitHub repository must follow this exact structure:

ee641-hw3-[username]/
├── problem1/
│   ├── attention.py
│   ├── model.py
│   ├── dataset.py
│   ├── train.py
│   ├── analyze.py
│   └── results/
│       ├── training_log.json
│       ├── best_model.pth
│       ├── attention_patterns/
│       └── head_analysis/
├── problem2/
│   ├── positional_encoding.py
│   ├── model.py
│   ├── dataset.py
│   ├── train.py
│   ├── analyze.py
│   └── results/
│       ├── training_logs/
│       ├── best_models/
│       ├── extrapolation_curves.png
│       └── position_viz/
├── report.pdf
└── README.md

The README.md in your repository root must contain:

  • Your full name
  • USC email address
  • Instructions to run each problem if they differ from the standard commands
  • Any implementation notes
Testing Your Submission

Before submitting:

  1. Your repository structure must match the requirement exactly
  2. python train.py and python inference.py must run without errors in each problem directory
  3. All output files must be generated in the correct locations