Homework #3: Attention Mechanisms and Transformers
EE 641: Fall 2025
Assigned: 15 October
Due: Friday, 31 October at 23:59 (UPDATED)
Submission: Gradescope via GitHub repository
- PyTorch >= 2.0 must be installed
- Allowed libraries: PyTorch, NumPy, matplotlib, and Python standard library
- No other external libraries permitted (including no torchvision.models, pre-trained models, or attention library implementations)
Overview
In this assignment you will implement the core attention mechanisms that power modern transformer architectures. The first problem focuses on building multi-head self-attention from scratch and analyzing what different attention heads learn through a multi-digit addition task. The second problem explores positional encoding strategies and their impact on length generalization.
Getting Started
Download the starter code: hw3-starter.zip.
Extract the starter code:
unzip hw3-starter.zip
cd hw3-starterProblem 1: Multi-Head Attention - Multi-Digit Addition
Implement scaled dot-product attention and multi-head attention from scratch. Train on a multi-digit addition task and analyze what different attention heads learn.
Part A: Dataset and Data Loading
You will work with a multi-digit addition dataset where the model must add two 3-digit numbers with carry propagation.
Generate the dataset:
cd problem1
python generate_data.py --seed 641 --num-digits 3The dataset contains: - Input: Two 3-digit numbers separated by ‘+’ token (e.g., [3, 4, 7, +, 1, 5, 9]) - Output: Sum padded to 4 digits (e.g., [0, 5, 0, 5] for 505) - Training samples: 10,000 - Validation samples: 2,000 - Test samples: 2,000
Example training samples:
{"input": [5, 0, 7, 10, 1, 5, 9], "target": [0, 6, 6, 6]}
{"input": [1, 5, 5, 10, 4, 9, 1], "target": [0, 6, 4, 6]}
{"input": [3, 9, 4, 10, 3, 1, 6], "target": [0, 7, 1, 0]}Where token 10 represents the ‘+’ operator (507 + 159 = 666, 155 + 491 = 646, 394 + 316 = 710).
The starter code provides dataset.py with the data loader:
Part B: Attention Implementation
Implement attention.py with the core attention mechanisms:
Part C: Model Architecture
Implement a sequence-to-sequence transformer in model.py:
Part D: Training
Train your model using the provided training script:
Part E: Attention Analysis
Implement analyze.py to extract and visualize attention patterns:
Deliverables
Your problem1/ directory must contain:
- All code files as specified above
results/training_log.jsonwith loss curves and accuracy metricsresults/best_model.pth- saved model weightsresults/attention_patterns/containing:- Heatmaps for each attention head
- Example visualizations on test cases
results/head_analysis/containing:- Head ablation results
- Head importance rankings
Your report must include analysis of:
- Attention pattern visualizations from at least 4 different heads
- Head ablation study: which heads are critical vs redundant?
- Discussion: How do attention heads specialize for carry propagation?
- Quantitative results: percentage of heads that can be pruned with minimal accuracy loss
Problem 2: Positional Encoding and Length Extrapolation
Implement three positional encoding strategies and test their ability to generalize to sequences longer than those seen during training.
Part A: Dataset and Data Loading
You will work with a sorting detection dataset where the model must determine if a sequence of integers is sorted in ascending order.
Generate the dataset:
cd problem2
python scripts/generate_data.py --seed 641The dataset contains: - Task: Binary classification (1 = sorted, 0 = unsorted) - Training: Sequences of length 8-16 with integers 0-99 - Testing: Sequences of length 32, 64, 128, 256 (for extrapolation analysis) - Training samples: 10,000 (50% sorted, 50% unsorted) - Validation samples: 2,000
Example training samples:
{"sequence": [3, 15, 27, 41, 58, 72, 83, 94], "is_sorted": 1, "length": 8}
{"sequence": [15, 3, 72, 27, 41, 94, 58, 83], "is_sorted": 0, "length": 8}
{"sequence": [1, 5, 12, 18, 23, 29, 34, 40, 47, 53, 61, 68], "is_sorted": 1, "length": 12}The starter code provides src/dataset.py with the data loader:
Part B: Positional Encoding Implementations
Implement src/positional_encoding.py with three encoding strategies:
Part C: Model Architecture
The transformer encoder is provided in src/model.py:
Part D: Training
Train your model using the provided training script:
Part E: Extrapolation Analysis
Implement analyze.py to test models on longer sequences:
Deliverables
Your problem2/ directory must contain:
- All code files as specified above
results/sinusoidal/containing:best_model.pth- saved model weightstraining_log.json- loss curves and accuracytraining_curves.png- visualization
results/learned/with same structureresults/none/with same structureresults/extrapolation/containing:extrapolation_results.json- accuracy dataextrapolation_curves.png- main result plotlearned_position_embeddings.png- position visualization
Your report must include analysis of:
- Extrapolation curves showing accuracy vs. sequence length for all three methods
- Mathematical explanation: Why does sinusoidal encoding extrapolate while learned encoding fails?
- Position embedding visualization for learned encoding
- Quantitative comparison: accuracy at lengths 32, 64, 128, 256
Submission Requirements
Your GitHub repository must follow this exact structure:
ee641-hw3-[username]/
├── problem1/
│ ├── attention.py
│ ├── model.py
│ ├── dataset.py
│ ├── train.py
│ ├── analyze.py
│ └── results/
│ ├── training_log.json
│ ├── best_model.pth
│ ├── attention_patterns/
│ └── head_analysis/
├── problem2/
│ ├── positional_encoding.py
│ ├── model.py
│ ├── dataset.py
│ ├── train.py
│ ├── analyze.py
│ └── results/
│ ├── training_logs/
│ ├── best_models/
│ ├── extrapolation_curves.png
│ └── position_viz/
├── report.pdf
└── README.md
The README.md in your repository root must contain:
- Your full name
- USC email address
- Instructions to run each problem if they differ from the standard commands
- Any implementation notes
Before submitting:
- Your repository structure must match the requirement exactly
python train.pyandpython inference.pymust run without errors in each problem directory- All output files must be generated in the correct locations