
EE 641 - Unit 6
Fall 2025
[Attention] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
[Tutorial] J. Alammar, “The Illustrated Transformer,” 2018.
[Evaluation] G. Tang, M. Müller, A. Rios, and R. Sennrich, “Why self-attention? A targeted evaluation of neural machine translation architectures,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 4263–4272.
[Position] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 464–468.
[Analysis] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5797–5808.
[Training] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer normalization in the transformer architecture,” in International Conference on Machine Learning, 2020, pp. 10524–10533.
[Theory] M. Phuong and M. Hutter, “Formal algorithms for transformers,” 2022.
[Survey] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,” ACM Computing Surveys, vol. 55, no. 6, pp. 1–28, 2022.
[Theory] M. Hahn, “Theoretical limitations of self-attention in neural sequence models,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 156–171, 2020.
[Scaling] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020.
Architecture Components
Encoder: \(\mathbf{H} = \text{Encoder}(\mathbf{X})\)
Decoder: \(p(\mathbf{y}_t | \mathbf{y}_{<t}, \mathbf{H})\)

Encoder and decoder can use any architecture: RNN, CNN, or attention-based mechanisms.

Encoder: Bidirectional context. Decoder: Causal constraint.
From Lecture 05: Attention Mechanism
Decoder queries encoder states: \[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{h}_j\]
Where attention weights: \[\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{T_s} \exp(e_{tk})}\]
This is cross-attention:

Encoder representations are computed once and cached for all decoder timesteps.
Fundamental constraint:
\[\mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1})\]
Fundamental limitation:
Applies to encoding:

RNN hidden state update rule:
\[\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b})\]
Where:
Critical observation:
GPU architecture mismatch:

Consider sequence length \(T=100\), hidden dimension \(d=512\):
RNN Encoder (Sequential):
Forward pass operations:
Critical path length: \(T\) operations
Gradient computation (BPTT):
Self-Attention (Parallel - Preview):
Forward pass operations:
Critical path length: \(O(1)\) operations
When is self-attention faster?
Attention trades increased computation (\(T^2\) vs \(T\)) for reduced latency (\(O(1)\) vs \(O(T)\) sequential operations).
Recall: Vanishing gradients in sequential models
RNN gradient through backpropagation through time (BPTT): \[\frac{\partial L}{\partial \mathbf{h}_1} = \frac{\partial L}{\partial \mathbf{h}_T} \prod_{t=2}^T \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}\]
Quantifying the decay:
For position \(j\) in a sequence of length \(T\): \[\left\|\frac{\partial L}{\partial \mathbf{h}_j}\right\| \approx \left\|\frac{\partial L}{\partial \mathbf{h}_T}\right\| \cdot \gamma^{T-j}\]
where \(\gamma = \|\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}\|\) is the spectral norm of the recurrent Jacobian.
For tanh activation: \(\gamma \lesssim 0.5\)
Attention mechanism gradient: \[\frac{\partial L}{\partial \mathbf{h}_j} = \sum_{t=1}^{T_t} \alpha_{tj} \frac{\partial L}{\partial \mathbf{c}_t}\]
Attention mechanism achieve:
Remaining limitations:
Transformer components:
Self-attention for encoding
Positional encoding

Traditional view:
Alternative approach:
What changes:

Cross-attention:
Self-attention:
Mathematical form:
For input sequence \(\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_T] \in \mathbb{R}^{T \times d}\):
\[\mathbf{h}_t = \sum_{j=1}^T \alpha_{tj} \mathbf{x}_j\]
where \(\alpha_{tj}\) measures similarity between position \(t\) and position \(j\).

Diagonal elements (self-attention) are non-zero. Each position can attend to itself.

For \(T=100\) tokens, RNN requires 100 sequential operations. Self-attention: 1 parallel operation.
RNN complexity:
For \(T=100\), \(d=512\):
Self-Attention complexity:
For \(T=100\), \(d=512\):

Attention has constant latency regardless of sequence length.
Attention as differentiable database lookup:
Database has:
Query:
Retrieval:
In self-attention:

Input sequence: \(\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_T] \in \mathbb{R}^{T \times d}\)
Linear projections:
\[\mathbf{Q} = \mathbf{X}\mathbf{W}_Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}_K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}_V\]
Where \(\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d \times d_k}\) are learned projection matrices, producing \(\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{T \times d_k}\).
Attention computation:
\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right)\mathbf{V}\]
Step by step:

Each cell \(S[i,j] = q_i^T k_j\) is one dot product between query \(i\) and key \(j\). Matrix formulation enables parallel computation of all pairwise scores.
Sequential (nested loops):
T² sequential operations.
Parallel (matrix multiply):
All T² dot products computed simultaneously in one GPU kernel.
Without separation: \(\text{Attention}(\mathbf{X}, \mathbf{X}, \mathbf{X})\)
With separation: Separate projection matrices
Allows asymmetric relationships:

Example: “subject” query can match “verb” keys strongly, while “verb” query matches “object” keys.

import torch
import torch.nn.functional as F
import math
def self_attention(X, W_Q, W_K, W_V):
"""
Self-attention mechanism.
Args:
X: Input tensor [batch, seq_len, d_model]
W_Q, W_K, W_V: Projection matrices [d_model, d_k]
Returns:
H: Output tensor [batch, seq_len, d_k]
attention_weights: [batch, seq_len, seq_len]
"""
# Get dimensions
batch_size, seq_len, d_model = X.shape
d_k = W_Q.shape[1]
# Project to Q, K, V
Q = X @ W_Q # [batch, seq_len, d_k]
K = X @ W_K # [batch, seq_len, d_k]
V = X @ W_V # [batch, seq_len, d_k]
# Compute attention scores
scores = Q @ K.transpose(-2, -1) # [batch, seq_len, seq_len]
scores = scores / math.sqrt(d_k) # Scale by sqrt(d_k)
# Apply softmax to get attention weights
attention_weights = F.softmax(scores, dim=-1) # [batch, seq_len, seq_len]
# Weighted sum of values
H = attention_weights @ V # [batch, seq_len, d_k]
return H, attention_weightsAll operations are matrix multiplications: GPU-efficient and fully parallelizable.
Single attention captures one similarity function.
Language has multiple simultaneous relationships:
Solution: Multiple attention heads
Analogy: Multiple convolutional filters in CNN, each detecting different features.

Each head specializes in different attention patterns.
Divide model dimension into \(h\) heads:
\[d_{model} = h \times d_k\]
For each head \(i \in \{1, \ldots, h\}\):
\[\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_i^Q, \mathbf{X}\mathbf{W}_i^K, \mathbf{X}\mathbf{W}_i^V)\]
Where:
Concatenate and project:
\[\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O\]
Where \(\mathbf{W}^O \in \mathbb{R}^{h \cdot d_v \times d_{model}}\)

Total parameters: Same as single attention head at full dimension.
Key operations:
Critical implementation detail:
Shape transformations enable parallel computation across heads without loops.
import torch
import torch.nn.functional as F
import math
def multi_head_attention(X, W_Q, W_K, W_V, W_O, num_heads):
"""
Multi-head self-attention with explicit reshaping.
X: [batch, seq_len, d_model]
W_Q, W_K, W_V: [d_model, d_model]
W_O: [d_model, d_model]
"""
B, T, d_model = X.shape
d_k = d_model // num_heads
# Linear projections: [B, T, d_model]
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
# Reshape and transpose for multi-head
# [B, T, d_model] -> [B, T, h, d_k] -> [B, h, T, d_k]
Q = Q.view(B, T, num_heads, d_k).transpose(1, 2)
K = K.view(B, T, num_heads, d_k).transpose(1, 2)
V = V.view(B, T, num_heads, d_k).transpose(1, 2)
# Scaled dot-product attention per head
# [B, h, T, d_k] @ [B, h, d_k, T] = [B, h, T, T]
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
attn_weights = F.softmax(scores, dim=-1)
# Apply attention to values
# [B, h, T, T] @ [B, h, T, d_k] = [B, h, T, d_k]
attn_out = attn_weights @ V
# Concatenate heads
# [B, h, T, d_k] -> [B, T, h, d_k] -> [B, T, h*d_k]
attn_out = attn_out.transpose(1, 2).reshape(B, T, d_model)
# Output projection
output = attn_out @ W_O # [B, T, d_model]
return outputDimension tracking: Transpose and reshape enable vectorized operations across heads. All \(h\) attention computations happen in parallel.
Single-head attention at full dimension:
Multi-head with \(h\) heads:
Same total parameters and FLOPs.
But with benefits:

Dimension per head: \(d_k = d_{model}/h = 512/8 = 64\) (for 8 heads)
Smaller per-head dimension improves cache locality.

Different heads specialize without supervision. Patterns emerge from data.
WMT’14 En→De Translation (from “Attention Is All You Need”):
| Heads | \(d_k\) | BLEU | Training Time |
|---|---|---|---|
| 1 | 512 | 25.1 | 1.0× |
| 2 | 256 | 26.4 | 0.98× |
| 4 | 128 | 27.1 | 1.02× |
| 8 | 64 | 27.3 | 1.0× |
| 16 | 32 | 26.9 | 1.05× |
| 32 | 16 | 25.8 | 1.12× |
Observations:

8 heads with \(d_k=64\) balances expressiveness and efficiency.

Can remove 40% of heads with <0.5 BLEU loss and 1.3× speedup. Some heads are redundant or specialized for rare patterns.
Recall: Scaled dot-product attention
\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right)\mathbf{V}\]
Why divide by \(\sqrt{d_k}\)?
In self-attention, all \(T\) positions compete for attention weight. Without scaling:
Problem amplified in self-attention:

Without scaling, large \(d_k\) causes attention saturation.
Assume query and key components are independent with unit variance:
\[\mathbf{q}, \mathbf{k} \sim \mathcal{N}(0, \mathbf{I})\]
Dot product without scaling:
\[\mathbf{q} \cdot \mathbf{k} = \sum_{i=1}^{d_k} q_i k_i\]
Each term \(q_i k_i\) has:
Sum of \(d_k\) independent random variables:
\[\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \sum_{i=1}^{d_k} \text{Var}(q_i k_i) = d_k\]
After scaling by \(1/\sqrt{d_k}\):
\[\text{Var}\left(\frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}\right) = \frac{1}{d_k} \text{Var}(\mathbf{q} \cdot \mathbf{k}) = \frac{d_k}{d_k} = 1\]
Variance is stabilized regardless of dimension.

Softmax gradient:
\[\frac{\partial}{\partial e_j} \text{softmax}(e)_i = \alpha_i(\delta_{ij} - \alpha_j)\]
Where \(\alpha_i = \text{softmax}(e)_i\)
When scores are large (unscaled):
Gradient vanishes for non-dominant positions.
With scaling:
Entropy of attention distribution:
\[H(\boldsymbol{\alpha}_t) = -\sum_{j=1}^T \alpha_{tj} \log \alpha_{tj}\]
Interpretation:
Without scaling:
With scaling:

Scaling preserves entropy across dimensions, enabling diverse attention patterns.
Attention with temperature \(\tau\):
\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}, \tau) = \text{softmax}\left(\frac{\mathbf{QK}^T}{\tau}\right)\mathbf{V}\]
Standard scaling: \(\tau = \sqrt{d_k}\)
Temperature effects:
Application: Can adjust temperature at inference for diversity.

Temperature provides control over attention concentration. Standard \(\tau = \sqrt{d_k}\) balances sharpness and diversity.
Matrix operations preserve set structure, not sequence order.
Consider self-attention computation: \[\mathbf{h}_i = \sum_{j=1}^T \alpha_{ij} \mathbf{v}_j\]
where attention weights depend only on content: \[\alpha_{ij} = \text{softmax}(\mathbf{q}_i^T \mathbf{k}_j / \sqrt{d_k})\]
Key observation:
Mathematical property: \[\text{SelfAttention}(\mathbf{X}_\pi) = \text{SelfAttention}(\mathbf{X})_\pi\]
Self-attention is permutation equivariant: Permute input → same permutation of output.

Each token gets same contextual representation regardless of where it appears.

Critical limitation: Pure self-attention treats sequences as sets, not ordered sequences.
Position information requires vectors \(\mathbf{p}_t \in \mathbb{R}^{d_{model}}\) that satisfy:
Requirement 1: Uniqueness
Each position must have distinct encoding: \[\mathbf{p}_i \neq \mathbf{p}_j \text{ for } i \neq j\]
Requirement 2: Bounded magnitude
Position encoding should not dominate token embeddings: \[\|\mathbf{p}_t\| \approx \|\mathbf{x}_t\|\]
Otherwise self-attention would focus on positions rather than content.
Requirement 3: Smoothness
Adjacent positions should have similar encodings: \[\|\mathbf{p}_{t+1} - \mathbf{p}_t\| \text{ is bounded}\]
Enables model to learn relative position relationships.
Requirement 4: Generalization
Ideally: encoding defined for arbitrary position \(t \in \mathbb{N}\), enabling extrapolation beyond training sequence lengths.
Approach:
\[\mathbf{X}_{input} = \mathbf{X}_{token} + \mathbf{X}_{position}\]
Both \(\in \mathbb{R}^{T \times d_{model}}\)

Why periodic functions?
Multiple sinusoids at different frequencies provide:
Formulation:
For position \(\text{pos} \in \{0, 1, \ldots, T-1\}\) and dimension index \(i \in \{0, 1, \ldots, d_{model}-1\}\):
\[\text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d_{model}}}\right)\]
\[\text{PE}(\text{pos}, 2i+1) = \cos\left(\frac{\text{pos}}{10000^{2i/d_{model}}}\right)\]
Properties:
Frequency spectrum:

Interpretation:
Each position receives a unique \(d_{model}\)-dimensional vector.
Theorem: For any fixed offset \(k\), \(\text{PE}(\text{pos}+k)\) is a linear function of \(\text{PE}(\text{pos})\).
Proof: Using angle addition formulas for dimension \(i\):
\[\sin(\alpha + \beta) = \sin(\alpha)\cos(\beta) + \cos(\alpha)\sin(\beta)\] \[\cos(\alpha + \beta) = \cos(\alpha)\cos(\beta) - \sin(\alpha)\sin(\beta)\]
Let \(\omega_i = 10000^{-2i/d_{model}}\) denote the angular frequency for dimension pair \(i\).
For the sine component (dimension \(2i\)): \[\sin(\omega_i(\text{pos}+k)) = \sin(\omega_i \cdot \text{pos})\cos(\omega_i k) + \cos(\omega_i \cdot \text{pos})\sin(\omega_i k)\]
For the cosine component (dimension \(2i+1\)): \[\cos(\omega_i(\text{pos}+k)) = \cos(\omega_i \cdot \text{pos})\cos(\omega_i k) - \sin(\omega_i \cdot \text{pos})\sin(\omega_i k)\]
Matrix form for dimension pair \(i\):
\[\begin{bmatrix} \text{PE}(\text{pos}+k, 2i) \\ \text{PE}(\text{pos}+k, 2i+1) \end{bmatrix} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix} \begin{bmatrix} \text{PE}(\text{pos}, 2i) \\ \text{PE}(\text{pos}, 2i+1) \end{bmatrix}\]
The transformation matrix depends only on offset \(k\) and frequency \(\omega_i\), independent of absolute position \(\text{pos}\).
Consequence: Model can learn to compute relative positions through learned linear transformations.
Sinusoidal encoding permits arbitrary sequence length:
Empirical extrapolation performance (WMT’14 En→De, Transformer base):
| Train Length | Test Length | BLEU | Degradation |
|---|---|---|---|
| \(T \leq 512\) | 512 | 27.3 | — |
| \(T \leq 512\) | 768 | 26.0 | -1.3 |
| \(T \leq 512\) | 1024 | 25.0 | -2.3 |
| \(T \leq 512\) | 2048 | 22.7 | -4.6 |
Sources of degradation:
Attention weight distribution shift: Learned queries and keys optimized for positions seen during training. Softmax scores at unseen positions produce different distributions.
Relative magnitude changes: At position 1024, encoding magnitude remains bounded but relative to learned attention patterns, behaves differently than position 512.
No gradient signal: Training never updates parameters based on positions beyond \(T_{train}\). Model cannot learn appropriate behavior for extrapolation region.

Practical implication: Training data must include sequence lengths matching deployment distribution. Sinusoidal encoding enables generation at any length but does not guarantee performance.

Implementation:
import torch
import math
def get_positional_encoding(max_len, d_model):
"""
Generate sinusoidal positional encodings.
Returns:
pe: Tensor [max_len, d_model]
"""
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) *
-(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
# Usage
pos_enc = get_positional_encoding(5000, 512)
# Add to embeddings
# token_emb: [batch, seq_len, d_model]
input_emb = token_emb + pos_enc[:seq_len]Key properties:
Learned Positional Embeddings
Treat position as categorical variable with \(T_{max}\) classes. Learn embedding matrix \(\mathbf{P} \in \mathbb{R}^{T_{max} \times d_{model}}\) where position \(t\) maps to row \(\mathbf{P}[t]\).
Properties:
Used in: BERT, GPT-2, GPT-3
Empirical performance: Comparable to sinusoidal within training length (WMT’14 En→De: 28.2 vs 28.4 BLEU at \(T \leq 512\))
Extrapolation: Architecturally impossible. Sequences longer than \(T_{max}\) require model retraining with larger embedding matrix.
Relative Position Encoding
Modify attention scores based on pairwise distance between positions.
\[e_{ij} = \mathbf{q}_i^T \mathbf{k}_j + a_{i-j}\]
where \(a_k\) is learned bias for offset \(k \in \{-T_{max}, \ldots, T_{max}\}\).
Properties:
Used in: Transformer-XL, T5

Other approaches:
Performance differences minimal on standard benchmarks (\(<1\) BLEU). Architectural choice driven by sequence length requirements and computational constraints.
Architecture overview:
Each layer contains:
Key principle:

Dimension preservation: \(d_{model}\) constant throughout stack enables deep composition.

Two sublayers, each with residual connection and layer normalization.
“Position-wise”: Same FFN applied independently to each position.
\[\text{FFN}(\mathbf{x}) = \text{ReLU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2\]
Where:
Why needed?
Interpretation:

For \(d_{model}=512\), \(d_{ff}=2048\): FFN has ~2.1M parameters per layer.
Stacking creates hierarchical representations:
Lower layers (1-2):
Middle layers (3-4):
Upper layers (5-6):
Empirical observation:
Each layer specializes without supervision, patterns emerge from task optimization.

Each position’s representation becomes increasingly context-aware and abstract.
After \(L\) layers, each position has accessed:
Contrast with CNNs:
Transformer advantage:

Transformer sees full sequence at every layer. CNN must stack many layers to achieve same scope.
Standard layer transformation:
\[\mathbf{x}^{(\ell)} = F(\mathbf{x}^{(\ell-1)})\]
Gradient through \(L\) layers: \[\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(0)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(L)}} \prod_{\ell=1}^L \frac{\partial F^{(\ell)}}{\partial \mathbf{x}^{(\ell-1)}}\]
Problem: Product of Jacobians causes vanishing/exploding gradients.
Residual connection:
\[\mathbf{x}^{(\ell)} = \mathbf{x}^{(\ell-1)} + F(\mathbf{x}^{(\ell-1)})\]
Key property: Identity path bypasses transformation.

Identity path ensures gradient can always flow through the network.
Forward pass with residual: \[\mathbf{x}^{(\ell)} = \mathbf{x}^{(\ell-1)} + F(\mathbf{x}^{(\ell-1)})\]
Backward pass (chain rule): \[\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(\ell-1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(\ell)}} \cdot \frac{\partial \mathbf{x}^{(\ell)}}{\partial \mathbf{x}^{(\ell-1)}}\]
Compute Jacobian: \[\frac{\partial \mathbf{x}^{(\ell)}}{\partial \mathbf{x}^{(\ell-1)}} = \frac{\partial}{\partial \mathbf{x}^{(\ell-1)}}\left[\mathbf{x}^{(\ell-1)} + F(\mathbf{x}^{(\ell-1)})\right] = \mathbf{I} + \frac{\partial F}{\partial \mathbf{x}^{(\ell-1)}}\]
Key result: \[\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(\ell-1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(\ell)}} \cdot \left(\mathbf{I} + \frac{\partial F}{\partial \mathbf{x}^{(\ell-1)}}\right)\]
Gradient magnitude lower bound: \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(\ell-1)}}\right\| \geq \left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(\ell)}}\right\| - \left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}^{(\ell)}} \cdot \frac{\partial F}{\partial \mathbf{x}^{(\ell-1)}}\right\|\]
Even if \(\frac{\partial F}{\partial x}\) has small norm, identity term \(\mathbf{I}\) prevents vanishing.
Each encoder layer has 2 residual connections:
For \(N\) layers:
6-layer transformer:
Ensemble interpretation:

Exponential path count enables very deep networks (50+ layers) without gradient issues.
CNNs:
Transformers:
Key difference: Global attention amplifies gradient issues. Without identity path, gradients vanish completely at early layers.
Ablation study (WMT’14 En→De):
| Configuration | BLEU | Status |
|---|---|---|
| 6-layer Transformer | ||
| Without residual | — | Diverges |
| With residual | 27.3 | Converges |
| Deep Transformer (48 layers) | ||
| Post-norm, no residual | — | Diverges |
| Pre-norm, with residual | 26.9 | Converges |
Residual connections required for transformer training. CNNs benefit from residuals but can function without them due to locality.
Batch normalization computes statistics across batch dimension.
For minibatch \(\mathcal{B} = \{x^{(1)}, \ldots, x^{(B)}\}\) and feature \(j\):
\[\mu_j = \frac{1}{B}\sum_{b=1}^B x_j^{(b)}, \quad \sigma_j^2 = \frac{1}{B}\sum_{b=1}^B (x_j^{(b)} - \mu_j)^2\]
Normalize: \(\hat{x}_j^{(b)} = \frac{x_j^{(b)} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}\)
Works for CNNs: Each feature has consistent meaning across spatial positions and samples.
Fails for sequences:

Position-dependent statistics cannot be captured by batch-level normalization.
Normalize across feature dimension for each sample and position independently:
\[\mu^{(t)} = \frac{1}{d}\sum_{i=1}^d x_i^{(t)}, \quad \sigma^{2(t)} = \frac{1}{d}\sum_{i=1}^d (x_i^{(t)} - \mu^{(t)})^2\]
\[\text{LayerNorm}(\mathbf{x}^{(t)}) = \gamma \odot \frac{\mathbf{x}^{(t)} - \mu^{(t)}}{\sqrt{\sigma^{2(t)} + \epsilon}} + \beta\]
where \(\gamma, \beta \in \mathbb{R}^d\) are learned per-feature scale and shift.
Properties:
Computational cost: \(O(d)\) per position, negligible compared to attention \(O(T \cdot d)\).

Layer normalization standardizes each position independently, removing scale variations across features.
Layer normalization can be placed before or after the sublayer. Original Transformer used post-norm. Modern deep transformers use pre-norm for stability.
Post-Norm (original Transformer): \[\mathbf{y} = \text{LayerNorm}(\mathbf{x} + \text{Sublayer}(\mathbf{x}))\]
Pre-Norm: \[\mathbf{y} = \mathbf{x} + \text{Sublayer}(\text{LayerNorm}(\mathbf{x}))\]
Post-Norm behavior:
Gradient through post-norm:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{x}}} \cdot \frac{\partial \text{LN}}{\partial \mathbf{x}}\]
where \(\frac{\partial \text{LN}}{\partial \mathbf{x}}\) includes \(\sigma^{-1}\) term.
For deep networks: Accumulated residuals → large \(\|\mathbf{x}\|\) → small \(\sigma\) → gradient explosion in \(\sigma^{-2}\) term.
Pre-Norm behavior:
Gradient through pre-norm:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} + \frac{\partial \mathcal{L}}{\partial \text{Sublayer}} \cdot \frac{\partial \text{LN}}{\partial \mathbf{x}}\]
Identity path \(\frac{\partial \mathcal{L}}{\partial \mathbf{y}}\) is unnormalized → stable for arbitrary depth.
Empirical observation: Post-norm diverges beyond 24 layers. Pre-norm stable to 100+ layers.

WMT’14 En-De results: Post-norm optimal at 6 layers (27.8 BLEU), fails at 48+ layers. Pre-norm maintains training stability across all depths, with 6-layer slightly lower (27.3 BLEU, -0.5).
Without normalization in residual path, representation norms accumulate:
After \(L\) layers with residual connections: \[\mathbf{x}^{(L)} = \mathbf{x}^{(0)} + \sum_{\ell=1}^L F^{(\ell)}(\cdot)\]
Expected norm growth: \(\|\mathbf{x}^{(L)}\| \approx \sqrt{L} \cdot \|\mathbf{x}^{(0)}\|\) (random walk in high dimensions).

Pre-norm advantage: Sublayers always operate on normalized inputs (‖input‖ ≈ 1), preventing sensitivity to depth. Residual path accumulates unnormalized updates, but this doesn’t affect gradient flow through identity connection.
Decoder layer structure:
Each of \(N\) layers (typically \(N=6\)) contains three sublayers:
Each sublayer has residual connection and layer normalization.
Critical difference from encoder: Decoder must maintain causality for autoregressive generation.

Three sublayers process target sequence with access to encoder representations.
Autoregressive generation requires: \(p(y_t | y_{<t}, \mathbf{x})\)
Position \(t\) can only depend on positions \(\{1, 2, \ldots, t\}\), not future positions \(\{t+1, \ldots, T\}\).
During training:
Masking mechanism:
Attention scores before softmax: \(\mathbf{S} = \frac{\mathbf{QK}^T}{\sqrt{d_k}} \in \mathbb{R}^{T \times T}\)
Apply mask \(\mathbf{M}\) where \(M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}\)
Masked scores: \(\tilde{\mathbf{S}} = \mathbf{S} + \mathbf{M}\)
After softmax: Future positions have weight 0.

Upper triangle (future) forced to zero probability after softmax.
import torch
import torch.nn.functional as F
import math
def create_causal_mask(T):
"""
Create upper-triangular mask for causal attention.
Returns:
mask: [T, T] with -inf in upper triangle
"""
mask = torch.triu(torch.ones(T, T) * float('-inf'), diagonal=1)
return mask
def masked_self_attention(Q, K, V):
"""
Self-attention with causal masking.
Args:
Q, K, V: [batch, T, d_k]
Returns:
output: [batch, T, d_k]
weights: [batch, T, T]
"""
batch_size, T, d_k = Q.shape
# Compute scores
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k) # [batch, T, T]
# Apply causal mask
mask = create_causal_mask(T).to(Q.device)
scores = scores + mask # Broadcasting: [batch, T, T] + [T, T]
# Softmax (future positions have exp(-inf) = 0)
weights = F.softmax(scores, dim=-1) # [batch, T, T]
# Weighted sum
output = weights @ V # [batch, T, d_k]
return output, weightsMask is device-agnostic and requires no gradients (constant).
Masking enables parallel training on full sequence.
RNN decoder training:
Masked transformer decoder training:
Training speedup: Up to \(100\times\) for long sequences (parallel vs sequential).

Parallel processing during training compensates for quadratic attention cost: up to 100× faster than RNN despite \(O(T^2)\) complexity.
At inference, generation is sequential:
Naive approach: Recompute attention for all previous tokens at each step.
For position \(t\): Process tokens \(\{y_1, \ldots, y_t\}\) through decoder.
Computational cost: \(O(T^2)\) attention operations for sequence of length \(T\).
Optimization: KV caching
Keys and values from previous positions don’t change. Cache them:

Memory-speed tradeoff: Store \(O(L \times T \times d)\) cached keys/values (where \(L\) = number of layers), reduce computation from \(O(T^2)\) to \(O(T)\) per generation step.
Cross-attention sublayer:
\[\text{CrossAttention}(\mathbf{Q}_{dec}, \mathbf{K}_{enc}, \mathbf{V}_{enc})\]
where:
Attention matrix: \(\mathbb{R}^{T_t \times T_s}\) (target positions × source positions)
This is the same attention mechanism from Lecture 05, now operating on transformer representations.

Each decoder position attends to full source sequence to retrieve relevant information for generation.

Standard transformer: 6-layer encoder, 6-layer decoder, multi-head self-attention and cross-attention, position-wise FFN with residuals and layer norm at each sublayer.
Decoder produces contextualized representations ∈ ℝ^(T×d_model)
Output layer maps to vocabulary:
\[\mathbf{logits} = \mathbf{H}_{decoder}\mathbf{W}_{vocab}\]
Where W_vocab ∈ ℝ^(d_model × |V|) projects each position to vocabulary size.
Per-position softmax:
\[p(y_t | y_{<t}, \mathbf{x}) = \text{softmax}(\mathbf{logits}_t) \in \mathbb{R}^{|V|}\]
Each decoder position produces probability distribution over vocabulary.
For translation (WMT’14 En-De):
Training: Cross-entropy loss at each position \[\mathcal{L} = -\sum_{t=1}^T \log p(y_t^* | y_{<t}, \mathbf{x})\]
where y_t^* is ground truth token.

Output projection weight matrix often tied with input embeddings (W_vocab^T = W_embed), reducing parameters by |V| × d_model.
Parameter count (base config: \(N=6\), \(d_{model}=512\), \(d_{ff}=2048\)):
Scaling:
FFN layers dominate: \(2 \times d_{model} \times d_{ff}\) per layer.
Memory scaling with sequence length:
Parameters are constant (65MB), but activations and attention scale with \(T\):
Crossover: Attention memory dominates when \(T > 256\).

At \(T=512\), batch=32: Attention matrices require 4GB, dominating total memory. Parameters (65MB) negligible.
Per-layer complexity:
Crossover point: Self-attention dominates when \(T^2 d_{model} > T d_{model}^2\), i.e., when \(T > d_{model}\).
For \(d_{model} = 512\):
Concrete example (\(N=6\), \(d=512\), \(T=100\)):
FFN uses 5× more computation than attention for translation.
Reality: The “quadratic bottleneck” narrative is misleading for typical NLP tasks. Attention becomes the computational bottleneck only for very long sequences (\(T > 512\)).


Key observation: For \(T=512\), attention matrices require 4GB memory (batch=32), dominating total memory. Parameters (65MB) negligible by comparison.
Each encoder/decoder layer has 2-3 residual connections:
Total gradient paths from output to input:
For \(N\) encoder layers and \(N\) decoder layers: \[\text{Paths} = 2^{2N} \times 2^{3N} = 2^{5N}\]
For \(N=6\): \(2^{30} \approx 1\) billion paths.
Gradient at input is average over all paths.

Ensemble interpretation: Gradient is averaged over \(2^{5N}\) paths. Individual path failures don’t prevent training as long as sufficient paths remain viable. The exponential number of viable paths ensures training stability in deep transformers.
Standard approach for neural networks:
Transformers trained this way:
Standard solution: Learning rate warmup

Transformer learning rate schedule: \(\text{lr} = d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup\_steps}^{-1.5})\)

Without warmup: Gradients reach \(10^2\) magnitude within 200 steps. With 4000-step warmup: Gradients remain stable around \(10^0\) magnitude.
Transformer has exponentially many gradient paths:
Early training problem:
Effect on updates:
Additional instabilities:

Path variance decreases from \(10^{3}\)-\(10^{4}\) to \(10^{1}\) during warmup. Attention entropy drops from \(\log(100) \approx 4.6\) (uniform) to ~2.0 (structured).
Mechanism: Small learning rate prevents path divergence
At initialization:
Without warmup (\(\eta = 10^{-3}\)):
Large-gradient paths dominate: \[\Delta \theta \approx -10^{-3} \times 10^{1} = -10^{-2}\]
This causes:
With warmup (\(\eta \approx 10^{-5}\) initially):
All updates small: \[\Delta \theta \approx -10^{-5} \times 10^{1} = -10^{-4}\]
Network explores stable basin:

Annealing interpretation: Warmup resembles simulated annealing. Small updates allow system to explore parameter space without leaving stable basin. Gradual increase prevents paths from amplifying variance.
Four components work together:
1. Learning rate warmup
2. Gradient clipping
3. Careful initialization
4. Architecture choices

All four components necessary for deep models (12+ layers). Removing any one typically causes instability.
WMT 2014 English → German:
Transformer results:

WMT 2014 English → French:
Inference speed: ~10× faster than RNN (beam search, single GPU).

Critical findings: Positional encoding essential (-7.6 BLEU without). Residual connections required (training diverges). Layer normalization important (-2.2 BLEU).
1. Parallelization enables faster training
2. Direct gradient paths
3. Scalable to large hardware

But: Requires more training data

WMT 2014 dataset: 4.5M sentence pairs. Transformer advantage: Clear at this scale. LSTM advantage: Below 1M examples (low-resource languages).
Transformer memory breakdown:
Training memory = \(O(\text{batch} \times \text{layers} \times T^2 \times d_{\text{model}})\)
For sequence length \(T=512\), batch size \(B=32\):
| Component | Memory | Calculation |
|---|---|---|
| Attention matrices | 4.0 GB | \(B \times N \times h \times T^2 \times 4\) bytes |
| Activations | 1.2 GB | \(B \times N \times T \times d \times 4\) bytes |
| Parameters | 0.25 GB | \(65M \times 4\) bytes |
| Gradients | 0.25 GB | Same as parameters |
| Optimizer states | 0.5 GB | Adam: 2× parameters |
| Total | ~6.2 GB | Single forward + backward |
For \(T=2048\): Attention matrices alone require 64 GB.
Sequence length bottleneck:

Quadratic memory scaling limits practical sequence length. At \(T=4096\): 256 MB per attention matrix, 49 GB total for batch=32.
1. Gradient checkpointing (recomputation)
2. Mixed precision training (FP16 + FP32)
3. KV caching during autoregressive generation
Autoregressive decoding: \(p(y_t | y_{<t})\) requires computing attention at each step.
Without caching:
With KV caching:
Speedup scales linearly with sequence length:

Combined optimizations enable training on sequences ~4× longer or generation up to 10× faster.

Training: Transformer ~10× faster due to parallelization. Inference: Both limited by autoregressive generation (must generate one token at a time).
Scaling laws: Log-linear performance improvement with parameters
Translation (WMT’14 En-De BLEU):
| Model | Params | BLEU |
|---|---|---|
| Transformer Base (2017) | 65M | 27.3 |
| Transformer Big (2017) | 213M | 28.4 |
| GPT-2 (2019) | 1.5B | 30.1 |
| GPT-3 (2020) | 175B | 31.3 |
Language modeling (perplexity on validation):
Consistent log-linear improvement across model sizes from 100M to 175B parameters.
Key observations:
Contrast: RNN performance plateaus around 100M parameters, limited by sequential computation bottleneck.

Log-linear scaling continues from 15M to 175B+ parameters. Performance improvements remain predictable across three orders of magnitude in model size.
Convolutional Neural Networks:
Recurrent Neural Networks:
Transformers:

CNN and RNN embed domain knowledge. Transformer learns structure from data alone.

Transformer requires ~10× more data than RNN for comparable performance (estimated from WMT translation task comparisons). CNN most sample-efficient below 1M examples.
WMT translation tasks (measured performance):
High-resource languages (>4M sentence pairs):
Low-resource languages (<500K sentence pairs):
Data augmentation partially compensates:

Weak inductive bias trades sample efficiency for asymptotic performance.

Transformer excels at pattern matching and interpolation. Fails at systematic generalization and algorithmic reasoning (SCAN benchmark: 35% vs RNN 90%).
Attention computation scales as \(O(T^2)\):
Attention matrix: \(\mathbf{S} \in \mathbb{R}^{T \times T}\)
Memory requirements (batch=1, single layer, float32):
| \(T\) | Attention Matrix (\(T^2 \times 4\) bytes) | Use Case |
|---|---|---|
| 512 | 1.0 MB | Paragraphs |
| 1024 | 4.0 MB | Short documents |
| 2048 | 16 MB | Long documents |
| 4096 | 64 MB | Book chapters |
| 8192 | 256 MB | Cannot fit typical GPU |
Full model with 6 layers requires 6× more memory.
Document length statistics:
Standard transformer cannot process full documents without splitting.

For \(T > d\), transformer uses more FLOPs than RNN. Memory prohibitive beyond \(T \approx 4096\).
Training: Sequences up to \(T_{\text{train}} = 512\)
Testing: Longer sequences
Performance degradation (WMT En→De, Transformer base):
| Test \(T\) | BLEU | \(\Delta\) |
|---|---|---|
| 512 | 27.3 | 0.0 |
| 768 | 26.0 | -1.3 |
| 1024 | 25.0 | -2.3 |
| 2048 | 22.7 | -4.6 |
Why this occurs:
Learned position embeddings: Cannot extrapolate at all.

-4.6 BLEU at 4× training length (WMT En→De, Transformer base). Attention entropy increases (patterns degrade to nearly uniform).

Arithmetic: 98% (1-digit) → 12% (3-digit) → 3% (division). Counting degrades from 95% to 65% with length.
SCAN benchmark (compositional commands):
Tasks like “jump twice”, “walk left and jump”
| Split | RNN | Transformer |
|---|---|---|
| Simple | 99.8% | 99.5% |
| Add Jump | 90.3% | 35.2% |
| Length | 18.2% | 15.8% |
| Template | 12.1% | 8.4% |
Both architectures fail on systematic generalization, but transformer worse.
Other algorithmic tasks:
No explicit state update mechanism for iterative computation.

Pattern matching: 95%. Algorithmic reasoning: 12-45%. Attention mechanism retrieves and combines but does not compute.