
EE 641 - Unit 5
Fall 2025
[Bottleneck] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014, pp. 103–111.
[Attention] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations, 2015.
[Variants] Minh-Thang Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
[Tutorial] L. Weng, “Attention? Attention!” Lil’Log, June 2018.
[Vision] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning, 2015, pp. 2048–2057.
[Speech] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964.
[Copy] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 1073–1083.
[Reading] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional attention flow for machine comprehension,” in International Conference on Learning Representations, 2017.
[Interpret] S. Jain and B. C. Wallace, “Attention is not explanation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 3543–3556.
Encoder RNN: \[\mathbf{h}_t = f(\mathbf{h}_{t-1}, \mathbf{x}_t; \theta_{enc})\]
Processes source sequence token-by-token:
Context extraction: \[\mathbf{c} = \mathbf{h}_{T_s}\]
Final hidden state becomes context for decoder.
Decoder RNN: \[\mathbf{s}_t = g(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}; \theta_{dec})\]
Generates target sequence conditioned on context:

Context vector \(\mathbf{c}\) must encode entire source sequence for decoder.
Entropy measures information content: \[H(X) = -\sum_x p(x) \log_2 p(x) \text{ [bits]}\]
For continuous \(\mathbf{X} \in \mathbb{R}^d\): \[H(\mathbf{X}) \leq d \times b \text{ bits}\]
where \(b\) = bits per dimension (32 for float32).
Mutual information quantifies shared information: \[I(X; Y) = H(X) + H(Y) - H(X, Y)\]
Measures how much knowing \(Y\) reduces uncertainty about \(X\).
Data processing inequality:
For Markov chain \(X \rightarrow Y \rightarrow Z\): \[I(X; Z) \leq I(X; Y)\]
Processing cannot create information.
Conditional entropy: \[H(X|Y) = H(X,Y) - H(Y)\]
Information in \(X\) not explained by \(Y\).
For seq2seq: \[H(\mathbf{X}|\mathbf{c}) = H(\mathbf{X}, \mathbf{c}) - H(\mathbf{c})\]
Measures how much source information is lost in context.
Key relationships:
\[I(\mathbf{X}; \mathbf{c}) = H(\mathbf{X}) - H(\mathbf{X}|\mathbf{c})\]
Information preserved = Total info - Lost info
Bottleneck bound:
\[I(\mathbf{X}; \mathbf{Y}) \leq I(\mathbf{X}; \mathbf{c}) \leq H(\mathbf{c}) = h \times 32\]
Target reconstruction limited by context capacity.
Concrete bound for seq2seq:
Application to seq2seq: \(\mathbf{X}\) (source) → \(\mathbf{c}\) (context) → \(\mathbf{Y}\) (target)
Channel capacity (Shannon, 1948):
Maximum rate of reliable information transmission: \[C = \max_{p(x)} I(X; Y) \text{ [bits/transmission]}\]
For fixed-dimensional channel \(\mathbf{c} \in \mathbb{R}^h\): \[C \leq H(\mathbf{c}) \leq h \times 32 \text{ bits}\]
Rate-distortion tradeoff:
Cannot perfectly reconstruct \(X\) from compressed \(Y\) when \(H(Y) < H(X)\).
Minimum distortion at rate \(R\): \[D(R) = \min_{p(y|x): I(X;Y) \leq R} \mathbb{E}[d(X, \hat{X})]\]
where \(\hat{X}\) is reconstruction from \(Y\).
Implications:

Seq2seq operates in high-distortion regime for long sequences.
Source sequence: \[\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_{T_s}]\]
Context vector: \[\mathbf{c} = \mathbf{h}_{T_s} \in \mathbb{R}^{h}\]
Target sequence: \[\mathbf{Y} = [\mathbf{y}_1, \mathbf{y}_2, ..., \mathbf{y}_{T_t}]\]

The fundamental problem: When \(T_s \times d_{emb} \gg h\), information must be discarded.
The data processing inequality quantifies the bottleneck:
\[I(\mathbf{X}; \mathbf{Y}) \leq I(\mathbf{X}; \mathbf{c}) \leq H(\mathbf{c})\]
where:
Capacity bound:
For \(\mathbf{c} \in \mathbb{R}^h\) with 32-bit floats: \[H(\mathbf{c}) \leq h \times 32 \text{ bits}\]
Concrete example:

Implication: No decoder can recover information lost at the bottleneck.
BiLingual Evaluation Understudy (BLEU):
Measures n-gram overlap between machine translation and human reference(s).
Computation:
\[\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)\]
where:
Typical n-gram precision:
\[p_n = \frac{\text{\# matching n-grams}}{\text{\# candidate n-grams}}\]
Brevity penalty:
\[BP = \begin{cases} 1 & c > r \\ e^{1-r/c} & c \leq r \end{cases}\]
where \(c\) = candidate length, \(r\) = reference length.
Score range: 0-100 (percentage scale)
Interpretation guidelines:
| BLEU Score | Quality Assessment |
|---|---|
| <10 | Almost useless |
| 10-20 | Hard to get gist |
| 20-30 | Understandable, many errors |
| 30-40 | Good, some errors |
| 40-50 | Very good |
| 50-60 | Excellent |
| >60 | Near human quality |
WMT’14 En→Fr benchmarks:
Properties:
What differences mean:
BLEU provides automated, reproducible quality metric. Correlates with human judgment (\(r \approx 0.6-0.7\)) at corpus level.
WMT’14 English→French Results:
Seq2seq performance vs sequence length:
| Length | BLEU | ∆ from baseline |
|---|---|---|
| 10-20 | 35.2 | — |
| 20-30 | 32.8 | -2.4 |
| 30-40 | 29.5 | -5.7 |
| 40-50 | 25.1 | -10.1 |
| 50-60 | 19.8 | -15.4 |
| 60+ | 12.3 | -22.9 |
Degradation rate: ~0.5 BLEU per token
Root cause: Fixed context vector cannot scale with input length.

Recency bias in RNN encoding:
The final hidden state \(\mathbf{h}_{T_s}\) is dominated by recent inputs:
\[\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t)\]
Contribution decay:
Influence of \(\mathbf{x}_k\) on \(\mathbf{h}_T\) decays exponentially:
\[\text{Influence}(\mathbf{x}_k \rightarrow \mathbf{h}_T) \propto \|\mathbf{W}_{hh}\|^{T-k}\]
For typical \(\|\mathbf{W}_{hh}\| \approx 0.9\):
Result: Early tokens effectively erased from context.

Bidirectional encoding helps but doesn’t solve the problem: Still compresses to fixed \(h\) dimensions.
Chain rule for sequential dependencies:
For \(y = f_n(f_{n-1}(...f_1(x)))\):
\[\frac{\partial y}{\partial x} = \frac{\partial f_n}{\partial f_{n-1}} \cdot \frac{\partial f_{n-1}}{\partial f_{n-2}} \cdots \frac{\partial f_1}{\partial x}\]
Product of Jacobians:
\[\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} = \prod_{t=2}^T \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}\]
Each \(\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} \in \mathbb{R}^{h \times h}\) is a Jacobian matrix.
For vanilla RNN:
\[\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b})\]
\[\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \text{diag}(\tanh'(\mathbf{z}_{t-1})) \cdot \mathbf{W}_{hh}\]
where \(\mathbf{z}_t\) is the pre-activation.
Gradient magnitude behavior:
\[\left\| \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} \right\| \leq \|\mathbf{W}_{hh}\| \cdot \max_i |\tanh'(z_i)|\]
Since \(|\tanh'(z)| \leq 1\) and typically \(\leq 0.25\) for saturated activations:
\[\left\| \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} \right\| \leq \|\mathbf{W}_{hh}\|\]
Over T steps:
\[\left\| \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} \right\| \leq \|\mathbf{W}_{hh}\|^{T-1} \cdot \prod_{t=2}^T |\tanh'(z_t)|\]
Typical case:
After 50 steps: \(0.45^{50} \approx 10^{-18}\) (gradient vanishes)
Complete gradient path from loss to source:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}_T} \cdot \frac{\partial \mathbf{y}_T}{\partial \mathbf{s}_T} \cdot \frac{\partial \mathbf{s}_T}{\partial \mathbf{c}} \cdot \frac{\partial \mathbf{c}}{\partial \mathbf{h}_{T_s}} \cdot \frac{\partial \mathbf{h}_{T_s}}{\partial \mathbf{h}_1} \cdot \frac{\partial \mathbf{h}_1}{\partial \mathbf{x}_1}\]
Path length: \(T_{tgt}\) (decoder) + 1 (context) + \(T_{src}\) (encoder)
Gradient magnitude decay:
Each term contributes multiplicative factor:
Decoder path: \(\prod_{t=1}^{T_{tgt}} \left\|\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}}\right\| \approx \gamma_{dec}^{T_{tgt}}\)
Context bottleneck: \(\left\|\frac{\partial \mathbf{s}_1}{\partial \mathbf{c}}\right\| \approx 1\) (initialization)
Encoder path: \(\prod_{t=2}^{T_{src}} \left\|\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}\right\| \approx \gamma_{enc}^{T_{src}}\)
Combined: \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}_1}\right\| \approx \gamma_{dec}^{T_{tgt}} \cdot \gamma_{enc}^{T_{src}} \approx \gamma^{T_{src} + T_{tgt}}\]

Problem: Single long path from loss to early source tokens.
Concrete example:
Translation: 50-token source → 50-token target
Gradient to first source token:
Path length: \(T_{src} + T_{tgt} = 100\) steps
With \(\gamma = 0.45\) per step: \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}_1}\right\| \approx 0.45^{100} \approx 10^{-35}\]
Gradient to middle token (\(t=25\)):
Path length: \(25 + 50 = 75\) steps \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}_{25}}\right\| \approx 0.45^{75} \approx 10^{-26}\]
Gradient to last token:
Path length: \(0 + 50 = 50\) steps \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}_{50}}\right\| \approx 0.45^{50} \approx 10^{-18}\]
All are numerically zero in float32 (ε ≈ 10^-7)

Conclusion: Only last ~10 encoder tokens receive meaningful gradient signal.
LSTM gradient flow:
Cell state provides near-linear path: \[\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t + \text{other terms}\]
When forget gate \(\mathbf{f}_t \approx 1\): minimal decay.
Improved gradient behavior:
With \(\gamma_{LSTM} \approx 0.9\) (vs 0.45 for vanilla):
LSTM enables longer sequences:
But bottleneck remains:
All encoder information still compressed to \(\mathbf{c} \in \mathbb{R}^h\).
Better gradient flow ≠ more capacity.

LSTM improves gradient flow but doesn’t increase information capacity.
WMT’14 En→Fr seq2seq+LSTM results:
Previous slides showed severe bottleneck (64% information loss) and gradient vanishing (\(10^{-35}\)). Why does it work?
LSTM solves trainability, not capacity:
Vanilla RNN cannot train (gradients vanish). LSTM makes training possible (gradients flow). Performance reflects capacity constraints, not training failure.
Why partial information is sufficient:
1. Training data distribution favors short sequences
73% of WMT’14 data ≤40 tokens where bottleneck manageable. Model optimizes for common case.
2. Language redundancy enables lossy compression
512-1024 dimensions sufficient for high-level semantics:
Not sufficient for:
3. Compression efficiency varies by content
Simple sentences: 60-70% preservation adequate
Complex sentences: degradation more severe (explains BLEU drop)

BLEU performance by length:
| Length | Seq2seq BLEU | Information preserved |
|---|---|---|
| 20-30 | 32.8 | ~70% |
| 40-50 | 25.1 | ~55% |
| 60+ | 12.3 | ~30% |
Degradation correlates with information loss. Training optimizes for common case (short sequences).
LSTM enables training (gradient flow) but bottleneck remains (capacity limit). BLEU 20-30 reflects successful learning within constraints.
Problem 1: Information Bottleneck
Fixed context capacity regardless of input length: \[H(\mathbf{c}) \leq h \times 32 \text{ bits}\]
Quantified failure for \(T_s = 50\):
Performance impact:
What solution needs:
Context must scale with input: \(\text{capacity} = O(T_s)\)
Access all encoder states, not compressed summary.
Problem 2: Gradient Vanishing
Path length grows with both sequences: \[\text{path} = T_{src} + T_{tgt}\]
Quantified failure for 50+50 tokens:
LSTM improvement:
What solution needs:
Path length \(= O(1)\) to any encoder state.
Direct connections bypass sequential chain.
LSTM solves gradient flow (Problem 2) but not information capacity (Problem 1). Attention addresses both.
Approach 1: Increase context size
\[\mathbf{c} \in \mathbb{R}^h \rightarrow \mathbb{R}^{kh}\]
Pros:
Cons:
Insufficient. Delays problem but doesn’t solve it.
Approach 2: Multiple context vectors
Use \(\mathbf{c}_1, \mathbf{c}_2, ..., \mathbf{c}_K\) from different encoder positions
Pros:
Cons:
Better, but alignment problem remains.
Approach 3: Hierarchical encoding
Build tree structure over encoder states
Pros:
Cons:
Theoretically interesting, practically difficult.
Approach 4: Direct access with soft selection
Decoder can “look at” any encoder state
Selection weights learned and differentiable
Pros:
Cons:
Solves both problems. This is attention.
Information capacity:
Available information: \(T_s \times h\) dimensions
Context at step \(t\): \(\mathbf{c}_t = \sum_j \alpha_{tj} \mathbf{h}_j\)
Total contexts: \(T_t\) different vectors
Total capacity available to decoder: \[T_t \times (T_s \times h) \text{ dimension-accesses}\]
For 50×50 translation: 2,500 context vectors (vs 1 in seq2seq)
Information preserved:

Gradient flow:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j} = \sum_{t=1}^{T_t} \alpha_{tj} \frac{\partial \mathcal{L}}{\partial \mathbf{c}_t}\]
Context as weighted sum:
\[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{h}_j\]
where \(\alpha_{tj} \in [0,1]\) and \(\sum_j \alpha_{tj} = 1\)
Attention weights from scores:
\[\alpha_{tj} = \frac{\exp(e_{tj}/\tau)}{\sum_{k=1}^{T_s} \exp(e_{tk}/\tau)}\]
Temperature \(\tau\) controls distribution sharpness: \(\tau \to 0\) gives one-hot (hard selection), \(\tau \to \infty\) gives uniform. Standard: \(\tau = 1\).
Score function measures relevance:
\[e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j)\]
How well does encoder state \(\mathbf{h}_j\) match decoder query \(\mathbf{s}_{t-1}\)?
Different score functions lead to different attention variants.

What should score(\(\mathbf{s}_{t-1}\), \(\mathbf{h}_j\)) be?
Consider this sentence:
“The animal didn’t cross the street because it was too tired.”
Question: What does “it” refer to?
Humans resolve this through context:
Problem for fixed-context seq2seq:
Attention mechanism:

Attention mechanism computes context dynamically based on decoder state. When generating “it”, the model attends strongly to “animal” (0.65 weight), correctly resolving the pronoun reference.
Encoder (unchanged):
Process source sequence with bidirectional LSTM:
\[\overrightarrow{\mathbf{h}}_j = \text{LSTM}(\mathbf{x}_j, \overrightarrow{\mathbf{h}}_{j-1})\] \[\overleftarrow{\mathbf{h}}_j = \text{LSTM}(\mathbf{x}_j, \overleftarrow{\mathbf{h}}_{j+1})\]
Concatenate: \(\mathbf{h}_j = [\overrightarrow{\mathbf{h}}_j; \overleftarrow{\mathbf{h}}_j] \in \mathbb{R}^{2h}\)
All encoder states preserved: \(\{\mathbf{h}_1, ..., \mathbf{h}_{T_s}\}\)
Decoder with attention:
At each timestep \(t\):
\[\mathbf{s}_t = \text{LSTM}([\mathbf{y}_{t-1}; \mathbf{c}_{t-1}], \mathbf{s}_{t-1})\] \[P(\mathbf{y}_t | \mathbf{y}_{<t}, \mathbf{X}) = \text{softmax}(\mathbf{W}_o[\mathbf{s}_t; \mathbf{c}_t])\]

Context \(\mathbf{c}_t\) computed dynamically at each decoder step.
Critical property: Attention weights \(\alpha_{tj}\) recomputed for each target position \(t\).
Decoder timestep \(t=1\):
Decoder timestep \(t=2\):
Each decoder step accesses different weighted combinations of source information.

At \(t=1\): compute \(\alpha_{1j}\) using \(\mathbf{s}_1\), form \(\mathbf{c}_1\).
At \(t=2\): recompute \(\alpha_{2j}\) using \(\mathbf{s}_2\), form different \(\mathbf{c}_2\).
Each timestep accesses source information dynamically based on current decoding context.
Given at decoder timestep \(t=2\):
Decoder state: \[\mathbf{s}_1 = [0.6, 0.4]^\top \in \mathbb{R}^2\]
Encoder states:
\[\mathbf{h}_1 = [1.0, 0.5]^\top\] \[\mathbf{h}_2 = [0.3, 0.9]^\top\] \[\mathbf{h}_3 = [0.7, 0.8]^\top\] \[\mathbf{h}_4 = [-0.2, 0.6]^\top\] \[\mathbf{h}_5 = [0.4, 0.3]^\top\]
We have \(T_s = 5\) encoder positions, each \(\mathbf{h}_j \in \mathbb{R}^2\).
Task: Compute context vector \(\mathbf{c}_2\) for generating \(\mathbf{y}_2\)
Three steps:

Score function (using dot product):
\[e_{tj} = \mathbf{s}_1^\top \mathbf{h}_j\]
Compute each score:
\[e_{21} = [0.6, 0.4] \cdot [1.0, 0.5] = 0.6 + 0.2 = 0.8\] \[e_{22} = [0.6, 0.4] \cdot [0.3, 0.9] = 0.18 + 0.36 = 0.54\] \[e_{23} = [0.6, 0.4] \cdot [0.7, 0.8] = 0.42 + 0.32 = 0.74\] \[e_{24} = [0.6, 0.4] \cdot [-0.2, 0.6] = -0.12 + 0.24 = 0.12\] \[e_{25} = [0.6, 0.4] \cdot [0.4, 0.3] = 0.24 + 0.12 = 0.36\]
Score vector: \[\mathbf{e}_2 = [0.80, 0.54, 0.74, 0.12, 0.36]\]
Interpretation: Position 1 has highest score (0.80) → most aligned with query

Higher score = better alignment between “query” (\(\mathbf{s}_t\)) and “key” (\(\mathbf{h}_j\))
Scores (from previous step): \[\mathbf{e}_2 = [0.80, 0.54, 0.74, 0.12, 0.36]\]
Apply softmax: \[\alpha_{2j} = \frac{\exp(e_{2j})}{\sum_{k=1}^{5} \exp(e_{2k})}\]
Compute exponentials: \[\exp(\mathbf{e}_2) = [2.23, 1.72, 2.10, 1.13, 1.43]\]
Sum: \[\sum_{k=1}^{5} \exp(e_{2k}) = 8.61\]
Normalize: \[\alpha_{21} = \frac{2.23}{8.61} = 0.259\] \[\alpha_{22} = \frac{1.72}{8.61} = 0.200\] \[\alpha_{23} = \frac{2.10}{8.61} = 0.244\] \[\alpha_{24} = \frac{1.13}{8.61} = 0.131\] \[\alpha_{25} = \frac{1.43}{8.61} = 0.166\]
Attention weights: \[\boldsymbol{\alpha}_2 = [0.259, 0.200, 0.244, 0.131, 0.166]\]
Verify: \(\sum_j \alpha_{2j} = 1.000\)

Softmax converts arbitrary scores to valid probability distribution
Attention weights: \[\boldsymbol{\alpha}_2 = [0.259, 0.200, 0.244, 0.131, 0.166]\]
Encoder states: \[\mathbf{h}_1 = [1.0, 0.5]^\top, \quad \mathbf{h}_2 = [0.3, 0.9]^\top\] \[\mathbf{h}_3 = [0.7, 0.8]^\top, \quad \mathbf{h}_4 = [-0.2, 0.6]^\top\] \[\mathbf{h}_5 = [0.4, 0.3]^\top\]
Compute context: \[\mathbf{c}_2 = \sum_{j=1}^{5} \alpha_{2j} \mathbf{h}_j\]
Dimension 1: \[c_{2,1} = 0.259(1.0) + 0.200(0.3) + 0.244(0.7) + 0.131(-0.2) + 0.166(0.4)\] \[= 0.259 + 0.060 + 0.171 - 0.026 + 0.066 = 0.530\]
Dimension 2: \[c_{2,2} = 0.259(0.5) + 0.200(0.9) + 0.244(0.8) + 0.131(0.6) + 0.166(0.3)\] \[= 0.130 + 0.180 + 0.195 + 0.079 + 0.050 = 0.634\]
Context vector: \[\mathbf{c}_2 = [0.530, 0.634]^\top\]

Context is a weighted combination emphasizing relevant encoder positions
Batch processing all decoder timesteps:
Decoder states: \(\mathbf{S} \in \mathbb{R}^{T_t \times h}\)
Encoder states: \(\mathbf{H} \in \mathbb{R}^{T_s \times 2h}\)
Score computation (example: general/multiplicative):
\[\mathbf{E} = \mathbf{S} \mathbf{H}^\top\]
\(\mathbf{E} \in \mathbb{R}^{T_t \times T_s}\).
\[\mathbf{E} = \mathbf{S} \mathbf{W}_a \mathbf{H}^\top\]
where \(\mathbf{W}_a \in \mathbb{R}^{h \times 2h}\)
\(E_{ij}\) = score for decoder step \(i\) attending to encoder position \(j\)
Attention weights:
\[\mathbf{A} = \text{softmax}(\mathbf{E}, \text{dim}=-1)\]
Applied row-wise: each row sums to 1
Context vectors:
\[\mathbf{C} = \mathbf{A} \mathbf{H}\]
Result: \(\mathbf{C} \in \mathbb{R}^{T_t \times 2h}\)
Row \(i\) of \(\mathbf{C}\) is context \(\mathbf{c}_i\) for decoder step \(i\)

GPU-friendly: All operations are matrix multiplications.
Key-Value store interpretation:
Encoder states form key-value pairs:
Decoder query:
Retrieval process:
Soft vs hard lookup:
Hard (discrete): Return value with highest score
Soft (differentiable): Return weighted average of all values
Weights indicate relevance of each value to query.

Why soft? Gradients flow through weighted sum, enabling end-to-end learning.
Central question: How to measure relevance?
\[e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j)\]
Requirements:
Three main approaches emerged:
Each makes different tradeoffs.

Score function:
\[e_{tj} = \mathbf{v}^\top \tanh(\mathbf{W}_s \mathbf{s}_{t-1} + \mathbf{W}_h \mathbf{h}_j)\]
Parameters:
Computation:
Why tanh?
Cost: \(O(d_a \times (h_d + h_e))\) per score

Learns complex, nonlinear alignment patterns via tanh nonlinearity.
Three variants:
1. Dot product (simplest): \[e_{tj} = \mathbf{s}_{t-1}^\top \mathbf{h}_j\]
Requires: \(h_d = h_e\)
2. General (learned projection): \[e_{tj} = \mathbf{s}_{t-1}^\top \mathbf{W}_a \mathbf{h}_j\]
Parameters: \(\mathbf{W}_a \in \mathbb{R}^{h_d \times h_e}\)
Works for any dimensions.
3. Concat (hybrid with additive): \[e_{tj} = \mathbf{v}^\top \tanh(\mathbf{W}_a [\mathbf{s}_{t-1}; \mathbf{h}_j])\]
Parameters: \(\mathbf{W}_a \in \mathbb{R}^{d_a \times (h_d + h_e)}\), \(\mathbf{v} \in \mathbb{R}^{d_a}\)
Comparison to Bahdanau:

Tradeoff: Simplicity and speed vs expressiveness.
Score function:
\[e_{tj} = \frac{\mathbf{s}_{t-1}^\top \mathbf{h}_j}{\sqrt{d_k}}\]
where \(d_k\) is the key dimension.
Why scaling is critical:
Dot product variance grows with dimension.
For \(\mathbf{s}, \mathbf{h}\) with i.i.d. components ~ \(\mathcal{N}(0, 1)\):
\[\text{Var}(\mathbf{s}^\top \mathbf{h}) = \sum_{i=1}^{d_k} \text{Var}(s_i h_i) = d_k\]
Without scaling, as \(d_k\) increases:
With scaling: \[\text{Var}\left(\frac{\mathbf{s}^\top \mathbf{h}}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1\]
Variance stays constant regardless of dimension.

Critical for transformers: Enables attention in high-dimensional spaces (512-1024).
Attention weights normalized via softmax:
\[\alpha_j = \frac{\exp(e_j)}{\sum_{k=1}^{T_s} \exp(e_k)}\]
Gradient computation requires \(\frac{\partial \alpha_j}{\partial e_k}\) for backpropagation.
Derivation:
For \(j = k\) (diagonal):
\[\frac{\partial \alpha_j}{\partial e_j} = \frac{\partial}{\partial e_j}\left(\frac{\exp(e_j)}{Z}\right)\]
where \(Z = \sum_k \exp(e_k)\)
\[= \frac{\exp(e_j) \cdot Z - \exp(e_j) \cdot \exp(e_j)}{Z^2}\]
\[= \frac{\exp(e_j)}{Z} \left(1 - \frac{\exp(e_j)}{Z}\right) = \alpha_j(1 - \alpha_j)\]
For \(j \neq k\) (off-diagonal):
\[\frac{\partial \alpha_j}{\partial e_k} = \frac{0 \cdot Z - \exp(e_j) \cdot \exp(e_k)}{Z^2}\]
\[= -\frac{\exp(e_j)}{Z} \cdot \frac{\exp(e_k)}{Z} = -\alpha_j \alpha_k\]
Combined form:
\[\frac{\partial \alpha_j}{\partial e_k} = \alpha_j(\delta_{jk} - \alpha_k)\]
where \(\delta_{jk} = 1\) if \(j=k\), else \(0\) (Kronecker delta).
Gradient from loss \(\mathcal{L}\) to scores:
\[\frac{\partial \mathcal{L}}{\partial e_k} = \sum_{j=1}^{T_s} \frac{\partial \mathcal{L}}{\partial \alpha_j} \cdot \frac{\partial \alpha_j}{\partial e_k}\]
\[= \sum_j \frac{\partial \mathcal{L}}{\partial \alpha_j} \cdot \alpha_j(\delta_{jk} - \alpha_k)\]
\[= \alpha_k \frac{\partial \mathcal{L}}{\partial \alpha_k} - \alpha_k \sum_j \alpha_j \frac{\partial \mathcal{L}}{\partial \alpha_j}\]
Vectorized:
\[\nabla_{\mathbf{e}} \mathcal{L} = \boldsymbol{\alpha} \odot \left(\nabla_{\boldsymbol{\alpha}} \mathcal{L} - \langle \boldsymbol{\alpha}, \nabla_{\boldsymbol{\alpha}} \mathcal{L} \rangle \mathbf{1}\right)\]
where \(\odot\) is element-wise product, \(\langle \cdot, \cdot \rangle\) is dot product.
Gradient properties:
Empirical results on WMT’14 En→De:
| Score Function | BLEU | Parameters | Speed |
|---|---|---|---|
| Fixed context | 14.0 | — | Fast |
| Dot product | 18.2 | 0 | Fastest |
| General (Luong) | 20.1 | 256K | Fast |
| Additive (Bahdanau) | 20.3 | 640K | Medium |
| Scaled dot-product | 20.8 | 0 | Fastest |
Any attention mechanism: +4-6 BLEU over seq2seq baseline.
Scaling critical: Dot product (18.2) → scaled (20.8), +2.6 BLEU.
Diminishing returns: Additive (20.3) vs scaled (20.8), only +0.5 BLEU difference.
Speed-quality tradeoff: Scaled dot-product fastest with highest quality.
Current practice (2025):

Scaled dot-product dominates modern applications: fastest computation, highest quality, zero additional parameters.
Statistical MT alignment (pre-neural):
Given parallel sentences, find word-to-word mapping:
Hard alignment (IBM Models 1-5, 1990s):
Each target word aligned to exactly one source word (or NULL).
Learned via EM algorithm over large corpora.
Properties:
Problems:

Neural attention learns alignment implicitly during translation training.
Attention weights = soft alignment:
\[\alpha_{tj} \in (0, 1), \quad \sum_j \alpha_{tj} = 1\]
Each target position \(t\) has distribution over all source positions.
Differences from hard alignment:
Interpretation:

Each row sums to 1: Valid probability distribution over source positions.
Source (English): “The agreement on the European Economic Area was signed in August 1992”
Target (German): “Das Abkommen über den Europäischen Wirtschaftsraum wurde im August 1992 unterzeichnet”
Alignment challenges:
Attention patterns:
Model learns these patterns without explicit alignment supervision.

Attention discovers linguistic structure without explicit supervision.
Critical point: Attention alignment learned without alignment annotations.
Training setup:
Input: Parallel sentences (source, target) pairs
Supervision: Only translation loss \[\mathcal{L} = -\sum_{t=1}^{T_t} \log P(\mathbf{y}_t^* | \mathbf{y}_{<t}^*, \mathbf{X}; \theta)\]
No alignment labels: Never told which source word aligns to which target word
What emerges:
Attention weights \(\alpha_{tj}\) converge to linguistically meaningful alignments:
Why this matters:
Traditional MT required:
Attention mechanism:
Alignment is a by-product of learning to translate, not an explicit training target.

Attention weights encode source-target alignments as emergent structure from translation objective. Model discovers linguistic correspondences through backpropagation, without explicit alignment supervision during training.
Pattern 1: Diagonal (Monotonic alignment)
Pattern 2: Off-diagonal blocks (Reordering)
Pattern 3: Vertical stripes (Many-to-one)
Pattern 4: Horizontal stripes (One-to-many)
Pattern 5: Diffuse (Uniform)

Pattern analysis helps diagnose model behavior and failure modes.
Unidirectional encoder:
\[\mathbf{h}_j = \text{RNN}(\mathbf{x}_j, \mathbf{h}_{j-1})\]
State \(\mathbf{h}_j\) only sees \(\mathbf{x}_1, ..., \mathbf{x}_j\) (past context).
Problem: Cannot use future context for alignment.
Example: “bank” in “river bank” vs “savings bank”
Bidirectional encoder:
\[\overrightarrow{\mathbf{h}}_j = \text{RNN}(\mathbf{x}_j, \overrightarrow{\mathbf{h}}_{j-1})\] \[\overleftarrow{\mathbf{h}}_j = \text{RNN}(\mathbf{x}_j, \overleftarrow{\mathbf{h}}_{j+1})\] \[\mathbf{h}_j = [\overrightarrow{\mathbf{h}}_j; \overleftarrow{\mathbf{h}}_j]\]
Each \(\mathbf{h}_j\) incorporates full sentence context.
Benefits:

Bidirectional encoders: 2× computation for full source context. WMT’14: +2.1 BLEU over unidirectional.

Decoder-encoder formulation:
\[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{h}_j\]
where \(\alpha_{tj} = \text{softmax}(\text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j))\)
Ties attention to specific seq2seq architecture. Decoder state \(\mathbf{s}_{t-1}\) queries encoder states \(\mathbf{h}_j\) directly.
Query-Key-Value abstraction:
\[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{v}_j\]
where \(\alpha_{tj} = \text{softmax}(\mathbf{q}_t^\top \mathbf{k}_j / \sqrt{d})\)
General formulation decouples mechanism from architecture. Query \(\mathbf{q}_t = \mathbf{s}_{t-1}\) searches keys \(\mathbf{k}_j = \mathbf{h}_j\) to weight values \(\mathbf{v}_j = \mathbf{h}_j\).
Same computation applies to self-attention, cross-attention, multi-head variants.

Modern GPUs contain thousands of parallel cores. Sequential loops use one core—remaining capacity sits idle.
Batch size 32, sequence length 50: typical speedup 15-40×.
Input tensors:
where \(B\) is batch size.
Three operations:
1. Score computation: \[\mathbf{E} = \mathbf{Q}\mathbf{K}^\top / \sqrt{d}\]
Result: \(\mathbf{E} \in \mathbb{R}^{B \times T_t \times T_s}\)
Batch matrix multiply processes all samples simultaneously.
2. Softmax normalization: \[\mathbf{A} = \text{softmax}(\mathbf{E}, \text{dim}=-1)\]
Applied per row. Each row in \(\mathbf{A}\) sums to 1.
3. Context assembly: \[\mathbf{C} = \mathbf{A}\mathbf{V}\]
Result: \(\mathbf{C} \in \mathbb{R}^{B \times T_t \times d}\)
All matrix operations, no loops.
import torch
import torch.nn.functional as F
import math
def batched_attention(Q, K, V):
"""
Args:
Q: [B, T_t, d] query vectors
K: [B, T_s, d] key vectors
V: [B, T_s, d] value vectors
Returns:
context: [B, T_t, d]
weights: [B, T_t, T_s]
"""
d = Q.size(-1)
# Scores: [B, T_t, T_s]
scores = torch.matmul(Q, K.transpose(-2, -1))
scores = scores / math.sqrt(d)
# Attention weights: [B, T_t, T_s]
weights = F.softmax(scores, dim=-1)
# Context: [B, T_t, d]
context = torch.matmul(weights, V)
return context, weightsExample:
Attention operations per sample:
Score computation \(\mathbf{Q}\mathbf{K}^\top\):
Context computation \(\mathbf{A}\mathbf{V}\):
Total: \(O(T_t \times T_s \times d)\) operations per sample
Memory: \(O(T_t \times T_s)\) for attention matrix storage
RNN baseline:
Encoder: \(O(T_s \times d^2)\) operations Decoder: \(O(T_t \times d^2)\) operations Total: \(O((T_s + T_t) \times d^2)\)
Attention dominates when \(T_s \times T_t > d\).
Translation (\(T_s = T_t = 50\), \(d = 512\)): \(2500 > 512\), attention requires more computation but supports parallelism impossible with RNN.

Attention: \(O(T^2)\) but parallel. RNN: \(O(T \times d^2)\) but sequential.
Direct softmax computation fails for large scores:
\[\text{softmax}(x_i) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}\]
For \(x_i = 50\): \(\exp(50) \approx 5 \times 10^{21}\), exceeding float32 range.
Log-sum-exp trick maintains numerical stability:
\[\text{softmax}(x_i) = \frac{\exp(x_i - \max(x))}{\sum_j \exp(x_j - \max(x))}\]
Subtracting \(\max(x)\) shifts largest value to zero. Exponential of zero equals one (safe). All other values become negative, producing exponentials less than one (safe). Result remains mathematically identical since shift cancels in ratio.
Additional epsilon prevents division by zero:
\[\alpha_j = \frac{\exp(e_j - \max_k e_k)}{\sum_k \exp(e_k - \max_k e_k) + \epsilon}\]
where \(\epsilon = 10^{-9}\).
Gradient clipping remains necessary but requires less aggressive bounds than RNN. Typical maximum gradient norm: 5.0 for attention versus 0.25-1.0 for RNN.
def stable_softmax_attention(scores):
"""
Args:
scores: [B, T_t, T_s] attention scores
Returns:
weights: [B, T_t, T_s] attention weights
"""
# Subtract max for numerical stability
scores_max = scores.max(dim=-1, keepdim=True)[0]
scores_stable = scores - scores_max
# Compute exponentials (now safe)
exp_scores = torch.exp(scores_stable)
# Normalize with epsilon
exp_sum = exp_scores.sum(dim=-1, keepdim=True)
weights = exp_scores / (exp_sum + 1e-9)
return weightsMixed precision (FP16) considerations:

Training: full sequences in parallel, each score computed once.
Inference: one token at a time, each new token attends over all previous positions.
Encoder keys and values stay constant—recomputing them at every decoder step wastes computation. For \(T\) generated tokens: \(O(T^2)\) scores computed, only \(O(T)\) new.
Observation: At every decoder step,
\[\text{scores}_t = \mathbf{q}_t \mathbf{K}_{enc}^\top\]
But \(\mathbf{K}_{enc}\) never changes. Identical computation repeats \(T_t\) times.
Implementation:
# Compute once before generation
K_enc = encoder_projection(encoder_states)
V_enc = encoder_projection(encoder_states)
# Reuse for all T_t decoder steps
for t in range(T_t):
q_t = decoder_state[t]
scores = q_t @ K_enc.T
weights = softmax(scores)
context = weights @ V_encMemory cost: \(O(T_s \times d)\) for cache.
For \(T_s = 50\), \(d = 512\): 25,600 floats = 100 KB.

Speedup: \(T_t\times\) for encoder attention, 10-50× overall.
class AttentionWithCache:
def __init__(self):
self.k_cache = None
self.v_cache = None
def forward(self, query, key, value, use_cache=True):
if use_cache and self.k_cache is not None:
key = torch.cat([self.k_cache, key], dim=1)
value = torch.cat([self.v_cache, value], dim=1)
self.k_cache, self.v_cache = key, value
return attention(query, key, value)RNN creates sequential dependencies:
\[\mathbf{h}_t = f(\mathbf{h}_{t-1}, \mathbf{x}_t)\]
Must compute \(\mathbf{h}_1\) before \(\mathbf{h}_2\)—hardware waits at each timestep.
Attention has no temporal dependencies:
\[\mathbf{c}_t = \sum_j \alpha_{tj} \mathbf{h}_j\]
All \(\alpha_{tj}\) compute in parallel. GPU processes all positions simultaneously.

RNN:
Attention:
Parallel attention enables efficient GPU utilization despite higher FLOPs.
Random initialization produces near-uniform attention distributions at training start:
\[\alpha_{tj} \approx \frac{1}{T_s} \text{ for all } j\]
Score function parameters contain no prior knowledge of linguistic alignment. Model must discover alignment patterns through translation loss gradient signal alone.

Early training (epochs 1-2): obvious one-to-one alignments. Diagonal emerges for similar word order. Sharp but inaccurate.
Mid training (epochs 3-6): refines weights, learns reordering. Handles phrasal alignment.
Late training (epochs 7-12): fine-tunes rare words, long-distance dependencies. Patterns stabilize.
No explicit alignment supervision—purely from translation loss.
Gradient magnitude to encoder positions quantifies learning signal strength. Measure \(\|\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j}\|\) during training for source position \(j\).
Seq2seq with fixed context exhibits exponential gradient decay:
\[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j}\right\| \approx \gamma^{T_s - j}\]
For \(\gamma = 0.45\), gradients vanish by position 1. Position receiving \(\sim 10^{-18}\) gradient magnitude cannot learn effectively.
Attention distributes gradients based on relevance:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j} = \sum_t \alpha_{tj} \frac{\partial \mathcal{L}}{\partial \mathbf{c}_t}\]
No exponential decay. Important tokens receive strong gradients regardless of position. Token attended heavily across multiple decoder steps accumulates large gradient.

WMT’14 En→De, 50-token sequences: gradient magnitude 100× larger at position 1, 1000× at position 25, \(10^{10}\times\) at position 45 (attention vs seq2seq).
Convergence: 40-50K iterations (attention) versus 100K (seq2seq).
Seq2seq loss landscape contains single information pathway: \(\mathbf{X} \rightarrow \mathbf{c} \rightarrow \mathbf{Y}\). Optimization bottlenecks at fixed context encoding. Single path to reduce loss creates sharp, narrow minima. Small parameter perturbations significantly increase loss.
Attention loss landscape contains \(O(T_s \times T_t)\) gradient pathways through attention weights. Loss reduction occurs via: alignment adjustments for different tokens, individual attention weight refinement, score function parameter modification. Gradient descent exploits redundant paths. Minima become smoother and wider. Parameter perturbations distribute across multiple pathways, maintaining lower loss.

Learning rate: seq2seq \(\leq 0.001\), attention up to \(0.01\).
Variance: seq2seq high (unstable gradients), attention low (stable).
Convergence: seq2seq 100K iterations, attention 40-50K—2-2.5× speedup.
Pattern:
\[\alpha_{tj} \approx \frac{1}{T_s} \text{ for all } j\]
Attention weights approximately uniform across all source positions.
Measured characteristics:
Occurrence conditions:
Performance impact:
WMT’14 En→De with uniform attention:

Uniform weights provide no source discrimination. Translation accuracy equivalent to random alignment baseline.
Pattern:
\[\alpha_{tj} \approx \begin{cases} 1 & j = k \\ 0 & \text{otherwise} \end{cases}\]
Single source position receives \(>0.9\) weight per timestep.
Measured characteristics:
Occurrence conditions:
Performance impact:
Repetition rate on WMT’14:

Single-position attention ignores phrasal context. Repetition rate increases 5.8× over baseline.
Pattern:
Alignment error increases with sequence length.
Measured degradation:
| Length (tokens) | Alignment F1 | BLEU | Error rate |
|---|---|---|---|
| 20-40 | 0.78 | 28.3 | 8.2% |
| 40-60 | 0.71 | 24.1 | 14.7% |
| 60-80 | 0.63 | 19.8 | 22.1% |
| 80-100 | 0.54 | 15.2 | 31.4% |
Drift rate: Alignment F1 decreases 0.012 per token beyond position 40.

Contributing factors:
Alignment F1 < 0.6 for sequences exceeding 80 tokens. BLEU degradation: 0.15 per additional token.
Standard attention computes independently at each decoder step:
\[\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_k \exp(e_{tk})}\]
where \(e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j)\).
No information about which source positions received attention previously. No record of how many times each position has been used. No tracking of what remains to translate.
Consider English to French translation:
Source: “The European Economic Community was founded in nineteen fifty seven”
Step 5: Translating “Communauté” (Community)
Attends to “Community”: \(\alpha_{5,3} = 0.7\)
Step 6: Translating “économique” (Economic)
Can attend to “Community” again: \(\alpha_{6,3} = 0.6\). No penalty for reuse.
Step 10: Generating next word
Might attend to “Community” yet again. Can completely ignore “fifty seven”.
Output: “…Communauté a été Communauté en…”
Missing year translation. Repeated “Communauté”.
Track total attention to each source position:
\[\mathbf{c}^{cov}_t = \sum_{i=1}^{t-1} \boldsymbol{\alpha}_i\]
where \(c^{cov}_{tj}\) represents cumulative attention to position \(j\) through step \(t\).
Modify score function to penalize repeated attention:
\[e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j) - \lambda c^{cov}_{tj}\]
Positions with high cumulative attention receive lower scores. Forces model to explore unattended positions.
Example progression:
Source positions: \([1, 2, 3, 4, 5]\)
Step 1: Coverage \([0, 0, 0, 0, 0]\), attention focuses position 3 (\(\alpha = 0.5\))
Step 2: Coverage \([0.1, 0.2, 0.5, 0.1, 0.1]\), position 3 penalized by \(-\lambda \times 0.5\), attention reduces to 0.3
Step 3: Coverage \([0.25, 0.45, 0.8, 0.25, 0.25]\), position 3 heavily penalized \(-\lambda \times 0.8\), attention drops to 0.1
Step 5: Coverage \([0.8, 0.95, 1.5, 0.4, 0.35]\), positions 1-3 saturated, forced to attend positions 4-5
All source positions eventually covered.

Empirical results (Tu et al., 2016):
| Dataset | Baseline | +Coverage | Improvement |
|---|---|---|---|
| WMT’14 En→De | 19.8 | 20.7 | +0.9 BLEU |
| WMT’14 En→Fr | 34.2 | 35.1 | +0.9 BLEU |
| Long sequences (>50 tokens) | 16.3 | 18.1 | +1.8 BLEU |
Larger gains on longer sequences where repetition occurs more frequently. Reduces over-translation by 45%, under-translation by 30%. Computational overhead: 2-3% slower training.
Typical coverage penalty: \(\lambda \in [0.5, 2.0]\) depending on task.

Attention entropy quantifies distribution sharpness:
\[H(\alpha_t) = -\sum_j \alpha_{tj} \log \alpha_{tj}\]
Expected range: \(H \in [1, \log(T_s) - 1]\)
Monitor entropy distribution across training batches. Outliers indicate failures.
Maximum attention weight per row:
\[\max_j \alpha_{tj}\]
Expected range: [0.3, 0.7]
0.9: Over-concentration
Coverage uniformity:
\[\sum_t \alpha_{tj} \text{ for each } j\]
Expected: roughly even distribution across source positions. Large variance indicates some positions ignored, others over-attended.
Visualize attention matrices for failure cases. Patterns reveal specific failure modes immediately.
Workshop on Machine Translation (WMT):
Annual shared task for machine translation evaluation.
Standard benchmarks:
Evaluation metric: BLEU (Bilingual Evaluation Understudy)
Measures n-gram precision between candidate and reference translations:
\[\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{4} w_n \log p_n\right)\]
where:
Range: 0-100 (higher = better)
Example scores:

Limitations: BLEU correlates with human judgment but misses semantic equivalence, fluency issues.

Key findings:
Why larger gains on En→De?
German requires long-range reordering (verb-final clauses), compound words spanning multiple source tokens, case agreement across phrases. Fixed context bottleneck cannot preserve this information. Attention accesses full source representation dynamically.
Method: Train full model, remove one component, measure BLEU degradation on WMT’14 En→Fr validation set.
Component impacts:
| Configuration | BLEU | ∆ |
|---|---|---|
| Full attention model | 36.7 | — |
| - Scaled scores (no \(1/\sqrt{d_k}\)) | 33.1 | -3.6 |
| - Bidirectional encoder | 34.2 | -2.5 |
| - Input feeding | 35.2 | -1.5 |
| - Coverage mechanism | 35.9 | -0.8 |
| Fixed context (seq2seq) | 31.5 | -5.2 |
Scaled scores: -3.6 BLEU. Without \(1/\sqrt{d_k}\) scaling, dot products grow with dimension. For \(d=512\), unscaled scores reach ±50, softmax saturates to near one-hot. Gradient flow breaks.
Bidirectional encoder: -2.5 BLEU. Unidirectional states lack future context. Ambiguous words (“bank” in “river bank” vs “savings bank”) cannot disambiguate. Alignment accuracy drops.
Input feeding: -1.5 BLEU. Decoder loses track of what was attended previously. Coherence across steps degrades. More repetition, missed source words.
Coverage mechanism: -0.8 BLEU. Effect concentrated on long sequences (>50 tokens): -1.8 BLEU on long subset. Negligible on short sequences.

Attention architecture: Removing all attention components (seq2seq baseline) loses 5.2 BLEU total. Non-additive: interactions between components exist.
Model trained only on translation loss \((y, \hat{y})\). No alignment labels, no parse trees, no linguistic annotations.
Attention weights \(\alpha_{tj}\) emerge as implicit alignment between source and target positions.
Analysis methodology:
Examine trained attention weights on test set. For each target word \(y_t\), inspect which source positions receive high attention weight \(\alpha_{tj} > 0.3\).
Compare discovered alignments to:

Three categories of learned patterns:
English → French translation:
Both languages follow Subject-Verb-Object order. Adjectives differ (French often post-nominal) but core structure similar.
Attention pattern: Strong diagonal
Source: “The quick brown fox jumps over the lazy dog”
Target: “Le rapide renard brun saute sur le chien paresseux”
Attention discovers:
Attention matrix shows clear diagonal with small local deviations for adjective-noun reordering.
Quantification:
Mean absolute diagonal offset: 1.2 positions
85% of attention weight within ±3 positions of diagonal

Interpretation: For languages with similar syntax, attention learns near-identity mapping with local adjustments for grammatical differences.
English (SVO) → German (SOV in subordinate clauses):
Source: “I know that she will come tomorrow”
Target: “Ich weiß dass sie morgen kommen wird”
Verb moves from middle to end. Attention must handle 5-position jump.

Analysis:
Attention discovers:
Without attention, fixed context at end of encoding cannot preserve verb identity after 5 more tokens processed. Attention accesses “will” and “come” directly regardless of distance.
Quantification on WMT’14 En→De test set:
German compound nouns: Single target word maps to multiple source words.
Source: “European Economic Area”
Target: “Wirtschaftsgebiet” (literally “economy-area”)
Attention distributes weight across source phrase:
French multi-word expressions: Multiple target words map to single source idiom.
Source: “kick the bucket” (idiom for “die”)
Target: “il est mort” (literally “he is dead”)
Attention shows diffuse pattern:

Implications:
Attention automatically handles:
Fixed context bottleneck loses phrase structure—all information compressed equally.
Novel/rare words:
Source: “The chrysanthemum bloomed”
Target: “Le chrysanthème a fleuri”
If “chrysanthemum” appears <5 times in training, attention weights diffuse. Cannot learn specific alignment without exposure.
Result: Generic flower translation or hallucination.
Highly idiomatic expressions:
Source: “It’s raining cats and dogs”
Literal German: “Es regnet Katzen und Hunde” (incorrect)
Correct: “Es regnet in Strömen” (“in streams”)
Attention often produces word-by-word translation without cultural context for idioms.
Very long sequences (>100 tokens):
Attention matrix \(T_s \times T_t\) grows quadratically. For \(T_s = T_t = 100\): 10K attention weights per decoder step.
Optimization difficulty: More parameters to learn, noisier gradient signal.
Empirical degradation: -0.3 BLEU per 10 tokens beyond 80-token threshold.

Quantified limitations:
Why these failures matter:
Understanding limitations guides architecture improvements. Rare words → subword tokenization (BPE, WordPiece). Long sequences → sparse attention, local windows. Idioms → retrieval-augmented generation.
Core abstraction:
Attention = soft selection over representations.
No assumption about representation structure:
General formulation:
Given:
Compute: \[\alpha_i = \frac{\exp(\text{score}(\mathbf{q}, \mathbf{k}_i))}{\sum_j \exp(\text{score}(\mathbf{q}, \mathbf{k}_j))}\]
\[\text{context} = \sum_{i=1}^n \alpha_i \mathbf{v}_i\]
What changes across domains:
| Domain | Source | Keys/Values | Score function |
|---|---|---|---|
| Translation | Text tokens | LSTM hidden states | Additive/Dot |
| Captioning | Image regions | CNN feature map | Spatial dot |
| Speech | Audio frames | Spectrogram features | Location-aware |
| Summarization | Document sentences | Sentence embeddings | Hierarchical |

Same mechanism, different representations. Next: Image captioning applies attention to CNN feature maps.
CNN extracts feature map \(\mathbf{F} \in \mathbb{R}^{H \times W \times D}\) where each spatial position \((i,j)\) encodes local image region.
Flatten to sequence: \(L = H \times W\) positions, each a \(D\)-dimensional vector \(\mathbf{v}_k\).
Decoder LSTM generates caption tokens. At each step \(t\):
\[\alpha_{tk} = \frac{\exp(\mathbf{s}_t^\top \mathbf{W}_a \mathbf{v}_k)}{\sum_{j=1}^{L} \exp(\mathbf{s}_t^\top \mathbf{W}_a \mathbf{v}_j)}\]
\[\mathbf{c}_t = \sum_{k=1}^{L} \alpha_{tk} \mathbf{v}_k\]
Attention weights \(\alpha_{tk}\) exhibit spatial concentration: \(\alpha_{tk} > 0.1\) typically spans 3-8% of image area (2-6 adjacent spatial locations out of 49 total in \(7 \times 7\) grid).

MS COCO benchmark (5K test images):
Baseline (mean pooling): 25.0 BLEU-4, 23.0 CIDEr
Spatial attention: 28.4 BLEU-4, 27.2 CIDEr
Difference: +3.4 BLEU-4 (+14%), +4.2 CIDEr (+18%)
Attention weights correlate with human gaze patterns: IoU = 0.68 on regions fixated during first 500ms. No explicit region annotations during training.
Audio spectrograms at 100 fps: 10-second utterance = 1000 frames.
Transcript: ~30 characters.
Direct attention over 1000 frames: \(30 \times 1000 = 30\)K attention weights per sequence. Memory: 120KB per sample (fp32).
Pyramidal Bi-LSTM encoder:
Each layer concatenates adjacent timesteps, halves temporal resolution:
Attention on 125-frame representation: \(30 \times 125 = 3.75\)K weights (8× reduction). Memory: 15KB per sample.

Google Voice Search task (>100K utterances):
GMM-HMM baseline: 23.6% WER
Deep Speech (CTC, no attention): 16.5% WER
Listen, Attend, Spell (pyramidal): 14.1% WER
Pyramidal encoder: -1.5% WER absolute (-9% relative) vs flat encoder at same parameter count. Computational cost identical. Gradient paths shortened 8× through temporal compression.
Abstractive summarization with fixed vocabulary (~50K words) fails on rare words: proper nouns, numbers, technical terms.
Standard decoder substitutes generic alternatives or UNK token:
Pointer-generator mechanism:
Generation probability at decoder step \(t\):
\[p_{\text{gen}} = \sigma(\mathbf{w}_c^\top \mathbf{c}_t + \mathbf{w}_s^\top \mathbf{s}_t + \mathbf{w}_y^\top \mathbf{y}_{t-1} + b)\]
Final distribution over extended vocabulary (original vocab ∪ source words):
\[P(w) = p_{\text{gen}} \cdot P_{\text{vocab}}(w) + (1 - p_{\text{gen}}) \sum_{i:w_i=w} \alpha_{ti}\]
Copy probability for word \(w\) sums attention weights \(\alpha_{ti}\) at all source positions where \(w_i = w\).

CNN/DailyMail dataset (287K article-summary pairs):
Abstractive baseline (seq2seq): 31.3 ROUGE-1, 11.8 ROUGE-2
Pointer-generator: 36.4 ROUGE-1, 15.7 ROUGE-2
Coverage mechanism added: 39.5 ROUGE-1, 17.3 ROUGE-2
Pointer mechanism: +5.1 ROUGE-1 over baseline. Coverage: +3.1 ROUGE-1.
Copy accuracy on rare proper nouns: 94% vs 31% baseline. Numbers: 89% vs 12%.
Standard attention: unidirectional query from decoder to encoder.
Reading comprehension requires bidirectional interaction between document and question.
Bi-Directional Attention Flow (BiDAF):
Document \(D = \{\mathbf{h}_1, ..., \mathbf{h}_m\}\), question \(Q = \{\mathbf{u}_1, ..., \mathbf{u}_n\}\)
Similarity matrix: \(\mathbf{S}_{ij} = \mathbf{w}^\top [\mathbf{h}_i; \mathbf{u}_j; \mathbf{h}_i \circ \mathbf{u}_j]\)
Context-to-Query (C2Q): For each document word, which question words most relevant?
\[\tilde{\mathbf{u}}_i = \sum_j \frac{\exp(\mathbf{S}_{ij})}{\sum_k \exp(\mathbf{S}_{ik})} \mathbf{u}_j\]
Query-to-Context (Q2C): Which document words most relevant to entire question?
\[\mathbf{b} = \max_j \mathbf{S}_{:j}, \quad \tilde{\mathbf{h}} = \sum_i \frac{\exp(\mathbf{b}_i)}{\sum_k \exp(\mathbf{b}_k)} \mathbf{h}_i\]
Combined representation: \(\mathbf{g}_i = [\mathbf{h}_i; \tilde{\mathbf{u}}_i; \mathbf{h}_i \circ \tilde{\mathbf{u}}_i; \mathbf{h}_i \circ \tilde{\mathbf{h}}]\)
Output layer predicts start and end positions of answer span.

SQuAD v1.1 benchmark (100K question-answer pairs):
Match-LSTM (single direction): 64.7% EM, 73.7% F1
BiDAF (bi-directional): 68.0% EM, 77.3% F1
BiDAF + character embeddings: 73.3% EM, 81.1% F1
Bi-directional flow: +3.3% EM, +3.6% F1. Character embeddings (handle OOV): +5.3% EM.
EM (exact match): predicted answer = gold answer exactly. F1: token-level overlap.
Standard attention computes single weighted sum:
\[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{h}_j\]
One lookup operation per decoder step.
Example question answering:
Context: “John went to the store. The store was closed. John went home.”
Question: “Where did John go after the store was closed?”
Two-step reasoning required:
Single attention step addresses both facts simultaneously. Attention weights distribute across multiple positions: entropy H(α) = 2.3 bits (nearly uniform over 8 tokens) vs focused attention H(α) = 0.8 bits.

Multi-hop attention performs sequential lookups, each conditioned on previous results.
Stack \(K\) attention operations, each querying same memory with refined context:
\[\mathbf{m}_1 = \text{Attention}(\mathbf{q}, \mathbf{M})\] \[\mathbf{m}_2 = \text{Attention}(\mathbf{m}_1, \mathbf{M})\] \[\vdots\] \[\mathbf{m}_K = \text{Attention}(\mathbf{m}_{K-1}, \mathbf{M})\]
Memory \(\mathbf{M} = [\mathbf{h}_1, ..., \mathbf{h}_{T_s}]\) remains fixed (encoder states).
Hop 1: Initial query \(\mathbf{q}\) retrieves \(\mathbf{m}_1\)
Hop 2: Use \(\mathbf{m}_1\) as query, retrieves \(\mathbf{m}_2\)
Hop \(K\): Final representation \(\mathbf{m}_K\) feeds output layer
Each hop conditions on previous result, refining attention focus.
How is \(K\) determined?
\(K\) = fixed hyperparameter set during architecture design, not learned.
Typical values:
Trade-off:
No mechanism to adaptively choose \(K\) per example. All examples use same fixed hop count.

Example: Memory Networks (Weston et al., 2015)
Question answering over stories. Each hop retrieves relevant sentence, next hop uses that context to retrieve next relevant sentence.
Complexity: \(O(K \times T_s \times d)\) for \(K\) hops. Linear in hop count.
bAbI QA tasks:
Full attention: \(O(T^2)\) memory and computation.
\(T = 1000\): 1M weights per layer. 12 layers: 12M weights total.
GPU memory (16GB): limits sequence length to ~2048 tokens at batch size 8.
Sparse attention: Each position attends to subset of positions, not all \(T\) positions.
Local window (radius \(r\)):
\[\alpha_{ij} = \begin{cases} \text{softmax}(e_{ij}) & \text{if } |i - j| \leq r \\ 0 & \text{otherwise} \end{cases}\]
Complexity: \(O(T \times r)\) instead of \(O(T^2)\)
Example: \(r = 5\) → each position attends to 11 neighbors (self + 5 left + 5 right).
Strided (every \(k\)-th position):
\[\alpha_{ij} \neq 0 \text{ only if } j \equiv 0 \pmod{k}\]
Complexity: \(O(T \times T/k)\)
Example: \(k = 8\) → attend to T/8 positions.
When to use strided:
Captures long-range dependencies that local window misses. Position 100 can attend to positions 0, 8, 16, …, 96 (distance up to 100), while local window (\(r=5\)) only sees positions 95-105.
Use case: Document-level coherence, paragraph structure, long-range coreference.
Limitation: Misses nearby context. Fixed stride may miss relevant intermediate positions.
Combined: Local + strided in different attention heads.
Local head: nearby context (syntax, adjacent words)
Strided head: long-range dependencies (discourse, document structure)

WMT’14 En→De translation with sparse patterns:
Full attention (\(T = 512\)): 27.3 BLEU, 2048 MB memory
Local (\(r = 5\), complexity \(O(10T)\)): 25.3 BLEU (-2.0), 410 MB memory
Strided (\(k = 8\), complexity \(O(T^2/8)\)): 26.3 BLEU (-1.0), 256 MB memory
Combined (local + strided, \(O(15T)\)): 26.8 BLEU (-0.5), 615 MB memory
Combined pattern: 5× memory reduction for -0.5 BLEU. Enables 4× longer sequences at same memory budget.
Hidden layer activations in standard networks: distributed representations without explicit alignment structure.
Attention weights \(\alpha_{tj}\) assign explicit scores to each source position for every target position.
Heatmap visualization: rows = target positions, columns = source positions, intensity = \(\alpha_{tj}\).
High weight (\(\alpha_{tj} > 0.5\)): strong correspondence between source \(j\) and target \(t\).

Applications:
Assumption: \(\alpha_{tj} = 0.8\) → token \(j\) causally important for output \(t\).
Empirical test (Jain & Wallace, 2019): Correlation between attention weights and gradient-based importance.
Sentiment classification experiment:
Counterfactual test:
Attention weight 0.40 on “boring” contradicts low causal importance. Attention weight 0.05 on “not” contradicts high causal importance.
Mechanism behind mismatch:
Attention computes context, then output layer transforms:
\[\mathbf{c}_t = \sum_j \alpha_{tj} \mathbf{h}_j\] \[\mathbf{y}_t = \text{softmax}(W \mathbf{c}_t + b)\]
Matrix \(W\) re-weights features from context \(\mathbf{c}_t\). High-attention features down-weighted by small \(W\) entries. Low-attention features amplified by large \(W\) entries.
Attention indicates retrieval weights, not decision weights.

Attention weights serve as hypothesis generator for inspection. Validate causal importance via ablation or gradient methods.
Attention weights \(\alpha_{tj}\) differentiable → amenable to supervision or constraints.
Supervised attention:
Add alignment supervision loss:
\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \lambda \mathcal{L}_{\text{attention}}\]
\[\mathcal{L}_{\text{attention}} = \sum_{t,j} (\alpha_{tj} - \alpha^*_{tj})^2\]
\(\alpha^*_{tj}\) from gold alignments (manual annotation or statistical aligner).
Hard monotonic constraint:
Speech recognition requires forward-only attention. Mask prevents backward attention:
\[\alpha_{t,j} = 0 \quad \text{if } j < j_{t-1}\]
Position \(j_{t-1}\): highest-weight position at step \(t-1\). Forces \(j_t \geq j_{t-1}\).
Attention prior (soft constraint):
Bias scores toward expected diagonal alignment:
\[e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j) + \text{prior}(t, j)\]
Gaussian prior centered on diagonal: \[\text{prior}(t, j) = \exp\left(-\frac{(j - t \cdot T_s/T_t)^2}{2\sigma^2}\right)\]

Applications:
Constraints improve sample efficiency at cost of model flexibility.
Monotonic speech recognition:
Traditional memory: Hard indexing by fixed address
memory = [value_0, value_1, value_2, ..., value_n]
output = memory[address] # output = value_addressAttention: Soft addressing by content similarity
Memory \(\mathbf{M} = [\mathbf{h}_1, ..., \mathbf{h}_{T_s}]\) (encoder states)
Query \(\mathbf{q}\) specifies retrieval pattern by content
Output weighted combination:
\[\text{output} = \sum_{j=1}^{T_s} \alpha_j \mathbf{h}_j\]
\[\alpha_j = \text{softmax}(\text{similarity}(\mathbf{q}, \mathbf{h}_j))\]
Soft weights \(\alpha_j\) differentiable → gradients flow through retrieval operation.

Differentiable: Gradients backpropagate through \(\alpha_j\).
Robust: Query errors degrade gracefully (distribute weight across neighbors).
Content-based: Similarity function retrieves by semantic match, not position.
Learnable: Query generation and similarity function trained end-to-end.
NTM (Graves et al., 2014) extends read-only attention to read/write external memory.
Architecture:
Read operation:
\[\mathbf{r}_t = \sum_{i=1}^{N} \alpha^r_{ti} \mathbf{M}_{i}\]
where \(\alpha^r_{ti} = \text{softmax}(\text{similarity}(\mathbf{k}^r_t, \mathbf{M}_i))\)
\(\mathbf{k}^r_t\): read key from controller
Write operation:
\[\mathbf{M}_i \leftarrow \mathbf{M}_i \cdot (1 - \alpha^w_{ti} \mathbf{e}_t) + \alpha^w_{ti} \mathbf{a}_t\]
\(\alpha^w_{ti}\): write attention weights
\(\mathbf{e}_t\): erase vector (what to remove)
\(\mathbf{a}_t\): add vector (what to write)
NTM learns algorithmic tasks from input-output examples:
Copy: [3, 1, 4, 1, 5] → [3, 1, 4, 1, 5]
Reverse: [3, 1, 4, 1, 5] → [5, 1, 4, 1, 3]
Sort: [3, 1, 4, 1, 5] → [1, 1, 3, 4, 5]
Associative recall: (key, value) pairs → retrieve value given key
No explicit algorithm implementation. Attention patterns encode algorithm.
Copy task execution:
Phase 1 (write): Sequential sharp attention writes input tokens to memory slots 1→N.
Phase 2 (read): Sequential sharp attention reads from same slots 1→N in order.

Limitations:
Training unstable: many local minima in attention weight space.
Poor scaling: N > 100 memory slots → optimization difficulty.
Sharpness critical: soft weights \(\alpha_{ti} \approx 0.2\) across 5 slots → read/write interference.
Memory Networks (Weston et al., 2015): Fixed memory (no writes), multi-hop attention. Better QA performance, easier training.
Modern architectures separate keys (similarity) from values (content):
\[\mathbf{c} = \sum_j \text{softmax}(\mathbf{q}^\top \mathbf{k}_j) \mathbf{v}_j\]
Keys \(\{\mathbf{k}_j\}\): Compute attention scores via \(\mathbf{q}^\top \mathbf{k}_j\)
Values \(\{\mathbf{v}_j\}\): Retrieved content in weighted sum
Query \(\mathbf{q}\): Specifies retrieval pattern
Seq2seq attention uses tied keys/values: \(\mathbf{k}_j = \mathbf{v}_j = \mathbf{h}_j\).
Separation enables asymmetry: key dimensionality optimized for similarity, value dimensionality for information content.
Retrieval-augmented generation:
External memory: database, knowledge base, or web corpus (millions to billions of documents).
Query embedding \(\mathbf{q}\) retrieves top-K documents:
\[\text{docs} = \text{TopK}(\text{similarity}(\mathbf{q}, \text{all documents}))\]
Retrieved documents concatenated with input, fed to language model.
Parametric knowledge (weights) + non-parametric knowledge (retrieved text).

External memory scales to billions of documents (Wikipedia: 6M articles, Common Crawl: 250B pages).
Similarity search via approximate nearest neighbors (FAISS, ScaNN): sub-linear time \(O(\log N)\) instead of \(O(N)\).
Modern systems: GPT-4 + web search, Claude + retrieval, medical QA with PubMed (35M abstracts).
Attention eliminates fixed context bottleneck via dynamic access to all encoder states.
Recurrent encoder/decoder still sequential:
\[\mathbf{h}_t = f_{\text{enc}}(\mathbf{h}_{t-1}, \mathbf{x}_t; \theta)\] \[\mathbf{s}_t = f_{\text{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}_t; \phi)\]
Sequential dependency limits:
Parallelization: timestep \(t\) cannot compute until \(t-1\) completes.
Gradient paths: BPTT through \(T\) steps still exhibits exponential decay.
Training speed: RNN computation dominates total time (98% of 249ms per batch).
Quadratic attention complexity:
Attention matrix: \(O(T_s \times T_t)\) memory
Translation (\(T_s = T_t = 50\)): 2,500 weights per decoder step (manageable)
Document summarization (\(T_s = 1000\)): 1M weights (approaching limits)
Speech recognition (\(T_s = 10000\), 1 minute audio): 100M weights (exceeds 16GB GPU)

Quantified limitations:
Training time (50-token sequences, batch 32):
Memory (single sample, no batching):
RNN state \(\mathbf{h}_t\) implicitly encodes position \(t\) via sequential processing:
\[\mathbf{h}_t = f(\mathbf{h}_{t-1}, \mathbf{x}_t)\]
State \(\mathbf{h}_t\) accumulated information from positions 1 through \(t\).
Attention weighted sum permutation-invariant:
\[\mathbf{c} = \sum_j \alpha_j \mathbf{h}_j\]
Without positional information in \(\mathbf{h}_j\), sum operation identical for any permutation of inputs.
Current seq2seq architecture:
RNN encoder processes sequentially → embeds position in \(\mathbf{h}_j\) → attention inherits positional information.
RNN-free architecture problem:
Replacing RNN with non-sequential encoder (e.g., stacked attention) loses position information.
“dog bites man” vs “man bites dog” produce identical bag-of-embeddings representations.
Explicit position encoding approaches:
Learned embeddings: \[\mathbf{x}'_t = \mathbf{x}_t + \mathbf{p}_t\] \(\mathbf{p}_t \in \mathbb{R}^d\) trained as parameters, one per position.
Sinusoidal (fixed): \[\text{PE}(t, 2i) = \sin(t / 10000^{2i/d})\] \[\text{PE}(t, 2i+1) = \cos(t / 10000^{2i/d})\] Deterministic encoding, generalizes to unseen lengths.
Relative encoding: Encode offset \(i - j\) instead of absolute positions \(i, j\). Captures relational information.

Without position encoding, “the cat sat” and “sat cat the” indistinguishable to pure attention.
Attention weights \(\alpha_{tj}\) optimized for translation loss, not alignment quality.
Translation-optimal attention may differ from linguistically-interpretable alignment.
Idiom example:
Linguistically correct: “kicked the bucket” → “died” (entire phrase maps to single word)
Model attention for “died”: “kicked” (0.15), “the” (0.40), “bucket” (0.20)
Translation correct but alignment non-interpretable (high weight on “the”).
Supervised alignment loss:
Gold alignments \(\boldsymbol{\alpha}^*\) from manual annotation:
\[\mathcal{L} = \mathcal{L}_{\text{translation}} + \lambda \|\boldsymbol{\alpha} - \boldsymbol{\alpha}^*\|^2\]
Results:
Alignment accuracy: +10-15% (measured against gold alignments)
Translation quality: ±0.2 BLEU (negligible change)
Cost: Requires expensive manual alignment annotations
Application: Medical/legal domains where interpretability critical.
Core achievement: Bottleneck eliminated
Seq2seq: Fixed \(\mathbf{c} \in \mathbb{R}^h\), capacity \(16,384\) bits regardless of \(T_s\)
Attention: Decoder accesses all \(T_s\) encoder states \(\{\mathbf{h}_1, ..., \mathbf{h}_{T_s}\}\)
Capacity scales: \(O(T_s \times h)\) dimensions available
WMT’14 En→De translation gains:
| Length (tokens) | Seq2seq | +Attention | Improvement |
|---|---|---|---|
| 20-30 | 32.8 | 34.1 | +1.3 |
| 50-60 | 19.8 | 27.7 | +7.9 |
| 60+ | 12.3 | 23.0 | +10.7 |
Gains increase with length—confirms bottleneck solved.
Gradient flow: Direct paths to all encoder states
Seq2seq: \(\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j}\) through \(T_{src} + T_{tgt}\) sequential steps
Attention: \(\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j} = \sum_t \alpha_{tj} \frac{\partial \mathcal{L}}{\partial \mathbf{c}_t}\)
Path length: 1 hop (direct connection)
All encoder positions receive gradient signal, not just recent positions.
Architectural constraint: RNN remains bottleneck
Attention provides dynamic weighting.
Sequence processing still requires RNN:
\[\mathbf{h}_t = f_{\text{enc}}(\mathbf{h}_{t-1}, \mathbf{x}_t)\] \[\mathbf{s}_t = f_{\text{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}_t)\]
Sequential dependency prevents parallelization.
Training time breakdown (50-token batch):
RNN dominates: 98% of computation time
Quadratic attention cost manageable for translation
Attention matrix: \(T_s \times T_t\) weights
Translation (\(T_s, T_t \approx 50\)): 2,500 weights (10KB memory)
Long documents (\(T_s = 2048\)): 4M weights (16MB memory)
Books (\(T = 8192\)): 67M weights (256MB memory)
Quadratic growth limits very long sequences.
RNN solved gradient vanishing but introduced sequential bottleneck. Attention solved information bottleneck but requires RNN for sequence processing. Next step: Replace RNN with parallelizable architecture (self-attention stacks + explicit positional encoding = Transformers).