Attention Mechanisms

[Bottleneck] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014, pp. 103–111.
[Attention] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations, 2015.
[Variants] Minh-Thang Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
[Tutorial] L. Weng, “Attention? Attention!” Lil’Log, June 2018.
[Vision] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning, 2015, pp. 2048–2057.
[Speech] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964.
[Copy] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 1073–1083.
[Reading] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional attention flow for machine comprehension,” in International Conference on Learning Representations, 2017.
[Interpret] S. Jain and B. C. Wallace, “Attention is not explanation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 3543–3556.

The Bottleneck Problem

Seq2Seq Encoder-Decoder Architecture

Encoder RNN: \[\mathbf{h}_t = f(\mathbf{h}_{t-1}, \mathbf{x}_t; \theta_{enc})\]

Processes source sequence token-by-token:

Input: $\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_{T_s}$
Hidden states: $\mathbf{h}_1, \mathbf{h}_2, ..., \mathbf{h}_{T_s}$
Each $\mathbf{h}_t \in \mathbb{R}^h$ captures history up to position $t$

Context extraction: \[\mathbf{c} = \mathbf{h}_{T_s}\]

Final hidden state becomes context for decoder.

Decoder RNN: \[\mathbf{s}_t = g(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}; \theta_{dec})\]

Generates target sequence conditioned on context:

Initialize: $\mathbf{s}_0 = \mathbf{c}$
Output: $P(\mathbf{y}_t | \mathbf{y}_{<t}, \mathbf{c})$

Context vector $\mathbf{c}$ must encode entire source sequence for decoder.

Information-Theoretic Quantities for Compression

Entropy measures information content: \[H(X) = -\sum_x p(x) \log_2 p(x) \text{ [bits]}\]

For continuous $\mathbf{X} \in \mathbb{R}^d$: \[H(\mathbf{X}) \leq d \times b \text{ bits}\]

where $b$ = bits per dimension (32 for float32).

Mutual information quantifies shared information: \[I(X; Y) = H(X) + H(Y) - H(X, Y)\]

Measures how much knowing $Y$ reduces uncertainty about $X$.

Data processing inequality:

For Markov chain $X \rightarrow Y \rightarrow Z$: \[I(X; Z) \leq I(X; Y)\]

Processing cannot create information.

Conditional entropy: \[H(X|Y) = H(X,Y) - H(Y)\]

Information in $X$ not explained by $Y$.

For seq2seq: \[H(\mathbf{X}|\mathbf{c}) = H(\mathbf{X}, \mathbf{c}) - H(\mathbf{c})\]

Measures how much source information is lost in context.

Key relationships:

\[I(\mathbf{X}; \mathbf{c}) = H(\mathbf{X}) - H(\mathbf{X}|\mathbf{c})\]

Information preserved = Total info - Lost info

Bottleneck bound:

\[I(\mathbf{X}; \mathbf{Y}) \leq I(\mathbf{X}; \mathbf{c}) \leq H(\mathbf{c}) = h \times 32\]

Target reconstruction limited by context capacity.

Concrete bound for seq2seq:

Source: $H(\mathbf{X}) \approx T_s \times 200$ bits/token
Context: $H(\mathbf{c}) \leq 512 \times 32 = 16,384$ bits
For $T_s = 50$: $H(\mathbf{X}) \approx 10,000$ bits
Maximum preserved: $\min(10000, 16384) = 10,000$ bits
But $H(\mathbf{X}|\mathbf{c}) > 0$ due to encoding limitations
Actual preserved: 40-60% for RNN encoder

Application to seq2seq: $\mathbf{X}$ (source) → $\mathbf{c}$ (context) → $\mathbf{Y}$ (target)

Channel Capacity and Rate-Distortion Theory

Channel capacity (Shannon, 1948):

Maximum rate of reliable information transmission: \[C = \max_{p(x)} I(X; Y) \text{ [bits/transmission]}\]

For fixed-dimensional channel $\mathbf{c} \in \mathbb{R}^h$: \[C \leq H(\mathbf{c}) \leq h \times 32 \text{ bits}\]

Rate-distortion tradeoff:

Cannot perfectly reconstruct $X$ from compressed $Y$ when $H(Y) < H(X)$.

Minimum distortion at rate $R$: \[D(R) = \min_{p(y|x): I(X;Y) \leq R} \mathbb{E}[d(X, \hat{X})]\]

where $\hat{X}$ is reconstruction from $Y$.

Implications:

Fixed $h$ → fixed capacity
Longer sequences → more distortion
No encoder can overcome capacity limit

Seq2seq operates in high-distortion regime for long sequences.

Information Flow in Sequence-to-Sequence Models

Source sequence: \[\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_{T_s}]\]

$T_s$ tokens, each $\mathbf{x}_t \in \mathbb{R}^{d_{emb}}$
Total information: $T_s \times d_{emb}$ dimensions

Context vector: \[\mathbf{c} = \mathbf{h}_{T_s} \in \mathbb{R}^{h}\]

Single fixed-size vector
Capacity: $h$ dimensions

Target sequence: \[\mathbf{Y} = [\mathbf{y}_1, \mathbf{y}_2, ..., \mathbf{y}_{T_t}]\]

Generated from context alone
Must encode all source information

The fundamental problem: When $T_s \times d_{emb} \gg h$, information must be discarded.

Mutual Information Analysis

The data processing inequality quantifies the bottleneck:

\[I(\mathbf{X}; \mathbf{Y}) \leq I(\mathbf{X}; \mathbf{c}) \leq H(\mathbf{c})\]

where:

$I(\mathbf{X}; \mathbf{Y})$: Mutual information between source and target
$H(\mathbf{c})$: Entropy of context vector
Markov chain: $\mathbf{X} \rightarrow \mathbf{c} \rightarrow \mathbf{Y}$

Capacity bound:

For $\mathbf{c} \in \mathbb{R}^h$ with 32-bit floats: \[H(\mathbf{c}) \leq h \times 32 \text{ bits}\]

Concrete example:

Source: 50 tokens × 512 dims = 25,600 dimensions
Context: $h$ = 1024 dimensions
Compression ratio: 25:1
Information preserved: $\leq$ 4% (assuming uniform)

Implication: No decoder can recover information lost at the bottleneck.

BLEU Score Quantifies Translation Quality

BiLingual Evaluation Understudy (BLEU):

Measures n-gram overlap between machine translation and human reference(s).

Computation:

\[\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)\]

where:

$p_n$: precision of n-grams (typically $N=4$)
$w_n = 1/N$: uniform weights
$BP$: brevity penalty for short outputs

Typical n-gram precision:

\[p_n = \frac{\text{\# matching n-grams}}{\text{\# candidate n-grams}}\]

Brevity penalty:

\[BP = \begin{cases} 1 & c > r \\ e^{1-r/c} & c \leq r \end{cases}\]

where $c$ = candidate length, $r$ = reference length.

Score range: 0-100 (percentage scale)

Interpretation guidelines:

BLEU Score	Quality Assessment
<10	Almost useless
10-20	Hard to get gist
20-30	Understandable, many errors
30-40	Good, some errors
40-50	Very good
50-60	Excellent
>60	Near human quality

WMT’14 En→Fr benchmarks:

Human parity: ~40-45 BLEU
Best 2014 systems: 35-37 BLEU
Statistical MT baseline: 30-33 BLEU

Properties:

Precision-focused: Rewards matching reference n-grams
Multiple references help: Average over references if available
Not perfect: Misses semantic equivalence, ignores word order beyond n-grams
Widely used: Standard metric despite limitations

What differences mean:

+1 BLEU: Noticeable improvement
+5 BLEU: Major improvement
+10 BLEU: Transformative difference

BLEU provides automated, reproducible quality metric. Correlates with human judgment ($r \approx 0.6-0.7$) at corpus level.

Empirical Evidence of Performance Degradation

WMT’14 English→French Results:

Seq2seq performance vs sequence length:

Length	BLEU	∆ from baseline
10-20	35.2	—
20-30	32.8	-2.4
30-40	29.5	-5.7
40-50	25.1	-10.1
50-60	19.8	-15.4
60+	12.3	-22.9

Degradation rate: ~0.5 BLEU per token

Root cause: Fixed context vector cannot scale with input length.

Where Information Is Lost

Recency bias in RNN encoding:

The final hidden state $\mathbf{h}_{T_s}$ is dominated by recent inputs:

\[\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t)\]

Contribution decay:

Influence of $\mathbf{x}_k$ on $\mathbf{h}_T$ decays exponentially:

\[\text{Influence}(\mathbf{x}_k \rightarrow \mathbf{h}_T) \propto \|\mathbf{W}_{hh}\|^{T-k}\]

For typical $\|\mathbf{W}_{hh}\| \approx 0.9$:

Token 45 → final state: $0.9^5 = 0.59$ (moderate)
Token 30 → final state: $0.9^{20} = 0.12$ (weak)
Token 10 → final state: $0.9^{40} = 0.015$ (negligible)

Result: Early tokens effectively erased from context.

Bidirectional encoding helps but doesn’t solve the problem: Still compresses to fixed $h$ dimensions.

Backpropagation Through Sequential Computation

Chain rule for sequential dependencies:

For $y = f_n(f_{n-1}(...f_1(x)))$:

\[\frac{\partial y}{\partial x} = \frac{\partial f_n}{\partial f_{n-1}} \cdot \frac{\partial f_{n-1}}{\partial f_{n-2}} \cdots \frac{\partial f_1}{\partial x}\]

Product of Jacobians:

\[\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} = \prod_{t=2}^T \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}\]

Each $\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} \in \mathbb{R}^{h \times h}$ is a Jacobian matrix.

For vanilla RNN:

\[\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b})\]

\[\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \text{diag}(\tanh'(\mathbf{z}_{t-1})) \cdot \mathbf{W}_{hh}\]

where $\mathbf{z}_t$ is the pre-activation.

Gradient magnitude behavior:

\[\left\| \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} \right\| \leq \|\mathbf{W}_{hh}\| \cdot \max_i |\tanh'(z_i)|\]

Since $|\tanh'(z)| \leq 1$ and typically $\leq 0.25$ for saturated activations:

\[\left\| \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} \right\| \leq \|\mathbf{W}_{hh}\|\]

Over T steps:

\[\left\| \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} \right\| \leq \|\mathbf{W}_{hh}\|^{T-1} \cdot \prod_{t=2}^T |\tanh'(z_t)|\]

Typical case:

$\|\mathbf{W}_{hh}\| \approx 0.9$ (stable training)
$|\tanh'| \approx 0.5$ (average activation)
Combined: $\gamma \approx 0.45$ per step

After 50 steps: $0.45^{50} \approx 10^{-18}$ (gradient vanishes)

Gradient Paths in Seq2Seq Architecture

Complete gradient path from loss to source:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}_T} \cdot \frac{\partial \mathbf{y}_T}{\partial \mathbf{s}_T} \cdot \frac{\partial \mathbf{s}_T}{\partial \mathbf{c}} \cdot \frac{\partial \mathbf{c}}{\partial \mathbf{h}_{T_s}} \cdot \frac{\partial \mathbf{h}_{T_s}}{\partial \mathbf{h}_1} \cdot \frac{\partial \mathbf{h}_1}{\partial \mathbf{x}_1}\]

Path length: $T_{tgt}$ (decoder) + 1 (context) + $T_{src}$ (encoder)

Gradient magnitude decay:

Each term contributes multiplicative factor:

Decoder path: $\prod_{t=1}^{T_{tgt}} \left\|\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}}\right\| \approx \gamma_{dec}^{T_{tgt}}$
Context bottleneck: $\left\|\frac{\partial \mathbf{s}_1}{\partial \mathbf{c}}\right\| \approx 1$ (initialization)
Encoder path: $\prod_{t=2}^{T_{src}} \left\|\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}\right\| \approx \gamma_{enc}^{T_{src}}$

Combined: \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}_1}\right\| \approx \gamma_{dec}^{T_{tgt}} \cdot \gamma_{enc}^{T_{src}} \approx \gamma^{T_{src} + T_{tgt}}\]

Problem: Single long path from loss to early source tokens.

Quantifying Gradient Vanishing in Seq2Seq

Concrete example:

Translation: 50-token source → 50-token target

Gradient to first source token:

Path length: $T_{src} + T_{tgt} = 100$ steps

With $\gamma = 0.45$ per step: \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}_1}\right\| \approx 0.45^{100} \approx 10^{-35}\]

Gradient to middle token ($t=25$):

Path length: $25 + 50 = 75$ steps \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}_{25}}\right\| \approx 0.45^{75} \approx 10^{-26}\]

Gradient to last token:

Path length: $0 + 50 = 50$ steps \[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{x}_{50}}\right\| \approx 0.45^{50} \approx 10^{-18}\]

All are numerically zero in float32 (ε ≈ 10^-7)

Conclusion: Only last ~10 encoder tokens receive meaningful gradient signal.

LSTM Helps But Doesn’t Solve the Bottleneck

LSTM gradient flow:

Cell state provides near-linear path: \[\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t + \text{other terms}\]

When forget gate $\mathbf{f}_t \approx 1$: minimal decay.

Improved gradient behavior:

With $\gamma_{LSTM} \approx 0.9$ (vs 0.45 for vanilla):

50 steps: $0.9^{50} \approx 0.005$ (small but not zero!)
100 steps: $0.9^{100} \approx 10^{-5}$ (still learnable)

LSTM enables longer sequences:

Vanilla RNN: Effective horizon ~10-20 tokens
LSTM: Effective horizon ~100-200 tokens

But bottleneck remains:

All encoder information still compressed to $\mathbf{c} \in \mathbb{R}^h$.

Better gradient flow ≠ more capacity.

LSTM improves gradient flow but doesn’t increase information capacity.

Then How Does Seq2Seq+LSTM Achieve BLEU 20-30?

WMT’14 En→Fr seq2seq+LSTM results:

Short sequences (20-30 tok): BLEU 30-35
Long sequences (60-80 tok): BLEU 20-25

Previous slides showed severe bottleneck (64% information loss) and gradient vanishing ($10^{-35}$). Why does it work?

LSTM solves trainability, not capacity:

Vanilla RNN cannot train (gradients vanish). LSTM makes training possible (gradients flow). Performance reflects capacity constraints, not training failure.

Why partial information is sufficient:

1. Training data distribution favors short sequences

73% of WMT’14 data ≤40 tokens where bottleneck manageable. Model optimizes for common case.

2. Language redundancy enables lossy compression

512-1024 dimensions sufficient for high-level semantics:

Grammatical structure
Entity references
Semantic relations
Discourse flow

Not sufficient for:

Exact lexical choices
Fine-grained details
Long-range dependencies >50 tokens

3. Compression efficiency varies by content

Simple sentences: 60-70% preservation adequate

Complex sentences: degradation more severe (explains BLEU drop)

BLEU performance by length:

Length	Seq2seq BLEU	Information preserved
20-30	32.8	~70%
40-50	25.1	~55%
60+	12.3	~30%

Degradation correlates with information loss. Training optimizes for common case (short sequences).

LSTM enables training (gradient flow) but bottleneck remains (capacity limit). BLEU 20-30 reflects successful learning within constraints.

Solving the Bottleneck Requires Direct Encoder Access

Problem 1: Information Bottleneck

Fixed context capacity regardless of input length: \[H(\mathbf{c}) \leq h \times 32 \text{ bits}\]

Quantified failure for $T_s = 50$:

Source entropy: $50 \times 200 = 10,000$ bits
Context capacity: $512 \times 32 = 16,384$ bits
RNN compression efficiency: 40-60%
Actual preserved: 6,400 bits (64% loss)

Performance impact:

BLEU degrades ~0.3 per token (accelerating with length)
Critical failure beyond 60 tokens (BLEU <20)

What solution needs:

Context must scale with input: $\text{capacity} = O(T_s)$

Access all encoder states, not compressed summary.

Problem 2: Gradient Vanishing

Path length grows with both sequences: \[\text{path} = T_{src} + T_{tgt}\]

Quantified failure for 50+50 tokens:

Vanilla RNN: $\gamma \approx 0.45$
Path decay: $0.45^{100} \approx 10^{-35}$
Float32 precision: $10^{-7}$ (vanished)

LSTM improvement:

$\gamma \approx 0.9$
Path decay: $0.9^{100} \approx 10^{-5}$ (learnable)
Gradient flow: SOLVED

What solution needs:

Path length $= O(1)$ to any encoder state.

Direct connections bypass sequential chain.

LSTM solves gradient flow (Problem 2) but not information capacity (Problem 1). Attention addresses both.

Design Space for Solutions

Approach 1: Increase context size

\[\mathbf{c} \in \mathbb{R}^h \rightarrow \mathbb{R}^{kh}\]

Pros:

Simple modification
Same architecture

Cons:

Still fixed size (just bigger)
Does not scale with input length
Memory grows with $k$
Does not solve gradient problem

Insufficient. Delays problem but doesn’t solve it.

Approach 2: Multiple context vectors

Use $\mathbf{c}_1, \mathbf{c}_2, ..., \mathbf{c}_K$ from different encoder positions

Pros:

More capacity: $K \times h$ dimensions
Can capture different aspects

Cons:

How to select which context when?
Discrete selection → not differentiable
Still loses information between samples
Fixed $K$ doesn’t scale

Better, but alignment problem remains.

Approach 3: Hierarchical encoding

Build tree structure over encoder states

Pros:

$\log T$ depth instead of $T$
Shorter gradient paths

Cons:

Assumes hierarchical structure (language doesn’t have this)
Still compresses at root
Complex architecture
Not clear how to decode

Theoretically interesting, practically difficult.

Approach 4: Direct access with soft selection

Decoder can “look at” any encoder state

Selection weights learned and differentiable

Pros:

Capacity scales with $T_s$: $O(T_s \times h)$ available
Direct gradient paths: 1 hop to any encoder state
Differentiable selection mechanism
Flexible alignment

Cons:

Computational cost: $O(T_s \times T_t)$
Memory for alignment weights

Solves both problems. This is attention.

Attention Solves Both Problems Simultaneously

Information capacity:

Available information: $T_s \times h$ dimensions

Context at step $t$: $\mathbf{c}_t = \sum_j \alpha_{tj} \mathbf{h}_j$

Total contexts: $T_t$ different vectors

Total capacity available to decoder: \[T_t \times (T_s \times h) \text{ dimension-accesses}\]

For 50×50 translation: 2,500 context vectors (vs 1 in seq2seq)

Information preserved:

No forced compression through fixed bottleneck
Decoder decides dynamically which information to use
Lossy only in weighted sum, not in storage

Gradient flow:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j} = \sum_{t=1}^{T_t} \alpha_{tj} \frac{\partial \mathcal{L}}{\partial \mathbf{c}_t}\]

Path length: 1 (direct connection)
No exponential decay: $\gamma^1 = \gamma \approx 0.9$ (manageable)

Soft Selection: The Attention Mechanism

Context as weighted sum:

\[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{h}_j\]

where $\alpha_{tj} \in [0,1]$ and $\sum_j \alpha_{tj} = 1$

Attention weights from scores:

\[\alpha_{tj} = \frac{\exp(e_{tj}/\tau)}{\sum_{k=1}^{T_s} \exp(e_{tk}/\tau)}\]

Temperature $\tau$ controls distribution sharpness: $\tau \to 0$ gives one-hot (hard selection), $\tau \to \infty$ gives uniform. Standard: $\tau = 1$.

Score function measures relevance:

\[e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j)\]

How well does encoder state $\mathbf{h}_j$ match decoder query $\mathbf{s}_{t-1}$?

Different score functions lead to different attention variants.

What should score($\mathbf{s}_{t-1}$, $\mathbf{h}_j$) be?

Attention is Learned Alignment

Attention Resolves Contextual Dependencies

Consider this sentence:

“The animal didn’t cross the street because it was too tired.”

Question: What does “it” refer to?

“the animal” (makes sense)
“the street” (streets don’t get tired)

Humans resolve this through context:

“tired” applies to living beings
Semantic understanding of sentence structure
Long-range dependency spanning 7 words

Problem for fixed-context seq2seq:

Fixed $\mathbf{c}$ vector must encode all relationships
Information about “animal” compressed through 7 tokens
No mechanism to selectively retrieve relevant source context when generating “it”

Attention mechanism:

When decoding “it”, compute relevance of each source word
“animal” receives high attention weight
“street” receives low attention weight
Dynamic context based on what’s needed at each decoding step

Attention mechanism computes context dynamically based on decoder state. When generating “it”, the model attends strongly to “animal” (0.65 weight), correctly resolving the pronoun reference.

Complete Attention-Based Seq2Seq Architecture

Encoder (unchanged):

Process source sequence with bidirectional LSTM:

\[\overrightarrow{\mathbf{h}}_j = \text{LSTM}(\mathbf{x}_j, \overrightarrow{\mathbf{h}}_{j-1})\] \[\overleftarrow{\mathbf{h}}_j = \text{LSTM}(\mathbf{x}_j, \overleftarrow{\mathbf{h}}_{j+1})\]

Concatenate: $\mathbf{h}_j = [\overrightarrow{\mathbf{h}}_j; \overleftarrow{\mathbf{h}}_j] \in \mathbb{R}^{2h}$

All encoder states preserved: $\{\mathbf{h}_1, ..., \mathbf{h}_{T_s}\}$

Decoder with attention:

At each timestep $t$:

Compute attention over encoder states
Generate context $\mathbf{c}_t$
Update decoder state with context
Predict next token

\[\mathbf{s}_t = \text{LSTM}([\mathbf{y}_{t-1}; \mathbf{c}_{t-1}], \mathbf{s}_{t-1})\] \[P(\mathbf{y}_t | \mathbf{y}_{<t}, \mathbf{X}) = \text{softmax}(\mathbf{W}_o[\mathbf{s}_t; \mathbf{c}_t])\]

Context $\mathbf{c}_t$ computed dynamically at each decoder step.

Attention Recomputed at Every Decoder Timestep

Critical property: Attention weights $\alpha_{tj}$ recomputed for each target position $t$.

Decoder timestep $t=1$:

Input: $\langle$END$\rangle$ token + initial state $\mathbf{s}_0$
RNN update: $\mathbf{s}_1 = \text{LSTM}(\langle$END$\rangle, \mathbf{s}_0)$
Compute attention: Calculate $\alpha_{1j}$ for all source positions $j$ \[\alpha_{1j} = \frac{\exp(e_{1j})}{\sum_{k=1}^{T_s} \exp(e_{1k})}\] where $e_{1j} = \text{score}(\mathbf{s}_1, \mathbf{h}_j)$
Context vector: $\mathbf{c}_1 = \sum_{j=1}^{T_s} \alpha_{1j} \mathbf{h}_j$
Combine and predict: $[\mathbf{s}_1; \mathbf{c}_1] \rightarrow \text{MLP} \rightarrow P(\mathbf{y}_1)$

Decoder timestep $t=2$:

Input: Previous output $\mathbf{y}_1$ + state $\mathbf{s}_1$
RNN update: $\mathbf{s}_2 = \text{LSTM}(\mathbf{y}_1, \mathbf{s}_1)$
Recompute attention: Calculate $\alpha_{2j}$ for all source positions
- Different decoder state $\mathbf{s}_2 \neq \mathbf{s}_1$
- Different scores $e_{2j} = \text{score}(\mathbf{s}_2, \mathbf{h}_j)$
- Different weights $\alpha_{2j} \neq \alpha_{1j}$
New context: $\mathbf{c}_2 = \sum_{j=1}^{T_s} \alpha_{2j} \mathbf{h}_j$
Combine and predict: $[\mathbf{s}_2; \mathbf{c}_2] \rightarrow \text{MLP} \rightarrow P(\mathbf{y}_2)$

Each decoder step accesses different weighted combinations of source information.

At $t=1$: compute $\alpha_{1j}$ using $\mathbf{s}_1$, form $\mathbf{c}_1$.

At $t=2$: recompute $\alpha_{2j}$ using $\mathbf{s}_2$, form different $\mathbf{c}_2$.

Each timestep accesses source information dynamically based on current decoding context.

Attention Computation: Setup

Given at decoder timestep $t=2$:

Decoder state: \[\mathbf{s}_1 = [0.6, 0.4]^\top \in \mathbb{R}^2\]

Encoder states:

\[\mathbf{h}_1 = [1.0, 0.5]^\top\] \[\mathbf{h}_2 = [0.3, 0.9]^\top\] \[\mathbf{h}_3 = [0.7, 0.8]^\top\] \[\mathbf{h}_4 = [-0.2, 0.6]^\top\] \[\mathbf{h}_5 = [0.4, 0.3]^\top\]

We have $T_s = 5$ encoder positions, each $\mathbf{h}_j \in \mathbb{R}^2$.

Task: Compute context vector $\mathbf{c}_2$ for generating $\mathbf{y}_2$

Three steps:

Compute scores: How relevant is each $\mathbf{h}_j$ to $\mathbf{s}_1$?
Normalize to weights: Convert scores to probabilities
Weighted sum: Combine encoder states

Step 1: Computing Scores

Score function (using dot product):

\[e_{tj} = \mathbf{s}_1^\top \mathbf{h}_j\]

Compute each score:

\[e_{21} = [0.6, 0.4] \cdot [1.0, 0.5] = 0.6 + 0.2 = 0.8\] \[e_{22} = [0.6, 0.4] \cdot [0.3, 0.9] = 0.18 + 0.36 = 0.54\] \[e_{23} = [0.6, 0.4] \cdot [0.7, 0.8] = 0.42 + 0.32 = 0.74\] \[e_{24} = [0.6, 0.4] \cdot [-0.2, 0.6] = -0.12 + 0.24 = 0.12\] \[e_{25} = [0.6, 0.4] \cdot [0.4, 0.3] = 0.24 + 0.12 = 0.36\]

Score vector: \[\mathbf{e}_2 = [0.80, 0.54, 0.74, 0.12, 0.36]\]

Interpretation: Position 1 has highest score (0.80) → most aligned with query

Higher score = better alignment between “query” ($\mathbf{s}_t$) and “key” ($\mathbf{h}_j$)

Step 2: Softmax Normalization

Scores (from previous step): \[\mathbf{e}_2 = [0.80, 0.54, 0.74, 0.12, 0.36]\]

Apply softmax: \[\alpha_{2j} = \frac{\exp(e_{2j})}{\sum_{k=1}^{5} \exp(e_{2k})}\]

Compute exponentials: \[\exp(\mathbf{e}_2) = [2.23, 1.72, 2.10, 1.13, 1.43]\]

Sum: \[\sum_{k=1}^{5} \exp(e_{2k}) = 8.61\]

Normalize: \[\alpha_{21} = \frac{2.23}{8.61} = 0.259\] \[\alpha_{22} = \frac{1.72}{8.61} = 0.200\] \[\alpha_{23} = \frac{2.10}{8.61} = 0.244\] \[\alpha_{24} = \frac{1.13}{8.61} = 0.131\] \[\alpha_{25} = \frac{1.43}{8.61} = 0.166\]

Attention weights: \[\boldsymbol{\alpha}_2 = [0.259, 0.200, 0.244, 0.131, 0.166]\]

Verify: $\sum_j \alpha_{2j} = 1.000$

Softmax converts arbitrary scores to valid probability distribution

Step 3: Weighted Sum to Context

Attention weights: \[\boldsymbol{\alpha}_2 = [0.259, 0.200, 0.244, 0.131, 0.166]\]

Encoder states: \[\mathbf{h}_1 = [1.0, 0.5]^\top, \quad \mathbf{h}_2 = [0.3, 0.9]^\top\] \[\mathbf{h}_3 = [0.7, 0.8]^\top, \quad \mathbf{h}_4 = [-0.2, 0.6]^\top\] \[\mathbf{h}_5 = [0.4, 0.3]^\top\]

Compute context: \[\mathbf{c}_2 = \sum_{j=1}^{5} \alpha_{2j} \mathbf{h}_j\]

Dimension 1: \[c_{2,1} = 0.259(1.0) + 0.200(0.3) + 0.244(0.7) + 0.131(-0.2) + 0.166(0.4)\] \[= 0.259 + 0.060 + 0.171 - 0.026 + 0.066 = 0.530\]

Dimension 2: \[c_{2,2} = 0.259(0.5) + 0.200(0.9) + 0.244(0.8) + 0.131(0.6) + 0.166(0.3)\] \[= 0.130 + 0.180 + 0.195 + 0.079 + 0.050 = 0.634\]

Context vector: \[\mathbf{c}_2 = [0.530, 0.634]^\top\]

Context is a weighted combination emphasizing relevant encoder positions

Matrix Form for Efficient Implementation

Batch processing all decoder timesteps:

Decoder states: $\mathbf{S} \in \mathbb{R}^{T_t \times h}$

Encoder states: $\mathbf{H} \in \mathbb{R}^{T_s \times 2h}$

Score computation (example: general/multiplicative):

\[\mathbf{E} = \mathbf{S} \mathbf{H}^\top\]

$\mathbf{E} \in \mathbb{R}^{T_t \times T_s}$.

Later (general multiplicative attention:)

\[\mathbf{E} = \mathbf{S} \mathbf{W}_a \mathbf{H}^\top\]

where $\mathbf{W}_a \in \mathbb{R}^{h \times 2h}$

$E_{ij}$ = score for decoder step $i$ attending to encoder position $j$

Attention weights:

\[\mathbf{A} = \text{softmax}(\mathbf{E}, \text{dim}=-1)\]

Applied row-wise: each row sums to 1

Context vectors:

\[\mathbf{C} = \mathbf{A} \mathbf{H}\]

Result: $\mathbf{C} \in \mathbb{R}^{T_t \times 2h}$

Row $i$ of $\mathbf{C}$ is context $\mathbf{c}_i$ for decoder step $i$

GPU-friendly: All operations are matrix multiplications.

Attention as Soft Dictionary Lookup

Key-Value store interpretation:

Encoder states form key-value pairs:

Keys: $\{\mathbf{h}_1, ..., \mathbf{h}_{T_s}\}$ (for matching)
Values: $\{\mathbf{h}_1, ..., \mathbf{h}_{T_s}\}$ (for retrieval)
Basic attention: keys = values

Decoder query:

Query: $\mathbf{s}_{t-1}$ (search pattern)

Retrieval process:

Compare query against all keys: score($\mathbf{s}_{t-1}$, $\mathbf{h}_j$)
Compute similarity-based weights: softmax(scores)
Retrieve weighted combination of values

Soft vs hard lookup:

Hard (discrete): Return value with highest score

Soft (differentiable): Return weighted average of all values

Weights indicate relevance of each value to query.

Why soft? Gradients flow through weighted sum, enabling end-to-end learning.

Score Function Measures Query-Key Relevance

Central question: How to measure relevance?

\[e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j)\]

Requirements:

Input compatibility: Handle different dimensions
- Decoder state: $\mathbf{s}_{t-1} \in \mathbb{R}^{h_d}$
- Encoder state: $\mathbf{h}_j \in \mathbb{R}^{h_e}$
- Often $h_d \neq h_e$ (especially with bidirectional encoder)
Differentiability: Gradient flow for learning
Computational efficiency: Applied $T_s \times T_t$ times
Expressiveness: Capture semantic similarity

Three main approaches emerged:

Additive (Bahdanau, 2015)
Multiplicative (Luong, 2015)
Scaled dot-product (Vaswani, 2017)

Each makes different tradeoffs.

Additive Attention (Bahdanau, 2015)

Score function:

\[e_{tj} = \mathbf{v}^\top \tanh(\mathbf{W}_s \mathbf{s}_{t-1} + \mathbf{W}_h \mathbf{h}_j)\]

Parameters:

$\mathbf{W}_s \in \mathbb{R}^{d_a \times h_d}$: Project decoder state
$\mathbf{W}_h \in \mathbb{R}^{d_a \times h_e}$: Project encoder state
$\mathbf{v} \in \mathbb{R}^{d_a}$: Score vector
$d_a$: Attention dimension (typically 256-512)

Computation:

Project both inputs to common space: $\mathbb{R}^{d_a}$
Element-wise addition
Nonlinear activation (tanh)
Linear projection to scalar

Why tanh?

Bounded output: $[-1, 1]$
Creates nonlinear similarity space
Prevents score explosion

Cost: $O(d_a \times (h_d + h_e))$ per score

Learns complex, nonlinear alignment patterns via tanh nonlinearity.

Multiplicative Attention (Luong, 2015)

Three variants:

1. Dot product (simplest): \[e_{tj} = \mathbf{s}_{t-1}^\top \mathbf{h}_j\]

Requires: $h_d = h_e$

2. General (learned projection): \[e_{tj} = \mathbf{s}_{t-1}^\top \mathbf{W}_a \mathbf{h}_j\]

Parameters: $\mathbf{W}_a \in \mathbb{R}^{h_d \times h_e}$

Works for any dimensions.

3. Concat (hybrid with additive): \[e_{tj} = \mathbf{v}^\top \tanh(\mathbf{W}_a [\mathbf{s}_{t-1}; \mathbf{h}_j])\]

Parameters: $\mathbf{W}_a \in \mathbb{R}^{d_a \times (h_d + h_e)}$, $\mathbf{v} \in \mathbb{R}^{d_a}$

Comparison to Bahdanau:

Fewer parameters: 0 (dot) vs 393K (additive at $d_a=256$)
No nonlinearity (dot/general) or single layer (concat)
Luong et al. (2015): Comparable BLEU, fewer parameters

Tradeoff: Simplicity and speed vs expressiveness.

Scaled Dot-Product Attention (Vaswani, 2017)

Score function:

\[e_{tj} = \frac{\mathbf{s}_{t-1}^\top \mathbf{h}_j}{\sqrt{d_k}}\]

where $d_k$ is the key dimension.

Why scaling is critical:

Dot product variance grows with dimension.

For $\mathbf{s}, \mathbf{h}$ with i.i.d. components ~ $\mathcal{N}(0, 1)$:

\[\text{Var}(\mathbf{s}^\top \mathbf{h}) = \sum_{i=1}^{d_k} \text{Var}(s_i h_i) = d_k\]

Without scaling, as $d_k$ increases:

Dot products become very large
Softmax saturates: $\alpha_j \approx 0$ or $1$
Gradients vanish

With scaling: \[\text{Var}\left(\frac{\mathbf{s}^\top \mathbf{h}}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1\]

Variance stays constant regardless of dimension.

Critical for transformers: Enables attention in high-dimensional spaces (512-1024).

Softmax Gradient: Backpropagation Through Attention Weights

Attention weights normalized via softmax:

\[\alpha_j = \frac{\exp(e_j)}{\sum_{k=1}^{T_s} \exp(e_k)}\]

Gradient computation requires $\frac{\partial \alpha_j}{\partial e_k}$ for backpropagation.

Derivation:

For $j = k$ (diagonal):

\[\frac{\partial \alpha_j}{\partial e_j} = \frac{\partial}{\partial e_j}\left(\frac{\exp(e_j)}{Z}\right)\]

where $Z = \sum_k \exp(e_k)$

\[= \frac{\exp(e_j) \cdot Z - \exp(e_j) \cdot \exp(e_j)}{Z^2}\]

\[= \frac{\exp(e_j)}{Z} \left(1 - \frac{\exp(e_j)}{Z}\right) = \alpha_j(1 - \alpha_j)\]

For $j \neq k$ (off-diagonal):

\[\frac{\partial \alpha_j}{\partial e_k} = \frac{0 \cdot Z - \exp(e_j) \cdot \exp(e_k)}{Z^2}\]

\[= -\frac{\exp(e_j)}{Z} \cdot \frac{\exp(e_k)}{Z} = -\alpha_j \alpha_k\]

Combined form:

\[\frac{\partial \alpha_j}{\partial e_k} = \alpha_j(\delta_{jk} - \alpha_k)\]

where $\delta_{jk} = 1$ if $j=k$, else $0$ (Kronecker delta).

Gradient from loss $\mathcal{L}$ to scores:

\[\frac{\partial \mathcal{L}}{\partial e_k} = \sum_{j=1}^{T_s} \frac{\partial \mathcal{L}}{\partial \alpha_j} \cdot \frac{\partial \alpha_j}{\partial e_k}\]

\[= \sum_j \frac{\partial \mathcal{L}}{\partial \alpha_j} \cdot \alpha_j(\delta_{jk} - \alpha_k)\]

\[= \alpha_k \frac{\partial \mathcal{L}}{\partial \alpha_k} - \alpha_k \sum_j \alpha_j \frac{\partial \mathcal{L}}{\partial \alpha_j}\]

Vectorized:

\[\nabla_{\mathbf{e}} \mathcal{L} = \boldsymbol{\alpha} \odot \left(\nabla_{\boldsymbol{\alpha}} \mathcal{L} - \langle \boldsymbol{\alpha}, \nabla_{\boldsymbol{\alpha}} \mathcal{L} \rangle \mathbf{1}\right)\]

where $\odot$ is element-wise product, $\langle \cdot, \cdot \rangle$ is dot product.

Gradient properties:

When $\alpha_j \to 1$ (saturated): $\frac{\partial \alpha_j}{\partial e_j} \to 0$
When $\alpha_j \to 0$ (ignored): gradient contribution negligible
Scaling prevents saturation, maintains gradient flow

Score Function Performance Comparison

Empirical results on WMT’14 En→De:

Score Function	BLEU	Parameters	Speed
Fixed context	14.0	—	Fast
Dot product	18.2	0	Fastest
General (Luong)	20.1	256K	Fast
Additive (Bahdanau)	20.3	640K	Medium
Scaled dot-product	20.8	0	Fastest

Any attention mechanism: +4-6 BLEU over seq2seq baseline.

Scaling critical: Dot product (18.2) → scaled (20.8), +2.6 BLEU.

Diminishing returns: Additive (20.3) vs scaled (20.8), only +0.5 BLEU difference.

Speed-quality tradeoff: Scaled dot-product fastest with highest quality.

Current practice (2025):

Transformers: Scaled dot-product (universal)
RNN seq2seq: Additive or general
Simple cases: Dot product (when dimensions match)

Scaled dot-product dominates modern applications: fastest computation, highest quality, zero additional parameters.

What Traditional Machine Translation (MT) Alignment Looked Like

Statistical MT alignment (pre-neural):

Given parallel sentences, find word-to-word mapping:

Source: “The cat sat on the mat”
Target: “Le chat assis sur le tapis”

Hard alignment (IBM Models 1-5, 1990s):

Each target word aligned to exactly one source word (or NULL).

Learned via EM algorithm over large corpora.

Properties:

Discrete: Binary decision per word pair
One-to-one or one-to-many
Not differentiable
Trained separately from translation model
Used for phrase extraction

Problems:

Alignment errors propagate
Can’t handle many-to-many
No gradient signal to translation
Brittle with rare words

Neural attention learns alignment implicitly during translation training.

Attention as Soft Alignment

Attention weights = soft alignment:

\[\alpha_{tj} \in (0, 1), \quad \sum_j \alpha_{tj} = 1\]

Each target position $t$ has distribution over all source positions.

Differences from hard alignment:

Continuous weights: $\alpha_{tj} \in [0,1]$, not binary
Many-to-many: All connections weighted (no hard zeros)
Learned jointly: Single end-to-end model, no separate aligner
Differentiable: Gradients backpropagate through $\alpha_{tj}$
Context-dependent: Weights vary with decoder state $\mathbf{s}_t$

Interpretation:

$\alpha_{tj} = 0.6$: Target word $t$ weighted 60% from source word $j$
Multiple high weights: Target word combines multiple source words
Phrasal alignment natural: One-to-many, many-to-one handled automatically
Smooth gradient signal
No explicit supervision needed
Robust to noise

Each row sums to 1: Valid probability distribution over source positions.

Real Translation Example: English → German

Source (English): “The agreement on the European Economic Area was signed in August 1992”

Target (German): “Das Abkommen über den Europäischen Wirtschaftsraum wurde im August 1992 unterzeichnet”

Alignment challenges:

Reordering: “was signed” → “wurde … unterzeichnet” (verb splits across positions)
Compounds: “European Economic Area” → “Europäischen Wirtschaftsraum” (3 words → 1 word)
Articles: “the” → “den” (case agreement morphology)

Attention patterns:

“Das” → “The” (article translation)
“Abkommen” → “agreement”
“unterzeichnet” attends to both “was” and “signed” (split verb)
“Wirtschaftsraum” attends to “Economic” and “Area” (compound)

Model learns these patterns without explicit alignment supervision.

Attention discovers linguistic structure without explicit supervision.

Alignment Emerges from Translation Loss Alone

Critical point: Attention alignment learned without alignment annotations.

Training setup:

Input: Parallel sentences (source, target) pairs
- English: “The animal didn’t cross the street”
- German: “Das Tier überquerte die Straße nicht”
Supervision: Only translation loss \[\mathcal{L} = -\sum_{t=1}^{T_t} \log P(\mathbf{y}_t^* | \mathbf{y}_{<t}^*, \mathbf{X}; \theta)\]
No alignment labels: Never told which source word aligns to which target word

What emerges:

Attention weights $\alpha_{tj}$ converge to linguistically meaningful alignments:

Content words align to translations
Function words show systematic patterns
Reorderings captured (SVO → SOV)
Compounds and split constructions discovered

Why this matters:

Traditional MT required:

Manual word alignment annotations
Linguistic rules for reordering
Explicit morphological analysis

Attention mechanism:

Learns alignment as emergent behavior
Driven by translation objective alone
Discovers linguistic structure through gradient descent
No linguistic engineering required

Alignment is a by-product of learning to translate, not an explicit training target.

Attention weights encode source-target alignments as emergent structure from translation objective. Model discovers linguistic correspondences through backpropagation, without explicit alignment supervision during training.

Common Attention Patterns

Pattern 1: Diagonal (Monotonic alignment)

Languages with similar word order (En→Fr, En→De)
Strong diagonal indicates word-for-word correspondence

Pattern 2: Off-diagonal blocks (Reordering)

Subject-Object-Verb vs Subject-Verb-Object
Adjective-Noun order reversal

Pattern 3: Vertical stripes (Many-to-one)

Multiple source words → single target word
Example: “not good” → “mauvais” (French)
Phrasal verbs: “give up” → “abandonner”

Pattern 4: Horizontal stripes (One-to-many)

Single source → multiple target words
Example: “children” → “les enfants” (article + noun)
Compound splitting in German

Pattern 5: Diffuse (Uniform)

Indicates uncertainty or failure mode
Model cannot decide what to attend to
Often occurs with rare words or ambiguous context

Pattern analysis helps diagnose model behavior and failure modes.

Bidirectional Encoding Improves Alignment

Unidirectional encoder:

\[\mathbf{h}_j = \text{RNN}(\mathbf{x}_j, \mathbf{h}_{j-1})\]

State $\mathbf{h}_j$ only sees $\mathbf{x}_1, ..., \mathbf{x}_j$ (past context).

Problem: Cannot use future context for alignment.

Example: “bank” in “river bank” vs “savings bank”

Disambiguation requires seeing words after “bank”
Unidirectional encoder hasn’t seen them yet

Bidirectional encoder:

\[\overrightarrow{\mathbf{h}}_j = \text{RNN}(\mathbf{x}_j, \overrightarrow{\mathbf{h}}_{j-1})\] \[\overleftarrow{\mathbf{h}}_j = \text{RNN}(\mathbf{x}_j, \overleftarrow{\mathbf{h}}_{j+1})\] \[\mathbf{h}_j = [\overrightarrow{\mathbf{h}}_j; \overleftarrow{\mathbf{h}}_j]\]

Each $\mathbf{h}_j$ incorporates full sentence context.

Benefits:

Full sentence context per position
Context-aware alignment
+2.1 BLEU over unidirectional (WMT’14, Bahdanau et al., 2015)

Bidirectional encoders: 2× computation for full source context. WMT’14: +2.1 BLEU over unidirectional.

Efficient Implementation

From Decoder-Encoder to Query-Key-Value

Decoder-encoder formulation:

\[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{h}_j\]

where $\alpha_{tj} = \text{softmax}(\text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j))$

Ties attention to specific seq2seq architecture. Decoder state $\mathbf{s}_{t-1}$ queries encoder states $\mathbf{h}_j$ directly.

Query-Key-Value abstraction:

\[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{v}_j\]

where $\alpha_{tj} = \text{softmax}(\mathbf{q}_t^\top \mathbf{k}_j / \sqrt{d})$

General formulation decouples mechanism from architecture. Query $\mathbf{q}_t = \mathbf{s}_{t-1}$ searches keys $\mathbf{k}_j = \mathbf{h}_j$ to weight values $\mathbf{v}_j = \mathbf{h}_j$.

Same computation applies to self-attention, cross-attention, multi-head variants.

Sequential Computation Limits GPU Utilization

Modern GPUs contain thousands of parallel cores. Sequential loops use one core—remaining capacity sits idle.

Sequential implementation:

for b in range(batch_size):
    for t in range(T_t):
        for s in range(T_s):
            scores[b,t,s] = query[b,t] @ key[b,s]

Triple-nested loop forces serial execution. GPU utilization drops to 5-10% as cores wait for data dependencies that do not exist.

Batched implementation:

scores = torch.bmm(Q, K.transpose(1, 2))
# [B, T_t, d] @ [B, d, T_s] -> [B, T_t, T_s]

Batch matrix multiply saturates memory bandwidth. All $B \times T_t \times T_s$ operations execute in parallel across available cores.

Batch size 32, sequence length 50: typical speedup 15-40×.

Batched Attention Formulation

Input tensors:

Queries: $\mathbf{Q} \in \mathbb{R}^{B \times T_t \times d}$
Keys: $\mathbf{K} \in \mathbb{R}^{B \times T_s \times d}$
Values: $\mathbf{V} \in \mathbb{R}^{B \times T_s \times d}$

where $B$ is batch size.

Three operations:

1. Score computation: \[\mathbf{E} = \mathbf{Q}\mathbf{K}^\top / \sqrt{d}\]

Result: $\mathbf{E} \in \mathbb{R}^{B \times T_t \times T_s}$

Batch matrix multiply processes all samples simultaneously.

2. Softmax normalization: \[\mathbf{A} = \text{softmax}(\mathbf{E}, \text{dim}=-1)\]

Applied per row. Each row in $\mathbf{A}$ sums to 1.

3. Context assembly: \[\mathbf{C} = \mathbf{A}\mathbf{V}\]

Result: $\mathbf{C} \in \mathbb{R}^{B \times T_t \times d}$

All matrix operations, no loops.

import torch
import torch.nn.functional as F
import math

def batched_attention(Q, K, V):
    """
    Args:
        Q: [B, T_t, d] query vectors
        K: [B, T_s, d] key vectors
        V: [B, T_s, d] value vectors

    Returns:
        context: [B, T_t, d]
        weights: [B, T_t, T_s]
    """
    d = Q.size(-1)

    # Scores: [B, T_t, T_s]
    scores = torch.matmul(Q, K.transpose(-2, -1))
    scores = scores / math.sqrt(d)

    # Attention weights: [B, T_t, T_s]
    weights = F.softmax(scores, dim=-1)

    # Context: [B, T_t, d]
    context = torch.matmul(weights, V)

    return context, weights

Example:

B, T_s, T_t, d = 32, 50, 48, 512

Q = torch.randn(B, T_t, d)
K = torch.randn(B, T_s, d)
V = torch.randn(B, T_s, d)

context, weights = batched_attention(Q, K, V)
# context: [32, 48, 512]
# weights: [32, 48, 50]

Computational Complexity

Attention operations per sample:

Score computation $\mathbf{Q}\mathbf{K}^\top$:

Matrix sizes: $(T_t \times d) \times (d \times T_s)$
Operations: $T_t \times T_s \times d$ multiplies

Context computation $\mathbf{A}\mathbf{V}$:

Matrix sizes: $(T_t \times T_s) \times (T_s \times d)$
Operations: $T_t \times T_s \times d$ multiplies

Total: $O(T_t \times T_s \times d)$ operations per sample

Memory: $O(T_t \times T_s)$ for attention matrix storage

RNN baseline:

Encoder: $O(T_s \times d^2)$ operations Decoder: $O(T_t \times d^2)$ operations Total: $O((T_s + T_t) \times d^2)$

Attention dominates when $T_s \times T_t > d$.

Translation ($T_s = T_t = 50$, $d = 512$): $2500 > 512$, attention requires more computation but supports parallelism impossible with RNN.

Attention: $O(T^2)$ but parallel. RNN: $O(T \times d^2)$ but sequential.

Numerical Stability in Softmax

Direct softmax computation fails for large scores:

\[\text{softmax}(x_i) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}\]

For $x_i = 50$: $\exp(50) \approx 5 \times 10^{21}$, exceeding float32 range.

Log-sum-exp trick maintains numerical stability:

\[\text{softmax}(x_i) = \frac{\exp(x_i - \max(x))}{\sum_j \exp(x_j - \max(x))}\]

Subtracting $\max(x)$ shifts largest value to zero. Exponential of zero equals one (safe). All other values become negative, producing exponentials less than one (safe). Result remains mathematically identical since shift cancels in ratio.

Additional epsilon prevents division by zero:

\[\alpha_j = \frac{\exp(e_j - \max_k e_k)}{\sum_k \exp(e_k - \max_k e_k) + \epsilon}\]

where $\epsilon = 10^{-9}$.

Gradient clipping remains necessary but requires less aggressive bounds than RNN. Typical maximum gradient norm: 5.0 for attention versus 0.25-1.0 for RNN.

def stable_softmax_attention(scores):
    """
    Args:
        scores: [B, T_t, T_s] attention scores

    Returns:
        weights: [B, T_t, T_s] attention weights
    """
    # Subtract max for numerical stability
    scores_max = scores.max(dim=-1, keepdim=True)[0]
    scores_stable = scores - scores_max

    # Compute exponentials (now safe)
    exp_scores = torch.exp(scores_stable)

    # Normalize with epsilon
    exp_sum = exp_scores.sum(dim=-1, keepdim=True)
    weights = exp_scores / (exp_sum + 1e-9)

    return weights

Mixed precision (FP16) considerations:

# Use FP32 for softmax accumulation
with torch.cuda.amp.autocast(enabled=False):
    scores = scores.float()
    weights = F.softmax(scores, dim=-1)
    weights = weights.half()

Redundant Computation in Autoregressive Generation

Training: full sequences in parallel, each score computed once.

Inference: one token at a time, each new token attends over all previous positions.

Encoder keys and values stay constant—recomputing them at every decoder step wastes computation. For $T$ generated tokens: $O(T^2)$ scores computed, only $O(T)$ new.

KV Caching Eliminates Redundancy

Observation: At every decoder step,

\[\text{scores}_t = \mathbf{q}_t \mathbf{K}_{enc}^\top\]

But $\mathbf{K}_{enc}$ never changes. Identical computation repeats $T_t$ times.

Implementation:

# Compute once before generation
K_enc = encoder_projection(encoder_states)
V_enc = encoder_projection(encoder_states)

# Reuse for all T_t decoder steps
for t in range(T_t):
    q_t = decoder_state[t]
    scores = q_t @ K_enc.T
    weights = softmax(scores)
    context = weights @ V_enc

Memory cost: $O(T_s \times d)$ for cache.

For $T_s = 50$, $d = 512$: 25,600 floats = 100 KB.

Speedup: $T_t\times$ for encoder attention, 10-50× overall.

class AttentionWithCache:
    def __init__(self):
        self.k_cache = None
        self.v_cache = None

    def forward(self, query, key, value, use_cache=True):
        if use_cache and self.k_cache is not None:
            key = torch.cat([self.k_cache, key], dim=1)
            value = torch.cat([self.v_cache, value], dim=1)

        self.k_cache, self.v_cache = key, value
        return attention(query, key, value)

Parallelization Enables Attention Training Speedup

RNN creates sequential dependencies:

\[\mathbf{h}_t = f(\mathbf{h}_{t-1}, \mathbf{x}_t)\]

Must compute $\mathbf{h}_1$ before $\mathbf{h}_2$—hardware waits at each timestep.

Attention has no temporal dependencies:

\[\mathbf{c}_t = \sum_j \alpha_{tj} \mathbf{h}_j\]

All $\alpha_{tj}$ compute in parallel. GPU processes all positions simultaneously.

RNN:

Sequential over $T_s$ encoder and $T_t$ decoder steps
Low GPU utilization (sequential bottleneck)
Multi-GPU: limited benefit

Attention:

Fully parallel across positions
High GPU utilization (parallel operations)
Multi-GPU: linear scaling via batch parallelism

Parallel attention enables efficient GPU utilization despite higher FLOPs.

Attention Alignment Emerges Without Supervision

Random initialization produces near-uniform attention distributions at training start:

\[\alpha_{tj} \approx \frac{1}{T_s} \text{ for all } j\]

Score function parameters contain no prior knowledge of linguistic alignment. Model must discover alignment patterns through translation loss gradient signal alone.

Early training (epochs 1-2): obvious one-to-one alignments. Diagonal emerges for similar word order. Sharp but inaccurate.

Mid training (epochs 3-6): refines weights, learns reordering. Handles phrasal alignment.

Late training (epochs 7-12): fine-tunes rare words, long-distance dependencies. Patterns stabilize.

No explicit alignment supervision—purely from translation loss.

Gradient Flow Comparison: Attention vs Seq2Seq

Gradient magnitude to encoder positions quantifies learning signal strength. Measure $\|\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j}\|$ during training for source position $j$.

Seq2seq with fixed context exhibits exponential gradient decay:

\[\left\|\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j}\right\| \approx \gamma^{T_s - j}\]

For $\gamma = 0.45$, gradients vanish by position 1. Position receiving $\sim 10^{-18}$ gradient magnitude cannot learn effectively.

Attention distributes gradients based on relevance:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j} = \sum_t \alpha_{tj} \frac{\partial \mathcal{L}}{\partial \mathbf{c}_t}\]

No exponential decay. Important tokens receive strong gradients regardless of position. Token attended heavily across multiple decoder steps accumulates large gradient.

WMT’14 En→De, 50-token sequences: gradient magnitude 100× larger at position 1, 1000× at position 25, $10^{10}\times$ at position 45 (attention vs seq2seq).

Convergence: 40-50K iterations (attention) versus 100K (seq2seq).

Loss Landscape Structure Differs Between Approaches

Seq2seq loss landscape contains single information pathway: $\mathbf{X} \rightarrow \mathbf{c} \rightarrow \mathbf{Y}$. Optimization bottlenecks at fixed context encoding. Single path to reduce loss creates sharp, narrow minima. Small parameter perturbations significantly increase loss.

Attention loss landscape contains $O(T_s \times T_t)$ gradient pathways through attention weights. Loss reduction occurs via: alignment adjustments for different tokens, individual attention weight refinement, score function parameter modification. Gradient descent exploits redundant paths. Minima become smoother and wider. Parameter perturbations distribute across multiple pathways, maintaining lower loss.

Learning rate: seq2seq $\leq 0.001$, attention up to $0.01$.

Variance: seq2seq high (unstable gradients), attention low (stable).

Convergence: seq2seq 100K iterations, attention 40-50K—2-2.5× speedup.

Uniform Attention Distribution Indicates Failure

Pattern:

\[\alpha_{tj} \approx \frac{1}{T_s} \text{ for all } j\]

Attention weights approximately uniform across all source positions.

Measured characteristics:

Entropy: $H(\alpha_t) > \log(T_s) - 1$ (too high)
Max weight: $\max_j \alpha_{tj} < 0.2$ (no discrimination)
Variance: $\text{Var}(\alpha_t) < 0.01$ (flat distribution)

Occurrence conditions:

Score magnitudes $|e_{tj}| < 0.5$ (insufficient contrast)
Out-of-vocabulary words (no learned representations)
Training signal variance $> 2.0$ (conflicting gradients)

Performance impact:

WMT’14 En→De with uniform attention:

BLEU: 12.3 (baseline: 19.8)
-7.5 BLEU degradation
Precision@1: 0.34 (random baseline: 0.33)

Uniform weights provide no source discrimination. Translation accuracy equivalent to random alignment baseline.

Over-Concentrated Attention Distribution

Pattern:

\[\alpha_{tj} \approx \begin{cases} 1 & j = k \\ 0 & \text{otherwise} \end{cases}\]

Single source position receives $>0.9$ weight per timestep.

Measured characteristics:

Entropy: $H(\alpha_t) < 1.0$ (too low)
Max weight: $\max_j \alpha_{tj} > 0.9$ (excessive peak)
Effective positions: $\sum_j \mathbb{1}[\alpha_{tj} > 0.1] \leq 1$

Occurrence conditions:

Score function produces $\max_j e_{tj} - \text{mean}_k e_{tk} > 5.0$
Softmax temperature $\tau < 0.5$ (too sharp)
Regularization weight $\lambda < 10^{-5}$ (insufficient)

Performance impact:

Repetition rate on WMT’14:

Baseline: 3.2% repeated n-grams
Over-concentrated: 18.7% repeated n-grams
+15.5 percentage points
Missing translations: 22% of source content untranslated

Single-position attention ignores phrasal context. Repetition rate increases 5.8× over baseline.

Attention Drift in Long Sequences

Pattern:

Alignment error increases with sequence length.

Measured degradation:

Length (tokens)	Alignment F1	BLEU	Error rate
20-40	0.78	28.3	8.2%
40-60	0.71	24.1	14.7%
60-80	0.63	19.8	22.1%
80-100	0.54	15.2	31.4%

Drift rate: Alignment F1 decreases 0.012 per token beyond position 40.

Contributing factors:

Position embedding resolution: $\sin(i/10000^{2k/d})$ loses precision
Hidden state norm decays: $\|\mathbf{s}_t\|$ decreases 15% per 50 steps
Accumulated softmax entropy error: $\sum_t \epsilon_t$ grows linearly

Alignment F1 < 0.6 for sequences exceeding 80 tokens. BLEU degradation: 0.15 per additional token.

Repetition From Attention Without Memory

Standard attention computes independently at each decoder step:

\[\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_k \exp(e_{tk})}\]

where $e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j)$.

No information about which source positions received attention previously. No record of how many times each position has been used. No tracking of what remains to translate.

Consider English to French translation:

Source: “The European Economic Community was founded in nineteen fifty seven”

Step 5: Translating “Communauté” (Community)

Attends to “Community”: $\alpha_{5,3} = 0.7$

Step 6: Translating “économique” (Economic)

Can attend to “Community” again: $\alpha_{6,3} = 0.6$. No penalty for reuse.

Step 10: Generating next word

Might attend to “Community” yet again. Can completely ignore “fifty seven”.

Output: “…Communauté a été Communauté en…”

Missing year translation. Repeated “Communauté”.

Coverage Tracks Cumulative Attention

Track total attention to each source position:

\[\mathbf{c}^{cov}_t = \sum_{i=1}^{t-1} \boldsymbol{\alpha}_i\]

where $c^{cov}_{tj}$ represents cumulative attention to position $j$ through step $t$.

Modify score function to penalize repeated attention:

\[e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j) - \lambda c^{cov}_{tj}\]

Positions with high cumulative attention receive lower scores. Forces model to explore unattended positions.

Example progression:

Source positions: $[1, 2, 3, 4, 5]$

Step 1: Coverage $[0, 0, 0, 0, 0]$, attention focuses position 3 ($\alpha = 0.5$)

Step 2: Coverage $[0.1, 0.2, 0.5, 0.1, 0.1]$, position 3 penalized by $-\lambda \times 0.5$, attention reduces to 0.3

Step 3: Coverage $[0.25, 0.45, 0.8, 0.25, 0.25]$, position 3 heavily penalized $-\lambda \times 0.8$, attention drops to 0.1

Step 5: Coverage $[0.8, 0.95, 1.5, 0.4, 0.35]$, positions 1-3 saturated, forced to attend positions 4-5

All source positions eventually covered.

Empirical results (Tu et al., 2016):

Dataset	Baseline	+Coverage	Improvement
WMT’14 En→De	19.8	20.7	+0.9 BLEU
WMT’14 En→Fr	34.2	35.1	+0.9 BLEU
Long sequences (>50 tokens)	16.3	18.1	+1.8 BLEU

Larger gains on longer sequences where repetition occurs more frequently. Reduces over-translation by 45%, under-translation by 30%. Computational overhead: 2-3% slower training.

Typical coverage penalty: $\lambda \in [0.5, 2.0]$ depending on task.

Diagnosing Attention Failures

Attention entropy quantifies distribution sharpness:

\[H(\alpha_t) = -\sum_j \alpha_{tj} \log \alpha_{tj}\]

Expected range: $H \in [1, \log(T_s) - 1]$

$H < 1$: Over-concentration
$H > \log(T_s) - 1$: Too uniform

Monitor entropy distribution across training batches. Outliers indicate failures.

Maximum attention weight per row:

\[\max_j \alpha_{tj}\]

Expected range: [0.3, 0.7]

0.9: Over-concentration
<0.2: Uniform distribution

Coverage uniformity:

\[\sum_t \alpha_{tj} \text{ for each } j\]

Expected: roughly even distribution across source positions. Large variance indicates some positions ignored, others over-attended.

Visualize attention matrices for failure cases. Patterns reveal specific failure modes immediately.

Benchmarks and Applications

WMT Translation Tasks and BLEU Metric

Workshop on Machine Translation (WMT):

Annual shared task for machine translation evaluation.

Standard benchmarks:

WMT’14 English→French: 36M training pairs
WMT’14 English→German: 4.5M training pairs
Test sets: ~3K held-out sentence pairs

Evaluation metric: BLEU (Bilingual Evaluation Understudy)

Measures n-gram precision between candidate and reference translations:

\[\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{4} w_n \log p_n\right)\]

where:

$p_n$: precision of n-grams (unigrams through 4-grams)
$w_n = 0.25$: uniform weights
$BP$: brevity penalty for short outputs

Range: 0-100 (higher = better)

Example scores:

50-60: High-quality translation
30-40: Understandable, some errors
<20: Poor quality

Limitations: BLEU correlates with human judgment but misses semantic equivalence, fluency issues.

Attention vs Fixed Context: Translation Performance

Key findings:

En→Fr: Attention +5.2 BLEU (16% gain)
En→De: Attention +6.3 BLEU (45% gain)
Larger gains on morphologically richer language (German)
Performance gap widens beyond 50 tokens
Seq2seq BLEU drops 0.3-0.5 per 10 tokens after length 30

Why larger gains on En→De?

German requires long-range reordering (verb-final clauses), compound words spanning multiple source tokens, case agreement across phrases. Fixed context bottleneck cannot preserve this information. Attention accesses full source representation dynamically.

Ablation: Component Contributions

Method: Train full model, remove one component, measure BLEU degradation on WMT’14 En→Fr validation set.

Component impacts:

Configuration	BLEU	∆
Full attention model	36.7	—
- Scaled scores (no $1/\sqrt{d_k}$)	33.1	-3.6
- Bidirectional encoder	34.2	-2.5
- Input feeding	35.2	-1.5
- Coverage mechanism	35.9	-0.8
Fixed context (seq2seq)	31.5	-5.2

Scaled scores: -3.6 BLEU. Without $1/\sqrt{d_k}$ scaling, dot products grow with dimension. For $d=512$, unscaled scores reach ±50, softmax saturates to near one-hot. Gradient flow breaks.

Bidirectional encoder: -2.5 BLEU. Unidirectional states lack future context. Ambiguous words (“bank” in “river bank” vs “savings bank”) cannot disambiguate. Alignment accuracy drops.

Input feeding: -1.5 BLEU. Decoder loses track of what was attended previously. Coherence across steps degrades. More repetition, missed source words.

Coverage mechanism: -0.8 BLEU. Effect concentrated on long sequences (>50 tokens): -1.8 BLEU on long subset. Negligible on short sequences.

Attention architecture: Removing all attention components (seq2seq baseline) loses 5.2 BLEU total. Non-additive: interactions between components exist.

Attention Discovers Linguistic Structure Without Explicit Supervision

Model trained only on translation loss $(y, \hat{y})$. No alignment labels, no parse trees, no linguistic annotations.

Attention weights $\alpha_{tj}$ emerge as implicit alignment between source and target positions.

Analysis methodology:

Examine trained attention weights on test set. For each target word $y_t$, inspect which source positions receive high attention weight $\alpha_{tj} > 0.3$.

Compare discovered alignments to:

Gold alignments (when available, rare)
Linguistic phenomena (reordering, compounds, idioms)
Failure patterns

Three categories of learned patterns:

Monotonic alignment: Similar word order (En→Fr typically diagonal)
Syntactic reordering: Different grammatical structure (SVO→SOV)
Phrasal mapping: Multiple source words→single target (compounds, idioms)

Monotonic Alignment: Similar Word Order Languages

English → French translation:

Both languages follow Subject-Verb-Object order. Adjectives differ (French often post-nominal) but core structure similar.

Attention pattern: Strong diagonal

Source: “The quick brown fox jumps over the lazy dog”

Target: “Le rapide renard brun saute sur le chien paresseux”

Attention discovers:

“Le” ← “The” (articles align)
“renard” ← “fox” (nouns align)
“saute” ← “jumps” (verbs align)
“paresseux” ← “lazy” (adjectives align despite position shift)

Attention matrix shows clear diagonal with small local deviations for adjective-noun reordering.

Quantification:

Mean absolute diagonal offset: 1.2 positions

85% of attention weight within ±3 positions of diagonal

Interpretation: For languages with similar syntax, attention learns near-identity mapping with local adjustments for grammatical differences.

Syntactic Reordering: Verb-Final Languages

English (SVO) → German (SOV in subordinate clauses):

Source: “I know that she will come tomorrow”

Target: “Ich weiß dass sie morgen kommen wird”

Verb moves from middle to end. Attention must handle 5-position jump.

Analysis:

Attention discovers:

Monotonic alignment for subject/verb clause: “I know that she”
Long-range dependency: “morgen” (position 4) ← “tomorrow” (position 6)
Verb splitting: “kommen wird” ← “will come” (reversed order)

Without attention, fixed context at end of encoding cannot preserve verb identity after 5 more tokens processed. Attention accesses “will” and “come” directly regardless of distance.

Quantification on WMT’14 En→De test set:

47% of sentences contain verb-final clauses
Mean verb displacement: 4.2 positions
Attention correctly identifies verb in 89% of cases
Seq2seq (fixed context) correct in 31% of cases

Compound Words and Multi-Word Expressions

German compound nouns: Single target word maps to multiple source words.

Source: “European Economic Area”

Target: “Wirtschaftsgebiet” (literally “economy-area”)

Attention distributes weight across source phrase:

α(“Wirtschaftsgebiet”, “Economic”) = 0.45
α(“Wirtschaftsgebiet”, “Area”) = 0.42
α(“Wirtschaftsgebiet”, “European”) = 0.10

French multi-word expressions: Multiple target words map to single source idiom.

Source: “kick the bucket” (idiom for “die”)

Target: “il est mort” (literally “he is dead”)

Attention shows diffuse pattern:

All three French words attend to all three English words
No one-to-one correspondence
Model recognizes non-compositional meaning

Implications:

Attention automatically handles:

Variable source-target cardinality (N→1, 1→N, N→M)
Compositional vs non-compositional semantics
Phrase-level alignment without explicit phrase table

Fixed context bottleneck loses phrase structure—all information compressed equally.

What Attention Cannot Learn: Failure Cases

Novel/rare words:

Source: “The chrysanthemum bloomed”

Target: “Le chrysanthème a fleuri”

If “chrysanthemum” appears <5 times in training, attention weights diffuse. Cannot learn specific alignment without exposure.

Result: Generic flower translation or hallucination.

Highly idiomatic expressions:

Source: “It’s raining cats and dogs”

Literal German: “Es regnet Katzen und Hunde” (incorrect)

Correct: “Es regnet in Strömen” (“in streams”)

Attention often produces word-by-word translation without cultural context for idioms.

Very long sequences (>100 tokens):

Attention matrix $T_s \times T_t$ grows quadratically. For $T_s = T_t = 100$: 10K attention weights per decoder step.

Optimization difficulty: More parameters to learn, noisier gradient signal.

Empirical degradation: -0.3 BLEU per 10 tokens beyond 80-token threshold.

Quantified limitations:

Words with frequency <10: Mean BLEU 22 (poor)
Words with frequency 10-100: Mean BLEU 35 (moderate)
Words with frequency >100: Mean BLEU 42 (good)
Sequences >80 tokens: -3 BLEU vs 40-token baseline
Pure idioms: 41% translated literally (incorrect)

Why these failures matter:

Understanding limitations guides architecture improvements. Rare words → subword tokenization (BPE, WordPiece). Long sequences → sparse attention, local windows. Idioms → retrieval-augmented generation.

Attention Mechanism Generalizes Beyond Translation

Core abstraction:

Attention = soft selection over representations.

No assumption about representation structure:

Sequential (translation)
Spatial (images)
Graph (molecules, code)
Set (unordered)

General formulation:

Given:

Query $\mathbf{q} \in \mathbb{R}^{d_q}$ (decoder state)
Keys $\{\mathbf{k}_1, ..., \mathbf{k}_n\}$ (encoder representations)
Values $\{\mathbf{v}_1, ..., \mathbf{v}_n\}$ (content to aggregate)

Compute: \[\alpha_i = \frac{\exp(\text{score}(\mathbf{q}, \mathbf{k}_i))}{\sum_j \exp(\text{score}(\mathbf{q}, \mathbf{k}_j))}\]

\[\text{context} = \sum_{i=1}^n \alpha_i \mathbf{v}_i\]

What changes across domains:

Domain	Source	Keys/Values	Score function
Translation	Text tokens	LSTM hidden states	Additive/Dot
Captioning	Image regions	CNN feature map	Spatial dot
Speech	Audio frames	Spectrogram features	Location-aware
Summarization	Document sentences	Sentence embeddings	Hierarchical

Same mechanism, different representations. Next: Image captioning applies attention to CNN feature maps.

Spatial Attention Distributes Over 2D Image Regions During Caption Generation

CNN extracts feature map $\mathbf{F} \in \mathbb{R}^{H \times W \times D}$ where each spatial position $(i,j)$ encodes local image region.

Flatten to sequence: $L = H \times W$ positions, each a $D$-dimensional vector $\mathbf{v}_k$.

Decoder LSTM generates caption tokens. At each step $t$:

\[\alpha_{tk} = \frac{\exp(\mathbf{s}_t^\top \mathbf{W}_a \mathbf{v}_k)}{\sum_{j=1}^{L} \exp(\mathbf{s}_t^\top \mathbf{W}_a \mathbf{v}_j)}\]

\[\mathbf{c}_t = \sum_{k=1}^{L} \alpha_{tk} \mathbf{v}_k\]

Attention weights $\alpha_{tk}$ exhibit spatial concentration: $\alpha_{tk} > 0.1$ typically spans 3-8% of image area (2-6 adjacent spatial locations out of 49 total in $7 \times 7$ grid).

MS COCO benchmark (5K test images):

Baseline (mean pooling): 25.0 BLEU-4, 23.0 CIDEr

Spatial attention: 28.4 BLEU-4, 27.2 CIDEr

Difference: +3.4 BLEU-4 (+14%), +4.2 CIDEr (+18%)

Attention weights correlate with human gaze patterns: IoU = 0.68 on regions fixated during first 500ms. No explicit region annotations during training.

Pyramidal Encoder Achieves 8× Temporal Compression for Speech Recognition

Audio spectrograms at 100 fps: 10-second utterance = 1000 frames.

Transcript: ~30 characters.

Direct attention over 1000 frames: $30 \times 1000 = 30$K attention weights per sequence. Memory: 120KB per sample (fp32).

Pyramidal Bi-LSTM encoder:

Each layer concatenates adjacent timesteps, halves temporal resolution:

Layer 1: 1000 frames → 500 frames (2× reduction)
Layer 2: 500 frames → 250 frames (4× total)
Layer 3: 250 frames → 125 frames (8× total)

Attention on 125-frame representation: $30 \times 125 = 3.75$K weights (8× reduction). Memory: 15KB per sample.

Google Voice Search task (>100K utterances):

GMM-HMM baseline: 23.6% WER

Deep Speech (CTC, no attention): 16.5% WER

Listen, Attend, Spell (pyramidal): 14.1% WER

Pyramidal encoder: -1.5% WER absolute (-9% relative) vs flat encoder at same parameter count. Computational cost identical. Gradient paths shortened 8× through temporal compression.

Pointer-Generator Network Copies Rare Words via Attention-Weighted Source Distribution

Abstractive summarization with fixed vocabulary (~50K words) fails on rare words: proper nouns, numbers, technical terms.

Standard decoder substitutes generic alternatives or UNK token:

Source: “CEO Satya Nadella announced revenue of $52.7B”
Output: “CEO Smith announced revenue of large amount”

Pointer-generator mechanism:

Generation probability at decoder step $t$:

\[p_{\text{gen}} = \sigma(\mathbf{w}_c^\top \mathbf{c}_t + \mathbf{w}_s^\top \mathbf{s}_t + \mathbf{w}_y^\top \mathbf{y}_{t-1} + b)\]

Final distribution over extended vocabulary (original vocab ∪ source words):

\[P(w) = p_{\text{gen}} \cdot P_{\text{vocab}}(w) + (1 - p_{\text{gen}}) \sum_{i:w_i=w} \alpha_{ti}\]

Copy probability for word $w$ sums attention weights $\alpha_{ti}$ at all source positions where $w_i = w$.

CNN/DailyMail dataset (287K article-summary pairs):

Abstractive baseline (seq2seq): 31.3 ROUGE-1, 11.8 ROUGE-2

Pointer-generator: 36.4 ROUGE-1, 15.7 ROUGE-2

Coverage mechanism added: 39.5 ROUGE-1, 17.3 ROUGE-2

Pointer mechanism: +5.1 ROUGE-1 over baseline. Coverage: +3.1 ROUGE-1.

Copy accuracy on rare proper nouns: 94% vs 31% baseline. Numbers: 89% vs 12%.

Bi-Directional Attention Flow Computes Document↔︎Question Similarity Matrix

Standard attention: unidirectional query from decoder to encoder.

Reading comprehension requires bidirectional interaction between document and question.

Bi-Directional Attention Flow (BiDAF):

Document $D = \{\mathbf{h}_1, ..., \mathbf{h}_m\}$, question $Q = \{\mathbf{u}_1, ..., \mathbf{u}_n\}$

Similarity matrix: $\mathbf{S}_{ij} = \mathbf{w}^\top [\mathbf{h}_i; \mathbf{u}_j; \mathbf{h}_i \circ \mathbf{u}_j]$

Context-to-Query (C2Q): For each document word, which question words most relevant?

\[\tilde{\mathbf{u}}_i = \sum_j \frac{\exp(\mathbf{S}_{ij})}{\sum_k \exp(\mathbf{S}_{ik})} \mathbf{u}_j\]

Query-to-Context (Q2C): Which document words most relevant to entire question?

\[\mathbf{b} = \max_j \mathbf{S}_{:j}, \quad \tilde{\mathbf{h}} = \sum_i \frac{\exp(\mathbf{b}_i)}{\sum_k \exp(\mathbf{b}_k)} \mathbf{h}_i\]

Combined representation: $\mathbf{g}_i = [\mathbf{h}_i; \tilde{\mathbf{u}}_i; \mathbf{h}_i \circ \tilde{\mathbf{u}}_i; \mathbf{h}_i \circ \tilde{\mathbf{h}}]$

Output layer predicts start and end positions of answer span.

SQuAD v1.1 benchmark (100K question-answer pairs):

Match-LSTM (single direction): 64.7% EM, 73.7% F1

BiDAF (bi-directional): 68.0% EM, 77.3% F1

BiDAF + character embeddings: 73.3% EM, 81.1% F1

Bi-directional flow: +3.3% EM, +3.6% F1. Character embeddings (handle OOV): +5.3% EM.

EM (exact match): predicted answer = gold answer exactly. F1: token-level overlap.

Extensions and Interpretability

Single-Hop Attention Computes One Weighted Average Over Encoder States

Standard attention computes single weighted sum:

\[\mathbf{c}_t = \sum_{j=1}^{T_s} \alpha_{tj} \mathbf{h}_j\]

One lookup operation per decoder step.

Example question answering:

Context: “John went to the store. The store was closed. John went home.”

Question: “Where did John go after the store was closed?”

Two-step reasoning required:

Locate “store was closed”
Find subsequent event → “went home”

Single attention step addresses both facts simultaneously. Attention weights distribute across multiple positions: entropy H(α) = 2.3 bits (nearly uniform over 8 tokens) vs focused attention H(α) = 0.8 bits.

Multi-hop attention performs sequential lookups, each conditioned on previous results.

Multi-Hop Attention Chains Sequential Lookups Through Same Memory

Stack $K$ attention operations, each querying same memory with refined context:

\[\mathbf{m}_1 = \text{Attention}(\mathbf{q}, \mathbf{M})\] \[\mathbf{m}_2 = \text{Attention}(\mathbf{m}_1, \mathbf{M})\] \[\vdots\] \[\mathbf{m}_K = \text{Attention}(\mathbf{m}_{K-1}, \mathbf{M})\]

Memory $\mathbf{M} = [\mathbf{h}_1, ..., \mathbf{h}_{T_s}]$ remains fixed (encoder states).

Hop 1: Initial query $\mathbf{q}$ retrieves $\mathbf{m}_1$

Hop 2: Use $\mathbf{m}_1$ as query, retrieves $\mathbf{m}_2$

Hop $K$: Final representation $\mathbf{m}_K$ feeds output layer

Each hop conditions on previous result, refining attention focus.

How is $K$ determined?

$K$ = fixed hyperparameter set during architecture design, not learned.

Typical values:

Reading comprehension (bAbI tasks): $K = 2$ or $K = 3$
Visual question answering: $K = 2$
Complex reasoning tasks: $K = 4$ to $K = 6$

Trade-off:

Larger $K$: More reasoning capacity, $O(K \times T_s \times d)$ computation
Smaller $K$: Faster but may miss multi-step dependencies

No mechanism to adaptively choose $K$ per example. All examples use same fixed hop count.

Example: Memory Networks (Weston et al., 2015)

Question answering over stories. Each hop retrieves relevant sentence, next hop uses that context to retrieve next relevant sentence.

Complexity: $O(K \times T_s \times d)$ for $K$ hops. Linear in hop count.

bAbI QA tasks:

1 hop: 72% accuracy
2 hops: 86% accuracy (+14%)
3 hops: 93% accuracy (+7%)
4+ hops: 93% (saturated)

Sparse Attention Reduces O(T²) Complexity by Restricting Connectivity

Full attention: $O(T^2)$ memory and computation.

$T = 1000$: 1M weights per layer. 12 layers: 12M weights total.

GPU memory (16GB): limits sequence length to ~2048 tokens at batch size 8.

Sparse attention: Each position attends to subset of positions, not all $T$ positions.

Local window (radius $r$):

\[\alpha_{ij} = \begin{cases} \text{softmax}(e_{ij}) & \text{if } |i - j| \leq r \\ 0 & \text{otherwise} \end{cases}\]

Complexity: $O(T \times r)$ instead of $O(T^2)$

Example: $r = 5$ → each position attends to 11 neighbors (self + 5 left + 5 right).

Strided (every $k$-th position):

\[\alpha_{ij} \neq 0 \text{ only if } j \equiv 0 \pmod{k}\]

Complexity: $O(T \times T/k)$

Example: $k = 8$ → attend to T/8 positions.

When to use strided:

Captures long-range dependencies that local window misses. Position 100 can attend to positions 0, 8, 16, …, 96 (distance up to 100), while local window ($r=5$) only sees positions 95-105.

Use case: Document-level coherence, paragraph structure, long-range coreference.

Limitation: Misses nearby context. Fixed stride may miss relevant intermediate positions.

Combined: Local + strided in different attention heads.

Local head: nearby context (syntax, adjacent words)

Strided head: long-range dependencies (discourse, document structure)

WMT’14 En→De translation with sparse patterns:

Full attention ($T = 512$): 27.3 BLEU, 2048 MB memory

Local ($r = 5$, complexity $O(10T)$): 25.3 BLEU (-2.0), 410 MB memory

Strided ($k = 8$, complexity $O(T^2/8)$): 26.3 BLEU (-1.0), 256 MB memory

Combined (local + strided, $O(15T)$): 26.8 BLEU (-0.5), 615 MB memory

Combined pattern: 5× memory reduction for -0.5 BLEU. Enables 4× longer sequences at same memory budget.

Attention Weights α Explicitly Score Input-Output Correspondence

Hidden layer activations in standard networks: distributed representations without explicit alignment structure.

Attention weights $\alpha_{tj}$ assign explicit scores to each source position for every target position.

Heatmap visualization: rows = target positions, columns = source positions, intensity = $\alpha_{tj}$.

High weight ($\alpha_{tj} > 0.5$): strong correspondence between source $j$ and target $t$.

Applications:

Debugging: Attention heatmap identifies misalignments (model attends to wrong source word)
Trust: Users see which input tokens received high weights during prediction
Domain validation: Medical experts verify model assigns high weight to relevant symptoms
Error analysis: Systematic comparison of attention patterns between correct and incorrect predictions

Attention Weights Do Not Guarantee Causal Importance

Assumption: $\alpha_{tj} = 0.8$ → token $j$ causally important for output $t$.

Empirical test (Jain & Wallace, 2019): Correlation between attention weights and gradient-based importance.

Sentiment classification experiment:

Input: “This movie was not boring at all”
Prediction: Positive (0.92 probability)
Attention weights: “not” (0.05), “boring” (0.40), “at all” (0.08)

Counterfactual test:

Remove “boring”: Prediction remains positive (0.89)
Remove “not”: Prediction flips negative (0.78)

Attention weight 0.40 on “boring” contradicts low causal importance. Attention weight 0.05 on “not” contradicts high causal importance.

Mechanism behind mismatch:

Attention computes context, then output layer transforms:

\[\mathbf{c}_t = \sum_j \alpha_{tj} \mathbf{h}_j\] \[\mathbf{y}_t = \text{softmax}(W \mathbf{c}_t + b)\]

Matrix $W$ re-weights features from context $\mathbf{c}_t$. High-attention features down-weighted by small $W$ entries. Low-attention features amplified by large $W$ entries.

Attention indicates retrieval weights, not decision weights.

Attention weights serve as hypothesis generator for inspection. Validate causal importance via ablation or gradient methods.

Constraining Attention Patterns via Auxiliary Loss or Hard Masks

Attention weights $\alpha_{tj}$ differentiable → amenable to supervision or constraints.

Supervised attention:

Add alignment supervision loss:

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \lambda \mathcal{L}_{\text{attention}}\]

\[\mathcal{L}_{\text{attention}} = \sum_{t,j} (\alpha_{tj} - \alpha^*_{tj})^2\]

$\alpha^*_{tj}$ from gold alignments (manual annotation or statistical aligner).

Hard monotonic constraint:

Speech recognition requires forward-only attention. Mask prevents backward attention:

\[\alpha_{t,j} = 0 \quad \text{if } j < j_{t-1}\]

Position $j_{t-1}$: highest-weight position at step $t-1$. Forces $j_t \geq j_{t-1}$.

Attention prior (soft constraint):

Bias scores toward expected diagonal alignment:

\[e_{tj} = \text{score}(\mathbf{s}_{t-1}, \mathbf{h}_j) + \text{prior}(t, j)\]

Gaussian prior centered on diagonal: \[\text{prior}(t, j) = \exp\left(-\frac{(j - t \cdot T_s/T_t)^2}{2\sigma^2}\right)\]

Applications:

Speech recognition: Monotonic constraint prevents repetition/skipping errors
Low-resource translation: Dictionary-based prior biases attention toward known word pairs
Domain adaptation: Initialize attention patterns from source task

Constraints improve sample efficiency at cost of model flexibility.

Monotonic speech recognition:

Unconstrained: 18% WER, 500 hours training data
Monotonic constraint: 16% WER, 200 hours training data (-2% WER, -60% data)

Attention Implements Soft Content-Based Memory Addressing

Traditional memory: Hard indexing by fixed address

memory = [value_0, value_1, value_2, ..., value_n]
output = memory[address]  # output = value_address

Attention: Soft addressing by content similarity

Memory $\mathbf{M} = [\mathbf{h}_1, ..., \mathbf{h}_{T_s}]$ (encoder states)

Query $\mathbf{q}$ specifies retrieval pattern by content

Output weighted combination:

\[\text{output} = \sum_{j=1}^{T_s} \alpha_j \mathbf{h}_j\]

\[\alpha_j = \text{softmax}(\text{similarity}(\mathbf{q}, \mathbf{h}_j))\]

Soft weights $\alpha_j$ differentiable → gradients flow through retrieval operation.

Differentiable: Gradients backpropagate through $\alpha_j$.

Robust: Query errors degrade gracefully (distribute weight across neighbors).

Content-based: Similarity function retrieves by semantic match, not position.

Learnable: Query generation and similarity function trained end-to-end.

Neural Turing Machine Adds Differentiable Write Operations to Memory

NTM (Graves et al., 2014) extends read-only attention to read/write external memory.

Architecture:

Controller: RNN that generates queries
Memory: Matrix $\mathbf{M} \in \mathbb{R}^{N \times d}$ ($N$ memory slots, $d$ dimensions each)
Read head: Attention mechanism retrieves from memory
Write head: Attention-weighted update to memory

Read operation:

\[\mathbf{r}_t = \sum_{i=1}^{N} \alpha^r_{ti} \mathbf{M}_{i}\]

where $\alpha^r_{ti} = \text{softmax}(\text{similarity}(\mathbf{k}^r_t, \mathbf{M}_i))$

$\mathbf{k}^r_t$: read key from controller

Write operation:

\[\mathbf{M}_i \leftarrow \mathbf{M}_i \cdot (1 - \alpha^w_{ti} \mathbf{e}_t) + \alpha^w_{ti} \mathbf{a}_t\]

$\alpha^w_{ti}$: write attention weights

$\mathbf{e}_t$: erase vector (what to remove)

$\mathbf{a}_t$: add vector (what to write)

NTM learns algorithmic tasks from input-output examples:

Copy: [3, 1, 4, 1, 5] → [3, 1, 4, 1, 5]

Reverse: [3, 1, 4, 1, 5] → [5, 1, 4, 1, 3]

Sort: [3, 1, 4, 1, 5] → [1, 1, 3, 4, 5]

Associative recall: (key, value) pairs → retrieve value given key

No explicit algorithm implementation. Attention patterns encode algorithm.

Copy task execution:

Phase 1 (write): Sequential sharp attention writes input tokens to memory slots 1→N.

Phase 2 (read): Sequential sharp attention reads from same slots 1→N in order.

Limitations:

Training unstable: many local minima in attention weight space.

Poor scaling: N > 100 memory slots → optimization difficulty.

Sharpness critical: soft weights $\alpha_{ti} \approx 0.2$ across 5 slots → read/write interference.

Memory Networks (Weston et al., 2015): Fixed memory (no writes), multi-hop attention. Better QA performance, easier training.

Key-Value Separation and Retrieval-Augmented Generation

Modern architectures separate keys (similarity) from values (content):

\[\mathbf{c} = \sum_j \text{softmax}(\mathbf{q}^\top \mathbf{k}_j) \mathbf{v}_j\]

Keys $\{\mathbf{k}_j\}$: Compute attention scores via $\mathbf{q}^\top \mathbf{k}_j$

Values $\{\mathbf{v}_j\}$: Retrieved content in weighted sum

Query $\mathbf{q}$: Specifies retrieval pattern

Seq2seq attention uses tied keys/values: $\mathbf{k}_j = \mathbf{v}_j = \mathbf{h}_j$.

Separation enables asymmetry: key dimensionality optimized for similarity, value dimensionality for information content.

Retrieval-augmented generation:

External memory: database, knowledge base, or web corpus (millions to billions of documents).

Query embedding $\mathbf{q}$ retrieves top-K documents:

\[\text{docs} = \text{TopK}(\text{similarity}(\mathbf{q}, \text{all documents}))\]

Retrieved documents concatenated with input, fed to language model.

Parametric knowledge (weights) + non-parametric knowledge (retrieved text).

External memory scales to billions of documents (Wikipedia: 6M articles, Common Crawl: 250B pages).

Similarity search via approximate nearest neighbors (FAISS, ScaNN): sub-linear time $O(\log N)$ instead of $O(N)$.

Modern systems: GPT-4 + web search, Claude + retrieval, medical QA with PubMed (35M abstracts).

Attention Solves Bottleneck But RNN Sequential Dependency Persists

Attention eliminates fixed context bottleneck via dynamic access to all encoder states.

Recurrent encoder/decoder still sequential:

\[\mathbf{h}_t = f_{\text{enc}}(\mathbf{h}_{t-1}, \mathbf{x}_t; \theta)\] \[\mathbf{s}_t = f_{\text{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}_t; \phi)\]

Sequential dependency limits:

Parallelization: timestep $t$ cannot compute until $t-1$ completes.

Gradient paths: BPTT through $T$ steps still exhibits exponential decay.

Training speed: RNN computation dominates total time (98% of 249ms per batch).

Quadratic attention complexity:

Attention matrix: $O(T_s \times T_t)$ memory

Translation ($T_s = T_t = 50$): 2,500 weights per decoder step (manageable)

Document summarization ($T_s = 1000$): 1M weights (approaching limits)

Speech recognition ($T_s = 10000$, 1 minute audio): 100M weights (exceeds 16GB GPU)

Quantified limitations:

Training time (50-token sequences, batch 32):

RNN encoding: 100ms
Attention: 5ms (20× faster)
RNN decoding: 144ms
Bottleneck: RNN takes 98% of time

Memory (single sample, no batching):

512 tokens: 1MB (translation - OK)
2048 tokens: 16MB (documents - manageable)
8192 tokens: 256MB (books - problematic)
65536 tokens: 16GB (impossible on single GPU)

RNN Provides Implicit Position Encoding; Pure Attention Requires Explicit Encoding

RNN state $\mathbf{h}_t$ implicitly encodes position $t$ via sequential processing:

\[\mathbf{h}_t = f(\mathbf{h}_{t-1}, \mathbf{x}_t)\]

State $\mathbf{h}_t$ accumulated information from positions 1 through $t$.

Attention weighted sum permutation-invariant:

\[\mathbf{c} = \sum_j \alpha_j \mathbf{h}_j\]

Without positional information in $\mathbf{h}_j$, sum operation identical for any permutation of inputs.

Current seq2seq architecture:

RNN encoder processes sequentially → embeds position in $\mathbf{h}_j$ → attention inherits positional information.

RNN-free architecture problem:

Replacing RNN with non-sequential encoder (e.g., stacked attention) loses position information.

“dog bites man” vs “man bites dog” produce identical bag-of-embeddings representations.

Explicit position encoding approaches:

Learned embeddings: \[\mathbf{x}'_t = \mathbf{x}_t + \mathbf{p}_t\] $\mathbf{p}_t \in \mathbb{R}^d$ trained as parameters, one per position.

Sinusoidal (fixed): \[\text{PE}(t, 2i) = \sin(t / 10000^{2i/d})\] \[\text{PE}(t, 2i+1) = \cos(t / 10000^{2i/d})\] Deterministic encoding, generalizes to unseen lengths.

Relative encoding: Encode offset $i - j$ instead of absolute positions $i, j$. Captures relational information.

Without position encoding, “the cat sat” and “sat cat the” indistinguishable to pure attention.

Attention Learned from Translation Loss May Not Align Linguistically

Attention weights $\alpha_{tj}$ optimized for translation loss, not alignment quality.

Translation-optimal attention may differ from linguistically-interpretable alignment.

Idiom example:

Linguistically correct: “kicked the bucket” → “died” (entire phrase maps to single word)

Model attention for “died”: “kicked” (0.15), “the” (0.40), “bucket” (0.20)

Translation correct but alignment non-interpretable (high weight on “the”).

Supervised alignment loss:

Gold alignments $\boldsymbol{\alpha}^*$ from manual annotation:

\[\mathcal{L} = \mathcal{L}_{\text{translation}} + \lambda \|\boldsymbol{\alpha} - \boldsymbol{\alpha}^*\|^2\]

Results:

Alignment accuracy: +10-15% (measured against gold alignments)

Translation quality: ±0.2 BLEU (negligible change)

Cost: Requires expensive manual alignment annotations

Application: Medical/legal domains where interpretability critical.

What Attention Accomplished vs Architectural Constraints Remaining

Core achievement: Bottleneck eliminated

Seq2seq: Fixed $\mathbf{c} \in \mathbb{R}^h$, capacity $16,384$ bits regardless of $T_s$

Attention: Decoder accesses all $T_s$ encoder states $\{\mathbf{h}_1, ..., \mathbf{h}_{T_s}\}$

Capacity scales: $O(T_s \times h)$ dimensions available

WMT’14 En→De translation gains:

Length (tokens)	Seq2seq	+Attention	Improvement
20-30	32.8	34.1	+1.3
50-60	19.8	27.7	+7.9
60+	12.3	23.0	+10.7

Gains increase with length—confirms bottleneck solved.

Gradient flow: Direct paths to all encoder states

Seq2seq: $\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j}$ through $T_{src} + T_{tgt}$ sequential steps

Attention: $\frac{\partial \mathcal{L}}{\partial \mathbf{h}_j} = \sum_t \alpha_{tj} \frac{\partial \mathcal{L}}{\partial \mathbf{c}_t}$

Path length: 1 hop (direct connection)

All encoder positions receive gradient signal, not just recent positions.

Architectural constraint: RNN remains bottleneck

Attention provides dynamic weighting.

Sequence processing still requires RNN:

\[\mathbf{h}_t = f_{\text{enc}}(\mathbf{h}_{t-1}, \mathbf{x}_t)\] \[\mathbf{s}_t = f_{\text{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}_t)\]

Sequential dependency prevents parallelization.

Training time breakdown (50-token batch):

RNN encoder: 100ms (40%)
Attention: 5ms (2%)
RNN decoder: 144ms (58%)

RNN dominates: 98% of computation time

Quadratic attention cost manageable for translation

Attention matrix: $T_s \times T_t$ weights

Translation ($T_s, T_t \approx 50$): 2,500 weights (10KB memory)

Long documents ($T_s = 2048$): 4M weights (16MB memory)

Books ($T = 8192$): 67M weights (256MB memory)

Quadratic growth limits very long sequences.

RNN solved gradient vanishing but introduced sequential bottleneck. Attention solved information bottleneck but requires RNN for sequence processing. Next step: Replace RNN with parallelizable architecture (self-attention stacks + explicit positional encoding = Transformers).

Configuration	BLEU	∆
Full attention model	36.7	—
- Scaled scores (no \(1/\sqrt{d_k}\))	33.1	-3.6
- Bidirectional encoder	34.2	-2.5
- Input feeding	35.2	-1.5
- Coverage mechanism	35.9	-0.8
Fixed context (seq2seq)	31.5	-5.2