
EE 641 - Unit 4
Fall 2025
[Gradients] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
[LSTM] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[RNN] I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” MIT Press, 2016, Chapter 10: Sequence Modeling: Recurrent and Recursive Nets.
[GRU] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724–1734.
[Seq2Seq] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems, 2014, pp. 3104–3112.
[BPTT] P. J. Werbos, “Backpropagation through time: What it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[Audio] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649.
[Vision] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 855–868, 2008.
[Language] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and understanding recurrent networks,” arXiv preprint arXiv:1506.02078, 2015.
[Medical] Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzel, “Learning to diagnose with LSTM recurrent networks,” in International Conference on Learning Representations, 2016.
State transition function: \[\mathbf{s}_{t+1} = f(\mathbf{s}_t, \mathbf{u}_t)\]
Output function: \[\mathbf{y}_t = g(\mathbf{s}_t, \mathbf{u}_t)\]
where:
Applications: digital circuits (FSMs), control systems, signal processing, recurrent neural networks

| Type | State Space | Dynamics |
|---|---|---|
| FSM | Discrete, finite | Designed |
| Markov Chain | Discrete | Stochastic |
| Linear System | Continuous | Linear, fixed |
| Nonlinear System | Continuous | Nonlinear, fixed |
| RNN | Continuous | Nonlinear, learned |
RNNs optimize dynamics from data

Without state (memoryless): \[\mathbf{y}_t = f(\mathbf{x}_t)\]
With explicit history: \[\mathbf{y}_t = f(\mathbf{x}_t, \mathbf{x}_{t-1}, ..., \mathbf{x}_{t-k})\]
With state (compressed history): \[\mathbf{s}_t = g(\mathbf{s}_{t-1}, \mathbf{x}_t)\] \[\mathbf{y}_t = h(\mathbf{s}_t)\]
Trade-offs:
Memory complexity comparison:
| Approach | Memory | Computation | Gradient Flow |
|---|---|---|---|
| Explicit history | \(O(k \times d)\) | \(O(k \times d)\) per step | Direct to each \(\mathbf{x}_{t-i}\) |
| State compression | \(O(H)\) | \(O(H^2)\) per step | Through state chain |
State influence decay:
For state-based compression with linear dynamics: \[\text{Influence}(\mathbf{x}_{t-k}) \approx \|\mathbf{W}\|^k \cdot \|\mathbf{x}_{t-k}\|\]
Practical implications:
| Aspect | FSM | Linear System | RNN |
|---|---|---|---|
| State space | \(\{S_1, ..., S_n\}\) | \(\mathbb{R}^n\) | \(\mathbb{R}^H\) |
| Transition | Lookup table | \(\mathbf{s}_t = \mathbf{A}\mathbf{s}_{t-1} + \mathbf{B}\mathbf{u}_t\) | \(\mathbf{s}_t = \tanh(\mathbf{W}_{ss}\mathbf{s}_{t-1} + \mathbf{W}_{xs}\mathbf{x}_t)\) |
| Parameters | Designed | Fixed by physics | Learned from data |
| Optimization | Manual | System ID | Backpropagation |
| Capacity | \(\log_2(n)\) bits | Infinite (continuous) | Infinite (continuous) |
RNNs are universal approximators of dynamical systems
Differences:
FSM:
Linear System:
RNN:
Mathematical capacity:
Recurrent form (compact):
Unrolled form (explicit):
Mathematical equivalence: \[\mathbf{s}_t = f(\mathbf{s}_{t-1}, \mathbf{x}_t) = f(f(\mathbf{s}_{t-2}, \mathbf{x}_{t-1}), \mathbf{x}_t)\]
Unrolling depth = sequence length \(T\)

Block diagram (Controls/DSP):
Graphical model (ML):
Computation graph (Deep Learning):
All represent the same mathematics.

| Aspect | FIR Filter | IIR Filter | RNN |
|---|---|---|---|
| Response | Finite | Infinite | Infinite |
| Memory | Explicit (taps) | Implicit (poles) | Implicit (state) |
| Stability | Always stable | Depends on poles | Learned stability |
| Adaptation | LMS/RLS | Adaptive IIR | Backpropagation |
| Nonlinearity | None | None | Activation functions |
Signal processing interpretation:
RNNs: nonlinear adaptive filters in high-dimensional spaces

State update: \[\mathbf{s}_t = \mathbf{A}\mathbf{s}_{t-1} + \mathbf{B}\mathbf{u}_t\]
Output: \[\mathbf{y}_t = \mathbf{C}\mathbf{s}_t + \mathbf{D}\mathbf{u}_t\]
where:
Linearity enables closed-form analysis

Starting from \(\mathbf{s}_0\): \[\mathbf{s}_1 = \mathbf{A}\mathbf{s}_0 + \mathbf{B}\mathbf{u}_1\] \[\mathbf{s}_2 = \mathbf{A}\mathbf{s}_1 + \mathbf{B}\mathbf{u}_2 = \mathbf{A}^2\mathbf{s}_0 + \mathbf{A}\mathbf{B}\mathbf{u}_1 + \mathbf{B}\mathbf{u}_2\]
General solution: \[\mathbf{s}_t = \mathbf{A}^t\mathbf{s}_0 + \sum_{i=0}^{t-1}\mathbf{A}^i\mathbf{B}\mathbf{u}_{t-i}\]
Two components:
Long-term behavior dominated by \(\mathbf{A}^t\)

For diagonalizable \(\mathbf{A}\): \[\mathbf{A} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^{-1}\]
where:
Powers simplify: \[\mathbf{A}^t = \mathbf{V}\boldsymbol{\Lambda}^t\mathbf{V}^{-1}\] \[\boldsymbol{\Lambda}^t = \text{diag}(\lambda_1^t, ..., \lambda_n^t)\]
Eigenspace interpretation:

\[\rho(\mathbf{A}) = \max_i |\lambda_i|\]
Stability conditions:
| Spectral Radius | Behavior | Gradient Flow |
|---|---|---|
| \(\rho < 1\) | Stable, converges to 0 | Vanishing |
| \(\rho = 1\) | Marginal stability | Preserved |
| \(\rho > 1\) | Unstable, diverges | Exploding |
For RNNs:
Time horizon: Effective memory \(\approx \frac{1}{1-\rho}\)

For real matrix \(\mathbf{A}\): \[\lambda = a \pm bi\]
Polar form: \[\lambda = r e^{\pm i\theta}\] where:
Evolution: \[\lambda^t = r^t e^{\pm it\theta} = r^t(\cos(t\theta) \pm i\sin(t\theta))\]
Results in:

Case 1: Real, stable \[\mathbf{A}_1 = \begin{bmatrix} 0.8 & 0.1 \\ 0.1 & 0.7 \end{bmatrix}\] \(\lambda = \{0.9, 0.6\}\) → Monotonic decay
Case 2: Complex, stable \[\mathbf{A}_2 = \begin{bmatrix} 0.9 & 0.3 \\ -0.3 & 0.9 \end{bmatrix}\] \(\lambda = 0.9 \pm 0.3i\) → Decaying spiral
Case 3: Real, mixed \[\mathbf{A}_3 = \begin{bmatrix} 1.1 & 0.2 \\ 0.2 & 0.7 \end{bmatrix}\] \(\lambda = \{1.2, 0.6\}\) → Saddle point
Each case has distinct implications for gradient flow in RNNs

First-order AR filter: \[y[n] = \alpha y[n-1] + \beta x[n]\]
Transfer function: \[H(z) = \frac{\beta}{1 - \alpha z^{-1}}\]
Pole location determines behavior:
DC gain (steady-state response): \[H(1) = \frac{\beta}{1 - \alpha}\]
For unit DC gain: \(\beta = 1 - \alpha\)

Linear AR (vector form): \[\mathbf{s}_t = \mathbf{A}\mathbf{s}_{t-1} + \mathbf{B}\mathbf{x}_t\]
Vanilla RNN: \[\mathbf{s}_t = \tanh(\mathbf{W}_{ss}\mathbf{s}_{t-1} + \mathbf{W}_{xs}\mathbf{x}_t + \mathbf{b})\]
Generalizations:
Memory mechanism:
Stability:

Forget gate (feedback coefficient): \[\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)\]
Input gate (feedforward coefficient): \[\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)\]
Coupled gates for unit DC gain:

Linear system frequency response:
RNN frequency response:
Long-term dependencies = Low frequencies
Vanishing gradients = High-pass filtering
Gates modulate bandwidth

CNNs = FIR Filters
RNNs = IIR Filters
Comparison:

| DSP Concept | RNN Equivalent | Purpose |
|---|---|---|
| Saturation/Limiting | Gradient clipping | Prevent overflow |
| AGC | Layer norm | Stabilize magnitude |
| Parallel paths | Skip connections | Preserve signal |
| Filter bank | Attention heads | Frequency decomposition |
| Adaptive filtering | BPTT | Learn coefficients |
| Decimation | Pooling in time | Reduce rate |
| Interpolation | Upsampling | Increase rate |

Classical DSP: Saturation prevents amplifier damage
Deep Learning: Gradient clipping prevents training instability \[\mathbf{g}_{\text{clipped}} = \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\| \leq \theta \\ \theta \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|} & \text{if } \|\mathbf{g}\| > \theta \end{cases}\]
Two clipping strategies:
Value clipping: Element-wise bounds
Norm clipping: Preserves direction
Empirical thresholds (LSTM on PTB):

Classical AGC (Automatic Gain Control):
Layer Normalization: \[\text{LN}(\mathbf{x}) = \gamma \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]
where across feature dimension:
\(\mu = \frac{1}{D}\sum_{i=1}^D x_i\)
\(\sigma^2 = \frac{1}{D}\sum_{i=1}^D (x_i - \mu)^2\)
AGC: Normalizes signal power
LayerNorm: Normalizes activation magnitude
Both prevent saturation in subsequent stages
Measured stabilization (Transformer, 12 layers):

DSP: Parallel signal paths
ResNet skip connections: \[\mathbf{y} = F(\mathbf{x}, \mathbf{W}) + \mathbf{x}\]
Gradient flow analysis: \[\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \left(1 + \frac{\partial F}{\partial \mathbf{x}}\right)\]
Even if \(\frac{\partial F}{\partial \mathbf{x}} \approx 0\) (vanishing), gradient still flows through identity path
Frequency domain interpretation:
Impact on trainability (ImageNet):

Filter banks in DSP vs attention mechanisms
Classical filter bank:
Multi-head attention as filter bank: \[\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
Each head captures different “frequency” of relationships:
Adaptive aspect:
Measured specialization (BERT-base):

Adaptive filters learn optimal coefficients
Classical adaptive filtering (LMS, RLS):
BPTT as adaptive filter training: \[\mathbf{W}_{t+1} = \mathbf{W}_t - \alpha \nabla_{\mathbf{W}} L\]
Both minimize prediction error through iterative updates
Convergence characteristics:
Truncated BPTT = Finite impulse response:

Decimation in DSP vs temporal pooling
DSP Decimation:
Temporal pooling in RNNs:
Common strategies:
Computational savings:
Performance-memory (speech recognition):

For loss \(L\) and parameters \(\mathbf{W}^{(1)}\) in layer 1: \[\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \cdot \frac{\partial \mathbf{a}^{(L)}}{\partial \mathbf{a}^{(L-1)}} \cdots \frac{\partial \mathbf{a}^{(2)}}{\partial \mathbf{W}^{(1)}}\]
Product of Jacobians: \[\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \boldsymbol{\delta}^{(L)} \prod_{l=2}^{L} \mathbf{J}^{(l)}\]
where \(\mathbf{J}^{(l)} = \frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{a}^{(l-1)}}\)
Gradient magnitude depends on product of L-1 matrices
For RNNs: Replace depth with time

For layer with activation \(\phi\): \[\mathbf{a}^{(l)} = \phi(\mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})\]
Jacobian: \[\mathbf{J}^{(l)} = \frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{a}^{(l-1)}} = \text{diag}(\phi'(\mathbf{z}^{(l)})) \cdot \mathbf{W}^{(l)}\]
Norm bounds: \[\|\mathbf{J}^{(l)}\| \leq \|\phi'\|_{\infty} \cdot \|\mathbf{W}^{(l)}\|\]
Activation derivatives:

Starting from loss gradient \(\nabla_{\mathbf{a}^{(L)}} L\): \[\|\nabla_{\mathbf{W}^{(1)}} L\| \approx \|\nabla_{\mathbf{a}^{(L)}} L\| \cdot \prod_{l=2}^{L} \|\mathbf{J}^{(l)}\|\]
Controlling factors:
Exponential behavior:
Effective gradient horizon: \[L_{\text{eff}} \approx \frac{\log(\epsilon)}{\log(\|\mathbf{J}\|)}\]
where \(\epsilon\) is numerical precision threshold

Sigmoid saturation: \[\sigma'(x) = \sigma(x)(1-\sigma(x)) \leq 0.25\]
Tanh saturation: \[\tanh'(x) = 1 - \tanh^2(x) \leq 1\]
For deep networks:
Consequences:
This motivated: ResNets, LSTMs, normalization techniques

Causes:
Mathematical analysis: \[\|\nabla\| \approx \|\mathbf{W}\|^L \cdot \|\phi'\|^L\]
For ReLU with \(\|\mathbf{W}\| = 1.5\):
Consequences:
Solutions: Gradient clipping, careful initialization, normalization

RNN unrolled for T steps = T-layer deep network
Gradient path length: O(T) \[\frac{\partial L}{\partial \mathbf{s}_0} = \prod_{t=1}^T \frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}}\]
Challenges scale with sequence length:
Effective memory horizon: \[T_{\text{eff}} \approx \frac{1}{1 - \|\mathbf{J}_{\text{RNN}}\|}\]
This motivates:

Hidden state update: \[\mathbf{s}_t = \tanh(\mathbf{W}_{ss}\mathbf{s}_{t-1} + \mathbf{W}_{xs}\mathbf{x}_t + \mathbf{b}_s)\]
Output computation: \[\mathbf{y}_t = \mathbf{W}_{sy}\mathbf{s}_t + \mathbf{b}_y\]
Tanh activation properties:

| Component | Dimension | Description |
|---|---|---|
| \(\mathbf{x}_t\) | \(\mathbb{R}^D\) | Input at time \(t\) |
| \(\mathbf{s}_t\) | \(\mathbb{R}^H\) | Hidden state |
| \(\mathbf{y}_t\) | \(\mathbb{R}^K\) | Output at time \(t\) |
| \(\mathbf{W}_{xs}\) | \(\mathbb{R}^{H \times D}\) | Input-to-hidden |
| \(\mathbf{W}_{ss}\) | \(\mathbb{R}^{H \times H}\) | Hidden-to-hidden |
| \(\mathbf{W}_{sy}\) | \(\mathbb{R}^{K \times H}\) | Hidden-to-output |
| \(\mathbf{b}_s\) | \(\mathbb{R}^H\) | Hidden bias |
| \(\mathbf{b}_y\) | \(\mathbb{R}^K\) | Output bias |
Total parameters: \[N_{\text{params}} = HD + H^2 + KH + H + K\]
Example configuration:
Parameter breakdown:
Input→Hidden (W_xs): 300 × 512 = 153K
Hidden→Hidden (W_ss): 512 × 512 = 262K
Hidden→Output (W_sy): 10000 × 512 = 5.12M
Biases: 10.5K
────────────────────────────────────────
Total: 5.54M
Output projection dominates (93%) when vocabulary is large, while hidden-to-hidden remains fixed regardless of vocabulary.
Memory footprint (float32):
Without sharing:
With sharing (RNN):
CNN shares weights across space; RNN shares across time


One-to-Many (e.g., image captioning)
Many-to-One (e.g., sentiment analysis)
Many-to-Many (Synchronized) (e.g., POS tagging)
Many-to-Many (Encoder-Decoder) (e.g., translation)
Encoder-decoder architecture covered in Seq2Seq section

One-to-Many Applications:
Many-to-One Applications:
Many-to-Many (Synchronized):
Many-to-Many (Seq2Seq):

Theoretical capacity:
Practical capacity much less:
Memory content:
Forgetting curve:

Training mode (fixed length):
Streaming mode (infinite length):
Implementation pattern:
# Training (batch)
states = []
for x in sequence:
h = rnn_cell(h, x)
states.append(h)
# Streaming (online)
class StreamingRNN:
def __init__(self):
self.hidden = None
def step(self, x):
if self.hidden is None:
self.hidden = zeros(H)
self.hidden = rnn_cell(self.hidden, x)
return output(self.hidden)Applications: Live transcription, real-time translation, online monitoring

PyTorch RNN layers
# Basic RNN layer
rnn = torch.nn.RNN(
input_size=D, # Input dimension
hidden_size=H, # Hidden dimension
num_layers=1, # Number of stacked RNNs
nonlinearity='tanh', # or 'relu'
bias=True,
batch_first=False, # (seq, batch, features)
dropout=0.0, # Between layers (if num_layers > 1)
bidirectional=False
)
# Forward pass
output, h_n = rnn(input, h_0)Input/Output dimensions:
# Shapes (batch_first=False):
input: (T, B, D) # seq_len, batch, input_size
h_0: (L, B, H) # num_layers, batch, hidden_size
output:(T, B, H) # seq_len, batch, hidden_size
h_n: (L, B, H) # num_layers, batch, hidden_size
# If bidirectional=True:
output:(T, B, 2*H)
h_n: (2*L, B, H)RNN Cell (single timestep):
LSTM Implementation:
lstm = torch.nn.LSTM(
input_size=D,
hidden_size=H,
num_layers=1,
bias=True,
batch_first=False,
dropout=0.0,
bidirectional=False,
proj_size=0 # v1.8+: projection layer
)
# Returns tuple: (h_n, c_n)
output, (h_n, c_n) = lstm(input, (h_0, c_0))
# Shapes:
# c_0, c_n: (L, B, H) # cell state
# All others same as RNNGRU Implementation:
gru = torch.nn.GRU(
input_size=D,
hidden_size=H,
num_layers=1,
bias=True,
batch_first=False,
dropout=0.0,
bidirectional=False
)
output, h_n = gru(input, h_0)
# Same shapes as RNNComparing parameters:
Stacking layers:
# Multi-layer RNN
deep_rnn = torch.nn.RNN(
input_size=D,
hidden_size=H,
num_layers=3, # Stack 3 RNNs
dropout=0.2 # Between layers 1→2, 2→3
)Bidirectional:
# Simplified pseudocode
def rnn_forward(x_sequence, Wss, Wxs, Wsy):
s = zeros(H) # Initial state
outputs = []
for x_t in x_sequence:
# Update hidden state
s = tanh(Wss @ s + Wxs @ x_t + bs)
# Compute output
y_t = Wsy @ s + by
outputs.append(y_t)
return outputsSequential requirement:
Full PyTorch implementation
import torch
import torch.nn as nn
class VanillaRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
# Weight matrices
self.Wxs = nn.Linear(input_size, hidden_size)
self.Wss = nn.Linear(hidden_size, hidden_size, bias=False)
self.Wsy = nn.Linear(hidden_size, output_size)
def forward(self, x, s0=None):
batch_size, seq_len, _ = x.shape
if s0 is None:
s = torch.zeros(batch_size, self.hidden_size)
else:
s = s0
outputs = []
states = []
for t in range(seq_len):
s = torch.tanh(
self.Wss(s) + self.Wxs(x[:, t, :])
)
y = self.Wsy(s)
states.append(s)
outputs.append(y)
return torch.stack(outputs, dim=1), torch.stack(states, dim=1)Forward pass must store:
Memory complexity:
Example (batch_size=32):
Memory optimization techniques:

Matrix multiplications:
Per timestep total: \(O(H^2 + HD + KH)\)
Full sequence: \(O(T \times (H^2 + HD + KH))\)
Bottlenecks by task:
Comparison with Transformers:
Parallelization constraints:

Standard approach: Zero initialization
Learned initial state:
self.s_0 = nn.Parameter(torch.randn(1, H))
# Broadcast for batch
s_0 = self.s_0.expand(batch_size, -1)Random initialization:
Weight initialization (specific to RNNs):
Impact: Mainly affects first ~10 timesteps

import torch
import torch.nn as nn
# Standard API usage
rnn = nn.RNN(
input_size=300,
hidden_size=512,
num_layers=1,
batch_first=True, # (B, T, D) format
dropout=0.0, # No dropout for single layer
bidirectional=False
)
# Input: (batch, seq_len, input_size)
x = torch.randn(32, 100, 300)
# Initial hidden: (num_layers, batch, hidden_size)
h_0 = torch.zeros(1, 32, 512)
# Forward pass
output, h_n = rnn(x, h_0)
# output: (32, 100, 512) - all hidden states
# h_n: (1, 32, 512) - final hidden state
# For variable length sequences
from torch.nn.utils.rnn import pack_padded_sequence
packed = pack_padded_sequence(
x, lengths, batch_first=True
)
output_packed, h_n = rnn(packed, h_0)Custom RNN cell (for debugging)
class CustomRNNCell(nn.Module):
"""Single RNN cell for step-by-step processing"""
def __init__(self, input_size, hidden_size):
super().__init__()
# Combined weight matrix for efficiency
self.weight_ih = nn.Linear(input_size, hidden_size)
self.weight_hh = nn.Linear(hidden_size, hidden_size)
def forward(self, x_t, h_prev):
"""Single timestep forward"""
h_new = torch.tanh(
self.weight_ih(x_t) + self.weight_hh(h_prev)
)
return h_new
# Usage in custom loop
cell = CustomRNNCell(300, 512)
h = torch.zeros(32, 512)
outputs = []
for t in range(seq_len):
h = cell(x[:, t, :], h)
outputs.append(h)
# Can add custom logic here:
# - Attention computation
# - Gradient clipping per step
# - Visualization/debugging
outputs = torch.stack(outputs, dim=1)Custom implementation use cases:
Total loss: \[L = \sum_{t=1}^T L_t(\mathbf{y}_t, \hat{\mathbf{y}}_t)\]
Common loss functions:
Classification: Cross-entropy per timestep \[L_t = -\sum_{k=1}^K y_{tk} \log \hat{y}_{tk}\]
Regression: MSE per timestep \[L_t = \|\mathbf{y}_t - \hat{\mathbf{y}}_t\|^2\]
Language modeling: Perplexity \[\text{PPL} = \exp\left(\frac{1}{T}\sum_t L_t\right)\]
Loss accumulation patterns:
Gradient starts from: \[\frac{\partial L}{\partial \mathbf{y}_t} = \frac{\partial L_t}{\partial \mathbf{y}_t}\]

Error signal at hidden state: \[\boldsymbol{\delta}_t = \frac{\partial L}{\partial \mathbf{s}_t}\]
Recursive computation: \[\boldsymbol{\delta}_t = \frac{\partial L_t}{\partial \mathbf{s}_t} + \mathbf{W}_{ss}^T \boldsymbol{\delta}_{t+1} \odot \tanh'(\mathbf{s}_t)\]
Components:
Boundary condition: \[\boldsymbol{\delta}_T = \frac{\partial L_T}{\partial \mathbf{s}_T}\]
Gradient flows backward through same path as forward state flow

Weight gradient accumulation
Hidden-to-hidden weights: \[\frac{\partial L}{\partial \mathbf{W}_{ss}} = \sum_{t=1}^T \boldsymbol{\delta}_t \mathbf{s}_{t-1}^T\]
Input-to-hidden weights: \[\frac{\partial L}{\partial \mathbf{W}_{xs}} = \sum_{t=1}^T \boldsymbol{\delta}_t \mathbf{x}_t^T\]
Hidden-to-output weights: \[\frac{\partial L}{\partial \mathbf{W}_{sy}} = \sum_{t=1}^T \frac{\partial L_t}{\partial \mathbf{y}_t} \mathbf{s}_t^T\]
Computational pattern:


Limiting gradient flow for efficiency
Motivation:
Truncated BPTT algorithm:
def truncated_bptt(sequence, k=20):
states = []
for t in range(len(sequence)):
if t % k == 0:
# Detach: stop gradient flow
state = state.detach()
state = rnn_cell(state, sequence[t])
states.append(state)
if t % k == k-1:
# Backprop through last k steps
loss = compute_loss(states[-k:])
loss.backward()Trade-offs:

Chunking sequences for efficiency
def train_with_tbptt(model, sequence, chunk_size=35):
"""Truncated BPTT with chunking"""
hidden = None
total_loss = 0
# Process in chunks
for i in range(0, len(sequence), chunk_size):
# Get chunk
chunk = sequence[i:i+chunk_size]
# Detach hidden state (stop gradient)
if hidden is not None:
hidden = hidden.detach()
# Forward pass on chunk
output, hidden = model(chunk, hidden)
# Compute loss on chunk
loss = criterion(output, targets[i:i+chunk_size])
# Backward pass (only through chunk)
loss.backward()
# Apply gradient clipping
torch.nn.utils.clip_grad_norm_(
model.parameters(), max_norm=5.0
)
# Update weights
optimizer.step()
optimizer.zero_grad()
total_loss += loss.item()
return total_lossImplementation details:
Advanced TBPTT strategies
class StatefulRNN:
"""Maintains hidden state across batches"""
def __init__(self, model, batch_size):
self.model = model
self.hidden = None
self.batch_size = batch_size
def reset_hidden(self):
"""Reset at epoch boundaries"""
self.hidden = None
def forward_chunk(self, chunk, backprop_steps):
"""Process chunk with controlled BPTT"""
# Detach old hidden state
if self.hidden is not None:
# Keep last k steps for gradient
if backprop_steps > 0:
# Allow gradient through last k
self.hidden = self.hidden[-backprop_steps:]
else:
# Full detach
self.hidden = self.hidden.detach()
# Forward pass
output, self.hidden = self.model(chunk, self.hidden)
return output
# Usage for language modeling
model = StatefulRNN(rnn, batch_size=32)
for epoch in range(num_epochs):
model.reset_hidden()
for batch in dataloader:
output = model.forward_chunk(batch, k=35)
loss = compute_loss(output, targets)
loss.backward()Built-in features for BPTT
import torch
import torch.nn as nn
# Standard RNN with PyTorch
class LanguageModel(nn.Module):
def __init__(self, vocab_size, hidden_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.rnn = nn.RNN(hidden_size, hidden_size,
batch_first=True)
self.output = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden=None):
embed = self.embedding(x)
rnn_out, hidden = self.rnn(embed, hidden)
output = self.output(rnn_out)
return output, hidden
# Training with gradient clipping
model = LanguageModel(10000, 512)
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
# Forward pass
output, _ = model(batch.input)
loss = F.cross_entropy(
output.reshape(-1, 10000),
batch.target.reshape(-1)
)
# Backward pass
loss.backward()
# Gradient clipping - standard for RNN stability
torch.nn.utils.clip_grad_norm_(
model.parameters(), max_norm=5.0
)
optimizer.step()
optimizer.zero_grad()Debugging gradient flow
def check_gradient_health(model, loss):
"""Monitor gradient statistics"""
loss.backward(retain_graph=True)
grad_stats = {}
for name, param in model.named_parameters():
if param.grad is not None:
grad = param.grad.data
grad_stats[name] = {
'mean': grad.mean().item(),
'std': grad.std().item(),
'max': grad.max().item(),
'min': grad.min().item(),
'norm': grad.norm().item()
}
# Check for issues
for name, stats in grad_stats.items():
# Vanishing
if stats['norm'] < 1e-6:
print(f"WARNING: Vanishing gradient in {name}")
# Exploding
if stats['norm'] > 100:
print(f"WARNING: Exploding gradient in {name}")
# NaN/Inf
if torch.isnan(grad).any() or torch.isinf(grad).any():
print(f"ERROR: NaN/Inf in {name}")
return grad_stats
# Visualize gradient flow
import matplotlib.pyplot as plt
def plot_gradient_flow(named_parameters):
ave_grads = []
layers = []
for n, p in named_parameters:
if p.grad is not None:
layers.append(n)
ave_grads.append(p.grad.abs().mean().cpu())
plt.plot(ave_grads, alpha=0.7)
plt.hlines(0, 0, len(ave_grads)+1, linewidth=1)
plt.xticks(range(len(ave_grads)), layers, rotation="vertical")
plt.xlim(xmin=0, xmax=len(ave_grads))
plt.xlabel("Layers")
plt.ylabel("Average gradient")
plt.title("Gradient flow")
plt.grid(True)Processing sequences in both directions
Architecture:
\[\vec{\mathbf{s}}_t = \text{RNN}_{\rightarrow}(x_t, \vec{\mathbf{s}}_{t-1})\] \[\overleftarrow{\mathbf{s}}_t = \text{RNN}_{\leftarrow}(x_t, \overleftarrow{\mathbf{s}}_{t+1})\] \[\mathbf{s}_t = [\vec{\mathbf{s}}_t; \overleftarrow{\mathbf{s}}_t]\]
Limitations:
# PyTorch implementation
birnn = nn.RNN(input_size, hidden_size,
bidirectional=True,
batch_first=True)
# Output size: (batch, seq_len, 2*hidden_size)Common applications:

For loss at time T, gradient to time k: \[\frac{\partial L}{\partial \mathbf{s}_k} = \frac{\partial L}{\partial \mathbf{s}_T} \prod_{t=k+1}^{T} \frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}}\]
Each temporal Jacobian: \[\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}} = \text{diag}(\tanh'(\mathbf{z}_t)) \cdot \mathbf{W}_{ss}\]
where \(\mathbf{z}_t = \mathbf{W}_{ss}\mathbf{s}_{t-1} + \mathbf{W}_{xs}\mathbf{x}_t + \mathbf{b}\)
Product of T-k Jacobians: \[\prod_{t=k+1}^{T} \left[\text{diag}(\tanh'(\mathbf{z}_t)) \cdot \mathbf{W}_{ss}\right]\]
Each factor shrinks gradient:
Gradient decay scenarios:
| Configuration | \(\gamma = \|\mathbf{W}\| \cdot \max(\tanh')\) | Gradient after T steps | Effective memory |
|---|---|---|---|
| Strong decay | 0.72 | \(0.72^T\) | ~7 steps |
| Typical RNN | 0.86 | \(0.86^T\) | ~15 steps |
| Well-initialized | 0.93 | \(0.93^T\) | ~30 steps |
| Marginal stability | 0.99 | \(0.99^T\) | ~100 steps |
Thresholds:
Empirical measurements (Bengio et al., 1994):
Starting from: \[\left\|\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}}\right\| = \|\text{diag}(\tanh'(\mathbf{z}_t)) \cdot \mathbf{W}_{ss}\|\]
Using submultiplicativity: \[\left\|\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}}\right\| \leq \|\text{diag}(\tanh'(\mathbf{z}_t))\| \cdot \|\mathbf{W}_{ss}\|\]
Bounds:
Final bound: \[\left\|\frac{\partial L}{\partial \mathbf{s}_k}\right\| \leq \gamma^{T-k} \left\|\frac{\partial L}{\partial \mathbf{s}_T}\right\|\]
Gradient behavior depends on \(\gamma^{T-k}\):
Empirically measured:

Experiment setup:
Measured decay rates:
Practical implication: Cannot learn dependencies beyond T_eff

Setup: Penn Treebank language modeling
# Model configuration
hidden_size = 128
num_layers = 1
sequence_length = 35
vocab_size = 10000
# Training
batch_size = 32
learning_rate = 1.0Gradient measurements:
Results:

Observable behaviors:
Quantified impact (word prediction task):
Dependency | Vanilla RNN | LSTM
-----------|-------------|------
1-5 words | 85% acc | 87% acc
5-10 words | 45% acc | 78% acc
10-20 words| 12% acc | 65% acc
20+ words | ~random | 43% acc
Failure modes:
Compensation strategies (inadequate):

Techniques for visualization:
Gradient behavior:
Diagnostic metrics: \[\text{Gradient ratio} = \frac{\|\nabla_{\mathbf{s}_0}\|}{\|\nabla_{\mathbf{s}_T}\|}\]
Typical values:
Interpretation: Information cannot flow backward beyond the gradient horizon

\[\left\|\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}}\right\| = \|\text{diag}(f'(\mathbf{z}_t)) \cdot \mathbf{W}_{ss}\| > 1\]
When this happens:
Explosion characteristics:
Measured explosion rates:

1. Value clipping (element-wise):
2. Norm clipping (preserve direction):
def clip_grad_norm(grads, max_norm):
total_norm = torch.sqrt(
sum(g.norm()**2 for g in grads))
clip_coef = max_norm / (total_norm + 1e-6)
clip_coef = min(clip_coef, 1.0)
for g in grads:
g.mul_(clip_coef)
return total_normComparison:

Basic norm clipping:
Complete training loop:
def train_step(model, data, target, clip_val=5.0):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
# Clip gradients
total_norm = torch.nn.utils.clip_grad_norm_(
model.parameters(), clip_val)
# Monitor gradient explosion
if total_norm > clip_val:
print(f"Clipped: {total_norm:.2f} -> {clip_val}")
optimizer.step()
return loss.item(), total_norm.item()Adaptive clipping:

Empirical guidelines:
Monitoring strategy:
# Track clipping statistics
clip_stats = {
'total_steps': 0,
'clipped_steps': 0,
'max_norm_seen': 0,
'avg_norm': 0
}
# Log percentage clipped
if total_norm > max_norm:
clip_stats['clipped_steps'] += 1Warning signs:
50% steps clipped → threshold too low
Adaptive strategies:

1. Orthogonal initialization:
def orthogonal_init(weight):
nn.init.orthogonal_(weight)
# Ensure spectral radius < 1
with torch.no_grad():
weight *= 0.952. Identity initialization (IRNN):
3. Spectral normalization:
def spectral_normalize(weight, target=0.95):
with torch.no_grad():
s = torch.svd(weight).S[0]
weight *= target / sEmpirical results (Penn Treebank):

Standard RNN with LayerNorm:
class LayerNormRNN(nn.Module):
def __init__(self, input_size, hidden_size):
super().__init__()
self.rnn = nn.RNN(input_size, hidden_size)
self.ln = nn.LayerNorm(hidden_size)
def forward(self, x, h0=None):
# Apply LayerNorm to hidden states
output, hn = self.rnn(x, h0)
output = self.ln(output)
return output, hnLayer-wise normalization:
# Normalize pre-activation
z = W_ih @ x + W_hh @ h
z_norm = layer_norm(z) # Before activation
h_new = tanh(z_norm)Benefits:
Measured improvements:

\[\frac{\partial L}{\partial \mathbf{s}_0} = \prod_{t=1}^T \underbrace{\|\mathbf{W}\| \cdot |\tanh'|}_{\leq 0.5} \approx 0.5^T\]
After 20 steps: \(0.5^{20} \approx 10^{-6}\) (vanished!)
LSTM’s solution: Create bypass paths
Cell state gradient: \[\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t + \text{other terms}\]
When \(\mathbf{f}_t \approx 1\): Linear path through time!
Gates control gradient flow:
Practical impact:

\[\frac{\partial L}{\partial \mathbf{c}_{t-1}} = \frac{\partial L}{\partial \mathbf{c}_t} \cdot \frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}}\]
Cell state update equation: \[\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t\]
Taking derivative w.r.t. \(\mathbf{c}_{t-1}\): \[\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t) + \underbrace{\frac{\partial \mathbf{f}_t}{\partial \mathbf{c}_{t-1}} \odot \mathbf{c}_{t-1}}_{\text{second-order terms}}\]
When \(\mathbf{f}_t \approx 1\): \[\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} \approx \mathbf{I}\]
Compare gradient products over T steps:
| Architecture | Gradient Product | T=50 Result |
|---|---|---|
| Vanilla RNN | \(\prod_{t=1}^T (\mathbf{W}^T \text{diag}(\sigma'))\) | \(\approx 10^{-30}\) |
| LSTM (f=0.9) | \(\prod_{t=1}^T \text{diag}(\mathbf{f}_t)\) | \(\approx 0.005\) |
| LSTM (f=0.99) | \(\prod_{t=1}^T \text{diag}(\mathbf{f}_t)\) | \(\approx 0.6\) |
Initialization trick:


\[\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)\]
where \([\mathbf{s}_{t-1}, \mathbf{x}_t]\) denotes concatenation
Dimensions:
Element-wise gating: \[\mathbf{c}_t^{\text{kept}} = \mathbf{f}_t \odot \mathbf{c}_{t-1}\]
Interpretation:
Learning dynamics:

Input gate (what to store): \[\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)\]
Candidate values (potential new content): \[\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)\]
Combined update: \[\mathbf{c}_t^{\text{new}} = \mathbf{i}_t \odot \tilde{\mathbf{c}}_t\]
Complete cell update: \[\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t\]
Design choices:
Learned behavior (task-dependent):

\[\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)\]
Hidden state computation: \[\mathbf{s}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)\]
Design rationale:
Interpretation:
Task-dependent behavior:
Dimension: Both \(\mathbf{o}_t, \mathbf{s}_t \in \mathbb{R}^H\)


Forget previous: \(\mathbf{f}_t \odot \mathbf{c}_{t-1}\)
Add new info: \(\mathbf{i}_t \odot \tilde{\mathbf{c}}_t\)
Update cell: \(\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t\)
Output: \(\mathbf{s}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)\)
Parameter count:
Computational cost per timestep:

Given:
Step 1: Compute gates (simplified weights shown)
# Concatenate inputs
h_x = concat([s_t_1, x_t]) # [0.2, 0.8, 0.5, -0.3]
# Forget gate
f_t = sigmoid(W_f @ h_x + b_f) # [0.9, 0.1]
# Input gate
i_t = sigmoid(W_i @ h_x + b_i) # [0.2, 0.8]
# Candidate
c_tilde = tanh(W_c @ h_x + b_c) # [0.6, -0.4]
# Output gate
o_t = sigmoid(W_o @ h_x + b_o) # [0.7, 0.5]Step 2: Update cell state
c_t = f_t * c_t_1 + i_t * c_tilde
= [0.9, 0.1] * [1.5, -0.5] + [0.2, 0.8] * [0.6, -0.4]
= [1.35, -0.05] + [0.12, -0.32]
= [1.47, -0.37]Step 3: Compute hidden state

\[\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t\]
Components:
Improved gradient path:
Contrast with vanilla RNN:
Cell state can grow unbounded:

\[\mathbf{s}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)\]
Two-stage process:
Why this design?
Dimensions:
Usage in network:


Cell state = Long-term memory (RAM)
Hidden state = Working memory (Cache)
Gates = Memory controller
Capacity analysis:
Memory access patterns:

Dimensions:
Weight matrices (4 gates):
Total parameters: \[N = 4 \times [H \times (H + D) + H] = 4H(H + D + 1)\]
Example (typical NLP):

Gate computations (f, i, o, g): \[4 \times [H(H + D) + H] = 4H(H + D + 1)\]
Cell state update: \[2H \text{ (element-wise ops)}\]
Hidden state: \[H \text{ (element-wise)}\]
Total forward pass: \[\approx 4H(H + D + 1) \text{ FLOPs/timestep}\]
Backward pass (BPTT): \[\approx 12H(H + D + 1) \text{ FLOPs/timestep}\]
Comparison to vanilla RNN:
Matrix operations dominate:

Cell state gradient: \[\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t\]
No nonlinearity - element-wise multiplication
Compare to vanilla RNN: \[\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}} = \text{diag}(\tanh'(\mathbf{z}_t)) \cdot \mathbf{W}_{ss}\]
LSTM advantages:
Long-term gradient: \[\frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_0} = \prod_{t=1}^{T} \mathbf{f}_t\]
Element-wise product, not matrix product!
Gradient magnitude:

Initialization strategy:
Learning progression:
Early training: High forget gates
Mid training: Selective forgetting
Late training: Refined gating
Gradient contribution: \[\frac{\partial L}{\partial \mathbf{f}_t} = \frac{\partial L}{\partial \mathbf{c}_t} \odot \mathbf{c}_{t-1}\]
Gradient signal:

Vanilla RNN gradient product: \[\prod_{t=k}^{T} \|\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}}\| \leq \prod_{t=k}^{T} \gamma_{\text{tanh}} \cdot \|\mathbf{W}_{ss}\|\]
where \(\gamma_{\text{tanh}} \leq 1\) (≈ 0.5 in practice)
LSTM gradient product: \[\prod_{t=k}^{T} \|\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}}\| = \prod_{t=k}^{T} \|\mathbf{f}_t\|\]
where \(\mathbf{f}_t \in [0,1]^H\) (0.9-0.95 in practice)
Comparing gradient flow:
Empirical measurements (Penn Treebank):

Definition: Distance where gradient > threshold \[\text{Horizon} = \max\{T : \|\frac{\partial L}{\partial \mathbf{s}_0}\| > \epsilon\}\]
\(\epsilon = 10^{-5}\) or \(10^{-6}\)
Analytical approximation:
where \(\bar{f}\) is average forget gate
Measured horizons (various tasks):
| Model | Horizon | Task |
|---|---|---|
| Vanilla RNN | 10-20 | Most tasks |
| LSTM (default) | 100-200 | Language modeling |
| LSTM (tuned) | 200-500 | Music generation |
| GRU | 80-150 | Translation |
Still has limits:

Standard LSTM gates: \[\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)\]
Peephole LSTM gates: \[\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{P}_f \odot \mathbf{c}_{t-1} + \mathbf{b}_f)\]
All three gates get peepholes:
Peephole weights:
Motivation:
Performance:

GRU equations: \[\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z)\] \[\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r)\] \[\tilde{\mathbf{s}}_t = \tanh(\mathbf{W}[\mathbf{r}_t \odot \mathbf{s}_{t-1}, \mathbf{x}_t] + \mathbf{b})\] \[\mathbf{s}_t = (1-\mathbf{z}_t) \odot \mathbf{s}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{s}}_t\]
Differences from LSTM:
Gate roles:
Advantages:

Reset gate mechanics:
Update gate as interpolation: \[\mathbf{s}_t = (1-\mathbf{z}_t) \odot \mathbf{s}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{s}}_t\]
This is a convex combination:
Gradient flow in GRU: \[\frac{\partial \mathbf{s}_t}{\partial \mathbf{s}_{t-1}} = (1-\mathbf{z}_t) + \text{other terms}\]
Linear path when \(\mathbf{z}_t\) is small!
Design philosophy:

Norm clipping (standard approach): \[\mathbf{g}_{\text{clipped}} = \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\|_2 \leq \theta \\ \theta \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|_2} & \text{if } \|\mathbf{g}\|_2 > \theta \end{cases}\]
Implementation:
Starting points (tune based on gradient statistics):
Monitoring gradient statistics:
Comparison with CNNs/Transformers:

Training (Teacher Forcing):
Inference (Autoregressive):
Exposure bias accumulation:
Scheduled sampling solution:

Sequential dependency: \[\mathbf{h}_t = f(\mathbf{h}_{t-1}, \mathbf{x}_t)\]
Cannot compute \(\mathbf{h}_t\) without \(\mathbf{h}_{t-1}\)
Computational complexity:
Memory requirements for BPTT:
Truncated BPTT trade-offs:
Speed comparison (sequence length 512):
This motivates attention mechanisms and transformers

Components:
Computational cost per image:

class ImageCaptioner(nn.Module):
def __init__(self):
super().__init__()
# Frozen CNN encoder
resnet = models.resnet50(pretrained=True)
modules = list(resnet.children())[:-1]
self.cnn = nn.Sequential(*modules)
for p in self.cnn.parameters():
p.requires_grad = False
# Trainable components
self.project = nn.Linear(2048, 512)
self.embed = nn.Embedding(10000, 256)
self.lstm = nn.LSTM(256, 512, 2)
self.output = nn.Linear(512, 10000)Forward pass:
Beam search decoding
def beam_search(image, beam_size=3):
feat = model.cnn(image).squeeze()
h0 = model.project(feat).unsqueeze(0)
c0 = torch.zeros_like(h0)
beams = [{'seq': [SOS_TOKEN],
'score': 0.0,
'state': (h0, c0)}]
for _ in range(MAX_LEN):
candidates = []
for beam in beams:
if beam['seq'][-1] == EOS_TOKEN:
candidates.append(beam)
continue
# Next token prediction
token = torch.tensor([beam['seq'][-1]])
embed = model.embed(token)
out, state = model.lstm(embed,
beam['state'])
scores = F.log_softmax(
model.output(out), dim=-1)
# Top-k expansion
topk = scores[0,0].topk(beam_size)
for score, idx in zip(*topk):
candidates.append({
'seq': beam['seq'] + [idx],
'score': beam['score'] + score,
'state': state
})
# Beam pruning
beams = sorted(candidates,
key=lambda x: x['score'],
reverse=True)[:beam_size]| Model | BLEU-1 | BLEU-4 | METEOR | CIDEr |
|---|---|---|---|---|
| CNN-RNN | 71.6 | 26.7 | 24.5 | 94.3 |
| + Attention | 74.1 | 31.5 | 26.0 | 107.2 |
| + Beam=5 | 72.8 | 28.9 | 25.2 | 98.1 |
Computational requirements:
Failure modes:
Corrections:

Word order differences:
Non-monotonic alignment:
Statistical MT limitations:

\[p_n = \frac{\sum_{\text{n-gram} \in \hat{y}} \min(\text{Count}(\text{n-gram}), \text{Count}_{\text{ref}}(\text{n-gram}))}{\sum_{\text{n-gram} \in \hat{y}} \text{Count}(\text{n-gram})}\]
Combined score: \[\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^4 w_n \log p_n\right)\]
where \(w_n = 1/4\) (uniform weights)
Brevity penalty: \[BP = \begin{cases} 1 & \text{if } c > r \\ e^{1-r/c} & \text{if } c \leq r \end{cases}\]
Typical scores:

WMT’14 English-French dataset:
Preprocessing challenges:
Computational requirements:
Data distribution issues:

Variable-length sequences compressed to fixed H-dimensional vector:
Information bottleneck problem:
Observable failures:
This motivates attention mechanisms (next lecture)

Encoder: Processes source sequence
Decoder: Generates target sequence
Standard configuration:
Training objective: \[\mathcal{L} = -\sum_{t=1}^{T_{tgt}} \log P(\mathbf{y}_t | \mathbf{y}_{<t}, \mathbf{x})\]

# Encoder
for t in range(src_len):
h_enc[t] = encoder(x[t], h_enc[t-1])
context = h_enc[-1]
# Decoder (teacher forcing)
h_dec[0] = context
for t in range(tgt_len):
h_dec[t+1] = decoder(y[t], h_dec[t])
logits[t] = project(h_dec[t+1])
loss += CrossEntropy(logits[t], y[t+1])Gradient clipping requirements:
Training tricks:
Inference strategies
Greedy decoding:
h_dec = context
y_pred = [<SOS>]
for t in range(max_len):
h_dec = decoder(y_pred[-1], h_dec)
probs = softmax(project(h_dec))
y_pred.append(argmax(probs))
if y_pred[-1] == <EOS>:
breakBeam search (k=5 typical):
Decoding comparison:
Length normalization: \[\text{Score} = \frac{1}{L^\alpha} \sum_{t=1}^L \log P(y_t|y_{<t}, \mathbf{x})\]
Typical α = 0.6-0.7 (mild preference for longer)
Why log probabilities:
Beam search complexity:

Discovery (Sutskever et al., 2014):
Reverse source sentence order
“A B C” → “C B A” → “X Y Z”
+2-4 BLEU improvement
Hypothesis: Reduces average path length
Deep architectures:
Ensemble decoding:
Results on WMT’14 En→Fr:

What seq2seq achieved:
What it couldn’t solve:
The bottleneck problem motivates attention:
